[lxml-dev] Passing UTF-8 bytestrings to lxml
John J Lee
jjl at pobox.com
Mon Aug 4 20:02:12 CEST 2008
On Mon, 4 Aug 2008 16:07:33 +0200 (CEST), "Stefan Behnel"
<stefan_ml at behnel.de> said:
[...]
> > Looking at the code, it seems that changing function _utf8 in
> > apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would
> > be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that
> > seems to work.
>
> The internal encoding used by libxml2 is UTF-8, so I don't expect any
> problems when you pass in UTF-8 directly - as long as you can make sure
> that it's really a valid UTF-8 byte sequence.
Thanks for this, it's very helpful. I have a follow-up question,
though.
On discovering the fact that unicode strings containing non-ASCII
characters
don't hash to the same value as their UTF-8 equivalent bytestring
(despite
the fact that, for example, they compare equal, when the default
encoding is
set to UTF-8), I'm having second thoughts about my mixed-str-and-unicode
scheme:
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("utf-8")
>>> hash(u"\xa3")
-610773982
>>> hash(u"\xa3".encode("utf-8"))
1195450215
>>> d = {}
>>> d[u"\xa3"] = 1
>>> d[u"\xa3".encode("utf-8")] = 2
>>> len(d)
2
FWIW, that fact is documented here:
http://www.python.org/dev/peps/pep-0100/ "Comparison & Hash Value"
So, my question: were I also to change the function funicode (also in
apihelpers.pxi) to return UTF-8 bytestrings, would lxml always return
UTF-8 bytestring objects from all of its API calls? Again, this seems
to work with a quick test, but I wonder whether there are cases where
funicode() is not called. The patch I'm thinking of would be
something like this:
--- apihelpers.pxi.orig 2008-08-04 12:52:34.000000000 +0100
+++ apihelpers.pxi 2008-08-04 18:40:57.000000000 +0100
@@ -623,30 +623,18 @@
return is_non_ascii
cdef object funicode(char* s):
- cdef Py_ssize_t slen
- cdef char* spos
- cdef char c
- spos = s
- c = spos[0]
- while c != c'\0':
- if c & 0x80:
- break
- spos = spos + 1
- c = spos[0]
- slen = spos - s
- if c != c'\0':
- return python.PyUnicode_DecodeUTF8(s, slen+cstd.strlen(spos),
NULL)
- return python.PyString_FromStringAndSize(s, slen)
+ if s is NULL:
+ return python.PyString_FromString("")
+ return python.PyString_FromString(s)
cdef object _utf8(object s):
if python.PyString_Check(s):
- assert not isutf8py(s), \
- "All strings must be XML compatible, either Unicode or
ASCII"
+ assert isutf8py(s) != -1, \
+ "All strings must either unicode objects or UTF-8"
elif python.PyUnicode_Check(s):
- # FIXME: we should test these strings, too ...
s = python.PyUnicode_AsUTF8String(s)
assert isutf8py(s) != -1, \
- "All strings must be XML compatible, either Unicode or
ASCII"
+ "All strings must be either unicode objects or UTF-8"
else:
raise TypeError, "Argument must be string or unicode."
return s
(I'm not requesting this patch be applied to lxml, just hoping to get
some
help re whether this will do what I hope it will.)
[...]
> Global switches are always a bad thing. And I don't like the idea of
> accepting UTF-8 encoded strings at the API level and returning them as
> unicode strings (and: no, I would not allow returning UTF-8 encoded
> strings from the API).
>
> So I guess the answer is a pretty straight no.
Fair enough :-)
> PS: Regarding your actual problem: it's best to decode data directly when
> your code gets its hands on it, and to decode as late as possible, i.e.
[...]
Sure, that's the usual principle. In my case, I'm consciously looking
for
a practical hack as a way of working with existing code. Also, though,
there are dissenters who argue in favour of encoding to UTF-8 as early
as
possible (and recoding as late as possible). That view seems self-
consistent to me. Note that e.g. "x in y" tests still work fine if you
pick
UTF-8. The only problems then are with things like len(), .strip(),
.upper(), etc, but those can be solved by using a different len()
function,
and using functions instead of methods for strip, upper, etc. It's a
minority view, of course.
Thanks again,
John
More information about the lxml-dev
mailing list