[lxml-dev] Passing UTF-8 bytestrings to lxml

John J Lee jjl at pobox.com
Mon Aug 4 20:02:12 CEST 2008


On Mon, 4 Aug 2008 16:07:33 +0200 (CEST), "Stefan Behnel"
<stefan_ml at behnel.de> said:
[...]
> > Looking at the code, it seems that changing function _utf8 in
> > apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would
> > be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that
> > seems to work.
> 
> The internal encoding used by libxml2 is UTF-8, so I don't expect any
> problems when you pass in UTF-8 directly - as long as you can make sure
> that it's really a valid UTF-8 byte sequence.

Thanks for this, it's very helpful.  I have a follow-up question,
though.

On discovering the fact that unicode strings containing non-ASCII
characters 
don't hash to the same value as their UTF-8 equivalent bytestring
(despite 
the fact that, for example, they compare equal, when the default
encoding is 
set to UTF-8), I'm having second thoughts about my mixed-str-and-unicode 
scheme:

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("utf-8")
>>> hash(u"\xa3")
-610773982
>>> hash(u"\xa3".encode("utf-8"))
1195450215
>>> d = {}
>>> d[u"\xa3"] = 1
>>> d[u"\xa3".encode("utf-8")] = 2
>>> len(d)
2


FWIW, that fact is documented here:

http://www.python.org/dev/peps/pep-0100/ "Comparison & Hash Value"


So, my question: were I also to change the function funicode (also in 
apihelpers.pxi) to return UTF-8 bytestrings, would lxml always return 
UTF-8 bytestring objects from all of its API calls?  Again, this seems 
to work with a quick test, but I wonder whether there are cases where 
funicode() is not called.  The patch I'm thinking of would be 
something like this:

--- apihelpers.pxi.orig 2008-08-04 12:52:34.000000000 +0100
+++ apihelpers.pxi      2008-08-04 18:40:57.000000000 +0100
@@ -623,30 +623,18 @@
     return is_non_ascii
 
 cdef object funicode(char* s):
-    cdef Py_ssize_t slen
-    cdef char* spos
-    cdef char c
-    spos = s
-    c = spos[0]
-    while c != c'\0':
-        if c & 0x80:
-            break
-        spos = spos + 1
-        c = spos[0]
-    slen = spos - s
-    if c != c'\0':
-        return python.PyUnicode_DecodeUTF8(s, slen+cstd.strlen(spos),
NULL)
-    return python.PyString_FromStringAndSize(s, slen)
+    if s is NULL:
+        return python.PyString_FromString("")
+    return python.PyString_FromString(s)
 
 cdef object _utf8(object s):
     if python.PyString_Check(s):
-        assert not isutf8py(s), \
-               "All strings must be XML compatible, either Unicode or
ASCII"
+        assert isutf8py(s) != -1, \
+               "All strings must either unicode objects or UTF-8"
     elif python.PyUnicode_Check(s):
-        # FIXME: we should test these strings, too ...
         s = python.PyUnicode_AsUTF8String(s)
         assert isutf8py(s) != -1, \
-               "All strings must be XML compatible, either Unicode or
ASCII"
+               "All strings must be either unicode objects or UTF-8"
     else:
         raise TypeError, "Argument must be string or unicode."
     return s


(I'm not requesting this patch be applied to lxml, just hoping to get
some 
help re whether this will do what I hope it will.)

[...]
> Global switches are always a bad thing. And I don't like the idea of
> accepting UTF-8 encoded strings at the API level and returning them as
> unicode strings (and: no, I would not allow returning UTF-8 encoded
> strings from the API).
> 
> So I guess the answer is a pretty straight no.

Fair enough :-)


> PS: Regarding your actual problem: it's best to decode data directly when
> your code gets its hands on it, and to decode as late as possible, i.e.
[...]

Sure, that's the usual principle.  In my case, I'm consciously looking
for 
a practical hack as a way of working with existing code.  Also, though, 
there are dissenters who argue in favour of encoding to UTF-8 as early
as 
possible (and recoding as late as possible).  That view seems self-
consistent to me.  Note that e.g. "x in y" tests still work fine if you
pick 
UTF-8.  The only problems then are with things like len(), .strip(), 
.upper(), etc, but those can be solved by using a different len()
function, 
and using functions instead of methods for strip, upper, etc.  It's a 
minority view, of course.

Thanks again,


John


More information about the lxml-dev mailing list