[lxml-dev] Passing UTF-8 bytestrings to lxml

John J Lee jjl at pobox.com
Mon Aug 4 14:10:16 CEST 2008


Hi

Apologies in advance if this is the wrong list -- I'm suggesting a change 
to lxml, so I guess this is the right place...

I'm working on some existing code that makes use of both unicode objects 
and UTF-8 encoded bytestring objects (both of which sometimes contain 
non-ASCII characters).  I'm making changes to the code to ensure that it 
supports the unicode character set.  Unfortunately, it's not practical to 
change all of the code to use unicode objects (partly because there's a 
lot of code, and partly because fixing that would probably entail fixing 
PyGTK to return unicode objects instead of UTF-8 encoded bytestrings). 
So, the plan is to live with both unicode and UTF-8 encoded bytestrings, 
and to ensure Python's default encoding is always set to UTF-8.  I'm sure 
the wisdom that approach could be debated (!), but I hope that somebody 
will be kind enough to answer the following question anyway :-)

Looking at the code, it seems that changing function _utf8 in 
apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would 
be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that 
seems to work.

1. Will what I'm doing subtly break lxml in some way if I make use of this 
patched lxml in my own code?

2. Should lxml be changed in this way?  If it's considered important to 
avoid accidentally passing non-ASCII bytestrings to lxml, would it be 
acceptable to add a global switch to enable accepting UTF-8 encoded 
bytestrings?


Thanks for any help


(This patch is against lxml 1.3.6, but this function in SVN trunk is
very similar)

--- apihelpers.pxi.orig	2008-08-04 12:52:34.000000000 +0100
+++ apihelpers.pxi	2008-08-04 12:41:57.000000000 +0100
@@ -640,13 +640,12 @@

  cdef object _utf8(object s):
      if python.PyString_Check(s):
-        assert not isutf8py(s), \
-               "All strings must be XML compatible, either Unicode or ASCII"
+        assert isutf8py(s) != -1, \
+               "All strings must either unicode objects or UTF-8"
      elif python.PyUnicode_Check(s):
-        # FIXME: we should test these strings, too ...
          s = python.PyUnicode_AsUTF8String(s)
          assert isutf8py(s) != -1, \
-               "All strings must be XML compatible, either Unicode or ASCII"
+               "All strings must be either unicode objects or UTF-8"
      else:
          raise TypeError, "Argument must be string or unicode."
      return s



John


More information about the lxml-dev mailing list