[lxml-dev] Passing UTF-8 bytestrings to lxml
Stefan Behnel
stefan_ml at behnel.de
Mon Aug 4 16:07:33 CEST 2008
Hi,
John J Lee wrote:
> Apologies in advance if this is the wrong list -- I'm suggesting a change
> to lxml, so I guess this is the right place...
We only have one mailing list, so this is definitely the right place.
> Looking at the code, it seems that changing function _utf8 in
> apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would
> be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that
> seems to work.
The internal encoding used by libxml2 is UTF-8, so I don't expect any
problems when you pass in UTF-8 directly - as long as you can make sure
that it's really a valid UTF-8 byte sequence.
> 2. Should lxml be changed in this way? If it's considered important to
> avoid accidentally passing non-ASCII bytestrings to lxml
I consider that important, yes. The support for ASCII byte strings is a
pure convenience as ASCII names are extremely common in XML *and* they are
compatible with unicode strings in Python 2.x. Allowing anything other
than ASCII here would open the door for all sorts of hard to track down
encoding problems, as you would no longer get an exception when you
accidentally pass ISO encoded non-ASCII strings, for example.
Note that when lxml runs under Python 3, it will not allow you to pass
byte strings into the API at all (except for parsing, obviously).
> would it be
> acceptable to add a global switch to enable accepting UTF-8 encoded
> bytestrings?
Global switches are always a bad thing. And I don't like the idea of
accepting UTF-8 encoded strings at the API level and returning them as
unicode strings (and: no, I would not allow returning UTF-8 encoded
strings from the API).
So I guess the answer is a pretty straight no.
Stefan
PS: Regarding your actual problem: it's best to decode data directly when
your code gets its hands on it, and to decode as late as possible, i.e. on
the way out. I remember that when I worked with Qt3, I used two helper
functions to wrap (arguments of) Qt functions that accepted or returned
strings, so that I could work with clean Python unicode strings in the
rest of my code. That's the best advice I can give you. Besides, if PyGTK
worked like lxml, you wouldn't have this problem in the first place.
More information about the lxml-dev
mailing list