[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Mon Jul 20 15:59:17 CEST 2009
Nicholas Dudfield wrote:
>> So the editor you are working with (which one is it, BTW?)
>
> SublimeText: http://www.sublimetext.com/features
>
> It's a windows only, closed source editor which however has quite a
> few redeeming features, including a Python (2.5) API. All buffer
> access returns native unicode and the substr(pt1, pt2) indexing is
> character based rather than byte.
Sounds like a sensible API design.
There is a special thing about parsing from Python unicode strings, BTW.
Basically, lxml figures out the platform specific encoding that CPython
uses internally (at startup time), and then passes the plain unicode string
buffer to libxml2 together with the correct decoding selector. Thus,
libxml2 will first recode the UCS2/UCS4 encoded byte sequence into UTF-8,
and then parse that.
>> So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it
>
> When you say it converts first to utf8 internally, does that include
> recoding xml entities as well? ie A file already utf8 encoded may not
> necessarily maintain the bytestream after the first stage of
> processing? eg {'<p>"</p>' : "<p>'</p>"}
No, the codec layer only recodes the characters, so you'd get a UTF-8
encoded """ byte string as result. The rest is handled by the XML
parser layer, which sees the "&" and considers it the start of an
entity/char reference.
Stefan
More information about the lxml-dev
mailing list