[lxml-dev] About the position of html parsing by HTML Target parser

Nicholas Dudfield ndudfield at gmail.com
Mon Jul 20 15:05:09 CEST 2009


>>> So the editor you are working with (which one is it, BTW?)

SublimeText:  http://www.sublimetext.com/features

It's a windows only, closed source editor which however has quite a
few redeeming features, including a Python (2.5) API. All buffer
access returns native unicode and the substr(pt1, pt2) indexing is
character based rather than byte.

>>> Or maybe that would be considered an implementation detail that doesn't show at the API level.

No idea what encoding it represents characters internally with.

>>> So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it

When you say it converts first to utf8 internally, does that include
recoding xml entities as well? ie A file already utf8 encoded may not
necessarily maintain the bytestream after the first stage of
processing?  eg {'<p>&quot;</p>' : "<p>'</p>"}


ps.  Thanks very much for your time ( and lxml! )


More information about the lxml-dev mailing list