[lxml-dev] About the position of html parsing by HTML Target parser
Nicholas Dudfield
ndudfield at gmail.com
Mon Jul 20 15:05:09 CEST 2009
>>> So the editor you are working with (which one is it, BTW?)
SublimeText: http://www.sublimetext.com/features
It's a windows only, closed source editor which however has quite a
few redeeming features, including a Python (2.5) API. All buffer
access returns native unicode and the substr(pt1, pt2) indexing is
character based rather than byte.
>>> Or maybe that would be considered an implementation detail that doesn't show at the API level.
No idea what encoding it represents characters internally with.
>>> So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it
When you say it converts first to utf8 internally, does that include
recoding xml entities as well? ie A file already utf8 encoded may not
necessarily maintain the bytestream after the first stage of
processing? eg {'<p>"</p>' : "<p>'</p>"}
ps. Thanks very much for your time ( and lxml! )
More information about the lxml-dev
mailing list