[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Mon Jul 20 13:00:27 CEST 2009
Nicholas Dudfield wrote:
> ahocorasick doesnt seem to work with native unicode,
Ah, great. :-/ Never tried it myself, just read about it more than once on
c.l.py. Looking at the code, it actually reuses an existing C
implementation, that's why.
> which makes it
> about as useful (for my particular purpose anyway) as the parser
> context uft8 stream positions :(
So the editor you are working with (which one is it, BTW?) gives you
unicode strings? I would have expected it to work with byte buffers
internally. Or maybe that would be considered an implementation detail that
doesn't show at the API level.
> Is there any fundamental reason why an xml parser couldn't work with
> native unicode? ie an abstract character stream? I'm completely
> clueless when it comes to parsers.
Not a fundamental reason, but it's a lot simpler and faster to parse XML
streams in UTF-8 than in any other encoding, and it's also more efficient
to parse them as a UTF-8 byte stream than as a Unicode character stream,
especially in C. You basically read one byte and immediately know if it
represents a control character or not. Unicode characters require 4 bytes
here to represent all possible code points.
Also, UTF-8 is the internal representation format used inside of libxml2
anyway, just for the same reason. So the parser of libxml2 first encodes
the stream to UTF-8 (at the I/O stream buffer layer) and then processes it
without further modifications.
Stefan
More information about the lxml-dev
mailing list