[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Mon Jul 20 09:06:00 CEST 2009
qhlonline wrote:
> I have tried to alter the libxml2 source to add a callback telling the
> current position when an element parsed.
Note that something that requires patching libxml2 will not make it into an
lxml release.
As you noted before, the parser context already provides this information
at any time, not only when parsing elements. So adding a callback for it is
not a sensible approach.
I'm not even sure what this position means exactly. Is it (1) the byte
position in the original (undecoded) data stream, (2) the byte position in
the UTF-8 encoded parse stream, or (3) the character position in the XML
stream?
According to the libxml2 docs:
long nbChars : number of xmlChar processed
This sounds like it's the second information. That would not be useful and
shouldn't get exposed in lxml's API as it's rather error prone to rely on
it: works for ASCII and UTF-8, obviously, may work for some other encodings
depending on the data, but fails for most other streams. OTOH, the first
and the third information /might/ be of interest, depending on your use
case, but are not easily recovered from the information that the parser
provides.
> I have nerver compile cython source before. Can any body give me some
> suggestion?
If you just change lxml's sources, running setup.py will build it just as
before. All you need to do is install Cython 0.11 or later.
http://codespeak.net/lxml/build.html
Stefan
More information about the lxml-dev
mailing list