[lxml-dev] About the position of html parsing by HTML Target parser
qhlonline
qhlonline at 163.com
Mon Jul 20 10:27:48 CEST 2009
2009-07-20,"Stefan Behnel" stefan_ml at behnel.de:
>
>qhlonline wrote:
>> I have tried to alter the libxml2 source to add a callback telling the
>> current position when an element parsed.
>
>Note that something that requires patching libxml2 will not make it into an
>lxml release.
>
>As you noted before, the parser context already provides this information
>at any time, not only when parsing elements. So adding a callback for it is
>not a sensible approach.
>
>I'm not even sure what this position means exactly. Is it (1) the byte
>position in the original (undecoded) data stream, (2) the byte position in
>the UTF-8 encoded parse stream, or (3) the character position in the XML
>stream?
>
>According to the libxml2 docs:
>
> long nbChars : number of xmlChar processed
>
>This sounds like it's the second information. That would not be useful and
>shouldn't get exposed in lxml's API as it's rather error prone to rely on
>it: works for ASCII and UTF-8, obviously, may work for some other encodings
>depending on the data, but fails for most other streams. OTOH, the first
>and the third information /might/ be of interest, depending on your use
>case, but are not easily recovered from the information that the parser
>provides.
>
>
>> I have nerver compile cython source before. Can any body give me some
>> suggestion?
>
>If you just change lxml's sources, running setup.py will build it just as
>before. All you need to do is install Cython 0.11 or later.
>
>http://codespeak.net/lxml/build.html
>
>Stefan
Now the key problem for me is I don't konw whether can I and how to change the lxml target parser defination to let it support my new callback in libxml2, I suspect that the Treebuilder class in Saxparser.pxi is the base class of target parser,because it support functions like 'start', 'end', 'close', 'data', just like the target parser, But I am not sure, because this class seems to be used to bulid a ElmentTree or dom for parserd HTML document from its name and has no relationship with target parser. Am I steping the wrong place?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090720/c719b72a/attachment.htm
More information about the lxml-dev
mailing list