[lxml-dev] About the position of html parsing by HTML Target parser

qhlonline qhlonline at 163.com
Mon Jul 20 10:06:43 CEST 2009


2009-07-20,"Stefan Behnel" <stefan_ml at behnel.de> :
>
>qhlonline wrote:
>> I have tried to alter the libxml2 source to add a callback telling the 
>> current position when an element parsed.
>
>Note that something that requires patching libxml2 will not make it into an
>lxml release.
>
>As you noted before, the parser context already provides this information
>at any time, not only when parsing elements. So adding a callback for it is
>not a sensible approach.
>
>I'm not even sure what this position means exactly. Is it (1) the byte
>position in the original (undecoded) data stream, (2) the byte position in
>the UTF-8 encoded parse stream, or (3) the character position in the XML
>stream?
>
My change is taking place on the 'htmlParseStartTag' function in HTMLparser.c source file, I think may be its a UTF-8 stream.
 
>According to the libxml2 docs:
>
> long nbChars : number of xmlChar processed
>
>This sounds like it's the second information. That would not be useful and
>shouldn't get exposed in lxml's API as it's rather error prone to rely on
>it: works for ASCII and UTF-8, obviously, may work for some other encodings
>depending on the data, but fails for most other streams. OTOH, the first
>and the third information /might/ be of interest, depending on your use
>case, but are not easily recovered from the information that the parser
>provides.
This positon may not be precise after some encoding changement form other encoding to UTF-8, but I think it can meet our needs according to my leader's requirement.
>
>> I have nerver compile cython source before. Can any body give me some
>> suggestion?
>
>If you just change lxml's sources, running setup.py will build it just as
>before. All you need to do is install Cython 0.11 or later.
>
Thank you for your help! 
>http://codespeak.net/lxml/build.html
>
>Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090720/115ce0eb/attachment-0001.htm 


More information about the lxml-dev mailing list