[lxml-dev] About the position of html parsing by HTML Target parser

Stefan Behnel stefan_ml at behnel.de
Fri Jul 17 08:53:38 CEST 2009


qhlonline wrote:
> I have to know the real parsing position when some special tags found by
> target parser.

Interesting requirement. I wonder who designs XML formats where you have to
know the stream position to read them. Do you actually mean the bytes
position or the character position?


> Is the 'structural position ' means information about which
> line and which column, like that in Parsing Error Report?

No, with "structural position" I meant the position of the element within
the tree structure, such as the unique path from the root element to the
currently parsed element.


> In libxml2 source file
> SAX2.c there is an callback interface (charactersSAXFunc) for character event:
>
>     hdlr->characters = xmlSAX2Characters
>
> The event handler has a 'len' parameter which tells current parsed HTML
> stream length. and I noticed that lxml source Saxparser.pxi there is a
> function defination:
> 
>       cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil:
>
> It works just as processer of the sax.character event. How can I change
> the lxml source code of target parser to add sax.character event
> processing to it with 'data_len' parameter?

You don't have to. Just take the string that you receive in .data(), encode
it as UTF-8, and take its len(). However, that doesn't help you with your
problem, as it is not the information you are looking for.

Stefan



More information about the lxml-dev mailing list