[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Fri Jul 17 08:53:38 CEST 2009
qhlonline wrote:
> I have to know the real parsing position when some special tags found by
> target parser.
Interesting requirement. I wonder who designs XML formats where you have to
know the stream position to read them. Do you actually mean the bytes
position or the character position?
> Is the 'structural position ' means information about which
> line and which column, like that in Parsing Error Report?
No, with "structural position" I meant the position of the element within
the tree structure, such as the unique path from the root element to the
currently parsed element.
> In libxml2 source file
> SAX2.c there is an callback interface (charactersSAXFunc) for character event:
>
> hdlr->characters = xmlSAX2Characters
>
> The event handler has a 'len' parameter which tells current parsed HTML
> stream length. and I noticed that lxml source Saxparser.pxi there is a
> function defination:
>
> cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil:
>
> It works just as processer of the sax.character event. How can I change
> the lxml source code of target parser to add sax.character event
> processing to it with 'data_len' parameter?
You don't have to. Just take the string that you receive in .data(), encode
it as UTF-8, and take its len(). However, that doesn't help you with your
problem, as it is not the information you are looking for.
Stefan
More information about the lxml-dev
mailing list