[lxml-dev] About the position of html parsing by HTML Target parser
qhlonline
qhlonline at 163.com
Fri Jul 17 05:15:55 CEST 2009
Stefan Behnel" <stefan_ml at behnel.de>
>
>qhlonline wrote:
>> Hi, all I am parsing html files with lxml target parser, now I wan't to
>> know when I have reached some HTML tag, how can I know the position of
>> the HTML document I am parsing?
>
>These are two different requirements. Do you really need the line/character
>information here? Isn't the structural position enough?
>
I have to know the real parsing position when some special tags found by target parser. Is the 'structural position ' means information about which line and which column, like that in Parsing Error Report? I think they are helpless in compute the parsing stream length. In libxml2 source file SAX2.c there is an callback interface (charactersSAXFunc) for character event:
hdlr->characters = xmlSAX2Characters
The event handler has a 'len' parameter which tells current parsed HTML stream length. and I noticed that lxml source Saxparser.pxi there is a function defination:
cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil:
It works just as processer of the sax.character event. How can I change the lxml source code of target parser to add sax.character event processing to it with 'data_len' parameter? Not the default 'data' function in target parser of couse, It has no parameter like 'data_len' and its 'data' parameter is only the text between an element, not the whole parsed string.
>> Is there any callbacks in target parser
>> who can tell me the total stream length I have parsed?
>
>Not that I know of. Same as in ElementTree, I'd say.
>
>Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/1f213c46/attachment.htm
More information about the lxml-dev
mailing list