[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Fri Jul 17 09:00:44 CEST 2009
qhlonline wrote:
> If there are some way for me to get the parsing context, and if I can
> access this structure directly, may be this problem can get solved. In
> libxml2 there is a defination of "struct _xmlParserCtxt". This structure
> have a member "long nbChars; " , It is just the "number of xmlChar
> processed" .
You could subtype the XMLParser class in Cython. That's not trivial, since
it's not exported at the C-API level. You'll have to redefine the class
hierarchy in a separate lxml.etree.pxd file to do that. Note that you only
need to access the _parser_context and (maybe) _push_parser_context. The
other object type fields in the classes can be set to type "object" instead
of their real type.
But remember that the type isn't public. Future lxml versions may change
it, which means that you will have to adapt your code.
That said, I still do not understand why you need the character stream
position for parsing. Could you elaborate on that?
Stefan
More information about the lxml-dev
mailing list