[lxml-dev] About the position of html parsing by HTML Target parser

Stefan Behnel stefan_ml at behnel.de
Mon Jul 20 09:48:04 CEST 2009


[bringing this back to the list]

Nicholas Dudfield wrote:
>>> even then it's only a *byte* at a time, not a *character* at a time).
> 
> I have been feeding it a unicode character at a time and tested it
> with some Chinese characters.

Ah, sure, that works.


>> Searching for the
>> above regexp is safe as "<" cannot occur anywhere in the XML data stream
>> except for a tag start/end or comment/PI
> 
> or in embedded <script></script> with CDATA eg
> $("someSelector").html('<p> ...')

Right, I keep forgetting that CDATA is evil. But you can check for CDATA as
well.


> For clarification, you are saying skip the target parser and feeding
> process and just iterate over the elements and regex for the tags?
> something like "<tagname\s"  but accounting for tags such as <br/> etc

I'd try that, yes. Should be a lot faster as you only handle the elements
in your code, not the text content, for example. And you always know
exactly what the next opening/closing tag must be.

You can special case self-closing tags by checking for contained text and
children. If there isn't anything in them, be prepared to find the next
opening element before finding a closing tag. Having the parsed tree
available gives you a lot of information that you can exploit.

You may also consider using something like ahocorasick to search for all
opening tags at once (note that lxml makes the namespace prefix available
for each Element), plus "<![CDATA". Then double-check the result by
traversing the tree and fixing up any false positives 'somehow'.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/


>> I will also check if there is a way to provide the position at the (target)
>> parser level, but that needs to fit the current interface. And I currently
>> do not have much time to dig into this.
> 
> That would be awesomely ideal :)

... although not that helpful, as I just noted in another post. So I don't
think this is a viable solution for now.

Stefan


More information about the lxml-dev mailing list