[lxml-dev] A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)

Stefan Behnel stefan_ml at behnel.de
Tue Jul 1 14:03:35 CEST 2008


Hi,

Mike MacCana wrote:
> I solved the crap HTML problem as follows. Hopefully the following will
> be useful to anyone beginning XPath with lxml.

Just adding a few comments as I see fit.


> ## Function to strip non-ascii characters
> ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters
> ## for list
> def onlyascii(char):
> 	if ord(char) < 32 or ord(char) > 176:
> 		return ''
> 	else:
> 		return char

Note that this will not work as expected with multi-byte encodings such as
UTF-8.


> ## We can now access our cleaned content as 'cleanedcontent'
> cleanedcontent=cleaner.clean_html(asciihtml)

This will (obviously) parse the HTML into a tree internally, so it's more
efficient to pass a parsed tree directly.


> ## Go parse our content
> cleanedcontentstringio = StringIO(cleanedcontent)
> parser = etree.XMLParser(recover=True)
> tree = etree.parse(cleanedcontentstringio)

I wonder why you use an XML parser here. The HTML parser will likely work
better, as it knows about self-closing HTML tags.

Stefan



More information about the lxml-dev mailing list