[lxml-dev] Some HTML target processing issues
Max Ivanov
ivanov.maxim at gmail.com
Fri Aug 8 06:54:57 CEST 2008
Hi all!
I've attached small piece of code. lxml target parsing has some
problems from my point of view.
1) I use lxml.html.HTMLParser which should handle unknown HTML tags
since it uses lxml.html.HtmlElementClassLookup which contain this code
in its lookup function:
"if node_type == 'element': return
self._element_classes.get(name.lower(), HtmlElement)". If I understand
it right, then even unknown tags should be handled properly. But I
still get error at the end of the code: lxml.etree.XMLSyntaxError: Tag
noindex invalid, line 266, column 17
I don't understand why, hope someone give me a good advice :)
2) Even if the whole process fails, etree.fromstring continue to call
target methods (start,end,comment etc...) even after invelid tag
<noindex> is appeared. It's ok, it's some sort of fault tolerance. But
why it do not call target.close() at the end? Instead of that it
raises exception. If document processing continues even after error,
then call target.close() too! Maybe i'ts better to pass all accured
errors to close function, so target could decide what to do.
3) lxml should stop processing when target raises exception. Nowdays
it's just ignored and all continue.
More information about the lxml-dev
mailing list