[lxml-dev] Some HTML target processing issues
Stefan Behnel
stefan_ml at behnel.de
Fri Aug 8 07:18:25 CEST 2008
Hi,
Max Ivanov wrote:
> I've attached small piece of code.
No.
Anyway, HTML target parsing (or rather: target parsing with the "recover"
option) is rarely used, so you might have run into a bug due to lack of testing.
> lxml target parsing has some problems from my point of view.
>
> 1) I use lxml.html.HTMLParser which should handle unknown HTML tags
> since it uses lxml.html.HtmlElementClassLookup which contain this code
> in its lookup function:
> "if node_type == 'element': return
> self._element_classes.get(name.lower(), HtmlElement)". If I understand
> it right, then even unknown tags should be handled properly. But I
> still get error at the end of the code: lxml.etree.XMLSyntaxError: Tag
> noindex invalid, line 266, column 17
You mix two different things here. The error you get comes from the parser,
the lookup is called by the machinery that wraps an already parsed XML node as
an Element (i.e. much later).
> 2) Even if the whole process fails, etree.fromstring continue to call
> target methods (start,end,comment etc...) even after invelid tag
> <noindex> is appeared. It's ok, it's some sort of fault tolerance. But
> why it do not call target.close() at the end? Instead of that it
> raises exception.
Sounds like a bug to me. When you parse with recovery enabled, it should
finish gracefully also for the parser target.
> Maybe i'ts better to pass all accured
> errors to close function, so target could decide what to do.
That's not part of the API. Besides, you can find the errors (and warnings) in
the error log.
> 3) lxml should stop processing when target raises exception. Nowdays
> it's just ignored and all continue.
Might be another problem related to "recover" parsing, or a general problem.
I'll look into it when I find the time.
Can you come up with a patch with a couple of simple test cases for
src/lxml/tests/test_htmlparser.py that show the three problems you describe?
That usually makes them easier (read: faster) to fix. There are some target
parser test cases in test_etree.py and test_elementtree.py that you can look
at for inspiration.
Stefan
More information about the lxml-dev
mailing list