[lxml-dev] Possible bug
Stefan Behnel
stefan_ml at behnel.de
Thu Mar 19 16:10:42 CET 2009
Bob Kline wrote:
> Before I dig into the work of producing a repro case, would the lxml
> developers be interested in a bug report if I confirm that the XSL/T
> parser which comes with the lxml package chokes on the serialized
> version of an XML tree assembled by the lxml's HTML parser when the
> original HTML document contains a comment which the XML spec doesn't
> like (because "--" appears inside the comment)?
So what you do is:
1) parse an HTML document that contains "--" in a comment
2) serialise it to XML, which produces broken XML because of the comment
value
You were not clear about the rest, but I guess it was not:
3) parse it using an XML parser which does not fail
4) pass it to XSLT(), which then fails to initialise
but rather
3) return the serialised XML from a custom document resolver while running
an XSLT
right?
As an under-informed guess, I would assume step 2) to be the problem here,
in which case there is not much lxml can do about it. I also doubt that
libxml2 can do much here, as the problem is that you are serialising an
HTML tree into XML syntax without any intermediate semantic adaptation.
A good way to work around this would be to let the HTML parser remove all
comments on the way in by passing "remove_comments=True" - unless you
really need them in the document. Some of the other parser options might
also be of interest for your use case.
You can also use lxml.html.clean to remove some other potentially harmful
content from the HTML file before passing it into an XSLT.
Stefan
More information about the lxml-dev
mailing list