[lxml-dev] Possible bug
Bob Kline
bkline at rksystems.com
Thu Mar 19 19:37:03 CET 2009
Stefan Behnel wrote:
> Bob Kline wrote:
>
>> Before I dig into the work of producing a repro case, would the lxml
>> developers be interested in a bug report if I confirm that the XSL/T
>> parser which comes with the lxml package chokes on the serialized
>> version of an XML tree assembled by the lxml's HTML parser when the
>> original HTML document contains a comment which the XML spec doesn't
>> like (because "--" appears inside the comment)?
>>
>
> So what you do is:
>
> 1) parse an HTML document that contains "--" in a comment
> 2) serialise it to XML, which produces broken XML because of the comment
> value
>
> You were not clear about the rest, but I guess it was not:
>
Hi, Stefan. Thanks for your reply.
Sorry for not being sufficiently clear. Here's what I'm doing:
reader = urllib2.urlopen(urlForHtmlPage)
htmlPage = reader.read()
tree = etree.HTML(htmlPage)
xmlDocStrings = ['<NewDoc>']
for child in tree:
if not rejectThisNode(child):
xmlDocStrings.append(etree.tostring(child))
xmlDocStrings.append('</NewDoc>')
xmlDoc = "".join(xmlDocStrings)
fp = open(nameOfXsltFile)
transform = etree.XSLT(etree.parse(fp))
newDoc = transform(etree.XML(xmlDoc)) # blows up here
> 3) parse it using an XML parser which does not fail
> 4) pass it to XSLT(), which then fails to initialise
>
> but rather
>
> 3) return the serialised XML from a custom document resolver while running
> an XSLT right?
>
No custom document resolver involved, as you can see from the code
above. I'm beginning to think I came to the wrong conclusion when
reading the following passage in the lxml documentation:
HTML parsing is similarly simple. The parsers have a recover keyword
argument that the HTMLParser sets by default. It lets libxml2 try
its best to return something usable without raising an exception.
I assumed a different meaning for "something usable" than the behavior
of the software appears to justify, thinking that if the result was not
a tree which would serialize itself back into well-formed XML an
exception would be thrown. That's not how it works, though, is it?
--
Bob Kline
http://www.rksystems.com
mailto:bkline at rksystems.com
More information about the lxml-dev
mailing list