[lxml-dev] Possible bug

Bob Kline bkline at rksystems.com
Thu Mar 19 19:37:03 CET 2009


Stefan Behnel wrote:
> Bob Kline wrote:
>   
>> Before I dig into the work of producing a repro case, would the lxml
>> developers be interested in a bug report if I confirm that the XSL/T
>> parser which comes with the lxml package chokes on the serialized
>> version of an XML tree assembled by the lxml's HTML parser when the
>> original HTML document contains a comment which the XML spec doesn't
>> like (because "--" appears inside the comment)?
>>     
>
> So what you do is:
>
> 1) parse an HTML document that contains "--" in a comment
> 2) serialise it to XML, which produces broken XML because of the comment
> value
>
> You were not clear about the rest, but I guess it was not:
>   

Hi, Stefan.  Thanks for your reply.

Sorry for not being sufficiently clear.  Here's what I'm doing:

reader = urllib2.urlopen(urlForHtmlPage)
htmlPage = reader.read()
tree = etree.HTML(htmlPage)
xmlDocStrings = ['<NewDoc>']
for child in tree:
    if not rejectThisNode(child):
        xmlDocStrings.append(etree.tostring(child))
xmlDocStrings.append('</NewDoc>')
xmlDoc = "".join(xmlDocStrings)
fp = open(nameOfXsltFile)
transform = etree.XSLT(etree.parse(fp))
newDoc = transform(etree.XML(xmlDoc)) # blows up here

> 3) parse it using an XML parser which does not fail
> 4) pass it to XSLT(), which then fails to initialise
>
> but rather
>
> 3) return the serialised XML from a custom document resolver while running
> an XSLT right?
>   

No custom document resolver involved, as you can see from the code 
above.  I'm beginning to think I came to the wrong conclusion when 
reading the following passage in the lxml documentation:

    HTML parsing is similarly simple. The parsers have a recover keyword
    argument that the HTMLParser sets by default. It lets libxml2 try
    its best to return something usable without raising an exception.

I assumed a different meaning for "something usable" than the behavior 
of the software appears to justify, thinking that if the result was not 
a tree which would serialize itself back into well-formed XML an 
exception would be thrown.  That's not how it works, though, is it?

-- 
Bob Kline
http://www.rksystems.com
mailto:bkline at rksystems.com



More information about the lxml-dev mailing list