[lxml-dev] Possible bug
Stefan Behnel
stefan_ml at behnel.de
Thu Mar 19 22:35:56 CET 2009
Hi,
Bob Kline wrote:
> Sorry for not being sufficiently clear. Here's what I'm doing:
>
> reader = urllib2.urlopen(urlForHtmlPage)
> htmlPage = reader.read()
> tree = etree.HTML(htmlPage)
Use a custom parser here, as in
parser = etree.HTMLParser(remove_comments=True, remove_pis=True)
tree = etree.fromstring(htmlPage, parser)
> xmlDocStrings = ['<NewDoc>']
> for child in tree:
> if not rejectThisNode(child):
> xmlDocStrings.append(etree.tostring(child))
> xmlDocStrings.append('</NewDoc>')
> xmlDoc = "".join(xmlDocStrings)
You do not need to serialise here. It's perfectly ok if you do this:
xmlDocStrings = tree.makeelement("NewDoc")
for child in tree[:]:
if not rejectThisNode(child):
remove_ugly_content(child)
xmlDocStrings.append(child)
newDoc = transform(xmlDocStrings)
> fp = open(nameOfXsltFile)
> transform = etree.XSLT(etree.parse(fp))
> newDoc = transform(etree.XML(xmlDoc)) # blows up here
If you split the last line in two, I would assume that it will fail on
root = etree.XML(xmlDoc)
and not on
newDoc = transform(root)
Removing unwanted content before running the transform will fix this.
>> 3) parse it using an XML parser which does not fail
>> 4) pass it to XSLT(), which then fails to initialise
>>
>> but rather
>>
>> 3) return the serialised XML from a custom document resolver while
>> running an XSLT right?
>
> No custom document resolver involved, as you can see from the code
> above. I'm beginning to think I came to the wrong conclusion when
> reading the following passage in the lxml documentation:
>
> HTML parsing is similarly simple. The parsers have a recover keyword
> argument that the HTMLParser sets by default. It lets libxml2 try
> its best to return something usable without raising an exception.
>
> I assumed a different meaning for "something usable" than the behavior
> of the software appears to justify, thinking that if the result was not
> a tree which would serialize itself back into well-formed XML an
> exception would be thrown. That's not how it works, though, is it?
The "something usable" just means that while it may not succeed to parse
all of a broken document 'correctly' (whatever that means in this context),
it will always return a document that is as complete as possible. Without
the "recover" option, you would get an exception when a parse error occurs.
Stefan
More information about the lxml-dev
mailing list