[lxml-dev] Possible bug

Stefan Behnel stefan_ml at behnel.de
Thu Mar 19 22:35:56 CET 2009


Hi,

Bob Kline wrote:
> Sorry for not being sufficiently clear.  Here's what I'm doing:
> 
> reader = urllib2.urlopen(urlForHtmlPage)
> htmlPage = reader.read()
> tree = etree.HTML(htmlPage)

Use a custom parser here, as in

	parser = etree.HTMLParser(remove_comments=True, remove_pis=True)
	tree = etree.fromstring(htmlPage, parser)


> xmlDocStrings = ['<NewDoc>']
> for child in tree:
>    if not rejectThisNode(child):
>        xmlDocStrings.append(etree.tostring(child))
> xmlDocStrings.append('</NewDoc>')
> xmlDoc = "".join(xmlDocStrings)

You do not need to serialise here. It's perfectly ok if you do this:

	xmlDocStrings = tree.makeelement("NewDoc")
	for child in tree[:]:
		if not rejectThisNode(child):
			remove_ugly_content(child)
			xmlDocStrings.append(child)

	newDoc = transform(xmlDocStrings)

> fp = open(nameOfXsltFile)
> transform = etree.XSLT(etree.parse(fp))
> newDoc = transform(etree.XML(xmlDoc)) # blows up here

If you split the last line in two, I would assume that it will fail on

	root = etree.XML(xmlDoc)

and not on

	newDoc = transform(root)

Removing unwanted content before running the transform will fix this.


>> 3) parse it using an XML parser which does not fail
>> 4) pass it to XSLT(), which then fails to initialise
>>
>> but rather
>>
>> 3) return the serialised XML from a custom document resolver while
>> running an XSLT right?
> 
> No custom document resolver involved, as you can see from the code
> above.  I'm beginning to think I came to the wrong conclusion when
> reading the following passage in the lxml documentation:
> 
>    HTML parsing is similarly simple. The parsers have a recover keyword
>    argument that the HTMLParser sets by default. It lets libxml2 try
>    its best to return something usable without raising an exception.
> 
> I assumed a different meaning for "something usable" than the behavior
> of the software appears to justify, thinking that if the result was not
> a tree which would serialize itself back into well-formed XML an
> exception would be thrown.  That's not how it works, though, is it?

The "something usable" just means that while it may not succeed to parse
all of a broken document 'correctly' (whatever that means in this context),
it will always return a document that is as complete as possible. Without
the "recover" option, you would get an exception when a parse error occurs.

Stefan


More information about the lxml-dev mailing list