[lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Stefan Behnel stefan_ml at behnel.de
Thu Aug 14 20:11:45 CEST 2008


Hi,

Brad Clements wrote:
> when document('') is processed, the base_url is used to look up 
> the stylesheet's canonical "URL", and then that URL is used to retrieve 
> the xml document tree that represents the stylesheet.

Yes, it's common to look up a document by its URL. That's an optimisation used
by libxslt, too, so if you assign the same URL to different documents, you
will run into problems, whether lxml does this or not.


> The problem here is that base_url could be wrong.. It could be the same 
> value as some other document. In fact, I can recreate the problem by 
> setting base_url to the same value for both the xml source and the 
> stylesheet source.

You are deliberately lying to lxml, and still expect it to be so kind to do
the right thing regardless?


> My understanding of the reason for base_url was just so that resolvers 
> would have a basis for resolving relative lookups. That is certainly how 
> I use base_url ... as the only mechanism to set the URL that is passed 
> to my custom resolver.

Yes, that's one way of using it. Others may use it differently.


> this is a design flaw in lxml. I'm thinking that using base_url as a way 
> to get back the original stylesheet XML was convenient for the lxml 
> developers, but has left a big undocumented pitfall for lxml users.

And it's easy to work around by providing unique URLs for each document. If
you think the documentation should be improved, please submit a patch.


> The only documentation I could find on the website about base_url is on 
> http://codespeak.net/lxml/parsing.html#parsers  where no mention is made 
> about the requirement to NOT use the same base_url for different documents.

It sounds to me like the misunderstanding here is largely based on what the
"base URL" of a document is. It's the URL that defines the origin of the
document. Assuming that you will get the same document when you re-read its
URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have
to re-parse a document each time it encounters a document() reference. That
would really hurt performance.


> My test case program is shown below,
> when base_url is the same value for both the stylesheet and the xml 
> document, then document('') fails in the stylesheet.
> If base_url is different, it works.

I agree that separate documentation paragraphs in the parser documentation,
the resolver documentation, and the XSLT documentation would help here. Maybe
you can write up something?

Stefan


More information about the lxml-dev mailing list