[lxml-dev] Setting URL from lxml.html.fromstring, etc
Stefan Behnel
stefan_ml at behnel.de
Mon Feb 18 09:33:17 CET 2008
Hi Ian,
Ian Bicking wrote:
> There doesn't seem to be any way to set a document's URL when parsing
> the document. E.g.:
>
> >>> from lxml import html
> >>> tree = html.parse('http://www.python.org')
> >>> tree.docinfo.URL
> 'http://www.python.org'
>
> But the parse function doesn't really take any arguments, and the URL
> attribute is write-only. Ideally you could do fromstring('...doc...',
> URL='location').
All keyword arguments that you pass to the parse/fromstring functions are
passed on to lxml.etree's corresponding functions. That means, you can pass
the "base_url" keyword. (Maybe that should be mentioned in the docstrings).
> Also I'm not sure why the URL shouldn't be writable.
What would be the use case? The problem that arises is that the source URL of
a document would no longer be an immutable identifier of the document. If it
can change, it's less valuable for caching (for example). It's a different
thing if you pass a URL to the parser because it can't know where the document
came from, or if you change the 'source' of a document at will.
Stefan
More information about the lxml-dev
mailing list