[lxml-dev] Setting URL from lxml.html.fromstring, etc
Ian Bicking
ianb at colorstudy.com
Mon Feb 18 19:04:29 CET 2008
Stefan Behnel wrote:
> Ian Bicking wrote:
>> There doesn't seem to be any way to set a document's URL when parsing
>> the document. E.g.:
>>
>> >>> from lxml import html
>> >>> tree = html.parse('http://www.python.org')
>> >>> tree.docinfo.URL
>> 'http://www.python.org'
>>
>> But the parse function doesn't really take any arguments, and the URL
>> attribute is write-only. Ideally you could do fromstring('...doc...',
>> URL='location').
>
> All keyword arguments that you pass to the parse/fromstring functions are
> passed on to lxml.etree's corresponding functions. That means, you can pass
> the "base_url" keyword. (Maybe that should be mentioned in the docstrings).
Yeah... it's hard to figure out what method is underlying these. I've
added a note to the docstring and an explicit base_url argument to the
functions, so you can see the presence of the parameter more easily.
It does not appear that html.parse() takes a base_url argument (just as
etree.parse does not). If you pass a URL or filename then I suppose
that becomes the base. If you pass in a file-like object then I think
it also works, if the file-like object has a geturl() method (like
urllib's files do).
>> Also I'm not sure why the URL shouldn't be writable.
>
> What would be the use case? The problem that arises is that the source URL of
> a document would no longer be an immutable identifier of the document. If it
> can change, it's less valuable for caching (for example). It's a different
> thing if you pass a URL to the parser because it can't know where the document
> came from, or if you change the 'source' of a document at will.
If you can just get it right during parsing it should be fine. But
there's things like xml:base (doesn't apply to HTML; not sure how it's
handled in XML), or unusual headers like Content-Location, which you
might want to handle at point in time that the document has already been
parsed.
Probably not a problem, but it doesn't seem that much like a problem to
make it writable too. Especially since the document itself is writable.
Once you've edited the document, it's not *the* document at that URL
anyway. Maybe you get a page, edit it, and serve it at a new location.
Deliverance does this by getting the theme page, then injecting the
content into that page -- but the theme page is the originally-parsed
object, though it will be served at a different location. I'd like to
be able to fix up that data. And I'm not sure how I'd make a copy of a
document with a new URL, if the URL/document link is immutable. (Right
now I'm mostly ignoring the URL, but it would be nice if I could
actually trust it.)
Ian
More information about the lxml-dev
mailing list