[lxml-dev] Setting URL from lxml.html.fromstring, etc

Ian Bicking ianb at colorstudy.com
Mon Feb 18 19:04:29 CET 2008


Stefan Behnel wrote:
> Ian Bicking wrote:
>> There doesn't seem to be any way to set a document's URL when parsing 
>> the document.  E.g.:
>>
>>  >>> from lxml import html
>>  >>> tree = html.parse('http://www.python.org')
>>  >>> tree.docinfo.URL
>> 'http://www.python.org'
>>
>> But the parse function doesn't really take any arguments, and the URL 
>> attribute is write-only.  Ideally you could do fromstring('...doc...', 
>> URL='location').
> 
> All keyword arguments that you pass to the parse/fromstring functions are
> passed on to lxml.etree's corresponding functions. That means, you can pass
> the "base_url" keyword. (Maybe that should be mentioned in the docstrings).

Yeah... it's hard to figure out what method is underlying these.  I've 
added a note to the docstring and an explicit base_url argument to the 
functions, so you can see the presence of the parameter more easily.

It does not appear that html.parse() takes a base_url argument (just as 
etree.parse does not).  If you pass a URL or filename then I suppose 
that becomes the base.  If you pass in a file-like object then I think 
it also works, if the file-like object has a geturl() method (like 
urllib's files do).

>> Also I'm not sure why the URL shouldn't be writable.
> 
> What would be the use case? The problem that arises is that the source URL of
> a document would no longer be an immutable identifier of the document. If it
> can change, it's less valuable for caching (for example). It's a different
> thing if you pass a URL to the parser because it can't know where the document
> came from, or if you change the 'source' of a document at will.

If you can just get it right during parsing it should be fine.  But 
there's things like xml:base (doesn't apply to HTML; not sure how it's 
handled in XML), or unusual headers like Content-Location, which you 
might want to handle at point in time that the document has already been 
parsed.

Probably not a problem, but it doesn't seem that much like a problem to 
make it writable too.  Especially since the document itself is writable. 
  Once you've edited the document, it's not *the* document at that URL 
anyway.  Maybe you get a page, edit it, and serve it at a new location. 
  Deliverance does this by getting the theme page, then injecting the 
content into that page -- but the theme page is the originally-parsed 
object, though it will be served at a different location.  I'd like to 
be able to fix up that data.  And I'm not sure how I'd make a copy of a 
document with a new URL, if the URL/document link is immutable.  (Right 
now I'm mostly ignoring the URL, but it would be nice if I could 
actually trust it.)

   Ian


More information about the lxml-dev mailing list