[lxml-dev] Setting URL from lxml.html.fromstring, etc

Stefan Behnel stefan_ml at behnel.de
Mon Feb 18 21:29:34 CET 2008


Hi Ian,

Ian Bicking wrote:
> Stefan Behnel wrote:
>> Ian Bicking wrote:
>>> There doesn't seem to be any way to set a document's URL when parsing
>>> the document.  E.g.:
>>>
>>>  >>> from lxml import html
>>>  >>> tree = html.parse('http://www.python.org')
>>>  >>> tree.docinfo.URL
>>> 'http://www.python.org'
>>>
>>> But the parse function doesn't really take any arguments, and the URL
>>> attribute is write-only.  Ideally you could do
>>> fromstring('...doc...', URL='location').
>>
>> All keyword arguments that you pass to the parse/fromstring functions are
>> passed on to lxml.etree's corresponding functions. That means, you can
>> pass
>> the "base_url" keyword. (Maybe that should be mentioned in the
>> docstrings).
> 
> Yeah... it's hard to figure out what method is underlying these.  I've
> added a note to the docstring and an explicit base_url argument to the
> functions, so you can see the presence of the parameter more easily.

That's good, then epydoc can pick it up.


> It does not appear that html.parse() takes a base_url argument (just as
> etree.parse does not).  If you pass a URL or filename then I suppose
> that becomes the base.

Yes. parse() is for parsing from files/URLs, so you'd normally have some kind
of source name/URL. StringIO is a different thing, but then, in most cases
where you could use parse(StringIO), it would be better to use fromstring(),
which supports the "base_url" keyword.


> If you pass in a file-like object then I think
> it also works, if the file-like object has a geturl() method (like
> urllib's files do).

The code we use is this:

cdef _getFilenameForFile(source):
    # file instances have a name attribute
    try:
        return source.name
    except AttributeError:
        pass
    # gzip file instances have a filename attribute
    try:
        return source.filename
    except AttributeError:
        pass
    # urllib2 provides a geturl() method
    try:
        geturl = source.geturl
    except AttributeError:
        # can't determine filename
        return None
    else:
        return geturl()


>>> Also I'm not sure why the URL shouldn't be writable.
>>
>> What would be the use case? The problem that arises is that the source
>> URL of
>> a document would no longer be an immutable identifier of the document.
>> If it
>> can change, it's less valuable for caching (for example). It's a
>> different
>> thing if you pass a URL to the parser because it can't know where the
>> document
>> came from, or if you change the 'source' of a document at will.
> 
> If you can just get it right during parsing it should be fine.  But
> there's things like xml:base (doesn't apply to HTML; not sure how it's
> handled in XML)

Not sure, but that should be handled in the parser. At least, it deals with
parse-time information.


> or unusual headers like Content-Location, which you
> might want to handle at point in time that the document has already been
> parsed.

"Header" sounds more like something you'd also know in advance.


> Probably not a problem, but it doesn't seem that much like a problem to
> make it writable too.  Especially since the document itself is writable.
>  Once you've edited the document, it's not *the* document at that URL
> anyway.  Maybe you get a page, edit it, and serve it at a new location.
>  Deliverance does this by getting the theme page, then injecting the
> content into that page -- but the theme page is the originally-parsed
> object, though it will be served at a different location.  I'd like to
> be able to fix up that data.  And I'm not sure how I'd make a copy of a
> document with a new URL, if the URL/document link is immutable.  (Right
> now I'm mostly ignoring the URL, but it would be nice if I could
> actually trust it.)

I see. The URL is currently retrieved through "tree.docinfo" (i.e. the DocInfo
class), which is completely read-only. I'll have to figure out the
implications first - feel free to inject some ideas. :)

Stefan



More information about the lxml-dev mailing list