[lxml-dev] Setting URL from lxml.html.fromstring, etc
Stefan Behnel
stefan_ml at behnel.de
Mon Feb 18 21:29:34 CET 2008
Hi Ian,
Ian Bicking wrote:
> Stefan Behnel wrote:
>> Ian Bicking wrote:
>>> There doesn't seem to be any way to set a document's URL when parsing
>>> the document. E.g.:
>>>
>>> >>> from lxml import html
>>> >>> tree = html.parse('http://www.python.org')
>>> >>> tree.docinfo.URL
>>> 'http://www.python.org'
>>>
>>> But the parse function doesn't really take any arguments, and the URL
>>> attribute is write-only. Ideally you could do
>>> fromstring('...doc...', URL='location').
>>
>> All keyword arguments that you pass to the parse/fromstring functions are
>> passed on to lxml.etree's corresponding functions. That means, you can
>> pass
>> the "base_url" keyword. (Maybe that should be mentioned in the
>> docstrings).
>
> Yeah... it's hard to figure out what method is underlying these. I've
> added a note to the docstring and an explicit base_url argument to the
> functions, so you can see the presence of the parameter more easily.
That's good, then epydoc can pick it up.
> It does not appear that html.parse() takes a base_url argument (just as
> etree.parse does not). If you pass a URL or filename then I suppose
> that becomes the base.
Yes. parse() is for parsing from files/URLs, so you'd normally have some kind
of source name/URL. StringIO is a different thing, but then, in most cases
where you could use parse(StringIO), it would be better to use fromstring(),
which supports the "base_url" keyword.
> If you pass in a file-like object then I think
> it also works, if the file-like object has a geturl() method (like
> urllib's files do).
The code we use is this:
cdef _getFilenameForFile(source):
# file instances have a name attribute
try:
return source.name
except AttributeError:
pass
# gzip file instances have a filename attribute
try:
return source.filename
except AttributeError:
pass
# urllib2 provides a geturl() method
try:
geturl = source.geturl
except AttributeError:
# can't determine filename
return None
else:
return geturl()
>>> Also I'm not sure why the URL shouldn't be writable.
>>
>> What would be the use case? The problem that arises is that the source
>> URL of
>> a document would no longer be an immutable identifier of the document.
>> If it
>> can change, it's less valuable for caching (for example). It's a
>> different
>> thing if you pass a URL to the parser because it can't know where the
>> document
>> came from, or if you change the 'source' of a document at will.
>
> If you can just get it right during parsing it should be fine. But
> there's things like xml:base (doesn't apply to HTML; not sure how it's
> handled in XML)
Not sure, but that should be handled in the parser. At least, it deals with
parse-time information.
> or unusual headers like Content-Location, which you
> might want to handle at point in time that the document has already been
> parsed.
"Header" sounds more like something you'd also know in advance.
> Probably not a problem, but it doesn't seem that much like a problem to
> make it writable too. Especially since the document itself is writable.
> Once you've edited the document, it's not *the* document at that URL
> anyway. Maybe you get a page, edit it, and serve it at a new location.
> Deliverance does this by getting the theme page, then injecting the
> content into that page -- but the theme page is the originally-parsed
> object, though it will be served at a different location. I'd like to
> be able to fix up that data. And I'm not sure how I'd make a copy of a
> document with a new URL, if the URL/document link is immutable. (Right
> now I'm mostly ignoring the URL, but it would be nice if I could
> actually trust it.)
I see. The URL is currently retrieved through "tree.docinfo" (i.e. the DocInfo
class), which is completely read-only. I'll have to figure out the
implications first - feel free to inject some ideas. :)
Stefan
More information about the lxml-dev
mailing list