[lxml-dev] Setting URL from lxml.html.fromstring, etc

Ian Bicking ianb at colorstudy.com
Fri Feb 29 23:35:52 CET 2008


Stefan Behnel wrote:
> Hi,
> 
> Ian Bicking wrote:
>> Stefan Behnel wrote:
>>> Don't you think it should behave differently for XML and HTML? For
>>> XML, I'd
>>> expect it to depend on xml:base, while for HTML, it'd rather always
>>> depend on
>>> the document URL (and not set an xml:base attribute on assignment).
>> Sure, they act somewhat differently, but does it make sense to use two
>> different names?  I think they mean similar things in both cases, though
>> perhaps the per-element base attribute in HTML shouldn't be writable.
>> (Though the tree is kind of this weird invisible thing that you wouldn't
>> know is there except for things like docinfo.URL, but a little
>> documentation can fix that of course.)
> 
> ok, I do prefer 'base' then, though, as it matches xml:base. It also makes
> less sense in the HTML area than in the XML area, where you actually /have/
> something like a base URL of an element, rather than just a URL of a document
> that the Element happens to be in. So, if you move an HTML Element from one
> tree to another, it will change its base URL, while in the XML world, you
> /can/ work around that if you need/want to.
> 
> I think we should deprecate 'base_url' in favour of 'base', and document the
> respective behaviour in the doc strings of both properties.

OK.  Then would the html base attribute just be a read-only property 
then?  Like:

   def base(self):
       return super(HtmlElement, self).base
   base = property(base)

I'm not terribly concerned about whether it is read-only or not.  It's a 
little fuzzy, since HTML is parsed to the lxml representation, and 
though it will probably be serialized to HTML again (if it is serialized 
at all) and HTML doesn't have anything like xml:base, the lxml 
representation is not itself exactly HTML.  And if you serialize to 
XHTML, then xml:base is available.

Also translating HTML to XHTML is kind of an outstanding issue for 
lxml.html, and it seems reasonable to me that XHTML could be parsed into 
the same classes as HTML.  The only real caveat there is that XHTML uses 
different (namespaced) tag names.  If you remove the tag names, then the 
classes and the lookup applies just fine.  (Presumably the lookup could 
be changed to support XHTML fairly easily.)

   Ian


More information about the lxml-dev mailing list