[lxml-dev] naming the lxml.html parse functions

Ian Bicking ianb at colorstudy.com
Mon Jul 9 22:13:05 CEST 2007


Stefan Behnel wrote:
> 
> Ian Bicking wrote:
>> Stefan Behnel wrote:
>>>> HTMLFragment() will always return a fragment, i.e. a single element
>>>> that can
>>>> be pasted into a body. This means: remove html/body if they are
>>>> present and
>>>> add a <div> if there are multiple elements. Maybe check if there
>>>> actually are
>>>> any block tags and just wrap the fragments in a <p> otherwise, but
>>>> that's more
>>>> of an optimisation.
>> I think we talked about using <span> if there were no block tags, not <p>.
> 
> Ah, sure. Anyway, I didn't change your implementation, so everything works as
> before (except for the naming).
> 
> 
>> Something about HTMLFragment(s) seems weird to me.  I guess HTML()
>> itself is weird, though it is reminiscent of XML().  Which is itself
>> weird, since neither is a class.
> 
> It's a factory though, that is mainly meant for HTML 'literals'. And it gives
> you an HtmlElement or a list of those. Hmmm, I admit that HTMLFragments() does
> not really sound like returning a list...

Everything is potentially a factory.  dict.items() is a list factory. 
HTML and HTMLFragment are factories for the same kind of object.


>> HTMLFragment() bothers me more because
>> it definitely doesn't return a different type of object from HTML(), but
>> the naming implies it does.
> 
> Hmmm, I don't really feel the same way, but maybe I'm too biased already. :)
> 
> It's Python after all, so the actual type is not that relevant.

Yes, but we're already badly abusing naming conventions.  These aren't 
classes, but they are named like classes.  This has caused confusion for 
me in the past.


>>> This means that "HTML()" now behaves differently from "fromstring()",
>>> although
>>> "XML()" and "fromstring()" behave the same in etree. But I find that
>>> ok, since
>>> they behave as you would expect. HTML() gives you an HTML page (including
>>> html/body) and "fromstring()" more or less gives you what you passed
>>> in as a
>>> string, be it with or without <html>.
>> Sometimes you actually don't get a body, like if you parse HTML('<link
>> rel="foo">') you only get a head.  And sometimes you don't get a head.
>> Maybe the parsing should normalize this too, as it's a corner case
>> people often don't think about.  For that matter, I think there should
>> probably be a body property on the html element (or all elements?),
>> since I find myself commonly plucking out the body element right away.
> 
> If we keep the current names, we should make sure they fit the expectations.
> Having HTML() always return a complete document sounds natural to me.

I'd be inclined to feel the other way, that HTML() would be more like 
fromstring(), and return what you give it instead of interpreting 
everything as a document.  But I'm not too concerned there.

> Checking the returned tag for 'body' or 'head' is simple enough.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
             : Write code, do good : http://topp.openplans.org/careers


More information about the lxml-dev mailing list