[lxml-dev] naming the lxml.html parse functions

Ian Bicking ianb at colorstudy.com
Mon Jul 9 21:53:11 CEST 2007


Stefan Behnel wrote:
> Stefan Behnel wrote:
>> Stefan Behnel wrote:
>>> HTML is a factory function, so what about
>>> calling the string parser functions "HTML()", "HTMLFragment()" and
>>> "HTMLFragments()"?
>> That would also make the semantics pretty simple:
>>
>> HTML() will always return a complete HTML document, i.e. wrapped by html/body
>> if necessary.
>>
>> HTMLFragment() will always return a fragment, i.e. a single element that can
>> be pasted into a body. This means: remove html/body if they are present and
>> add a <div> if there are multiple elements. Maybe check if there actually are
>> any block tags and just wrap the fragments in a <p> otherwise, but that's more
>> of an optimisation.

I think we talked about using <span> if there were no block tags, not <p>.

Something about HTMLFragment(s) seems weird to me.  I guess HTML() 
itself is weird, though it is reminiscent of XML().  Which is itself 
weird, since neither is a class.  HTMLFragment() bothers me more because 
it definitely doesn't return a different type of object from HTML(), but 
the naming implies it does.

>> HTMLFragments() will always return a list of fragments, i.e. text and/or
>> elements and remove any html/body parts that come from the document or were
>> added by the parser.
> 
> I changed this on the branch and also renamed the current do-what-I-mean
> "parse()" function to "fromstring()".

That seems like a fine name.

> This means that "HTML()" now behaves differently from "fromstring()", although
> "XML()" and "fromstring()" behave the same in etree. But I find that ok, since
> they behave as you would expect. HTML() gives you an HTML page (including
> html/body) and "fromstring()" more or less gives you what you passed in as a
> string, be it with or without <html>.

Sometimes you actually don't get a body, like if you parse HTML('<link 
rel="foo">') you only get a head.  And sometimes you don't get a head. 
Maybe the parsing should normalize this too, as it's a corner case 
people often don't think about.  For that matter, I think there should 
probably be a body property on the html element (or all elements?), 
since I find myself commonly plucking out the body element right away.


-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
             : Write code, do good : http://topp.openplans.org/careers


More information about the lxml-dev mailing list