[lxml-dev] naming the lxml.html parse functions
Ian Bicking
ianb at colorstudy.com
Mon Jul 9 21:53:11 CEST 2007
Stefan Behnel wrote:
> Stefan Behnel wrote:
>> Stefan Behnel wrote:
>>> HTML is a factory function, so what about
>>> calling the string parser functions "HTML()", "HTMLFragment()" and
>>> "HTMLFragments()"?
>> That would also make the semantics pretty simple:
>>
>> HTML() will always return a complete HTML document, i.e. wrapped by html/body
>> if necessary.
>>
>> HTMLFragment() will always return a fragment, i.e. a single element that can
>> be pasted into a body. This means: remove html/body if they are present and
>> add a <div> if there are multiple elements. Maybe check if there actually are
>> any block tags and just wrap the fragments in a <p> otherwise, but that's more
>> of an optimisation.
I think we talked about using <span> if there were no block tags, not <p>.
Something about HTMLFragment(s) seems weird to me. I guess HTML()
itself is weird, though it is reminiscent of XML(). Which is itself
weird, since neither is a class. HTMLFragment() bothers me more because
it definitely doesn't return a different type of object from HTML(), but
the naming implies it does.
>> HTMLFragments() will always return a list of fragments, i.e. text and/or
>> elements and remove any html/body parts that come from the document or were
>> added by the parser.
>
> I changed this on the branch and also renamed the current do-what-I-mean
> "parse()" function to "fromstring()".
That seems like a fine name.
> This means that "HTML()" now behaves differently from "fromstring()", although
> "XML()" and "fromstring()" behave the same in etree. But I find that ok, since
> they behave as you would expect. HTML() gives you an HTML page (including
> html/body) and "fromstring()" more or less gives you what you passed in as a
> string, be it with or without <html>.
Sometimes you actually don't get a body, like if you parse HTML('<link
rel="foo">') you only get a head. And sometimes you don't get a head.
Maybe the parsing should normalize this too, as it's a corner case
people often don't think about. For that matter, I think there should
probably be a body property on the html element (or all elements?),
since I find myself commonly plucking out the body element right away.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
: Write code, do good : http://topp.openplans.org/careers
More information about the lxml-dev
mailing list