[lxml-dev] naming the lxml.html parse functions

Ian Bicking ianb at colorstudy.com
Fri Jul 6 19:01:31 CEST 2007


Stefan Behnel wrote:
> Ian Bicking wrote:
>> I'm still not sure what to call all the parsing functions for HTML.
> 
> Hmm, there isn't really something comparable in lxml's API so far, so we can't
> just copy names here.
> 
> "parse_string()" would match their intention, so that would make it
> "parse_string_element()" and "parse_string_elements()". Maybe that's too long
> for an every-day-use function, but at least the names are clear. I don't even
> think length matters here as parse functions may be used in every program, but
> likely only once or a couple of times in a few selected places, so clarity
> outweighs typing here IMHO.
> 
> "strparse()" would be shorter but might suggest that they only parse plain
> strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway).

For the different varieties, I wonder if they should just be attributes 
on the parser?  E.g., HTML() (full doc), HTML.element(), 
HTML.elements().  Similarly, parse(fn) (full doc), parse.element(fn), 
parse.elements(fn).  Then we just have HTML and parse.

One nice thing about this is that you don't have to fiddle with imports 
when you change your mind about what you are parsing.

> On the other hand, I'm wondering why they parse strings in the first place.
> Wouldn't parsing from a file make more sense? There's always StringIO if you
> need it (which is efficiently special cased in lxml). Note that libxml2 can
> even parse from http and ftp URLs directly, so you would even loose something
> (if only performance) if you required people to load a document into memory
> first and then pass it to the parser as a string. You'd also loose base URL
> information, BTW.

Where is base URL information kept?  This should be an optional argument 
for all the parsing functions that don't use a URL.

> So, my preferred solution would be to keep the names and make them functions
> that parse from a filename or file-like object, just like etree.parse() works.
> Admittedly, that's a bit tricky as you can't check what the file starts with
> to decide how to parse it without opening it first...

If I did that, I'd just have to write the string-based versions over and 
over, as that's what I use (and pretty much have to use) in all the 
tests.  I suppose outside of tests it's not that useful, but tests are 
of course important.  Plus lxml.XML, HTML, etc., already work on 
strings, so there should be equivalent parsers.

>> Also
>> I'd like some method on at least HTML elements for doing CSS selections,
>> but I'm not sure what to call it.  Any ideas?
> 
> Well, the xpath() method is named after the language, so why not just call the
> method "cssselect()" ? That makes it clear where the implementation comes from
> and matches the existing API.

Sure.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
             : Write code, do good : http://topp.openplans.org/careers


More information about the lxml-dev mailing list