[lxml-dev] naming the lxml.html parse functions

Stefan Behnel stefan_ml at behnel.de
Fri Jul 6 10:10:02 CEST 2007



Ian Bicking wrote:
> I'm still not sure what to call all the parsing functions for HTML.

Hmm, there isn't really something comparable in lxml's API so far, so we can't
just copy names here.

"parse_string()" would match their intention, so that would make it
"parse_string_element()" and "parse_string_elements()". Maybe that's too long
for an every-day-use function, but at least the names are clear. I don't even
think length matters here as parse functions may be used in every program, but
likely only once or a couple of times in a few selected places, so clarity
outweighs typing here IMHO.

"strparse()" would be shorter but might suggest that they only parse plain
strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway).

On the other hand, I'm wondering why they parse strings in the first place.
Wouldn't parsing from a file make more sense? There's always StringIO if you
need it (which is efficiently special cased in lxml). Note that libxml2 can
even parse from http and ftp URLs directly, so you would even loose something
(if only performance) if you required people to load a document into memory
first and then pass it to the parser as a string. You'd also loose base URL
information, BTW.

So, my preferred solution would be to keep the names and make them functions
that parse from a filename or file-like object, just like etree.parse() works.
Admittedly, that's a bit tricky as you can't check what the file starts with
to decide how to parse it without opening it first...


> Also
> I'd like some method on at least HTML elements for doing CSS selections,
> but I'm not sure what to call it.  Any ideas?

Well, the xpath() method is named after the language, so why not just call the
method "cssselect()" ? That makes it clear where the implementation comes from
and matches the existing API.

Stefan


More information about the lxml-dev mailing list