[lxml-dev] time to ideas: a good API for iterparse() on HTML ?

Stefan Behnel stefan_ml at behnel.de
Thu Sep 20 18:24:15 CEST 2007


Hi all,

I wonder what a good API for iterparse() on HTML would be. I'm kinda tempted
to change the iterparse class into a function like parse(), remove the
existing keyword arguments and replace them with a standard "parser" argument
as in parse() and fromstring():

  >>> iterator = etree.iterparse(f, parser=etree.HTMLParser())

I'm not sure this works, as we can't support a parser target object ("target"
keyword of parsers) or the feed parser interface with iterparse (which both
the XMLParser and the HTMLParser currently support), but it wouldn't be
obvious from the API that you can't pass a target parser into iterparse(). So
it's not quite the perfect interface, as this would need to raise an error:

  >>> parser = etree.HTMLParser(target=SomeTarget())
  >>> iterator = etree.iterparse(f, parser=parser)

The alternatives would be an "html" keyword option to iterparse (the straight
forward, simple solution, but which we use nowhere else in the API):

  >>> iterator = etree.iterparse(f, html=True)

or a "method" argument like in the serialisers:

  >>> iterator = etree.iterparse(f, method="html")

or maybe:

  >>> iterator = etree.iterparse(f, input_type="html")

or an "iterparsehtml" function/class (which would be the worst thing to do IMHO):

  >>> iterator = etree.iterparsehtml(f)

I feel that there should be some symmetry between iterparse(), the other parse
functions and the parser classes, but I'm not sure which.

Any comments?

Stefan


More information about the lxml-dev mailing list