[lxml-dev] iteraprse comment parsing

Stefan Behnel stefan_ml at behnel.de
Thu Oct 18 18:43:02 CEST 2007


Hi,

kris wrote:
>         I noticed a difference in parsing behavior between 
>         iterwalk and iterparse.  It could simply be that I do
>         not know how to turn off comment parsing.

You can switch it off for iterparse() and the normal parser via a keyword
argument. See help(etree.XMLParser) and read the parser docs on the web page.
I'm not sure lxml 1.1.x supports that already, though.

To work around this: you can easily wrap iterwalk() with a generator filter
function that checks if the .tag attribute of the element is a string and only
yields events that match this criteria.


>         python ~/xml-test.py
>         using parse
>         ['start', <Element r at 2aad85d0f3c0>, 'end', <Element r at
>         2aad85d0f3c0>]
>         using walk
>         ['start', <Element r at 2aad85d0f460>, 'start', <!-- asjsjs -->,
>         'end', <!-- asjsjs -->, 'end', <Element r at 2aad85d0f460>]

>         BAD

I agree. :)


>         lxml.etree:        (1, 1, 2, 0)

That's pretty old and this won't get fixed in 1.1.x. But it should get fixed
in both 1.3 and 2.0. It should be easy for you to upgrade then.

I'm not quite sure which one to fix, though: iterparse() or iterwalk().
iterparse() already accepts the normal parser keyword arguments
"remove_comments" and "remove_pis", but those currently only change the tree
that is built, not the events that are generated. So at least in 1.3, iterwalk
should ignore comments and PIs as well.

I was thinking about changing the way iterparse() is currently implemented in
2.0 anyway, so the idea would be to pass a parser instead of a bunch of
keyword arguments. Not sure how that will work out...

Stefan



More information about the lxml-dev mailing list