[lxml-dev] Annoying interaction between comments and text
Stefan Behnel
stefan_ml at behnel.de
Tue Jun 12 22:01:44 CEST 2007
Hi Itamar,
Itamar Shtull-Trauring wrote:
> Lets say I have an element with some text in it. No subelements, just
> text. It may have a comment in it, but I really don't want to have to
> think about it. In elementtree I can just do:
>
> >>> elementtree.ElementTree.fromstring("<x>hello <!-- hello --> world</x>").text
> 'hello world'
This is only because ET strips comments in the parser. However, they are
actually part of its tree model and I find it rather surprising that I can
write out documents with comments in ET but when I parse them back in, the
comments are gone. So, ET is not consistent here.
To generate the above document in ET or lxml, you'd do this:
>>> from elementtree import ElementTree as et
>>> x = et.Element("x")
>>> comment = et.Comment("hello") # comment spacing bug in ET 1.2.x
>>> x.append(comment)
>>> x.text = "hello "
>>> comment.tail = " world"
>>> print et.tostring(x)
<x>hello <!-- hello --> world</x>
>>> et.text
'hello '
Big surprise?
lxml is compatible here, but it is also consistent:
> >>> lxml.etree.fromstring("<x>hello <!-- hello --> world</x>").text
> 'hello '
> One needs to use xpath to extract all the text. This is problematic
> because it means you can basically *never use the text attribute of
> elements*, since someone may have added a comment. Since comments have no
> semantic meaning this is something of a problem.
It's a problem in some cases and a feature in others. I understand that some
applications do not want to be bothered with comments, so a parser option to
cut them out would be the right solution. Note, however, that it would be
switched off by default.
BTW, it's a bit tricky but not too hard to remove comments yourself:
comments = [ el for el in root.getiterator() if el.tag is Comment ]
if comments and comments[0] is root:
raise Exception, "root node is a comment"
for comment in comments:
parent = comment.getparent()
if comment.tail:
pred = comment.getprevious()
if pred is not None:
pred.tail = (pred.tail or '') + comment.tail
else:
parent.text = (parent.text or '') + comment.tail
parent.remove(comment)
> I bet there's lots and
> lots of lxml code that would break if someone added a comment inside an
> element's text.
I assume you mean "ElementTree code" here. lxml had this feature since the
beginning.
Stefan
More information about the lxml-dev
mailing list