[lxml-dev] Annoying interaction between comments and text

Stefan Behnel stefan_ml at behnel.de
Tue Jun 12 22:01:44 CEST 2007


Hi Itamar,

Itamar Shtull-Trauring wrote:
> Lets say I have an element with some text in it. No subelements, just
> text. It may have a comment in it, but I really don't want to have to
> think about it. In elementtree I can just do:
> 
> >>> elementtree.ElementTree.fromstring("<x>hello <!-- hello --> world</x>").text
> 'hello  world'

This is only because ET strips comments in the parser. However, they are
actually part of its tree model and I find it rather surprising that I can
write out documents with comments in ET but when I parse them back in, the
comments are gone. So, ET is not consistent here.

To generate the above document in ET or lxml, you'd do this:

  >>> from elementtree import ElementTree as et
  >>> x = et.Element("x")
  >>> comment = et.Comment("hello") # comment spacing bug in ET 1.2.x
  >>> x.append(comment)
  >>> x.text = "hello "
  >>> comment.tail = " world"

  >>> print et.tostring(x)
  <x>hello <!-- hello --> world</x>

  >>> et.text
  'hello '

Big surprise?

lxml is compatible here, but it is also consistent:

> >>> lxml.etree.fromstring("<x>hello <!-- hello --> world</x>").text
> 'hello '


> One needs to use xpath to extract all the text. This is problematic
> because it means you can basically *never use the text attribute of
> elements*, since someone may have added a comment. Since comments have no
> semantic meaning this is something of a problem.

It's a problem in some cases and a feature in others. I understand that some
applications do not want to be bothered with comments, so a parser option to
cut them out would be the right solution. Note, however, that it would be
switched off by default.

BTW, it's a bit tricky but not too hard to remove comments yourself:

  comments = [ el for el in root.getiterator() if el.tag is Comment ]
  if comments and comments[0] is root:
    raise Exception, "root node is a comment"

  for comment in comments:
    parent = comment.getparent()
    if comment.tail:
      pred = comment.getprevious()
      if pred is not None:
        pred.tail = (pred.tail or '') + comment.tail
      else:
        parent.text = (parent.text or '') + comment.tail
    parent.remove(comment)


> I bet there's lots and
> lots of lxml code that would break if someone added a comment inside an
> element's text.

I assume you mean "ElementTree code" here. lxml had this feature since the
beginning.

Stefan



More information about the lxml-dev mailing list