[lxml-dev] Problem with ":" char in tag names

Stefan Behnel stefan_ml at behnel.de
Fri Aug 17 21:14:54 CEST 2007


Martijn Faassen wrote:
>> lxml (read: libxml2) supports XML 1.0 (don't think there were any relevant
>> changes in 1.1, which you cite above) and is generally namespace aware. This
>> means that ":" is considered a separator between a namespace prefix and the
>> tag name, and is therefore not allowed as part of a plain (namespace-less) tag
>> name.
> 
> What used to happen if you put a colon in a tag name? What would people 
> expect to happen?

Well, lxml.etree previously accepted those as part of a tag name. This means
that you could do this:

    >>> root = etree.Element("some:root")
    >>> print etree.tostring(root)
    <some:root/>

which allowed you to use namespace prefixes without declaring namespaces, i.e.
it really helps you in writing out broken XML. It also allowed you to do this,
which I think people did:

    >>> root = etree.XML('<p:root xmlns:p="http://whatever/"/>')
    >>> root.append( etree.Element("p:other") )
    >>> print etree.tostring(root)
    <p:root xmlns:p="http://whatever/"><p:other/></p:root>

Looks correct, right? However, it nicely breaks all namespace aware XML stuff
that works on the in-memory tree:

    >>> print root, root[0]
    <Element {http://whatever/}root at b7624e3c> <Element p:other at b792b93c>

    >>> print root.xpath("//p:other")
    Traceback (most recent call last):
      ...
    etree.XPathEvalError: Undefined namespace prefix

    >>> print root.xpath("//p:other", {"p":"http://whatever/"})
    []

So raising an exception here *really* prevents a lot of pitfalls and helps
people fix their programs.


> I wonder whether it'd be possible to support namespace prefixes the 
> proper way this way. I.e if I write:
> 
> Element('foo:bar', nsmap={'foo': 'blah})
> 
> that could be equivalent to:
> 
> Element('{blah}bar', nsmap={'foo': 'blah'})

No. There should be one way to do this. We already use prefixes in XPath,
which causes a lot of annoyance for new users.

BTW, this is an extremely rare use pattern. Normally, you would either work on
an XML document that already comes with its pre-defined prefixes, or you would
define an nsmap once (as you show above) and then stick to using
SubElement(..., "{ns}tag") without redefining the prefixes.

Note that lxml nicely reassigns prefixes now when inserting an element into an
existing tree, so there really is no need to assign prefixes more than once
(if at all).


> The nice thing is that you could avoid having to write '{%s}foo' % 
> my_namespace a lot.

Feel free to assign it to a global constant or to use the E factory as in
lxml.html.builder.


> Of course this has consequences for other areas, such as 'tag', so I'm 
> not sure whether this is a good idea, but throwing it in.

Right, it would let ".tag" return something other than what you passed into
the Element() function.


> It's definitely another extension on ElementTree, which can't really do 
> this kind of stuff well due to the lack of parent pointers.

Right, so it would unnecessarily add an additional namespace definition
pattern that is not supported by ET and at the same time allow the pitfalls
that the users who reported the problem currently run into. Meaning: it would
let people write programs that would stop working the day they wanted to
switch to ET or the day they started using XPath. Great.

No, this change is definitely a bug fix. I'm sorry for people who were not
aware of this bug in the past and accidentally misused it, but this has to change.

Stefan



More information about the lxml-dev mailing list