[lxml-dev] Problem with ":" char in tag names

Martijn Faassen faassen at startifact.com
Mon Aug 20 16:12:35 CEST 2007


Hey,

On 8/19/07, Stefan Behnel <stefan_ml at behnel.de> wrote:
[snip]
> > Yes, and prefixes are used in the XML serialization. The way we have
> > both prefixes and Clarke notation *already* creates a lot of confusion
> > for users.
>
> That's why I would rather prefer getting it eliminated in XPath (with ETXPath)
> than introducing it in other parts of the API.

I assume ETXPath is a non-compliant way to express XPath expressions
that uses Clarke notation? Those will look very very long. I'm not
sure whether that will make people's life easier at all... Suddenly
standard XPath examples fail to work.

We must face it: namespace prefixes *are* something that people need
to worry about when dealing with XML anyway. We cannot make them go
away from our APIs entirely.  My proposal is an attempt to make the
best of it.

> >> Note that lxml nicely reassigns prefixes now when inserting an element into
> >> an existing tree, so there really is no need to assign prefixes more than
> >> once (if at all).
> >
> > Assigning prefixes, sure. *Using* prefixes is what I'm talking about.
>
> But prefixes are error prone and this behaviour makes them even more error
> prone. Prefixes are not equivalent to namespaces as more than one prefix can
> map to the same namespace, in different parts of a document or even
> concurrently.

This is behavior that any XML programmer will need to be aware of
anyway. I don't think that in most XML handling code this will be
error-prone. People can understand that namespace definitions get
inherited through the XML tree. We can see prefixes as variables
"acquired" through the XML tree. If I use prefix 'a' on some node, the
system will walk up the parent chain until it finds the definition of
'a'. If the definition can not be found, this is an error.

> And since lxml.etree adapts namespace prefixes when merging
> documents or adding new elements, you can get surprising behaviour depending
> on the source of the document you are working on.

That is indeed a greater potential cause for errors. Under what
circumstances does this happen in practice? I imagine this is a bigger
problem when merging documents than when
adding new elements, right?

> If you only generate XML
> from scratch without interacting with external code, you may be fine with
> prefix notation, but if you work on existing documents or pipe XML through
> external libraries, you may end up being surprised why lxml.etree starts
> throwing exceptions at you when you continue working on the document you just
> got back.

That's a good point.

Another question is what would happen with the default namespace - it
would be scary to have unprefixed names suddenly turn into namespaced
names.

> Allowing prefix notation in tag names encourages people to write code that
> makes assumptions about their data that may not be true for 100% equivalent
> data. And if you are aware of the potential pitfalls of such a feature, I
> doubt that you would use it except for a very limited number of use cases.

That's the question :is this set of use cases really "very limited"?
In many many use cases, for instance almost all of my own, XML
documents only use a single namespace, or at most a few.

Possibly these (in my opinion very common) use cases would be served
by another strategy than meaningful namespace prefixes.

> > In addition, the Clarke notation pattern forces one to write code like this:
> >
> > SubElement(el, '{%s}foo' % MY_NS)
> >
> > i.e. people generally don't want to spell out their entire namespace
> > URI over and over again when constructing XML.
>
> I absolutely see that problem. But I do not think that supporting prefix
> notation is a good way to solve this.

> I mean, the most common case where this
> really hurts is that you use one single namespace in your application and have
> to repeat it for every SubElement. But it's easy to write a factory that wraps
> SubElement() and simply copies the namespace of the parent over to the new
> child (if it doesn't provide one itself), something like this:
>
>     def SameNamespaceSubElement(parent, tag, *args, **kwargs):
>         if not tag.startswith("{") and parent.tag.startswith("{"):
>             tag = parent.tag[:parent.tag.index("}")+1] + tag
>         return etree.SubElement(parent, tag, *args, **kwargs)
>
> (plus QName() support, plus a better name, etc.)

In order to construct code like this more easily, it would be nice by
the way if elements had their namespace URI and namespace prefix
available as attributes (plus the namespace prefix ->   namespace
mapping).

I do end up constructing a factory frequently. The above features
would allow me to construct a sub element factory that uses namespace
prefixes. We could then play with the feel of this and see whether we
can eventually move such a factory into the core (and in what form).

Regards,

Martijn


More information about the lxml-dev mailing list