[lxml-dev] Problem with ":" char in tag names

Martijn Faassen faassen at startifact.com
Sat Aug 18 18:22:51 CEST 2007


[whoops reply only sent to Stefan while I meant to include the list]

Hey Stefan,

I agree that this is a bugfix; sorry for the confusion. I never tried
to use namespace prefixes this way and used Clarke notation
consistently. My feedback is not coming from the perspective of
supporting broken use, but wondering whether we cannot make lxml
easier to use.

On 8/17/07, Stefan Behnel <stefan_ml at behnel.de> wrote:

[snip in the past people used to be able to construct programs that
looked like they produced correct XML but actually didn't]

> So raising an exception here *really* prevents a lot of pitfalls and helps
> people fix their programs.

Okay, it is clear that the previous behavior was at best undefined, so
raising an error is a good idea. It does indicate one thing though -
if we *wanted* to write a feature that explictily used namespace
prefixes, we could, as it's certainly not doing anything else. :)

> > I wonder whether it'd be possible to support namespace prefixes the
> > proper way this way. I.e if I write:
> >
> > Element('foo:bar', nsmap={'foo': 'blah})
> >
> > that could be equivalent to:
> >
> > Element('{blah}bar', nsmap={'foo': 'blah'})
>
> No. There should be one way to do this. We already use prefixes in XPath,
> which causes a lot of annoyance for new users.

Yes, and prefixes are used in the XML serialization. The way we have
both prefixes and Clarke notation *already* creates a lot of confusion
for users.

In addition, the Clarke notation pattern forces one to write code like this:

SubElement(el, '{%s}foo' % MY_NS)

> BTW, this is an extremely rare use pattern. Normally, you would either work on
> an XML document that already comes with its pre-defined prefixes, or you would
> define an nsmap once (as you show above) and then stick to using
> SubElement(..., "{ns}tag") without redefining the prefixes.

Yes, but this is a very common pattern:

SubElement(el, '{%s}foo' % MY_NS)

i.e. people generally don't want to spell out their entire namespace
URI over and over again when constructing XML.

Therefore I started to wonder whether we could create a convenience
that uses namespace prefixes *and* does the right thing:

SubElement(el, 'myns:foo')

will work *if* myns has been defined as  a namespace prefix in the
context of 'el'.

Of course, accessing tags through .tag would still return Clarke notation.

We can have various objections against this. We can for instance say,
this is a bad idea as it's it is surprising behavior. After all, if
you set a tag and then get it, you'll get something else back. Then
again, since XML parsing already has this behavior and thus the user
will have to be familiar with it anyway, I don't think it's that
surprising in the end. It might actually be a useful convenience that
will make some code look cleaner.

Another objection is that we should have only one way to do it. But we
don't, really. In order to set namespaces for elements, we currently
have a number of ways to do arrange your code. One is to use the "%s"
pattern. Another is to use a custom factory specific to your codebase.
And we *already* have two ways to get namespace information into the
application - through namespace prefixes in the parser, and through
Clarke notation in the API.

> Note that lxml nicely reassigns prefixes now when inserting an element into an
> existing tree, so there really is no need to assign prefixes more than once
> (if at all).

Assigning prefixes, sure. *Using* prefixes is what I'm talking about.

> > The nice thing is that you could avoid having to write '{%s}foo' %
> > my_namespace a lot.
>
> Feel free to assign it to a global constant or to use the E factory as in
> lxml.html.builder.

Yes, remember that I've used lxml before. :)

I often use a global constant. It still means I scatter "{%s}foo" %
MY_GLOBAL_CONSTANT throughout my code. Meanwhile, I *already* have a
"global constant" that I also set somewhere, in the XML, namely my
namespace map.

I can of course create my own factory, which I've also frequently
done. That runs the risk of obscuring otherwise clear use of the lxml
API. (then again, the application's concerns may often force a factory
on the developer anyway).

> > Of course this has consequences for other areas, such as 'tag', so I'm
> > not sure whether this is a good idea, but throwing it in.
>
> Right, it would let ".tag" return something other than what you passed into
> the Element() function.

Yes. If we make this change, we'd also need to figure out what happens
if you explictily *set* tag. Should we allow:

foo.tag = 'foo:bar'

allowing potentially inconsistent behavior as you can set something
and then get back something else in Clarke notation, or still forbid
it?

> > It's definitely another extension on ElementTree, which can't really do
> > this kind of stuff well due to the lack of parent pointers.
>
> Right, so it would unnecessarily add an additional namespace definition
> pattern that is not supported by ET and at the same time allow the pitfalls
> that the users who reported the problem currently run into. Meaning: it would
> let people write programs that would stop working the day they wanted to
> switch to ET or the day they started using XPath. Great.

I think there are two potential drawbacks:

* allow users to write programs that will stop working as soon as they
switch back to ET. This is a drawback. It's also a drawback that
already exists - we have many many extensions above the ElementTree
API and people's programs will stop working if they don't stick to the
common subset.

* allow users to write programs that stop working when they switch to
XPath. I don't understand why you say this. Of course it's broken
*now*. I'm not advocating the current broken behavior at all, and
support disabling undefined behavior. I'm just wondering whether we
shouldn't support this behavior explicitly *and do the right thing*.
We can still raise an exception as soon as someone uses an undefined
namespace prefix, of course.

> No, this change is definitely a bug fix. I'm sorry for people who were not
> aware of this bug in the past and accidentally misused it, but this has to change.

Sorry for the confusion in my original reply. I didn't mean to say the
bugfix should be rolled back. It's indeed a bugfix and I support it.
It just led me to think we might have an opportunity there for
improving our API.

Regards,

Martijn


More information about the lxml-dev mailing list