[lxml-dev] invalid tag names get serialized

Stefan Behnel stefan_ml at behnel.de
Wed Jul 18 20:25:11 CEST 2007


jholg at gmx.de wrote:
> The name check should go directly into _createElement,

No, _createElement() is only a tiny wrapper around the element node creation
in libxml2. No Python exceptions allowed there.


> otherwise etree.SubElement will not pick it up.

Then SubElement will get its own check. I factored out the exception raising
so that it's only a one-liner to prevent invalid tags from passing through the
API.


> I'm also pro renaming TagNameIsValid to NCNameIsValid, as it is used on attributes also.

I actually renamed it to "_xmlNameIsValid()". It's not a public function yet,
but I might reconsider that.


>> Also, it's too late and too hard to debug. No, this patch works much
>> better,
>> but the now failing tests seem to imply that Klingon tag names are not
>> allowed
>> in well-formed XML documents. I'll have to check if it's the XML spec
>> that's xenophobe here or only libxml2...
> 
> I do think that the character \u1234 is not allowed for XML NCNames:
> BaseChar production snippet:
> 
> [...] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] [...]

Right, I noticed that also. I also fixed the test cases now and added a bunch
of new ones.

Stefan



More information about the lxml-dev mailing list