[lxml-dev] invalid tag names get serialized
Stefan Behnel
stefan_ml at behnel.de
Wed Jul 18 09:36:32 CEST 2007
jholg at gmx.de wrote:
> I noticed that lxml (both objectify and etree) happily accepts broken tag
names (numbers, containing whitespace, ...) throughout the API and also
serializes such document; only when trying to re-parse it this fails:
>
> >>> root = etree.Element("root")
> >>> etree.SubElement(root, " __foo bar ")
> ''
> >>> print etree.tostring(root)
> <root>< __foo bar /></root>
> >>> print etree.fromstring(etree.tostring(root))
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "etree.pyx", line 1970, in etree.fromstring
> File "parser.pxi", line 980, in etree._parseMemoryDocument
> File "parser.pxi", line 876, in etree._parseDoc
> File "parser.pxi", line 533, in etree._BaseParser._parseDoc
> File "parser.pxi", line 660, in etree._handleParseResult
> File "parser.pxi", line 608, in etree._raiseParseError
> etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 8
>
> I gather this is basically libxml2 behaviour. It is not nice, though, since
> you can produce serialized data without knowing your evil doings, and only
> detect it when you try to parse it back in (in vain). Would it be a problem
> to have the tag name checked before it is set for an element?
Not entirely "libxml2 behaviour", since it actually provides functions to
check names. You just have to use them. Although 'just' is slightly too
simplistic here. The straight forward patch actually breaks lots of test
cases, e.g. getiterator('*').
I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would
be enough to check names only in the factory functions, 'el.set()' and
'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care.
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: name-validation.patch
Type: text/x-diff
Size: 1419 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070718/1d0f962b/attachment.bin
More information about the lxml-dev
mailing list