[lxml-dev] etree.tostring generate invalid XML?
Stefan Behnel
stefan_ml at behnel.de
Sun May 20 09:25:42 CEST 2007
Hi,
without looking it up, I don't think this is a bug (and definitely not in
lxml). The XML spec simply forbids certain characters in serialised XML.
Qiangning Hong wrote:
> >>> from lxml import etree
> >>> e = lxml.etree.Element('root')
> >>> e.text = u'\x08'
> >>> xml = etree.tostring(e, 'utf8')
> >>> xml
> '<root>\x08</root>'
Don't tell me you didn't expect that. :)
>>>> etree.XML(xml)
Interesting, no output here?
>>>> etree.XML(xml)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "etree.pyx", line 1749, in etree.XML
> File "parser.pxi", line 934, in etree._parseMemoryDocument
> File "parser.pxi", line 830, in etree._parseDoc
> File "parser.pxi", line 516, in etree._BaseParser._parseDoc
> File "parser.pxi", line 619, in etree._handleParseResult
> File "parser.pxi", line 590, in etree._raiseParseError
> etree.XMLSyntaxError: line 1: PCDATA invalid Char value 8
>
> Shouldn't xml be '<root></root>' ? Is it a bug of lxml?
When you're dealing with binary data in XML, you should always encode it in a
way that makes it 'XML compatible', such as uuencode, base64 or what ever.
If you want, you can ask on the libxml2 mailing list, but I doubt they'll tell
you anything different. You might get an answer, though, that gives you a bit
more of insight into what goes on.
Stefan
More information about the lxml-dev
mailing list