[lxml-dev] etree.tostring generate invalid XML?

Stefan Behnel stefan_ml at behnel.de
Sun May 20 09:25:42 CEST 2007


Hi,

without looking it up, I don't think this is a bug (and definitely not in
lxml). The XML spec simply forbids certain characters in serialised XML.

Qiangning Hong wrote:
> >>> from lxml import etree
> >>> e = lxml.etree.Element('root')
> >>> e.text = u'\x08'
> >>> xml = etree.tostring(e, 'utf8')
> >>> xml
>  '<root>\x08</root>'

Don't tell me you didn't expect that. :)


>>>> etree.XML(xml)

Interesting, no output here?


>>>> etree.XML(xml)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "etree.pyx", line 1749, in etree.XML
>   File "parser.pxi", line 934, in etree._parseMemoryDocument
>   File "parser.pxi", line 830, in etree._parseDoc
>   File "parser.pxi", line 516, in etree._BaseParser._parseDoc
>   File "parser.pxi", line 619, in etree._handleParseResult
>   File "parser.pxi", line 590, in etree._raiseParseError
> etree.XMLSyntaxError: line 1: PCDATA invalid Char value 8
> 
> Shouldn't xml be '<root>&#8;</root>' ?  Is it a bug of lxml?

When you're dealing with binary data in XML, you should always encode it in a
way that makes it 'XML compatible', such as uuencode, base64 or what ever.

If you want, you can ask on the libxml2 mailing list, but I doubt they'll tell
you anything different. You might get an answer, though, that gives you a bit
more of insight into what goes on.

Stefan


More information about the lxml-dev mailing list