[lxml-dev] Low ASCII values as text

Stefan Behnel stefan_ml at behnel.de
Wed Apr 8 10:53:01 CEST 2009


Hi,

F Wolff wrote:
> I encountered a small issue from a user's error report, and a way to
> duplicate the issue is from this example code:
>
> from lxml import etree
> l = etree.Element('cow')
> l.text = unicode('\xd0\x94\x1bi\x1b\x1b\x1b?', "utf-8")
> etree.fromstring(etree.tostring(l))
>
> With lxml 2.1 I get:
>
> XMLSyntaxError: PCDATA invalid Char value 27, line 1, column 13
>
> It seems that etree.tostring() can generate XML that etree.fromstring()
> can't handle.

To be precise, tostring() could generate output that was not XML. That was
clearly a bug.


> But with a newer version (I think a beta of 2.2), I get
> "All strings must be XML compatible : Unicode or ASCII, no NULL bytes"
> on the assignment statement (l.text = ...).

This is in line with the set of allowed characters in XML, the relevant
snippet being:

    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | ...

"\x1b" is not in this set.

http://www.w3.org/TR/REC-xml/#charsets


> So in either case my question is if lxml's handling of these low values
> in ASCII is correct, since it doesn't seem possible to actually
> represent them at all, but I guess I am missing something important. As
> far as I know the XML 1.0 specification demands indicating these with
> numeric entities.

No, you cannot even represent them as character references, they are
simply not allowed. The only (sensible) way to pass binary data through
XML is to encode it, e.g. using base64.

This specification was weakened in XML 1.1, which simply allows more
characters, including the range "[#x1-#xD7FF]". However, it still carries
this warning:

"""
Document authors are encouraged to avoid "compatibility characters", as
defined in Unicode [Unicode]. The characters defined in the following
ranges are also discouraged. They are either control characters or
permanently undefined Unicode characters:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
...
"""

http://www.w3.org/TR/xml11/#charsets

So, even in XML 1.1, it is still considered a bad idea to use these
characters in text content.

Stefan



More information about the lxml-dev mailing list