[lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities?

Stefan Behnel stefan_ml at behnel.de
Thu Feb 19 18:48:05 CET 2009


usernamenumber wrote:
> I am porting a perl/SAX tool to python/lxml. Ideally, given the same
> input, the new tool should produce the same output as the old tool. In
> fact, it introduces a number of problems for me if this is not the case.

It's always bad style to make applications depend on a specific XML
serialisation done by a specific tool. That's exactly what canonical XML
(C14N) was designed for.


> One annoying problem I am encountering is that SAX seems to store unicode
> entity IDs in hex, whereas lxml uses decimal, regardless of what value is
> used in the input:
> 
>  >>> import lxml.etree as etree
>  >>> example_sax_output = "<foo>Copyright &#xA9; 2009 Foocorp, Inc</foo>"  # Note: xA9
>  >>> e = etree.fromstring(example_sax_output)
>  >>> etree.tostring(e)
>  <foo>Copyright &#169; 2009 Foocorp, Inc</foo>  # Note: 169
> 
> Is it possible to avoid this without doing something horribly kludgey
> like going through the output with a regex search and manually
> converting the values to hex?

There isn't a straight way to do that. Decimal character references were
chosen for compatibility with ElementTree, which uses "xmlcharrefreplace".
However, if you have a bit of memory and do not care too much about raw
performance, you can do this:

    # Python 2.6
    unicode_xml = etree.tostring(tree, encoding=unicode)
    bytes_xml = b''.join(chr(c) if c < 0x80 else b'&#x%X;' % c
                         for c in imap(ord, unicode_xml))

There's also a separate serialiser API in libxml2 that happens to output
hex entities. However, that's not used for backward compatibility reasons.

Stefan


More information about the lxml-dev mailing list