[lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities?
Stefan Behnel
stefan_ml at behnel.de
Thu Feb 19 18:48:05 CET 2009
usernamenumber wrote:
> I am porting a perl/SAX tool to python/lxml. Ideally, given the same
> input, the new tool should produce the same output as the old tool. In
> fact, it introduces a number of problems for me if this is not the case.
It's always bad style to make applications depend on a specific XML
serialisation done by a specific tool. That's exactly what canonical XML
(C14N) was designed for.
> One annoying problem I am encountering is that SAX seems to store unicode
> entity IDs in hex, whereas lxml uses decimal, regardless of what value is
> used in the input:
>
> >>> import lxml.etree as etree
> >>> example_sax_output = "<foo>Copyright © 2009 Foocorp, Inc</foo>" # Note: xA9
> >>> e = etree.fromstring(example_sax_output)
> >>> etree.tostring(e)
> <foo>Copyright © 2009 Foocorp, Inc</foo> # Note: 169
>
> Is it possible to avoid this without doing something horribly kludgey
> like going through the output with a regex search and manually
> converting the values to hex?
There isn't a straight way to do that. Decimal character references were
chosen for compatibility with ElementTree, which uses "xmlcharrefreplace".
However, if you have a bit of memory and do not care too much about raw
performance, you can do this:
# Python 2.6
unicode_xml = etree.tostring(tree, encoding=unicode)
bytes_xml = b''.join(chr(c) if c < 0x80 else b'&#x%X;' % c
for c in imap(ord, unicode_xml))
There's also a separate serialiser API in libxml2 that happens to output
hex entities. However, that's not used for backward compatibility reasons.
Stefan
More information about the lxml-dev
mailing list