[lxml-dev] HTML character code interpretation
Spencer Crissman
spencer.crissman at gmail.com
Mon Jul 28 22:24:06 CEST 2008
I am using lxml to process some xhtml files. The files have html character
codes embedded in them. For instance: & #39;s rather than a '. When I
parse the files, edit them, and then write them back out, I want my edits to
be the only changes in the output files, but lxml is replacing the character
codes with the actual characters they are supposed to represent as well.
So if I have:
It& #39;s an example. <-- Space inserted to help readability.
It is writing out:
It's an example.
I've tried setting resolve_entities to false, ala:
tree = etree.parse(input, etree.XMLParser(resolve_entities=False))
But this seems to have no effect.
There a way to tell lxml to ignore these/leave them as is?
Thanks.
-s
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080728/24b92efa/attachment.htm
More information about the lxml-dev
mailing list