[lxml-dev] HTML character code interpretation
Stefan Behnel
stefan_ml at behnel.de
Tue Jul 29 07:52:22 CEST 2008
Hi,
Spencer Crissman wrote:
> I am using lxml to process some xhtml files. The files have html character
> codes embedded in them. For instance: & #39;s rather than a '. When I
> parse the files, edit them, and then write them back out, I want my edits to
> be the only changes in the output files, but lxml is replacing the character
> codes with the actual characters they are supposed to represent as well.
Just to clarify, does that mean you want the character references and entities
exactly as they were before, or do you want certain characters escaped, or ...
> So if I have:
> It& #39;s an example. <-- Space inserted to help readability.
>
> It is writing out:
> It's an example.
>
> I've tried setting resolve_entities to false, ala:
> tree = etree.parse(input, etree.XMLParser(resolve_entities=False))
The best place to find that out is the libxml2 parser source code, but IIRC
this option only relates to entities defined in DTDs, not to character references.
> There a way to tell lxml to ignore these/leave them as is?
No. Maybe you could give some more background on what you try to achieve.
There may still be ways to do what you want.
Stefan
More information about the lxml-dev
mailing list