[lxml-dev] HTML character code interpretation

Spencer Crissman spencer.crissman at gmail.com
Tue Jul 29 14:47:29 CEST 2008


Sorry, I tried to be as concise as possible, here is some more info as to
what I am trying to get.

We use the files as templates for our website.  They have some markers in
them to insert data that get processed when the files get served up.  I need
to add some new tags to a number of these pages, enough of them that I don't
wish to do so by hand.  I was hoping to use lxml to read in the files, add
the markers, and write out the files.  The output would need to be exactly
like the input except for the tags and/or attributes that I specifically add
to the element tree.

Most of this works, except for a few things:
1) The whitespace gets mangled a bit.  I lose some newlines, and a couple
get added.  This doesn't matter much, and I could live with it.
2) lxml is adding a meta attribute to the output's header section.  This
also doesn't matter so much.
3) All the HTML character codes are getting replaced by the actual
characters.
4) We have some custom tags that are self closing, and when they get written
out, they are getting written as open and close tag pairs rather than a
self-closing element.

So if anyone has a suggestion on how to address issues 3 and/or 4, either
with lxml or a different parser/x(ht)ml lib, I would appreciate any
pointers.  I was hoping to avoid a full blown parser, but if I can't get a
little closer with something pre-built, I may have to go that way.

Thanks for the response,

Spencer


On Tue, Jul 29, 2008 at 1:52 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:

> Hi,
>
> Spencer Crissman wrote:
> > I am using lxml to process some xhtml files.  The files have html
> character
> > codes embedded in them.  For instance: & #39;s rather than a '.  When I
> > parse the files, edit them, and then write them back out, I want my edits
> to
> > be the only changes in the output files, but lxml is replacing the
> character
> > codes with the actual characters they are supposed to represent as well.
>
> Just to clarify, does that mean you want the character references and
> entities
> exactly as they were before, or do you want certain characters escaped, or
> ...
>
>
> > So if I have:
> > It& #39;s an example. <-- Space inserted to help readability.
> >
> > It is writing out:
> > It's an example.
> >
> > I've tried setting resolve_entities to false, ala:
> > tree = etree.parse(input, etree.XMLParser(resolve_entities=False))
>
> The best place to find that out is the libxml2 parser source code, but IIRC
> this option only relates to entities defined in DTDs, not to character
> references.
>
>
> > There a way to tell lxml to ignore these/leave them as is?
>
> No. Maybe you could give some more background on what you try to achieve.
> There may still be ways to do what you want.
>
> Stefan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080729/b3bc6a30/attachment.htm 


More information about the lxml-dev mailing list