[lxml-dev] HTML character code interpretation

Stefan Behnel stefan_ml at behnel.de
Tue Jul 29 19:15:38 CEST 2008


Hi,

Spencer Crissman wrote:
> We use the files as templates for our website. They have some markers in
> them to insert data that get processed when the files get served up. I need
> to add some new tags to a number of these pages, enough of them that I don't
> wish to do so by hand.  I was hoping to use lxml to read in the files, add
> the markers, and write out the files.  The output would need to be exactly
> like the input except for the tags and/or attributes that I specifically add
> to the element tree.

Here I have to assume that your templates are HTML based (as opposed to XML or
XHTML) and that you use the HTML parser to parse them. The HTML parser has a
certain knowledge about the HTML structure, and can therefore work different
from a normal XML parser. Also, the HTML parser parses with error recovery
enabled by default (recover=True), so it will try to fix up the structure if
it finds any problems.


> Most of this works, except for a few things:
> 1) The whitespace gets mangled a bit.  I lose some newlines, and a couple
> get added.  This doesn't matter much, and I could live with it.

Could you provide the code that you use for parsing and serialising?


> 2) lxml is adding a meta attribute to the output's header section.  This
> also doesn't matter so much.

AFAIR, the HTML serialiser does that, not the parser. It's easy to remove from
the serialised string. We do that somewhere near the end (?) of
lxml/html/__init__.py.


> 3) All the HTML character codes are getting replaced by the actual
> characters.

I don't think there is a way to prevent that. I'm having a hard time to
understand why the character references are so important to you? The example
you showed was escaping a plain ASCII character, no browser on earth should
have a problem with that.

If it's a fixed set of character references that must be escaped (for whatever
reason), you can serialise the tree into a unicode string (encoding=unicode),
replace them with the respective charref sequence and then encode the string
into the target encoding.


> 4) We have some custom tags that are self closing, and when they get written
> out, they are getting written as open and close tag pairs rather than a
> self-closing element.

Hmm, yes, that's a problem. You can't extend the serialiser with additional
knowledge about special tags. The only special serialiser that libxml2
supports is the one for HTML. But since the self-closing tags are necessarily
empty, you can always treat the serialised string and replace all
"></specialtag>" substrings by "/>" or ">" before you write it back out.

Stefan


More information about the lxml-dev mailing list