[lxml-dev] Tag name validation and HTML

Stefan Behnel stefan_ml at behnel.de
Thu Sep 27 15:50:48 CEST 2007


James Graham wrote:
> Is there a recommended way of creating 
> a custom tag type, preferably using the same code for ElementTree and 
> lxml.etree?

Both lxml.etree and ElementTree have support for (something like) this, but
not in the same way.

In ET, you can pass an "element_factory" argument to the TreeBuilder.

http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.TreeBuilder-class

In lxml.etree, you can define an Element-Lookup for a parser.

http://codespeak.net/lxml/element_classes.html

As both approaches work at the parser level, it should be possible (though not
too easy) to write some glue code that sets up a parser for either library,
and then use the parser in the rest of the code without modification.

Note that in lxml.etree, the decision about which element class to use for a
given node is not taken inside the parser, but at element access time. Hence
the different approaches (and the extensive support in lxml).


> In particular html5lib needs to create a notional document root
> element whilst parsing.

This is a pretty specific problem. You can solve it in lxml.etree in two ways.
If the root node has a specific name, you can use the CustomElementClassLookup
scheme (so this won't work if you can't control the name of the root node).

http://codespeak.net/lxml/element_classes.html#custom-element-class-lookup

If the only way to decide about the class is to check for a parent, you can
use the tree based lookup and check "getparent()" for None.

http://codespeak.net/lxml/element_classes.html#tree-based-element-class-lookup-in-python

I don't think ET can take this decision at all from the element_factory above,
but then, you can always replace the root Element /after/ parsing, so I don't
think you would even need that machinery here.


> So far, we have been using an ordinary Element with a
> .tag that cannot be produced by parsing any input e.g. 
> root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant.

Hmmm, but this changes the document, right? Could you explain a little what
that node is supposed to do different than normal nodes? In particular, why
can't a tree wrapper do what you want?

Stefan


More information about the lxml-dev mailing list