[lxml-dev] Tag name validation and HTML
Stefan Behnel
stefan_ml at behnel.de
Thu Sep 27 15:50:48 CEST 2007
James Graham wrote:
> Is there a recommended way of creating
> a custom tag type, preferably using the same code for ElementTree and
> lxml.etree?
Both lxml.etree and ElementTree have support for (something like) this, but
not in the same way.
In ET, you can pass an "element_factory" argument to the TreeBuilder.
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.TreeBuilder-class
In lxml.etree, you can define an Element-Lookup for a parser.
http://codespeak.net/lxml/element_classes.html
As both approaches work at the parser level, it should be possible (though not
too easy) to write some glue code that sets up a parser for either library,
and then use the parser in the rest of the code without modification.
Note that in lxml.etree, the decision about which element class to use for a
given node is not taken inside the parser, but at element access time. Hence
the different approaches (and the extensive support in lxml).
> In particular html5lib needs to create a notional document root
> element whilst parsing.
This is a pretty specific problem. You can solve it in lxml.etree in two ways.
If the root node has a specific name, you can use the CustomElementClassLookup
scheme (so this won't work if you can't control the name of the root node).
http://codespeak.net/lxml/element_classes.html#custom-element-class-lookup
If the only way to decide about the class is to check for a parent, you can
use the tree based lookup and check "getparent()" for None.
http://codespeak.net/lxml/element_classes.html#tree-based-element-class-lookup-in-python
I don't think ET can take this decision at all from the element_factory above,
but then, you can always replace the root Element /after/ parsing, so I don't
think you would even need that machinery here.
> So far, we have been using an ordinary Element with a
> .tag that cannot be produced by parsing any input e.g.
> root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant.
Hmmm, but this changes the document, right? Could you explain a little what
that node is supposed to do different than normal nodes? In particular, why
can't a tree wrapper do what you want?
Stefan
More information about the lxml-dev
mailing list