[lxml-dev] Tag name validation and HTML
James Graham
jg307 at cam.ac.uk
Thu Sep 27 14:23:47 CEST 2007
The development branch of lxml 2 appears to restrict the characters that may
appear in a tag name. Whilst this may be appropriate for XML, it does not match
the behavior of all common HTML UAs and, as such, does not match the current
draft of the HTML 5 spec [1]. This is an issue for html5lib [2] as we are keen
to keep support for building lxml trees from HTML input, something which is
currently possible with lxml 1.3.
In an only tangentially related question, is there a recommended way of creating
a custom tag type, preferably using the same code for ElementTree and
lxml.etree? In particular html5lib needs to create a notional document root
element whilst parsing. So far, we have been using an ordinary Element with a
.tag that cannot be produced by parsing any input e.g.
root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant.
[1] http://www.whatwg.org/specs/web-apps/current-work/#tag-name0
[2] http://code.google.com/p/html5lib/
--
"Eternity's a terrible thought. I mean, where's it all going to end?"
-- Tom Stoppard, Rosencrantz and Guildenstern are Dead
More information about the lxml-dev
mailing list