[lxml-dev] Tag name validation and HTML

James Graham jg307 at cam.ac.uk
Thu Sep 27 14:23:47 CEST 2007


The development branch of lxml 2 appears to restrict the characters that may 
appear in a tag name. Whilst this may be appropriate for XML, it does not match 
the behavior of all common HTML UAs and, as such, does not match the current 
draft of the HTML 5 spec [1]. This is an issue for html5lib [2] as we are keen 
to keep support for building lxml trees from HTML input, something which is 
currently possible with lxml 1.3.

In an only tangentially related question, is there a recommended way of creating 
a custom tag type, preferably using the same code for ElementTree and 
lxml.etree? In particular html5lib needs to create a notional document root 
element whilst parsing. So far, we have been using an ordinary Element with a 
.tag that cannot be produced by parsing any input e.g. 
root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant.

[1] http://www.whatwg.org/specs/web-apps/current-work/#tag-name0
[2] http://code.google.com/p/html5lib/

-- 
"Eternity's a terrible thought. I mean, where's it all going to end?"
  -- Tom Stoppard, Rosencrantz and Guildenstern are Dead


More information about the lxml-dev mailing list