[lxml-dev] Tag name validation and HTML

Stefan Behnel stefan_ml at behnel.de
Sat Oct 6 19:28:47 CEST 2007


Hi,

James Graham wrote:
> The : thing is difficult because HTML UAs are expected to deal with : in
> the tag name and there is content in the wild that depends on this being
> accepted; MS Office produces "HTML" containing tags like <o:p>, for
> example. Since I, and I guess others too, want to use lxml to process
> random content that may have colons in the tag names, hard failure for
> this case is a problem. To make matters worse it is possible that the
> HTML spec will change in the future to introduce some sort of
> namespacing feature which may or may not use colons.

Ok, so I understand that HTML tags must be treated different from XML tags.


> Given all of this I would prefer it if it were possible to have an
> HTML-specific mode with much more liberal rules than the XML mode. This
> could then be adapted to support any namespacing features HTML grows in
> the future. For example, if one could do something like
> 
> import lxml.html
> lxml.html.Element("o:p")
> 
> where lxml.html.Element would be just like lxml.etree.Element but
> without XML-specific validity checks.

This absolutely makes sense to me. I'll have to look into the details of an
implementation though, since tag name validation is currently done in
lxml.etree.Element, which is simply reused by the Python-implemented
lxml.html. So we'd have to provide some kind of Python-level API for this.

Stefan


More information about the lxml-dev mailing list