[lxml-dev] Tag name validation and HTML

James Graham jg307 at cam.ac.uk
Tue Oct 2 22:33:06 CEST 2007


Stefan Behnel wrote:
> James Graham wrote:
>> The development branch of lxml 2 appears to restrict the characters that may 
>> appear in a tag name. Whilst this may be appropriate for XML, it does not match 
>> the behavior of all common HTML UAs and, as such, does not match the current 
>> draft of the HTML 5 spec [1].
> 
> This is actually not as simple as it might seem. The Element factory cannot
> distinguish between XML and HTML tags, so it cannot switch off validation for
> a particular tag. So the conservative solution would be to actually follow the
> HTML5 spec, as it is a superset of the XML spec, an extremely broad one even.
> But then there's not much left that you could honestly call validation. Also,
> I would still want to restrict ":" in tag names, as this has been a source of
> problems way too often. So that would just leave spaces and any of ":/>" as
> invalid characters in tag names.

The : thing is difficult because HTML UAs are expected to deal with : in 
the tag name and there is content in the wild that depends on this being 
accepted; MS Office produces "HTML" containing tags like <o:p>, for 
example. Since I, and I guess others too, want to use lxml to process 
random content that may have colons in the tag names, hard failure for 
this case is a problem. To make matters worse it is possible that the 
HTML spec will change in the future to introduce some sort of 
namespacing feature which may or may not use colons.

Given all of this I would prefer it if it were possible to have an 
HTML-specific mode with much more liberal rules than the XML mode. This 
could then be adapted to support any namespacing features HTML grows in 
the future. For example, if one could do something like

import lxml.html
lxml.html.Element("o:p")

where lxml.html.Element would be just like lxml.etree.Element but 
without XML-specific validity checks. I guess there might be serious 
practical difficulties with that exact solution, but I think the general 
idea of being able to flag an element as following HTML rules or XML 
rules would be more user-friendly than having a set of rules that 
neither matches the XML nor the HTML model correctly.

-- 
"Mixed up signals
Bullet train
People snuffed out in the brutal rain"
--Conner Oberst


More information about the lxml-dev mailing list