[lxml-dev] Re: Tag name validation and HTML

jholg at gmx.de jholg at gmx.de
Fri Oct 5 14:00:41 CEST 2007


Hi,

> The : thing is difficult because HTML UAs are expected to deal with : in 
> the tag name and there is content in the wild that depends on this being 
> accepted; MS Office produces "HTML" containing tags like <o:p>, for 
> example. Since I, and I guess others too, want to use lxml to process 
> random content that may have colons in the tag names, hard failure for 
> this case is a problem. To make matters worse it is possible that the 
> HTML spec will change in the future to introduce some sort of 
> namespacing feature which may or may not use colons.

You'd get errors when parsing such stuff with the XML parser:

>>> etree.fromstring("""<o:p>foo</o:p>""")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "etree.pyx", line 2137, in etree.fromstring
  File "parser.pxi", line 1301, in etree._parseMemoryDocument
  File "parser.pxi", line 1207, in etree._parseDoc
  File "parser.pxi", line 782, in etree._BaseParser._parseDoc
  File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc
  File "parser.pxi", line 523, in etree._handleParseResult
  File "parser.pxi", line 471, in etree._raiseParseError
etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5

but not with the HTML parser:

>>> etree.HTML
<built-in function HTML>
>>> etree.HTML("""<o:p>foo</o:p>""")
<Element html at 2c8030>
>>>

So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements.
For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data.

> Given all of this I would prefer it if it were possible to have an 
> HTML-specific mode with much more liberal rules than the XML mode. This 
> could then be adapted to support any namespacing features HTML grows in 
> the future. For example, if one could do something like
> 
> import lxml.html
> lxml.html.Element("o:p")
> 
> where lxml.html.Element would be just like lxml.etree.Element but 
> without XML-specific validity checks. I guess there might be serious 
> practical difficulties with that exact solution, but I think the general 
> idea of being able to flag an element as following HTML rules or XML 
> rules would be more user-friendly than having a set of rules that 
> neither matches the XML nor the HTML model correctly.

Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element().

Holger

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer


More information about the lxml-dev mailing list