[lxml-dev] Tag name validation and HTML
Stefan Behnel
stefan_ml at behnel.de
Thu Sep 27 16:31:25 CEST 2007
James Graham wrote:
> The development branch of lxml 2 appears to restrict the characters that may
> appear in a tag name. Whilst this may be appropriate for XML, it does not match
> the behavior of all common HTML UAs and, as such, does not match the current
> draft of the HTML 5 spec [1].
This is actually not as simple as it might seem. The Element factory cannot
distinguish between XML and HTML tags, so it cannot switch off validation for
a particular tag. So the conservative solution would be to actually follow the
HTML5 spec, as it is a superset of the XML spec, an extremely broad one even.
But then there's not much left that you could honestly call validation. Also,
I would still want to restrict ":" in tag names, as this has been a source of
problems way too often. So that would just leave spaces and any of ":/>" as
invalid characters in tag names.
BTW, the spec you reference is actually a parser spec. Obviously, allowing "<"
or "&" at the API level isn't a good idea either, so we end up defining our
own way of validating tag names that would be somewhere between the XML spec
and the HTML spec. And it would still allow you to write broken XML without
noticing...
> This is an issue for html5lib [2] as we are keen
> to keep support for building lxml trees from HTML input, something which is
> currently possible with lxml 1.3.
Extensive support for HTML is definitely a goal of lxml, so if the current
behaviour breaks the HTML spec, it must change. But I'll have to see how.
Any comments appreciated.
Stefan
More information about the lxml-dev
mailing list