[lxml-dev] Tag name validation and HTML
Stefan Behnel
stefan_ml at behnel.de
Thu Sep 27 18:30:13 CEST 2007
Stefan Behnel wrote:
> James Graham wrote:
>> The development branch of lxml 2 appears to restrict the characters that may
>> appear in a tag name. Whilst this may be appropriate for XML, it does not match
>> the behavior of all common HTML UAs and, as such, does not match the current
>> draft of the HTML 5 spec [1].
>
> This is actually not as simple as it might seem. The Element factory cannot
> distinguish between XML and HTML tags, so it cannot switch off validation for
> a particular tag. So the conservative solution would be to actually follow the
> HTML5 spec, as it is a superset of the XML spec, an extremely broad one even.
> But then there's not much left that you could honestly call validation. Also,
> I would still want to restrict ":" in tag names, as this has been a source of
> problems way too often. So that would just leave spaces and any of ":/>" as
> invalid characters in tag names.
>
> BTW, the spec you reference is actually a parser spec. Obviously, allowing "<"
> or "&" at the API level isn't a good idea either, so we end up defining our
> own way of validating tag names that would be somewhere between the XML spec
> and the HTML spec. And it would still allow you to write broken XML without
> noticing...
This patch might make for a good starter. Comments appreciated.
Stefan
Index: src/lxml/apihelpers.pxi
===================================================================
--- src/lxml/apihelpers.pxi (Revision 46892)
+++ src/lxml/apihelpers.pxi (Arbeitskopie)
@@ -791,7 +791,23 @@
return _xmlNameIsValid(_cstr(name_utf8))
cdef int _xmlNameIsValid(char* c_name):
- return tree.xmlValidateNCName(c_name, 0) == 0
+ #return tree.xmlValidateNCName(c_name, 0) == 0
+ if c_name is NULL or c_name[0] == c'\0':
+ return 0
+ while c_name[0] != c'\0':
+ if c_name[0] == c':' or \
+ c_name[0] == c'&' or \
+ c_name[0] == c'<' or \
+ c_name[0] == c'>' or \
+ c_name[0] == c'/' or \
+ c_name[0] == c'\x09' or \
+ c_name[0] == c'\x0A' or \
+ c_name[0] == c'\x0B' or \
+ c_name[0] == c'\x0C' or \
+ c_name[0] == c'\x20':
+ return 0
+ c_name = c_name + 1
+ return 1
cdef int _tagValidOrRaise(tag_utf) except -1:
if not _pyXmlNameIsValid(tag_utf):
More information about the lxml-dev
mailing list