[lxml-dev] Tag name validation and HTML

Stefan Behnel stefan_ml at behnel.de
Thu Sep 27 18:30:13 CEST 2007


Stefan Behnel wrote:
> James Graham wrote:
>> The development branch of lxml 2 appears to restrict the characters that may 
>> appear in a tag name. Whilst this may be appropriate for XML, it does not match 
>> the behavior of all common HTML UAs and, as such, does not match the current 
>> draft of the HTML 5 spec [1].
> 
> This is actually not as simple as it might seem. The Element factory cannot
> distinguish between XML and HTML tags, so it cannot switch off validation for
> a particular tag. So the conservative solution would be to actually follow the
> HTML5 spec, as it is a superset of the XML spec, an extremely broad one even.
> But then there's not much left that you could honestly call validation. Also,
> I would still want to restrict ":" in tag names, as this has been a source of
> problems way too often. So that would just leave spaces and any of ":/>" as
> invalid characters in tag names.
> 
> BTW, the spec you reference is actually a parser spec. Obviously, allowing "<"
> or "&" at the API level isn't a good idea either, so we end up defining our
> own way of validating tag names that would be somewhere between the XML spec
> and the HTML spec. And it would still allow you to write broken XML without
> noticing...

This patch might make for a good starter. Comments appreciated.

Stefan


Index: src/lxml/apihelpers.pxi
===================================================================
--- src/lxml/apihelpers.pxi     (Revision 46892)
+++ src/lxml/apihelpers.pxi     (Arbeitskopie)
@@ -791,7 +791,23 @@
     return _xmlNameIsValid(_cstr(name_utf8))

 cdef int _xmlNameIsValid(char* c_name):
-    return tree.xmlValidateNCName(c_name, 0) == 0
+    #return tree.xmlValidateNCName(c_name, 0) == 0
+    if c_name is NULL or c_name[0] == c'\0':
+        return 0
+    while c_name[0] != c'\0':
+        if c_name[0] == c':' or \
+                c_name[0] == c'&' or \
+                c_name[0] == c'<' or \
+                c_name[0] == c'>' or \
+                c_name[0] == c'/' or \
+                c_name[0] == c'\x09' or \
+                c_name[0] == c'\x0A' or \
+                c_name[0] == c'\x0B' or \
+                c_name[0] == c'\x0C' or \
+                c_name[0] == c'\x20':
+            return 0
+        c_name = c_name + 1
+    return 1

 cdef int _tagValidOrRaise(tag_utf) except -1:
     if not _pyXmlNameIsValid(tag_utf):



More information about the lxml-dev mailing list