[Lxml-checkins] r50510 - in lxml/trunk: . doc src/lxml src/lxml/tests

scoder at codespeak.net scoder at codespeak.net
Fri Jan 11 09:51:17 CET 2008


Author: scoder
Date: Fri Jan 11 09:51:17 2008
New Revision: 50510

Modified:
   lxml/trunk/   (props changed)
   lxml/trunk/CHANGES.txt
   lxml/trunk/doc/validation.txt
   lxml/trunk/src/lxml/iterparse.pxi
   lxml/trunk/src/lxml/parser.pxi
   lxml/trunk/src/lxml/tests/common_imports.py
   lxml/trunk/src/lxml/tests/test_dtd.py
   lxml/trunk/src/lxml/tests/test_xmlschema.py
   lxml/trunk/src/lxml/xmlparser.pxd
   lxml/trunk/src/lxml/xmlschema.pxd
   lxml/trunk/src/lxml/xmlschema.pxi
Log:
 r3226 at delle:  sbehnel | 2008-01-10 20:28:46 +0100
 on-the-fly XML schema validation in the parser


Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt	(original)
+++ lxml/trunk/CHANGES.txt	Fri Jan 11 09:51:17 2008
@@ -8,6 +8,8 @@
 Features added
 --------------
 
+* Parse-time XML schema validation (``schema`` parser keyword).
+
 * XPath string results of the ``text()`` function and attribute
   selection make their Element container accessible through a
   ``getparent()`` method.

Modified: lxml/trunk/doc/validation.txt
==============================================================================
--- lxml/trunk/doc/validation.txt	(original)
+++ lxml/trunk/doc/validation.txt	Fri Jan 11 09:51:17 2008
@@ -13,16 +13,17 @@
 
 There is also initial support for Schematron_.  However, it does not currently
 support error reporting in the validation phase due to insufficiencies in the
-implementation as of libxml2 2.6.29.
+implementation as of libxml2 2.6.30.
 
 .. _Schematron:   http://www.ascc.net/xml/schematron
 
 .. contents::
 .. 
-   1  DTD
-   2  RelaxNG
-   3  XMLSchema
-   4  Schematron
+   1  Validation at parse time
+   2  DTD
+   3  RelaxNG
+   4  XMLSchema
+   5  Schematron
 
 The usual setup procedure::
 
@@ -30,20 +31,59 @@
   >>> from StringIO import StringIO
 
 
+Validation at parse time
+------------------------
+
+The parser in lxml can do on-the-fly validation of a document against
+a DTD or an XML schema.  The DTD is retrieved automatically based on
+the DOCTYPE of the parsed document.  All you have to do is use a
+parser that has DTD validation enabled::
+
+  >>> parser = etree.XMLParser(dtd_validation=True)
+
+Obviously, a request for validation enables the DTD loading feature.
+There are two other options that enable loading the DTD, but that do
+not perform any validation.  The first is the ``load_dtd`` keyword
+option, which simply loads the DTD into the parser and makes it
+available to the document as external subset.  You can retrieve the
+DTD from the parsed document using the ``docinfo`` property of the
+result ElementTree object.  The internal subset is available as
+``internalDTD``, the external subset is provided as ``externalDTD``.
+
+The third way way to activate DTD loading is with the
+``attribute_defaults`` option, which loads the DTD and weaves
+attribute default values into the document.  Again, no validation is
+performed unless explicitly requested.
+
+XML schema is supported in a similar way, but requires an explicit
+schema to be provided::
+
+  >>> schema_root = etree.XML('''\
+  ...   <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
+  ...     <xsd:element name="a" type="xsd:integer"/>
+  ...   </xsd:schema>
+  ... ''')
+  >>> schema = etree.XMLSchema(schema_root)
+
+  >>> parser = etree.XMLParser(schema = schema)
+  >>> root = etree.fromstring("<a>5</a>", parser)
+
+If the validation fails (be it for a DTD or an XML schema), the parser
+will raise an exception::
+
+  >>> root = etree.fromstring("<a>not int</a>", parser)
+  Traceback (most recent call last):
+  XMLSyntaxError: Element 'a': 'not int' is not a valid value of the atomic type 'xs:integer'.
+
+
 DTD
 ---
 
-There are two places in lxml where DTDs are supported: parsers and the DTD
-class.  If you pass a keyword option to a parser that requires DTD loading,
-lxml will automatically include the DTD in the parsing process.  If you pass
-the keyword for DTD validation, lxml (or rather libxml2) will use this DTD
-right inside the parser and report failure or success when parsing terminates.
-
-The parser support for DTDs depends on internal or external subsets of the XML
-file.  This means that the XML file itself must either contain a DTD or must
-reference a DTD to make this work.  If you want to validate an XML document
-against a DTD that is not referenced by the document itself, you can use the
-``DTD`` class.
+As described above, the parser support for DTDs depends on internal or
+external subsets of the XML file.  This means that the XML file itself
+must either contain a DTD or must reference a DTD to make this work.
+If you want to validate an XML document against a DTD that is not
+referenced by the document itself, you can use the ``DTD`` class.
 
 To use the ``DTD`` class, you must first pass a filename or file-like object
 into the constructor to parse a DTD::

Modified: lxml/trunk/src/lxml/iterparse.pxi
==============================================================================
--- lxml/trunk/src/lxml/iterparse.pxi	(original)
+++ lxml/trunk/src/lxml/iterparse.pxi	Fri Jan 11 09:51:17 2008
@@ -272,14 +272,15 @@
 
     Other keyword arguments:
     * encoding           - override the document encoding
+    * schema             - an XMLSchema to validate against
     """
     cdef object _source
     cdef readonly object root
-    def __init__(self, source, events=("end",), tag=None,
+    def __init__(self, source, events=("end",), *, tag=None,
                  attribute_defaults=False, dtd_validation=False,
                  load_dtd=False, no_network=True, remove_blank_text=False,
                  remove_comments=False, remove_pis=False, encoding=None,
-                 html=False):
+                 html=False, XMLSchema schema=None):
         cdef _IterparseContext context
         cdef char* c_encoding
         cdef int parse_options
@@ -318,7 +319,7 @@
         if remove_blank_text:
             parse_options = parse_options | xmlparser.XML_PARSE_NOBLANKS
 
-        _BaseParser.__init__(self, parse_options, html,
+        _BaseParser.__init__(self, parse_options, html, schema,
                              remove_comments, remove_pis,
                              None, filename, encoding)
 

Modified: lxml/trunk/src/lxml/parser.pxi
==============================================================================
--- lxml/trunk/src/lxml/parser.pxi	(original)
+++ lxml/trunk/src/lxml/parser.pxi	Fri Jan 11 09:51:17 2008
@@ -375,9 +375,13 @@
 cdef class _ParserContext(_ResolverContext)
 cdef class _SaxParserContext(_ParserContext)
 cdef class _TargetParserContext(_SaxParserContext)
+cdef class _ParserSchemaValidationContext
+cdef class _Validator
+cdef class XMLSchema(_Validator)
 
 cdef class _ParserContext(_ResolverContext):
     cdef _ErrorLog _error_log
+    cdef _ParserSchemaValidationContext _validator
     cdef xmlparser.xmlParserCtxt* _c_ctxt
     cdef python.PyThread_type_lock _lock
 
@@ -390,6 +394,7 @@
     cdef _ParserContext _copy(self):
         cdef _ParserContext context
         context = self.__class__()
+        context._validator = self._validator.copy()
         _initParserContext(context, self._resolvers._copy(), NULL)
         return context
 
@@ -414,11 +419,15 @@
             if result == 0:
                 raise ParserError, "parser locking failed"
         self._error_log.connect()
+        if self._validator is not None:
+            self._validator.connect(self._c_ctxt)
         return 0
 
     cdef int cleanup(self) except -1:
         self._resetParserContext()
         self.clear()
+        if self._validator is not None:
+            self._validator.disconnect()
         self._error_log.disconnect()
         if config.ENABLE_THREADING and self._lock is not NULL:
             python.PyThread_release_lock(self._lock)
@@ -487,7 +496,10 @@
         c_ctxt.myDoc = NULL
 
     if result is not NULL:
-        if recover or (c_ctxt.wellFormed and \
+        if context._validator is not None and \
+                not context._validator.isvalid():
+            well_formed = 0 # actually not 'valid', but anyway ...
+        elif recover or (c_ctxt.wellFormed and \
                        c_ctxt.lastError.level < xmlerror.XML_ERR_ERROR):
             well_formed = 1
         elif not c_ctxt.replaceEntities and not c_ctxt.validate \
@@ -535,16 +547,15 @@
     cdef bint _for_html
     cdef bint _remove_comments
     cdef bint _remove_pis
+    cdef XMLSchema _schema
     cdef object _filename
     cdef object _target
     cdef object _default_encoding
     cdef int _default_encoding_int
 
-    def __init__(self, int parse_options, bint for_html,
-                 remove_comments, remove_pis,
-                 target, filename, encoding):
+    def __init__(self, int parse_options, bint for_html, XMLSchema schema,
+                 remove_comments, remove_pis, target, filename, encoding):
         cdef int c_encoding
-        cdef xmlparser.xmlParserCtxt* pctxt
         if not isinstance(self, HTMLParser) and \
                 not isinstance(self, XMLParser) and \
                 not isinstance(self, iterparse):
@@ -556,6 +567,7 @@
         self._for_html = for_html
         self._remove_comments = remove_comments
         self._remove_pis = remove_pis
+        self._schema = schema
 
         self._resolvers = _ResolverRegistry()
 
@@ -575,6 +587,9 @@
         cdef xmlparser.xmlParserCtxt* pctxt
         if self._parser_context is None:
             self._parser_context = self._createContext(self._target)
+            if self._schema is not None:
+                self._parser_context._validator = \
+                    self._schema._newSaxValidator()
             pctxt = self._newParserCtxt()
             if pctxt is NULL:
                 python.PyErr_NoMemory()
@@ -591,6 +606,9 @@
         cdef xmlparser.xmlParserCtxt* pctxt
         if self._push_parser_context is None:
             self._push_parser_context = self._createContext(self._target)
+            if self._schema is not None:
+                self._push_parser_context._validator = \
+                    self._schema._newSaxValidator()
             pctxt = self._newPushParserCtxt()
             if pctxt is NULL:
                 python.PyErr_NoMemory()
@@ -1439,6 +1457,7 @@
     Other keyword arguments:
     * encoding - override the document encoding
     * target   - a parser target object that will receive the parse events
+    * schema   - an XMLSchema to validate against
 
     Note that you should avoid sharing parsers between threads.  While this is
     not harmful, it is more efficient to use separate parsers.  This does not
@@ -1448,7 +1467,8 @@
                  load_dtd=False, no_network=True, ns_clean=False,
                  recover=False, remove_blank_text=False, compact=True,
                  resolve_entities=True, remove_comments=False,
-                 remove_pis=False, target=None, encoding=None):
+                 remove_pis=False, target=None, encoding=None,
+                 XMLSchema schema=None):
         cdef int parse_options
         parse_options = _XML_DEFAULT_PARSE_OPTIONS
         if load_dtd:
@@ -1472,7 +1492,7 @@
         if not resolve_entities:
             parse_options = parse_options ^ xmlparser.XML_PARSE_NOENT
 
-        _BaseParser.__init__(self, parse_options, 0,
+        _BaseParser.__init__(self, parse_options, 0, schema,
                              remove_comments, remove_pis,
                              target, None, encoding)
 
@@ -1487,7 +1507,7 @@
                  load_dtd=False, no_network=True, ns_clean=False,
                  recover=False, remove_blank_text=False, compact=True,
                  resolve_entities=True, remove_comments=True,
-                 remove_pis=True, target=None, encoding=None):
+                 remove_pis=True, target=None, encoding=None, schema=None):
         XMLParser.__init__(self,
                            attribute_defaults=attribute_defaults,
                            dtd_validation=dtd_validation,
@@ -1501,7 +1521,8 @@
                            remove_comments=remove_comments,
                            remove_pis=remove_pis,
                            target=target,
-                           encoding=encoding)
+                           encoding=encoding,
+                           schema=schema)
 
 
 cdef XMLParser __DEFAULT_XML_PARSER
@@ -1561,13 +1582,15 @@
     Other keyword arguments:
     * encoding - override the document encoding
     * target   - a parser target object that will receive the parse events
+    * schema   - an XMLSchema to validate against
 
     Note that you should avoid sharing parsers between threads for performance
     reasons.
     """
-    def __init__(self, recover=True, no_network=True, remove_blank_text=False,
-                 compact=True, remove_comments=False, remove_pis=False,
-                 target=None, encoding=None):
+    def __init__(self, *, recover=True, no_network=True,
+                 remove_blank_text=False, compact=True, remove_comments=False,
+                 remove_pis=False, target=None, encoding=None,
+                 XMLSchema schema=None):
         cdef int parse_options
         parse_options = _HTML_DEFAULT_PARSE_OPTIONS
         if remove_blank_text:
@@ -1579,7 +1602,7 @@
         if not compact:
             parse_options = parse_options ^ htmlparser.HTML_PARSE_COMPACT
 
-        _BaseParser.__init__(self, parse_options, 1,
+        _BaseParser.__init__(self, parse_options, 1, schema,
                              remove_comments, remove_pis,
                              target, None, encoding)
 

Modified: lxml/trunk/src/lxml/tests/common_imports.py
==============================================================================
--- lxml/trunk/src/lxml/tests/common_imports.py	(original)
+++ lxml/trunk/src/lxml/tests/common_imports.py	Fri Jan 11 09:51:17 2008
@@ -53,9 +53,9 @@
     def tearDown(self):
         gc.collect()
 
-    def parse(self, text):
+    def parse(self, text, parser=None):
         f = StringIO(text)
-        return etree.parse(f)
+        return etree.parse(f, parser=parser)
     
     def _rootstring(self, tree):
         return etree.tostring(tree.getroot()).replace(' ', '').replace('\n', '')

Modified: lxml/trunk/src/lxml/tests/test_dtd.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_dtd.py	(original)
+++ lxml/trunk/src/lxml/tests/test_dtd.py	Fri Jan 11 09:51:17 2008
@@ -26,6 +26,13 @@
         dtd = etree.DTD(StringIO("<!ELEMENT b EMPTY>"))
         self.assert_(dtd.validate(root))
 
+    def test_dtd_parse_invalid(self):
+        fromstring = etree.fromstring
+        parser = etree.XMLParser(dtd_validation=True)
+        xml = '<!DOCTYPE b SYSTEM "%s"><b><a/></b>' % fileInTestDir("test.dtd")
+        self.assertRaises(etree.XMLSyntaxError,
+                          fromstring, xml, parser=parser)
+
     def test_dtd_invalid(self):
         root = etree.XML("<b><a/></b>")
         dtd = etree.DTD(StringIO("<!ELEMENT b EMPTY>"))

Modified: lxml/trunk/src/lxml/tests/test_xmlschema.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_xmlschema.py	(original)
+++ lxml/trunk/src/lxml/tests/test_xmlschema.py	Fri Jan 11 09:51:17 2008
@@ -26,6 +26,26 @@
         self.assert_(schema.validate(tree_valid))
         self.assert_(not schema.validate(tree_invalid))
 
+    def test_xmlschema_parse(self):
+        schema = self.parse('''
+<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
+  <xsd:element name="a" type="AType"/>
+  <xsd:complexType name="AType">
+    <xsd:sequence>
+      <xsd:element name="b" type="xsd:string" />
+    </xsd:sequence>
+  </xsd:complexType>
+</xsd:schema>
+''')
+        schema = etree.XMLSchema(schema)
+        parser = etree.XMLParser(schema=schema)
+
+        tree_valid = self.parse('<a><b></b></a>', parser=parser)
+        self.assertEquals('a', tree_valid.getroot().tag)
+
+        self.assertRaises(etree.XMLSyntaxError,
+                          self.parse, '<a><c></c></a>', parser=parser)
+
     def test_xmlschema_elementtree_error(self):
         self.assertRaises(ValueError, etree.XMLSchema, etree.ElementTree())
 

Modified: lxml/trunk/src/lxml/xmlparser.pxd
==============================================================================
--- lxml/trunk/src/lxml/xmlparser.pxd	(original)
+++ lxml/trunk/src/lxml/xmlparser.pxd	Fri Jan 11 09:51:17 2008
@@ -91,6 +91,7 @@
         xmlError lastError
         xmlNode* node
         xmlSAXHandler* sax
+        void* userData
         int* spaceTab
         int spaceMax
         bint html

Modified: lxml/trunk/src/lxml/xmlschema.pxd
==============================================================================
--- lxml/trunk/src/lxml/xmlschema.pxd	(original)
+++ lxml/trunk/src/lxml/xmlschema.pxd	Fri Jan 11 09:51:17 2008
@@ -1,10 +1,11 @@
-cimport tree
+from xmlparser cimport xmlSAXHandler
 from tree cimport xmlDoc
 
 cdef extern from "libxml/xmlschemas.h":
     ctypedef struct xmlSchema
     ctypedef struct xmlSchemaParserCtxt
 
+    ctypedef struct xmlSchemaSAXPlugStruct
     ctypedef struct xmlSchemaValidCtxt
 
     cdef xmlSchemaValidCtxt* xmlSchemaNewValidCtxt(xmlSchema* schema) nogil
@@ -15,3 +16,9 @@
     cdef void xmlSchemaFree(xmlSchema* schema) nogil
     cdef void xmlSchemaFreeParserCtxt(xmlSchemaParserCtxt* ctxt) nogil
     cdef void xmlSchemaFreeValidCtxt(xmlSchemaValidCtxt* ctxt) nogil
+
+    cdef xmlSchemaSAXPlugStruct* xmlSchemaSAXPlug(xmlSchemaValidCtxt* ctxt,
+                                                  xmlSAXHandler** sax,
+                                                  void** data) nogil
+    cdef int xmlSchemaSAXUnplug(xmlSchemaSAXPlugStruct* sax_plug)
+    cdef int xmlSchemaIsValid(xmlSchemaValidCtxt* ctxt)

Modified: lxml/trunk/src/lxml/xmlschema.pxi
==============================================================================
--- lxml/trunk/src/lxml/xmlschema.pxi	(original)
+++ lxml/trunk/src/lxml/xmlschema.pxi	Fri Jan 11 09:51:17 2008
@@ -105,8 +105,53 @@
 
         self._error_log.disconnect()
         if ret == -1:
-            raise XMLSchemaValidateError, "Internal error in XML Schema validation."
+            raise XMLSchemaValidateError(
+                "Internal error in XML Schema validation.")
         if ret == 0:
             return True
         else:
             return False
+
+    cdef _ParserSchemaValidationContext _newSaxValidator(self):
+        cdef _ParserSchemaValidationContext context
+        context = NEW_SCHEMA_CONTEXT(_ParserSchemaValidationContext)
+        context._schema = self
+        context._valid_ctxt = NULL
+        context._sax_plug = NULL
+        return context
+
+cdef class _ParserSchemaValidationContext:
+    cdef XMLSchema _schema
+    cdef xmlschema.xmlSchemaValidCtxt* _valid_ctxt
+    cdef xmlschema.xmlSchemaSAXPlugStruct* _sax_plug
+
+    def __dealloc__(self):
+        if self._sax_plug:
+            self.disconnect()
+        if self._valid_ctxt:
+            xmlschema.xmlSchemaFreeValidCtxt(self._valid_ctxt)
+
+    cdef _ParserSchemaValidationContext copy(self):
+        return self._schema._newSaxValidator()
+
+    cdef int connect(self, xmlparser.xmlParserCtxt* c_ctxt) except -1:
+        if self._valid_ctxt is NULL:
+            self._valid_ctxt = xmlschema.xmlSchemaNewValidCtxt(
+                self._schema._c_schema)
+            if self._valid_ctxt is NULL:
+                raise XMLSchemaError, "Failed to create validation context"
+        self._sax_plug = xmlschema.xmlSchemaSAXPlug(
+            self._valid_ctxt, &c_ctxt.sax, &c_ctxt.userData)
+
+    cdef void disconnect(self):
+        xmlschema.xmlSchemaSAXUnplug(self._sax_plug)
+        self._sax_plug = NULL
+
+    cdef bint isvalid(self):
+        if self._valid_ctxt is NULL:
+            return 1 # valid
+        return xmlschema.xmlSchemaIsValid(self._valid_ctxt)
+
+cdef extern from "etree_defs.h":
+    # macro call to 't->tp_new()' for fast instantiation
+    cdef _ParserSchemaValidationContext NEW_SCHEMA_CONTEXT "PY_NEW" (object t)


More information about the lxml-checkins mailing list