[Lxml-checkins] r50510 - in lxml/trunk: . doc src/lxml src/lxml/tests
scoder at codespeak.net
scoder at codespeak.net
Fri Jan 11 09:51:17 CET 2008
Author: scoder
Date: Fri Jan 11 09:51:17 2008
New Revision: 50510
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/doc/validation.txt
lxml/trunk/src/lxml/iterparse.pxi
lxml/trunk/src/lxml/parser.pxi
lxml/trunk/src/lxml/tests/common_imports.py
lxml/trunk/src/lxml/tests/test_dtd.py
lxml/trunk/src/lxml/tests/test_xmlschema.py
lxml/trunk/src/lxml/xmlparser.pxd
lxml/trunk/src/lxml/xmlschema.pxd
lxml/trunk/src/lxml/xmlschema.pxi
Log:
r3226 at delle: sbehnel | 2008-01-10 20:28:46 +0100
on-the-fly XML schema validation in the parser
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Jan 11 09:51:17 2008
@@ -8,6 +8,8 @@
Features added
--------------
+* Parse-time XML schema validation (``schema`` parser keyword).
+
* XPath string results of the ``text()`` function and attribute
selection make their Element container accessible through a
``getparent()`` method.
Modified: lxml/trunk/doc/validation.txt
==============================================================================
--- lxml/trunk/doc/validation.txt (original)
+++ lxml/trunk/doc/validation.txt Fri Jan 11 09:51:17 2008
@@ -13,16 +13,17 @@
There is also initial support for Schematron_. However, it does not currently
support error reporting in the validation phase due to insufficiencies in the
-implementation as of libxml2 2.6.29.
+implementation as of libxml2 2.6.30.
.. _Schematron: http://www.ascc.net/xml/schematron
.. contents::
..
- 1 DTD
- 2 RelaxNG
- 3 XMLSchema
- 4 Schematron
+ 1 Validation at parse time
+ 2 DTD
+ 3 RelaxNG
+ 4 XMLSchema
+ 5 Schematron
The usual setup procedure::
@@ -30,20 +31,59 @@
>>> from StringIO import StringIO
+Validation at parse time
+------------------------
+
+The parser in lxml can do on-the-fly validation of a document against
+a DTD or an XML schema. The DTD is retrieved automatically based on
+the DOCTYPE of the parsed document. All you have to do is use a
+parser that has DTD validation enabled::
+
+ >>> parser = etree.XMLParser(dtd_validation=True)
+
+Obviously, a request for validation enables the DTD loading feature.
+There are two other options that enable loading the DTD, but that do
+not perform any validation. The first is the ``load_dtd`` keyword
+option, which simply loads the DTD into the parser and makes it
+available to the document as external subset. You can retrieve the
+DTD from the parsed document using the ``docinfo`` property of the
+result ElementTree object. The internal subset is available as
+``internalDTD``, the external subset is provided as ``externalDTD``.
+
+The third way way to activate DTD loading is with the
+``attribute_defaults`` option, which loads the DTD and weaves
+attribute default values into the document. Again, no validation is
+performed unless explicitly requested.
+
+XML schema is supported in a similar way, but requires an explicit
+schema to be provided::
+
+ >>> schema_root = etree.XML('''\
+ ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
+ ... <xsd:element name="a" type="xsd:integer"/>
+ ... </xsd:schema>
+ ... ''')
+ >>> schema = etree.XMLSchema(schema_root)
+
+ >>> parser = etree.XMLParser(schema = schema)
+ >>> root = etree.fromstring("<a>5</a>", parser)
+
+If the validation fails (be it for a DTD or an XML schema), the parser
+will raise an exception::
+
+ >>> root = etree.fromstring("<a>not int</a>", parser)
+ Traceback (most recent call last):
+ XMLSyntaxError: Element 'a': 'not int' is not a valid value of the atomic type 'xs:integer'.
+
+
DTD
---
-There are two places in lxml where DTDs are supported: parsers and the DTD
-class. If you pass a keyword option to a parser that requires DTD loading,
-lxml will automatically include the DTD in the parsing process. If you pass
-the keyword for DTD validation, lxml (or rather libxml2) will use this DTD
-right inside the parser and report failure or success when parsing terminates.
-
-The parser support for DTDs depends on internal or external subsets of the XML
-file. This means that the XML file itself must either contain a DTD or must
-reference a DTD to make this work. If you want to validate an XML document
-against a DTD that is not referenced by the document itself, you can use the
-``DTD`` class.
+As described above, the parser support for DTDs depends on internal or
+external subsets of the XML file. This means that the XML file itself
+must either contain a DTD or must reference a DTD to make this work.
+If you want to validate an XML document against a DTD that is not
+referenced by the document itself, you can use the ``DTD`` class.
To use the ``DTD`` class, you must first pass a filename or file-like object
into the constructor to parse a DTD::
Modified: lxml/trunk/src/lxml/iterparse.pxi
==============================================================================
--- lxml/trunk/src/lxml/iterparse.pxi (original)
+++ lxml/trunk/src/lxml/iterparse.pxi Fri Jan 11 09:51:17 2008
@@ -272,14 +272,15 @@
Other keyword arguments:
* encoding - override the document encoding
+ * schema - an XMLSchema to validate against
"""
cdef object _source
cdef readonly object root
- def __init__(self, source, events=("end",), tag=None,
+ def __init__(self, source, events=("end",), *, tag=None,
attribute_defaults=False, dtd_validation=False,
load_dtd=False, no_network=True, remove_blank_text=False,
remove_comments=False, remove_pis=False, encoding=None,
- html=False):
+ html=False, XMLSchema schema=None):
cdef _IterparseContext context
cdef char* c_encoding
cdef int parse_options
@@ -318,7 +319,7 @@
if remove_blank_text:
parse_options = parse_options | xmlparser.XML_PARSE_NOBLANKS
- _BaseParser.__init__(self, parse_options, html,
+ _BaseParser.__init__(self, parse_options, html, schema,
remove_comments, remove_pis,
None, filename, encoding)
Modified: lxml/trunk/src/lxml/parser.pxi
==============================================================================
--- lxml/trunk/src/lxml/parser.pxi (original)
+++ lxml/trunk/src/lxml/parser.pxi Fri Jan 11 09:51:17 2008
@@ -375,9 +375,13 @@
cdef class _ParserContext(_ResolverContext)
cdef class _SaxParserContext(_ParserContext)
cdef class _TargetParserContext(_SaxParserContext)
+cdef class _ParserSchemaValidationContext
+cdef class _Validator
+cdef class XMLSchema(_Validator)
cdef class _ParserContext(_ResolverContext):
cdef _ErrorLog _error_log
+ cdef _ParserSchemaValidationContext _validator
cdef xmlparser.xmlParserCtxt* _c_ctxt
cdef python.PyThread_type_lock _lock
@@ -390,6 +394,7 @@
cdef _ParserContext _copy(self):
cdef _ParserContext context
context = self.__class__()
+ context._validator = self._validator.copy()
_initParserContext(context, self._resolvers._copy(), NULL)
return context
@@ -414,11 +419,15 @@
if result == 0:
raise ParserError, "parser locking failed"
self._error_log.connect()
+ if self._validator is not None:
+ self._validator.connect(self._c_ctxt)
return 0
cdef int cleanup(self) except -1:
self._resetParserContext()
self.clear()
+ if self._validator is not None:
+ self._validator.disconnect()
self._error_log.disconnect()
if config.ENABLE_THREADING and self._lock is not NULL:
python.PyThread_release_lock(self._lock)
@@ -487,7 +496,10 @@
c_ctxt.myDoc = NULL
if result is not NULL:
- if recover or (c_ctxt.wellFormed and \
+ if context._validator is not None and \
+ not context._validator.isvalid():
+ well_formed = 0 # actually not 'valid', but anyway ...
+ elif recover or (c_ctxt.wellFormed and \
c_ctxt.lastError.level < xmlerror.XML_ERR_ERROR):
well_formed = 1
elif not c_ctxt.replaceEntities and not c_ctxt.validate \
@@ -535,16 +547,15 @@
cdef bint _for_html
cdef bint _remove_comments
cdef bint _remove_pis
+ cdef XMLSchema _schema
cdef object _filename
cdef object _target
cdef object _default_encoding
cdef int _default_encoding_int
- def __init__(self, int parse_options, bint for_html,
- remove_comments, remove_pis,
- target, filename, encoding):
+ def __init__(self, int parse_options, bint for_html, XMLSchema schema,
+ remove_comments, remove_pis, target, filename, encoding):
cdef int c_encoding
- cdef xmlparser.xmlParserCtxt* pctxt
if not isinstance(self, HTMLParser) and \
not isinstance(self, XMLParser) and \
not isinstance(self, iterparse):
@@ -556,6 +567,7 @@
self._for_html = for_html
self._remove_comments = remove_comments
self._remove_pis = remove_pis
+ self._schema = schema
self._resolvers = _ResolverRegistry()
@@ -575,6 +587,9 @@
cdef xmlparser.xmlParserCtxt* pctxt
if self._parser_context is None:
self._parser_context = self._createContext(self._target)
+ if self._schema is not None:
+ self._parser_context._validator = \
+ self._schema._newSaxValidator()
pctxt = self._newParserCtxt()
if pctxt is NULL:
python.PyErr_NoMemory()
@@ -591,6 +606,9 @@
cdef xmlparser.xmlParserCtxt* pctxt
if self._push_parser_context is None:
self._push_parser_context = self._createContext(self._target)
+ if self._schema is not None:
+ self._push_parser_context._validator = \
+ self._schema._newSaxValidator()
pctxt = self._newPushParserCtxt()
if pctxt is NULL:
python.PyErr_NoMemory()
@@ -1439,6 +1457,7 @@
Other keyword arguments:
* encoding - override the document encoding
* target - a parser target object that will receive the parse events
+ * schema - an XMLSchema to validate against
Note that you should avoid sharing parsers between threads. While this is
not harmful, it is more efficient to use separate parsers. This does not
@@ -1448,7 +1467,8 @@
load_dtd=False, no_network=True, ns_clean=False,
recover=False, remove_blank_text=False, compact=True,
resolve_entities=True, remove_comments=False,
- remove_pis=False, target=None, encoding=None):
+ remove_pis=False, target=None, encoding=None,
+ XMLSchema schema=None):
cdef int parse_options
parse_options = _XML_DEFAULT_PARSE_OPTIONS
if load_dtd:
@@ -1472,7 +1492,7 @@
if not resolve_entities:
parse_options = parse_options ^ xmlparser.XML_PARSE_NOENT
- _BaseParser.__init__(self, parse_options, 0,
+ _BaseParser.__init__(self, parse_options, 0, schema,
remove_comments, remove_pis,
target, None, encoding)
@@ -1487,7 +1507,7 @@
load_dtd=False, no_network=True, ns_clean=False,
recover=False, remove_blank_text=False, compact=True,
resolve_entities=True, remove_comments=True,
- remove_pis=True, target=None, encoding=None):
+ remove_pis=True, target=None, encoding=None, schema=None):
XMLParser.__init__(self,
attribute_defaults=attribute_defaults,
dtd_validation=dtd_validation,
@@ -1501,7 +1521,8 @@
remove_comments=remove_comments,
remove_pis=remove_pis,
target=target,
- encoding=encoding)
+ encoding=encoding,
+ schema=schema)
cdef XMLParser __DEFAULT_XML_PARSER
@@ -1561,13 +1582,15 @@
Other keyword arguments:
* encoding - override the document encoding
* target - a parser target object that will receive the parse events
+ * schema - an XMLSchema to validate against
Note that you should avoid sharing parsers between threads for performance
reasons.
"""
- def __init__(self, recover=True, no_network=True, remove_blank_text=False,
- compact=True, remove_comments=False, remove_pis=False,
- target=None, encoding=None):
+ def __init__(self, *, recover=True, no_network=True,
+ remove_blank_text=False, compact=True, remove_comments=False,
+ remove_pis=False, target=None, encoding=None,
+ XMLSchema schema=None):
cdef int parse_options
parse_options = _HTML_DEFAULT_PARSE_OPTIONS
if remove_blank_text:
@@ -1579,7 +1602,7 @@
if not compact:
parse_options = parse_options ^ htmlparser.HTML_PARSE_COMPACT
- _BaseParser.__init__(self, parse_options, 1,
+ _BaseParser.__init__(self, parse_options, 1, schema,
remove_comments, remove_pis,
target, None, encoding)
Modified: lxml/trunk/src/lxml/tests/common_imports.py
==============================================================================
--- lxml/trunk/src/lxml/tests/common_imports.py (original)
+++ lxml/trunk/src/lxml/tests/common_imports.py Fri Jan 11 09:51:17 2008
@@ -53,9 +53,9 @@
def tearDown(self):
gc.collect()
- def parse(self, text):
+ def parse(self, text, parser=None):
f = StringIO(text)
- return etree.parse(f)
+ return etree.parse(f, parser=parser)
def _rootstring(self, tree):
return etree.tostring(tree.getroot()).replace(' ', '').replace('\n', '')
Modified: lxml/trunk/src/lxml/tests/test_dtd.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_dtd.py (original)
+++ lxml/trunk/src/lxml/tests/test_dtd.py Fri Jan 11 09:51:17 2008
@@ -26,6 +26,13 @@
dtd = etree.DTD(StringIO("<!ELEMENT b EMPTY>"))
self.assert_(dtd.validate(root))
+ def test_dtd_parse_invalid(self):
+ fromstring = etree.fromstring
+ parser = etree.XMLParser(dtd_validation=True)
+ xml = '<!DOCTYPE b SYSTEM "%s"><b><a/></b>' % fileInTestDir("test.dtd")
+ self.assertRaises(etree.XMLSyntaxError,
+ fromstring, xml, parser=parser)
+
def test_dtd_invalid(self):
root = etree.XML("<b><a/></b>")
dtd = etree.DTD(StringIO("<!ELEMENT b EMPTY>"))
Modified: lxml/trunk/src/lxml/tests/test_xmlschema.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_xmlschema.py (original)
+++ lxml/trunk/src/lxml/tests/test_xmlschema.py Fri Jan 11 09:51:17 2008
@@ -26,6 +26,26 @@
self.assert_(schema.validate(tree_valid))
self.assert_(not schema.validate(tree_invalid))
+ def test_xmlschema_parse(self):
+ schema = self.parse('''
+<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
+ <xsd:element name="a" type="AType"/>
+ <xsd:complexType name="AType">
+ <xsd:sequence>
+ <xsd:element name="b" type="xsd:string" />
+ </xsd:sequence>
+ </xsd:complexType>
+</xsd:schema>
+''')
+ schema = etree.XMLSchema(schema)
+ parser = etree.XMLParser(schema=schema)
+
+ tree_valid = self.parse('<a><b></b></a>', parser=parser)
+ self.assertEquals('a', tree_valid.getroot().tag)
+
+ self.assertRaises(etree.XMLSyntaxError,
+ self.parse, '<a><c></c></a>', parser=parser)
+
def test_xmlschema_elementtree_error(self):
self.assertRaises(ValueError, etree.XMLSchema, etree.ElementTree())
Modified: lxml/trunk/src/lxml/xmlparser.pxd
==============================================================================
--- lxml/trunk/src/lxml/xmlparser.pxd (original)
+++ lxml/trunk/src/lxml/xmlparser.pxd Fri Jan 11 09:51:17 2008
@@ -91,6 +91,7 @@
xmlError lastError
xmlNode* node
xmlSAXHandler* sax
+ void* userData
int* spaceTab
int spaceMax
bint html
Modified: lxml/trunk/src/lxml/xmlschema.pxd
==============================================================================
--- lxml/trunk/src/lxml/xmlschema.pxd (original)
+++ lxml/trunk/src/lxml/xmlschema.pxd Fri Jan 11 09:51:17 2008
@@ -1,10 +1,11 @@
-cimport tree
+from xmlparser cimport xmlSAXHandler
from tree cimport xmlDoc
cdef extern from "libxml/xmlschemas.h":
ctypedef struct xmlSchema
ctypedef struct xmlSchemaParserCtxt
+ ctypedef struct xmlSchemaSAXPlugStruct
ctypedef struct xmlSchemaValidCtxt
cdef xmlSchemaValidCtxt* xmlSchemaNewValidCtxt(xmlSchema* schema) nogil
@@ -15,3 +16,9 @@
cdef void xmlSchemaFree(xmlSchema* schema) nogil
cdef void xmlSchemaFreeParserCtxt(xmlSchemaParserCtxt* ctxt) nogil
cdef void xmlSchemaFreeValidCtxt(xmlSchemaValidCtxt* ctxt) nogil
+
+ cdef xmlSchemaSAXPlugStruct* xmlSchemaSAXPlug(xmlSchemaValidCtxt* ctxt,
+ xmlSAXHandler** sax,
+ void** data) nogil
+ cdef int xmlSchemaSAXUnplug(xmlSchemaSAXPlugStruct* sax_plug)
+ cdef int xmlSchemaIsValid(xmlSchemaValidCtxt* ctxt)
Modified: lxml/trunk/src/lxml/xmlschema.pxi
==============================================================================
--- lxml/trunk/src/lxml/xmlschema.pxi (original)
+++ lxml/trunk/src/lxml/xmlschema.pxi Fri Jan 11 09:51:17 2008
@@ -105,8 +105,53 @@
self._error_log.disconnect()
if ret == -1:
- raise XMLSchemaValidateError, "Internal error in XML Schema validation."
+ raise XMLSchemaValidateError(
+ "Internal error in XML Schema validation.")
if ret == 0:
return True
else:
return False
+
+ cdef _ParserSchemaValidationContext _newSaxValidator(self):
+ cdef _ParserSchemaValidationContext context
+ context = NEW_SCHEMA_CONTEXT(_ParserSchemaValidationContext)
+ context._schema = self
+ context._valid_ctxt = NULL
+ context._sax_plug = NULL
+ return context
+
+cdef class _ParserSchemaValidationContext:
+ cdef XMLSchema _schema
+ cdef xmlschema.xmlSchemaValidCtxt* _valid_ctxt
+ cdef xmlschema.xmlSchemaSAXPlugStruct* _sax_plug
+
+ def __dealloc__(self):
+ if self._sax_plug:
+ self.disconnect()
+ if self._valid_ctxt:
+ xmlschema.xmlSchemaFreeValidCtxt(self._valid_ctxt)
+
+ cdef _ParserSchemaValidationContext copy(self):
+ return self._schema._newSaxValidator()
+
+ cdef int connect(self, xmlparser.xmlParserCtxt* c_ctxt) except -1:
+ if self._valid_ctxt is NULL:
+ self._valid_ctxt = xmlschema.xmlSchemaNewValidCtxt(
+ self._schema._c_schema)
+ if self._valid_ctxt is NULL:
+ raise XMLSchemaError, "Failed to create validation context"
+ self._sax_plug = xmlschema.xmlSchemaSAXPlug(
+ self._valid_ctxt, &c_ctxt.sax, &c_ctxt.userData)
+
+ cdef void disconnect(self):
+ xmlschema.xmlSchemaSAXUnplug(self._sax_plug)
+ self._sax_plug = NULL
+
+ cdef bint isvalid(self):
+ if self._valid_ctxt is NULL:
+ return 1 # valid
+ return xmlschema.xmlSchemaIsValid(self._valid_ctxt)
+
+cdef extern from "etree_defs.h":
+ # macro call to 't->tp_new()' for fast instantiation
+ cdef _ParserSchemaValidationContext NEW_SCHEMA_CONTEXT "PY_NEW" (object t)
More information about the lxml-checkins
mailing list