[Lxml-checkins] r39292 - in lxml/trunk: . doc
scoder at codespeak.net
scoder at codespeak.net
Wed Feb 21 16:43:47 CET 2007
Author: scoder
Date: Wed Feb 21 16:43:46 2007
New Revision: 39292
Added:
lxml/trunk/doc/parsing.txt
- copied, changed from r39233, lxml/trunk/doc/api.txt
lxml/trunk/doc/validation.txt
- copied, changed from r39233, lxml/trunk/doc/api.txt
lxml/trunk/doc/xpathxslt.txt
- copied, changed from r39233, lxml/trunk/doc/api.txt
Modified:
lxml/trunk/CHANGES.txt
lxml/trunk/doc/api.txt
lxml/trunk/doc/main.txt
lxml/trunk/doc/mkhtml.py
Log:
first take on a major split-up of api.txt
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Wed Feb 21 16:43:46 2007
@@ -17,6 +17,11 @@
* The pattern for attribute names in ObjectPath was too restrictive
+Other changes
+-------------
+
+* major restructuring in the documentation
+
1.2 (2007-02-20)
================
Modified: lxml/trunk/doc/api.txt
==============================================================================
--- lxml/trunk/doc/api.txt (original)
+++ lxml/trunk/doc/api.txt Wed Feb 21 16:43:46 2007
@@ -4,23 +4,35 @@
lxml tries to follow established APIs wherever possible. Sometimes, however,
the need to expose a feature in an easy way led to the invention of a new API.
+This page describes the major differences and a few additions to the main
+ElementTree API.
+
+Separate pages describe the support for `parsing XML`_, executing `XPath and
+XSLT`_, `validating XML`_ and interfacing with other XML tools through the
+`SAX-API`_.
+
+lxml is extremely extensible through `XPath functions in Python`_, custom
+`Python element classes`_, custom `URL resolvers`_ and even `at the C-level`_.
+
+.. _`parsing XML`: parsing.html
+.. _`XPath and XSLT`: xpathxslt.html
+.. _`validating XML`: validation.html
+.. _`SAX-API`: sax.html
+.. _`XPath functions in Python`: extensions.html
+.. _`Python element classes`: element_classes.html
+.. _`at the C-level`: capi.html
+.. _`URL resolvers`: resolvers.txt
+
.. contents::
..
- 1 lxml.etree
- 2 Other Element APIs
- 3 Trees and Documents
- 4 Iteration
- 5 Parsers
- 6 iterparse and iterwalk
- 7 Error handling on exceptions
- 8 Python unicode strings
- 9 XPath
- 10 XSLT
- 11 RelaxNG
- 12 XMLSchema
- 13 xinclude
- 14 write_c14n on ElementTree
+ 1 lxml.etree
+ 2 Other Element APIs
+ 3 Trees and Documents
+ 4 Iteration
+ 5 Error handling on exceptions
+ 6 xinclude
+ 7 write_c14n on ElementTree
lxml.etree
@@ -167,208 +179,9 @@
['d']
See also the section on the utility functions ``iterparse()`` and
-``iterwalk()`` below.
-
-
-Parsers
--------
+``iterwalk()`` in the `parser documentation`_.
-One of the differences is the parser. There is support for both XML and
-(broken) HTML. Both are based on libxml2 and therefore only support options
-that are backed by the library. Parsers take a number of keyword arguments.
-The following is an example for namespace cleanup during parsing, first with
-the default parser, then with a parametrized one::
-
- >>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
-
- >>> et = etree.parse(StringIO(xml))
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b xmlns="test"/></a>
-
- >>> parser = etree.XMLParser(ns_clean=True)
- >>> et = etree.parse(StringIO(xml), parser)
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b/></a>
-
-HTML parsing is similarly simple. The parsers have a ``recover`` keyword
-argument that the HTMLParser sets by default. It lets libxml2 try its best to
-return something usable without raising an exception. You should use libxml2
-version 2.6.21 or newer to take advantage of this feature::
-
- >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
-
- >>> parser = etree.HTMLParser()
- >>> et = etree.parse(StringIO(broken_html), parser)
-
- >>> print etree.tostring(et.getroot())
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-Lxml has an HTML function, similar to the XML shortcut known from
-ElementTree::
-
- >>> html = etree.HTML(broken_html)
- >>> print etree.tostring(html)
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-The support for parsing broken HTML depends entirely on libxml2's recovery
-algorithm. It is *not* the fault of lxml if you find documents that are so
-heavily broken that the parser cannot handle them. There is also no guarantee
-that the resulting tree will contain all data from the original document. The
-parser may have to drop seriously broken parts when struggling to keep
-parsing. Especially misplaced meta tags can suffer from this, which may lead
-to encoding problems.
-
-The use of the libxml2 parsers makes some additional information available at
-the API level. Currently, ElementTree objects can access the DOCTYPE
-information provided by a parsed document, as well as the XML version and the
-original encoding::
-
- >>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN"
- >>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
- >>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
- >>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
- >>> xhtml = xml_header + doctype_string + '<html><body></body></html>'
-
- >>> tree = etree.parse(StringIO(xhtml))
- >>> docinfo = tree.docinfo
- >>> print docinfo.public_id
- -//W3C//DTD XHTML 1.0 Transitional//EN
- >>> print docinfo.system_url
- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
- >>> docinfo.doctype == doctype_string
- True
-
- >>> print docinfo.xml_version
- 1.0
- >>> print docinfo.encoding
- ascii
-
-
-iterparse and iterwalk
-----------------------
-
-As known from ElementTree, the ``iterparse()`` utility function returns an
-iterator that generates parser events for an XML file (or file-like object),
-while building the tree. The values are tuples ``(event-type, object)``. The
-event types are 'start', 'end', 'start-ns' and 'end-ns'.
-
-The 'start' and 'end' events represent opening and closing elements and are
-accompanied by the respective element. By default, only 'end' events are
-generated::
-
- >>> xml = '''\
- ... <root>
- ... <element key='value'>text</element>
- ... <element>text</element>tail
- ... <empty-element xmlns="testns" />
- ... </root>
- ... '''
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
- end {testns}empty-element
- end root
-
-The resulting tree is available through the ``root`` property of the iterator::
-
- >>> context.root.tag
- 'root'
-
-The other types can be activated with the ``events`` keyword argument::
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, elem in context:
- ... print action, elem.tag
- start root
- start element
- end element
- start element
- end element
- start {testns}empty-element
- end {testns}empty-element
- end root
-
-You can modify the element and its descendants when handling the 'end' event.
-To save memory, for example, you can remove subtrees that are no longer
-needed::
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print len(elem),
- ... elem.clear()
- 0 0 0 3
- >>> context.root.getchildren()
- []
-
-**WARNING**: During the 'start' event, the descendants and following siblings
-are not yet available and should not be accessed. During the 'end' event, the
-element and its descendants can be freely modified, but its following siblings
-should not be accessed. During either of the two events, you **must not**
-modify or move the ancestors (parents) of the current element. You should
-also avoid moving or discarding the element itself. The golden rule is: do
-not touch anything that will have to be touched again by the parser later on.
-
-If you have elements with a long list of children in your XML file and want to
-save more memory during parsing, you can clean up the preceding siblings of
-the current element::
-
- >>> for event, element in etree.iterparse(StringIO(xml)):
- ... # ... do something with the element
- ... element.clear() # clean up children
- ... if element.getprevious(): # clean up preceding siblings
- ... del element.getparent()[0]
-
-You can use ``while`` instead of ``if`` if you skipped siblings using the
-``tag`` keyword argument. The more selective your tag is, however, the more
-thought you will have to put into finding the right way to clean up the
-elements that were skipped. Therefore, it is sometimes easier to traverse all
-elements and do the tag selection by hand in the event handler code.
-
-The 'start-ns' and 'end-ns' events notify about namespace declarations and
-generate tuples ``(prefix, URI)``::
-
- >>> events = ("start-ns", "end-ns")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, obj in context:
- ... print action, obj
- start-ns ('', 'testns')
- end-ns None
-
-It is common practice to use a list as namespace stack and pop the last entry
-on the 'end-ns' event.
-
-lxml.etree supports two extensions compared to ElementTree. It accepts a
-``tag`` keyword argument just like ``element.getiterator(tag)``. This
-restricts events to a specific tag or namespace.
-
- >>> context = etree.iterparse(StringIO(xml), tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
- >>> for action, elem in context:
- ... print action, elem.tag
- start {testns}empty-element
- end {testns}empty-element
-
-The second extension is the ``iterwalk()`` function. It behaves exactly like
-``iterparse()``, but works on Elements and ElementTrees::
-
- >>> root = context.root
- >>> context = etree.iterwalk(root, events=events, tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- start element
- end element
- start element
- end element
+.. _`parser documentation`: parsing.html#iterparse-and-iterwalk
Error handling on exceptions
@@ -415,467 +228,6 @@
etc. which are described in their respective sections below.
-Python unicode strings
-----------------------
-
-lxml.etree has broader support for Python unicode strings than the ElementTree
-library. First of all, where ElementTree would raise an exception, the
-parsers in lxml.etree can handle unicode strings straight away. This is most
-helpful for XML snippets embedded in source code using the ``XML()``
-function::
-
- >>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
- >>> uxml
- u'<test> \uf8d1 + \uf8d2 </test>'
- >>> root = etree.XML(uxml)
-
-This requires, however, that unicode strings do not specify a conflicting
-encoding themselves and thus lie about their real encoding::
-
- >>> etree.XML(u'<?xml version="1.0" encoding="ASCII"?>\n' + uxml)
- Traceback (most recent call last):
- ...
- ValueError: Unicode strings with encoding declaration are not supported.
-
-Similarly, you will get errors when you try the same with HTML data in a
-unicode string that specifies a charset in a meta tag of the header. You
-should generally avoid converting XML/HTML data to unicode before passing it
-into the parsers. It is both slower and error prone.
-
-To serialize the result, you would normally use the ``tostring`` module
-function, which serializes to plain ASCII by default or a number of other
-encodings if asked for::
-
- >>> etree.tostring(root)
- '<test>  +  </test>'
-
- >>> etree.tostring(root, 'UTF-8', xml_declaration=False)
- '<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'
-
-As an extension, lxml.etree has a new ``tounicode()`` function that you can
-call on XML tree objects to retrieve a Python unicode representation::
-
- >>> etree.tounicode(root)
- u'<test> \uf8d1 + \uf8d2 </test>'
-
- >>> el = etree.Element("test")
- >>> etree.tounicode(el)
- u'<test/>'
-
- >>> subel = etree.SubElement(el, "subtest")
- >>> etree.tounicode(el)
- u'<test><subtest/></test>'
-
- >>> et = etree.ElementTree(el)
- >>> etree.tounicode(et)
- u'<test><subtest/></test>'
-
-The result of ``tounicode()`` can be treated like any other Python unicode
-string and then passed back into the parsers. However, if you want to save
-the result to a file or pass it over the network, you should use ``write()``
-or ``tostring()`` with an encoding argument (typically UTF-8) to serialize the
-XML. The main reason is that unicode strings returned by ``tounicode()``
-never have an XML declaration and therefore do not specify their encoding.
-These strings are most likely not parsable by other XML libraries.
-
-In contrast, the ``tostring()`` function automatically adds a declaration as
-needed that reflects the encoding of the returned string. This makes it
-possible for other parsers to correctly parse the XML byte stream. Note that
-using ``tostring()`` with UTF-8 is also considerably faster in most cases.
-
-
-XPath
------
-
-lxml.etree supports the simple path syntax of the ``findall()`` etc. methods
-on ElementTree and Element, as known from the original ElementTree library.
-As an extension, these classes also provide an ``xpath()`` method that
-supports expressions in the complete XPath syntax.
-
-There are also specialized XPath evaluator classes that are more efficient for
-frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance
-comparison`_ to learn when to use which. Their semantics when used on
-Elements and ElementTrees are the same as for the ``xpath()`` method described
-here.
-
-.. _`performance comparison`: performance.html#xpath
-
-For ElementTree, the xpath method performs a global XPath query against the
-document (if absolute) or against the root node (if relative)::
-
- >>> f = StringIO('<foo><bar></bar></foo>')
- >>> tree = etree.parse(f)
-
- >>> r = tree.xpath('/foo/bar')
- >>> len(r)
- 1
- >>> r[0].tag
- 'bar'
-
- >>> r = tree.xpath('bar')
- >>> r[0].tag
- 'bar'
-
-When ``xpath()`` is used on an element, the XPath expression is evaluated
-against the element (if relative) or against the root tree (if absolute)::
-
- >>> root = tree.getroot()
- >>> r = root.xpath('bar')
- >>> r[0].tag
- 'bar'
-
- >>> bar = root[0]
- >>> r = bar.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
- >>> tree = bar.getroottree()
- >>> r = tree.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
-Optionally, you can provide a ``namespaces`` keyword argument, which should be
-a dictionary mapping the namespace prefixes used in the XPath expression to
-namespace URIs::
-
- >>> f = StringIO('''\
- ... <a:foo xmlns:a="http://codespeak.net/ns/test1"
- ... xmlns:b="http://codespeak.net/ns/test2">
- ... <b:bar>Text</b:bar>
- ... </a:foo>
- ... ''')
- >>> doc = etree.parse(f)
- >>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
- ... 'b': 'http://codespeak.net/ns/test2'})
- >>> len(r)
- 1
- >>> r[0].tag
- '{http://codespeak.net/ns/test2}bar'
- >>> r[0].text
- 'Text'
-
-There is also an optional ``extensions`` argument which is used to define
-`extension functions`_ in Python that are local to this evaluation.
-
-.. _`extension functions`: extensions.html
-
-The return values of XPath evaluations vary, depending on the XPath expression
-used:
-
-* True or False, when the XPath expression has a boolean result
-
-* a float, when the XPath expression has a numeric result (integer or float)
-
-* a (unicode) string, when the XPath expression has a string result.
-
-* a list of items, when the XPath expression has a list as result. The items
- may include elements, strings and tuples. Text nodes and attributes in the
- result are returned as strings (the text node content or attribute value).
- Comments are also returned as strings, enclosed by the usual ``<!--`` and
- ``-->`` markers. Namespace declarations are returned as tuples of strings:
- ``(prefix, URI)``.
-
-A related convenience method of ElementTree objects is ``getpath(element)``,
-which returns a structural, absolute XPath expression to find that element::
-
- >>> a = etree.Element("a")
- >>> b = etree.SubElement(a, "b")
- >>> c = etree.SubElement(a, "c")
- >>> d1 = etree.SubElement(c, "d")
- >>> d2 = etree.SubElement(c, "d")
-
- >>> tree = etree.ElementTree(c)
- >>> print tree.getpath(d2)
- /c/d[2]
- >>> tree.xpath(tree.getpath(d2)) == [d2]
- True
-
-
-XSLT
-----
-
-lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
-given an ElementTree object to construct an XSLT transformer::
-
- >>> f = StringIO('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> xslt_doc = etree.parse(f)
- >>> transform = etree.XSLT(xslt_doc)
-
-You can then run the transformation on an ElementTree document by simply
-calling it, and this results in another ElementTree object::
-
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
- >>> result = transform(doc)
-
-The result object can be accessed like a normal ElementTree document::
-
- >>> result.getroot().text
- 'Text'
-
-but, as opposed to normal ElementTree objects, can also be turned into an (XML
-or text) string by applying the str() function::
-
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-The result is always a plain string, encoded as requested by the
-``xsl:output`` element in the stylesheet. If you want a Python unicode string
-instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
-is sufficient). This allows you to call the builtin ``unicode()`` function on
-the result::
-
- >>> unicode(result)
- u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-You can use other encodings at the cost of multiple recoding. Encodings that
-are not supported by Python will result in an error::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:output encoding="UCS4"/>
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
-
- >>> result = transform(doc)
- >>> unicode(result)
- Traceback (most recent call last):
- [...]
- LookupError: unknown encoding: UCS4
-
-It is possible to pass parameters, in the form of XPath expressions, to the
-XSLT template::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="$a" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
-
-The parameters are passed as keyword parameters to the transform call. First
-let's try passing in a simple string expression::
-
- >>> result = transform(doc, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-Let's try a non-string XPath expression now::
-
- >>> result = transform(doc, a="/a/b/text()")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-There's also a convenience method on the tree object for doing XSL
-transformations. This is less efficient if you want to apply the same XSL
-transformation to multiple documents, but is shorter to write for one-shot
-operations, as you do not have to instantiate a stylesheet yourself::
-
- >>> result = doc.xslt(xslt_tree, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-By default, XSLT supports all extension functions from libxslt and libexslt as
-well as Python regular expressions through EXSLT. Note that some extensions
-enable style sheets to read and write files on the local file system. See the
-`document loader documentation`_ on how to deal with this.
-
-.. _`document loader documentation`: resolvers.html
-
-If you want to know how your stylesheet performed, pass the ``profile_run``
-keyword to the transform::
-
- >>> result = transform(doc, a="/a/b/text()", profile_run=True)
- >>> profile = result.xslt_profile
-
-The value of the ``xslt_profile`` property is an ElementTree with profiling
-data about each template, similar to the following::
-
- <profile>
- <template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/>
- </profile>
-
-Note that this is a read-only document. You must not move any of its elements
-to other documents. Please deep-copy the document if you need to modify it.
-If you want to free it from memory, just do::
-
- >>> del result.xslt_profile
-
-
-RelaxNG
--------
-
-lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can
-be given an ElementTree object to construct a Relax NG validator::
-
- >>> f = StringIO('''\
- ... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
- ... <zeroOrMore>
- ... <element name="b">
- ... <text />
- ... </element>
- ... </zeroOrMore>
- ... </element>
- ... ''')
- >>> relaxng_doc = etree.parse(f)
- >>> relaxng = etree.RelaxNG(relaxng_doc)
-
-You can then validate some ElementTree document against the schema. You'll get
-back True if the document is valid against the Relax NG schema, and False if
-not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> relaxng.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> relaxng.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not relaxng(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> relaxng.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> relaxng.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Starting with version 0.9, lxml now has a simple API to report the errors
-generated by libxml2. If you want to find out why the validation failed in the
-second case, you can look up the error log of the validation process and check
-it for relevant messages::
-
- >>> log = relaxng.error_log
- >>> print log.last_error
- <string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there
-
-You can see that the error (ERROR) happened during RelaxNG validation
-(RELAXNGV). The message then tells you what went wrong. Note that this error
-is local to the RelaxNG object. It will only contain log entries that
-appeares during the validation. The DocumentInvalid exception raised by the
-``assertValid`` method above provides access to the global error log (like all
-other lxml exceptions).
-
-Similar to XSLT, there's also a less efficient but easier shortcut method to
-do one-shot RelaxNG validation::
-
- >>> doc.relaxng(relaxng_doc)
- 1
- >>> doc2.relaxng(relaxng_doc)
- 0
-
-
-XMLSchema
----------
-
-lxml.etree also has a XML Schema (XSD) support, using the class
-lxml.etree.XMLSchema. This support is very similar to the Relax NG
-support. The class can be given an ElementTree object to construct a
-XMLSchema validator::
-
- >>> f = StringIO('''\
- ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
- ... <xsd:element name="a" type="AType"/>
- ... <xsd:complexType name="AType">
- ... <xsd:sequence>
- ... <xsd:element name="b" type="xsd:string" />
- ... </xsd:sequence>
- ... </xsd:complexType>
- ... </xsd:schema>
- ... ''')
- >>> xmlschema_doc = etree.parse(f)
- >>> xmlschema = etree.XMLSchema(xmlschema_doc)
-
-You can then validate some ElementTree document with this. Like with
-RelaxNG, you'll get back true if the document is valid against the XML
-schema, and false if not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> xmlschema.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> xmlschema.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not xmlschema(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> xmlschema.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> xmlschema.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Error reporting works like for the RelaxNG class::
-
- >>> log = xmlschema.error_log
- >>> error = log.last_error
- >>> print error.domain_name
- SCHEMASV
- >>> print error.type_name
- SCHEMAV_ELEMENT_CONTENT
-
-If you were to print this log entry, you would get something like the
-following. Note that the error message depends on the libxml2 version in
-use::
-
- <string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).
-
-Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut
-method to do XML Schema validation::
-
- >>> doc.xmlschema(xmlschema_doc)
- 1
- >>> doc2.xmlschema(xmlschema_doc)
- 0
-
-
xinclude
--------
Modified: lxml/trunk/doc/main.txt
==============================================================================
--- lxml/trunk/doc/main.txt (original)
+++ lxml/trunk/doc/main.txt Wed Feb 21 16:43:46 2007
@@ -66,6 +66,8 @@
* `lxml.etree specific API`_ documentation
+ * `XML validation`_ with RelaxNG and XML Schema
+
* Python `extension functions`_ for XPath and XSLT
* `custom element classes`_ for custom XML APIs
@@ -109,6 +111,7 @@
.. _`benchmark results`: performance.html
.. _`compatibility`: compatibility.html
.. _`lxml.etree specific API`: api.html
+.. _`XML validation`: validation.html
.. _`extension functions`: extensions.html
.. _`custom element classes`: element_classes.html
.. _`SAX compliant API`: sax.html
Modified: lxml/trunk/doc/mkhtml.py
==============================================================================
--- lxml/trunk/doc/mkhtml.py (original)
+++ lxml/trunk/doc/mkhtml.py Wed Feb 21 16:43:46 2007
@@ -14,7 +14,8 @@
for name in ['main.txt', 'intro.txt', 'api.txt', 'compatibility.txt',
'extensions.txt', 'element_classes.txt', 'sax.txt',
'build.txt', 'FAQ.txt', 'performance.txt', 'resolvers.txt',
- 'capi.txt', 'objectify.txt']:
+ 'capi.txt', 'objectify.txt', 'validation.txt',
+ 'xpathxslt.txt', 'parsing.txt']:
path = os.path.join(doc_dir, name)
outname = os.path.splitext(name)[0] + '.html'
outpath = os.path.join(dirname, outname)
Copied: lxml/trunk/doc/parsing.txt (from r39233, lxml/trunk/doc/api.txt)
==============================================================================
--- lxml/trunk/doc/api.txt (original)
+++ lxml/trunk/doc/parsing.txt Wed Feb 21 16:43:46 2007
@@ -1,173 +1,15 @@
=====================
-APIs specific to lxml
+Parsing XML with lxml
=====================
-lxml tries to follow established APIs wherever possible. Sometimes, however,
-the need to expose a feature in an easy way led to the invention of a new API.
+lxml provides a very simple and powerful API for parsing XML. It supports
+one-step parsing as well as step-by-step parsing using an event-driven API.
.. contents::
..
- 1 lxml.etree
- 2 Other Element APIs
- 3 Trees and Documents
- 4 Iteration
- 5 Parsers
- 6 iterparse and iterwalk
- 7 Error handling on exceptions
- 8 Python unicode strings
- 9 XPath
- 10 XSLT
- 11 RelaxNG
- 12 XMLSchema
- 13 xinclude
- 14 write_c14n on ElementTree
-
-
-lxml.etree
-----------
-
-lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
-however some incompatibilities (see `compatibility`_). The extensions are
-documented here.
-
-.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
-.. _`compatibility`: compatibility.html
-
-If you need to know which version of lxml is installed, you can access the
-``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
-however, that it did not exist before version 1.0, so you will get an
-AttributeError in older versions. The versions of libxml2 and libxslt are
-available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
-
-The following examples usually assume this to be executed first::
-
- >>> from lxml import etree
- >>> from StringIO import StringIO
-
-
-Other Element APIs
-------------------
-
-While lxml.etree itself uses the ElementTree API, it is possible to replace
-the Element implementation by `custom element subclasses`_. This has been
-used to implement well-known XML APIs on top of lxml. The ``lxml.elements``
-package contains examples. Currently, there is a data-binding implementation
-called `objectify`_, which is similar to the `Amara bindery`_ tool.
-
-Additionally, the `lxml.elements.classlookup`_ module provides a number of
-different schemes to customize the mapping between libxml2 nodes and the
-Element classes used by lxml.etree.
-
-.. _`custom element subclasses`: namespace_extensions.html
-.. _`objectify`: objectify.html
-.. _`lxml.elements.classlookup`: elements.html#lxml.elements.classlookup
-.. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/
-
-
-Trees and Documents
--------------------
-
-Compared to the original ElementTree API, lxml.etree has an extended tree
-model. It knows about parents and siblings of elements::
-
- >>> root = etree.Element("root")
- >>> a = etree.SubElement(root, "a")
- >>> b = etree.SubElement(root, "b")
- >>> c = etree.SubElement(root, "c")
- >>> d = etree.SubElement(root, "d")
- >>> e = etree.SubElement(d, "e")
- >>> b.getparent() == root
- True
- >>> print b.getnext().tag
- c
- >>> print c.getprevious().tag
- b
-
-Elements always live within a document context in lxml. This implies that
-there is also a notion of an absolute document root. You can retrieve an
-ElementTree for the root node of a document from any of its elements::
-
- >>> tree = d.getroottree()
- >>> print tree.getroot().tag
- root
-
-Note that this is different from wrapping an Element in an ElementTree. You
-can use ElementTrees to create XML trees with an explicit root node::
-
- >>> tree = etree.ElementTree(d)
- >>> print tree.getroot().tag
- d
- >>> print etree.tostring(tree)
- <d><e/></d>
-
-All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
-will understand the explicitly chosen root as root node of a document. They
-will not see any elements outside the ElementTree. However, ElementTrees do
-not modify their Elements::
-
- >>> element = tree.getroot()
- >>> print element.tag
- d
- >>> print element.getparent().tag
- root
- >>> print element.getroottree().getroot().tag
- root
-
-The rule is that all operations that are applied to Elements use either the
-Element itself as reference point, or the absolute root of the document that
-contains this Element (e.g. for absolute XPath expressions). All operations
-on an ElementTree use its explicit root node as reference.
-
-
-Iteration
----------
-
-The ElementTree API makes Elements iterable to supports iteration over their
-children. Using the tree defined above, we get::
-
- >>> [ el.tag for el in root ]
- ['a', 'b', 'c', 'd']
-
-Tree traversal is commonly based on the ``element.getiterator()`` method::
-
- >>> [ el.tag for el in root.getiterator() ]
- ['root', 'a', 'b', 'c', 'd', 'e']
-
-lxml.etree also supports this, but additionally features an extended API for
-iteration over the children, following/preceding siblings, ancestors and
-descendants of an element, as defined by the respective XPath axis::
-
- >>> [ el.tag for el in root.iterchildren() ]
- ['a', 'b', 'c', 'd']
- >>> [ el.tag for el in root.iterchildren(reversed=True) ]
- ['d', 'c', 'b', 'a']
- >>> [ el.tag for el in b.itersiblings() ]
- ['c', 'd']
- >>> [ el.tag for el in c.itersiblings(preceding=True) ]
- ['b', 'a']
- >>> [ el.tag for el in e.iterancestors() ]
- ['d', 'root']
- >>> [ el.tag for el in root.iterdescendants() ]
- ['a', 'b', 'c', 'd', 'e']
-
-Note how ``element.iterdescendants()`` does not include the element itself, as
-opposed to ``element.getiterator()``. The latter effectively implements the
-'descendant-or-self' axis in XPath.
-
-All of these iterators support an additional ``tag`` keyword argument that
-filters the generated elements by tag name::
-
- >>> [ el.tag for el in root.iterchildren(tag='a') ]
- ['a']
- >>> [ el.tag for el in d.iterchildren(tag='a') ]
- []
- >>> [ el.tag for el in root.iterdescendants(tag='d') ]
- ['d']
- >>> [ el.tag for el in root.getiterator(tag='d') ]
- ['d']
-
-See also the section on the utility functions ``iterparse()`` and
-``iterwalk()`` below.
+ 1 Parsers
+ 2 iterparse and iterwalk
+ 3 Python unicode strings
Parsers
@@ -371,50 +213,6 @@
end element
-Error handling on exceptions
-----------------------------
-
-Libxml2 provides error messages for failures, be it during parsing, XPath
-evaluation or schema validation. Whenever an exception is raised, you can
-retrieve the errors that occured and "might have" lead to the problem::
-
- >>> etree.clearErrorLog()
- >>> broken_xml = '<a>'
- >>> try:
- ... etree.parse(StringIO(broken_xml))
- ... except etree.XMLSyntaxError, e:
- ... pass # just put the exception into e
- >>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
- >>> print log
- <string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1
-
-This might look a little cryptic at first, but it is the information that
-libxml2 gives you. At least the message at the end should give you a hint
-what went wrong and you can see that the fatal error (FATAL) happened during
-parsing (PARSER) line 1 of a string (<string>, or filename if available).
-Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for
-that. You can get it from a log entry like this::
-
- >>> entry = log[0]
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-There is also a convenience attribute ``last_error`` that returns the last
-error or fatal error that occurred::
-
- >>> entry = e.error_log.last_error
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-Alternatively, lxml.etree supports logging libxml2 messages to the Python
-stdlib logging module. This is done through the ``etree.PyErrorLog`` class.
-It disables the error reporting from exceptions and forwards log messages to a
-Python logger. To use it, see the descriptions of the function
-``etree.useGlobalPythonLog`` and the class ``etree.PyErrorLog`` for help.
-Note that this does not affect the local error logs of XSLT, XMLSchema,
-etc. which are described in their respective sections below.
-
-
Python unicode strings
----------------------
@@ -482,429 +280,3 @@
needed that reflects the encoding of the returned string. This makes it
possible for other parsers to correctly parse the XML byte stream. Note that
using ``tostring()`` with UTF-8 is also considerably faster in most cases.
-
-
-XPath
------
-
-lxml.etree supports the simple path syntax of the ``findall()`` etc. methods
-on ElementTree and Element, as known from the original ElementTree library.
-As an extension, these classes also provide an ``xpath()`` method that
-supports expressions in the complete XPath syntax.
-
-There are also specialized XPath evaluator classes that are more efficient for
-frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance
-comparison`_ to learn when to use which. Their semantics when used on
-Elements and ElementTrees are the same as for the ``xpath()`` method described
-here.
-
-.. _`performance comparison`: performance.html#xpath
-
-For ElementTree, the xpath method performs a global XPath query against the
-document (if absolute) or against the root node (if relative)::
-
- >>> f = StringIO('<foo><bar></bar></foo>')
- >>> tree = etree.parse(f)
-
- >>> r = tree.xpath('/foo/bar')
- >>> len(r)
- 1
- >>> r[0].tag
- 'bar'
-
- >>> r = tree.xpath('bar')
- >>> r[0].tag
- 'bar'
-
-When ``xpath()`` is used on an element, the XPath expression is evaluated
-against the element (if relative) or against the root tree (if absolute)::
-
- >>> root = tree.getroot()
- >>> r = root.xpath('bar')
- >>> r[0].tag
- 'bar'
-
- >>> bar = root[0]
- >>> r = bar.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
- >>> tree = bar.getroottree()
- >>> r = tree.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
-Optionally, you can provide a ``namespaces`` keyword argument, which should be
-a dictionary mapping the namespace prefixes used in the XPath expression to
-namespace URIs::
-
- >>> f = StringIO('''\
- ... <a:foo xmlns:a="http://codespeak.net/ns/test1"
- ... xmlns:b="http://codespeak.net/ns/test2">
- ... <b:bar>Text</b:bar>
- ... </a:foo>
- ... ''')
- >>> doc = etree.parse(f)
- >>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
- ... 'b': 'http://codespeak.net/ns/test2'})
- >>> len(r)
- 1
- >>> r[0].tag
- '{http://codespeak.net/ns/test2}bar'
- >>> r[0].text
- 'Text'
-
-There is also an optional ``extensions`` argument which is used to define
-`extension functions`_ in Python that are local to this evaluation.
-
-.. _`extension functions`: extensions.html
-
-The return values of XPath evaluations vary, depending on the XPath expression
-used:
-
-* True or False, when the XPath expression has a boolean result
-
-* a float, when the XPath expression has a numeric result (integer or float)
-
-* a (unicode) string, when the XPath expression has a string result.
-
-* a list of items, when the XPath expression has a list as result. The items
- may include elements, strings and tuples. Text nodes and attributes in the
- result are returned as strings (the text node content or attribute value).
- Comments are also returned as strings, enclosed by the usual ``<!--`` and
- ``-->`` markers. Namespace declarations are returned as tuples of strings:
- ``(prefix, URI)``.
-
-A related convenience method of ElementTree objects is ``getpath(element)``,
-which returns a structural, absolute XPath expression to find that element::
-
- >>> a = etree.Element("a")
- >>> b = etree.SubElement(a, "b")
- >>> c = etree.SubElement(a, "c")
- >>> d1 = etree.SubElement(c, "d")
- >>> d2 = etree.SubElement(c, "d")
-
- >>> tree = etree.ElementTree(c)
- >>> print tree.getpath(d2)
- /c/d[2]
- >>> tree.xpath(tree.getpath(d2)) == [d2]
- True
-
-
-XSLT
-----
-
-lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
-given an ElementTree object to construct an XSLT transformer::
-
- >>> f = StringIO('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> xslt_doc = etree.parse(f)
- >>> transform = etree.XSLT(xslt_doc)
-
-You can then run the transformation on an ElementTree document by simply
-calling it, and this results in another ElementTree object::
-
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
- >>> result = transform(doc)
-
-The result object can be accessed like a normal ElementTree document::
-
- >>> result.getroot().text
- 'Text'
-
-but, as opposed to normal ElementTree objects, can also be turned into an (XML
-or text) string by applying the str() function::
-
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-The result is always a plain string, encoded as requested by the
-``xsl:output`` element in the stylesheet. If you want a Python unicode string
-instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
-is sufficient). This allows you to call the builtin ``unicode()`` function on
-the result::
-
- >>> unicode(result)
- u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-You can use other encodings at the cost of multiple recoding. Encodings that
-are not supported by Python will result in an error::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:output encoding="UCS4"/>
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
-
- >>> result = transform(doc)
- >>> unicode(result)
- Traceback (most recent call last):
- [...]
- LookupError: unknown encoding: UCS4
-
-It is possible to pass parameters, in the form of XPath expressions, to the
-XSLT template::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="$a" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
-
-The parameters are passed as keyword parameters to the transform call. First
-let's try passing in a simple string expression::
-
- >>> result = transform(doc, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-Let's try a non-string XPath expression now::
-
- >>> result = transform(doc, a="/a/b/text()")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-There's also a convenience method on the tree object for doing XSL
-transformations. This is less efficient if you want to apply the same XSL
-transformation to multiple documents, but is shorter to write for one-shot
-operations, as you do not have to instantiate a stylesheet yourself::
-
- >>> result = doc.xslt(xslt_tree, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-By default, XSLT supports all extension functions from libxslt and libexslt as
-well as Python regular expressions through EXSLT. Note that some extensions
-enable style sheets to read and write files on the local file system. See the
-`document loader documentation`_ on how to deal with this.
-
-.. _`document loader documentation`: resolvers.html
-
-If you want to know how your stylesheet performed, pass the ``profile_run``
-keyword to the transform::
-
- >>> result = transform(doc, a="/a/b/text()", profile_run=True)
- >>> profile = result.xslt_profile
-
-The value of the ``xslt_profile`` property is an ElementTree with profiling
-data about each template, similar to the following::
-
- <profile>
- <template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/>
- </profile>
-
-Note that this is a read-only document. You must not move any of its elements
-to other documents. Please deep-copy the document if you need to modify it.
-If you want to free it from memory, just do::
-
- >>> del result.xslt_profile
-
-
-RelaxNG
--------
-
-lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can
-be given an ElementTree object to construct a Relax NG validator::
-
- >>> f = StringIO('''\
- ... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
- ... <zeroOrMore>
- ... <element name="b">
- ... <text />
- ... </element>
- ... </zeroOrMore>
- ... </element>
- ... ''')
- >>> relaxng_doc = etree.parse(f)
- >>> relaxng = etree.RelaxNG(relaxng_doc)
-
-You can then validate some ElementTree document against the schema. You'll get
-back True if the document is valid against the Relax NG schema, and False if
-not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> relaxng.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> relaxng.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not relaxng(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> relaxng.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> relaxng.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Starting with version 0.9, lxml now has a simple API to report the errors
-generated by libxml2. If you want to find out why the validation failed in the
-second case, you can look up the error log of the validation process and check
-it for relevant messages::
-
- >>> log = relaxng.error_log
- >>> print log.last_error
- <string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there
-
-You can see that the error (ERROR) happened during RelaxNG validation
-(RELAXNGV). The message then tells you what went wrong. Note that this error
-is local to the RelaxNG object. It will only contain log entries that
-appeares during the validation. The DocumentInvalid exception raised by the
-``assertValid`` method above provides access to the global error log (like all
-other lxml exceptions).
-
-Similar to XSLT, there's also a less efficient but easier shortcut method to
-do one-shot RelaxNG validation::
-
- >>> doc.relaxng(relaxng_doc)
- 1
- >>> doc2.relaxng(relaxng_doc)
- 0
-
-
-XMLSchema
----------
-
-lxml.etree also has a XML Schema (XSD) support, using the class
-lxml.etree.XMLSchema. This support is very similar to the Relax NG
-support. The class can be given an ElementTree object to construct a
-XMLSchema validator::
-
- >>> f = StringIO('''\
- ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
- ... <xsd:element name="a" type="AType"/>
- ... <xsd:complexType name="AType">
- ... <xsd:sequence>
- ... <xsd:element name="b" type="xsd:string" />
- ... </xsd:sequence>
- ... </xsd:complexType>
- ... </xsd:schema>
- ... ''')
- >>> xmlschema_doc = etree.parse(f)
- >>> xmlschema = etree.XMLSchema(xmlschema_doc)
-
-You can then validate some ElementTree document with this. Like with
-RelaxNG, you'll get back true if the document is valid against the XML
-schema, and false if not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> xmlschema.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> xmlschema.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not xmlschema(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> xmlschema.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> xmlschema.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Error reporting works like for the RelaxNG class::
-
- >>> log = xmlschema.error_log
- >>> error = log.last_error
- >>> print error.domain_name
- SCHEMASV
- >>> print error.type_name
- SCHEMAV_ELEMENT_CONTENT
-
-If you were to print this log entry, you would get something like the
-following. Note that the error message depends on the libxml2 version in
-use::
-
- <string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).
-
-Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut
-method to do XML Schema validation::
-
- >>> doc.xmlschema(xmlschema_doc)
- 1
- >>> doc2.xmlschema(xmlschema_doc)
- 0
-
-
-xinclude
---------
-
-Simple XInclude support exists. You can let lxml process xinclude statements
-in a document by calling the xinclude() method on a tree::
-
- >>> data = StringIO('''\
- ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
- ... <foo/>
- ... <xi:include href="doc/test.xml" />
- ... </doc>''')
-
- >>> tree = etree.parse(data)
- >>> tree.xinclude()
- >>> etree.tostring(tree.getroot())
- '<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'
-
-
-write_c14n on ElementTree
--------------------------
-
-The lxml.etree.ElementTree class has a method write_c14n, which takes a file
-object as argument. This file object will receive an UTF-8 representation of
-the canonicalized form of the XML, following the W3C C14N recommendation. For
-example::
-
- >>> f = StringIO('<a><b/></a>')
- >>> tree = etree.parse(f)
- >>> f2 = StringIO()
- >>> tree.write_c14n(f2)
- >>> f2.getvalue()
- '<a><b></b></a>'
Copied: lxml/trunk/doc/validation.txt (from r39233, lxml/trunk/doc/api.txt)
==============================================================================
--- lxml/trunk/doc/api.txt (original)
+++ lxml/trunk/doc/validation.txt Wed Feb 21 16:43:46 2007
@@ -1,719 +1,18 @@
-=====================
-APIs specific to lxml
-=====================
+====================
+Validation with lxml
+====================
+
+Apart from DTD support in the parsers, lxml currently supports two schema
+languages: `Relax NG`_ and `XML Schema`_. Both provide identical APIs,
+represented by a validator class with the obvious names.
-lxml tries to follow established APIs wherever possible. Sometimes, however,
-the need to expose a feature in an easy way led to the invention of a new API.
+.. _`Relax NG`: http://www.relaxng.org/
+.. _`XML Schema`: http://www.w3.org/XML/Schema
.. contents::
..
- 1 lxml.etree
- 2 Other Element APIs
- 3 Trees and Documents
- 4 Iteration
- 5 Parsers
- 6 iterparse and iterwalk
- 7 Error handling on exceptions
- 8 Python unicode strings
- 9 XPath
- 10 XSLT
- 11 RelaxNG
- 12 XMLSchema
- 13 xinclude
- 14 write_c14n on ElementTree
-
-
-lxml.etree
-----------
-
-lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
-however some incompatibilities (see `compatibility`_). The extensions are
-documented here.
-
-.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
-.. _`compatibility`: compatibility.html
-
-If you need to know which version of lxml is installed, you can access the
-``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
-however, that it did not exist before version 1.0, so you will get an
-AttributeError in older versions. The versions of libxml2 and libxslt are
-available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
-
-The following examples usually assume this to be executed first::
-
- >>> from lxml import etree
- >>> from StringIO import StringIO
-
-
-Other Element APIs
-------------------
-
-While lxml.etree itself uses the ElementTree API, it is possible to replace
-the Element implementation by `custom element subclasses`_. This has been
-used to implement well-known XML APIs on top of lxml. The ``lxml.elements``
-package contains examples. Currently, there is a data-binding implementation
-called `objectify`_, which is similar to the `Amara bindery`_ tool.
-
-Additionally, the `lxml.elements.classlookup`_ module provides a number of
-different schemes to customize the mapping between libxml2 nodes and the
-Element classes used by lxml.etree.
-
-.. _`custom element subclasses`: namespace_extensions.html
-.. _`objectify`: objectify.html
-.. _`lxml.elements.classlookup`: elements.html#lxml.elements.classlookup
-.. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/
-
-
-Trees and Documents
--------------------
-
-Compared to the original ElementTree API, lxml.etree has an extended tree
-model. It knows about parents and siblings of elements::
-
- >>> root = etree.Element("root")
- >>> a = etree.SubElement(root, "a")
- >>> b = etree.SubElement(root, "b")
- >>> c = etree.SubElement(root, "c")
- >>> d = etree.SubElement(root, "d")
- >>> e = etree.SubElement(d, "e")
- >>> b.getparent() == root
- True
- >>> print b.getnext().tag
- c
- >>> print c.getprevious().tag
- b
-
-Elements always live within a document context in lxml. This implies that
-there is also a notion of an absolute document root. You can retrieve an
-ElementTree for the root node of a document from any of its elements::
-
- >>> tree = d.getroottree()
- >>> print tree.getroot().tag
- root
-
-Note that this is different from wrapping an Element in an ElementTree. You
-can use ElementTrees to create XML trees with an explicit root node::
-
- >>> tree = etree.ElementTree(d)
- >>> print tree.getroot().tag
- d
- >>> print etree.tostring(tree)
- <d><e/></d>
-
-All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
-will understand the explicitly chosen root as root node of a document. They
-will not see any elements outside the ElementTree. However, ElementTrees do
-not modify their Elements::
-
- >>> element = tree.getroot()
- >>> print element.tag
- d
- >>> print element.getparent().tag
- root
- >>> print element.getroottree().getroot().tag
- root
-
-The rule is that all operations that are applied to Elements use either the
-Element itself as reference point, or the absolute root of the document that
-contains this Element (e.g. for absolute XPath expressions). All operations
-on an ElementTree use its explicit root node as reference.
-
-
-Iteration
----------
-
-The ElementTree API makes Elements iterable to supports iteration over their
-children. Using the tree defined above, we get::
-
- >>> [ el.tag for el in root ]
- ['a', 'b', 'c', 'd']
-
-Tree traversal is commonly based on the ``element.getiterator()`` method::
-
- >>> [ el.tag for el in root.getiterator() ]
- ['root', 'a', 'b', 'c', 'd', 'e']
-
-lxml.etree also supports this, but additionally features an extended API for
-iteration over the children, following/preceding siblings, ancestors and
-descendants of an element, as defined by the respective XPath axis::
-
- >>> [ el.tag for el in root.iterchildren() ]
- ['a', 'b', 'c', 'd']
- >>> [ el.tag for el in root.iterchildren(reversed=True) ]
- ['d', 'c', 'b', 'a']
- >>> [ el.tag for el in b.itersiblings() ]
- ['c', 'd']
- >>> [ el.tag for el in c.itersiblings(preceding=True) ]
- ['b', 'a']
- >>> [ el.tag for el in e.iterancestors() ]
- ['d', 'root']
- >>> [ el.tag for el in root.iterdescendants() ]
- ['a', 'b', 'c', 'd', 'e']
-
-Note how ``element.iterdescendants()`` does not include the element itself, as
-opposed to ``element.getiterator()``. The latter effectively implements the
-'descendant-or-self' axis in XPath.
-
-All of these iterators support an additional ``tag`` keyword argument that
-filters the generated elements by tag name::
-
- >>> [ el.tag for el in root.iterchildren(tag='a') ]
- ['a']
- >>> [ el.tag for el in d.iterchildren(tag='a') ]
- []
- >>> [ el.tag for el in root.iterdescendants(tag='d') ]
- ['d']
- >>> [ el.tag for el in root.getiterator(tag='d') ]
- ['d']
-
-See also the section on the utility functions ``iterparse()`` and
-``iterwalk()`` below.
-
-
-Parsers
--------
-
-One of the differences is the parser. There is support for both XML and
-(broken) HTML. Both are based on libxml2 and therefore only support options
-that are backed by the library. Parsers take a number of keyword arguments.
-The following is an example for namespace cleanup during parsing, first with
-the default parser, then with a parametrized one::
-
- >>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
-
- >>> et = etree.parse(StringIO(xml))
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b xmlns="test"/></a>
-
- >>> parser = etree.XMLParser(ns_clean=True)
- >>> et = etree.parse(StringIO(xml), parser)
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b/></a>
-
-HTML parsing is similarly simple. The parsers have a ``recover`` keyword
-argument that the HTMLParser sets by default. It lets libxml2 try its best to
-return something usable without raising an exception. You should use libxml2
-version 2.6.21 or newer to take advantage of this feature::
-
- >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
-
- >>> parser = etree.HTMLParser()
- >>> et = etree.parse(StringIO(broken_html), parser)
-
- >>> print etree.tostring(et.getroot())
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-Lxml has an HTML function, similar to the XML shortcut known from
-ElementTree::
-
- >>> html = etree.HTML(broken_html)
- >>> print etree.tostring(html)
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-The support for parsing broken HTML depends entirely on libxml2's recovery
-algorithm. It is *not* the fault of lxml if you find documents that are so
-heavily broken that the parser cannot handle them. There is also no guarantee
-that the resulting tree will contain all data from the original document. The
-parser may have to drop seriously broken parts when struggling to keep
-parsing. Especially misplaced meta tags can suffer from this, which may lead
-to encoding problems.
-
-The use of the libxml2 parsers makes some additional information available at
-the API level. Currently, ElementTree objects can access the DOCTYPE
-information provided by a parsed document, as well as the XML version and the
-original encoding::
-
- >>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN"
- >>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
- >>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
- >>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
- >>> xhtml = xml_header + doctype_string + '<html><body></body></html>'
-
- >>> tree = etree.parse(StringIO(xhtml))
- >>> docinfo = tree.docinfo
- >>> print docinfo.public_id
- -//W3C//DTD XHTML 1.0 Transitional//EN
- >>> print docinfo.system_url
- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
- >>> docinfo.doctype == doctype_string
- True
-
- >>> print docinfo.xml_version
- 1.0
- >>> print docinfo.encoding
- ascii
-
-
-iterparse and iterwalk
-----------------------
-
-As known from ElementTree, the ``iterparse()`` utility function returns an
-iterator that generates parser events for an XML file (or file-like object),
-while building the tree. The values are tuples ``(event-type, object)``. The
-event types are 'start', 'end', 'start-ns' and 'end-ns'.
-
-The 'start' and 'end' events represent opening and closing elements and are
-accompanied by the respective element. By default, only 'end' events are
-generated::
-
- >>> xml = '''\
- ... <root>
- ... <element key='value'>text</element>
- ... <element>text</element>tail
- ... <empty-element xmlns="testns" />
- ... </root>
- ... '''
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
- end {testns}empty-element
- end root
-
-The resulting tree is available through the ``root`` property of the iterator::
-
- >>> context.root.tag
- 'root'
-
-The other types can be activated with the ``events`` keyword argument::
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, elem in context:
- ... print action, elem.tag
- start root
- start element
- end element
- start element
- end element
- start {testns}empty-element
- end {testns}empty-element
- end root
-
-You can modify the element and its descendants when handling the 'end' event.
-To save memory, for example, you can remove subtrees that are no longer
-needed::
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print len(elem),
- ... elem.clear()
- 0 0 0 3
- >>> context.root.getchildren()
- []
-
-**WARNING**: During the 'start' event, the descendants and following siblings
-are not yet available and should not be accessed. During the 'end' event, the
-element and its descendants can be freely modified, but its following siblings
-should not be accessed. During either of the two events, you **must not**
-modify or move the ancestors (parents) of the current element. You should
-also avoid moving or discarding the element itself. The golden rule is: do
-not touch anything that will have to be touched again by the parser later on.
-
-If you have elements with a long list of children in your XML file and want to
-save more memory during parsing, you can clean up the preceding siblings of
-the current element::
-
- >>> for event, element in etree.iterparse(StringIO(xml)):
- ... # ... do something with the element
- ... element.clear() # clean up children
- ... if element.getprevious(): # clean up preceding siblings
- ... del element.getparent()[0]
-
-You can use ``while`` instead of ``if`` if you skipped siblings using the
-``tag`` keyword argument. The more selective your tag is, however, the more
-thought you will have to put into finding the right way to clean up the
-elements that were skipped. Therefore, it is sometimes easier to traverse all
-elements and do the tag selection by hand in the event handler code.
-
-The 'start-ns' and 'end-ns' events notify about namespace declarations and
-generate tuples ``(prefix, URI)``::
-
- >>> events = ("start-ns", "end-ns")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, obj in context:
- ... print action, obj
- start-ns ('', 'testns')
- end-ns None
-
-It is common practice to use a list as namespace stack and pop the last entry
-on the 'end-ns' event.
-
-lxml.etree supports two extensions compared to ElementTree. It accepts a
-``tag`` keyword argument just like ``element.getiterator(tag)``. This
-restricts events to a specific tag or namespace.
-
- >>> context = etree.iterparse(StringIO(xml), tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
- >>> for action, elem in context:
- ... print action, elem.tag
- start {testns}empty-element
- end {testns}empty-element
-
-The second extension is the ``iterwalk()`` function. It behaves exactly like
-``iterparse()``, but works on Elements and ElementTrees::
-
- >>> root = context.root
- >>> context = etree.iterwalk(root, events=events, tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- start element
- end element
- start element
- end element
-
-
-Error handling on exceptions
-----------------------------
-
-Libxml2 provides error messages for failures, be it during parsing, XPath
-evaluation or schema validation. Whenever an exception is raised, you can
-retrieve the errors that occured and "might have" lead to the problem::
-
- >>> etree.clearErrorLog()
- >>> broken_xml = '<a>'
- >>> try:
- ... etree.parse(StringIO(broken_xml))
- ... except etree.XMLSyntaxError, e:
- ... pass # just put the exception into e
- >>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
- >>> print log
- <string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1
-
-This might look a little cryptic at first, but it is the information that
-libxml2 gives you. At least the message at the end should give you a hint
-what went wrong and you can see that the fatal error (FATAL) happened during
-parsing (PARSER) line 1 of a string (<string>, or filename if available).
-Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for
-that. You can get it from a log entry like this::
-
- >>> entry = log[0]
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-There is also a convenience attribute ``last_error`` that returns the last
-error or fatal error that occurred::
-
- >>> entry = e.error_log.last_error
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-Alternatively, lxml.etree supports logging libxml2 messages to the Python
-stdlib logging module. This is done through the ``etree.PyErrorLog`` class.
-It disables the error reporting from exceptions and forwards log messages to a
-Python logger. To use it, see the descriptions of the function
-``etree.useGlobalPythonLog`` and the class ``etree.PyErrorLog`` for help.
-Note that this does not affect the local error logs of XSLT, XMLSchema,
-etc. which are described in their respective sections below.
-
-
-Python unicode strings
-----------------------
-
-lxml.etree has broader support for Python unicode strings than the ElementTree
-library. First of all, where ElementTree would raise an exception, the
-parsers in lxml.etree can handle unicode strings straight away. This is most
-helpful for XML snippets embedded in source code using the ``XML()``
-function::
-
- >>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
- >>> uxml
- u'<test> \uf8d1 + \uf8d2 </test>'
- >>> root = etree.XML(uxml)
-
-This requires, however, that unicode strings do not specify a conflicting
-encoding themselves and thus lie about their real encoding::
-
- >>> etree.XML(u'<?xml version="1.0" encoding="ASCII"?>\n' + uxml)
- Traceback (most recent call last):
- ...
- ValueError: Unicode strings with encoding declaration are not supported.
-
-Similarly, you will get errors when you try the same with HTML data in a
-unicode string that specifies a charset in a meta tag of the header. You
-should generally avoid converting XML/HTML data to unicode before passing it
-into the parsers. It is both slower and error prone.
-
-To serialize the result, you would normally use the ``tostring`` module
-function, which serializes to plain ASCII by default or a number of other
-encodings if asked for::
-
- >>> etree.tostring(root)
- '<test>  +  </test>'
-
- >>> etree.tostring(root, 'UTF-8', xml_declaration=False)
- '<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'
-
-As an extension, lxml.etree has a new ``tounicode()`` function that you can
-call on XML tree objects to retrieve a Python unicode representation::
-
- >>> etree.tounicode(root)
- u'<test> \uf8d1 + \uf8d2 </test>'
-
- >>> el = etree.Element("test")
- >>> etree.tounicode(el)
- u'<test/>'
-
- >>> subel = etree.SubElement(el, "subtest")
- >>> etree.tounicode(el)
- u'<test><subtest/></test>'
-
- >>> et = etree.ElementTree(el)
- >>> etree.tounicode(et)
- u'<test><subtest/></test>'
-
-The result of ``tounicode()`` can be treated like any other Python unicode
-string and then passed back into the parsers. However, if you want to save
-the result to a file or pass it over the network, you should use ``write()``
-or ``tostring()`` with an encoding argument (typically UTF-8) to serialize the
-XML. The main reason is that unicode strings returned by ``tounicode()``
-never have an XML declaration and therefore do not specify their encoding.
-These strings are most likely not parsable by other XML libraries.
-
-In contrast, the ``tostring()`` function automatically adds a declaration as
-needed that reflects the encoding of the returned string. This makes it
-possible for other parsers to correctly parse the XML byte stream. Note that
-using ``tostring()`` with UTF-8 is also considerably faster in most cases.
-
-
-XPath
------
-
-lxml.etree supports the simple path syntax of the ``findall()`` etc. methods
-on ElementTree and Element, as known from the original ElementTree library.
-As an extension, these classes also provide an ``xpath()`` method that
-supports expressions in the complete XPath syntax.
-
-There are also specialized XPath evaluator classes that are more efficient for
-frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance
-comparison`_ to learn when to use which. Their semantics when used on
-Elements and ElementTrees are the same as for the ``xpath()`` method described
-here.
-
-.. _`performance comparison`: performance.html#xpath
-
-For ElementTree, the xpath method performs a global XPath query against the
-document (if absolute) or against the root node (if relative)::
-
- >>> f = StringIO('<foo><bar></bar></foo>')
- >>> tree = etree.parse(f)
-
- >>> r = tree.xpath('/foo/bar')
- >>> len(r)
- 1
- >>> r[0].tag
- 'bar'
-
- >>> r = tree.xpath('bar')
- >>> r[0].tag
- 'bar'
-
-When ``xpath()`` is used on an element, the XPath expression is evaluated
-against the element (if relative) or against the root tree (if absolute)::
-
- >>> root = tree.getroot()
- >>> r = root.xpath('bar')
- >>> r[0].tag
- 'bar'
-
- >>> bar = root[0]
- >>> r = bar.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
- >>> tree = bar.getroottree()
- >>> r = tree.xpath('/foo/bar')
- >>> r[0].tag
- 'bar'
-
-Optionally, you can provide a ``namespaces`` keyword argument, which should be
-a dictionary mapping the namespace prefixes used in the XPath expression to
-namespace URIs::
-
- >>> f = StringIO('''\
- ... <a:foo xmlns:a="http://codespeak.net/ns/test1"
- ... xmlns:b="http://codespeak.net/ns/test2">
- ... <b:bar>Text</b:bar>
- ... </a:foo>
- ... ''')
- >>> doc = etree.parse(f)
- >>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
- ... 'b': 'http://codespeak.net/ns/test2'})
- >>> len(r)
- 1
- >>> r[0].tag
- '{http://codespeak.net/ns/test2}bar'
- >>> r[0].text
- 'Text'
-
-There is also an optional ``extensions`` argument which is used to define
-`extension functions`_ in Python that are local to this evaluation.
-
-.. _`extension functions`: extensions.html
-
-The return values of XPath evaluations vary, depending on the XPath expression
-used:
-
-* True or False, when the XPath expression has a boolean result
-
-* a float, when the XPath expression has a numeric result (integer or float)
-
-* a (unicode) string, when the XPath expression has a string result.
-
-* a list of items, when the XPath expression has a list as result. The items
- may include elements, strings and tuples. Text nodes and attributes in the
- result are returned as strings (the text node content or attribute value).
- Comments are also returned as strings, enclosed by the usual ``<!--`` and
- ``-->`` markers. Namespace declarations are returned as tuples of strings:
- ``(prefix, URI)``.
-
-A related convenience method of ElementTree objects is ``getpath(element)``,
-which returns a structural, absolute XPath expression to find that element::
-
- >>> a = etree.Element("a")
- >>> b = etree.SubElement(a, "b")
- >>> c = etree.SubElement(a, "c")
- >>> d1 = etree.SubElement(c, "d")
- >>> d2 = etree.SubElement(c, "d")
-
- >>> tree = etree.ElementTree(c)
- >>> print tree.getpath(d2)
- /c/d[2]
- >>> tree.xpath(tree.getpath(d2)) == [d2]
- True
-
-
-XSLT
-----
-
-lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
-given an ElementTree object to construct an XSLT transformer::
-
- >>> f = StringIO('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> xslt_doc = etree.parse(f)
- >>> transform = etree.XSLT(xslt_doc)
-
-You can then run the transformation on an ElementTree document by simply
-calling it, and this results in another ElementTree object::
-
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
- >>> result = transform(doc)
-
-The result object can be accessed like a normal ElementTree document::
-
- >>> result.getroot().text
- 'Text'
-
-but, as opposed to normal ElementTree objects, can also be turned into an (XML
-or text) string by applying the str() function::
-
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-The result is always a plain string, encoded as requested by the
-``xsl:output`` element in the stylesheet. If you want a Python unicode string
-instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
-is sufficient). This allows you to call the builtin ``unicode()`` function on
-the result::
-
- >>> unicode(result)
- u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-You can use other encodings at the cost of multiple recoding. Encodings that
-are not supported by Python will result in an error::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:output encoding="UCS4"/>
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="/a/b/text()" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
-
- >>> result = transform(doc)
- >>> unicode(result)
- Traceback (most recent call last):
- [...]
- LookupError: unknown encoding: UCS4
-
-It is possible to pass parameters, in the form of XPath expressions, to the
-XSLT template::
-
- >>> xslt_tree = etree.XML('''\
- ... <xsl:stylesheet version="1.0"
- ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
- ... <xsl:template match="/">
- ... <foo><xsl:value-of select="$a" /></foo>
- ... </xsl:template>
- ... </xsl:stylesheet>''')
- >>> transform = etree.XSLT(xslt_tree)
- >>> f = StringIO('<a><b>Text</b></a>')
- >>> doc = etree.parse(f)
-
-The parameters are passed as keyword parameters to the transform call. First
-let's try passing in a simple string expression::
-
- >>> result = transform(doc, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-Let's try a non-string XPath expression now::
-
- >>> result = transform(doc, a="/a/b/text()")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>Text</foo>\n'
-
-There's also a convenience method on the tree object for doing XSL
-transformations. This is less efficient if you want to apply the same XSL
-transformation to multiple documents, but is shorter to write for one-shot
-operations, as you do not have to instantiate a stylesheet yourself::
-
- >>> result = doc.xslt(xslt_tree, a="'A'")
- >>> str(result)
- '<?xml version="1.0"?>\n<foo>A</foo>\n'
-
-By default, XSLT supports all extension functions from libxslt and libexslt as
-well as Python regular expressions through EXSLT. Note that some extensions
-enable style sheets to read and write files on the local file system. See the
-`document loader documentation`_ on how to deal with this.
-
-.. _`document loader documentation`: resolvers.html
-
-If you want to know how your stylesheet performed, pass the ``profile_run``
-keyword to the transform::
-
- >>> result = transform(doc, a="/a/b/text()", profile_run=True)
- >>> profile = result.xslt_profile
-
-The value of the ``xslt_profile`` property is an ElementTree with profiling
-data about each template, similar to the following::
-
- <profile>
- <template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/>
- </profile>
-
-Note that this is a read-only document. You must not move any of its elements
-to other documents. Please deep-copy the document if you need to modify it.
-If you want to free it from memory, just do::
-
- >>> del result.xslt_profile
+ 1 RelaxNG
+ 2 XMLSchema
RelaxNG
@@ -874,37 +173,3 @@
1
>>> doc2.xmlschema(xmlschema_doc)
0
-
-
-xinclude
---------
-
-Simple XInclude support exists. You can let lxml process xinclude statements
-in a document by calling the xinclude() method on a tree::
-
- >>> data = StringIO('''\
- ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
- ... <foo/>
- ... <xi:include href="doc/test.xml" />
- ... </doc>''')
-
- >>> tree = etree.parse(data)
- >>> tree.xinclude()
- >>> etree.tostring(tree.getroot())
- '<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'
-
-
-write_c14n on ElementTree
--------------------------
-
-The lxml.etree.ElementTree class has a method write_c14n, which takes a file
-object as argument. This file object will receive an UTF-8 representation of
-the canonicalized form of the XML, following the W3C C14N recommendation. For
-example::
-
- >>> f = StringIO('<a><b/></a>')
- >>> tree = etree.parse(f)
- >>> f2 = StringIO()
- >>> tree.write_c14n(f2)
- >>> f2.getvalue()
- '<a><b></b></a>'
Copied: lxml/trunk/doc/xpathxslt.txt (from r39233, lxml/trunk/doc/api.txt)
==============================================================================
--- lxml/trunk/doc/api.txt (original)
+++ lxml/trunk/doc/xpathxslt.txt Wed Feb 21 16:43:46 2007
@@ -1,487 +1,14 @@
-=====================
-APIs specific to lxml
-=====================
+========================
+XPath and XSLT with lxml
+========================
-lxml tries to follow established APIs wherever possible. Sometimes, however,
-the need to expose a feature in an easy way led to the invention of a new API.
+lxml supports both XPath and XSLT through libxml2 and libxslt in a standards
+compliant way.
.. contents::
..
- 1 lxml.etree
- 2 Other Element APIs
- 3 Trees and Documents
- 4 Iteration
- 5 Parsers
- 6 iterparse and iterwalk
- 7 Error handling on exceptions
- 8 Python unicode strings
- 9 XPath
- 10 XSLT
- 11 RelaxNG
- 12 XMLSchema
- 13 xinclude
- 14 write_c14n on ElementTree
-
-
-lxml.etree
-----------
-
-lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
-however some incompatibilities (see `compatibility`_). The extensions are
-documented here.
-
-.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
-.. _`compatibility`: compatibility.html
-
-If you need to know which version of lxml is installed, you can access the
-``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
-however, that it did not exist before version 1.0, so you will get an
-AttributeError in older versions. The versions of libxml2 and libxslt are
-available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
-
-The following examples usually assume this to be executed first::
-
- >>> from lxml import etree
- >>> from StringIO import StringIO
-
-
-Other Element APIs
-------------------
-
-While lxml.etree itself uses the ElementTree API, it is possible to replace
-the Element implementation by `custom element subclasses`_. This has been
-used to implement well-known XML APIs on top of lxml. The ``lxml.elements``
-package contains examples. Currently, there is a data-binding implementation
-called `objectify`_, which is similar to the `Amara bindery`_ tool.
-
-Additionally, the `lxml.elements.classlookup`_ module provides a number of
-different schemes to customize the mapping between libxml2 nodes and the
-Element classes used by lxml.etree.
-
-.. _`custom element subclasses`: namespace_extensions.html
-.. _`objectify`: objectify.html
-.. _`lxml.elements.classlookup`: elements.html#lxml.elements.classlookup
-.. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/
-
-
-Trees and Documents
--------------------
-
-Compared to the original ElementTree API, lxml.etree has an extended tree
-model. It knows about parents and siblings of elements::
-
- >>> root = etree.Element("root")
- >>> a = etree.SubElement(root, "a")
- >>> b = etree.SubElement(root, "b")
- >>> c = etree.SubElement(root, "c")
- >>> d = etree.SubElement(root, "d")
- >>> e = etree.SubElement(d, "e")
- >>> b.getparent() == root
- True
- >>> print b.getnext().tag
- c
- >>> print c.getprevious().tag
- b
-
-Elements always live within a document context in lxml. This implies that
-there is also a notion of an absolute document root. You can retrieve an
-ElementTree for the root node of a document from any of its elements::
-
- >>> tree = d.getroottree()
- >>> print tree.getroot().tag
- root
-
-Note that this is different from wrapping an Element in an ElementTree. You
-can use ElementTrees to create XML trees with an explicit root node::
-
- >>> tree = etree.ElementTree(d)
- >>> print tree.getroot().tag
- d
- >>> print etree.tostring(tree)
- <d><e/></d>
-
-All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
-will understand the explicitly chosen root as root node of a document. They
-will not see any elements outside the ElementTree. However, ElementTrees do
-not modify their Elements::
-
- >>> element = tree.getroot()
- >>> print element.tag
- d
- >>> print element.getparent().tag
- root
- >>> print element.getroottree().getroot().tag
- root
-
-The rule is that all operations that are applied to Elements use either the
-Element itself as reference point, or the absolute root of the document that
-contains this Element (e.g. for absolute XPath expressions). All operations
-on an ElementTree use its explicit root node as reference.
-
-
-Iteration
----------
-
-The ElementTree API makes Elements iterable to supports iteration over their
-children. Using the tree defined above, we get::
-
- >>> [ el.tag for el in root ]
- ['a', 'b', 'c', 'd']
-
-Tree traversal is commonly based on the ``element.getiterator()`` method::
-
- >>> [ el.tag for el in root.getiterator() ]
- ['root', 'a', 'b', 'c', 'd', 'e']
-
-lxml.etree also supports this, but additionally features an extended API for
-iteration over the children, following/preceding siblings, ancestors and
-descendants of an element, as defined by the respective XPath axis::
-
- >>> [ el.tag for el in root.iterchildren() ]
- ['a', 'b', 'c', 'd']
- >>> [ el.tag for el in root.iterchildren(reversed=True) ]
- ['d', 'c', 'b', 'a']
- >>> [ el.tag for el in b.itersiblings() ]
- ['c', 'd']
- >>> [ el.tag for el in c.itersiblings(preceding=True) ]
- ['b', 'a']
- >>> [ el.tag for el in e.iterancestors() ]
- ['d', 'root']
- >>> [ el.tag for el in root.iterdescendants() ]
- ['a', 'b', 'c', 'd', 'e']
-
-Note how ``element.iterdescendants()`` does not include the element itself, as
-opposed to ``element.getiterator()``. The latter effectively implements the
-'descendant-or-self' axis in XPath.
-
-All of these iterators support an additional ``tag`` keyword argument that
-filters the generated elements by tag name::
-
- >>> [ el.tag for el in root.iterchildren(tag='a') ]
- ['a']
- >>> [ el.tag for el in d.iterchildren(tag='a') ]
- []
- >>> [ el.tag for el in root.iterdescendants(tag='d') ]
- ['d']
- >>> [ el.tag for el in root.getiterator(tag='d') ]
- ['d']
-
-See also the section on the utility functions ``iterparse()`` and
-``iterwalk()`` below.
-
-
-Parsers
--------
-
-One of the differences is the parser. There is support for both XML and
-(broken) HTML. Both are based on libxml2 and therefore only support options
-that are backed by the library. Parsers take a number of keyword arguments.
-The following is an example for namespace cleanup during parsing, first with
-the default parser, then with a parametrized one::
-
- >>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
-
- >>> et = etree.parse(StringIO(xml))
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b xmlns="test"/></a>
-
- >>> parser = etree.XMLParser(ns_clean=True)
- >>> et = etree.parse(StringIO(xml), parser)
- >>> print etree.tostring(et.getroot())
- <a xmlns="test"><b/></a>
-
-HTML parsing is similarly simple. The parsers have a ``recover`` keyword
-argument that the HTMLParser sets by default. It lets libxml2 try its best to
-return something usable without raising an exception. You should use libxml2
-version 2.6.21 or newer to take advantage of this feature::
-
- >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
-
- >>> parser = etree.HTMLParser()
- >>> et = etree.parse(StringIO(broken_html), parser)
-
- >>> print etree.tostring(et.getroot())
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-Lxml has an HTML function, similar to the XML shortcut known from
-ElementTree::
-
- >>> html = etree.HTML(broken_html)
- >>> print etree.tostring(html)
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
-
-The support for parsing broken HTML depends entirely on libxml2's recovery
-algorithm. It is *not* the fault of lxml if you find documents that are so
-heavily broken that the parser cannot handle them. There is also no guarantee
-that the resulting tree will contain all data from the original document. The
-parser may have to drop seriously broken parts when struggling to keep
-parsing. Especially misplaced meta tags can suffer from this, which may lead
-to encoding problems.
-
-The use of the libxml2 parsers makes some additional information available at
-the API level. Currently, ElementTree objects can access the DOCTYPE
-information provided by a parsed document, as well as the XML version and the
-original encoding::
-
- >>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN"
- >>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
- >>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
- >>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
- >>> xhtml = xml_header + doctype_string + '<html><body></body></html>'
-
- >>> tree = etree.parse(StringIO(xhtml))
- >>> docinfo = tree.docinfo
- >>> print docinfo.public_id
- -//W3C//DTD XHTML 1.0 Transitional//EN
- >>> print docinfo.system_url
- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
- >>> docinfo.doctype == doctype_string
- True
-
- >>> print docinfo.xml_version
- 1.0
- >>> print docinfo.encoding
- ascii
-
-
-iterparse and iterwalk
-----------------------
-
-As known from ElementTree, the ``iterparse()`` utility function returns an
-iterator that generates parser events for an XML file (or file-like object),
-while building the tree. The values are tuples ``(event-type, object)``. The
-event types are 'start', 'end', 'start-ns' and 'end-ns'.
-
-The 'start' and 'end' events represent opening and closing elements and are
-accompanied by the respective element. By default, only 'end' events are
-generated::
-
- >>> xml = '''\
- ... <root>
- ... <element key='value'>text</element>
- ... <element>text</element>tail
- ... <empty-element xmlns="testns" />
- ... </root>
- ... '''
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
- end {testns}empty-element
- end root
-
-The resulting tree is available through the ``root`` property of the iterator::
-
- >>> context.root.tag
- 'root'
-
-The other types can be activated with the ``events`` keyword argument::
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, elem in context:
- ... print action, elem.tag
- start root
- start element
- end element
- start element
- end element
- start {testns}empty-element
- end {testns}empty-element
- end root
-
-You can modify the element and its descendants when handling the 'end' event.
-To save memory, for example, you can remove subtrees that are no longer
-needed::
-
- >>> context = etree.iterparse(StringIO(xml))
- >>> for action, elem in context:
- ... print len(elem),
- ... elem.clear()
- 0 0 0 3
- >>> context.root.getchildren()
- []
-
-**WARNING**: During the 'start' event, the descendants and following siblings
-are not yet available and should not be accessed. During the 'end' event, the
-element and its descendants can be freely modified, but its following siblings
-should not be accessed. During either of the two events, you **must not**
-modify or move the ancestors (parents) of the current element. You should
-also avoid moving or discarding the element itself. The golden rule is: do
-not touch anything that will have to be touched again by the parser later on.
-
-If you have elements with a long list of children in your XML file and want to
-save more memory during parsing, you can clean up the preceding siblings of
-the current element::
-
- >>> for event, element in etree.iterparse(StringIO(xml)):
- ... # ... do something with the element
- ... element.clear() # clean up children
- ... if element.getprevious(): # clean up preceding siblings
- ... del element.getparent()[0]
-
-You can use ``while`` instead of ``if`` if you skipped siblings using the
-``tag`` keyword argument. The more selective your tag is, however, the more
-thought you will have to put into finding the right way to clean up the
-elements that were skipped. Therefore, it is sometimes easier to traverse all
-elements and do the tag selection by hand in the event handler code.
-
-The 'start-ns' and 'end-ns' events notify about namespace declarations and
-generate tuples ``(prefix, URI)``::
-
- >>> events = ("start-ns", "end-ns")
- >>> context = etree.iterparse(StringIO(xml), events=events)
- >>> for action, obj in context:
- ... print action, obj
- start-ns ('', 'testns')
- end-ns None
-
-It is common practice to use a list as namespace stack and pop the last entry
-on the 'end-ns' event.
-
-lxml.etree supports two extensions compared to ElementTree. It accepts a
-``tag`` keyword argument just like ``element.getiterator(tag)``. This
-restricts events to a specific tag or namespace.
-
- >>> context = etree.iterparse(StringIO(xml), tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
-
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
- >>> for action, elem in context:
- ... print action, elem.tag
- start {testns}empty-element
- end {testns}empty-element
-
-The second extension is the ``iterwalk()`` function. It behaves exactly like
-``iterparse()``, but works on Elements and ElementTrees::
-
- >>> root = context.root
- >>> context = etree.iterwalk(root, events=events, tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- start element
- end element
- start element
- end element
-
-
-Error handling on exceptions
-----------------------------
-
-Libxml2 provides error messages for failures, be it during parsing, XPath
-evaluation or schema validation. Whenever an exception is raised, you can
-retrieve the errors that occured and "might have" lead to the problem::
-
- >>> etree.clearErrorLog()
- >>> broken_xml = '<a>'
- >>> try:
- ... etree.parse(StringIO(broken_xml))
- ... except etree.XMLSyntaxError, e:
- ... pass # just put the exception into e
- >>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
- >>> print log
- <string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1
-
-This might look a little cryptic at first, but it is the information that
-libxml2 gives you. At least the message at the end should give you a hint
-what went wrong and you can see that the fatal error (FATAL) happened during
-parsing (PARSER) line 1 of a string (<string>, or filename if available).
-Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for
-that. You can get it from a log entry like this::
-
- >>> entry = log[0]
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-There is also a convenience attribute ``last_error`` that returns the last
-error or fatal error that occurred::
-
- >>> entry = e.error_log.last_error
- >>> print entry.domain_name, entry.type_name, entry.filename
- PARSER ERR_TAG_NOT_FINISHED <string>
-
-Alternatively, lxml.etree supports logging libxml2 messages to the Python
-stdlib logging module. This is done through the ``etree.PyErrorLog`` class.
-It disables the error reporting from exceptions and forwards log messages to a
-Python logger. To use it, see the descriptions of the function
-``etree.useGlobalPythonLog`` and the class ``etree.PyErrorLog`` for help.
-Note that this does not affect the local error logs of XSLT, XMLSchema,
-etc. which are described in their respective sections below.
-
-
-Python unicode strings
-----------------------
-
-lxml.etree has broader support for Python unicode strings than the ElementTree
-library. First of all, where ElementTree would raise an exception, the
-parsers in lxml.etree can handle unicode strings straight away. This is most
-helpful for XML snippets embedded in source code using the ``XML()``
-function::
-
- >>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
- >>> uxml
- u'<test> \uf8d1 + \uf8d2 </test>'
- >>> root = etree.XML(uxml)
-
-This requires, however, that unicode strings do not specify a conflicting
-encoding themselves and thus lie about their real encoding::
-
- >>> etree.XML(u'<?xml version="1.0" encoding="ASCII"?>\n' + uxml)
- Traceback (most recent call last):
- ...
- ValueError: Unicode strings with encoding declaration are not supported.
-
-Similarly, you will get errors when you try the same with HTML data in a
-unicode string that specifies a charset in a meta tag of the header. You
-should generally avoid converting XML/HTML data to unicode before passing it
-into the parsers. It is both slower and error prone.
-
-To serialize the result, you would normally use the ``tostring`` module
-function, which serializes to plain ASCII by default or a number of other
-encodings if asked for::
-
- >>> etree.tostring(root)
- '<test>  +  </test>'
-
- >>> etree.tostring(root, 'UTF-8', xml_declaration=False)
- '<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'
-
-As an extension, lxml.etree has a new ``tounicode()`` function that you can
-call on XML tree objects to retrieve a Python unicode representation::
-
- >>> etree.tounicode(root)
- u'<test> \uf8d1 + \uf8d2 </test>'
-
- >>> el = etree.Element("test")
- >>> etree.tounicode(el)
- u'<test/>'
-
- >>> subel = etree.SubElement(el, "subtest")
- >>> etree.tounicode(el)
- u'<test><subtest/></test>'
-
- >>> et = etree.ElementTree(el)
- >>> etree.tounicode(et)
- u'<test><subtest/></test>'
-
-The result of ``tounicode()`` can be treated like any other Python unicode
-string and then passed back into the parsers. However, if you want to save
-the result to a file or pass it over the network, you should use ``write()``
-or ``tostring()`` with an encoding argument (typically UTF-8) to serialize the
-XML. The main reason is that unicode strings returned by ``tounicode()``
-never have an XML declaration and therefore do not specify their encoding.
-These strings are most likely not parsable by other XML libraries.
-
-In contrast, the ``tostring()`` function automatically adds a declaration as
-needed that reflects the encoding of the returned string. This makes it
-possible for other parsers to correctly parse the XML byte stream. Note that
-using ``tostring()`` with UTF-8 is also considerably faster in most cases.
+ 1 XPath
+ 2 XSLT
XPath
@@ -714,197 +241,3 @@
If you want to free it from memory, just do::
>>> del result.xslt_profile
-
-
-RelaxNG
--------
-
-lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can
-be given an ElementTree object to construct a Relax NG validator::
-
- >>> f = StringIO('''\
- ... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
- ... <zeroOrMore>
- ... <element name="b">
- ... <text />
- ... </element>
- ... </zeroOrMore>
- ... </element>
- ... ''')
- >>> relaxng_doc = etree.parse(f)
- >>> relaxng = etree.RelaxNG(relaxng_doc)
-
-You can then validate some ElementTree document against the schema. You'll get
-back True if the document is valid against the Relax NG schema, and False if
-not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> relaxng.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> relaxng.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not relaxng(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> relaxng.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> relaxng.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Starting with version 0.9, lxml now has a simple API to report the errors
-generated by libxml2. If you want to find out why the validation failed in the
-second case, you can look up the error log of the validation process and check
-it for relevant messages::
-
- >>> log = relaxng.error_log
- >>> print log.last_error
- <string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there
-
-You can see that the error (ERROR) happened during RelaxNG validation
-(RELAXNGV). The message then tells you what went wrong. Note that this error
-is local to the RelaxNG object. It will only contain log entries that
-appeares during the validation. The DocumentInvalid exception raised by the
-``assertValid`` method above provides access to the global error log (like all
-other lxml exceptions).
-
-Similar to XSLT, there's also a less efficient but easier shortcut method to
-do one-shot RelaxNG validation::
-
- >>> doc.relaxng(relaxng_doc)
- 1
- >>> doc2.relaxng(relaxng_doc)
- 0
-
-
-XMLSchema
----------
-
-lxml.etree also has a XML Schema (XSD) support, using the class
-lxml.etree.XMLSchema. This support is very similar to the Relax NG
-support. The class can be given an ElementTree object to construct a
-XMLSchema validator::
-
- >>> f = StringIO('''\
- ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
- ... <xsd:element name="a" type="AType"/>
- ... <xsd:complexType name="AType">
- ... <xsd:sequence>
- ... <xsd:element name="b" type="xsd:string" />
- ... </xsd:sequence>
- ... </xsd:complexType>
- ... </xsd:schema>
- ... ''')
- >>> xmlschema_doc = etree.parse(f)
- >>> xmlschema = etree.XMLSchema(xmlschema_doc)
-
-You can then validate some ElementTree document with this. Like with
-RelaxNG, you'll get back true if the document is valid against the XML
-schema, and false if not::
-
- >>> valid = StringIO('<a><b></b></a>')
- >>> doc = etree.parse(valid)
- >>> xmlschema.validate(doc)
- 1
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> xmlschema.validate(doc2)
- 0
-
-Calling the schema object has the same effect as calling its validate
-method. This is sometimes used in conditional statements::
-
- >>> invalid = StringIO('<a><c></c></a>')
- >>> doc2 = etree.parse(invalid)
- >>> if not xmlschema(doc2):
- ... print "invalid!"
- invalid!
-
-If you prefer getting an exception when validating, you can use the
-``assert_`` or ``assertValid`` methods::
-
- >>> xmlschema.assertValid(doc2)
- Traceback (most recent call last):
- [...]
- DocumentInvalid: Document does not comply with schema
-
- >>> xmlschema.assert_(doc2)
- Traceback (most recent call last):
- [...]
- AssertionError: Document does not comply with schema
-
-Error reporting works like for the RelaxNG class::
-
- >>> log = xmlschema.error_log
- >>> error = log.last_error
- >>> print error.domain_name
- SCHEMASV
- >>> print error.type_name
- SCHEMAV_ELEMENT_CONTENT
-
-If you were to print this log entry, you would get something like the
-following. Note that the error message depends on the libxml2 version in
-use::
-
- <string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).
-
-Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut
-method to do XML Schema validation::
-
- >>> doc.xmlschema(xmlschema_doc)
- 1
- >>> doc2.xmlschema(xmlschema_doc)
- 0
-
-
-xinclude
---------
-
-Simple XInclude support exists. You can let lxml process xinclude statements
-in a document by calling the xinclude() method on a tree::
-
- >>> data = StringIO('''\
- ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
- ... <foo/>
- ... <xi:include href="doc/test.xml" />
- ... </doc>''')
-
- >>> tree = etree.parse(data)
- >>> tree.xinclude()
- >>> etree.tostring(tree.getroot())
- '<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'
-
-
-write_c14n on ElementTree
--------------------------
-
-The lxml.etree.ElementTree class has a method write_c14n, which takes a file
-object as argument. This file object will receive an UTF-8 representation of
-the canonicalized form of the XML, following the W3C C14N recommendation. For
-example::
-
- >>> f = StringIO('<a><b/></a>')
- >>> tree = etree.parse(f)
- >>> f2 = StringIO()
- >>> tree.write_c14n(f2)
- >>> f2.getvalue()
- '<a><b></b></a>'
More information about the lxml-checkins
mailing list