From scoder at codespeak.net Thu May 3 21:10:50 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:10:50 +0200 (CEST) Subject: [Lxml-checkins] r42642 - lxml/trunk/doc Message-ID: <20070503191050.CFD368075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:10:48 2007 New Revision: 42642 Modified: lxml/trunk/doc/FAQ.txt lxml/trunk/doc/build.txt Log: contributing and building Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:10:48 2007 @@ -17,23 +17,26 @@ 1.5 What is the difference between lxml.etree and lxml.objectify? 1.6 Why is my application so slow? 1.7 Why do I get errors about missing UCS4 symbols when installing lxml? - 2 Bugs - 2.1 My application crashes! Why does lxml.etree do that? - 2.2 I think I have found a bug in lxml. What should I do? - 3 Threading - 3.1 Can I use threads to concurrently access the lxml API? - 3.2 Does my program run faster if I use threads? - 3.3 Would my single-threaded program run faster if I turned off threading? - 4 Parsing and Serialisation - 4.1 Why doesn't the ``pretty_print`` option reformat my XML output? - 4.2 Why can't lxml parse my XML from unicode strings? - 4.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ? - 4.4 Why can't I just delete parents or clear the root node in iterparse()? - 5 XPath and Document Traversal - 5.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)? - 5.2 Why doesn't ``findall()`` support full XPath expressions? - 5.3 How can I find out which namespace prefixes are used in a document? - 5.4 How can I specify a default namespace for XPath expressions? + 2 Contributing + 2.1 Why is lxml not written in Python? + 2.2 How can I contribute? + 3 Bugs + 3.1 My application crashes! Why does lxml.etree do that? + 3.2 I think I have found a bug in lxml. What should I do? + 4 Threading + 4.1 Can I use threads to concurrently access the lxml API? + 4.2 Does my program run faster if I use threads? + 4.3 Would my single-threaded program run faster if I turned off threading? + 5 Parsing and Serialisation + 5.1 Why doesn't the ``pretty_print`` option reformat my XML output? + 5.2 Why can't lxml parse my XML from unicode strings? + 5.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ? + 5.4 Why can't I just delete parents or clear the root node in iterparse()? + 6 XPath and Document Traversal + 6.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)? + 6.2 Why doesn't ``findall()`` support full XPath expressions? + 6.3 How can I find out which namespace prefixes are used in a document? + 6.4 How can I specify a default namespace for XPath expressions? General Questions @@ -167,6 +170,64 @@ .. _`build instructions`: build.html +Contributing +============ + +Why is lxml not written in Python? +---------------------------------- + +lxml interfaces with two C libraries: libxml2 and libxslt. Accessing them at +the C-level is required for performance reasons. + +To avoid writing plain C-code and caring too much about the details of +built-in types and reference counting, lxml is written in Pyrex_, a +Python-like language that is translated into C-code. Chances are that if you +know Python, you can write code that Pyrex accepts. Again, the C-ish style +used in the lxml code is just for performance optimisations. If you want to +contribute, don't bother with the details, a Python implementation of your +contribution is better than none. And keep in mind that lxml's flexible API +often favours an implementation of features in pure Python, without bothering +with C-code at all. + +Please contact the `mailing list`_ if you need any help. + +.. _Pyrex: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ + + +How can I contribute? +--------------------- + +Besides enhancing the code, there are a lot of places where you can help the +project and its user base. You can + +* spread the word and write about lxml. Many users (especially new Python + users) have not yet heared about lxml, although our user base is constantly + growing. If you write your own blog and feel like saying something about + lxml, go ahead and do so. If we think your contribution or criticism is + valuable to other users, we may even put a link or a quote on the project + page. + +* provide code examples for the general usage of lxml or specific problems + solved with lxml. Readable code is a very good way of showing how a library + can be used and what great things you can do with it. Again, if we hear + about it, we can set a link on the project page. + +* work on the documentation. The web page is generated from a set of ReST_ + `text files`_. It is meant both as a representative project page for lxml + and as a site for documenting lxml's API and usage. If you have questions + or an idea how to make it more readable and accessible while you are reading + it, please send a comment to the `mailing list`_. + +.. _ReST: http://docutils.sourceforge.net/rst.html +.. _`text files`: http://codespeak.net/svn/lxml/trunk/doc/ + +* improve the docstrings. lxml uses docstrings to support Python's integrated + online ``help()`` function. However, sometimes these are not sufficient to + grasp the details of the function in question. If you find such a place, + you can try to write up a better description and send it to the `mailing + list`_. + + Bugs ==== @@ -176,7 +237,7 @@ One of the goals of lxml is "no segfaults", so if there is no clear warning in the documentation that you were doing something potentially harmful, you have found a bug and we would like to hear about it. Please report this bug to the -mailing list. See the next section on how to do that. +`mailing list`_. See the next section on how to do that. I think I have found a bug in lxml. What should I do? Modified: lxml/trunk/doc/build.txt ============================================================================== --- lxml/trunk/doc/build.txt (original) +++ lxml/trunk/doc/build.txt Thu May 3 21:10:48 2007 @@ -2,8 +2,10 @@ ============================= To build lxml from source, you need libxml2 and libxslt properly installed, -including header files (possibly shipped in -dev packages). The build process -also requires setuptools_. +*including the header files*. These are likely shipped in separate ``-dev`` +or ``-devel`` packages like ``libxml2-dev``, which you need to install. The +build process also requires setuptools_. The lxml source distribution comes +with a script called ``ez_setup.py`` that can be used to install them. .. _setuptools: http://peak.telecommunity.com/DevCenter/setuptools @@ -34,18 +36,22 @@ Newer versions of lxml depend on features and bug fixes that are not yet available in an official Pyrex release. This includes support for the - external C-API of lxml, for Python 2.5 and for 64 bit architectures. + external C-API of lxml.etree, for Python 2.5 and for 64 bit architectures. To build lxml 1.1 and later from non-release or modified sources, you must - therefore install an updated Pyrex version from here: + therefore use an updated Pyrex version from here: http://codespeak.net/svn/lxml/pyrex/ - Since version 1.1.2, the lxml source distribution includes this Pyrex - version. It will be used if the 'pyrex' directory is available in the lxml - root directory. If you install from SVN or delete this directory from the - unpacked distribution directory, the normally installed Pyrex version will - be used. + A subversion checkout of lxml will automatically retrieve the latest Pyrex + as external project source (``svn:externals``). Look out for the ``Pyrex`` + directory in the source tree. + + Since version 1.1.2, the lxml source distribution also includes this Pyrex + version. It will be used if the ``Pyrex`` directory is available in the + lxml root directory. If you install from SVN or delete this directory from + the unpacked distribution directory, the normally installed Pyrex version + will be used. * lxml 1.0 and earlier @@ -86,6 +92,10 @@ python setup.py build +or:: + + python setup.py bdist_egg + If you want to test lxml from the source directory, it is better to build it in-place like this:: @@ -96,15 +106,24 @@ make If you get errors about missing header files (e.g., ``libxml/xmlversion.h``) -then you need to add the location of that file to the include path like:: +then you need to make sure the development packages of libxml2 and libxslt are +properly installed. If this doesn't help, you may have to add the location of +the header files to the include path like:: - python setup.py build_ext -i -I /usr/include/libxml2 + python setup.py build_ext -i -I /usr/include/libxml2 where the file is in ``/usr/include/libxml2/libxml/xmlversion.h`` To use lxml.etree in-place, you can place lxml's ``src`` directory on your Python module search path (PYTHONPATH) and then import ``lxml.etree`` to play -with it. +with it:: + + # cd lxml + # PYTHONPATH=src python + Python 2.5.1 + Type "help", "copyright", "credits" or "license" for more information. + >>> from lxml import etree + >>> To recompile after changes, note that you may have to run ``make clean`` or delete the file ``src/lxml/etree.c``. Distutils do not automatically pick up @@ -125,8 +144,8 @@ make test -To run the ElementTree and cElementTree compatibility tests, make sure -you have lxml on your PYTHONPATH first, then run:: +This also runs the ElementTree and cElementTree compatibility tests. To call +them separately, make sure you have lxml on your PYTHONPATH first, then run:: python selftest.py @@ -147,15 +166,16 @@ This is the procedure to make an lxml egg for your platform: -* download the lxml-x.y.tar.gz release. This contains the pregenerated C so we - don't run into any Pyrex issues. Unpack it and cd into it. +* Download the lxml-x.y.tar.gz release. This contains the pregenerated C so + that you don't run into any Pyrex issues. Unpack it and cd into it. * python setup.py build -* if you're on a unixy platform, cd into build/lib.your.platform and - strip any .so file you find there. This reduces the size of the egg. +* If you're on a unixy platform, cd into ``build/lib.your.platform`` and strip + any ``.so`` file you find there. This reduces the size of the egg + considerably. -* python setup.py bdist_egg upload +* ``python setup.py bdist_egg upload`` The last 'upload' step only works if you have access to the lxml cheeseshop entry. If not, you can just make an egg with ``bdist_egg`` and mail it to the From scoder at codespeak.net Thu May 3 21:12:14 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:12:14 +0200 (CEST) Subject: [Lxml-checkins] r42643 - lxml/trunk Message-ID: <20070503191214.F0D258075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:12:13 2007 New Revision: 42643 Modified: lxml/trunk/ (props changed) Log: properties From scoder at codespeak.net Thu May 3 21:13:33 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:13:33 +0200 (CEST) Subject: [Lxml-checkins] r42644 - lxml/trunk/doc Message-ID: <20070503191333.788948075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:13:32 2007 New Revision: 42644 Modified: lxml/trunk/doc/FAQ.txt Log: faq Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:13:32 2007 @@ -1,6 +1,6 @@ -========================== -Frequently Asked Questions -========================== +================================ +Frequently Asked Questions (FAQ) +================================ See also the notes on compatibility_ to ElementTree_. From scoder at codespeak.net Thu May 3 21:52:31 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:52:31 +0200 (CEST) Subject: [Lxml-checkins] r42646 - lxml/trunk/doc Message-ID: <20070503195231.EB7C08075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:52:29 2007 New Revision: 42646 Modified: lxml/trunk/doc/FAQ.txt lxml/trunk/doc/performance.txt Log: doc on benchmark and performance Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:52:29 2007 @@ -15,7 +15,7 @@ 1.3 What standards does lxml implement? 1.4 Where are the Windows binaries? 1.5 What is the difference between lxml.etree and lxml.objectify? - 1.6 Why is my application so slow? + 1.6 How can I make my application run faster? 1.7 Why do I get errors about missing UCS4 symbols when installing lxml? 2 Contributing 2.1 Why is lxml not written in Python? @@ -136,17 +136,18 @@ XPath, XSLT or validation. -Why is my application so slow? ------------------------------- +How can I make my application run faster? +----------------------------------------- lxml.etree is a very fast library for processing XML. There are, however, `a few caveats`_ involved in the mapping of the powerful libxml2 library to the simple and convenient ElementTree API. Not all operations are as fast as the -simplicity of the API might suggest. The `benchmark page`_ has a comparison -to other ElementTree implementations and a number of tips for performance -tweaking. As with any Python application, the rule of thumb is: the more of -your processing runs in C, the faster your application gets. See also the -section on threading_. +simplicity of the API might suggest, while some use cases can heavily benefit +from finding the right way of doing them. The `benchmark page`_ has a +comparison to other ElementTree implementations and a number of tips for +performance tweaking. As with any Python application, the rule of thumb is: +the more of your processing runs in C, the faster your application gets. See +also the section on threading_. .. _`a few caveats`: performance.html#the-elementtree-api .. _`benchmark page`: performance.html @@ -182,7 +183,7 @@ To avoid writing plain C-code and caring too much about the details of built-in types and reference counting, lxml is written in Pyrex_, a Python-like language that is translated into C-code. Chances are that if you -know Python, you can write code that Pyrex accepts. Again, the C-ish style +know Python, you can write `code that Pyrex accepts`_. Again, the C-ish style used in the lxml code is just for performance optimisations. If you want to contribute, don't bother with the details, a Python implementation of your contribution is better than none. And keep in mind that lxml's flexible API @@ -192,6 +193,7 @@ Please contact the `mailing list`_ if you need any help. .. _Pyrex: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ +.. _`code that Pyrex accepts`: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/overview.html How can I contribute? Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Thu May 3 21:52:29 2007 @@ -30,10 +30,15 @@ attributes (-/A), with or without ASCII or unicode text (-/S/U), and either against a tree or its serialised form (T/X). In the result extracts cited below, T1 refers to a 3-level tree with many children at the third level, T2 -is swapped around to have many children at the root element, T3 is a deep tree -with few children at each level and T4 is a small tree, slightly broader than -deep. If repetition is involved, this usually means running the benchmark in -a loop over all children of the tree root. +is swapped around to have many children below the root element, T3 is a deep +tree with few children at each level and T4 is a small tree, slightly broader +than deep. If repetition is involved, this usually means running the +benchmark in a loop over all children of the tree root, otherwise, the +operation is run on the root node (C/R). + +As an example, the character code ``(SATR T1)`` states that the benchmark was +running for tree T1, with plain string text (S) and attributes (A). It was +run against the root element (R) in the tree structure of the data (T). .. contents:: .. @@ -48,11 +53,11 @@ Bad things first ---------------- -First thing to say: there *is* an overhead involved in having a C library -mimic the ElementTree API. As opposed to ElementTree, lxml has to generate -Python objects on the fly when asked for them. What this means is: the more -of your code runs in Python, the slower your application gets. Note, however, -that this is true for most performance critical Python applications. +First thing to say: there *is* an overhead involved in having a DOM-like C +library mimic the ElementTree API. As opposed to ElementTree, lxml has to +generate Python objects on the fly when asked for them. What this means is: +the more of your code runs in Python, the slower your application gets. Note, +however, that this is true for most performance critical Python applications. Parsing and Serialising From scoder at codespeak.net Thu May 3 22:26:17 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 22:26:17 +0200 (CEST) Subject: [Lxml-checkins] r42647 - in lxml/trunk: benchmark doc Message-ID: <20070503202617.C04DB8075@code0.codespeak.net> Author: scoder Date: Thu May 3 22:26:17 2007 New Revision: 42647 Modified: lxml/trunk/benchmark/bench_etree.py lxml/trunk/doc/performance.txt Log: benchmark for indexed child access Modified: lxml/trunk/benchmark/bench_etree.py ============================================================================== --- lxml/trunk/benchmark/bench_etree.py (original) +++ lxml/trunk/benchmark/bench_etree.py Thu May 3 22:26:17 2007 @@ -18,6 +18,19 @@ for child in reversed(root): pass + def bench_first_child(self, root): + for i in range(1000): + child = root[0] + + def bench_last_child(self, root): + for i in range(1000): + child = root[-1] + + def bench_middle_child(self, root): + pos = len(root) / 2 + for i in range(1000): + child = root[pos] + @with_attributes(True, False) @with_text(text=True, utext=True) def bench_tostring_utf8(self, root): Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Thu May 3 22:26:17 2007 @@ -14,7 +14,7 @@ .. _ElementTree: http://effbot.org/zone/element-index.htm .. _cElementTree: http://effbot.org/zone/celementtree.htm -The statements made here are backed by the benchmark scripts +The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with the lxml source distribution. The timings cited below compare lxml 1.3 (with libxml2 2.6.26) to the ElementTree and cElementTree versions shipped with @@ -166,6 +166,29 @@ cET: root_getchildren (--TR T2) 0.0150 msec/pass ET : root_getchildren (--TR T2) 0.0091 msec/pass +When accessing single children, however, e.g. by index, this handicap is +negligible:: + + lxe: first_child (--TR T2) 0.2499 msec/pass + cET: first_child (--TR T2) 0.2048 msec/pass + ET : first_child (--TR T2) 0.9291 msec/pass + + lxe: last_child (--TR T1) 0.2511 msec/pass + cET: last_child (--TR T1) 0.2148 msec/pass + ET : last_child (--TR T1) 0.9191 msec/pass + +... unless you add the time to find a child index in a bigger list, as ET and +cET use Python lists here, which are based on arrays. The data structure used +by libxml2 is a linked tree, and thus, a linked list of children:: + + lxe: middle_child (--TR T1) 0.2921 msec/pass + cET: middle_child (--TR T1) 0.2069 msec/pass + ET : middle_child (--TR T1) 0.9291 msec/pass + + lxe: middle_child (--TR T2) 1.9028 msec/pass + cET: middle_child (--TR T2) 0.2089 msec/pass + ET : middle_child (--TR T2) 0.9360 msec/pass + As opposed to ET, libxml2 has a notion of documents that each element must be in. This results in a major performance difference for creating independent Elements that end up in independently created documents:: From scoder at codespeak.net Fri May 4 11:37:29 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 4 May 2007 11:37:29 +0200 (CEST) Subject: [Lxml-checkins] r42667 - lxml/trunk/doc Message-ID: <20070504093729.512778075@code0.codespeak.net> Author: scoder Date: Fri May 4 11:37:28 2007 New Revision: 42667 Modified: lxml/trunk/doc/performance.txt Log: tree timings and note on non-comparable absolute numbers Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Fri May 4 11:37:28 2007 @@ -40,6 +40,13 @@ running for tree T1, with plain string text (S) and attributes (A). It was run against the root element (R) in the tree structure of the data (T). +Note that very small operations are repeated in integer loops to make them +measurable. It is therefore not always possible to compare the absolute +timings of, say, a single access benchmark (which usually loops) and a 'get +all in one step' benchmark, which already takes enough time to be measurable +and is therefore measured as is. Take a look at the concrete benchmarks in +the scripts to understand how the numbers compare. + .. contents:: .. 1 Bad things first @@ -137,20 +144,20 @@ (given in seconds):: lxe: -- S- U- -A SA UA - T1: 0.1029 0.1005 0.0998 0.1003 0.0998 0.1002 - T2: 0.1035 0.1013 0.1015 0.1090 0.1089 0.1090 - T3: 0.0276 0.0270 0.0273 0.0679 0.0673 0.0673 - T4: 0.0004 0.0004 0.0004 0.0013 0.0013 0.0013 + T1: 0.1155 0.1154 0.1153 0.1159 0.1181 0.1158 + T2: 0.1183 0.1197 0.1200 0.1267 0.1261 0.1264 + T3: 0.0341 0.0312 0.0314 0.0726 0.0717 0.0720 + T4: 0.0005 0.0004 0.0004 0.0014 0.0014 0.0014 cET: -- S- U- -A SA UA - T1: 0.0277 0.0273 0.0273 0.0272 0.0278 0.0275 - T2: 0.0281 0.0347 0.0281 0.0285 0.0284 0.0284 - T3: 0.0074 0.0074 0.0074 0.0122 0.0102 0.0101 + T1: 0.0290 0.0271 0.0275 0.0297 0.0273 0.0274 + T2: 0.0280 0.0280 0.0281 0.0285 0.0283 0.0286 + T3: 0.0071 0.0072 0.0071 0.0113 0.0096 0.0096 T4: 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ET : -- S- U- -A SA UA - T1: 0.1349 0.1962 0.2356 0.1288 0.2642 0.1351 - T2: 0.3104 0.1344 0.3566 0.3857 0.1354 0.4677 - T3: 0.0313 0.0325 0.0312 0.0356 0.3803 0.0364 - T4: 0.0005 0.0005 0.0008 0.0006 0.0007 0.0006 + T1: 0.1362 0.1985 0.2300 0.1344 0.2672 0.1335 + T2: 0.3107 0.1386 0.3581 0.3886 0.1388 0.4277 + T3: 0.0334 0.0332 0.0320 0.0367 0.3769 0.0375 + T4: 0.0006 0.0005 0.0008 0.0007 0.0007 0.0006 While lxml is still faster than ET in most cases (30-60%), cET can be up to three times faster than lxml here. One of the reasons is that lxml must From scoder at codespeak.net Fri May 4 21:00:41 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 4 May 2007 21:00:41 +0200 (CEST) Subject: [Lxml-checkins] r42690 - lxml/trunk/src/lxml Message-ID: <20070504190041.C97308068@code0.codespeak.net> Author: scoder Date: Fri May 4 21:00:40 2007 New Revision: 42690 Modified: lxml/trunk/src/lxml/etree.pyx Log: missing PI function and Comment.values() method Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Fri May 4 21:00:40 2007 @@ -1101,6 +1101,9 @@ def items(self): return [] + def values(self): + return [] + cdef class _Comment(__ContentOnlyElement): property tag: def __get__(self): @@ -1751,6 +1754,8 @@ tree.xmlAddChild(c_doc, c_node) return _elementFactory(doc, c_node) +PI = ProcessingInstruction + def SubElement(_Element _parent not None, _tag, attrib=None, nsmap=None, **_extra): """Subelement factory. This function creates an element instance, and appends it to an From scoder at codespeak.net Sat May 5 12:29:51 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 12:29:51 +0200 (CEST) Subject: [Lxml-checkins] r42695 - in lxml/trunk: . doc src/lxml src/lxml/tests Message-ID: <20070505102951.495DB8075@code0.codespeak.net> Author: scoder Date: Sat May 5 12:29:50 2007 New Revision: 42695 Modified: lxml/trunk/CHANGES.txt lxml/trunk/doc/sax.txt lxml/trunk/src/lxml/apihelpers.pxi lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/sax.py lxml/trunk/src/lxml/serializer.pxi lxml/trunk/src/lxml/tests/test_etree.py lxml/trunk/src/lxml/tests/test_sax.py Log: comment/PI fixes for lxml.sax, support for serialising top-level PIs and comments, appending and prepending comments andd PIs to the root node Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 5 12:29:50 2007 @@ -8,12 +8,17 @@ Features added -------------- +* ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support + adding processing instructions and comments around the root node + * Element.attrib now has a ``pop()`` method * Extended type annotation in objectify: cleaner annotation namespace setup plus new ``xsiannotate()`` and ``deannotate()`` functions -* Support for custom Element class instantiation in lxml.sax +* Support for custom Element class instantiation in lxml.sax: passing a + ``makeelement()`` function to the ElementTreeContentHandler will reuse the + lookup context of that function * '.' represents empty ObjectPath (identity) @@ -30,6 +35,11 @@ Bugs fixed ---------- +* Documents lost their top-level PIs and comments on serialisation + +* lxml.sax failed on comments and PIs. Comments are now properly ignored and + PIs are copied. + * Thread safety in XPath evaluators * Raise AssertionError when passing strings containing '\0' bytes Modified: lxml/trunk/doc/sax.txt ============================================================================== --- lxml/trunk/doc/sax.txt (original) +++ lxml/trunk/doc/sax.txt Sat May 5 12:29:50 2007 @@ -39,6 +39,10 @@ >>> lxml.etree.tostring(tree.getroot()) 'Hello world' +By passing a ``makeelement`` function the constructor of +``ElementTreeContentHandler``, e.g. the one of a parser you configured, you +can determine which element class lookup scheme should be used. + Producing SAX events from an ElementTree or Element --------------------------------------------------- Modified: lxml/trunk/src/lxml/apihelpers.pxi ============================================================================== --- lxml/trunk/src/lxml/apihelpers.pxi (original) +++ lxml/trunk/src/lxml/apihelpers.pxi Sat May 5 12:29:50 2007 @@ -541,7 +541,6 @@ c_node = child._c_node # store possible text node c_next = c_node.next - # XXX what if element is coming from a different document? tree.xmlUnlinkNode(c_node) # move node itself tree.xmlAddChild(parent._c_node, c_node) @@ -550,6 +549,38 @@ # parent element has moved; change them too.. moveNodeToDocument(child, parent._doc) +cdef void _appendSibling(_Element element, _Element sibling): + """Append a new child to a parent element. + """ + cdef xmlNode* c_next + cdef xmlNode* c_node + c_node = sibling._c_node + # store possible text node + c_next = c_node.next + tree.xmlUnlinkNode(c_node) + # move node itself + tree.xmlAddNextSibling(element._c_node, c_node) + _moveTail(c_next, c_node) + # uh oh, elements may be pointing to different doc when + # parent element has moved; change them too.. + moveNodeToDocument(sibling, element._doc) + +cdef void _prependSibling(_Element element, _Element sibling): + """Append a new child to a parent element. + """ + cdef xmlNode* c_next + cdef xmlNode* c_node + c_node = sibling._c_node + # store possible text node + c_next = c_node.next + tree.xmlUnlinkNode(c_node) + # move node itself + tree.xmlAddPrevSibling(element._c_node, c_node) + _moveTail(c_next, c_node) + # uh oh, elements may be pointing to different doc when + # parent element has moved; change them too.. + moveNodeToDocument(sibling, element._doc) + cdef int isutf8(char* s): cdef char c c = s[0] Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sat May 5 12:29:50 2007 @@ -531,6 +531,36 @@ """ _appendChild(self, element) + def addnext(self, _Element element): + """Adds the element as a following sibling directly after this + element. + + This is normally used to set a processing instruction or comment after + the root node of a document. Note that tail text is automatically + discarded when adding at the root level. + """ + if self._c_node.parent != NULL and not _isElement(self._c_node.parent): + if element._c_node.type != tree.XML_PI_NODE: + if element._c_node.type != tree.XML_COMMENT_NODE: + raise TypeError, "Only processing instructions and comments can be siblings of the root element" + element.tail = None + _appendSibling(self, element) + + def addprevious(self, _Element element): + """Adds the element as a preceding sibling directly before this + element. + + This is normally used to set a processing instruction or comment + before the root node of a document. Note that tail text is + automatically discarded when adding at the root level. + """ + if self._c_node.parent != NULL and not _isElement(self._c_node.parent): + if element._c_node.type != tree.XML_PI_NODE: + if element._c_node.type != tree.XML_COMMENT_NODE: + raise TypeError, "Only processing instructions and comments can be siblings of the root element" + element.tail = None + _prependSibling(self, element) + def extend(self, elements): """Extends the current children by the elements in the iterable. """ Modified: lxml/trunk/src/lxml/sax.py ============================================================================== --- lxml/trunk/src/lxml/sax.py (original) +++ lxml/trunk/src/lxml/sax.py Sat May 5 12:29:50 2007 @@ -1,5 +1,6 @@ from xml.sax.handler import ContentHandler from etree import ElementTree, Element, SubElement, LxmlError +from etree import XML, Comment, ProcessingInstruction class SaxError(LxmlError): pass @@ -15,6 +16,7 @@ """ def __init__(self, makeelement=None): self._root = None + self._root_siblings = [] self._element_stack = [] self._default_ns = None self._ns_mapping = { None : [None] } @@ -82,6 +84,10 @@ if self._root is None: element = self._root = \ self._makeelement(el_name, attrs, self._new_mappings) + if self._root_siblings and hasattr(element, 'addprevious'): + for sibling in self._root_siblings: + element.addprevious(sibling) + del self._root_siblings[:] else: element = SubElement(element_stack[-1], el_name, attrs, self._new_mappings) @@ -89,10 +95,16 @@ self._new_mappings.clear() + def processingInstruction(self, target, data): + pi = ProcessingInstruction(target, data) + if self._root is None: + self._root_siblings.append(pi) + else: + self._element_stack[-1].append(pi) + def endElementNS(self, ns_name, qname): element = self._element_stack.pop() - tag = element.tag - if ns_name != _getNsTag(tag): + if ns_name != _getNsTag(element.tag): raise SaxError, "Unexpected element closed: {%s}%s" % ns_name def startElement(self, name, attributes=None): @@ -106,10 +118,13 @@ try: # if there already is a child element, we must append to its tail last_element = last_element[-1] - last_element.tail = (last_element.tail or u'') + data + last_element.tail = (last_element.tail or '') + data except IndexError: # otherwise: append to the text - last_element.text = (last_element.text or u'') + data + last_element.text = (last_element.text or '') + data + + ignorableWhitespace = characters + class ElementTreeProducer(object): """Produces SAX events for an element and children. @@ -124,13 +139,41 @@ from xml.sax.xmlreader import AttributesNSImpl as attr_class self._attr_class = attr_class self._empty_attributes = attr_class({}, {}) - + def saxify(self): self._content_handler.startDocument() - self._recursive_saxify(self._element, {}) + + element = self._element + if hasattr(element, 'getprevious'): + siblings = [] + sibling = element.getprevious() + while getattr(sibling, 'tag', None) is ProcessingInstruction: + siblings.append(sibling) + sibling = sibling.getprevious() + for sibling in siblings[::-1]: + self._recursive_saxify(sibling, {}) + + self._recursive_saxify(element, {}) + + if hasattr(element, 'getnext'): + sibling = element.getnext() + while getattr(sibling, 'tag', None) is ProcessingInstruction: + self._recursive_saxify(sibling, {}) + sibling = sibling.getnext() + self._content_handler.endDocument() def _recursive_saxify(self, element, prefixes): + content_handler = self._content_handler + tag = element.tag + if tag is Comment or tag is ProcessingInstruction: + if tag is ProcessingInstruction: + content_handler.processingInstruction( + element.target, element.text) + if element.tail: + content_handler.characters(element.tail) + return + new_prefixes = [] build_qname = self._build_qname attribs = element.items() @@ -146,10 +189,9 @@ else: sax_attributes = self._empty_attributes - ns_uri, local_name = _getNsTag(element.tag) + ns_uri, local_name = _getNsTag(tag) qname = build_qname(ns_uri, local_name, prefixes, new_prefixes) - content_handler = self._content_handler for prefix, uri in new_prefixes: content_handler.startPrefixMapping(prefix, uri) content_handler.startElementNS((ns_uri, local_name), Modified: lxml/trunk/src/lxml/serializer.pxi ============================================================================== --- lxml/trunk/src/lxml/serializer.pxi (original) +++ lxml/trunk/src/lxml/serializer.pxi Sat May 5 12:29:50 2007 @@ -78,8 +78,10 @@ if write_xml_declaration: _writeDeclarationToBuffer(c_buffer, c_doc.version, encoding) + _writePrevSiblings(c_buffer, c_node, encoding, pretty_print) tree.xmlNodeDumpOutput(c_buffer, c_doc, c_node, 0, pretty_print, encoding) _writeTail(c_buffer, c_node, encoding, pretty_print) + _writeNextSiblings(c_buffer, c_node, encoding, pretty_print) cdef void _writeDeclarationToBuffer(tree.xmlOutputBuffer* c_buffer, char* version, char* encoding): @@ -100,6 +102,36 @@ pretty_print, encoding) c_node = c_node.next +cdef void _writePrevSiblings(tree.xmlOutputBuffer* c_buffer, xmlNode* c_node, + char* encoding, int pretty_print): + cdef xmlNode* c_sibling + if c_node.parent is not NULL and _isElement(c_node.parent): + return + # we are at a root node, so add PI and comment siblings + c_sibling = c_node + while c_sibling.prev != NULL and \ + (c_sibling.prev.type == tree.XML_PI_NODE or \ + c_sibling.prev.type == tree.XML_COMMENT_NODE): + c_sibling = c_sibling.prev + while c_sibling != c_node: + tree.xmlNodeDumpOutput(c_buffer, c_node.doc, c_sibling, 0, + pretty_print, encoding) + c_sibling = c_sibling.next + +cdef void _writeNextSiblings(tree.xmlOutputBuffer* c_buffer, xmlNode* c_node, + char* encoding, int pretty_print): + cdef xmlNode* c_sibling + if c_node.parent is not NULL and _isElement(c_node.parent): + return + # we are at a root node, so add PI and comment siblings + c_sibling = c_node.next + while c_sibling != NULL and \ + (c_sibling.type == tree.XML_PI_NODE or \ + c_sibling.type == tree.XML_COMMENT_NODE): + tree.xmlNodeDumpOutput(c_buffer, c_node.doc, c_sibling, 0, + pretty_print, encoding) + c_sibling = c_sibling.next + # output to file-like objects cdef class _FilelikeWriter: Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Sat May 5 12:29:50 2007 @@ -404,6 +404,156 @@ Element = self.etree.Element self.assertRaises(TypeError, Element('a').append, None) + def test_addnext(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + root = Element('root') + SubElement(root, 'a') + SubElement(root, 'b') + + self.assertEquals(['a', 'b'], + [c.tag for c in root]) + root[1].addnext(root[0]) + self.assertEquals(['b', 'a'], + [c.tag for c in root]) + + def test_addprevious(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + root = Element('root') + SubElement(root, 'a') + SubElement(root, 'b') + + self.assertEquals(['a', 'b'], + [c.tag for c in root]) + root[0].addprevious(root[1]) + self.assertEquals(['b', 'a'], + [c.tag for c in root]) + + def test_addnext_root(self): + Element = self.etree.Element + a = Element('a') + b = Element('b') + self.assertRaises(TypeError, a.addnext, b) + + def test_addnext_root(self): + Element = self.etree.Element + a = Element('a') + b = Element('b') + self.assertRaises(TypeError, a.addnext, b) + + def test_addprevious_pi(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + PI = self.etree.PI + root = Element('root') + SubElement(root, 'a') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addprevious(pi) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addprevious_root_pi(self): + Element = self.etree.Element + PI = self.etree.PI + root = Element('root') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addprevious(pi) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addnext_pi(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + PI = self.etree.PI + root = Element('root') + SubElement(root, 'a') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addnext(pi) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addnext_root_pi(self): + Element = self.etree.Element + PI = self.etree.PI + root = Element('root') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addnext(pi) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addnext_comment(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + Comment = self.etree.Comment + root = Element('root') + SubElement(root, 'a') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addnext(comment) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addnext_root_comment(self): + Element = self.etree.Element + Comment = self.etree.Comment + root = Element('root') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addnext(comment) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addprevious_comment(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + Comment = self.etree.Comment + root = Element('root') + SubElement(root, 'a') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addprevious(comment) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addprevious_root_comment(self): + Element = self.etree.Element + Comment = self.etree.Comment + root = Element('root') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addprevious(comment) + self.assertEquals('\n', + self._writeElement(root)) + # ET's Elements have items() and key(), but not values() def test_attribute_values(self): XML = self.etree.XML Modified: lxml/trunk/src/lxml/tests/test_sax.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_sax.py (original) +++ lxml/trunk/src/lxml/tests/test_sax.py Sat May 5 12:29:50 2007 @@ -25,6 +25,30 @@ self.assertEquals('abbbba', xml_out) + def test_etree_sax_comment(self): + tree = self.parse('abba') + xml_out = self._saxify_serialize(tree) + self.assertEquals('abba', + xml_out) + + def test_etree_sax_pi(self): + tree = self.parse('abba') + xml_out = self._saxify_serialize(tree) + self.assertEquals('abba', + xml_out) + + def test_etree_sax_comment_root(self): + tree = self.parse('ab') + xml_out = self._saxify_serialize(tree) + self.assertEquals('ab', + xml_out) + + def test_etree_sax_pi_root(self): + tree = self.parse('ab') + xml_out = self._saxify_serialize(tree) + self.assertEquals('ab', + xml_out) + def test_etree_sax_attributes(self): tree = self.parse('abba') xml_out = self._saxify_serialize(tree) From scoder at codespeak.net Sat May 5 19:02:35 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 19:02:35 +0200 (CEST) Subject: [Lxml-checkins] r42704 - in lxml/trunk: . src/lxml Message-ID: <20070505170235.1CACD806D@code0.codespeak.net> Author: scoder Date: Sat May 5 19:02:33 2007 New Revision: 42704 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/extensions.pxi Log: support passing a node-set instead of a string in XPath regexps Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 5 19:02:33 2007 @@ -8,6 +8,9 @@ Features added -------------- +* The regular expression functions in XPath now support passing a node-set + instead of a string + * ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support adding processing instructions and comments around the root node @@ -17,7 +20,7 @@ plus new ``xsiannotate()`` and ``deannotate()`` functions * Support for custom Element class instantiation in lxml.sax: passing a - ``makeelement()`` function to the ElementTreeContentHandler will reuse the + ``makeelement`` function to the ElementTreeContentHandler will reuse the lookup context of that function * '.' represents empty ObjectPath (identity) Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Sat May 5 19:02:33 2007 @@ -306,6 +306,18 @@ ################################################################################ # EXSLT regexp implementation +cdef int _collect_tree_text(element, l) except -1: + # recursively collect all text (XPath 'string-value' of a node) + text = element.text + if text is not None: + python.PyList_Append(l, text) + for child in element: + _collect_tree_text(child, l) + tail = element.tail + if tail is not None: + python.PyList_Append(l, tail) + return 0 + cdef class _ExsltRegExp: cdef object _compile_map def __init__(self): @@ -314,6 +326,19 @@ cdef _make_string(self, value): if _isString(value): return value + elif python.PyList_Check(value): + # node set: take recursive text concatenation of first element + if python.PyList_GET_SIZE(value) == 0: + return '' + firstnode = value[0] + if _isString(firstnode): + return firstnode + elif isinstance(firstnode, _Element): + l = [] + _collect_tree_text(firstnode, l) + return ''.join(l) + else: + return str(firstnode) else: raise TypeError, "Invalid argument type %s" % type(value) From scoder at codespeak.net Sat May 5 19:09:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 19:09:00 +0200 (CEST) Subject: [Lxml-checkins] r42705 - lxml/trunk/doc Message-ID: <20070505170900.2FDB6806D@code0.codespeak.net> Author: scoder Date: Sat May 5 19:08:59 2007 New Revision: 42705 Modified: lxml/trunk/doc/xpathxslt.txt Log: rewrite of XPath doc page Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sat May 5 19:08:59 2007 @@ -6,10 +6,15 @@ compliant way. .. contents:: -.. +.. 1 XPath + 1.1 The ``xpath()`` method + 1.2 The ``XPath`` class + 1.3 The ``XPathEvaluator`` classes + 1.4 ``ETXPath`` 2 XSLT + The usual setup procedure:: >>> from lxml import etree @@ -17,12 +22,17 @@ XPath ------ +===== + +lxml.etree supports the simple path syntax of the `find, findall and +findtext`_ methods on ElementTree and Element, as known from the original +ElementTree library (ElementPath_). As an lxml specific extension, these +classes also provide an ``xpath()`` method that supports expressions in the +complete XPath syntax, as well as `extension functions`_. -lxml.etree supports the simple path syntax of the ``findall()`` etc. methods -on ElementTree and Element, as known from the original ElementTree library. -As an extension, these classes also provide an ``xpath()`` method that -supports expressions in the complete XPath syntax. +.. _ElementPath: http://effbot.org/zone/element-xpath.htm +.. _`find, findall and findtext`: http://effbot.org/zone/element.htm#searching-for-subelements +.. _`extension functions`: extensions.html There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance @@ -32,6 +42,10 @@ .. _`performance comparison`: performance.html#xpath + +The ``xpath()`` method +---------------------- + For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative):: @@ -48,7 +62,7 @@ >>> r[0].tag 'bar' -When ``xpath()`` is used on an element, the XPath expression is evaluated +When ``xpath()`` is used on an Element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):: >>> root = tree.getroot() @@ -66,6 +80,19 @@ >>> r[0].tag 'bar' +The ``xpath()`` method has support for XPath variables:: + + >>> expr = "//*[local-name() = $name]" + + >>> print root.xpath(expr, name = "foo")[0].tag + foo + + >>> print root.xpath(expr, name = "bar")[0].tag + bar + + >>> print root.xpath("$text", text = "Hello World!") + Hello World! + Optionally, you can provide a ``namespaces`` keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs:: @@ -102,11 +129,10 @@ * a (unicode) string, when the XPath expression has a string result. * a list of items, when the XPath expression has a list as result. The items - may include elements, strings and tuples. Text nodes and attributes in the - result are returned as strings (the text node content or attribute value). - Comments are also returned as strings, enclosed by the usual ```` markers. Namespace declarations are returned as tuples of strings: - ``(prefix, URI)``. + may include elements (also comments and processing instructions), strings + and tuples. Text nodes and attributes in the result are returned as strings + (the text node content or attribute value). Namespace declarations are + returned as tuples of strings: ``(prefix, URI)``. A related convenience method of ElementTree objects is ``getpath(element)``, which returns a structural, absolute XPath expression to find that element:: @@ -124,8 +150,111 @@ True +The ``XPath`` class +------------------- + +The ``XPath`` class compiles an XPath expression into a callable function:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//b") + >>> print find(root)[0].tag + b + +The compilation takes as much time as in the ``xpath()`` method, but it is +done only once per class instantiation. This makes it especially efficient +for repeated evaluation of the same XPath expression. + +Just like the ``xpath()`` method, the ``XPath`` class supports XPath +variables:: + + >>> count_elements = etree.XPath("count(//*[local-name() = $name])") + + >>> print count_elements(root, name = "a") + 1.0 + >>> print count_elements(root, name = "b") + 2.0 + +This supports very efficient evaluation of modified versions of an XPath +expression, as compilation is still only required once. + +Prefix-to-namespace mappings can be passed as second parameter:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//n:b", {'n':'NS'}) + >>> print find(root)[0].tag + {NS}b + +You can pass the boolean keyword ``regexp`` to enable Python regular +expressions in the EXSLT_ namespace:: + + >>> regexpNS = "http://exslt.org/regular-expressions" + >>> find = etree.XPath("//*[r:test(., '^abc$', 'i')]", + ... {'r':regexpNS}, regexp = True) + + >>> root = etree.XML("aBaBc") + >>> print find(root)[0].text + aBc + +.. _EXSLT: http://www.exslt.org/ + + +The ``XPathEvaluator`` classes +------------------------------ + +lxml.etree provides two other efficient XPath evaluators that work on +ElementTrees or Elements respectively: ``XPathDocumentEvaluator`` and +``XPathElementEvaluator``. They are automatically selected if you use the +XPathEvaluator helper for instantiation:: + + >>> root = etree.XML("") + >>> xpatheval = etree.XPathEvaluator(root) + + >>> print isinstance(xpatheval, etree.XPathElementEvaluator) + True + + >>> print xpatheval("//b")[0].tag + b + +This class provides efficient support for evaluating different XPath +expressions on the same Element or ElementTree. + + +``ETXPath`` +----------- + +ElementTree supports a language named ElementPath_ in its ``find*()`` methods. +One of the main differences between XPath and ElementPath is that the XPath +language requires an indirection through prefixes for namespace support, +whereas ElementTree uses the Clark notation (``{ns}name``) to avoid prefixes +completely. The other major difference regards the capabilities of both path +languages. Where XPath supports various sophisticated ways of restricting the +result set through functions and boolean expressions, ElementPath only +supports pure path traversal without nesting or further conditions. So, while +the ElementPath syntax is self-contained and therefore easier to write and +handle, XPath is much more powerful and expressive. + +lxml.etree bridges this gap through the class ``ETXPath``, which accepts XPath +expressions with namespaces in Clark notation. It is identical to the +``XPath`` class, except for the namespace notation. Normally, you would +write:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//p:b", {'p' : 'ns'}) + >>> print find(root)[0].tag + {ns}b + +``ETXPath`` allows you to change this to:: + + >>> find = etree.ETXPath("//{ns}b") + >>> print find(root)[0].tag + {ns}b + + XSLT ----- +==== lxml.etree introduces a new class, lxml.etree.XSLT. The class can be given an ElementTree object to construct an XSLT transformer:: From scoder at codespeak.net Sun May 6 08:57:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 08:57:04 +0200 (CEST) Subject: [Lxml-checkins] r42723 - lxml/trunk/doc Message-ID: <20070506065704.92C57807C@code0.codespeak.net> Author: scoder Date: Sun May 6 08:57:03 2007 New Revision: 42723 Modified: lxml/trunk/doc/resolvers.txt lxml/trunk/doc/xpathxslt.txt Log: restructuring of XSLT docs Modified: lxml/trunk/doc/resolvers.txt ============================================================================== --- lxml/trunk/doc/resolvers.txt (original) +++ lxml/trunk/doc/resolvers.txt Sun May 6 08:57:03 2007 @@ -3,13 +3,20 @@ .. contents:: .. - 1 Document loaders in context - 2 I/O access control in XSLT + 1 Resolvers + 2 Document loading in context + 3 I/O access control in XSLT Lxml has support for custom document loaders in both the parsers and XSL transformations. These so-called resolvers are subclasses of the -etree.Resolver class as in the following example:: +etree.Resolver class. + + +Resolvers +--------- + +Here is an example of a custom resolver:: >>> from lxml import etree @@ -32,10 +39,10 @@ * ``resolve_file`` takes an open file-like object that has at least a read() method * ``resolve_empty`` resolves into an empty document -The ``resolve`` method may choose to return None, in which case the next -registered resolver (or the default resolver) is consulted. It is never -called if the resolver returns the result of any of the above ``resolve_*`` -methods. +The ``resolve()`` method may choose to return None, in which case the next +registered resolver (or the default resolver) is consulted. Resolving always +terminates if ``resolve()`` returns the result of any of the above +``resolve_*()`` methods. Resolvers are registered local to a parser:: @@ -58,7 +65,7 @@ fragment. -Document loaders in context +Document loading in context --------------------------- XML documents memorise their initial parser (and its resolvers) during their @@ -180,12 +187,16 @@ I/O access control in XSLT -------------------------- -XSLT has an additional mechanism to control the access to certain I/O -operations during the transformation process. This is most interesting where -XSL scripts come from potentially insecure sources and must be prevented from -modifying the local file system. Note, however, that there is no way to keep -them from eating up your precious CPU time, so this should not stop you from -thinking about what XSLT you execute. +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through EXSLT. Some extensions enable +style sheets to read and write files on the local file system. + +XSLT has a mechanism to control the access to certain I/O operations during +the transformation process. This is most interesting where XSL scripts come +from potentially insecure sources and must be prevented from modifying the +local file system. Note, however, that there is no way to keep them from +eating up your precious CPU time, so this should not stop you from thinking +about what XSLT you execute. Access control is configured using the ``XSLTAccessControl`` class. It can be called with a number of keyword arguments that allow or deny specific Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 6 08:57:03 2007 @@ -28,11 +28,11 @@ findtext`_ methods on ElementTree and Element, as known from the original ElementTree library (ElementPath_). As an lxml specific extension, these classes also provide an ``xpath()`` method that supports expressions in the -complete XPath syntax, as well as `extension functions`_. +complete XPath syntax, as well as `custom extension functions`_. .. _ElementPath: http://effbot.org/zone/element-xpath.htm .. _`find, findall and findtext`: http://effbot.org/zone/element.htm#searching-for-subelements -.. _`extension functions`: extensions.html +.. _`custom extension functions`: extensions.html There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance @@ -115,9 +115,11 @@ 'Text' There is also an optional ``extensions`` argument which is used to define -`extension functions`_ in Python that are local to this evaluation. +`custom extension functions`_ in Python that are local to this evaluation. -.. _`extension functions`: extensions.html + +XPath return values +------------------- The return values of XPath evaluations vary, depending on the XPath expression used: @@ -315,6 +317,18 @@ [...] LookupError: unknown encoding: UCS4 +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through EXSLT. Also see the documentation +on `custom extension functions`_ and `document resolvers`_. There is a +separate section on `controlling access`_ to external documents and resources. + +.. _`document resolvers`: resolvers.html +.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt + + +Stylesheet parameters +--------------------- + It is possible to pass parameters, in the form of XPath expressions, to the XSLT template:: @@ -342,7 +356,11 @@ >>> str(result) '\nText\n' -There's also a convenience method on the tree object for doing XSL + +The ``xslt()`` tree method +-------------------------- + +There's also a convenience method on ElementTree objects for doing XSL transformations. This is less efficient if you want to apply the same XSL transformation to multiple documents, but is shorter to write for one-shot operations, as you do not have to instantiate a stylesheet yourself:: @@ -351,12 +369,16 @@ >>> str(result) '\nA\n' -By default, XSLT supports all extension functions from libxslt and libexslt as -well as Python regular expressions through EXSLT. Note that some extensions -enable style sheets to read and write files on the local file system. See the -`document loader documentation`_ on how to deal with this. +This is a shortcut for the following code:: + + >>> transform = etree.XSLT(xslt_tree) + >>> result = transform(doc, a="'A'") + >>> str(result) + '\nA\n' + -.. _`document loader documentation`: resolvers.html +Profiling +--------- If you want to know how your stylesheet performed, pass the ``profile_run`` keyword to the transform:: From scoder at codespeak.net Sun May 6 09:02:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:02:34 +0200 (CEST) Subject: [Lxml-checkins] r42724 - lxml/trunk/doc Message-ID: <20070506070234.CC10F807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:02:34 2007 New Revision: 42724 Modified: lxml/trunk/doc/xpathxslt.txt Log: restructuring of XSLT docs Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 6 09:02:34 2007 @@ -9,10 +9,15 @@ .. 1 XPath 1.1 The ``xpath()`` method - 1.2 The ``XPath`` class - 1.3 The ``XPathEvaluator`` classes - 1.4 ``ETXPath`` + 1.2 XPath return values + 1.3 The ``XPath`` class + 1.4 The ``XPathEvaluator`` classes + 1.5 ``ETXPath`` 2 XSLT + 2.1 XSLT result objects + 2.2 Stylesheet parameters + 2.3 The ``xslt()`` tree method + 2.4 Profiling The usual setup procedure:: @@ -276,9 +281,28 @@ >>> f = StringIO('Text') >>> doc = etree.parse(f) - >>> result = transform(doc) + >>> result_tree = transform(doc) + +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through the `EXSLT regexp functions`_. +Also see the documentation on `custom extension functions`_ and `document +resolvers`_. There is a separate section on `controlling access`_ to external +documents and resources. + +.. _`EXSLT regexp functions`: http://www.exslt.org/regexp/ +.. _`document resolvers`: resolvers.html +.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt + + +XSLT result objects +------------------- -The result object can be accessed like a normal ElementTree document:: +The result of an XSL transformation can be accessed like a normal ElementTree +document:: + + >>> f = StringIO('Text') + >>> doc = etree.parse(f) + >>> result = transform(doc) >>> result.getroot().text 'Text' @@ -317,14 +341,6 @@ [...] LookupError: unknown encoding: UCS4 -By default, XSLT supports all extension functions from libxslt and libexslt as -well as Python regular expressions through EXSLT. Also see the documentation -on `custom extension functions`_ and `document resolvers`_. There is a -separate section on `controlling access`_ to external documents and resources. - -.. _`document resolvers`: resolvers.html -.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt - Stylesheet parameters --------------------- From scoder at codespeak.net Sun May 6 09:10:32 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:10:32 +0200 (CEST) Subject: [Lxml-checkins] r42725 - lxml/trunk/doc Message-ID: <20070506071032.1B3D4807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:10:31 2007 New Revision: 42725 Modified: lxml/trunk/doc/element_classes.txt Log: cleanup Modified: lxml/trunk/doc/element_classes.txt ============================================================================== --- lxml/trunk/doc/element_classes.txt (original) +++ lxml/trunk/doc/element_classes.txt Sun May 6 09:10:31 2007 @@ -29,11 +29,12 @@ 2.2 Namespace class lookup 2.3 Attribute based lookup 2.4 Custom element class lookup + 2.5 Tree based element class lookup in Python 3 Implementing namespaces Element initialization ----------------------- +====================== There is one thing to know up front. Element classes *must not* have a constructor, neither must there be any internal state (except for the data @@ -72,7 +73,7 @@ Setting up a class lookup scheme --------------------------------- +================================ The first thing to do when deploying custom element classes is to register a class lookup scheme on a parser. lxml.etree provides quite a number of @@ -140,7 +141,7 @@ Default class lookup -.................... +-------------------- This is the most simple lookup mechanism. It always returns the default element class. Consequently, no further fallbacks are supported, but this @@ -179,7 +180,7 @@ Namespace class lookup -...................... +---------------------- This is an advanced lookup mechanism that supports namespace/tag-name specific element classes. You can select it by calling:: @@ -204,14 +205,15 @@ Attribute based lookup -...................... +---------------------- This scheme uses a mapping from attribute values to classes. An attribute name is set at initialisation time and is then used to find the corresponding value. It is set up as follows:: >>> id_class_mapping = {} # maps attribute values to element classes - >>> lookup = etree.AttributeBasedElementClassLookup('id', id_class_mapping) + >>> lookup = etree.AttributeBasedElementClassLookup( + ... 'id', id_class_mapping) >>> parser = etree.XMLParser() >>> parser.setElementClassLookup(lookup) @@ -230,7 +232,7 @@ Custom element class lookup -........................... +--------------------------- This is the most customisable way of finding element classes on a per-element basis. It allows you to implement a custom lookup scheme in a subclass:: @@ -252,7 +254,7 @@ Tree based element class lookup in Python -......................................... +----------------------------------------- Taking more elaborate decisions than allowed by the custom scheme is difficult to achieve in pure Python. It would require access to the tree - before the @@ -291,7 +293,7 @@ Implementing namespaces ------------------------ +======================= lxml allows you to implement namespaces, in a rather literal sense. After setting up the namespace class lookup mechanism as described above, you can From scoder at codespeak.net Sun May 6 09:24:19 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:24:19 +0200 (CEST) Subject: [Lxml-checkins] r42727 - lxml/trunk/doc Message-ID: <20070506072419.E583C807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:24:19 2007 New Revision: 42727 Modified: lxml/trunk/doc/element_classes.txt Log: cleanup Modified: lxml/trunk/doc/element_classes.txt ============================================================================== --- lxml/trunk/doc/element_classes.txt (original) +++ lxml/trunk/doc/element_classes.txt Sun May 6 09:24:19 2007 @@ -4,8 +4,8 @@ lxml has very sophisticated support for custom Element classes. You can provide your own classes for Elements and have lxml use them by default, for -all elements generated by a specific parser or only for a specific tag name in -a specific namespace. +all elements generated by a specific parser, for a specific tag name in a +specific namespace or for an exact element at a specific position in the tree. Custom Elements must inherit from the ``lxml.etree.ElementBase`` class, which provides the Element interface for subclasses:: @@ -44,10 +44,12 @@ called, the object may not even be initialized yet to represent the XML tag, so there is not much use in providing an ``__init__`` method in subclasses. -However, there is one possible way to do things on element initialization, if -you really need to. ElementBase classes have an ``_init()`` method that can -be overridden. It can be used to modify the XML tree, e.g. to construct -special children or verify and update attributes. +Most use cases will not require any class initialisation, so you can content +yourself with skipping to the next section for now. However, if you really +need to set up your element class on instantiation, there is one possible way +to do so. ElementBase classes have an ``_init()`` method that can be +overridden. It can be used to modify the XML tree, e.g. to construct special +children or verify and update attributes. The semantics of ``_init()`` are as follows: From scoder at codespeak.net Sun May 6 11:10:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:10:59 +0200 (CEST) Subject: [Lxml-checkins] r42730 - lxml/trunk/src/lxml Message-ID: <20070506091059.8894B807C@code0.codespeak.net> Author: scoder Date: Sun May 6 11:10:59 2007 New Revision: 42730 Modified: lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/etree_defs.h lxml/trunk/src/lxml/python.pxd Log: fast path for instantiation of _Element class (20% faster) Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 6 11:10:59 2007 @@ -1043,6 +1043,9 @@ else: ELEMENT_CREATION_LOCK = NULL +cdef extern from "etree_defs.h": + cdef _Element NEW_ELEMENT "PY_NEW" (object t) + cdef _Element _elementFactory(_Document doc, xmlNode* c_node): cdef python.PyThreadState* state cdef _Element result @@ -1061,9 +1064,13 @@ python.PyThread_release_lock(ELEMENT_CREATION_LOCK) return result - element_class = LOOKUP_ELEMENT_CLASS(ELEMENT_CLASS_LOOKUP_STATE, - doc, c_node) - result = element_class() + element_class = LOOKUP_ELEMENT_CLASS( + ELEMENT_CLASS_LOOKUP_STATE, doc, c_node) + if element_class is _Element: + # fast path for standard _Element class + result = NEW_ELEMENT(_Element) + else: + result = element_class() result._doc = doc result._c_node = c_node registerProxy(result) @@ -1071,7 +1078,8 @@ if config.ENABLE_THREADING: python.PyThread_release_lock(ELEMENT_CREATION_LOCK) - result._init() + if element_class is not _Element: + result._init() return result Modified: lxml/trunk/src/lxml/etree_defs.h ============================================================================== --- lxml/trunk/src/lxml/etree_defs.h (original) +++ lxml/trunk/src/lxml/etree_defs.h Sun May 6 11:10:59 2007 @@ -64,6 +64,16 @@ #define iter(o) PyObject_GetIter(o) #define _cstr(s) PyString_AS_STRING(s) +static PyObject* __PY_NEW_GLOBAL_EMPTY_TUPLE = NULL; + +#define PY_NEW(T) \ + (((PyTypeObject*)(T))->tp_new( \ + (PyTypeObject*)(T), \ + ((__PY_NEW_GLOBAL_EMPTY_TUPLE == NULL) ? \ + (__PY_NEW_GLOBAL_EMPTY_TUPLE = PyTuple_New(0)) : \ + (__PY_NEW_GLOBAL_EMPTY_TUPLE)), \ + NULL)) + #define _isString(obj) PyObject_TypeCheck(obj, &PyBaseString_Type) #define _isElement(c_node) \ Modified: lxml/trunk/src/lxml/python.pxd ============================================================================== --- lxml/trunk/src/lxml/python.pxd (original) +++ lxml/trunk/src/lxml/python.pxd Sun May 6 11:10:59 2007 @@ -112,3 +112,4 @@ cdef object repr(object obj) cdef object iter(object obj) cdef char* _cstr(object s) + cdef object PY_NEW(object t) From scoder at codespeak.net Sun May 6 11:13:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:13:04 +0200 (CEST) Subject: [Lxml-checkins] r42731 - lxml/trunk/src/lxml Message-ID: <20070506091304.4167B807C@code0.codespeak.net> Author: scoder Date: Sun May 6 11:13:04 2007 New Revision: 42731 Modified: lxml/trunk/src/lxml/etree.pyx Log: cleanup Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 6 11:13:04 2007 @@ -1044,6 +1044,7 @@ ELEMENT_CREATION_LOCK = NULL cdef extern from "etree_defs.h": + # macro call to 'tp_new()' for fast instantiation cdef _Element NEW_ELEMENT "PY_NEW" (object t) cdef _Element _elementFactory(_Document doc, xmlNode* c_node): From scoder at codespeak.net Sun May 6 11:32:10 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:32:10 +0200 (CEST) Subject: [Lxml-checkins] r42736 - lxml/trunk/doc Message-ID: <20070506093210.723A8807F@code0.codespeak.net> Author: scoder Date: Sun May 6 11:32:10 2007 New Revision: 42736 Modified: lxml/trunk/doc/performance.txt Log: new benchmark results after _Element instantiation speedup Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Sun May 6 11:32:10 2007 @@ -144,9 +144,9 @@ (given in seconds):: lxe: -- S- U- -A SA UA - T1: 0.1155 0.1154 0.1153 0.1159 0.1181 0.1158 - T2: 0.1183 0.1197 0.1200 0.1267 0.1261 0.1264 - T3: 0.0341 0.0312 0.0314 0.0726 0.0717 0.0720 + T1: 0.1181 0.1080 0.1074 0.1088 0.1087 0.1099 + T2: 0.1103 0.1109 0.1164 0.1241 0.1203 0.1231 + T3: 0.0297 0.0309 0.0297 0.0716 0.0704 0.0703 T4: 0.0005 0.0004 0.0004 0.0014 0.0014 0.0014 cET: -- S- U- -A SA UA T1: 0.0290 0.0271 0.0275 0.0297 0.0273 0.0274 @@ -169,18 +169,18 @@ Where ET and cET can quickly create a shallow copy of their list of children, lxml has to create a Python object for each child and collect them in a list:: - lxe: root_getchildren (--TR T2) 0.3500 msec/pass + lxe: root_getchildren (--TR T2) 0.1960 msec/pass cET: root_getchildren (--TR T2) 0.0150 msec/pass ET : root_getchildren (--TR T2) 0.0091 msec/pass When accessing single children, however, e.g. by index, this handicap is negligible:: - lxe: first_child (--TR T2) 0.2499 msec/pass + lxe: first_child (--TR T2) 0.2289 msec/pass cET: first_child (--TR T2) 0.2048 msec/pass ET : first_child (--TR T2) 0.9291 msec/pass - lxe: last_child (--TR T1) 0.2511 msec/pass + lxe: last_child (--TR T1) 0.2310 msec/pass cET: last_child (--TR T1) 0.2148 msec/pass ET : last_child (--TR T1) 0.9191 msec/pass @@ -188,11 +188,11 @@ cET use Python lists here, which are based on arrays. The data structure used by libxml2 is a linked tree, and thus, a linked list of children:: - lxe: middle_child (--TR T1) 0.2921 msec/pass + lxe: middle_child (--TR T1) 0.2759 msec/pass cET: middle_child (--TR T1) 0.2069 msec/pass ET : middle_child (--TR T1) 0.9291 msec/pass - lxe: middle_child (--TR T2) 1.9028 msec/pass + lxe: middle_child (--TR T2) 1.7111 msec/pass cET: middle_child (--TR T2) 0.2089 msec/pass ET : middle_child (--TR T2) 0.9360 msec/pass @@ -208,11 +208,11 @@ are supposed to end up in, either as SubElements of an Element or using the explicit ``Element.makeelement()`` call:: - lxe: makeelement (--TC T2) 2.5990 msec/pass + lxe: makeelement (--TC T2) 2.3680 msec/pass cET: makeelement (--TC T2) 0.3128 msec/pass ET : makeelement (--TC T2) 1.6940 msec/pass - lxe: create_subelements (--TC T2) 2.3072 msec/pass + lxe: create_subelements (--TC T2) 2.2051 msec/pass cET: create_subelements (--TC T2) 0.2370 msec/pass ET : create_subelements (--TC T2) 3.2189 msec/pass @@ -257,11 +257,11 @@ You should keep this difference in mind when you merge very large trees. On the other hand, deep copying a tree is fast in lxml:: - lxe: deepcopy (--TC T1) 10.6010 msec/pass + lxe: deepcopy (--TC T1) 10.5221 msec/pass cET: deepcopy (--TC T1) 220.2251 msec/pass ET : deepcopy (--TC T1) 463.7730 msec/pass - lxe: deepcopy (--TC T3) 8.2979 msec/pass + lxe: deepcopy (--TC T3) 8.2841 msec/pass cET: deepcopy (--TC T3) 53.8740 msec/pass ET : deepcopy (--TC T3) 118.2799 msec/pass @@ -277,33 +277,33 @@ especially if few elements are of interest or the element tag name is known, lxml is a good choice:: - lxe: getiterator_all (--TR T2) 10.3800 msec/pass + lxe: getiterator_all (--TR T2) 6.4790 msec/pass cET: getiterator_all (--TR T2) 28.2831 msec/pass ET : getiterator_all (--TR T2) 26.0720 msec/pass - lxe: getiterator_islice (--TR T2) 0.1140 msec/pass + lxe: getiterator_islice (--TR T2) 0.0892 msec/pass cET: getiterator_islice (--TR T2) 0.2460 msec/pass ET : getiterator_islice (--TR T2) 26.6550 msec/pass - lxe: getiterator_tag (--TR T2) 0.3879 msec/pass + lxe: getiterator_tag (--TR T2) 0.3850 msec/pass cET: getiterator_tag (--TR T2) 9.3720 msec/pass ET : getiterator_tag (--TR T2) 22.8221 msec/pass - lxe: getiterator_tag_all (--TR T2) 0.8819 msec/pass + lxe: getiterator_tag_all (--TR T2) 0.7222 msec/pass cET: getiterator_tag_all (--TR T2) 27.2939 msec/pass ET : getiterator_tag_all (--TR T2) 22.8271 msec/pass This similarly shows in ``Element.findall()``:: - lxe: findall (--TR T2) 10.9370 msec/pass + lxe: findall (--TR T2) 6.8321 msec/pass cET: findall (--TR T2) 28.8639 msec/pass ET : findall (--TR T2) 27.1060 msec/pass - lxe: findall (--TR T3) 2.1989 msec/pass + lxe: findall (--TR T3) 1.3590 msec/pass cET: findall (--TR T3) 8.9881 msec/pass ET : findall (--TR T3) 6.4890 msec/pass - lxe: findall_tag (--TR T2) 0.9520 msec/pass + lxe: findall_tag (--TR T2) 0.9229 msec/pass cET: findall_tag (--TR T2) 27.2651 msec/pass ET : findall_tag (--TR T2) 22.7208 msec/pass From scoder at codespeak.net Mon May 7 11:00:39 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:00:39 +0200 (CEST) Subject: [Lxml-checkins] r42773 - lxml/trunk/src/lxml Message-ID: <20070507090039.93BD28069@code0.codespeak.net> Author: scoder Date: Mon May 7 11:00:39 2007 New Revision: 42773 Modified: lxml/trunk/src/lxml/etree.pyx Log: comment Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Mon May 7 11:00:39 2007 @@ -1044,7 +1044,7 @@ ELEMENT_CREATION_LOCK = NULL cdef extern from "etree_defs.h": - # macro call to 'tp_new()' for fast instantiation + # macro call to 't->tp_new()' for fast instantiation cdef _Element NEW_ELEMENT "PY_NEW" (object t) cdef _Element _elementFactory(_Document doc, xmlNode* c_node): From scoder at codespeak.net Mon May 7 11:01:41 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:01:41 +0200 (CEST) Subject: [Lxml-checkins] r42774 - lxml/trunk/src/lxml Message-ID: <20070507090141.54EC68069@code0.codespeak.net> Author: scoder Date: Mon May 7 11:01:41 2007 New Revision: 42774 Modified: lxml/trunk/src/lxml/extensions.pxi Log: cleanup: use libxml2 API function Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Mon May 7 11:01:41 2007 @@ -306,18 +306,6 @@ ################################################################################ # EXSLT regexp implementation -cdef int _collect_tree_text(element, l) except -1: - # recursively collect all text (XPath 'string-value' of a node) - text = element.text - if text is not None: - python.PyList_Append(l, text) - for child in element: - _collect_tree_text(child, l) - tail = element.tail - if tail is not None: - python.PyList_Append(l, tail) - return 0 - cdef class _ExsltRegExp: cdef object _compile_map def __init__(self): @@ -334,9 +322,8 @@ if _isString(firstnode): return firstnode elif isinstance(firstnode, _Element): - l = [] - _collect_tree_text(firstnode, l) - return ''.join(l) + return funicode( + tree.xmlNodeGetContent((<_Element>firstnode)._c_node)) else: return str(firstnode) else: From scoder at codespeak.net Mon May 7 11:02:49 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:02:49 +0200 (CEST) Subject: [Lxml-checkins] r42775 - lxml/trunk/src/lxml Message-ID: <20070507090249.45AA88065@code0.codespeak.net> Author: scoder Date: Mon May 7 11:02:48 2007 New Revision: 42775 Modified: lxml/trunk/src/lxml/relaxng.pxi lxml/trunk/src/lxml/xmlschema.pxi Log: make clear when libxml2 bug was fixed Modified: lxml/trunk/src/lxml/relaxng.pxi ============================================================================== --- lxml/trunk/src/lxml/relaxng.pxi (original) +++ lxml/trunk/src/lxml/relaxng.pxi Mon May 7 11:02:48 2007 @@ -32,11 +32,12 @@ root_node = _rootNodeOrRaise(etree) c_node = root_node._c_node # work around for libxml2 bug if document is not RNG at all - c_href = _getNs(c_node) - if c_href is NULL or \ - cstd.strcmp(c_href, - 'http://relaxng.org/ns/structure/1.0') != 0: - raise RelaxNGParseError, "Document is not Relax NG" + if _LIBXML_VERSION_INT < 20624: + c_href = _getNs(c_node) + if c_href is NULL or \ + cstd.strcmp(c_href, + 'http://relaxng.org/ns/structure/1.0') != 0: + raise RelaxNGParseError, "Document is not Relax NG" fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) parser_ctxt = relaxng.xmlRelaxNGNewDocParserCtxt(fake_c_doc) elif file is not None: Modified: lxml/trunk/src/lxml/xmlschema.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlschema.pxi (original) +++ lxml/trunk/src/lxml/xmlschema.pxi Mon May 7 11:02:48 2007 @@ -30,11 +30,12 @@ root_node = _rootNodeOrRaise(etree) # work around for libxml2 bug if document is not XML schema at all - c_node = root_node._c_node - c_href = _getNs(c_node) - if c_href is NULL or \ - cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: - raise XMLSchemaParseError, "Document is not XML Schema" + if _LIBXML_VERSION_INT < 20624: + c_node = root_node._c_node + c_href = _getNs(c_node) + if c_href is NULL or \ + cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: + raise XMLSchemaParseError, "Document is not XML Schema" fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) parser_ctxt = xmlschema.xmlSchemaNewDocParserCtxt(fake_c_doc) From scoder at codespeak.net Mon May 7 14:09:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 14:09:34 +0200 (CEST) Subject: [Lxml-checkins] r42794 - lxml/trunk/doc Message-ID: <20070507120934.568BA8067@code0.codespeak.net> Author: scoder Date: Mon May 7 14:09:32 2007 New Revision: 42794 Modified: lxml/trunk/doc/xpathxslt.txt Log: section for getpath() Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Mon May 7 14:09:32 2007 @@ -10,9 +10,10 @@ 1 XPath 1.1 The ``xpath()`` method 1.2 XPath return values - 1.3 The ``XPath`` class - 1.4 The ``XPathEvaluator`` classes - 1.5 ``ETXPath`` + 1.3 Generating XPath expressions + 1.4 The ``XPath`` class + 1.5 The ``XPathEvaluator`` classes + 1.6 ``ETXPath`` 2 XSLT 2.1 XSLT result objects 2.2 Stylesheet parameters @@ -141,8 +142,12 @@ (the text node content or attribute value). Namespace declarations are returned as tuples of strings: ``(prefix, URI)``. -A related convenience method of ElementTree objects is ``getpath(element)``, -which returns a structural, absolute XPath expression to find that element:: + +Generating XPath expressions +---------------------------- + +A convenience method of ElementTree objects is ``getpath(element)``, which +returns a structural, absolute XPath expression to find that element:: >>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") From scoder at codespeak.net Mon May 7 14:14:38 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 14:14:38 +0200 (CEST) Subject: [Lxml-checkins] r42795 - lxml/trunk/doc Message-ID: <20070507121438.4E9378067@code0.codespeak.net> Author: scoder Date: Mon May 7 14:14:36 2007 New Revision: 42795 Modified: lxml/trunk/doc/xpathxslt.txt Log: cleanup Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Mon May 7 14:14:36 2007 @@ -146,8 +146,8 @@ Generating XPath expressions ---------------------------- -A convenience method of ElementTree objects is ``getpath(element)``, which -returns a structural, absolute XPath expression to find that element:: +ElementTree objects have a method ``getpath(element)``, which returns a +structural, absolute XPath expression to find that element:: >>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") From scoder at codespeak.net Mon May 7 20:26:25 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 20:26:25 +0200 (CEST) Subject: [Lxml-checkins] r42837 - lxml/trunk/doc Message-ID: <20070507182625.ADA278065@code0.codespeak.net> Author: scoder Date: Mon May 7 20:26:25 2007 New Revision: 42837 Modified: lxml/trunk/doc/performance.txt Log: benchmarks and optimisation example Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Mon May 7 20:26:25 2007 @@ -6,35 +6,57 @@ enough for most applications, so lxml is probably somewhere between 'fast enough' and 'the best choice' for yours. -This text describes where lxml.etree (lxe) excels, gives hints on some -performance traps and compares the overall performance to the original -ElementTree_ (ET) and cElementTree_ (cET) libraries by Fredrik Lundh. The -cElementTree library is a fast C-implementation of the original ElementTree. +This text describes where lxml.etree (abbreviated to 'lxe') excels, gives +hints on some performance traps and compares the overall performance to the +original ElementTree_ (ET) and cElementTree_ (cET) libraries by Fredrik Lundh. +The cElementTree library is a fast C-implementation of the original +ElementTree. .. _ElementTree: http://effbot.org/zone/element-index.htm .. _cElementTree: http://effbot.org/zone/celementtree.htm +.. contents:: +.. + 1 How to read the timings + 2 Bad things first + 3 Parsing and Serialising + 4 The ElementTree API + 5 Tree traversal + 6 XPath + 7 lxml.objectify + + +How to read the timings +----------------------- + The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with -the lxml source distribution. The timings cited below compare lxml 1.3 (with -libxml2 2.6.26) to the ElementTree and cElementTree versions shipped with -CPython 2.5 (based on ElementTree 1.2.6). They were run single-threaded on a -1.8GHz Intel Core Duo machine. +the lxml source distribution. They are distributed under the same BSD license +as lxml itself, and the lxml project would like to promote them as a general +benchmarking suite for all ElementTree implementations. New benchmarks are +very easy to add as tiny test methods, so if you write a performance test for +a specific part of the API yourself, please consider sending it to the lxml +mailing list. + +The timings cited below compare lxml 1.3 (with libxml2 2.6.27) to the +ElementTree and cElementTree versions shipped with CPython 2.5 (based on +ElementTree 1.2.6). They were run single-threaded on a 1.8GHz Intel Core Duo +machine under Ubuntu Linux 7.04 (Feisty). .. _`bench_etree.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_etree.py .. _`bench_xpath.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_xpath.py .. _`bench_objectify.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_objectify.py The scripts run a number of simple tests on the different libraries, using -different XML tree configurations: different tree sizes, with or without -attributes (-/A), with or without ASCII or unicode text (-/S/U), and either -against a tree or its serialised form (T/X). In the result extracts cited -below, T1 refers to a 3-level tree with many children at the third level, T2 -is swapped around to have many children below the root element, T3 is a deep -tree with few children at each level and T4 is a small tree, slightly broader -than deep. If repetition is involved, this usually means running the -benchmark in a loop over all children of the tree root, otherwise, the -operation is run on the root node (C/R). +different XML tree configurations: different tree sizes (T1-4), with or +without attributes (-/A), with or without ASCII string or unicode text +(-/S/U), and either against a tree or its serialised XML form (T/X). In the +result extracts cited below, T1 refers to a 3-level tree with many children at +the third level, T2 is swapped around to have many children below the root +element, T3 is a deep tree with few children at each level and T4 is a small +tree, slightly broader than deep. If repetition is involved, this usually +means running the benchmark in a loop over all children of the tree root, +otherwise, the operation is run on the root node (C/R). As an example, the character code ``(SATR T1)`` states that the benchmark was running for tree T1, with plain string text (S) and attributes (A). It was @@ -44,27 +66,29 @@ measurable. It is therefore not always possible to compare the absolute timings of, say, a single access benchmark (which usually loops) and a 'get all in one step' benchmark, which already takes enough time to be measurable -and is therefore measured as is. Take a look at the concrete benchmarks in -the scripts to understand how the numbers compare. - -.. contents:: -.. - 1 Bad things first - 2 Parsing and Serialising - 3 The ElementTree API - 4 Tree traversal - 5 XPath - 6 lxml.objectify +and is therefore measured as is. An example is the index access to a single +child, which cannot be compared to the timings for ``getchildren()``. Take a +look at the concrete benchmarks in the scripts to understand how the numbers +compare. -Bad things first ----------------- +General notes +------------- First thing to say: there *is* an overhead involved in having a DOM-like C library mimic the ElementTree API. As opposed to ElementTree, lxml has to -generate Python objects on the fly when asked for them. What this means is: -the more of your code runs in Python, the slower your application gets. Note, -however, that this is true for most performance critical Python applications. +generate Python representations of tree nodes on the fly when asked for them, +and the internal tree structure of libxml2 results in a higher maintenance +overhead than the simpler top-down structure of ElementTree. What this means +is: the more of your code runs in Python, the less you can benefit from the +speed of lxml and libxml2. Note, however, that this is true for most +performance critical Python applications. No one would implement complex +matrix calculations in pure Python when you can use Numeric. + +The up side then is that lxml provides powerful tools like tree iterators, +XPath and XSLT, that can handle complex operations at the speed of C. Their +pythonic API in lxml makes them so flexible that most applications can easily +benefit from them. Parsing and Serialising @@ -111,26 +135,32 @@ ET : parse_stringIO (UAXR T3) 163.5361 msec/pass The expat parser allows cET to be up to 80% faster than lxml on plain parser -performance. Similar timings can be observer for the ``iterparse()`` -function. However, if you take a complete serialize-parse cycle, the numbers +performance. Similar timings can be observed for the ``iterparse()`` +function. However, if you take a complete input-output cycle, the numbers will look similar to these:: - lxe: write_utf8_parse_stringIO (S-TR T1) 316.6230 msec/pass - cET: write_utf8_parse_stringIO (S-TR T1) 592.1209 msec/pass - ET : write_utf8_parse_stringIO (S-TR T1) 817.9121 msec/pass - - lxe: write_utf8_parse_stringIO (UATR T3) 49.9680 msec/pass - cET: write_utf8_parse_stringIO (UATR T3) 434.6111 msec/pass - ET : write_utf8_parse_stringIO (UATR T3) 574.1441 msec/pass - - lxe: write_utf8_parse_stringIO (SATR T4) 1.2789 msec/pass - cET: write_utf8_parse_stringIO (SATR T4) 12.2640 msec/pass - ET : write_utf8_parse_stringIO (SATR T4) 15.6620 msec/pass + lxe: write_utf8_parse_stringIO (S-TR T1) 166.3210 msec/pass + cET: write_utf8_parse_stringIO (S-TR T1) 581.2099 msec/pass + ET : write_utf8_parse_stringIO (S-TR T1) 803.5331 msec/pass + + lxe: write_utf8_parse_stringIO (UATR T2) 184.4249 msec/pass + cET: write_utf8_parse_stringIO (UATR T2) 671.5119 msec/pass + ET : write_utf8_parse_stringIO (UATR T2) 924.3481 msec/pass + + lxe: write_utf8_parse_stringIO (S-TR T3) 9.1329 msec/pass + cET: write_utf8_parse_stringIO (S-TR T3) 77.9850 msec/pass + ET : write_utf8_parse_stringIO (S-TR T3) 157.0492 msec/pass + + lxe: write_utf8_parse_stringIO (SATR T4) 1.3900 msec/pass + cET: write_utf8_parse_stringIO (SATR T4) 12.6081 msec/pass + ET : write_utf8_parse_stringIO (SATR T4) 16.2580 msec/pass For applications that require a high parser throughput and do little serialization, cET is the best choice. Also for iterparse applications that extract small amounts of data from large XML data sets. If it comes to -round-trip performance, however, lxml tends to be 3-4 times faster in total. +round-trip performance, however, lxml tends to be 3-4 times faster in +total. So, whenever the input documents are not considerably bigger than the +output, lxml is the clear winner. The ElementTree API @@ -261,7 +291,7 @@ cET: deepcopy (--TC T1) 220.2251 msec/pass ET : deepcopy (--TC T1) 463.7730 msec/pass - lxe: deepcopy (--TC T3) 8.2841 msec/pass + lxe: deepcopy (--TC T3) 4.2651 msec/pass cET: deepcopy (--TC T3) 53.8740 msec/pass ET : deepcopy (--TC T3) 118.2799 msec/pass @@ -359,6 +389,115 @@ lxe: xpath_class_repeat (--TC T4) 1.0269 msec/pass +An bigger example +----------------- + +A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would +read in a 3 MB XML version of the Old Testament of the Bible and look for the +text "begat" in all verses. Apparently, it is contained in 120 of them. This +is easy to implement in ElementTree using ``findall()``. However, the fastest +way to do this is obviously ``iterparse()``, as most of the data is not of any +interest. + +.. _`xml.org`: http://xml.org/... + +Now, Uche's original proposal was:: + + def bench_ET(): + tree = ElementTree.parse("ot.xml") + result = [] + for v in tree.findall("//v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +which takes about one second on my machine today. The faster ``iterparse()`` +variant looks like this:: + + def bench_ET_iterparse(): + result = [] + for event, v in ElementTree.iterparse("ot.xml"): + if v.tag == 'v': + text = v.text + if 'begat' in text: + result.append(text) + v.clear() + return len(result) + +The improvement is about 10%. At the time I first tried (early 2006), lxml +didn't have ``iterparse()`` support, but the ``findall()`` variant was already +faster than ElementTree. This changes immediately when you switch to +cElementTree. The latter only needs 0.17 seconds to do the trick today and +only some impressive 0.10 seconds when running the iterparse version. And +even back then, it was quite a bit faster than what lxml could achieve. + +Since then, lxml has matured a lot and has gotten much faster. The iterparse +variant now runs in 0.14 seconds, and if you remove the ``v.clear()``, it is +even a little faster (which isn't the case for cElementTree). When you move +the whole thing to a pure XPath implementation, it will look like this:: + + def bench_lxml_xpath_all(): + tree = etree.parse("ot.xml") + result = tree.xpath("//v[contains(., 'begat')]/text()") + return len(result) + +This runs in about 0.13 seconds and is about the shortest possible +implementation (in lines of Python code) that I could come up with. Now, this +is already a rather complex XPath expression compared to the simple "//v" +ElementPath expression we started with. Since this is also valid XPath, let's +try this instead:: + + def bench_lxml_xpath(): + tree = etree.parse("ot.xml") + result = [] + for v in tree.xpath("//v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +This gets us down to 0.12 seconds. However, since this is not much different +from the original findall variant, we can remove the complexity of the XPath +call completely and just go with what we had in the beginning. Under lxml, +this runs in the same 0.12 seconds. + +But there is one thing left to try. We can replace the simple ElementPath +expression with a native tree iterator:: + + def bench_lxml_getiterator(): + tree = etree.parse("ot.xml") + result = [] + for v in tree.getiterator("v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +This implements the same thing, just without the overhead of parsing and +evaluating a path expression. And this makes it another bit faster, down to +0.11 seconds. For comparison, cElementTree runs this version in 0.17 seconds. + +So, what have we learned? + +* It's important to know the available options - and it's worth starting with + the most simple one. In this case, a programmer would then probably have + started with ``getiterator("v")`` or ``iterparse()``. Either of them would + already have been the most efficient, depending on which library is used. + +* It's not always worth optimising. After all that hassle we got from 0.12 + seconds for the initial implementation to 0.11 seconds. Switching over to + cElementTree and writing an ``iterparse()`` based version would have given + us 0.10 seconds - not a big difference for 3MB of XML. + +* Take care what operation is really dominating in your use case. Here, lxml + is little slower than cElementTree on ``parse()`` (both about 0.06 seconds), + but more visibly slower on ``iterparse()``: 0.07 versus 0.10 seconds. + However, tree iteration in lxml is increadibly fast, so it can be faster to + parse the whole tree and then iterate over it rather than using + ``iterparse()`` to do both in one step. + + lxml.objectify -------------- @@ -439,9 +578,10 @@ Here are some more things to try if optimisation is required: * A lot of time is usually spent in tree traversal to find the addressed - elements in the tree. If you often work in subtrees, assign the parent of - the subtree to a variable or pass it into functions instead of starting at - the root. This allows accessing its descendents more directly. + elements in the tree. If you often work in subtrees, do what you would also + do with deep Python objects: assign the parent of the subtree to a variable + or pass it into functions instead of starting at the root. This allows + accessing its descendents more directly. * Try assigning data values directly to attributes instead of passing them through DataElement. From scoder at codespeak.net Mon May 7 22:17:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 22:17:04 +0200 (CEST) Subject: [Lxml-checkins] r42838 - lxml/trunk/doc Message-ID: <20070507201704.58B51806D@code0.codespeak.net> Author: scoder Date: Mon May 7 22:17:02 2007 New Revision: 42838 Modified: lxml/trunk/doc/validation.txt Log: clarifications in validation docs Modified: lxml/trunk/doc/validation.txt ============================================================================== --- lxml/trunk/doc/validation.txt (original) +++ lxml/trunk/doc/validation.txt Mon May 7 22:17:02 2007 @@ -13,7 +13,8 @@ There is also initial support for Schematron_. However, it is currently disabled in lxml builds due to insufficiencies in the implementation as of -libxml2 2.6.27. +libxml2 2.6.27. To enable it when you build from sources, you currently have +to uncomment the include line at the end of the file ``src/lxml/etree.pyx``. .. _Schematron: http://www.ascc.net/xml/schematron @@ -84,6 +85,10 @@ >>> relaxng_doc = etree.parse(f) >>> relaxng = etree.RelaxNG(relaxng_doc) +Alternatively, pass a filename to the ``file`` keyword argument to parse from +a file. This also enables correct handling of include files from within the +RelaxNG parser. + You can then validate some ElementTree document against the schema. You'll get back True if the document is valid against the Relax NG schema, and False if not:: @@ -130,7 +135,7 @@ You can see that the error (ERROR) happened during RelaxNG validation (RELAXNGV). The message then tells you what went wrong. Note that this error -is local to the RelaxNG object. It will only contain log entries that +log is local to the RelaxNG object. It will only contain log entries that appeared during the validation. The DocumentInvalid exception raised by the ``assertValid`` method above provides access to the global error log (like all other lxml exceptions). @@ -147,10 +152,9 @@ XMLSchema --------- -lxml.etree also has a XML Schema (XSD) support, using the class -lxml.etree.XMLSchema. This support is very similar to the Relax NG -support. The class can be given an ElementTree object to construct a -XMLSchema validator:: +lxml.etree also has XML Schema (XSD) support, using the class +lxml.etree.XMLSchema. The API is very similar to the Relax NG and DTD +classes. Pass an ElementTree object to construct a XMLSchema validator:: >>> f = StringIO('''\ ... @@ -165,9 +169,9 @@ >>> xmlschema_doc = etree.parse(f) >>> xmlschema = etree.XMLSchema(xmlschema_doc) -You can then validate some ElementTree document with this. Like with -RelaxNG, you'll get back true if the document is valid against the XML -schema, and false if not:: +You can then validate some ElementTree document with this. Like with RelaxNG, +you'll get back true if the document is valid against the XML schema, and +false if not:: >>> valid = StringIO('') >>> doc = etree.parse(valid) @@ -179,8 +183,8 @@ >>> xmlschema.validate(doc2) 0 -Calling the schema object has the same effect as calling its validate -method. This is sometimes used in conditional statements:: +Calling the schema object has the same effect as calling its validate method. +This is sometimes used in conditional statements:: >>> invalid = StringIO('') >>> doc2 = etree.parse(invalid) @@ -201,7 +205,7 @@ [...] AssertionError: Document does not comply with schema -Error reporting works like for the RelaxNG class:: +Error reporting works as for the RelaxNG class:: >>> log = xmlschema.error_log >>> error = log.last_error From scoder at codespeak.net Mon May 7 23:35:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 23:35:00 +0200 (CEST) Subject: [Lxml-checkins] r42840 - in lxml/trunk: doc src/lxml Message-ID: <20070507213500.8C8B4807C@code0.codespeak.net> Author: scoder Date: Mon May 7 23:35:00 2007 New Revision: 42840 Modified: lxml/trunk/doc/parsing.txt lxml/trunk/doc/validation.txt lxml/trunk/src/lxml/parser.pxi Log: clarifications in parser docs Modified: lxml/trunk/doc/parsing.txt ============================================================================== --- lxml/trunk/doc/parsing.txt (original) +++ lxml/trunk/doc/parsing.txt Mon May 7 23:35:00 2007 @@ -18,26 +18,56 @@ Parsers -------- +======= Parsers are represented by parser objects. There is support for parsing both -XML and (broken) HTML (note that XHTML is best parsed as XML). Both are based -on libxml2 and therefore only support options that are backed by the library. -Parsers take a number of keyword arguments. The following is an example for -namespace cleanup during parsing, first with the default parser, then with a -parametrized one:: +XML and (broken) HTML. Note that XHTML is best parsed as XML, parsing it with +the HTML parser can lead to unexpected results. Here is a simple example for +XML parsing:: >>> xml = '' - >>> et = etree.parse(StringIO(xml)) + >>> et = etree.parse(StringIO(xml)) >>> print etree.tostring(et.getroot()) + +Parser options +-------------- + +The parsers accept a number of setup options as keyword arguments. The above +example is easily extended to clean up namespaces during parsing:: + >>> parser = etree.XMLParser(ns_clean=True) >>> et = etree.parse(StringIO(xml), parser) >>> print etree.tostring(et.getroot()) +The keyword arguments in the constructor are mainly based on the libxml2 +parser configuration. A DTD will also be loaded if validation or attribute +default values are requested. + +Available boolean keyword arguments: + +* attribute_defaults - read the DTD (if referenced by the document) and add + the default attributes from it + +* dtd_validation - validate while parsing (if a DTD was referenced) + +* load_dtd - load and parse the DTD while parsing (no validation is performed) + +* no_network - prevent network access when looking up external documents + +* ns_clean - try to clean up redundant namespace declarations + +* recover - try hard to parse through broken XML + +* remove_blank_text - discard blank text nodes between tags + + +Parsing HTML +------------ + HTML parsing is similarly simple. The parsers have a ``recover`` keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return something usable without raising an exception. You should use libxml2 @@ -48,15 +78,29 @@ >>> parser = etree.HTMLParser() >>> et = etree.parse(StringIO(broken_html), parser) - >>> print etree.tostring(et.getroot()) - test

page title

+ >>> print etree.tostring(et.getroot(), pretty_print=True) + + + test + + +

page title

+ + Lxml has an HTML function, similar to the XML shortcut known from ElementTree:: >>> html = etree.HTML(broken_html) - >>> print etree.tostring(html) - test

page title

+ >>> print etree.tostring(html, pretty_print=True) + + + test + + +

page title

+ + The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is *not* the fault of lxml if you find documents that are so @@ -66,6 +110,10 @@ parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems. + +Doctype information +------------------- + The use of the libxml2 parsers makes some additional information available at the API level. Currently, ElementTree objects can access the DOCTYPE information provided by a parsed document, as well as the XML version and the @@ -93,7 +141,7 @@ iterparse and iterwalk ----------------------- +====================== As known from ElementTree, the ``iterparse()`` utility function returns an iterator that generates parser events for an XML file (or file-like object), @@ -125,7 +173,7 @@ >>> context.root.tag 'root' -The other types can be activated with the ``events`` keyword argument:: +The other event types can be activated with the ``events`` keyword argument:: >>> events = ("start", "end") >>> context = etree.iterparse(StringIO(xml), events=events) @@ -140,6 +188,32 @@ end {testns}empty-element end root + +Selective tag events +-------------------- + +As an extension over ElementTree, lxml.etree accepts a ``tag`` keyword +argument just like ``element.getiterator(tag)``. This restricts events to a +specific tag or namespace:: + + >>> context = etree.iterparse(StringIO(xml), tag="element") + >>> for action, elem in context: + ... print action, elem.tag + end element + end element + + >>> events = ("start", "end") + >>> context = etree.iterparse( + ... StringIO(xml), events=events, tag="{testns}*") + >>> for action, elem in context: + ... print action, elem.tag + start {testns}empty-element + end {testns}empty-element + + +Modifying the tree +------------------ + You can modify the element and its descendants when handling the 'end' event. To save memory, for example, you can remove subtrees that are no longer needed:: @@ -170,11 +244,12 @@ ... if element.getprevious(): # clean up preceding siblings ... del element.getparent()[0] -You can use ``while`` instead of ``if`` if you skipped siblings using the -``tag`` keyword argument. The more selective your tag is, however, the more -thought you will have to put into finding the right way to clean up the -elements that were skipped. Therefore, it is sometimes easier to traverse all -elements and do the tag selection by hand in the event handler code. +You can use ``while`` instead of the ``if`` to delete multiple siblings in a +row if you skipped over them using the ``tag`` keyword argument. The more +selective your tag is, however, the more thought you will have to put into +finding the right way to clean up the elements that were skipped. Therefore, +it is sometimes easier to traverse all elements and do the tag selection by +hand in the event handler code. The 'start-ns' and 'end-ns' events notify about namespace declarations and generate tuples ``(prefix, URI)``:: @@ -189,28 +264,16 @@ It is common practice to use a list as namespace stack and pop the last entry on the 'end-ns' event. -lxml.etree supports two extensions compared to ElementTree. It accepts a -``tag`` keyword argument just like ``element.getiterator(tag)``. This -restricts events to a specific tag or namespace. - >>> context = etree.iterparse(StringIO(xml), tag="element") - >>> for action, elem in context: - ... print action, elem.tag - end element - end element +iterwalk +-------- - >>> events = ("start", "end") - >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*") - >>> for action, elem in context: - ... print action, elem.tag - start {testns}empty-element - end {testns}empty-element - -The second extension is the ``iterwalk()`` function. It behaves exactly like -``iterparse()``, but works on Elements and ElementTrees:: +A second extension over ElementTree is the ``iterwalk()`` function. It +behaves exactly like ``iterparse()``, but works on Elements and ElementTrees:: - >>> root = context.root - >>> context = etree.iterwalk(root, events=events, tag="element") + >>> root = etree.XML(xml) + >>> context = etree.iterwalk( + ... root, events=("start", "end"), tag="element") >>> for action, elem in context: ... print action, elem.tag start element @@ -220,7 +283,7 @@ Python unicode strings ----------------------- +====================== lxml.etree has broader support for Python unicode strings than the ElementTree library. First of all, where ElementTree would raise an exception, the @@ -246,6 +309,10 @@ should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone. + +Serialising to Unicode strings +------------------------------ + To serialize the result, you would normally use the ``tostring`` module function, which serializes to plain ASCII by default or a number of other encodings if asked for:: Modified: lxml/trunk/doc/validation.txt ============================================================================== --- lxml/trunk/doc/validation.txt (original) +++ lxml/trunk/doc/validation.txt Mon May 7 23:35:00 2007 @@ -4,7 +4,7 @@ Apart from the built-in DTD support in parsers, lxml currently supports three schema languages: DTD_, `Relax NG`_ and `XML Schema`_. All three provide -identical APIs in lxml, represented by a validator class with the obvious +identical APIs in lxml, represented by validator classes with the obvious names. .. _DTD: http://en.wikipedia.org/wiki/Document_Type_Definition Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Mon May 7 23:35:00 2007 @@ -664,14 +664,9 @@ * recover - try hard to parse through broken XML * remove_blank_text - discard blank text nodes - For read-only documents that will not be altered after parsing, you can - also pass the following keyword arguments: - * compact - compactly store short element text content - - Note that you should avoid sharing parsers between threads. This does not + Note that you should avoid sharing parsers between threads. While this is + not harmful, it is more efficient to use separate parsers. This does not apply to the default parser. - - You must not modify documents that were parsed with the 'compact' option. """ def __init__(self, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=False, ns_clean=False, @@ -794,12 +789,8 @@ * no_network - prevent network access * remove_blank_text - discard empty text nodes - For read-only documents that will not be altered after parsing, you can - also pass the following keyword arguments: - * compact - compactly store short element text content - - Note that you should avoid sharing parsers between threads. You must not - modify documents that were parsed with the 'compact' option. + Note that you should avoid sharing parsers between threads for parformance + reasons. """ def __init__(self, recover=True, no_network=False, remove_blank_text=False, compact=True): From scoder at codespeak.net Tue May 8 11:57:16 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 11:57:16 +0200 (CEST) Subject: [Lxml-checkins] r42843 - lxml/trunk/doc Message-ID: <20070508095716.8A03C8065@code0.codespeak.net> Author: scoder Date: Tue May 8 11:57:14 2007 New Revision: 42843 Modified: lxml/trunk/doc/parsing.txt Log: short comparison code for iterwalk/iterparse Modified: lxml/trunk/doc/parsing.txt ============================================================================== --- lxml/trunk/doc/parsing.txt (original) +++ lxml/trunk/doc/parsing.txt Tue May 8 11:57:14 2007 @@ -271,6 +271,7 @@ A second extension over ElementTree is the ``iterwalk()`` function. It behaves exactly like ``iterparse()``, but works on Elements and ElementTrees:: + >>> root = etree.XML(xml) >>> context = etree.iterwalk( ... root, events=("start", "end"), tag="element") @@ -281,6 +282,17 @@ start element end element + >>> f = StringIO(xml) + >>> context = etree.iterparse( + ... f, events=("start", "end"), tag="element") + + >>> for action, elem in context: + ... print action, elem.tag + start element + end element + start element + end element + Python unicode strings ====================== From scoder at codespeak.net Tue May 8 13:43:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 13:43:59 +0200 (CEST) Subject: [Lxml-checkins] r42846 - lxml/trunk/doc Message-ID: <20070508114359.12E6D8069@code0.codespeak.net> Author: scoder Date: Tue May 8 13:43:58 2007 New Revision: 42846 Modified: lxml/trunk/doc/performance.txt Log: numpy Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 13:43:58 2007 @@ -83,7 +83,7 @@ is: the more of your code runs in Python, the less you can benefit from the speed of lxml and libxml2. Note, however, that this is true for most performance critical Python applications. No one would implement complex -matrix calculations in pure Python when you can use Numeric. +matrix calculations in pure Python when you can use NumPy. The up side then is that lxml provides powerful tools like tree iterators, XPath and XSLT, that can handle complex operations at the speed of C. Their From scoder at codespeak.net Tue May 8 14:20:13 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 14:20:13 +0200 (CEST) Subject: [Lxml-checkins] r42849 - lxml/trunk/doc Message-ID: <20070508122013.5B9E38069@code0.codespeak.net> Author: scoder Date: Tue May 8 14:20:12 2007 New Revision: 42849 Modified: lxml/trunk/doc/performance.txt Log: docs Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 14:20:12 2007 @@ -82,8 +82,8 @@ overhead than the simpler top-down structure of ElementTree. What this means is: the more of your code runs in Python, the less you can benefit from the speed of lxml and libxml2. Note, however, that this is true for most -performance critical Python applications. No one would implement complex -matrix calculations in pure Python when you can use NumPy. +performance critical Python applications. No one would implement fourier +transformations in pure Python when you can use NumPy. The up side then is that lxml provides powerful tools like tree iterators, XPath and XSLT, that can handle complex operations at the speed of C. Their @@ -480,6 +480,11 @@ So, what have we learned? +* Python code is not slow. The pure XPath solution was not even as fast as + the first shot Python implementation. In general, a few more lines in + Python make things more readable, which is much more important than the last + 5% of performance. + * It's important to know the available options - and it's worth starting with the most simple one. In this case, a programmer would then probably have started with ``getiterator("v")`` or ``iterparse()``. Either of them would From scoder at codespeak.net Tue May 8 14:23:09 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 14:23:09 +0200 (CEST) Subject: [Lxml-checkins] r42850 - lxml/trunk/doc Message-ID: <20070508122309.B5FA38069@code0.codespeak.net> Author: scoder Date: Tue May 8 14:23:09 2007 New Revision: 42850 Modified: lxml/trunk/doc/performance.txt Log: cleanup Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 14:23:09 2007 @@ -1,3 +1,4 @@ +==================== Benchmarks and Speed ==================== @@ -27,7 +28,7 @@ How to read the timings ------------------------ +======================= The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with @@ -73,7 +74,7 @@ General notes -------------- +============= First thing to say: there *is* an overhead involved in having a DOM-like C library mimic the ElementTree API. As opposed to ElementTree, lxml has to @@ -92,7 +93,7 @@ Parsing and Serialising ------------------------ +======================= These are areas where lxml excels. The reason is that both parts are executed entirely at the C level, without major interaction with Python code. The @@ -164,7 +165,7 @@ The ElementTree API -------------------- +=================== Since all three libraries implement the same API, their performance is easy to compare in this area. A major disadvantage for lxml's performance is the @@ -390,7 +391,7 @@ An bigger example ------------------ +================= A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would read in a 3 MB XML version of the Old Testament of the Bible and look for the @@ -504,7 +505,7 @@ lxml.objectify --------------- +============== The following timings are based on the benchmark script `bench_objectify.py`_. From scoder at codespeak.net Wed May 9 00:08:43 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 00:08:43 +0200 (CEST) Subject: [Lxml-checkins] r42886 - lxml/trunk/doc Message-ID: <20070508220843.C2A15806D@code0.codespeak.net> Author: scoder Date: Wed May 9 00:08:42 2007 New Revision: 42886 Modified: lxml/trunk/doc/performance.txt Log: doc restructuring Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 00:08:42 2007 @@ -27,6 +27,25 @@ 7 lxml.objectify +General notes +============= + +First thing to say: there *is* an overhead involved in having a DOM-like C +library mimic the ElementTree API. As opposed to ElementTree, lxml has to +generate Python representations of tree nodes on the fly when asked for them, +and the internal tree structure of libxml2 results in a higher maintenance +overhead than the simpler top-down structure of ElementTree. What this means +is: the more of your code runs in Python, the less you can benefit from the +speed of lxml and libxml2. Note, however, that this is true for most +performance critical Python applications. No one would implement fourier +transformations in pure Python when you can use NumPy. + +The up side then is that lxml provides powerful tools like tree iterators, +XPath and XSLT, that can handle complex operations at the speed of C. Their +pythonic API in lxml makes them so flexible that most applications can easily +benefit from them. + + How to read the timings ======================= @@ -73,25 +92,6 @@ compare. -General notes -============= - -First thing to say: there *is* an overhead involved in having a DOM-like C -library mimic the ElementTree API. As opposed to ElementTree, lxml has to -generate Python representations of tree nodes on the fly when asked for them, -and the internal tree structure of libxml2 results in a higher maintenance -overhead than the simpler top-down structure of ElementTree. What this means -is: the more of your code runs in Python, the less you can benefit from the -speed of lxml and libxml2. Note, however, that this is true for most -performance critical Python applications. No one would implement fourier -transformations in pure Python when you can use NumPy. - -The up side then is that lxml provides powerful tools like tree iterators, -XPath and XSLT, that can handle complex operations at the speed of C. Their -pythonic API in lxml makes them so flexible that most applications can easily -benefit from them. - - Parsing and Serialising ======================= From scoder at codespeak.net Wed May 9 00:15:45 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 00:15:45 +0200 (CEST) Subject: [Lxml-checkins] r42887 - lxml/trunk/doc Message-ID: <20070508221545.099E1806D@code0.codespeak.net> Author: scoder Date: Wed May 9 00:15:45 2007 New Revision: 42887 Modified: lxml/trunk/doc/performance.txt Log: doc restructuring Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 00:15:45 2007 @@ -196,6 +196,10 @@ are no longer referenced. ET and cET represent the tree itself through these objects, which reduces the overhead in creating them. + +Child access +------------ + The same reason makes operations like ``getchildren()`` more costly in lxml. Where ET and cET can quickly create a shallow copy of their list of children, lxml has to create a Python object for each child and collect them in a list:: @@ -227,6 +231,10 @@ cET: middle_child (--TR T2) 0.2089 msec/pass ET : middle_child (--TR T2) 0.9360 msec/pass + +Element creation +---------------- + As opposed to ET, libxml2 has a notion of documents that each element must be in. This results in a major performance difference for creating independent Elements that end up in independently created documents:: @@ -252,6 +260,10 @@ choice. Note, however, that the serialisation performance may even out this advantage, especially for smaller trees and trees with many attributes. + +Merging different sources +------------------------- + A critical action for lxml is moving elements between document contexts. It requires lxml to do recursive adaptations throughout the moved tree structure. @@ -285,8 +297,13 @@ cET: replace_children_element (--TC T1) 0.0238 msec/pass ET : replace_children_element (--TC T1) 0.1628 msec/pass -You should keep this difference in mind when you merge very large trees. On -the other hand, deep copying a tree is fast in lxml:: +You should keep this difference in mind when you merge very large trees. + + +deepcopy +-------- + +Deep copying a tree is fast in lxml:: lxe: deepcopy (--TC T1) 10.5221 msec/pass cET: deepcopy (--TC T1) 220.2251 msec/pass @@ -347,7 +364,7 @@ XPath ------ +===== The following timings are based on the benchmark script `bench_xpath.py`_. @@ -390,8 +407,8 @@ lxe: xpath_class_r