From scoder at codespeak.net Thu May 3 21:10:50 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:10:50 +0200 (CEST) Subject: [Lxml-checkins] r42642 - lxml/trunk/doc Message-ID: <20070503191050.CFD368075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:10:48 2007 New Revision: 42642 Modified: lxml/trunk/doc/FAQ.txt lxml/trunk/doc/build.txt Log: contributing and building Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:10:48 2007 @@ -17,23 +17,26 @@ 1.5 What is the difference between lxml.etree and lxml.objectify? 1.6 Why is my application so slow? 1.7 Why do I get errors about missing UCS4 symbols when installing lxml? - 2 Bugs - 2.1 My application crashes! Why does lxml.etree do that? - 2.2 I think I have found a bug in lxml. What should I do? - 3 Threading - 3.1 Can I use threads to concurrently access the lxml API? - 3.2 Does my program run faster if I use threads? - 3.3 Would my single-threaded program run faster if I turned off threading? - 4 Parsing and Serialisation - 4.1 Why doesn't the ``pretty_print`` option reformat my XML output? - 4.2 Why can't lxml parse my XML from unicode strings? - 4.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ? - 4.4 Why can't I just delete parents or clear the root node in iterparse()? - 5 XPath and Document Traversal - 5.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)? - 5.2 Why doesn't ``findall()`` support full XPath expressions? - 5.3 How can I find out which namespace prefixes are used in a document? - 5.4 How can I specify a default namespace for XPath expressions? + 2 Contributing + 2.1 Why is lxml not written in Python? + 2.2 How can I contribute? + 3 Bugs + 3.1 My application crashes! Why does lxml.etree do that? + 3.2 I think I have found a bug in lxml. What should I do? + 4 Threading + 4.1 Can I use threads to concurrently access the lxml API? + 4.2 Does my program run faster if I use threads? + 4.3 Would my single-threaded program run faster if I turned off threading? + 5 Parsing and Serialisation + 5.1 Why doesn't the ``pretty_print`` option reformat my XML output? + 5.2 Why can't lxml parse my XML from unicode strings? + 5.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ? + 5.4 Why can't I just delete parents or clear the root node in iterparse()? + 6 XPath and Document Traversal + 6.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)? + 6.2 Why doesn't ``findall()`` support full XPath expressions? + 6.3 How can I find out which namespace prefixes are used in a document? + 6.4 How can I specify a default namespace for XPath expressions? General Questions @@ -167,6 +170,64 @@ .. _`build instructions`: build.html +Contributing +============ + +Why is lxml not written in Python? +---------------------------------- + +lxml interfaces with two C libraries: libxml2 and libxslt. Accessing them at +the C-level is required for performance reasons. + +To avoid writing plain C-code and caring too much about the details of +built-in types and reference counting, lxml is written in Pyrex_, a +Python-like language that is translated into C-code. Chances are that if you +know Python, you can write code that Pyrex accepts. Again, the C-ish style +used in the lxml code is just for performance optimisations. If you want to +contribute, don't bother with the details, a Python implementation of your +contribution is better than none. And keep in mind that lxml's flexible API +often favours an implementation of features in pure Python, without bothering +with C-code at all. + +Please contact the `mailing list`_ if you need any help. + +.. _Pyrex: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ + + +How can I contribute? +--------------------- + +Besides enhancing the code, there are a lot of places where you can help the +project and its user base. You can + +* spread the word and write about lxml. Many users (especially new Python + users) have not yet heared about lxml, although our user base is constantly + growing. If you write your own blog and feel like saying something about + lxml, go ahead and do so. If we think your contribution or criticism is + valuable to other users, we may even put a link or a quote on the project + page. + +* provide code examples for the general usage of lxml or specific problems + solved with lxml. Readable code is a very good way of showing how a library + can be used and what great things you can do with it. Again, if we hear + about it, we can set a link on the project page. + +* work on the documentation. The web page is generated from a set of ReST_ + `text files`_. It is meant both as a representative project page for lxml + and as a site for documenting lxml's API and usage. If you have questions + or an idea how to make it more readable and accessible while you are reading + it, please send a comment to the `mailing list`_. + +.. _ReST: http://docutils.sourceforge.net/rst.html +.. _`text files`: http://codespeak.net/svn/lxml/trunk/doc/ + +* improve the docstrings. lxml uses docstrings to support Python's integrated + online ``help()`` function. However, sometimes these are not sufficient to + grasp the details of the function in question. If you find such a place, + you can try to write up a better description and send it to the `mailing + list`_. + + Bugs ==== @@ -176,7 +237,7 @@ One of the goals of lxml is "no segfaults", so if there is no clear warning in the documentation that you were doing something potentially harmful, you have found a bug and we would like to hear about it. Please report this bug to the -mailing list. See the next section on how to do that. +`mailing list`_. See the next section on how to do that. I think I have found a bug in lxml. What should I do? Modified: lxml/trunk/doc/build.txt ============================================================================== --- lxml/trunk/doc/build.txt (original) +++ lxml/trunk/doc/build.txt Thu May 3 21:10:48 2007 @@ -2,8 +2,10 @@ ============================= To build lxml from source, you need libxml2 and libxslt properly installed, -including header files (possibly shipped in -dev packages). The build process -also requires setuptools_. +*including the header files*. These are likely shipped in separate ``-dev`` +or ``-devel`` packages like ``libxml2-dev``, which you need to install. The +build process also requires setuptools_. The lxml source distribution comes +with a script called ``ez_setup.py`` that can be used to install them. .. _setuptools: http://peak.telecommunity.com/DevCenter/setuptools @@ -34,18 +36,22 @@ Newer versions of lxml depend on features and bug fixes that are not yet available in an official Pyrex release. This includes support for the - external C-API of lxml, for Python 2.5 and for 64 bit architectures. + external C-API of lxml.etree, for Python 2.5 and for 64 bit architectures. To build lxml 1.1 and later from non-release or modified sources, you must - therefore install an updated Pyrex version from here: + therefore use an updated Pyrex version from here: http://codespeak.net/svn/lxml/pyrex/ - Since version 1.1.2, the lxml source distribution includes this Pyrex - version. It will be used if the 'pyrex' directory is available in the lxml - root directory. If you install from SVN or delete this directory from the - unpacked distribution directory, the normally installed Pyrex version will - be used. + A subversion checkout of lxml will automatically retrieve the latest Pyrex + as external project source (``svn:externals``). Look out for the ``Pyrex`` + directory in the source tree. + + Since version 1.1.2, the lxml source distribution also includes this Pyrex + version. It will be used if the ``Pyrex`` directory is available in the + lxml root directory. If you install from SVN or delete this directory from + the unpacked distribution directory, the normally installed Pyrex version + will be used. * lxml 1.0 and earlier @@ -86,6 +92,10 @@ python setup.py build +or:: + + python setup.py bdist_egg + If you want to test lxml from the source directory, it is better to build it in-place like this:: @@ -96,15 +106,24 @@ make If you get errors about missing header files (e.g., ``libxml/xmlversion.h``) -then you need to add the location of that file to the include path like:: +then you need to make sure the development packages of libxml2 and libxslt are +properly installed. If this doesn't help, you may have to add the location of +the header files to the include path like:: - python setup.py build_ext -i -I /usr/include/libxml2 + python setup.py build_ext -i -I /usr/include/libxml2 where the file is in ``/usr/include/libxml2/libxml/xmlversion.h`` To use lxml.etree in-place, you can place lxml's ``src`` directory on your Python module search path (PYTHONPATH) and then import ``lxml.etree`` to play -with it. +with it:: + + # cd lxml + # PYTHONPATH=src python + Python 2.5.1 + Type "help", "copyright", "credits" or "license" for more information. + >>> from lxml import etree + >>> To recompile after changes, note that you may have to run ``make clean`` or delete the file ``src/lxml/etree.c``. Distutils do not automatically pick up @@ -125,8 +144,8 @@ make test -To run the ElementTree and cElementTree compatibility tests, make sure -you have lxml on your PYTHONPATH first, then run:: +This also runs the ElementTree and cElementTree compatibility tests. To call +them separately, make sure you have lxml on your PYTHONPATH first, then run:: python selftest.py @@ -147,15 +166,16 @@ This is the procedure to make an lxml egg for your platform: -* download the lxml-x.y.tar.gz release. This contains the pregenerated C so we - don't run into any Pyrex issues. Unpack it and cd into it. +* Download the lxml-x.y.tar.gz release. This contains the pregenerated C so + that you don't run into any Pyrex issues. Unpack it and cd into it. * python setup.py build -* if you're on a unixy platform, cd into build/lib.your.platform and - strip any .so file you find there. This reduces the size of the egg. +* If you're on a unixy platform, cd into ``build/lib.your.platform`` and strip + any ``.so`` file you find there. This reduces the size of the egg + considerably. -* python setup.py bdist_egg upload +* ``python setup.py bdist_egg upload`` The last 'upload' step only works if you have access to the lxml cheeseshop entry. If not, you can just make an egg with ``bdist_egg`` and mail it to the From scoder at codespeak.net Thu May 3 21:12:14 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:12:14 +0200 (CEST) Subject: [Lxml-checkins] r42643 - lxml/trunk Message-ID: <20070503191214.F0D258075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:12:13 2007 New Revision: 42643 Modified: lxml/trunk/ (props changed) Log: properties From scoder at codespeak.net Thu May 3 21:13:33 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:13:33 +0200 (CEST) Subject: [Lxml-checkins] r42644 - lxml/trunk/doc Message-ID: <20070503191333.788948075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:13:32 2007 New Revision: 42644 Modified: lxml/trunk/doc/FAQ.txt Log: faq Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:13:32 2007 @@ -1,6 +1,6 @@ -========================== -Frequently Asked Questions -========================== +================================ +Frequently Asked Questions (FAQ) +================================ See also the notes on compatibility_ to ElementTree_. From scoder at codespeak.net Thu May 3 21:52:31 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 21:52:31 +0200 (CEST) Subject: [Lxml-checkins] r42646 - lxml/trunk/doc Message-ID: <20070503195231.EB7C08075@code0.codespeak.net> Author: scoder Date: Thu May 3 21:52:29 2007 New Revision: 42646 Modified: lxml/trunk/doc/FAQ.txt lxml/trunk/doc/performance.txt Log: doc on benchmark and performance Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Thu May 3 21:52:29 2007 @@ -15,7 +15,7 @@ 1.3 What standards does lxml implement? 1.4 Where are the Windows binaries? 1.5 What is the difference between lxml.etree and lxml.objectify? - 1.6 Why is my application so slow? + 1.6 How can I make my application run faster? 1.7 Why do I get errors about missing UCS4 symbols when installing lxml? 2 Contributing 2.1 Why is lxml not written in Python? @@ -136,17 +136,18 @@ XPath, XSLT or validation. -Why is my application so slow? ------------------------------- +How can I make my application run faster? +----------------------------------------- lxml.etree is a very fast library for processing XML. There are, however, `a few caveats`_ involved in the mapping of the powerful libxml2 library to the simple and convenient ElementTree API. Not all operations are as fast as the -simplicity of the API might suggest. The `benchmark page`_ has a comparison -to other ElementTree implementations and a number of tips for performance -tweaking. As with any Python application, the rule of thumb is: the more of -your processing runs in C, the faster your application gets. See also the -section on threading_. +simplicity of the API might suggest, while some use cases can heavily benefit +from finding the right way of doing them. The `benchmark page`_ has a +comparison to other ElementTree implementations and a number of tips for +performance tweaking. As with any Python application, the rule of thumb is: +the more of your processing runs in C, the faster your application gets. See +also the section on threading_. .. _`a few caveats`: performance.html#the-elementtree-api .. _`benchmark page`: performance.html @@ -182,7 +183,7 @@ To avoid writing plain C-code and caring too much about the details of built-in types and reference counting, lxml is written in Pyrex_, a Python-like language that is translated into C-code. Chances are that if you -know Python, you can write code that Pyrex accepts. Again, the C-ish style +know Python, you can write `code that Pyrex accepts`_. Again, the C-ish style used in the lxml code is just for performance optimisations. If you want to contribute, don't bother with the details, a Python implementation of your contribution is better than none. And keep in mind that lxml's flexible API @@ -192,6 +193,7 @@ Please contact the `mailing list`_ if you need any help. .. _Pyrex: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ +.. _`code that Pyrex accepts`: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/overview.html How can I contribute? Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Thu May 3 21:52:29 2007 @@ -30,10 +30,15 @@ attributes (-/A), with or without ASCII or unicode text (-/S/U), and either against a tree or its serialised form (T/X). In the result extracts cited below, T1 refers to a 3-level tree with many children at the third level, T2 -is swapped around to have many children at the root element, T3 is a deep tree -with few children at each level and T4 is a small tree, slightly broader than -deep. If repetition is involved, this usually means running the benchmark in -a loop over all children of the tree root. +is swapped around to have many children below the root element, T3 is a deep +tree with few children at each level and T4 is a small tree, slightly broader +than deep. If repetition is involved, this usually means running the +benchmark in a loop over all children of the tree root, otherwise, the +operation is run on the root node (C/R). + +As an example, the character code ``(SATR T1)`` states that the benchmark was +running for tree T1, with plain string text (S) and attributes (A). It was +run against the root element (R) in the tree structure of the data (T). .. contents:: .. @@ -48,11 +53,11 @@ Bad things first ---------------- -First thing to say: there *is* an overhead involved in having a C library -mimic the ElementTree API. As opposed to ElementTree, lxml has to generate -Python objects on the fly when asked for them. What this means is: the more -of your code runs in Python, the slower your application gets. Note, however, -that this is true for most performance critical Python applications. +First thing to say: there *is* an overhead involved in having a DOM-like C +library mimic the ElementTree API. As opposed to ElementTree, lxml has to +generate Python objects on the fly when asked for them. What this means is: +the more of your code runs in Python, the slower your application gets. Note, +however, that this is true for most performance critical Python applications. Parsing and Serialising From scoder at codespeak.net Thu May 3 22:26:17 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 3 May 2007 22:26:17 +0200 (CEST) Subject: [Lxml-checkins] r42647 - in lxml/trunk: benchmark doc Message-ID: <20070503202617.C04DB8075@code0.codespeak.net> Author: scoder Date: Thu May 3 22:26:17 2007 New Revision: 42647 Modified: lxml/trunk/benchmark/bench_etree.py lxml/trunk/doc/performance.txt Log: benchmark for indexed child access Modified: lxml/trunk/benchmark/bench_etree.py ============================================================================== --- lxml/trunk/benchmark/bench_etree.py (original) +++ lxml/trunk/benchmark/bench_etree.py Thu May 3 22:26:17 2007 @@ -18,6 +18,19 @@ for child in reversed(root): pass + def bench_first_child(self, root): + for i in range(1000): + child = root[0] + + def bench_last_child(self, root): + for i in range(1000): + child = root[-1] + + def bench_middle_child(self, root): + pos = len(root) / 2 + for i in range(1000): + child = root[pos] + @with_attributes(True, False) @with_text(text=True, utext=True) def bench_tostring_utf8(self, root): Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Thu May 3 22:26:17 2007 @@ -14,7 +14,7 @@ .. _ElementTree: http://effbot.org/zone/element-index.htm .. _cElementTree: http://effbot.org/zone/celementtree.htm -The statements made here are backed by the benchmark scripts +The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with the lxml source distribution. The timings cited below compare lxml 1.3 (with libxml2 2.6.26) to the ElementTree and cElementTree versions shipped with @@ -166,6 +166,29 @@ cET: root_getchildren (--TR T2) 0.0150 msec/pass ET : root_getchildren (--TR T2) 0.0091 msec/pass +When accessing single children, however, e.g. by index, this handicap is +negligible:: + + lxe: first_child (--TR T2) 0.2499 msec/pass + cET: first_child (--TR T2) 0.2048 msec/pass + ET : first_child (--TR T2) 0.9291 msec/pass + + lxe: last_child (--TR T1) 0.2511 msec/pass + cET: last_child (--TR T1) 0.2148 msec/pass + ET : last_child (--TR T1) 0.9191 msec/pass + +... unless you add the time to find a child index in a bigger list, as ET and +cET use Python lists here, which are based on arrays. The data structure used +by libxml2 is a linked tree, and thus, a linked list of children:: + + lxe: middle_child (--TR T1) 0.2921 msec/pass + cET: middle_child (--TR T1) 0.2069 msec/pass + ET : middle_child (--TR T1) 0.9291 msec/pass + + lxe: middle_child (--TR T2) 1.9028 msec/pass + cET: middle_child (--TR T2) 0.2089 msec/pass + ET : middle_child (--TR T2) 0.9360 msec/pass + As opposed to ET, libxml2 has a notion of documents that each element must be in. This results in a major performance difference for creating independent Elements that end up in independently created documents:: From scoder at codespeak.net Fri May 4 11:37:29 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 4 May 2007 11:37:29 +0200 (CEST) Subject: [Lxml-checkins] r42667 - lxml/trunk/doc Message-ID: <20070504093729.512778075@code0.codespeak.net> Author: scoder Date: Fri May 4 11:37:28 2007 New Revision: 42667 Modified: lxml/trunk/doc/performance.txt Log: tree timings and note on non-comparable absolute numbers Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Fri May 4 11:37:28 2007 @@ -40,6 +40,13 @@ running for tree T1, with plain string text (S) and attributes (A). It was run against the root element (R) in the tree structure of the data (T). +Note that very small operations are repeated in integer loops to make them +measurable. It is therefore not always possible to compare the absolute +timings of, say, a single access benchmark (which usually loops) and a 'get +all in one step' benchmark, which already takes enough time to be measurable +and is therefore measured as is. Take a look at the concrete benchmarks in +the scripts to understand how the numbers compare. + .. contents:: .. 1 Bad things first @@ -137,20 +144,20 @@ (given in seconds):: lxe: -- S- U- -A SA UA - T1: 0.1029 0.1005 0.0998 0.1003 0.0998 0.1002 - T2: 0.1035 0.1013 0.1015 0.1090 0.1089 0.1090 - T3: 0.0276 0.0270 0.0273 0.0679 0.0673 0.0673 - T4: 0.0004 0.0004 0.0004 0.0013 0.0013 0.0013 + T1: 0.1155 0.1154 0.1153 0.1159 0.1181 0.1158 + T2: 0.1183 0.1197 0.1200 0.1267 0.1261 0.1264 + T3: 0.0341 0.0312 0.0314 0.0726 0.0717 0.0720 + T4: 0.0005 0.0004 0.0004 0.0014 0.0014 0.0014 cET: -- S- U- -A SA UA - T1: 0.0277 0.0273 0.0273 0.0272 0.0278 0.0275 - T2: 0.0281 0.0347 0.0281 0.0285 0.0284 0.0284 - T3: 0.0074 0.0074 0.0074 0.0122 0.0102 0.0101 + T1: 0.0290 0.0271 0.0275 0.0297 0.0273 0.0274 + T2: 0.0280 0.0280 0.0281 0.0285 0.0283 0.0286 + T3: 0.0071 0.0072 0.0071 0.0113 0.0096 0.0096 T4: 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ET : -- S- U- -A SA UA - T1: 0.1349 0.1962 0.2356 0.1288 0.2642 0.1351 - T2: 0.3104 0.1344 0.3566 0.3857 0.1354 0.4677 - T3: 0.0313 0.0325 0.0312 0.0356 0.3803 0.0364 - T4: 0.0005 0.0005 0.0008 0.0006 0.0007 0.0006 + T1: 0.1362 0.1985 0.2300 0.1344 0.2672 0.1335 + T2: 0.3107 0.1386 0.3581 0.3886 0.1388 0.4277 + T3: 0.0334 0.0332 0.0320 0.0367 0.3769 0.0375 + T4: 0.0006 0.0005 0.0008 0.0007 0.0007 0.0006 While lxml is still faster than ET in most cases (30-60%), cET can be up to three times faster than lxml here. One of the reasons is that lxml must From scoder at codespeak.net Fri May 4 21:00:41 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 4 May 2007 21:00:41 +0200 (CEST) Subject: [Lxml-checkins] r42690 - lxml/trunk/src/lxml Message-ID: <20070504190041.C97308068@code0.codespeak.net> Author: scoder Date: Fri May 4 21:00:40 2007 New Revision: 42690 Modified: lxml/trunk/src/lxml/etree.pyx Log: missing PI function and Comment.values() method Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Fri May 4 21:00:40 2007 @@ -1101,6 +1101,9 @@ def items(self): return [] + def values(self): + return [] + cdef class _Comment(__ContentOnlyElement): property tag: def __get__(self): @@ -1751,6 +1754,8 @@ tree.xmlAddChild(c_doc, c_node) return _elementFactory(doc, c_node) +PI = ProcessingInstruction + def SubElement(_Element _parent not None, _tag, attrib=None, nsmap=None, **_extra): """Subelement factory. This function creates an element instance, and appends it to an From scoder at codespeak.net Sat May 5 12:29:51 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 12:29:51 +0200 (CEST) Subject: [Lxml-checkins] r42695 - in lxml/trunk: . doc src/lxml src/lxml/tests Message-ID: <20070505102951.495DB8075@code0.codespeak.net> Author: scoder Date: Sat May 5 12:29:50 2007 New Revision: 42695 Modified: lxml/trunk/CHANGES.txt lxml/trunk/doc/sax.txt lxml/trunk/src/lxml/apihelpers.pxi lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/sax.py lxml/trunk/src/lxml/serializer.pxi lxml/trunk/src/lxml/tests/test_etree.py lxml/trunk/src/lxml/tests/test_sax.py Log: comment/PI fixes for lxml.sax, support for serialising top-level PIs and comments, appending and prepending comments andd PIs to the root node Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 5 12:29:50 2007 @@ -8,12 +8,17 @@ Features added -------------- +* ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support + adding processing instructions and comments around the root node + * Element.attrib now has a ``pop()`` method * Extended type annotation in objectify: cleaner annotation namespace setup plus new ``xsiannotate()`` and ``deannotate()`` functions -* Support for custom Element class instantiation in lxml.sax +* Support for custom Element class instantiation in lxml.sax: passing a + ``makeelement()`` function to the ElementTreeContentHandler will reuse the + lookup context of that function * '.' represents empty ObjectPath (identity) @@ -30,6 +35,11 @@ Bugs fixed ---------- +* Documents lost their top-level PIs and comments on serialisation + +* lxml.sax failed on comments and PIs. Comments are now properly ignored and + PIs are copied. + * Thread safety in XPath evaluators * Raise AssertionError when passing strings containing '\0' bytes Modified: lxml/trunk/doc/sax.txt ============================================================================== --- lxml/trunk/doc/sax.txt (original) +++ lxml/trunk/doc/sax.txt Sat May 5 12:29:50 2007 @@ -39,6 +39,10 @@ >>> lxml.etree.tostring(tree.getroot()) 'Hello world' +By passing a ``makeelement`` function the constructor of +``ElementTreeContentHandler``, e.g. the one of a parser you configured, you +can determine which element class lookup scheme should be used. + Producing SAX events from an ElementTree or Element --------------------------------------------------- Modified: lxml/trunk/src/lxml/apihelpers.pxi ============================================================================== --- lxml/trunk/src/lxml/apihelpers.pxi (original) +++ lxml/trunk/src/lxml/apihelpers.pxi Sat May 5 12:29:50 2007 @@ -541,7 +541,6 @@ c_node = child._c_node # store possible text node c_next = c_node.next - # XXX what if element is coming from a different document? tree.xmlUnlinkNode(c_node) # move node itself tree.xmlAddChild(parent._c_node, c_node) @@ -550,6 +549,38 @@ # parent element has moved; change them too.. moveNodeToDocument(child, parent._doc) +cdef void _appendSibling(_Element element, _Element sibling): + """Append a new child to a parent element. + """ + cdef xmlNode* c_next + cdef xmlNode* c_node + c_node = sibling._c_node + # store possible text node + c_next = c_node.next + tree.xmlUnlinkNode(c_node) + # move node itself + tree.xmlAddNextSibling(element._c_node, c_node) + _moveTail(c_next, c_node) + # uh oh, elements may be pointing to different doc when + # parent element has moved; change them too.. + moveNodeToDocument(sibling, element._doc) + +cdef void _prependSibling(_Element element, _Element sibling): + """Append a new child to a parent element. + """ + cdef xmlNode* c_next + cdef xmlNode* c_node + c_node = sibling._c_node + # store possible text node + c_next = c_node.next + tree.xmlUnlinkNode(c_node) + # move node itself + tree.xmlAddPrevSibling(element._c_node, c_node) + _moveTail(c_next, c_node) + # uh oh, elements may be pointing to different doc when + # parent element has moved; change them too.. + moveNodeToDocument(sibling, element._doc) + cdef int isutf8(char* s): cdef char c c = s[0] Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sat May 5 12:29:50 2007 @@ -531,6 +531,36 @@ """ _appendChild(self, element) + def addnext(self, _Element element): + """Adds the element as a following sibling directly after this + element. + + This is normally used to set a processing instruction or comment after + the root node of a document. Note that tail text is automatically + discarded when adding at the root level. + """ + if self._c_node.parent != NULL and not _isElement(self._c_node.parent): + if element._c_node.type != tree.XML_PI_NODE: + if element._c_node.type != tree.XML_COMMENT_NODE: + raise TypeError, "Only processing instructions and comments can be siblings of the root element" + element.tail = None + _appendSibling(self, element) + + def addprevious(self, _Element element): + """Adds the element as a preceding sibling directly before this + element. + + This is normally used to set a processing instruction or comment + before the root node of a document. Note that tail text is + automatically discarded when adding at the root level. + """ + if self._c_node.parent != NULL and not _isElement(self._c_node.parent): + if element._c_node.type != tree.XML_PI_NODE: + if element._c_node.type != tree.XML_COMMENT_NODE: + raise TypeError, "Only processing instructions and comments can be siblings of the root element" + element.tail = None + _prependSibling(self, element) + def extend(self, elements): """Extends the current children by the elements in the iterable. """ Modified: lxml/trunk/src/lxml/sax.py ============================================================================== --- lxml/trunk/src/lxml/sax.py (original) +++ lxml/trunk/src/lxml/sax.py Sat May 5 12:29:50 2007 @@ -1,5 +1,6 @@ from xml.sax.handler import ContentHandler from etree import ElementTree, Element, SubElement, LxmlError +from etree import XML, Comment, ProcessingInstruction class SaxError(LxmlError): pass @@ -15,6 +16,7 @@ """ def __init__(self, makeelement=None): self._root = None + self._root_siblings = [] self._element_stack = [] self._default_ns = None self._ns_mapping = { None : [None] } @@ -82,6 +84,10 @@ if self._root is None: element = self._root = \ self._makeelement(el_name, attrs, self._new_mappings) + if self._root_siblings and hasattr(element, 'addprevious'): + for sibling in self._root_siblings: + element.addprevious(sibling) + del self._root_siblings[:] else: element = SubElement(element_stack[-1], el_name, attrs, self._new_mappings) @@ -89,10 +95,16 @@ self._new_mappings.clear() + def processingInstruction(self, target, data): + pi = ProcessingInstruction(target, data) + if self._root is None: + self._root_siblings.append(pi) + else: + self._element_stack[-1].append(pi) + def endElementNS(self, ns_name, qname): element = self._element_stack.pop() - tag = element.tag - if ns_name != _getNsTag(tag): + if ns_name != _getNsTag(element.tag): raise SaxError, "Unexpected element closed: {%s}%s" % ns_name def startElement(self, name, attributes=None): @@ -106,10 +118,13 @@ try: # if there already is a child element, we must append to its tail last_element = last_element[-1] - last_element.tail = (last_element.tail or u'') + data + last_element.tail = (last_element.tail or '') + data except IndexError: # otherwise: append to the text - last_element.text = (last_element.text or u'') + data + last_element.text = (last_element.text or '') + data + + ignorableWhitespace = characters + class ElementTreeProducer(object): """Produces SAX events for an element and children. @@ -124,13 +139,41 @@ from xml.sax.xmlreader import AttributesNSImpl as attr_class self._attr_class = attr_class self._empty_attributes = attr_class({}, {}) - + def saxify(self): self._content_handler.startDocument() - self._recursive_saxify(self._element, {}) + + element = self._element + if hasattr(element, 'getprevious'): + siblings = [] + sibling = element.getprevious() + while getattr(sibling, 'tag', None) is ProcessingInstruction: + siblings.append(sibling) + sibling = sibling.getprevious() + for sibling in siblings[::-1]: + self._recursive_saxify(sibling, {}) + + self._recursive_saxify(element, {}) + + if hasattr(element, 'getnext'): + sibling = element.getnext() + while getattr(sibling, 'tag', None) is ProcessingInstruction: + self._recursive_saxify(sibling, {}) + sibling = sibling.getnext() + self._content_handler.endDocument() def _recursive_saxify(self, element, prefixes): + content_handler = self._content_handler + tag = element.tag + if tag is Comment or tag is ProcessingInstruction: + if tag is ProcessingInstruction: + content_handler.processingInstruction( + element.target, element.text) + if element.tail: + content_handler.characters(element.tail) + return + new_prefixes = [] build_qname = self._build_qname attribs = element.items() @@ -146,10 +189,9 @@ else: sax_attributes = self._empty_attributes - ns_uri, local_name = _getNsTag(element.tag) + ns_uri, local_name = _getNsTag(tag) qname = build_qname(ns_uri, local_name, prefixes, new_prefixes) - content_handler = self._content_handler for prefix, uri in new_prefixes: content_handler.startPrefixMapping(prefix, uri) content_handler.startElementNS((ns_uri, local_name), Modified: lxml/trunk/src/lxml/serializer.pxi ============================================================================== --- lxml/trunk/src/lxml/serializer.pxi (original) +++ lxml/trunk/src/lxml/serializer.pxi Sat May 5 12:29:50 2007 @@ -78,8 +78,10 @@ if write_xml_declaration: _writeDeclarationToBuffer(c_buffer, c_doc.version, encoding) + _writePrevSiblings(c_buffer, c_node, encoding, pretty_print) tree.xmlNodeDumpOutput(c_buffer, c_doc, c_node, 0, pretty_print, encoding) _writeTail(c_buffer, c_node, encoding, pretty_print) + _writeNextSiblings(c_buffer, c_node, encoding, pretty_print) cdef void _writeDeclarationToBuffer(tree.xmlOutputBuffer* c_buffer, char* version, char* encoding): @@ -100,6 +102,36 @@ pretty_print, encoding) c_node = c_node.next +cdef void _writePrevSiblings(tree.xmlOutputBuffer* c_buffer, xmlNode* c_node, + char* encoding, int pretty_print): + cdef xmlNode* c_sibling + if c_node.parent is not NULL and _isElement(c_node.parent): + return + # we are at a root node, so add PI and comment siblings + c_sibling = c_node + while c_sibling.prev != NULL and \ + (c_sibling.prev.type == tree.XML_PI_NODE or \ + c_sibling.prev.type == tree.XML_COMMENT_NODE): + c_sibling = c_sibling.prev + while c_sibling != c_node: + tree.xmlNodeDumpOutput(c_buffer, c_node.doc, c_sibling, 0, + pretty_print, encoding) + c_sibling = c_sibling.next + +cdef void _writeNextSiblings(tree.xmlOutputBuffer* c_buffer, xmlNode* c_node, + char* encoding, int pretty_print): + cdef xmlNode* c_sibling + if c_node.parent is not NULL and _isElement(c_node.parent): + return + # we are at a root node, so add PI and comment siblings + c_sibling = c_node.next + while c_sibling != NULL and \ + (c_sibling.type == tree.XML_PI_NODE or \ + c_sibling.type == tree.XML_COMMENT_NODE): + tree.xmlNodeDumpOutput(c_buffer, c_node.doc, c_sibling, 0, + pretty_print, encoding) + c_sibling = c_sibling.next + # output to file-like objects cdef class _FilelikeWriter: Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Sat May 5 12:29:50 2007 @@ -404,6 +404,156 @@ Element = self.etree.Element self.assertRaises(TypeError, Element('a').append, None) + def test_addnext(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + root = Element('root') + SubElement(root, 'a') + SubElement(root, 'b') + + self.assertEquals(['a', 'b'], + [c.tag for c in root]) + root[1].addnext(root[0]) + self.assertEquals(['b', 'a'], + [c.tag for c in root]) + + def test_addprevious(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + root = Element('root') + SubElement(root, 'a') + SubElement(root, 'b') + + self.assertEquals(['a', 'b'], + [c.tag for c in root]) + root[0].addprevious(root[1]) + self.assertEquals(['b', 'a'], + [c.tag for c in root]) + + def test_addnext_root(self): + Element = self.etree.Element + a = Element('a') + b = Element('b') + self.assertRaises(TypeError, a.addnext, b) + + def test_addnext_root(self): + Element = self.etree.Element + a = Element('a') + b = Element('b') + self.assertRaises(TypeError, a.addnext, b) + + def test_addprevious_pi(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + PI = self.etree.PI + root = Element('root') + SubElement(root, 'a') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addprevious(pi) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addprevious_root_pi(self): + Element = self.etree.Element + PI = self.etree.PI + root = Element('root') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addprevious(pi) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addnext_pi(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + PI = self.etree.PI + root = Element('root') + SubElement(root, 'a') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addnext(pi) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addnext_root_pi(self): + Element = self.etree.Element + PI = self.etree.PI + root = Element('root') + pi = PI('TARGET', 'TEXT') + pi.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addnext(pi) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addnext_comment(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + Comment = self.etree.Comment + root = Element('root') + SubElement(root, 'a') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addnext(comment) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addnext_root_comment(self): + Element = self.etree.Element + Comment = self.etree.Comment + root = Element('root') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addnext(comment) + self.assertEquals('\n', + self._writeElement(root)) + + def test_addprevious_comment(self): + Element = self.etree.Element + SubElement = self.etree.SubElement + Comment = self.etree.Comment + root = Element('root') + SubElement(root, 'a') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root[0].addprevious(comment) + self.assertEquals('TAIL', + self._writeElement(root)) + + def test_addprevious_root_comment(self): + Element = self.etree.Element + Comment = self.etree.Comment + root = Element('root') + comment = Comment('TEXT ') + comment.tail = "TAIL" + + self.assertEquals('', + self._writeElement(root)) + root.addprevious(comment) + self.assertEquals('\n', + self._writeElement(root)) + # ET's Elements have items() and key(), but not values() def test_attribute_values(self): XML = self.etree.XML Modified: lxml/trunk/src/lxml/tests/test_sax.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_sax.py (original) +++ lxml/trunk/src/lxml/tests/test_sax.py Sat May 5 12:29:50 2007 @@ -25,6 +25,30 @@ self.assertEquals('abbbba', xml_out) + def test_etree_sax_comment(self): + tree = self.parse('abba') + xml_out = self._saxify_serialize(tree) + self.assertEquals('abba', + xml_out) + + def test_etree_sax_pi(self): + tree = self.parse('abba') + xml_out = self._saxify_serialize(tree) + self.assertEquals('abba', + xml_out) + + def test_etree_sax_comment_root(self): + tree = self.parse('ab') + xml_out = self._saxify_serialize(tree) + self.assertEquals('ab', + xml_out) + + def test_etree_sax_pi_root(self): + tree = self.parse('ab') + xml_out = self._saxify_serialize(tree) + self.assertEquals('ab', + xml_out) + def test_etree_sax_attributes(self): tree = self.parse('abba') xml_out = self._saxify_serialize(tree) From scoder at codespeak.net Sat May 5 19:02:35 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 19:02:35 +0200 (CEST) Subject: [Lxml-checkins] r42704 - in lxml/trunk: . src/lxml Message-ID: <20070505170235.1CACD806D@code0.codespeak.net> Author: scoder Date: Sat May 5 19:02:33 2007 New Revision: 42704 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/extensions.pxi Log: support passing a node-set instead of a string in XPath regexps Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 5 19:02:33 2007 @@ -8,6 +8,9 @@ Features added -------------- +* The regular expression functions in XPath now support passing a node-set + instead of a string + * ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support adding processing instructions and comments around the root node @@ -17,7 +20,7 @@ plus new ``xsiannotate()`` and ``deannotate()`` functions * Support for custom Element class instantiation in lxml.sax: passing a - ``makeelement()`` function to the ElementTreeContentHandler will reuse the + ``makeelement`` function to the ElementTreeContentHandler will reuse the lookup context of that function * '.' represents empty ObjectPath (identity) Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Sat May 5 19:02:33 2007 @@ -306,6 +306,18 @@ ################################################################################ # EXSLT regexp implementation +cdef int _collect_tree_text(element, l) except -1: + # recursively collect all text (XPath 'string-value' of a node) + text = element.text + if text is not None: + python.PyList_Append(l, text) + for child in element: + _collect_tree_text(child, l) + tail = element.tail + if tail is not None: + python.PyList_Append(l, tail) + return 0 + cdef class _ExsltRegExp: cdef object _compile_map def __init__(self): @@ -314,6 +326,19 @@ cdef _make_string(self, value): if _isString(value): return value + elif python.PyList_Check(value): + # node set: take recursive text concatenation of first element + if python.PyList_GET_SIZE(value) == 0: + return '' + firstnode = value[0] + if _isString(firstnode): + return firstnode + elif isinstance(firstnode, _Element): + l = [] + _collect_tree_text(firstnode, l) + return ''.join(l) + else: + return str(firstnode) else: raise TypeError, "Invalid argument type %s" % type(value) From scoder at codespeak.net Sat May 5 19:09:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 5 May 2007 19:09:00 +0200 (CEST) Subject: [Lxml-checkins] r42705 - lxml/trunk/doc Message-ID: <20070505170900.2FDB6806D@code0.codespeak.net> Author: scoder Date: Sat May 5 19:08:59 2007 New Revision: 42705 Modified: lxml/trunk/doc/xpathxslt.txt Log: rewrite of XPath doc page Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sat May 5 19:08:59 2007 @@ -6,10 +6,15 @@ compliant way. .. contents:: -.. +.. 1 XPath + 1.1 The ``xpath()`` method + 1.2 The ``XPath`` class + 1.3 The ``XPathEvaluator`` classes + 1.4 ``ETXPath`` 2 XSLT + The usual setup procedure:: >>> from lxml import etree @@ -17,12 +22,17 @@ XPath ------ +===== + +lxml.etree supports the simple path syntax of the `find, findall and +findtext`_ methods on ElementTree and Element, as known from the original +ElementTree library (ElementPath_). As an lxml specific extension, these +classes also provide an ``xpath()`` method that supports expressions in the +complete XPath syntax, as well as `extension functions`_. -lxml.etree supports the simple path syntax of the ``findall()`` etc. methods -on ElementTree and Element, as known from the original ElementTree library. -As an extension, these classes also provide an ``xpath()`` method that -supports expressions in the complete XPath syntax. +.. _ElementPath: http://effbot.org/zone/element-xpath.htm +.. _`find, findall and findtext`: http://effbot.org/zone/element.htm#searching-for-subelements +.. _`extension functions`: extensions.html There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance @@ -32,6 +42,10 @@ .. _`performance comparison`: performance.html#xpath + +The ``xpath()`` method +---------------------- + For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative):: @@ -48,7 +62,7 @@ >>> r[0].tag 'bar' -When ``xpath()`` is used on an element, the XPath expression is evaluated +When ``xpath()`` is used on an Element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):: >>> root = tree.getroot() @@ -66,6 +80,19 @@ >>> r[0].tag 'bar' +The ``xpath()`` method has support for XPath variables:: + + >>> expr = "//*[local-name() = $name]" + + >>> print root.xpath(expr, name = "foo")[0].tag + foo + + >>> print root.xpath(expr, name = "bar")[0].tag + bar + + >>> print root.xpath("$text", text = "Hello World!") + Hello World! + Optionally, you can provide a ``namespaces`` keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs:: @@ -102,11 +129,10 @@ * a (unicode) string, when the XPath expression has a string result. * a list of items, when the XPath expression has a list as result. The items - may include elements, strings and tuples. Text nodes and attributes in the - result are returned as strings (the text node content or attribute value). - Comments are also returned as strings, enclosed by the usual ```` markers. Namespace declarations are returned as tuples of strings: - ``(prefix, URI)``. + may include elements (also comments and processing instructions), strings + and tuples. Text nodes and attributes in the result are returned as strings + (the text node content or attribute value). Namespace declarations are + returned as tuples of strings: ``(prefix, URI)``. A related convenience method of ElementTree objects is ``getpath(element)``, which returns a structural, absolute XPath expression to find that element:: @@ -124,8 +150,111 @@ True +The ``XPath`` class +------------------- + +The ``XPath`` class compiles an XPath expression into a callable function:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//b") + >>> print find(root)[0].tag + b + +The compilation takes as much time as in the ``xpath()`` method, but it is +done only once per class instantiation. This makes it especially efficient +for repeated evaluation of the same XPath expression. + +Just like the ``xpath()`` method, the ``XPath`` class supports XPath +variables:: + + >>> count_elements = etree.XPath("count(//*[local-name() = $name])") + + >>> print count_elements(root, name = "a") + 1.0 + >>> print count_elements(root, name = "b") + 2.0 + +This supports very efficient evaluation of modified versions of an XPath +expression, as compilation is still only required once. + +Prefix-to-namespace mappings can be passed as second parameter:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//n:b", {'n':'NS'}) + >>> print find(root)[0].tag + {NS}b + +You can pass the boolean keyword ``regexp`` to enable Python regular +expressions in the EXSLT_ namespace:: + + >>> regexpNS = "http://exslt.org/regular-expressions" + >>> find = etree.XPath("//*[r:test(., '^abc$', 'i')]", + ... {'r':regexpNS}, regexp = True) + + >>> root = etree.XML("aBaBc") + >>> print find(root)[0].text + aBc + +.. _EXSLT: http://www.exslt.org/ + + +The ``XPathEvaluator`` classes +------------------------------ + +lxml.etree provides two other efficient XPath evaluators that work on +ElementTrees or Elements respectively: ``XPathDocumentEvaluator`` and +``XPathElementEvaluator``. They are automatically selected if you use the +XPathEvaluator helper for instantiation:: + + >>> root = etree.XML("") + >>> xpatheval = etree.XPathEvaluator(root) + + >>> print isinstance(xpatheval, etree.XPathElementEvaluator) + True + + >>> print xpatheval("//b")[0].tag + b + +This class provides efficient support for evaluating different XPath +expressions on the same Element or ElementTree. + + +``ETXPath`` +----------- + +ElementTree supports a language named ElementPath_ in its ``find*()`` methods. +One of the main differences between XPath and ElementPath is that the XPath +language requires an indirection through prefixes for namespace support, +whereas ElementTree uses the Clark notation (``{ns}name``) to avoid prefixes +completely. The other major difference regards the capabilities of both path +languages. Where XPath supports various sophisticated ways of restricting the +result set through functions and boolean expressions, ElementPath only +supports pure path traversal without nesting or further conditions. So, while +the ElementPath syntax is self-contained and therefore easier to write and +handle, XPath is much more powerful and expressive. + +lxml.etree bridges this gap through the class ``ETXPath``, which accepts XPath +expressions with namespaces in Clark notation. It is identical to the +``XPath`` class, except for the namespace notation. Normally, you would +write:: + + >>> root = etree.XML("") + + >>> find = etree.XPath("//p:b", {'p' : 'ns'}) + >>> print find(root)[0].tag + {ns}b + +``ETXPath`` allows you to change this to:: + + >>> find = etree.ETXPath("//{ns}b") + >>> print find(root)[0].tag + {ns}b + + XSLT ----- +==== lxml.etree introduces a new class, lxml.etree.XSLT. The class can be given an ElementTree object to construct an XSLT transformer:: From scoder at codespeak.net Sun May 6 08:57:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 08:57:04 +0200 (CEST) Subject: [Lxml-checkins] r42723 - lxml/trunk/doc Message-ID: <20070506065704.92C57807C@code0.codespeak.net> Author: scoder Date: Sun May 6 08:57:03 2007 New Revision: 42723 Modified: lxml/trunk/doc/resolvers.txt lxml/trunk/doc/xpathxslt.txt Log: restructuring of XSLT docs Modified: lxml/trunk/doc/resolvers.txt ============================================================================== --- lxml/trunk/doc/resolvers.txt (original) +++ lxml/trunk/doc/resolvers.txt Sun May 6 08:57:03 2007 @@ -3,13 +3,20 @@ .. contents:: .. - 1 Document loaders in context - 2 I/O access control in XSLT + 1 Resolvers + 2 Document loading in context + 3 I/O access control in XSLT Lxml has support for custom document loaders in both the parsers and XSL transformations. These so-called resolvers are subclasses of the -etree.Resolver class as in the following example:: +etree.Resolver class. + + +Resolvers +--------- + +Here is an example of a custom resolver:: >>> from lxml import etree @@ -32,10 +39,10 @@ * ``resolve_file`` takes an open file-like object that has at least a read() method * ``resolve_empty`` resolves into an empty document -The ``resolve`` method may choose to return None, in which case the next -registered resolver (or the default resolver) is consulted. It is never -called if the resolver returns the result of any of the above ``resolve_*`` -methods. +The ``resolve()`` method may choose to return None, in which case the next +registered resolver (or the default resolver) is consulted. Resolving always +terminates if ``resolve()`` returns the result of any of the above +``resolve_*()`` methods. Resolvers are registered local to a parser:: @@ -58,7 +65,7 @@ fragment. -Document loaders in context +Document loading in context --------------------------- XML documents memorise their initial parser (and its resolvers) during their @@ -180,12 +187,16 @@ I/O access control in XSLT -------------------------- -XSLT has an additional mechanism to control the access to certain I/O -operations during the transformation process. This is most interesting where -XSL scripts come from potentially insecure sources and must be prevented from -modifying the local file system. Note, however, that there is no way to keep -them from eating up your precious CPU time, so this should not stop you from -thinking about what XSLT you execute. +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through EXSLT. Some extensions enable +style sheets to read and write files on the local file system. + +XSLT has a mechanism to control the access to certain I/O operations during +the transformation process. This is most interesting where XSL scripts come +from potentially insecure sources and must be prevented from modifying the +local file system. Note, however, that there is no way to keep them from +eating up your precious CPU time, so this should not stop you from thinking +about what XSLT you execute. Access control is configured using the ``XSLTAccessControl`` class. It can be called with a number of keyword arguments that allow or deny specific Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 6 08:57:03 2007 @@ -28,11 +28,11 @@ findtext`_ methods on ElementTree and Element, as known from the original ElementTree library (ElementPath_). As an lxml specific extension, these classes also provide an ``xpath()`` method that supports expressions in the -complete XPath syntax, as well as `extension functions`_. +complete XPath syntax, as well as `custom extension functions`_. .. _ElementPath: http://effbot.org/zone/element-xpath.htm .. _`find, findall and findtext`: http://effbot.org/zone/element.htm#searching-for-subelements -.. _`extension functions`: extensions.html +.. _`custom extension functions`: extensions.html There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance @@ -115,9 +115,11 @@ 'Text' There is also an optional ``extensions`` argument which is used to define -`extension functions`_ in Python that are local to this evaluation. +`custom extension functions`_ in Python that are local to this evaluation. -.. _`extension functions`: extensions.html + +XPath return values +------------------- The return values of XPath evaluations vary, depending on the XPath expression used: @@ -315,6 +317,18 @@ [...] LookupError: unknown encoding: UCS4 +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through EXSLT. Also see the documentation +on `custom extension functions`_ and `document resolvers`_. There is a +separate section on `controlling access`_ to external documents and resources. + +.. _`document resolvers`: resolvers.html +.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt + + +Stylesheet parameters +--------------------- + It is possible to pass parameters, in the form of XPath expressions, to the XSLT template:: @@ -342,7 +356,11 @@ >>> str(result) '\nText\n' -There's also a convenience method on the tree object for doing XSL + +The ``xslt()`` tree method +-------------------------- + +There's also a convenience method on ElementTree objects for doing XSL transformations. This is less efficient if you want to apply the same XSL transformation to multiple documents, but is shorter to write for one-shot operations, as you do not have to instantiate a stylesheet yourself:: @@ -351,12 +369,16 @@ >>> str(result) '\nA\n' -By default, XSLT supports all extension functions from libxslt and libexslt as -well as Python regular expressions through EXSLT. Note that some extensions -enable style sheets to read and write files on the local file system. See the -`document loader documentation`_ on how to deal with this. +This is a shortcut for the following code:: + + >>> transform = etree.XSLT(xslt_tree) + >>> result = transform(doc, a="'A'") + >>> str(result) + '\nA\n' + -.. _`document loader documentation`: resolvers.html +Profiling +--------- If you want to know how your stylesheet performed, pass the ``profile_run`` keyword to the transform:: From scoder at codespeak.net Sun May 6 09:02:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:02:34 +0200 (CEST) Subject: [Lxml-checkins] r42724 - lxml/trunk/doc Message-ID: <20070506070234.CC10F807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:02:34 2007 New Revision: 42724 Modified: lxml/trunk/doc/xpathxslt.txt Log: restructuring of XSLT docs Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 6 09:02:34 2007 @@ -9,10 +9,15 @@ .. 1 XPath 1.1 The ``xpath()`` method - 1.2 The ``XPath`` class - 1.3 The ``XPathEvaluator`` classes - 1.4 ``ETXPath`` + 1.2 XPath return values + 1.3 The ``XPath`` class + 1.4 The ``XPathEvaluator`` classes + 1.5 ``ETXPath`` 2 XSLT + 2.1 XSLT result objects + 2.2 Stylesheet parameters + 2.3 The ``xslt()`` tree method + 2.4 Profiling The usual setup procedure:: @@ -276,9 +281,28 @@ >>> f = StringIO('Text') >>> doc = etree.parse(f) - >>> result = transform(doc) + >>> result_tree = transform(doc) + +By default, XSLT supports all extension functions from libxslt and libexslt as +well as Python regular expressions through the `EXSLT regexp functions`_. +Also see the documentation on `custom extension functions`_ and `document +resolvers`_. There is a separate section on `controlling access`_ to external +documents and resources. + +.. _`EXSLT regexp functions`: http://www.exslt.org/regexp/ +.. _`document resolvers`: resolvers.html +.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt + + +XSLT result objects +------------------- -The result object can be accessed like a normal ElementTree document:: +The result of an XSL transformation can be accessed like a normal ElementTree +document:: + + >>> f = StringIO('Text') + >>> doc = etree.parse(f) + >>> result = transform(doc) >>> result.getroot().text 'Text' @@ -317,14 +341,6 @@ [...] LookupError: unknown encoding: UCS4 -By default, XSLT supports all extension functions from libxslt and libexslt as -well as Python regular expressions through EXSLT. Also see the documentation -on `custom extension functions`_ and `document resolvers`_. There is a -separate section on `controlling access`_ to external documents and resources. - -.. _`document resolvers`: resolvers.html -.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt - Stylesheet parameters --------------------- From scoder at codespeak.net Sun May 6 09:10:32 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:10:32 +0200 (CEST) Subject: [Lxml-checkins] r42725 - lxml/trunk/doc Message-ID: <20070506071032.1B3D4807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:10:31 2007 New Revision: 42725 Modified: lxml/trunk/doc/element_classes.txt Log: cleanup Modified: lxml/trunk/doc/element_classes.txt ============================================================================== --- lxml/trunk/doc/element_classes.txt (original) +++ lxml/trunk/doc/element_classes.txt Sun May 6 09:10:31 2007 @@ -29,11 +29,12 @@ 2.2 Namespace class lookup 2.3 Attribute based lookup 2.4 Custom element class lookup + 2.5 Tree based element class lookup in Python 3 Implementing namespaces Element initialization ----------------------- +====================== There is one thing to know up front. Element classes *must not* have a constructor, neither must there be any internal state (except for the data @@ -72,7 +73,7 @@ Setting up a class lookup scheme --------------------------------- +================================ The first thing to do when deploying custom element classes is to register a class lookup scheme on a parser. lxml.etree provides quite a number of @@ -140,7 +141,7 @@ Default class lookup -.................... +-------------------- This is the most simple lookup mechanism. It always returns the default element class. Consequently, no further fallbacks are supported, but this @@ -179,7 +180,7 @@ Namespace class lookup -...................... +---------------------- This is an advanced lookup mechanism that supports namespace/tag-name specific element classes. You can select it by calling:: @@ -204,14 +205,15 @@ Attribute based lookup -...................... +---------------------- This scheme uses a mapping from attribute values to classes. An attribute name is set at initialisation time and is then used to find the corresponding value. It is set up as follows:: >>> id_class_mapping = {} # maps attribute values to element classes - >>> lookup = etree.AttributeBasedElementClassLookup('id', id_class_mapping) + >>> lookup = etree.AttributeBasedElementClassLookup( + ... 'id', id_class_mapping) >>> parser = etree.XMLParser() >>> parser.setElementClassLookup(lookup) @@ -230,7 +232,7 @@ Custom element class lookup -........................... +--------------------------- This is the most customisable way of finding element classes on a per-element basis. It allows you to implement a custom lookup scheme in a subclass:: @@ -252,7 +254,7 @@ Tree based element class lookup in Python -......................................... +----------------------------------------- Taking more elaborate decisions than allowed by the custom scheme is difficult to achieve in pure Python. It would require access to the tree - before the @@ -291,7 +293,7 @@ Implementing namespaces ------------------------ +======================= lxml allows you to implement namespaces, in a rather literal sense. After setting up the namespace class lookup mechanism as described above, you can From scoder at codespeak.net Sun May 6 09:24:19 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 09:24:19 +0200 (CEST) Subject: [Lxml-checkins] r42727 - lxml/trunk/doc Message-ID: <20070506072419.E583C807C@code0.codespeak.net> Author: scoder Date: Sun May 6 09:24:19 2007 New Revision: 42727 Modified: lxml/trunk/doc/element_classes.txt Log: cleanup Modified: lxml/trunk/doc/element_classes.txt ============================================================================== --- lxml/trunk/doc/element_classes.txt (original) +++ lxml/trunk/doc/element_classes.txt Sun May 6 09:24:19 2007 @@ -4,8 +4,8 @@ lxml has very sophisticated support for custom Element classes. You can provide your own classes for Elements and have lxml use them by default, for -all elements generated by a specific parser or only for a specific tag name in -a specific namespace. +all elements generated by a specific parser, for a specific tag name in a +specific namespace or for an exact element at a specific position in the tree. Custom Elements must inherit from the ``lxml.etree.ElementBase`` class, which provides the Element interface for subclasses:: @@ -44,10 +44,12 @@ called, the object may not even be initialized yet to represent the XML tag, so there is not much use in providing an ``__init__`` method in subclasses. -However, there is one possible way to do things on element initialization, if -you really need to. ElementBase classes have an ``_init()`` method that can -be overridden. It can be used to modify the XML tree, e.g. to construct -special children or verify and update attributes. +Most use cases will not require any class initialisation, so you can content +yourself with skipping to the next section for now. However, if you really +need to set up your element class on instantiation, there is one possible way +to do so. ElementBase classes have an ``_init()`` method that can be +overridden. It can be used to modify the XML tree, e.g. to construct special +children or verify and update attributes. The semantics of ``_init()`` are as follows: From scoder at codespeak.net Sun May 6 11:10:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:10:59 +0200 (CEST) Subject: [Lxml-checkins] r42730 - lxml/trunk/src/lxml Message-ID: <20070506091059.8894B807C@code0.codespeak.net> Author: scoder Date: Sun May 6 11:10:59 2007 New Revision: 42730 Modified: lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/etree_defs.h lxml/trunk/src/lxml/python.pxd Log: fast path for instantiation of _Element class (20% faster) Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 6 11:10:59 2007 @@ -1043,6 +1043,9 @@ else: ELEMENT_CREATION_LOCK = NULL +cdef extern from "etree_defs.h": + cdef _Element NEW_ELEMENT "PY_NEW" (object t) + cdef _Element _elementFactory(_Document doc, xmlNode* c_node): cdef python.PyThreadState* state cdef _Element result @@ -1061,9 +1064,13 @@ python.PyThread_release_lock(ELEMENT_CREATION_LOCK) return result - element_class = LOOKUP_ELEMENT_CLASS(ELEMENT_CLASS_LOOKUP_STATE, - doc, c_node) - result = element_class() + element_class = LOOKUP_ELEMENT_CLASS( + ELEMENT_CLASS_LOOKUP_STATE, doc, c_node) + if element_class is _Element: + # fast path for standard _Element class + result = NEW_ELEMENT(_Element) + else: + result = element_class() result._doc = doc result._c_node = c_node registerProxy(result) @@ -1071,7 +1078,8 @@ if config.ENABLE_THREADING: python.PyThread_release_lock(ELEMENT_CREATION_LOCK) - result._init() + if element_class is not _Element: + result._init() return result Modified: lxml/trunk/src/lxml/etree_defs.h ============================================================================== --- lxml/trunk/src/lxml/etree_defs.h (original) +++ lxml/trunk/src/lxml/etree_defs.h Sun May 6 11:10:59 2007 @@ -64,6 +64,16 @@ #define iter(o) PyObject_GetIter(o) #define _cstr(s) PyString_AS_STRING(s) +static PyObject* __PY_NEW_GLOBAL_EMPTY_TUPLE = NULL; + +#define PY_NEW(T) \ + (((PyTypeObject*)(T))->tp_new( \ + (PyTypeObject*)(T), \ + ((__PY_NEW_GLOBAL_EMPTY_TUPLE == NULL) ? \ + (__PY_NEW_GLOBAL_EMPTY_TUPLE = PyTuple_New(0)) : \ + (__PY_NEW_GLOBAL_EMPTY_TUPLE)), \ + NULL)) + #define _isString(obj) PyObject_TypeCheck(obj, &PyBaseString_Type) #define _isElement(c_node) \ Modified: lxml/trunk/src/lxml/python.pxd ============================================================================== --- lxml/trunk/src/lxml/python.pxd (original) +++ lxml/trunk/src/lxml/python.pxd Sun May 6 11:10:59 2007 @@ -112,3 +112,4 @@ cdef object repr(object obj) cdef object iter(object obj) cdef char* _cstr(object s) + cdef object PY_NEW(object t) From scoder at codespeak.net Sun May 6 11:13:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:13:04 +0200 (CEST) Subject: [Lxml-checkins] r42731 - lxml/trunk/src/lxml Message-ID: <20070506091304.4167B807C@code0.codespeak.net> Author: scoder Date: Sun May 6 11:13:04 2007 New Revision: 42731 Modified: lxml/trunk/src/lxml/etree.pyx Log: cleanup Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 6 11:13:04 2007 @@ -1044,6 +1044,7 @@ ELEMENT_CREATION_LOCK = NULL cdef extern from "etree_defs.h": + # macro call to 'tp_new()' for fast instantiation cdef _Element NEW_ELEMENT "PY_NEW" (object t) cdef _Element _elementFactory(_Document doc, xmlNode* c_node): From scoder at codespeak.net Sun May 6 11:32:10 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 6 May 2007 11:32:10 +0200 (CEST) Subject: [Lxml-checkins] r42736 - lxml/trunk/doc Message-ID: <20070506093210.723A8807F@code0.codespeak.net> Author: scoder Date: Sun May 6 11:32:10 2007 New Revision: 42736 Modified: lxml/trunk/doc/performance.txt Log: new benchmark results after _Element instantiation speedup Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Sun May 6 11:32:10 2007 @@ -144,9 +144,9 @@ (given in seconds):: lxe: -- S- U- -A SA UA - T1: 0.1155 0.1154 0.1153 0.1159 0.1181 0.1158 - T2: 0.1183 0.1197 0.1200 0.1267 0.1261 0.1264 - T3: 0.0341 0.0312 0.0314 0.0726 0.0717 0.0720 + T1: 0.1181 0.1080 0.1074 0.1088 0.1087 0.1099 + T2: 0.1103 0.1109 0.1164 0.1241 0.1203 0.1231 + T3: 0.0297 0.0309 0.0297 0.0716 0.0704 0.0703 T4: 0.0005 0.0004 0.0004 0.0014 0.0014 0.0014 cET: -- S- U- -A SA UA T1: 0.0290 0.0271 0.0275 0.0297 0.0273 0.0274 @@ -169,18 +169,18 @@ Where ET and cET can quickly create a shallow copy of their list of children, lxml has to create a Python object for each child and collect them in a list:: - lxe: root_getchildren (--TR T2) 0.3500 msec/pass + lxe: root_getchildren (--TR T2) 0.1960 msec/pass cET: root_getchildren (--TR T2) 0.0150 msec/pass ET : root_getchildren (--TR T2) 0.0091 msec/pass When accessing single children, however, e.g. by index, this handicap is negligible:: - lxe: first_child (--TR T2) 0.2499 msec/pass + lxe: first_child (--TR T2) 0.2289 msec/pass cET: first_child (--TR T2) 0.2048 msec/pass ET : first_child (--TR T2) 0.9291 msec/pass - lxe: last_child (--TR T1) 0.2511 msec/pass + lxe: last_child (--TR T1) 0.2310 msec/pass cET: last_child (--TR T1) 0.2148 msec/pass ET : last_child (--TR T1) 0.9191 msec/pass @@ -188,11 +188,11 @@ cET use Python lists here, which are based on arrays. The data structure used by libxml2 is a linked tree, and thus, a linked list of children:: - lxe: middle_child (--TR T1) 0.2921 msec/pass + lxe: middle_child (--TR T1) 0.2759 msec/pass cET: middle_child (--TR T1) 0.2069 msec/pass ET : middle_child (--TR T1) 0.9291 msec/pass - lxe: middle_child (--TR T2) 1.9028 msec/pass + lxe: middle_child (--TR T2) 1.7111 msec/pass cET: middle_child (--TR T2) 0.2089 msec/pass ET : middle_child (--TR T2) 0.9360 msec/pass @@ -208,11 +208,11 @@ are supposed to end up in, either as SubElements of an Element or using the explicit ``Element.makeelement()`` call:: - lxe: makeelement (--TC T2) 2.5990 msec/pass + lxe: makeelement (--TC T2) 2.3680 msec/pass cET: makeelement (--TC T2) 0.3128 msec/pass ET : makeelement (--TC T2) 1.6940 msec/pass - lxe: create_subelements (--TC T2) 2.3072 msec/pass + lxe: create_subelements (--TC T2) 2.2051 msec/pass cET: create_subelements (--TC T2) 0.2370 msec/pass ET : create_subelements (--TC T2) 3.2189 msec/pass @@ -257,11 +257,11 @@ You should keep this difference in mind when you merge very large trees. On the other hand, deep copying a tree is fast in lxml:: - lxe: deepcopy (--TC T1) 10.6010 msec/pass + lxe: deepcopy (--TC T1) 10.5221 msec/pass cET: deepcopy (--TC T1) 220.2251 msec/pass ET : deepcopy (--TC T1) 463.7730 msec/pass - lxe: deepcopy (--TC T3) 8.2979 msec/pass + lxe: deepcopy (--TC T3) 8.2841 msec/pass cET: deepcopy (--TC T3) 53.8740 msec/pass ET : deepcopy (--TC T3) 118.2799 msec/pass @@ -277,33 +277,33 @@ especially if few elements are of interest or the element tag name is known, lxml is a good choice:: - lxe: getiterator_all (--TR T2) 10.3800 msec/pass + lxe: getiterator_all (--TR T2) 6.4790 msec/pass cET: getiterator_all (--TR T2) 28.2831 msec/pass ET : getiterator_all (--TR T2) 26.0720 msec/pass - lxe: getiterator_islice (--TR T2) 0.1140 msec/pass + lxe: getiterator_islice (--TR T2) 0.0892 msec/pass cET: getiterator_islice (--TR T2) 0.2460 msec/pass ET : getiterator_islice (--TR T2) 26.6550 msec/pass - lxe: getiterator_tag (--TR T2) 0.3879 msec/pass + lxe: getiterator_tag (--TR T2) 0.3850 msec/pass cET: getiterator_tag (--TR T2) 9.3720 msec/pass ET : getiterator_tag (--TR T2) 22.8221 msec/pass - lxe: getiterator_tag_all (--TR T2) 0.8819 msec/pass + lxe: getiterator_tag_all (--TR T2) 0.7222 msec/pass cET: getiterator_tag_all (--TR T2) 27.2939 msec/pass ET : getiterator_tag_all (--TR T2) 22.8271 msec/pass This similarly shows in ``Element.findall()``:: - lxe: findall (--TR T2) 10.9370 msec/pass + lxe: findall (--TR T2) 6.8321 msec/pass cET: findall (--TR T2) 28.8639 msec/pass ET : findall (--TR T2) 27.1060 msec/pass - lxe: findall (--TR T3) 2.1989 msec/pass + lxe: findall (--TR T3) 1.3590 msec/pass cET: findall (--TR T3) 8.9881 msec/pass ET : findall (--TR T3) 6.4890 msec/pass - lxe: findall_tag (--TR T2) 0.9520 msec/pass + lxe: findall_tag (--TR T2) 0.9229 msec/pass cET: findall_tag (--TR T2) 27.2651 msec/pass ET : findall_tag (--TR T2) 22.7208 msec/pass From scoder at codespeak.net Mon May 7 11:00:39 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:00:39 +0200 (CEST) Subject: [Lxml-checkins] r42773 - lxml/trunk/src/lxml Message-ID: <20070507090039.93BD28069@code0.codespeak.net> Author: scoder Date: Mon May 7 11:00:39 2007 New Revision: 42773 Modified: lxml/trunk/src/lxml/etree.pyx Log: comment Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Mon May 7 11:00:39 2007 @@ -1044,7 +1044,7 @@ ELEMENT_CREATION_LOCK = NULL cdef extern from "etree_defs.h": - # macro call to 'tp_new()' for fast instantiation + # macro call to 't->tp_new()' for fast instantiation cdef _Element NEW_ELEMENT "PY_NEW" (object t) cdef _Element _elementFactory(_Document doc, xmlNode* c_node): From scoder at codespeak.net Mon May 7 11:01:41 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:01:41 +0200 (CEST) Subject: [Lxml-checkins] r42774 - lxml/trunk/src/lxml Message-ID: <20070507090141.54EC68069@code0.codespeak.net> Author: scoder Date: Mon May 7 11:01:41 2007 New Revision: 42774 Modified: lxml/trunk/src/lxml/extensions.pxi Log: cleanup: use libxml2 API function Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Mon May 7 11:01:41 2007 @@ -306,18 +306,6 @@ ################################################################################ # EXSLT regexp implementation -cdef int _collect_tree_text(element, l) except -1: - # recursively collect all text (XPath 'string-value' of a node) - text = element.text - if text is not None: - python.PyList_Append(l, text) - for child in element: - _collect_tree_text(child, l) - tail = element.tail - if tail is not None: - python.PyList_Append(l, tail) - return 0 - cdef class _ExsltRegExp: cdef object _compile_map def __init__(self): @@ -334,9 +322,8 @@ if _isString(firstnode): return firstnode elif isinstance(firstnode, _Element): - l = [] - _collect_tree_text(firstnode, l) - return ''.join(l) + return funicode( + tree.xmlNodeGetContent((<_Element>firstnode)._c_node)) else: return str(firstnode) else: From scoder at codespeak.net Mon May 7 11:02:49 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 11:02:49 +0200 (CEST) Subject: [Lxml-checkins] r42775 - lxml/trunk/src/lxml Message-ID: <20070507090249.45AA88065@code0.codespeak.net> Author: scoder Date: Mon May 7 11:02:48 2007 New Revision: 42775 Modified: lxml/trunk/src/lxml/relaxng.pxi lxml/trunk/src/lxml/xmlschema.pxi Log: make clear when libxml2 bug was fixed Modified: lxml/trunk/src/lxml/relaxng.pxi ============================================================================== --- lxml/trunk/src/lxml/relaxng.pxi (original) +++ lxml/trunk/src/lxml/relaxng.pxi Mon May 7 11:02:48 2007 @@ -32,11 +32,12 @@ root_node = _rootNodeOrRaise(etree) c_node = root_node._c_node # work around for libxml2 bug if document is not RNG at all - c_href = _getNs(c_node) - if c_href is NULL or \ - cstd.strcmp(c_href, - 'http://relaxng.org/ns/structure/1.0') != 0: - raise RelaxNGParseError, "Document is not Relax NG" + if _LIBXML_VERSION_INT < 20624: + c_href = _getNs(c_node) + if c_href is NULL or \ + cstd.strcmp(c_href, + 'http://relaxng.org/ns/structure/1.0') != 0: + raise RelaxNGParseError, "Document is not Relax NG" fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) parser_ctxt = relaxng.xmlRelaxNGNewDocParserCtxt(fake_c_doc) elif file is not None: Modified: lxml/trunk/src/lxml/xmlschema.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlschema.pxi (original) +++ lxml/trunk/src/lxml/xmlschema.pxi Mon May 7 11:02:48 2007 @@ -30,11 +30,12 @@ root_node = _rootNodeOrRaise(etree) # work around for libxml2 bug if document is not XML schema at all - c_node = root_node._c_node - c_href = _getNs(c_node) - if c_href is NULL or \ - cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: - raise XMLSchemaParseError, "Document is not XML Schema" + if _LIBXML_VERSION_INT < 20624: + c_node = root_node._c_node + c_href = _getNs(c_node) + if c_href is NULL or \ + cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: + raise XMLSchemaParseError, "Document is not XML Schema" fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) parser_ctxt = xmlschema.xmlSchemaNewDocParserCtxt(fake_c_doc) From scoder at codespeak.net Mon May 7 14:09:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 14:09:34 +0200 (CEST) Subject: [Lxml-checkins] r42794 - lxml/trunk/doc Message-ID: <20070507120934.568BA8067@code0.codespeak.net> Author: scoder Date: Mon May 7 14:09:32 2007 New Revision: 42794 Modified: lxml/trunk/doc/xpathxslt.txt Log: section for getpath() Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Mon May 7 14:09:32 2007 @@ -10,9 +10,10 @@ 1 XPath 1.1 The ``xpath()`` method 1.2 XPath return values - 1.3 The ``XPath`` class - 1.4 The ``XPathEvaluator`` classes - 1.5 ``ETXPath`` + 1.3 Generating XPath expressions + 1.4 The ``XPath`` class + 1.5 The ``XPathEvaluator`` classes + 1.6 ``ETXPath`` 2 XSLT 2.1 XSLT result objects 2.2 Stylesheet parameters @@ -141,8 +142,12 @@ (the text node content or attribute value). Namespace declarations are returned as tuples of strings: ``(prefix, URI)``. -A related convenience method of ElementTree objects is ``getpath(element)``, -which returns a structural, absolute XPath expression to find that element:: + +Generating XPath expressions +---------------------------- + +A convenience method of ElementTree objects is ``getpath(element)``, which +returns a structural, absolute XPath expression to find that element:: >>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") From scoder at codespeak.net Mon May 7 14:14:38 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 14:14:38 +0200 (CEST) Subject: [Lxml-checkins] r42795 - lxml/trunk/doc Message-ID: <20070507121438.4E9378067@code0.codespeak.net> Author: scoder Date: Mon May 7 14:14:36 2007 New Revision: 42795 Modified: lxml/trunk/doc/xpathxslt.txt Log: cleanup Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Mon May 7 14:14:36 2007 @@ -146,8 +146,8 @@ Generating XPath expressions ---------------------------- -A convenience method of ElementTree objects is ``getpath(element)``, which -returns a structural, absolute XPath expression to find that element:: +ElementTree objects have a method ``getpath(element)``, which returns a +structural, absolute XPath expression to find that element:: >>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") From scoder at codespeak.net Mon May 7 20:26:25 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 20:26:25 +0200 (CEST) Subject: [Lxml-checkins] r42837 - lxml/trunk/doc Message-ID: <20070507182625.ADA278065@code0.codespeak.net> Author: scoder Date: Mon May 7 20:26:25 2007 New Revision: 42837 Modified: lxml/trunk/doc/performance.txt Log: benchmarks and optimisation example Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Mon May 7 20:26:25 2007 @@ -6,35 +6,57 @@ enough for most applications, so lxml is probably somewhere between 'fast enough' and 'the best choice' for yours. -This text describes where lxml.etree (lxe) excels, gives hints on some -performance traps and compares the overall performance to the original -ElementTree_ (ET) and cElementTree_ (cET) libraries by Fredrik Lundh. The -cElementTree library is a fast C-implementation of the original ElementTree. +This text describes where lxml.etree (abbreviated to 'lxe') excels, gives +hints on some performance traps and compares the overall performance to the +original ElementTree_ (ET) and cElementTree_ (cET) libraries by Fredrik Lundh. +The cElementTree library is a fast C-implementation of the original +ElementTree. .. _ElementTree: http://effbot.org/zone/element-index.htm .. _cElementTree: http://effbot.org/zone/celementtree.htm +.. contents:: +.. + 1 How to read the timings + 2 Bad things first + 3 Parsing and Serialising + 4 The ElementTree API + 5 Tree traversal + 6 XPath + 7 lxml.objectify + + +How to read the timings +----------------------- + The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with -the lxml source distribution. The timings cited below compare lxml 1.3 (with -libxml2 2.6.26) to the ElementTree and cElementTree versions shipped with -CPython 2.5 (based on ElementTree 1.2.6). They were run single-threaded on a -1.8GHz Intel Core Duo machine. +the lxml source distribution. They are distributed under the same BSD license +as lxml itself, and the lxml project would like to promote them as a general +benchmarking suite for all ElementTree implementations. New benchmarks are +very easy to add as tiny test methods, so if you write a performance test for +a specific part of the API yourself, please consider sending it to the lxml +mailing list. + +The timings cited below compare lxml 1.3 (with libxml2 2.6.27) to the +ElementTree and cElementTree versions shipped with CPython 2.5 (based on +ElementTree 1.2.6). They were run single-threaded on a 1.8GHz Intel Core Duo +machine under Ubuntu Linux 7.04 (Feisty). .. _`bench_etree.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_etree.py .. _`bench_xpath.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_xpath.py .. _`bench_objectify.py`: http://codespeak.net/svn/lxml/branch/lxml-1.3/benchmark/bench_objectify.py The scripts run a number of simple tests on the different libraries, using -different XML tree configurations: different tree sizes, with or without -attributes (-/A), with or without ASCII or unicode text (-/S/U), and either -against a tree or its serialised form (T/X). In the result extracts cited -below, T1 refers to a 3-level tree with many children at the third level, T2 -is swapped around to have many children below the root element, T3 is a deep -tree with few children at each level and T4 is a small tree, slightly broader -than deep. If repetition is involved, this usually means running the -benchmark in a loop over all children of the tree root, otherwise, the -operation is run on the root node (C/R). +different XML tree configurations: different tree sizes (T1-4), with or +without attributes (-/A), with or without ASCII string or unicode text +(-/S/U), and either against a tree or its serialised XML form (T/X). In the +result extracts cited below, T1 refers to a 3-level tree with many children at +the third level, T2 is swapped around to have many children below the root +element, T3 is a deep tree with few children at each level and T4 is a small +tree, slightly broader than deep. If repetition is involved, this usually +means running the benchmark in a loop over all children of the tree root, +otherwise, the operation is run on the root node (C/R). As an example, the character code ``(SATR T1)`` states that the benchmark was running for tree T1, with plain string text (S) and attributes (A). It was @@ -44,27 +66,29 @@ measurable. It is therefore not always possible to compare the absolute timings of, say, a single access benchmark (which usually loops) and a 'get all in one step' benchmark, which already takes enough time to be measurable -and is therefore measured as is. Take a look at the concrete benchmarks in -the scripts to understand how the numbers compare. - -.. contents:: -.. - 1 Bad things first - 2 Parsing and Serialising - 3 The ElementTree API - 4 Tree traversal - 5 XPath - 6 lxml.objectify +and is therefore measured as is. An example is the index access to a single +child, which cannot be compared to the timings for ``getchildren()``. Take a +look at the concrete benchmarks in the scripts to understand how the numbers +compare. -Bad things first ----------------- +General notes +------------- First thing to say: there *is* an overhead involved in having a DOM-like C library mimic the ElementTree API. As opposed to ElementTree, lxml has to -generate Python objects on the fly when asked for them. What this means is: -the more of your code runs in Python, the slower your application gets. Note, -however, that this is true for most performance critical Python applications. +generate Python representations of tree nodes on the fly when asked for them, +and the internal tree structure of libxml2 results in a higher maintenance +overhead than the simpler top-down structure of ElementTree. What this means +is: the more of your code runs in Python, the less you can benefit from the +speed of lxml and libxml2. Note, however, that this is true for most +performance critical Python applications. No one would implement complex +matrix calculations in pure Python when you can use Numeric. + +The up side then is that lxml provides powerful tools like tree iterators, +XPath and XSLT, that can handle complex operations at the speed of C. Their +pythonic API in lxml makes them so flexible that most applications can easily +benefit from them. Parsing and Serialising @@ -111,26 +135,32 @@ ET : parse_stringIO (UAXR T3) 163.5361 msec/pass The expat parser allows cET to be up to 80% faster than lxml on plain parser -performance. Similar timings can be observer for the ``iterparse()`` -function. However, if you take a complete serialize-parse cycle, the numbers +performance. Similar timings can be observed for the ``iterparse()`` +function. However, if you take a complete input-output cycle, the numbers will look similar to these:: - lxe: write_utf8_parse_stringIO (S-TR T1) 316.6230 msec/pass - cET: write_utf8_parse_stringIO (S-TR T1) 592.1209 msec/pass - ET : write_utf8_parse_stringIO (S-TR T1) 817.9121 msec/pass - - lxe: write_utf8_parse_stringIO (UATR T3) 49.9680 msec/pass - cET: write_utf8_parse_stringIO (UATR T3) 434.6111 msec/pass - ET : write_utf8_parse_stringIO (UATR T3) 574.1441 msec/pass - - lxe: write_utf8_parse_stringIO (SATR T4) 1.2789 msec/pass - cET: write_utf8_parse_stringIO (SATR T4) 12.2640 msec/pass - ET : write_utf8_parse_stringIO (SATR T4) 15.6620 msec/pass + lxe: write_utf8_parse_stringIO (S-TR T1) 166.3210 msec/pass + cET: write_utf8_parse_stringIO (S-TR T1) 581.2099 msec/pass + ET : write_utf8_parse_stringIO (S-TR T1) 803.5331 msec/pass + + lxe: write_utf8_parse_stringIO (UATR T2) 184.4249 msec/pass + cET: write_utf8_parse_stringIO (UATR T2) 671.5119 msec/pass + ET : write_utf8_parse_stringIO (UATR T2) 924.3481 msec/pass + + lxe: write_utf8_parse_stringIO (S-TR T3) 9.1329 msec/pass + cET: write_utf8_parse_stringIO (S-TR T3) 77.9850 msec/pass + ET : write_utf8_parse_stringIO (S-TR T3) 157.0492 msec/pass + + lxe: write_utf8_parse_stringIO (SATR T4) 1.3900 msec/pass + cET: write_utf8_parse_stringIO (SATR T4) 12.6081 msec/pass + ET : write_utf8_parse_stringIO (SATR T4) 16.2580 msec/pass For applications that require a high parser throughput and do little serialization, cET is the best choice. Also for iterparse applications that extract small amounts of data from large XML data sets. If it comes to -round-trip performance, however, lxml tends to be 3-4 times faster in total. +round-trip performance, however, lxml tends to be 3-4 times faster in +total. So, whenever the input documents are not considerably bigger than the +output, lxml is the clear winner. The ElementTree API @@ -261,7 +291,7 @@ cET: deepcopy (--TC T1) 220.2251 msec/pass ET : deepcopy (--TC T1) 463.7730 msec/pass - lxe: deepcopy (--TC T3) 8.2841 msec/pass + lxe: deepcopy (--TC T3) 4.2651 msec/pass cET: deepcopy (--TC T3) 53.8740 msec/pass ET : deepcopy (--TC T3) 118.2799 msec/pass @@ -359,6 +389,115 @@ lxe: xpath_class_repeat (--TC T4) 1.0269 msec/pass +An bigger example +----------------- + +A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would +read in a 3 MB XML version of the Old Testament of the Bible and look for the +text "begat" in all verses. Apparently, it is contained in 120 of them. This +is easy to implement in ElementTree using ``findall()``. However, the fastest +way to do this is obviously ``iterparse()``, as most of the data is not of any +interest. + +.. _`xml.org`: http://xml.org/... + +Now, Uche's original proposal was:: + + def bench_ET(): + tree = ElementTree.parse("ot.xml") + result = [] + for v in tree.findall("//v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +which takes about one second on my machine today. The faster ``iterparse()`` +variant looks like this:: + + def bench_ET_iterparse(): + result = [] + for event, v in ElementTree.iterparse("ot.xml"): + if v.tag == 'v': + text = v.text + if 'begat' in text: + result.append(text) + v.clear() + return len(result) + +The improvement is about 10%. At the time I first tried (early 2006), lxml +didn't have ``iterparse()`` support, but the ``findall()`` variant was already +faster than ElementTree. This changes immediately when you switch to +cElementTree. The latter only needs 0.17 seconds to do the trick today and +only some impressive 0.10 seconds when running the iterparse version. And +even back then, it was quite a bit faster than what lxml could achieve. + +Since then, lxml has matured a lot and has gotten much faster. The iterparse +variant now runs in 0.14 seconds, and if you remove the ``v.clear()``, it is +even a little faster (which isn't the case for cElementTree). When you move +the whole thing to a pure XPath implementation, it will look like this:: + + def bench_lxml_xpath_all(): + tree = etree.parse("ot.xml") + result = tree.xpath("//v[contains(., 'begat')]/text()") + return len(result) + +This runs in about 0.13 seconds and is about the shortest possible +implementation (in lines of Python code) that I could come up with. Now, this +is already a rather complex XPath expression compared to the simple "//v" +ElementPath expression we started with. Since this is also valid XPath, let's +try this instead:: + + def bench_lxml_xpath(): + tree = etree.parse("ot.xml") + result = [] + for v in tree.xpath("//v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +This gets us down to 0.12 seconds. However, since this is not much different +from the original findall variant, we can remove the complexity of the XPath +call completely and just go with what we had in the beginning. Under lxml, +this runs in the same 0.12 seconds. + +But there is one thing left to try. We can replace the simple ElementPath +expression with a native tree iterator:: + + def bench_lxml_getiterator(): + tree = etree.parse("ot.xml") + result = [] + for v in tree.getiterator("v"): + text = v.text + if 'begat' in text: + result.append(text) + return len(result) + +This implements the same thing, just without the overhead of parsing and +evaluating a path expression. And this makes it another bit faster, down to +0.11 seconds. For comparison, cElementTree runs this version in 0.17 seconds. + +So, what have we learned? + +* It's important to know the available options - and it's worth starting with + the most simple one. In this case, a programmer would then probably have + started with ``getiterator("v")`` or ``iterparse()``. Either of them would + already have been the most efficient, depending on which library is used. + +* It's not always worth optimising. After all that hassle we got from 0.12 + seconds for the initial implementation to 0.11 seconds. Switching over to + cElementTree and writing an ``iterparse()`` based version would have given + us 0.10 seconds - not a big difference for 3MB of XML. + +* Take care what operation is really dominating in your use case. Here, lxml + is little slower than cElementTree on ``parse()`` (both about 0.06 seconds), + but more visibly slower on ``iterparse()``: 0.07 versus 0.10 seconds. + However, tree iteration in lxml is increadibly fast, so it can be faster to + parse the whole tree and then iterate over it rather than using + ``iterparse()`` to do both in one step. + + lxml.objectify -------------- @@ -439,9 +578,10 @@ Here are some more things to try if optimisation is required: * A lot of time is usually spent in tree traversal to find the addressed - elements in the tree. If you often work in subtrees, assign the parent of - the subtree to a variable or pass it into functions instead of starting at - the root. This allows accessing its descendents more directly. + elements in the tree. If you often work in subtrees, do what you would also + do with deep Python objects: assign the parent of the subtree to a variable + or pass it into functions instead of starting at the root. This allows + accessing its descendents more directly. * Try assigning data values directly to attributes instead of passing them through DataElement. From scoder at codespeak.net Mon May 7 22:17:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 22:17:04 +0200 (CEST) Subject: [Lxml-checkins] r42838 - lxml/trunk/doc Message-ID: <20070507201704.58B51806D@code0.codespeak.net> Author: scoder Date: Mon May 7 22:17:02 2007 New Revision: 42838 Modified: lxml/trunk/doc/validation.txt Log: clarifications in validation docs Modified: lxml/trunk/doc/validation.txt ============================================================================== --- lxml/trunk/doc/validation.txt (original) +++ lxml/trunk/doc/validation.txt Mon May 7 22:17:02 2007 @@ -13,7 +13,8 @@ There is also initial support for Schematron_. However, it is currently disabled in lxml builds due to insufficiencies in the implementation as of -libxml2 2.6.27. +libxml2 2.6.27. To enable it when you build from sources, you currently have +to uncomment the include line at the end of the file ``src/lxml/etree.pyx``. .. _Schematron: http://www.ascc.net/xml/schematron @@ -84,6 +85,10 @@ >>> relaxng_doc = etree.parse(f) >>> relaxng = etree.RelaxNG(relaxng_doc) +Alternatively, pass a filename to the ``file`` keyword argument to parse from +a file. This also enables correct handling of include files from within the +RelaxNG parser. + You can then validate some ElementTree document against the schema. You'll get back True if the document is valid against the Relax NG schema, and False if not:: @@ -130,7 +135,7 @@ You can see that the error (ERROR) happened during RelaxNG validation (RELAXNGV). The message then tells you what went wrong. Note that this error -is local to the RelaxNG object. It will only contain log entries that +log is local to the RelaxNG object. It will only contain log entries that appeared during the validation. The DocumentInvalid exception raised by the ``assertValid`` method above provides access to the global error log (like all other lxml exceptions). @@ -147,10 +152,9 @@ XMLSchema --------- -lxml.etree also has a XML Schema (XSD) support, using the class -lxml.etree.XMLSchema. This support is very similar to the Relax NG -support. The class can be given an ElementTree object to construct a -XMLSchema validator:: +lxml.etree also has XML Schema (XSD) support, using the class +lxml.etree.XMLSchema. The API is very similar to the Relax NG and DTD +classes. Pass an ElementTree object to construct a XMLSchema validator:: >>> f = StringIO('''\ ... @@ -165,9 +169,9 @@ >>> xmlschema_doc = etree.parse(f) >>> xmlschema = etree.XMLSchema(xmlschema_doc) -You can then validate some ElementTree document with this. Like with -RelaxNG, you'll get back true if the document is valid against the XML -schema, and false if not:: +You can then validate some ElementTree document with this. Like with RelaxNG, +you'll get back true if the document is valid against the XML schema, and +false if not:: >>> valid = StringIO('') >>> doc = etree.parse(valid) @@ -179,8 +183,8 @@ >>> xmlschema.validate(doc2) 0 -Calling the schema object has the same effect as calling its validate -method. This is sometimes used in conditional statements:: +Calling the schema object has the same effect as calling its validate method. +This is sometimes used in conditional statements:: >>> invalid = StringIO('') >>> doc2 = etree.parse(invalid) @@ -201,7 +205,7 @@ [...] AssertionError: Document does not comply with schema -Error reporting works like for the RelaxNG class:: +Error reporting works as for the RelaxNG class:: >>> log = xmlschema.error_log >>> error = log.last_error From scoder at codespeak.net Mon May 7 23:35:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 7 May 2007 23:35:00 +0200 (CEST) Subject: [Lxml-checkins] r42840 - in lxml/trunk: doc src/lxml Message-ID: <20070507213500.8C8B4807C@code0.codespeak.net> Author: scoder Date: Mon May 7 23:35:00 2007 New Revision: 42840 Modified: lxml/trunk/doc/parsing.txt lxml/trunk/doc/validation.txt lxml/trunk/src/lxml/parser.pxi Log: clarifications in parser docs Modified: lxml/trunk/doc/parsing.txt ============================================================================== --- lxml/trunk/doc/parsing.txt (original) +++ lxml/trunk/doc/parsing.txt Mon May 7 23:35:00 2007 @@ -18,26 +18,56 @@ Parsers -------- +======= Parsers are represented by parser objects. There is support for parsing both -XML and (broken) HTML (note that XHTML is best parsed as XML). Both are based -on libxml2 and therefore only support options that are backed by the library. -Parsers take a number of keyword arguments. The following is an example for -namespace cleanup during parsing, first with the default parser, then with a -parametrized one:: +XML and (broken) HTML. Note that XHTML is best parsed as XML, parsing it with +the HTML parser can lead to unexpected results. Here is a simple example for +XML parsing:: >>> xml = '' - >>> et = etree.parse(StringIO(xml)) + >>> et = etree.parse(StringIO(xml)) >>> print etree.tostring(et.getroot()) + +Parser options +-------------- + +The parsers accept a number of setup options as keyword arguments. The above +example is easily extended to clean up namespaces during parsing:: + >>> parser = etree.XMLParser(ns_clean=True) >>> et = etree.parse(StringIO(xml), parser) >>> print etree.tostring(et.getroot()) +The keyword arguments in the constructor are mainly based on the libxml2 +parser configuration. A DTD will also be loaded if validation or attribute +default values are requested. + +Available boolean keyword arguments: + +* attribute_defaults - read the DTD (if referenced by the document) and add + the default attributes from it + +* dtd_validation - validate while parsing (if a DTD was referenced) + +* load_dtd - load and parse the DTD while parsing (no validation is performed) + +* no_network - prevent network access when looking up external documents + +* ns_clean - try to clean up redundant namespace declarations + +* recover - try hard to parse through broken XML + +* remove_blank_text - discard blank text nodes between tags + + +Parsing HTML +------------ + HTML parsing is similarly simple. The parsers have a ``recover`` keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return something usable without raising an exception. You should use libxml2 @@ -48,15 +78,29 @@ >>> parser = etree.HTMLParser() >>> et = etree.parse(StringIO(broken_html), parser) - >>> print etree.tostring(et.getroot()) - test

page title

+ >>> print etree.tostring(et.getroot(), pretty_print=True) + + + test + + +

page title

+ + Lxml has an HTML function, similar to the XML shortcut known from ElementTree:: >>> html = etree.HTML(broken_html) - >>> print etree.tostring(html) - test

page title

+ >>> print etree.tostring(html, pretty_print=True) + + + test + + +

page title

+ + The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is *not* the fault of lxml if you find documents that are so @@ -66,6 +110,10 @@ parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems. + +Doctype information +------------------- + The use of the libxml2 parsers makes some additional information available at the API level. Currently, ElementTree objects can access the DOCTYPE information provided by a parsed document, as well as the XML version and the @@ -93,7 +141,7 @@ iterparse and iterwalk ----------------------- +====================== As known from ElementTree, the ``iterparse()`` utility function returns an iterator that generates parser events for an XML file (or file-like object), @@ -125,7 +173,7 @@ >>> context.root.tag 'root' -The other types can be activated with the ``events`` keyword argument:: +The other event types can be activated with the ``events`` keyword argument:: >>> events = ("start", "end") >>> context = etree.iterparse(StringIO(xml), events=events) @@ -140,6 +188,32 @@ end {testns}empty-element end root + +Selective tag events +-------------------- + +As an extension over ElementTree, lxml.etree accepts a ``tag`` keyword +argument just like ``element.getiterator(tag)``. This restricts events to a +specific tag or namespace:: + + >>> context = etree.iterparse(StringIO(xml), tag="element") + >>> for action, elem in context: + ... print action, elem.tag + end element + end element + + >>> events = ("start", "end") + >>> context = etree.iterparse( + ... StringIO(xml), events=events, tag="{testns}*") + >>> for action, elem in context: + ... print action, elem.tag + start {testns}empty-element + end {testns}empty-element + + +Modifying the tree +------------------ + You can modify the element and its descendants when handling the 'end' event. To save memory, for example, you can remove subtrees that are no longer needed:: @@ -170,11 +244,12 @@ ... if element.getprevious(): # clean up preceding siblings ... del element.getparent()[0] -You can use ``while`` instead of ``if`` if you skipped siblings using the -``tag`` keyword argument. The more selective your tag is, however, the more -thought you will have to put into finding the right way to clean up the -elements that were skipped. Therefore, it is sometimes easier to traverse all -elements and do the tag selection by hand in the event handler code. +You can use ``while`` instead of the ``if`` to delete multiple siblings in a +row if you skipped over them using the ``tag`` keyword argument. The more +selective your tag is, however, the more thought you will have to put into +finding the right way to clean up the elements that were skipped. Therefore, +it is sometimes easier to traverse all elements and do the tag selection by +hand in the event handler code. The 'start-ns' and 'end-ns' events notify about namespace declarations and generate tuples ``(prefix, URI)``:: @@ -189,28 +264,16 @@ It is common practice to use a list as namespace stack and pop the last entry on the 'end-ns' event. -lxml.etree supports two extensions compared to ElementTree. It accepts a -``tag`` keyword argument just like ``element.getiterator(tag)``. This -restricts events to a specific tag or namespace. - >>> context = etree.iterparse(StringIO(xml), tag="element") - >>> for action, elem in context: - ... print action, elem.tag - end element - end element +iterwalk +-------- - >>> events = ("start", "end") - >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*") - >>> for action, elem in context: - ... print action, elem.tag - start {testns}empty-element - end {testns}empty-element - -The second extension is the ``iterwalk()`` function. It behaves exactly like -``iterparse()``, but works on Elements and ElementTrees:: +A second extension over ElementTree is the ``iterwalk()`` function. It +behaves exactly like ``iterparse()``, but works on Elements and ElementTrees:: - >>> root = context.root - >>> context = etree.iterwalk(root, events=events, tag="element") + >>> root = etree.XML(xml) + >>> context = etree.iterwalk( + ... root, events=("start", "end"), tag="element") >>> for action, elem in context: ... print action, elem.tag start element @@ -220,7 +283,7 @@ Python unicode strings ----------------------- +====================== lxml.etree has broader support for Python unicode strings than the ElementTree library. First of all, where ElementTree would raise an exception, the @@ -246,6 +309,10 @@ should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone. + +Serialising to Unicode strings +------------------------------ + To serialize the result, you would normally use the ``tostring`` module function, which serializes to plain ASCII by default or a number of other encodings if asked for:: Modified: lxml/trunk/doc/validation.txt ============================================================================== --- lxml/trunk/doc/validation.txt (original) +++ lxml/trunk/doc/validation.txt Mon May 7 23:35:00 2007 @@ -4,7 +4,7 @@ Apart from the built-in DTD support in parsers, lxml currently supports three schema languages: DTD_, `Relax NG`_ and `XML Schema`_. All three provide -identical APIs in lxml, represented by a validator class with the obvious +identical APIs in lxml, represented by validator classes with the obvious names. .. _DTD: http://en.wikipedia.org/wiki/Document_Type_Definition Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Mon May 7 23:35:00 2007 @@ -664,14 +664,9 @@ * recover - try hard to parse through broken XML * remove_blank_text - discard blank text nodes - For read-only documents that will not be altered after parsing, you can - also pass the following keyword arguments: - * compact - compactly store short element text content - - Note that you should avoid sharing parsers between threads. This does not + Note that you should avoid sharing parsers between threads. While this is + not harmful, it is more efficient to use separate parsers. This does not apply to the default parser. - - You must not modify documents that were parsed with the 'compact' option. """ def __init__(self, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=False, ns_clean=False, @@ -794,12 +789,8 @@ * no_network - prevent network access * remove_blank_text - discard empty text nodes - For read-only documents that will not be altered after parsing, you can - also pass the following keyword arguments: - * compact - compactly store short element text content - - Note that you should avoid sharing parsers between threads. You must not - modify documents that were parsed with the 'compact' option. + Note that you should avoid sharing parsers between threads for parformance + reasons. """ def __init__(self, recover=True, no_network=False, remove_blank_text=False, compact=True): From scoder at codespeak.net Tue May 8 11:57:16 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 11:57:16 +0200 (CEST) Subject: [Lxml-checkins] r42843 - lxml/trunk/doc Message-ID: <20070508095716.8A03C8065@code0.codespeak.net> Author: scoder Date: Tue May 8 11:57:14 2007 New Revision: 42843 Modified: lxml/trunk/doc/parsing.txt Log: short comparison code for iterwalk/iterparse Modified: lxml/trunk/doc/parsing.txt ============================================================================== --- lxml/trunk/doc/parsing.txt (original) +++ lxml/trunk/doc/parsing.txt Tue May 8 11:57:14 2007 @@ -271,6 +271,7 @@ A second extension over ElementTree is the ``iterwalk()`` function. It behaves exactly like ``iterparse()``, but works on Elements and ElementTrees:: + >>> root = etree.XML(xml) >>> context = etree.iterwalk( ... root, events=("start", "end"), tag="element") @@ -281,6 +282,17 @@ start element end element + >>> f = StringIO(xml) + >>> context = etree.iterparse( + ... f, events=("start", "end"), tag="element") + + >>> for action, elem in context: + ... print action, elem.tag + start element + end element + start element + end element + Python unicode strings ====================== From scoder at codespeak.net Tue May 8 13:43:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 13:43:59 +0200 (CEST) Subject: [Lxml-checkins] r42846 - lxml/trunk/doc Message-ID: <20070508114359.12E6D8069@code0.codespeak.net> Author: scoder Date: Tue May 8 13:43:58 2007 New Revision: 42846 Modified: lxml/trunk/doc/performance.txt Log: numpy Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 13:43:58 2007 @@ -83,7 +83,7 @@ is: the more of your code runs in Python, the less you can benefit from the speed of lxml and libxml2. Note, however, that this is true for most performance critical Python applications. No one would implement complex -matrix calculations in pure Python when you can use Numeric. +matrix calculations in pure Python when you can use NumPy. The up side then is that lxml provides powerful tools like tree iterators, XPath and XSLT, that can handle complex operations at the speed of C. Their From scoder at codespeak.net Tue May 8 14:20:13 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 14:20:13 +0200 (CEST) Subject: [Lxml-checkins] r42849 - lxml/trunk/doc Message-ID: <20070508122013.5B9E38069@code0.codespeak.net> Author: scoder Date: Tue May 8 14:20:12 2007 New Revision: 42849 Modified: lxml/trunk/doc/performance.txt Log: docs Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 14:20:12 2007 @@ -82,8 +82,8 @@ overhead than the simpler top-down structure of ElementTree. What this means is: the more of your code runs in Python, the less you can benefit from the speed of lxml and libxml2. Note, however, that this is true for most -performance critical Python applications. No one would implement complex -matrix calculations in pure Python when you can use NumPy. +performance critical Python applications. No one would implement fourier +transformations in pure Python when you can use NumPy. The up side then is that lxml provides powerful tools like tree iterators, XPath and XSLT, that can handle complex operations at the speed of C. Their @@ -480,6 +480,11 @@ So, what have we learned? +* Python code is not slow. The pure XPath solution was not even as fast as + the first shot Python implementation. In general, a few more lines in + Python make things more readable, which is much more important than the last + 5% of performance. + * It's important to know the available options - and it's worth starting with the most simple one. In this case, a programmer would then probably have started with ``getiterator("v")`` or ``iterparse()``. Either of them would From scoder at codespeak.net Tue May 8 14:23:09 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 8 May 2007 14:23:09 +0200 (CEST) Subject: [Lxml-checkins] r42850 - lxml/trunk/doc Message-ID: <20070508122309.B5FA38069@code0.codespeak.net> Author: scoder Date: Tue May 8 14:23:09 2007 New Revision: 42850 Modified: lxml/trunk/doc/performance.txt Log: cleanup Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Tue May 8 14:23:09 2007 @@ -1,3 +1,4 @@ +==================== Benchmarks and Speed ==================== @@ -27,7 +28,7 @@ How to read the timings ------------------------ +======================= The statements made here are backed by the (micro-)benchmark scripts `bench_etree.py`_, `bench_xpath.py`_ and `bench_objectify.py`_ that come with @@ -73,7 +74,7 @@ General notes -------------- +============= First thing to say: there *is* an overhead involved in having a DOM-like C library mimic the ElementTree API. As opposed to ElementTree, lxml has to @@ -92,7 +93,7 @@ Parsing and Serialising ------------------------ +======================= These are areas where lxml excels. The reason is that both parts are executed entirely at the C level, without major interaction with Python code. The @@ -164,7 +165,7 @@ The ElementTree API -------------------- +=================== Since all three libraries implement the same API, their performance is easy to compare in this area. A major disadvantage for lxml's performance is the @@ -390,7 +391,7 @@ An bigger example ------------------ +================= A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would read in a 3 MB XML version of the Old Testament of the Bible and look for the @@ -504,7 +505,7 @@ lxml.objectify --------------- +============== The following timings are based on the benchmark script `bench_objectify.py`_. From scoder at codespeak.net Wed May 9 00:08:43 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 00:08:43 +0200 (CEST) Subject: [Lxml-checkins] r42886 - lxml/trunk/doc Message-ID: <20070508220843.C2A15806D@code0.codespeak.net> Author: scoder Date: Wed May 9 00:08:42 2007 New Revision: 42886 Modified: lxml/trunk/doc/performance.txt Log: doc restructuring Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 00:08:42 2007 @@ -27,6 +27,25 @@ 7 lxml.objectify +General notes +============= + +First thing to say: there *is* an overhead involved in having a DOM-like C +library mimic the ElementTree API. As opposed to ElementTree, lxml has to +generate Python representations of tree nodes on the fly when asked for them, +and the internal tree structure of libxml2 results in a higher maintenance +overhead than the simpler top-down structure of ElementTree. What this means +is: the more of your code runs in Python, the less you can benefit from the +speed of lxml and libxml2. Note, however, that this is true for most +performance critical Python applications. No one would implement fourier +transformations in pure Python when you can use NumPy. + +The up side then is that lxml provides powerful tools like tree iterators, +XPath and XSLT, that can handle complex operations at the speed of C. Their +pythonic API in lxml makes them so flexible that most applications can easily +benefit from them. + + How to read the timings ======================= @@ -73,25 +92,6 @@ compare. -General notes -============= - -First thing to say: there *is* an overhead involved in having a DOM-like C -library mimic the ElementTree API. As opposed to ElementTree, lxml has to -generate Python representations of tree nodes on the fly when asked for them, -and the internal tree structure of libxml2 results in a higher maintenance -overhead than the simpler top-down structure of ElementTree. What this means -is: the more of your code runs in Python, the less you can benefit from the -speed of lxml and libxml2. Note, however, that this is true for most -performance critical Python applications. No one would implement fourier -transformations in pure Python when you can use NumPy. - -The up side then is that lxml provides powerful tools like tree iterators, -XPath and XSLT, that can handle complex operations at the speed of C. Their -pythonic API in lxml makes them so flexible that most applications can easily -benefit from them. - - Parsing and Serialising ======================= From scoder at codespeak.net Wed May 9 00:15:45 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 00:15:45 +0200 (CEST) Subject: [Lxml-checkins] r42887 - lxml/trunk/doc Message-ID: <20070508221545.099E1806D@code0.codespeak.net> Author: scoder Date: Wed May 9 00:15:45 2007 New Revision: 42887 Modified: lxml/trunk/doc/performance.txt Log: doc restructuring Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 00:15:45 2007 @@ -196,6 +196,10 @@ are no longer referenced. ET and cET represent the tree itself through these objects, which reduces the overhead in creating them. + +Child access +------------ + The same reason makes operations like ``getchildren()`` more costly in lxml. Where ET and cET can quickly create a shallow copy of their list of children, lxml has to create a Python object for each child and collect them in a list:: @@ -227,6 +231,10 @@ cET: middle_child (--TR T2) 0.2089 msec/pass ET : middle_child (--TR T2) 0.9360 msec/pass + +Element creation +---------------- + As opposed to ET, libxml2 has a notion of documents that each element must be in. This results in a major performance difference for creating independent Elements that end up in independently created documents:: @@ -252,6 +260,10 @@ choice. Note, however, that the serialisation performance may even out this advantage, especially for smaller trees and trees with many attributes. + +Merging different sources +------------------------- + A critical action for lxml is moving elements between document contexts. It requires lxml to do recursive adaptations throughout the moved tree structure. @@ -285,8 +297,13 @@ cET: replace_children_element (--TC T1) 0.0238 msec/pass ET : replace_children_element (--TC T1) 0.1628 msec/pass -You should keep this difference in mind when you merge very large trees. On -the other hand, deep copying a tree is fast in lxml:: +You should keep this difference in mind when you merge very large trees. + + +deepcopy +-------- + +Deep copying a tree is fast in lxml:: lxe: deepcopy (--TC T1) 10.5221 msec/pass cET: deepcopy (--TC T1) 220.2251 msec/pass @@ -347,7 +364,7 @@ XPath ------ +===== The following timings are based on the benchmark script `bench_xpath.py`_. @@ -390,8 +407,8 @@ lxe: xpath_class_repeat (--TC T4) 1.0269 msec/pass -An bigger example -================= +A bigger example +================ A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would read in a 3 MB XML version of the Old Testament of the Bible and look for the @@ -521,6 +538,10 @@ API, the create-discard cycles can become a bottleneck, as elements have to be instantiated over and over again. + +ObjectPath +---------- + ObjectPath can be used to speed up the access to elements that are deep in the tree. It avoids step-by-step Python element instantiations along the path, which can substantially improve the access time:: @@ -544,6 +565,10 @@ Note, however, that parsing ObjectPath expressions is not for free either, so this is most effective for frequently accessing the same element. + +Caching Elements +---------------- + A way to improve the normal attribute access time is static instantiation of the Python objects, thus trading memory for speed. Just create a cache dictionary and run:: @@ -581,6 +606,10 @@ is most effective for largely immutable trees. You should consider using a set instead of a list in this case and add new elements by hand. + +Further optimisations +--------------------- + Here are some more things to try if optimisation is required: * A lot of time is usually spent in tree traversal to find the addressed From scoder at codespeak.net Wed May 9 00:39:36 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 00:39:36 +0200 (CEST) Subject: [Lxml-checkins] r42888 - lxml/trunk/doc Message-ID: <20070508223936.550E08075@code0.codespeak.net> Author: scoder Date: Wed May 9 00:39:36 2007 New Revision: 42888 Modified: lxml/trunk/doc/FAQ.txt lxml/trunk/doc/objectify.txt lxml/trunk/doc/performance.txt Log: annotations and links Modified: lxml/trunk/doc/FAQ.txt ============================================================================== --- lxml/trunk/doc/FAQ.txt (original) +++ lxml/trunk/doc/FAQ.txt Wed May 9 00:39:36 2007 @@ -2,6 +2,11 @@ Frequently Asked Questions (FAQ) ================================ +.. meta:: + :description: Frequently Asked Questions about lxml (FAQ) + :keywords: lxml, lxml.etree, FAQ, frequently asked questions + + See also the notes on compatibility_ to ElementTree_. .. _compatibility: compatibility.html Modified: lxml/trunk/doc/objectify.txt ============================================================================== --- lxml/trunk/doc/objectify.txt (original) +++ lxml/trunk/doc/objectify.txt Wed May 9 00:39:36 2007 @@ -2,6 +2,9 @@ lxml.objectify ============== +:Author: + Stefan Behnel + lxml supports an alternative API similar to the Amara_ bindery or gnosis.xml.objectify_ through a custom Element implementation. The main idea is to hide the usage of XML behind normal Python objects, sometimes referred Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 00:39:36 2007 @@ -2,6 +2,14 @@ Benchmarks and Speed ==================== +:Author: + Stefan Behnel + +.. meta:: + :description: Performance evaluation of lxml and ElementTree + :keywords: lxml performance, lxml.etree, lxml.objectify, benchmarks, ElementTree + + As an XML library, lxml.etree is very fast. It is also slow. As with all software, it depends on what you do with it. Rest assured that lxml is fast enough for most applications, so lxml is probably somewhere between 'fast @@ -410,16 +418,17 @@ A bigger example ================ -A while ago, Uche Ogbuji posted a benchmark proposal at `xml.org`_ that would -read in a 3 MB XML version of the Old Testament of the Bible and look for the -text "begat" in all verses. Apparently, it is contained in 120 of them. This -is easy to implement in ElementTree using ``findall()``. However, the fastest +A while ago, Uche Ogbuji posted a `benchmark proposal`_ that would read in a +3MB XML version of the `Old Testament`_ of the Bible and look for the word +*begat* in all verses. Apparently, it is contained in 120 of them. This is +easy to implement in ElementTree using ``findall()``. However, the fastest way to do this is obviously ``iterparse()``, as most of the data is not of any interest. -.. _`xml.org`: http://xml.org/... +.. _`benchmark proposal`: http://www.onlamp.com/pub/wlg/6291 +.. _`Old Testament`: http://www.ibiblio.org/bosak/xml/eg/religion.2.00.xml.zip -Now, Uche's original proposal was:: +Now, Uche's original proposal was more or less the following:: def bench_ET(): tree = ElementTree.parse("ot.xml") From scoder at codespeak.net Wed May 9 01:02:05 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 01:02:05 +0200 (CEST) Subject: [Lxml-checkins] r42889 - in lxml/trunk/doc: . html Message-ID: <20070508230205.88A398075@code0.codespeak.net> Author: scoder Date: Wed May 9 01:02:04 2007 New Revision: 42889 Modified: lxml/trunk/doc/html/style.css lxml/trunk/doc/mkhtml.py Log: formatting, timestamp in HTML files Modified: lxml/trunk/doc/html/style.css ============================================================================== --- lxml/trunk/doc/html/style.css (original) +++ lxml/trunk/doc/html/style.css Wed May 9 01:02:04 2007 @@ -162,6 +162,11 @@ text-decoration: underline; } +p.timestamp { + text-align: right; + font-size: 70%; +} + p { /*margin: 0.5em 0em 1em 0em;*/ text-align: justify; @@ -169,6 +174,12 @@ margin: 0.5em 0em 0em 0em; } +th.docinfo-name { + padding-left: 3ex; + text-align: right; + font-weight: bold; +} + hr { clear: both; height: 1px; Modified: lxml/trunk/doc/mkhtml.py ============================================================================== --- lxml/trunk/doc/mkhtml.py (original) +++ lxml/trunk/doc/mkhtml.py Wed May 9 01:02:04 2007 @@ -1,5 +1,5 @@ from lxml.etree import parse, Element, SubElement, XPath -import os, shutil, re, sys, copy +import os, shutil, re, sys, copy, time SITE_STRUCTURE = [ ('lxml', ('main.txt', 'intro.txt', 'FAQ.txt', 'compatibility.txt', @@ -21,6 +21,8 @@ {"h" : "http://www.w3.org/1999/xhtml"}) find_menu = XPath("//h:ul[@id=$name]", {"h" : "http://www.w3.org/1999/xhtml"}) +find_page_end = XPath("/h:html/h:body/h:div[last()]", + {"h" : "http://www.w3.org/1999/xhtml"}) replace_invalid = re.compile(r'[-_/.\s\\]').sub @@ -103,9 +105,15 @@ build_menu(tree, basename, section, menu) - # integrate menu + # integrate menu and date + date = Element("{http://www.w3.org/1999/xhtml}p", {"class":"timestamp"}) + date.text = "Page generated on " + time.strftime("%Y-%m-%d") for tree, basename, outpath in trees.itervalues(): new_tree = merge_menu(tree, menu, basename) + div = find_page_end(new_tree) + if div: + div[-1].append(copy.deepcopy(date)) + new_tree.write(outpath) # also convert INSTALL.txt and CHANGES.txt From scoder at codespeak.net Wed May 9 01:12:46 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 01:12:46 +0200 (CEST) Subject: [Lxml-checkins] r42890 - lxml/trunk/doc Message-ID: <20070508231246.B1536806F@code0.codespeak.net> Author: scoder Date: Wed May 9 01:12:46 2007 New Revision: 42890 Modified: lxml/trunk/doc/performance.txt Log: well Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 01:12:46 2007 @@ -527,7 +527,8 @@ but more visibly slower on ``iterparse()``: 0.07 versus 0.10 seconds. However, tree iteration in lxml is increadibly fast, so it can be faster to parse the whole tree and then iterate over it rather than using - ``iterparse()`` to do both in one step. + ``iterparse()`` to do both in one step. Or, you can just rely on the lxml + authors to optimise iterparse in one of the next releases... lxml.objectify From scoder at codespeak.net Wed May 9 13:28:20 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 13:28:20 +0200 (CEST) Subject: [Lxml-checkins] r42929 - in lxml/trunk/doc: . html Message-ID: <20070509112820.33531807A@code0.codespeak.net> Author: scoder Date: Wed May 9 13:28:19 2007 New Revision: 42929 Modified: lxml/trunk/doc/html/style.css lxml/trunk/doc/mkhtml.py Log: date footer in HTML pages Modified: lxml/trunk/doc/html/style.css ============================================================================== --- lxml/trunk/doc/html/style.css (original) +++ lxml/trunk/doc/html/style.css Wed May 9 13:28:19 2007 @@ -8,14 +8,14 @@ padding: 1em 1em 1em 21em; } - div.document { + div.document, div.footer { width: 45em; background-color: white; } } @media print { - div.document { + div.document, div.footer { width: auto; padding-left: 0px; } @@ -25,12 +25,20 @@ } } -div.document { +div.document, div.footer { margin: 1em auto 1em auto; color: #222; +} + +div.document { text-align: left; } +div.footer { + text-align: center; + font-size: 70%; +} + /*** TOC ***/ div.contents.topic > ul { @@ -162,11 +170,6 @@ text-decoration: underline; } -p.timestamp { - text-align: right; - font-size: 70%; -} - p { /*margin: 0.5em 0em 1em 0em;*/ text-align: justify; Modified: lxml/trunk/doc/mkhtml.py ============================================================================== --- lxml/trunk/doc/mkhtml.py (original) +++ lxml/trunk/doc/mkhtml.py Wed May 9 13:28:19 2007 @@ -13,6 +13,8 @@ RST2HTML_OPTIONS = " ".join([ "--no-toc-backlinks", "--strip-comments", + "--language en", + "--date", ]) find_title = XPath("/h:html/h:head/h:title/text()", @@ -105,15 +107,9 @@ build_menu(tree, basename, section, menu) - # integrate menu and date - date = Element("{http://www.w3.org/1999/xhtml}p", {"class":"timestamp"}) - date.text = "Page generated on " + time.strftime("%Y-%m-%d") + # integrate menu for tree, basename, outpath in trees.itervalues(): new_tree = merge_menu(tree, menu, basename) - div = find_page_end(new_tree) - if div: - div[-1].append(copy.deepcopy(date)) - new_tree.write(outpath) # also convert INSTALL.txt and CHANGES.txt From scoder at codespeak.net Wed May 9 13:28:57 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 13:28:57 +0200 (CEST) Subject: [Lxml-checkins] r42930 - lxml/trunk/doc Message-ID: <20070509112857.23B3A807A@code0.codespeak.net> Author: scoder Date: Wed May 9 13:28:56 2007 New Revision: 42930 Modified: lxml/trunk/doc/performance.txt Log: small docfix Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 13:28:56 2007 @@ -527,7 +527,7 @@ but more visibly slower on ``iterparse()``: 0.07 versus 0.10 seconds. However, tree iteration in lxml is increadibly fast, so it can be faster to parse the whole tree and then iterate over it rather than using - ``iterparse()`` to do both in one step. Or, you can just rely on the lxml + ``iterparse()`` to do both in one step. Or, you can just wait for the lxml authors to optimise iterparse in one of the next releases... From scoder at codespeak.net Wed May 9 13:30:37 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 13:30:37 +0200 (CEST) Subject: [Lxml-checkins] r42931 - lxml/trunk/doc Message-ID: <20070509113037.917F9807A@code0.codespeak.net> Author: scoder Date: Wed May 9 13:30:37 2007 New Revision: 42931 Modified: lxml/trunk/doc/performance.txt Log: small doc fix Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 13:30:37 2007 @@ -415,7 +415,7 @@ lxe: xpath_class_repeat (--TC T4) 1.0269 msec/pass -A bigger example +A longer example ================ A while ago, Uche Ogbuji posted a `benchmark proposal`_ that would read in a From scoder at codespeak.net Wed May 9 13:33:21 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 13:33:21 +0200 (CEST) Subject: [Lxml-checkins] r42932 - lxml/trunk/doc Message-ID: <20070509113321.23AB2807A@code0.codespeak.net> Author: scoder Date: Wed May 9 13:33:20 2007 New Revision: 42932 Modified: lxml/trunk/doc/performance.txt Log: small doc fix Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 13:33:20 2007 @@ -420,10 +420,10 @@ A while ago, Uche Ogbuji posted a `benchmark proposal`_ that would read in a 3MB XML version of the `Old Testament`_ of the Bible and look for the word -*begat* in all verses. Apparently, it is contained in 120 of them. This is -easy to implement in ElementTree using ``findall()``. However, the fastest -way to do this is obviously ``iterparse()``, as most of the data is not of any -interest. +*begat* in all verses. Apparently, it is contained in 120 out of almost 24000 +verses. This is easy to implement in ElementTree using ``findall()``. +However, the fastest way to do this is obviously ``iterparse()``, as most of +the data is not of any interest. .. _`benchmark proposal`: http://www.onlamp.com/pub/wlg/6291 .. _`Old Testament`: http://www.ibiblio.org/bosak/xml/eg/religion.2.00.xml.zip From scoder at codespeak.net Wed May 9 21:49:01 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 21:49:01 +0200 (CEST) Subject: [Lxml-checkins] r42973 - lxml/trunk/doc Message-ID: <20070509194901.BB6C2807A@code0.codespeak.net> Author: scoder Date: Wed May 9 21:49:01 2007 New Revision: 42973 Modified: lxml/trunk/doc/parsing.txt Log: cleanup Modified: lxml/trunk/doc/parsing.txt ============================================================================== --- lxml/trunk/doc/parsing.txt (original) +++ lxml/trunk/doc/parsing.txt Wed May 9 21:49:01 2007 @@ -240,16 +240,17 @@ >>> for event, element in etree.iterparse(StringIO(xml)): ... # ... do something with the element - ... element.clear() # clean up children - ... if element.getprevious(): # clean up preceding siblings - ... del element.getparent()[0] - -You can use ``while`` instead of the ``if`` to delete multiple siblings in a -row if you skipped over them using the ``tag`` keyword argument. The more -selective your tag is, however, the more thought you will have to put into -finding the right way to clean up the elements that were skipped. Therefore, -it is sometimes easier to traverse all elements and do the tag selection by -hand in the event handler code. + ... element.clear() # clean up children + ... while element.getprevious() is not None: + ... del element.getparent()[0] # clean up preceding siblings + +The ``while`` loop deletes multiple siblings in a row. This is only necessary +if you skipped over some of them using the ``tag`` keyword argument. +Otherwise, a simple ``if`` should do. The more selective your tag is, +however, the more thought you will have to put into finding the right way to +clean up the elements that were skipped. Therefore, it is sometimes easier to +traverse all elements and do the tag selection by hand in the event handler +code. The 'start-ns' and 'end-ns' events notify about namespace declarations and generate tuples ``(prefix, URI)``:: From scoder at codespeak.net Wed May 9 21:49:37 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 21:49:37 +0200 (CEST) Subject: [Lxml-checkins] r42974 - lxml/trunk/doc Message-ID: <20070509194937.8572C807A@code0.codespeak.net> Author: scoder Date: Wed May 9 21:49:37 2007 New Revision: 42974 Modified: lxml/trunk/doc/performance.txt Log: doc rephrasing Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 21:49:37 2007 @@ -484,10 +484,11 @@ result.append(text) return len(result) -This gets us down to 0.12 seconds. However, since this is not much different -from the original findall variant, we can remove the complexity of the XPath -call completely and just go with what we had in the beginning. Under lxml, -this runs in the same 0.12 seconds. +This gets us down to 0.12 seconds, thus showing that a generic XPath +evaluation engine cannot always compete with a simpler, tailored solution. +However, since this is not much different from the original findall variant, +we can remove the complexity of the XPath call completely and just go with +what we had in the beginning. Under lxml, this runs in the same 0.12 seconds. But there is one thing left to try. We can replace the simple ElementPath expression with a native tree iterator:: @@ -522,13 +523,14 @@ cElementTree and writing an ``iterparse()`` based version would have given us 0.10 seconds - not a big difference for 3MB of XML. -* Take care what operation is really dominating in your use case. Here, lxml - is little slower than cElementTree on ``parse()`` (both about 0.06 seconds), - but more visibly slower on ``iterparse()``: 0.07 versus 0.10 seconds. - However, tree iteration in lxml is increadibly fast, so it can be faster to - parse the whole tree and then iterate over it rather than using - ``iterparse()`` to do both in one step. Or, you can just wait for the lxml - authors to optimise iterparse in one of the next releases... +* Take care what operation is really dominating in your use case. If we split + up the operations, we can see that lxml is slightly slower than cElementTree + on ``parse()`` (both about 0.06 seconds), but more visibly slower on + ``iterparse()``: 0.07 versus 0.10 seconds. However, tree iteration in lxml + is increadibly fast, so it can be better to parse the whole tree and then + iterate over it rather than using ``iterparse()`` to do both in one step. + Or, you can just wait for the lxml authors to optimise iterparse in one of + the next releases... lxml.objectify From scoder at codespeak.net Wed May 9 21:51:13 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 21:51:13 +0200 (CEST) Subject: [Lxml-checkins] r42975 - lxml/trunk/src/lxml Message-ID: <20070509195113.30F79807B@code0.codespeak.net> Author: scoder Date: Wed May 9 21:51:12 2007 New Revision: 42975 Modified: lxml/trunk/src/lxml/iterparse.pxi Log: larger iterparse chunks Modified: lxml/trunk/src/lxml/iterparse.pxi ============================================================================== --- lxml/trunk/src/lxml/iterparse.pxi (original) +++ lxml/trunk/src/lxml/iterparse.pxi Wed May 9 21:51:12 2007 @@ -1,7 +1,7 @@ # iterparse -- incremental parsing cdef object __ITERPARSE_CHUNK_SIZE -__ITERPARSE_CHUNK_SIZE = 16384 +__ITERPARSE_CHUNK_SIZE = 32768 ctypedef enum IterparseEventFilter: ITERPARSE_FILTER_START = 1 From scoder at codespeak.net Wed May 9 23:02:31 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 23:02:31 +0200 (CEST) Subject: [Lxml-checkins] r42980 - lxml/trunk/benchmark Message-ID: <20070509210231.3E0A3807A@code0.codespeak.net> Author: scoder Date: Wed May 9 23:02:31 2007 New Revision: 42980 Modified: lxml/trunk/benchmark/bench_etree.py Log: deepcopy with attributes Modified: lxml/trunk/benchmark/bench_etree.py ============================================================================== --- lxml/trunk/benchmark/bench_etree.py (original) +++ lxml/trunk/benchmark/bench_etree.py Wed May 9 23:02:31 2007 @@ -212,10 +212,14 @@ child[:] @children + @with_attributes(True, False) + @with_text(utext=True, text=True, no_text=True) def bench_deepcopy(self, children): for child in children: copy.deepcopy(child) + @with_attributes(True, False) + @with_text(utext=True, text=True, no_text=True) def bench_deepcopy_all(self, root): copy.deepcopy(root) From scoder at codespeak.net Wed May 9 23:03:15 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 9 May 2007 23:03:15 +0200 (CEST) Subject: [Lxml-checkins] r42981 - lxml/trunk/doc Message-ID: <20070509210315.EBCD2807A@code0.codespeak.net> Author: scoder Date: Wed May 9 23:03:15 2007 New Revision: 42981 Modified: lxml/trunk/doc/performance.txt Log: bench doc: deepcopy, some rephrasing Modified: lxml/trunk/doc/performance.txt ============================================================================== --- lxml/trunk/doc/performance.txt (original) +++ lxml/trunk/doc/performance.txt Wed May 9 23:03:15 2007 @@ -313,16 +313,21 @@ Deep copying a tree is fast in lxml:: - lxe: deepcopy (--TC T1) 10.5221 msec/pass - cET: deepcopy (--TC T1) 220.2251 msec/pass - ET : deepcopy (--TC T1) 463.7730 msec/pass - - lxe: deepcopy (--TC T3) 4.2651 msec/pass - cET: deepcopy (--TC T3) 53.8740 msec/pass - ET : deepcopy (--TC T3) 118.2799 msec/pass - -So, for example, if you often need to create independent subtrees from a large -tree that you have parsed in, lxml is by far the best choice here. + lxe: deepcopy_all (--TR T1) 11.0400 msec/pass + cET: deepcopy_all (--TR T1) 119.6141 msec/pass + ET : deepcopy_all (--TR T1) 451.2160 msec/pass + + lxe: deepcopy_all (-ATR T2) 13.5410 msec/pass + cET: deepcopy_all (-ATR T2) 135.2482 msec/pass + ET : deepcopy_all (-ATR T2) 476.1350 msec/pass + + lxe: deepcopy_all (S-TR T3) 4.2889 msec/pass + cET: deepcopy_all (S-TR T3) 36.0429 msec/pass + ET : deepcopy_all (S-TR T3) 113.4322 msec/pass + +So, for example, if you have a database-like scenario where you parse in a +large tree and then search and copy independent subtrees from it for further +processing, lxml is by far the best choice here. Tree traversal @@ -330,8 +335,8 @@ Another area where lxml is very fast is iteration for tree traversal. If your algorithms can benefit from step-by-step traversal of the XML tree and -especially if few elements are of interest or the element tag name is known, -lxml is a good choice:: +especially if few elements are of interest or the target element tag name is +known, lxml is a good choice:: lxe: getiterator_all (--TR T2) 6.4790 msec/pass cET: getiterator_all (--TR T2) 28.2831 msec/pass @@ -349,7 +354,7 @@ cET: getiterator_tag_all (--TR T2) 27.2939 msec/pass ET : getiterator_tag_all (--TR T2) 22.8271 msec/pass -This similarly shows in ``Element.findall()``:: +This translates directly into similar timings for ``Element.findall()``:: lxe: findall (--TR T2) 6.8321 msec/pass cET: findall (--TR T2) 28.8639 msec/pass From scoder at codespeak.net Thu May 10 11:39:36 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Thu, 10 May 2007 11:39:36 +0200 (CEST) Subject: [Lxml-checkins] r43008 - lxml/trunk Message-ID: <20070510093936.A8D368081@code0.codespeak.net> Author: scoder Date: Thu May 10 11:39:36 2007 New Revision: 43008 Modified: lxml/trunk/TODO.txt Log: todo Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Thu May 10 11:39:36 2007 @@ -16,6 +16,9 @@ * more testing on multi-threading +* better exception messages for XPath and schemas based on error log, + e.g. missing namespace mappings in XPath + ElementTree ----------- From scoder at codespeak.net Fri May 11 11:25:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 11 May 2007 11:25:34 +0200 (CEST) Subject: [Lxml-checkins] r43160 - in lxml/trunk: . src/lxml Message-ID: <20070511092534.9EDF28090@code0.codespeak.net> Author: scoder Date: Fri May 11 11:25:34 2007 New Revision: 43160 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/iterparse.pxi lxml/trunk/src/lxml/xmlparser.pxd Log: more robust error handling in iterparse() Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Fri May 11 11:25:34 2007 @@ -38,6 +38,8 @@ Bugs fixed ---------- +* More robust error handling in ``iterparse()`` + * Documents lost their top-level PIs and comments on serialisation * lxml.sax failed on comments and PIs. Comments are now properly ignored and Modified: lxml/trunk/src/lxml/iterparse.pxi ============================================================================== --- lxml/trunk/src/lxml/iterparse.pxi (original) +++ lxml/trunk/src/lxml/iterparse.pxi Fri May 11 11:25:34 2007 @@ -48,7 +48,7 @@ c_ns = c_ns.next return count -cdef class _IterparseResolverContext(_ResolverContext): +cdef class _IterparseContext(_ResolverContext): cdef xmlparser.startElementNsSAX2Func _origSaxStart cdef xmlparser.endElementNsSAX2Func _origSaxEnd cdef _Element _root @@ -64,8 +64,8 @@ cdef char* _tag_href cdef char* _tag_name - def __init__(self, *args): - _ResolverContext.__init__(self, *args) + def __init__(self, _ResolverRegistry resolvers): + _ResolverContext.__init__(self, resolvers) self._ns_stack = [] self._pop_ns = self._ns_stack.pop self._node_stack = [] @@ -90,7 +90,7 @@ ITERPARSE_FILTER_END_NS): sax.endElementNs = _saxEnd - cdef void _setEventFilter(self, events, tag): + cdef _setEventFilter(self, events, tag): self._event_filter = _buildIterparseEventFilter(events) if tag is None or tag == '*': self._tag_href = NULL @@ -109,8 +109,7 @@ if self._tag_href is NULL and self._tag_name is NULL: self._tag_tuple = None - cdef void startNode(self, xmlNode* c_node): - cdef _Element node + cdef int startNode(self, xmlNode* c_node) except -1: cdef xmlNs* c_ns cdef int ns_count if self._event_filter & ITERPARSE_FILTER_START_NS: @@ -129,9 +128,9 @@ python.PyList_Append(self._node_stack, node) if self._event_filter & ITERPARSE_FILTER_START: python.PyList_Append(self._events, ("start", node)) + return 0 - cdef void endNode(self, xmlNode* c_node): - cdef _Element node + cdef int endNode(self, xmlNode* c_node) except -1: cdef xmlNs* c_ns cdef int ns_count if self._event_filter & ITERPARSE_FILTER_END: @@ -141,7 +140,6 @@ ITERPARSE_FILTER_START_NS | \ ITERPARSE_FILTER_END_NS): node = self._pop_node() - assert node._c_node is c_node else: if self._doc is None: self._doc = _documentFactory(c_node.doc, None) @@ -155,23 +153,36 @@ event = ("end-ns", None) for i from 0 <= i < ns_count: python.PyList_Append(self._events, event) + return 0 cdef void _pushSaxStartEvent(xmlparser.xmlParserCtxt* c_ctxt, xmlNode* c_node): - cdef _IterparseResolverContext context - context = <_IterparseResolverContext>c_ctxt._private - context.startNode(c_node) + cdef _IterparseContext context + context = <_IterparseContext>c_ctxt._private + try: + context.startNode(c_node) + except: + if c_ctxt.errNo == xmlerror.XML_ERR_OK: + c_ctxt.errNo = xmlerror.XML_ERR_INTERNAL_ERROR + c_ctxt.disableSAX = 1 + context._store_raised() cdef void _pushSaxEndEvent(xmlparser.xmlParserCtxt* c_ctxt, xmlNode* c_node): - cdef _IterparseResolverContext context - context = <_IterparseResolverContext>c_ctxt._private - context.endNode(c_node) + cdef _IterparseContext context + context = <_IterparseContext>c_ctxt._private + try: + context.endNode(c_node) + except: + if c_ctxt.errNo == xmlerror.XML_ERR_OK: + c_ctxt.errNo = xmlerror.XML_ERR_INTERNAL_ERROR + c_ctxt.disableSAX = 1 + context._store_raised() cdef xmlparser.startElementNsSAX2Func _getOrigStart(xmlparser.xmlParserCtxt* c_ctxt): - return (<_IterparseResolverContext>c_ctxt._private)._origSaxStart + return (<_IterparseContext>c_ctxt._private)._origSaxStart cdef xmlparser.endElementNsSAX2Func _getOrigEnd(xmlparser.xmlParserCtxt* c_ctxt): - return (<_IterparseResolverContext>c_ctxt._private)._origSaxEnd + return (<_IterparseContext>c_ctxt._private)._origSaxEnd cdef void _saxStart(void* ctxt, char* localname, char* prefix, char* URI, int nb_namespaces, char** namespaces, @@ -230,7 +241,7 @@ def __init__(self, source, events=("end",), tag=None, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=False, remove_blank_text=False): - cdef _IterparseResolverContext context + cdef _IterparseContext context cdef char* c_filename cdef int parse_options if not hasattr(source, 'read'): @@ -246,7 +257,7 @@ c_filename = NULL self._source = source - _BaseParser.__init__(self, _IterparseResolverContext) + _BaseParser.__init__(self, _IterparseContext) parse_options = _XML_DEFAULT_PARSE_OPTIONS if load_dtd: @@ -263,7 +274,7 @@ parse_options = parse_options | xmlparser.XML_PARSE_NOBLANKS self._parse_options = parse_options - context = <_IterparseResolverContext>self._context + context = <_IterparseContext>self._context context._setEventFilter(events, tag) context._wrapCallbacks(self._parser_ctxt.sax) xmlparser.xmlCtxtUseOptions(self._parser_ctxt, parse_options) @@ -274,12 +285,12 @@ return self def __next__(self): - cdef _IterparseResolverContext context + cdef _IterparseContext context cdef int error cdef char* c_filename if self._source is None: raise StopIteration - context = <_IterparseResolverContext>self._context + context = <_IterparseContext>self._context if python.PyList_GET_SIZE(context._events) > context._event_index: item = python.PyList_GET_ITEM(context._events, context._event_index) python.Py_INCREF(item) # 'borrowed reference' from PyList_GET_ITEM @@ -291,7 +302,6 @@ while python.PyList_GET_SIZE(context._events) == 0 and error == 0: data = self._source.read(__ITERPARSE_CHUNK_SIZE) if not python.PyString_Check(data): - #xmlparser.xmlParseChunk(self._parser_ctxt, NULL, 0, 1) self._source = None raise TypeError, "reading file objects must return plain strings" elif data: @@ -307,6 +317,7 @@ _raiseParseError(self._parser_ctxt, self._filename) if python.PyList_GET_SIZE(context._events) == 0: self.root = context._root + self._source = None raise StopIteration context._event_index = 1 @@ -316,8 +327,8 @@ cdef class iterwalk: - """A tree walker that generates ``iterparse()`` events from an existing - tree as if it was parsing XML data. + """A tree walker that generates events from an existing tree as if it was + parsing XML data with ``iterparse()``. """ cdef object _node_stack cdef object _pop_node Modified: lxml/trunk/src/lxml/xmlparser.pxd ============================================================================== --- lxml/trunk/src/lxml/xmlparser.pxd (original) +++ lxml/trunk/src/lxml/xmlparser.pxd Fri May 11 11:25:34 2007 @@ -52,6 +52,8 @@ int wellFormed int recovery int options + int disableSAX + int errNo xmlError lastError xmlNode* node xmlSAXHandler* sax From scoder at codespeak.net Fri May 11 18:54:16 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 11 May 2007 18:54:16 +0200 (CEST) Subject: [Lxml-checkins] r43227 - in lxml/trunk: . src/lxml src/lxml/tests Message-ID: <20070511165416.830E6809C@code0.codespeak.net> Author: scoder Date: Fri May 11 18:54:14 2007 New Revision: 43227 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/tests/test_elementtree.py Log: clear() method on Element.attrib Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Fri May 11 18:54:14 2007 @@ -38,6 +38,10 @@ Bugs fixed ---------- +* More ET compatible behaviour when writing out XML declarations or not + +* ``Element.attrib`` was missing ``clear()`` method + * More robust error handling in ``iterparse()`` * Documents lost their top-level PIs and comments on serialisation Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Fri May 11 18:54:14 2007 @@ -1467,6 +1467,12 @@ _delAttribute(self._element, key) return result + def clear(self): + cdef xmlNode* c_node + c_node = self._element._c_node + while c_node.properties is not NULL: + tree.xmlRemoveProp(c_node.properties) + # ACCESSORS def __repr__(self): return repr(dict( _attributeIteratorFactory(self._element, 3) )) @@ -1882,17 +1888,15 @@ """ cdef int write_declaration cdef int c_pretty_print - if encoding is None: - encoding = 'ASCII' - else: - encoding = encoding.upper() c_pretty_print = bool(pretty_print) if xml_declaration is None: # by default, write an XML declaration only for non-standard encodings - write_declaration = encoding not in \ + write_declaration = encoding is not None and encoding.upper() not in \ ('ASCII', 'UTF-8', 'UTF8', 'US-ASCII') else: write_declaration = bool(xml_declaration) + if encoding is None: + encoding = 'ASCII' if isinstance(element_or_tree, _Element): return _tostring(<_Element>element_or_tree, Modified: lxml/trunk/src/lxml/tests/test_elementtree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_elementtree.py (original) +++ lxml/trunk/src/lxml/tests/test_elementtree.py Fri May 11 18:54:14 2007 @@ -290,6 +290,27 @@ self.assertEquals(None, root.get('three')) self.assertEquals('foo', root.get('three', 'foo')) + def test_attrib_clear(self): + XML = self.etree.XML + + root = XML('') + self.assertEquals('One', root.get('one')) + self.assertEquals('Two', root.get('two')) + root.attrib.clear() + self.assertEquals(None, root.get('one')) + self.assertEquals(None, root.get('two')) + + def test_attrib_set_clear(self): + Element = self.etree.Element + + root = Element("root", one="One") + root.set("two", "Two") + self.assertEquals('One', root.get('one')) + self.assertEquals('Two', root.get('two')) + root.attrib.clear() + self.assertEquals(None, root.get('one')) + self.assertEquals(None, root.get('two')) + def test_attribute_update_dict(self): XML = self.etree.XML From scoder at codespeak.net Fri May 11 19:01:49 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Fri, 11 May 2007 19:01:49 +0200 (CEST) Subject: [Lxml-checkins] r43235 - lxml/trunk Message-ID: <20070511170149.C0B6B8092@code0.codespeak.net> Author: scoder Date: Fri May 11 19:01:49 2007 New Revision: 43235 Modified: lxml/trunk/selftest.py lxml/trunk/selftest2.py Log: enabled some more ET selftests (rest is broken due to different serialisation) Modified: lxml/trunk/selftest.py ============================================================================== --- lxml/trunk/selftest.py (original) +++ lxml/trunk/selftest.py Fri May 11 19:01:49 2007 @@ -272,28 +272,31 @@ ## '

spamegg

' ## """ -## def parseliteral(): -## r""" -## >>> element = ElementTree.XML("text") -## >>> ElementTree.ElementTree(element).write(sys.stdout) -## text -## >>> element = ElementTree.fromstring("text") -## >>> ElementTree.ElementTree(element).write(sys.stdout) -## text -## >>> print ElementTree.tostring(element) -## text -## >>> print ElementTree.tostring(element, "ascii") -## -## text -## >>> _, ids = ElementTree.XMLID("text") -## >>> len(ids) -## 0 -## >>> _, ids = ElementTree.XMLID("text") -## >>> len(ids) -## 1 -## >>> ids["body"].tag -## 'body' -## """ +def parseliteral(): + r""" + >>> element = ElementTree.XML("text") + >>> ElementTree.ElementTree(element).write(sys.stdout) + text + >>> element = ElementTree.fromstring("text") + >>> ElementTree.ElementTree(element).write(sys.stdout) + text + >>> print ElementTree.tostring(element) + text + +# looks different in lxml +# >>> print ElementTree.tostring(element, "ascii") +# +# text + + >>> _, ids = ElementTree.XMLID("text") + >>> len(ids) + 0 + >>> _, ids = ElementTree.XMLID("text") + >>> len(ids) + 1 + >>> ids["body"].tag + 'body' + """ ## def simpleparsefile(): ## """ @@ -519,16 +522,18 @@ ## """ -## def xmllang(): -## """ -## This appears to be a problem; in underlying libxml2? +def xmllang(): + """ + This appears to be a problem; in underlying libxml2? -## 1) xml namespace + 1) xml namespace -## >>> elem = ElementTree.XML("") -## >>> serialize(elem) # 1.1 -## '' -## """ + >>> elem = ElementTree.XML("") + >>> serialize(elem) # 1.1 + '' + +# '' # ElementTree produces an extra blank + """ def namespace(): """ Modified: lxml/trunk/selftest2.py ============================================================================== --- lxml/trunk/selftest2.py (original) +++ lxml/trunk/selftest2.py Fri May 11 19:01:49 2007 @@ -133,30 +133,30 @@ 'textsubtext' """ -## def encoding(): -## r""" -## Test encoding issues. +def encoding(): + r""" + Test encoding issues. -## >>> elem = ElementTree.Element("tag") -## >>> elem.text = u"abc" -## >>> serialize(elem) -## 'abc' -## >>> serialize(elem, "utf-8") -## 'abc' -## >>> serialize(elem, "us-ascii") -## 'abc' -## >>> serialize(elem, "iso-8859-1") -## "\nabc" + >>> elem = ElementTree.Element("tag") + >>> elem.text = u"abc" + >>> serialize(elem) + 'abc' + >>> serialize(elem, "utf-8") + 'abc' + >>> serialize(elem, "us-ascii") + 'abc' + >>> serialize(elem, "iso-8859-1").lower() + "\nabc" -## >>> elem.text = "<&\"\'>" -## >>> serialize(elem) -## '<&"\'>' -## >>> serialize(elem, "utf-8") -## '<&"\'>' -## >>> serialize(elem, "us-ascii") # cdata characters -## '<&"\'>' -## >>> serialize(elem, "iso-8859-1") -## '\n<&"\'>' + >>> elem.text = "<&\"\'>" + >>> serialize(elem) + '<&"\'>' + >>> serialize(elem, "utf-8") + '<&"\'>' + >>> serialize(elem, "us-ascii") # cdata characters + '<&"\'>' + >>> serialize(elem, "iso-8859-1").lower() + '\n<&"\'>' ## >>> elem.attrib["key"] = "<&\"\'>" ## >>> elem.text = None @@ -169,16 +169,16 @@ ## >>> serialize(elem, "iso-8859-1") ## '\n' -## >>> elem.text = u'\xe5\xf6\xf6<>' -## >>> elem.attrib.clear() -## >>> serialize(elem) -## 'åöö<>' -## >>> serialize(elem, "utf-8") -## '\xc3\xa5\xc3\xb6\xc3\xb6<>' -## >>> serialize(elem, "us-ascii") -## 'åöö<>' -## >>> serialize(elem, "iso-8859-1") -## "\n\xe5\xf6\xf6<>" + >>> elem.text = u'\xe5\xf6\xf6<>' + >>> elem.attrib.clear() + >>> serialize(elem) + 'åöö<>' + >>> serialize(elem, "utf-8") + '\xc3\xa5\xc3\xb6\xc3\xb6<>' + >>> serialize(elem, "us-ascii") + 'åöö<>' + >>> serialize(elem, "iso-8859-1").lower() + "\n\xe5\xf6\xf6<>" ## >>> elem.attrib["key"] = u'\xe5\xf6\xf6<>' ## >>> elem.text = None @@ -191,25 +191,25 @@ ## >>> serialize(elem, "iso-8859-1") ## '\n' -## """ + """ -## def qname(): -## """ -## Test QName handling. +def qname(): + """ + Test QName handling. -## 1) decorated tags + 1) decorated tags -## >>> elem = ElementTree.Element("{uri}tag") -## >>> serialize(elem) # 1.1 -## '' + >>> elem = ElementTree.Element("{uri}tag") + >>> serialize(elem) # 1.1 + '' ## 2) decorated attributes ## >>> elem.attrib["{uri}key"] = "value" ## >>> serialize(elem) # 2.1 -## '' +## '' -## """ + """ def cdata(): """ From scoder at codespeak.net Sat May 12 17:34:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 12 May 2007 17:34:00 +0200 (CEST) Subject: [Lxml-checkins] r43301 - in lxml/trunk: . src/lxml src/lxml/tests Message-ID: <20070512153400.E18D1807F@code0.codespeak.net> Author: scoder Date: Sat May 12 17:34:00 2007 New Revision: 43301 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/tests/test_xpathevaluator.py lxml/trunk/src/lxml/xmlerror.pxi lxml/trunk/src/lxml/xpath.pxi Log: new XPathEvalError for evaluation errors (instead of always raising XPathSyntaxError) Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 12 17:34:00 2007 @@ -8,6 +8,10 @@ Features added -------------- +* Error specific messages in XPath parsing and evaluation + NOTE: for evaluation errors, you will now get an XPathEvalError instead of + an XPathSyntaxError. To catch both, you can except on ``XPathError`` + * The regular expression functions in XPath now support passing a node-set instead of a string Modified: lxml/trunk/src/lxml/tests/test_xpathevaluator.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_xpathevaluator.py (original) +++ lxml/trunk/src/lxml/tests/test_xpathevaluator.py Sat May 12 17:34:00 2007 @@ -114,7 +114,20 @@ def test_xpath_error(self): tree = self.parse('') - self.assertRaises(SyntaxError, tree.xpath, '\\fad') + self.assertRaises(etree.XPathEvalError, tree.xpath, '\\fad') + + def test_xpath_class_error(self): + self.assertRaises(SyntaxError, etree.XPath, '\\fad') + self.assertRaises(etree.XPathSyntaxError, etree.XPath, '\\fad') + + def test_xpath_prefix_error(self): + tree = self.parse('') + self.assertRaises(etree.XPathEvalError, tree.xpath, '/fa:d') + + def test_xpath_class_prefix_error(self): + tree = self.parse('') + xpath = etree.XPath("/fa:d") + self.assertRaises(etree.XPathEvalError, xpath, tree) def test_elementtree_getpath(self): a = etree.Element("a") Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Sat May 12 17:34:00 2007 @@ -5,8 +5,9 @@ # module level API functions def clearErrorLog(): - """Clear the global error log. - Note that this log is already bounded to a fixed size.""" + """Clear the global error log. Note that this log is already bound to a + fixed size. + """ __GLOBAL_ERROR_LOG.clear() # dummy function: no debug output at all @@ -145,6 +146,15 @@ def __len__(self): return len(self._entries) + def __contains__(self, error_type): + for entry in self._entries: + if entry.type == error_type: + return True + return False + + def __nonzero__(self): + return bool(self._entries) + def filter_domains(self, domains): cdef _LogEntry entry filtered = [] Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Sat May 12 17:34:00 2007 @@ -1,14 +1,36 @@ # XPath evaluation -class XPathContextError(XPathError): +class XPathSyntaxError(LxmlSyntaxError, XPathError): pass -class XPathSyntaxError(LxmlSyntaxError, XPathError): +class XPathEvalError(XPathError): pass ################################################################################ # XPath +cdef object _XPATH_SYNTAX_ERRORS +_XPATH_SYNTAX_ERRORS = ( + xmlerror.XML_XPATH_NUMBER_ERROR, + xmlerror.XML_XPATH_UNFINISHED_LITERAL_ERROR, + xmlerror.XML_XPATH_VARIABLE_REF_ERROR, + xmlerror.XML_XPATH_INVALID_PREDICATE_ERROR, + xmlerror.XML_XPATH_UNCLOSED_ERROR, + xmlerror.XML_XPATH_INVALID_CHAR_ERROR +) + +cdef object _XPATH_EVAL_ERRORS +_XPATH_EVAL_ERRORS = ( + xmlerror.XML_XPATH_UNDEF_VARIABLE_ERROR, + xmlerror.XML_XPATH_UNDEF_PREFIX_ERROR, + xmlerror.XML_XPATH_UNKNOWN_FUNC_ERROR, + xmlerror.XML_XPATH_INVALID_OPERAND, + xmlerror.XML_XPATH_INVALID_TYPE, + xmlerror.XML_XPATH_INVALID_ARITY, + xmlerror.XML_XPATH_INVALID_CTXT_SIZE, + xmlerror.XML_XPATH_INVALID_CTXT_POSITION +) + cdef int _register_xpath_function(void* ctxt, name_utf, ns_utf): if ns_utf is None: return xpath.xmlXPathRegisterFunc( @@ -76,11 +98,17 @@ cdef xpath.xmlXPathContext* _xpathCtxt cdef _XPathContext _context cdef python.PyThread_type_lock _eval_lock + cdef _ErrorLog _error_log def __init__(self, namespaces, extensions, enable_regexp): + self._error_log = _ErrorLog() self._context = _XPathContext(namespaces, extensions, enable_regexp, None) + property error_log: + def __get__(self): + return self._error_log.copy() + def __dealloc__(self): if self._xpathCtxt is not NULL: xpath.xmlXPathFreeContext(self._xpathCtxt) @@ -127,6 +155,12 @@ python.PyThread_release_lock(self._eval_lock) cdef _raise_parse_error(self): + entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) + if entries: + entry = entries[0] + if entry is not None and entry.message: + raise XPathSyntaxError, entry.message + if self._xpathCtxt is not NULL and \ self._xpathCtxt.lastError.message is not NULL: message = funicode(self._xpathCtxt.lastError.message) @@ -134,6 +168,24 @@ message = "error in xpath expression" raise XPathSyntaxError, message + cdef _raise_eval_error(self): + entries = self._error_log.filter_types(_XPATH_EVAL_ERRORS) + if entries: + entry = entries[0] + if entry is not None and entry.message: + raise XPathEvalError, entry.message + entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) + if entries: + entry = entries[0] + if entry is not None and entry.message: + raise XPathSyntaxError, entry.message + if self._xpathCtxt is not NULL and \ + self._xpathCtxt.lastError.message is not NULL: + message = funicode(self._xpathCtxt.lastError.message) + else: + message = "error in xpath evaluation" + raise XPathEvalError, message + cdef object _handle_result(self, xpath.xmlXPathObject* xpathObj, _Document doc): if self._context._exc._has_raised(): if xpathObj is not NULL: @@ -144,7 +196,7 @@ if xpathObj is NULL: self._context._release_temp_refs() - self._raise_parse_error() + self._raise_eval_error() try: result = _unwrapXPathObject(xpathObj, doc) @@ -176,7 +228,7 @@ _XPathEvaluatorBase.__init__(self, namespaces, extensions, regexp) xpathCtxt = xpath.xmlXPathNewContext(doc._c_doc) if xpathCtxt is NULL: - raise XPathContextError, "Unable to create new XPath context" + python.PyErr_NoMemory() self.set_context(xpathCtxt) def registerNamespace(self, prefix, uri): @@ -207,6 +259,7 @@ doc = self._element._doc self._lock() + self._error_log.connect() self._xpathCtxt.node = self._element._c_node try: self._context.register_context(doc) @@ -217,6 +270,7 @@ python.PyEval_RestoreThread(state) result = self._handle_result(xpathObj, doc) finally: + self._error_log.disconnect() self._context.unregister_context() self._unlock() @@ -249,6 +303,7 @@ doc = self._element._doc self._lock() + self._error_log.connect() try: self._context.register_context(doc) c_doc = _fakeRootDoc(doc._c_doc, self._element._c_node) @@ -265,6 +320,7 @@ _destroyFakeDoc(doc._c_doc, c_doc) self._context.unregister_context() finally: + self._error_log.disconnect() self._unlock() return result @@ -308,9 +364,11 @@ path = _utf8(path) xpathCtxt = xpath.xmlXPathNewContext(NULL) if xpathCtxt is NULL: - raise XPathContextError, "Unable to create new XPath context" + python.PyErr_NoMemory() self.set_context(xpathCtxt) + self._error_log.connect() self._xpath = xpath.xmlXPathCtxtCompile(xpathCtxt, _cstr(path)) + self._error_log.disconnect() if self._xpath is NULL: self._raise_parse_error() @@ -325,6 +383,7 @@ element = _rootNodeOrRaise(_etree_or_element) self._lock() + self._error_log.connect() self._xpathCtxt.doc = document._c_doc self._xpathCtxt.node = element._c_node @@ -337,6 +396,7 @@ python.PyEval_RestoreThread(state) result = self._handle_result(xpathObj, document) finally: + self._error_log.disconnect() self._context.unregister_context() self._unlock() return result From scoder at codespeak.net Sun May 13 20:42:05 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 13 May 2007 20:42:05 +0200 (CEST) Subject: [Lxml-checkins] r43318 - in lxml/trunk: doc src/lxml Message-ID: <20070513184205.41F79807C@code0.codespeak.net> Author: scoder Date: Sun May 13 20:42:04 2007 New Revision: 43318 Modified: lxml/trunk/doc/xpathxslt.txt lxml/trunk/src/lxml/extensions.pxi lxml/trunk/src/lxml/xpath.pxi Log: always raise eval exception from XPath evaluators Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 13 20:42:04 2007 @@ -14,6 +14,7 @@ 1.4 The ``XPath`` class 1.5 The ``XPathEvaluator`` classes 1.6 ``ETXPath`` + 1.7 Error handling 2 XSLT 2.1 XSLT result objects 2.2 Stylesheet parameters @@ -265,6 +266,58 @@ {ns}b +Error handling +-------------- + +lxml.etree raises exceptions when errors occur while parsing or evaluating an +XPath expression:: + + >>> find = etree.XPath("\\") + Traceback (most recent call last): + ... + XPathSyntaxError: Error in xpath expression + +lxml will also try to give you a hint what went wrong, so if you pass a more +complex expression, you may get a somewhat more specific error:: + + >>> find = etree.XPath("//*[1.1.1]") + Traceback (most recent call last): + ... + XPathSyntaxError: Invalid predicate + +During evaluation, lxml will emit an XPathEvalError on errors:: + + >>> find = etree.XPath("//ns:a") + >>> find(root) + Traceback (most recent call last): + ... + XPathEvalError: Undefined namespace prefix + +This works for the ``XPath`` class, however, the other evaluators (including +the ``xpath()`` method) are one-shot operations that do parsing and evaluation +in one step. They therefore raise evaluation exceptions in all cases:: + + >>> root = etree.Element("test") + >>> find = root.xpath("//*[1.1.1]") + Traceback (most recent call last): + ... + XPathEvalError: Invalid predicate + + >>> find = root.xpath("//ns:a") + Traceback (most recent call last): + ... + XPathEvalError: Undefined namespace prefix + + >>> find = root.xpath("\\") + Traceback (most recent call last): + ... + XPathEvalError: Error in xpath evaluation + +Note that lxml versions before 1.3 always raised an ``XPathSyntaxError`` for +all errors, including evaluation errors. The best way to support older +versions is to except on the superclass ``XPathError``. + + XSLT ==== Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Sun May 13 20:42:04 2007 @@ -3,10 +3,13 @@ class XPathError(LxmlError): pass -class XPathFunctionError(XPathError): +class XPathEvalError(XPathError): pass -class XPathResultError(XPathError): +class XPathFunctionError(XPathEvalError): + pass + +class XPathResultError(XPathEvalError): pass # forward declarations Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Sun May 13 20:42:04 2007 @@ -3,9 +3,6 @@ class XPathSyntaxError(LxmlSyntaxError, XPathError): pass -class XPathEvalError(XPathError): - pass - ################################################################################ # XPath @@ -165,7 +162,7 @@ self._xpathCtxt.lastError.message is not NULL: message = funicode(self._xpathCtxt.lastError.message) else: - message = "error in xpath expression" + message = "Error in xpath expression" raise XPathSyntaxError, message cdef _raise_eval_error(self): @@ -178,12 +175,12 @@ if entries: entry = entries[0] if entry is not None and entry.message: - raise XPathSyntaxError, entry.message + raise XPathEvalError, entry.message if self._xpathCtxt is not NULL and \ self._xpathCtxt.lastError.message is not NULL: message = funicode(self._xpathCtxt.lastError.message) else: - message = "error in xpath evaluation" + message = "Error in xpath evaluation" raise XPathEvalError, message cdef object _handle_result(self, xpath.xmlXPathObject* xpathObj, _Document doc): From scoder at codespeak.net Sun May 13 20:44:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 13 May 2007 20:44:00 +0200 (CEST) Subject: [Lxml-checkins] r43319 - lxml/trunk/src/lxml Message-ID: <20070513184400.31B33807C@code0.codespeak.net> Author: scoder Date: Sun May 13 20:43:59 2007 New Revision: 43319 Modified: lxml/trunk/src/lxml/dtd.pxi lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/extensions.pxi lxml/trunk/src/lxml/nsclasses.pxi lxml/trunk/src/lxml/parser.pxi lxml/trunk/src/lxml/relaxng.pxi lxml/trunk/src/lxml/sax.py lxml/trunk/src/lxml/schematron.pxi lxml/trunk/src/lxml/xmlschema.pxi lxml/trunk/src/lxml/xslt.pxi Log: docstrings on exceptions Modified: lxml/trunk/src/lxml/dtd.pxi ============================================================================== --- lxml/trunk/src/lxml/dtd.pxi (original) +++ lxml/trunk/src/lxml/dtd.pxi Sun May 13 20:43:59 2007 @@ -2,12 +2,18 @@ cimport dtdvalid class DTDError(LxmlError): + """Base class for DTD errors. + """ pass class DTDParseError(DTDError): + """Error while parsing a DTD. + """ pass class DTDValidateError(DTDError): + """Error while validating an XML document with a DTD. + """ pass ################################################################################ Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 13 20:43:59 2007 @@ -85,6 +85,9 @@ # module level superclass for all exceptions class LxmlError(Error): + """Main exception base class for lxml. All other exceptions inherit from + this one. + """ def __init__(self, *args): _initError(self, *args) self.error_log = __copyGlobalErrorLog() @@ -106,15 +109,18 @@ # superclass for all syntax errors class LxmlSyntaxError(LxmlError, SyntaxError): - pass - -class DocumentInvalid(LxmlError): + """Base class for all syntax errors. + """ pass class XIncludeError(LxmlError): + """Error during XInclude processing. + """ pass class C14NError(LxmlError): + """Error during C14N serialisation. + """ pass # version information @@ -1957,6 +1963,12 @@ ################################################################################ # Validation +class DocumentInvalid(LxmlError): + """Validation error. Raised by all document validators when their + ``assertValid(tree)`` method fails. + """ + pass + cdef class _Validator: "Base class for XML validators." cdef _ErrorLog _error_log Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Sun May 13 20:43:59 2007 @@ -1,15 +1,23 @@ # support for extension functions in XPath and XSLT class XPathError(LxmlError): + """Base class of all XPath errors. + """ pass class XPathEvalError(XPathError): + """Error during XPath evaluation. + """ pass class XPathFunctionError(XPathEvalError): + """Internal error looking up an XPath extension function. + """ pass class XPathResultError(XPathEvalError): + """Error handling an XPath result. + """ pass # forward declarations Modified: lxml/trunk/src/lxml/nsclasses.pxi ============================================================================== --- lxml/trunk/src/lxml/nsclasses.pxi (original) +++ lxml/trunk/src/lxml/nsclasses.pxi Sun May 13 20:43:59 2007 @@ -1,9 +1,13 @@ # module-level API for namespace implementations class LxmlRegistryError(LxmlError): + """Base class of lxml registry errors. + """ pass class NamespaceRegistryError(LxmlRegistryError): + """Error registering a namespace extension. + """ pass cdef object __NAMESPACE_REGISTRIES Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Sun May 13 20:43:59 2007 @@ -5,9 +5,13 @@ from xmlparser cimport xmlParserCtxt, xmlDict class XMLSyntaxError(LxmlSyntaxError): + """Syntax error while parsing an XML document. + """ pass class ParserError(LxmlError): + """Internal lxml parser error. + """ pass ctypedef enum LxmlParserType: @@ -378,7 +382,7 @@ raise TypeError, "This class cannot be instantiated" self._parser_ctxt = pctxt if pctxt is NULL: - raise ParserError, "Failed to create parser context" + python.PyErr_NoMemory() if pctxt.sax != NULL: # hard switch-off for CDATA nodes => makes them plain text pctxt.sax.cdataBlock = NULL Modified: lxml/trunk/src/lxml/relaxng.pxi ============================================================================== --- lxml/trunk/src/lxml/relaxng.pxi (original) +++ lxml/trunk/src/lxml/relaxng.pxi Sun May 13 20:43:59 2007 @@ -2,12 +2,18 @@ cimport relaxng class RelaxNGError(LxmlError): + """Base class for RelaxNG errors. + """ pass class RelaxNGParseError(RelaxNGError): + """Error while parsing an XML document as RelaxNG. + """ pass class RelaxNGValidateError(RelaxNGError): + """Error while validating an XML document with a RelaxNG schema. + """ pass ################################################################################ Modified: lxml/trunk/src/lxml/sax.py ============================================================================== --- lxml/trunk/src/lxml/sax.py (original) +++ lxml/trunk/src/lxml/sax.py Sun May 13 20:43:59 2007 @@ -3,6 +3,8 @@ from etree import XML, Comment, ProcessingInstruction class SaxError(LxmlError): + """General SAX error. + """ pass def _getNsTag(tag): Modified: lxml/trunk/src/lxml/schematron.pxi ============================================================================== --- lxml/trunk/src/lxml/schematron.pxi (original) +++ lxml/trunk/src/lxml/schematron.pxi Sun May 13 20:43:59 2007 @@ -48,12 +48,18 @@ """ class SchematronError(LxmlError): + """Base class of all Schematron errors. + """ pass class SchematronParseError(SchematronError): + """Error while parsing an XML document as Schematron schema. + """ pass class SchematronValidateError(SchematronError): + """Error while validating an XML document with a Schematron schema. + """ pass ################################################################################ Modified: lxml/trunk/src/lxml/xmlschema.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlschema.pxi (original) +++ lxml/trunk/src/lxml/xmlschema.pxi Sun May 13 20:43:59 2007 @@ -2,12 +2,18 @@ cimport xmlschema class XMLSchemaError(LxmlError): + """Base class of all XML Schema errors + """ pass class XMLSchemaParseError(XMLSchemaError): + """Error while parsing an XML document as XML Schema. + """ pass class XMLSchemaValidateError(XMLSchemaError): + """Error while validating an XML document with an XML Schema. + """ pass ################################################################################ Modified: lxml/trunk/src/lxml/xslt.pxi ============================================================================== --- lxml/trunk/src/lxml/xslt.pxi (original) +++ lxml/trunk/src/lxml/xslt.pxi Sun May 13 20:43:59 2007 @@ -3,18 +3,28 @@ cimport xslt class XSLTError(LxmlError): + """Base class of all XSLT errors. + """ pass class XSLTParseError(XSLTError): + """Error parsing a stylesheet document. + """ pass class XSLTApplyError(XSLTError): + """Error running an XSL transformation. + """ pass class XSLTSaveError(XSLTError): + """Error serialising an XSLT result. + """ pass class XSLTExtensionError(XSLTError): + """Error registering an XSLT extension. + """ pass # version information From scoder at codespeak.net Mon May 14 10:33:48 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:33:48 +0200 (CEST) Subject: [Lxml-checkins] r43345 - lxml/trunk Message-ID: <20070514083348.4EB4A807A@code0.codespeak.net> Author: scoder Date: Mon May 14 10:33:47 2007 New Revision: 43345 Modified: lxml/trunk/update-error-constants.py Log: comment Modified: lxml/trunk/update-error-constants.py ============================================================================== --- lxml/trunk/update-error-constants.py (original) +++ lxml/trunk/update-error-constants.py Mon May 14 10:33:47 2007 @@ -92,7 +92,9 @@ append_pxi('''\ # Constants are stored in tuples of strings, for which Pyrex generates very # efficient setup code. To parse them, iterate over the tuples and parse each -# line in each string independently. +# line in each string independently. Tuples of strings (instead of a plain +# string) are required as some C-compilers of a certain well-known OS vendor +# cannot handle strings that are a few thousand bytes in length. ''') ctypedef_indent = ' '*4 From scoder at codespeak.net Mon May 14 10:33:57 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:33:57 +0200 (CEST) Subject: [Lxml-checkins] r43346 - lxml/trunk/doc Message-ID: <20070514083357.951F18080@code0.codespeak.net> Author: scoder Date: Mon May 14 10:33:57 2007 New Revision: 43346 Modified: lxml/trunk/doc/extensions.txt Log: fix after XPath eexception change Modified: lxml/trunk/doc/extensions.txt ============================================================================== --- lxml/trunk/doc/extensions.txt (original) +++ lxml/trunk/doc/extensions.txt Mon May 14 10:33:57 2007 @@ -128,8 +128,9 @@ >>> print xslt(doc) Ola Haegar -It is also possible to register namespaces with a single evaluator. While the -following example involves no functions, the idea should still be clear:: +It is also possible to register namespaces with a single evaluator after its +creation. While the following example involves no functions, the idea should +still be clear:: >>> f = StringIO('') >>> ns_doc = etree.parse(f) @@ -138,7 +139,8 @@ [] This returns nothing, as we did not ask for the right namespace. When we -register the namespace with the evaluator, we can access it via a prefix:: +register the namespace with the evaluator, however, we can access it via a +prefix:: >>> e.registerNamespace('foo', 'http://mydomain.org/myfunctions') >>> e.evaluate('/foo:a')[0].tag @@ -151,7 +153,7 @@ >>> e2.evaluate('/foo:a') Traceback (most recent call last): ... - XPathSyntaxError: error in xpath expression + XPathEvalError: Undefined namespace prefix Evaluator-local extensions From scoder at codespeak.net Mon May 14 10:34:05 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:05 +0200 (CEST) Subject: [Lxml-checkins] r43347 - lxml/trunk/doc Message-ID: <20070514083405.C4458807A@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:05 2007 New Revision: 43347 Modified: lxml/trunk/doc/api.txt Log: section on error logging Modified: lxml/trunk/doc/api.txt ============================================================================== --- lxml/trunk/doc/api.txt (original) +++ lxml/trunk/doc/api.txt Mon May 14 10:34:05 2007 @@ -31,9 +31,10 @@ 3 Trees and Documents 4 Iteration 5 Error handling on exceptions - 6 Serialisation - 7 XInclude and ElementInclude - 8 write_c14n on ElementTree + 6 Error logging + 7 Serialisation + 8 XInclude and ElementInclude + 9 write_c14n on ElementTree lxml.etree @@ -188,29 +189,46 @@ ---------------------------- Libxml2 provides error messages for failures, be it during parsing, XPath -evaluation or schema validation. Whenever an exception is raised, you can -retrieve the errors that occured and "might have" lead to the problem:: +evaluation or schema validation. The preferred way of accessing them is +through the local ``error_log`` property of the respective evaluator or +transformer object. See their documentation for details. + +However, lxml also keeps a global error log of all errors that occurred at the +application level. Whenever an exception is raised, you can retrieve the +errors that occured and "might have" lead to the problem from the error log +copy attached to the exception:: >>> etree.clearErrorLog() - >>> broken_xml = '' + >>> broken_xml = ''' + ... + ... + ... + ... ''' >>> try: ... etree.parse(StringIO(broken_xml)) ... except etree.XMLSyntaxError, e: ... pass # just put the exception into e - >>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL) + +Once you have caught this exception, you can access its ``error_log`` property +to retrieve the log entries or filter them by a specific type, error domain or +error level:: + + >>> log = e.error_log.filter_from_level(etree.ErrorLevels.FATAL) >>> print log - :1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1 + :4:FATAL:PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: a line 3 and root + :5:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag root line 2 This might look a little cryptic at first, but it is the information that libxml2 gives you. At least the message at the end should give you a hint -what went wrong and you can see that the fatal error (FATAL) happened during -parsing (PARSER) line 1 of a string (, or filename if available). -Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for -that. You can get it from a log entry like this:: +what went wrong and you can see that the fatal errors (FATAL) happened during +parsing (PARSER) lines 4 and 5 of a string (, or the filename if +available). Here, PARSER is the so-called error domain, see +``lxml.etree.ErrorDomains`` for that. You can get it from a log entry like +this:: >>> entry = log[0] >>> print entry.domain_name, entry.type_name, entry.filename - PARSER ERR_TAG_NOT_FINISHED + PARSER ERR_TAG_NAME_MISMATCH There is also a convenience attribute ``last_error`` that returns the last error or fatal error that occurred:: @@ -219,13 +237,16 @@ >>> print entry.domain_name, entry.type_name, entry.filename PARSER ERR_TAG_NOT_FINISHED -Alternatively, lxml.etree supports logging libxml2 messages to the Python -stdlib logging module. This is done through the ``etree.PyErrorLog`` class. -It disables the error reporting from exceptions and forwards log messages to a -Python logger. To use it, see the descriptions of the function -``etree.useGlobalPythonLog`` and the class ``etree.PyErrorLog`` for help. -Note that this does not affect the local error logs of XSLT, XMLSchema, -etc. which are described in their respective sections below. + +Error logging +------------- + +lxml.etree supports logging libxml2 messages to the Python stdlib logging +module. This is done through the ``etree.PyErrorLog`` class. It disables the +error reporting from exceptions and forwards log messages to a Python logger. +To use it, see the descriptions of the function ``etree.useGlobalPythonLog`` +and the class ``etree.PyErrorLog`` for help. Note that this does not affect +the local error logs of XSLT, XMLSchema, etc. Serialisation From scoder at codespeak.net Mon May 14 10:34:21 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:21 +0200 (CEST) Subject: [Lxml-checkins] r43348 - lxml/trunk/doc Message-ID: <20070514083421.A40DB807A@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:21 2007 New Revision: 43348 Modified: lxml/trunk/doc/capi.txt Log: small fixes Modified: lxml/trunk/doc/capi.txt ============================================================================== --- lxml/trunk/doc/capi.txt (original) +++ lxml/trunk/doc/capi.txt Mon May 14 10:34:21 2007 @@ -9,7 +9,7 @@ The API is described in the file `etreepublic.pxd`_, which is directly c-importable by Pyrex modules. -.. _`etreepublic.pxd`: http://codespeak.net/svn/lxml/branch/capi/src/lxml/etreepublic.pxd +.. _`etreepublic.pxd`: http://codespeak.net/svn/lxml/trunk/src/lxml/etreepublic.pxd .. contents:: .. @@ -23,6 +23,8 @@ This is the easiest way of extending lxml at the C level. A Pyrex module should start like this:: + # My Pyrex extension + # import the public functions and classes of lxml.etree cimport etreepublic as cetree @@ -47,7 +49,8 @@ def setValue(self, myval): self.set("my_attribute", myval) - etree.setDefaultElementClass(NewElementClass) + etree.setElementClassLookup( + DefaultElementClassLookup(element=NewElementClass)) Writing external modules in C From scoder at codespeak.net Mon May 14 10:34:28 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:28 +0200 (CEST) Subject: [Lxml-checkins] r43349 - lxml/trunk/src/lxml Message-ID: <20070514083428.6D9E88080@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:28 2007 New Revision: 43349 Modified: lxml/trunk/src/lxml/xpath.pxi Log: cleanup Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Mon May 14 10:34:28 2007 @@ -167,11 +167,8 @@ cdef _raise_eval_error(self): entries = self._error_log.filter_types(_XPATH_EVAL_ERRORS) - if entries: - entry = entries[0] - if entry is not None and entry.message: - raise XPathEvalError, entry.message - entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) + if not entries: + entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) if entries: entry = entries[0] if entry is not None and entry.message: From scoder at codespeak.net Mon May 14 10:34:35 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:35 +0200 (CEST) Subject: [Lxml-checkins] r43350 - lxml/trunk/src/lxml Message-ID: <20070514083435.598158081@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:35 2007 New Revision: 43350 Modified: lxml/trunk/src/lxml/xmlerror.pxi Log: cleanup, doc strings Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Mon May 14 10:34:35 2007 @@ -129,6 +129,9 @@ self._entries = entries def copy(self): + """Creates a shallow copy of this error log. Reuses the list of + entries. + """ return _ListErrorLog(self._entries, self.last_error) def __iter__(self): @@ -156,6 +159,9 @@ return bool(self._entries) def filter_domains(self, domains): + """Filter the errors by the given domains and return a new error log + containing the matches. + """ cdef _LogEntry entry filtered = [] if not python.PySequence_Check(domains): @@ -166,6 +172,9 @@ return _ListErrorLog(filtered) def filter_types(self, types): + """Filter the errors by the given types and return a new error log + containing the matches. + """ cdef _LogEntry entry if not python.PySequence_Check(types): types = (types,) @@ -176,8 +185,9 @@ return _ListErrorLog(filtered) def filter_levels(self, levels): - """Return a log with all messages of the requested level(s). Takes a - single log level or a sequence.""" + """Filter the errors by the given error levels and return a new error + log containing the matches. + """ cdef _LogEntry entry if not python.PySequence_Check(levels): levels = (levels,) @@ -223,6 +233,8 @@ del self._entries[:] def copy(self): + """Creates a shallow copy of this error log and the list of entries. + """ return _ListErrorLog(self._entries[:], self.last_error) def __iter__(self): @@ -270,7 +282,8 @@ object and calls ``self.log(log_entry, format_string, arg1, arg2, ...)`` with appropriate data. """ - cdef public object level_map + cdef readonly object level_map + cdef object _map_level cdef object _log def __init__(self, logger_name=None): _BaseErrorLog.__init__(self) @@ -280,6 +293,7 @@ ErrorLevels.ERROR : logging.ERROR, ErrorLevels.FATAL : logging.CRITICAL } + self._map_level = self.level_map.get if logger_name: logger = logging.getLogger(logger_name) else: @@ -287,11 +301,13 @@ self._log = logger.log def copy(self): + """Dummy method that returns an empty error log. + """ return _ListErrorLog([]) def log(self, entry, message_format_string, *args): self._log( - self.level_map.get(entry.level, 0), + self._map_level(entry.level, 0), message_format_string, *args ) @@ -310,9 +326,8 @@ """Replace the global error log by an etree.PyErrorLog that uses the standard Python logging package. - Note that this slows down processing and disables access to the global - error log from exceptions. Parsers, XSLT etc. will continue to provide - their normal local error log. + Note that this disables access to the global error log from exceptions. + Parsers, XSLT etc. will continue to provide their normal local error log. """ global __GLOBAL_ERROR_LOG __GLOBAL_ERROR_LOG = log From scoder at codespeak.net Mon May 14 10:34:42 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:42 +0200 (CEST) Subject: [Lxml-checkins] r43351 - lxml/trunk/src/lxml Message-ID: <20070514083442.3F6948081@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:41 2007 New Revision: 43351 Modified: lxml/trunk/src/lxml/etreepublic.pxd Log: fixed field name in C-API Modified: lxml/trunk/src/lxml/etreepublic.pxd ============================================================================== --- lxml/trunk/src/lxml/etreepublic.pxd (original) +++ lxml/trunk/src/lxml/etreepublic.pxd Mon May 14 10:34:41 2007 @@ -36,7 +36,7 @@ cdef class lxml.etree._ElementTree [ object LxmlElementTree ]: cdef _Document _doc - cdef _Element _element + cdef _Element _context_node cdef class lxml.etree.ElementClassLookup [ object LxmlElementClassLookup ]: cdef object (*_lookup_function)(object, _Document, tree.xmlNode*) @@ -82,7 +82,7 @@ cdef object lookupNamespaceElementClass(_1, _Document _2, tree.xmlNode* c_node) - # call the fallback lookup function of an FallbackElementClassLookup + # call the fallback lookup function of a FallbackElementClassLookup cdef object callLookupFallback(FallbackElementClassLookup lookup, _Document doc, tree.xmlNode* c_node) From scoder at codespeak.net Mon May 14 10:34:49 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 10:34:49 +0200 (CEST) Subject: [Lxml-checkins] r43352 - in lxml/trunk/src/lxml: . tests Message-ID: <20070514083449.4B3E0807A@code0.codespeak.net> Author: scoder Date: Mon May 14 10:34:48 2007 New Revision: 43352 Modified: lxml/trunk/src/lxml/classlookup.pxi lxml/trunk/src/lxml/tests/test_classlookup.py Log: cleanup, support custom PI/comment classes as fallback Modified: lxml/trunk/src/lxml/classlookup.pxi ============================================================================== --- lxml/trunk/src/lxml/classlookup.pxi (original) +++ lxml/trunk/src/lxml/classlookup.pxi Mon May 14 10:34:48 2007 @@ -79,6 +79,9 @@ cdef class ElementDefaultClassLookup(ElementClassLookup): """Element class lookup scheme that always returns the default Element class. + + The keyword arguments ``element``, ``comment`` and ``pi`` accept the + respective Element classes. """ cdef readonly object element_class cdef readonly object comment_class @@ -86,21 +89,21 @@ def __init__(self, element=None, comment=None, pi=None): self._lookup_function = _lookupDefaultElementClass if element is None: - self.element_class = _Element + self.element_class = None elif issubclass(element, ElementBase): self.element_class = element else: raise TypeError, "element class must be subclass of ElementBase" if comment is None: - self.comment_class = _Comment + self.comment_class = None elif issubclass(comment, CommentBase): self.comment_class = comment else: raise TypeError, "comment class must be subclass of CommentBase" if pi is None: - self.pi_class = _ProcessingInstruction + self.pi_class = None elif issubclass(pi, PIBase): self.pi_class = pi else: @@ -109,17 +112,23 @@ cdef object _lookupDefaultElementClass(state, _Document _doc, xmlNode* c_node): "Trivial class lookup function that always returns the default class." if c_node.type == tree.XML_ELEMENT_NODE: - if state is None: + if state is not None: + cls = (state).element_class + if cls is None: return _Element else: - return (state).element_class + return cls elif c_node.type == tree.XML_COMMENT_NODE: - if state is None: + if state is not None: + cls = (state).comment_class + if cls is None: return _Comment else: - return (state).comment_class + return cls elif c_node.type == tree.XML_PI_NODE: - if state is None: + if state is not None: + cls = (state).pi_class + if cls is None: # special case XSLT-PI if c_node.name is not NULL and c_node.content is not NULL: if cstd.strcmp(c_node.name, "xml-stylesheet") == 0: @@ -128,7 +137,7 @@ return _XSLTProcessingInstruction return _ProcessingInstruction else: - return (state).pi_class + return cls else: assert 0, "Unknown node type: %s" % c_node.type @@ -145,9 +154,9 @@ dictionary. Arguments: - * attribute name ('{ns}name' style string) - * class mapping (Python dict mapping attribute values to Element classes) - * fallback (optional fallback lookup mechanism) + * attribute name - '{ns}name' style string + * class mapping - Python dict mapping attribute values to Element classes + * fallback - optional fallback lookup mechanism A None key in the class mapping will be checked if the attribute is missing. @@ -194,10 +203,9 @@ cdef object _parser_class_lookup(state, _Document doc, xmlNode* c_node): cdef FallbackElementClassLookup lookup lookup = state - if c_node.type == tree.XML_ELEMENT_NODE: - if doc._parser._class_lookup is not None: - return doc._parser._class_lookup._lookup_function( - doc._parser._class_lookup, doc, c_node) + if doc._parser._class_lookup is not None: + return doc._parser._class_lookup._lookup_function( + doc._parser._class_lookup, doc, c_node) return lookup._callFallback(doc, c_node) Modified: lxml/trunk/src/lxml/tests/test_classlookup.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_classlookup.py (original) +++ lxml/trunk/src/lxml/tests/test_classlookup.py Mon May 14 10:34:48 2007 @@ -51,6 +51,31 @@ def test_default_class_lookup(self): class TestElement(etree.ElementBase): + FIND_ME = "default element" + class TestComment(etree.CommentBase): + FIND_ME = "default comment" + class TestPI(etree.PIBase): + FIND_ME = "default pi" + + parser = etree.XMLParser() + + lookup = etree.ElementDefaultClassLookup( + element=TestElement, comment=TestComment, pi=TestPI) + parser.setElementClassLookup(lookup) + + root = etree.XML(""" + + + + + """, parser) + + self.assertEquals("default element", root.FIND_ME) + self.assertEquals("default pi", root[0].FIND_ME) + self.assertEquals("default comment", root[1].FIND_ME) + + def test_default_class_lookup_is_not_nslookup(self): + class TestElement(etree.ElementBase): FIND_ME = "namespace class" ns = etree.Namespace("myNS") From scoder at codespeak.net Mon May 14 15:13:01 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 15:13:01 +0200 (CEST) Subject: [Lxml-checkins] r43363 - in lxml/trunk: . src/lxml src/lxml/tests Message-ID: <20070514131301.345008081@code0.codespeak.net> Author: scoder Date: Mon May 14 15:12:59 2007 New Revision: 43363 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/tests/test_etree.py lxml/trunk/src/lxml/xmlerror.pxd lxml/trunk/src/lxml/xmlerror.pxi Log: column field on error log entries Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Mon May 14 15:12:59 2007 @@ -8,6 +8,8 @@ Features added -------------- +* ``column`` field on error log entries to accompany the ``line`` field + * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on ``XPathError`` Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Mon May 14 15:12:59 2007 @@ -183,6 +183,10 @@ if 'PARSER' in log.domain_name]) self.assert_([ log for log in logs if 'TAG_NAME_MISMATCH' in log.type_name ]) + self.assert_([ log for log in logs + if 1 == log.line ]) + self.assert_([ log for log in logs + if 15 == log.column ]) def test_parse_error_from_file(self): parse = self.etree.parse Modified: lxml/trunk/src/lxml/xmlerror.pxd ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxd (original) +++ lxml/trunk/src/lxml/xmlerror.pxd Mon May 14 15:12:59 2007 @@ -780,6 +780,8 @@ char* str2 char* str3 int line + int int1 + int int2 ctypedef void (*xmlGenericErrorFunc)(void* ctxt, char* msg, ...) ctypedef void (*xmlStructuredErrorFunc)(void* userData, xmlError* error) Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Mon May 14 15:12:59 2007 @@ -36,15 +36,18 @@ cdef readonly object domain cdef readonly object type cdef readonly object line + cdef readonly object column cdef readonly object level cdef readonly object message cdef readonly object filename + cdef _setError(self, xmlerror.xmlError* error): cdef int size self.domain = error.domain self.type = error.code self.level = error.level self.line = error.line + self.column = error.int2 size = cstd.strlen(error.message) if size > 0 and error.message[size-1] == c'\n': size = size - 1 # strip EOL From scoder at codespeak.net Mon May 14 15:22:38 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 15:22:38 +0200 (CEST) Subject: [Lxml-checkins] r43365 - lxml/trunk Message-ID: <20070514132238.4F7748084@code0.codespeak.net> Author: scoder Date: Mon May 14 15:22:38 2007 New Revision: 43365 Modified: lxml/trunk/TODO.txt Log: started list of todo's for lxml 2.0 Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Mon May 14 15:22:38 2007 @@ -1,3 +1,7 @@ +=============== +ToDo's for lxml +=============== + lxml ==== @@ -44,3 +48,11 @@ * RelaxNG compact notation (rnc versus rng) support. Currently not supported by libxml2 (patch exists) + + +lxml 2.0 +======== + +* print column number in error log lines + +* clean up (and remove?) duplicated API for extension functions From scoder at codespeak.net Mon May 14 19:26:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 19:26:59 +0200 (CEST) Subject: [Lxml-checkins] r43379 - lxml/trunk Message-ID: <20070514172659.9A6088075@code0.codespeak.net> Author: scoder Date: Mon May 14 19:26:59 2007 New Revision: 43379 Modified: lxml/trunk/CHANGES.txt Log: cleanup Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Mon May 14 19:26:59 2007 @@ -20,7 +20,7 @@ * ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support adding processing instructions and comments around the root node -* Element.attrib now has a ``pop()`` method +* ``Element.attrib`` was missing ``clear()`` and ``pop()`` methods * Extended type annotation in objectify: cleaner annotation namespace setup plus new ``xsiannotate()`` and ``deannotate()`` functions @@ -46,8 +46,6 @@ * More ET compatible behaviour when writing out XML declarations or not -* ``Element.attrib`` was missing ``clear()`` method - * More robust error handling in ``iterparse()`` * Documents lost their top-level PIs and comments on serialisation From scoder at codespeak.net Mon May 14 19:35:03 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 19:35:03 +0200 (CEST) Subject: [Lxml-checkins] r43380 - lxml/trunk Message-ID: <20070514173503.750C1807A@code0.codespeak.net> Author: scoder Date: Mon May 14 19:35:03 2007 New Revision: 43380 Modified: lxml/trunk/TODO.txt Log: more comments on 2.0 Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Mon May 14 19:35:03 2007 @@ -53,6 +53,11 @@ lxml 2.0 ======== -* print column number in error log lines +* reformat error log lines, add column number + +* always use '' as URL when tree was parsed from string? (can libxml2 + handle this?) * clean up (and remove?) duplicated API for extension functions + +* find a way to integrate Schematron (if it's available) From scoder at codespeak.net Mon May 14 22:42:52 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 14 May 2007 22:42:52 +0200 (CEST) Subject: [Lxml-checkins] r43386 - in lxml/trunk: . doc src/lxml Message-ID: <20070514204252.62EC1807A@code0.codespeak.net> Author: scoder Date: Mon May 14 22:42:51 2007 New Revision: 43386 Modified: lxml/trunk/CHANGES.txt lxml/trunk/doc/objectify.txt lxml/trunk/src/lxml/objectify.pyx Log: objectify didn't handle prefixed type names in xsi:type Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Mon May 14 22:42:51 2007 @@ -44,6 +44,8 @@ Bugs fixed ---------- +* Objectify couldn't handle prefixed XSD type names in ``xsi:type`` + * More ET compatible behaviour when writing out XML declarations or not * More robust error handling in ``iterparse()`` Modified: lxml/trunk/doc/objectify.txt ============================================================================== --- lxml/trunk/doc/objectify.txt (original) +++ lxml/trunk/doc/objectify.txt Mon May 14 22:42:51 2007 @@ -733,22 +733,23 @@ and/or 'xsi:type' information:: >>> root = objectify.fromstring('''\ - ... - ... 5 - ... 5 - ... 5 + ... + ... 5 + ... 5 + ... 5 ... ''') >>> objectify.annotate(root) >>> print objectify.dump(root) root = None [ObjectifiedElement] d = 5.0 [FloatElement] - * xsi:type = 'double' + * xsi:type = 'xsd:double' * py:pytype = 'float' l = 5L [LongElement] - * xsi:type = 'long' + * xsi:type = 'xsd:long' * py:pytype = 'long' s = '5' [StringElement] - * xsi:type = 'string' + * xsi:type = 'xsd:string' * py:pytype = 'str' >>> objectify.deannotate(root) >>> print objectify.dump(root) @@ -780,7 +781,7 @@ root = None [ObjectifiedElement] x = 5L [LongElement] * py:pytype = 'long' - * xsi:type = 'integer' + * xsi:type = 'xsd:integer' There is a side effect of the type lookup. If you assign a string value using attribute assignment and that string value turns out to be valid for any of Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Mon May 14 22:42:51 2007 @@ -1011,12 +1011,13 @@ xsi_ns = "{%s}" % XML_SCHEMA_INSTANCE_NS pytype_ns = "{%s}" % PYTYPE_NAMESPACE for name, value in cetree.iterattributes(element, 3): - if name == PYTYPE_ATTRIBUTE: - if value == TREE_PYTYPE: - continue - else: - name = name.replace(pytype_ns, 'py:') - name = name.replace(xsi_ns, 'xsi:') + if '{' in name: + if name == PYTYPE_ATTRIBUTE: + if value == TREE_PYTYPE: + continue + else: + name = name.replace(pytype_ns, 'py:') + name = name.replace(xsi_ns, 'xsi:') result = result + "%s * %s = %r\n" % (indentstr, name, value) indent = indent + 1 @@ -1097,6 +1098,9 @@ if value is not None: dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) + if dict_result is NULL and ':' in value: + prefix, value = value.split(':', 1) + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) if dict_result is not NULL: return (dict_result)._type @@ -1516,6 +1520,9 @@ if value is not None: dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) + if dict_result is NULL and ':' in value: + prefix, value = value.split(':', 1) + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) if dict_result is not NULL: pytype = dict_result @@ -1574,6 +1581,9 @@ c_node, _XML_SCHEMA_INSTANCE_NS, "type") if typename is not None: dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename) + if dict_result is NULL and ':' in typename: + prefix, typename = typename.split(':', 1) + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename) if dict_result is not NULL: pytype = dict_result if pytype is not StrType: @@ -1617,6 +1627,15 @@ cetree.delAttributeFromNsName(c_node, _XML_SCHEMA_INSTANCE_NS, "type") else: # update or create attribute + c_ns = cetree.findOrBuildNodeNs(doc, c_node, _XML_SCHEMA_NS) + if c_ns is not NULL: + if c_ns.prefix is not NULL and c_ns.prefix[0] != c'\0': + if ':' in typename: + oldprefix, name = typename.split(':', 1) + if cstd.strcmp(_cstr(oldprefix), c_ns.prefix) != 0: + typename = c_ns.prefix + ':' + name + elif ':' in typename: + _, typename = typename.split(':', 1) c_ns = cetree.findOrBuildNodeNs(doc, c_node, _XML_SCHEMA_INSTANCE_NS) tree.xmlSetNsProp(c_node, c_ns, "type", _cstr(typename)) tree.END_FOR_EACH_ELEMENT_FROM(c_node) @@ -1729,6 +1748,7 @@ if the type can be identified. If '_pytype' or '_xsi' are among the keyword arguments, they will be used instead. """ + cdef python.PyObject* dict_result if nsmap is None: nsmap = _DEFAULT_NSMAP if attrib is not None: @@ -1736,12 +1756,30 @@ attrib.update(_attributes) _attributes = attrib if _xsi is not None: + if ':' in _xsi: + prefix, name = _xsi.split(':', 1) + ns = nsmap.get(prefix) + if ns != XML_SCHEMA_NS: + raise TypeError, "XSD types require the XSD namespace" + elif nsmap is _DEFAULT_NSMAP: + name = _xsi + _xsi = 'xsd' + ':' + _xsi + else: + name = _xsi + for p, ns in nsmap.items(): + if ns == XML_SCHEMA_NS: + _xsi = prefix + ':' + _xsi + break + else: + raise TypeError, "XSD types require the XSD namespace" python.PyDict_SetItem(_attributes, XML_SCHEMA_INSTANCE_TYPE_ATTR, _xsi) if _pytype is None: - # allow for s.o. using unregistered or even wrong xsi:type names - pytype_lookup = _SCHEMA_TYPE_DICT.get(_xsi) - if pytype_lookup is not None: - _pytype = pytype_lookup.name + # allow using unregistered or even wrong xsi:type names + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, _xsi) + if dict_result is NULL: + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, name) + if dict_result is not NULL: + _pytype = (dict_result).name if python._isString(_value): strval = _value From scoder at codespeak.net Tue May 15 08:40:31 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 15 May 2007 08:40:31 +0200 (CEST) Subject: [Lxml-checkins] r43400 - lxml/trunk Message-ID: <20070515064031.A8535807A@code0.codespeak.net> Author: scoder Date: Tue May 15 08:40:30 2007 New Revision: 43400 Modified: lxml/trunk/TODO.txt Log: todo Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Tue May 15 08:40:30 2007 @@ -61,3 +61,5 @@ * clean up (and remove?) duplicated API for extension functions * find a way to integrate Schematron (if it's available) + +* always use ns-prefixed type names in objectify's ``xsi:type`` attributes From scoder at codespeak.net Tue May 15 16:56:12 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 15 May 2007 16:56:12 +0200 (CEST) Subject: [Lxml-checkins] r43410 - in lxml/trunk: . src/lxml Message-ID: <20070515145612.34B6B8084@code0.codespeak.net> Author: scoder Date: Tue May 15 16:56:11 2007 New Revision: 43410 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/objectify.pyx Log: empty (non-None) string prefixes handled as None instead of empty string Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Tue May 15 16:56:11 2007 @@ -44,6 +44,8 @@ Bugs fixed ---------- +* passing '' as namespace prefix in nsmap could be passed through to libxml2 + * Objectify couldn't handle prefixed XSD type names in ``xsi:type`` * More ET compatible behaviour when writing out XML declarations or not Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Tue May 15 16:56:11 2007 @@ -355,7 +355,7 @@ for prefix, href in nsmap.items(): href_utf = _utf8(href) c_href = _cstr(href_utf) - if prefix is not None: + if prefix is not None and prefix: prefix_utf = _utf8(prefix) c_prefix = _cstr(prefix_utf) else: Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Tue May 15 16:56:11 2007 @@ -1768,7 +1768,8 @@ name = _xsi for p, ns in nsmap.items(): if ns == XML_SCHEMA_NS: - _xsi = prefix + ':' + _xsi + if p is not None and P: + _xsi = prefix + ':' + _xsi break else: raise TypeError, "XSD types require the XSD namespace" From scoder at codespeak.net Tue May 15 18:37:46 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 15 May 2007 18:37:46 +0200 (CEST) Subject: [Lxml-checkins] r43416 - lxml/trunk/src/lxml Message-ID: <20070515163746.B3B428084@code0.codespeak.net> Author: scoder Date: Tue May 15 18:37:45 2007 New Revision: 43416 Modified: lxml/trunk/src/lxml/objectify.pyx Log: typo Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Tue May 15 18:37:45 2007 @@ -1768,7 +1768,7 @@ name = _xsi for p, ns in nsmap.items(): if ns == XML_SCHEMA_NS: - if p is not None and P: + if p is not None and p: _xsi = prefix + ':' + _xsi break else: From scoder at codespeak.net Tue May 15 18:38:38 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 15 May 2007 18:38:38 +0200 (CEST) Subject: [Lxml-checkins] r43417 - lxml/trunk/src/lxml Message-ID: <20070515163838.060698084@code0.codespeak.net> Author: scoder Date: Tue May 15 18:38:37 2007 New Revision: 43417 Modified: lxml/trunk/src/lxml/xmlerror.pxi Log: memory access fix Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Tue May 15 18:38:37 2007 @@ -406,6 +406,11 @@ c_error.domain = xmlerror.XML_FROM_XSLT c_error.code = xmlerror.XML_ERR_OK # what else? c_error.level = xmlerror.XML_ERR_ERROR # what else? + c_error.str1 = NULL + c_error.str2 = NULL + c_error.str3 = NULL + c_error.int1 = 0 + c_error.int2 = 0 _forwardError(c_log_handler, &c_error) From scoder at codespeak.net Tue May 15 18:44:50 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 15 May 2007 18:44:50 +0200 (CEST) Subject: [Lxml-checkins] r43418 - lxml/trunk/src/lxml Message-ID: <20070515164450.679138084@code0.codespeak.net> Author: scoder Date: Tue May 15 18:44:50 2007 New Revision: 43418 Modified: lxml/trunk/src/lxml/extensions.pxi Log: memory leak Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Tue May 15 18:44:50 2007 @@ -323,6 +323,7 @@ self._compile_map = {} cdef _make_string(self, value): + cdef char* c_text if _isString(value): return value elif python.PyList_Check(value): @@ -333,8 +334,10 @@ if _isString(firstnode): return firstnode elif isinstance(firstnode, _Element): - return funicode( - tree.xmlNodeGetContent((<_Element>firstnode)._c_node)) + c_text = tree.xmlNodeGetContent((<_Element>firstnode)._c_node) + s = funicode(c_text) + tree.xmlFree(c_text) + return s else: return str(firstnode) else: From scoder at codespeak.net Wed May 16 00:18:04 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 00:18:04 +0200 (CEST) Subject: [Lxml-checkins] r43421 - lxml/trunk/src/lxml Message-ID: <20070515221804.EF1A58084@code0.codespeak.net> Author: scoder Date: Wed May 16 00:18:03 2007 New Revision: 43421 Modified: lxml/trunk/src/lxml/xmlerror.pxi Log: cleanup Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Wed May 16 00:18:03 2007 @@ -406,10 +406,6 @@ c_error.domain = xmlerror.XML_FROM_XSLT c_error.code = xmlerror.XML_ERR_OK # what else? c_error.level = xmlerror.XML_ERR_ERROR # what else? - c_error.str1 = NULL - c_error.str2 = NULL - c_error.str3 = NULL - c_error.int1 = 0 c_error.int2 = 0 _forwardError(c_log_handler, &c_error) From scoder at codespeak.net Wed May 16 00:19:16 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 00:19:16 +0200 (CEST) Subject: [Lxml-checkins] r43422 - lxml/trunk/src/lxml Message-ID: <20070515221916.E1A048084@code0.codespeak.net> Author: scoder Date: Wed May 16 00:19:16 2007 New Revision: 43422 Modified: lxml/trunk/src/lxml/extensions.pxi Log: default to returning str(value) from XPath string coercion function instead of exception Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Wed May 16 00:19:16 2007 @@ -341,7 +341,7 @@ else: return str(firstnode) else: - raise TypeError, "Invalid argument type %s" % type(value) + return str(value) cdef _compile(self, rexp, ignore_case): cdef python.PyObject* c_result From scoder at codespeak.net Wed May 16 00:20:36 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 00:20:36 +0200 (CEST) Subject: [Lxml-checkins] r43423 - in lxml/trunk: . src/lxml Message-ID: <20070515222036.36C698084@code0.codespeak.net> Author: scoder Date: Wed May 16 00:20:35 2007 New Revision: 43423 Modified: lxml/trunk/TODO.txt lxml/trunk/src/lxml/etreepublic.pxd lxml/trunk/src/lxml/public-api.pxi Log: provide findOrBuildNodeNsPrefix() function in C-API to support a preferred prefix Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Wed May 16 00:20:35 2007 @@ -63,3 +63,6 @@ * find a way to integrate Schematron (if it's available) * always use ns-prefixed type names in objectify's ``xsi:type`` attributes + +* remove ``findOrBuildNodeNs()`` from C-API (replaced by + findOrBuildNodeNsPrefix) Modified: lxml/trunk/src/lxml/etreepublic.pxd ============================================================================== --- lxml/trunk/src/lxml/etreepublic.pxd (original) +++ lxml/trunk/src/lxml/etreepublic.pxd Wed May 16 00:20:35 2007 @@ -201,6 +201,10 @@ cdef tree.xmlNs* findOrBuildNodeNs(_Document doc, tree.xmlNode* c_node, char* href) + # recursively lookup a namespace in element or ancestors, or create it + cdef tree.xmlNs* findOrBuildNodeNsPrefix( + _Document doc, tree.xmlNode* c_node, char* href, char* prefix) + # find the Document of an Element, ElementTree or Document (itself!) cdef _Document documentOrRaise(object input) Modified: lxml/trunk/src/lxml/public-api.pxi ============================================================================== --- lxml/trunk/src/lxml/public-api.pxi (original) +++ lxml/trunk/src/lxml/public-api.pxi Wed May 16 00:20:35 2007 @@ -142,3 +142,9 @@ if doc is None: raise TypeError return doc._findOrBuildNodeNs(c_node, href, NULL) + +cdef public tree.xmlNs* findOrBuildNodeNsPrefix( + _Document doc, xmlNode* c_node, char* href, char* prefix) except NULL: + if doc is None: + raise TypeError + return doc._findOrBuildNodeNs(c_node, href, prefix) From scoder at codespeak.net Wed May 16 00:21:29 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 00:21:29 +0200 (CEST) Subject: [Lxml-checkins] r43424 - in lxml/trunk: doc src/lxml src/lxml/tests Message-ID: <20070515222129.3D93F8084@code0.codespeak.net> Author: scoder Date: Wed May 16 00:21:28 2007 New Revision: 43424 Modified: lxml/trunk/doc/objectify.txt lxml/trunk/src/lxml/objectify.pyx lxml/trunk/src/lxml/tests/test_objectify.py Log: always use xsd namespace prefixes for schema types in objectify's xsi:type Modified: lxml/trunk/doc/objectify.txt ============================================================================== --- lxml/trunk/doc/objectify.txt (original) +++ lxml/trunk/doc/objectify.txt Wed May 16 00:21:28 2007 @@ -688,19 +688,19 @@ >>> root = objectify.fromstring('''\ ... - ... 5 - ... 5 - ... 5 + ... 5 + ... 5 + ... 5 ... ... ''') >>> print objectify.dump(root) root = None [ObjectifiedElement] d = 5.0 [FloatElement] - * xsi:type = 'double' + * xsi:type = 'xsd:double' l = 5L [LongElement] - * xsi:type = 'long' + * xsi:type = 'xsd:long' s = '5' [StringElement] - * xsi:type = 'string' + * xsi:type = 'xsd:string' Again, there is a utility function ``xsiannotate()`` that recursively generates the "xsi:type" attribute for the elements of a tree:: @@ -719,11 +719,11 @@ >>> print objectify.dump(root) root = None [ObjectifiedElement] a = 'test' [StringElement] - * xsi:type = 'string' + * xsi:type = 'xsd:string' b = 5 [IntElement] - * xsi:type = 'int' + * xsi:type = 'xsd:int' c = True [BoolElement] - * xsi:type = 'boolean' + * xsi:type = 'xsd:boolean' Note, however, that ``xsiannotate()`` will always use the first XML Schema datatype that is defined for any given Python type, see also Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Wed May 16 00:21:28 2007 @@ -1538,7 +1538,8 @@ c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) else: # update or create attribute - c_ns = cetree.findOrBuildNodeNs(doc, c_node, _PYTYPE_NAMESPACE) + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _PYTYPE_NAMESPACE, 'py') tree.xmlSetNsProp(c_node, c_ns, _PYTYPE_ATTRIBUTE_NAME, _cstr(pytype.name)) tree.END_FOR_EACH_ELEMENT_FROM(c_node) @@ -1627,16 +1628,21 @@ cetree.delAttributeFromNsName(c_node, _XML_SCHEMA_INSTANCE_NS, "type") else: # update or create attribute - c_ns = cetree.findOrBuildNodeNs(doc, c_node, _XML_SCHEMA_NS) + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _XML_SCHEMA_NS, 'xsd') if c_ns is not NULL: - if c_ns.prefix is not NULL and c_ns.prefix[0] != c'\0': - if ':' in typename: - oldprefix, name = typename.split(':', 1) - if cstd.strcmp(_cstr(oldprefix), c_ns.prefix) != 0: - typename = c_ns.prefix + ':' + name - elif ':' in typename: - _, typename = typename.split(':', 1) - c_ns = cetree.findOrBuildNodeNs(doc, c_node, _XML_SCHEMA_INSTANCE_NS) + if ':' in typename: + prefix, name = typename.split(':', 1) + if c_ns.prefix is NULL or c_ns.prefix[0] == c'\0': + typename = name + elif cstd.strcmp(_cstr(prefix), c_ns.prefix) != 0: + prefix = c_ns.prefix + typename = prefix + ':' + name + elif c_ns.prefix is not NULL or c_ns.prefix[0] != c'\0': + prefix = c_ns.prefix + typename = prefix + ':' + typename + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _XML_SCHEMA_INSTANCE_NS, 'xsi') tree.xmlSetNsProp(c_node, c_ns, "type", _cstr(typename)) tree.END_FOR_EACH_ELEMENT_FROM(c_node) @@ -1763,12 +1769,12 @@ raise TypeError, "XSD types require the XSD namespace" elif nsmap is _DEFAULT_NSMAP: name = _xsi - _xsi = 'xsd' + ':' + _xsi + _xsi = 'xsd:' + _xsi else: name = _xsi - for p, ns in nsmap.items(): + for prefix, ns in nsmap.items(): if ns == XML_SCHEMA_NS: - if p is not None and p: + if prefix is not None and prefix: _xsi = prefix + ':' + _xsi break else: Modified: lxml/trunk/src/lxml/tests/test_objectify.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_objectify.py (original) +++ lxml/trunk/src/lxml/tests/test_objectify.py Wed May 16 00:21:28 2007 @@ -601,19 +601,19 @@ child_types = [ c.get(XML_SCHEMA_INSTANCE_TYPE_ATTR) for c in root.iterchildren() ] - self.assertEquals("int", child_types[0]) - self.assertEquals("string", child_types[1]) - self.assertEquals("float", child_types[2]) - self.assertEquals("string", child_types[3]) - self.assertEquals("boolean", child_types[4]) - self.assertEquals(None, child_types[5]) - self.assertEquals(None, child_types[6]) - self.assertEquals("int", child_types[7]) - self.assertEquals("int", child_types[8]) - self.assertEquals("int", child_types[9]) - self.assertEquals("string", child_types[10]) - self.assertEquals("float", child_types[11]) - self.assertEquals("integer", child_types[12]) + self.assertEquals("xsd:int", child_types[0]) + self.assertEquals("xsd:string", child_types[1]) + self.assertEquals("xsd:float", child_types[2]) + self.assertEquals("xsd:string", child_types[3]) + self.assertEquals("xsd:boolean", child_types[4]) + self.assertEquals(None, child_types[5]) + self.assertEquals(None, child_types[6]) + self.assertEquals("xsd:int", child_types[7]) + self.assertEquals("xsd:int", child_types[8]) + self.assertEquals("xsd:int", child_types[9]) + self.assertEquals("xsd:string", child_types[10]) + self.assertEquals("xsd:float", child_types[11]) + self.assertEquals("xsd:integer", child_types[12]) self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR)) @@ -641,19 +641,19 @@ child_types = [ c.get(XML_SCHEMA_INSTANCE_TYPE_ATTR) for c in root.iterchildren() ] - self.assertEquals("int", child_types[0]) - self.assertEquals("string", child_types[1]) - self.assertEquals("float", child_types[2]) - self.assertEquals("string", child_types[3]) - self.assertEquals("boolean", child_types[4]) - self.assertEquals(None, child_types[5]) - self.assertEquals(None, child_types[6]) - self.assertEquals("double", child_types[7]) - self.assertEquals("float", child_types[8]) - self.assertEquals("string", child_types[9]) - self.assertEquals("string", child_types[10]) - self.assertEquals("float", child_types[11]) - self.assertEquals("integer", child_types[12]) + self.assertEquals("xsd:int", child_types[0]) + self.assertEquals("xsd:string", child_types[1]) + self.assertEquals("xsd:float", child_types[2]) + self.assertEquals("xsd:string", child_types[3]) + self.assertEquals("xsd:boolean", child_types[4]) + self.assertEquals(None, child_types[5]) + self.assertEquals(None, child_types[6]) + self.assertEquals("xsd:double", child_types[7]) + self.assertEquals("xsd:float", child_types[8]) + self.assertEquals("xsd:string", child_types[9]) + self.assertEquals("xsd:string", child_types[10]) + self.assertEquals("xsd:float", child_types[11]) + self.assertEquals("xsd:integer", child_types[12]) self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR)) @@ -710,19 +710,19 @@ child_types = [ c.get(XML_SCHEMA_INSTANCE_TYPE_ATTR) for c in root.iterchildren() ] - self.assertEquals("int", child_types[ 0]) - self.assertEquals("string", child_types[ 1]) - self.assertEquals("float", child_types[ 2]) - self.assertEquals("string", child_types[ 3]) - self.assertEquals("boolean", child_types[ 4]) - self.assertEquals(None, child_types[ 5]) - self.assertEquals(None, child_types[ 6]) - self.assertEquals("int", child_types[ 7]) - self.assertEquals("int", child_types[ 8]) - self.assertEquals("int", child_types[ 9]) - self.assertEquals("string", child_types[10]) - self.assertEquals("float", child_types[11]) - self.assertEquals("integer", child_types[12]) + self.assertEquals("xsd:int", child_types[ 0]) + self.assertEquals("xsd:string", child_types[ 1]) + self.assertEquals("xsd:float", child_types[ 2]) + self.assertEquals("xsd:string", child_types[ 3]) + self.assertEquals("xsd:boolean", child_types[ 4]) + self.assertEquals(None, child_types[ 5]) + self.assertEquals(None, child_types[ 6]) + self.assertEquals("xsd:int", child_types[ 7]) + self.assertEquals("xsd:int", child_types[ 8]) + self.assertEquals("xsd:int", child_types[ 9]) + self.assertEquals("xsd:string", child_types[10]) + self.assertEquals("xsd:float", child_types[11]) + self.assertEquals("xsd:integer", child_types[12]) self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR)) From scoder at codespeak.net Wed May 16 10:17:26 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 10:17:26 +0200 (CEST) Subject: [Lxml-checkins] r43426 - in lxml/trunk: . src/lxml src/lxml/tests Message-ID: <20070516081726.74A35808D@code0.codespeak.net> Author: scoder Date: Wed May 16 10:17:25 2007 New Revision: 43426 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/extensions.pxi lxml/trunk/src/lxml/tests/test_xpathevaluator.py Log: raise exception when passing empty prefixes or namespaces to XPath Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Wed May 16 10:17:25 2007 @@ -44,6 +44,8 @@ Bugs fixed ---------- +* passing '' as XPath namespace prefix did not raise an error + * passing '' as namespace prefix in nsmap could be passed through to libxml2 * Objectify couldn't handle prefixed XSD type names in ``xsi:type`` Modified: lxml/trunk/src/lxml/extensions.pxi ============================================================================== --- lxml/trunk/src/lxml/extensions.pxi (original) +++ lxml/trunk/src/lxml/extensions.pxi Wed May 16 10:17:25 2007 @@ -69,10 +69,10 @@ if namespaces: ns = [] for prefix, ns_uri in namespaces: - if prefix is None: + if prefix is None or not prefix: raise TypeError, \ "empty namespace prefix is not supported in XPath" - if ns_uri is None: + if ns_uri is None or not ns_uri: raise TypeError, \ "setting default namespace is not supported in XPath" prefix_utf = self._to_utf(prefix) Modified: lxml/trunk/src/lxml/tests/test_xpathevaluator.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_xpathevaluator.py (original) +++ lxml/trunk/src/lxml/tests/test_xpathevaluator.py Wed May 16 10:17:25 2007 @@ -112,6 +112,13 @@ TypeError, root.xpath, '//b', {None: 'uri:a'}) + def test_xpath_ns_empty(self): + tree = self.parse('') + root = tree.getroot() + self.assertRaises( + TypeError, + root.xpath, '//b', {'': 'uri:a'}) + def test_xpath_error(self): tree = self.parse('') self.assertRaises(etree.XPathEvalError, tree.xpath, '\\fad') From scoder at codespeak.net Wed May 16 22:19:19 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:19:19 +0200 (CEST) Subject: [Lxml-checkins] r43438 - in lxml/trunk: doc src/lxml/tests Message-ID: <20070516201919.483488075@code0.codespeak.net> Author: scoder Date: Wed May 16 22:19:17 2007 New Revision: 43438 Added: lxml/trunk/doc/tutorial.txt Modified: lxml/trunk/doc/main.txt lxml/trunk/doc/mkhtml.py lxml/trunk/src/lxml/tests/test_etree.py Log: first take on an lxml.etree tutorial Modified: lxml/trunk/doc/main.txt ============================================================================== --- lxml/trunk/doc/main.txt (original) +++ lxml/trunk/doc/main.txt Wed May 16 22:19:17 2007 @@ -4,8 +4,8 @@ .. contents:: .. 1 Introduction - 2 Download - 3 Documentation + 2 Documentation + 3 Download 4 Mailing list 5 License 6 Old Versions @@ -25,42 +25,6 @@ .. _FAQ: FAQ.html -Download --------- - -The best way to download binary versions is to visit `lxml at the Python -cheeseshop`_. It has the source, eggs and installers for various platforms. -The source distribution is signed with `this key`_. - -.. _`lxml at the Python cheeseshop`: http://cheeseshop.python.org/pypi/lxml/ -.. _`this key`: pubkey.asc - -The latest version is `lxml 1.3beta`_, released 2007-02-27 (`changes for 1.3beta`_). -`Older versions`_ are listed below. - -.. _`lxml 1.3beta`: lxml-1.3beta.tgz -.. _`CHANGES for 1.3beta`: changes-1.3beta.html -.. _`Older versions`: #old-versions - -Please take a look at the `installation instructions`_! - -.. _`installation instructions`: installation.html - -It's also possible to check out the latest development version of lxml -from svn directly, using a command like this:: - - svn co http://codespeak.net/svn/lxml/trunk lxml - -You can also `browse it through the web`_. Please read `how to build lxml -from source`_ first. The `latest CHANGES`_ of the developer version are also -accessible. You can check there if a bug you found has been fixed or a -feature you want has been implemented in the latest trunk version. - -.. _`how to build lxml from source`: build.html -.. _`browse it through the web`: http://codespeak.net/svn/lxml -.. _`latest CHANGES`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt - - Documentation ------------- @@ -74,6 +38,8 @@ * lxml.etree: + * the `lxml.etree Tutorial`_ + * `lxml.etree specific API`_ documentation * parsing_ and validating_ XML @@ -95,17 +61,19 @@ * a brief comparison of `objectify and etree`_ lxml.etree follows the ElementTree_ API as much as possible, building it on -top of the native libxml2 tree. See also the ElementTree compatibility_ -overview and the `benchmark results`_ comparing lxml to the original -ElementTree_ and cElementTree_ implementations. - -Right after the ElementTree_ documentation, the most important place to look -is the `lxml.etree specific API`_ documentation. It describes how lxml extends the -ElementTree API to expose libxml2 and libxslt specific functionality, such as -XPath_, `Relax NG`_, `XML Schema`_, `XSLT`_, and `c14n`_. Python code can be -called from XPath expressions and XSLT stylesheets through the use of -`extension functions`_. lxml also offers a `SAX compliant API`_, that works -with the SAX support in the standard library. +top of the native libxml2 tree. If you are new to ElementTree, start with the +`lxml.etree Tutorial`_. See also the ElementTree compatibility_ overview and +the `benchmark results`_ comparing lxml to the original ElementTree_ and +cElementTree_ implementations. + +Right after the `lxml.etree Tutorial`_ and the ElementTree_ documentation, the +most important place to look is the `lxml.etree specific API`_ documentation. +It describes how lxml extends the ElementTree API to expose libxml2 and +libxslt specific functionality, such as XPath_, `Relax NG`_, `XML Schema`_, +`XSLT`_, and `c14n`_. Python code can be called from XPath expressions and +XSLT stylesheets through the use of `extension functions`_. lxml also offers +a `SAX compliant API`_, that works with the SAX support in the standard +library. There is a separate module `lxml.objectify`_ that implements a data-binding API on top of lxml.etree. See the `objectify and etree`_ FAQ entry for a @@ -120,6 +88,7 @@ .. _ElementTree: http://effbot.org/zone/element-index.htm .. _cElementTree: http://effbot.org/zone/celementtree.htm +.. _`lxml.etree Tutorial`: tutorial.html .. _`benchmark results`: performance.html .. _`compatibility`: compatibility.html .. _`lxml.etree specific API`: api.html @@ -140,6 +109,42 @@ .. _`c14n`: http://www.w3.org/TR/xml-c14n +Download +-------- + +The best way to download binary versions is to visit `lxml at the Python +cheeseshop`_. It has the source, eggs and installers for various platforms. +The source distribution is signed with `this key`_. + +.. _`lxml at the Python cheeseshop`: http://cheeseshop.python.org/pypi/lxml/ +.. _`this key`: pubkey.asc + +The latest version is `lxml 1.3beta`_, released 2007-02-27 (`changes for 1.3beta`_). +`Older versions`_ are listed below. + +.. _`lxml 1.3beta`: lxml-1.3beta.tgz +.. _`CHANGES for 1.3beta`: changes-1.3beta.html +.. _`Older versions`: #old-versions + +Please take a look at the `installation instructions`_! + +.. _`installation instructions`: installation.html + +It's also possible to check out the latest development version of lxml +from svn directly, using a command like this:: + + svn co http://codespeak.net/svn/lxml/trunk lxml + +You can also `browse it through the web`_. Please read `how to build lxml +from source`_ first. The `latest CHANGES`_ of the developer version are also +accessible. You can check there if a bug you found has been fixed or a +feature you want has been implemented in the latest trunk version. + +.. _`how to build lxml from source`: build.html +.. _`browse it through the web`: http://codespeak.net/svn/lxml +.. _`latest CHANGES`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt + + Mailing list ------------ Modified: lxml/trunk/doc/mkhtml.py ============================================================================== --- lxml/trunk/doc/mkhtml.py (original) +++ lxml/trunk/doc/mkhtml.py Wed May 16 22:19:17 2007 @@ -4,10 +4,11 @@ SITE_STRUCTURE = [ ('lxml', ('main.txt', 'intro.txt', 'FAQ.txt', 'compatibility.txt', 'performance.txt', 'build.txt')), - ('Developing with lxml', ('api.txt', 'parsing.txt', 'validation.txt', - 'xpathxslt.txt', 'objectify.txt')), - ('Extending lxml', ('resolvers.txt', 'extensions.txt', 'element_classes.txt', - 'sax.txt', 'capi.txt')), + ('Developing with lxml', ('tutorial.txt', 'api.txt', 'parsing.txt', + 'validation.txt', 'xpathxslt.txt', + 'objectify.txt')), + ('Extending lxml', ('resolvers.txt', 'extensions.txt', + 'element_classes.txt', 'sax.txt', 'capi.txt')), ] RST2HTML_OPTIONS = " ".join([ Added: lxml/trunk/doc/tutorial.txt ============================================================================== --- (empty file) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:19:17 2007 @@ -0,0 +1,336 @@ +======================= +The lxml.etree Tutorial +======================= + +This tutorial briefly overviews the main concepts of the `ElementTree API`_ as +implemented by lxml.etree, and some simple enhancements that make your life as +a programmer easier. + +.. _`ElementTree API`: http://effbot.org/zone/element-index.htm#documentation + +.. contents:: +.. + 1 Elements and ElementTrees + 1.1 The Element class + 1.2 The ElementTree class + 2 Parsing and XML literals + 2.1 The XML() function + 2.2 The parse() function + 3 Namespaces + 4 The find*() methods + 4.1 findall() + 4.2 find() + 4.3 findtext() + + +A common way to import ``lxml.etree`` is as follows:: + + >>> from lxml import etree + +If your code only uses the ElementTree API and does not rely on any +functionality that is specific to ``lxml.etree``, you can also use the +following import chain as a fall-back to the original ElementTree:: + + try: + from lxml import etree + print "running with lxml.etree" + except ImportError: + try: + # Python 2.5 + import xml.etree.cElementTree as etree + print "running with cElementTree on Python 2.5+" + except ImportError: + try: + # Python 2.5 + import xml.etree.ElementTree as etree + print "running with ElementTree on Python 2.5+" + except ImportError: + try: + # normal cElementTree install + import cElementTree as etree + print "running with cElementTree" + except ImportError: + try: + # normal ElementTree install + import elementtree.ElementTree as etree + print "running with ElementTree" + except ImportError: + print "Failed to import ElementTree from any known place" + +To aid in writing portable code, this tutorial makes it clear in the examples +which part of the presented API is an extension of lxml.etree over the +original `ElementTree API`_, as defined by Fredrik Lundh's `ElementTree +library`_. + +.. _`ElementTree library`: http://effbot.org/zone/element-index.htm + + +The Element class +================= + +An ``Element`` is the main container object for the ElementTree API. Most of +the XML tree functionality is accessed through this class. Elements are +easily created through the ``Element`` factory:: + + >>> root = etree.Element("root") + +The XML tag name of elements is accessed through the ``tag`` property:: + + >>> print root.tag + root + +Elements are organised in an XML tree structure. To create child elements and +add them to a parent element, you can use the ``append()`` method:: + + >>> root.append( etree.Element("child1") ) + +However, a much more efficient and more common way to do this is through the +``SubElement`` factory. It accepts the same arguments as the ``Element`` +factory, but additionally requires the parent as first argument:: + + >>> child2 = etree.SubElement(root, "child2") + >>> child3 = etree.SubElement(root, "child3") + +To see that this is really XML, you can serialise the tree you have created:: + + >>> print etree.tostring(root, pretty_print=True) + + + + + + + +Elements are lists +------------------ + +To make the access to these subelements as easy and straight forward as +possible, elements behave exactly like normal Python lists:: + + >>> child = root[0] + >>> print child.tag + child1 + + >>> for child in root: + ... print child.tag + child1 + child2 + child3 + + >>> if root: + ... print "root has children!" + root has children! + + >>> root.insert(0, etree.Element("child0")) + >>> start = root[:1] + >>> end = root[-1:] + + >>> print start[0].tag + child0 + >>> print end[0].tag + child3 + + >>> root[0] = root[-1] + >>> for child in root: + ... print child.tag + child3 + child1 + child2 + +Note how the last element was moved to a different position in the last +example. This is a difference from the original ElementTree (and from lists), +where elements can sit in multiple positions of any number of trees. In +lxml.etree, elements can only sit in one position of one tree at a time. + +To retrieve a 'real' Python list of all children (or a *shallow copy* of the +element children list), you can call the ``getchildren()`` method:: + + >>> children = root.getchildren() + + >>> print type(children) is type([]) + True + + >>> for child in children: + ... print child.tag + child3 + child1 + child2 + +The way up in the tree is provided through the ``getparent()`` method:: + + >>> root is root[0].getparent() # lxml.etree only! + True + +The siblings (or neighbours) of an element are accessed as next and previous +elements:: + + >>> root[0] is root[1].getprevious() # lxml.etree only! + True + >>> root[1] is root[0].getnext() # lxml.etree only! + True + + +Elements carry attributes +------------------------- + +XML elements support attributes. You can create them directly in the Element +factory:: + + >>> root = etree.Element("root", interesting="totally") + >>> print etree.tostring(root) + + +Fast and direct access to these attributes is provided by the ``set()`` and +``get()`` methods of elements:: + + >>> print root.get("interesting") + totally + + >>> root.set("interesting", "somewhat") + >>> print root.get("interesting") + somewhat + +However, a very convenient way of dealing with them is through the dictionary +interface of the ``attrib`` property:: + + >>> attributes = root.attrib + + >>> print attributes["interesting"] + somewhat + + >>> print attributes.get("hello") + None + + >>> attributes["hello"] = "Guten Tag" + >>> print attributes.get("hello") + Guten Tag + >>> print root.get("hello") + Guten Tag + + +Elements carry text +------------------- + +Elements can contain text:: + + >>> root = etree.Element("root") + >>> root.text = "TEXT" + + >>> print root.text + TEXT + + >>> print etree.tostring(root) + TEXT + +In many XML documents (so-called *data-centric* documents), this is the only +place where text can be found. It is encapsulated by a leaf tag somewhere in +the tree hierarchy. + +However, if XML is used for tagged text documents such as (X)HTML, text can +also appear between different elements, right in the middle of the tree:: + + Hello
World + +Here, the ``
`` tag is surrounded by text. This is often referred to as +*document-style* XML. Elements support this through their ``tail`` property. +It contains the text that directly follows the element, up to the next element +in the XML tree:: + + >>> html = etree.Element("html") + >>> body = etree.SubElement(html, "body") + >>> body.text = "TEXT" + + >>> print etree.tostring(html) + TEXT + + >>> br = etree.SubElement(body, "br") + >>> print etree.tostring(html) + TEXT
+ + >>> br.tail = "TAIL" + >>> print etree.tostring(html) + TEXT
TAIL + +These two properties are enough to represent any text content in an XML +document. If you want to read the text without the intermediate tags, +however, you have to recursively concatenate all ``text`` and ``tail`` +attributes in the correct order. A simpler way to do this is XPath_:: + + >>> print html.xpath("string()") # lxml.etree only! + TEXTTAIL + +.. _XPath: xpathxslt.txt#xpath + + +Tree iteration +-------------- + +For problems like the above, where you want to recursively traverse the tree +and do something with its elements, tree iteration is a very convenient +solution. Elements provide a tree iterator for this purpose. It yields +elements in *document order*, i.e. in the order their tags would appear if you +serialised the tree to XML:: + + >>> root = etree.Element("root") + >>> etree.SubElement(root, "child").text = "Child 1" + >>> etree.SubElement(root, "child").text = "Child 2" + >>> etree.SubElement(root, "another").text = "Child 3" + + >>> print etree.tostring(root, pretty_print=True) + + Child 1 + Child 2 + Child 3 + + + >>> for element in root.getiterator(): + ... print element.tag, '-', element.text + root - None + child - Child 1 + child - Child 2 + another - Child 3 + +If you know you are only interested in a single tag, you can pass its name to +``getiterator()`` to have it filter for you:: + + >>> for element in root.getiterator("child"): + ... print element.tag, '-', element.text + child - Child 1 + child - Child 2 + +In lxml.etree, elements provide `further iterators`_ for all directions in the +tree: children, parents (or rather ancestors) and siblings. + +.. _`further iterators`: api.html#iteration + + + + +The ElementTree class +===================== + + +Parsing files and XML literals +============================== + +The XML() function +------------------ + +The parse() function +-------------------- + +Namespaces +========== + + +ElementPath +=========== + +findall() +--------- + +find() +------ + +findtext() +---------- Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Wed May 16 22:19:17 2007 @@ -1567,6 +1567,8 @@ suite.addTests([unittest.makeSuite(ElementIncludeTestCase)]) suite.addTests([unittest.makeSuite(ETreeC14NTestCase)]) suite.addTests( + [doctest.DocFileSuite('../../../doc/tutorial.txt')]) + suite.addTests( [doctest.DocFileSuite('../../../doc/api.txt')]) suite.addTests( [doctest.DocFileSuite('../../../doc/parsing.txt')]) From scoder at codespeak.net Wed May 16 22:24:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:24:59 +0200 (CEST) Subject: [Lxml-checkins] r43439 - lxml/trunk/doc Message-ID: <20070516202459.6A4DC8075@code0.codespeak.net> Author: scoder Date: Wed May 16 22:24:59 2007 New Revision: 43439 Modified: lxml/trunk/doc/tutorial.txt Log: author, heading fix Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:24:59 2007 @@ -2,6 +2,9 @@ The lxml.etree Tutorial ======================= +:Author: + Stefan Behnel + This tutorial briefly overviews the main concepts of the `ElementTree API`_ as implemented by lxml.etree, and some simple enhancements that make your life as a programmer easier. @@ -208,8 +211,8 @@ Guten Tag -Elements carry text -------------------- +Elements contain text +--------------------- Elements can contain text:: From scoder at codespeak.net Wed May 16 22:49:36 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:49:36 +0200 (CEST) Subject: [Lxml-checkins] r43440 - lxml/trunk/doc Message-ID: <20070516204936.B2D6A8082@code0.codespeak.net> Author: scoder Date: Wed May 16 22:49:36 2007 New Revision: 43440 Modified: lxml/trunk/doc/tutorial.txt Log: paragraph on deep copying elements Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:49:36 2007 @@ -145,6 +145,20 @@ where elements can sit in multiple positions of any number of trees. In lxml.etree, elements can only sit in one position of one tree at a time. +If you want to *copy* an element to a different position, consider creating a +independent *deep copy* using the ``copy`` module from Python's standard +library:: + + >>> from copy import deepcopy + + >>> element = etree.Element("neu") + >>> element.append( deepcopy(root[1]) ) + + >>> print element[0].tag + child1 + >>> print [ c.tag for c in root ] + ['child3', 'child1', 'child2'] + To retrieve a 'real' Python list of all children (or a *shallow copy* of the element children list), you can call the ``getchildren()`` method:: From scoder at codespeak.net Wed May 16 22:51:55 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:51:55 +0200 (CEST) Subject: [Lxml-checkins] r43441 - lxml/trunk/doc Message-ID: <20070516205155.90E848097@code0.codespeak.net> Author: scoder Date: Wed May 16 22:51:54 2007 New Revision: 43441 Modified: lxml/trunk/doc/tutorial.txt Log: fix Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:51:54 2007 @@ -140,7 +140,7 @@ child1 child2 -Note how the last element was moved to a different position in the last +Note how the last element was *moved* to a different position in the last example. This is a difference from the original ElementTree (and from lists), where elements can sit in multiple positions of any number of trees. In lxml.etree, elements can only sit in one position of one tree at a time. From scoder at codespeak.net Wed May 16 22:52:38 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:52:38 +0200 (CEST) Subject: [Lxml-checkins] r43442 - lxml/trunk/doc Message-ID: <20070516205238.C2CEC8097@code0.codespeak.net> Author: scoder Date: Wed May 16 22:52:38 2007 New Revision: 43442 Modified: lxml/trunk/doc/tutorial.txt Log: fix Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:52:38 2007 @@ -145,7 +145,7 @@ where elements can sit in multiple positions of any number of trees. In lxml.etree, elements can only sit in one position of one tree at a time. -If you want to *copy* an element to a different position, consider creating a +If you want to *copy* an element to a different position, consider creating an independent *deep copy* using the ``copy`` module from Python's standard library:: From scoder at codespeak.net Wed May 16 22:58:11 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:58:11 +0200 (CEST) Subject: [Lxml-checkins] r43443 - lxml/trunk/doc Message-ID: <20070516205811.01C078097@code0.codespeak.net> Author: scoder Date: Wed May 16 22:58:11 2007 New Revision: 43443 Modified: lxml/trunk/doc/tutorial.txt Log: another XPath text example Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:58:11 2007 @@ -275,6 +275,8 @@ >>> print html.xpath("string()") # lxml.etree only! TEXTTAIL + >>> print html.xpath("//text()") # lxml.etree only! + ['TEXT', 'TAIL'] .. _XPath: xpathxslt.txt#xpath From scoder at codespeak.net Wed May 16 22:59:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 22:59:00 +0200 (CEST) Subject: [Lxml-checkins] r43444 - lxml/trunk/doc Message-ID: <20070516205900.5BD548097@code0.codespeak.net> Author: scoder Date: Wed May 16 22:59:00 2007 New Revision: 43444 Modified: lxml/trunk/doc/tutorial.txt Log: broken URL Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 22:59:00 2007 @@ -278,7 +278,7 @@ >>> print html.xpath("//text()") # lxml.etree only! ['TEXT', 'TAIL'] -.. _XPath: xpathxslt.txt#xpath +.. _XPath: xpathxslt.html#xpath Tree iteration From scoder at codespeak.net Wed May 16 23:02:19 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 16 May 2007 23:02:19 +0200 (CEST) Subject: [Lxml-checkins] r43445 - lxml/trunk/doc Message-ID: <20070516210219.97E828097@code0.codespeak.net> Author: scoder Date: Wed May 16 23:02:19 2007 New Revision: 43445 Modified: lxml/trunk/doc/tutorial.txt Log: doc fix Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Wed May 16 23:02:19 2007 @@ -240,8 +240,8 @@ TEXT In many XML documents (so-called *data-centric* documents), this is the only -place where text can be found. It is encapsulated by a leaf tag somewhere in -the tree hierarchy. +place where text can be found. It is encapsulated by a leaf tag at the very +bottom of the tree hierarchy. However, if XML is used for tagged text documents such as (X)HTML, text can also appear between different elements, right in the middle of the tree:: From scoder at codespeak.net Sun May 20 10:18:14 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 20 May 2007 10:18:14 +0200 (CEST) Subject: [Lxml-checkins] r43509 - lxml/trunk/doc Message-ID: <20070520081814.779A88084@code0.codespeak.net> Author: scoder Date: Sun May 20 10:18:13 2007 New Revision: 43509 Modified: lxml/trunk/doc/tutorial.txt Log: tutorial: XPath as function Modified: lxml/trunk/doc/tutorial.txt ============================================================================== --- lxml/trunk/doc/tutorial.txt (original) +++ lxml/trunk/doc/tutorial.txt Sun May 20 10:18:13 2007 @@ -278,6 +278,12 @@ >>> print html.xpath("//text()") # lxml.etree only! ['TEXT', 'TAIL'] +If you want to use this more often, you can wrap it in a function:: + + >>> buildTextList = etree.XPath("//text()") # lxml.etree only! + >>> print buildTextList(html) + ['TEXT', 'TAIL'] + .. _XPath: xpathxslt.html#xpath From scoder at codespeak.net Sun May 20 11:37:57 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 20 May 2007 11:37:57 +0200 (CEST) Subject: [Lxml-checkins] r43510 - lxml/trunk/src/lxml/tests Message-ID: <20070520093757.50D5F807F@code0.codespeak.net> Author: scoder Date: Sun May 20 11:37:56 2007 New Revision: 43510 Modified: lxml/trunk/src/lxml/tests/test_xslt.py Log: exslt test split Modified: lxml/trunk/src/lxml/tests/test_xslt.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_xslt.py (original) +++ lxml/trunk/src/lxml/tests/test_xslt.py Sun May 20 11:37:56 2007 @@ -134,19 +134,40 @@ self.assertEquals(expected, unicode(res)) - def test_exslt(self): + def test_exslt_str(self): tree = self.parse('
BC') style = self.parse('''\ + exclude-result-prefixes="str xsl"> + + + +''') + + st = etree.XSLT(style) + res = st(tree) + self.assertEquals('''\ + +-B--C- +''', + str(res)) + + def test_exslt_math(self): + tree = self.parse('BC') + style = self.parse('''\ + + + @@ -159,7 +180,7 @@ res = st(tree) self.assertEquals('''\ --B--C- +BC ''', str(res)) From scoder at codespeak.net Mon May 21 18:00:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:00:00 +0200 (CEST) Subject: [Lxml-checkins] r43529 - lxml/trunk/doc Message-ID: <20070521160000.91676808F@code0.codespeak.net> Author: scoder Date: Mon May 21 17:59:54 2007 New Revision: 43529 Modified: lxml/trunk/doc/xpathxslt.txt Log: doc: make clear regexp support is the default Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Mon May 21 17:59:54 2007 @@ -199,12 +199,11 @@ >>> print find(root)[0].tag {NS}b -You can pass the boolean keyword ``regexp`` to enable Python regular -expressions in the EXSLT_ namespace:: +By default, ``XPath`` supports regular expressions in the EXSLT_ namespace:: >>> regexpNS = "http://exslt.org/regular-expressions" - >>> find = etree.XPath("//*[r:test(., '^abc$', 'i')]", - ... {'r':regexpNS}, regexp = True) + >>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]", + ... {'re':regexpNS}) >>> root = etree.XML("aBaBc") >>> print find(root)[0].text @@ -212,6 +211,9 @@ .. _EXSLT: http://www.exslt.org/ +You can disable this with the boolean keyword argument ``regexp`` which +defaults to True. + The ``XPathEvaluator`` classes ------------------------------ From scoder at codespeak.net Mon May 21 18:00:10 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:00:10 +0200 (CEST) Subject: [Lxml-checkins] r43530 - lxml/trunk/doc/html Message-ID: <20070521160010.C0C9A8092@code0.codespeak.net> Author: scoder Date: Mon May 21 18:00:09 2007 New Revision: 43530 Modified: lxml/trunk/doc/html/style.css Log: CSS layout Modified: lxml/trunk/doc/html/style.css ============================================================================== --- lxml/trunk/doc/html/style.css (original) +++ lxml/trunk/doc/html/style.css Mon May 21 18:00:09 2007 @@ -41,13 +41,13 @@ /*** TOC ***/ -div.contents.topic > ul { +div.contents.topic ul { margin-top: 0px; } -div.contents.topic > ul > li { +div.contents.topic ul > li { text-decoration: none; - line-height: 1.1em; + line-height: 1.2em; } div.contents.topic > p > a { From scoder at codespeak.net Mon May 21 18:01:17 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:01:17 +0200 (CEST) Subject: [Lxml-checkins] r43531 - lxml/trunk/doc Message-ID: <20070521160117.A588C808F@code0.codespeak.net> Author: scoder Date: Mon May 21 18:01:16 2007 New Revision: 43531 Modified: lxml/trunk/doc/objectify.txt Log: huge restructuring in objectify.txt, Holger's section on XSI annotation Modified: lxml/trunk/doc/objectify.txt ============================================================================== --- lxml/trunk/doc/objectify.txt (original) +++ lxml/trunk/doc/objectify.txt Mon May 21 18:01:16 2007 @@ -2,8 +2,8 @@ lxml.objectify ============== -:Author: - Stefan Behnel +:Authors: + Stefan Behnel, Holger Joukl lxml supports an alternative API similar to the Amara_ bindery or gnosis.xml.objectify_ through a custom Element implementation. The main idea @@ -25,22 +25,27 @@ .. contents:: .. - 1 Setting up lxml.objectify - 2 Creating objectify trees - 3 Element access through object attributes - 4 Namespace handling - 5 ObjectPath - 6 Python data types - 7 Defining additional data classes - 8 Recursive string representation of elements - 9 What is different from ElementTree? - 10 Resetting the API + 1 Setting up lxml.objectify + 2 The lxml.objectify API + 2.1 Creating objectify trees + 2.2 Element access through object attributes + 2.3 Namespace handling + 3 ObjectPath + 4 Python data types + 5 Recursive tree dump + 5.1 Recursive string representation of elements + 6 How data types are matched + 6.1 Type annotations + 6.2 XML Schema datatype annotation + 6.3 The DataElement factory + 6.4 Defining additional data classes + 7 What is different from lxml.etree? Setting up lxml.objectify -------------------------- +========================= -To make use of ``objectify``, you need both the ``lxml.etree`` module and +To set up and use ``objectify``, you need both the ``lxml.etree`` module and ``lxml.objectify``:: >>> from lxml import etree @@ -74,6 +79,13 @@ .. _`namespace specific classes`: element_classes.html#namespace-class-lookup +The lxml.objectify API +====================== + +In ``lxml.objectify``, element trees provide an API that models the behaviour +of normal Python object trees as closely as possible. + + Creating objectify trees ------------------------ @@ -318,7 +330,7 @@ ObjectPath ----------- +========== For both convenience and speed, objectify supports its own path language, represented by the ``ObjectPath`` class:: @@ -455,7 +467,7 @@ Python data types ------------------ +================= The objectify module knows about Python data types and tries its best to let element content behave like them. For example, they support the normal math @@ -488,6 +500,67 @@ >>> print root.d % (1234, 12345) 1234 - 12345 +However, data elements continue to provide the objectify API. This means that +sequence operations such as ``len()``, slicing and indexing (e.g. of strings) +cannot behave as the Python types. Like all other tree elements, they show +the normal slicing behaviour of objectify elements:: + + >>> root = objectify.fromstring("testtoast") + >>> print root.a + ' me' # behaves like a string, right? + test me + >>> len(root.a) # but there's only one 'a' element! + 1 + >>> [ a.tag for a in root.a ] + ['a'] + >>> print root.a[0].tag + a + + >>> print root.a + test + >>> [ str(a) for a in root.a[:1] ] + ['test'] + +If you need to run sequence operations on data types, you must ask the API for +the *real* Python value. The string value is always available through the +normal ElementTree ``.text`` attribute. Additionally, all data classes +provide a ``.pyval`` attribute that returns the value as plain Python type:: + + >>> root = objectify.fromstring("test5") + >>> root.a.text + 'test' + >>> root.a.pyval + 'test' + + >>> root.b.text + '5' + >>> root.b.pyval + 5 + +Note, however, that both attributes are read-only in objectify. If you want +to change values, just assign them directly to the attribute:: + + >>> root.a.text = "25" + Traceback (most recent call last): + ... + TypeError: attribute 'text' of 'StringElement' objects is not writable + + >>> root.a.pyval = 25 + Traceback (most recent call last): + ... + TypeError: attribute 'pyval' of 'StringElement' objects is not writable + + >>> root.a = 25 + >>> print root.a + 25 + >>> print root.a.pyval + 25 + +In other words, ``objectify`` data elements behave like immutable Python +types. You can replace them, but not modify them. + + +Recursive tree dump +------------------- To see the data types that are currently used, you can call the module level ``dump()`` function that returns a recursive string representation for @@ -547,64 +620,46 @@ a = 2 [IntElement] a = 3 [IntElement] -However, data elements continue to provide the objectify API. This means that -sequence operations such as ``len()``, slicing and indexing (e.g. of strings) -cannot behave as the Python types. Like all other tree elements, they show -the normal slicing behaviour of objectify elements:: - - >>> root = objectify.fromstring("testtoast") - >>> print root.a + ' me' # behaves like a string, right? - test me - >>> len(root.a) # but there's only one 'a' element! - 1 - >>> [ a.tag for a in root.a ] - ['a'] - >>> print root.a[0].tag - a - >>> print root.a - test - >>> [ str(a) for a in root.a[:1] ] - ['test'] - -If you need to run sequence operations on data types, you must ask the API for -the *real* Python value. The string value is always available through the -normal ElementTree ``.text`` attribute. Additionally, all data classes -provide a ``.pyval`` attribute that returns the value as plain Python type:: - - >>> root = objectify.fromstring("test5") - >>> root.a.text - 'test' - >>> root.a.pyval - 'test' +Recursive string representation of elements +------------------------------------------- - >>> root.b.text - '5' - >>> root.b.pyval - 5 +Normally, elements use the standard string representation for str() that is +provided by lxml.etree. You can enable a pretty-print representation for +objectify elements like this:: -Note, however, that both attributes are read-only in objectify. If you want -to change values, just assign them directly to the attribute:: + >>> objectify.enableRecursiveStr() - >>> root.a.text = "25" - Traceback (most recent call last): - ... - TypeError: attribute 'text' of 'StringElement' objects is not writable + >>> root = objectify.fromstring(""" + ... + ... 1 + ... 1.2 + ... 1 + ... true + ... what? + ... + ... + ... """) - >>> root.a.pyval = 25 - Traceback (most recent call last): - ... - TypeError: attribute 'pyval' of 'StringElement' objects is not writable + >>> print str(root) + root = None [ObjectifiedElement] + a = 1 [IntElement] + * attr1 = 'foo' + * attr2 = 'bar' + a = 1.2 [FloatElement] + b = 1 [IntElement] + b = True [BoolElement] + c = 'what?' [StringElement] + d = None [NoneElement] + * xsi:nil = 'true' - >>> root.a = 25 - >>> print root.a - 25 +This behaviour can be switched off in the same way:: -In other words, objectify data elements behave like immutable Python types. + >>> objectify.enableRecursiveStr(False) How data types are matched --------------------------- +========================== Objectify uses two different types of Elements. Structural Elements (or tree Elements) represent the object tree structure. Data Elements represent the @@ -639,6 +694,10 @@ classes used in these cases. By default, ``tree_class`` is a class called ``ObjectifiedElement`` and ``empty_data_class`` is a ``StringElement``. + +Type annotations +---------------- + The "type hint" mechanism deploys an XML attribute defined as ``lxml.objectify.PYTYPE_ATTRIBUTE``. It may contain any of the following string values: int, long, float, str, unicode, none:: @@ -682,12 +741,17 @@ b = 5 [IntElement] * py:pytype = 'int' + +XML Schema datatype annotation +------------------------------ + A second way of specifying data type information uses XML Schema types as element annotations. Objectify knows those that can be mapped to normal Python types:: >>> root = objectify.fromstring('''\ - ... + ... ... 5 ... 5 ... 5 @@ -758,6 +822,10 @@ l = 5 [IntElement] s = 5 [IntElement] + +The DataElement factory +----------------------- + For convenience, the ``DataElement()`` factory creates an Element with a Python value in one step. You can pass the required Python type name or the XSI type name:: @@ -798,13 +866,106 @@ provide the type of a data element by hand:: >>> root = objectify.Element("root") - >>> root.s = objectify.DataElement(5, _pytype="str") + >>> root.s = objectify.DataElement(5, _pytype="str") >>> print objectify.dump(root) root = None [ObjectifiedElement] s = '5' [StringElement] * py:pytype = 'str' - +Likewise, the data type can be provided as an XML Schema type using the _xsi +argument of ``DataElement()``:: + + >>> root = objectify.Element("root") + >>> root.s = objectify.DataElement(5, _xsi="string") + >>> print objectify.dump(root) + root = None [ObjectifiedElement] + s = '5' [StringElement] + * py:pytype = 'str' + * xsi:type = 'xsd:string' + +XML Schema types reside in the XML schema namespace thus ``DataElement()`` +tries to correctly prefix the xsi:type attribute value for you:: + + >>> root = objectify.Element("root") + >>> root.s = objectify.DataElement(5, _xsi="string") + + >>> objectify.deannotate(root, xsi=False) + >>> print etree.tostring(root, pretty_print=True) + + 5 + + +``DataElement()`` uses a default nsmap to set these prefixes:: + + >>> el = objectify.DataElement('5', _xsi='string') + >>> for prefix, namespace in el.nsmap.items(): + ... print prefix, '-', namespace + py - http://codespeak.net/lxml/objectify/pytype + xsd - http://www.w3.org/2001/XMLSchema + xsi - http://www.w3.org/2001/XMLSchema-instance + + >>> print el.get("{http://www.w3.org/2001/XMLSchema-instance}type") + xsd:string + +While you can set custom namespace prefixes, it is necessary to provide valid +namespace information if you choose to do so:: + + >>> el = objectify.DataElement('5', _xsi='foo:string', + ... nsmap={'foo': 'http://www.w3.org/2001/XMLSchema'}) + >>> for prefix, namespace in el.nsmap.items(): + ... print prefix, '-', namespace + ns0 - http://codespeak.net/lxml/objectify/pytype + ns1 - http://www.w3.org/2001/XMLSchema-instance + foo - http://www.w3.org/2001/XMLSchema + + >>> print el.get("{http://www.w3.org/2001/XMLSchema-instance}type") + foo:string + + >>> el = objectify.DataElement('5', _xsi='foo:string', + ... nsmap={'foo': 'http://www.w3.org/2001/XMLSchema', + ... 'myxsi': 'http://www.w3.org/2001/XMLSchema-instance'}) + >>> for prefix, namespace in el.nsmap.items(): + ... print prefix, '-', namespace + ns0 - http://codespeak.net/lxml/objectify/pytype + foo - http://www.w3.org/2001/XMLSchema + myxsi - http://www.w3.org/2001/XMLSchema-instance + + >>> print el.get("{http://www.w3.org/2001/XMLSchema-instance}type") + foo:string + +Care must be taken if different namespace prefixes have been used for the same +namespace. Namespace information gets merged to avoid duplicate definitions +when adding a new sub-element to a tree, but this mechanism does not adapt the +prefixes of attribute values:: + + >>> root = objectify.fromstring("""""") + >>> print etree.tostring(root, pretty_print=True) + + + >>> s = objectify.DataElement("17", _xsi="string") + >>> print etree.tostring(s, pretty_print=True) + 17 + + >>> root.s = s + >>> print etree.tostring(root, pretty_print=True) + + 17 + + +It is your responsibility to fix the prefixes of attribute values if you +choose to deviate from the standard prefixes. A convenient way to do this for +xsi:type attributes is to use the ``xsiannotate()`` utility:: + + >>> objectify.xsiannotate(root) + >>> print etree.tostring(root, pretty_print=True) + + 17 + + +Of course, it is discouraged to use different prefixes for one and the same +namespace when building up an objectify tree. + + Defining additional data classes -------------------------------- @@ -894,45 +1055,8 @@ after all references are gone and the Python object is garbage collected. -Recursive string representation of elements -------------------------------------------- - -Normally, elements use the standard string representation for str() that is -provided by lxml.etree. You can enable a pretty-print representation for -objectify elements like this:: - - >>> objectify.enableRecursiveStr() - - >>> root = objectify.fromstring(""" - ... - ... 1 - ... 1.2 - ... 1 - ... true - ... what? - ... - ... - ... """) - - >>> print str(root) - root = None [ObjectifiedElement] - a = 1 [IntElement] - * attr1 = 'foo' - * attr2 = 'bar' - a = 1.2 [FloatElement] - b = 1 [IntElement] - b = True [BoolElement] - c = 'what?' [StringElement] - d = None [NoneElement] - * xsi:nil = 'true' - -This behaviour can be switched off in the same way:: - - >>> objectify.enableRecursiveStr(False) - - -What is different from ElementTree? ------------------------------------ +What is different from lxml.etree? +================================== Such a different Element API obviously implies some side effects to the normal behaviour of the rest of the API. @@ -945,7 +1069,8 @@ can access all children with the ``iterchildren()`` method on elements or retrieve a list by calling the ``getchildren()`` method. -* The find, findall and findtext methods use a different implementation as - they rely on the original iteration scheme. This has the disadvantage that - they may not be 100% backwards compatible, and the additional advantage that - they now support any XPath expression. +* The find, findall and findtext methods require a different implementation + based on ETXPath. In ``lxml.etree``, they use a Python implementation based + on the original iteration scheme. This has the disadvantage that they may + not be 100% backwards compatible, and the additional advantage that they now + support any XPath expression. From scoder at codespeak.net Mon May 21 18:01:44 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:01:44 +0200 (CEST) Subject: [Lxml-checkins] r43532 - lxml/trunk/doc Message-ID: <20070521160144.BBE84808F@code0.codespeak.net> Author: scoder Date: Mon May 21 18:01:38 2007 New Revision: 43532 Modified: lxml/trunk/doc/api.txt Log: doc cleanup Modified: lxml/trunk/doc/api.txt ============================================================================== --- lxml/trunk/doc/api.txt (original) +++ lxml/trunk/doc/api.txt Mon May 21 18:01:38 2007 @@ -1,11 +1,11 @@ -===================== -APIs specific to lxml -===================== - -lxml tries to follow established APIs wherever possible. Sometimes, however, -the need to expose a feature in an easy way led to the invention of a new API. -This page describes the major differences and a few additions to the main -ElementTree API. +=========================== +APIs specific to lxml.etree +=========================== + +lxml.etree tries to follow established APIs wherever possible. Sometimes, +however, the need to expose a feature in an easy way led to the invention of a +new API. This page describes the major differences and a few additions to the +main ElementTree API. Separate pages describe the support for `parsing XML`_, executing `XPath and XSLT`_, `validating XML`_ and interfacing with other XML tools through the From scoder at codespeak.net Mon May 21 18:02:08 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:02:08 +0200 (CEST) Subject: [Lxml-checkins] r43533 - in lxml/trunk: . src/lxml Message-ID: <20070521160208.984BA808F@code0.codespeak.net> Author: scoder Date: Mon May 21 18:02:07 2007 New Revision: 43533 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/xpath.pxi Log: ETXPath was missing regexp keyword Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Mon May 21 18:02:07 2007 @@ -44,6 +44,8 @@ Bugs fixed ---------- +* ``ETXPath`` was missing the ``regexp`` keyword argument + * passing '' as XPath namespace prefix did not raise an error * passing '' as namespace prefix in nsmap could be passed through to libxml2 Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Mon May 21 18:02:07 2007 @@ -407,10 +407,14 @@ cdef class ETXPath(XPath): """Special XPath class that supports the ElementTree {uri} notation for - namespaces.""" - def __init__(self, path, extensions=None): + namespaces. + + Note that this class does not accept the ``namespace`` keyword + argument. All namespaces must be passed as part of the path string. + """ + def __init__(self, path, extensions=None, regexp=True): path, namespaces = self._nsextract_path(path) - XPath.__init__(self, path, namespaces, extensions) + XPath.__init__(self, path, namespaces, extensions, regexp) cdef _nsextract_path(self, path): # replace {namespaces} by new prefixes @@ -422,7 +426,7 @@ i = 1 for namespace_def in _find_namespaces(stripped_path): if namespace_def not in namespace_defs: - prefix = python.PyString_FromFormat("xpp%02d", i) + prefix = python.PyString_FromFormat("__xpp%02d", i) i = i+1 python.PyList_Append(namespace_defs, namespace_def) namespace = namespace_def[1:-1] # remove '{}' From scoder at codespeak.net Mon May 21 18:03:17 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:03:17 +0200 (CEST) Subject: [Lxml-checkins] r43534 - lxml/trunk/src/lxml Message-ID: <20070521160317.BE3E3808F@code0.codespeak.net> Author: scoder Date: Mon May 21 18:03:15 2007 New Revision: 43534 Modified: lxml/trunk/src/lxml/objectify.pyx Log: raise ValueError instead of TypeError when XSD namespace is missing in objectify Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Mon May 21 18:03:15 2007 @@ -1458,7 +1458,7 @@ # StrType does not have a typecheck but is the default anyway, # so just accept it if given as type information if pytype is None: - return pytype + return None value = textOf(c_node) try: pytype.type_check(value) @@ -1468,7 +1468,6 @@ pass return None - def annotate(element_or_tree, ignore_old=True): """Recursively annotates the elements of an XML tree with 'pytype' attributes. @@ -1503,8 +1502,8 @@ if dict_result is not NULL: pytype = dict_result if pytype is not StrType: - # StrType does not have a typecheck but is the default anyway, - # so just accept it if given as type information + # StrType does not have a typecheck but is the default + # anyway, so just accept it if given as type information pytype = _check_type(c_node, pytype) if pytype is None: @@ -1766,7 +1765,7 @@ prefix, name = _xsi.split(':', 1) ns = nsmap.get(prefix) if ns != XML_SCHEMA_NS: - raise TypeError, "XSD types require the XSD namespace" + raise ValueError, "XSD types require the XSD namespace" elif nsmap is _DEFAULT_NSMAP: name = _xsi _xsi = 'xsd:' + _xsi @@ -1778,7 +1777,7 @@ _xsi = prefix + ':' + _xsi break else: - raise TypeError, "XSD types require the XSD namespace" + raise ValueError, "XSD types require the XSD namespace" python.PyDict_SetItem(_attributes, XML_SCHEMA_INSTANCE_TYPE_ATTR, _xsi) if _pytype is None: # allow using unregistered or even wrong xsi:type names From scoder at codespeak.net Mon May 21 18:03:23 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Mon, 21 May 2007 18:03:23 +0200 (CEST) Subject: [Lxml-checkins] r43535 - lxml/trunk/src/lxml/tests Message-ID: <20070521160323.A82008092@code0.codespeak.net> Author: scoder Date: Mon May 21 18:03:22 2007 New Revision: 43535 Modified: lxml/trunk/src/lxml/tests/test_objectify.py Log: test cases for XSI annotations in objectify Modified: lxml/trunk/src/lxml/tests/test_objectify.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_objectify.py (original) +++ lxml/trunk/src/lxml/tests/test_objectify.py Mon May 21 18:03:22 2007 @@ -13,6 +13,7 @@ from lxml import objectify +XML_SCHEMA_NS = "http://www.w3.org/2001/XMLSchema" XML_SCHEMA_INSTANCE_NS = "http://www.w3.org/2001/XMLSchema-instance" XML_SCHEMA_INSTANCE_TYPE_ATTR = "{%s}type" % XML_SCHEMA_INSTANCE_NS XML_SCHEMA_NIL_ATTR = "{%s}nil" % XML_SCHEMA_INSTANCE_NS @@ -497,6 +498,23 @@ root.b = False self.assertFalse(root.b) + def test_dataelement_xsi(self): + el = objectify.DataElement(1, _xsi="string") + self.assertEquals( + el.get(XML_SCHEMA_INSTANCE_TYPE_ATTR), + 'xsd:string') + + def test_dataelement_xsi_nsmap(self): + el = objectify.DataElement(1, _xsi="string", + nsmap={'schema': XML_SCHEMA_NS}) + self.assertEquals( + el.get(XML_SCHEMA_INSTANCE_TYPE_ATTR), + 'schema:string') + + def test_dataelement_xsi_prefix_error(self): + self.assertRaises(ValueError, objectify.DataElement, 1, + _xsi="foo:string") + def test_pytype_annotation(self): XML = self.XML root = XML(u'''\ From scoder at codespeak.net Wed May 23 16:11:58 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 23 May 2007 16:11:58 +0200 (CEST) Subject: [Lxml-checkins] r43574 - in lxml/trunk: . src/lxml Message-ID: <20070523141158.0F0968074@code0.codespeak.net> Author: scoder Date: Wed May 23 16:11:57 2007 New Revision: 43574 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/xslt.pxi Log: pass resolver context on to imported XSLT documents Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Wed May 23 16:11:57 2007 @@ -44,6 +44,8 @@ Bugs fixed ---------- +* XSLT parsing failed to pass resolver context on to imported documents + * ``ETXPath`` was missing the ``regexp`` keyword argument * passing '' as XPath namespace prefix did not raise an error Modified: lxml/trunk/src/lxml/xslt.pxi ============================================================================== --- lxml/trunk/src/lxml/xslt.pxi (original) +++ lxml/trunk/src/lxml/xslt.pxi Wed May 23 16:11:57 2007 @@ -141,6 +141,8 @@ c_doc = _xslt_resolve_stylesheet(c_uri, c_pcontext) if c_doc is not NULL: python.PyGILState_Release(gil_state) + if c_type == xslt.XSLT_LOAD_STYLESHEET: + c_doc._private = c_pcontext return c_doc c_doc = _xslt_resolve_from_python(c_uri, c_pcontext, parse_options, &error) @@ -151,6 +153,8 @@ _xslt_store_resolver_exception(c_uri, c_pcontext, c_type) python.PyGILState_Release(gil_state) + if c_doc is not NULL and c_type == xslt.XSLT_LOAD_STYLESHEET: + c_doc._private = c_pcontext return c_doc cdef xslt.xsltDocLoaderFunc XSLT_DOC_DEFAULT_LOADER From scoder at codespeak.net Wed May 23 21:49:59 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 23 May 2007 21:49:59 +0200 (CEST) Subject: [Lxml-checkins] r43589 - lxml/trunk/src/lxml Message-ID: <20070523194959.B5B1B807C@code0.codespeak.net> Author: scoder Date: Wed May 23 21:49:58 2007 New Revision: 43589 Modified: lxml/trunk/src/lxml/xinclude.pxd Log: some more C-function declared Modified: lxml/trunk/src/lxml/xinclude.pxd ============================================================================== --- lxml/trunk/src/lxml/xinclude.pxd (original) +++ lxml/trunk/src/lxml/xinclude.pxd Wed May 23 21:49:58 2007 @@ -1,9 +1,17 @@ from tree cimport xmlDoc, xmlNode cdef extern from "libxml/xinclude.h": - + + ctypedef struct xmlXIncludeCtxt + cdef int xmlXIncludeProcess(xmlDoc* doc) cdef int xmlXIncludeProcessFlags(xmlDoc* doc, int parser_opts) cdef int xmlXIncludeProcessTree(xmlNode* doc) cdef int xmlXIncludeProcessTreeFlags(xmlNode* doc, int parser_opts) - + + cdef xmlXIncludeCtxt* xmlXIncludeNewContext(xmlDoc* doc) + cdef int xmlXIncludeProcessNode(xmlXIncludeCtxt* ctxt, xmlNode* node) + cdef int xmlXIncludeSetFlags(xmlXIncludeCtxt* ctxt, int flags) + + # libxml2 >= 2.6.27 + cdef int xmlXIncludeProcessFlagsData(xmlDoc* doc, int flags, void* data) From scoder at codespeak.net Wed May 23 21:51:22 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Wed, 23 May 2007 21:51:22 +0200 (CEST) Subject: [Lxml-checkins] r43590 - in lxml/trunk: . src/lxml Message-ID: <20070523195122.4E63F807C@code0.codespeak.net> Author: scoder Date: Wed May 23 21:51:22 2007 New Revision: 43590 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/objectify.pyx Log: parse() function in objectify Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Wed May 23 21:51:22 2007 @@ -8,6 +8,8 @@ Features added -------------- +* ``parse()`` function in ``objectify``, corresponding to ``XML()`` etc. + * ``column`` field on error log entries to accompany the ``line`` field * Error specific messages in XPath parsing and evaluation Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Wed May 23 21:51:22 2007 @@ -1685,8 +1685,8 @@ __DEFAULT_PARSER = etree.XMLParser(remove_blank_text=True) __DEFAULT_PARSER.setElementClassLookup( ObjectifyElementClassLookup() ) -cdef object parser -parser = __DEFAULT_PARSER +cdef object objectify_parser +objectify_parser = __DEFAULT_PARSER def setDefaultParser(new_parser = None): """Replace the default parser used by objectify's Element() and @@ -1696,16 +1696,16 @@ Call without arguments to reset to the original parser. """ - global parser + global objectify_parser if new_parser is None: - parser = __DEFAULT_PARSER + objectify_parser = __DEFAULT_PARSER elif isinstance(new_parser, etree.XMLParser): - parser = new_parser + objectify_parser = new_parser else: raise TypeError, "parser must inherit from lxml.etree.XMLParser" cdef _Element _makeElement(tag, text, attrib, nsmap): - return cetree.makeElement(tag, None, parser, text, None, attrib, nsmap) + return cetree.makeElement(tag, None, objectify_parser, text, None, attrib, nsmap) ################################################################################ # Module level factory functions @@ -1718,10 +1718,18 @@ NOTE: requires parser based element class lookup activated in lxml.etree! """ - return _fromstring(xml, parser) + return _fromstring(xml, objectify_parser) XML = fromstring +cdef object _parse +_parse = etree.parse + +def parse(f, parser=None): + if parser is None: + parser = objectify_parser + return _parse(f, parser) + cdef object _DEFAULT_NSMAP _DEFAULT_NSMAP = { "py" : PYTYPE_NAMESPACE, "xsi" : XML_SCHEMA_INSTANCE_NS, From scoder at codespeak.net Sat May 26 20:45:09 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:45:09 +0200 (CEST) Subject: [Lxml-checkins] r43690 - in lxml/trunk: . doc src/lxml Message-ID: <20070526184509.3069D80E6@code0.codespeak.net> Author: scoder Date: Sat May 26 20:45:08 2007 New Revision: 43690 Modified: lxml/trunk/CHANGES.txt lxml/trunk/doc/api.txt lxml/trunk/doc/xpathxslt.txt lxml/trunk/src/lxml/iterparse.pxi lxml/trunk/src/lxml/parser.pxi lxml/trunk/src/lxml/xmlerror.pxi Log: display first error in exception string instead of last one Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sat May 26 20:45:08 2007 @@ -46,6 +46,9 @@ Bugs fixed ---------- +* The text in exceptions raised by XML parsers and XPath evaluators now + reports the first error that occurred instead of the last + * XSLT parsing failed to pass resolver context on to imported documents * ``ETXPath`` was missing the ``regexp`` keyword argument Modified: lxml/trunk/doc/api.txt ============================================================================== --- lxml/trunk/doc/api.txt (original) +++ lxml/trunk/doc/api.txt Sat May 26 20:45:08 2007 @@ -265,18 +265,44 @@ -By default, lxml (and ElementTree) output the XML declaration only if it is -required. You can enable or disable it explicitly by passing another keyword -argument for the serialisation:: +By default, lxml (just as ElementTree) outputs the XML declaration only if it +is required by the standard:: - >>> print etree.tostring(root, xml_declaration=True) - - + >>> unicode_root = etree.Element(u"t\u1234st") + >>> unicode_root.text = u"t\u4321st" + >>> etree.tostring(unicode_root, encoding="utf-8") + 't\xe4\x8c\xa1st' + + >>> print etree.tostring(unicode_root, encoding="iso-8859-1") + + t䌡st Also see the general remarks on `Unicode support`_. .. _`Unicode support`: parsing.html#python-unicode-strings +You can enable or disable the declaration explicitly by passing another +keyword argument for the serialisation:: + + >>> print etree.tostring(root, xml_declaration=True) + + + + >>> etree.tostring(unicode_root, encoding="utf-8", + ... xml_declaration=False) + 't\xe4\x8c\xa1st' + +Note that a standard compliant XML parser will not consider the last line +well-formed XML if the encoding is not explicitly provided somehow, e.g. in an +underlying transport protocol:: + + >>> notxml = etree.tostring(unicode_root, encoding="utf-8", + ... xml_declaration=False) + >>> etree.XML(notxml) + Traceback (most recent call last): + ... + XMLSyntaxError: line 1: error parsing attribute name + XInclude and ElementInclude --------------------------- Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sat May 26 20:45:08 2007 @@ -277,7 +277,7 @@ >>> find = etree.XPath("\\") Traceback (most recent call last): ... - XPathSyntaxError: Error in xpath expression + XPathSyntaxError: Invalid expression lxml will also try to give you a hint what went wrong, so if you pass a more complex expression, you may get a somewhat more specific error:: Modified: lxml/trunk/src/lxml/iterparse.pxi ============================================================================== --- lxml/trunk/src/lxml/iterparse.pxi (original) +++ lxml/trunk/src/lxml/iterparse.pxi Sat May 26 20:45:08 2007 @@ -314,7 +314,7 @@ break if error != 0: self._source = None - _raiseParseError(self._parser_ctxt, self._filename) + _raiseParseError(self._parser_ctxt, self._filename, None) if python.PyList_GET_SIZE(context._events) == 0: self.root = context._root self._source = None Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Sat May 26 20:45:08 2007 @@ -429,9 +429,6 @@ def __get__(self): return self._error_log.copy() - def __dummy(self): - pass - def setElementClassLookup(self, ElementClassLookup lookup = None): """Set a lookup scheme for element classes generated from this parser. @@ -496,7 +493,8 @@ python.PyEval_RestoreThread(state) recover = self._parse_options & xmlparser.XML_PARSE_RECOVER - return _handleParseResult(pctxt, result, None, recover) + return _handleParseResult(pctxt, result, None, + self._error_log._first_error, recover) finally: self._cleanup() self._context.clear() @@ -529,7 +527,8 @@ python.PyEval_RestoreThread(state) recover = self._parse_options & xmlparser.XML_PARSE_RECOVER - return _handleParseResult(pctxt, result, None, recover) + return _handleParseResult(pctxt, result, None, + self._error_log._first_error, recover) finally: self._cleanup() self._context.clear() @@ -558,7 +557,8 @@ python.PyEval_RestoreThread(state) recover = self._parse_options & xmlparser.XML_PARSE_RECOVER - return _handleParseResult(pctxt, result, c_filename, recover) + return _handleParseResult(pctxt, result, c_filename, + self._error_log._first_error, recover) finally: self._cleanup() self._context.clear() @@ -583,14 +583,15 @@ pctxt, self._parse_options, self._parser_type) recover = self._parse_options & xmlparser.XML_PARSE_RECOVER - return _handleParseResult(pctxt, result, filename, recover) + return _handleParseResult(pctxt, result, filename, + self._error_log._first_error, recover) finally: self._cleanup() self._context.clear() self._error_log.disconnect() self._unlockParser() -cdef int _raiseParseError(xmlParserCtxt* ctxt, filename) except 0: +cdef int _raiseParseError(xmlParserCtxt* ctxt, filename, error) except 0: if filename is not None and \ ctxt.lastError.domain == xmlerror.XML_FROM_IO: if ctxt.lastError.message is not NULL: @@ -599,16 +600,21 @@ else: message = "Error reading file '%s'" % filename raise IOError, message + elif error is not None and error.message is not None: + message = error.message + if error.line > 0: + message = "line %d: %s" % (error.line, message) + raise XMLSyntaxError, message elif ctxt.lastError.message is not NULL: message = (ctxt.lastError.message).strip() - if ctxt.lastError.line >= 0: + if ctxt.lastError.line > 0: message = "line %d: %s" % (ctxt.lastError.line, message) raise XMLSyntaxError, message else: raise XMLSyntaxError cdef xmlDoc* _handleParseResult(xmlParserCtxt* ctxt, xmlDoc* result, - filename, int recover) except NULL: + filename, error, int recover) except NULL: cdef _ResolverContext context if ctxt.myDoc is not NULL: if ctxt.myDoc != result: @@ -632,7 +638,7 @@ context._raise_if_stored() if result is NULL: - _raiseParseError(ctxt, filename) + _raiseParseError(ctxt, filename, error) elif result.URL is NULL and filename is not None: result.URL = tree.xmlStrdup(_cstr(filename)) return result @@ -715,7 +721,7 @@ pctxt, c_text, NULL, NULL, options) try: recover = options & xmlparser.XML_PARSE_RECOVER - c_doc = _handleParseResult(pctxt, c_doc, None, recover) + c_doc = _handleParseResult(pctxt, c_doc, None, None, recover) finally: xmlparser.xmlFreeParserCtxt(pctxt) return c_doc @@ -739,7 +745,7 @@ filename = None else: filename = c_filename - c_doc = _handleParseResult(pctxt, c_doc, filename, recover) + c_doc = _handleParseResult(pctxt, c_doc, filename, None, recover) finally: xmlparser.xmlFreeParserCtxt(pctxt) return c_doc Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Sat May 26 20:45:08 2007 @@ -222,10 +222,13 @@ return self.filter_from_level(ErrorLevels.WARNING) cdef class _ErrorLog(_ListErrorLog): + cdef object _first_error def __init__(self): + self._first_error = None _ListErrorLog.__init__(self, []) cdef void connect(self): + self._first_error = None del self._entries[:] connectErrorLog(self) @@ -233,6 +236,7 @@ connectErrorLog(NULL) def clear(self): + self._first_error = None del self._entries[:] def copy(self): @@ -244,6 +248,8 @@ return iter(self._entries[:]) def receive(self, entry): + if self._first_error is None: + self._first_error = entry python.PyList_Append(self._entries, entry) cdef class _DomainErrorLog(_ErrorLog): From scoder at codespeak.net Sat May 26 20:45:34 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:45:34 +0200 (CEST) Subject: [Lxml-checkins] r43691 - lxml/trunk/doc Message-ID: <20070526184534.8B78280E9@code0.codespeak.net> Author: scoder Date: Sat May 26 20:45:34 2007 New Revision: 43691 Modified: lxml/trunk/doc/objectify.txt Log: refer to benchmark page from objectify docs Modified: lxml/trunk/doc/objectify.txt ============================================================================== --- lxml/trunk/doc/objectify.txt (original) +++ lxml/trunk/doc/objectify.txt Sat May 26 20:45:34 2007 @@ -20,8 +20,12 @@ not be mixed with other element implementations, to avoid non-obvious behaviour. +The `benchmark page`_ has some hints on performance optimisation of code using +lxml.objectify. + .. _Amara: http://uche.ogbuji.net/tech/4suite/amara/ .. _gnosis.xml.objectify: http://gnosis.cx/download/ +.. _`benchmark page`: performance.html#lxml-objectify .. contents:: .. From scoder at codespeak.net Sat May 26 20:48:57 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:48:57 +0200 (CEST) Subject: [Lxml-checkins] r43692 - lxml/trunk/src/lxml Message-ID: <20070526184857.3584080E6@code0.codespeak.net> Author: scoder Date: Sat May 26 20:48:56 2007 New Revision: 43692 Modified: lxml/trunk/src/lxml/xpath.pxi Log: display first error in exception string instead of last one Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Sat May 26 20:48:56 2007 @@ -158,7 +158,10 @@ if entry is not None and entry.message: raise XPathSyntaxError, entry.message - if self._xpathCtxt is not NULL and \ + if self._error_log._first_error is not None and \ + self._error_log._first_error.message is not None: + message = self._error_log._first_error.message + elif self._xpathCtxt is not NULL and \ self._xpathCtxt.lastError.message is not NULL: message = funicode(self._xpathCtxt.lastError.message) else: From scoder at codespeak.net Sat May 26 20:51:15 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:51:15 +0200 (CEST) Subject: [Lxml-checkins] r43693 - lxml/trunk/src/lxml Message-ID: <20070526185115.C402880E6@code0.codespeak.net> Author: scoder Date: Sat May 26 20:51:15 2007 New Revision: 43693 Modified: lxml/trunk/src/lxml/proxy.pxi Log: copy and paste bug? (not sure) Modified: lxml/trunk/src/lxml/proxy.pxi ============================================================================== --- lxml/trunk/src/lxml/proxy.pxi (original) +++ lxml/trunk/src/lxml/proxy.pxi Sat May 26 20:51:15 2007 @@ -56,7 +56,6 @@ c_new_root = tree.xmlDocCopyNode(c_node, c_doc, 2) # non recursive! tree.xmlDocSetRootElement(c_doc, c_new_root) _copyParentNamespaces(c_node, c_new_root) - _copyParentNamespaces(c_node, c_root) c_new_root.children = c_node.children c_new_root.last = c_node.last From scoder at codespeak.net Sat May 26 20:52:16 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:52:16 +0200 (CEST) Subject: [Lxml-checkins] r43694 - lxml/trunk/src/lxml Message-ID: <20070526185216.6E3DC80E8@code0.codespeak.net> Author: scoder Date: Sat May 26 20:52:16 2007 New Revision: 43694 Modified: lxml/trunk/src/lxml/etree_defs.h lxml/trunk/src/lxml/tree.pxd Log: provide libxml2 xmlsave API (in case we ever need it) Modified: lxml/trunk/src/lxml/etree_defs.h ============================================================================== --- lxml/trunk/src/lxml/etree_defs.h (original) +++ lxml/trunk/src/lxml/etree_defs.h Sat May 26 20:52:16 2007 @@ -47,6 +47,27 @@ #define HTML_PARSE_RECOVER XML_PARSE_RECOVER #endif +/* added to xmlsave API in libxml2 2.6.23 */ +#if LIBXML_VERSION < 20623 +#define xmlSaveToBuffer(buffer, encoding, options) +#endif + +/* added to xmlsave API in libxml2 2.6.22 */ +#if LIBXML_VERSION < 20622 +#define XML_SAVE_NO_EMPTY 1<<2, /* no empty tags */ +#define XML_SAVE_NO_XHTML 1<<3 /* disable XHTML1 specific rules */ +#endif + +/* added to xmlsave API in libxml2 2.6.21 */ +#if LIBXML_VERSION < 20621 +#define XML_SAVE_NO_DECL 1<<1, /* drop the xml declaration */ +#endif + +/* added to xmlsave API in libxml2 2.6.17 */ +#if LIBXML_VERSION < 20617 +#define XML_SAVE_FORMAT 1<<0, /* format save output */ +#endif + /* work around MSDEV 6.0 */ #if (_MSC_VER == 1200) && (WINVER < 0x0500) long _ftol( double ); //defined by VC6 C libs Modified: lxml/trunk/src/lxml/tree.pxd ============================================================================== --- lxml/trunk/src/lxml/tree.pxd (original) +++ lxml/trunk/src/lxml/tree.pxd Sat May 26 20:52:16 2007 @@ -212,6 +212,7 @@ cdef int xmlReconciliateNs(xmlDoc* doc, xmlNode* tree) cdef xmlNs* xmlNewReconciliedNs(xmlDoc* doc, xmlNode* tree, xmlNs* ns) cdef xmlBuffer* xmlBufferCreate() + cdef void xmlBufferFree(xmlBuffer* buf) cdef char* xmlBufferContent(xmlBuffer* buf) cdef int xmlBufferLength(xmlBuffer* buf) cdef int xmlKeepBlanksDefault(int val) @@ -245,11 +246,23 @@ cdef extern from "libxml/xmlsave.h": ctypedef struct xmlSaveCtxt - + + ctypedef enum xmlSaveOption: + XML_SAVE_FORMAT = 1 # format save output (2.6.17) + XML_SAVE_NO_DECL = 2 # drop the xml declaration (2.6.21) + XML_SAVE_NO_EMPTY = 4 # no empty tags (2.6.22) + XML_SAVE_NO_XHTML = 8 # disable XHTML1 specific rules (2.6.22) + cdef xmlSaveCtxt* xmlSaveToFilename(char* filename, char* encoding, int options) + cdef xmlSaveCtxt* xmlSaveToBuffer(xmlBuffer* buffer, char* encoding, + int options) # libxml2 2.6.23 cdef long xmlSaveDoc(xmlSaveCtxt* ctxt, xmlDoc* doc) + cdef long xmlSaveTree(xmlSaveCtxt* ctxt, xmlNode* node) cdef int xmlSaveClose(xmlSaveCtxt* ctxt) + cdef int xmlSaveFlush(xmlSaveCtxt* ctxt) + cdef int xmlSaveSetAttrEscape(xmlSaveCtxt* ctxt, void* escape_func) + cdef int xmlSaveSetEscape(xmlSaveCtxt* ctxt, void* escape_func) cdef extern from "libxml/globals.h": cdef int xmlThrDefKeepBlanksDefaultValue(int onoff) From scoder at codespeak.net Sat May 26 20:54:28 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:54:28 +0200 (CEST) Subject: [Lxml-checkins] r43695 - lxml/trunk/src/lxml Message-ID: <20070526185428.6A91580E6@code0.codespeak.net> Author: scoder Date: Sat May 26 20:54:27 2007 New Revision: 43695 Modified: lxml/trunk/src/lxml/etree.pyx Log: provide original parser at ElementTree API level even if there is no root node Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sat May 26 20:54:27 2007 @@ -1240,6 +1240,8 @@ if self._context_node is not None and \ self._context_node._doc is not None: return self._context_node._doc._parser + if self._doc is not None: + return self._doc._parser return None def write(self, file, encoding=None, @@ -1319,7 +1321,6 @@ path = "." + path return root.findall(path) - # extensions to ElementTree API def xpath(self, _path, namespaces=None, extensions=None, **_variables): """XPath evaluate in context of document. @@ -1396,7 +1397,8 @@ There is support for loading files through the file system, HTTP and FTP. - Note that XInclude does not support custom resolvers in Python space. + Note that XInclude does not support custom resolvers in Python space + due to restrictions of libxml2 <= 2.6.28. """ cdef python.PyThreadState* state cdef int result From scoder at codespeak.net Sat May 26 20:54:52 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sat, 26 May 2007 20:54:52 +0200 (CEST) Subject: [Lxml-checkins] r43696 - lxml/trunk Message-ID: <20070526185452.825EE80E8@code0.codespeak.net> Author: scoder Date: Sat May 26 20:54:52 2007 New Revision: 43696 Modified: lxml/trunk/TODO.txt Log: cleanup Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Sat May 26 20:54:52 2007 @@ -16,8 +16,6 @@ * test namespaces more in-depth -* will namespace nodes of unknown namespaces be added (and never freed?) - * more testing on multi-threading * better exception messages for XPath and schemas based on error log, From scoder at codespeak.net Sun May 27 14:10:09 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 27 May 2007 14:10:09 +0200 (CEST) Subject: [Lxml-checkins] r43711 - in lxml/trunk: . src/lxml src/lxml/tests Message-ID: <20070527121009.DB6B880B4@code0.codespeak.net> Author: scoder Date: Sun May 27 14:10:06 2007 New Revision: 43711 Modified: lxml/trunk/CHANGES.txt lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/parser.pxi lxml/trunk/src/lxml/relaxng.pxi lxml/trunk/src/lxml/tests/test_etree.py lxml/trunk/src/lxml/xmlerror.pxi lxml/trunk/src/lxml/xmlschema.pxi lxml/trunk/src/lxml/xpath.pxi lxml/trunk/src/lxml/xslt.pxi Log: better exception messages in XPath, parsers and validators; major cleanup of error message extraction Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Sun May 27 14:10:06 2007 @@ -46,8 +46,8 @@ Bugs fixed ---------- -* The text in exceptions raised by XML parsers and XPath evaluators now - reports the first error that occurred instead of the last +* The text in exceptions raised by XML parsers, validators and XPath + evaluators now reports the first error that occurred instead of the last * XSLT parsing failed to pass resolver context on to imported documents Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Sun May 27 14:10:06 2007 @@ -1976,7 +1976,7 @@ cdef _ErrorLog _error_log def __init__(self): self._error_log = _ErrorLog() - + def validate(self, etree): """Validate the document using this schema. @@ -1986,12 +1986,14 @@ def assertValid(self, etree): "Raises DocumentInvalid if the document does not comply with the schema." if not self(etree): - raise DocumentInvalid, "Document does not comply with schema" + raise DocumentInvalid, self._error_log._buildExceptionMessage( + "Document does not comply with schema") def assert_(self, etree): "Raises AssertionError if the document does not comply with the schema." if not self(etree): - raise AssertionError, "Document does not comply with schema" + raise AssertionError, self._error_log._buildExceptionMessage( + "Document does not comply with schema") property error_log: def __get__(self): Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Sun May 27 14:10:06 2007 @@ -494,7 +494,7 @@ recover = self._parse_options & xmlparser.XML_PARSE_RECOVER return _handleParseResult(pctxt, result, None, - self._error_log._first_error, recover) + self._error_log, recover) finally: self._cleanup() self._context.clear() @@ -528,7 +528,7 @@ recover = self._parse_options & xmlparser.XML_PARSE_RECOVER return _handleParseResult(pctxt, result, None, - self._error_log._first_error, recover) + self._error_log, recover) finally: self._cleanup() self._context.clear() @@ -558,7 +558,7 @@ recover = self._parse_options & xmlparser.XML_PARSE_RECOVER return _handleParseResult(pctxt, result, c_filename, - self._error_log._first_error, recover) + self._error_log, recover) finally: self._cleanup() self._context.clear() @@ -584,14 +584,15 @@ recover = self._parse_options & xmlparser.XML_PARSE_RECOVER return _handleParseResult(pctxt, result, filename, - self._error_log._first_error, recover) + self._error_log, recover) finally: self._cleanup() self._context.clear() self._error_log.disconnect() self._unlockParser() -cdef int _raiseParseError(xmlParserCtxt* ctxt, filename, error) except 0: +cdef int _raiseParseError(xmlParserCtxt* ctxt, filename, + _ErrorLog error_log) except 0: if filename is not None and \ ctxt.lastError.domain == xmlerror.XML_FROM_IO: if ctxt.lastError.message is not NULL: @@ -600,11 +601,9 @@ else: message = "Error reading file '%s'" % filename raise IOError, message - elif error is not None and error.message is not None: - message = error.message - if error.line > 0: - message = "line %d: %s" % (error.line, message) - raise XMLSyntaxError, message + elif error_log is not None: + raise XMLSyntaxError, error_log._buildExceptionMessage( + "Document is not well formed") elif ctxt.lastError.message is not NULL: message = (ctxt.lastError.message).strip() if ctxt.lastError.line > 0: @@ -614,7 +613,8 @@ raise XMLSyntaxError cdef xmlDoc* _handleParseResult(xmlParserCtxt* ctxt, xmlDoc* result, - filename, error, int recover) except NULL: + filename, _ErrorLog error_log, + int recover) except NULL: cdef _ResolverContext context if ctxt.myDoc is not NULL: if ctxt.myDoc != result: @@ -638,7 +638,7 @@ context._raise_if_stored() if result is NULL: - _raiseParseError(ctxt, filename, error) + _raiseParseError(ctxt, filename, error_log) elif result.URL is NULL and filename is not None: result.URL = tree.xmlStrdup(_cstr(filename)) return result Modified: lxml/trunk/src/lxml/relaxng.pxi ============================================================================== --- lxml/trunk/src/lxml/relaxng.pxi (original) +++ lxml/trunk/src/lxml/relaxng.pxi Sun May 27 14:10:06 2007 @@ -31,19 +31,21 @@ cdef xmlDoc* fake_c_doc cdef char* c_href cdef relaxng.xmlRelaxNGParserCtxt* parser_ctxt + _Validator.__init__(self) self._c_schema = NULL fake_c_doc = NULL if etree is not None: doc = _documentOrRaise(etree) root_node = _rootNodeOrRaise(etree) c_node = root_node._c_node - # work around for libxml2 bug if document is not RNG at all + # work around for libxml2 crash bug if document is not RNG at all if _LIBXML_VERSION_INT < 20624: c_href = _getNs(c_node) if c_href is NULL or \ cstd.strcmp(c_href, 'http://relaxng.org/ns/structure/1.0') != 0: raise RelaxNGParseError, "Document is not Relax NG" + self._error_log.connect() fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) parser_ctxt = relaxng.xmlRelaxNGNewDocParserCtxt(fake_c_doc) elif file is not None: @@ -52,27 +54,30 @@ # XXX assume a string object filename = file filename = _encodeFilename(filename) + self._error_log.connect() parser_ctxt = relaxng.xmlRelaxNGNewParserCtxt(_cstr(filename)) else: raise RelaxNGParseError, "No tree or file given" if parser_ctxt is NULL: + self._error_log.disconnect() if fake_c_doc is not NULL: _destroyFakeDoc(doc._c_doc, fake_c_doc) raise RelaxNGParseError, "Document is not parsable as Relax NG" self._c_schema = relaxng.xmlRelaxNGParse(parser_ctxt) + self._error_log.disconnect() - # XXX: freeing parser context will crash if document was not RNG!! - #relaxng.xmlRelaxNGFreeParserCtxt(parser_ctxt) + if _LIBXML_VERSION_INT >= 20624: + relaxng.xmlRelaxNGFreeParserCtxt(parser_ctxt) if self._c_schema is NULL: if fake_c_doc is not NULL: - relaxng.xmlRelaxNGFreeParserCtxt(parser_ctxt) + if _LIBXML_VERSION_INT < 20624: + relaxng.xmlRelaxNGFreeParserCtxt(parser_ctxt) _destroyFakeDoc(doc._c_doc, fake_c_doc) - raise RelaxNGParseError, "Document is not valid Relax NG" - relaxng.xmlRelaxNGFreeParserCtxt(parser_ctxt) + raise RelaxNGParseError, self._error_log._buildExceptionMessage( + "Document is not valid Relax NG") if fake_c_doc is not NULL: _destroyFakeDoc(doc._c_doc, fake_c_doc) - _Validator.__init__(self) def __dealloc__(self): relaxng.xmlRelaxNGFree(self._c_schema) @@ -95,7 +100,7 @@ valid_ctxt = relaxng.xmlRelaxNGNewValidCtxt(self._c_schema) if valid_ctxt is NULL: self._error_log.disconnect() - raise RelaxNGError, "Failed to create validation context" + python.PyErr_NoMemory() c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) state = python.PyEval_SaveThread() Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Sun May 27 14:10:06 2007 @@ -342,7 +342,8 @@ def resolve(self, url, id, context): assertEqual(url, test_url) return self.resolve_string( - u'' % url, context) + u''' + ''' % url, context) parser.resolvers.add(MyResolver()) @@ -351,9 +352,9 @@ root = tree.getroot() self.assertEquals(root.text, test_url) - def test_resolve_empty(self): + def _test_resolve_empty(self): parse = self.etree.parse - parser = self.etree.XMLParser(dtd_validation=True) + parser = self.etree.XMLParser(load_dtd=True) assertEqual = self.assertEqual test_url = u"__nosuch.dtd" @@ -369,12 +370,9 @@ parser.resolvers.add(MyResolver()) xml = u'&myentity;' % test_url - tree = parse(StringIO(xml), parser) + self.assertRaises(etree.XMLSyntaxError, parse, StringIO(xml), parser) self.assert_(check.resolved) - root = tree.getroot() - self.assertEquals(root.text, None) - def test_resolve_error(self): parse = self.etree.parse parser = self.etree.XMLParser(dtd_validation=True) Modified: lxml/trunk/src/lxml/xmlerror.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlerror.pxi (original) +++ lxml/trunk/src/lxml/xmlerror.pxi Sun May 27 14:10:06 2007 @@ -84,12 +84,14 @@ return ErrorLevels._getName(self.level, "unknown") cdef class _BaseErrorLog: + cdef _LogEntry _first_error cdef readonly object last_error - def __init__(self, last_error=None): + def __init__(self, first_error, last_error): + self._first_error = first_error self.last_error = last_error def copy(self): - return _BaseErrorLog(self.last_error) + return _BaseErrorLog(self._first_error, self.last_error) def __repr__(self): return '' @@ -124,18 +126,40 @@ if is_error: self.last_error = entry + cdef _buildExceptionMessage(self, default_message): + if self._first_error is None: + return default_message + if self._first_error.message is not None and self._first_error.message: + message = self._first_error.message + elif default_message is None: + return None + else: + message = default_message + if self._first_error.line > 0: + if self._first_error.column > 0: + message = "%s, line %d, column %d" % ( + message, self._first_error.line, self._first_error.column) + else: + message = "%s, line %d" % (message, self._first_error.line) + return message + cdef class _ListErrorLog(_BaseErrorLog): "Immutable base version of a list based error log." cdef object _entries - def __init__(self, entries, last_error=None): - _BaseErrorLog.__init__(self, last_error) + def __init__(self, entries, first_error, last_error): + if entries: + if first_error is None: + first_error = entries[0] + if last_error is None: + last_error = entries[-1] + _BaseErrorLog.__init__(self, first_error, last_error) self._entries = entries def copy(self): """Creates a shallow copy of this error log. Reuses the list of entries. """ - return _ListErrorLog(self._entries, self.last_error) + return _ListErrorLog(self._entries, self._first_error, self.last_error) def __iter__(self): return iter(self._entries) @@ -172,7 +196,7 @@ for entry in self._entries: if entry.domain in domains: python.PyList_Append(filtered, entry) - return _ListErrorLog(filtered) + return _ListErrorLog(filtered, None, None) def filter_types(self, types): """Filter the errors by the given types and return a new error log @@ -185,7 +209,7 @@ for entry in self._entries: if entry.type in types: python.PyList_Append(filtered, entry) - return _ListErrorLog(filtered) + return _ListErrorLog(filtered, None, None) def filter_levels(self, levels): """Filter the errors by the given error levels and return a new error @@ -198,7 +222,7 @@ for entry in self._entries: if entry.level in levels: python.PyList_Append(filtered, entry) - return _ListErrorLog(filtered) + return _ListErrorLog(filtered, None, None) def filter_from_level(self, level): "Return a log with all messages of the requested level of worse." @@ -207,7 +231,7 @@ for entry in self._entries: if entry.level >= level: python.PyList_Append(filtered, entry) - return _ListErrorLog(filtered) + return _ListErrorLog(filtered, None, None) def filter_from_fatals(self): "Convenience method to get all fatal error messages." @@ -222,10 +246,8 @@ return self.filter_from_level(ErrorLevels.WARNING) cdef class _ErrorLog(_ListErrorLog): - cdef object _first_error def __init__(self): - self._first_error = None - _ListErrorLog.__init__(self, []) + _ListErrorLog.__init__(self, [], None, None) cdef void connect(self): self._first_error = None @@ -242,7 +264,8 @@ def copy(self): """Creates a shallow copy of this error log and the list of entries. """ - return _ListErrorLog(self._entries[:], self.last_error) + return _ListErrorLog(self._entries[:], self._first_error, + self.last_error) def __iter__(self): return iter(self._entries[:]) @@ -295,7 +318,7 @@ cdef object _map_level cdef object _log def __init__(self, logger_name=None): - _BaseErrorLog.__init__(self) + _BaseErrorLog.__init__(self, None, None) import logging self.level_map = { ErrorLevels.WARNING : logging.WARNING, @@ -312,7 +335,7 @@ def copy(self): """Dummy method that returns an empty error log. """ - return _ListErrorLog([]) + return _ListErrorLog([], None, None) def log(self, entry, message_format_string, *args): self._log( @@ -344,9 +367,9 @@ # local log functions: forward error to logger object cdef void _forwardError(void* c_log_handler, xmlerror.xmlError* error): - cdef _ErrorLog log_handler + cdef _BaseErrorLog log_handler if c_log_handler is not NULL: - log_handler = <_ErrorLog>c_log_handler + log_handler = <_BaseErrorLog>c_log_handler else: log_handler = __GLOBAL_ERROR_LOG log_handler._receive(error) Modified: lxml/trunk/src/lxml/xmlschema.pxi ============================================================================== --- lxml/trunk/src/lxml/xmlschema.pxi (original) +++ lxml/trunk/src/lxml/xmlschema.pxi Sun May 27 14:10:06 2007 @@ -30,7 +30,9 @@ cdef xmlNode* c_node cdef char* c_href cdef xmlschema.xmlSchemaParserCtxt* parser_ctxt + _Validator.__init__(self) self._c_schema = NULL + fake_c_doc = NULL if etree is not None: doc = _documentOrRaise(etree) root_node = _rootNodeOrRaise(etree) @@ -44,29 +46,32 @@ raise XMLSchemaParseError, "Document is not XML Schema" fake_c_doc = _fakeRootDoc(doc._c_doc, root_node._c_node) + self._error_log.connect() parser_ctxt = xmlschema.xmlSchemaNewDocParserCtxt(fake_c_doc) - if parser_ctxt is NULL: - _destroyFakeDoc(doc._c_doc, fake_c_doc) - raise XMLSchemaParseError, "Document is not parsable as XML Schema" - self._c_schema = xmlschema.xmlSchemaParse(parser_ctxt) - - xmlschema.xmlSchemaFreeParserCtxt(parser_ctxt) - _destroyFakeDoc(doc._c_doc, fake_c_doc) elif file is not None: filename = _getFilenameForFile(file) if filename is None: # XXX assume a string object filename = file filename = _encodeFilename(filename) + self._error_log.connect() parser_ctxt = xmlschema.xmlSchemaNewParserCtxt(_cstr(filename)) - self._c_schema = xmlschema.xmlSchemaParse(parser_ctxt) - xmlschema.xmlSchemaFreeParserCtxt(parser_ctxt) else: raise XMLSchemaParseError, "No tree or file given" + if parser_ctxt is not NULL: + self._c_schema = xmlschema.xmlSchemaParse(parser_ctxt) + if _LIBXML_VERSION_INT > 20624: + xmlschema.xmlSchemaFreeParserCtxt(parser_ctxt) + + self._error_log.disconnect() + + if fake_c_doc is not NULL: + _destroyFakeDoc(doc._c_doc, fake_c_doc) + if self._c_schema is NULL: - raise XMLSchemaParseError, "Document is not valid XML Schema" - _Validator.__init__(self) + raise XMLSchemaParseError, self._error_log._buildExceptionMessage( + "Document is not valid XML Schema") def __dealloc__(self): xmlschema.xmlSchemaFree(self._c_schema) Modified: lxml/trunk/src/lxml/xpath.pxi ============================================================================== --- lxml/trunk/src/lxml/xpath.pxi (original) +++ lxml/trunk/src/lxml/xpath.pxi Sun May 27 14:10:06 2007 @@ -152,36 +152,26 @@ python.PyThread_release_lock(self._eval_lock) cdef _raise_parse_error(self): + cdef _BaseErrorLog entries entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) if entries: - entry = entries[0] - if entry is not None and entry.message: - raise XPathSyntaxError, entry.message - - if self._error_log._first_error is not None and \ - self._error_log._first_error.message is not None: - message = self._error_log._first_error.message - elif self._xpathCtxt is not NULL and \ - self._xpathCtxt.lastError.message is not NULL: - message = funicode(self._xpathCtxt.lastError.message) - else: - message = "Error in xpath expression" - raise XPathSyntaxError, message + message = entries._buildExceptionMessage(None) + if message is not None: + raise XPathSyntaxError, message + raise XPathSyntaxError, self._error_log._buildExceptionMessage( + "Error in xpath expression") cdef _raise_eval_error(self): + cdef _BaseErrorLog entries entries = self._error_log.filter_types(_XPATH_EVAL_ERRORS) if not entries: entries = self._error_log.filter_types(_XPATH_SYNTAX_ERRORS) if entries: - entry = entries[0] - if entry is not None and entry.message: - raise XPathEvalError, entry.message - if self._xpathCtxt is not NULL and \ - self._xpathCtxt.lastError.message is not NULL: - message = funicode(self._xpathCtxt.lastError.message) - else: - message = "Error in xpath evaluation" - raise XPathEvalError, message + message = entries._buildExceptionMessage(None) + if message is not None: + raise XPathEvalError, message + raise XPathEvalError, self._error_log._buildExceptionMessage( + "Error in xpath expression") cdef object _handle_result(self, xpath.xmlXPathObject* xpathObj, _Document doc): if self._context._exc._has_raised(): Modified: lxml/trunk/src/lxml/xslt.pxi ============================================================================== --- lxml/trunk/src/lxml/xslt.pxi (original) +++ lxml/trunk/src/lxml/xslt.pxi Sun May 27 14:10:06 2007 @@ -307,6 +307,7 @@ if c_style is NULL: tree.xmlFreeDoc(c_doc) self._xslt_resolver_context._raise_if_stored() + # last error seems to be the most accurate here if self._error_log.last_error is not None: raise XSLTParseError, self._error_log.last_error.message else: @@ -391,6 +392,7 @@ self._xslt_resolver_context._raise_if_stored() if c_result is NULL: + # last error seems to be the most accurate here error = self._error_log.last_error if error is not None and error.message: if error.line >= 0: From scoder at codespeak.net Sun May 27 14:11:02 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 27 May 2007 14:11:02 +0200 (CEST) Subject: [Lxml-checkins] r43712 - lxml/trunk/doc Message-ID: <20070527121102.91A0B80B4@code0.codespeak.net> Author: scoder Date: Sun May 27 14:11:02 2007 New Revision: 43712 Modified: lxml/trunk/doc/api.txt lxml/trunk/doc/validation.txt lxml/trunk/doc/xpathxslt.txt Log: better exception messages in XPath, parsers and validators; major cleanup of error message extraction Modified: lxml/trunk/doc/api.txt ============================================================================== --- lxml/trunk/doc/api.txt (original) +++ lxml/trunk/doc/api.txt Sun May 27 14:11:02 2007 @@ -301,7 +301,7 @@ >>> etree.XML(notxml) Traceback (most recent call last): ... - XMLSyntaxError: line 1: error parsing attribute name + XMLSyntaxError: error parsing attribute name, line 1, column 3 XInclude and ElementInclude Modified: lxml/trunk/doc/validation.txt ============================================================================== --- lxml/trunk/doc/validation.txt (original) +++ lxml/trunk/doc/validation.txt Sun May 27 14:11:02 2007 @@ -118,12 +118,12 @@ >>> relaxng.assertValid(doc2) Traceback (most recent call last): [...] - DocumentInvalid: Document does not comply with schema + DocumentInvalid: Did not expect element c there, line 1 >>> relaxng.assert_(doc2) Traceback (most recent call last): [...] - AssertionError: Document does not comply with schema + AssertionError: Did not expect element c there, line 1 If you want to find out why the validation failed in the second case, you can look up the error log of the validation process and check it for relevant @@ -198,12 +198,12 @@ >>> xmlschema.assertValid(doc2) Traceback (most recent call last): [...] - DocumentInvalid: Document does not comply with schema + DocumentInvalid: Element 'c': This element is not expected. Expected is ( b )., line 1 >>> xmlschema.assert_(doc2) Traceback (most recent call last): [...] - AssertionError: Document does not comply with schema + AssertionError: Element 'c': This element is not expected. Expected is ( b )., line 1 Error reporting works as for the RelaxNG class:: Modified: lxml/trunk/doc/xpathxslt.txt ============================================================================== --- lxml/trunk/doc/xpathxslt.txt (original) +++ lxml/trunk/doc/xpathxslt.txt Sun May 27 14:11:02 2007 @@ -313,7 +313,7 @@ >>> find = root.xpath("\\") Traceback (most recent call last): ... - XPathEvalError: Error in xpath evaluation + XPathEvalError: Invalid expression Note that lxml versions before 1.3 always raised an ``XPathSyntaxError`` for all errors, including evaluation errors. The best way to support older From scoder at codespeak.net Sun May 27 14:11:21 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 27 May 2007 14:11:21 +0200 (CEST) Subject: [Lxml-checkins] r43713 - lxml/trunk Message-ID: <20070527121121.521A280B4@code0.codespeak.net> Author: scoder Date: Sun May 27 14:11:21 2007 New Revision: 43713 Modified: lxml/trunk/TODO.txt Log: todo Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Sun May 27 14:11:21 2007 @@ -64,3 +64,5 @@ * remove ``findOrBuildNodeNs()`` from C-API (replaced by findOrBuildNodeNsPrefix) + +* follow PEP 8 in API naming From scoder at codespeak.net Sun May 27 14:19:33 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 27 May 2007 14:19:33 +0200 (CEST) Subject: [Lxml-checkins] r43714 - in lxml/trunk/src/lxml: . tests Message-ID: <20070527121933.8620F80B4@code0.codespeak.net> Author: scoder Date: Sun May 27 14:19:32 2007 New Revision: 43714 Modified: lxml/trunk/src/lxml/parser.pxi lxml/trunk/src/lxml/tests/test_etree.py Log: raise XMLSyntax error even on recoverable errors Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Sun May 27 14:19:32 2007 @@ -622,7 +622,8 @@ ctxt.myDoc = NULL if result is not NULL: - if ctxt.wellFormed or recover: + if recover or (ctxt.wellFormed and \ + ctxt.lastError.level < xmlerror.XML_ERR_ERROR): __GLOBAL_PARSER_CONTEXT.initDocDict(result) else: # free broken document Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Sun May 27 14:19:32 2007 @@ -352,7 +352,7 @@ root = tree.getroot() self.assertEquals(root.text, test_url) - def _test_resolve_empty(self): + def test_resolve_empty(self): parse = self.etree.parse parser = self.etree.XMLParser(load_dtd=True) assertEqual = self.assertEqual From scoder at codespeak.net Sun May 27 21:31:09 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Sun, 27 May 2007 21:31:09 +0200 (CEST) Subject: [Lxml-checkins] r43750 - lxml/trunk Message-ID: <20070527193109.C8A5780BB@code0.codespeak.net> Author: scoder Date: Sun May 27 21:31:08 2007 New Revision: 43750 Modified: lxml/trunk/TODO.txt Log: some more todos Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Sun May 27 21:31:08 2007 @@ -65,4 +65,6 @@ * remove ``findOrBuildNodeNs()`` from C-API (replaced by findOrBuildNodeNsPrefix) -* follow PEP 8 in API naming +* follow PEP 8 in API naming (avoidCamelCase in_favour_of_underscores) + +* clean support for entities (maybe an Entity element class?) From scoder at codespeak.net Tue May 29 10:58:23 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 29 May 2007 10:58:23 +0200 (CEST) Subject: [Lxml-checkins] r43836 - in lxml/trunk/src/lxml: . tests Message-ID: <20070529085823.BE027808D@code0.codespeak.net> Author: scoder Date: Tue May 29 10:58:22 2007 New Revision: 43836 Modified: lxml/trunk/src/lxml/classlookup.pxi lxml/trunk/src/lxml/etree.pyx lxml/trunk/src/lxml/etree_defs.h lxml/trunk/src/lxml/objectify.pyx lxml/trunk/src/lxml/parser.pxi lxml/trunk/src/lxml/tests/test_etree.py lxml/trunk/src/lxml/tree.pxd lxml/trunk/src/lxml/xmlparser.pxd Log: Entity support at the API level Modified: lxml/trunk/src/lxml/classlookup.pxi ============================================================================== --- lxml/trunk/src/lxml/classlookup.pxi (original) +++ lxml/trunk/src/lxml/classlookup.pxi Tue May 29 10:58:22 2007 @@ -34,6 +34,16 @@ ``_init(self)`` method that will be called after object creation. """ +cdef class EntityBase(_Entity): + """All custom Entity classes must inherit from this one. + + Note that subclasses *must not* override __init__ or __new__ as it is + absolutely undefined when these objects will be created or destroyed. All + persistent state of Entities must be stored in the underlying XML. If you + really need to initialize the object after creation, you can implement an + ``_init(self)`` method that will be called after object creation. + """ + ################################################################################ # Element class lookup @@ -80,13 +90,14 @@ """Element class lookup scheme that always returns the default Element class. - The keyword arguments ``element``, ``comment`` and ``pi`` accept the - respective Element classes. + The keyword arguments ``element``, ``comment``, ``pi`` and ``entity`` + accept the respective Element classes. """ cdef readonly object element_class cdef readonly object comment_class cdef readonly object pi_class - def __init__(self, element=None, comment=None, pi=None): + cdef readonly object entity_class + def __init__(self, element=None, comment=None, pi=None, entity=None): self._lookup_function = _lookupDefaultElementClass if element is None: self.element_class = None @@ -109,6 +120,13 @@ else: raise TypeError, "PI class must be subclass of PIBase" + if entity is None: + self.entity_class = None + elif issubclass(pi, EntityBase): + self.entity_class = pi + else: + raise TypeError, "Entity class must be subclass of EntityBase" + cdef object _lookupDefaultElementClass(state, _Document _doc, xmlNode* c_node): "Trivial class lookup function that always returns the default class." if c_node.type == tree.XML_ELEMENT_NODE: @@ -138,6 +156,13 @@ return _ProcessingInstruction else: return cls + elif c_node.type == tree.XML_ENTITY_REF_NODE: + if state is not None: + cls = (state).entity_class + if cls is None: + return _Entity + else: + return cls else: assert 0, "Unknown node type: %s" % c_node.type @@ -217,10 +242,10 @@ lookup(self, type, doc, namespace, name) to lookup the element class for a node. Arguments of the method: - * type: one of 'element', 'comment', 'PI' + * type: one of 'element', 'comment', 'PI', 'entity' * doc: document that the node is in - * namespace: namespace URI of the node (or None for comments/PIs) - * name: name of the element, None for comments, target for PIs + * namespace: namespace URI of the node (or None for comments/PIs/entities) + * name: name of the element/entity, None for comments, target for PIs If you return None from this method, the fallback will be called. """ @@ -237,10 +262,14 @@ lookup = state - if c_node.type == tree.XML_COMMENT_NODE: + if c_node.type == tree.XML_ELEMENT_NODE: + element_type = "element" + elif c_node.type == tree.XML_COMMENT_NODE: element_type = "comment" elif c_node.type == tree.XML_PI_NODE: element_type = "PI" + elif c_node.type == tree.XML_ENTITY_REF_NODE: + element_type = "entity" else: element_type = "element" if c_node.name is NULL: Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Tue May 29 10:58:22 2007 @@ -1179,6 +1179,24 @@ else: return "" % self.target +cdef class _Entity(__ContentOnlyElement): + property tag: + def __get__(self): + return Entity + + property name: + # not in ElementTree + def __get__(self): + return funicode(self._c_node.name) + + def __set__(self, value): + value = _utf8(value) + c_text = _cstr(value) + tree.xmlNodeSetName(self._c_node, c_text) + + def __repr__(self): + return "&%s;" % self.name + cdef public class _ElementTree [ type LxmlElementTreeType, object LxmlElementTree ]: @@ -1692,12 +1710,15 @@ cdef class ElementDepthFirstIterator(_ElementTagMatcher): """Iterates over an element and its sub-elements in document order (depth - first pre-order). + first pre-order). Note that this also includes comments, entities and + processing instructions. To filter them out, check if the ``tag`` + property of the returned element is a string (i.e. not None and not a + factory function). - If the optional 'tag' argument is not None, it returns only the elements - that match the respective name and namespace. + If the optional 'tag' argument is not None, the iterator returns only the + elements that match the respective name and namespace. - The optional boolean 'inclusive' argument defaults to True and can be set + The optional boolean argument 'inclusive' defaults to True and can be set to False to exclude the start element itself. Note that the behaviour of this iterator is completely undefined if the @@ -1707,6 +1728,7 @@ # keep next node to return and a depth counter in the tree cdef _Element _next_node cdef _Element _top_node + cdef int _include_all_types def __init__(self, _Element node not None, tag=None, inclusive=True): self._top_node = node self._next_node = node @@ -1731,7 +1753,10 @@ c_node = self._nextNodeAnyTag(c_node) else: c_node = self._nextNodeMatchTag(c_node) - self._next_node = _elementFactory(current_node._doc, c_node) + if c_node is NULL: + self._next_node = None + else: + self._next_node = _elementFactory(current_node._doc, c_node) return current_node cdef xmlNode* _nextNodeAnyTag(self, xmlNode* c_node): @@ -1742,8 +1767,9 @@ cdef xmlNode* _nextNodeMatchTag(self, xmlNode* c_node): tree.BEGIN_FOR_EACH_ELEMENT_FROM(self._top_node._c_node, c_node, 0) - if _tagMatches(c_node, self._href, self._name): - return c_node + if c_node.type == tree.XML_ELEMENT_NODE: + if _tagMatches(c_node, self._href, self._name): + return c_node tree.END_FOR_EACH_ELEMENT_FROM(c_node) return NULL @@ -1762,6 +1788,11 @@ c_node = tree.xmlNewDocPI(c_doc, target, text) return c_node +cdef xmlNode* _createEntity(xmlDoc* c_doc, char* name): + cdef xmlNode* c_node + c_node = tree.xmlNewReference(c_doc, name) + return c_node + # module-level API for ElementTree def Element(_tag, attrib=None, nsmap=None, **_extra): @@ -1809,6 +1840,22 @@ PI = ProcessingInstruction +def Entity(name): + """Entity factory. This factory function creates a special element that + will be serialized as an XML entity. Note, however, that the entity will + not be automatically declared in the document. A document that uses + entities requires a DTD. + """ + cdef _Document doc + cdef xmlNode* c_node + cdef xmlDoc* c_doc + name = _utf8(name) + c_doc = _newDoc() + doc = _documentFactory(c_doc, None) + c_node = _createEntity(c_doc, _cstr(name)) + tree.xmlAddChild(c_doc, c_node) + return _elementFactory(doc, c_node) + def SubElement(_Element _parent not None, _tag, attrib=None, nsmap=None, **_extra): """Subelement factory. This function creates an element instance, and appends it to an Modified: lxml/trunk/src/lxml/etree_defs.h ============================================================================== --- lxml/trunk/src/lxml/etree_defs.h (original) +++ lxml/trunk/src/lxml/etree_defs.h Tue May 29 10:58:22 2007 @@ -99,7 +99,8 @@ #define _isElement(c_node) \ (((c_node)->type == XML_ELEMENT_NODE) || \ - ((c_node)->type == XML_COMMENT_NODE) || \ + ((c_node)->type == XML_COMMENT_NODE) || \ + ((c_node)->type == XML_ENTITY_REF_NODE) || \ ((c_node)->type == XML_PI_NODE)) #define _isElementOrXInclude(c_node) \ Modified: lxml/trunk/src/lxml/objectify.pyx ============================================================================== --- lxml/trunk/src/lxml/objectify.pyx (original) +++ lxml/trunk/src/lxml/objectify.pyx Tue May 29 10:58:22 2007 @@ -1491,56 +1491,57 @@ NoneType = _PYTYPE_DICT.get('none') c_node = element._c_node tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1) - pytype = None - value = None - if not ignore: - # check that old value is valid - old_value = cetree.attributeValueFromNsName( - c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) - if old_value is not None and old_value != TREE_PYTYPE: - dict_result = python.PyDict_GetItem(_PYTYPE_DICT, old_value) - if dict_result is not NULL: - pytype = dict_result - if pytype is not StrType: - # StrType does not have a typecheck but is the default - # anyway, so just accept it if given as type information - pytype = _check_type(c_node, pytype) - - if pytype is None: - # if element is defined as xsi:nil, represent it as None - if cetree.attributeValueFromNsName( - c_node, _XML_SCHEMA_INSTANCE_NS, "nil") == "true": - pytype = NoneType - - if pytype is None: - # check for XML Schema type hint - value = cetree.attributeValueFromNsName( - c_node, _XML_SCHEMA_INSTANCE_NS, "type") - - if value is not None: - dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) - if dict_result is NULL and ':' in value: - prefix, value = value.split(':', 1) + if c_node.type == tree.XML_ELEMENT_NODE: + pytype = None + value = None + if not ignore: + # check that old value is valid + old_value = cetree.attributeValueFromNsName( + c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) + if old_value is not None and old_value != TREE_PYTYPE: + dict_result = python.PyDict_GetItem(_PYTYPE_DICT, old_value) + if dict_result is not NULL: + pytype = dict_result + if pytype is not StrType: + # StrType does not have a typecheck but is the default + # anyway, so just accept it if given as type information + pytype = _check_type(c_node, pytype) + + if pytype is None: + # if element is defined as xsi:nil, represent it as None + if cetree.attributeValueFromNsName( + c_node, _XML_SCHEMA_INSTANCE_NS, "nil") == "true": + pytype = NoneType + + if pytype is None: + # check for XML Schema type hint + value = cetree.attributeValueFromNsName( + c_node, _XML_SCHEMA_INSTANCE_NS, "type") + + if value is not None: dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) - if dict_result is not NULL: - pytype = dict_result + if dict_result is NULL and ':' in value: + prefix, value = value.split(':', 1) + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value) + if dict_result is not NULL: + pytype = dict_result - if pytype is None: - # try to guess type - if cetree.findChildForwards(c_node, 0) is NULL: - # element has no children => data class - pytype = _guessPyType(textOf(c_node), StrType) - - if pytype is None: - # delete attribute if it exists - cetree.delAttributeFromNsName( - c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) - else: - # update or create attribute - c_ns = cetree.findOrBuildNodeNsPrefix( - doc, c_node, _PYTYPE_NAMESPACE, 'py') - tree.xmlSetNsProp(c_node, c_ns, _PYTYPE_ATTRIBUTE_NAME, - _cstr(pytype.name)) + if pytype is None: + # try to guess type + if cetree.findChildForwards(c_node, 0) is NULL: + # element has no children => data class + pytype = _guessPyType(textOf(c_node), StrType) + + if pytype is None: + # delete attribute if it exists + cetree.delAttributeFromNsName( + c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) + else: + # update or create attribute + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _PYTYPE_NAMESPACE, 'py') + tree.xmlSetNsProp(c_node, c_ns, _PYTYPE_ATTRIBUTE_NAME, + _cstr(pytype.name)) tree.END_FOR_EACH_ELEMENT_FROM(c_node) def xsiannotate(element_or_tree, ignore_old=True): @@ -1571,78 +1572,79 @@ StrType = _PYTYPE_DICT.get('str') c_node = element._c_node tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1) - typename = None - pytype = None - value = None - istree = 0 - if not ignore: - # check that old value is valid - typename = cetree.attributeValueFromNsName( - c_node, _XML_SCHEMA_INSTANCE_NS, "type") - if typename is not None: - dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename) - if dict_result is NULL and ':' in typename: - prefix, typename = typename.split(':', 1) + if c_node.type == tree.XML_ELEMENT_NODE: + typename = None + pytype = None + value = None + istree = 0 + if not ignore: + # check that old value is valid + typename = cetree.attributeValueFromNsName( + c_node, _XML_SCHEMA_INSTANCE_NS, "type") + if typename is not None: dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename) - if dict_result is not NULL: - pytype = dict_result - if pytype is not StrType: - # StrType does not have a typecheck but is the default anyway, - # so just accept it if given as type information - pytype = _check_type(c_node, pytype) - if pytype is None: - typename = None - - if typename is None: - if pytype is None: - # check for pytype hint - value = cetree.attributeValueFromNsName( - c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) - - if value is not None: - if value == TREE_PYTYPE: - istree = 1 + if dict_result is NULL and ':' in typename: + prefix, typename = typename.split(':', 1) + dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename) + if dict_result is not NULL: + pytype = dict_result + if pytype is not StrType: + # StrType does not have a typecheck but is the default anyway, + # so just accept it if given as type information + pytype = _check_type(c_node, pytype) + if pytype is None: + typename = None + + if typename is None: + if pytype is None: + # check for pytype hint + value = cetree.attributeValueFromNsName( + c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) + + if value is not None: + if value == TREE_PYTYPE: + istree = 1 + else: + dict_result = python.PyDict_GetItem(_PYTYPE_DICT, value) + if dict_result is not NULL: + pytype = dict_result + if pytype is not StrType: + pytype = _check_type(c_node, pytype) + + if not istree and pytype is None: + # try to guess type + if cetree.findChildForwards(c_node, 0) is NULL: + # element has no children => data class + pytype = _guessPyType(textOf(c_node), StrType) else: - dict_result = python.PyDict_GetItem(_PYTYPE_DICT, value) - if dict_result is not NULL: - pytype = dict_result - if pytype is not StrType: - pytype = _check_type(c_node, pytype) - - if not istree and pytype is None: - # try to guess type - if cetree.findChildForwards(c_node, 0) is NULL: - # element has no children => data class - pytype = _guessPyType(textOf(c_node), StrType) - else: - istree = 1 + istree = 1 - if typename is None and not istree and pytype is not None: - if python.PyList_GET_SIZE(pytype._schema_types) > 0: - # pytype->xsi:type is a 1:n mapping so simply take the first - typename = pytype._schema_types[0] - - if typename is None or istree: - # delete attribute if it exists - cetree.delAttributeFromNsName(c_node, _XML_SCHEMA_INSTANCE_NS, "type") - else: - # update or create attribute - c_ns = cetree.findOrBuildNodeNsPrefix( - doc, c_node, _XML_SCHEMA_NS, 'xsd') - if c_ns is not NULL: - if ':' in typename: - prefix, name = typename.split(':', 1) - if c_ns.prefix is NULL or c_ns.prefix[0] == c'\0': - typename = name - elif cstd.strcmp(_cstr(prefix), c_ns.prefix) != 0: + if typename is None and not istree and pytype is not None: + if python.PyList_GET_SIZE(pytype._schema_types) > 0: + # pytype->xsi:type is a 1:n mapping so simply take the first + typename = pytype._schema_types[0] + + if typename is None or istree: + # delete attribute if it exists + cetree.delAttributeFromNsName(c_node, _XML_SCHEMA_INSTANCE_NS, "type") + else: + # update or create attribute + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _XML_SCHEMA_NS, 'xsd') + if c_ns is not NULL: + if ':' in typename: + prefix, name = typename.split(':', 1) + if c_ns.prefix is NULL or c_ns.prefix[0] == c'\0': + typename = name + elif cstd.strcmp(_cstr(prefix), c_ns.prefix) != 0: + prefix = c_ns.prefix + typename = prefix + ':' + name + elif c_ns.prefix is not NULL or c_ns.prefix[0] != c'\0': prefix = c_ns.prefix - typename = prefix + ':' + name - elif c_ns.prefix is not NULL or c_ns.prefix[0] != c'\0': - prefix = c_ns.prefix - typename = prefix + ':' + typename - c_ns = cetree.findOrBuildNodeNsPrefix( - doc, c_node, _XML_SCHEMA_INSTANCE_NS, 'xsi') - tree.xmlSetNsProp(c_node, c_ns, "type", _cstr(typename)) + typename = prefix + ':' + typename + c_ns = cetree.findOrBuildNodeNsPrefix( + doc, c_node, _XML_SCHEMA_INSTANCE_NS, 'xsi') + tree.xmlSetNsProp(c_node, c_ns, "type", _cstr(typename)) tree.END_FOR_EACH_ELEMENT_FROM(c_node) @@ -1661,20 +1663,23 @@ c_node = element._c_node if pytype and xsi: tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1) - cetree.delAttributeFromNsName( - c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) - cetree.delAttributeFromNsName( - c_node, _XML_SCHEMA_INSTANCE_NS, "type") + if c_node.type == tree.XML_ELEMENT_NODE: + cetree.delAttributeFromNsName( + c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) + cetree.delAttributeFromNsName( + c_node, _XML_SCHEMA_INSTANCE_NS, "type") tree.END_FOR_EACH_ELEMENT_FROM(c_node) elif pytype: tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1) - cetree.delAttributeFromNsName( - c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) + if c_node.type == tree.XML_ELEMENT_NODE: + cetree.delAttributeFromNsName( + c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME) tree.END_FOR_EACH_ELEMENT_FROM(c_node) else: tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1) - cetree.delAttributeFromNsName( - c_node, _XML_SCHEMA_INSTANCE_NS, "type") + if c_node.type == tree.XML_ELEMENT_NODE: + cetree.delAttributeFromNsName( + c_node, _XML_SCHEMA_INSTANCE_NS, "type") tree.END_FOR_EACH_ELEMENT_FROM(c_node) Modified: lxml/trunk/src/lxml/parser.pxi ============================================================================== --- lxml/trunk/src/lxml/parser.pxi (original) +++ lxml/trunk/src/lxml/parser.pxi Tue May 29 10:58:22 2007 @@ -616,6 +616,7 @@ filename, _ErrorLog error_log, int recover) except NULL: cdef _ResolverContext context + cdef int well_formed if ctxt.myDoc is not NULL: if ctxt.myDoc != result: tree.xmlFreeDoc(ctxt.myDoc) @@ -624,6 +625,20 @@ if result is not NULL: if recover or (ctxt.wellFormed and \ ctxt.lastError.level < xmlerror.XML_ERR_ERROR): + well_formed = 1 + elif not ctxt.replaceEntities and not ctxt.validate: + # in this mode, we ignore errors about undefined entities + for error in error_log.filter_from_errors(): + if error.type != ErrorTypes.WAR_UNDECLARED_ENTITY and \ + error.type != ErrorTypes.ERR_UNDECLARED_ENTITY: + well_formed = 0 + break + else: + well_formed = 1 + else: + well_formed = 0 + + if well_formed: __GLOBAL_PARSER_CONTEXT.initDocDict(result) else: # free broken document @@ -674,6 +689,8 @@ * ns_clean - clean up redundant namespace declarations * recover - try hard to parse through broken XML * remove_blank_text - discard blank text nodes + * compact - safe memory for short text content (default: on) + * resolve_entities - replace entities by their text value (default: on) Note that you should avoid sharing parsers between threads. While this is not harmful, it is more efficient to use separate parsers. This does not @@ -681,7 +698,8 @@ """ def __init__(self, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=False, ns_clean=False, - recover=False, remove_blank_text=False, compact=True): + recover=False, remove_blank_text=False, compact=True, + resolve_entities=True): cdef int parse_options _BaseParser.__init__(self) @@ -704,6 +722,8 @@ parse_options = parse_options | xmlparser.XML_PARSE_NOBLANKS if not compact: parse_options = parse_options ^ xmlparser.XML_PARSE_COMPACT + if not resolve_entities: + parse_options = parse_options ^ xmlparser.XML_PARSE_NOENT self._parse_options = parse_options @@ -799,6 +819,7 @@ * recover - try hard to parse through broken HTML (default: True) * no_network - prevent network access * remove_blank_text - discard empty text nodes + * compact - safe memory for short text content (default: on) Note that you should avoid sharing parsers between threads for parformance reasons. Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Tue May 29 10:58:22 2007 @@ -390,6 +390,38 @@ xml = u'&myentity;' self.assertRaises(_LocalException, parse, StringIO(xml), parser) + def test_entity(self): + parse = self.etree.parse + tostring = self.etree.tostring + parser = self.etree.XMLParser(resolve_entities=False) + Entity = self.etree.Entity + + xml = '&myentity;' + tree = parse(StringIO(xml), parser) + root = tree.getroot() + self.assertEquals(root[0].tag, Entity) + self.assertFalse(root[0].text) + self.assertEquals(root[0].tail, None) + self.assertEquals(root[0].name, "myentity") + + self.assertEquals('&myentity;', + tostring(root)) + + def test_entity_append(self): + Entity = self.etree.Entity + Element = self.etree.Element + + root = Element("root") + root.append( Entity("test") ) + + self.assertEquals(root[0].tag, Entity) + self.assertFalse(root[0].text) + self.assertEquals(root[0].tail, None) + self.assertEquals(root[0].name, "test") + + self.assertEquals('&test;', + tostring(root)) + # TypeError in etree, AssertionError in ElementTree; def test_setitem_assert(self): Element = self.etree.Element Modified: lxml/trunk/src/lxml/tree.pxd ============================================================================== --- lxml/trunk/src/lxml/tree.pxd (original) +++ lxml/trunk/src/lxml/tree.pxd Tue May 29 10:58:22 2007 @@ -165,6 +165,7 @@ cdef xmlNode* xmlNewDocText(xmlDoc* doc, char* content) cdef xmlNode* xmlNewDocComment(xmlDoc* doc, char* content) cdef xmlNode* xmlNewDocPI(xmlDoc* doc, char* name, char* content) + cdef xmlNode* xmlNewReference(xmlDoc* doc, char* name) cdef xmlNs* xmlNewNs(xmlNode* node, char* href, char* prefix) cdef xmlNode* xmlAddChild(xmlNode* parent, xmlNode* cur) cdef xmlNode* xmlReplaceNode(xmlNode* old, xmlNode* cur) Modified: lxml/trunk/src/lxml/xmlparser.pxd ============================================================================== --- lxml/trunk/src/lxml/xmlparser.pxd (original) +++ lxml/trunk/src/lxml/xmlparser.pxd Tue May 29 10:58:22 2007 @@ -54,6 +54,8 @@ int options int disableSAX int errNo + int replaceEntities + int validate xmlError lastError xmlNode* node xmlSAXHandler* sax From scoder at codespeak.net Tue May 29 10:58:35 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 29 May 2007 10:58:35 +0200 (CEST) Subject: [Lxml-checkins] r43837 - lxml/trunk Message-ID: <20070529085835.079E0808F@code0.codespeak.net> Author: scoder Date: Tue May 29 10:58:35 2007 New Revision: 43837 Modified: lxml/trunk/CHANGES.txt Log: Entity support at the API level Modified: lxml/trunk/CHANGES.txt ============================================================================== --- lxml/trunk/CHANGES.txt (original) +++ lxml/trunk/CHANGES.txt Tue May 29 10:58:35 2007 @@ -8,6 +8,10 @@ Features added -------------- +* Entity support through an ``Entity`` factory and element classes. XML + parsers now have a ``resolve_entities`` keyword argument that can be set to + False to keep entities in the document. + * ``parse()`` function in ``objectify``, corresponding to ``XML()`` etc. * ``column`` field on error log entries to accompany the ``line`` field @@ -46,6 +50,8 @@ Bugs fixed ---------- +* The XML parser did not report undefined entities as error + * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last From scoder at codespeak.net Tue May 29 10:59:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 29 May 2007 10:59:00 +0200 (CEST) Subject: [Lxml-checkins] r43838 - lxml/trunk/src/lxml Message-ID: <20070529085900.45716808D@code0.codespeak.net> Author: scoder Date: Tue May 29 10:58:59 2007 New Revision: 43838 Modified: lxml/trunk/src/lxml/etree.pyx Log: ElementTree._setroot() Modified: lxml/trunk/src/lxml/etree.pyx ============================================================================== --- lxml/trunk/src/lxml/etree.pyx (original) +++ lxml/trunk/src/lxml/etree.pyx Tue May 29 10:58:59 2007 @@ -1228,6 +1228,14 @@ self._doc = None return self._context_node + def _setroot(self, _Element root not None): + """Relocate the ElementTree to a new root node. + """ + if root._c_node.type != tree.XML_ELEMENT_NODE: + raise TypeError, "Only elements can be the root of an ElementTree" + self._context_node = root + self._doc = None + def getroot(self): """Gets the root element for this tree. """ From ianb at codespeak.net Tue May 29 17:04:20 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Tue, 29 May 2007 17:04:20 +0200 (CEST) Subject: [Lxml-checkins] r43853 - lxml/trunk Message-ID: <20070529150420.C1990808F@code0.codespeak.net> Author: ianb Date: Tue May 29 17:04:20 2007 New Revision: 43853 Modified: lxml/trunk/versioninfo.py Log: don't cause an error when there is a directory with no version information (an uncommitted directory) Modified: lxml/trunk/versioninfo.py ============================================================================== --- lxml/trunk/versioninfo.py (original) +++ lxml/trunk/versioninfo.py Tue May 29 17:04:20 2007 @@ -38,7 +38,9 @@ elif data.startswith(' Author: ianb Date: Tue May 29 17:05:51 2007 New Revision: 43854 Added: lxml/branch/html/ - copied from r43853, lxml/trunk/ lxml/branch/html/src/lxml/doctestcompare.py (contents, props changed) lxml/branch/html/src/lxml/html/ lxml/branch/html/src/lxml/html/__init__.py (contents, props changed) lxml/branch/html/src/lxml/html/clean.py (contents, props changed) lxml/branch/html/src/lxml/html/defs.py (contents, props changed) lxml/branch/html/src/lxml/html/htmldiff.py (contents, props changed) lxml/branch/html/src/lxml/html/rewritelinks.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/ lxml/branch/html/src/lxml/html/tests/__init__.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_basic.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_basic.txt (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_clean.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_clean.txt (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_htmldiff.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_htmldiff.txt (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_rewritelinks.py (contents, props changed) lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt (contents, props changed) lxml/branch/html/src/lxml/html/usedoctest.py (contents, props changed) lxml/branch/html/src/lxml/usedoctest.py (contents, props changed) Modified: lxml/branch/html/setup.py Log: Branch with lxml.html work Modified: lxml/branch/html/setup.py ============================================================================== --- lxml/trunk/setup.py (original) +++ lxml/branch/html/setup.py Tue May 29 17:05:51 2007 @@ -70,7 +70,7 @@ ], package_dir = {'': 'src'}, - packages = ['lxml'], + packages = ['lxml', 'lxml.html'], zip_safe = False, ext_modules = setupinfo.ext_modules( STATIC_INCLUDE_DIRS, STATIC_LIBRARY_DIRS, STATIC_CFLAGS), Added: lxml/branch/html/src/lxml/doctestcompare.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/doctestcompare.py Tue May 29 17:05:51 2007 @@ -0,0 +1,395 @@ +""" +lxml-based doctest output comparison. + +To use this you must call ``lxmldoctest.install()``, which will cause +doctest to use this in all subsequent calls. + +This changes the way output is checked and comparisons are made for +XML or HTML-like content. + +XML or HTML content is noticed because the example starts with ``<`` +(it's HTML if it starts with ```` or include an ``any`` +attribute in the tag. An ``any`` tag matches any tag, while the +attribute matches any and all attributes. + +When a match fails, the reformatted example and gotten text is +displayed (indented), and a rough diff-like output is given. Anything +marked with ``-`` is in the output but wasn't supposed to be, and +similarly ``+`` means its in the example but wasn't in the output. +""" + +from lxml import etree +from lxml.html import HTML +import re +import doctest +import cgi + +PARSE_HTML = doctest.register_optionflag('PARSE_HTML') +PARSE_XML = doctest.register_optionflag('PARSE_XML') + +OutputChecker = doctest.OutputChecker + +def strip(v): + if v is None: + return None + else: + return v.strip() + +class LXMLOutputChecker(OutputChecker): + + empty_tags = ( + 'param', 'img', 'area', 'br', 'basefont', 'input', + 'base', 'meta', 'link', 'col') + + default_parser = etree.XML + + def check_output(self, want, got, optionflags): + alt_self = getattr(self, '_temp_override_self', None) + if alt_self is not None: + super_method = self._temp_call_super_check_output + self = alt_self + else: + super_method = OutputChecker.check_output + parser = self.get_parser(want, got, optionflags) + if not parser: + return super_method( + self, want, got, optionflags) + try: + want_doc = parser(want) + except etree.XMLSyntaxError: + return False + try: + got_doc = parser(got) + except etree.XMLSyntaxError: + return False + return self.compare_docs(want_doc, got_doc) + + def get_parser(self, want, got, optionflags): + parser = None + if PARSE_HTML & optionflags: + parser = HTML + elif PARSE_XML & optionflags: + parser = etree.XML + elif want.strip().lower().startswith('' % el.tag + return '<%s %s>' % (el.tag, ' '.join(attrs)) + + def format_end_tag(self, el): + return '' % el.tag + + def collect_diff(self, want, got, html, indent): + parts = [] + if not len(want) and not len(got): + parts.append(' '*indent) + parts.append(self.collect_diff_tag(want, got)) + if not self.html_empty_tag(got, html): + parts.append(self.collect_diff_text(want.text, got.text)) + parts.append(self.collect_diff_end_tag(want, got)) + parts.append(self.collect_diff_text(want.tail, got.tail)) + parts.append('\n') + return ''.join(parts) + parts.append(' '*indent) + parts.append(self.collect_diff_tag(want, got)) + parts.append('\n') + if strip(want.text) or strip(got.text): + parts.append(' '*indent) + parts.append(self.collect_diff_text(want.text, got.text)) + parts.append('\n') + want_children = list(want) + got_children = list(got) + while want_children or got_children: + if not want_children: + parts.append(self.format_doc(got_children.pop(0), html, indent+2, '-')) + continue + if not got_children: + parts.append(self.format_doc(want_children.pop(0), html, indent+2, '+')) + continue + parts.append(self.collect_diff( + want_children.pop(0), got_children.pop(0), html, indent+2)) + parts.append(' '*indent) + parts.append(self.collect_diff_end_tag(want, got)) + parts.append('\n') + if strip(want.tail) or strip(got.tail): + parts.append(' '*indent) + parts.append(self.collect_diff_text(want.tail, got.tail)) + parts.append('\n') + return ''.join(parts) + + def collect_diff_tag(self, want, got): + if want.tag != got.tag and want.tag != 'any': + tag = '%s (not %s)' % (got.tag, want.tag) + else: + tag = got.tag + attrs = [] + any = want.tag == 'any' or 'any' in want.attrib + for name, value in sorted(got.attrib.items()): + if name not in want.attrib and not any: + attrs.append('-%s="%s"' % (name, self.format_text(value, False))) + else: + if name in want.attrib: + text = self.collect_diff_text(value, want.attrib[name], False) + else: + text = self.format_text(value, False) + attrs.append('%s="%s"' % (name, text)) + if not any: + for name, value in sorted(got.attrib.items()): + if name in got.attrib: + continue + attrs.append('+%s="%s"' % (name, self.format_text(value, False))) + if attrs: + tag = '<%s %s>' % (tag, ' '.join(attrs)) + else: + tag = '<%s>' % tag + return tag + + def collect_diff_end_tag(self, want, got): + if want.tag != got.tag: + tag = '%s (not %s)' % (got.tag, want.tag) + else: + tag = got.tag + return '' % tag + + def collect_diff_text(self, want, got, strip=True): + if self.text_compare(want, got, strip): + if not got: + return '' + return self.format_text(got, strip) + text = '%s (not %s)' % (got, want) + return self.format_text(text, strip) + +class LHTMLOutputChecker(LXMLOutputChecker): + default_parser = HTML + +def install(html=False): + """ + Install doctestcompare for all future doctests. + + If html is true, then by default the HTML parser will be used; + otherwise the XML parser is used. + """ + if html: + doctest.OutputChecker = LHTMLOutputChecker + else: + doctest.OutputChecker = LXMLOutputChecker + +def temp_install(html=False): + """ + Use this *inside* a doctest to enable this checker for this + doctest only. + + If html is true, then by default the HTML parser will be used; + otherwise the XML parser is used. + """ + if html: + Checker = LHTMLOutputChecker + else: + Checker = LXMLOutputChecker + frame = _find_doctest_frame() + dt_self = frame.f_locals['self'] + checker = Checker() + old_checker = dt_self._checker + dt_self._checker = checker + # The unfortunate thing is that there is a local variable 'check' + # in the function that runs the doctests, that is a bound method + # into the output checker. We have to update that. We can't + # modify the frame, so we have to modify the object in place. The + # only way to do this is to actually change the func_code + # attribute of the method. We change it, and then wait for + # __record_outcome to be run, which signals the end of the __run + # method, at which point we restore the previous check_output + # implementation. + check_func = frame.f_locals['check'].im_func + # Because we can't patch up func_globals, this is the only global + # in check_output that we care about: + doctest.etree = etree + _RestoreChecker(dt_self, old_checker, checker, + check_func, checker.check_output.im_func) + +class _RestoreChecker(object): + def __init__(self, dt_self, old_checker, new_checker, check_func, clone_func): + self.dt_self = dt_self + self.checker = old_checker + self.checker._temp_call_super_check_output = self.call_super + self.checker._temp_override_self = new_checker + self.check_func = check_func + self.clone_func = clone_func + self.install_clone() + self.install_dt_self() + def install_clone(self): + self.func_code = self.check_func.func_code + self.func_globals = self.check_func.func_globals + self.check_func.func_code = self.clone_func.func_code + def uninstall_clone(self): + self.check_func.func_code = self.func_code + def install_dt_self(self): + self.prev_func = self.dt_self._DocTestRunner__record_outcome + self.dt_self._DocTestRunner__record_outcome = self + def uninstall_dt_self(self): + self.dt_self._DocTestRunner__record_outcome = self.prev_func + def __call__(self, *args, **kw): + self.uninstall_clone() + self.uninstall_dt_self() + del self.checker._temp_override_self + del self.checker._temp_call_super_check_output + return self.prev_func(*args, **kw) + def call_super(self, *args, **kw): + self.uninstall_clone() + try: + return self.check_func(*args, **kw) + finally: + self.install_clone() + +def _find_doctest_frame(): + import sys + frame = sys._getframe(1) + while frame: + l = frame.f_locals + if 'BOOM' in l: + # Sign of doctest + return frame + frame = frame.f_back + raise LookupError( + "Could not find doctest (only use this function *inside* a doctest)") + Added: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/__init__.py Tue May 29 17:05:51 2007 @@ -0,0 +1,201 @@ +import threading +import re +from lxml import etree + +__all__ = ['HTML', 'tostring'] + +_rel_links_xpath = etree.XPath("descendant-or-self::a[fn:upper-case(@rel)=$rel]") +#_class_xpath = etree.XPath(r"descendant-or-self::*[regexp:match(@class, concat('\b', $class_name, '\b'))]", {'regexp': 'http://exslt.org/regular-expressions'}) +_class_xpath = etree.XPath("descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), concat(' ', $class_name, ' '))]") + +class HtmlMixin(object): + + def remove_element(self): + """ + Removes this element from the tree, including its children and + text. The tail text is joined to the previous element or + parent. + """ + parent = self.getparent() + assert parent + index = parent.index(self) + if self.tail: + if index == 0: + parent.text = (parent.text or '') + self.tail + else: + previous = parent[index-1] + previous.tail = (previous.tail or '') + self.tail + parent.remove(self) + + def remove_tag(self): + """ + Remove the tag, but not its children or text. The children and text + are merged into the parent. + """ + parent = self.getparent() + assert parent + index = parent.index(self) + if self.text: + if index == 0: + parent.text = (parent.text or '') + self.text + else: + prev = parent[index-1] + prev.tail = (prev.tail or '') + self.text + if self.tail: + if len(self): + last = self[-1] + last.tail = (last.tail or '') + self.tail + elif index == 0: + parent.text = (parent.text or '') + self.tail + else: + prev = parent[index-1] + prev.tail = (prev.tail or '') + self.tail + parent[index:index+1] = list(self) + + def find_rel_links(self, rel): + return _rel_links_xpath(self, rel=rel.lower()) + + def find_class(self, class_name): + return _class_xpath(self, class_name=class_name.lower()) + +class HtmlComment(etree._Comment, HtmlMixin): + pass + +class HtmlElement(etree.ElementBase, HtmlMixin): + pass + +class HtmlLookup(etree.CustomElementClassLookup): + + def lookup(self, node_type, document, namespace, name): + if node_type == 'element': + return HtmlElement + elif node_type == 'comment': + return HtmlComment + else: + # Delegate + return None + +html_parser = etree.HTMLParser() +html_parser.setElementClassLookup(HtmlLookup()) + +def HTML(html): + # FIXME: should this notice a fragment and parse accordingly? + value = etree.HTML(html, html_parser) + if value is None: + raise ParserError( + "Could not parse document") + return value + +def parse_elements(html, no_leading_text=False): + """ + Parses several HTML elements, returning a list of elements. + + The first item in the list may be a string (though leading + whitespace is removed). If no_leading_text is true, then it will + be an error if there is leading text. + """ + # FIXME: check what happens when you give html with a body, head, etc. + html = '%s' % html + doc = HTML(html) + assert doc.tag == 'html' + bodies = [e for e in doc if e.tag == 'body'] + assert len(bodies) == 1 + body = bodies[0] + elements = [] + if no_leading_text and body.text and body.text.strip(): + raise ParserError( + "There is leading text: %r" % body.text) + if body.text and body.text.strip(): + elements.append(body.text) + elements.extend(body) + # FIXME: removing the reference to the parent artificial document + # would be nice + return elements + +def parse_element(html, create_parent=False): + """ + Parses a single HTML element; it is an error if there is more than + one element, or if anything but whitespace precedes or follows the + element. + + If create_parent is true (or is a tag name) then a parent node + will be created to encapsulate the HTML in a single element. + """ + if create_parent: + if not isinstance(create_parent, basestring): + create_parent = 'div' + return parse_element('<%s>%s' % (create_parent, html, create_parent)) + elements = parse_elements(html, no_leading_text=True) + if not elements: + raise ParserError( + "No elements found") + if len(elements) > 1: + raise ParserError( + "Multiple elements found (%s)" + % ', '.join([e.tag for e in elements])) + el = elements[0] + if el.tail and el.tail.strip(): + raise ParserError( + "Element followed by text: %r" % el.tail) + el.tail = None + return el + +def Element(*args, **kw): + # FIXME: this is totally broken; segfaults + v = HtmlElement(*args, **kw) + return v + +############################################################ +## Serialization +############################################################ + +_html_xsl = """\ + + + + + + +""" + +_pretty_html_xsl = """\ + + + + + + +""" + +_local_transforms = threading.local() +# FIXME: should we just lazily compile these? +_local_transforms.html_transform = etree.XSLT(etree.XML(_html_xsl)) +_local_transforms.pretty_html_transform = etree.XSLT(etree.XML(_pretty_html_xsl)) + +# This isn't a general match, but it's a match for what XSLT specifically creates: +_meta_content_type_re = re.compile( + r'') + +def tostring(doc, pretty=False, include_meta_content_type=False): + """ + return HTML string representation of the document given + + note: this will create a meta http-equiv="Content" tag in the head + and may replace any that are present + """ + assert doc is not None + if pretty: + try: + pretty_html_transform = _local_transforms.pretty_html_transform + except AttributeError: + pretty_html_transform = _local_transforms.pretty_html_transform = etree.XSLT(etree.XML(_pretty_html_xsl)) + html = str(pretty_html_transform(doc)) + else: + try: + html_transform = _local_transforms.html_transform + except AttributeError: + html_transform = _local_transforms.html_transform = etree.XSLT(etree.XML(_html_xsl)) + html = str(html_transform(doc)) + if not include_meta_content_type: + html = _meta_content_type_re.sub('', html) + return html Added: lxml/branch/html/src/lxml/html/clean.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/clean.py Tue May 29 17:05:51 2007 @@ -0,0 +1,157 @@ +from lxml import etree +from lxml.html import defs +from lxml.html import HTML, tostring + +__all__ = ['clean_html', 'clean'] + +def clean_html(html, **kw): + """ + Like clean(), but takes a text input document, and returns a text + document. + """ + doc = HTML(html) + clean(doc, **kw) + return tostring(doc) + +def clean(doc, + scripts=True, + javascript=True, + comments=True, + # process instructions? + style=False, + links=False, + embedded=True, + frames=True, + forms=True, + remove_tags=None, + allow_tags=None, + strip_tags=True, + remove_unknown_tags=True, + add_nofollow=False, + # callbacks? + ): + """ + Cleans the document of each of the possible offending elements: + + ``scripts``: + Any `` +... +... +... +... +... a link +... another link +...

a paragraph

+...
secret EVIL!
+... of EVIL! +... +...
+... Password: +...
+... annoying EVIL! +... spam spam SPAM! +... +... ''' +>>> print doc + + + + + + + + a link + another link +

a paragraph

+
secret EVIL!
+ of EVIL! + +
+ Password: +
+ annoying EVIL! + spam spam SPAM! + + +>>> print tostring(HTML(doc)) + + + + + + + + a link + another link +

a paragraph

+
secret EVIL!
+ of EVIL! + +
+ Password: +
+ annoying EVIL! + spam spam SPAM! + + +>>> print clean_html(doc) + + + + + + a link + another link +

a paragraph

+
secret EVIL!
+ Password: + annoying EVIL! + spam spam SPAM! + + +>>> print clean_html(doc, style=True, links=True, add_nofollow=True) + + + + + a link + another link +

a paragraph

+
secret EVIL!
+ Password: + annoying EVIL! + spam spam SPAM! + + Added: lxml/branch/html/src/lxml/html/tests/test_htmldiff.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/tests/test_htmldiff.py Tue May 29 17:05:51 2007 @@ -0,0 +1,13 @@ +import unittest +from lxml.tests.common_imports import doctest + +from lxml.html import htmldiff + +def test_suite(): + suite = unittest.TestSuite() + suite.addTests([doctest.DocFileSuite('test_htmldiff.txt'), + doctest.DocTestSuite(htmldiff)]) + return suite + +if __name__ == '__main__': + unittest.main() Added: lxml/branch/html/src/lxml/html/tests/test_htmldiff.txt ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/tests/test_htmldiff.txt Tue May 29 17:05:51 2007 @@ -0,0 +1,248 @@ +htmldiff does HTML comparisons. These are word-based comparisons. + +First, a handy function for normalizing whitespace and doing word wrapping:: + + >>> import re, textwrap + >>> def pwrapped(text): + ... text = re.sub(r'[ \n\t\r]+', ' ', text) + ... text = textwrap.fill(text) + ... print text + >>> def pdiff(text1, text2): + ... pwrapped(htmldiff(text1, text2)) + +Example:: + + >>> from lxml.html.htmldiff import htmldiff, split_unbalanced, html_annotate + >>> html1 = '

This is some test text with some changes and some same stuff

' + >>> html2 = '''

This is some test textual writing with some changed stuff + ... and some same stuff

''' + >>> pdiff(html1, html2) +

This is some test textual writing with some changed + stuff text with some changes and some same stuff

+ +Style tags are largely ignored in terms of differences, though markup is not eliminated:: + + >>> html1 = '

Hi you guys

' + >>> html2 = '

Hi you guys

' + >>> pdiff(html1, html2) +

Hi you guys

+ >>> pdiff('text', '

text

') +

text

+ >>> pdiff('Hi guys !!', 'Hi guy !!') + Hi guy guys !! + >>> pdiff('Hi', 'Hi') + Hi Hi + >>> pdiff('A B C', 'A C') + A B C + >>> pdiff('A B C', 'B C') + A B C + >>> pdiff('

', '

') +

+ >>> pdiff('

Hi

', '

Bye

') +

Bye

Hi

+ >>> pdiff('

Hi Guy

', '

Bye Guy

') +

Bye Hi Guy

+ >>> pdiff('

Hey there

', '') +

Hey there

+ +Whitespace is ignored, as it's not meaningful in HTML:: + + >>> pdiff('
Hi\n\nguys
', '
Hi guy
') +
Hi guy guys
+ +Movement between paragraphs is ignored, as tag-based changes are generally ignored:: + >>> + >>> pdiff('

Hello

World

', '

Hello World

') +

Hello World

+ +As a special case, changing the href of a link is displayed, and +images are treated like words: + + >>> pdiff('search', 'search') + search Link: http://google.com + Link: http://yahoo.com + >>> pdiff('

Print this

', '

Print this

') +

Print this

+ >>> pdiff('search', 'search') + search + +The sixteen combinations:: + +First "insert start" (del start/middle/end/none): + + >>> pdiff('A B C', 'D B C D A B C + >>> pdiff('A B C', 'D A C') + D A B C + >>> pdiff('A B C', 'D A B') + D A B C + >>> pdiff('A B C', 'D A B C') + D A B C + +Next, "insert middle" (del start/middle/end/none): + + >>> pdiff('A B C', 'D B C') + D A B C + >>> pdiff('A B C', 'A D C') + A D B C + >>> pdiff('A B C', 'A D B') + A D B C + +This one case hits the threshold of our insensitive matching: + + >>> pdiff('A B C', 'A D B C') + A D A B C + + +Then "insert end" (del start/middle/end/none): + + >>> pdiff('A B C', 'B C D') + A B C D + >>> pdiff('A B C', 'A C D') + A B C D + >>> pdiff('A B C', 'A B D') + A B D C + >>> pdiff('A B C', 'A B C D') + A B C D + +Then no insert (del start/middle/end): + + >>> pdiff('A B C', 'B C') + A B C + >>> pdiff('A B C', 'A C') + A B C + >>> pdiff('A B C', 'A B') + A B C + + >>> pdiff('A B C', 'A B') + A B C + >>> pdiff('A B C', 'A B') + A B C + >>> pdiff('A

hey there how are you?

', 'A') + A

hey there how are you?

+ +Testing a larger document, to make sure there are not weird +unnecessary parallels found: + + >>> pdiff(''' + ...

This is a test document with many words in it that goes on + ... for a while and doesn't have anything do to with the next + ... document that we match this against

''', ''' + ...

This is another document with few similarities to the preceding + ... one, but enough that it may have overlap that could turn into + ... a confusing series of deletes and inserts. + ...

''') +

This is another document with few similarities to the + preceding one, but enough that it may have overlap that could turn + into a confusing series of deletes and inserts.

+

This is a test document with many words in it that goes on for + a while and doesn't have anything do to with the next document that we + match this against

+ + + +Annotation of content can also be done, where every bit of content is +marked up with information about where it came from. + +First, some setup; note that html_annotate is called with a sequence +of documents and the annotation associated with that document. We'll +just use indexes, but you could use author or timestamp information. + + >>> def markup(text, annotation): + ... return '%s' % (annotation, text) + >>> def panno(*docs): + ... pwrapped(html_annotate([(doc, index) for index, doc in enumerate(docs)], + ... markup=markup)) + +Now, a sequence of documents: + + >>> panno('Hello cruel world', 'Hi cruel world', 'Hi world') + Hi world + >>> panno('A similar document', 'A similar document', + ... 'A similar document here') + A similar document here + >>> panno('

P1 para

P2 para

', '

P1 para

P3 foo

') +

P1 para

P3 + foo

+ >>> panno('Hello

There World

','Hello

There Town

') + Hello

There Town

+ >>> panno('

Hello

There World','

Hello

There Town') +

Hello

There + Town + >>> panno('

Hello

There World

','

Hello

There Town

') +

Hello

There + Town

+ >>> panno('

Hi You

', + ... '

Hi You

', + ... '

Hi You

') +

Hi You

+ >>> panno('

Hey

', + ... '

Hey

') +

Hey

+ >>> panno('

Hey You

', + ... '

Hey Guy

') +

Hey Guy

+ + + +Here's a test of a utility function!: + + >>> from lxml.html.htmldiff import _merge_element_contents + >>> from lxml import etree + >>> doc = '''
+ ...
a b content c d
+ ...
content and more stuff trailing
+ ...
hicontent
+ ...
Hi some stuffmore stuff
+ ...
''' + >>> doc = etree.HTML(doc) + >>> def show_result(id): + ... el = doc.xpath("//*[@id='d%s']" % id)[0] + ... _merge_element_contents(el) + ... container = doc.xpath("//*[@id='c%s']" % id)[0] + ... print etree.tostring(container).strip() + >>> show_result(1) +
a b content c d
+ >>> show_result(2) +
content and more stuff trailing
+ >>> show_result(3) +
hicontent
+ >>> show_result(4) +
Hi some stuffmore stuff
+ +More utility: + + >>> from lxml.html.htmldiff import fixup_ins_del_tags + >>> def pfixup(text): + ... print fixup_ins_del_tags(text).strip() + >>> pfixup('

some text and more text and more

') +

some text and more text and more

+ >>> pfixup('

Hi! you

') +

Hi! you

+ >>> pfixup('
Some text and

more text

') +
Some text and

more text

+ >>> pfixup(''' + ...
One tableMore stuff
''') + + + +
One tableMore stuff
+ + +Testing split_unbalanced: + + >>> split_unbalanced(['', 'hey', '']) + ([], ['', 'hey', ''], []) + >>> split_unbalanced(['', 'hey']) + ([''], ['hey'], []) + >>> split_unbalanced(['Hey', '', 'You', '
']) + ([], ['Hey', 'You'], ['', '
']) + >>> split_unbalanced(['So', '', 'Hi', '', 'There', '']) + ([], ['So', 'Hi', '', 'There', ''], ['']) + >>> split_unbalanced(['So', '', 'Hi', '', 'There']) + ([''], ['So', 'Hi', 'There'], ['']) + Added: lxml/branch/html/src/lxml/html/tests/test_rewritelinks.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/tests/test_rewritelinks.py Tue May 29 17:05:51 2007 @@ -0,0 +1,10 @@ +import unittest +from lxml.tests.common_imports import doctest + +def test_suite(): + suite = unittest.TestSuite() + suite.addTests([doctest.DocFileSuite('test_rewritelinks.txt')]) + return suite + +if __name__ == '__main__': + unittest.main() Added: lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt Tue May 29 17:05:51 2007 @@ -0,0 +1,79 @@ +These are tests of relocateresponse:: + + >>> from lxml.html.rewritelinks import * + +In all these examples we'll be using ``http://old`` for the old +(to-be-replaced) URL and ``https://new`` for the new URL (note the +scheme change). Out of laziness we'll define some keywords we use +with all these transformations:: + + >>> relocate_href = Relocator( + ... base_href='http://old/base/path.html', + ... old_href='http://old/', + ... new_href='https://new/') + +Now lets look at simple href rewriting. + +Normal rewrite:: + + >>> relocate_href('http://old/bar') + 'https://new/bar' + +Note that the trailing / doesn't matter in this one case (since +``http://old`` and ``http://old/`` are entirely equivalent):: + + >>> relocate_href('http://old') + 'https://new/' + +The trailing / does matter in other cases:: + + >>> Relocator( + ... base_href='', + ... old_href='http://old-test/foo/', + ... new_href='https://new', + ... )('http://old-test/foo') + 'http://old-test/foo' + >>> Relocator( + ... base_href='', + ... old_href='http://old-test/foo/', + ... new_href='https://new', + ... )('http://old-test/foo/') + 'https://new' + +Rewriting a link that doesn't match old_href is a no-op:: + + >>> relocate_href('http://foo/bar') + 'http://foo/bar' + +Relative links are handled:: + + >>> relocate_href('index.html') + 'https://new/base/index.html' + +Now for content. First, to make it easier on us, we need to trim the +normalized HTML we get from these functions:: + + >>> import re + >>> def pr_html(html): + ... html = re.sub(r'', '', html) + ... html = re.sub(r'', '', html) + ... print html.strip() + +Some basics:: + + >>> pr_html(rewrite_links_html( + ... 'link', relocate_href)) + link + >>> pr_html(rewrite_links_html( + ... '', relocate_href)) + + >>> pr_html(rewrite_links_html( + ... '', relocate_href)) + + >>> pr_html(rewrite_links_html('''\ + ... + ... + ... x\ + ... ''', relocate_href)) + + x Added: lxml/branch/html/src/lxml/html/usedoctest.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/html/usedoctest.py Tue May 29 17:05:51 2007 @@ -0,0 +1,3 @@ +from lxml import doctestcompare + +doctestcompare.temp_install(html=True) Added: lxml/branch/html/src/lxml/usedoctest.py ============================================================================== --- (empty file) +++ lxml/branch/html/src/lxml/usedoctest.py Tue May 29 17:05:51 2007 @@ -0,0 +1,3 @@ +from lxml import doctestcompare + +doctestcompare.temp_install() From ianb at codespeak.net Tue May 29 22:30:11 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Tue, 29 May 2007 22:30:11 +0200 (CEST) Subject: [Lxml-checkins] r43882 - lxml/trunk/src/lxml/tests Message-ID: <20070529203011.6F9098080@code0.codespeak.net> Author: ianb Date: Tue May 29 22:30:11 2007 New Revision: 43882 Modified: lxml/trunk/src/lxml/tests/test_etree.py Log: little typo in a test Modified: lxml/trunk/src/lxml/tests/test_etree.py ============================================================================== --- lxml/trunk/src/lxml/tests/test_etree.py (original) +++ lxml/trunk/src/lxml/tests/test_etree.py Tue May 29 22:30:11 2007 @@ -410,6 +410,7 @@ def test_entity_append(self): Entity = self.etree.Entity Element = self.etree.Element + tostring = self.etree.tostring root = Element("root") root.append( Entity("test") ) From ianb at codespeak.net Tue May 29 22:34:41 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Tue, 29 May 2007 22:34:41 +0200 (CEST) Subject: [Lxml-checkins] r43883 - lxml/branch/html/src/lxml/tests Message-ID: <20070529203441.F1D298080@code0.codespeak.net> Author: ianb Date: Tue May 29 22:34:41 2007 New Revision: 43883 Modified: lxml/branch/html/src/lxml/tests/test_etree.py Log: fix test typo Modified: lxml/branch/html/src/lxml/tests/test_etree.py ============================================================================== --- lxml/branch/html/src/lxml/tests/test_etree.py (original) +++ lxml/branch/html/src/lxml/tests/test_etree.py Tue May 29 22:34:41 2007 @@ -410,6 +410,7 @@ def test_entity_append(self): Entity = self.etree.Entity Element = self.etree.Element + tostring = self.etree.tostring root = Element("root") root.append( Entity("test") ) From ianb at codespeak.net Tue May 29 22:42:18 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Tue, 29 May 2007 22:42:18 +0200 (CEST) Subject: [Lxml-checkins] r43884 - lxml/branch/html/src/lxml/html Message-ID: <20070529204218.A2F608082@code0.codespeak.net> Author: ianb Date: Tue May 29 22:42:18 2007 New Revision: 43884 Modified: lxml/branch/html/src/lxml/html/__init__.py lxml/branch/html/src/lxml/html/clean.py lxml/branch/html/src/lxml/html/rewritelinks.py Log: Added get_element_by_id and text_only; some comments for TODOs Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Tue May 29 22:42:18 2007 @@ -7,6 +7,7 @@ _rel_links_xpath = etree.XPath("descendant-or-self::a[fn:upper-case(@rel)=$rel]") #_class_xpath = etree.XPath(r"descendant-or-self::*[regexp:match(@class, concat('\b', $class_name, '\b'))]", {'regexp': 'http://exslt.org/regular-expressions'}) _class_xpath = etree.XPath("descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), concat(' ', $class_name, ' '))]") +_id_xpath = etree.XPath("descendant-or-self::*[@id=$id]") class HtmlMixin(object): @@ -58,6 +59,27 @@ def find_class(self, class_name): return _class_xpath(self, class_name=class_name.lower()) + def get_element_by_id(self, id, default=None): + # FIXME: should this raise an exception when something isn't found? + try: + # FIXME: should this check for multiple matches? + # browsers just return the first one + return _id_xpath(self, id=id)[0] + except IndexError: + return default + + def text_only(self, with_tail=False): + """ + Return the text inside this element, without any tags. If with_tail + is true, then also include the text that follows this tag. + """ + parts = [self.text or ''] + for child in self: + parts.append(child.text_only(with_tail=True)) + if with_tail: + parts.append(self.tail or '') + return ''.join(parts) + class HtmlComment(etree._Comment, HtmlMixin): pass Modified: lxml/branch/html/src/lxml/html/clean.py ============================================================================== --- lxml/branch/html/src/lxml/html/clean.py (original) +++ lxml/branch/html/src/lxml/html/clean.py Tue May 29 22:42:18 2007 @@ -86,6 +86,8 @@ del el.attrib[attrib] for attrib in defs.link_attrs: # FIXME: should call lower-case() + # FIXME: starts-with isn't really good either, because + # href=" javascript:..." is also a problem for el in doc.xpath("descendant-or-self::*[starts-with(@%s, 'javascript:')]" % attrib): if isinstance(el, basestring): assert 0, repr(el) Modified: lxml/branch/html/src/lxml/html/rewritelinks.py ============================================================================== --- lxml/branch/html/src/lxml/html/rewritelinks.py (original) +++ lxml/branch/html/src/lxml/html/rewritelinks.py Tue May 29 22:42:18 2007 @@ -39,6 +39,7 @@ if remove_base_tags: resolve_base_href(doc) + # FIXME: should use defs.link_attrs for attrib in 'href', 'src': els = doc.xpath('//*[@%s]' % attrib) for el in els: From scoder at codespeak.net Tue May 29 22:56:00 2007 From: scoder at codespeak.net (scoder at codespeak.net) Date: Tue, 29 May 2007 22:56:00 +0200 (CEST) Subject: [Lxml-checkins] r43887 - lxml/trunk Message-ID: <20070529205600.DCD928082@code0.codespeak.net> Author: scoder Date: Tue May 29 22:56:00 2007 New Revision: 43887 Modified: lxml/trunk/TODO.txt Log: todo 2.0: remove ctxt argument from extension functions Modified: lxml/trunk/TODO.txt ============================================================================== --- lxml/trunk/TODO.txt (original) +++ lxml/trunk/TODO.txt Tue May 29 22:56:00 2007 @@ -58,6 +58,8 @@ * clean up (and remove?) duplicated API for extension functions +* remove first 'context' argument from extension functions + * find a way to integrate Schematron (if it's available) * always use ns-prefixed type names in objectify's ``xsi:type`` attributes From ianb at codespeak.net Wed May 30 00:09:04 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Wed, 30 May 2007 00:09:04 +0200 (CEST) Subject: [Lxml-checkins] r43891 - lxml/branch/html/src/lxml/html Message-ID: <20070529220904.11B008080@code0.codespeak.net> Author: ianb Date: Wed May 30 00:09:03 2007 New Revision: 43891 Modified: lxml/branch/html/src/lxml/html/__init__.py lxml/branch/html/src/lxml/html/defs.py Log: Fixed lxml.html.Element(); TODO comment Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Wed May 30 00:09:03 2007 @@ -2,7 +2,7 @@ import re from lxml import etree -__all__ = ['HTML', 'tostring'] +__all__ = ['HTML', 'tostring', 'Element'] _rel_links_xpath = etree.XPath("descendant-or-self::a[fn:upper-case(@rel)=$rel]") #_class_xpath = etree.XPath(r"descendant-or-self::*[regexp:match(@class, concat('\b', $class_name, '\b'))]", {'regexp': 'http://exslt.org/regular-expressions'}) @@ -164,7 +164,7 @@ def Element(*args, **kw): # FIXME: this is totally broken; segfaults - v = HtmlElement(*args, **kw) + v = html_parser.makeelement(*args, **kw) return v ############################################################ Modified: lxml/branch/html/src/lxml/html/defs.py ============================================================================== --- lxml/branch/html/src/lxml/html/defs.py (original) +++ lxml/branch/html/src/lxml/html/defs.py Wed May 30 00:09:03 2007 @@ -1,3 +1,7 @@ +# FIXME: this should all be confirmed against what a DTD says +# (probably in a test; this may not match the DTD exactly, but we +# should document just how it differs). + # Data taken from http://www.w3.org/TR/html401/index/elements.html empty_tags = [ From ianb at codespeak.net Wed May 30 17:46:38 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Wed, 30 May 2007 17:46:38 +0200 (CEST) Subject: [Lxml-checkins] r43898 - lxml/branch/html/src/lxml/html Message-ID: <20070530154638.B6CED807E@code0.codespeak.net> Author: ianb Date: Wed May 30 17:46:38 2007 New Revision: 43898 Modified: lxml/branch/html/src/lxml/html/__init__.py lxml/branch/html/src/lxml/html/clean.py Log: rename remove_element to drop_element, remove_tag to drop_tag. Add clean support for dropping meta tags, and drop applet along with other embedded objects Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Wed May 30 17:46:38 2007 @@ -11,7 +11,7 @@ class HtmlMixin(object): - def remove_element(self): + def drop_element(self): """ Removes this element from the tree, including its children and text. The tail text is joined to the previous element or @@ -28,7 +28,7 @@ previous.tail = (previous.tail or '') + self.tail parent.remove(self) - def remove_tag(self): + def drop_tag(self): """ Remove the tag, but not its children or text. The children and text are merged into the parent. Modified: lxml/branch/html/src/lxml/html/clean.py ============================================================================== --- lxml/branch/html/src/lxml/html/clean.py (original) +++ lxml/branch/html/src/lxml/html/clean.py Wed May 30 17:46:38 2007 @@ -4,6 +4,13 @@ __all__ = ['clean_html', 'clean'] +# FIXME: I should study this for more ideas: http://feedparser.org/docs/html-sanitization.html +# In CSS/style attribute: +# url(javascript:...) +# expression(...) +# Other on* attributes that aren't standard? +# Try these tests: http://feedparser.org/tests/wellformed/sanitize/ + def clean_html(html, **kw): """ Like clean(), but takes a text input document, and returns a text @@ -20,6 +27,7 @@ # process instructions? style=False, links=False, + meta=False, embedded=True, frames=True, forms=True, @@ -48,6 +56,9 @@ ``links``: Remove any ```` tags + ``meta``: + Remove any ```` tags + ``frames``: Remove any frame-related tags @@ -99,17 +110,19 @@ if isinstance(el, etree._Comment): bad.append(el) for el in bad: - el.remove_element() + el.drop_element() if style: kill_tags.append('style') for el in doc.xpath('descendant-or-self::link[lower-case(@rel)="stylesheet"]'): - el.remove_element() + el.drop_element() for el in doc.xpath('descendant-or-self::*[@style]'): del el.attrib['style'] if links: kill_tags.append('link') + if meta: + kill_tags.append('meta') if embedded: - kill_tags.extend(['object', 'embed', 'iframe']) + kill_tags.extend(['object', 'embed', 'iframe', 'applet']) if frames: kill_tags.extend(defs.frame_tags) if forms: @@ -122,17 +135,17 @@ if el.tag in kill_tags: bad.append(el) for el in bad: - el.remove_element() + el.drop_element() if remove_tags: xpath = ' | '.join([ "descendant-or-self::%s" % tag for tag in remove_tags]) for el in doc.xpath(xpath): if strip_tags: - el.remove_tag() + el.drop_tag() else: # FIXME: Should we test if this has been removed because of a parent? - el.remove_element() + el.drop_element() if remove_unknown_tags: if allow_tags: raise ValueError( @@ -145,10 +158,10 @@ bad.append(el) for el in bad: if strip_tags: - el.remove_tag() + el.drop_tag() else: # FIXME: Should we test if this has been removed because of a parent? - el.remove_element() + el.drop_element() if add_nofollow: for el in doc.xpath('descendant-or-self::a[@href]'): href = el.attrib['href'] From ianb at codespeak.net Wed May 30 17:49:36 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Wed, 30 May 2007 17:49:36 +0200 (CEST) Subject: [Lxml-checkins] r43899 - lxml/branch/html/src/lxml/html Message-ID: <20070530154936.5DA33807E@code0.codespeak.net> Author: ianb Date: Wed May 30 17:49:36 2007 New Revision: 43899 Modified: lxml/branch/html/src/lxml/html/__init__.py Log: use CommentBase instead of _Comment as a superclass; switch to ElementDefaultClassLookup Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Wed May 30 17:49:36 2007 @@ -80,25 +80,15 @@ parts.append(self.tail or '') return ''.join(parts) -class HtmlComment(etree._Comment, HtmlMixin): +class HtmlComment(etree.CommentBase, HtmlMixin): pass class HtmlElement(etree.ElementBase, HtmlMixin): pass -class HtmlLookup(etree.CustomElementClassLookup): - - def lookup(self, node_type, document, namespace, name): - if node_type == 'element': - return HtmlElement - elif node_type == 'comment': - return HtmlComment - else: - # Delegate - return None - html_parser = etree.HTMLParser() -html_parser.setElementClassLookup(HtmlLookup()) +html_parser.setElementClassLookup(etree.ElementDefaultClassLookup( + element=HtmlElement, comment=HtmlComment)) def HTML(html): # FIXME: should this notice a fragment and parse accordingly? From ianb at codespeak.net Wed May 30 17:59:35 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Wed, 30 May 2007 17:59:35 +0200 (CEST) Subject: [Lxml-checkins] r43900 - lxml/branch/html/src/lxml/html Message-ID: <20070530155935.BC620807E@code0.codespeak.net> Author: ianb Date: Wed May 30 17:59:35 2007 New Revision: 43900 Modified: lxml/branch/html/src/lxml/html/__init__.py Log: Rename text_only to get_text_content. Remove unnecessary from the fragment parsing. Add some doc strings Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Wed May 30 17:59:35 2007 @@ -32,6 +32,13 @@ """ Remove the tag, but not its children or text. The children and text are merged into the parent. + + Example:: + + >>> h = parse_element('
Hello World!
') + >>> h.xpath('//b')[0].drop_tag() + >>> print tostring(h) +
Hello World!
""" parent = self.getparent() assert parent @@ -54,12 +61,27 @@ parent[index:index+1] = list(self) def find_rel_links(self, rel): + """ + Find any links like ``...``; returns a list of elements. + """ return _rel_links_xpath(self, rel=rel.lower()) def find_class(self, class_name): + """ + Find any elements with the given class name. + """ return _class_xpath(self, class_name=class_name.lower()) def get_element_by_id(self, id, default=None): + """ + Get the first element in a document with the given id. If + none are found, return default (None). + + Note that there can be more than one element with the same id, + and this isn't uncommon in HTML documents found in the wild. + Browsers return only the first match, and this function does + the same. + """ # FIXME: should this raise an exception when something isn't found? try: # FIXME: should this check for multiple matches? @@ -68,17 +90,11 @@ except IndexError: return default - def text_only(self, with_tail=False): + def get_text_content(self): """ - Return the text inside this element, without any tags. If with_tail - is true, then also include the text that follows this tag. + Return the text content of the tag (and the text in any children). """ - parts = [self.text or ''] - for child in self: - parts.append(child.text_only(with_tail=True)) - if with_tail: - parts.append(self.tail or '') - return ''.join(parts) + return self.xpath("string()") class HtmlComment(etree.CommentBase, HtmlMixin): pass @@ -104,10 +120,11 @@ The first item in the list may be a string (though leading whitespace is removed). If no_leading_text is true, then it will - be an error if there is leading text. + be an error if there is leading text, and it will always be a list + of only elements. """ # FIXME: check what happens when you give html with a body, head, etc. - html = '%s' % html + html = '%s' % html doc = HTML(html) assert doc.tag == 'html' bodies = [e for e in doc if e.tag == 'body'] @@ -153,7 +170,6 @@ return el def Element(*args, **kw): - # FIXME: this is totally broken; segfaults v = html_parser.makeelement(*args, **kw) return v From ianb at codespeak.net Thu May 31 20:46:51 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Thu, 31 May 2007 20:46:51 +0200 (CEST) Subject: [Lxml-checkins] r43949 - lxml/branch/html/src/lxml/html Message-ID: <20070531184651.8E69A80A6@code0.codespeak.net> Author: ianb Date: Thu May 31 20:46:51 2007 New Revision: 43949 Modified: lxml/branch/html/src/lxml/html/clean.py Log: a few notes about things I should do Modified: lxml/branch/html/src/lxml/html/clean.py ============================================================================== --- lxml/branch/html/src/lxml/html/clean.py (original) +++ lxml/branch/html/src/lxml/html/clean.py Thu May 31 20:46:51 2007 @@ -10,6 +10,16 @@ # expression(...) # Other on* attributes that aren't standard? # Try these tests: http://feedparser.org/tests/wellformed/sanitize/ +# Also http://code.sixapart.com/trac/livejournal/browser/trunk/cgi-bin/cleanhtml.pl +# IE treats like +# ...? +# and is fishy in a fragment +# max width for words +# max height? +# autolink? +# CSS stuff? +# remove images? + def clean_html(html, **kw): """ From ianb at codespeak.net Thu May 31 23:54:19 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Thu, 31 May 2007 23:54:19 +0200 (CEST) Subject: [Lxml-checkins] r43950 - lxml/branch/html/src/lxml Message-ID: <20070531215419.17E3E80AD@code0.codespeak.net> Author: ianb Date: Thu May 31 23:54:18 2007 New Revision: 43950 Modified: lxml/branch/html/src/lxml/doctestcompare.py Log: Fix for problem with HTML-as-a-function and that it then looks like a method to the LHTMLOutputChecker class Modified: lxml/branch/html/src/lxml/doctestcompare.py ============================================================================== --- lxml/branch/html/src/lxml/doctestcompare.py (original) +++ lxml/branch/html/src/lxml/doctestcompare.py Thu May 31 23:54:18 2007 @@ -48,7 +48,8 @@ 'param', 'img', 'area', 'br', 'basefont', 'input', 'base', 'meta', 'link', 'col') - default_parser = etree.XML + def get_default_parser(self): + return etree.XML def check_output(self, want, got, optionflags): alt_self = getattr(self, '_temp_override_self', None) @@ -80,7 +81,7 @@ elif want.strip().lower().startswith('<html'): parser = HTML elif want.strip().startswith('<'): - parser = self.default_parser + parser = self.get_default_parser() return parser def compare_docs(self, want, got): @@ -146,7 +147,7 @@ return '\n'.join(errors) else: return value - html = parser is etree.HTML + html = parser is HTML diff_parts = [] diff_parts.append('Expected:') diff_parts.append(self.format_doc(want_doc, html, 2)) @@ -300,7 +301,8 @@ return self.format_text(text, strip) class LHTMLOutputChecker(LXMLOutputChecker): - default_parser = HTML + def get_default_parser(self): + return HTML def install(html=False): """ From ianb at codespeak.net Thu May 31 23:55:43 2007 From: ianb at codespeak.net (ianb at codespeak.net) Date: Thu, 31 May 2007 23:55:43 +0200 (CEST) Subject: [Lxml-checkins] r43951 - in lxml/branch/html/src/lxml/html: . tests Message-ID: <20070531215543.BAB8880AD@code0.codespeak.net> Author: ianb Date: Thu May 31 23:55:43 2007 New Revision: 43951 Modified: lxml/branch/html/src/lxml/html/__init__.py lxml/branch/html/src/lxml/html/rewritelinks.py lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt Log: Added iter_links; added methods for each of the functions; added functions for each of the methods. Added some more tests. Consolidation of the functions will happen in a following commit Modified: lxml/branch/html/src/lxml/html/__init__.py ============================================================================== --- lxml/branch/html/src/lxml/html/__init__.py (original) +++ lxml/branch/html/src/lxml/html/__init__.py Thu May 31 23:55:43 2007 @@ -96,6 +96,125 @@ """ return self.xpath("string()") + ######################################## + ## Link functions + ######################################## + + def make_links_absolute(self, base_href, resolve_base_href=True): + """ + Make all links in the document absolute, given the + ``base_href`` for the document (the full URL where the + document came from). + + If ``resolve_base_href`` is true, then any ``<base href>`` + tags in the document are used *and* removed from the document. + If it is false then any such tag is ignored. + """ + from lxml.html.rewritelinks import make_links_absolute + make_links_absolute(self, base_href, resolve_base_href=resolve_base_href) + + def resolve_base_href(self): + """ + Find any ``<base href>`` tag in the document, and apply its + values to all links found in the document. Also remove the + tag once it has been applied. + """ + from lxml.html.rewritelinks import resolve_base_href + resolve_base_href(self) + + def iter_links(self, in_order=True): + """ + Iterate over all the links in the document, yielding + ``(element, attribute, link)``. + + The ``element`` contains the link. ``attribute`` is a string + like ``'href'`` or ``'src'``. It may be None, which means + that the link is in the body of the element. The only type + this occurs is with ``<style>`` tags that contain links like + ``url(...)``. ``link`` is the actual link, like + ``'http://codespeak.net'`` + + Note: links are not returned in document order. + """ + from lxml.html.rewritelinks import iter_links + return iter_links(self, in_order=in_order) + + def rewrite_links(self, link_repl_func, resolve_base_href=True, + base_href=None): + """ + Rewrite all the links in the document. For each link + ``link_repl_func(link)`` will be called, and the return value + will replace the old link. + + Note that links may not be absolute (unless you first called + ``make_links_absolute()``), and may be internal (e.g., + ``'#anchor'``). They can also be values like + ``'mailto:email'`` or ``'javascript:expr'``. + + If you give ``base_href`` then all links passed to + ``link_repl_func()`` will be absolute. + """ + from lxml.html.rewritelinks import rewrite_links + if base_href is not None: + # FIXME: this can be done in one pass with a wrapper + # around link_repl_func + self.make_links_absolute(base_href, resolve_base_href=resolve_base_href) + resolve_base_href = False + rewrite_links(self, link_repl_func, remove_base_tags=resolve_base_href) + +class _MethodFunc(object): + def __init__(self, name, fragment=False, source_class=HtmlMixin): + self.name = name + self.fragment = fragment + self.__doc__ = getattr(source_class, self.name).__doc__ + def __call__(self, doc, *args, **kw): + if 'fragment' in kw: + fragment = kw.pop('fragment') + else: + fragment = self.fragment + if isinstance(doc, basestring): + if fragment: + doc = parse_element(doc) + else: + doc = HTML(doc) + meth = getattr(doc, self.name) + result = meth(*args, **kw) + if result is None: + # Then serialize and return + return tostring(doc) + else: + return result + +find_rel_links = _MethodFunc('find_rel_links') +find_class = _MethodFunc('find_class') +make_links_absolute = _MethodFunc('make_links_absolute') +resolve_base_href = _MethodFunc('resolve_base_href') +iter_links = _MethodFunc('iter_links') +rewrite_links = _MethodFunc('rewrite_links') + +class _SubmoduleFunc(object): + def __init__(self, module, name, doc=None): + self.module = module + self.name = name + self.obj = None + if doc is None: + doc = 'See %s.%s' % (module, name) + self.__doc__ = doc + def __call__(self, *args, **kw): + if self.obj is None: + import sys + __import__(self.module) + mod = sys.modules(self.module) + self.obj = getattr(mod, self.name) + self.__doc__ = self.obj.__doc__ + return self.obj(*args, **kw) + +# FIXME: Damn module names conflict with the function names :( +#clean = _SubmoduleFunc('lxml.html.clean', 'clean') +#clean_html = _SubmoduleFunc('lxml.html.clean', 'clean_html') +#htmldiff = _SubmoduleFunc('lxml.html.htmldiff', 'htmldiff') +#html_annotate = _SubmoduleFunc('lxml.html.htmldiff', 'html_annotate') + class HtmlComment(etree.CommentBase, HtmlMixin): pass Modified: lxml/branch/html/src/lxml/html/rewritelinks.py ============================================================================== --- lxml/branch/html/src/lxml/html/rewritelinks.py (original) +++ lxml/branch/html/src/lxml/html/rewritelinks.py Thu May 31 23:55:43 2007 @@ -4,6 +4,7 @@ from lxml.html import tostring, HTML +from lxml.html import defs import urlparse import re @@ -11,10 +12,10 @@ 'rewrite_links', 'rewrite_links_html', 'Relocator'] -def make_links_absolute(doc, base_href): +def make_links_absolute(doc, base_href, resolve_base_href=True): def link_repl(href): return urlparse.urljoin(base_href, href) - rewrite_links(doc, link_repl_func) + rewrite_links(doc, link_repl_func, remove_base_tags=resolve_base_href) def make_links_absolute_html(html, base_href): doc = HTML(html) @@ -88,6 +89,30 @@ for el in doc.xpath("//*[contains(@style, 'url(')]"): el.attrib['style'] = CSS_URL_PAT.sub(absuri, el.attrib['style']) +def iter_links(doc): + """ + Yield (element, attribute, link, pos), where attribute may be None + (indicating the link is in the text). ``pos`` is the position + where the link occurs; often 0, but sometimes something else in + the case of links in stylesheets or style tags. + + Note: <base href> is *not* taken into account in any way. The + link you get is exactly the link in the document. + """ + link_attrs = defs.link_attrs + for el in doc.iterdescendants(): + for attrib in link_attrs: + if attrib in el.attrib: + yield (el, attrib, el.attrib[attrib], 0) + if el.tag == 'style' and el.text: + for match in CSS_URL_PAT.finditer(el.text): + yield (el, None, match.group(1), match.start(1)) + for match in CSS_IMPORT_PAT.finditer(el.text): + yield (el, None, match.group(1), match.start(1)) + if 'style' in el.attrib: + for match in CSS_URL_PAT.finditer(el.attrib['style']): + yield (el, 'style', match.group(1), match.start(1)) + class Relocator(object): """ This helper can be used to move all links in a document from one @@ -120,3 +145,4 @@ # A link somewhere else entirely return href return self.new_href + real_href[len(self.old_href):] + Modified: lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt ============================================================================== --- lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt (original) +++ lxml/branch/html/src/lxml/html/tests/test_rewritelinks.txt Thu May 31 23:55:43 2007 @@ -1,20 +1,18 @@ These are tests of relocateresponse:: - >>> from lxml.html.rewritelinks import * + >>> from lxml.html.rewritelinks import Relocator In all these examples we'll be using ``http://old`` for the old (to-be-replaced) URL and ``https://new`` for the new URL (note the -scheme change). Out of laziness we'll define some keywords we use -with all these transformations:: +scheme change). To test the rewriting we'll use this handy rewriter +that rewrites everything from one base to another base:: >>> relocate_href = Relocator( ... base_href='http://old/base/path.html', ... old_href='http://old/', ... new_href='https://new/') -Now lets look at simple href rewriting. - -Normal rewrite:: +Now lets look at simple href rewriting. Normal rewrite:: >>> relocate_href('http://old/bar') 'https://new/bar' @@ -53,27 +51,138 @@ Now for content. First, to make it easier on us, we need to trim the normalized HTML we get from these functions:: - >>> import re - >>> def pr_html(html): - ... html = re.sub(r'</?(?:html|head|body)>', '', html) - ... html = re.sub(r'<meta.*?>', '', html) - ... print html.strip() - Some basics:: - >>> pr_html(rewrite_links_html( - ... '<a href="http://old/blah/blah.html">link</a>', relocate_href)) + >>> from lxml.html import usedoctest, parse_element, tostring + >>> from lxml.html import rewrite_links + >>> print rewrite_links( + ... '<a href="http://old/blah/blah.html">link</a>', relocate_href) <a href="https://new/blah/blah.html">link</a> - >>> pr_html(rewrite_links_html( - ... '<script src="http://old/foo.js"></script>', relocate_href)) + >>> print rewrite_links( + ... '<script src="http://old/foo.js"></script>', relocate_href) <script src="https://new/foo.js"></script> - >>> pr_html(rewrite_links_html( - ... '<link href="foo.css">', relocate_href)) + >>> print rewrite_links( + ... '<link href="foo.css">', relocate_href) <link href="https://new/base/foo.css"> - >>> pr_html(rewrite_links_html('''\ + >>> print rewrite_links('''\ ... <base href="http://blah/stuff/index.html"> ... <link href="foo.css"> ... <a href="http://old/bar.html">x</a>\ - ... ''', relocate_href)) + ... ''', relocate_href) <link href="http://blah/stuff/foo.css"> <a href="https://new/bar.html">x</a> + +Links in CSS are also handled:: + + >>> print rewrite_links(''' + ... <style> + ... body {background-image: url(http://old/image.gif)}; + ... @import "http://old/other-style.css"; + ... </style>''', relocate_href) + <html><head><style> + body {background-image: url(https://new/image.gif)}; + @import "https://new/other-style.css"; + </style></head></html> + +Those links in style attributes are also rewritten:: + + >>> print rewrite_links(''' + ... <div style="background-image: url(http://old/image.gif)">text</div> + ... ''', relocate_href) + <div style="background-image: url(https://new/image.gif)">text</div> + +The ``<base href>`` tag is also respected (but also removed):: + + >>> print rewrite_links(''' + ... <html><head> + ... <base href="http://old/"> + ... </head> + ... <body> + ... <a href="foo.html">link</a> + ... </body></html>''', relocate_href) + <html> + <head></head> + <body> + <a href="https://new/foo.html">link</a> + </body> + </html> + +The ``iter_links`` method (and function) gives you all the links in +the document, along with the element and attribute the link comes +from. This makes it fairly easy to see what resources the document +references or embeds (an ``<a>`` tag is a reference, an ``<img>`` tag +is something embedded). It returns a generator of ``(element, attrib, +link)``, which is awkward to test here, so we'll make a printer:: + + >>> from lxml.html import iter_links + >>> def print_iter(seq): + ... for element, attrib, link, pos in seq: + ... if pos: + ... extra = '@%s' % pos + ... else: + ... extra = '' + ... print '%s %s="%s"%s' % (element.tag, attrib, link, extra) + >>> print_iter(iter_links(''' + ... <html> + ... <head> + ... <link rel="stylesheet" href="style.css"> + ... <style type="text/css"> + ... body { + ... background-image: url(/bg.gif); + ... } + ... @import "/other-styles.css"; + ... </style> + ... <script src="/js-funcs.js"></script> + ... </head> + ... <body> + ... <table> + ... <tr><td><ul> + ... <li><a href="/test.html">Test stuff</a></li> + ... <li><a href="/other.html">Other stuff</a></li> + ... </td></tr> + ... <td style="background-image: url(/td-bg.png)"> + ... <img src="/logo.gif"> + ... Hi world! + ... </td></tr> + ... </table> + ... </body></html>''')) + link href="style.css" + style None="/bg.gif"@40 + style None="/other-styles.css"@69 + script src="/js-funcs.js" + a href="/test.html" + a href="/other.html" + td style="/td-bg.png"@22 + img src="/logo.gif" + >>> print_iter(iter_links(''' + ... <html> + ... <head> + ... <link rel="stylesheet" href="style.css"> + ... <style type="text/css"> + ... body { + ... background-image: url(/bg.gif); + ... } + ... @import "/other-styles.css"; + ... </style> + ... <script src="/js-funcs.js"></script> + ... </head> + ... <body> + ... <table> + ... <tr><td><ul> + ... <li><a href="/test.html">Test stuff</a></li> + ... <li><a href="/other.html">Other stuff</a></li> + ... </td></tr> + ... <td style="background-image: url(/td-bg.png)"> + ... <img src="/logo.gif"> + ... Hi world! + ... </td></tr> + ... </table> + ... </body></html>''', False)) + link href="style.css" + a href="/test.html" + a href="/other.html" + script src="/js-funcs.js" + img src="/logo.gif" + style None="/bg.gif"@40 + style None="/other-styles.css"@69 + td style="/td-bg.png"@22