...
...
From scoder at codespeak.net Fri Feb 27 14:47:12 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:12 +0100 (CET)
Subject: [Lxml-checkins] r62228 - lxml/trunk
Message-ID: <20090227134712.5EC5516846D@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:11 2009
New Revision: 62228
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/setup.py
Log:
r5054 at delle: sbehnel | 2009-02-22 15:44:15 +0100
mark Py3 exception crash bug fixed in Cython, official support for Python 3.0.1
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 14:47:11 2009
@@ -17,6 +17,9 @@
Bugs fixed
----------
+* Crash bug in exception handling code under Python 3. This was due
+ to a problem in Cython, not lxml itself.
+
* ``lxml.html.FormElement._name()`` failed for non top-level forms.
* ``TAG`` special attribute in constructor of custom Element classes
@@ -25,6 +28,8 @@
Other changes
-------------
+* Official support for Python 3.0.1.
+
* ``Element.findtext()`` now returns an empty string instead of None
for Elements without text content.
Modified: lxml/trunk/setup.py
==============================================================================
--- lxml/trunk/setup.py (original)
+++ lxml/trunk/setup.py Fri Feb 27 14:47:11 2009
@@ -98,8 +98,8 @@
'Programming Language :: Python :: 2.4',
'Programming Language :: Python :: 2.5',
'Programming Language :: Python :: 2.6',
-# 'Programming Language :: Python :: 3',
-# 'Programming Language :: Python :: 3.0',
+ 'Programming Language :: Python :: 3',
+ 'Programming Language :: Python :: 3.0',
'Programming Language :: C',
'Operating System :: OS Independent',
'Topic :: Text Processing :: Markup :: HTML',
From scoder at codespeak.net Fri Feb 27 14:47:17 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:17 +0100 (CET)
Subject: [Lxml-checkins] r62229 - in lxml/trunk: . doc
Message-ID: <20090227134717.9817E168487@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:17 2009
New Revision: 62229
Modified:
lxml/trunk/ (props changed)
lxml/trunk/doc/FAQ.txt
Log:
r5055 at delle: sbehnel | 2009-02-27 12:04:42 +0100
FAQ update: clean up threading sections, reference dev-works article
Modified: lxml/trunk/doc/FAQ.txt
==============================================================================
--- lxml/trunk/doc/FAQ.txt (original)
+++ lxml/trunk/doc/FAQ.txt Fri Feb 27 14:47:17 2009
@@ -96,7 +96,9 @@
``lxml.objectify``, read the `objectify documentation`_.
John Shipman has written another tutorial called `Python XML
-processing with lxml`_ that contains lots of examples.
+processing with lxml`_ that contains lots of examples. Liza Daly
+wrote a nice article about high-performance aspects when `parsing
+large files with lxml`_.
.. _`lxml.etree Tutorial`: tutorial.html
.. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
@@ -104,6 +106,8 @@
.. _`objectify documentation`: objectify.html
.. _`Python XML processing with lxml`: http://www.nmt.edu/tcc/help/pubs/pylxml/
.. _`element library`: http://effbot.org/zone/element-lib.htm
+.. _`parsing large files with lxml`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
+
Where can I find more documentation about lxml?
-----------------------------------------------
@@ -194,7 +198,10 @@
* zif.sedna_, an XQuery based interface to the Sedna OpenSource XML database
And don't miss the quotes by our generally happy_ users_, and other
-`sites that link to lxml`_.
+`sites that link to lxml`_. As `Liza Daly`_ puts it: "Many software
+products come with the pick-two caveat, meaning that you must choose
+only two: speed, flexibility, or readability. When used carefully,
+lxml can provide all three."
.. _Zope: http://www.zope.org/
.. _Plone: http://www.plone.org/
@@ -215,6 +222,7 @@
.. _happy: http://thread.gmane.org/gmane.comp.python.lxml.devel/3244/focus=3244
.. _users: http://article.gmane.org/gmane.comp.python.lxml.devel/3246
.. _`sites that link to lxml`: http://www.google.com/search?as_lq=http:%2F%2Fcodespeak.net%2Flxml
+.. _`Liza Daly`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
What is the difference between lxml.etree and lxml.objectify?
@@ -619,8 +627,8 @@
either the default parser (which is replicated for each thread) or
create a parser for each thread yourself. lxml also allows
concurrency during validation (RelaxNG and XMLSchema) and XSL
-transformation. You can share RelaxNG, XMLSchema and (with
-restrictions) XSLT objects between threads.
+transformation. You can share RelaxNG, XMLSchema and XSLT objects
+between threads.
While you can also share parsers between threads, this will serialize
the access to each of them, so it is better to ``.copy()`` parsers or
@@ -629,19 +637,16 @@
internal lock to protect their prepared evaluation contexts. It is
therefore best to use separate evaluator instances in threads.
-Due to the way libxslt handles threading, applying a stylesheets is
-most efficient if it was parsed in the same thread that executes it.
-One way to achieve this is by caching stylesheets in thread-local
-storage.
-
-Warning: Before lxml 2.2, there were various issues when moving
-subtrees between different threads. If you need code to run with
-older versions, you should generally avoid modifying trees in other
-threads than the one it was generated in. Although this should work
-in many cases, there are certain scenarios where the termination of a
-thread that parsed a tree can crash the application if subtrees of
-this tree were moved to other documents. You should be on the safe
-side when passing trees between threads if you either
+Warning: Before lxml 2.2, and especially before 2.1, there were
+various issues when moving subtrees between different threads, or when
+applying XSLT objects from one thread to trees parsed or modified in
+another. If you need code to run with older versions, you should
+generally avoid modifying trees in other threads than the one it was
+generated in. Although this should work in many cases, there are
+certain scenarios where the termination of a thread that parsed a tree
+can crash the application if subtrees of this tree were moved to other
+documents. You should be on the safe side when passing trees between
+threads if you either
- do not modify these trees and do not move their elements to other
trees, or
@@ -650,6 +655,13 @@
use (e.g. by using a fixed size thread-pool or long-running threads
in processing chains)
+Since lxml 2.2, even multi-thread pipelines are supported. However,
+note that it is more efficient to do all tree work inside one thread,
+than to let multiple threads work on a tree one after the other. This
+is because trees inherit state from the thread that created them,
+which must be maintained when the tree is modified inside another
+thread.
+
Does my program run faster if I use threads?
--------------------------------------------
@@ -657,11 +669,13 @@
Depends. The best way to answer this is timing and profiling.
The global interpreter lock (GIL) in Python serializes access to the
-interpreter, so if the majority of your processing is done in Python code
-(walking trees, modifying elements, etc.), your gain will be close to 0. The
-more of your XML processing moves into lxml, however, the higher your gain.
-If your application is bound by XML parsing and serialisation, or by complex
-XSLTs, your speedup on multi-processor machines can be substantial.
+interpreter, so if the majority of your processing is done in Python
+code (walking trees, modifying elements, etc.), your gain will be
+close to zero. The more of your XML processing moves into lxml,
+however, the higher your gain. If your application is bound by XML
+parsing and serialisation, or by very selective XPath expressions and
+complex XSLTs, your speedup on multi-processor machines can be
+substantial.
See the question above to learn which operations free the GIL to support
multi-threading.
@@ -670,30 +684,28 @@
Would my single-threaded program run faster if I turned off threading?
----------------------------------------------------------------------
-Quite likely, yes. You can see for yourself by compiling lxml
-entirely without threading support. Pass the ``--without-threading``
-option to setup.py when building lxml from source. You can also build
-libxml2 without pthread support (``--without-pthreads`` option), which
-may add another bit of performance. Note that this will leave
-internal data structures entirely without thread protection, so make
-sure you really do not use lxml outside of the main application thread
-in this case.
+Possibly, yes. You can see for yourself by compiling lxml entirely
+without threading support. Pass the ``--without-threading`` option to
+setup.py when building lxml from source. You can also build libxml2
+without pthread support (``--without-pthreads`` option), which may add
+another bit of performance. Note that this will leave internal data
+structures entirely without thread protection, so make sure you really
+do not use lxml outside of the main application thread in this case.
Why can't I reuse XSLT stylesheets in other threads?
----------------------------------------------------
-Since lxml 2.0, you can. However, it is a lot more efficient to use
-stylesheets in the thread that created them. This is due to some
-interfering optimisations in libxslt and lxml.etree. It is therefore
-a good idea to cache them in thread local storage (see Python's
-threading module). lxml cannot easily do this for you, as it cannot
-know when to discard them from such a cache.
-
-If you use very complex stylesheets or create stylesheets
-programmatically, you should do so in the main thread, and then copy
-them into the thread cache using the ``copy`` module from the standard
-library.
+Since later lxml 2.0 versions, you can do this. There is some
+overhead involved as the result document needs an additional cleanup
+traversal when the input document and/or the stylesheet were created
+in other threads. However, on a multi-processor machine, the gain of
+freeing the GIL easily covers this drawback.
+
+If you need even the last bit of performance, consider keeping (a copy
+of) the stylesheet in thread-local storage, and try creating the input
+document(s) in the same thread. And do not forget to benchmark your
+code to see if the increased code complexity is really worth it.
My program crashes when run with mod_python/Pyro/Zope/Plone/...
@@ -709,10 +721,11 @@
code runs perfectly when started by hand, the following gives you a few hints
for possible approaches to solve your specific problem:
-* make sure you use recent versions of libxml2, libxslt and lxml. The libxml2
- developers keep fixing bugs in each release, and lxml also tries to become
- more robust against possible pitfalls. So newer versions might already fix
- your problem in a reliable way.
+* make sure you use recent versions of libxml2, libxslt and lxml. The
+ libxml2 developers keep fixing bugs in each release, and lxml also
+ tries to become more robust against possible pitfalls. So newer
+ versions might already fix your problem in a reliable way. Version
+ 2.2 of lxml contains many improvements.
* make sure the library versions you installed are really used. Do
not rely on what your operating system tells you! Print the version
@@ -736,14 +749,15 @@
from crashing, which should be worth more to you than peek performance.
Remember that lxml is fast anyway, so concurrency may not even be worth it.
-* avoid doing fancy XSLT stuff like foreign document access or passing in
- subtrees trough XSLT variables. This might or might not work, depending on
- your specific usage.
+* look out for fancy XSLT stuff like foreign document access or
+ passing in subtrees trough XSLT variables. This might or might not
+ work, depending on your specific usage. Again, later versions of
+ lxml and libxslt provide safer support here.
* try copying trees at suspicious places in your code and working with
- those instead of a tree shared between threads. A good candidate
- might be the result of an XSLT or the stylesheet itself, if it
- traverses thread boundaries.
+ those instead of a tree shared between threads. Note that the
+ copying must happen inside the target thread to be effective, not in
+ the thread that created the tree.
* try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
instead of sharing one. Also see the question above.
@@ -756,6 +770,10 @@
of lxml, libxml2 and libxslt you are using (see the question on reporting
a bug).
+Note that most of these options will degrade performance and/or your
+code quality. If you are unsure what to do, please ask on the mailing
+list.
+
Parsing and Serialisation
=========================
From scoder at codespeak.net Fri Feb 27 14:47:22 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:22 +0100 (CET)
Subject: [Lxml-checkins] r62230 - in lxml/trunk: . doc
Message-ID: <20090227134722.5F44C168480@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:21 2009
New Revision: 62230
Modified:
lxml/trunk/ (props changed)
lxml/trunk/doc/FAQ.txt
Log:
r5056 at delle: sbehnel | 2009-02-27 12:09:09 +0100
minor doc update
Modified: lxml/trunk/doc/FAQ.txt
==============================================================================
--- lxml/trunk/doc/FAQ.txt (original)
+++ lxml/trunk/doc/FAQ.txt Fri Feb 27 14:47:21 2009
@@ -757,7 +757,9 @@
* try copying trees at suspicious places in your code and working with
those instead of a tree shared between threads. Note that the
copying must happen inside the target thread to be effective, not in
- the thread that created the tree.
+ the thread that created the tree. Serialising in one thread and
+ parsing in another is also a simple (and fast) way of separating
+ thread contexts.
* try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
instead of sharing one. Also see the question above.
From scoder at codespeak.net Fri Feb 27 14:47:27 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:27 +0100 (CET)
Subject: [Lxml-checkins] r62231 - in lxml/trunk: . src/lxml src/lxml/tests
Message-ID: <20090227134727.AE76116846D@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:27 2009
New Revision: 62231
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/src/lxml/tests/test_threading.py
lxml/trunk/src/lxml/xslt.pxi
Log:
r5057 at delle: sbehnel | 2009-02-27 13:32:16 +0100
fix crash when overwriting attributes in XSLT
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 14:47:27 2009
@@ -17,6 +17,9 @@
Bugs fixed
----------
+* Crash in XSLT when overwriting an already defined attribute using
+ ``xsl:attribute``.
+
* Crash bug in exception handling code under Python 3. This was due
to a problem in Cython, not lxml itself.
Modified: lxml/trunk/src/lxml/tests/test_threading.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_threading.py (original)
+++ lxml/trunk/src/lxml/tests/test_threading.py Fri Feb 27 14:47:27 2009
@@ -84,6 +84,31 @@
self.assertEquals(_bytes('
BCB'),
tostring(root))
+ def test_thread_xslt_attr_replace(self):
+ # this is the only case in XSLT where the result tree can be
+ # modified in-place
+ XML = self.etree.XML
+ tostring = self.etree.tostring
+ style = self.etree.XSLT(XML(_bytes('''\
+
+
+
+
+ xyz
+
+
+ ''')))
+
+ result = []
+ def run_thread():
+ root = XML(_bytes('
'))
+ result.append( style(root).getroot() )
+
+ self._run_thread(run_thread)
+ self.assertEquals(_bytes('
'),
+ tostring(result[0]))
+
def test_thread_create_xslt(self):
XML = self.etree.XML
tostring = self.etree.tostring
Modified: lxml/trunk/src/lxml/xslt.pxi
==============================================================================
--- lxml/trunk/src/lxml/xslt.pxi (original)
+++ lxml/trunk/src/lxml/xslt.pxi Fri Feb 27 14:47:27 2009
@@ -486,7 +486,15 @@
_destroyFakeDoc(input_doc._c_doc, c_doc)
python.PyErr_NoMemory()
- initTransformDict(transform_ctxt)
+ # using the stylesheet dict is safer than using a possibly
+ # unrelated dict from the current thread. Almost all
+ # non-input tag/attr names will come from the stylesheet
+ # anyway.
+ if transform_ctxt.dict is not NULL:
+ xmlparser.xmlDictFree(transform_ctxt.dict)
+ transform_ctxt.dict = self._c_style.doc.dict
+ xmlparser.xmlDictReference(transform_ctxt.dict)
+
xslt.xsltSetCtxtParseOptions(
transform_ctxt, input_doc._parser._parse_options)
@@ -776,9 +784,6 @@
# enable EXSLT support for XSLT
xslt.exsltRegisterAll()
-cdef void initTransformDict(xslt.xsltTransformContext* transform_ctxt):
- __GLOBAL_PARSER_CONTEXT.initThreadDictRef(&transform_ctxt.dict)
-
################################################################################
# XSLT PI support
From scoder at codespeak.net Fri Feb 27 14:47:32 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:32 +0100 (CET)
Subject: [Lxml-checkins] r62232 - in lxml/trunk: . src/lxml/html
src/lxml/html/tests
Message-ID: <20090227134732.5EDF11684C8@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:31 2009
New Revision: 62232
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/src/lxml/html/soupparser.py
lxml/trunk/src/lxml/html/tests/test_elementsoup.py
Log:
r5058 at delle: sbehnel | 2009-02-27 14:06:21 +0100
fix bug #334718: soupparser fails on attribute without value
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 14:47:31 2009
@@ -17,6 +17,8 @@
Bugs fixed
----------
+* Soupparser failed on broken attributes without values.
+
* Crash in XSLT when overwriting an already defined attribute using
``xsl:attribute``.
Modified: lxml/trunk/src/lxml/html/soupparser.py
==============================================================================
--- lxml/trunk/src/lxml/html/soupparser.py (original)
+++ lxml/trunk/src/lxml/html/soupparser.py Fri Feb 27 14:47:31 2009
@@ -109,8 +109,14 @@
import re
handle_entities = re.compile("&(\w+);").sub
+try:
+ empty_string = unicode()
+except NameError:
+ empty_string = str()
def unescape(string):
+ if not string:
+ return empty_string
# work around oddities in BeautifulSoup's entity handling
def unescape_entity(m):
try:
Modified: lxml/trunk/src/lxml/html/tests/test_elementsoup.py
==============================================================================
--- lxml/trunk/src/lxml/html/tests/test_elementsoup.py (original)
+++ lxml/trunk/src/lxml/html/tests/test_elementsoup.py Fri Feb 27 14:47:31 2009
@@ -1,5 +1,5 @@
import unittest, sys
-from lxml.tests.common_imports import make_doctest
+from lxml.tests.common_imports import make_doctest, HelperTestCase
try:
import BeautifulSoup
@@ -7,11 +7,25 @@
except ImportError:
BS_INSTALLED = False
+if BS_INSTALLED:
+ class SoupParserTestCase(HelperTestCase):
+ from lxml.html import soupparser
+
+ def test_broken_attribute(self):
+ html = """\
+
+
+
+ """
+ root = self.soupparser.fromstring(html)
+ self.assert_(root.find('.//input').get('disabled') is not None)
+
def test_suite():
suite = unittest.TestSuite()
if sys.version_info >= (2,4):
if BS_INSTALLED:
+ suite.addTests([unittest.makeSuite(SoupParserTestCase)])
suite.addTests([make_doctest('../../../../doc/elementsoup.txt')])
return suite
From scoder at codespeak.net Fri Feb 27 14:47:37 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 14:47:37 +0100 (CET)
Subject: [Lxml-checkins] r62233 - in lxml/trunk: . src/lxml src/lxml/tests
Message-ID: <20090227134737.BBADC168480@codespeak.net>
Author: scoder
Date: Fri Feb 27 14:47:36 2009
New Revision: 62233
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/src/lxml/apihelpers.pxi
lxml/trunk/src/lxml/lxml.etree.pyx
lxml/trunk/src/lxml/tests/test_etree.py
Log:
r5059 at delle: sbehnel | 2009-02-27 14:43:30 +0100
make deep-copying an ElementTree copy PI/comment siblings, too
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 14:47:36 2009
@@ -17,6 +17,9 @@
Bugs fixed
----------
+* Deep-copying an ElementTree did not copy its sibling PIs and
+ comments.
+
* Soupparser failed on broken attributes without values.
* Crash in XSLT when overwriting an already defined attribute using
Modified: lxml/trunk/src/lxml/apihelpers.pxi
==============================================================================
--- lxml/trunk/src/lxml/apihelpers.pxi (original)
+++ lxml/trunk/src/lxml/apihelpers.pxi Fri Feb 27 14:47:36 2009
@@ -871,7 +871,7 @@
c_target = c_tail
c_tail = c_next
-cdef void _copyTail(xmlNode* c_tail, xmlNode* c_target):
+cdef int _copyTail(xmlNode* c_tail, xmlNode* c_target) except -1:
cdef xmlNode* c_new_tail
# tail copying support: look for any text nodes trailing this node and
# copy it to the target node
@@ -881,9 +881,34 @@
c_new_tail = tree.xmlDocCopyNode(c_tail, c_target.doc, 0)
else:
c_new_tail = tree.xmlCopyNode(c_tail, 0)
+ if c_new_tail is NULL:
+ python.PyErr_NoMemory()
tree.xmlAddNextSibling(c_target, c_new_tail)
c_target = c_new_tail
c_tail = _textNodeOrSkip(c_tail.next)
+ return 0
+
+cdef int _copyNonElementSiblings(xmlNode* c_node, xmlNode* c_target) except -1:
+ cdef xmlNode* c_copy
+ cdef xmlNode* c_sibling = c_node
+ while c_sibling.prev != NULL and \
+ (c_sibling.prev.type == tree.XML_PI_NODE or \
+ c_sibling.prev.type == tree.XML_COMMENT_NODE):
+ c_sibling = c_sibling.prev
+ while c_sibling != c_node:
+ c_copy = tree.xmlDocCopyNode(c_sibling, c_target.doc, 1)
+ if c_copy is NULL:
+ python.PyErr_NoMemory()
+ tree.xmlAddPrevSibling(c_target, c_copy)
+ c_sibling = c_sibling.next
+ while c_sibling.next != NULL and \
+ (c_sibling.next.type == tree.XML_PI_NODE or \
+ c_sibling.next.type == tree.XML_COMMENT_NODE):
+ c_sibling = c_sibling.next
+ c_copy = tree.xmlDocCopyNode(c_sibling, c_target.doc, 1)
+ if c_copy is NULL:
+ python.PyErr_NoMemory()
+ tree.xmlAddNextSibling(c_target, c_copy)
cdef int _deleteSlice(_Document doc, xmlNode* c_node,
Py_ssize_t count, Py_ssize_t step) except -1:
Modified: lxml/trunk/src/lxml/lxml.etree.pyx
==============================================================================
--- lxml/trunk/src/lxml/lxml.etree.pyx (original)
+++ lxml/trunk/src/lxml/lxml.etree.pyx Fri Feb 27 14:47:36 2009
@@ -1571,9 +1571,21 @@
def __deepcopy__(self, memo):
cdef _Element root
+ cdef _Document doc
+ cdef xmlDoc* c_doc
if self._context_node is not None:
root = self._context_node.__copy__()
- return _elementTreeFactory(None, root)
+ _copyNonElementSiblings(self._context_node._c_node, root._c_node)
+ return _elementTreeFactory(None, root)
+ elif self._doc is not None:
+ c_doc = tree.xmlCopyDoc(self._doc._c_doc, 1)
+ if c_doc is NULL:
+ python.PyErr_NoMemory()
+ doc = _documentFactory(c_doc, self._doc._parser)
+ return _elementTreeFactory(doc, None)
+ else:
+ # so what ...
+ return self
# not in ElementTree, read-only
property docinfo:
Modified: lxml/trunk/src/lxml/tests/test_etree.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_etree.py (original)
+++ lxml/trunk/src/lxml/tests/test_etree.py Fri Feb 27 14:47:36 2009
@@ -233,6 +233,22 @@
self.assertEquals('ONE', a.text)
self.assertEquals('ANOTHER', b.text)
+ def test_deepcopy_elementtree_pi(self):
+ XML = self.etree.XML
+ tostring = self.etree.tostring
+ root = XML(_bytes("
"))
+ tree1 = self.etree.ElementTree(root)
+ self.assertEquals(_bytes("
"),
+ tostring(tree1))
+
+ tree2 = copy.deepcopy(tree1)
+ self.assertEquals(_bytes("
"),
+ tostring(tree2))
+
+ root2 = copy.deepcopy(tree1.getroot())
+ self.assertEquals(_bytes("
"),
+ tostring(root2))
+
def test_attribute_set(self):
# ElementTree accepts arbitrary attribute values
# lxml.etree allows only strings
From scoder at codespeak.net Fri Feb 27 15:08:29 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 15:08:29 +0100 (CET)
Subject: [Lxml-checkins] r62234 - in lxml/trunk: . src/lxml src/lxml/tests
Message-ID: <20090227140829.26DE91684A6@codespeak.net>
Author: scoder
Date: Fri Feb 27 15:08:27 2009
New Revision: 62234
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/src/lxml/lxml.etree.pyx
lxml/trunk/src/lxml/tests/test_etree.py
Log:
r5069 at delle: sbehnel | 2009-02-27 15:06:11 +0100
copy int/ext DTD subsets when deep-copying an ElementTree
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 15:08:27 2009
@@ -17,8 +17,8 @@
Bugs fixed
----------
-* Deep-copying an ElementTree did not copy its sibling PIs and
- comments.
+* Deep-copying an ElementTree copied neither its sibling PIs and
+ comments nor its internal/external DTD subsets.
* Soupparser failed on broken attributes without values.
Modified: lxml/trunk/src/lxml/lxml.etree.pyx
==============================================================================
--- lxml/trunk/src/lxml/lxml.etree.pyx (original)
+++ lxml/trunk/src/lxml/lxml.etree.pyx Fri Feb 27 15:08:27 2009
@@ -1576,6 +1576,16 @@
if self._context_node is not None:
root = self._context_node.__copy__()
_copyNonElementSiblings(self._context_node._c_node, root._c_node)
+ doc = root._doc
+ c_doc = self._context_node._doc._c_doc
+ if c_doc.intSubset and not doc._c_doc.intSubset:
+ doc._c_doc.intSubset = tree.xmlCopyDtd(c_doc.intSubset)
+ if doc._c_doc.intSubset is NULL:
+ python.PyErr_NoMemory()
+ if c_doc.extSubset and not doc._c_doc.extSubset:
+ doc._c_doc.extSubset = tree.xmlCopyDtd(c_doc.extSubset)
+ if doc._c_doc.extSubset is NULL:
+ python.PyErr_NoMemory()
return _elementTreeFactory(None, root)
elif self._doc is not None:
c_doc = tree.xmlCopyDoc(self._doc._c_doc, 1)
Modified: lxml/trunk/src/lxml/tests/test_etree.py
==============================================================================
--- lxml/trunk/src/lxml/tests/test_etree.py (original)
+++ lxml/trunk/src/lxml/tests/test_etree.py Fri Feb 27 15:08:27 2009
@@ -249,6 +249,21 @@
self.assertEquals(_bytes("
"),
tostring(root2))
+ def test_deepcopy_elementtree_dtd(self):
+ XML = self.etree.XML
+ tostring = self.etree.tostring
+ xml = _bytes('\n]>\n
')
+ root = XML(xml)
+ tree1 = self.etree.ElementTree(root)
+ self.assertEquals(xml, tostring(tree1))
+
+ tree2 = copy.deepcopy(tree1)
+ self.assertEquals(xml, tostring(tree2))
+
+ root2 = copy.deepcopy(tree1.getroot())
+ self.assertEquals(_bytes("
"),
+ tostring(root2))
+
def test_attribute_set(self):
# ElementTree accepts arbitrary attribute values
# lxml.etree allows only strings
From scoder at codespeak.net Fri Feb 27 15:41:02 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 15:41:02 +0100 (CET)
Subject: [Lxml-checkins] r62235 - in lxml/trunk: . src/lxml/html
Message-ID: <20090227144102.8E5BD168519@codespeak.net>
Author: scoder
Date: Fri Feb 27 15:41:02 2009
New Revision: 62235
Modified:
lxml/trunk/ (props changed)
lxml/trunk/src/lxml/html/soupparser.py
Log:
r5071 at delle: sbehnel | 2009-02-27 15:29:21 +0100
simplification
Modified: lxml/trunk/src/lxml/html/soupparser.py
==============================================================================
--- lxml/trunk/src/lxml/html/soupparser.py (original)
+++ lxml/trunk/src/lxml/html/soupparser.py Fri Feb 27 15:41:02 2009
@@ -109,14 +109,10 @@
import re
handle_entities = re.compile("&(\w+);").sub
-try:
- empty_string = unicode()
-except NameError:
- empty_string = str()
def unescape(string):
if not string:
- return empty_string
+ return ''
# work around oddities in BeautifulSoup's entity handling
def unescape_entity(m):
try:
From scoder at codespeak.net Fri Feb 27 16:22:49 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 16:22:49 +0100 (CET)
Subject: [Lxml-checkins] r62239 - in lxml/trunk: . doc
Message-ID: <20090227152249.4CDC316852D@codespeak.net>
Author: scoder
Date: Fri Feb 27 16:22:48 2009
New Revision: 62239
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/doc/main.txt
lxml/trunk/version.txt
Log:
r5073 at delle: sbehnel | 2009-02-27 15:44:43 +0100
prepare release of 2.2beta4
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Feb 27 16:22:48 2009
@@ -2,8 +2,8 @@
lxml changelog
==============
-Under development
-=================
+2.2beta4 (2009-02-27)
+=====================
Features added
--------------
Modified: lxml/trunk/doc/main.txt
==============================================================================
--- lxml/trunk/doc/main.txt (original)
+++ lxml/trunk/doc/main.txt Fri Feb 27 16:22:48 2009
@@ -147,8 +147,8 @@
source release. If you can't wait, consider trying a less recent
release version first.
-The latest version is `lxml 2.2beta3`_, released 2009-02-17
-(`changes for 2.2beta3`_). `Older versions`_ are listed below.
+The latest version is `lxml 2.2beta4`_, released 2009-02-17
+(`changes for 2.2beta4`_). `Older versions`_ are listed below.
Please take a look at the `installation instructions`_!
@@ -220,7 +220,9 @@
`2.0
`_ and the `current
in-development version
`_.
-.. _`PDF documentation`: lxmldoc-2.2beta3.pdf
+.. _`PDF documentation`: lxmldoc-2.2beta4.pdf
+
+* `lxml 2.2beta3`_, released 2009-02-17 (`changes for 2.2beta3`_)
* `lxml 2.2beta2`_, released 2009-01-25 (`changes for 2.2beta2`_)
@@ -314,6 +316,7 @@
* `lxml 0.5`_, released 2005-04-08
+.. _`lxml 2.2beta4`: lxml-2.2beta4.tgz
.. _`lxml 2.2beta3`: lxml-2.2beta3.tgz
.. _`lxml 2.2beta2`: lxml-2.2beta2.tgz
.. _`lxml 2.2beta1`: lxml-2.2beta1.tgz
@@ -361,6 +364,7 @@
.. _`lxml 0.5.1`: lxml-0.5.1.tgz
.. _`lxml 0.5`: lxml-0.5.tgz
+.. _`changes for 2.2beta4`: changes-2.2beta4.html
.. _`changes for 2.2beta3`: changes-2.2beta3.html
.. _`changes for 2.2beta2`: changes-2.2beta2.html
.. _`changes for 2.2beta1`: changes-2.2beta1.html
Modified: lxml/trunk/version.txt
==============================================================================
--- lxml/trunk/version.txt (original)
+++ lxml/trunk/version.txt Fri Feb 27 16:22:48 2009
@@ -1 +1 @@
-2.2beta3
+2.2beta4
From scoder at codespeak.net Fri Feb 27 16:22:53 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 16:22:53 +0100 (CET)
Subject: [Lxml-checkins] r62240 - in lxml/trunk: . doc
Message-ID: <20090227152253.9600616852E@codespeak.net>
Author: scoder
Date: Fri Feb 27 16:22:53 2009
New Revision: 62240
Modified:
lxml/trunk/ (props changed)
lxml/trunk/doc/mklatex.py
Log:
r5074 at delle: sbehnel | 2009-02-27 16:19:53 +0100
fix PDF docs
Modified: lxml/trunk/doc/mklatex.py
==============================================================================
--- lxml/trunk/doc/mklatex.py (original)
+++ lxml/trunk/doc/mklatex.py Fri Feb 27 16:22:53 2009
@@ -90,7 +90,8 @@
break
if line.startswith('%') or \
r'\documentclass' in line or \
- r'\makeindex' in line:
+ r'\makeindex' in line or \
+ r'{inputenc}' in line:
continue
if line.startswith(r'\usepackage'):
if line in existing_header_lines:
@@ -270,7 +271,7 @@
if hln.startswith(r"\documentclass"):
#hln = hln.replace('article', 'book')
hln = DOCUMENT_CLASS
- elif hln.startswith("%% generator "):
+ elif hln.startswith("%% generator ") or hln.startswith("% generated "):
master.write(EPYDOC_IMPORT)
elif hln.startswith(r"\begin{document}"):
# pygments and epydoc support
From scoder at codespeak.net Fri Feb 27 16:22:58 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Fri, 27 Feb 2009 16:22:58 +0100 (CET)
Subject: [Lxml-checkins] r62241 - in lxml/trunk: . doc
Message-ID: <20090227152258.221CA16852F@codespeak.net>
Author: scoder
Date: Fri Feb 27 16:22:57 2009
New Revision: 62241
Modified:
lxml/trunk/ (props changed)
lxml/trunk/INSTALL.txt
lxml/trunk/doc/build.txt
Log:
r5075 at delle: sbehnel | 2009-02-27 16:20:41 +0100
fix dependency versions
Modified: lxml/trunk/INSTALL.txt
==============================================================================
--- lxml/trunk/INSTALL.txt (original)
+++ lxml/trunk/INSTALL.txt Fri Feb 27 16:22:57 2009
@@ -12,7 +12,7 @@
http://xmlsoft.org/downloads.html
If you want to use XPath, do not use libxml2 2.6.27. We recommend
- libxml2 2.6.28 or later.
+ libxml2 2.7.2 or later.
* libxslt 1.1.15 or later. It can be found here:
http://xmlsoft.org/XSLT/downloads.html
Modified: lxml/trunk/doc/build.txt
==============================================================================
--- lxml/trunk/doc/build.txt (original)
+++ lxml/trunk/doc/build.txt Fri Feb 27 16:22:57 2009
@@ -45,9 +45,9 @@
want to be an lxml developer, then you do need a working Cython
installation. You can use EasyInstall_ to install it::
- easy_install Cython==0.10.3
+ easy_install Cython==0.11
-lxml currently requires Cython 0.10.3, later release versions should
+lxml currently requires Cython 0.11, later release versions should
work as well.
From scoder at codespeak.net Sat Feb 28 23:04:00 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Sat, 28 Feb 2009 23:04:00 +0100 (CET)
Subject: [Lxml-checkins] r62278 - lxml/trunk
Message-ID: <20090228220400.06062168458@codespeak.net>
Author: scoder
Date: Sat Feb 28 23:03:59 2009
New Revision: 62278
Modified:
lxml/trunk/ (props changed)
lxml/trunk/ez_setup.py
Log:
r5079 at delle: sbehnel | 2009-02-28 23:01:11 +0100
updated setuptools install script
Modified: lxml/trunk/ez_setup.py
==============================================================================
--- lxml/trunk/ez_setup.py (original)
+++ lxml/trunk/ez_setup.py Sat Feb 28 23:03:59 2009
@@ -14,8 +14,8 @@
This file can also be run as a script to install or upgrade setuptools.
"""
import sys
-DEFAULT_VERSION = "0.6c3"
-DEFAULT_URL = "http://cheeseshop.python.org/packages/%s/s/setuptools/" % sys.version[:3]
+DEFAULT_VERSION = "0.6c9"
+DEFAULT_URL = "http://pypi.python.org/packages/%s/s/setuptools/" % sys.version[:3]
md5_data = {
'setuptools-0.6b1-py2.3.egg': '8822caf901250d848b996b7f25c6e6ca',
@@ -33,13 +33,33 @@
'setuptools-0.6c3-py2.3.egg': 'f181fa125dfe85a259c9cd6f1d7b78fa',
'setuptools-0.6c3-py2.4.egg': 'e0ed74682c998bfb73bf803a50e7b71e',
'setuptools-0.6c3-py2.5.egg': 'abef16fdd61955514841c7c6bd98965e',
+ 'setuptools-0.6c4-py2.3.egg': 'b0b9131acab32022bfac7f44c5d7971f',
+ 'setuptools-0.6c4-py2.4.egg': '2a1f9656d4fbf3c97bf946c0a124e6e2',
+ 'setuptools-0.6c4-py2.5.egg': '8f5a052e32cdb9c72bcf4b5526f28afc',
+ 'setuptools-0.6c5-py2.3.egg': 'ee9fd80965da04f2f3e6b3576e9d8167',
+ 'setuptools-0.6c5-py2.4.egg': 'afe2adf1c01701ee841761f5bcd8aa64',
+ 'setuptools-0.6c5-py2.5.egg': 'a8d3f61494ccaa8714dfed37bccd3d5d',
+ 'setuptools-0.6c6-py2.3.egg': '35686b78116a668847237b69d549ec20',
+ 'setuptools-0.6c6-py2.4.egg': '3c56af57be3225019260a644430065ab',
+ 'setuptools-0.6c6-py2.5.egg': 'b2f8a7520709a5b34f80946de5f02f53',
+ 'setuptools-0.6c7-py2.3.egg': '209fdf9adc3a615e5115b725658e13e2',
+ 'setuptools-0.6c7-py2.4.egg': '5a8f954807d46a0fb67cf1f26c55a82e',
+ 'setuptools-0.6c7-py2.5.egg': '45d2ad28f9750e7434111fde831e8372',
+ 'setuptools-0.6c8-py2.3.egg': '50759d29b349db8cfd807ba8303f1902',
+ 'setuptools-0.6c8-py2.4.egg': 'cba38d74f7d483c06e9daa6070cce6de',
+ 'setuptools-0.6c8-py2.5.egg': '1721747ee329dc150590a58b3e1ac95b',
+ 'setuptools-0.6c9-py2.3.egg': 'a83c4020414807b496e4cfbe08507c03',
+ 'setuptools-0.6c9-py2.4.egg': '260a2be2e5388d66bdaee06abec6342a',
+ 'setuptools-0.6c9-py2.5.egg': 'fe67c3e5a17b12c0e7c541b7ea43a8e6',
+ 'setuptools-0.6c9-py2.6.egg': 'ca37b1ff16fa2ede6e19383e7b59245a',
}
import sys, os
+try: from hashlib import md5
+except ImportError: from md5 import md5
def _validate_md5(egg_name, data):
if egg_name in md5_data:
- from md5 import md5
digest = md5(data).hexdigest()
if digest != md5_data[egg_name]:
print >>sys.stderr, (
@@ -49,7 +69,6 @@
sys.exit(2)
return data
-
def use_setuptools(
version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
download_delay=15
@@ -65,31 +84,31 @@
this routine will print a message to ``sys.stderr`` and raise SystemExit in
an attempt to abort the calling script.
"""
- try:
- import setuptools
- if setuptools.__version__ == '0.0.1':
- print >>sys.stderr, (
- "You have an obsolete version of setuptools installed. Please\n"
- "remove it from your system entirely before rerunning this script."
- )
- sys.exit(2)
- except ImportError:
+ was_imported = 'pkg_resources' in sys.modules or 'setuptools' in sys.modules
+ def do_download():
egg = download_setuptools(version, download_base, to_dir, download_delay)
sys.path.insert(0, egg)
import setuptools; setuptools.bootstrap_install_from = egg
-
- import pkg_resources
try:
- pkg_resources.require("setuptools>="+version)
-
+ import pkg_resources
+ except ImportError:
+ return do_download()
+ try:
+ pkg_resources.require("setuptools>="+version); return
except pkg_resources.VersionConflict, e:
- # XXX could we install in a subprocess here?
- print >>sys.stderr, (
+ if was_imported:
+ print >>sys.stderr, (
"The required version of setuptools (>=%s) is not available, and\n"
"can't be installed while this script is running. Please install\n"
- " a more recent version first.\n\n(Currently using %r)"
- ) % (version, e.args[0])
- sys.exit(2)
+ " a more recent version first, using 'easy_install -U setuptools'."
+ "\n\n(Currently using %r)"
+ ) % (version, e.args[0])
+ sys.exit(2)
+ else:
+ del pkg_resources, sys.modules['pkg_resources'] # reload ok
+ return do_download()
+ except pkg_resources.DistributionNotFound:
+ return do_download()
def download_setuptools(
version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
@@ -138,9 +157,43 @@
if dst: dst.close()
return os.path.realpath(saveto)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
def main(argv, version=DEFAULT_VERSION):
"""Install or upgrade setuptools and EasyInstall"""
-
try:
import setuptools
except ImportError:
@@ -155,8 +208,11 @@
os.unlink(egg)
else:
if setuptools.__version__ == '0.0.1':
- # tell the user to uninstall obsolete version
- use_setuptools(version)
+ print >>sys.stderr, (
+ "You have an obsolete version of setuptools installed. Please\n"
+ "remove it from your system entirely before rerunning this script."
+ )
+ sys.exit(2)
req = "setuptools>="+version
import pkg_resources
@@ -177,13 +233,10 @@
print "Setuptools version",version,"or greater has been installed."
print '(Run "ez_setup.py -U setuptools" to reinstall or upgrade.)'
-
-
def update_md5(filenames):
"""Update our built-in md5 registry"""
import re
- from md5 import md5
for name in filenames:
base = os.path.basename(name)
@@ -220,3 +273,4 @@
+
From scoder at codespeak.net Sat Feb 28 23:04:05 2009
From: scoder at codespeak.net (scoder at codespeak.net)
Date: Sat, 28 Feb 2009 23:04:05 +0100 (CET)
Subject: [Lxml-checkins] r62279 - in lxml/trunk: . doc
Message-ID: <20090228220405.76FB5168469@codespeak.net>
Author: scoder
Date: Sat Feb 28 23:04:04 2009
New Revision: 62279
Modified:
lxml/trunk/ (props changed)
lxml/trunk/doc/performance.txt
Log:
r5080 at delle: sbehnel | 2009-02-28 23:01:43 +0100
updated performance docs to lxml 2.2 and Python 2.6
Modified: lxml/trunk/doc/performance.txt
==============================================================================
--- lxml/trunk/doc/performance.txt (original)
+++ lxml/trunk/doc/performance.txt Sat Feb 28 23:04:04 2009
@@ -86,12 +86,15 @@
a specific part of the API yourself, please consider sending it to the lxml
mailing list.
-The timings cited below compare lxml 2.1 (with libxml2 2.6.33) to the
-April 2008 SVN trunk versions of ElementTree (1.3alpha) and
-cElementTree (1.2.7). They were run single-threaded on a 1.8GHz Intel
-Core Duo machine under Ubuntu Linux 7.10 (Gutsy). The C libraries
-were compiled with the same platform specific optimisation flags. The
-Python interpreter (2.5.1) was used as provided by the distribution.
+The timings cited below compare lxml 2.2 (with libxml2 2.7.3) to the
+February 2009 SVN versions of ElementTree (1.3alpha2) and cElementTree
+(1.0.6). They were run single-threaded on a 1.8GHz Intel Core Duo
+machine under Ubuntu Linux 8.10 (Intrepid). The C libraries were
+compiled with the same platform specific optimisation flags. The
+Python interpreter (2.6.1) was manually compiled for the platform.
+Note that many of the following ElementTree timings are therefore
+better then what a normal Python installation with the standard
+library (c)ElementTree modules would yield.
.. _`bench_etree.py`: http://codespeak.net/svn/lxml/trunk/benchmark/bench_etree.py
.. _`bench_xpath.py`: http://codespeak.net/svn/lxml/trunk/benchmark/bench_xpath.py
@@ -129,107 +132,108 @@
executes entirely at the C level, without any interaction with Python
code. The results are rather impressive, especially for UTF-8, which
is native to libxml2. While 20 to 40 times faster than (c)ElementTree
-1.2 (which is part of the standard library in Python 2.5), lxml is
+1.2 (which is part of the standard library since Python 2.5), lxml is
still more than 7 times as fast as the much improved ElementTree 1.3::
- lxe: tostring_utf16 (SATR T1) 25.7590 msec/pass
- cET: tostring_utf16 (SATR T1) 179.6291 msec/pass
- ET : tostring_utf16 (SATR T1) 188.5638 msec/pass
-
- lxe: tostring_utf16 (UATR T1) 26.0060 msec/pass
- cET: tostring_utf16 (UATR T1) 176.9981 msec/pass
- ET : tostring_utf16 (UATR T1) 188.2110 msec/pass
-
- lxe: tostring_utf16 (S-TR T2) 26.9201 msec/pass
- cET: tostring_utf16 (S-TR T2) 182.5061 msec/pass
- ET : tostring_utf16 (S-TR T2) 190.2061 msec/pass
-
- lxe: tostring_utf8 (S-TR T2) 19.5830 msec/pass
- cET: tostring_utf8 (S-TR T2) 183.0020 msec/pass
- ET : tostring_utf8 (S-TR T2) 187.7251 msec/pass
-
- lxe: tostring_utf8 (U-TR T3) 5.5292 msec/pass
- cET: tostring_utf8 (U-TR T3) 56.1349 msec/pass
- ET : tostring_utf8 (U-TR T3) 56.6628 msec/pass
+ lxe: tostring_utf16 (SATR T1) 22.4042 msec/pass
+ cET: tostring_utf16 (SATR T1) 184.5090 msec/pass
+ ET : tostring_utf16 (SATR T1) 182.4350 msec/pass
+
+ lxe: tostring_utf16 (UATR T1) 23.1769 msec/pass
+ cET: tostring_utf16 (UATR T1) 188.6780 msec/pass
+ ET : tostring_utf16 (UATR T1) 186.7781 msec/pass
+
+ lxe: tostring_utf16 (S-TR T2) 21.8501 msec/pass
+ cET: tostring_utf16 (S-TR T2) 200.0139 msec/pass
+ ET : tostring_utf16 (S-TR T2) 190.8720 msec/pass
+
+ lxe: tostring_utf8 (S-TR T2) 17.1690 msec/pass
+ cET: tostring_utf8 (S-TR T2) 192.3709 msec/pass
+ ET : tostring_utf8 (S-TR T2) 189.7140 msec/pass
+
+ lxe: tostring_utf8 (U-TR T3) 4.9832 msec/pass
+ cET: tostring_utf8 (U-TR T3) 60.2911 msec/pass
+ ET : tostring_utf8 (U-TR T3) 57.8101 msec/pass
The same applies to plain text serialisation. Note that cElementTree
-does not currently support this, as it is new in ET 1.3::
+does not currently support this, as it is a new feature in ET 1.3 and
+lxml.etree 2.0::
- lxe: tostring_text_ascii (S-TR T1) 3.8729 msec/pass
- ET : tostring_text_ascii (S-TR T1) 90.7841 msec/pass
+ lxe: tostring_text_ascii (S-TR T1) 4.3709 msec/pass
+ ET : tostring_text_ascii (S-TR T1) 83.9939 msec/pass
- lxe: tostring_text_ascii (S-TR T3) 1.1508 msec/pass
- ET : tostring_text_ascii (S-TR T3) 28.0581 msec/pass
+ lxe: tostring_text_ascii (S-TR T3) 1.3590 msec/pass
+ ET : tostring_text_ascii (S-TR T3) 26.6340 msec/pass
- lxe: tostring_text_utf16 (S-TR T1) 5.6219 msec/pass
- ET : tostring_text_utf16 (S-TR T1) 87.4891 msec/pass
+ lxe: tostring_text_utf16 (S-TR T1) 6.2978 msec/pass
+ ET : tostring_text_utf16 (S-TR T1) 84.7399 msec/pass
- lxe: tostring_text_utf16 (U-TR T1) 7.0660 msec/pass
- ET : tostring_text_utf16 (U-TR T1) 82.1049 msec/pass
+ lxe: tostring_text_utf16 (U-TR T1) 7.7510 msec/pass
+ ET : tostring_text_utf16 (U-TR T1) 79.9279 msec/pass
Unlike ElementTree, the ``tostring()`` function in lxml also supports
serialisation to a Python unicode string object::
- lxe: tostring_text_unicode (S-TR T1) 4.2419 msec/pass
- lxe: tostring_text_unicode (U-TR T1) 5.2760 msec/pass
- lxe: tostring_text_unicode (S-TR T3) 1.3049 msec/pass
- lxe: tostring_text_unicode (U-TR T3) 1.4210 msec/pass
+ lxe: tostring_text_unicode (S-TR T1) 4.6940 msec/pass
+ lxe: tostring_text_unicode (U-TR T1) 6.3069 msec/pass
+ lxe: tostring_text_unicode (S-TR T3) 1.3652 msec/pass
+ lxe: tostring_text_unicode (U-TR T3) 2.0702 msec/pass
For parsing, on the other hand, the advantage is clearly with
cElementTree. The (c)ET libraries use a very thin layer on top of the
expat parser, which is known to be extremely fast::
- lxe: parse_stringIO (SAXR T1) 40.6771 msec/pass
- cET: parse_stringIO (SAXR T1) 19.3741 msec/pass
- ET : parse_stringIO (SAXR T1) 355.7711 msec/pass
-
- lxe: parse_stringIO (S-XR T3) 5.9960 msec/pass
- cET: parse_stringIO (S-XR T3) 5.8751 msec/pass
- ET : parse_stringIO (S-XR T3) 93.7259 msec/pass
-
- lxe: parse_stringIO (UAXR T3) 26.2671 msec/pass
- cET: parse_stringIO (UAXR T3) 30.6449 msec/pass
- ET : parse_stringIO (UAXR T3) 178.8890 msec/pass
+ lxe: parse_stringIO (SAXR T1) 50.0100 msec/pass
+ cET: parse_stringIO (SAXR T1) 19.3238 msec/pass
+ ET : parse_stringIO (SAXR T1) 318.2330 msec/pass
+
+ lxe: parse_stringIO (S-XR T3) 6.1851 msec/pass
+ cET: parse_stringIO (S-XR T3) 5.7080 msec/pass
+ ET : parse_stringIO (S-XR T3) 83.5931 msec/pass
+
+ lxe: parse_stringIO (UAXR T3) 34.4319 msec/pass
+ cET: parse_stringIO (UAXR T3) 28.8520 msec/pass
+ ET : parse_stringIO (UAXR T3) 164.5968 msec/pass
While about as fast for smaller documents, the expat parser allows cET
to be up to 2 times faster than lxml on plain parser performance for
large input documents. Similar timings can be observed for the
``iterparse()`` function::
- lxe: iterparse_stringIO (SAXR T1) 50.8120 msec/pass
- cET: iterparse_stringIO (SAXR T1) 24.9379 msec/pass
- ET : iterparse_stringIO (SAXR T1) 388.9420 msec/pass
-
- lxe: iterparse_stringIO (UAXR T3) 29.0790 msec/pass
- cET: iterparse_stringIO (UAXR T3) 32.1240 msec/pass
- ET : iterparse_stringIO (UAXR T3) 189.1720 msec/pass
+ lxe: iterparse_stringIO (SAXR T1) 57.8308 msec/pass
+ cET: iterparse_stringIO (SAXR T1) 23.8140 msec/pass
+ ET : iterparse_stringIO (SAXR T1) 349.5209 msec/pass
+
+ lxe: iterparse_stringIO (UAXR T3) 37.2162 msec/pass
+ cET: iterparse_stringIO (UAXR T3) 30.2329 msec/pass
+ ET : iterparse_stringIO (UAXR T3) 171.4060 msec/pass
However, if you benchmark the complete round-trip of a serialise-parse
cycle, the numbers will look similar to these::
- lxe: write_utf8_parse_stringIO (S-TR T1) 63.7550 msec/pass
- cET: write_utf8_parse_stringIO (S-TR T1) 292.0721 msec/pass
- ET : write_utf8_parse_stringIO (S-TR T1) 635.2799 msec/pass
-
- lxe: write_utf8_parse_stringIO (UATR T2) 75.0258 msec/pass
- cET: write_utf8_parse_stringIO (UATR T2) 341.7251 msec/pass
- ET : write_utf8_parse_stringIO (UATR T2) 713.1951 msec/pass
-
- lxe: write_utf8_parse_stringIO (S-TR T3) 11.4899 msec/pass
- cET: write_utf8_parse_stringIO (S-TR T3) 96.8502 msec/pass
- ET : write_utf8_parse_stringIO (S-TR T3) 185.6079 msec/pass
-
- lxe: write_utf8_parse_stringIO (SATR T4) 1.2081 msec/pass
- cET: write_utf8_parse_stringIO (SATR T4) 6.8581 msec/pass
- ET : write_utf8_parse_stringIO (SATR T4) 10.6261 msec/pass
+ lxe: write_utf8_parse_stringIO (S-TR T1) 60.2388 msec/pass
+ cET: write_utf8_parse_stringIO (S-TR T1) 314.9750 msec/pass
+ ET : write_utf8_parse_stringIO (S-TR T1) 616.4260 msec/pass
+
+ lxe: write_utf8_parse_stringIO (UATR T2) 71.7540 msec/pass
+ cET: write_utf8_parse_stringIO (UATR T2) 364.4099 msec/pass
+ ET : write_utf8_parse_stringIO (UATR T2) 684.5109 msec/pass
+
+ lxe: write_utf8_parse_stringIO (S-TR T3) 10.7441 msec/pass
+ cET: write_utf8_parse_stringIO (S-TR T3) 103.3869 msec/pass
+ ET : write_utf8_parse_stringIO (S-TR T3) 179.5921 msec/pass
+
+ lxe: write_utf8_parse_stringIO (SATR T4) 1.1981 msec/pass
+ cET: write_utf8_parse_stringIO (SATR T4) 7.0901 msec/pass
+ ET : write_utf8_parse_stringIO (SATR T4) 10.4899 msec/pass
For applications that require a high parser throughput of large files,
and that do little to no serialization, cET is the best choice. Also
-for iterparse applications that extract small amounts of data from
-large XML data sets that do not fit into the memory. If it comes to
-round-trip performance, however, lxml tends to be multiple times
-faster in total. So, whenever the input documents are not
-considerably larger than the output, lxml is the clear winner.
+for iterparse applications that extract small amounts of data or
+aggregate information from large XML data sets that do not fit into
+memory. If it comes to round-trip performance, however, lxml tends to
+be multiple times faster in total. So, whenever the input documents
+are not considerably larger than the output, lxml is the clear winner.
Regarding HTML parsing, Ian Bicking has done some `benchmarking on
lxml's HTML parser`_, comparing it to a number of other famous HTML
@@ -241,6 +245,13 @@
.. _`benchmarking on lxml's HTML parser`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
+Liza Daly has written an article that presents a couple of tweaks to
+get the most out of lxml's parser for very large XML documents. She
+quite favourably positions ``lxml.etree`` as a tool for
+`high-performance XML parsing`_.
+
+.. _`high-performance XML parsing`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
+
The ElementTree API
===================
@@ -253,27 +264,27 @@
(given in seconds)::
lxe: -- S- U- -A SA UA
- T1: 0.0437 0.0498 0.0516 0.0430 0.0498 0.0519
- T2: 0.0550 0.0643 0.0677 0.0612 0.0685 0.0721
- T3: 0.0168 0.0142 0.0159 0.0338 0.0350 0.0359
- T4: 0.0003 0.0002 0.0003 0.0007 0.0007 0.0007
+ T1: 0.0502 0.0572 0.0613 0.0494 0.0575 0.0615
+ T2: 0.0602 0.0691 0.0747 0.0651 0.0745 0.0796
+ T3: 0.0145 0.0157 0.0176 0.0392 0.0411 0.0415
+ T4: 0.0003 0.0003 0.0003 0.0008 0.0008 0.0008
cET: -- S- U- -A SA UA
- T1: 0.0093 0.0093 0.0093 0.0097 0.0094 0.0094
- T2: 0.0153 0.0155 0.0152 0.0157 0.0154 0.0154
- T3: 0.0076 0.0076 0.0076 0.0099 0.0122 0.0100
+ T1: 0.0092 0.0094 0.0094 0.0094 0.0096 0.0093
+ T2: 0.0152 0.0151 0.0152 0.0156 0.0154 0.0154
+ T3: 0.0079 0.0080 0.0079 0.0106 0.0107 0.0134
T4: 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
ET : -- S- U- -A SA UA
- T1: 0.1074 0.1669 0.1050 0.2054 0.2401 0.1047
- T2: 0.2920 0.1172 0.3393 0.3830 0.1184 0.4215
- T3: 0.0347 0.0331 0.0316 0.0368 0.3944 0.0377
- T4: 0.0006 0.0005 0.0007 0.0006 0.0007 0.0006
-
-
-While lxml is still faster than ET in most cases (10-70%), cET can be up to
-five times faster than lxml here. One of the reasons is that lxml must
-additionally discard the created Python elements after their use, when they
-are no longer referenced. ET and cET represent the tree itself through these
-objects, which reduces the overhead in creating them.
+ T1: 0.1017 0.1715 0.1962 0.1080 0.2470 0.1049
+ T2: 0.3130 0.3324 0.1130 0.3897 0.1158 0.4246
+ T3: 0.0341 0.0323 0.0338 0.0358 0.3965 0.0359
+ T4: 0.0006 0.0005 0.0006 0.0006 0.0007 0.0006
+
+While lxml is still a lot faster than ET in most cases, cET can be up
+to five times faster than lxml here. One of the reasons is that lxml
+must additionally discard the created Python elements after their use,
+when they are no longer referenced. ET and cET represent the tree
+itself through these objects, which reduces the overhead in creating
+them.
Child access
@@ -284,36 +295,36 @@
create a shallow copy of their list of children, lxml has to create a
Python object for each child and collect them in a list::
- lxe: root_list_children (--TR T1) 0.0160 msec/pass
- cET: root_list_children (--TR T1) 0.0081 msec/pass
- ET : root_list_children (--TR T1) 0.0541 msec/pass
-
- lxe: root_list_children (--TR T2) 0.2100 msec/pass
- cET: root_list_children (--TR T2) 0.0319 msec/pass
- ET : root_list_children (--TR T2) 0.4420 msec/pass
+ lxe: root_list_children (--TR T1) 0.0148 msec/pass
+ cET: root_list_children (--TR T1) 0.0050 msec/pass
+ ET : root_list_children (--TR T1) 0.0219 msec/pass
+
+ lxe: root_list_children (--TR T2) 0.1719 msec/pass
+ cET: root_list_children (--TR T2) 0.0260 msec/pass
+ ET : root_list_children (--TR T2) 0.3390 msec/pass
This handicap is also visible when accessing single children::
- lxe: first_child (--TR T2) 0.2341 msec/pass
- cET: first_child (--TR T2) 0.2198 msec/pass
- ET : first_child (--TR T2) 0.8960 msec/pass
-
- lxe: last_child (--TR T1 ) 0.2549 msec/pass
- cET: last_child (--TR T1 ) 0.2251 msec/pass
- ET : last_child (--TR T1 ) 0.8969 msec/pass
+ lxe: first_child (--TR T2) 0.1879 msec/pass
+ cET: first_child (--TR T2) 0.1760 msec/pass
+ ET : first_child (--TR T2) 0.8099 msec/pass
+
+ lxe: last_child (--TR T1) 0.1910 msec/pass
+ cET: last_child (--TR T1) 0.1872 msec/pass
+ ET : last_child (--TR T1) 0.8099 msec/pass
... unless you also add the time to find a child index in a bigger
list. ET and cET use Python lists here, which are based on arrays.
The data structure used by libxml2 is a linked tree, and thus, a
linked list of children::
- lxe: middle_child (--TR T1) 0.2699 msec/pass
- cET: middle_child (--TR T1) 0.2089 msec/pass
- ET : middle_child (--TR T1) 0.8910 msec/pass
-
- lxe: middle_child (--TR T2) 1.9410 msec/pass
- cET: middle_child (--TR T2) 0.2151 msec/pass
- ET : middle_child (--TR T2) 0.8960 msec/pass
+ lxe: middle_child (--TR T1) 0.2189 msec/pass
+ cET: middle_child (--TR T1) 0.1779 msec/pass
+ ET : middle_child (--TR T1) 0.8030 msec/pass
+
+ lxe: middle_child (--TR T2) 2.4071 msec/pass
+ cET: middle_child (--TR T2) 0.1781 msec/pass
+ ET : middle_child (--TR T2) 0.8039 msec/pass
Element creation
@@ -323,21 +334,21 @@
in. This results in a major performance difference for creating independent
Elements that end up in independently created documents::
- lxe: create_elements (--TC T2) 1.7340 msec/pass
- cET: create_elements (--TC T2) 0.1929 msec/pass
- ET : create_elements (--TC T2) 1.3809 msec/pass
+ lxe: create_elements (--TC T2) 2.1949 msec/pass
+ cET: create_elements (--TC T2) 0.1941 msec/pass
+ ET : create_elements (--TC T2) 1.2760 msec/pass
Therefore, it is always preferable to create Elements for the document they
are supposed to end up in, either as SubElements of an Element or using the
explicit ``Element.makeelement()`` call::
- lxe: makeelement (--TC T2) 1.6100 msec/pass
- cET: makeelement (--TC T2) 0.3171 msec/pass
- ET : makeelement (--TC T2) 1.6270 msec/pass
+ lxe: makeelement (--TC T2) 1.8370 msec/pass
+ cET: makeelement (--TC T2) 0.3200 msec/pass
+ ET : makeelement (--TC T2) 1.5380 msec/pass
- lxe: create_subelements (--TC T2) 1.3542 msec/pass
+ lxe: create_subelements (--TC T2) 1.6761 msec/pass
cET: create_subelements (--TC T2) 0.2329 msec/pass
- ET : create_subelements (--TC T2) 3.3019 msec/pass
+ ET : create_subelements (--TC T2) 3.0999 msec/pass
So, if the main performance bottleneck of an application is creating large XML
trees in memory through calls to Element and SubElement, cET is the best
@@ -354,13 +365,13 @@
The following benchmark appends all root children of the second tree to the
root of the first tree::
- lxe: append_from_document (--TR T1,T2) 3.0038 msec/pass
+ lxe: append_from_document (--TR T1,T2) 3.4299 msec/pass
cET: append_from_document (--TR T1,T2) 0.2639 msec/pass
- ET : append_from_document (--TR T1,T2) 1.2522 msec/pass
+ ET : append_from_document (--TR T1,T2) 1.1489 msec/pass
- lxe: append_from_document (--TR T3,T4) 0.0398 msec/pass
- cET: append_from_document (--TR T3,T4) 0.0160 msec/pass
- ET : append_from_document (--TR T3,T4) 0.0811 msec/pass
+ lxe: append_from_document (--TR T3,T4) 0.0429 msec/pass
+ cET: append_from_document (--TR T3,T4) 0.0169 msec/pass
+ ET : append_from_document (--TR T3,T4) 0.0780 msec/pass
Although these are fairly small numbers compared to parsing, this easily shows
the different performance classes for lxml and (c)ET. Where the latter do not
@@ -371,22 +382,22 @@
This difference is not always as visible, but applies to most parts of the
API, like inserting newly created elements::
- lxe: insert_from_document (--TR T1,T2) 4.9140 msec/pass
- cET: insert_from_document (--TR T1,T2) 0.4108 msec/pass
- ET : insert_from_document (--TR T1,T2) 1.4670 msec/pass
+ lxe: insert_from_document (--TR T1,T2) 6.1119 msec/pass
+ cET: insert_from_document (--TR T1,T2) 0.4129 msec/pass
+ ET : insert_from_document (--TR T1,T2) 1.4160 msec/pass
or replacing the child slice by a newly created element::
- lxe: replace_children_element (--TC T1) 0.1500 msec/pass
- cET: replace_children_element (--TC T1) 0.0238 msec/pass
- ET : replace_children_element (--TC T1) 0.1600 msec/pass
+ lxe: replace_children_element (--TC T1) 0.1769 msec/pass
+ cET: replace_children_element (--TC T1) 0.0250 msec/pass
+ ET : replace_children_element (--TC T1) 0.1538 msec/pass
as opposed to replacing the slice with an existing element from the
same document::
- lxe: replace_children (--TC T1) 0.0160 msec/pass
+ lxe: replace_children (--TC T1) 0.0169 msec/pass
cET: replace_children (--TC T1) 0.0119 msec/pass
- ET : replace_children (--TC T1) 0.0741 msec/pass
+ ET : replace_children (--TC T1) 0.0758 msec/pass
While these numbers are too small to provide a major performance
impact in practice, you should keep this difference in mind when you
@@ -398,17 +409,17 @@
Deep copying a tree is fast in lxml::
- lxe: deepcopy_all (--TR T1) 9.4090 msec/pass
- cET: deepcopy_all (--TR T1) 120.1589 msec/pass
- ET : deepcopy_all (--TR T1) 901.3789 msec/pass
-
- lxe: deepcopy_all (-ATR T2) 12.4569 msec/pass
- cET: deepcopy_all (-ATR T2) 135.8809 msec/pass
- ET : deepcopy_all (-ATR T2) 940.7840 msec/pass
-
- lxe: deepcopy_all (S-TR T3) 2.7640 msec/pass
- cET: deepcopy_all (S-TR T3) 30.1108 msec/pass
- ET : deepcopy_all (S-TR T3) 228.4350 msec/pass
+ lxe: deepcopy_all (--TR T1) 10.0670 msec/pass
+ cET: deepcopy_all (--TR T1) 115.8700 msec/pass
+ ET : deepcopy_all (--TR T1) 866.8201 msec/pass
+
+ lxe: deepcopy_all (-ATR T2) 12.4321 msec/pass
+ cET: deepcopy_all (-ATR T2) 130.1000 msec/pass
+ ET : deepcopy_all (-ATR T2) 901.1638 msec/pass
+
+ lxe: deepcopy_all (S-TR T3) 2.6951 msec/pass
+ cET: deepcopy_all (S-TR T3) 28.9950 msec/pass
+ ET : deepcopy_all (S-TR T3) 218.7109 msec/pass
So, for example, if you have a database-like scenario where you parse in a
large tree and then search and copy independent subtrees from it for further
@@ -423,42 +434,43 @@
especially if few elements are of interest or the target element tag name is
known, lxml is a good choice::
- lxe: getiterator_all (--TR T1) 5.0449 msec/pass
- cET: getiterator_all (--TR T1) 42.0539 msec/pass
- ET : getiterator_all (--TR T1) 22.9158 msec/pass
-
- lxe: getiterator_islice (--TR T2) 0.0789 msec/pass
- cET: getiterator_islice (--TR T2) 0.3579 msec/pass
- ET : getiterator_islice (--TR T2) 0.2351 msec/pass
-
- lxe: getiterator_tag (--TR T2) 0.0651 msec/pass
- cET: getiterator_tag (--TR T2) 0.7648 msec/pass
- ET : getiterator_tag (--TR T2) 0.4380 msec/pass
-
- lxe: getiterator_tag_all (--TR T2) 0.8650 msec/pass
- cET: getiterator_tag_all (--TR T2) 42.7120 msec/pass
- ET : getiterator_tag_all (--TR T2) 21.5559 msec/pass
+ lxe: getiterator_all (--TR T1) 4.7209 msec/pass
+ cET: getiterator_all (--TR T1) 45.8400 msec/pass
+ ET : getiterator_all (--TR T1) 22.9480 msec/pass
+
+ lxe: getiterator_islice (--TR T2) 0.0398 msec/pass
+ cET: getiterator_islice (--TR T2) 0.3798 msec/pass
+ ET : getiterator_islice (--TR T2) 0.1900 msec/pass
+
+ lxe: getiterator_tag (--TR T2) 0.0160 msec/pass
+ cET: getiterator_tag (--TR T2) 0.8149 msec/pass
+ ET : getiterator_tag (--TR T2) 0.3560 msec/pass
+
+ lxe: getiterator_tag_all (--TR T2) 0.6580 msec/pass
+ cET: getiterator_tag_all (--TR T2) 46.3769 msec/pass
+ ET : getiterator_tag_all (--TR T2) 20.3989 msec/pass
This translates directly into similar timings for ``Element.findall()``::
- lxe: findall (--TR T2) 6.8750 msec/pass
- cET: findall (--TR T2) 46.8600 msec/pass
- ET : findall (--TR T2) 27.0121 msec/pass
-
- lxe: findall (--TR T3) 1.5690 msec/pass
- cET: findall (--TR T3) 13.6340 msec/pass
- ET : findall (--TR T3) 8.8100 msec/pass
-
- lxe: findall_tag (--TR T2) 1.0221 msec/pass
- cET: findall_tag (--TR T2) 42.8400 msec/pass
- ET : findall_tag (--TR T2) 21.4801 msec/pass
-
- lxe: findall_tag (--TR T3) 0.4241 msec/pass
- cET: findall_tag (--TR T3) 10.7069 msec/pass
- ET : findall_tag (--TR T3) 5.8560 msec/pass
-
-Note that all three libraries currently use the same Python implementation for
-``findall()``, except for their native tree iterator (``element.iter()``).
+ lxe: findall (--TR T2) 6.7198 msec/pass
+ cET: findall (--TR T2) 51.2750 msec/pass
+ ET : findall (--TR T2) 26.9110 msec/pass
+
+ lxe: findall (--TR T3) 1.4520 msec/pass
+ cET: findall (--TR T3) 14.2760 msec/pass
+ ET : findall (--TR T3) 8.4310 msec/pass
+
+ lxe: findall_tag (--TR T2) 0.7401 msec/pass
+ cET: findall_tag (--TR T2) 46.5961 msec/pass
+ ET : findall_tag (--TR T2) 20.3760 msec/pass
+
+ lxe: findall_tag (--TR T3) 0.3331 msec/pass
+ cET: findall_tag (--TR T3) 11.5960 msec/pass
+ ET : findall_tag (--TR T3) 5.4510 msec/pass
+
+Note that all three libraries currently use the same Python
+implementation for ``.findall()``, except for their native tree
+iterator (``element.iter()``).
XPath
@@ -471,38 +483,38 @@
of the lxml API you use. The most straight forward way is to call the
``xpath()`` method on an Element or ElementTree::
- lxe: xpath_method (--TC T1) 1.5969 msec/pass
- lxe: xpath_method (--TC T2) 21.3680 msec/pass
- lxe: xpath_method (--TC T3) 0.1218 msec/pass
- lxe: xpath_method (--TC T4) 1.0300 msec/pass
+ lxe: xpath_method (--TC T1) 1.5750 msec/pass
+ lxe: xpath_method (--TC T2) 20.9570 msec/pass
+ lxe: xpath_method (--TC T3) 0.1199 msec/pass
+ lxe: xpath_method (--TC T4) 1.0121 msec/pass
This is well suited for testing and when the XPath expressions are as diverse
as the trees they are called on. However, if you have a single XPath
expression that you want to apply to a larger number of different elements,
the ``XPath`` class is the most efficient way to do it::
- lxe: xpath_class (--TC T1) 0.6590 msec/pass
- lxe: xpath_class (--TC T2) 2.9969 msec/pass
- lxe: xpath_class (--TC T3) 0.0520 msec/pass
- lxe: xpath_class (--TC T4) 0.1619 msec/pass
+ lxe: xpath_class (--TC T1) 0.6301 msec/pass
+ lxe: xpath_class (--TC T2) 2.6128 msec/pass
+ lxe: xpath_class (--TC T3) 0.0498 msec/pass
+ lxe: xpath_class (--TC T4) 0.1400 msec/pass
Note that this still allows you to use variables in the expression, so you can
parse it once and then adapt it through variables at call time. In other
cases, where you have a fixed Element or ElementTree and want to run different
expressions on it, you should consider the ``XPathEvaluator``::
- lxe: xpath_element (--TR T1) 0.4120 msec/pass
- lxe: xpath_element (--TR T2) 11.5321 msec/pass
- lxe: xpath_element (--TR T3) 0.1152 msec/pass
- lxe: xpath_element (--TR T4) 0.3202 msec/pass
+ lxe: xpath_element (--TR T1) 0.2739 msec/pass
+ lxe: xpath_element (--TR T2) 10.8800 msec/pass
+ lxe: xpath_element (--TR T3) 0.0660 msec/pass
+ lxe: xpath_element (--TR T4) 0.2739 msec/pass
While it looks slightly slower, creating an XPath object for each of the
expressions generates a much higher overhead here::
- lxe: xpath_class_repeat (--TC T1) 1.5409 msec/pass
- lxe: xpath_class_repeat (--TC T2) 20.2711 msec/pass
- lxe: xpath_class_repeat (--TC T3) 0.1161 msec/pass
- lxe: xpath_class_repeat (--TC T4) 0.9799 msec/pass
+ lxe: xpath_class_repeat (--TC T1) 1.5399 msec/pass
+ lxe: xpath_class_repeat (--TC T2) 20.5159 msec/pass
+ lxe: xpath_class_repeat (--TC T3) 0.1178 msec/pass
+ lxe: xpath_class_repeat (--TC T4) 0.9880 msec/pass
A longer example
@@ -640,8 +652,8 @@
``iterparse()``: 0.07 versus 0.10 seconds. However, tree iteration in lxml
is increadibly fast, so it can be better to parse the whole tree and then
iterate over it rather than using ``iterparse()`` to do both in one step.
- Or, you can just wait for the lxml authors to optimise iterparse in one of
- the next releases...
+ Or, you can just wait for the lxml developers to optimise iterparse in one
+ of the next releases...
lxml.objectify
@@ -669,21 +681,21 @@
tree. It avoids step-by-step Python element instantiations along the path,
which can substantially improve the access time::
- lxe: attribute (--TR T1) 8.4081 msec/pass
- lxe: attribute (--TR T2) 51.3301 msec/pass
- lxe: attribute (--TR T4) 8.2269 msec/pass
-
- lxe: objectpath (--TR T1) 4.6120 msec/pass
- lxe: objectpath (--TR T2) 47.0440 msec/pass
- lxe: objectpath (--TR T4) 4.4930 msec/pass
-
- lxe: attributes_deep (--TR T1) 12.6550 msec/pass
- lxe: attributes_deep (--TR T2) 56.0241 msec/pass
- lxe: attributes_deep (--TR T4) 12.5690 msec/pass
-
- lxe: objectpath_deep (--TR T1) 5.9190 msec/pass
- lxe: objectpath_deep (--TR T2) 49.6972 msec/pass
- lxe: objectpath_deep (--TR T4) 5.7530 msec/pass
+ lxe: attribute (--TR T1) 6.9990 msec/pass
+ lxe: attribute (--TR T2) 29.2060 msec/pass
+ lxe: attribute (--TR T4) 6.9048 msec/pass
+
+ lxe: objectpath (--TR T1) 3.5410 msec/pass
+ lxe: objectpath (--TR T2) 24.9801 msec/pass
+ lxe: objectpath (--TR T4) 3.5069 msec/pass
+
+ lxe: attributes_deep (--TR T1) 16.9580 msec/pass
+ lxe: attributes_deep (--TR T2) 39.8140 msec/pass
+ lxe: attributes_deep (--TR T4) 16.9699 msec/pass
+
+ lxe: objectpath_deep (--TR T1) 9.4180 msec/pass
+ lxe: objectpath_deep (--TR T2) 31.7512 msec/pass
+ lxe: objectpath_deep (--TR T4) 9.4421 msec/pass
Note, however, that parsing ObjectPath expressions is not for free either, so
this is most effective for frequently accessing the same element.
@@ -713,17 +725,17 @@
subtrees and elements) to cache, you can trade memory usage against access
speed::
- lxe: attribute_cached (--TR T1) 6.4209 msec/pass
- lxe: attribute_cached (--TR T2) 48.0378 msec/pass
- lxe: attribute_cached (--TR T4) 6.3779 msec/pass
-
- lxe: attributes_deep_cached (--TR T1) 7.8559 msec/pass
- lxe: attributes_deep_cached (--TR T2) 51.0719 msec/pass
- lxe: attributes_deep_cached (--TR T4) 7.7350 msec/pass
-
- lxe: objectpath_deep_cached (--TR T1) 3.2761 msec/pass
- lxe: objectpath_deep_cached (--TR T2) 45.7590 msec/pass
- lxe: objectpath_deep_cached (--TR T4) 3.1459 msec/pass
+ lxe: attribute_cached (--TR T1) 5.1420 msec/pass
+ lxe: attribute_cached (--TR T2) 27.0739 msec/pass
+ lxe: attribute_cached (--TR T4) 5.1429 msec/pass
+
+ lxe: attributes_deep_cached (--TR T1) 7.0908 msec/pass
+ lxe: attributes_deep_cached (--TR T2) 29.5591 msec/pass
+ lxe: attributes_deep_cached (--TR T4) 7.1721 msec/pass
+
+ lxe: objectpath_deep_cached (--TR T1) 2.2731 msec/pass
+ lxe: objectpath_deep_cached (--TR T2) 23.1631 msec/pass
+ lxe: objectpath_deep_cached (--TR T4) 2.3179 msec/pass
Things to note: you cannot currently use ``weakref.WeakKeyDictionary`` objects
for this as lxml's element objects do not support weak references (which are