[Lxml-checkins] r33794 - in lxml/branch/lxml-1.1: benchmark doc

scoder at codespeak.net scoder at codespeak.net
Fri Oct 27 09:00:04 CEST 2006


Author: scoder
Date: Fri Oct 27 08:59:57 2006
New Revision: 33794

Modified:
   lxml/branch/lxml-1.1/benchmark/bench_objectify.py
   lxml/branch/lxml-1.1/doc/FAQ.txt
   lxml/branch/lxml-1.1/doc/main.txt
   lxml/branch/lxml-1.1/doc/performance.txt
Log:
merges from trunk: objectify benchmarks, FAQ/performance.txt

Modified: lxml/branch/lxml-1.1/benchmark/bench_objectify.py
==============================================================================
--- lxml/branch/lxml-1.1/benchmark/bench_objectify.py	(original)
+++ lxml/branch/lxml-1.1/benchmark/bench_objectify.py	Fri Oct 27 08:59:57 2006
@@ -2,12 +2,6 @@
 from itertools import *
 from StringIO import StringIO
 
-from lxml import etree, objectify
-
-parser = etree.XMLParser(remove_blank_text=True)
-lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup())
-parser.setElementClassLookup(lookup)
-
 import benchbase
 from benchbase import with_attributes, with_text, onlylib, serialized
 
@@ -17,10 +11,21 @@
 
 class BenchMark(benchbase.BenchMarkBase):
     def __init__(self, lib):
-        benchbase.BenchMarkBase.__init__(self, lib, parser)
+        from lxml import etree, objectify
+        self.objectify = objectify
+        parser = etree.XMLParser(remove_blank_text=True)
+        lookup = objectify.ObjectifyElementClassLookup()
+        parser.setElementClassLookup(lookup)
+        super(BenchMark, self).__init__(etree, parser)
+
+    def bench_attribute(self, root):
+        "1 2 4"
+        for i in repeat(None, 3000):
+            root.zzzzz
 
-    def bench_attributes(self, root):
+    def bench_attribute_cached(self, root):
         "1 2 4"
+        cache = root.zzzzz
         for i in repeat(None, 3000):
             root.zzzzz
 
@@ -38,13 +43,13 @@
 
     def bench_objectpath(self, root):
         "1 2 4"
-        path = objectify.ObjectPath(".zzzzz")
+        path = self.objectify.ObjectPath(".zzzzz")
         for i in repeat(None, 3000):
             path(root)
 
     def bench_objectpath_deep(self, root):
         "1 2 4"
-        path = objectify.ObjectPath(".zzzzz.{cdefg}z00000")
+        path = self.objectify.ObjectPath(".zzzzz.{cdefg}z00000")
         for i in repeat(None, 3000):
             path(root)
 
@@ -52,9 +57,32 @@
         "1 2 4"
         cache1 = root.zzzzz
         cache2 = cache1['{cdefg}z00000']
-        path = objectify.ObjectPath(".zzzzz.{cdefg}z00000")
+        path = self.objectify.ObjectPath(".zzzzz.{cdefg}z00000")
         for i in repeat(None, 3000):
             path(root)
 
+    @with_text(text=True, utext=True, no_text=True)
+    def bench_annotate(self, root):
+        self.objectify.annotate(root)
+
+    def bench_descendantpaths(self, root):
+        root.descendantpaths()
+
+    @with_text(text=True)
+    def bench_type_inference(self, root):
+        "1 2 4"
+        el = root.aaaaa
+        for i in repeat(None, 1000):
+            el.getchildren()
+
+    @with_text(text=True)
+    def bench_type_inference_annotated(self, root):
+        "1 2 4"
+        el = root.aaaaa
+        self.objectify.annotate(el)
+        for i in repeat(None, 1000):
+            el.getchildren()
+
+
 if __name__ == '__main__':
     benchbase.main(BenchMark)

Modified: lxml/branch/lxml-1.1/doc/FAQ.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/FAQ.txt	(original)
+++ lxml/branch/lxml-1.1/doc/FAQ.txt	Fri Oct 27 08:59:57 2006
@@ -64,6 +64,30 @@
 .. _`the web page`:    http://codespeak.net/lxml/#documentation
 
 
+What is the difference between lxml.etree and lxml.objectify?
+-------------------------------------------------------------
+
+The two modules provide different ways of handling XML.  However, objectify
+builds on top of lxml.etree and therefore inherits most of its capabilities
+and a large portion of its API.
+
+* lxml.etree is a generic API for XML and HTML handling.  It aims for
+  ElementTree compatibility_ and supports the entire XML infoset.  It is well
+  suited for both mixed content and data centric XML.  Its generality makes it
+  the best choice for most applications.
+
+* lxml.objectify is a specialized API for XML data handling in a Python object
+  syntax.  It provides a very natural way to deal with data fields stored in a
+  structurally well defined XML format.  Data is automatically converted to
+  Python data types and can be manipulated with normal Python operators.  Look
+  at the examples in the `objectify documentation`_ to see what it feels like
+  to use it.
+
+  Objectify is not well suited for mixed contents or HTML documents.  As it is
+  built on top of lxml.etree, however, it inherits the normal support for
+  XPath, XSLT or validation.
+
+
 Why is my application so slow?
 ------------------------------
 
@@ -325,78 +349,3 @@
 You can't.  In XPath, there is no such thing as a default namespace.  Just use
 an arbitrary prefix and let the namespace dictionary of the XPath evaluators
 map it to your namespace.  See also the question above.
-
-
-lxml.objectify
-==============
-
-What is the difference between lxml.etree and lxml.objectify?
--------------------------------------------------------------
-
-The two modules provide different ways of handling XML.  However, objectify
-builds on top of lxml.etree and therefore inherits most of its capabilities
-and a large portion of its API.
-
-* lxml.etree is a generic API for XML and HTML handling.  It aims for
-  ElementTree compatibility_ and supports the entire XML infoset.  It is well
-  suited for both mixed content and data centric XML.  Its generality makes it
-  the best choice for most applications.
-
-* lxml.objectify is a specialized API for XML data handling in a Python object
-  syntax.  It provides a very natural way to deal with data fields stored in a
-  structurally well defined XML format.  Data is automatically converted to
-  Python data types and can be manipulated with normal Python operators.  Look
-  at the examples in the `objectify documentation`_ to see what it feels like
-  to use it.
-
-  Objectify is not well suited for mixed contents or HTML documents.  As it is
-  built on top of lxml.etree, however, it inherits the normal support for
-  XPath, XSLT or validation.
-
-Is there a way to speed up frequent element access?
----------------------------------------------------
-
-lxml.objectify creates Python representations of elements on the fly.  To save
-memory, the normal Python garbage collection mechanisms will discard them when
-their last reference is gone.  In cases where deeply nested elements are
-frequently accessed through the objectify API, the create-discard cycles can
-become a bottleneck, as elements have to be instantiated over and over again.
-
-If your benchmarks prove that the overhead is too high for your specific use
-case, here are some things to try:
-
-* If you often work in subtrees, assign the parent of the subtree to a
-  variable or pass it into functions instead of starting at the root.  This
-  allows accessing its descendents more directly.
-
-* Use precompiled ObjectPath expressions instead of accessing deeply nested
-  elements step-by-step via object attributes.
-
-* Try assigning data values directly to attributes instead of passing them
-  through DataElement.
-
-* Run ``objectify.annotate()`` over read-only trees to speed up the attribute
-  type inference on access.
-
-* To prevent frequent object create-discard cycles, you can keep a permanent
-  reference to the Python objects in a tree.  Just create a cache dictionary
-  and run::
-
-     cache[root] = list(root.getiterator())
-
-  after parsing and::
-
-     del cache[root]
-
-  when you are done with the tree.  This will keep the Python element
-  representations of all elements alive and thus avoid the overhead of
-  repeated Python object creation.  By choosing the right trees (or even
-  elements) to cache, you can trade memory usage against access speed.
-
-  Things to note: you cannot currently use ``weakref.WeakKeyDictionary``
-  objects for this as lxml's elements do not support weak references for
-  memory reasons.  Also note that new element objects that you add to these
-  trees will not turn up in the cache automatically and will therefore still
-  be garbage collected when all their Python references are gone, so this is
-  most effective for largely immutable trees.  You should consider using a set
-  instead of a list in this case and add new elements by hand.

Modified: lxml/branch/lxml-1.1/doc/main.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/main.txt	(original)
+++ lxml/branch/lxml-1.1/doc/main.txt	Fri Oct 27 08:59:57 2006
@@ -29,7 +29,7 @@
 
 .. _`installation instructions`: installation.html
 
-* `lxml 1.1.2`_, released 2006-10-26 (`changes for 1.1.2`_)
+* `lxml 1.1.2`_, released 2006-10-27 (`changes for 1.1.2`_)
 
 * `lxml 1.1.1`_, released 2006-09-21 (`changes for 1.1.1`_)
 

Modified: lxml/branch/lxml-1.1/doc/performance.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/performance.txt	(original)
+++ lxml/branch/lxml-1.1/doc/performance.txt	Fri Oct 27 08:59:57 2006
@@ -305,3 +305,93 @@
   lxe: xpath_class_repeat   (-- T3)    0.3126 msec/pass
   lxe: xpath_class_repeat   (-- T4)    1.1111 msec/pass
 
+
+lxml.objectify
+--------------
+
+Objectify is a data-binding API for XML based on lxml.etree, that was added in
+version 1.1.  It uses standard Python attribute access to traverse the XML
+tree.  It also features ObjectPath, a fast path language based on the same
+meme.
+
+Just like lxml.etree, lxml.objectify creates Python representations of
+elements on the fly.  To save memory, the normal Python garbage collection
+mechanisms will discard them when their last reference is gone.  In cases
+where deeply nested elements are frequently accessed through the objectify
+API, the create-discard cycles can become a bottleneck, as elements have to be
+instantiated over and over again.
+
+ObjectPath can be used to speed up the access to elements that are deep in the
+tree.  It avoids step-by-step Python element instantiations along the path,
+which can substantially improve the access time::
+
+  lxe: attribute                  (--T T1)   14.8621 msec/pass
+  lxe: attribute                  (--T T2)   61.8820 msec/pass
+  lxe: attribute                  (--T T4)   14.9317 msec/pass
+
+  lxe: objectpath                 (--T T1)   13.7311 msec/pass
+  lxe: objectpath                 (--T T2)   58.5930 msec/pass
+  lxe: objectpath                 (--T T4)    8.0961 msec/pass
+
+  lxe: attributes_deep            (--T T1)   81.4488 msec/pass
+  lxe: attributes_deep            (--T T2)   77.0266 msec/pass
+  lxe: attributes_deep            (--T T4)   27.1226 msec/pass
+
+  lxe: objectpath_deep            (--T T1)   63.1915 msec/pass
+  lxe: objectpath_deep            (--T T2)   65.2469 msec/pass
+  lxe: objectpath_deep            (--T T4)   11.0138 msec/pass
+
+Note, however, that parsing ObjectPath expressions is not for free either, so
+this is most effective for frequently accessing the same element.
+
+A way to improve the normal attribute access time is static instantiation of
+the Python objects, thus trading memory for speed.  Just create a cache
+dictionary and run::
+
+    cache[root] = list(root.getiterator())
+
+after parsing and::
+
+    del cache[root]
+
+when you are done with the tree.  This will keep the Python element
+representations of all elements alive and thus avoid the overhead of repeated
+Python object creation.  You can also consider using filters or generator
+expressions to be more selective.  By choosing the right trees (or even
+subtrees and elements) to cache, you can trade memory usage against access
+speed::
+
+  lxe: attribute_cached           (--T T1)   10.8343 msec/pass
+  lxe: attribute_cached           (--T T2)   55.5890 msec/pass
+  lxe: attribute_cached           (--T T4)   10.9514 msec/pass
+
+  lxe: attributes_deep_cached     (--T T1)   63.7080 msec/pass
+  lxe: attributes_deep_cached     (--T T2)   65.6838 msec/pass
+  lxe: attributes_deep_cached     (--T T4)   15.4514 msec/pass
+
+Things to note: you cannot currently use ``weakref.WeakKeyDictionary`` objects
+for this as lxml's element objects do not support weak references (which are
+costly in terms of memory).  Also note that new element objects that you add
+to these trees will not turn up in the cache automatically and will therefore
+still be garbage collected when all their Python references are gone, so this
+is most effective for largely immutable trees.  You should consider using a
+set instead of a list in this case and add new elements by hand.
+
+Here are some more things to try if optimisation is required:
+
+* A lot of time is usually spent in tree traversal to find the addressed
+  elements in the tree.  If you often work in subtrees, assign the parent of
+  the subtree to a variable or pass it into functions instead of starting at
+  the root.  This allows accessing its descendents more directly.
+
+* Try assigning data values directly to attributes instead of passing them
+  through DataElement.
+
+* If you use custom data types that are costly to parse, try running
+  ``objectify.annotate()`` over read-only trees to speed up the attribute type
+  inference on read access.
+
+Note that none of these measures is guaranteed to speed up your application.
+As usual, you should prefer readable code over premature optimisations and
+profile your expected use cases before bothering to apply optimisations at
+random.


More information about the lxml-checkins mailing list