[Lxml-checkins] r33705 - lxml/trunk/doc

scoder at codespeak.net scoder at codespeak.net
Wed Oct 25 08:49:25 CEST 2006


Author: scoder
Date: Wed Oct 25 08:49:24 2006
New Revision: 33705

Modified:
   lxml/trunk/doc/FAQ.txt
Log:
FAQ entry on objectify performance tweaking

Modified: lxml/trunk/doc/FAQ.txt
==============================================================================
--- lxml/trunk/doc/FAQ.txt	(original)
+++ lxml/trunk/doc/FAQ.txt	Wed Oct 25 08:49:24 2006
@@ -12,9 +12,8 @@
    1  General Questions
      1.1  Is there a tutorial?
      1.2  Where can I find more documentation about lxml?
-     1.3  What is the difference between lxml.etree and lxml.objectify?
-     1.4  Why is my application so slow?
-     1.5  Why do I get errors about missing UCS4 symbols when installing lxml?
+     1.3  Why is my application so slow?
+     1.4  Why do I get errors about missing UCS4 symbols when installing lxml?
    2  Bugs
      2.1  My application crashes! Why does lxml.etree do that?
      2.2  I think I have found a bug in lxml. What should I do?
@@ -31,6 +30,9 @@
      5.2  Why doesn't ``findall()`` support full XPath expressions?
      5.3  How can I find out which namespace prefixes are used in a document?
      5.4  How can I specify a default namespace for XPath expressions?
+   6  lxml.objectify
+     6.1  What is the difference between lxml.etree and lxml.objectify?
+     6.2  Is there a way to speed up frequent element access?
 
 
 General Questions
@@ -62,30 +64,6 @@
 .. _`the web page`:    http://codespeak.net/lxml/#documentation
 
 
-What is the difference between lxml.etree and lxml.objectify?
--------------------------------------------------------------
-
-The two modules provide different ways of handling XML.  However, objectify
-builds on top of lxml.etree and therefore inherits most of its capabilities
-and a large portion of its API.
-
-* lxml.etree is a generic API for XML and HTML handling.  It aims for
-  ElementTree compatibility_ and supports the entire XML infoset.  It is well
-  suited for both mixed content and data centric XML.  Its generality makes it
-  the best choice for most applications.
-
-* lxml.objectify is a specialized API for XML data handling in a Python object
-  syntax.  It provides a very natural way to deal with data fields stored in a
-  structurally well defined XML format.  Data is automatically converted to
-  Python data types and can be manipulated with normal Python operators.  Look
-  at the examples in the `objectify documentation`_ to see what it feels like
-  to use it.
-
-  Objectify is not well suited for mixed contents or HTML documents.  As it is
-  built on top of lxml.etree, however, it inherits the normal support for
-  XPath, XSLT or validation.
-
-
 Why is my application so slow?
 ------------------------------
 
@@ -178,7 +156,7 @@
 
 Due to the way libxslt handles threading, concurrent access to stylesheets is
 currently only possible if it was parsed in the main thread.  Parsing and
-using a stylesheet inside one thread also works.
+applying a stylesheet inside one thread also works.
 
 Warning: You should generally avoid modifying trees in other threads than the
 one it was generated in.  Although this should work in many cases, there are
@@ -200,10 +178,10 @@
 
 The global interpreter lock (GIL) in Python serializes access to the
 interpreter, so if the majority of your processing is done in Python code
-(traversing trees, modifying elements, etc.), your gain will be close to 0.
-The more of your XML processing moves into lxml, however, the higher your
-gain.  If your application is bound by XML parsing and serialisation, or by
-complex XSLTs, your speedup on multi-processor machines can be substantial.
+(walking trees, modifying elements, etc.), your gain will be close to 0.  The
+more of your XML processing moves into lxml, however, the higher your gain.
+If your application is bound by XML parsing and serialisation, or by complex
+XSLTs, your speedup on multi-processor machines can be substantial.
 
 See the question above to learn which operations free the GIL to support
 multi-threading.
@@ -347,3 +325,78 @@
 You can't.  In XPath, there is no such thing as a default namespace.  Just use
 an arbitrary prefix and let the namespace dictionary of the XPath evaluators
 map it to your namespace.  See also the question above.
+
+
+lxml.objectify
+==============
+
+What is the difference between lxml.etree and lxml.objectify?
+-------------------------------------------------------------
+
+The two modules provide different ways of handling XML.  However, objectify
+builds on top of lxml.etree and therefore inherits most of its capabilities
+and a large portion of its API.
+
+* lxml.etree is a generic API for XML and HTML handling.  It aims for
+  ElementTree compatibility_ and supports the entire XML infoset.  It is well
+  suited for both mixed content and data centric XML.  Its generality makes it
+  the best choice for most applications.
+
+* lxml.objectify is a specialized API for XML data handling in a Python object
+  syntax.  It provides a very natural way to deal with data fields stored in a
+  structurally well defined XML format.  Data is automatically converted to
+  Python data types and can be manipulated with normal Python operators.  Look
+  at the examples in the `objectify documentation`_ to see what it feels like
+  to use it.
+
+  Objectify is not well suited for mixed contents or HTML documents.  As it is
+  built on top of lxml.etree, however, it inherits the normal support for
+  XPath, XSLT or validation.
+
+Is there a way to speed up frequent element access?
+---------------------------------------------------
+
+lxml.objectify creates Python representations of elements on the fly.  To save
+memory, the normal Python garbage collection mechanisms will discard them when
+their last reference is gone.  In cases where deeply nested elements are
+frequently accessed through the objectify API, the create-discard cycles can
+become a bottleneck, as elements have to be instantiated over and over again.
+
+If your benchmarks prove that the overhead is too high for your specific use
+case, here are some things to try:
+
+* If you often work in subtrees, assign the parent of the subtree to a
+  variable or pass it into functions instead of starting at the root.  This
+  allows accessing its descendents more directly.
+
+* Use precompiled ObjectPath expressions instead of accessing deeply nested
+  elements step-by-step via object attributes.
+
+* Try assigning data values directly to attributes instead of passing them
+  through DataElement.
+
+* Run ``objectify.annotate()`` over read-only trees to speed up the attribute
+  type inference on access.
+
+* To prevent frequent object create-discard cycles, you can keep a permanent
+  reference to the Python objects in a tree.  Just create a cache dictionary
+  and run::
+
+     cache[root] = list(root.getiterator())
+
+  after parsing and::
+
+     del cache[root]
+
+  when you are done with the tree.  This will keep the Python element
+  representations of all elements alive and thus avoid the overhead of
+  repeated Python object creation.  By choosing the right trees (or even
+  elements) to cache, you can trade memory usage against access speed.
+
+  Things to note: you cannot currently use ``weakref.WeakKeyDictionary``
+  objects for this as lxml's elements do not support weak references for
+  memory reasons.  Also note that new element objects that you add to these
+  trees will not turn up in the cache automatically and will therefore still
+  be garbage collected when all their Python references are gone, so this is
+  most effective for largely immutable trees.  You should consider using a set
+  instead of a list in this case and add new elements by hand.


More information about the lxml-checkins mailing list