[Lxml-checkins] r33705 - lxml/trunk/doc
scoder at codespeak.net
scoder at codespeak.net
Wed Oct 25 08:49:25 CEST 2006
Author: scoder
Date: Wed Oct 25 08:49:24 2006
New Revision: 33705
Modified:
lxml/trunk/doc/FAQ.txt
Log:
FAQ entry on objectify performance tweaking
Modified: lxml/trunk/doc/FAQ.txt
==============================================================================
--- lxml/trunk/doc/FAQ.txt (original)
+++ lxml/trunk/doc/FAQ.txt Wed Oct 25 08:49:24 2006
@@ -12,9 +12,8 @@
1 General Questions
1.1 Is there a tutorial?
1.2 Where can I find more documentation about lxml?
- 1.3 What is the difference between lxml.etree and lxml.objectify?
- 1.4 Why is my application so slow?
- 1.5 Why do I get errors about missing UCS4 symbols when installing lxml?
+ 1.3 Why is my application so slow?
+ 1.4 Why do I get errors about missing UCS4 symbols when installing lxml?
2 Bugs
2.1 My application crashes! Why does lxml.etree do that?
2.2 I think I have found a bug in lxml. What should I do?
@@ -31,6 +30,9 @@
5.2 Why doesn't ``findall()`` support full XPath expressions?
5.3 How can I find out which namespace prefixes are used in a document?
5.4 How can I specify a default namespace for XPath expressions?
+ 6 lxml.objectify
+ 6.1 What is the difference between lxml.etree and lxml.objectify?
+ 6.2 Is there a way to speed up frequent element access?
General Questions
@@ -62,30 +64,6 @@
.. _`the web page`: http://codespeak.net/lxml/#documentation
-What is the difference between lxml.etree and lxml.objectify?
--------------------------------------------------------------
-
-The two modules provide different ways of handling XML. However, objectify
-builds on top of lxml.etree and therefore inherits most of its capabilities
-and a large portion of its API.
-
-* lxml.etree is a generic API for XML and HTML handling. It aims for
- ElementTree compatibility_ and supports the entire XML infoset. It is well
- suited for both mixed content and data centric XML. Its generality makes it
- the best choice for most applications.
-
-* lxml.objectify is a specialized API for XML data handling in a Python object
- syntax. It provides a very natural way to deal with data fields stored in a
- structurally well defined XML format. Data is automatically converted to
- Python data types and can be manipulated with normal Python operators. Look
- at the examples in the `objectify documentation`_ to see what it feels like
- to use it.
-
- Objectify is not well suited for mixed contents or HTML documents. As it is
- built on top of lxml.etree, however, it inherits the normal support for
- XPath, XSLT or validation.
-
-
Why is my application so slow?
------------------------------
@@ -178,7 +156,7 @@
Due to the way libxslt handles threading, concurrent access to stylesheets is
currently only possible if it was parsed in the main thread. Parsing and
-using a stylesheet inside one thread also works.
+applying a stylesheet inside one thread also works.
Warning: You should generally avoid modifying trees in other threads than the
one it was generated in. Although this should work in many cases, there are
@@ -200,10 +178,10 @@
The global interpreter lock (GIL) in Python serializes access to the
interpreter, so if the majority of your processing is done in Python code
-(traversing trees, modifying elements, etc.), your gain will be close to 0.
-The more of your XML processing moves into lxml, however, the higher your
-gain. If your application is bound by XML parsing and serialisation, or by
-complex XSLTs, your speedup on multi-processor machines can be substantial.
+(walking trees, modifying elements, etc.), your gain will be close to 0. The
+more of your XML processing moves into lxml, however, the higher your gain.
+If your application is bound by XML parsing and serialisation, or by complex
+XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support
multi-threading.
@@ -347,3 +325,78 @@
You can't. In XPath, there is no such thing as a default namespace. Just use
an arbitrary prefix and let the namespace dictionary of the XPath evaluators
map it to your namespace. See also the question above.
+
+
+lxml.objectify
+==============
+
+What is the difference between lxml.etree and lxml.objectify?
+-------------------------------------------------------------
+
+The two modules provide different ways of handling XML. However, objectify
+builds on top of lxml.etree and therefore inherits most of its capabilities
+and a large portion of its API.
+
+* lxml.etree is a generic API for XML and HTML handling. It aims for
+ ElementTree compatibility_ and supports the entire XML infoset. It is well
+ suited for both mixed content and data centric XML. Its generality makes it
+ the best choice for most applications.
+
+* lxml.objectify is a specialized API for XML data handling in a Python object
+ syntax. It provides a very natural way to deal with data fields stored in a
+ structurally well defined XML format. Data is automatically converted to
+ Python data types and can be manipulated with normal Python operators. Look
+ at the examples in the `objectify documentation`_ to see what it feels like
+ to use it.
+
+ Objectify is not well suited for mixed contents or HTML documents. As it is
+ built on top of lxml.etree, however, it inherits the normal support for
+ XPath, XSLT or validation.
+
+Is there a way to speed up frequent element access?
+---------------------------------------------------
+
+lxml.objectify creates Python representations of elements on the fly. To save
+memory, the normal Python garbage collection mechanisms will discard them when
+their last reference is gone. In cases where deeply nested elements are
+frequently accessed through the objectify API, the create-discard cycles can
+become a bottleneck, as elements have to be instantiated over and over again.
+
+If your benchmarks prove that the overhead is too high for your specific use
+case, here are some things to try:
+
+* If you often work in subtrees, assign the parent of the subtree to a
+ variable or pass it into functions instead of starting at the root. This
+ allows accessing its descendents more directly.
+
+* Use precompiled ObjectPath expressions instead of accessing deeply nested
+ elements step-by-step via object attributes.
+
+* Try assigning data values directly to attributes instead of passing them
+ through DataElement.
+
+* Run ``objectify.annotate()`` over read-only trees to speed up the attribute
+ type inference on access.
+
+* To prevent frequent object create-discard cycles, you can keep a permanent
+ reference to the Python objects in a tree. Just create a cache dictionary
+ and run::
+
+ cache[root] = list(root.getiterator())
+
+ after parsing and::
+
+ del cache[root]
+
+ when you are done with the tree. This will keep the Python element
+ representations of all elements alive and thus avoid the overhead of
+ repeated Python object creation. By choosing the right trees (or even
+ elements) to cache, you can trade memory usage against access speed.
+
+ Things to note: you cannot currently use ``weakref.WeakKeyDictionary``
+ objects for this as lxml's elements do not support weak references for
+ memory reasons. Also note that new element objects that you add to these
+ trees will not turn up in the cache automatically and will therefore still
+ be garbage collected when all their Python references are gone, so this is
+ most effective for largely immutable trees. You should consider using a set
+ instead of a list in this case and add new elements by hand.
More information about the lxml-checkins
mailing list