[Lxml-checkins] r33707 - in lxml/branch/lxml-1.1: . doc src/lxml

scoder at codespeak.net scoder at codespeak.net
Wed Oct 25 09:55:31 CEST 2006


Author: scoder
Date: Wed Oct 25 09:55:26 2006
New Revision: 33707

Modified:
   lxml/branch/lxml-1.1/CHANGES.txt
   lxml/branch/lxml-1.1/doc/FAQ.txt
   lxml/branch/lxml-1.1/doc/build.txt
   lxml/branch/lxml-1.1/src/lxml/docloader.pxi
   lxml/branch/lxml-1.1/src/lxml/etree.pyx
   lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd
   lxml/branch/lxml-1.1/src/lxml/objectify.pyx
   lxml/branch/lxml-1.1/src/lxml/parser.pxi
   lxml/branch/lxml-1.1/src/lxml/public-api.pxi
   lxml/branch/lxml-1.1/src/lxml/xmlid.pxi
   lxml/branch/lxml-1.1/src/lxml/xslt.pxi
Log:
big merge from trunk: xmlid fixes, _Attrib cleanup, FAQ about objectify, resolver context cleanup

Modified: lxml/branch/lxml-1.1/CHANGES.txt
==============================================================================
--- lxml/branch/lxml-1.1/CHANGES.txt	(original)
+++ lxml/branch/lxml-1.1/CHANGES.txt	Wed Oct 25 09:55:26 2006
@@ -18,6 +18,9 @@
 Bugs fixed
 ----------
 
+* Open files and XML strings returned by Python resolvers were not
+  closed/freed
+
 * Copying Comments and ProcessingInstructions failed
 
 * Memory leak for external URLs in _XSLTProcessingInstruction.parseXSL()

Modified: lxml/branch/lxml-1.1/doc/FAQ.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/FAQ.txt	(original)
+++ lxml/branch/lxml-1.1/doc/FAQ.txt	Wed Oct 25 09:55:26 2006
@@ -12,9 +12,8 @@
    1  General Questions
      1.1  Is there a tutorial?
      1.2  Where can I find more documentation about lxml?
-     1.3  What is the difference between lxml.etree and lxml.objectify?
-     1.4  Why is my application so slow?
-     1.5  Why do I get errors about missing UCS4 symbols when installing lxml?
+     1.3  Why is my application so slow?
+     1.4  Why do I get errors about missing UCS4 symbols when installing lxml?
    2  Bugs
      2.1  My application crashes! Why does lxml.etree do that?
      2.2  I think I have found a bug in lxml. What should I do?
@@ -31,6 +30,9 @@
      5.2  Why doesn't ``findall()`` support full XPath expressions?
      5.3  How can I find out which namespace prefixes are used in a document?
      5.4  How can I specify a default namespace for XPath expressions?
+   6  lxml.objectify
+     6.1  What is the difference between lxml.etree and lxml.objectify?
+     6.2  Is there a way to speed up frequent element access?
 
 
 General Questions
@@ -62,30 +64,6 @@
 .. _`the web page`:    http://codespeak.net/lxml/#documentation
 
 
-What is the difference between lxml.etree and lxml.objectify?
--------------------------------------------------------------
-
-The two modules provide different ways of handling XML.  However, objectify
-builds on top of lxml.etree and therefore inherits most of its capabilities
-and a large portion of its API.
-
-* lxml.etree is a generic API for XML and HTML handling.  It aims for
-  ElementTree compatibility_ and supports the entire XML infoset.  It is well
-  suited for both mixed content and data centric XML.  Its generality makes it
-  the best choice for most applications.
-
-* lxml.objectify is a specialized API for XML data handling in a Python object
-  syntax.  It provides a very natural way to deal with data fields stored in a
-  structurally well defined XML format.  Data is automatically converted to
-  Python data types and can be manipulated with normal Python operators.  Look
-  at the examples in the `objectify documentation`_ to see what it feels like
-  to use it.
-
-  Objectify is not well suited for mixed contents or HTML documents.  As it is
-  built on top of lxml.etree, however, it inherits the normal support for
-  XPath, XSLT or validation.
-
-
 Why is my application so slow?
 ------------------------------
 
@@ -178,7 +156,7 @@
 
 Due to the way libxslt handles threading, concurrent access to stylesheets is
 currently only possible if it was parsed in the main thread.  Parsing and
-using a stylesheet inside one thread also works.
+applying a stylesheet inside one thread also works.
 
 Warning: You should generally avoid modifying trees in other threads than the
 one it was generated in.  Although this should work in many cases, there are
@@ -200,10 +178,10 @@
 
 The global interpreter lock (GIL) in Python serializes access to the
 interpreter, so if the majority of your processing is done in Python code
-(traversing trees, modifying elements, etc.), your gain will be close to 0.
-The more of your XML processing moves into lxml, however, the higher your
-gain.  If your application is bound by XML parsing and serialisation, or by
-complex XSLTs, your speedup on multi-processor machines can be substantial.
+(walking trees, modifying elements, etc.), your gain will be close to 0.  The
+more of your XML processing moves into lxml, however, the higher your gain.
+If your application is bound by XML parsing and serialisation, or by complex
+XSLTs, your speedup on multi-processor machines can be substantial.
 
 See the question above to learn which operations free the GIL to support
 multi-threading.
@@ -347,3 +325,78 @@
 You can't.  In XPath, there is no such thing as a default namespace.  Just use
 an arbitrary prefix and let the namespace dictionary of the XPath evaluators
 map it to your namespace.  See also the question above.
+
+
+lxml.objectify
+==============
+
+What is the difference between lxml.etree and lxml.objectify?
+-------------------------------------------------------------
+
+The two modules provide different ways of handling XML.  However, objectify
+builds on top of lxml.etree and therefore inherits most of its capabilities
+and a large portion of its API.
+
+* lxml.etree is a generic API for XML and HTML handling.  It aims for
+  ElementTree compatibility_ and supports the entire XML infoset.  It is well
+  suited for both mixed content and data centric XML.  Its generality makes it
+  the best choice for most applications.
+
+* lxml.objectify is a specialized API for XML data handling in a Python object
+  syntax.  It provides a very natural way to deal with data fields stored in a
+  structurally well defined XML format.  Data is automatically converted to
+  Python data types and can be manipulated with normal Python operators.  Look
+  at the examples in the `objectify documentation`_ to see what it feels like
+  to use it.
+
+  Objectify is not well suited for mixed contents or HTML documents.  As it is
+  built on top of lxml.etree, however, it inherits the normal support for
+  XPath, XSLT or validation.
+
+Is there a way to speed up frequent element access?
+---------------------------------------------------
+
+lxml.objectify creates Python representations of elements on the fly.  To save
+memory, the normal Python garbage collection mechanisms will discard them when
+their last reference is gone.  In cases where deeply nested elements are
+frequently accessed through the objectify API, the create-discard cycles can
+become a bottleneck, as elements have to be instantiated over and over again.
+
+If your benchmarks prove that the overhead is too high for your specific use
+case, here are some things to try:
+
+* If you often work in subtrees, assign the parent of the subtree to a
+  variable or pass it into functions instead of starting at the root.  This
+  allows accessing its descendents more directly.
+
+* Use precompiled ObjectPath expressions instead of accessing deeply nested
+  elements step-by-step via object attributes.
+
+* Try assigning data values directly to attributes instead of passing them
+  through DataElement.
+
+* Run ``objectify.annotate()`` over read-only trees to speed up the attribute
+  type inference on access.
+
+* To prevent frequent object create-discard cycles, you can keep a permanent
+  reference to the Python objects in a tree.  Just create a cache dictionary
+  and run::
+
+     cache[root] = list(root.getiterator())
+
+  after parsing and::
+
+     del cache[root]
+
+  when you are done with the tree.  This will keep the Python element
+  representations of all elements alive and thus avoid the overhead of
+  repeated Python object creation.  By choosing the right trees (or even
+  elements) to cache, you can trade memory usage against access speed.
+
+  Things to note: you cannot currently use ``weakref.WeakKeyDictionary``
+  objects for this as lxml's elements do not support weak references for
+  memory reasons.  Also note that new element objects that you add to these
+  trees will not turn up in the cache automatically and will therefore still
+  be garbage collected when all their Python references are gone, so this is
+  most effective for largely immutable trees.  You should consider using a set
+  instead of a list in this case and add new elements by hand.

Modified: lxml/branch/lxml-1.1/doc/build.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/build.txt	(original)
+++ lxml/branch/lxml-1.1/doc/build.txt	Wed Oct 25 09:55:26 2006
@@ -224,11 +224,11 @@
 * check md5sum of created tar.gz file and place new sum and size in dsc file
 * do ``dpkg-source -x lxml-...dsc`` and cd into the newly created directory
 * run ``dch -i`` and add a comment like "use trunk version", this will
-  increase the debian version number so apt/dpkg don't get confused
+  increase the debian version number so apt/dpkg won't get confused
 * run ``dpkg-buildpackage -rfakeroot -us -uc`` to build the package
 
-Eventually dpkg-buildpackage will tell you that some dependecies are missing,
-you can either install them manually or run apt-get build-dep lxml
+In case ``dpkg-buildpackage`` tells you that some dependecies are missing, you
+can either install them manually or run ``apt-get build-dep lxml``.
 
 That will give you .deb packages in the parent directory which can be
 installed using ``dpkg -i``.

Modified: lxml/branch/lxml-1.1/src/lxml/docloader.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/docloader.pxi	(original)
+++ lxml/branch/lxml-1.1/src/lxml/docloader.pxi	Wed Oct 25 09:55:26 2006
@@ -98,3 +98,7 @@
         _ExceptionContext.__init__(self)
         self._resolvers = resolvers
         self._storage = _TempStore()
+
+    cdef void clear(self):
+        _ExceptionContext.clear(self)
+        self._storage.clear()

Modified: lxml/branch/lxml-1.1/src/lxml/etree.pyx
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/etree.pyx	(original)
+++ lxml/branch/lxml-1.1/src/lxml/etree.pyx	Wed Oct 25 09:55:26 2006
@@ -24,6 +24,9 @@
 cdef object super
 super = __builtin__.super
 
+cdef object StopIteration
+StopIteration = __builtin__.StopIteration
+
 del __builtin__
 
 cdef object _elementpath
@@ -41,6 +44,9 @@
 except ImportError:
     pass
 
+cdef object ITER_EMPTY
+ITER_EMPTY = iter(())
+
 # the rules
 # any libxml C argument/variable is prefixed with c_
 # any non-public function/class is prefixed with an underscore
@@ -1052,13 +1058,13 @@
         """Gets a list of attribute names. The names are returned in an arbitrary
         order (just like for an ordinary Python dictionary).
         """
-        return self.attrib.keys()
+        return python.PySequence_List( _attributeIteratorFactory(self, 1) )
 
     def items(self):
         """Gets element attributes, as a sequence. The attributes are returned in
         an arbitrary order.
         """
-        return self.attrib.items()
+        return python.PySequence_List( _attributeIteratorFactory(self, 3) )
 
     def getchildren(self):
         """Returns all subelements. The elements are returned in document order.
@@ -1302,10 +1308,7 @@
 
     # ACCESSORS
     def __repr__(self):
-        result = {}
-        for key, value in self.items():
-            result[key] = value
-        return repr(result)
+        return repr(dict( _attributeIteratorFactory(self._element, 3) ))
     
     def __getitem__(self, key):
         result = _getAttributeValue(self._element, key, None)
@@ -1338,17 +1341,8 @@
         return _getAttributeValue(self._element, key, default)
 
     def keys(self):
-        cdef xmlNode* c_node
-        cdef xmlAttr* c_attr
-        c_node = self._element._c_node
-        c_attr = c_node.properties
-        result = []
-        while c_attr is not NULL:
-            if c_attr.type == tree.XML_ATTRIBUTE_NODE:
-                python.PyList_Append(
-                    result, _namespacedName(<xmlNode*>c_attr))
-            c_attr = c_attr.next
-        return result
+        return python.PySequence_List(
+            _attributeIteratorFactory(self._element, 1) )
 
     def __iter__(self):
         return iter(self.keys())
@@ -1357,35 +1351,15 @@
         return iter(self.keys())
 
     def values(self):
-        cdef xmlNode* c_node
-        cdef xmlAttr* c_attr
-        c_node = self._element._c_node
-        c_attr = c_node.properties
-        result = []
-        while c_attr is not NULL:
-            if c_attr.type == tree.XML_ATTRIBUTE_NODE:
-                python.PyList_Append(
-                    result, _attributeValue(c_node, c_attr))
-            c_attr = c_attr.next
-        return result
+        return python.PySequence_List(
+            _attributeIteratorFactory(self._element, 2) )
 
     def itervalues(self):
         return iter(self.values())
 
     def items(self):
-        result = []
-        cdef xmlNode* c_node
-        cdef xmlAttr* c_attr
-        c_node = self._element._c_node
-        c_attr = c_node.properties
-        while c_attr is not NULL:
-            if c_attr.type == tree.XML_ATTRIBUTE_NODE:
-                python.PyList_Append(result, (
-                    _namespacedName(<xmlNode*>c_attr),
-                    _attributeValue(c_node, c_attr)
-                    ))
-            c_attr = c_attr.next
-        return result
+        return python.PySequence_List(
+            _attributeIteratorFactory(self._element, 3) )
 
     def iteritems(self):
         return iter(self.items())
@@ -1413,6 +1387,47 @@
             tree.xmlFree(c_result)
             return 1
 
+cdef class _AttribIterator:
+    """Attribute iterator - for internal use only!
+    """
+    # XML attributes must not be removed while running!
+    cdef _Element _node
+    cdef xmlAttr* _c_attr
+    cdef int _keysvalues # 1 - keys, 2 - values, 3 - items (key, value)
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        cdef xmlAttr* c_attr
+        if self._node is None:
+            raise StopIteration
+        c_attr = self._c_attr
+        while c_attr is not NULL and c_attr.type != tree.XML_ATTRIBUTE_NODE:
+            c_attr = c_attr.next
+        if c_attr is NULL:
+            self._node = None
+            raise StopIteration
+
+        self._c_attr = c_attr.next
+        if self._keysvalues == 1:
+            return _namespacedName(<xmlNode*>c_attr)
+        elif self._keysvalues == 2:
+            return _attributeValue(self._node._c_node, c_attr)
+        else:
+            return (_namespacedName(<xmlNode*>c_attr),
+                    _attributeValue(self._node._c_node, c_attr))
+
+cdef object _attributeIteratorFactory(_Element element, int keysvalues):
+    cdef _AttribIterator attribs
+    if element._c_node.properties is NULL:
+        return ITER_EMPTY
+    attribs = _AttribIterator()
+    attribs._node = element
+    attribs._c_attr = element._c_node.properties
+    attribs._keysvalues = keysvalues
+    return attribs
+
+
 ctypedef xmlNode* (*_node_to_node_function)(xmlNode*)
 
 cdef public class _ElementTagMatcher [ object LxmlElementTagMatcher,

Modified: lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd	(original)
+++ lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd	Wed Oct 25 09:55:26 2006
@@ -103,6 +103,10 @@
     # return the value of attribute "{ns}name", or the default value
     cdef object getAttributeValue(_NodeBase element, key, default)
 
+    # return an iterator over attribute names (1), values (2) or items (3)
+    # attributes must not be removed during iteration!
+    cdef object iterattributes(_Element element, int keysvalues)
+
     # set an attribute value on an element
     # on failure, sets an exception and returns -1
     cdef int setAttributeValue(_NodeBase element, key, value) except -1

Modified: lxml/branch/lxml-1.1/src/lxml/objectify.pyx
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/objectify.pyx	(original)
+++ lxml/branch/lxml-1.1/src/lxml/objectify.pyx	Wed Oct 25 09:55:26 2006
@@ -941,7 +941,7 @@
                                    value, type(element).__name__)
     xsi_ns    = "{%s}" % XML_SCHEMA_INSTANCE_NS
     pytype_ns = "{%s}" % PYTYPE_NAMESPACE
-    for name, value in element.items():
+    for name, value in cetree.iterattributes(element, 3):
         if name == PYTYPE_ATTRIBUTE and value == TREE_PYTYPE:
             continue
         name = name.replace(xsi_ns, 'xsi:').replace(pytype_ns, 'py:')

Modified: lxml/branch/lxml-1.1/src/lxml/parser.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/parser.pxi	(original)
+++ lxml/branch/lxml-1.1/src/lxml/parser.pxi	Wed Oct 25 09:55:26 2006
@@ -461,6 +461,7 @@
             recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
             return _handleParseResult(pctxt, result, None, recover)
         finally:
+            self._context.clear()
             self._error_log.disconnect()
             self._unlockParser()
 
@@ -492,6 +493,7 @@
             recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
             return _handleParseResult(pctxt, result, None, recover)
         finally:
+            self._context.clear()
             self._error_log.disconnect()
             self._unlockParser()
 
@@ -519,6 +521,7 @@
             recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
             return _handleParseResult(pctxt, result, c_filename, recover)
         finally:
+            self._context.clear()
             self._error_log.disconnect()
             self._unlockParser()
 
@@ -542,6 +545,7 @@
             recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
             return _handleParseResult(pctxt, result, filename, recover)
         finally:
+            self._context.clear()
             self._error_log.disconnect()
             self._unlockParser()
 

Modified: lxml/branch/lxml-1.1/src/lxml/public-api.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/public-api.pxi	(original)
+++ lxml/branch/lxml-1.1/src/lxml/public-api.pxi	Wed Oct 25 09:55:26 2006
@@ -80,6 +80,9 @@
 cdef public object getAttributeValue(_NodeBase element, key, default):
     return _getAttributeValue(element, key, default)
 
+cdef public object iterattributes(_Element element, int keysvalues):
+    return _attributeIteratorFactory(element, keysvalues)
+
 cdef public int setAttributeValue(_NodeBase element, key, value) except -1:
     return _setAttributeValue(element, key, value)
 

Modified: lxml/branch/lxml-1.1/src/lxml/xmlid.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/xmlid.pxi	(original)
+++ lxml/branch/lxml-1.1/src/lxml/xmlid.pxi	Wed Oct 25 09:55:26 2006
@@ -47,7 +47,7 @@
 
     The dictionary must be instantiated with the root element of a parsed XML
     document, otherwise the behaviour is undefined.  Elements and XML trees
-    that were created or modified through the API are not supported.
+    that were created or modified 'by hand' are not supported.
     """
     cdef _Document _doc
     cdef object _keys
@@ -89,7 +89,7 @@
         return c_id is not NULL
 
     def has_key(self, id_name):
-        return self.__contains__(id_name)
+        return id_name in self
 
     def __cmp__(self, other):
         if other is None:
@@ -113,68 +113,59 @@
         return repr(dict(self))
 
     def keys(self):
-        keys = self._keys
-        if keys is not None:
-            return python.PySequence_List(keys)
-        keys = self._build_keys()
-        self._keys = python.PySequence_Tuple(keys)
-        return keys
+        if self._keys is None:
+            self._keys = self._build_keys()
+        return self._keys[:]
 
     def __iter__(self):
-        keys = self._keys
-        if keys is None:
-            keys = self.keys()
-        return iter(keys)
+        if self._keys is None:
+            self._keys = self._build_keys()
+        return iter(self._keys)
 
     def iterkeys(self):
-        return self.__iter__()
+        return self
 
     def __len__(self):
-        keys = self._keys
-        if keys is None:
-            keys = self.keys()
-        return len(keys)
-
-    cdef object _build_keys(self):
-        keys = []
-        tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
-                         _collectIdHashKeys, <python.PyObject*>keys)
-        return keys
+        if self._keys is None:
+            self._keys = self._build_keys()
+        return len(self._keys)
 
     def items(self):
-        items = self._items
-        if items is not None:
-            return python.PySequence_List(items)
-        items = self._build_items()
-        self._items = python.PySequence_Tuple(items)
-        return items
+        if self._items is None:
+            self._items = self._build_items()
+        return self._items[:]
 
     def iteritems(self):
-        items = self._items
-        if items is None:
-            items = self.items()
-        return iter(items)
-
-    cdef object _build_items(self):
-        items = []
-        context = (items, self._doc)
-        tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
-                         _collectIdHashItemList, <python.PyObject*>context)
-        return items
+        if self._items is None:
+            self._items = self._build_items()
+        return iter(self._items)
 
     def values(self):
-        items = self._items
-        if items is None:
-            items = self.items()
+        if self._items is None:
+            self._items = self._build_items()
         values = []
-        for item in items:
+        for item in self._items:
             value = python.PyTuple_GET_ITEM(item, 1)
+            python.Py_INCREF(value)
             python.PyList_Append(values, value)
         return values
 
     def itervalues(self):
         return iter(self.values())
 
+    cdef object _build_keys(self):
+        keys = []
+        tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
+                         _collectIdHashKeys, <python.PyObject*>keys)
+        return keys
+
+    cdef object _build_items(self):
+        items = []
+        context = (items, self._doc)
+        tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
+                         _collectIdHashItemList, <python.PyObject*>context)
+        return items
+
 cdef void _collectIdHashItemDict(void* payload, void* context, char* name):
     # collect elements from ID attribute hash table
     cdef tree.xmlID* c_id

Modified: lxml/branch/lxml-1.1/src/lxml/xslt.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/xslt.pxi	(original)
+++ lxml/branch/lxml-1.1/src/lxml/xslt.pxi	Wed Oct 25 09:55:26 2006
@@ -425,23 +425,26 @@
         _destroyFakeDoc(input_doc._c_doc, c_doc)
 
         self._error_log.disconnect()
-        if self._xslt_resolver_context._has_raised():
-            if c_result is not NULL:
-                tree.xmlFreeDoc(c_result)
-            self._xslt_resolver_context._raise_if_stored()
-
-        if c_result is NULL:
-            error = self._error_log.last_error
-            if error is not None and error.message:
-                if error.line >= 0:
-                    message = "%s, line %d" % (error.message, error.line)
+        try:
+            if self._xslt_resolver_context._has_raised():
+                if c_result is not NULL:
+                    tree.xmlFreeDoc(c_result)
+                self._xslt_resolver_context._raise_if_stored()
+
+            if c_result is NULL:
+                error = self._error_log.last_error
+                if error is not None and error.message:
+                    if error.line >= 0:
+                        message = "%s, line %d" % (error.message, error.line)
+                    else:
+                        message = error.message
+                elif error.line >= 0:
+                    message = "Error applying stylesheet, line %d" % error.line
                 else:
-                    message = error.message
-            elif error.line >= 0:
-                message = "Error applying stylesheet, line %d" % error.line
-            else:
-                message = "Error applying stylesheet"
-            raise XSLTApplyError, message
+                    message = "Error applying stylesheet"
+                raise XSLTApplyError, message
+        finally:
+            self._xslt_resolver_context.clear()
 
         result_doc = _documentFactory(c_result, input_doc._parser)
         return _xsltResultTreeFactory(result_doc, self, profile_doc)


More information about the lxml-checkins mailing list