[Lxml-checkins] r33707 - in lxml/branch/lxml-1.1: . doc src/lxml
scoder at codespeak.net
scoder at codespeak.net
Wed Oct 25 09:55:31 CEST 2006
Author: scoder
Date: Wed Oct 25 09:55:26 2006
New Revision: 33707
Modified:
lxml/branch/lxml-1.1/CHANGES.txt
lxml/branch/lxml-1.1/doc/FAQ.txt
lxml/branch/lxml-1.1/doc/build.txt
lxml/branch/lxml-1.1/src/lxml/docloader.pxi
lxml/branch/lxml-1.1/src/lxml/etree.pyx
lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd
lxml/branch/lxml-1.1/src/lxml/objectify.pyx
lxml/branch/lxml-1.1/src/lxml/parser.pxi
lxml/branch/lxml-1.1/src/lxml/public-api.pxi
lxml/branch/lxml-1.1/src/lxml/xmlid.pxi
lxml/branch/lxml-1.1/src/lxml/xslt.pxi
Log:
big merge from trunk: xmlid fixes, _Attrib cleanup, FAQ about objectify, resolver context cleanup
Modified: lxml/branch/lxml-1.1/CHANGES.txt
==============================================================================
--- lxml/branch/lxml-1.1/CHANGES.txt (original)
+++ lxml/branch/lxml-1.1/CHANGES.txt Wed Oct 25 09:55:26 2006
@@ -18,6 +18,9 @@
Bugs fixed
----------
+* Open files and XML strings returned by Python resolvers were not
+ closed/freed
+
* Copying Comments and ProcessingInstructions failed
* Memory leak for external URLs in _XSLTProcessingInstruction.parseXSL()
Modified: lxml/branch/lxml-1.1/doc/FAQ.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/FAQ.txt (original)
+++ lxml/branch/lxml-1.1/doc/FAQ.txt Wed Oct 25 09:55:26 2006
@@ -12,9 +12,8 @@
1 General Questions
1.1 Is there a tutorial?
1.2 Where can I find more documentation about lxml?
- 1.3 What is the difference between lxml.etree and lxml.objectify?
- 1.4 Why is my application so slow?
- 1.5 Why do I get errors about missing UCS4 symbols when installing lxml?
+ 1.3 Why is my application so slow?
+ 1.4 Why do I get errors about missing UCS4 symbols when installing lxml?
2 Bugs
2.1 My application crashes! Why does lxml.etree do that?
2.2 I think I have found a bug in lxml. What should I do?
@@ -31,6 +30,9 @@
5.2 Why doesn't ``findall()`` support full XPath expressions?
5.3 How can I find out which namespace prefixes are used in a document?
5.4 How can I specify a default namespace for XPath expressions?
+ 6 lxml.objectify
+ 6.1 What is the difference between lxml.etree and lxml.objectify?
+ 6.2 Is there a way to speed up frequent element access?
General Questions
@@ -62,30 +64,6 @@
.. _`the web page`: http://codespeak.net/lxml/#documentation
-What is the difference between lxml.etree and lxml.objectify?
--------------------------------------------------------------
-
-The two modules provide different ways of handling XML. However, objectify
-builds on top of lxml.etree and therefore inherits most of its capabilities
-and a large portion of its API.
-
-* lxml.etree is a generic API for XML and HTML handling. It aims for
- ElementTree compatibility_ and supports the entire XML infoset. It is well
- suited for both mixed content and data centric XML. Its generality makes it
- the best choice for most applications.
-
-* lxml.objectify is a specialized API for XML data handling in a Python object
- syntax. It provides a very natural way to deal with data fields stored in a
- structurally well defined XML format. Data is automatically converted to
- Python data types and can be manipulated with normal Python operators. Look
- at the examples in the `objectify documentation`_ to see what it feels like
- to use it.
-
- Objectify is not well suited for mixed contents or HTML documents. As it is
- built on top of lxml.etree, however, it inherits the normal support for
- XPath, XSLT or validation.
-
-
Why is my application so slow?
------------------------------
@@ -178,7 +156,7 @@
Due to the way libxslt handles threading, concurrent access to stylesheets is
currently only possible if it was parsed in the main thread. Parsing and
-using a stylesheet inside one thread also works.
+applying a stylesheet inside one thread also works.
Warning: You should generally avoid modifying trees in other threads than the
one it was generated in. Although this should work in many cases, there are
@@ -200,10 +178,10 @@
The global interpreter lock (GIL) in Python serializes access to the
interpreter, so if the majority of your processing is done in Python code
-(traversing trees, modifying elements, etc.), your gain will be close to 0.
-The more of your XML processing moves into lxml, however, the higher your
-gain. If your application is bound by XML parsing and serialisation, or by
-complex XSLTs, your speedup on multi-processor machines can be substantial.
+(walking trees, modifying elements, etc.), your gain will be close to 0. The
+more of your XML processing moves into lxml, however, the higher your gain.
+If your application is bound by XML parsing and serialisation, or by complex
+XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support
multi-threading.
@@ -347,3 +325,78 @@
You can't. In XPath, there is no such thing as a default namespace. Just use
an arbitrary prefix and let the namespace dictionary of the XPath evaluators
map it to your namespace. See also the question above.
+
+
+lxml.objectify
+==============
+
+What is the difference between lxml.etree and lxml.objectify?
+-------------------------------------------------------------
+
+The two modules provide different ways of handling XML. However, objectify
+builds on top of lxml.etree and therefore inherits most of its capabilities
+and a large portion of its API.
+
+* lxml.etree is a generic API for XML and HTML handling. It aims for
+ ElementTree compatibility_ and supports the entire XML infoset. It is well
+ suited for both mixed content and data centric XML. Its generality makes it
+ the best choice for most applications.
+
+* lxml.objectify is a specialized API for XML data handling in a Python object
+ syntax. It provides a very natural way to deal with data fields stored in a
+ structurally well defined XML format. Data is automatically converted to
+ Python data types and can be manipulated with normal Python operators. Look
+ at the examples in the `objectify documentation`_ to see what it feels like
+ to use it.
+
+ Objectify is not well suited for mixed contents or HTML documents. As it is
+ built on top of lxml.etree, however, it inherits the normal support for
+ XPath, XSLT or validation.
+
+Is there a way to speed up frequent element access?
+---------------------------------------------------
+
+lxml.objectify creates Python representations of elements on the fly. To save
+memory, the normal Python garbage collection mechanisms will discard them when
+their last reference is gone. In cases where deeply nested elements are
+frequently accessed through the objectify API, the create-discard cycles can
+become a bottleneck, as elements have to be instantiated over and over again.
+
+If your benchmarks prove that the overhead is too high for your specific use
+case, here are some things to try:
+
+* If you often work in subtrees, assign the parent of the subtree to a
+ variable or pass it into functions instead of starting at the root. This
+ allows accessing its descendents more directly.
+
+* Use precompiled ObjectPath expressions instead of accessing deeply nested
+ elements step-by-step via object attributes.
+
+* Try assigning data values directly to attributes instead of passing them
+ through DataElement.
+
+* Run ``objectify.annotate()`` over read-only trees to speed up the attribute
+ type inference on access.
+
+* To prevent frequent object create-discard cycles, you can keep a permanent
+ reference to the Python objects in a tree. Just create a cache dictionary
+ and run::
+
+ cache[root] = list(root.getiterator())
+
+ after parsing and::
+
+ del cache[root]
+
+ when you are done with the tree. This will keep the Python element
+ representations of all elements alive and thus avoid the overhead of
+ repeated Python object creation. By choosing the right trees (or even
+ elements) to cache, you can trade memory usage against access speed.
+
+ Things to note: you cannot currently use ``weakref.WeakKeyDictionary``
+ objects for this as lxml's elements do not support weak references for
+ memory reasons. Also note that new element objects that you add to these
+ trees will not turn up in the cache automatically and will therefore still
+ be garbage collected when all their Python references are gone, so this is
+ most effective for largely immutable trees. You should consider using a set
+ instead of a list in this case and add new elements by hand.
Modified: lxml/branch/lxml-1.1/doc/build.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/build.txt (original)
+++ lxml/branch/lxml-1.1/doc/build.txt Wed Oct 25 09:55:26 2006
@@ -224,11 +224,11 @@
* check md5sum of created tar.gz file and place new sum and size in dsc file
* do ``dpkg-source -x lxml-...dsc`` and cd into the newly created directory
* run ``dch -i`` and add a comment like "use trunk version", this will
- increase the debian version number so apt/dpkg don't get confused
+ increase the debian version number so apt/dpkg won't get confused
* run ``dpkg-buildpackage -rfakeroot -us -uc`` to build the package
-Eventually dpkg-buildpackage will tell you that some dependecies are missing,
-you can either install them manually or run apt-get build-dep lxml
+In case ``dpkg-buildpackage`` tells you that some dependecies are missing, you
+can either install them manually or run ``apt-get build-dep lxml``.
That will give you .deb packages in the parent directory which can be
installed using ``dpkg -i``.
Modified: lxml/branch/lxml-1.1/src/lxml/docloader.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/docloader.pxi (original)
+++ lxml/branch/lxml-1.1/src/lxml/docloader.pxi Wed Oct 25 09:55:26 2006
@@ -98,3 +98,7 @@
_ExceptionContext.__init__(self)
self._resolvers = resolvers
self._storage = _TempStore()
+
+ cdef void clear(self):
+ _ExceptionContext.clear(self)
+ self._storage.clear()
Modified: lxml/branch/lxml-1.1/src/lxml/etree.pyx
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/etree.pyx (original)
+++ lxml/branch/lxml-1.1/src/lxml/etree.pyx Wed Oct 25 09:55:26 2006
@@ -24,6 +24,9 @@
cdef object super
super = __builtin__.super
+cdef object StopIteration
+StopIteration = __builtin__.StopIteration
+
del __builtin__
cdef object _elementpath
@@ -41,6 +44,9 @@
except ImportError:
pass
+cdef object ITER_EMPTY
+ITER_EMPTY = iter(())
+
# the rules
# any libxml C argument/variable is prefixed with c_
# any non-public function/class is prefixed with an underscore
@@ -1052,13 +1058,13 @@
"""Gets a list of attribute names. The names are returned in an arbitrary
order (just like for an ordinary Python dictionary).
"""
- return self.attrib.keys()
+ return python.PySequence_List( _attributeIteratorFactory(self, 1) )
def items(self):
"""Gets element attributes, as a sequence. The attributes are returned in
an arbitrary order.
"""
- return self.attrib.items()
+ return python.PySequence_List( _attributeIteratorFactory(self, 3) )
def getchildren(self):
"""Returns all subelements. The elements are returned in document order.
@@ -1302,10 +1308,7 @@
# ACCESSORS
def __repr__(self):
- result = {}
- for key, value in self.items():
- result[key] = value
- return repr(result)
+ return repr(dict( _attributeIteratorFactory(self._element, 3) ))
def __getitem__(self, key):
result = _getAttributeValue(self._element, key, None)
@@ -1338,17 +1341,8 @@
return _getAttributeValue(self._element, key, default)
def keys(self):
- cdef xmlNode* c_node
- cdef xmlAttr* c_attr
- c_node = self._element._c_node
- c_attr = c_node.properties
- result = []
- while c_attr is not NULL:
- if c_attr.type == tree.XML_ATTRIBUTE_NODE:
- python.PyList_Append(
- result, _namespacedName(<xmlNode*>c_attr))
- c_attr = c_attr.next
- return result
+ return python.PySequence_List(
+ _attributeIteratorFactory(self._element, 1) )
def __iter__(self):
return iter(self.keys())
@@ -1357,35 +1351,15 @@
return iter(self.keys())
def values(self):
- cdef xmlNode* c_node
- cdef xmlAttr* c_attr
- c_node = self._element._c_node
- c_attr = c_node.properties
- result = []
- while c_attr is not NULL:
- if c_attr.type == tree.XML_ATTRIBUTE_NODE:
- python.PyList_Append(
- result, _attributeValue(c_node, c_attr))
- c_attr = c_attr.next
- return result
+ return python.PySequence_List(
+ _attributeIteratorFactory(self._element, 2) )
def itervalues(self):
return iter(self.values())
def items(self):
- result = []
- cdef xmlNode* c_node
- cdef xmlAttr* c_attr
- c_node = self._element._c_node
- c_attr = c_node.properties
- while c_attr is not NULL:
- if c_attr.type == tree.XML_ATTRIBUTE_NODE:
- python.PyList_Append(result, (
- _namespacedName(<xmlNode*>c_attr),
- _attributeValue(c_node, c_attr)
- ))
- c_attr = c_attr.next
- return result
+ return python.PySequence_List(
+ _attributeIteratorFactory(self._element, 3) )
def iteritems(self):
return iter(self.items())
@@ -1413,6 +1387,47 @@
tree.xmlFree(c_result)
return 1
+cdef class _AttribIterator:
+ """Attribute iterator - for internal use only!
+ """
+ # XML attributes must not be removed while running!
+ cdef _Element _node
+ cdef xmlAttr* _c_attr
+ cdef int _keysvalues # 1 - keys, 2 - values, 3 - items (key, value)
+ def __iter__(self):
+ return self
+
+ def __next__(self):
+ cdef xmlAttr* c_attr
+ if self._node is None:
+ raise StopIteration
+ c_attr = self._c_attr
+ while c_attr is not NULL and c_attr.type != tree.XML_ATTRIBUTE_NODE:
+ c_attr = c_attr.next
+ if c_attr is NULL:
+ self._node = None
+ raise StopIteration
+
+ self._c_attr = c_attr.next
+ if self._keysvalues == 1:
+ return _namespacedName(<xmlNode*>c_attr)
+ elif self._keysvalues == 2:
+ return _attributeValue(self._node._c_node, c_attr)
+ else:
+ return (_namespacedName(<xmlNode*>c_attr),
+ _attributeValue(self._node._c_node, c_attr))
+
+cdef object _attributeIteratorFactory(_Element element, int keysvalues):
+ cdef _AttribIterator attribs
+ if element._c_node.properties is NULL:
+ return ITER_EMPTY
+ attribs = _AttribIterator()
+ attribs._node = element
+ attribs._c_attr = element._c_node.properties
+ attribs._keysvalues = keysvalues
+ return attribs
+
+
ctypedef xmlNode* (*_node_to_node_function)(xmlNode*)
cdef public class _ElementTagMatcher [ object LxmlElementTagMatcher,
Modified: lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd (original)
+++ lxml/branch/lxml-1.1/src/lxml/etreepublic.pxd Wed Oct 25 09:55:26 2006
@@ -103,6 +103,10 @@
# return the value of attribute "{ns}name", or the default value
cdef object getAttributeValue(_NodeBase element, key, default)
+ # return an iterator over attribute names (1), values (2) or items (3)
+ # attributes must not be removed during iteration!
+ cdef object iterattributes(_Element element, int keysvalues)
+
# set an attribute value on an element
# on failure, sets an exception and returns -1
cdef int setAttributeValue(_NodeBase element, key, value) except -1
Modified: lxml/branch/lxml-1.1/src/lxml/objectify.pyx
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/objectify.pyx (original)
+++ lxml/branch/lxml-1.1/src/lxml/objectify.pyx Wed Oct 25 09:55:26 2006
@@ -941,7 +941,7 @@
value, type(element).__name__)
xsi_ns = "{%s}" % XML_SCHEMA_INSTANCE_NS
pytype_ns = "{%s}" % PYTYPE_NAMESPACE
- for name, value in element.items():
+ for name, value in cetree.iterattributes(element, 3):
if name == PYTYPE_ATTRIBUTE and value == TREE_PYTYPE:
continue
name = name.replace(xsi_ns, 'xsi:').replace(pytype_ns, 'py:')
Modified: lxml/branch/lxml-1.1/src/lxml/parser.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/parser.pxi (original)
+++ lxml/branch/lxml-1.1/src/lxml/parser.pxi Wed Oct 25 09:55:26 2006
@@ -461,6 +461,7 @@
recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
return _handleParseResult(pctxt, result, None, recover)
finally:
+ self._context.clear()
self._error_log.disconnect()
self._unlockParser()
@@ -492,6 +493,7 @@
recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
return _handleParseResult(pctxt, result, None, recover)
finally:
+ self._context.clear()
self._error_log.disconnect()
self._unlockParser()
@@ -519,6 +521,7 @@
recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
return _handleParseResult(pctxt, result, c_filename, recover)
finally:
+ self._context.clear()
self._error_log.disconnect()
self._unlockParser()
@@ -542,6 +545,7 @@
recover = self._parse_options & xmlparser.XML_PARSE_RECOVER
return _handleParseResult(pctxt, result, filename, recover)
finally:
+ self._context.clear()
self._error_log.disconnect()
self._unlockParser()
Modified: lxml/branch/lxml-1.1/src/lxml/public-api.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/public-api.pxi (original)
+++ lxml/branch/lxml-1.1/src/lxml/public-api.pxi Wed Oct 25 09:55:26 2006
@@ -80,6 +80,9 @@
cdef public object getAttributeValue(_NodeBase element, key, default):
return _getAttributeValue(element, key, default)
+cdef public object iterattributes(_Element element, int keysvalues):
+ return _attributeIteratorFactory(element, keysvalues)
+
cdef public int setAttributeValue(_NodeBase element, key, value) except -1:
return _setAttributeValue(element, key, value)
Modified: lxml/branch/lxml-1.1/src/lxml/xmlid.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/xmlid.pxi (original)
+++ lxml/branch/lxml-1.1/src/lxml/xmlid.pxi Wed Oct 25 09:55:26 2006
@@ -47,7 +47,7 @@
The dictionary must be instantiated with the root element of a parsed XML
document, otherwise the behaviour is undefined. Elements and XML trees
- that were created or modified through the API are not supported.
+ that were created or modified 'by hand' are not supported.
"""
cdef _Document _doc
cdef object _keys
@@ -89,7 +89,7 @@
return c_id is not NULL
def has_key(self, id_name):
- return self.__contains__(id_name)
+ return id_name in self
def __cmp__(self, other):
if other is None:
@@ -113,68 +113,59 @@
return repr(dict(self))
def keys(self):
- keys = self._keys
- if keys is not None:
- return python.PySequence_List(keys)
- keys = self._build_keys()
- self._keys = python.PySequence_Tuple(keys)
- return keys
+ if self._keys is None:
+ self._keys = self._build_keys()
+ return self._keys[:]
def __iter__(self):
- keys = self._keys
- if keys is None:
- keys = self.keys()
- return iter(keys)
+ if self._keys is None:
+ self._keys = self._build_keys()
+ return iter(self._keys)
def iterkeys(self):
- return self.__iter__()
+ return self
def __len__(self):
- keys = self._keys
- if keys is None:
- keys = self.keys()
- return len(keys)
-
- cdef object _build_keys(self):
- keys = []
- tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
- _collectIdHashKeys, <python.PyObject*>keys)
- return keys
+ if self._keys is None:
+ self._keys = self._build_keys()
+ return len(self._keys)
def items(self):
- items = self._items
- if items is not None:
- return python.PySequence_List(items)
- items = self._build_items()
- self._items = python.PySequence_Tuple(items)
- return items
+ if self._items is None:
+ self._items = self._build_items()
+ return self._items[:]
def iteritems(self):
- items = self._items
- if items is None:
- items = self.items()
- return iter(items)
-
- cdef object _build_items(self):
- items = []
- context = (items, self._doc)
- tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
- _collectIdHashItemList, <python.PyObject*>context)
- return items
+ if self._items is None:
+ self._items = self._build_items()
+ return iter(self._items)
def values(self):
- items = self._items
- if items is None:
- items = self.items()
+ if self._items is None:
+ self._items = self._build_items()
values = []
- for item in items:
+ for item in self._items:
value = python.PyTuple_GET_ITEM(item, 1)
+ python.Py_INCREF(value)
python.PyList_Append(values, value)
return values
def itervalues(self):
return iter(self.values())
+ cdef object _build_keys(self):
+ keys = []
+ tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
+ _collectIdHashKeys, <python.PyObject*>keys)
+ return keys
+
+ cdef object _build_items(self):
+ items = []
+ context = (items, self._doc)
+ tree.xmlHashScan(<tree.xmlHashTable*>self._doc._c_doc.ids,
+ _collectIdHashItemList, <python.PyObject*>context)
+ return items
+
cdef void _collectIdHashItemDict(void* payload, void* context, char* name):
# collect elements from ID attribute hash table
cdef tree.xmlID* c_id
Modified: lxml/branch/lxml-1.1/src/lxml/xslt.pxi
==============================================================================
--- lxml/branch/lxml-1.1/src/lxml/xslt.pxi (original)
+++ lxml/branch/lxml-1.1/src/lxml/xslt.pxi Wed Oct 25 09:55:26 2006
@@ -425,23 +425,26 @@
_destroyFakeDoc(input_doc._c_doc, c_doc)
self._error_log.disconnect()
- if self._xslt_resolver_context._has_raised():
- if c_result is not NULL:
- tree.xmlFreeDoc(c_result)
- self._xslt_resolver_context._raise_if_stored()
-
- if c_result is NULL:
- error = self._error_log.last_error
- if error is not None and error.message:
- if error.line >= 0:
- message = "%s, line %d" % (error.message, error.line)
+ try:
+ if self._xslt_resolver_context._has_raised():
+ if c_result is not NULL:
+ tree.xmlFreeDoc(c_result)
+ self._xslt_resolver_context._raise_if_stored()
+
+ if c_result is NULL:
+ error = self._error_log.last_error
+ if error is not None and error.message:
+ if error.line >= 0:
+ message = "%s, line %d" % (error.message, error.line)
+ else:
+ message = error.message
+ elif error.line >= 0:
+ message = "Error applying stylesheet, line %d" % error.line
else:
- message = error.message
- elif error.line >= 0:
- message = "Error applying stylesheet, line %d" % error.line
- else:
- message = "Error applying stylesheet"
- raise XSLTApplyError, message
+ message = "Error applying stylesheet"
+ raise XSLTApplyError, message
+ finally:
+ self._xslt_resolver_context.clear()
result_doc = _documentFactory(c_result, input_doc._parser)
return _xsltResultTreeFactory(result_doc, self, profile_doc)
More information about the lxml-checkins
mailing list