[Lxml-checkins] r51015 - in lxml/trunk: . doc src/lxml

scoder at codespeak.net scoder at codespeak.net
Fri Jan 25 10:36:17 CET 2008


Author: scoder
Date: Fri Jan 25 10:36:16 2008
New Revision: 51015

Modified:
   lxml/trunk/   (props changed)
   lxml/trunk/CHANGES.txt
   lxml/trunk/doc/tutorial.txt
   lxml/trunk/doc/xpathxslt.txt
   lxml/trunk/src/lxml/extensions.pxi
   lxml/trunk/src/lxml/python.pxd
Log:
 r3317 at delle:  sbehnel | 2008-01-25 10:35:30 +0100
 XPath string results are always smart objects, but no longer forced into unicode


Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt	(original)
+++ lxml/trunk/CHANGES.txt	Fri Jan 25 10:36:16 2008
@@ -8,6 +8,12 @@
 Features added
 --------------
 
+* Plain ASCII XPath string results are no longer forced into unicode
+  objects (as in 2.0beta1).
+
+* All XPath string results are 'smart' objects that have a
+  ``getparent()`` method to retrieve their parent Element.
+
 * ``with_tail`` option in serialiser functions.
 
 * More accurate exception messages in validator creation.

Modified: lxml/trunk/doc/tutorial.txt
==============================================================================
--- lxml/trunk/doc/tutorial.txt	(original)
+++ lxml/trunk/doc/tutorial.txt	Fri Jan 25 10:36:16 2008
@@ -312,18 +312,18 @@
     >>> print html.xpath("string()") # lxml.etree only!
     TEXTTAIL
     >>> print html.xpath("//text()") # lxml.etree only!
-    [u'TEXT', u'TAIL']
+    ['TEXT', 'TAIL']
 
 If you want to use this more often, you can wrap it in a function::
 
     >>> build_text_list = etree.XPath("//text()") # lxml.etree only!
     >>> print build_text_list(html)
-    [u'TEXT', u'TAIL']
+    ['TEXT', 'TAIL']
 
-Note that the ``text()`` function in XPath always returns unicode
-strings.  This is because it is actually a special object that knows
-about its origins.  You can ask it where it came from through its
-``getparent()`` method, just as you would with Elements::
+Note that a string result returned by XPath is a special 'smart'
+object that knows about its origins.  You can ask it where it came
+from through its ``getparent()`` method, just as you would with
+Elements::
 
     >>> texts = build_text_list(html)
     >>> print texts[0]
@@ -346,6 +346,16 @@
     >>> print texts[1].is_tail
     True
 
+While this works for the results of the ``text()`` function, lxml will
+not to tell you the origin of a string value that was constructed by
+the XPath functions ``string()`` or ``concat()``::
+
+    >>> stringify = etree.XPath("string()")
+    >>> print stringify(html)
+    TEXTTAIL
+    >>> print stringify(html).getparent()
+    None
+
 
 Tree iteration
 --------------

Modified: lxml/trunk/doc/xpathxslt.txt
==============================================================================
--- lxml/trunk/doc/xpathxslt.txt	(original)
+++ lxml/trunk/doc/xpathxslt.txt	Fri Jan 25 10:36:16 2008
@@ -136,13 +136,32 @@
 
 * a float, when the XPath expression has a numeric result (integer or float)
 
-* a (unicode) string, when the XPath expression has a string result.
+* a 'smart' string (as described below), when the XPath expression has
+  a string result.
 
-* a list of items, when the XPath expression has a list as result.  The items
-  may include elements (also comments and processing instructions), strings
-  and tuples.  Text nodes and attributes in the result are returned as strings
-  (the text node content or attribute value).  Namespace declarations are
-  returned as tuples of strings: ``(prefix, URI)``.
+* a list of items, when the XPath expression has a list as result.
+  The items may include Elements (also comments and processing
+  instructions), strings and tuples.  Text nodes and attributes in the
+  result are returned as 'smart' string values.  Namespace
+  declarations are returned as tuples of strings: ``(prefix, URI)``.
+
+XPath string results are 'smart' in that they provide a
+``getparent()`` method that knows their origin::
+
+* for attribute values, ``result.getparent()`` returns the Element
+  that carries them.  An example is ``//foo/@attribute``, where the
+  parent would be a ``foo`` Element.
+
+* for the ``text()`` function (as in ``//text()``), it returns the
+  Element that contains the text or tail that was returned.
+
+You can distinguish between different text origins with the boolean
+properties ``is_text``, ``is_tail`` and ``is_attribute``.
+
+Note that ``getparent()`` may not always return an Element.  For
+example, the XPath functions ``string()`` and ``concat()`` will
+construct strings that do not have an origin.  For them,
+``getparent()`` will return None.
 
 
 Generating XPath expressions

Modified: lxml/trunk/src/lxml/extensions.pxi
==============================================================================
--- lxml/trunk/src/lxml/extensions.pxi	(original)
+++ lxml/trunk/src/lxml/extensions.pxi	Fri Jan 25 10:36:16 2008
@@ -491,7 +491,8 @@
     elif xpathObj.type == xpath.XPATH_NUMBER:
         return xpathObj.floatval
     elif xpathObj.type == xpath.XPATH_STRING:
-        return funicode(xpathObj.stringval)
+        return _elementStringResultFactory(
+            funicode(xpathObj.stringval), None, 0, 0)
     elif xpathObj.type == xpath.XPATH_POINT:
         raise NotImplementedError
     elif xpathObj.type == xpath.XPATH_RANGE:
@@ -524,7 +525,7 @@
             value = _fakeDocElementFactory(doc, c_node)
         elif c_node.type == tree.XML_TEXT_NODE or \
                 c_node.type == tree.XML_ATTRIBUTE_NODE:
-            value = _newElementStringResult(doc, c_node)
+            value = _buildElementStringResult(doc, c_node)
         elif c_node.type == tree.XML_NAMESPACE_DECL:
             s = (<xmlNs*>c_node).href
             if s is NULL:
@@ -560,7 +561,7 @@
 ################################################################################
 # special str/unicode subclasses
 
-cdef class _ElementStringResult(python.unicode):
+cdef class _ElementUnicodeResult(python.unicode):
     cdef _Element parent
     cdef readonly object is_tail
     cdef readonly object is_text
@@ -569,27 +570,56 @@
     def getparent(self):
         return self.parent
 
-cdef object _newElementStringResult(_Document doc, xmlNode* c_node):
-    cdef _ElementStringResult result
+class _ElementStringResult(str):
+    # we need to use a Python class here, str cannot be C-subclassed
+    # in Pyrex/Cython
+    def getparent(self):
+        return self._parent
+
+cdef object _elementStringResultFactory(string_value, _Element parent,
+                                        bint is_attribute, bint is_tail):
+    cdef _ElementUnicodeResult uresult
+    cdef bint is_text
+    if parent is None:
+        is_text = 0
+    else:
+        is_text = not (is_tail or is_attribute)
+
+    if python.PyString_CheckExact(string_value):
+        result = _ElementStringResult(string_value)
+        result._parent = parent
+        result.is_attribute = is_attribute
+        result.is_tail = is_tail
+        result.is_text = is_text
+        return result
+    else:
+        uresult = _ElementUnicodeResult(string_value)
+        uresult.parent = parent
+        uresult.is_attribute = is_attribute
+        uresult.is_tail = is_tail
+        uresult.is_text = is_text
+        return uresult
+
+cdef object _buildElementStringResult(_Document doc, xmlNode* c_node):
+    cdef _Element parent
     cdef xmlNode* c_element
     cdef char* s
-    cdef bint is_attribute, is_tail
+    cdef bint is_attribute, is_text, is_tail
 
     if c_node.type == tree.XML_ATTRIBUTE_NODE:
         is_attribute = 1
         is_tail = 0
         s = tree.xmlNodeGetContent(c_node)
         try:
-            value = python.PyUnicode_DecodeUTF8(s, cstd.strlen(s), NULL)
+            value = funicode(s)
         finally:
             tree.xmlFree(s)
         c_element = NULL
     else:
         #assert c_node.type == tree.XML_TEXT_NODE, "invalid node type"
         is_attribute = 0
-        # tail text?
-        value = python.PyUnicode_DecodeUTF8(
-            c_node.content, cstd.strlen(c_node.content), NULL)
+        # may be tail text or normal text
+        value = funicode(c_node.content)
         c_element = _previousElement(c_node)
         is_tail = c_element is not NULL
 
@@ -599,15 +629,12 @@
         while c_element is not NULL and not _isElement(c_element):
             c_element = c_element.parent
 
-    if c_element is NULL:
-        return value
+    if c_element is not NULL:
+        parent = _fakeDocElementFactory(doc, c_element)
+
+    return _elementStringResultFactory(
+        value, parent, is_attribute, is_tail)
 
-    result = _ElementStringResult(value)
-    result.parent = _fakeDocElementFactory(doc, c_element)
-    result.is_attribute = is_attribute
-    result.is_tail = is_tail
-    result.is_text = not (is_tail or is_attribute)
-    return result
 
 ################################################################################
 # callbacks for XPath/XSLT extension functions

Modified: lxml/trunk/src/lxml/python.pxd
==============================================================================
--- lxml/trunk/src/lxml/python.pxd	(original)
+++ lxml/trunk/src/lxml/python.pxd	Fri Jan 25 10:36:16 2008
@@ -25,6 +25,7 @@
 
     cdef int PyUnicode_Check(object obj)
     cdef int PyString_Check(object obj)
+    cdef int PyString_CheckExact(object obj)
 
     cdef object PyUnicode_FromEncodedObject(object s, char* encoding,
                                             char* errors)


More information about the lxml-checkins mailing list