[Lxml-checkins] r51015 - in lxml/trunk: . doc src/lxml
scoder at codespeak.net
scoder at codespeak.net
Fri Jan 25 10:36:17 CET 2008
Author: scoder
Date: Fri Jan 25 10:36:16 2008
New Revision: 51015
Modified:
lxml/trunk/ (props changed)
lxml/trunk/CHANGES.txt
lxml/trunk/doc/tutorial.txt
lxml/trunk/doc/xpathxslt.txt
lxml/trunk/src/lxml/extensions.pxi
lxml/trunk/src/lxml/python.pxd
Log:
r3317 at delle: sbehnel | 2008-01-25 10:35:30 +0100
XPath string results are always smart objects, but no longer forced into unicode
Modified: lxml/trunk/CHANGES.txt
==============================================================================
--- lxml/trunk/CHANGES.txt (original)
+++ lxml/trunk/CHANGES.txt Fri Jan 25 10:36:16 2008
@@ -8,6 +8,12 @@
Features added
--------------
+* Plain ASCII XPath string results are no longer forced into unicode
+ objects (as in 2.0beta1).
+
+* All XPath string results are 'smart' objects that have a
+ ``getparent()`` method to retrieve their parent Element.
+
* ``with_tail`` option in serialiser functions.
* More accurate exception messages in validator creation.
Modified: lxml/trunk/doc/tutorial.txt
==============================================================================
--- lxml/trunk/doc/tutorial.txt (original)
+++ lxml/trunk/doc/tutorial.txt Fri Jan 25 10:36:16 2008
@@ -312,18 +312,18 @@
>>> print html.xpath("string()") # lxml.etree only!
TEXTTAIL
>>> print html.xpath("//text()") # lxml.etree only!
- [u'TEXT', u'TAIL']
+ ['TEXT', 'TAIL']
If you want to use this more often, you can wrap it in a function::
>>> build_text_list = etree.XPath("//text()") # lxml.etree only!
>>> print build_text_list(html)
- [u'TEXT', u'TAIL']
+ ['TEXT', 'TAIL']
-Note that the ``text()`` function in XPath always returns unicode
-strings. This is because it is actually a special object that knows
-about its origins. You can ask it where it came from through its
-``getparent()`` method, just as you would with Elements::
+Note that a string result returned by XPath is a special 'smart'
+object that knows about its origins. You can ask it where it came
+from through its ``getparent()`` method, just as you would with
+Elements::
>>> texts = build_text_list(html)
>>> print texts[0]
@@ -346,6 +346,16 @@
>>> print texts[1].is_tail
True
+While this works for the results of the ``text()`` function, lxml will
+not to tell you the origin of a string value that was constructed by
+the XPath functions ``string()`` or ``concat()``::
+
+ >>> stringify = etree.XPath("string()")
+ >>> print stringify(html)
+ TEXTTAIL
+ >>> print stringify(html).getparent()
+ None
+
Tree iteration
--------------
Modified: lxml/trunk/doc/xpathxslt.txt
==============================================================================
--- lxml/trunk/doc/xpathxslt.txt (original)
+++ lxml/trunk/doc/xpathxslt.txt Fri Jan 25 10:36:16 2008
@@ -136,13 +136,32 @@
* a float, when the XPath expression has a numeric result (integer or float)
-* a (unicode) string, when the XPath expression has a string result.
+* a 'smart' string (as described below), when the XPath expression has
+ a string result.
-* a list of items, when the XPath expression has a list as result. The items
- may include elements (also comments and processing instructions), strings
- and tuples. Text nodes and attributes in the result are returned as strings
- (the text node content or attribute value). Namespace declarations are
- returned as tuples of strings: ``(prefix, URI)``.
+* a list of items, when the XPath expression has a list as result.
+ The items may include Elements (also comments and processing
+ instructions), strings and tuples. Text nodes and attributes in the
+ result are returned as 'smart' string values. Namespace
+ declarations are returned as tuples of strings: ``(prefix, URI)``.
+
+XPath string results are 'smart' in that they provide a
+``getparent()`` method that knows their origin::
+
+* for attribute values, ``result.getparent()`` returns the Element
+ that carries them. An example is ``//foo/@attribute``, where the
+ parent would be a ``foo`` Element.
+
+* for the ``text()`` function (as in ``//text()``), it returns the
+ Element that contains the text or tail that was returned.
+
+You can distinguish between different text origins with the boolean
+properties ``is_text``, ``is_tail`` and ``is_attribute``.
+
+Note that ``getparent()`` may not always return an Element. For
+example, the XPath functions ``string()`` and ``concat()`` will
+construct strings that do not have an origin. For them,
+``getparent()`` will return None.
Generating XPath expressions
Modified: lxml/trunk/src/lxml/extensions.pxi
==============================================================================
--- lxml/trunk/src/lxml/extensions.pxi (original)
+++ lxml/trunk/src/lxml/extensions.pxi Fri Jan 25 10:36:16 2008
@@ -491,7 +491,8 @@
elif xpathObj.type == xpath.XPATH_NUMBER:
return xpathObj.floatval
elif xpathObj.type == xpath.XPATH_STRING:
- return funicode(xpathObj.stringval)
+ return _elementStringResultFactory(
+ funicode(xpathObj.stringval), None, 0, 0)
elif xpathObj.type == xpath.XPATH_POINT:
raise NotImplementedError
elif xpathObj.type == xpath.XPATH_RANGE:
@@ -524,7 +525,7 @@
value = _fakeDocElementFactory(doc, c_node)
elif c_node.type == tree.XML_TEXT_NODE or \
c_node.type == tree.XML_ATTRIBUTE_NODE:
- value = _newElementStringResult(doc, c_node)
+ value = _buildElementStringResult(doc, c_node)
elif c_node.type == tree.XML_NAMESPACE_DECL:
s = (<xmlNs*>c_node).href
if s is NULL:
@@ -560,7 +561,7 @@
################################################################################
# special str/unicode subclasses
-cdef class _ElementStringResult(python.unicode):
+cdef class _ElementUnicodeResult(python.unicode):
cdef _Element parent
cdef readonly object is_tail
cdef readonly object is_text
@@ -569,27 +570,56 @@
def getparent(self):
return self.parent
-cdef object _newElementStringResult(_Document doc, xmlNode* c_node):
- cdef _ElementStringResult result
+class _ElementStringResult(str):
+ # we need to use a Python class here, str cannot be C-subclassed
+ # in Pyrex/Cython
+ def getparent(self):
+ return self._parent
+
+cdef object _elementStringResultFactory(string_value, _Element parent,
+ bint is_attribute, bint is_tail):
+ cdef _ElementUnicodeResult uresult
+ cdef bint is_text
+ if parent is None:
+ is_text = 0
+ else:
+ is_text = not (is_tail or is_attribute)
+
+ if python.PyString_CheckExact(string_value):
+ result = _ElementStringResult(string_value)
+ result._parent = parent
+ result.is_attribute = is_attribute
+ result.is_tail = is_tail
+ result.is_text = is_text
+ return result
+ else:
+ uresult = _ElementUnicodeResult(string_value)
+ uresult.parent = parent
+ uresult.is_attribute = is_attribute
+ uresult.is_tail = is_tail
+ uresult.is_text = is_text
+ return uresult
+
+cdef object _buildElementStringResult(_Document doc, xmlNode* c_node):
+ cdef _Element parent
cdef xmlNode* c_element
cdef char* s
- cdef bint is_attribute, is_tail
+ cdef bint is_attribute, is_text, is_tail
if c_node.type == tree.XML_ATTRIBUTE_NODE:
is_attribute = 1
is_tail = 0
s = tree.xmlNodeGetContent(c_node)
try:
- value = python.PyUnicode_DecodeUTF8(s, cstd.strlen(s), NULL)
+ value = funicode(s)
finally:
tree.xmlFree(s)
c_element = NULL
else:
#assert c_node.type == tree.XML_TEXT_NODE, "invalid node type"
is_attribute = 0
- # tail text?
- value = python.PyUnicode_DecodeUTF8(
- c_node.content, cstd.strlen(c_node.content), NULL)
+ # may be tail text or normal text
+ value = funicode(c_node.content)
c_element = _previousElement(c_node)
is_tail = c_element is not NULL
@@ -599,15 +629,12 @@
while c_element is not NULL and not _isElement(c_element):
c_element = c_element.parent
- if c_element is NULL:
- return value
+ if c_element is not NULL:
+ parent = _fakeDocElementFactory(doc, c_element)
+
+ return _elementStringResultFactory(
+ value, parent, is_attribute, is_tail)
- result = _ElementStringResult(value)
- result.parent = _fakeDocElementFactory(doc, c_element)
- result.is_attribute = is_attribute
- result.is_tail = is_tail
- result.is_text = not (is_tail or is_attribute)
- return result
################################################################################
# callbacks for XPath/XSLT extension functions
Modified: lxml/trunk/src/lxml/python.pxd
==============================================================================
--- lxml/trunk/src/lxml/python.pxd (original)
+++ lxml/trunk/src/lxml/python.pxd Fri Jan 25 10:36:16 2008
@@ -25,6 +25,7 @@
cdef int PyUnicode_Check(object obj)
cdef int PyString_Check(object obj)
+ cdef int PyString_CheckExact(object obj)
cdef object PyUnicode_FromEncodedObject(object s, char* encoding,
char* errors)
More information about the lxml-checkins
mailing list