[lxml-dev] Spacing and the presence of xml:space="preserve"

Stefan Behnel stefan_ml at behnel.de
Tue Mar 24 16:32:01 CET 2009


F Wolff wrote:
> We are currently using this expression to obtain a plain text version
> inside a node:
>
> For example:
>>>> from lxml import etree
> >>> etree.XPath("string()")
>>>> string_xpath(etree.fromstring("<a>  asdf  <b/>fdsa  </a>"))
> '  asdf  fdsa  '
>
> This works great and returns the string assuming xml:space="preserve",
> in other words, spacing is taken verbatim. We work on a file format
> where some of the spacing is very important (XLIFF). We generate such
> files with xml:space="preserve" in the necessary places. Not everybody
> generates such files, unfortunately, so we need to also handle the
> normalised versions. If I rather use the XPath function
> "normalize-space()", I can get the normalised spacing:
> 'asdf fdsa'
>
> but unfortunately it does this even if xml:space="preserve" is set:
>
> >>> etree.XPath("normalize-space()")
>>>> string_xpath(etree.fromstring('''<a xml:space="preserve">  asdf
>>>> <b/>fdsa  </a>'''))
> 'asdf fdsa'
>
>
> Unfortunately, I don't see a way to get the correct version (normalised
> by default, but with white-space preserved if xml:space="preserved" is
> set). Do I have to handle the cases separately, or is there a way for
> lxml to help me by just doing the right thing?  I could special case on
> the node, but it would be a bit harder to know if some xml:space
> directive was given higher up in the tree.

Here is what the XPath 1.0 spec says about normalize-space():

"""
Function: string normalize-space(string?)

The normalize-space function returns the argument string with whitespace
normalized by stripping leading and trailing whitespace and replacing
sequences of whitespace characters by a single space. Whitespace
characters are the same as those allowed by the S production in XML. If
the argument is omitted, it defaults to the context node converted to a
string, in other words the string-value of the context node.
"""

So there is no reference to "xml:space" that would dictate a specific
behaviour, neither for the context node nor for subtrees.

But have you considered writing the required logic in XSLT instead of
plain XPath or Python? The "mode" attribute on XSLT's templates should
give you all that's needed here, and you'll still end up with a callable
that returns a string (built entirely in C space), just a bit smarter this
time.

If you do this, please post the stylesheet. I think this might be
interesting to others, too.

Stefan



More information about the lxml-dev mailing list