[lxml-dev] Spacing and the presence of xml:space="preserve"

F Wolff friedel at translate.org.za
Tue Mar 24 11:39:48 CET 2009


Hallo all

We are currently using this expression to obtain a plain text version
inside a node:

For example:
>>> from lxml import etree
>>> etree.XPath("string()")
>>> string_xpath(etree.fromstring("<a>  asdf  <b/>fdsa  </a>"))
'  asdf  fdsa  '

This works great and returns the string assuming xml:space="preserve",
in other words, spacing is taken verbatim. We work on a file format
where some of the spacing is very important (XLIFF). We generate such
files with xml:space="preserve" in the necessary places. Not everybody
generates such files, unfortunately, so we need to also handle the
normalised versions. If I rather use the XPath function
"normalize-space()", I can get the normalised spacing:
'asdf fdsa'

but unfortunately it does this even if xml:space="preserve" is set:

>>> etree.XPath("normalize-space()")
>>> string_xpath(etree.fromstring('''<a xml:space="preserve">  asdf  <b/>fdsa  </a>'''))
'asdf fdsa'


Unfortunately, I don't see a way to get the correct version (normalised
by default, but with white-space preserved if xml:space="preserved" is
set). Do I have to handle the cases separately, or is there a way for
lxml to help me by just doing the right thing?  I could special case on
the node, but it would be a bit harder to know if some xml:space
directive was given higher up in the tree. Or am I missing something in
XPath / lxml?

Any help would be appreciated.

Friedel Wolff


--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/video-virtaals-functionality



More information about the lxml-dev mailing list