[lxml-dev] Spacing and the presence of xml:space="preserve"
F Wolff
friedel at translate.org.za
Tue Mar 24 11:39:48 CET 2009
Hallo all
We are currently using this expression to obtain a plain text version
inside a node:
For example:
>>> from lxml import etree
>>> etree.XPath("string()")
>>> string_xpath(etree.fromstring("<a> asdf <b/>fdsa </a>"))
' asdf fdsa '
This works great and returns the string assuming xml:space="preserve",
in other words, spacing is taken verbatim. We work on a file format
where some of the spacing is very important (XLIFF). We generate such
files with xml:space="preserve" in the necessary places. Not everybody
generates such files, unfortunately, so we need to also handle the
normalised versions. If I rather use the XPath function
"normalize-space()", I can get the normalised spacing:
'asdf fdsa'
but unfortunately it does this even if xml:space="preserve" is set:
>>> etree.XPath("normalize-space()")
>>> string_xpath(etree.fromstring('''<a xml:space="preserve"> asdf <b/>fdsa </a>'''))
'asdf fdsa'
Unfortunately, I don't see a way to get the correct version (normalised
by default, but with white-space preserved if xml:space="preserved" is
set). Do I have to handle the cases separately, or is there a way for
lxml to help me by just doing the right thing? I could special case on
the node, but it would be a bit harder to know if some xml:space
directive was given higher up in the tree. Or am I missing something in
XPath / lxml?
Any help would be appreciated.
Friedel Wolff
--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/video-virtaals-functionality
More information about the lxml-dev
mailing list