[lxml-dev] space normalisation for .text and .tail

Stefan Behnel stefan_ml at behnel.de
Sat Jul 4 13:24:00 CEST 2009


Hi,

F Wolff wrote:
> On 2009-03-24 I wrote about space normalisation with reference to the
> xml:space attribute, and the string() and normalize-string() functions
> in xpath. I solved my problem in code, partly due to slightly changing
> requirements.
> 
> Now I need to do similar magic, but need to handle the text nodes
> separately, without descending into child nodes.
> 
>>From the xpath document:
>> The string-value of an element node is the concatenation of the
>> string-values of all text node descendants of the element node in
>> document order.
> ...which is not what I need to do in this case.
> 
> Is there a way to apply the normalize-text() to a node's .text or .tail
> only? Is there another way to obtain the same result?

Well, lxml will not allow you to modify individual text nodes that the
parser created next to each other for whatever reason (likely due to
implementation details), even if XPath allows you to get your hands on them
using "text()". The text/tail properties are as deep down as it gets.


> From the looks of
> it, there is no reliable way that I can normalise correctly in code,
> since I won't know if a newline (for example) was given as a newline or
> as a character reference, and this should influence the normalisation.

Why is that? XML parsers will always replace character references by their
Unicode character value, and there is no way XPath could see them. If you
need that information for your algorithm, you will have to parse the XML
byte stream yourself. Neither the XML infoset nor the XPath data model
provide this.

Stefan


More information about the lxml-dev mailing list