[lxml-dev] space normalisation for .text and .tail

F Wolff friedel at translate.org.za
Tue Jul 7 09:29:51 CEST 2009


Op Sa, 2009-07-04 om 13:24 +0200 skryf Stefan Behnel:
> Hi,
> 
> F Wolff wrote:
> > On 2009-03-24 I wrote about space normalisation with reference to the
> > xml:space attribute, and the string() and normalize-string() functions
> > in xpath. I solved my problem in code, partly due to slightly changing
> > requirements.
> > 
> > Now I need to do similar magic, but need to handle the text nodes
> > separately, without descending into child nodes.
> > 
> >>From the xpath document:
> >> The string-value of an element node is the concatenation of the
> >> string-values of all text node descendants of the element node in
> >> document order.
> > ...which is not what I need to do in this case.
> > 
> > Is there a way to apply the normalize-text() to a node's .text or .tail
> > only? Is there another way to obtain the same result?
> 
> Well, lxml will not allow you to modify individual text nodes that the
> parser created next to each other for whatever reason (likely due to
> implementation details), even if XPath allows you to get your hands on them
> using "text()". The text/tail properties are as deep down as it gets.

Sorry, let me rephrase: I don't need to alter the internal XML
structure, I just want to obtain normalised versions of the .text
and .tail nodes in a tree with text and xml elements intertwined.

For example:
<a>
    Moo
    <b>
    Mew
    </b>
    bla    bla
</a>

In this case I'm looking or a way to obtain the strings "Moo", "Mew",
and "bla bla" (with the the spaces normalised).  XPath's
normalize-text() can give me "Moo Mew bla bla", but I still want access
to each .text and .tail separately normalised.

> 
> > From the looks of
> > it, there is no reliable way that I can normalise correctly in code,
> > since I won't know if a newline (for example) was given as a newline or
> > as a character reference, and this should influence the normalisation.
> 
> Why is that? XML parsers will always replace character references by their
> Unicode character value, and there is no way XPath could see them. If you
> need that information for your algorithm, you will have to parse the XML
> byte stream yourself. Neither the XML infoset nor the XPath data model
> provide this.
> 
> Stefan

My understanding was that the normalisation does not touch entities, and
that the following two is not equivalent when normalised;
<a>&#10;</a>

vs.

<a>
</a>

...but playing now with normalize-string() it seems that they are
equivalent.

Would it be possible to normalise correctly in code in all cases?

Thank you for the help.
Friedel


--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009



More information about the lxml-dev mailing list