[lxml-dev] space normalisation for .text and .tail
F Wolff
friedel at translate.org.za
Tue Jul 7 09:29:51 CEST 2009
Op Sa, 2009-07-04 om 13:24 +0200 skryf Stefan Behnel:
> Hi,
>
> F Wolff wrote:
> > On 2009-03-24 I wrote about space normalisation with reference to the
> > xml:space attribute, and the string() and normalize-string() functions
> > in xpath. I solved my problem in code, partly due to slightly changing
> > requirements.
> >
> > Now I need to do similar magic, but need to handle the text nodes
> > separately, without descending into child nodes.
> >
> >>From the xpath document:
> >> The string-value of an element node is the concatenation of the
> >> string-values of all text node descendants of the element node in
> >> document order.
> > ...which is not what I need to do in this case.
> >
> > Is there a way to apply the normalize-text() to a node's .text or .tail
> > only? Is there another way to obtain the same result?
>
> Well, lxml will not allow you to modify individual text nodes that the
> parser created next to each other for whatever reason (likely due to
> implementation details), even if XPath allows you to get your hands on them
> using "text()". The text/tail properties are as deep down as it gets.
Sorry, let me rephrase: I don't need to alter the internal XML
structure, I just want to obtain normalised versions of the .text
and .tail nodes in a tree with text and xml elements intertwined.
For example:
<a>
Moo
<b>
Mew
</b>
bla bla
</a>
In this case I'm looking or a way to obtain the strings "Moo", "Mew",
and "bla bla" (with the the spaces normalised). XPath's
normalize-text() can give me "Moo Mew bla bla", but I still want access
to each .text and .tail separately normalised.
>
> > From the looks of
> > it, there is no reliable way that I can normalise correctly in code,
> > since I won't know if a newline (for example) was given as a newline or
> > as a character reference, and this should influence the normalisation.
>
> Why is that? XML parsers will always replace character references by their
> Unicode character value, and there is no way XPath could see them. If you
> need that information for your algorithm, you will have to parse the XML
> byte stream yourself. Neither the XML infoset nor the XPath data model
> provide this.
>
> Stefan
My understanding was that the normalisation does not touch entities, and
that the following two is not equivalent when normalised;
<a> </a>
vs.
<a>
</a>
...but playing now with normalize-string() it seems that they are
equivalent.
Would it be possible to normalise correctly in code in all cases?
Thank you for the help.
Friedel
--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009
More information about the lxml-dev
mailing list