[lxml-dev] Question about newlines
Stefan Behnel
stefan_ml at behnel.de
Sun Dec 9 18:57:11 CET 2007
Noah Slater wrote:
> On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote:
>> Serialisation will never alter content.
> [snip]
>>> 1) When adding a PI via the element.addprevious method and PI has
>>> it's tail trimmed and so when serialising the PI runs into the
>>> root element.
>
> Well, this is well and good but lxml REMOVES the PI tail so I cannot
> insert a newline even if I want to.
Ah, got it. Thanks for insisting. :)
lxml.etree does this on purpose. If you allow character data around the
processing instructions that you add as siblings of the root node, you need to
make sure it's only whitespace (not 'real' data) to keep the in-memory tree
well-formed and to serialise well-formed XML. So the behaviour would be: strip
the tail, but keep it if it's whitespace. Sounds a bit ugly to me...
I also noted that libxml2's parser drops whitespace at the root level, which
is perfectly fine, as it is the most definitely ignorable whitespace there is.
I personally prefer having lxml add a line break when serialising processing
instructions and comments at the root level, and cosistently dropping all tail
text of PIs and comments appended/prepended to a root node. So the behaviour
for the root level would be: drop all whitespace when parsing, and add line
breaks around PIs and comments on serialisation.
There's also the document ending issue. The document serialiser of libxml2
does append a newline, and one day, lxml may switch to using it. So I added
this behaviour now - and had to adapt tons of test cases that compare
serialised XML between ET and lxml. But I don't mind having white-space
differences in the serialisation as long as it's well-formed, equivalent XML.
Stefan
More information about the lxml-dev
mailing list