[lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes
Robert Pierce
robert at smithpierce.net
Wed Jun 3 16:32:45 CEST 2009
In case it isn't obvious, I'm not an XML guru and haven't been using
lxml for long, but truly IMHO:
I stipulate the importance of nil (or null) in schema definitions, as
well as in attaching types to the in memory representation of the
tree. But from the standpoint of text representation, <foo
xsi:nil='true'/> doesn't seem to carry any additional information over
<foo/>.
My use case is passing XML through SQS, which has an upper bound of
about 6kB (after http headers are accounted for). When lxml annotates
empty elements, it attaches BOTH schema and type to each node, which
increases the size of the text representation of the element by a
factor of 4 or more. So I really have to deannotate it "all the way".
On 6/2/09, jholg at gmx.de <jholg at gmx.de> wrote:
> Consider this use case:
>
>>>> root = objectify.fromstring("""
> ... <root xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><x
> xsi:nil='true'/></root>""")
>>>> print etree.tostring(root)
> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><x
> xsi:nil="true"/></root>
>>>> objectify.deannotate(root) # Should this *remove* xsi:nil?!
>>>> print etree.tostring(root)
> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><x
> xsi:nil="true"/></root>
>>>>
>
> I wouldn't want deannotate() to remove xsi:nil here.
I think it is impossible to retain input intent once a tree is parsed
into memory. Really, in the absence of a schema I shouldn't be able
to tell the difference between your input and
root = objectify.fromstring('<root><x/></root>')
or
root = objectify.fromstring('<root/>')
root.x = None
You can only ask for consistency on output. Currently, the output of
deannotate is not consistent in this case. In any event, type
constraints are more properly defined in a schema, aren't they? Just
because you passed me <root><x xsi:nil='true'/></root> doesn't
constrain me from passing you back <root><x><y/></x></root> unless
there's a schema that says otherwise.
> What's the use case for a deannotate() that removes xsi:nil? Why not just
> assign '' instead of None and deannotate() afterwards?
As you suggest, I can set the element value to '', so it is a string
type and deannotate() removes the type. However, tostring() +
deannotate() then produces <foo></foo> rather than <foo/>... better,
but still not efficient. Of course, there is a valid argument to say
that a space constrained API shouldn't use a bloated data format like
XML at all, but (for my API) it's too late to make that argument.
> A compromise may be to add another keyword arg "nil" to deannotate() to
> allow for xsi:nil removal if needed (defaults to False, of course :)
Works for me!
More information about the lxml-dev
mailing list