[lxml-dev] parsing and serializing XML fragments

Stefan Behnel stefan_ml at behnel.de
Thu Jun 25 09:14:19 CEST 2009


Hi,

Hervé Cauwelier wrote:
> Hi, I'm trying to load fragments of XML to inject them in an existing
> document tree.
> 
> They look like this:
> 
>   <table:table table:name="%s" table:style-name="%s"/>

Just curious: why do you create a document in which you can do string
replacements?


> Converting the fragment to the "{uri}name" syntax is not an option since
> I must remain agnostic to the XML parser.

That can't be parsed by lxml's parser either.


> I would expect the XML() function to take an "nsmap" argument, like the
> xpath() method on elements, or parts of the API for subclassing elements.

No, it's a namespace aware XML parser, so it will reject documents that are
not namespace well-formed.


> For now I have another template, a complete document with namespace
> declaration, and I inject my fragment using string formatting. Lxml will
> parse it and I extract the first child element.

You can use the feed parser instead and do

    parser = etree.XMLParser(...)
    parser.feed('<?xml ...?><root xmlns:...>')
    parser.feed(the_fragment)
    parser.feed('</root>'
    fragment = parser.close()[0]

Feed parsers are reusable after a call to close(), BTW.


> I have looked at custom elements and other resolving methods but lxml
> was raising a namespace error before my "print"'s show up.

Element proxies are created /after/ parsing, so this won't help.


> Another issue is to save the element back to its snippet form, for unit
> test validation. Lxml will produce a valid document with namespace
> declaration. Either how to serialize without namespace declaration or
> how to remove it while keeping prefixes?

I put a lot of work into preventing lxml from serialising broken documents,
sorry. I also doubt that lxml's doctest support can help you here, as it
also requires parsing.

But you can insert your document into a new root Element that defines all
used namespaces (either fixed or collected at runtime), serialise that, and
strip the root element from top and bottom of the serialised byte string. I
do a bit of string mangling in lxml's own doctests to make them work in
both Py2 and Py3, it's not that hard to add these things. You can write a
little wrapper class around lxml.etree and override tostring() and parse()
to fit your needs (kudos to Fredrik for making them functions, BTW).

Here's an example:

http://codespeak.net/svn/lxml/trunk/doc/api.txt

Stefan


More information about the lxml-dev mailing list