[lxml-dev] Ingore namespace when parsing

Aaron Maxwell amax at redsymbol.net
Fri May 1 19:41:19 CEST 2009


Hi all,

When using python lxml to parse an XML document whose root element
defines a namespace, is there some way the library can allow me to not
explicitly invoke that namespace in queries?

Consider an XML document with this content:
{{{
<?xml version="1.0" ?>
<Root xmlns="http://redsymbol.net/SomeNamespace">
  <Child1></Child1>
  <Child2></Child2>
</Root>
}}}

If I parse it like this:
{{{
def ignore_ns(path_to_file):
    x = etree.parse(open(path_to_file))
    for kid in x.getroot():
        print kid.tag
}}}

... where the path_to_file contains the above xml document, then this
output is produced:

{{{
{http://redsymbol.net/SomeNamespace}Child1
{http://redsymbol.net/SomeNamespace}Child2
}}}

Alternatively, I can define a namespace-string stripping function
dynamically, and apply it as needed:

{{{
def strip_out_ns():
    x = etree.parse(open(path_to_file))
    ns = x.getroot().nsmap[None]
    def no_ns(s):
        return s.split('{'+ns+'}')[-1]
    for kid in x.getroot():
        print no_ns(kid.tag)
}}}

The output of this is simpler:
{{{
Child1
Child2
}}}

More commonly, I will want to search for a child element of some root,
using a query like 

{{{
rootElement.find('Child1')
}}}

(where rootElement is an Element object).  In the namespaced xml
document above, this call to .find() will return None, but

{{{
# ns found from rootElement.nsmap as above
rootElement.find('{' + ns + '}' + 'Child1')
}}}

will correctly find the child element.

In this kind of situation, where I just want to parse the document and
really don't care about the namespace, is there some way to construct
a parser that will ignore it in a more automated way?  Is there a
simpler, better approach, or some insight I'm missing?

Thanks everyone in advance.

Cheers,
Aaron


More information about the lxml-dev mailing list