[lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Brad Clements bkc at murkworks.com
Thu Aug 14 04:20:59 CEST 2008


Brad Clements wrote:
> I have a stylesheet that uses document('') to reference itself.
>
> The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10
>
> However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it 
> does not work.
>
Now that I've had some sleep and another hour of google time, I have 
been able to recreate the problem in a test program.

The big clue came from this old thread from 2006:

http://article.gmane.org/gmane.comp.python.lxml.devel/1083/match=document

Basically that post makes me think that the document('') problem is 
related to base_url passed to fromstring()

In that, when document('') is processed, the base_url is used to look up 
the stylesheet's canonical "URL", and then that URL is used to retrieve 
the xml document tree that represents the stylesheet.

The problem here is that base_url could be wrong.. It could be the same 
value as some other document. In fact, I can recreate the problem by 
setting base_url to the same value for both the xml source and the 
stylesheet source.

My understanding of the reason for base_url was just so that resolvers 
would have a basis for resolving relative lookups. That is certainly how 
I use base_url ... as the only mechanism to set the URL that is passed 
to my custom resolver.

It seems to me that after spending more than 5 hours trying to 
troubleshoot this "problem" with document(''), I'm going to say that 
this is a design flaw in lxml. I'm thinking that using base_url as a way 
to get back the original stylesheet XML was convenient for the lxml 
developers, but has left a big undocumented pitfall for lxml users.

The only documentation I could find on the website about base_url is on 
http://codespeak.net/lxml/parsing.html#parsers  where no mention is made 
about the requirement to NOT use the same base_url for different documents.

Of course, I could be wrong here and I don't want to get anyone upset by 
making invalid claims. My test case program is shown below,
when base_url is the same value for both the stylesheet and the xml 
document, then document('') fails in the stylesheet.

If base_url is different, it works.

--------------- test.py -----------

# demonstrate problem with self-reference stylesheet in lxml
# problem occurs when base_uri is the same for both the stylesheet and
# the xml document.

from lxml import etree

class Resolver(etree.Resolver):
   
    def __init__(self):
        super(etree.Resolver, self).__init__()
       
    def resolve(self, URL, ID, ctxt):
        print "RESOLVE URL %r" % (URL, )
        return None

stylesheet_src = """<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:xf="http://www.w3.org/2002/xforms"
                xmlns:const="const.uri" version="1.0" 
exclude-result-prefixes="const">
               
  <xsl:output encoding="utf-8" method="xml" />
  <const:head-elements id="location-selector-model-container">
    <xf:model  id="location-selector-model">
                <xf:instance xmlns="" id="location-selector-counties">
                    <data>
                        <test>Hi!</test>
                    </data>
                </xf:instance>
            </xf:model>
  </const:head-elements>
 
 
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
        </head>
        <body>
            <div>
               xf model id: <xsl:value-of 
select="document('')//const:head-elements/xf:model/@id" />  <br />
               expected value is: location-selector-model
            </div>
        </body>
    </html>
   </xsl:template>
</xsl:stylesheet>
"""

xml_src = """<?xml version="1.0"?>
<root />"""



def test():
    ss_parser = etree.XMLParser(load_dtd=True)
    ss_parser.resolvers.add(Resolver())

    stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser, 
base_url='http://myfile.xml')
    stylesheet = etree.XSLT(stylesheet_doc)
   
    doc_parser = etree.XMLParser(load_dtd=True)
    doc_parser.resolvers.add(Resolver())
                              
    xml_doc = etree.fromstring(xml_src, doc_parser, 
base_url='http://myfile.xml')
   
    print "%s" % stylesheet(xml_doc)
   
if __name__ == "__main__":
    test()
   





-- 
Brad Clements,                bkc at murkworks.com    (315)268-1000
http://www.murkworks.com                          
AOL-IM: BKClements



More information about the lxml-dev mailing list