[lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'
Brad Clements
bkc at murkworks.com
Thu Aug 14 04:20:59 CEST 2008
Brad Clements wrote:
> I have a stylesheet that uses document('') to reference itself.
>
> The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10
>
> However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it
> does not work.
>
Now that I've had some sleep and another hour of google time, I have
been able to recreate the problem in a test program.
The big clue came from this old thread from 2006:
http://article.gmane.org/gmane.comp.python.lxml.devel/1083/match=document
Basically that post makes me think that the document('') problem is
related to base_url passed to fromstring()
In that, when document('') is processed, the base_url is used to look up
the stylesheet's canonical "URL", and then that URL is used to retrieve
the xml document tree that represents the stylesheet.
The problem here is that base_url could be wrong.. It could be the same
value as some other document. In fact, I can recreate the problem by
setting base_url to the same value for both the xml source and the
stylesheet source.
My understanding of the reason for base_url was just so that resolvers
would have a basis for resolving relative lookups. That is certainly how
I use base_url ... as the only mechanism to set the URL that is passed
to my custom resolver.
It seems to me that after spending more than 5 hours trying to
troubleshoot this "problem" with document(''), I'm going to say that
this is a design flaw in lxml. I'm thinking that using base_url as a way
to get back the original stylesheet XML was convenient for the lxml
developers, but has left a big undocumented pitfall for lxml users.
The only documentation I could find on the website about base_url is on
http://codespeak.net/lxml/parsing.html#parsers where no mention is made
about the requirement to NOT use the same base_url for different documents.
Of course, I could be wrong here and I don't want to get anyone upset by
making invalid claims. My test case program is shown below,
when base_url is the same value for both the stylesheet and the xml
document, then document('') fails in the stylesheet.
If base_url is different, it works.
--------------- test.py -----------
# demonstrate problem with self-reference stylesheet in lxml
# problem occurs when base_uri is the same for both the stylesheet and
# the xml document.
from lxml import etree
class Resolver(etree.Resolver):
def __init__(self):
super(etree.Resolver, self).__init__()
def resolve(self, URL, ID, ctxt):
print "RESOLVE URL %r" % (URL, )
return None
stylesheet_src = """<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xf="http://www.w3.org/2002/xforms"
xmlns:const="const.uri" version="1.0"
exclude-result-prefixes="const">
<xsl:output encoding="utf-8" method="xml" />
<const:head-elements id="location-selector-model-container">
<xf:model id="location-selector-model">
<xf:instance xmlns="" id="location-selector-counties">
<data>
<test>Hi!</test>
</data>
</xf:instance>
</xf:model>
</const:head-elements>
<xsl:template match="/">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
<div>
xf model id: <xsl:value-of
select="document('')//const:head-elements/xf:model/@id" /> <br />
expected value is: location-selector-model
</div>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
"""
xml_src = """<?xml version="1.0"?>
<root />"""
def test():
ss_parser = etree.XMLParser(load_dtd=True)
ss_parser.resolvers.add(Resolver())
stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser,
base_url='http://myfile.xml')
stylesheet = etree.XSLT(stylesheet_doc)
doc_parser = etree.XMLParser(load_dtd=True)
doc_parser.resolvers.add(Resolver())
xml_doc = etree.fromstring(xml_src, doc_parser,
base_url='http://myfile.xml')
print "%s" % stylesheet(xml_doc)
if __name__ == "__main__":
test()
--
Brad Clements, bkc at murkworks.com (315)268-1000
http://www.murkworks.com
AOL-IM: BKClements
More information about the lxml-dev
mailing list