From faassen at infrae.com Fri Apr 7 15:56:21 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri Apr 7 15:55:44 2006 Subject: [lxml-dev] lxml for windows and python 2.3? Message-ID: <44366F85.1050906@infrae.com> Hi there, If someone happens to have a setup lying around to build this, I'd appreciate to see an egg or installer for lxml 0.9.1 on Windows for python 2.3. This because at Infrae we use Zope and this uses Python 2.3 still (at least before Zope 2.9), and we sometimes need to install on windows. Thanks! Regards, Martijn From tom-lxml-dev at fish.cx Wed Apr 12 18:48:51 2006 From: tom-lxml-dev at fish.cx (Tom Lynn) Date: Wed Apr 12 18:49:30 2006 Subject: [lxml-dev] bug - lxml crash after broken schema instantiation Message-ID: The following code causes a crash which kills Python, although if the incorrect XMLSchema line is removed it works:: ActivePython 2.4.1 Build 245 (ActiveState Corp.) based on Python 2.4.1 (#65, Mar 30 2005, 09:33:37) [MSC v.1310 32 bit (Intel)] on win32 >>> import lxml.etree >>> rng = lxml.etree.parse(file("devicemap.rng")) >>> lxml.etree.XMLSchema(rng) Traceback (most recent call last): File "", line 1, in ? File "xmlschema.pxi", line 32, in etree.XMLSchema.__init__ XMLSchemaParseError: Document is not valid XML Schema >>> lxml.etree.RelaxNG(rng) devicemap.rng is attached, and was generated from a DTD via Trang. Tom -- They're bouncy, trouncy, flouncy, pouncy Fun, fun, fun, fun, FUN! But the most wonderful thing about Immortals Is there can be only one. -------------- next part -------------- random ordered distributed true false all first From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 13 08:03:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 13 08:03:53 2006 Subject: [lxml-dev] bug - lxml crash after broken schema instantiation In-Reply-To: References: Message-ID: <443DE996.1070103@gkec.informatik.tu-darmstadt.de> Tom Lynn wrote: > The following code causes a crash which kills Python, although if the > incorrect XMLSchema line is removed it works:: > > ActivePython 2.4.1 Build 245 (ActiveState Corp.) based on > Python 2.4.1 (#65, Mar 30 2005, 09:33:37) [MSC v.1310 32 bit (Intel)] on win32 > >>> import lxml.etree > >>> rng = lxml.etree.parse(file("devicemap.rng")) > >>> lxml.etree.XMLSchema(rng) > Traceback (most recent call last): > File "", line 1, in ? > File "xmlschema.pxi", line 32, in etree.XMLSchema.__init__ > XMLSchemaParseError: Document is not valid XML Schema > >>> lxml.etree.RelaxNG(rng) Hi Tom, thanks for reporting this, I can reproduce it, even with a test case as simple as >>> from lxml.etree import XMLSchema, XML, ElementTree >>> et = ElementTree(XML("")) >>> XMLSchema(et) So it is not related to RelaxNG or your schema. This problem does not appear if the parsed document is within the XML-Schema namespace (we have a test case for that), but only if the document is not XML-Schema at all. This fact (and the Valgrind trace that shows xmlSchemaParse) makes me believe that this is a bug in libxml2, not lxml. Note that XML-Schema support is marked as "incomplete" even in the latest libxml2 versions. I applied a work-around that checks if the root node of the document passed into XMLSchema is within the XMLSchema namespace and otherwise rejects the document without calling libxml2 at all. Please check out and test the current SVN trunk version. http://codespeak.net/svn/lxml/trunk See here if you need help with the compilation: http://codespeak.net/lxml/installation.html Thanks, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 13 10:06:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 13 10:07:20 2006 Subject: [lxml-dev] bug - lxml crash after broken schema instantiation In-Reply-To: <443DE996.1070103@gkec.informatik.tu-darmstadt.de> References: <443DE996.1070103@gkec.informatik.tu-darmstadt.de> Message-ID: <443E0686.5080604@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Tom Lynn wrote: >> The following code causes a crash which kills Python, although if the >> incorrect XMLSchema line is removed it works:: >> >> ActivePython 2.4.1 Build 245 (ActiveState Corp.) based on >> Python 2.4.1 (#65, Mar 30 2005, 09:33:37) [MSC v.1310 32 bit (Intel)] on win32 >> >>> import lxml.etree >> >>> rng = lxml.etree.parse(file("devicemap.rng")) >> >>> lxml.etree.XMLSchema(rng) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "xmlschema.pxi", line 32, in etree.XMLSchema.__init__ >> XMLSchemaParseError: Document is not valid XML Schema >> >>> lxml.etree.RelaxNG(rng) > > I can reproduce it, even with a test case as simple as > > >>> from lxml.etree import XMLSchema, XML, ElementTree > >>> et = ElementTree(XML("")) > >>> XMLSchema(et) > > So it is not related to RelaxNG or your schema. This problem does not appear > if the parsed document is within the XML-Schema namespace (we have a test case > for that), but only if the document is not XML-Schema at all. > > This fact (and the Valgrind trace that shows xmlSchemaParse) makes me believe > that this is a bug in libxml2, not lxml. Note that XML-Schema support is > marked as "incomplete" even in the latest libxml2 versions. For the archives: This has been confirmed as being a bug in libxml2 (338303). The same bug is in RelaxNG (338306). Both were fixed today in the current CVS version of libxml2. http://bugzilla.gnome.org/show_bug.cgi?id=338303 http://bugzilla.gnome.org/show_bug.cgi?id=338306 The work-around in lxml will therefore stay for a while, since we cannot depend on the CVS version of libxml2 for this kind of bug. I will commit the changes to the lxml 0.9.x branch also. Since this is not really critical (most people will only pass RNG/Schema documents into RelaxNG/XMLSchema anyway), there won't be a 0.9.2 right away. Please use the SVN version for now if you need this problem fixed or report back if your 1000000-user-distributed-application depends on it. Stefan From tom-lxml-dev at fish.cx Thu Apr 13 12:52:05 2006 From: tom-lxml-dev at fish.cx (Tom Lynn) Date: Thu Apr 13 12:52:44 2006 Subject: [lxml-dev] bug - lxml crash after broken schema instantiation Message-ID: On Thu, 13 Apr 2006, Stefan Behnel wrote: > Hi Tom, > > thanks for reporting this, I can reproduce it, even with a test case as simple as > > >>> from lxml.etree import XMLSchema, XML, ElementTree > >>> et = ElementTree(XML("")) > >>> XMLSchema(et) > > So it is not related to RelaxNG or your schema. This problem does not appear > if the parsed document is within the XML-Schema namespace (we have a test case > for that), but only if the document is not XML-Schema at all. > > This fact (and the Valgrind trace that shows xmlSchemaParse) makes me believe > that this is a bug in libxml2, not lxml. Note that XML-Schema support is > marked as "incomplete" even in the latest libxml2 versions. I see you've submitted a libxml2 bug report too (#338303), which has already been marked as fixed. Remarkable stuff, this open source. :-) > I applied a work-around that checks if the root node of the document passed > into XMLSchema is within the XMLSchema namespace and otherwise rejects the > document without calling libxml2 at all. That's great, thanks. Tom -- Brain: Pinky, are you pondering what I'm pondering? Pinky: I think so, Brain, but if we give peas a chance, won't the lima beans feel left out? -- Pinky and the Brain, 'All You Need is Narf' From paul at zope-europe.org Sun Apr 16 18:59:27 2006 From: paul at zope-europe.org (Paul Everitt) Date: Sun Apr 16 19:00:23 2006 Subject: [lxml-dev] HTMLParser status and issues Message-ID: Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got: python setup.py build_ext -i running build_ext building 'lxml.etree' extension gcc -fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd -fno-common -dynamic -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.4/include/python2.4 -c src/lxml/etree.c -o build/temp.darwin-8.6.0-Power_Macintosh-2.4/src/lxml/etree.o -w -I/usr/include/libxml2 src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1 --Paul From paul at zope-europe.org Mon Apr 17 10:01:16 2006 From: paul at zope-europe.org (Paul Everitt) Date: Mon Apr 17 10:02:19 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: References: Message-ID: <44434B4C.3010707@zope-europe.org> Forgot to ask the question about status. :^) First, there are two branches: http://codespeak.net/svn/lxml/branch/htmlparse/ http://codespeak.net/svn/lxml/branch/htmlparser/ I'm presuming the latter is the one I want. Perhaps the former should get renamed to something less of a decoy? Next, once I get the parser working, I'd also like to use extensions as described here: http://codespeak.net/svn/lxml/trunk/doc/extensions.txt However, the htmlparser branch is older than the extensions work (I believe). Stefan, any chance the htmlparser branch could get the changes from the trunk? I'm particularly eager to get this combination working. The pipeline templating stuff I'm working on needs to handle non-well-formed HTML. It also needs a workaround for the fact that DOCTYPE (and encoding) information isn't available in the parse tree and thus isn't available in an XSLT template. As a workaround, I'd like to retrieve the information out-of-band and make it available as an extension function. --Paul Paul Everitt wrote: > > Howdy. I was giving the htmlparser branch a try. In trying to compile > it, I got: > > python setup.py build_ext -i > running build_ext > building 'lxml.etree' extension > gcc -fno-strict-aliasing -Wno-long-double -no-cpp-precomp > -mno-fused-madd -fno-common -dynamic -DNDEBUG -g -O3 -Wall > -Wstrict-prototypes > -I/Library/Frameworks/Python.framework/Versions/2.4/include/python2.4 -c > src/lxml/etree.c -o > build/temp.darwin-8.6.0-Power_Macintosh-2.4/src/lxml/etree.o -w > -I/usr/include/libxml2 > src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': > src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first > use in this function) > src/lxml/etree.c:17245: error: (Each undeclared identifier is reported > only once > src/lxml/etree.c:17245: error: for each function it appears in.) > src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first > use in this function) > src/lxml/etree.c: In function 'initetree': > src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first > use in this function) > src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first > use in this function) > error: command 'gcc' failed with exit status 1 > > --Paul From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Apr 17 10:36:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon Apr 17 10:36:20 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: <44434B4C.3010707@zope-europe.org> References: <44434B4C.3010707@zope-europe.org> Message-ID: <4443537C.9020707@gkec.informatik.tu-darmstadt.de> Hi Paul, Paul Everitt wrote: > First, there are two branches: > > http://codespeak.net/svn/lxml/branch/htmlparse/ > http://codespeak.net/svn/lxml/branch/htmlparser/ > > I'm presuming the latter is the one I want. Yes. I actually created that branch before I noticed that there already was a branch called "htmlparse"... > Perhaps the former should > get renamed to something less of a decoy? Would be better, yes. Anyway, if "htmlparser" gets merged into the trunk, that won't matter too much... > Next, once I get the parser working, I'd also like to use extensions as > described here: > > http://codespeak.net/svn/lxml/trunk/doc/extensions.txt > > However, the htmlparser branch is older than the extensions work (I > believe). Stefan, any chance the htmlparser branch could get the > changes from the trunk? Hmm, they should actually be in the branch. I merged them a while ago in order to make the diff usable. > I'm particularly eager to get this combination working. The pipeline > templating stuff I'm working on needs to handle non-well-formed HTML. It > also needs a workaround for the fact that DOCTYPE (and encoding) > information isn't available in the parse tree and thus isn't available > in an XSLT template. > > As a workaround, I'd like to retrieve the information out-of-band and > make it available as an extension function. Just try, it should work. The more you test the branch, the faster it can be merged into the trunk. Then you will have everything in there that the current trunk supports. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Apr 17 16:44:50 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon Apr 17 16:44:37 2006 Subject: [lxml-dev] HTMLParser status and issues In-Reply-To: References: Message-ID: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> Paul Everitt wrote: > Howdy. I was giving the htmlparser branch a try. In trying to compile > it, I got: > > src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': > src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first > use in this function) > src/lxml/etree.c:17245: error: (Each undeclared identifier is reported > only once > src/lxml/etree.c:17245: error: for each function it appears in.) > src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first > use in this function) > src/lxml/etree.c: In function 'initetree': > src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first > use in this function) > src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first > use in this function) > error: command 'gcc' failed with exit status 1 Hmm, I don't see a reason for that error. My clean checkout compiles nicely. What's your libxml2 version on MacOS? In my include/libxml2/HTMLparser.h it says somewhere around line 175: typedef enum { HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ HTML_PARSE_NONET = 1<<11,/* Forbid network access */ HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ } htmlParserOption; All options known in my place - but then, that's libxml 2.6.23 ... If the above enum contains the variables in your system, would you mind sending me the etree.c that Pyrex generated for you? Stefan From paul at zope-europe.org Mon Apr 17 19:21:24 2006 From: paul at zope-europe.org (Paul Everitt) Date: Mon Apr 17 19:22:26 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> References: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> Message-ID: <4443CE94.2030603@zope-europe.org> Stefan Behnel wrote: > Paul Everitt wrote: >> Howdy. I was giving the htmlparser branch a try. In trying to compile >> it, I got: >> >> src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': >> src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first >> use in this function) >> src/lxml/etree.c:17245: error: (Each undeclared identifier is reported >> only once >> src/lxml/etree.c:17245: error: for each function it appears in.) >> src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first >> use in this function) >> src/lxml/etree.c: In function 'initetree': >> src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first >> use in this function) >> src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first >> use in this function) >> error: command 'gcc' failed with exit status 1 > > > Hmm, I don't see a reason for that error. My clean checkout compiles nicely. > > What's your libxml2 version on MacOS? In my include/libxml2/HTMLparser.h it > says somewhere around line 175: $ xmllint --version xmllint: using libxml version 20622 You're not OS X, right? > typedef enum { > HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ > HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ > HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ > HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ > HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ > HTML_PARSE_NONET = 1<<11,/* Forbid network access */ > HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ > } htmlParserOption; > > > All options known in my place - but then, that's libxml 2.6.23 ... That will be kinda funny if .22 is the smoking gun. ;^) > If the above enum contains the variables in your system, would you mind > sending me the etree.c that Pyrex generated for you? Yep, I'll send it in a private note. Thanks! --Paul From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Apr 17 22:16:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon Apr 17 22:14:40 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: <4443CE94.2030603@zope-europe.org> References: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> <4443CE94.2030603@zope-europe.org> Message-ID: <4443F79F.90507@gkec.informatik.tu-darmstadt.de> Paul Everitt wrote: > Stefan Behnel wrote: >> Paul Everitt wrote: >>> Howdy. I was giving the htmlparser branch a try. In trying to compile >>> it, I got: >>> >>> src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': >>> src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first >>> use in this function) >>> src/lxml/etree.c:17245: error: (Each undeclared identifier is reported >>> only once >>> src/lxml/etree.c:17245: error: for each function it appears in.) >>> src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first >>> use in this function) >>> src/lxml/etree.c: In function 'initetree': >>> src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first >>> use in this function) >>> src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first >>> use in this function) >>> error: command 'gcc' failed with exit status 1 >> >> >> Hmm, I don't see a reason for that error. My clean checkout compiles >> nicely. >> >> What's your libxml2 version on MacOS? In my >> include/libxml2/HTMLparser.h it >> says somewhere around line 175: > > $ xmllint --version > xmllint: using libxml version 20622 > > You're not OS X, right? I'm on Linux. 2.6.22 should work perfectly, I just checked. >> typedef enum { >> HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ >> HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ >> HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ >> HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ >> HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ >> HTML_PARSE_NONET = 1<<11,/* Forbid network access */ >> HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ >> } htmlParserOption; >> >> All options known in my place - but then, that's libxml 2.6.23 ... > > That will be kinda funny if .22 is the smoking gun. ;^) > >> If the above enum contains the variables in your system, would you mind >> sending me the etree.c that Pyrex generated for you? > > Yep, I'll send it in a private note. Thanks! Thanks. I really can't see a problem in there. Maybe it's a compiler issue. I rewrote a part that might have shown a different usage of those two enum values. Could you retry with the current SVN? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 20 08:10:50 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 20 08:11:31 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: <4444AF44.80302@gkec.informatik.tu-darmstadt.de> References: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> <4443CE94.2030603@zope-europe.org> <4443F79F.90507@gkec.informatik.tu-darmstadt.de> <73DCD630-05BA-4221-8CC5-B729573C6BF8@zeapartners.org> <4443FAA0.9070909@gkec.informatik.tu-darmstadt.de> <4B044BD8-B441-4DD7-A620-AAAD33088A27@zeapartners.org> <4444A60F.3060408@gkec.informatik.tu-darmstadt.de> <4444AF44.80302@gkec.informatik.tu-darmstadt.de> Message-ID: <444725EA.2020606@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Paul Everitt wrote: >> I got this: >> >> # 175 "/usr/include/libxml2/libxml/HTMLparser.h" >> typedef enum { >> HTML_PARSE_NOERROR = 1<<5, >> HTML_PARSE_NOWARNING= 1<<6, >> HTML_PARSE_PEDANTIC = 1<<7, >> HTML_PARSE_NOBLANKS = 1<<8, >> HTML_PARSE_NONET = 1<<11 >> } htmlParserOption; > > > That's not libxml2 2.6.22 then. I think your C compiler uses the Mac-OS system > libraries instead of the libraries installed by what your xmllint uses. Back to this issue, the missing option HTML_PARSE_RECOVER came up in libxml2 2.6.21, while Mac-OS Tiger ships with 2.6.16. However, it looks like the HTML_PARSE_* options follow the numeric values of the XML_PARSE_* enum exactly. So, as a work-around, we could use XML_PARSE_RECOVER to make it compile and simply state that libxml2 2.6.21+ is required for parsing broken HTML. That way, it would keep working with the system libraries on Mac-OS X. Paul, I applied the above change to the branch for now. I'd be glad if you could check that it now compiles with the Mac-OS system libraries. Please run the test suite. If everything works as expected, only the test case(s) for parsing broken HTML should fail. If there are no objections, I'll then start merging the HTML parser into the trunk. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 20 12:44:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 20 12:45:07 2006 Subject: [lxml-dev] HTML parser is in the trunk! Ready for 1.0 ? Message-ID: <44476604.1050501@gkec.informatik.tu-darmstadt.de> Hi all, I just merged the HTML parser branch into the trunk. Paul reported that the latest branch version compiled cleanly on Mac-OS X Tiger (libxml 2.6.16) - and it even passed all tests there, including those on broken HTML. Newer versions of both libxml2 and libxslt are recommended, though. Another recent update on the trunk is the support for xml:id, which is currently available through an XMLDTDID function (XMLID was already in use by ET and is compatible in lxml). The new functionality is now directly based on the libxml2 ID hash table provided by the parser. This means that lxml now supports dictionary-like access to elements having an "xml:id" attribute or DTD-REF attributes. I think it is now the time to fix features for lxml 1.0. Expect it to be released next month (hopefully after Pyrex 0.9.4.1). If you think that lxml still misses something that should be in 1.0 or if you know about any remaining (or new) bugs, report back to the list. Please start a separate thread in that case instead of replying to this mail. Martijn and I are happy about any comment that helps us get lxml better. Have fun, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 20 19:26:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 20 19:26:37 2006 Subject: [lxml-dev] Custom resolvers Message-ID: <4447C43D.8000702@gkec.informatik.tu-darmstadt.de> Hi, since Paul kept bugging me, I created a new branch (resolver-new) and implemented an API for the custom resolvers stuff. It should be pretty simple to use, just create a parser and register the resolver: parser = XMLParser() parser.resolvers.add(my_resolver) "my_resolver" must be of type etree.Resolver and provide a method resolve(system_url, public_id, context) that returns either None (== "can't resolve, ask someone else") or a _ParserInput object. These can be built from files or strings using the Resolver methods 'resolve_string' and 'resolve_filename'. So, to create a custom resolver, you basically do this --------------- class MyResolver(lxml.etree.Resolver): entity = "This was an entity" def resolve(self, url, id, context): if url == 'my.dtd': # I can handle this return self.resolve_string( u'' % self.entity, context) elif url.startswith('http://'): # the default resolver can handle this return super(MyResolver, self).resolve(url, id, context) else: # don't know what to do, let someone else try return None my_resolver = MyResolver() --------------- I'll see how to integrate that in other places of the API, especially XSLT and schemas. Anyway, this works so far. Feel free to comment on it. Stefan From bkc at murkworks.com Thu Apr 20 19:46:21 2006 From: bkc at murkworks.com (Brad Clements) Date: Thu Apr 20 19:47:01 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4447C43D.8000702@gkec.informatik.tu-darmstadt.de> Message-ID: <444790AD.22365.328393B4@bkc.murkworks.com> On 20 Apr 2006 at 19:26, Stefan Behnel wrote: > parser = XMLParser() > parser.resolvers.add(my_resolver) Great, so does this resolver only get called when this one parser is used, or is it global to the process (like it is with libxml2)? > I'll see how to integrate that in other places of the API, especially > XSLT and schemas. Anyway, this works so far. Feel free to comment on If I create a parser, add my resolver, then load an .xslt file into that parser, I'd expect that subsequent use of the parsed document in a transform would continue to use my resolver. and that my resolver would not be called by other documents or transforms. Is that what really happens? If so, nirvana! -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 20 20:30:07 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 20 20:30:05 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <444790AD.22365.328393B4@bkc.murkworks.com> References: <444790AD.22365.328393B4@bkc.murkworks.com> Message-ID: <4447D32F.5090802@gkec.informatik.tu-darmstadt.de> Brad Clements wrote: > On 20 Apr 2006 at 19:26, Stefan Behnel wrote: > >> parser = XMLParser() >> parser.resolvers.add(my_resolver) > > > Great, so does this resolver only get called when this one parser is used, or is it > global to the process (like it is with libxml2)? It's currently local to a parser. I'm looking for a module level API also, but I'm not sure yet how to make it look pretty. Anyway, the parser-level API is likely the preferred one anyway. >> I'll see how to integrate that in other places of the API, especially >> XSLT and schemas. Anyway, this works so far. Feel free to comment on > > If I create a parser, add my resolver, then load an .xslt file into that parser, I'd > expect that subsequent use of the parsed document in a transform would > continue to use my resolver. and that my resolver would not be called by > other documents or transforms. So you'd want the resolvers stored at a per-document level rather than in XSLT or RelaxNG? That would totally simplify the API. I think that's a good idea. So, just to make that clear: 1) resolvers are only registered with parsers. 2) once a document is parsed, a reference to the parser-local resolvers is kept in the document to be reused in all operations where resolving is involved (XSLT, RelaxNG, XInclude, etc.). Questions: * if you parse an XSL document with one set of resolvers and then use it to transform an XML document with another set of resolvers - which ones should be used during the transform? My guess is: the document ones, but that may break lookups at the XSLT level (which libxslt handles in the standard resolvers, even for lookups inside the stylesheet itself!). Keeping these lookups separated by source document can get pretty hard, I assume. * should the document registries be independent of the parser registries or should they reflect updates in their original parser? > Is that what really happens? If so, nirvana! Not yet, but close :) Stefan From bkc at murkworks.com Thu Apr 20 21:21:12 2006 From: bkc at murkworks.com (Brad Clements) Date: Thu Apr 20 21:21:50 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4447D32F.5090802@gkec.informatik.tu-darmstadt.de> References: <444790AD.22365.328393B4@bkc.murkworks.com> Message-ID: <4447A6E8.17038.32DA6356@bkc.murkworks.com> On 20 Apr 2006 at 20:30, Stefan Behnel wrote: > > Great, so does this resolver only get called when this one parser is > > used, or is it global to the process (like it is with libxml2)? > > It's currently local to a parser. I'm looking for a module level API > also, but I'm not sure yet how to make it look pretty. Anyway, the > parser-level API is likely the preferred one anyway. Is the ability to register a resolver by-parser new functionality in libxml2? > So you'd want the resolvers stored at a per-document level rather than > in XSLT or RelaxNG? That would totally simplify the API. I think > that's a good idea. I don't know anything about RelaxNG.. But with respect to xslt.. see below > So, just to make that clear: > > 1) resolvers are only registered with parsers. yes > > 2) once a document is parsed, a reference to the parser-local > resolvers is kept in the document to be reused in all operations where > resolving is involved (XSLT, RelaxNG, XInclude, etc.). yes > Questions: > > * if you parse an XSL document with one set of resolvers and then use > it to transform an XML document with another set of resolvers - which > ones should be used during the transform? Well hmm.. when does the xsl transform process xsl:include and xsl:import? I think those two statements should use the resolver assigned to the base xslt document. During the transform, calls to document() should use the resolver of the base-uri. So, that could be tricky, the document() call is complicated. I suppose you could say that document() always uses the resolver associated with the source xml file and just leave it at that.. that'd be easy. > * should the document registries be independent of the parser > registries or should they reflect updates in their original parser? sorry, I don't understand what you mean. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 20 22:10:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 20 22:09:57 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4447A6E8.17038.32DA6356@bkc.murkworks.com> References: <444790AD.22365.328393B4@bkc.murkworks.com> <4447A6E8.17038.32DA6356@bkc.murkworks.com> Message-ID: <4447EAB0.80705@gkec.informatik.tu-darmstadt.de> Brad Clements wrote: > On 20 Apr 2006 at 20:30, Stefan Behnel wrote: > >>> Great, so does this resolver only get called when this one parser is >>> used, or is it global to the process (like it is with libxml2)? >> It's currently local to a parser. I'm looking for a module level API >> also, but I'm not sure yet how to make it look pretty. Anyway, the >> parser-level API is likely the preferred one anyway. > > Is the ability to register a resolver by-parser new functionality in libxml2? No, lxml registers a global resolver and dispatches internally, possibly falling back to the original default resolver. >> Questions: >> >> * if you parse an XSL document with one set of resolvers and then use >> it to transform an XML document with another set of resolvers - which >> ones should be used during the transform? > > Well hmm.. when does the xsl transform process xsl:include and xsl:import? > I think those two statements should use the resolver assigned to the base xslt > document. Includes and imports are handled at compilation time, which happens in XSLT.__init__(). Libxslt uses a different mechanism than libxml2 here, which (as usual) complicates things. It allows you to specify an "xsltDocLoaderFunction" that is expected to operate in the current XSLT context. Replacing this function would also fix the document('') call as it could access the in-memory stylesheet structure instead of trying to re-load it from a possibly unknown source. However, there doesn't seem to be a way to figure out the default document loader function to provide the necessary fallback. So, I don't know, maybe I'll have to see if libxslt can use the libxml2 resolver capabilities instead... > During the transform, calls to document() should use the resolver of the base-uri. That's the main problem I see. I'm not sure we can figure out the document that a resolver request comes from by means of libxml2. Libxslt provides this information to the loader function, but as long as we don't have a fallback, we can't just replace the loader function without re-implementing it completely. > So, that could be tricky, the document() call is complicated. I suppose you could > say that document() always uses the resolver associated with the source xml file > and just leave it at that.. that'd be easy. Yeah, but it can't always work. Imagine a stylesheet loaded from a ZIP file applied to an XML file loaded from the web. You'd then need both resolvers registered on the XML document. You could possibly imagine using both (e.g. using the XSLT resolvers as a fallback to the XML resolvers). But that may yield other race conditions. >> * should the document registries be independent of the parser >> registries or should they reflect updates in their original parser? > > sorry, I don't understand what you mean. I just meant: should they be stored by reference or copied? But I assume you'd want independent copies to allow updating the parser-local registry without affecting documents that were parsed earlier. So that's a minor problem here. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 11:17:46 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 11:18:21 2006 Subject: [lxml-dev] Pyrex 0.9.4.1 is out Message-ID: <4448A33A.1020409@gkec.informatik.tu-darmstadt.de> Hi, Pyrex 0.9.4.1 was released today. It finally compiles lxml nicely and out-of-the-box with Python 2.4 and gcc 4.x. And it can be installed with easy_install Pyrex I updated the INSTALL.txt on the trunk and the 0.9.x branch to make this version a requirement for those who want to twiddle with the source. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 13:17:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 13:17:07 2006 Subject: [lxml-dev] document('') fixed Message-ID: <4448BF31.4010509@gkec.informatik.tu-darmstadt.de> Hi, I played with the XSLT document loaders and found that the default loader can apparently handle "document('')" on XSL documents read from strings as long as they have a non-empty URL. This only makes sense when you know that libxslt keeps a list of known documents during the transformation, so it apparently searches that list for the URL of the requested document. I changed the code on the trunk to create a fake URL for the case that the document URL is empty. So, document('') should now work from any stylesheet (if anyone wants to verify...) Stefan From cazic at gmx.net Fri Apr 21 14:03:03 2006 From: cazic at gmx.net (cazic@gmx.net) Date: Fri Apr 21 14:03:41 2006 Subject: [lxml-dev] document('') fixed References: <4448BF31.4010509@gkec.informatik.tu-darmstadt.de> Message-ID: <18271.1145620983@www084.gmx.net> Hi, > --- Urspr?ngliche Nachricht --- > Von: Stefan Behnel > An: ML-Lxml-dev > Betreff: [lxml-dev] document('') fixed > Datum: Fri, 21 Apr 2006 13:17:05 +0200 > > Hi, > > I played with the XSLT document loaders and found that the default loader > can > apparently handle "document('')" on XSL documents read from strings as > long as > they have a non-empty URL. This only makes sense when you know that > libxslt > keeps a list of known documents during the transformation, so it > apparently > searches that list for the URL of the requested document. > > I changed the code on the trunk to create a fake URL for the case that the > document URL is empty. So, document('') should now work from any > stylesheet > (if anyone wants to verify...) Does it still reparse the stylesheet document? If you managed to reuse the stylesheet-tree for this purpose then this will produce problems, since the stylesheet-compilation process of Libxslt will change the tree; i.e., e.g. it will eliminate xsl:text elements and preserve whitespace-only text-nodes if they are children of xsl:text. Regards, Kasimier From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 14:39:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 14:39:34 2006 Subject: [lxml-dev] document('') fixed In-Reply-To: <18271.1145620983@www084.gmx.net> References: <4448BF31.4010509@gkec.informatik.tu-darmstadt.de> <18271.1145620983@www084.gmx.net> Message-ID: <4448D299.7080306@gkec.informatik.tu-darmstadt.de> cazic@gmx.net wrote: > Stefan Behnel wrote: >> I changed the code on the trunk to create a fake URL for the case that the >> document URL is empty. So, document('') should now work from any >> stylesheet >> (if anyone wants to verify...) > > Does it still reparse the stylesheet document? I'm not changing anything here, I'm only providing a URL for the stylesheet, which already exists for stylesheets read from files or the network. I tried this: ---------------------- >>> from lxml.etree import XSLT,XML >>> xml = XML("""\ ... ... ... """) >>> xslt=XSLT(xml) >>> str(xslt(xml) '\n\n' ---------------------- The output is all in one line. strace tells me that it tries to find the fake file and fails. It then checks a catalog in /etc/xml and then retries finding the fake file (which fails again). However, it then returns the above tree, so there must be a fallback somewhere that lets document('') succeed. > If you managed to > reuse the stylesheet-tree for this purpose then this will produce > problems, since the stylesheet-compilation process of Libxslt will > change the tree; i.e., e.g. it will eliminate xsl:text elements and > preserve whitespace-only text-nodes if they are children of xsl:text. That would produce the above output, yes. So, what you say is that we should rather handle the lookup "manually"? That would require copying the document twice before the XSLT compilation, to use one copy for compilation and to store the other one. The doc loader would then return a copy of the second copy when the stylesheet URL is requested. Is that the correct approach? That would really make it a lot of deep copying. If this is really necessary, would you mind if I called this behaviour a bug in libxslt? Stefan From cazic at gmx.net Fri Apr 21 16:06:21 2006 From: cazic at gmx.net (cazic@gmx.net) Date: Fri Apr 21 16:06:59 2006 Subject: [lxml-dev] document('') fixed References: <4448D299.7080306@gkec.informatik.tu-darmstadt.de> Message-ID: <11724.1145628381@www067.gmx.net> Hi, > --- Urspr?ngliche Nachricht --- > Von: Stefan Behnel > An: cazic@gmx.net > Kopie: lxml-dev@codespeak.net > Betreff: Re: [lxml-dev] document('') fixed > Datum: Fri, 21 Apr 2006 14:39:53 +0200 [...] > rather handle the lookup "manually"? That would require copying the > document > twice before the XSLT compilation, to use one copy for compilation and to > store the other one. The doc loader would then return a copy of the second > copy when the stylesheet URL is requested. > > Is that the correct approach? That would really make it a lot of deep > copying. > If this is really necessary, would you mind if I called this behaviour a > bug > in libxslt? For whitespace-stripping see: http://www.w3.org/TR/xslt#strip or the XSLT 2.0 spec, which clarifies the intended behaviour much better: http://www.w3.org/TR/xslt20/#stylesheet-stripping The elimination of xsl:text elements is a Libxslt-only thingy, but it's just an internal processing like pre-compilation of XPath expressions. I learned that the spec of XSLT 2.0 clarifies the semantics of the document() function (which, as I was told, was introduced in an abandoned draft of XSLT 1.1 and never made it into the recommendation): "One effect of these rules is that unless XML entities or xml:base are used, and provided that the base URI of the stylesheet module is known, document("") refers to the document node of the containing stylesheet module (the definitive rules are in [RFC3986]). The XML resource containing the stylesheet module is processed exactly as if it were any other XML document, for example there is no special recognition of xsl:text elements, and no special treatment of comments and processing instructions." (http://www.w3.org/TR/xslt20/#document) So this mechanism relies on a base URI to be known, which is not known if the stylesheet-tree is constructed from an in-memory string. I haven't read RFC3986, but an interesting question for me is, whether the *string* containing the XML, could be be treated as the document and be addressed/acquired via the document("") function. So if you could tweak lxml to keep a reference to that string, and feed Libxslt with it when document("") is called, that would be a nice solution, I think. Regards, Kasimier From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 17:00:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 17:00:04 2006 Subject: [lxml-dev] document('') fixed In-Reply-To: <11724.1145628381@www067.gmx.net> References: <4448D299.7080306@gkec.informatik.tu-darmstadt.de> <11724.1145628381@www067.gmx.net> Message-ID: <4448F3AA.1080904@gkec.informatik.tu-darmstadt.de> cazic@gmx.net wrote: > For whitespace-stripping see: > http://www.w3.org/TR/xslt#strip > > or the XSLT 2.0 spec, which clarifies the intended behaviour > much better: > http://www.w3.org/TR/xslt20/#stylesheet-stripping > > The elimination of xsl:text elements is a Libxslt-only thingy, > but it's just an internal processing like pre-compilation of > XPath expressions. [snip] > So this mechanism relies on a base URI to be known, which is > not known if the stylesheet-tree is constructed from an in-memory > string. Ok, I understand that there are certain minor changes in the stylesheet structure, mainly for white-space nodes and xsl:text elements. I personally don't think this is worth storing XML data and copying documents all over the place. Since most people will use document() only to a) find documents in the same directory as the stylesheet (which works anyway) or b) access data in the stylesheet (as opposed to templates, etc.), I can't see why it should hurt anyone to just leave it as it is now. Even the white-space stripping stuff will presumably only show surprising results in very rare cases. So, my preferred solutions is to just let document('') access the stylesheet and "maybe" collect some possible surprising effects somewhere in the documentation. Everything else would be too much overhead in the average case (and for the programmer :). Stefan From cazic at gmx.net Fri Apr 21 17:16:19 2006 From: cazic at gmx.net (cazic@gmx.net) Date: Fri Apr 21 17:16:58 2006 Subject: [lxml-dev] document('') fixed References: <4448F3AA.1080904@gkec.informatik.tu-darmstadt.de> Message-ID: <18191.1145632579@www095.gmx.net> Hi, > --- Urspr?ngliche Nachricht --- > Von: Stefan Behnel > An: cazic@gmx.net > Kopie: lxml-dev@codespeak.net > Betreff: Re: [lxml-dev] document('') fixed > Datum: Fri, 21 Apr 2006 17:00:58 +0200 [...] > anyone to just leave it as it is now. Even the white-space stripping stuff > will presumably only show surprising results in very rare cases. > > So, my preferred solutions is to just let document('') access the > stylesheet > and "maybe" collect some possible surprising effects somewhere in the > documentation. Everything else would be too much overhead in the average > case > (and for the programmer :). Well, we could strip processing-instructions by default in the Libxml2-parser; I don't use them and I think they are rarely used out there anyway ;-) Regards, Kasimier From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 17:19:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 17:18:05 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4447EAB0.80705@gkec.informatik.tu-darmstadt.de> References: <444790AD.22365.328393B4@bkc.murkworks.com> <4447A6E8.17038.32DA6356@bkc.murkworks.com> <4447EAB0.80705@gkec.informatik.tu-darmstadt.de> Message-ID: <4448F7EA.8080600@gkec.informatik.tu-darmstadt.de> Ok, things are getting somewhere... Stefan Behnel wrote: > Brad Clements wrote: >> On 20 Apr 2006 at 20:30, Stefan Behnel wrote: >>> * if you parse an XSL document with one set of resolvers and then use >>> it to transform an XML document with another set of resolvers - which >>> ones should be used during the transform? >> >> Well hmm.. when does the xsl transform process xsl:include and xsl:import? >> I think those two statements should use the resolver assigned to the base xslt >> document. > > Includes and imports are handled at compilation time, which happens in > XSLT.__init__(). Libxslt uses a different mechanism than libxml2 here, which > (as usual) complicates things. It allows you to specify an > "xsltDocLoaderFunction" that is expected to operate in the current XSLT context. libxslt cleanly separates XSL compile-time and transformation-time lookups by an argument passed to the loader function. This allows us to use a different set of resolvers for each context. There is an undocumented public reference to the default loader that we can use as fall-back. (See http://www.google.de/search?q=xsltDocDefaultLoader+site%3Axmlsoft.org on why I call it undocumented.) New problem: the default loader of libxslt reuses document references internally, referenced by their URL. I think we should keep this behaviour, which would mean: run the default loader first, and only if that fails dispatch to the Python resolvers. This would disable custom resolvers for file/network URLs etc. but enable it for custom URIs. Those would then even benefit from the internal document reuse. I'm currently using a special prefix "py:" for URIs that are always passed to the custom loaders first. I implemented a preliminary version in the resolver-new branch and unified the API towards libxml2 and libxslt document loaders (it's the same as in my first mail). I don't currently have any test cases, so maybe those who have been waiting for this feature can start playing with it? Note that exception handling is not currently working in XSLT but in the parsers. So, lxml can happily crash if you raise one. That will change - eventually... :) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Apr 21 17:45:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri Apr 21 17:44:50 2006 Subject: [lxml-dev] document('') fixed In-Reply-To: <18191.1145632579@www095.gmx.net> References: <4448F3AA.1080904@gkec.informatik.tu-darmstadt.de> <18191.1145632579@www095.gmx.net> Message-ID: <4448FE36.8030307@gkec.informatik.tu-darmstadt.de> cazic@gmx.net wrote: > Stefan Behnel wrote: >> So, my preferred solutions is to just let document('') access the >> stylesheet >> and "maybe" collect some possible surprising effects somewhere in the >> documentation. Everything else would be too much overhead in the average >> case (and for the programmer :). > > Well, we could strip processing-instructions by default in > the Libxml2-parser; I don't use them and I think they are > rarely used out there anyway ;-) True. But don't forget to document that somewhere in the source code! ;) Stefan From bkc at murkworks.com Fri Apr 21 17:50:22 2006 From: bkc at murkworks.com (Brad Clements) Date: Fri Apr 21 17:51:03 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4448F7EA.8080600@gkec.informatik.tu-darmstadt.de> References: <4447EAB0.80705@gkec.informatik.tu-darmstadt.de> Message-ID: <4448C6FE.1794.373FB87B@bkc.murkworks.com> On 21 Apr 2006 at 17:19, Stefan Behnel wrote: > New problem: the default loader of libxslt reuses document references > internally, referenced by their URL. I think we should keep this > behaviour, which would mean: run the default loader first, and only if > that fails dispatch to the Python resolvers. > > This would disable custom resolvers for file/network URLs etc. but > enable it for custom URIs. Those would then even benefit from the > internal document reuse. I'm currently using a special prefix "py:" > for URIs that are always passed to the custom loaders first. This is definitely a non-starter for me. My client's websites serve xml with xslt-pi instructions to web clients. We sniff the client, and if that browser can't support client-side transforms we then perform the transform on the server. In that case, the URL to be resolved is probably already a network URL. I need to be sure that my resolver gets the first crack at it, because I don't want libxslt making a callback to my web server (possibly by a url that the local process doesn't have access to) and definitely occuring outside the context in which it should occur. example.. web requests from authenticated clients with cookies. The cookie won't be passed by libxml2 back to the web server, so authentication is lost. I am using WSGI .. I use the paste.recursive.include module to "re-use" the current web request when handling Resolver callbacks from libxml2. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Apr 22 09:07:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat Apr 22 09:08:39 2006 Subject: [lxml-dev] Custom resolvers In-Reply-To: <4448C6FE.1794.373FB87B@bkc.murkworks.com> References: <4447EAB0.80705@gkec.informatik.tu-darmstadt.de> <4448C6FE.1794.373FB87B@bkc.murkworks.com> Message-ID: <4449D64F.6060504@gkec.informatik.tu-darmstadt.de> Brad Clements wrote: > My client's websites serve xml with xslt-pi instructions to web clients. > We sniff the client, and if that browser can't support client-side transforms we > then perform the transform on the server. > > In that case, the URL to be resolved is probably already a network URL. I need to > be sure that my resolver gets the first crack at it, because I don't want libxslt > making a callback to my web server (possibly by a url that the local process > doesn't have access to) and definitely occuring outside the context in which it > should occur. That's a reasonable use-case. I removed the first-shot for the default resolver (and a couple of bugs and crashes). This leaves it to users to decide about the trade-off between document re-use and the full flexibility of dynamic document loading. Although I didn't test it, document re-use should now require some additional user code like URL caching: if the document for that URL was already generated, the default resolver should know about it... XSLT document loaders should now be in a preliminary usable state. I'll write up some doctests for the new code (doc/resolvers.txt). That'll also show me if (and where) there are still bugs. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Apr 22 21:36:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat Apr 22 21:36:48 2006 Subject: [lxml-dev] document('') fixed In-Reply-To: <4448D299.7080306@gkec.informatik.tu-darmstadt.de> References: <4448BF31.4010509@gkec.informatik.tu-darmstadt.de> <18271.1145620983@www084.gmx.net> <4448D299.7080306@gkec.informatik.tu-darmstadt.de> Message-ID: <444A85A1.4080900@gkec.informatik.tu-darmstadt.de> Hi, previously, I wrote: > rather handle the lookup "manually"? That would require copying the document > twice before the XSLT compilation, to use one copy for compilation and to > store the other one. The doc loader would then return a copy of the second > copy when the stylesheet URL is requested. I revised my previous opinion on this. The current code now uses exactly this approach. Storing the string or a filename reference would not have solved the problem as there is nothing that keeps a user from building stylesheets by hand using the API. Alternatively, we could serialize the XSL to a string before compiling it and parse it on request. Daniel suggested that this might even be faster than deep-copying. I wouldn't mind hearing other opinions on this. Anyway, this is how it works now. Stylesheets that were parsed from strings are now special cased and a fake URI is generated for them. The lookup works as follows (first match wins): 1) if the requested URI is a fake URI a) the default resolver is asked to find the document b) the URI is checked against the current XSL document 2) the Python resolvers are called 3) the default resolver is called 4) fail This allows document('') to work in all cases (cross-fingers) and prefers the Python resolvers for anything but string-loaded stylesheets. I think that's a good trade-off. Doctests and explanations can be found in doc/resolvers.txt. Remember: The more feedback I get, the faster the branch can be merged into the trunk. If anyone can come up with additional doctests, clarifications or unit test cases, that would be much appreciated. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Apr 23 16:19:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun Apr 23 16:19:45 2006 Subject: [lxml-dev] document('') fixed In-Reply-To: <444A85A1.4080900@gkec.informatik.tu-darmstadt.de> References: <4448BF31.4010509@gkec.informatik.tu-darmstadt.de> <18271.1145620983@www084.gmx.net> <4448D299.7080306@gkec.informatik.tu-darmstadt.de> <444A85A1.4080900@gkec.informatik.tu-darmstadt.de> Message-ID: <444B8CDA.5000403@gkec.informatik.tu-darmstadt.de> Hi, one more comment on this: > Alternatively, we could serialize the XSL to a string before compiling it and > parse it on request. Daniel suggested that this might even be faster than > deep-copying. I did a couple of tests and found that this is much slower for small stylesheets. It may also carry the additional risk of requiring special parser options that may not be known to XSLT. I now simplified the code somewhat and special cased only the current stylesheet itself. The rest is handed to the Python loaders and subsequently to the default loader. If there are no substantial counter-arguments to this behaviour, I'll just wait for a few bug reports and otherwise merge it next week. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Apr 23 18:57:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun Apr 23 18:58:32 2006 Subject: [lxml-dev] Linking against libexslt Message-ID: <444BB211.5070705@gkec.informatik.tu-darmstadt.de> Hi, I'd like to include EXSLT support in lxml as it greatly enhances the capabilities of XSLT. It isn't currently enabled, although supported since libxslt 1.0.19. However, this requires linking against libexslt and xslt-config does (intentionally) not help here: http://mail.gnome.org/archives/xslt/2001-October/msg00133.html I don't know if simply adding "-lexslt" works on all systems (that may be the GCC way of doing it), so I added the following to my setup.py: --------------------------- xslt_libs = flags('xslt-config --libs') for i, libname in enumerate(xslt_libs): if 'exslt' in libname: break if 'xslt' in libname: xslt_libs.insert(i, libname.replace('xslt', 'exslt')) break [...] # near the end in setup() call: extra_link_args = xslt_libs --------------------------- This basically replaces "-lxslt" by "-lexslt -lxslt" on my machine (Linux). I hope this is sufficiently platform-independent to make it work on other systems, but I don't have Windows or Mac-OS available to check that this actually works. Since I was working on XSLT in the "resolver-new" branch anyway, I committed it there. I also attached the complete patch that can be applied to the trunk. It includes a new test case for exslt. Could anyone on the Win/Mac platforms please tell me if this is necessary, or if the following (which also works for me) is sufficient: extra_link_args = flags('xslt-config --libs') + ['-lexslt'] Also, could you please verify that the resolver-new branch compiles and that the test cases pass? Please remember that you may want to install Pyrex 0.9.4.1 before compiling from SVN. Thanks, Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: enable-exslt.patch Type: text/x-patch Size: 3075 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060423/6f6268ee/enable-exslt.bin From paul at zope-europe.org Sun Apr 23 20:53:58 2006 From: paul at zope-europe.org (Paul Everitt) Date: Sun Apr 23 20:54:50 2006 Subject: [lxml-dev] Re: HTMLParser status and issues In-Reply-To: <444725EA.2020606@gkec.informatik.tu-darmstadt.de> References: <4443A9E2.9030808@gkec.informatik.tu-darmstadt.de> <4443CE94.2030603@zope-europe.org> <4443F79F.90507@gkec.informatik.tu-darmstadt.de> <73DCD630-05BA-4221-8CC5-B729573C6BF8@zeapartners.org> <4443FAA0.9070909@gkec.informatik.tu-darmstadt.de> <4B044BD8-B441-4DD7-A620-AAAD33088A27@zeapartners.org> <4444A60F.3060408@gkec.informatik.tu-darmstadt.de> <4444AF44.80302@gkec.informatik.tu-darmstadt.de> <444725EA.2020606@gkec.informatik.tu-darmstadt.de> Message-ID: <444BCD46.6010603@zope-europe.org> Stefan Behnel wrote: > Stefan Behnel wrote: >> Paul Everitt wrote: >>> I got this: >>> >>> # 175 "/usr/include/libxml2/libxml/HTMLparser.h" >>> typedef enum { >>> HTML_PARSE_NOERROR = 1<<5, >>> HTML_PARSE_NOWARNING= 1<<6, >>> HTML_PARSE_PEDANTIC = 1<<7, >>> HTML_PARSE_NOBLANKS = 1<<8, >>> HTML_PARSE_NONET = 1<<11 >>> } htmlParserOption; >> >> That's not libxml2 2.6.22 then. I think your C compiler uses the Mac-OS system >> libraries instead of the libraries installed by what your xmllint uses. > > > Back to this issue, the missing option HTML_PARSE_RECOVER came up in libxml2 > 2.6.21, while Mac-OS Tiger ships with 2.6.16. However, it looks like the > HTML_PARSE_* options follow the numeric values of the XML_PARSE_* enum > exactly. So, as a work-around, we could use XML_PARSE_RECOVER to make it > compile and simply state that libxml2 2.6.21+ is required for parsing broken > HTML. That way, it would keep working with the system libraries on Mac-OS X. > > Paul, I applied the above change to the branch for now. I'd be glad if you > could check that it now compiles with the Mac-OS system libraries. Please run > the test suite. If everything works as expected, only the test case(s) for > parsing broken HTML should fail. Go way for the weekend and I miss all the fun. :^) Yes, this works. --Paul From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Apr 24 17:09:25 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon Apr 24 17:10:08 2006 Subject: [lxml-dev] exslt:regexp implementation based on 're' Message-ID: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> Hi, I noticed that exslt:regexp was not supported by libexslt, so I wrote three extension functions that use Python's re module (which is not really JavaScript compatible as requested by the spec, but who cares...). Here's an example: ---------------------------------------- >>> xslt = etree.XSLT(etree.XML("""\ """)) >>> result = xslt(etree.XML('123098987')) >>> print str(result) 987 ---------------------------------------- Since the test cases worked out perfectly, it's already in the trunk. So, when the regular exslt support gets merged, lxml will have more complete exslt support than libxslt itself. :) Stefan From faassen at infrae.com Tue Apr 25 15:21:47 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue Apr 25 15:18:59 2006 Subject: [lxml-dev] exslt:regexp implementation based on 're' In-Reply-To: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> Message-ID: <444E226B.3020805@infrae.com> Hey, Stefan Behnel wrote: > I noticed that exslt:regexp was not supported by libexslt, so I wrote three > extension functions that use Python's re module (which is not really > JavaScript compatible as requested by the spec, but who cares...). I think one might care if one had a stylesheet that uses exslt and then have it not work with lxml because the regex behavior is different? > Here's an > example: > > ---------------------------------------- > >>> xslt = etree.XSLT(etree.XML("""\ > xmlns:regexp="http://exslt.org/regular-expressions" > xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> > > > > > """)) > > >>> result = xslt(etree.XML('123098987')) > >>> print str(result) > 987 > ---------------------------------------- > > Since the test cases worked out perfectly, it's already in the trunk. So, when > the regular exslt support gets merged, lxml will have more complete exslt > support than libxslt itself. :) Cool. :) One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Apr 25 17:05:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue Apr 25 17:06:08 2006 Subject: [lxml-dev] exslt:regexp implementation based on 're' In-Reply-To: <444E226B.3020805@infrae.com> References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> Message-ID: <444E3AB9.5010506@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> I noticed that exslt:regexp was not supported by libexslt, so I wrote >> three >> extension functions that use Python's re module (which is not really >> JavaScript compatible as requested by the spec, but who cares...). > > I think one might care if one had a stylesheet that uses exslt and then > have it not work with lxml because the regex behavior is different? The API is identical, it just depends on what sort of expressions you use. The normal ().*+ stuff should be the same, also \w and the like. But you'll never find two RE implementations that are completely compatible. So, well, you'll just have to take care if you want to write portable stylesheets. Note that many processors do not even support REs at all and different processors base their support on different libraries (JavaScript or Apache or whatever). >> Here's an >> example: >> >> ---------------------------------------- >> >>> xslt = etree.XSLT(etree.XML("""\ >> > xmlns:regexp="http://exslt.org/regular-expressions" >> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> >> >> >> >> >> """)) >> >> >>> result = xslt(etree.XML('123098987')) >> >>> print str(result) >> 987 >> ---------------------------------------- >> >> Since the test cases worked out perfectly, it's already in the trunk. >> So, when >> the regular exslt support gets merged, lxml will have more complete exslt >> support than libxslt itself. :) > > Cool. :) > > One thing that I wonder about is potential security issues? Are there > ways to break out of the Python regexs and call arbitrary python code? > If not, then we don't need to worry about it. XSLT can be run from > fairly unsafe sources so this may be a concern. I wouldn't know why there should be any risks. The regexps are just handed to the re.compile function as is and there shouldn't be any way to break out of the (s)re module. There are no calls to "eval" or anything like it. The EXSLT extensions shouldn't do any harm either. On the other hand, registering the libxslt "extra" extension functions may be a risk. There is a "debug" element that becomes accessible and the "output" and "write" elements that can write(!) to files. So, maybe we should require some initialization function call to add those extras. I'll just remove the "extra" registration for now. Also, remember that the document() function can be used to access local XML files. That may already be a risk in some cases. Stefan From fredrik at pythonware.com Tue Apr 25 19:32:30 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue Apr 25 19:35:49 2006 Subject: [lxml-dev] Re: exslt:regexp implementation based on 're' References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> Message-ID: Martijn Faassen wrote: > One thing that I wonder about is potential security issues? Are there > ways to break out of the Python regexs and call arbitrary python code? > If not, then we don't need to worry about it. XSLT can be run from > fairly unsafe sources so this may be a concern. you can "hang" RE if you want (by crafting a really lousy RE that causes excessive backtracking), but since you can "hang" any XML parser that supports internal DTD:s (google for the "billion laughs attack"), I'm not sure how serious this is. I wouldn't accept XSLT programs from untrusted sources, though... From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Apr 25 20:20:47 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue Apr 25 20:21:31 2006 Subject: [lxml-dev] Re: exslt:regexp implementation based on 're' In-Reply-To: References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> Message-ID: <444E687F.1060107@gkec.informatik.tu-darmstadt.de> Fredrik Lundh wrote: > Martijn Faassen wrote: > >> One thing that I wonder about is potential security issues? Are there >> ways to break out of the Python regexs and call arbitrary python code? >> If not, then we don't need to worry about it. XSLT can be run from >> fairly unsafe sources so this may be a concern. > > you can "hang" RE if you want (by crafting a really lousy RE that > causes excessive backtracking), but since you can "hang" any XML > parser that supports internal DTD:s (google for the "billion laughs > attack"), I'm not sure how serious this is. > > I wouldn't accept XSLT programs from untrusted sources, though... Sure, that's the main threat. XSLT is Turing-complete. Anyone can write an infinitely recursing stylesheet - and no machine can ever decide if it will terminate... Stefan From faassen at infrae.com Wed Apr 26 10:57:44 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed Apr 26 10:55:14 2006 Subject: [lxml-dev] Re: exslt:regexp implementation based on 're' In-Reply-To: References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> Message-ID: <444F3608.30204@infrae.com> Fredrik Lundh wrote: > Martijn Faassen wrote: > > >>One thing that I wonder about is potential security issues? Are there >>ways to break out of the Python regexs and call arbitrary python code? >>If not, then we don't need to worry about it. XSLT can be run from >>fairly unsafe sources so this may be a concern. > > > you can "hang" RE if you want (by crafting a really lousy RE that > causes excessive backtracking), but since you can "hang" any XML > parser that supports internal DTD:s (google for the "billion laughs > attack"), I'm not sure how serious this is. > > I wouldn't accept XSLT programs from untrusted sources, though... Agreed that accepting any programs from untrusted sources is dangerous, but it depends also a bit on exactly how untrusted your sources are. I just wanted to make sure we didn't get some kind of potential privilege escalation where people from XSLT could trigger Python by cleverly crafted regexes using some specific extension in Python that I don't know about. Apparently this is safe. Regarsd, Martijn From faassen at infrae.com Wed Apr 26 10:58:57 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed Apr 26 10:56:25 2006 Subject: [lxml-dev] exslt:regexp implementation based on 're' In-Reply-To: <444E3AB9.5010506@gkec.informatik.tu-darmstadt.de> References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> <444E3AB9.5010506@gkec.informatik.tu-darmstadt.de> Message-ID: <444F3651.4010202@infrae.com> Stefan Behnel wrote: [snip] > Also, remember that the document() function can be used to access local XML > files. That may already be a risk in some cases. Good point. The custom resolver story could help against that, right? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Apr 26 11:09:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed Apr 26 11:10:47 2006 Subject: [lxml-dev] exslt:regexp implementation based on 're' In-Reply-To: <444F3651.4010202@infrae.com> References: <444CEA25.3040809@gkec.informatik.tu-darmstadt.de> <444E226B.3020805@infrae.com> <444E3AB9.5010506@gkec.informatik.tu-darmstadt.de> <444F3651.4010202@infrae.com> Message-ID: <444F38E7.1080402@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Stefan Behnel wrote: > [snip] >> Also, remember that the document() function can be used to access >> local XML files. That may already be a risk in some cases. > > Good point. The custom resolver story could help against that, right? Right. As long as you return anything but None from the Python resolvers, it will be parsed and handed directly back to libxslt. So, if you want to keep libxslt from doing any access to network or hard-disk, it "should" (untested) be enough to write a dummy resolver that returns a dummy or the empty document (resolve_empty()). Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Apr 27 23:03:42 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu Apr 27 23:04:27 2006 Subject: [lxml-dev] Resolver branch merged, trunk status for 1.0 Message-ID: <445131AE.8000503@gkec.informatik.tu-darmstadt.de> Hi all, the resolver branch is now in the trunk. This means that lxml 1.0 will have full-fledged support for * custom document loaders (see doc/resolvers.txt) * EXSLT * Python regexps in XSLT (can be switched off via 'regexp' keyword) * the XSLT node-set() function * xml:id and DTD IDs * HTML parsing I personally consider the trunk now feature-complete for 1.0. Anyone who is interested in getting the hands on a release rather sooner than later should now take the opportunity to start testing the trunk and report any remaining bugs. http://codespeak.net/svn/lxml/trunk Please install Pyrex 0.9.4.1 to compile it. Have fun, Stefan