From cmtaylor at ti.com Fri Dec 4 15:05:42 2009 From: cmtaylor at ti.com (Taylor, Martin) Date: Fri, 4 Dec 2009 08:05:42 -0600 Subject: [lxml-dev] Windows 7 64-bit Message-ID: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> I'm looking for an lxml Windows binary installer for Windows 7 64-bit. On http://pypi.python.org/pypi/lxml/2.2.4 I see lxml-2.2.4.win-amd64-py2.6.exe. Is this specifically, and only, for AMD 64-bit processors or will it install and run on all 64-bit Windows platforms, including Windows 7 on Intel hardware? Thanks, Martin Taylor -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091204/534be1f6/attachment.htm From stefan_ml at behnel.de Fri Dec 4 15:21:15 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 04 Dec 2009 15:21:15 +0100 Subject: [lxml-dev] Windows 7 64-bit In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> References: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> Message-ID: <4B191ADB.8080805@behnel.de> [and to the list...] Taylor, Martin, 04.12.2009 15:05: > I'm looking for an lxml Windows binary installer for Windows 7 64-bit. > On http://pypi.python.org/pypi/lxml/2.2.4 I see > lxml-2.2.4.win-amd64-py2.6.exe. Is this specifically, and only, for AMD > 64-bit processors or will it install and run on all 64-bit Windows > platforms, including Windows 7 on Intel hardware? Did you try it? Stefan From cmtaylor at ti.com Fri Dec 4 15:29:55 2009 From: cmtaylor at ti.com (Taylor, Martin) Date: Fri, 4 Dec 2009 08:29:55 -0600 Subject: [lxml-dev] Windows 7 64-bit In-Reply-To: <4B191AA8.5080906@behnel.de> References: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> <4B191AA8.5080906@behnel.de> Message-ID: <92CDD168D1E81F4F9D3839DC45903FC6674DDEF3@dlee03.ent.ti.com> Not yet. I should be getting my new Win 7 64-bit system later today. I've just be searching online for the various packages I'll need to install on this new system to prepare it for use as a test platform. If I don't get any replies to the contrary I will try it later today. Martin > -----Original Message----- > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > Sent: Friday, December 04, 2009 8:20 AM > To: Taylor, Martin > Subject: Re: [lxml-dev] Windows 7 64-bit > > > Taylor, Martin, 04.12.2009 15:05: > > I'm looking for an lxml Windows binary installer for > Windows 7 64-bit. > > On http://pypi.python.org/pypi/lxml/2.2.4 I see > > lxml-2.2.4.win-amd64-py2.6.exe. Is this specifically, and > only, for > > AMD 64-bit processors or will it install and run on all > 64-bit Windows > > platforms, including Windows 7 on Intel hardware? > > Did you try it? > > Stefan > > From stefan_ml at behnel.de Fri Dec 4 22:21:55 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 04 Dec 2009 22:21:55 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: <4B197D73.7030709@behnel.de> Hi, Martin Aspeli, 02.11.2009 03:58: > We have an incredibly frustrating, show-stopping problem using lxml (under > Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on > Windows. > > Under high load, the Python process crashes. There is no traceback in the log, > so I can't identify where it actually happens, but we get a Windows error > dialogue saying python.exe (or pythonservice.exe if running as a Windows > service) has crashed in etree.pyd (at some binary address, no line numbers or > function references). >[...] Any news from this front? Stefan From lists at cheimes.de Sat Dec 5 02:37:31 2009 From: lists at cheimes.de (Christian Heimes) Date: Sat, 05 Dec 2009 02:37:31 +0100 Subject: [lxml-dev] Windows 7 64-bit In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> References: <92CDD168D1E81F4F9D3839DC45903FC6674DDE9F@dlee03.ent.ti.com> Message-ID: Taylor, Martin schrieb: > I'm looking for an lxml Windows binary installer for Windows 7 64-bit. On http://pypi.python.org/pypi/lxml/2.2.4 I see lxml-2.2.4.win-amd64-py2.6.exe. Is this specifically, and only, for AMD 64-bit processors or will it install and > run on all 64-bit Windows platforms, including Windows 7 on Intel hardware? It will install on all AMD64 / X86_64 compatible 64bit Windows platforms but not on IA64. All modern Intel CPUs are based on AMD64 architecture. Christian From optilude+lists at gmail.com Sat Dec 5 05:15:31 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sat, 05 Dec 2009 12:15:31 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4B197D73.7030709@behnel.de> References: <4B197D73.7030709@behnel.de> Message-ID: Stefan Behnel wrote: > Hi, > > Martin Aspeli, 02.11.2009 03:58: >> We have an incredibly frustrating, show-stopping problem using lxml (under >> Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on >> Windows. >> >> Under high load, the Python process crashes. There is no traceback in the log, >> so I can't identify where it actually happens, but we get a Windows error >> dialogue saying python.exe (or pythonservice.exe if running as a Windows >> service) has crashed in etree.pyd (at some binary address, no line numbers or >> function references). >> [...] > > Any news from this front? Unfortunately not. We tried to simplify the xpath expressions, but it still crashed (perhaps a bit less often). Our "solution" was to ditch Deliverance in favour of collective.xdv, which still uses lxml, but uses the XDV XSLT-based transformation process. So now, we're only using lxml to execute two XSLT files (the first one generates the second). Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From jholg at gmx.de Sat Dec 5 23:47:27 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Sat, 05 Dec 2009 23:47:27 +0100 Subject: [lxml-dev] trunk build problem Message-ID: <20091205224727.130500@gmx.net> Hi, I just wanted to start work on the iso schematron support and created a branch off the current trunk for this. However: holger at lisa:/data/work/lxml$ python setup.py build Building lxml version 2.3.dev-69913. Building with Cython 0.11.2. Using build configuration of libxslt 1.1.22 Building against libxml2/libxslt in the following directory: /usr/lib running build running build_py running build_ext cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c warning: /data/work/lxml/src/lxml/python.pxd:15:4: slice already a builtin Cython type warning: /data/work/lxml/src/lxml/python.pxd:20:4: unicode already a builtin Cython type Error converting Pyrex file to C: ------------------------------------------------------------ ... c_child = _findChildForwards(c_node, 0) while c_child is not NULL: if c_child.type == tree.XML_ELEMENT_NODE: for i in xrange(c_tag_count): if _tagMatchesExactly(c_child, c_ns_tags[2*i], c_ns_tags[2*i+1]): c_next = _findChildForwards(c_child, 0) or _nextElement(c_child) ^ ------------------------------------------------------------ /data/work/lxml/src/lxml/cleanup.pxi:246:64: Cannot assign type 'int' to 'xmlNode *' building 'lxml.etree' extension gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/libxml2 -I/usr/include/python2.5 -c src/lxml/lxml.etree.c -o build/temp.linux-i686-2.5/src/lxml/lxml.etree.o -w src/lxml/lxml.etree.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. error: command 'gcc' failed with exit status 1 holger at lisa:/data/work/lxml$ Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Sat Dec 5 23:53:22 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 05 Dec 2009 23:53:22 +0100 Subject: [lxml-dev] trunk build problem In-Reply-To: <20091205224727.130500@gmx.net> References: <20091205224727.130500@gmx.net> Message-ID: <4B1AE462.8050106@behnel.de> jholg at gmx.de, 05.12.2009 23:47: > c_next = _findChildForwards(c_child, 0) or _nextElement(c_child) > ^ > ------------------------------------------------------------ > > /data/work/lxml/src/lxml/cleanup.pxi:246:64: Cannot assign type 'int' to 'xmlNode *' Yep, I remember that being new in Cython 0.12, "and/or" now have Python semantics everywhere, also for C types. You'll need 0.12 to build lxml 2.3. I'll update the docs accordingly. Stefan From dsoulayrol at free.fr Sun Dec 6 11:28:11 2009 From: dsoulayrol at free.fr (dsoulayrol at free.fr) Date: Sun, 6 Dec 2009 11:28:11 +0100 Subject: [lxml-dev] XML Catalogs Message-ID: <20091206102811.GB6122@asquith> Hello. I'd like to use catalogs to locate XSL stylesheets the way it is done in this document: http://www.sagehill.net/docbookxsl/WriteCatalog.html I'm not sure it is possible to do this with lxml, and how I could achieve this. The only information I found on using catalogs is the use of external_id etree.DTD: http://codespeak.net/lxml/validation.html Does anyone has the answer? -- David From stefan_ml at behnel.de Sun Dec 6 12:02:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 06 Dec 2009 12:02:36 +0100 Subject: [lxml-dev] XML Catalogs In-Reply-To: <20091206102811.GB6122@asquith> References: <20091206102811.GB6122@asquith> Message-ID: <4B1B8F4C.7080709@behnel.de> dsoulayrol at free.fr, 06.12.2009 11:28: > I'd like to use catalogs to locate XSL stylesheets the way it is done > in this document: > > http://www.sagehill.net/docbookxsl/WriteCatalog.html lxml's catalog support is based on libxml2: http://xmlsoft.org/catalog.html Is anything not working for you regarding catalogs? Stefan From dsoulayrol at free.fr Sun Dec 6 17:04:04 2009 From: dsoulayrol at free.fr (David Soulayrol) Date: Sun, 6 Dec 2009 17:04:04 +0100 Subject: [lxml-dev] XML Catalogs In-Reply-To: <4B1B8F4C.7080709@behnel.de> References: <20091206102811.GB6122@asquith> <4B1B8F4C.7080709@behnel.de> Message-ID: <20091206160404.GA4560@asquith> On Sun, Dec 06, 2009 at 12:02:36PM +0100, Stefan Behnel wrote: > > lxml's catalog support is based on libxml2: > > http://xmlsoft.org/catalog.html > > Is anything not working for you regarding catalogs? Actually nothing at the moment. My problem is I don't understand how I can achieve what I want with the documentation I have. All I found in lxml documentation is how to resolve a DTD Public ID from the catalog: dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XML V4.2//EN") What I want is to get an URL from a xsl sheet name using a entry in my catalog: I may be wrong, but I didn't find any help for this in the parsing or XSLT subjects of the documentation. I think I've understood xmlsoft doc as well, but it didn't help in this matter. Sorry if it is a silly question, but I'm in the dark there. -- David From mykingheaven at gmail.com Mon Dec 7 03:42:53 2009 From: mykingheaven at gmail.com (David Shieh) Date: Mon, 7 Dec 2009 10:42:53 +0800 Subject: [lxml-dev] Can't load the whole context Message-ID: Hi everyone, I've a problem by using lxml to parse my html file. I opened a html file, then use lxml.html.fromstring to parse it, but when I use xpath find an element and print it, it can't show the whole context. Here's the code for my oprations. from lxml.html import fromstring, tostring def test(): html = open('/home/icefox/qingbao/ sina.htm') tree = fromstring("".join(html.readlines())) text = tree.xpath('//body') file = open('test.html', 'wb') print len(text) for i, t in enumerate(text): file.write(tostring(t)) print str(i) + "---" + tostring(t)[:50] file.close() -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091207/95fa9490/attachment.htm From stefan_ml at behnel.de Mon Dec 7 09:36:26 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 09:36:26 +0100 Subject: [lxml-dev] Can't load the whole context In-Reply-To: References: Message-ID: <4B1CBE8A.5090008@behnel.de> David Shieh, 07.12.2009 03:42: > I've a problem by using lxml to parse my html file. I opened a html file, > then use lxml.html.fromstring to parse it, but when I use xpath find an > element and print it, it can't show the whole context. What so you mean with "context"? What output do you get and what output do you expect? Stefan From stefan_ml at behnel.de Mon Dec 7 10:55:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 10:55:23 +0100 Subject: [lxml-dev] Can't load the whole context In-Reply-To: References: <4B1CBE8A.5090008@behnel.de> Message-ID: <4B1CD10B.8060604@behnel.de> Hi, please keep the list involved. David Shieh, 07.12.2009 10:43: > 2009/12/7 Stefan Behnel >> David Shieh, 07.12.2009 03:42: >>> I've a problem by using lxml to parse my html file. I opened a html file, >>> then use lxml.html.fromstring to parse it, but when I use xpath find an >>> element and print it, it can't show the whole context. >> What so you mean with "context"? What output do you get and what output do >> you expect? >> > I used xpath '//body' to find body element, so I expected the output is the > whole text in body element. That tells me neither what you got as output, nor what exactly you expected that you did not get. In case what you want is the complete serialisation of the "body" element, then serialise the body element. Stefan From steven.vereecken at gmail.com Mon Dec 7 16:29:44 2009 From: steven.vereecken at gmail.com (Steven Vereecken) Date: Mon, 7 Dec 2009 16:29:44 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load Message-ID: > Martin Aspeli, 02.11.2009 03:58: >> We have an incredibly frustrating, show-stopping problem using lxml (under >> Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on >> Windows. >> >> Under high load, the Python process crashes. There is no traceback in the log, >> so I can't identify where it actually happens, but we get a Windows error >> dialogue saying python.exe (or pythonservice.exe if running as a Windows >> service) has crashed in etree.pyd (at some binary address, no line numbers or >> function references). >>[...] > > Any news from this front? > > Stefan > I don't know if this is any help, but I experienced a (possibly) similar problem under high load on Windows, where I got an error dialog only mentioning: The instruction at "0x00e2f922" referenced memory at "0x00000000". The memory could not be "read". This was with lxml version 2.2.2. The script in question was parsing a lot of large (6 to 10MB) documents, finding an element with a certain ID, replacing that with a new one (each time a deepcopy of an original), validating the result with a dtd, and writing the result back out. After some experimenting, it started to look as if the garbagecollector just couldn't follow (anything I did to make sure I didn't keep accidental references to the documents that I knew of didn't help) Calling gc.collect() after each file was processed "solved" the problem. I don't know if this is at all related to Martin's problem, but I thought I'd mention it. It might just help... greetings, Steven From steven.vereecken at gmail.com Mon Dec 7 16:39:38 2009 From: steven.vereecken at gmail.com (Steven Vereecken) Date: Mon, 7 Dec 2009 16:39:38 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: 2009/12/7 Steven Vereecken : >> Martin Aspeli, 02.11.2009 03:58: >>> We have an incredibly frustrating, show-stopping problem using lxml (under >>> Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on >>> Windows. >>> >>> Under high load, the Python process crashes. There is no traceback in the log, >>> so I can't identify where it actually happens, but we get a Windows error >>> dialogue saying python.exe (or pythonservice.exe if running as a Windows >>> service) has crashed in etree.pyd (at some binary address, no line numbers or >>> function references). >>>[...] >> >> Any news from this front? >> >> Stefan >> > > I don't know if this is any help, but I experienced a (possibly) > similar problem under high load on Windows, where I got an error > dialog only mentioning: The instruction at "0x00e2f922" referenced > memory at "0x00000000". The memory could not be "read". > > This was with lxml version 2.2.2. > > The script in question was parsing a lot of large (6 to 10MB) > documents, finding an element with a certain ID, replacing that with a > new one (each time a deepcopy of an original), validating the result > with a dtd, and writing the result back out. > > After some experimenting, it started to look as if the > garbagecollector just couldn't follow (anything I did to make sure I > didn't keep accidental references to the documents that I knew of > didn't help) Calling gc.collect() after each file was processed > "solved" the problem. > > I don't know if this is at all related to Martin's problem, but I > thought I'd mention it. It might just help... > > > greetings, > > Steven > Oh, and I forgot to mention: memory grew VERY large! (in case that wasn't clear from my description) From stefan_ml at behnel.de Mon Dec 7 16:44:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 16:44:21 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: <4B1D22D5.3080107@behnel.de> Steven Vereecken, 07.12.2009 16:39: >> I don't know if this is any help, but I experienced a (possibly) >> similar problem under high load on Windows, where I got an error >> dialog only mentioning: The instruction at "0x00e2f922" referenced >> memory at "0x00000000". The memory could not be "read". >> >> This was with lxml version 2.2.2. >> >> The script in question was parsing a lot of large (6 to 10MB) >> documents, finding an element with a certain ID, replacing that with a >> new one (each time a deepcopy of an original), validating the result >> with a dtd, and writing the result back out. >> >> After some experimenting, it started to look as if the >> garbagecollector just couldn't follow (anything I did to make sure I >> didn't keep accidental references to the documents that I knew of >> didn't help) Calling gc.collect() after each file was processed >> "solved" the problem. >> >> I don't know if this is at all related to Martin's problem, but I >> thought I'd mention it. It might just help... >> > Oh, and I forgot to mention: memory grew VERY large! (in case that > wasn't clear from my description) That sounds more like a memory problem due to the garbage collector not cleaning up reference cycles in time. In case you used "element.attrib" in your code, this should no longer be a problem in lxml 2.3. Stefan From optilude+lists at gmail.com Mon Dec 7 16:45:40 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 07 Dec 2009 23:45:40 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: Steven Vereecken wrote: > 2009/12/7 Steven Vereecken : >>> Martin Aspeli, 02.11.2009 03:58: >>>> We have an incredibly frustrating, show-stopping problem using lxml (under >>>> Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on >>>> Windows. >>>> >>>> Under high load, the Python process crashes. There is no traceback in the log, >>>> so I can't identify where it actually happens, but we get a Windows error >>>> dialogue saying python.exe (or pythonservice.exe if running as a Windows >>>> service) has crashed in etree.pyd (at some binary address, no line numbers or >>>> function references). >>>> [...] >>> Any news from this front? >>> >>> Stefan >>> >> I don't know if this is any help, but I experienced a (possibly) >> similar problem under high load on Windows, where I got an error >> dialog only mentioning: The instruction at "0x00e2f922" referenced >> memory at "0x00000000". The memory could not be "read". >> >> This was with lxml version 2.2.2. >> >> The script in question was parsing a lot of large (6 to 10MB) >> documents, finding an element with a certain ID, replacing that with a >> new one (each time a deepcopy of an original), validating the result >> with a dtd, and writing the result back out. >> >> After some experimenting, it started to look as if the >> garbagecollector just couldn't follow (anything I did to make sure I >> didn't keep accidental references to the documents that I knew of >> didn't help) Calling gc.collect() after each file was processed >> "solved" the problem. >> >> I don't know if this is at all related to Martin's problem, but I >> thought I'd mention it. It might just help... >> >> >> greetings, >> >> Steven >> > > Oh, and I forgot to mention: memory grew VERY large! (in case that > wasn't clear from my description) This may be the same thing. At least we saw pretty rapid memory growth, and a similarly stupid error message. I'm not sure where one would insert such a gc.collect() call (is this the Python GC?). In lxml? In Deliverance? Or whether this is the right thing to do... :) Unfortunately, the project has now moved off Deliverance as mentioned, so it's going to be hard to try this out. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From optilude+lists at gmail.com Mon Dec 7 17:02:52 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Tue, 08 Dec 2009 00:02:52 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4B1D22D5.3080107@behnel.de> References: <4B1D22D5.3080107@behnel.de> Message-ID: Stefan Behnel wrote: > Steven Vereecken, 07.12.2009 16:39: >>> I don't know if this is any help, but I experienced a (possibly) >>> similar problem under high load on Windows, where I got an error >>> dialog only mentioning: The instruction at "0x00e2f922" referenced >>> memory at "0x00000000". The memory could not be "read". >>> >>> This was with lxml version 2.2.2. >>> >>> The script in question was parsing a lot of large (6 to 10MB) >>> documents, finding an element with a certain ID, replacing that with a >>> new one (each time a deepcopy of an original), validating the result >>> with a dtd, and writing the result back out. >>> >>> After some experimenting, it started to look as if the >>> garbagecollector just couldn't follow (anything I did to make sure I >>> didn't keep accidental references to the documents that I knew of >>> didn't help) Calling gc.collect() after each file was processed >>> "solved" the problem. >>> >>> I don't know if this is at all related to Martin's problem, but I >>> thought I'd mention it. It might just help... >>> >> Oh, and I forgot to mention: memory grew VERY large! (in case that >> wasn't clear from my description) > > That sounds more like a memory problem due to the garbage collector not > cleaning up reference cycles in time. In case you used "element.attrib" in > your code, this should no longer be a problem in lxml 2.3. Can you elaborate on this? Is there a bug in 2.2.x that would cause this error? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Mon Dec 7 17:23:39 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 17:23:39 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: <4B1D22D5.3080107@behnel.de> Message-ID: <4B1D2C0B.5000802@behnel.de> Martin Aspeli, 07.12.2009 17:02: > Stefan Behnel wrote: >> That sounds more like a memory problem due to the garbage collector not >> cleaning up reference cycles in time. In case you used "element.attrib" in >> your code, this should no longer be a problem in lxml 2.3. > > Can you elaborate on this? Is there a bug in 2.2.x that would cause this > error? Not a bug, rather an unfortunate design decision. Accessing "element.attrib" creates a reference cycle in 2.x (don't remember the version where it was added). It therefore requires garbage collection for clean-up. So until the garbage collector is triggered, the document that contains the Element will stay alive, which can be a problem in systems with high document throughput, and especially with large documents. I decided to remove the cyclic reference in lxml 2.3, which makes "element.attrib" somewhat slower but avoids the cyclic garbage collection. So unless you add ref-cycles in your own code, lxml will no longer do it for you. I don't remember that you mentioned high memory load before the crash - that would have been a hint. Stefan From stefan_ml at behnel.de Mon Dec 7 17:32:34 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 17:32:34 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: <4B1D2E22.4050705@behnel.de> Martin Aspeli, 07.12.2009 16:45: > This may be the same thing. At least we saw pretty rapid memory growth, There might also be something else that triggers this. You mentioned heavy XPath usage - did you (or does deliverance) search for text or attribute values? Since 2.2, lxml returns 'smart strings' by default, which keep a reference to their Element. This needs to be disabled explicitly for an XPath call if you do not want it. I guess that deliverance doesn't do this, so depending on what you do with the string results and how long you keep them alive, they may well keep their original document alive as well. There are two ways to deal with this: pass "smart_strings=False" into the XPath evaluation, or pass each string result through str() on reception. Stefan From stefan_ml at behnel.de Mon Dec 7 17:33:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Dec 2009 17:33:53 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4B1D2E22.4050705@behnel.de> References: <4B1D2E22.4050705@behnel.de> Message-ID: <4B1D2E71.2010705@behnel.de> Stefan Behnel, 07.12.2009 17:32: > pass each string result through str() on reception. ... ok, that's for Py3. Make that unicode() in Py2. Stefan From stefan_ml at behnel.de Tue Dec 8 09:44:07 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 08 Dec 2009 09:44:07 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <5af21ed10912071653u637f76dm726c483d79003da4@mail.gmail.com> References: <4B1D2E22.4050705@behnel.de> <5af21ed10912071653u637f76dm726c483d79003da4@mail.gmail.com> Message-ID: <4B1E11D7.1040402@behnel.de> Martin Aspeli, 08.12.2009 01:53: > 2009/12/8 Stefan Behnel: >> Martin Aspeli, 07.12.2009 16:45: >>> This may be the same thing. At least we saw pretty rapid memory growth, >> There might also be something else that triggers this. You mentioned heavy >> XPath usage - did you (or does deliverance) search for text or attribute >> values? > > Yes, definitely, all the time. > > It will have used the css-to-xpath stuff to search by id or class, specifically. That's not what I meant - I was referring to the return values, i.e. if you receive attribute values or text as a string result of an XPath query, as opposed to Elements. Passing XPath variables and using string values in queries is not related to this. >> Since 2.2, lxml returns 'smart strings' by default, which keep a >> reference to their Element. This needs to be disabled explicitly for an >> XPath call if you do not want it. >> >> I guess that deliverance doesn't do this, so depending on what you do with >> the string results and how long you keep them alive, they may well keep >> their original document alive as well. > > I can't imagine Deliverance keeping anything after the transformation > has finished, though. It's a WSGI middleware filter, and I don't think > it stores anything between requests. Ok, so I guess this behaviour is not a problem here. Stefan From optilude+lists at gmail.com Tue Dec 8 12:35:23 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Tue, 08 Dec 2009 19:35:23 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4B1E11D7.1040402@behnel.de> References: <4B1D2E22.4050705@behnel.de> <5af21ed10912071653u637f76dm726c483d79003da4@mail.gmail.com> <4B1E11D7.1040402@behnel.de> Message-ID: Stefan Behnel wrote: > Martin Aspeli, 08.12.2009 01:53: >> 2009/12/8 Stefan Behnel: >>> Martin Aspeli, 07.12.2009 16:45: >>>> This may be the same thing. At least we saw pretty rapid memory growth, >>> There might also be something else that triggers this. You mentioned heavy >>> XPath usage - did you (or does deliverance) search for text or attribute >>> values? >> Yes, definitely, all the time. >> >> It will have used the css-to-xpath stuff to search by id or class, specifically. > > That's not what I meant - I was referring to the return values, i.e. if you > receive attribute values or text as a string result of an XPath query, as > opposed to Elements. Passing XPath variables and using string values in > queries is not related to this. I'm not sure I understand the implications fully, but Deliverance basically does this: - read a configuration file (XML) which contains rules like "replace this element in the 'theme' with this element in the 'content'" - the rules are specified in terms of xpath or CSS; I'm pretty sure they're resolved to XPath eventually. - parse the 'theme' HTML, usually a static file, with the HTML parser - parse the 'content' HTML, e.g. as output from Plone, with the HTML parser - apply the rules, basically modifying the theme tree (and sometimes, temporarily, the content tree, e.g. for 'drop' rules) - serialise the result This happens per request, though the rules may be cached. The code is here: http://codespeak.net/svn/z3/deliverance/trunk/ >>> Since 2.2, lxml returns 'smart strings' by default, which keep a >>> reference to their Element. This needs to be disabled explicitly for an >>> XPath call if you do not want it. >>> >>> I guess that deliverance doesn't do this, so depending on what you do with >>> the string results and how long you keep them alive, they may well keep >>> their original document alive as well. >> I can't imagine Deliverance keeping anything after the transformation >> has finished, though. It's a WSGI middleware filter, and I don't think >> it stores anything between requests. > > Ok, so I guess this behaviour is not a problem here. There may of course be a bug in Deliverance that does this by accident. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From steven.vereecken at gmail.com Tue Dec 8 18:03:26 2009 From: steven.vereecken at gmail.com (Steven Vereecken) Date: Tue, 8 Dec 2009 18:03:26 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load Message-ID: > The code is here: http://codespeak.net/svn/z3/deliverance/trunk/ > >>>> Since 2.2, lxml returns 'smart strings' by default, which keep a >>>> reference to their Element. This needs to be disabled explicitly for an >>>> XPath call if you do not want it. >>>> >>>> I guess that deliverance doesn't do this, so depending on what you do with >>>> the string results and how long you keep them alive, they may well keep >>>> their original document alive as well. >>> I can't imagine Deliverance keeping anything after the transformation >>> has finished, though. It's a WSGI middleware filter, and I don't think >>> it stores anything between requests. >> >> Ok, so I guess this behaviour is not a problem here. > > There may of course be a bug in Deliverance that does this by accident. >From a quick look, it seems that there are indeed a couple of places in Deliverance (rules.py, paster_templates.py, pyref.py) where "element.attrib" is used, and that could lead to the situation with the cyclic references that Stefan described (solved in upcoming version). In fact, think it almost certainly *will* trigger that problem, if you have large documents and a lot of requests in a short timespan. The quickest way to check/workaround would probably be to add "gc.collect()" somewhere in DeliveranceMiddleware.__call__ , so that somewhere in each request, a garbage collection is forced (not sure if this would slow down things much, but it should keep memory usage down if this "bug" is the cause). Or cleaner: add an extra middleware to the wsgi stack that does this as long as we don't have lxml 2.3 yet... (there might be better ways, but the gc.collect() trick worked for me ;-)) From mykingheaven at gmail.com Wed Dec 9 08:37:27 2009 From: mykingheaven at gmail.com (David Shieh) Date: Wed, 9 Dec 2009 15:37:27 +0800 Subject: [lxml-dev] lxml missed elements ! Message-ID: Dear developers, Here's a comparison of lxml and libxml2, and here's the code I used to do the comparison: #coding=utf-8 import libxml2 import lxml.html as H from lxml.html import fromstring, tostring def main(): filelists = [['/home/icefox/qingbao/sina.htm', '//body/div[2]/div[4]/div[2]/div'],['/home/icefox/qingbao/qq.htm', '//body/div']] for file in filelists: name = file[0] xpath = file[1] #libxml2 doc = libxml2.htmlReadFile(name, 'utf-8', 1) test = doc.xpathNewContext() res = test.xpathEval(xpath) print 'libxml2 - ' + str(len(res)) #lxml file = open(name, 'r') ldoc = H.fromstring("".join(file.readlines())) lres = ldoc.xpath(xpath) print 'lxml - ' + str(len(lres)) if __name__ == '__main__': main() and the output is : libxml2 - 4 lxml - 3 libxml2 - 28 lxml - 4 lxml alway misses elements ! Am I doing wrong or it is a bug ? Regards, David -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091209/ca8d8846/attachment.htm From stefan_ml at behnel.de Wed Dec 9 08:56:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Dec 2009 08:56:53 +0100 Subject: [lxml-dev] lxml missed elements ! In-Reply-To: References: Message-ID: <4B1F5845.8070200@behnel.de> David Shieh, 09.12.2009 08:37: > Here's a comparison of lxml and libxml2, and here's the code I used to do > the comparison: > > #coding=utf-8 > > import libxml2 > import lxml.html as H > from lxml.html import fromstring, tostring > > def main(): > filelists = [['/home/icefox/qingbao/sina.htm', > '//body/div[2]/div[4]/div[2]/div'],['/home/icefox/qingbao/qq.htm', > '//body/div']] > for file in filelists: > name = file[0] > xpath = file[1] > #libxml2 > doc = libxml2.htmlReadFile(name, 'utf-8', 1) > test = doc.xpathNewContext() > res = test.xpathEval(xpath) > > print 'libxml2 - ' + str(len(res)) > > #lxml > file = open(name, 'r') > ldoc = H.fromstring("".join(file.readlines())) > lres = ldoc.xpath(xpath) > > print 'lxml - ' + str(len(lres)) > > > if __name__ == '__main__': > main() > > and the output is : > > libxml2 - 4 > lxml - 3 > libxml2 - 28 > lxml - 4 Your report lacks a couple of important details. What system are you on? Which versions of libxml2 and lxml (and its libxml2) are you using? What's the difference between the two parsed trees? How broken is the HTML that you parse? Did you try parsing without error recovery? Why do you use such complicated code for parsing with lxml, instead of the obvious equivalent of the libxml2 code? Stefan From stefan_ml at behnel.de Wed Dec 9 11:22:08 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Dec 2009 11:22:08 +0100 Subject: [lxml-dev] lxml missed elements ! In-Reply-To: References: <4B1F5845.8070200@behnel.de> Message-ID: <4B1F7A50.1060702@behnel.de> Hi, please always reply to the list, and please don't top-post. David Shieh, 09.12.2009 11:09: > 2009/12/9 Stefan Behnel >> David Shieh, 09.12.2009 08:37: >>> [...] >>> filelists = [['/home/icefox/qingbao/sina.htm', >>> '//body/div[2]/div[4]/div[2]/div'],['/home/icefox/qingbao/qq.htm', >>> '//body/div']] >> Your report lacks a couple of important details. What system are you on? >> Which versions of libxml2 and lxml (and its libxml2) are you using? What's >> the difference between the two parsed trees? How broken is the HTML that >> you parse? Did you try parsing without error recovery? Why do you use such >> complicated code for parsing with lxml, instead of the obvious equivalent >> of the libxml2 code? What I meant with the last comment is that the obvious code for parsing a file in lxml.html is tree = lxml.html.parse(file_path_or_URL) > libxml2 version is 2.7.4 > lxml version is 2.2.4 > OS is Gentoo with kernel 2.6.31 > > sina.htm and qq.com are 2 different websites' index pages. sina.htm is > http://www.sina.com.cn/ and qq.htm is http://www.qq.com/. > I do these tests because lxml lost elements as I really need these elements. I don't think it lost them, it's rather likely that it repaired your HTML document in one way or another and now your XPath expression no longer fits the document. Try to make your XPath expression more robust, or use lxml.cssselect instead. An expression like the above is very likely to fail with even minor changes to the web page. > I don't quite understand your parsing without error recovery, could you > explain it more? http://codespeak.net/lxml/parsing.html#parser-options The same applies to lxml.html, just use lxml.html.HTMLParser(). Stefan From mykingheaven at gmail.com Thu Dec 10 07:43:53 2009 From: mykingheaven at gmail.com (David Shieh) Date: Thu, 10 Dec 2009 14:43:53 +0800 Subject: [lxml-dev] lxml missed elements ! In-Reply-To: <4B1F7A50.1060702@behnel.de> References: <4B1F5845.8070200@behnel.de> <4B1F7A50.1060702@behnel.de> Message-ID: Hi, Thanks for your reply. I will modify my code then. But, I've got a new question through your reply. You said it is likely to repair my HTML document to make it right, but shouldn't it be the same structure as libxml2 and browser do? Why will lxml repair html document and make its structure different from what we can see in the browser, or even different from libxml2? What I mean is, lxml should parse the html document into a structure same as libxml2 and browser do, and I think that's the right structure of this html document, isn't it? And for xpath choose elements, there's nothing more I can do. The requirment is to build the structure and find an element exactly in that position. I can rely it on class name or id name, cause I DO see some html documents that contain many elements with same id name.So I guess, xpath is the best way to select the specific element. Regards, David 2009/12/9 Stefan Behnel > Hi, > > please always reply to the list, and please don't top-post. > > > David Shieh, 09.12.2009 11:09: > > 2009/12/9 Stefan Behnel > >> David Shieh, 09.12.2009 08:37: > >>> [...] > >>> filelists = [['/home/icefox/qingbao/sina.htm', > >>> '//body/div[2]/div[4]/div[2]/div'],['/home/icefox/qingbao/qq.htm', > >>> '//body/div']] > >> Your report lacks a couple of important details. What system are you on? > >> Which versions of libxml2 and lxml (and its libxml2) are you using? > What's > >> the difference between the two parsed trees? How broken is the HTML that > >> you parse? Did you try parsing without error recovery? Why do you use > such > >> complicated code for parsing with lxml, instead of the obvious > equivalent > >> of the libxml2 code? > > What I meant with the last comment is that the obvious code for parsing a > file in lxml.html is > > tree = lxml.html.parse(file_path_or_URL) > > > > libxml2 version is 2.7.4 > > lxml version is 2.2.4 > > OS is Gentoo with kernel 2.6.31 > > > > sina.htm and qq.com are 2 different websites' index pages. sina.htm is > > http://www.sina.com.cn/ and qq.htm is http://www.qq.com/. > > I do these tests because lxml lost elements as I really need these > elements. > > I don't think it lost them, it's rather likely that it repaired your HTML > document in one way or another and now your XPath expression no longer fits > the document. > > Try to make your XPath expression more robust, or use lxml.cssselect > instead. An expression like the above is very likely to fail with even > minor changes to the web page. > > > > I don't quite understand your parsing without error recovery, could you > > explain it more? > > http://codespeak.net/lxml/parsing.html#parser-options > > The same applies to lxml.html, just use lxml.html.HTMLParser(). > > Stefan > > -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091210/daaaf494/attachment.htm From stefan_ml at behnel.de Thu Dec 10 10:34:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 10 Dec 2009 10:34:12 +0100 Subject: [lxml-dev] lxml missed elements ! In-Reply-To: References: <4B1F5845.8070200@behnel.de> <4B1F7A50.1060702@behnel.de> Message-ID: <4B20C094.1010107@behnel.de> Hi, *please*, don't top-post. David Shieh, 10.12.2009 07:43: > But, I've got a new question through your reply. You said it is likely to > repair my HTML document to make it right, > but shouldn't it be the same structure as libxml2 and browser do? Why will > lxml repair html document and make its structure different from what we can > see in the browser, or even different from libxml2? 1) because there isn't "one right way" to repair a broken HTML document. Broken is broken. 2) it quite likely *does* give you what libxml2 returns. You just have to use the same setup for libxml2 that lxml uses. BTW, you still didn't provide any example output that shows /how/ lxml behaves different. > And for xpath choose elements, there's nothing more I can do. The requirment > is to build the structure and find an element exactly in that position. I actually doubt that "exactly in that position" implies exactly one XPath expression. If software development was about dropping specifications into code 1:1, we'd all be out of our job by now (either being fired or bored to death). Stefan From nicolas at nexedi.com Fri Dec 11 19:45:58 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Fri, 11 Dec 2009 19:45:58 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version Message-ID: <4B229366.9070302@nexedi.com> Hi, I' have a issue with RelaxNG validation depending of version of python used. xmllint => No errors Python 2.6.4 => No Errors Python 2.4.6 => Errors as the version of libraries are the same, i don't know how to find the weakness in my python2.4 installation. Do you have any clue to help me understanding this issue ? Nicolas # xmllint --version xmllint: using libxml version 20706 For both python2.4 and python2.6 >>> etree.LXML_VERSION (2, 2, 4, 0) >>> etree.LIBXML_VERSION (2, 7, 6) >>> etree.LIBXML_COMPILED_VERSION (2, 7, 6) >>> etree.LIBXSLT_VERSION (1, 1, 20) >>> etree.LIBXSLT_COMPILED_VERSION (1, 1, 20) PS: I try to validate ODF content with oasis rng http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-schema-v1.1.rng -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Fri Dec 11 23:31:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 11 Dec 2009 23:31:53 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B229366.9070302@nexedi.com> References: <4B229366.9070302@nexedi.com> Message-ID: <4B22C859.3010407@behnel.de> Nicolas Delaby, 11.12.2009 19:45: > I' have a issue with RelaxNG validation depending of version of python used. > > xmllint => No errors > Python 2.6.4 => No Errors > Python 2.4.6 => Errors What kind of errors? > as the version of libraries are the same, i don't know how to find the > weakness in my python2.4 installation. > > Do you have any clue to help me understanding this issue ? > > > Nicolas > > # xmllint --version > xmllint: using libxml version 20706 > > For both python2.4 and python2.6 >>>> etree.LXML_VERSION > (2, 2, 4, 0) >>>> etree.LIBXML_VERSION > (2, 7, 6) >>>> etree.LIBXML_COMPILED_VERSION > (2, 7, 6) >>>> etree.LIBXSLT_VERSION > (1, 1, 20) >>>> etree.LIBXSLT_COMPILED_VERSION > (1, 1, 20) Do you have the libxml2 Python bindings installed in any of the Python versions? Is 2.7.6 the only libxml2 version that is installed on your system? Stefan From nicolas at nexedi.com Mon Dec 14 12:33:24 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Mon, 14 Dec 2009 12:33:24 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B22C859.3010407@behnel.de> References: <4B229366.9070302@nexedi.com> <4B22C859.3010407@behnel.de> Message-ID: <4B262284.5020208@nexedi.com> Stefan Behnel a ?crit : > Nicolas Delaby, 11.12.2009 19:45: >> I' have a issue with RelaxNG validation depending of version of python used. >> >> xmllint => No errors >> Python 2.6.4 => No Errors >> Python 2.4.6 => Errors > > What kind of errors? It said that xml document is not compliant against Rng. :2:0:ERROR:VALID:DTD_UNKNOWN_ID: IDREF attribute control references an unknown ID "control1" > > Do you have the libxml2 Python bindings installed in any of the Python > versions? I tried with and without, for both python version. The result is still the same. > Is 2.7.6 the only libxml2 version that is installed on your system? Yes it is. Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From ovnicraft at gmail.com Tue Dec 15 06:19:11 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Tue, 15 Dec 2009 00:19:11 -0500 Subject: [lxml-dev] Cant add a new elements Message-ID: Hi folk, i am trying to add elements for my file[1], i wrote a dirty script[2] for that, but when i print the root i see doesnt change. I am forgeting something? best regards, [1] http://pastebin.com/mfb48def [2] http://pastebin.com/m7cae532e -- Cristian Salamea CEO GnuThink Software Labs Software Libre / Open Source (+593-8) 4-36-44-48 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091215/d9d8e290/attachment.htm From terry_n_brown at yahoo.com Tue Dec 15 15:10:07 2009 From: terry_n_brown at yahoo.com (Terry Brown) Date: Tue, 15 Dec 2009 08:10:07 -0600 Subject: [lxml-dev] Cant add a new elements Message-ID: <20091215081007.271cc9a3@nrri.umn.edu> On Tue, 15 Dec 2009 00:19:11 -0500 Ovnicraft wrote: > Hi folk, i am trying to add elements for my file[1], i wrote a dirty > script[2] for that, but when i print the root i see doesnt change. > I am forgeting something? I ran your code (Ubuntu 9.10, python 2.6, lxml 2.1.5) and it seemed to work: http://pastebin.com/m142da07e Only changes were that I commented out the pdb lines. Cheers -Terry > best regards, > > [1] http://pastebin.com/mfb48def > [2] http://pastebin.com/m7cae532e > From nicolas at nexedi.com Wed Dec 16 17:44:17 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Wed, 16 Dec 2009 17:44:17 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B262284.5020208@nexedi.com> References: <4B229366.9070302@nexedi.com> <4B22C859.3010407@behnel.de> <4B262284.5020208@nexedi.com> Message-ID: <4B290E61.6010005@nexedi.com> Hi, I provide more materials, maybe it will help. XML source: http://pastebin.com/m7c9849d RNG: http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-schema-v1.1.rng sample which show that error occurs only with lxml but libxml2: http://pastebin.com/m1cc51ed9 more information about python-libxml2 package of my system rpm -qi python2.4-libxml2 Name : python2.4-libxml2 Relocations: (not relocatable) Version : 2.6.29 Vendor: Mandriva If you need more details do not hesitate to ask. Thanks for having a look, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Wed Dec 16 17:46:49 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 Dec 2009 17:46:49 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B290E61.6010005@nexedi.com> References: <4B229366.9070302@nexedi.com> <4B22C859.3010407@behnel.de> <4B262284.5020208@nexedi.com> <4B290E61.6010005@nexedi.com> Message-ID: <4B290EF9.7010200@behnel.de> Nicolas Delaby, 16.12.2009 17:44: > more information about python-libxml2 package of my system > rpm -qi python2.4-libxml2 > Name : python2.4-libxml2 Relocations: (not relocatable) > Version : 2.6.29 Vendor: Mandriva Hmm, didn't you say that 2.7.6 was the only libxml2 version installed on your system? Looks like this uses 2.6.29! Stefan From nicolas at nexedi.com Wed Dec 16 18:08:26 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Wed, 16 Dec 2009 18:08:26 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B290EF9.7010200@behnel.de> References: <4B229366.9070302@nexedi.com> <4B22C859.3010407@behnel.de> <4B262284.5020208@nexedi.com> <4B290E61.6010005@nexedi.com> <4B290EF9.7010200@behnel.de> Message-ID: <4B29140A.4020609@nexedi.com> Stefan Behnel a ?crit : > Nicolas Delaby, 16.12.2009 17:44: >> more information about python-libxml2 package of my system >> rpm -qi python2.4-libxml2 >> Name : python2.4-libxml2 Relocations: (not relocatable) >> Version : 2.6.29 Vendor: Mandriva > > Hmm, didn't you say that 2.7.6 was the only libxml2 version installed on > your system? Looks like this uses 2.6.29! > I'm not expert, but as far I can understand, python2.4 bindings was compiled against libxml2 2.6.29 but it doesn't mean libxml2 2.6.29 is installed. I'm almost sure that only libxml2-2.7.6 is installed. # ldconfig -v | grep libxml2 libxml2.so.2 -> libxml2.so.2.7.6 I am wrong ? Anyway python2.4-libxml2 (2.6.29) seems validate the document against RNG like xmllint (2.7.6) does (see http://pastebin.com/m1cc51ed9). Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Wed Dec 16 18:29:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 Dec 2009 18:29:24 +0100 Subject: [lxml-dev] RelaxNG validation result depending of python version In-Reply-To: <4B290E61.6010005@nexedi.com> References: <4B229366.9070302@nexedi.com> <4B22C859.3010407@behnel.de> <4B262284.5020208@nexedi.com> <4B290E61.6010005@nexedi.com> Message-ID: <4B2918F4.1040706@behnel.de> Nicolas Delaby, 16.12.2009 17:44: > Hi, I provide more materials, maybe it will help. > > XML source: http://pastebin.com/m7c9849d > RNG: http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-schema-v1.1.rng > > sample which show that error occurs only with lxml but libxml2: > http://pastebin.com/m1cc51ed9 > > more information about python-libxml2 package of my system > rpm -qi python2.4-libxml2 > Name : python2.4-libxml2 Relocations: (not relocatable) > Version : 2.6.29 Vendor: Mandriva Ok, I tried and I can't reproduce this. $ python2.4 -c 'import lxml.etree as et; print et.RelaxNG(file="OpenDocument-schema-v1.1.rng").validate(et.parse("m7c9849d.xml"))' True $ python2.4 -c 'import lxml.etree as et; print et.RelaxNG(et.parse("OpenDocument-schema-v1.1.rng")).validate(et.parse("m7c9849d.xml"))' True For the above I tried python 2.4.5/2.6.4 and libxml2 2.6.32/2.7.6 with lxml trunk (with no RelaxNG changes since 2.2 that I'm aware of). I also tried the validation with the libxml2 Python bindings in Python 2.4, which works nicely as well. Stefan From jholg at gmx.de Thu Dec 17 01:03:43 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 01:03:43 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B0D4551.1020702@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> Message-ID: <20091217000343.142550@gmx.net> Hi, > Oh, no further need for glory and fame on my side - just go ahead. :) Finally got around working on this, just checked in for review: Committed revision 70160. URL: http://codespeak.net/svn/lxml/branch/iso-schematron This comes complete with doc updates, unittests and whatnot. Some notes: * implemented as package lxml.schematron to cleanly bundle all the xsl/rng resources * the API allows for handing stylesheet parameters to the separate schematron-to-xsl compilation steps. Currently, these must be provided with stylesheet parameter properties in mind, i.e. text parameters must be given like {'phase': "'mandatory'"}. As all the stylesheet parameters seem to be text parameters, maybe we could make this a little more convenient and auto-strparam() everything in the arg dicts. But I think it's not really worth the effort and better to stick to normal stylesheet parameter handling * I had to modify xmlerror.pxi to stay as close to the original validators' workings as possible: $ svn diff --old=http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi at HEAD --new=src/lxml/xmlerror.pxi Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi) (revision 70161) +++ src/lxml/xmlerror.pxi (.../src/lxml/xmlerror.pxi) (working copy) @@ -71,6 +71,8 @@ else: self.filename = _decodeFilename(error.file) + #FIXME: This seems not to have been used anywhere, so far. Is my addition + #FIXME: of _utf8()-ing message & filename correct? cdef _setGeneric(self, int domain, int type, int level, int line, message, filename): self.domain = domain @@ -78,8 +80,8 @@ self.level = level self.line = line self.column = 0 - self.message = message - self.filename = filename + self.message = _utf8(message) + self.filename = _utf8(filename) def __repr__(self): return u"%s:%d:%d:%s:%s:%s: %s" % ( @@ -102,6 +104,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error @@ -172,7 +180,7 @@ message = u"%s, line %d" % (message, line) return exctype(message, code, line, column) - cdef _buildExceptionMessage(self, default_message): + cpdef _buildExceptionMessage(self, default_message): if self._first_error is None: return default_message if self._first_error.message is not None and self._first_error.message: I'm not too sure about these changes so here are some questions: * Ok to cpdef _buildExceptionMessage() instead of cdef? * Instead of adding _SettableLogEntry, would it also be ok to just cpdef _LogEntry.setGeneric? Also, I added a Validator class to isoschematron that mimics etree._Validator. This wouldn't be necessary if etree._Validator made _error_log accessible from python. Can we just do that? Have fun Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From stefan_ml at behnel.de Thu Dec 17 09:12:15 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 17 Dec 2009 09:12:15 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217000343.142550@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> Message-ID: <4B29E7DF.6010203@behnel.de> Hi Holger, jholg at gmx.de, 17.12.2009 01:03: > Finally got around working on this, just checked in for review: > > Committed revision 70160. > > URL: http://codespeak.net/svn/lxml/branch/iso-schematron > > This comes complete with doc updates, unittests and whatnot. Cool, thanks! > * implemented as package lxml.schematron to cleanly bundle all > the xsl/rng resources That's ok. I was about to propose calling it "lxml.isoschematron" when I noticed that that was what you meant anway. :) > * the API allows for handing stylesheet parameters to the separate > schematron-to-xsl compilation steps. Currently, these must be provided > with stylesheet parameter properties in mind, i.e. text parameters must > be given like {'phase': "'mandatory'"}. As all the stylesheet parameters > seem to be text parameters, maybe we could make this a little > more convenient and auto-strparam() everything in the arg dicts. > But I think it's not really worth the effort and better to stick to > normal stylesheet parameter handling I prefer simplifying the interface here and make them all string parameters. I just skimmed through your code, it looks like you want users to pass dicts instead of regular keyword arguments. Why is that? > * I had to modify xmlerror.pxi to stay as close to the original validators' > workings as possible: > > $ svn diff --old=http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi at HEAD --new=src/lxml/xmlerror.pxi > Index: src/lxml/xmlerror.pxi > =================================================================== > --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi) (revision 70161) > +++ src/lxml/xmlerror.pxi (.../src/lxml/xmlerror.pxi) (working copy) > @@ -71,6 +71,8 @@ > else: > self.filename = _decodeFilename(error.file) > > + #FIXME: This seems not to have been used anywhere, so far. Is my addition > + #FIXME: of _utf8()-ing message & filename correct? > cdef _setGeneric(self, int domain, int type, int level, int line, > message, filename): > self.domain = domain > @@ -78,8 +80,8 @@ > self.level = level > self.line = line > self.column = 0 > - self.message = message > - self.filename = filename > + self.message = _utf8(message) > + self.filename = _utf8(filename) That won't work. Neither the message nor the filename need to be compatible with the UTF-8 encoding. > def __repr__(self): > return u"%s:%d:%d:%s:%s:%s: %s" % ( > @@ -102,6 +104,12 @@ > def __get__(self): > return ErrorLevels._getName(self.level, u"unknown") > > +#FIXME: Can _LogEntry be settable itself so we don't need this? > +cdef class _SettableLogEntry(_LogEntry): > + cpdef setGeneric(self, int domain, int type, int level, int line, > + message, filename): > + self._setGeneric(domain, type, level, line, message, filename) > + > cdef class _BaseErrorLog: > cdef _LogEntry _first_error > cdef readonly object last_error > @@ -172,7 +180,7 @@ > message = u"%s, line %d" % (message, line) > return exctype(message, code, line, column) > > - cdef _buildExceptionMessage(self, default_message): > + cpdef _buildExceptionMessage(self, default_message): > if self._first_error is None: > return default_message > if self._first_error.message is not None and self._first_error.message: > > Also, I added a Validator class to isoschematron that mimics > etree._Validator. This wouldn't be necessary if etree._Validator made > _error_log accessible from python. Can we just do that? I wonder why this is necessary anyway. Can't we just reuse the error log of the underlying XSLT object? I don't expect that we need to generate any log messages ourselves, do we? Stefan From jholg at gmx.de Thu Dec 17 10:57:34 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 10:57:34 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B29E7DF.6010203@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> Message-ID: <20091217095734.67420@gmx.net> Hi Stefan, > > * the API allows for handing stylesheet parameters to the separate > > schematron-to-xsl compilation steps. Currently, these must be provided > > with stylesheet parameter properties in mind, i.e. text parameters > must > > be given like {'phase': "'mandatory'"}. As all the stylesheet > parameters > > seem to be text parameters, maybe we could make this a little > > more convenient and auto-strparam() everything in the arg dicts. > > But I think it's not really worth the effort and better to stick to > > normal stylesheet parameter handling > > I prefer simplifying the interface here and make them all string > parameters. I just skimmed through your code, it looks like you want users > to pass dicts instead of regular keyword arguments. Why is that? Because I want/need to separate the attributes for the steps include, abstract, compile. If we don't use dicts, how to distinguish these? I don't like using naming conventions here. What if any future version of the skeleton implementation resolves to using some number xslt args? We'd at least need to only auto-strparam() string arguments. But what if a parameter taking an xpath expression should show up someday? We can't really discriminate this from a normal string parameter; it's also not possible to hand in an etree.XPath object (just tried it). The only option I see here is to provide another switch to turn off auto-strparam()ing, if need be, defaulting to False. > > + #FIXME: This seems not to have been used anywhere, so far. Is my > addition > > + #FIXME: of _utf8()-ing message & filename correct? > > cdef _setGeneric(self, int domain, int type, int level, int line, > > message, filename): > > self.domain = domain > > @@ -78,8 +80,8 @@ > > self.level = level > > self.line = line > > self.column = 0 > > - self.message = message > > - self.filename = filename > > + self.message = _utf8(message) > > + self.filename = _utf8(filename) > > That won't work. Neither the message nor the filename need to be > compatible > with the UTF-8 encoding. I wondered about that. Maybe a misunderstanding of mine of what _utf8 is supposed to do. This just ensures that a string is 7-bit-ascii or unicode and returns it utf8-encoded, right? Now, with _setGeneric becoming public in one or the other way, don't you need to store message and filename in a well-known encoding? Or is it the caller's responsibility to know the encoding? What happens to unicode parameters? > > Also, I added a Validator class to isoschematron that mimics > > etree._Validator. This wouldn't be necessary if etree._Validator made > > _error_log accessible from python. Can we just do that? > > I wonder why this is necessary anyway. Can't we just reuse the error log > of > the underlying XSLT object? I don't expect that we need to generate any > log > messages ourselves, do we? The XSLT object returns a perfectly valid XSLT result tree. What makes it a schematron validation error is what's then selected from the result using an XPath expression (which is exposed in the package if someone has different selection needs/chooses to write his own "meta stylsheet" instead of iso_svrl_for_xslt1.xsl which might produce different output): # svrl result accessors svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) What we retrieve as a result from this xpath are the validation errors, which then need to be put into the Schematron._error_log; this is why I need to manually access _error_log in the subclass, and also why it needs to be setGeneric()-able. Btw: The validation result doc is accessible as property validation_report if store_report is true (default). I didn't wrap this in a special class as - I can't really imagine what one might want to actually extract from the svrl report - it's just so easy to get to what you want using XPath on this result tree One thing I forgot to mention: I haven't tested with Python 3, as I currently don't have an installation. Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From stefan_ml at behnel.de Thu Dec 17 11:30:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 17 Dec 2009 11:30:27 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217095734.67420@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> Message-ID: <4B2A0843.80509@behnel.de> jholg at gmx.de, 17.12.2009 10:57: >> I prefer simplifying the interface here and make them all string >> parameters. I just skimmed through your code, it looks like you want users >> to pass dicts instead of regular keyword arguments. Why is that? > > Because I want/need to separate the attributes for the steps include, > abstract, compile. If we don't use dicts, how to distinguish these? I don't > like using naming conventions here. Why not make the most common (and non-overlapping) parameters keyword arguments and use the dicts only as fallbacks? We could rename them to "additional_..._parameters" or something. > What if any future version of the skeleton implementation resolves to > using some number xslt args? We'd at least need to only auto-strparam() > string arguments. But what if a parameter taking an xpath expression should > show up someday? We can't really discriminate this from a normal string > parameter; it's also not possible to hand in an etree.XPath object (just > tried it). That's a good idea, though. Passing an XPath object in would simply let the parameter mangling code extract the underlying unparsed XPath expression. It's some unnecessary work if you don't actually want to use the pre-parsed expression, but it's definitely explicit. > The only option I see here is to provide another switch to turn off > auto-strparam()ing, if need be, defaulting to False. Ugly. > Maybe a misunderstanding of mine of what _utf8 is supposed to do. This > just ensures that a string is 7-bit-ascii or unicode and returns it > utf8-encoded, right? Right, it's a user input validation and normalisation function. Sort of the opposite of funicode(). > Now, with _setGeneric becoming public in one or the other way You say that like it was decided. It's a totally internal thing that shouldn't get exposed to Python space. >>> Also, I added a Validator class to isoschematron that mimics >>> etree._Validator. This wouldn't be necessary if etree._Validator made >>> _error_log accessible from python. Can we just do that? >> I wonder why this is necessary anyway. Can't we just reuse the error log >> of the underlying XSLT object? I don't expect that we need to generate any >> log messages ourselves, do we? > > The XSLT object returns a perfectly valid XSLT result tree. What makes > it a schematron validation error is what's then selected from the result using > an XPath expression (which is exposed in the package if someone has > different selection needs/chooses to write his own "meta stylsheet" instead > of iso_svrl_for_xslt1.xsl which might produce different output): > > # svrl result accessors > svrl_validation_errors = _etree.XPath( > '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) > > What we retrieve as a result from this xpath are the validation errors, > which then need to be put into the Schematron._error_log; this is why I > need to manually access _error_log in the subclass, and also why it needs > to be setGeneric()-able. Why don't you just create a fake error log? There's nothing that requires that the error_log property is of type _ErrorLog or that it receives its error messages in the normal C level way. If a fake-log isn't easy to do, I'm fine with making that simpler, but I'm against making C level APIs public just for a case like this. > The validation result doc is accessible as property validation_report if > store_report is true (default). I didn't wrap this in a special class as > - I can't really imagine what one might want to actually extract from the svrl report > - it's just so easy to get to what you want using XPath on this result tree I'm actually for extracting the error log lazily when the error_log property is first read, so storing the unmodified result document sounds like a good idea to me. > One thing I forgot to mention: I haven't tested with Python 3, as I > currently don't have an installation. That'll come. Stefan From jholg at gmx.de Thu Dec 17 12:21:29 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 12:21:29 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2A0843.80509@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> Message-ID: <20091217112129.158960@gmx.net> Hi, > >> I prefer simplifying the interface here and make them all string > >> parameters. I just skimmed through your code, it looks like you want > users > >> to pass dicts instead of regular keyword arguments. Why is that? > > > > Because I want/need to separate the attributes for the steps include, > > abstract, compile. If we don't use dicts, how to distinguish these? I > don't > > like using naming conventions here. > > Why not make the most common (and non-overlapping) parameters keyword > arguments and use the dicts only as fallbacks? We could rename them to > "additional_..._parameters" or something. These are the parameters of the involved xsl stylesheets: iso_dsdl_include.xsl: true true true true true true iso_abstract_expand.xsl: iso_svrl_for_xslt1.xsl: true [...] false true true One thing is that there are so many, another thing is to decide which ones are the "most common". As it stands, isoschematron already has quite a lot of parameters. > > string arguments. But what if a parameter taking an xpath expression > should > > show up someday? We can't really discriminate this from a normal string > > parameter; it's also not possible to hand in an etree.XPath object (just > > tried it). > > That's a good idea, though. Passing an XPath object in would simply let > the > parameter mangling code extract the underlying unparsed XPath expression. > It's some unnecessary work if you don't actually want to use the > pre-parsed > expression, but it's definitely explicit. Do you mean for etree.XSLT to allow for XPath object arguments (that's what I meant), or for isoschematron to extract the path using XPath.path? Supposing you meant the latter, the rules would then be: - if an arg is string, auto-strparam() it - if an arg is an XPath object, just use its .path property - else use unicode() (in a py3 compatible way) This would add a little more convenience to the parameter passing; I'm still not convinced that this shouldn't/couldn't be rather just addressed in the documentation. The advantage of not implementing the magic is that I can use the very same arg dictionary with isoschematron.Schematron() as with any other XSLT transform. > > Maybe a misunderstanding of mine of what _utf8 is supposed to do. This > > just ensures that a string is 7-bit-ascii or unicode and returns it > > utf8-encoded, right? > > Right, it's a user input validation and normalisation function. Sort of > the > opposite of funicode(). > > > > Now, with _setGeneric becoming public in one or the other way > > You say that like it was decided. It's a totally internal thing that > shouldn't get exposed to Python space. :) I know it's not - that's why I'm asking these questions. But it seems easier to me to reuse the existing stuff than replicating the very same functionality. Why not make this stuff a little friendlier for subclassing? Also, for the _setGeneric case I actually added the class _SettableLogEntry(_LogEntry) to make this minimally intrusive for the existing infrastructure; I just seemed to get it wrong regarding the _utf8() stuff. > > What we retrieve as a result from this xpath are the validation errors, > > which then need to be put into the Schematron._error_log; this is why I > > need to manually access _error_log in the subclass, and also why it > needs > > to be setGeneric()-able. > > Why don't you just create a fake error log? There's nothing that requires > that the error_log property is of type _ErrorLog or that it receives its > error messages in the normal C level way. If a fake-log isn't easy to do, > I'm fine with making that simpler, but I'm against making C level APIs > public just for a case like this. Of course we can do that, but then we need to basically reimplement _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one class, minus the C level entry points. Is there much gain in this? Or, in other words: What's lost by exposing _buildExceptionMessage() to the python side? > I'm actually for extracting the error log lazily when the error_log > property is first read, so storing the unmodified result document sounds > like a good idea to me. I need to retrieve the errors in the __call__ method anyway to see if the validation result is true, so why not store it right away? Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From stefan_ml at behnel.de Thu Dec 17 14:09:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 17 Dec 2009 14:09:50 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217112129.158960@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> Message-ID: <4B2A2D9E.40302@behnel.de> jholg at gmx.de, 17.12.2009 12:21: >> Why not make the most common (and non-overlapping) parameters keyword >> arguments and use the dicts only as fallbacks? We could rename them to >> "additional_..._parameters" or something. > > These are the parameters of the involved xsl stylesheets: > > iso_dsdl_include.xsl: > true > true > true > true > true > true All 'true' sounds like a good default, that seems to make all of the above "uncommon" enough to drop them into a dict parameter. > iso_abstract_expand.xsl: > Is there any visible need to override that? If not, I'd just drop it completely. > iso_svrl_for_xslt1.xsl: > true > > [...] > > false > true > true > > > > > One thing is that there are so many, another thing is to decide which ones are the "most common". > As it stands, isoschematron already has quite a lot of parameters. I just want to keep users from having to pass a dict in /most/ cases. The ones above do not seem to be that unuseful. >>> string arguments. But what if a parameter taking an xpath expression >> should >>> show up someday? We can't really discriminate this from a normal string >>> parameter; it's also not possible to hand in an etree.XPath object (just >>> tried it). >> That's a good idea, though. Passing an XPath object in would simply let >> the >> parameter mangling code extract the underlying unparsed XPath expression. >> It's some unnecessary work if you don't actually want to use the >> pre-parsed >> expression, but it's definitely explicit. > > Do you mean for etree.XSLT to allow for XPath object arguments (that's > what I meant), or for isoschematron to extract the path using XPath.path? I'm fine with the first. It's just the same as supporting QName for tags, CDATA for text and strparam() for XSLT parameters. Passing an XPath object is really hard to misinterpret (at least a lot harder than a plain string value). >>> What we retrieve as a result from this xpath are the validation errors, >>> which then need to be put into the Schematron._error_log; this is why I >>> need to manually access _error_log in the subclass, and also why it >> needs to be setGeneric()-able. >> Why don't you just create a fake error log? There's nothing that requires >> that the error_log property is of type _ErrorLog or that it receives its >> error messages in the normal C level way. If a fake-log isn't easy to do, >> I'm fine with making that simpler, but I'm against making C level APIs >> public just for a case like this. > > Of course we can do that, but then we need to basically reimplement > _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one > class, minus the C level entry points. Is there much gain in this? Well, there isn't that much functionality in there, really. And at the point where you need to do the conversion, you'd probably already know in what kind of errors the user is interested (see the filter_*() methods), so having a custom class here isn't all that wrong. I agree that it's worth trying to make the existing classes a little friendlier to subclasses, though, that might help already. >> I'm actually for extracting the error log lazily when the error_log >> property is first read, so storing the unmodified result document sounds >> like a good idea to me. > > I need to retrieve the errors in the __call__ method anyway to see if > the validation result is true, so why not store it right away? Because I expect that many users won't be interested in the exact errors and can live with a boolean predicate result. Extracting the information if errors were found at all is a simple and fast XPath search with a boolean result, right? Stefan From jholg at gmx.de Thu Dec 17 15:05:48 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 15:05:48 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2A2D9E.40302@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> Message-ID: <20091217140548.67420@gmx.net> > >> Why not make the most common (and non-overlapping) parameters keyword > >> arguments and use the dicts only as fallbacks? We could rename them to > >> "additional_..._parameters" or something. > > > > These are the parameters of the involved xsl stylesheets: > > > > iso_dsdl_include.xsl: > > true > > true > > true > > true > > true > > true > > All 'true' sounds like a good default, that seems to make all of the above > "uncommon" enough to drop them into a dict parameter. > > > > iso_abstract_expand.xsl: > > > > Is there any visible need to override that? If not, I'd just drop it > completely. This is for: """ It also * extracts a particular schema using an ID, where there are multiple schemas, such as when they are embedded in the same NVDL script """ No need to expose as extra kwarg for this. > > iso_svrl_for_xslt1.xsl: > > true > > > > [...] > > > > false > > true > > true > > > > > > > > > > One thing is that there are so many, another thing is to decide which > ones are the "most common". > > As it stands, isoschematron already has quite a lot of parameters. > > I just want to keep users from having to pass a dict in /most/ cases. The > ones above do not seem to be that unuseful. These seem mostly to be for trimming the svrl output somewhat, for providing additional information to the bare-bones validation failure messages. I'm pretty much a Schematron beginner so I'm not too sure which would be worth exposing. One exception, though: 'phase' is s.th. I plan to make excessive use of in my use case, as this allows grouping validation patterns and gives you a mechanism to selectively validate. So this would be my only candidate for an extra keyword arg. > I'm fine with the first. It's just the same as supporting QName for tags, > CDATA for text and strparam() for XSLT parameters. Passing an XPath object > is really hard to misinterpret (at least a lot harder than a plain string > value). Ok, I'll look at the XSLT implementation. I take it you don't see any value in keeping the xsl parameter handling compatible to what you normally have to hand to etree.XSLT as stylesheet parameters? > > Of course we can do that, but then we need to basically reimplement > > _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one > > class, minus the C level entry points. Is there much gain in this? > > Well, there isn't that much functionality in there, really. And at the > point where you need to do the conversion, you'd probably already know in > what kind of errors the user is interested (see the filter_*() methods), > so > having a custom class here isn't all that wrong. > > I agree that it's worth trying to make the existing classes a little > friendlier to subclasses, though, that might help already. For the error log stuff the only thing I needed to access from python was cpdef _buildExceptionMessage(self, default_message): in the isoschematron._Validator class. This need would go away if I could subclass etree._Validator and access _Validator._error_log from Python, so that I can call self._error_log.receive(logEntry) > > I need to retrieve the errors in the __call__ method anyway to see if > > the validation result is true, so why not store it right away? > > Because I expect that many users won't be interested in the exact errors > and can live with a boolean predicate result. Extracting the information > if > errors were found at all is a simple and fast XPath search with a boolean > result, right? Currently, the XPath searches all failed-assert elements, which are the actual error messages put into the error log: svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) This could be easily enough changed to return a boolean, with above xpath being used only for accessing the error_log property. Of course, if subclassing etree._Validator, lazy extraction would then mean to override error_log access. As a side effect, lazy error_log extraction would mean to always need to store the result report (this makes store_report arg obsolete). Then again, all the other validators return a simple boolean true, store any validation error message in ._error_log during __call__() and return a copy of this on .error_log() access. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Thu Dec 17 16:08:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 17 Dec 2009 16:08:05 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217140548.67420@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> Message-ID: <4B2A4955.6030608@behnel.de> jholg at gmx.de, 17.12.2009 15:05: >>> iso_svrl_for_xslt1.xsl: >>> true >>> >>> [...] >>> >>> false >>> true >>> true >>> >>> >>> >>> >>> One thing is that there are so many, another thing is to decide which >> ones are the "most common". >>> As it stands, isoschematron already has quite a lot of parameters. >> I just want to keep users from having to pass a dict in /most/ cases. The >> ones above do not seem to be that unuseful. > > These seem mostly to be for trimming the svrl output somewhat, for > providing additional information to the bare-bones validation failure > messages. I'm pretty much a Schematron beginner so I'm not too sure which > would be worth exposing. One exception, though: > 'phase' is s.th. I plan to make excessive use of in my use case, as this > allows grouping validation patterns and gives you a mechanism to > selectively validate. > > So this would be my only candidate for an extra keyword arg. Fine with me. Note that we can usually add additional keyword arguments at the end if we notice that people use them a lot. If provided, their value would then override the keywords passed as dict. >> I'm fine with the first. It's just the same as supporting QName for tags, >> CDATA for text and strparam() for XSLT parameters. Passing an XPath object >> is really hard to misinterpret (at least a lot harder than a plain string >> value). > > Ok, I'll look at the XSLT implementation. > > I take it you don't see any value in keeping the xsl parameter handling > compatible to what you normally have to hand to etree.XSLT as stylesheet > parameters? I'm not sure what exactly you mean here. I'm fine with *extending* the current functionality with something that is useful but doesn't currently work. I'm also all for making the schematron interface more specific (and more usable) than the generic XSLT interface. That's the whole point of integrating the stylesheets, after all. >>> Of course we can do that, but then we need to basically reimplement >>> _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one >>> class, minus the C level entry points. Is there much gain in this? >> Well, there isn't that much functionality in there, really. And at the >> point where you need to do the conversion, you'd probably already know in >> what kind of errors the user is interested (see the filter_*() methods), >> so having a custom class here isn't all that wrong. >> >> I agree that it's worth trying to make the existing classes a little >> friendlier to subclasses, though, that might help already. > > For the error log stuff the only thing I needed to access from python was > cpdef _buildExceptionMessage(self, default_message): > in the isoschematron._Validator class. > > This need would go away if I could subclass etree._Validator and access > _Validator._error_log from Python, so that I can call > self._error_log.receive(logEntry) Actually, looking through the code, I think "_receiveGeneric()" was originally used but then replaced by a locally constructed xmlError and a call to _receive(), so it's actually a dead method by now. We could provide _Validator with an _append_log_message() method that basically calls it. Would that solve the issue? >>> I need to retrieve the errors in the __call__ method anyway to see if >>> the validation result is true, so why not store it right away? >> Because I expect that many users won't be interested in the exact errors >> and can live with a boolean predicate result. Extracting the information >> if >> errors were found at all is a simple and fast XPath search with a boolean >> result, right? > > Currently, the XPath searches all failed-assert elements, which are the > actual error messages put into the error log: > > svrl_validation_errors = _etree.XPath( > '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) > > This could be easily enough changed to return a boolean, with above > xpath being used only for accessing the error_log property. Of course, > if subclassing etree._Validator, lazy extraction would then mean to > override error_log access. As a side effect, lazy error_log extraction > would mean to always need to store the result report (this makes > store_report arg obsolete). If False, the report would simply be evaluated and deleted immediately after the run (should be easy to do by accessing the error_log once if it's evaluated lazily :) I think it should be False by default, BTW. If users want the report, it's easy to enable it. > Then again, all the other validators return a simple boolean true, store > any validation error message in ._error_log during __call__() and return > a copy of this on .error_log() access. The difference is that the errors are collected in the log during the run. Here, they are extracted from the result *after* running the validation. Ok, let's make that a potential optimisation, not a requirement. I'm fine with having the error log extracted immediately after validation and throwing away the result document if the users asked to do so by passing the option. I would guess that the XSLT based validation is already heavy enough anyway. Stefan From jholg at gmx.de Thu Dec 17 22:34:05 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 22:34:05 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2A4955.6030608@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> <4B2A4955.6030608@behnel.de> Message-ID: <20091217213405.92940@gmx.net> > > 'phase' is s.th. I plan to make excessive use of in my use case, as this > > allows grouping validation patterns and gives you a mechanism to > > selectively validate. > > > > So this would be my only candidate for an extra keyword arg. > > Fine with me. Note that we can usually add additional keyword arguments at > the end if we notice that people use them a lot. If provided, their value > would then override the keywords passed as dict. Ok. > > I take it you don't see any value in keeping the xsl parameter handling > > compatible to what you normally have to hand to etree.XSLT as stylesheet > > parameters? > > I'm not sure what exactly you mean here. I'm fine with *extending* the > current functionality with something that is useful but doesn't currently > work. What I mean is adding magic to the handling of stylesheet parameters will not let one reuse the very same parameters (dict or keyword) when performing steps manually. This can of course be down by just using the existing stylesheets, and the isoschematron package also exposes the steps as globals: # the iso-schematron skeleton implementation steps aka xsl transformations extract_from_xsd = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'XSD2Schtrn.xsl'))) extract_from_rng = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'RNG2Schtrn.xsl'))) iso_dsdl_include = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_dsdl_include.xsl'))) iso_abstract_expand = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_abstract_expand.xsl'))) iso_svrl_for_xslt1 = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_svrl_for_xslt1.xsl'))) # if you want to use another "meta-stylesheet" for compilation to xslt, plug it # here iso_compile2xslt = iso_svrl_for_xslt1 > > This could be easily enough changed to return a boolean, with above > > xpath being used only for accessing the error_log property. Of course, > > if subclassing etree._Validator, lazy extraction would then mean to > > override error_log access. As a side effect, lazy error_log extraction > > would mean to always need to store the result report (this makes > > store_report arg obsolete). > > If False, the report would simply be evaluated and deleted immediately > after the run (should be easy to do by accessing the error_log once if > it's > evaluated lazily :) Not sure I follow. __call__ runs the xslt on the input data and produces the svrl report aka xsl result document. Now, if I want lazy error log extraction I need to store this result report. > I think it should be False by default, BTW. If users want the report, it's > easy to enable it. Fine with me. But then, I need to put the errors into the error log before throwing away the report: No lazy error log extraction. > The difference is that the errors are collected in the log during the run. > Here, they are extracted from the result *after* running the validation. I see. So you'd say the overhead of putting all the errors into the error log one by one in __call__ is expensive to a degree and we should avoid that. > Ok, let's make that a potential optimisation, not a requirement. I'm fine > with having the error log extracted immediately after validation and > throwing away the result document if the users asked to do so by passing > the option. I would guess that the XSLT based validation is already heavy > enough anyway. For lazy error log extraction we need to store the validation report. So maybe we could just compromise: If the user opts for storing the result report error_log will use lazy extraction, if not error log needs to be set inside __call__. Classical trade of memory vs speed :) Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From jholg at gmx.de Thu Dec 17 23:33:30 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 23:33:30 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2A4955.6030608@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> <4B2A4955.6030608@behnel.de> Message-ID: <20091217223330.246330@gmx.net> Hi, > Actually, looking through the code, I think "_receiveGeneric()" was > originally used but then replaced by a locally constructed xmlError and a > call to _receive(), so it's actually a dead method by now. We could > provide > _Validator with an _append_log_message() method that basically calls it. > Would that solve the issue? > I took a look: A method to clear the error log from the subclass would also be needed. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From jholg at gmx.de Thu Dec 17 23:43:59 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 Dec 2009 23:43:59 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217213405.92940@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> <4B2A4955.6030608@behnel.de> <20091217213405.92940@gmx.net> Message-ID: <20091217224359.67400@gmx.net> > > Ok, let's make that a potential optimisation, not a requirement. I'm > fine > > with having the error log extracted immediately after validation and > > throwing away the result document if the users asked to do so by passing > > the option. I would guess that the XSLT based validation is already > heavy > > enough anyway. > > For lazy error log extraction we need to store the validation report. So > maybe we could just compromise: If the user opts for storing the result > report error_log will use lazy extraction, if not error log needs to be set > inside __call__. Classical trade of memory vs speed :) After looking at the code I feel this solution's implementation would be quite a bit clumsier compared to now, e.g. because you need to save the file uri/name of the validated tree in the lazy-extraction case for latter reuse when error_log is first accessed. So I don't think it's worth the effort now. Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser From stefan_ml at behnel.de Fri Dec 18 07:58:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 18 Dec 2009 07:58:09 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217224359.67400@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> <4B2A4955.6030608@behnel.de> <20091217213405.92940@gmx.net> <20091217224359.67400@gmx.net> Message-ID: <4B2B2801.8010002@behnel.de> jholg at gmx.de, 17.12.2009 23:43: >>> Ok, let's make that a potential optimisation, not a requirement. I'm >> fine >>> with having the error log extracted immediately after validation and >>> throwing away the result document if the users asked to do so by passing >>> the option. I would guess that the XSLT based validation is already >> heavy >>> enough anyway. >> For lazy error log extraction we need to store the validation report. So >> maybe we could just compromise: If the user opts for storing the result >> report error_log will use lazy extraction, if not error log needs to be set >> inside __call__. Classical trade of memory vs speed :) > > After looking at the code I feel this solution's implementation would be > quite a bit clumsier compared to now, e.g. because you need to save the > file uri/name of the validated tree in the lazy-extraction case for latter > reuse when error_log is first accessed. So I don't think it's worth the > effort now. Ok. Stefan From stefan_ml at behnel.de Fri Dec 18 07:59:47 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 18 Dec 2009 07:59:47 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091217223330.246330@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091217140548.67420@gmx.net> <4B2A4955.6030608@behnel.de> <20091217223330.246330@gmx.net> Message-ID: <4B2B2863.50404@behnel.de> jholg at gmx.de, 17.12.2009 23:33: >> Actually, looking through the code, I think "_receiveGeneric()" was >> originally used but then replaced by a locally constructed xmlError and a >> call to _receive(), so it's actually a dead method by now. We could >> provide >> _Validator with an _append_log_message() method that basically calls it. >> Would that solve the issue? > > I took a look: > A method to clear the error log from the subclass would also be needed. Fine with me. Stefan From djjordaan at gmail.com Fri Dec 18 15:15:18 2009 From: djjordaan at gmail.com (Johan Jordaan) Date: Fri, 18 Dec 2009 16:15:18 +0200 Subject: [lxml-dev] Valadtion not correct. Message-ID: <2b53a0220912180615o59552c2cvd66c658350c58ea8@mail.gmail.com> The below code validates two pieces of xml to the provided xsd. The first piece xml should be valid (I have independently verified this with another validatior). The second piece should be invalid;l which it is. Is this a bug in lxml or have i missed something? from lxml import etree from StringIO import StringIO f = StringIO(''' ''') xmlschema_doc = etree.parse(f) xmlschema = etree.XMLSchema(xmlschema_doc) valid = StringIO(''' ''') doc = etree.parse(valid) print "Valid","Valid" if xmlschema.validate(doc) else "Invalid" invalid = StringIO(''' ''') doc = etree.parse(invalid) print "Invalid","Valid" if xmlschema.validate(doc) else "Invalid" From jholg at gmx.de Fri Dec 18 17:25:37 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 18 Dec 2009 17:25:37 +0100 Subject: [lxml-dev] Valadtion not correct. In-Reply-To: <2b53a0220912180615o59552c2cvd66c658350c58ea8@mail.gmail.com> References: <2b53a0220912180615o59552c2cvd66c658350c58ea8@mail.gmail.com> Message-ID: <20091218162537.200600@gmx.net> Hi, quite tricky: > > f = StringIO(''' > > > > > [...] > > > > > > > Note that attribute name for the key-ed attribute of type Fruit is of type "xsd:string", whereas the keyref-ing attribute of favourite_fruit is of type If you remove the type attribute from the definition in , it works: >>> from lxml import etree >>> from StringIO import StringIO >>> >>> f = StringIO(''' ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ''') >>> >>> xmlschema_doc = etree.parse(f) >>> xmlschema = etree.XMLSchema(xmlschema_doc) >>> >>> >>> valid = StringIO(''' ... ... ... ... ... ... ... ... ... ... ''') >>> doc = etree.parse(valid) >>> >>> xmlschema.validate(doc) True >>> print xmlschema.error_log >>> Why this is I cannot say so a little W3C XMLSchema Rec reading would be needed... It might still be a bug of the tools but I see the same behaviour in both oxygen (xerces) and lxml (libxml2), so I suppose the odds are that they behave correctly. Please let us now if you find out what the spec says :) Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From djjordaan at gmail.com Fri Dec 18 18:55:58 2009 From: djjordaan at gmail.com (Johan Jordaan) Date: Fri, 18 Dec 2009 19:55:58 +0200 Subject: [lxml-dev] Valadtion not correct. In-Reply-To: <20091218162537.200600@gmx.net> References: <2b53a0220912180615o59552c2cvd66c658350c58ea8@mail.gmail.com> <20091218162537.200600@gmx.net> Message-ID: <2b53a0220912180955i7b7b4259oa02d1356af1afd46@mail.gmail.com> Thanks for the hint. I opted to add a type to the anonymous complexType: >> >> >> >> >> >> >> And that solved my issue... Thanks again. On Fri, Dec 18, 2009 at 6:25 PM, wrote: > Hi, > > quite tricky: > >> >> f = StringIO(''' >> >> ? >> ? ? >> ? >> [...] >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? > > Note that attribute name for the key-ed attribute of type Fruit is of type "xsd:string", whereas the keyref-ing attribute of favourite_fruit is of type > > If you remove the type attribute from the definition in , it works: > >>>> from lxml import etree >>>> from StringIO import StringIO >>>> >>>> f = StringIO(''' > ... > ... ? > ... ? ? > ... ? > ... > ... ? > ... ? ? > ... ? ? ? ... type="Fruit"/> > ... ? ? > ... ? > ... > ... ? > ... ? ? > ... ? ? ? > ... ? ? ? ? ... type="FruitArray"/> > ... ? ? ? ? > ... ? ? ? ? ? > ... ? ? ? ? ? ? > ... ? ? ? ? ? > ... ? ? ? ? > ... ? ? ? > ... ? ? > ... > ... ? ? > ... ? ? ? > ... ? ? ? > ... ? ? > ... > ... ? ? > ... ? ? ? > ... ? ? ? > ... ? ? > ... ? > ... > ... ''') >>>> >>>> xmlschema_doc = etree.parse(f) >>>> xmlschema = etree.XMLSchema(xmlschema_doc) >>>> >>>> >>>> valid = StringIO(''' > ... > ... ? > ... ? ? > ... ? ? > ... ? ? > ... ? > ... > ... ? > ... > ... ''') >>>> doc = etree.parse(valid) >>>> >>>> xmlschema.validate(doc) > True >>>> print xmlschema.error_log > >>>> > > Why this is I cannot say so a little W3C XMLSchema Rec reading would be needed... > It might still be a bug of the tools but I see the same behaviour in both oxygen (xerces) and lxml (libxml2), so I suppose the odds are that they behave correctly. > > Please let us now if you find out what the spec says :) > > Holger > -- > Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - > sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser > From jholg at gmx.de Fri Dec 18 22:29:03 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 18 Dec 2009 22:29:03 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2A2D9E.40302@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> Message-ID: <20091218212903.296010@gmx.net> Hi Stefan, > >> That's a good idea, though. Passing an XPath object in would simply let > >> the > >> parameter mangling code extract the underlying unparsed XPath > expression. > >> It's some unnecessary work if you don't actually want to use the > >> pre-parsed > >> expression, but it's definitely explicit. > > > > Do you mean for etree.XSLT to allow for XPath object arguments (that's > > what I meant), or for isoschematron to extract the path using > XPath.path? > > I'm fine with the first. It's just the same as supporting QName for tags, > CDATA for text and strparam() for XSLT parameters. Passing an XPath object > is really hard to misinterpret (at least a lot harder than a plain string > value). You really sure you want that for XSLT.__call__? Because that means looping through the arg dict on every invocation of the stylesheet, doesn't it? What about just providing a helper function that takes keyword args and does the stylesheet parameter mangling? This could then be used in isoschematron.Schematron() and wherever else s.o. needs it. Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From stefan_ml at behnel.de Sat Dec 19 07:41:31 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Dec 2009 07:41:31 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091218212903.296010@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091218212903.296010@gmx.net> Message-ID: <4B2C759B.9080301@behnel.de> jholg at gmx.de, 18.12.2009 22:29: >>>> That's a good idea, though. Passing an XPath object in would simply let >>>> the >>>> parameter mangling code extract the underlying unparsed XPath >> expression. >>>> It's some unnecessary work if you don't actually want to use the >>>> pre-parsed >>>> expression, but it's definitely explicit. >>> Do you mean for etree.XSLT to allow for XPath object arguments (that's >>> what I meant), or for isoschematron to extract the path using >> XPath.path? >> >> I'm fine with the first. It's just the same as supporting QName for tags, >> CDATA for text and strparam() for XSLT parameters. Passing an XPath object >> is really hard to misinterpret (at least a lot harder than a plain string >> value). > > You really sure you want that for XSLT.__call__? Because that means > looping through the arg dict on every invocation of the stylesheet, doesn't it? Well, how do you think lxml passes the parameters to libxslt? Look at the _run_transform() method. Stefan From jholg at gmx.de Tue Dec 22 01:39:13 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 22 Dec 2009 01:39:13 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B2C759B.9080301@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> <4B0D4551.1020702@behnel.de> <20091217000343.142550@gmx.net> <4B29E7DF.6010203@behnel.de> <20091217095734.67420@gmx.net> <4B2A0843.80509@behnel.de> <20091217112129.158960@gmx.net> <4B2A2D9E.40302@behnel.de> <20091218212903.296010@gmx.net> <4B2C759B.9080301@behnel.de> Message-ID: <20091222003913.296010@gmx.net> Hi > Well, how do you think lxml passes the parameters to libxslt? Look at the > _run_transform() method. Yeah, not my brightest hour there... Anyway: Committed revision 70244. I've now updated the isoschematron implementation to - inherit from etree._Validator - use simple arguments and automagically convert them to stylesheet parameters - added the 'phase' keyword arg Also, etree.XSLT now accepts XPath objects as stylesheet parameters - test & doc updated to reflect this. These are the changes I made on the core lxml files since branching: $ svn diff -N --old=http://codespeak.net/svn/lxml/trunk/src/lxml/@69913 --new=src/lxml Index: src/lxml/xslt.pxi =================================================================== --- src/lxml/xslt.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml) (revision 69913) +++ src/lxml/xslt.pxi (.../src/lxml) (working copy) @@ -609,7 +609,10 @@ xslt.xsltQuoteOneUserParam( transform_ctxt, _cstr(k), _cstr(v)) else: - v = _utf8(value) + if isinstance(value, XPath): + v = _utf8((value).path) + else: + v = _utf8(value) params[i] = _cstr(k) i += 1 params[i] = _cstr(v) Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/xmlerror.pxi (.../src/lxml) (working copy) @@ -102,6 +102,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error Index: src/lxml/lxml.etree.pyx =================================================================== --- src/lxml/lxml.etree.pyx (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/lxml.etree.pyx (.../src/lxml) (working copy) @@ -2783,6 +2783,14 @@ raise AssertionError, self._error_log._buildExceptionMessage( u"Document does not comply with schema") + cpdef _append_log_message(self, int domain, int type, int level, int line, + message, filename): + self._error_log._receiveGeneric(domain, type, level, line, message, + filename) + + cpdef _clear_error_log(self): + self._error_log.clear() + property error_log: u"The log of validation errors and warnings." def __get__(self): Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From lists at zopyx.com Tue Dec 22 15:45:36 2009 From: lists at zopyx.com (Andreas Jung) Date: Tue, 22 Dec 2009 15:45:36 +0100 Subject: [lxml-dev] Replacing node.text with a node structure Message-ID: <4B30DB90.4010203@zopyx.com> Hi there, I have an XML structure like this

hello bar world ....

and need it to transform to this

hello bar world ....

p_node.text == 'hello bar world' should be replaced with a text node + Element node + another text node. How can I do this using lxml? Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: lists.vcf Type: text/x-vcard Size: 316 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20091222/8e308e92/attachment.vcf From jq at qdevelop.de Tue Dec 22 16:55:44 2009 From: jq at qdevelop.de (Jens Quade) Date: Tue, 22 Dec 2009 16:55:44 +0100 Subject: [lxml-dev] Replacing node.text with a node structure In-Reply-To: <4B30DB90.4010203@zopyx.com> References: <4B30DB90.4010203@zopyx.com> Message-ID: On 22.12.2009, at 15:45, Andreas Jung wrote: > Hi there, > > I have an XML structure like this > >

> hello bar world > > .... > >

> > > and need it to transform to this > >

> hello > > bar > > world > > .... > >

> > p_node.text == 'hello bar world' should be replaced with a text node + > Element node + another text node. > > How can I do this using lxml? >>> from lxml.etree import Element, XML, dump >>> p = XML('

Hello World!

') >>> dump(p)

Hello World!

>>> p.text = 'Hello' >>> foo = Element('foo') >>> foo.text = 'wonderful' >>> foo.tail = 'World' >>> p[:] = [foo] + p[:] >>> dump(p)

HellowonderfulWorld!

From sergio at sergiomb.no-ip.org Tue Dec 22 18:52:28 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 22 Dec 2009 17:52:28 +0000 Subject: [lxml-dev] Replacing node.text with a node structure In-Reply-To: <4B30DB90.4010203@zopyx.com> References: <4B30DB90.4010203@zopyx.com> Message-ID: <1261504348.2737.1.camel@segulix> Questions like that, my references was: http://infohost.nmt.edu/tcc/help/pubs/pylxml/index.html On Tue, 2009-12-22 at 15:45 +0100, Andreas Jung wrote: > Hi there, > > I have an XML structure like this > >

> hello bar world > > .... > >

> > > and need it to transform to this > >

> hello > > bar > > world > > .... > >

> > p_node.text == 'hello bar world' should be replaced with a text node + > Element node + another text node. > > How can I do this using lxml? > > Andreas > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20091222/a37cb02f/attachment-0001.bin From stefan_ml at behnel.de Fri Dec 25 07:39:28 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 Dec 2009 07:39:28 +0100 Subject: [lxml-dev] Different node instance depending on method of access In-Reply-To: References: Message-ID: <4B345E20.8020404@behnel.de> Hi, the mailing list would have been the place to ask this. Kevin D Smith, 25.12.2009 00:01: > I'm trying to use lxml to parse a document, apply CSS styles to the > nodes, then walk through the document to render it to another format. > The problem is that different ways of accessing the document return > different instances of the nodes. I really need to work on the same > instance no matter what way I access them. > > doc.find('body') returns > > CSSSelector('body')(doc) returns > > for action, elem in etree.iterwalk(doc, events=('start',)): > if elem.tag == 'body': return elem > > returns > > Using saxify(doc, handler) and processing styles in the event handlers > gives the following: > > CSSSelector('html, address, blockquote, body, dd, div, dl, dt, fieldset, > form, frame, frameset, h1, h2, h3, h4, h5, h6, noframes, ol, p, ul, > center, dir, hr, menu, pre')(self.doc) returns 101da2ce8> {u'display': 'block'} > > CSSSelector('body') returns > > As you can see, within the context of the iterwalk and saxify, the node > instances aren't the same as returned by doc.find and CSSSelector > outside of iterwalk and saxify. It's not because you are using different ways to get to the element, it's because you are throwing away the reference in between. This might help: http://codespeak.net/lxml/element_classes.html#background-on-element-proxies > Is there a way to guarantee that all of > these methods will use the same nodes? You didn't write /why/ you consider this a requirement - it likely isn't one. But if you really need this, you can cache the element instances like this: cache = list(root_element.iter()) # do stuff with the elements in the tree del cache As long as you don't add elements to the tree, this will ensure that you always get the same instances back. If you add elements, just add them to the cache as well. Stefan From stefan_ml at behnel.de Mon Dec 28 19:47:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 28 Dec 2009 19:47:38 +0100 Subject: [lxml-dev] XML Catalogs In-Reply-To: <20091206160404.GA4560@asquith> References: <20091206102811.GB6122@asquith> <4B1B8F4C.7080709@behnel.de> <20091206160404.GA4560@asquith> Message-ID: <4B38FD4A.8020901@behnel.de> David Soulayrol, 06.12.2009 17:04: > On Sun, Dec 06, 2009 at 12:02:36PM +0100, Stefan Behnel wrote: >> lxml's catalog support is based on libxml2: >> >> http://xmlsoft.org/catalog.html >> >> Is anything not working for you regarding catalogs? > > Actually nothing at the moment. My problem is I don't understand how I > can achieve what I want with the documentation I have. All I found in > lxml documentation is how to resolve a DTD Public ID from the catalog: > > dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XML V4.2//EN") > > What I want is to get an URL from a xsl sheet name using a entry > in my catalog: > > > name="my_sheet.xsl" > uri="file:///usr/local/share/xml/my_sheets/sheet1.xsl"/> > Ah, ok, there isn't currently any API for that in lxml. Catalog support is mostly used for loading DTDs etc., so that's what's supported behind the scenes. Explicit lookups in a catalog would happen through a call to xmlCatalogResolveURI() in libxml2: http://xmlsoft.org/html/libxml-catalog.html#xmlCatalogResolveURI but that would require wrapping catalog.h first. Stefan From richardbp+lxml at gmail.com Tue Dec 29 06:07:53 2009 From: richardbp+lxml at gmail.com (Richard Baron Penman) Date: Tue, 29 Dec 2009 16:07:53 +1100 Subject: [lxml-dev] embedding sub-elements in text attributes Message-ID: Hello, I want to emphasize certain words in a HTML document, and my current solution is: tree = etree.fromstring(html) for e in tree.getiterator(): for attr in 'text', 'tail': words = getattr(e, attr) or '' for word in words.split(): if important(word): setattr(e, attr, getattr(e, attr).replace(word, '' + word + '')) The above examines the text of each element and emphasizes the important words it finds. However it does this by embedding HTML tags in the text attributes, which is escaped when rendering so that I need to counter with: html = etree.tostring(tree).replace('>', '>').replace('<', '<') This makes me uncomfortable so I want to do it properly. However to embed a new Element I would need to shift around the 'text' and 'tail' attributes so that the emphasized text appeared at the same position. And this would be really tricky when iterating as above. Any advice how to do this properly would be appreciated. I am sure there is something I have missed in the API! (Although ideally from the ElementTree API so that I could this use it when lxml is not available.) thanks, Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091229/26876587/attachment.htm From stefan_ml at behnel.de Tue Dec 29 15:58:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 Dec 2009 15:58:50 +0100 Subject: [lxml-dev] embedding sub-elements in text attributes In-Reply-To: References: Message-ID: <4B3A192A.4050906@behnel.de> Hi, Richard Baron Penman, 29.12.2009 06:07: > I want to emphasize certain words in a HTML document, and my current > solution is: > > tree = etree.fromstring(html) > for e in tree.getiterator(): > for attr in 'text', 'tail': > words = getattr(e, attr) or '' > for word in words.split(): > if important(word): > setattr(e, attr, getattr(e, attr).replace(word, '' + word > + '')) Note that this fails for words that appear more than once. > The above examines the text of each element and emphasizes the important > words it finds. > However it does this by embedding HTML tags in the text attributes, which is > escaped when rendering so that I need to counter with: > > html = etree.tostring(tree).replace('>', '>').replace('<', '<') You should really avoid code like this. It's very unsafe. > This makes me uncomfortable so I want to do it properly. > However to embed a new Element I would need to shift around the 'text' and > 'tail' attributes so that the emphasized text appeared at the same position. > And this would be really tricky when iterating as above. First, make sure you do not modify the tree you iterate over: for e in list(tree.getiterator()): Then, you can (mis)use the E-factory for the replacements: H = lxml.html.builder def highlighted_content(text): if not text: return [] return [ H.B(word+' ') if important(word) else word+' ' for word in text.split() ] div_tag = H.DIV(*highlighted_content(e.text)) e.text = div_tag.text e[:0] = div_tag[:] div_tag = H.DIV(*highlighted_content(e.tail)) e.tail = div_tag.text e.extend(div_tag) There is also an E-factory for ElementTree, BTW. Note that the above is untested and likely contains bugs and formatting problems. But it shouldn't be hard to fix them. Stefan From stefan_ml at behnel.de Tue Dec 29 16:34:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 Dec 2009 16:34:56 +0100 Subject: [lxml-dev] etree not printing pretty :( In-Reply-To: <4B05268F.2040701@gmx.de> References: <4B051E56.4070701@gmx.de> <4B05209E.6090408@behnel.de> <4B05268F.2040701@gmx.de> Message-ID: <4B3A21A0.8050209@behnel.de> Martin Seiler, 19.11.2009 12:05: >>> I am rather new to python and lxml and I wonder about the output of my >>> tree. I add an Element, which contains some childs to a tree. When I >>> write the output everything is in pretty print, but the elements I >>> appended. They show up in one line. >> did you read this? >> >> http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output > > Yes, I did and applied before. If I print out the Element before > appending it, it works fine, just after appending it to an existing tree > it shows this behavior. libxml2 uses a heuristic to handle document-style content. When it finds tail text on an element, it switches to keeping text content intact within the respective subtree. That's likely the problem here. Setting the .tail property of each element to None should fix this, i.e. for el in root.iter(): el.tail = None > Also changing the call to: > >> et.ElementTree(element=eggs, parser=parser).write(xmlout, pretty_print=True) > > doesn't help? No. Setting a different parser here doesn't do anything (probably worth an exception...) Stefan From dakota at brokenpipe.ru Tue Dec 29 18:13:40 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 29 Dec 2009 20:13:40 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4AF992C9.7090400@behnel.de> References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> Message-ID: > > > Maybe I could join > in and dig a bit? Just don't know where to start. I hope it's not too > > complicated and it's possible - I have my problem's elegant solution, but > it > > needs this feature. > > You can try to dig into libxslt to find it out. I don't currently have the > time to implement major new features, but if I get an outline how this > should work, so that I can estimate the amount of work it takes, I may get > around to do it. > I've digged in. After some time of figuring out and reading libxslt code (the only way to understand what's really happening, because libxslt's documentation is ugly) I've ended up with solution. The patch is placed below and it's rather simple. I'm almost sure that it needs a few of your fixes just because you know much-much better how to manage elements, memory and so on. But it does things I really needed. I worked with source code of lxml 2.2.4. --------------- PATCH STARTS HERE --------------- diff -r c8813376f20b -r 0a195f4f7df2 xslt.pxd --- a/xslt.pxd Tue Dec 29 19:03:23 2009 +0300 +++ b/xslt.pxd Tue Dec 29 19:25:19 2009 +0300 @@ -30,10 +30,13 @@ xmlNode* node xmlDoc* output xmlNode* insert + xmlNode* inst xsltTransformState state ctypedef struct xsltStackElem + ctypedef struct xsltTemplate + cdef xsltStylesheet* xsltParseStylesheetDoc(xmlDoc* doc) nogil cdef void xsltFreeStylesheet(xsltStylesheet* sheet) nogil @@ -84,6 +87,10 @@ cdef xsltTransformContext* xsltNewTransformContext(xsltStylesheet* style, xmlDoc* doc) nogil cdef void xsltFreeTransformContext(xsltTransformContext* context) nogil + cdef void xsltApplyOneTemplate(xsltTransformContext* ctxt, + xmlNode* contextNode, xmlNode* list, + xsltTemplate* templ, + xsltStackElem* params) nogil cdef extern from "libxslt/xsltutils.h": cdef int xsltSaveResultToString(char** doc_txt_ptr, diff -r c8813376f20b -r 0a195f4f7df2 xsltext.pxi --- a/xsltext.pxi Tue Dec 29 19:03:23 2009 +0300 +++ b/xsltext.pxi Tue Dec 29 19:25:19 2009 +0300 @@ -66,6 +66,30 @@ tree.xmlFreeNode(c_parent) return results + def evaluate(self, _XSLTContext context not None, _Element output_parent): + u"""evaluate(self, context, output_parent) + + Call this method to evaluate XSLT content of extension element. + + Evaluation result will be placed into output_parent element. + """ + cdef xslt.xsltTransformContext* ctxt + cdef xmlNode* c_backup + + ctxt = context._xsltCtxt + c_backup = ctxt.insert + + # I'm not sure about output_parent's type, maybe it should be some type + # of proxy. This needs better knowing man's opinion. + # And I'm using output_parent node for adding results instead of + # elements list used in apply_templates, that's easier and allows to + # use attributes added to extension element with . + # And that's exactly the thing I need. + ctxt.insert = output_parent._c_node + xslt.xsltApplyOneTemplate(ctxt, + ctxt.node, ctxt.inst.children, NULL, NULL) + ctxt.insert = c_backup + cdef _registerXSLTExtensions(xslt.xsltTransformContext* c_ctxt, extension_dict): --------------- PATCH ENDS HERE --------------- So, with this patch we can handle XSLT content of extension elements (including attributes) and we can have extension elements inside extension elements. For example I can have xslt file looks like: 123 blabla And execute method for my:ext looks like: def execute(self, context, self_node, input_node, output_parent): tmp = etree.Element('tmp') self.evaluate(context, tmp) output_parent.append(tmp) And the result is: I think it's great feature. Is there any chance this thing will be included in nearest release? Thanks. -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091229/a8816354/attachment-0001.htm From sridharr at activestate.com Tue Dec 29 20:17:49 2009 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Tue, 29 Dec 2009 11:17:49 -0800 Subject: [lxml-dev] Instructions to build on Windows 64-bit? Message-ID: <4B3A55DD.1030004@activestate.com> I noticed that the lxml PyPI page provides 64-bit Windows installers [http://pypi.python.org/pypi/lxml/2.2.4 ; lxml-*amd64.exe]. I assume they are statically linked with the libxml/xslt libraries. In the interest of providing 64-bit binaries in PyPM [pypm.activestate.com], may I know how these binaries are built? I tried buildlibxml.py which fails at several steps; and the compiled libraries provided at ftp.zlatkovic.com are 32-bit only. -srid From stefan_ml at behnel.de Tue Dec 29 23:14:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 Dec 2009 23:14:25 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> Message-ID: <4B3A7F41.6000304@behnel.de> Hi, Marat Dakota, 29.12.2009 18:13: > After some time of figuring out and reading libxslt code > (the only way to understand what's really happening, because libxslt's > documentation is ugly) I've ended up with solution. > > The patch is placed below and it's rather simple. I'm almost sure that it > needs a few of your fixes just because you know much-much better how to > manage elements, memory and so on. But it does things I really needed. > [...] > I think it's great feature. Is there any chance this thing will be included > in nearest release? Thanks a lot, it's looks reasonable at first glance and I'll take a closer look as soon as I get to it. If it works well, it should make it into 2.3. Could you add a couple of tests to src/lxml/tests/test_xslt.py? That would help in making sure that it keeps working as expected even if I find that I need to rework the patch. Also, it's best to send patches as a readable attachment rather than inline. Mail programs tend to reformat text and it's easy to loose empty trailing lines etc. Thanks for pulling this out! Stefan From dattam at umich.edu Wed Dec 30 20:48:37 2009 From: dattam at umich.edu (Dattatreya Mellacheruvu) Date: Wed, 30 Dec 2009 14:48:37 -0500 Subject: [lxml-dev] =?utf-8?q?Is_there_a_=27Generic=27_XML-to-Relational_D?= =?utf-8?q?atabase_Program=3F?= Message-ID: <5a5db39d3ae4fd381bc58cc155c1e61c@umich.edu> Hi All, 1. Is there a python (UTILITY) program that converts 'any' XML to Relational Database program (which uses, say, lxml)? So, the utility program should not only parse the xml, but also understand the most general 'hierarchical' structure of the elements, create tables (say in sqlite) accordingly and insert the data. I have seen many relational DB to xml programs, but I haven't come across one that does the opposite. 2. Is there a UTILITY (program) whtat diplays the most general hierarchical structure of the elements as a tree (or as an expandable/collapsable list). 3. Is there a program which helps me to filter elements (in an xml file)? Supposing we were able to see the results of 2 (above), we then should be able ask for something like -- "discard all elements which do not have xyz and create a new file." Is there a pre-built UTILITY for doing such a thing? If nothing pre-build is available, I am gonna build them now. But I would love NOT to reinvent the wheel. Guru's on the forum, please advise. Regards, Datta. Graduate Student, UMICH. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091230/ffee4635/attachment.htm From ygingras at ygingras.net Thu Dec 31 17:11:10 2009 From: ygingras at ygingras.net (Yannick Gingras) Date: Thu, 31 Dec 2009 11:11:10 -0500 Subject: [lxml-dev] Looking for performance tips for soupparser Message-ID: <200912311111.20018.ygingras@ygingras.net> Hi, first of all, I have to say that I really like soupparser. Thanks a lot for it. I use it a lot data mining on a somewhat large document collection that I often revisit to try new ideas. Soupparser is fast but I put a lot of strain on it so I was looking for ways to speed things up. My first idea was to use beaker to cache the root Element object of every document to disk. Unfortunately, Element instances are not pickleable so I have to look for something else. Would any of you have some tips to share on speeding things up with soupparser? How hard would it be to make elements conform to the pickling protocol? -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20091231/28cebfa1/attachment.pgp From jkrukoff at ltgc.com Thu Dec 31 18:26:25 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 31 Dec 2009 10:26:25 -0700 Subject: [lxml-dev] Is there a 'Generic' XML-to-Relational Database Program? In-Reply-To: <5a5db39d3ae4fd381bc58cc155c1e61c@umich.edu> References: <5a5db39d3ae4fd381bc58cc155c1e61c@umich.edu> Message-ID: <1262280385.3886.12.camel@localhost.localdomain> On Wed, 2009-12-30 at 14:48 -0500, Dattatreya Mellacheruvu wrote: > Hi All, > > 1. Is there a python (UTILITY) program that converts 'any' XML to > Relational Database program (which uses, say, lxml)? > 2. Is there a UTILITY (program) whtat diplays the most general > hierarchical structure of the elements as a tree (or as an > expandable/collapsable list). > > 3. Is there a program which helps me to filter elements (in an xml > file)? First, if you do go down the path of trying to jam XML in a relational database, you'll probably want to start with an XML schema of some sort rather than just plain XML. Trying to infer structure (and data type!) information from random XML sounds like a very hard, if not impossible, problem. You could take a look at some of the automatic XML schema generation tools to get an idea of what the hard problems are, since that's essentially what you'd need to do as a first step. But, if your goal is to store random XML in a database, are you sure you should be looking at relational databases? You might find that a native XML database based on XQuery (such as eXist: http://www.exist-db.org/) solves your problem space in a much more elegant way. -- John Krukoff Land Title Guarantee Company From stefan_ml at behnel.de Thu Dec 31 23:08:11 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 31 Dec 2009 23:08:11 +0100 Subject: [lxml-dev] Looking for performance tips for soupparser In-Reply-To: <200912311111.20018.ygingras@ygingras.net> References: <200912311111.20018.ygingras@ygingras.net> Message-ID: <4B3D20CB.4000305@behnel.de> Yannick Gingras, 31.12.2009 17:11: > first of all, I have to say that I really like soupparser. Thanks a > lot for it. I use it a lot data mining on a somewhat large document > collection that I often revisit to try new ideas. Soupparser is fast Erm, no, not really. It uses BeautifulSoup as a parser backend, which really isn't that fast: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > I put a lot of strain on it so I was looking for ways to speed things > up. My first idea was to use beaker to cache the root Element object > of every document to disk. Unfortunately, Element instances are not > pickleable so I have to look for something else. > > Would any of you have some tips to share on speeding things up with > soupparser? How hard would it be to make elements conform to the > pickling protocol? I'd use the normal HTML parser instead, and only fall back to using the soupparser when things go really wrong (whatever that means in your case). Another thing you can do (assuming that caching is helpful in your case), is to parse the documents using soupparser and serialise them into the cache. Then parse them from the cache using the normal HTML parser (preferably with "recover=False") when you need them. A serialise-parse cycle is several times faster than a new parser run of BeautifulSoup, so if you need the documents multiple times, this will speed things up. Stefan