From aymeric.augustin at polytechnique.org Sat Jan 2 20:57:02 2010 From: aymeric.augustin at polytechnique.org (Aymeric Augustin) Date: Sat, 2 Jan 2010 20:57:02 +0100 Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files? Message-ID: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org> Hello, lxml.etree.parse is able to load gzipped XML files directly, but lxml.etree.iterparse is not. See below for an interactive session demonstrating the problem on debian stable. Is it the expected behavior, or is it a bug? The documentation does address this point, it says only: > lxml can parse from a local file, an HTTP URL or an FTP URL. It > also auto-detects and reads gzip-compressed XML files (.gz). Context: I'm handling hundreds of GB-sized files. It would be nice to store them gzipped and have lxml decompress them on the fly, without any specific Python code. Thanks! % python Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import gzip, sys >>> from lxml import etree >>> print etree.__version__ 2.1.1 Let's create a gzipped XML file: >>> gzip.open('test.xml.gz', 'wb').write('') etree.parse is able to load it: >>> tree = etree.parse('test.xml.gz') >>> tree.write(sys.stdout); print etree.iterparse crashes: >>> ctx = etree.iterparse('test.xml.gz') >>> list(ctx) Traceback (most recent call last): File "", line 1, in File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245) File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:53770) lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1 etree.iterparse accepts the ungzipped file: >>> ctx = etree.iterparse(gzip.open('test.xml.gz', 'rb')) >>> list(ctx) [(u'end', ), (u'end', )] -- Aymeric Augustin. From stefan_ml at behnel.de Sun Jan 3 08:25:51 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 03 Jan 2010 08:25:51 +0100 Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files? In-Reply-To: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org> References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org> Message-ID: <4B40467F.7020305@behnel.de> Aymeric Augustin, 02.01.2010 20:57: > lxml.etree.parse is able to load gzipped XML files directly, but > lxml.etree.iterparse is not. > [...] > The documentation does address this point, it says only: >> lxml can parse from a local file, an HTTP URL or an FTP URL. It >> also auto-detects and reads gzip-compressed XML files (.gz). Right, there should be a note in the iterparse docs also. The input support in iterparse() is a lot simpler than that. It doesn't support URLs either. Due to the inner workings of iterparse, all of this isn't trivial to add, as lxml would have to detect and apply the correct reading mechanism itself (e.g. by building up a decompression step for libxml2 manually). Even detecting the compression would require opening the file and reading from it first. Now imagine named pipes and system streams, which you cannot just reopen afterwards... It might be possible to detect GzipFile objects and bypass them, but that would already be a difference to the normal parse() behaviour. > Context: I'm handling hundreds of GB-sized files. It would be nice to > store them gzipped and have lxml decompress them on the fly, without > any specific Python code. The way to do that is currently by passing through the gzip module. You can also try using a pipe to an externally started gzip process. I frequently use this on 64-bit multicore Sun machines where the system provided gzip is increadibly fast, much faster than Python's gzip module. Stefan From aymeric.augustin at polytechnique.org Sun Jan 3 11:00:41 2010 From: aymeric.augustin at polytechnique.org (Aymeric Augustin) Date: Sun, 3 Jan 2010 11:00:41 +0100 Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files? In-Reply-To: <4B40467F.7020305@behnel.de> References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org> <4B40467F.7020305@behnel.de> Message-ID: On 3 janv. 10, at 08:25, Stefan Behnel wrote: > Right, there should be a note in the iterparse docs also. The input > support in iterparse() is a lot simpler than that. It doesn't > support URLs either. > > Due to the inner workings of iterparse, all of this isn't trivial > to add, as lxml would have to detect and apply the correct reading > mechanism itself (e.g. by building up a decompression step for > libxml2 manually). Even detecting the compression would require > opening the file and reading from it first. Now imagine named pipes > and system streams, which you cannot just reopen afterwards... OK, thanks for the explanation. >> Context: I'm handling hundreds of GB-sized files. It would be nice >> to store them gzipped and have lxml decompress them on the fly, >> without any specific Python code. > > The way to do that is currently by passing through the gzip module. > You can also try using a pipe to an externally started gzip > process. I frequently use this on 64-bit multicore Sun machines > where the system provided gzip is increadibly fast, much faster > than Python's gzip module. I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using the gzip module. The performance gain comes from gunzipping and parsing in parallel; the overall resource consumption (user + system) is nearly identical. So for now I'll stick with the gzip module. -- Aymeric Augustin. From stefan_ml at behnel.de Sun Jan 3 15:14:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 03 Jan 2010 15:14:35 +0100 Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files? In-Reply-To: References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org> <4B40467F.7020305@behnel.de> Message-ID: <4B40A64B.60302@behnel.de> Aymeric Augustin, 03.01.2010 11:00: > I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the > pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using the > gzip module. The performance gain comes from gunzipping and parsing in > parallel; the overall resource consumption (user + system) is nearly > identical. > > So for now I'll stick with the gzip module. Sounds reasonable. You can also try to adjust the gzip buffer size and see if that reduces the overhead. Stefan From lists at zopyx.com Wed Jan 6 14:33:57 2010 From: lists at zopyx.com (Andreas Jung) Date: Wed, 06 Jan 2010 14:33:57 +0100 Subject: [lxml-dev] 'text', 'tail' handling with nested markup Message-ID: <4B449145.8060602@zopyx.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi there, given a structure like

foo blather bar blather fox

How can I parse the content of

into a flat list like ['foo', , 'bar', , 'fox'] ? Andreas - -- ZOPYX Limited \ zopyx group Charlottenstr. 37/1 \ The full-service network for your D-72070 T?bingen \ Python, Zope and Plone projects www.zopyx.com, info at zopyx.com \ www.zopyxgroup.com - ------------------------------------------------------------------------ E-Publishing, Python, Zope & Plone development, Consulting -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktEkUUACgkQCJIWIbr9KYwbXwCfWorSz4vRAcGHTop0AYcNpvmq rSMAoJruQ6iWdOivdteLBnkCZHI4mM6m =M0/g -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: lists.vcf Type: text/x-vcard Size: 316 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100106/3f446a88/attachment.vcf From stefan_ml at behnel.de Thu Jan 7 14:15:41 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 07 Jan 2010 14:15:41 +0100 Subject: [lxml-dev] 'text', 'tail' handling with nested markup In-Reply-To: <4B449145.8060602@zopyx.com> References: <4B449145.8060602@zopyx.com> Message-ID: <4B45DE7D.9050903@behnel.de> Andreas Jung, 06.01.2010 14:33: > given a structure like > >

> foo > blather > bar > blather > fox >

> > How can I parse the content of

into a flat list like > > ['foo', , 'bar', , 'fox'] for p in root.iter(tag='p'): flat_list = [] if p.text: flat_list.append(p.text) for el in p: flat_list.append(el) if el.tail: flat_list.append(el.tail) print(flat_list) Stefan From d.rothe at semantics.de Thu Jan 7 22:38:46 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Thu, 07 Jan 2010 22:38:46 +0100 Subject: [lxml-dev] 'text', 'tail' handling with nested markup In-Reply-To: <4B45DE7D.9050903@behnel.de> References: <4B449145.8060602@zopyx.com> <4B45DE7D.9050903@behnel.de> Message-ID: On Thu, 07 Jan 2010 14:15:41 +0100, Stefan Behnel wrote: > > Andreas Jung, 06.01.2010 14:33: >> given a structure like >> >>

>> foo >> blather >> bar >> blather >> fox >>

>> >> How can I parse the content of

into a flat list like >> >> ['foo', , 'bar', , 'fox'] > > for p in root.iter(tag='p'): > flat_list = [] > if p.text: > flat_list.append(p.text) > for el in p: > flat_list.append(el) > if el.tail: > flat_list.append(el.tail) > print(flat_list) Another variant with xpath: flat_list = root.xpath('//p/node()') or if root is the

element flat_list = root.xpath('node()') --dirk From ygingras at ygingras.net Sun Jan 10 21:16:31 2010 From: ygingras at ygingras.net (Yannick Gingras) Date: Sun, 10 Jan 2010 15:16:31 -0500 Subject: [lxml-dev] Looking for performance tips for soupparser In-Reply-To: <4B3D20CB.4000305@behnel.de> References: <200912311111.20018.ygingras@ygingras.net> <4B3D20CB.4000305@behnel.de> Message-ID: <201001101516.31343.ygingras@ygingras.net> On December 31, 2009, Stefan Behnel wrote: > > Would any of you have some tips to share on speeding things up with > > soupparser? How hard would it be to make elements conform to the > > pickling protocol? > > I'd use the normal HTML parser instead, and only fall back to using the > soupparser when things go really wrong (whatever that means in your case). > > Another thing you can do (assuming that caching is helpful in your case), > is to parse the documents using soupparser and serialise them into the > cache. Then parse them from the cache using the normal HTML parser > (preferably with "recover=False") when you need them. A serialise-parse > cycle is several times faster than a new parser run of BeautifulSoup, so if > you need the documents multiple times, this will speed things up. I implemented both ideas and it resulted in a least a 10 fold speedup. Thanks a lot! -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100110/c9d590f6/attachment.pgp From b.tarde at gmail.com Mon Jan 11 22:04:34 2010 From: b.tarde at gmail.com (Peter Baker) Date: Mon, 11 Jan 2010 16:04:34 -0500 Subject: [lxml-dev] Output of xsl:message when terminate is not yes Message-ID: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com> I've just recently been getting acquainted with lxml. Thanks to the developers: it's great! There's just one thing (so far) that I haven't been able to do. In an XSLT transformation, I can't figure out what's going on with xsl:message when there is no terminate="yes" attribute. Command-line processors like xsltproc print the message to stderr. With libxslt you can capture it with a callback function. But I can't figure out how to display such messages with lxml. For just a little background, my app often needs to do very long transforms: up to a half hour, though a minute is more typical. It seems important to be able to spit out warnings and messages to mark the progress of the transformation. So reading an error log after the transform is over is not what I want. Sorry if the answer is an obvious one. It's hard to search a forum's archive for "message," and the other keywords I've been able to think of haven't given me the answser in either the documentation or the forum. Thanks in advance, Peter Baker From stefan_ml at behnel.de Tue Jan 12 09:05:44 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 12 Jan 2010 09:05:44 +0100 Subject: [lxml-dev] Output of xsl:message when terminate is not yes In-Reply-To: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com> References: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com> Message-ID: <4B4C2D58.1030604@behnel.de> Hi, Peter Baker, 11.01.2010 22:04: > I've just recently been getting acquainted with lxml. Thanks to the > developers: it's great! Thanks :) > There's just one thing (so far) that I haven't been able to do. In an > XSLT transformation, I can't figure out what's going on with > xsl:message when there is no terminate="yes" attribute. Command-line > processors like xsltproc print the message to stderr. With libxslt you > can capture it with a callback function. But I can't figure out how to > display such messages with lxml. > > For just a little background, my app often needs to do very long > transforms: up to a half hour, though a minute is more typical. It > seems important to be able to spit out warnings and messages to mark > the progress of the transformation. So reading an error log after the > transform is over is not what I want. I never tried that, but you should be able to read the error_log also during the transformation (you obviously need a reference to the running XSLT object to do that). You shouldn't read the log from a separate thread, though. I'm not sure if that works, but if it works, I should consider it a bug (I'll have to check that). A different way is to use a dedicated extension element to export the message, instead of the generic xsl:message tag. That would allow you to do whatever you want in Python code, e.g. use the logging package, and also to provide more information than just a plain message string. http://codespeak.net/lxml/extensions.html#xslt-extension-elements And a third approach would be to divert the global thread error log (i.e. the output of everything libxml2/libxslt does in the current thread) to Python's logging package. http://codespeak.net/lxml/api/lxml.etree.PyErrorLog-class.html http://codespeak.net/lxml/api/lxml.etree-module.html#use_global_python_log However, this doesn't appear to be used very much, so it hasn't been exercised as much as other parts of the API. It was even broken for a long time before 2.2.3 until a user finally noticed (this feature isn't easy to test as it obviously has the side-effect of diverting the error output). Just looking through the code now revealed a couple of spots where the API could be improved - I guess that should be done for 2.3. HTH - maybe not "one perfect way to do it", but at least a couple of paths to explore. :) > Sorry if the answer is an obvious one. It's hard to search a forum's > archive for "message," and the other keywords I've been able to think > of haven't given me the answser in either the documentation or the > forum. I'd look for "xsl:message" given that the prefix is so commonly used. But I don't remember having seen any related threads on this list so far. Stefan From ashish.vyas at motorola.com Tue Jan 12 10:33:56 2010 From: ashish.vyas at motorola.com (VYAS ASHISH M-NTB837) Date: Tue, 12 Jan 2010 17:33:56 +0800 Subject: [lxml-dev] FW: lxml 2.2.4 on python3.1, Windows XP gives importerror Message-ID: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com> Dear All I have Python 3.1 installed on Windows XP and Works nice. I downloaded lxml 2.2.4 (lxml-2.2.4.win32-py3.1.exe) from pypi. When I try: from lxml import etree I get: ImportError: DLL load failed: This application has failed to start because the application configuration is incorrect. Reinstalling the application may fix this problem. For information: 'import lxml' works fine. After reinstalling python3.1 also the error message is the same. Any help is appreciated! Regards, Ashish Vyas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100112/50e974ba/attachment.htm From dakota at brokenpipe.ru Tue Jan 12 14:05:43 2010 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 12 Jan 2010 16:05:43 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4B3A7F41.6000304@behnel.de> References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> Message-ID: > > Thanks a lot, it's looks reasonable at first glance and I'll take a closer > look as soon as I get to it. If it works well, it should make it into 2.3. > Is there a roadmap date for 2.3 release? > Could you add a couple of tests to src/lxml/tests/test_xslt.py? That would > help in making sure that it keeps working as expected even if I find that I > need to rework the patch. > I've added tests, I've also renamed variables to fit your code better and added possibility to evaluate extension element's content directly to _AppendOnlyElementProxy as well as to _Element. It looks like I'm satisfied with the code now. I wonder what will you say about it. > Also, it's best to send patches as a readable attachment rather than > inline. Mail programs tend to reformat text and it's easy to loose empty > trailing lines etc. > The patch is attached. Can't wait to see it in trunk :) > Thanks for pulling this out! And thank you for making very nice and useful thing! -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100112/2f83eef6/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: lxml-2.2.4.patch Type: application/octet-stream Size: 7125 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100112/2f83eef6/attachment-0001.obj From b.tarde at gmail.com Tue Jan 12 14:51:15 2010 From: b.tarde at gmail.com (Peter Baker) Date: Tue, 12 Jan 2010 08:51:15 -0500 Subject: [lxml-dev] Output of xsl:message when terminate is not yes In-Reply-To: <4B4C2D58.1030604@behnel.de> References: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com> <4B4C2D58.1030604@behnel.de> Message-ID: <64e117cb1001120551o45fd6b79scca71cff088b09b2@mail.gmail.com> On Tue, Jan 12, 2010 at 3:05 AM, Stefan Behnel wrote: > > I never tried that, but you should be able to read the error_log also during > the transformation (you obviously need a reference to the running XSLT > object to do that). > > You shouldn't read the log from a separate thread, though. I'm not sure if > that works, but if it works, I should consider it a bug (I'll have to check > that). > Stefan, Thanks for the quick reply. I'm not partial to extension elements, and your third method sounds iffy, so I'd like to try the first. But looking at the API it is not clear to me how I get at the error_log *while the transform is running.* (I'm still learning to think Pythonically: my current task is converting from a Bash script that calls executes xsltproc to Python that uses the Lxml API.) Here's a snippet from my current (probably un-Pythonic) code: from lxml import etree xgfdoc = etree.parse(self.xgffile) styledoc = etree.parse(self.xslfile) transform = etree.XSLT(styledoc) newparams = dict() for k in self.params.keys(): newparams[k] = "'" + self.params[k] + "'" try: result = transform(xgfdoc, **newparams) except etree.XSLTApplyError as msg: raise XGFCompilationError(msg) Peter From sridharr at activestate.com Tue Jan 12 19:02:38 2010 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Tue, 12 Jan 2010 10:02:38 -0800 Subject: [lxml-dev] Instructions to build on Windows 64-bit? In-Reply-To: <4B3A55DD.1030004@activestate.com> References: <4B3A55DD.1030004@activestate.com> Message-ID: <4B4CB93E.5070903@activestate.com> On 12/29/2009 11:17 AM, Sridhar Ratnakumar wrote: > I noticed that the lxml PyPI page provides 64-bit Windows installers > [http://pypi.python.org/pypi/lxml/2.2.4 ; lxml-*amd64.exe]. I assume > they are statically linked with the libxml/xslt libraries. > > In the interest of providing 64-bit binaries in PyPM > [pypm.activestate.com], may I know how these binaries are built? I tried > buildlibxml.py which fails at several steps; and the compiled libraries > provided atftp.zlatkovic.com are 32-bit only. Any response? I ask because this would enable us to provide builds for lxml via PyPM: http://pypm.activestate.com/list-l.html#lxml Currently the 64-bits are missing. If the person who made the amd64 installers could respond to this, that would be great. It will make http://codespeak.net/lxml/installation.html#installation-in-activepython just work on the 64-bit systems. Specifically I am asking for the 64-bit version of http://codespeak.net/lxml/build.html#static-linking-on-windows -srid From sidnei.da.silva at gmail.com Tue Jan 12 20:01:24 2010 From: sidnei.da.silva at gmail.com (Sidnei da Silva) Date: Tue, 12 Jan 2010 17:01:24 -0200 Subject: [lxml-dev] Instructions to build on Windows 64-bit? In-Reply-To: <4B4CB93E.5070903@activestate.com> References: <4B3A55DD.1030004@activestate.com> <4B4CB93E.5070903@activestate.com> Message-ID: On Tue, Jan 12, 2010 at 4:02 PM, Sridhar Ratnakumar wrote: > Any response? > > I ask because this would enable us to provide builds for lxml via PyPM: > > ? http://pypm.activestate.com/list-l.html#lxml > > Currently the 64-bits are missing. If the person who made the amd64 > installers could respond to this, that would be great. It will make > http://codespeak.net/lxml/installation.html#installation-in-activepython > just work on the 64-bit systems. > > Specifically I am asking for the 64-bit version of > http://codespeak.net/lxml/build.html#static-linking-on-windows I've used the binaries from: http://pecl2.php.net/downloads/php-windows-builds/php-libs/ I also had to modify the 'libraries' function in setupinfo.py since those binaries are slightly different. I don't remember the details, but something about a '_a' postfix IIRC. -- Sidnei From mike.maccana at gmail.com Wed Jan 13 21:40:26 2010 From: mike.maccana at gmail.com (Mike MacCana) Date: Wed, 13 Jan 2010 20:40:26 +0000 Subject: [lxml-dev] Python module to make / modify / create docx files Message-ID: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com> Hi LXML folks, Just a short note to say I've made a Python module to create, modify, query and extract text from Microsoft Word 2007 docx files - using everyone's favorite Python XML module (that's LXML of course). If you're interested, check out http://github.com/mikemaccana/python-docxfor a full feature list. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100113/2b79c91f/attachment.htm From martin at ozoneonline.com Wed Jan 13 21:27:16 2010 From: martin at ozoneonline.com (Martin Fisher) Date: Wed, 13 Jan 2010 12:27:16 -0800 Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard... Message-ID: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com> Hi Guys, I think I've searched diligently but can't find a good solution: I'm seeing the following error when loading docx... ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so Expected in: flat namespace in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions. Any clues where to look will be appreciated. Thanks Martin Ozone Online Martin Fisher Title: CTO phone: 415-692-4182 email: martin at ozoneonline.com fax: 415-771-5530 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100113/e16b185f/attachment.htm From stefan_ml at behnel.de Thu Jan 14 09:52:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 14 Jan 2010 09:52:33 +0100 Subject: [lxml-dev] Python module to make / modify / create docx files In-Reply-To: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com> References: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com> Message-ID: <4B4EDB51.4090001@behnel.de> MacCana, 13.01.2010 21:40: > Just a short note to say I've made a Python module to create, modify, query > and extract text from Microsoft Word 2007 docx files - using everyone's > favorite Python XML module (that's LXML of course). > > If you're interested, check out > http://github.com/mikemaccana/python-docxfor a full feature list. Thanks for sharing that, I'll add a link to the lxml 'who uses it' FAQ. Stefan From mike.maccana at gmail.com Sat Jan 16 23:44:48 2010 From: mike.maccana at gmail.com (Mike MacCana) Date: Sat, 16 Jan 2010 22:44:48 +0000 Subject: [lxml-dev] Namespaces on attribute values Message-ID: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com> Hi Stefan and others, I'm currently creating a particularly tricky element from a string, using: etree.fromstring('''2010-01-01T21:07:00Z''') which works fine. However I'd like to make the element manually. This bit is causing me trouble: xsi:type="dcterms:W3CDTF" is 'dcterms' a namespace on a attribute value? If so, how can I set it in LXML? I haven't seen that before, and I can't find much documentation online. I'm quite comfortable with namespaces on tags and elements. Here's what I currently have: bar = etree.Element('{http://purl.org/dc/terms/}'+'created') bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', 'dcterms:W3CDTF') bar.text = '2010-01-01T21:07:00Z' Alas the app that parses my XML doesn't like it - though the fromstring() is fine. Any way I can set the namespace on attribute value? Thanks very much for any help, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100116/b4fe86b2/attachment.html From stefan_ml at behnel.de Sun Jan 17 08:48:23 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 17 Jan 2010 08:48:23 +0100 Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard... In-Reply-To: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com> References: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com> Message-ID: <4B52C0C7.3090302@behnel.de> Martin Fisher, 13.01.2010 21:27: > I think I've searched diligently but can't find a good solution: > > I'm seeing the following error when loading docx... > > ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk > Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so > Expected in: flat namespace > in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so That's a symbol from libxml2 that it can't find. Something's wrong with your installation. > I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions. It shouldn't use macports to provide the libraries. See here for installation instructions: http://codespeak.net/lxml/installation.html#installation Stefan From stefan_ml at behnel.de Sun Jan 17 08:53:02 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 17 Jan 2010 08:53:02 +0100 Subject: [lxml-dev] FW: lxml 2.2.4 on python3.1, Windows XP gives importerror In-Reply-To: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com> References: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com> Message-ID: <4B52C1DE.8010400@behnel.de> VYAS ASHISH M-NTB837, 12.01.2010 10:33: > I have Python 3.1 installed on Windows XP and Works nice. > I downloaded lxml 2.2.4 (lxml-2.2.4.win32-py3.1.exe) from pypi. > > When I try: > from lxml import etree > I get: > ImportError: DLL load failed: This application has failed to start > because the application configuration is incorrect. Reinstalling the > application may fix this problem. I can't extract much information from that error message, except that the lxml.etree module failed to load. Do others experience the same problem with the 3.1 installer? > For information: 'import lxml' works fine. That's because it doesn't load any DLLs. > After reinstalling python3.1 also the error message is the same. Any > help is appreciated! Did you try reinstalling lxml? Stefan From stefan_ml at behnel.de Sun Jan 17 09:00:01 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 17 Jan 2010 09:00:01 +0100 Subject: [lxml-dev] Namespaces on attribute values In-Reply-To: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com> References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com> Message-ID: <4B52C381.1050304@behnel.de> Mike MacCana, 16.01.2010 23:44: > Hi Stefan and others, > > I'm currently creating a particularly tricky element from a string, using: > > etree.fromstring(''' xsi:type="dcterms:W3CDTF">2010-01-01T21:07:00Z''') > > which works fine. However I'd like to make the element manually. This bit is > causing me trouble: > > xsi:type="dcterms:W3CDTF" > > is 'dcterms' a namespace on a attribute value? If so, how can I set it in > LXML? I haven't seen that before, and I can't find much documentation > online. I'm quite comfortable with namespaces on tags and elements. Here's > what I currently have: > > bar = etree.Element('{http://purl.org/dc/terms/}'+'created') > bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', > 'dcterms:W3CDTF') > bar.text = '2010-01-01T21:07:00Z' > > Alas the app that parses my XML doesn't like it - though the fromstring() is > fine. Any way I can set the namespace on attribute value? It's not a namespace (i.e. a URI), it's just the associated prefix. You have to extract it from the Element you set the namespace on ('prefix' attribute) or use an 'nsmap' to specify a prefix for the namespace. http://codespeak.net/lxml/tutorial.html#namespaces If you don't provide a prefix yourself, lxml.etree will use a default name like 'ns0', which doesn't correspond with the 'dcterms' you use in your value. Stefan From mike.maccana at gmail.com Sun Jan 17 12:19:16 2010 From: mike.maccana at gmail.com (Mike MacCana) Date: Sun, 17 Jan 2010 11:19:16 +0000 Subject: [lxml-dev] Namespaces on attribute values In-Reply-To: <4B52C381.1050304@behnel.de> References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com> <4B52C381.1050304@behnel.de> Message-ID: <73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com> On Sun, Jan 17, 2010 at 8:00 AM, Stefan Behnel wrote: > > Mike MacCana, 16.01.2010 23:44: > > Hi Stefan and others, > > > > I'm currently creating a particularly tricky element from a string, > using: > > > > etree.fromstring(''' > xsi:type="dcterms:W3CDTF">2010-01-01T21:07:00Z''') > > > > which works fine. However I'd like to make the element manually. This bit > is > > causing me trouble: > > > > xsi:type="dcterms:W3CDTF" > > > > is 'dcterms' a namespace on a attribute value? If so, how can I set it in > > LXML? I haven't seen that before, and I can't find much documentation > > online. I'm quite comfortable with namespaces on tags and elements. > Here's > > what I currently have: > > > > bar = etree.Element('{http://purl.org/dc/terms/}'+'created') > > bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', > > 'dcterms:W3CDTF') > > bar.text = '2010-01-01T21:07:00Z' > > > > Alas the app that parses my XML doesn't like it - though the fromstring() > is > > fine. Any way I can set the namespace on attribute value? > > It's not a namespace (i.e. a URI), it's just the associated prefix. You > have to extract it from the Element you set the namespace on ('prefix' > attribute) or use an 'nsmap' to specify a prefix for the namespace. > > http://codespeak.net/lxml/tutorial.html#namespaces > > If you don't provide a prefix yourself, lxml.etree will use a default name > like 'ns0', which doesn't correspond with the 'dcterms' you use in your > value. > > Stefan > > Thanks - I should clarify that my aim is to set the right namespace on the attribute's value - ns0 is fine if it points to the right URI. My question is, how can I set the right namespace on what appears to be an attributes *value*? I've read the namespace tutorial, and understand: - Namespace of tag itself is "http://purl.org/dc/terms/". Cool. - Namespace of the attribute is "http://www.w3.org/2001/XMLSchema-instance". Cool. But how can I set a namespace (preferably directly or via a prefix) on an attribute value (not the attribute)? Trying: bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', '{ http://purl.org/dc/terms/}'+'W3CDTF') Doesn't seem to give me any luck either... Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100117/13831277/attachment.htm From gilles.lenfant at gmail.com Sun Jan 17 12:44:44 2010 From: gilles.lenfant at gmail.com (Gilles Lenfant) Date: Sun, 17 Jan 2010 12:44:44 +0100 Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard... In-Reply-To: <4B52C0C7.3090302@behnel.de> References: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com> <4B52C0C7.3090302@behnel.de> Message-ID: <7c3325691001170344u4b5313d2u99d419d09662c54a@mail.gmail.com> BTW (perhaps off topic, but...) Somebody knows how to do the equivalent of this in a buildout? "STATIC_DEPS=true easy_install lxml" Do I need to write a recipe for this or is there an OTB method I didn't find? Thanks by advance. -- Gilles Lenfant 2010/1/17 Stefan Behnel : > > Martin Fisher, 13.01.2010 21:27: >> I think I've searched diligently but can't find a good solution: >> >> I'm seeing the following error when loading docx... >> >> ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk >> ? Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so >> ? Expected in: flat namespace >> ?in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so > > That's a symbol from libxml2 that it can't find. Something's wrong with > your installation. > > >> I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions. > > It shouldn't use macports to provide the libraries. See here for > installation instructions: > > http://codespeak.net/lxml/installation.html#installation > > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From stefan_ml at behnel.de Sun Jan 17 15:00:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 17 Jan 2010 15:00:56 +0100 Subject: [lxml-dev] Namespaces on attribute values In-Reply-To: <73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com> References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com> <4B52C381.1050304@behnel.de> <73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com> Message-ID: <4B531818.1030406@behnel.de> Mike MacCana, 17.01.2010 12:19: > On Sun, Jan 17, 2010 at 8:00 AM, Stefan Behnel wrote: >> Mike MacCana, 16.01.2010 23:44: >>> xsi:type="dcterms:W3CDTF" >>> >>> is 'dcterms' a namespace on a attribute value? If so, how can I set it in >>> LXML? I haven't seen that before, and I can't find much documentation >>> online. I'm quite comfortable with namespaces on tags and elements. >>> Here's what I currently have: >>> >>> bar = etree.Element('{http://purl.org/dc/terms/}'+'created') >>> bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', >>> 'dcterms:W3CDTF') >>> bar.text = '2010-01-01T21:07:00Z' >>> >>> Alas the app that parses my XML doesn't like it - though the fromstring() >> is >>> fine. Any way I can set the namespace on attribute value? >> >> It's not a namespace (i.e. a URI), it's just the associated prefix. You >> have to extract it from the Element you set the namespace on ('prefix' >> attribute) I might have been unclear here. Elements have a 'prefix' attribute that gives you the prefix they use for their namespace. >> http://codespeak.net/lxml/tutorial.html#namespaces >> >> If you don't provide a prefix yourself, lxml.etree will use a default name >> like 'ns0', which doesn't correspond with the 'dcterms' you use in your >> value. > > Thanks - I should clarify that my aim is to set the right namespace on the > attribute's value - ns0 is fine if it points to the right URI. My question > is, how can I set the right namespace on what appears to be an attributes > *value*? > > I've read the namespace tutorial, and understand: > - Namespace of tag itself is "http://purl.org/dc/terms/". Cool. > - Namespace of the attribute is "http://www.w3.org/2001/XMLSchema-instance". > Cool. > But how can I set a namespace (preferably directly or via a prefix) on an > attribute value (not the attribute)? Trying: > > bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', '{ > http://purl.org/dc/terms/}'+'W3CDTF') There is a feature I had almost forgotten about, I'm not even sure it's documented anywhere (doc patches welcome). You can do this: bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', etree.QName('http://purl.org/dc/terms/', 'W3CDTF')) 'QName' means 'qualified name' and works wherever lxml.etree accepts a tag name, with attribute values as a special feature for exactly your use case. Note that lxml.etree can't take care for keeping the prefix used in the (now plain text) attribute value up to date during tree changes, so if (e.g.) you append the element above to a tree that already defines the namespace under a different prefix, the attribute value will not get updated and may loose its meaning if the namespace declarations get reassigned. Stefan From dakota at brokenpipe.ru Sun Jan 17 21:12:01 2010 From: dakota at brokenpipe.ru (Marat Dakota) Date: Sun, 17 Jan 2010 23:12:01 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> Message-ID: Hi, I wonder if you've noticed my last letter with patch and questions... -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100117/5c5e76d7/attachment.htm From stefan_ml at behnel.de Mon Jan 18 08:21:16 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jan 2010 08:21:16 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> Message-ID: <4B540BEC.5090900@behnel.de> Marat Dakota, 17.01.2010 21:12: > I wonder if you've noticed my last letter with patch and questions... Sorry! Yes, I noticed it, but didn't have the time to reply at the time. I haven't looked at it yet, but I definitely will. As I said, the last one looked good already, so I'll see that I get it applied as soon as I get to it. Thanks! Stefan From jholg at gmx.de Mon Jan 18 11:55:28 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 18 Jan 2010 11:55:28 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml Message-ID: <20100118105528.89660@gmx.net> Hi, having played around with schematron a bit more I propose some changes to the current trunk additions for iso-schematron support: The "validation criteria", i.e. if the validation result is True or False, is currently exposed on a module level: # svrl result accessors svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'svrl': SVRL_NS}) So you can customize the criteria globally. With schematron however you can use "assert" as well as "report" tests, and also categorize the tests using "flag" and "role" attributes that will show up in the resulting svrl xml document (I do not currently understand the intended difference between the two, but that's another story). So I think it may well be possible that what's interpreted as a validation error depends very much on how one designs the schematron schema, and it may be helpful to be able to customize validation outcome per validator instance. This speaks for pulling the result accessor into the Schematron class, probably as a class attribute that can be overridden on an instance level. The same might make sense for the iso-schematron implementation xsl transformation steps. Opinions? Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From stefan_ml at behnel.de Mon Jan 18 12:01:15 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jan 2010 12:01:15 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml In-Reply-To: <20100118105528.89660@gmx.net> References: <20100118105528.89660@gmx.net> Message-ID: <4B543F7B.8000803@behnel.de> jholg at gmx.de, 18.01.2010 11:55: > having played around with schematron a bit more I propose some changes to the current trunk additions for iso-schematron support: > > The "validation criteria", i.e. if the validation result is True or False, is currently exposed on a module level: > > # svrl result accessors > svrl_validation_errors = _etree.XPath( > '//svrl:failed-assert', namespaces={'svrl': SVRL_NS}) > > So you can customize the criteria globally. > > With schematron however you can use "assert" as well as "report" tests, and also categorize the tests using "flag" and "role" attributes that will show up in the resulting svrl xml document (I do not currently understand the intended difference between the two, but that's another story). > So I think it may well be possible that what's interpreted as a validation error depends very much on how one designs the schematron schema, and it may be helpful to be able to customize validation outcome per validator instance. > > This speaks for pulling the result accessor into the Schematron class, probably as a class attribute that can be overridden on an instance level. > > The same might make sense for the iso-schematron implementation xsl transformation steps. Sounds like a much better interface. Any interesting global options would be better overridden by subtyping the validator class, so class attributes make sense to me. Stefan From jholg at gmx.de Mon Jan 18 12:48:42 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 18 Jan 2010 12:48:42 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml In-Reply-To: <4B543F7B.8000803@behnel.de> References: <20100118105528.89660@gmx.net> <4B543F7B.8000803@behnel.de> Message-ID: <20100118114842.89660@gmx.net> Hi Stefan, > > This speaks for pulling the result accessor into the Schematron class, > probably as a class attribute that can be overridden on an instance level. > > > > The same might make sense for the iso-schematron implementation xsl > transformation steps. > > Sounds like a much better interface. Any interesting global options would > be better overridden by subtyping the validator class, so class attributes > make sense to me. I'll change that, then. Do you prefer me making the changes on the iso-schematron branch or directly in trunk? Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Mon Jan 18 12:52:54 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jan 2010 12:52:54 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml In-Reply-To: <20100118114842.89660@gmx.net> References: <20100118105528.89660@gmx.net> <4B543F7B.8000803@behnel.de> <20100118114842.89660@gmx.net> Message-ID: <4B544B96.50803@behnel.de> jholg at gmx.de, 18.01.2010 12:48: >>> This speaks for pulling the result accessor into the Schematron class, >> probably as a class attribute that can be overridden on an instance level. >>> The same might make sense for the iso-schematron implementation xsl >> transformation steps. >> >> Sounds like a much better interface. Any interesting global options would >> be better overridden by subtyping the validator class, so class attributes >> make sense to me. > > I'll change that, then. Do you prefer me making the changes on the iso-schematron branch or directly in trunk? >From my POV, the branch is basically dead after the merge, so please change the trunk only. Stefan From animator333 at yahoo.com Fri Jan 22 10:02:12 2010 From: animator333 at yahoo.com (Prashant Saxena) Date: Fri, 22 Jan 2010 14:32:12 +0530 (IST) Subject: [lxml-dev] lxml newbie objectify & subclassing Message-ID: <11593.3219.qm@web94914.mail.in2.yahoo.com> Hi, This is my first post. I have used python and xml earlier but on a very small scale. This time we are developing a fairly large application where primary data storage is based on xml. I have been reading the docs on the site and lxml is looking quite promising, specially the "objectify" module. Just to start with here is the first question: from lxml import etree from lxml import objectify class Attribute(objectify.ObjectifiedDataElement): """""" def __init__(self): objectify.ObjectifiedDataElement.__init__(self) self.set("datatype", "") self.set("range", "0.,1.") def asXml(self): return etree.tostring(self, method="xml", pretty_print=True) #------------------------------------------------------------------------------- class FloatAttribute(Attribute): """""" def __init__(self, tag="float", value=0.): Attribute.__init__(self) self.tag = tag self.set("datatype", "float") etree.SubElement(self, "float").text = str(value) f = FloatAttribute() print f.asXml() The above code crashes. What's wrong here? Prashant Python 2.6.2 wxPython 2.8.10.1 lxml 2.2.4 XP 32 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/ From animator333 at yahoo.com Fri Jan 22 10:14:38 2010 From: animator333 at yahoo.com (Prashant Saxena) Date: Fri, 22 Jan 2010 14:44:38 +0530 (IST) Subject: [lxml-dev] lxml newbie xpath as connections Message-ID: <169717.40674.qm@web94904.mail.in2.yahoo.com> Hi, This is regarding xpath and making some virtual connections in a xml file. txt=""" 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6 /node/emission/color[0]/red /node/diffuse/color[1]/blue """ o = objectify.fromstring(txt) Considering the above example, once is parsed, I could get the value of connection>output using >> print o.connection.output but it's a text. Is it possible to define connection>output in such a way, that once parsed connection>output should refer to element it is pointing by path as text. If not at the time of parsing then using xpath later, I tried: >> r = o.xpath(str(o.connection.output)) >> r.pyval I am getting an empty list. How ever if I try: >> r = o.xpath("/node/emission/color/red") >> print r r is list containing two values for each "red" of "color". How do I precisely get: /node/emission/color[0]/red Thanks Prashant Python 2.6.2 wxPython 2.8.10.1 lxml 2.2.4 XP 32 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/ From jholg at gmx.de Fri Jan 22 10:25:36 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 22 Jan 2010 10:25:36 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <11593.3219.qm@web94914.mail.in2.yahoo.com> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> Message-ID: <20100122092536.155750@gmx.net> > from lxml import etree > from lxml import objectify > > class Attribute(objectify.ObjectifiedDataElement): > """""" > def __init__(self): > objectify.ObjectifiedDataElement.__init__(self) > self.set("datatype", "") > self.set("range", "0.,1.") > > def asXml(self): > return etree.tostring(self, method="xml", pretty_print=True) > > #------------------------------------------------------------------------------- > > > class FloatAttribute(Attribute): > """""" > def __init__(self, tag="float", value=0.): > Attribute.__init__(self) > self.tag = tag > self.set("datatype", "float") > etree.SubElement(self, "float").text = str(value) > > > f = FloatAttribute() > print f.asXml() > > The above code crashes. What's wrong here? Please take a look at http://codespeak.net/lxml/element_classes.html especially: http://codespeak.net/lxml/element_classes.html#element-initialization Note that Elements get instantiated through the Element()/DataElement factory functions. Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From jholg at gmx.de Fri Jan 22 10:29:39 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 22 Jan 2010 10:29:39 +0100 Subject: [lxml-dev] lxml newbie xpath as connections In-Reply-To: <169717.40674.qm@web94904.mail.in2.yahoo.com> References: <169717.40674.qm@web94904.mail.in2.yahoo.com> Message-ID: <20100122092939.155740@gmx.net> > I am getting an empty list. How ever if I try: > >> r = o.xpath("/node/emission/color/red") > >> print r > > r is list containing two values for each "red" of "color". How do I > precisely get: > > /node/emission/color[0]/red > Please read up on XPath. Indexing starts with 1 in XPath so use /node/emission/color[1]/red Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser From stefan_ml at behnel.de Fri Jan 22 12:31:37 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jan 2010 12:31:37 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <20100122092536.155750@gmx.net> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> <20100122092536.155750@gmx.net> Message-ID: <4B598C99.5030202@behnel.de> jholg at gmx.de, 22.01.2010 10:25: > >> from lxml import etree >> from lxml import objectify >> >> class Attribute(objectify.ObjectifiedDataElement): >> """""" >> def __init__(self): >> objectify.ObjectifiedDataElement.__init__(self) >> self.set("datatype", "") >> self.set("range", "0.,1.") >> >> def asXml(self): >> return etree.tostring(self, method="xml", pretty_print=True) >> >> #------------------------------------------------------------------------------- >> >> >> class FloatAttribute(Attribute): >> """""" >> def __init__(self, tag="float", value=0.): >> Attribute.__init__(self) >> self.tag = tag >> self.set("datatype", "float") >> etree.SubElement(self, "float").text = str(value) >> >> >> f = FloatAttribute() >> print f.asXml() >> >> The above code crashes. What's wrong here? > > Please take a look at > http://codespeak.net/lxml/element_classes.html > especially: > http://codespeak.net/lxml/element_classes.html#element-initialization > > Note that Elements get instantiated through the Element()/DataElement factory functions. This is true in general, however, the above works in lxml.etree and should also work in lxml.objectify (and definitely shouldn't just crash). Stefan From stefan_ml at behnel.de Fri Jan 22 12:36:19 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jan 2010 12:36:19 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <11593.3219.qm@web94914.mail.in2.yahoo.com> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> Message-ID: <4B598DB3.20800@behnel.de> Hi, Prashant Saxena, 22.01.2010 10:02: > This is my first post. I have used python and xml earlier but on a very small scale. This time we are developing a fairly large > application where primary data storage is based on xml. I have been reading the docs on the site and lxml is looking quite > promising, specially the "objectify" module. Just to start with here is the first question: > > from lxml import etree > from lxml import objectify > > class Attribute(objectify.ObjectifiedDataElement): > """""" > def __init__(self): > objectify.ObjectifiedDataElement.__init__(self) > self.set("datatype", "") > self.set("range", "0.,1.") > > def asXml(self): > return etree.tostring(self, method="xml", pretty_print=True) > > #------------------------------------------------------------------------------- > > class FloatAttribute(Attribute): > """""" > def __init__(self, tag="float", value=0.): > Attribute.__init__(self) > self.tag = tag > self.set("datatype", "float") > etree.SubElement(self, "float").text = str(value) > > > f = FloatAttribute() > print f.asXml() > > The above code crashes. What's wrong here? > > Prashant > Python 2.6.2 > wxPython 2.8.10.1 > lxml 2.2.4 > XP 32 Thanks for the report, I can reproduce this. There seems to be an unexpected interaction between the __init__ method of ElementBase and the API of ObjectifiedElement. This needs to be made more robust (at least). Could you open a ticket in the bug tracker? A work-around (and actually the expected usage) is to instantiate the elements through the factories, as Holger suggested. Thanks, Stefan From stefan_ml at behnel.de Fri Jan 22 14:14:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jan 2010 14:14:36 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <101858.8586.qm@web94906.mail.in2.yahoo.com> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> <4B598DB3.20800@behnel.de> <101858.8586.qm@web94906.mail.in2.yahoo.com> Message-ID: <4B59A4BC.7080901@behnel.de> Prashant Saxena, 22.01.2010 13:53: > The bug is reported in bug tracker. > BTW, the code below is not working because of a bug or this feature is not been implemented? It's a bug because of a feature that is implemented in lxml.etree but not in lxml.objectify, so it's a bit of both. ;) > As of now, I would prefer to use as it is more pythonic & easy to implement, compare to methods explained > here: > http://codespeak.net/lxml/element_classes.html#element-initialization Not sure what you mean exactly. The ElementBase class is only meant for subtyping, not for direct usage. Could you provide some background about your use case? Stefan From jholg at gmx.de Fri Jan 22 15:06:41 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 22 Jan 2010 15:06:41 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <4B598C99.5030202@behnel.de> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> <20100122092536.155750@gmx.net> <4B598C99.5030202@behnel.de> Message-ID: <20100122140641.155740@gmx.net> Hi, > >> The above code crashes. What's wrong here? > > > > Please take a look at > > http://codespeak.net/lxml/element_classes.html > > especially: > > http://codespeak.net/lxml/element_classes.html#element-initialization > > > > Note that Elements get instantiated through the Element()/DataElement > factory functions. > This is true in general, however, the above works in lxml.etree and should > also work in lxml.objectify (and definitely shouldn't just crash). Looks like I'm missing something. What about " There is one thing to know up front. Element classes *must not* have an __init__ or __new__ method. " ? Has this requirement been relaxed? Holger -- Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190. Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF. From jholg at gmx.de Fri Jan 22 15:07:58 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 22 Jan 2010 15:07:58 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <180543.6188.qm@web94913.mail.in2.yahoo.com> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> <20100122092536.155750@gmx.net> <180543.6188.qm@web94913.mail.in2.yahoo.com> Message-ID: <20100122140758.155730@gmx.net> (cc-ing the list) -------- Original-Nachricht -------- > Datum: Fri, 22 Jan 2010 16:11:38 +0530 (IST) > Von: Prashant Saxena > An: jholg at gmx.de > Betreff: Re: [lxml-dev] lxml newbie objectify & subclassing > Thanks for a quick reply. > > > Why I need sub classing is because the application needs various types of > custom data types(elements), such as vector, matrix, > color etc. A collections of these data types is a node. These attributes > are created from custom xml files at run time, stored in a node, value of an > element is changed/edited using front end gui and then node is saved to > disk(xml format). You can again load the node from disk create all the > attribute at run time. > > Considering above scenario, the simplest attribute that represents a float > element is created from class. > > Instead of using , If I use > , there are no errors and code is working fine. > > The only draw back is that I have to convert string to pydata types myself > to hook with the gui, which is not difficult at all. > > If you do have some suggestions then please let me know. > > Thanks > > Prashant > > > > > ----- Original Message ---- > From: "jholg at gmx.de" > To: Prashant Saxena ; lxml-dev at codespeak.net > Sent: Fri, 22 January, 2010 2:55:36 PM > Subject: Re: [lxml-dev] lxml newbie objectify & subclassing > > > > > from lxml import etree > > from lxml import objectify > > > > class Attribute(objectify.ObjectifiedDataElement): > > """""" > > def __init__(self): > > objectify.ObjectifiedDataElement.__init__(self) > > self.set("datatype", "") > > self.set("range", "0.,1.") > > > > def asXml(self): > > return etree.tostring(self, method="xml", pretty_print=True) > > > > > #------------------------------------------------------------------------------- > > > > > > class FloatAttribute(Attribute): > > """""" > > def __init__(self, tag="float", value=0.): > > Attribute.__init__(self) > > self.tag = tag > > self.set("datatype", "float") > > etree.SubElement(self, "float").text = str(value) > > > > > > f = FloatAttribute() > > print f.asXml() > > > > The above code crashes. What's wrong here? > > Please take a look at > http://codespeak.net/lxml/element_classes.html > especially: > http://codespeak.net/lxml/element_classes.html#element-initialization > > Note that Elements get instantiated through the Element()/DataElement > factory functions. > > Holger > > -- > Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 > - > sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser > > > > The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. > http://in.yahoo.com/ -- Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190. Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF. From stefan_ml at behnel.de Fri Jan 22 15:41:59 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jan 2010 15:41:59 +0100 Subject: [lxml-dev] lxml newbie objectify & subclassing In-Reply-To: <20100122140641.155740@gmx.net> References: <11593.3219.qm@web94914.mail.in2.yahoo.com> <20100122092536.155750@gmx.net> <4B598C99.5030202@behnel.de> <20100122140641.155740@gmx.net> Message-ID: <4B59B937.7030703@behnel.de> jholg at gmx.de, 22.01.2010 15:06: >>>> The above code crashes. What's wrong here? >>> Please take a look at >>> http://codespeak.net/lxml/element_classes.html >>> especially: >>> http://codespeak.net/lxml/element_classes.html#element-initialization >>> >>> Note that Elements get instantiated through the Element()/DataElement >> factory functions. > > >> This is true in general, however, the above works in lxml.etree and should >> also work in lxml.objectify (and definitely shouldn't just crash). > > Looks like I'm missing something. What about > " > There is one thing to know up front. Element classes *must not* have an __init__ or __new__ method. > " > ? Has this requirement been relaxed? Sort-of. It works if you call the __init__ method of the superclass first thing in your subtype, and it will actually do The Right Thing, i.e. create a new element. It will still not get called when lxml.etree instantiates an element proxy behind the scenes, so the warning is still true in the sense that __init__ may not have been called on an Element proxy when lxml.etree returns it. As long as you only change the XML element in __init__ (as in the code that Prashant presented) and do not keep any local state in the class, you're fine, though. However, it crashes in this specific case because the way the __init__ method in ElementBase is implemented accesses the object before it is initialised, so this *is* a bug. I'm not sure how to fix this yet, but I'm considering to add more safety checks to the API in general. I'll see. Stefan From stefan_ml at behnel.de Fri Jan 22 23:00:19 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jan 2010 23:00:19 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> Message-ID: <4B5A1FF3.7040703@behnel.de> Marat Dakota, 12.01.2010 14:05: >> Thanks a lot, it's looks reasonable at first glance and I'll take a closer >> look as soon as I get to it. If it works well, it should make it into 2.3. > > Is there a roadmap date for 2.3 release? Not yet, no. >> Could you add a couple of tests to src/lxml/tests/test_xslt.py? That would >> help in making sure that it keeps working as expected even if I find that I >> need to rework the patch. >> > > I've added tests, I've also renamed variables to fit your code better and > added possibility to evaluate extension element's content directly to > _AppendOnlyElementProxy as well as to _Element. It looks like I'm satisfied > with the code now. I wonder what will you say about it. Hmm, and did you *run* the tests? The test code actually contains obvious errors (such as non well-formed XML), so I wonder how you tested it at all. After fixing the tests, they even crash on my machine. So, sorry, but this patch isn't in an acceptable state. Could you please open up a ticket on launchpad for this? That would make it easier to track the progress of this patch. Stefan From animator333 at yahoo.com Sat Jan 23 16:20:28 2010 From: animator333 at yahoo.com (Prashant Saxena) Date: Sat, 23 Jan 2010 20:50:28 +0530 (IST) Subject: [lxml-dev] pretty_print with tail Message-ID: <80093.83351.qm@web94909.mail.in2.yahoo.com> Hi, A test example prints: 0.231kFloat0.326kFloat0.921kFloatkColor I am using: etree.tostring(root, method="xml", pretty_print=True) May be because every element/children has a tail text. Is it possible to format the output in this way: 0.231kFloat 0.326kFloat 0.921kFloat kColor Thanks Prashant Python 2.6.2 lxml 2.2.4 wxPython 2.8.10.1 XP 32 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/ From stefan_ml at behnel.de Sat Jan 23 17:53:50 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Jan 2010 17:53:50 +0100 Subject: [lxml-dev] pretty_print with tail In-Reply-To: <80093.83351.qm@web94909.mail.in2.yahoo.com> References: <80093.83351.qm@web94909.mail.in2.yahoo.com> Message-ID: <4B5B299E.2080403@behnel.de> Prashant Saxena, 23.01.2010 16:20: > A test example prints: > 0.231kFloat0.326kFloat0.921kFloatkColor > > I am using: > etree.tostring(root, method="xml", pretty_print=True) > > May be because every element/children has a tail text. > > Is it possible to format the output in this way: > > > > 0.231kFloat > 0.326kFloat > 0.921kFloat > kColor > http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Stefan From animator333 at yahoo.com Sat Jan 23 18:56:20 2010 From: animator333 at yahoo.com (Prashant Saxena) Date: Sat, 23 Jan 2010 23:26:20 +0530 (IST) Subject: [lxml-dev] pretty_print with tail In-Reply-To: <4B5B299E.2080403@behnel.de> References: <80093.83351.qm@web94909.mail.in2.yahoo.com> <4B5B299E.2080403@behnel.de> Message-ID: <246745.27582.qm@web94903.mail.in2.yahoo.com> Prashant Saxena, 23.01.2010 16:20: > A test example prints: > 0.231kFloat0.326kFloat0.921kFloatkColor > > I am using: > etree.tostring(root, method="xml", pretty_print=True) > > May be because every element/children has a tail text. > > Is it possible to format the output in this way: > > > > 0.231kFloat > 0.326kFloat > 0.921kFloat > kColor > http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Parsing a xml file from disk which has written as above is not a problem and output is as is. I am interested in writing it to disk as above. from lxml import etree root = etree.Element("emmision") color = etree.SubElement(root, "color") color.tail = "kColor" red = etree.SubElement(color, "red") green = etree.SubElement(color, "green") blue = etree.SubElement(color, "blue") red.text = "0.231" green.text = "0.326" blue.text = "0.291" red.tail = "kFloat" green.tail = "kFloat" blue.tail = "kFloat" print etree.tostring(root, method="xml", pretty_print=True) This code prints every thing in a single line & it's hard to read. Do I have to write a custom function to parse the string and print as needed? Prashant Python 2.6.2 lxml 2.2.4 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/ From animator333 at yahoo.com Sat Jan 23 19:01:43 2010 From: animator333 at yahoo.com (Prashant Saxena) Date: Sat, 23 Jan 2010 23:31:43 +0530 (IST) Subject: [lxml-dev] conditional pretty_print Message-ID: <333445.56710.qm@web94912.mail.in2.yahoo.com> Hi, In short: While printing, 1. Ignore *all* attributes(keys()) of every element. 2. Ignore *certain* attributes(keys()) of every element. 3. Ignore *certain* attributes(keys()) of element with *tag* . Prashant Python 2.6.2 lxml 2.2.4 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/ From stefan_ml at behnel.de Sun Jan 24 12:37:06 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 24 Jan 2010 12:37:06 +0100 Subject: [lxml-dev] pretty_print with tail In-Reply-To: <246745.27582.qm@web94903.mail.in2.yahoo.com> References: <80093.83351.qm@web94909.mail.in2.yahoo.com> <4B5B299E.2080403@behnel.de> <246745.27582.qm@web94903.mail.in2.yahoo.com> Message-ID: <4B5C30E2.8000800@behnel.de> Prashant Saxena, 23.01.2010 18:56: > > Prashant Saxena, 23.01.2010 16:20: >> A test example prints: >> 0.231kFloat0.326kFloat0.921kFloatkColor >> >> I am using: >> etree.tostring(root, method="xml", pretty_print=True) >> >> May be because every element/children has a tail text. >> >> Is it possible to format the output in this way: >> >> >> >> 0.231kFloat >> 0.326kFloat >> 0.921kFloat >> kColor >> > > http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output > [...] > This code prints every thing in a single line & it's hard to read. Do I have to write a custom function to parse the string and > print as needed? See the last paragraph of the section I linked above. Stefan From stefan_ml at behnel.de Sun Jan 24 19:36:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 24 Jan 2010 19:36:42 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4B540BEC.5090900@behnel.de> References: <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> <4B540BEC.5090900@behnel.de> Message-ID: <4B5C933A.4010803@behnel.de> Hi, Stefan Behnel, 18.01.2010 08:21: > Marat Dakota, 17.01.2010 21:12: >> I wonder if you've noticed my last letter with patch and questions... > > Sorry! Yes, I noticed it, but didn't have the time to reply at the time. I > haven't looked at it yet, but I definitely will. As I said, the last one > looked good already, so I'll see that I get it applied as soon as I get to it. I have committed an extended version of the patch to the trunk. Please review the new API to see if it works for you. https://codespeak.net/viewvc/?view=rev&revision=70799 Stefan From dakota at brokenpipe.ru Mon Jan 25 09:04:36 2010 From: dakota at brokenpipe.ru (Marat Dakota) Date: Mon, 25 Jan 2010 11:04:36 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4B5C933A.4010803@behnel.de> References: <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de> <4B540BEC.5090900@behnel.de> <4B5C933A.4010803@behnel.de> Message-ID: > > I have committed an extended version of the patch to the trunk. Please > review the new API to see if it works for you. > > https://codespeak.net/viewvc/?view=rev&revision=70799 Thanks so much! I'll celebrate it :) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100125/c49dfda6/attachment.htm From stefan_ml at behnel.de Mon Jan 25 12:49:29 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Jan 2010 12:49:29 +0100 Subject: [lxml-dev] conditional pretty_print In-Reply-To: <333445.56710.qm@web94912.mail.in2.yahoo.com> References: <333445.56710.qm@web94912.mail.in2.yahoo.com> Message-ID: <4B5D8549.4020300@behnel.de> Prashant Saxena, 23.01.2010 19:01: > In short: Too short, I guess. > While printing, > > 1. Ignore *all* attributes(keys()) of every element. > 2. Ignore *certain* attributes(keys()) of every element. > 3. Ignore *certain* attributes(keys()) of element with *tag* . If the above is intended to describe a custom serialisation scheme, I assume you want to use this scheme to serialise an XML tree, right? Two ways to do that: 1) strip all unwanted information from the tree before serialising 2) roll your own serialiser (IIRC, the ElementTree docs mention this somewhere) Which way is better for you is mostly dependent on whether you want an opt-in or opt-out solution, I guess. Stefan From optilude+lists at gmail.com Thu Jan 28 14:52:21 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Thu, 28 Jan 2010 21:52:21 +0800 Subject: [lxml-dev] Building an ESI tag with lxml Message-ID: Hi, I'm trying to use lxml to conditionally insert an tag into an HTML document. The document is parsed with the HTML parser and manipulated in various ways. At one point, I search for a node ('placeholder') and want to replace it with something that renders to: The code I used looks like this: nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} root.nsmap.update(nsmap) # root is the element esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', url) placeholder.addnext(esiNode) # placeholder is later removed There are two problems with this: - The xmlns:esi ends up on the tag instead of the HTML root. Varnish doesn't like this apparently. - The tag is not self-closing when rendered with the html.tostring (using etree.tostring is not really an option as other things are going on which want html rendering). Varnish likes this even less. Thus: What can I do to push the namespace declaration up to the top node ('root') and make the tag self-closing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Thu Jan 28 17:04:19 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 28 Jan 2010 17:04:19 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: Message-ID: <4B61B583.50003@behnel.de> Hi Martin, Martin Aspeli, 28.01.2010 14:52: > I'm trying to use lxml to conditionally insert an tag > into an HTML document. First problem: HTML is not namespace aware - namespaces in HTML are underdefined at best (and they certainly were not well defined back in 2001, when the ESI spec appeared). > The document is parsed with the HTML parser and manipulated in various > ways. At one point, I search for a node ('placeholder') and want to > replace it with something that renders to: > > > > The code I used looks like this: > > nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} > root.nsmap.update(nsmap) # root is the element Updating the nsmap property has no effect. I've updated the docstring appropriately. > esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) > esiNode.set('src', url) > placeholder.addnext(esiNode) # placeholder is later removed > > There are two problems with this: > > - The xmlns:esi ends up on the tag instead of the HTML > root. Varnish doesn't like this apparently. As I said, namespaces in HTML... To move the namespace declaration to the top-level element, you can create a new 'html' root element that has it and move the nodes over, e.g. new_root = etree.Element('html', nsmap=nsmap) new_root[:] = root[:] # or copy.deepcopy(root)[:] I think it would be nice to allow an 'nsmap' parameter in the cleanup_namespaces() function. Its namespace declarations would then get added to the element it runs on before starting the cleanup process. That would be a 2.3 feature, though. I don't think adding support for changing 'el.nsmap' would be a good idea, as changing namespace prefixes is actually a rather non-trivial process. This should be requested explicitly at a well selected step in the code (usually just before serialisation, when prefixes become interesting). > - The tag is not self-closing when rendered with the > html.tostring (using etree.tostring is not really an option as other > things are going on which want html rendering). Varnish likes this even > less. Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't know that it's supposed to be self-closing. I think you only have three options here: 1) fix Varnish 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) 3) close the tag through byte string substitution *after* serialisation If you choose to go with 2), you may consider converting the stream back to plain HTML *after* processing the esi tags, using an additional parse-serialise cycle (or an external tool like xmllint or tidy). Stefan From optilude+lists at gmail.com Fri Jan 29 01:00:37 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 08:00:37 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B61B583.50003@behnel.de> References: <4B61B583.50003@behnel.de> Message-ID: Stefan Behnel wrote: > Hi Martin, > > Martin Aspeli, 28.01.2010 14:52: >> I'm trying to use lxml to conditionally insert an tag >> into an HTML document. > > First problem: HTML is not namespace aware - namespaces in HTML are > underdefined at best (and they certainly were not well defined back in > 2001, when the ESI spec appeared). > > >> The document is parsed with the HTML parser and manipulated in various >> ways. At one point, I search for a node ('placeholder') and want to >> replace it with something that renders to: >> >> >> >> The code I used looks like this: >> >> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} >> root.nsmap.update(nsmap) # root is the element > > Updating the nsmap property has no effect. I've updated the docstring > appropriately. Ok, thanks. >> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >> esiNode.set('src', url) >> placeholder.addnext(esiNode) # placeholder is later removed >> >> There are two problems with this: >> >> - The xmlns:esi ends up on the tag instead of the HTML >> root. Varnish doesn't like this apparently. > > As I said, namespaces in HTML... > > To move the namespace declaration to the top-level element, you can create > a new 'html' root element that has it and move the nodes over, e.g. > > new_root = etree.Element('html', nsmap=nsmap) > new_root[:] = root[:] # or copy.deepcopy(root)[:] How (in)efficient is this? > I think it would be nice to allow an 'nsmap' parameter in the > cleanup_namespaces() function. Its namespace declarations would then get > added to the element it runs on before starting the cleanup process. That > would be a 2.3 feature, though. Ok. > I don't think adding support for changing 'el.nsmap' would be a good idea, > as changing namespace prefixes is actually a rather non-trivial process. > This should be requested explicitly at a well selected step in the code > (usually just before serialisation, when prefixes become interesting). Agree. I'd be happy to pass something to the serialiser about namespaces. >> - The tag is not self-closing when rendered with the >> html.tostring (using etree.tostring is not really an option as other >> things are going on which want html rendering). Varnish likes this even >> less. > > Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't > know that it's supposed to be self-closing. I think you only have three > options here: Is there no way to make it aware of it? Seems this should be configurable (or monkey-patch-able) somewhere... > 1) fix Varnish > 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) It probably would. What's that look like? > 3) close the tag through byte string substitution *after* serialisation Yipes. If I do that, I'll just do the entire tag through such a substitution to be honest and not use lxml at all. > If you choose to go with 2), you may consider converting the stream back to > plain HTML *after* processing the esi tags, using an additional > parse-serialise cycle (or an external tool like xmllint or tidy). That sounds pretty bad for performance. :( Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From l at lrowe.co.uk Fri Jan 29 02:47:14 2010 From: l at lrowe.co.uk (Laurence Rowe) Date: Fri, 29 Jan 2010 01:47:14 +0000 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> Message-ID: 2010/1/29 Martin Aspeli : > Stefan Behnel wrote: >> Hi Martin, >> >> Martin Aspeli, 28.01.2010 14:52: >>> I'm trying to use lxml to conditionally insert an ?tag >>> into an HTML document. >> >> First problem: HTML is not namespace aware - namespaces in HTML are >> underdefined at best (and they certainly were not well defined back in >> 2001, when the ESI spec appeared). >> >> >>> The document is parsed with the HTML parser and manipulated in various >>> ways. At one point, I search for a node ('placeholder') and want to >>> replace it with something that renders to: >>> >>> ? ? >>> >>> The code I used looks like this: >>> >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} >>> root.nsmap.update(nsmap) # root is the ?element >> >> Updating the nsmap property has no effect. I've updated the docstring >> appropriately. > > Ok, thanks. > >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >>> esiNode.set('src', url) >>> placeholder.addnext(esiNode) # placeholder is later removed >>> >>> There are two problems with this: >>> >>> ? ?- The xmlns:esi ends up on the ?tag instead of the HTML >>> root. Varnish doesn't like this apparently. >> >> As I said, namespaces in HTML... >> >> To move the namespace declaration to the top-level element, you can create >> a new 'html' root element that has it and move the nodes over, e.g. >> >> ? ? ?new_root = etree.Element('html', nsmap=nsmap) >> ? ? ?new_root[:] = root[:] # or copy.deepcopy(root)[:] > > How (in)efficient is this? > >> I think it would be nice to allow an 'nsmap' parameter in the >> cleanup_namespaces() function. Its namespace declarations would then get >> added to the element it runs on before starting the cleanup process. That >> would be a 2.3 feature, though. > > Ok. > >> I don't think adding support for changing 'el.nsmap' would be a good idea, >> as changing namespace prefixes is actually a rather non-trivial process. >> This should be requested explicitly at a well selected step in the code >> (usually just before serialisation, when prefixes become interesting). > > Agree. I'd be happy to pass something to the serialiser about namespaces. > >>> ? ?- The ?tag is not self-closing when rendered with the >>> html.tostring (using etree.tostring is not really an option as other >>> things are going on which want html rendering). Varnish likes this even >>> less. >> >> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't >> know that it's supposed to be self-closing. I think you only have three >> options here: > > Is there no way to make it aware of it? Seems this should be > configurable (or monkey-patch-able) somewhere... > >> 1) fix Varnish >> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) > > It probably would. What's that look like? > >> 3) close the tag through byte string substitution *after* serialisation > > Yipes. If I do that, I'll just do the entire tag through such a > substitution to be honest and not use lxml at all. > >> If you choose to go with 2), you may consider converting the stream back to >> plain HTML *after* processing the esi tags, using an additional >> parse-serialise cycle (or an external tool like xmllint or tidy). > > That sounds pretty bad for performance. :( > > Martin FWIW, the only way I've found to get good xhtml output from html parsing is with an xsl like the following... This triggers the xml output mode to produce valid xhtml. If et.docinfo.public_id and et.docinfo.system_url could be set somehow then I'm sure it would work without the transform. (The relevant code is at the top of libxml2/xmlsave.c - basically so long as you have one of the xhtml public ids or system urls you'll get the right output). Laurence From optilude+lists at gmail.com Fri Jan 29 03:26:48 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 10:26:48 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B61B583.50003@behnel.de> References: <4B61B583.50003@behnel.de> Message-ID: Stefan Behnel wrote: > Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't > know that it's supposed to be self-closing. I think you only have three > options here: > > 1) fix Varnish > 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) > 3) close the tag through byte string substitution *after* serialisation > > If you choose to go with 2), you may consider converting the stream back to > plain HTML *after* processing the esi tags, using an additional > parse-serialise cycle (or an external tool like xmllint or tidy). I've just tried this with serialization using lxml.etree.tostring instead of lxml.html.tostring. Unfortunately, I'm still getting an open-close tag pair instead of a self-closed tag. Any idea what I may be doing wrong? esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) esiNode.set('src', tileHref) esiNode.text = None tilePlaceholderNode.addnext(esiNode) toRemove.append(tilePlaceholderNode) Output:

...

Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From optilude+lists at gmail.com Fri Jan 29 04:38:40 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 11:38:40 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B61B583.50003@behnel.de> References: <4B61B583.50003@behnel.de> Message-ID: Stefan Behnel wrote: > To move the namespace declaration to the top-level element, you can create > a new 'html' root element that has it and move the nodes over, e.g. > > new_root = etree.Element('html', nsmap=nsmap) > new_root[:] = root[:] # or copy.deepcopy(root)[:] I tried this (self-closing tag issue notwithstanding), like so: root = tree.getroot() nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} newRoot = etree.Element('html', nsmap=newRoot.attrib.update(root.attrib.items()) newRoot[:] = copy.deepcopy(root)[:] tree._setroot(newRoot) Unfortunately, I've now lost the doctype. :( The head of the page looks like: Intriguingly, the tag now self-closes. :-) However, Firefox is showing an empty page. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 08:36:10 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 08:36:10 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> Message-ID: <4B628FEA.2030905@behnel.de> Martin Aspeli, 29.01.2010 04:38: > Stefan Behnel wrote: > >> To move the namespace declaration to the top-level element, you can create >> a new 'html' root element that has it and move the nodes over, e.g. >> >> new_root = etree.Element('html', nsmap=nsmap) >> new_root[:] = root[:] # or copy.deepcopy(root)[:] > > I tried this (self-closing tag issue notwithstanding), like so: > > root = tree.getroot() > nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} > newRoot = etree.Element('html', > nsmap=newRoot.attrib.update(root.attrib.items()) > newRoot[:] = copy.deepcopy(root)[:] > tree._setroot(newRoot) > > Unfortunately, I've now lost the doctype. :( The head of the page looks > like: > > xmlns="http://www.w3.org/1999/xhtml" lang="en"> > You can also create the element using the parser: newRoot = etree.XML(''' ''') Sadly, doctype setting isn't currently as easy as it could be... > Intriguingly, the tag now self-closes. :-) However, > Firefox is showing an empty page. May or may not be due to the missing doctype. Stefan From stefan_ml at behnel.de Fri Jan 29 08:47:27 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 08:47:27 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> Message-ID: <4B62928F.5050400@behnel.de> Martin Aspeli, 29.01.2010 03:26: > Stefan Behnel wrote: > >> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't >> know that it's supposed to be self-closing. I think you only have three >> options here: >> >> 1) fix Varnish >> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) >> 3) close the tag through byte string substitution *after* serialisation >> >> If you choose to go with 2), you may consider converting the stream back to >> plain HTML *after* processing the esi tags, using an additional >> parse-serialise cycle (or an external tool like xmllint or tidy). > > I've just tried this with serialization using lxml.etree.tostring > instead of lxml.html.tostring. Unfortunately, I'm still getting an > open-close tag pair instead of a self-closed tag. Any idea what I may be > doing wrong? > > esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) > esiNode.set('src', tileHref) > esiNode.text = None Setting the .text to None is redundant as this is a new element. Otherwise, doing that should be enough to erase all text content. > tilePlaceholderNode.addnext(esiNode) > toRemove.append(tilePlaceholderNode) I guess I would have used parent.replace(old,new) here. > Output: > >

... xmlns:esi="http://www.edge-delivery.org/esi/1.0" > src="http://...">

Works for me. Could you send me a complete code snippet where it doesn't work for you? Stefan From optilude+lists at gmail.com Fri Jan 29 09:15:50 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 16:15:50 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62928F.5050400@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> Message-ID: Stefan Behnel wrote: > Martin Aspeli, 29.01.2010 03:26: >> Stefan Behnel wrote: >> >>> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't >>> know that it's supposed to be self-closing. I think you only have three >>> options here: >>> >>> 1) fix Varnish >>> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) >>> 3) close the tag through byte string substitution *after* serialisation >>> >>> If you choose to go with 2), you may consider converting the stream back to >>> plain HTML *after* processing the esi tags, using an additional >>> parse-serialise cycle (or an external tool like xmllint or tidy). >> I've just tried this with serialization using lxml.etree.tostring >> instead of lxml.html.tostring. Unfortunately, I'm still getting an >> open-close tag pair instead of a self-closed tag. Any idea what I may be >> doing wrong? >> >> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >> esiNode.set('src', tileHref) >> esiNode.text = None > > Setting the .text to None is redundant as this is a new element. Otherwise, > doing that should be enough to erase all text content. It was an act of desperation. :) >> tilePlaceholderNode.addnext(esiNode) >> toRemove.append(tilePlaceholderNode) > > I guess I would have used parent.replace(old,new) here. I didn't do this for two reasons: 1. In some cases (though not here) I'm replacing one placeholder with multiple nodes. 2. This code appears within a loop that's manipulating the tree for each of multiple elements matched with an XPath expression. I thought deleting a node mid-iteration would cause problems. >> Output: >> >>

...> xmlns:esi="http://www.edge-delivery.org/esi/1.0" >> src="http://...">

> > Works for me. Could you send me a complete code snippet where it doesn't > work for you? How much work are you willing to put in? :-) I can give you a Plone buildout that will set up everything and talk you through the steps to reproduce. It's not very hard, it just requires a few steps. I won't bother explaining it if you don't have half an hour to chase it down, though. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From d.rothe at semantics.de Fri Jan 29 09:57:33 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Fri, 29 Jan 2010 09:57:33 +0100 Subject: [lxml-dev] exslt functions in xpath expressions Message-ID: During XPath Evaluations in XSL-Transformations it's possible to use Stuff from http://www.exslt.org/ (so [5] does indeed match the element). During XPath Evaluations its only possible to use standard XPath/XSLT Functions. Is there a chance to enable the functions from exslt for lxml XPath evaluations as well? ========================================================= In [1]: from lxml import etree In [2]: from StringIO import StringIO In [3]: tree = etree.parse(StringIO('')) In [4]: print tree.xpath('/a[@b=concat("1","2")]') [] In [5]: print tree.xpath('/a[@b=str:split("12 34")]') [] ========================================================= --dirk From stefan_ml at behnel.de Fri Jan 29 09:59:14 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 09:59:14 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> Message-ID: <4B62A362.40502@behnel.de> Martin Aspeli, 29.01.2010 01:00: > Stefan Behnel wrote: >> Martin Aspeli, 28.01.2010 14:52: >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >>> esiNode.set('src', url) >>> placeholder.addnext(esiNode) # placeholder is later removed >>> >>> There are two problems with this: >>> >>> - The xmlns:esi ends up on the tag instead of the HTML >>> root. Varnish doesn't like this apparently. >> As I said, namespaces in HTML... >> >> To move the namespace declaration to the top-level element, you can create >> a new 'html' root element that has it and move the nodes over, e.g. >> >> new_root = etree.Element('html', nsmap=nsmap) >> new_root[:] = root[:] # or copy.deepcopy(root)[:] > > How (in)efficient is this? It's about linear in the number of elements in your tree, plus the number of direct children for the move operation. Maybe not the most efficient thing to do, but usually pretty fast. Certainly a lot faster than you could ever get your own hand-rolled serialiser in Python, for instance. You can compare the absolute numbers on this page: http://codespeak.net/lxml/performance.html#parsing-and-serialising http://codespeak.net/lxml/performance.html#merging-different-sources >> I don't think adding support for changing 'el.nsmap' would be a good idea, >> as changing namespace prefixes is actually a rather non-trivial process. >> This should be requested explicitly at a well selected step in the code >> (usually just before serialisation, when prefixes become interesting). > > Agree. I'd be happy to pass something to the serialiser about namespaces. I know, that's how ET's serialiser works. Can't work for lxml, though. The serialiser in libxml2 can only write out what is there. It could work for a doctype, though. Support for passing that verbatimly into the serialiser would be a nice feature. >>> - The tag is not self-closing when rendered with the >>> html.tostring (using etree.tostring is not really an option as other >>> things are going on which want html rendering). Varnish likes this even >>> less. >> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't >> know that it's supposed to be self-closing. I think you only have three >> options here: > > Is there no way to make it aware of it? Seems this should be > configurable (or monkey-patch-able) somewhere... Monkey-patching isn't all that easy in libxml2, though... Not that it can't work for C code, it's just not that portable - nor particularly safe... ;-) >> 1) fix Varnish >> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that) > > It probably would. What's that look like? Depends on your input. If it's HTML, there's an html_to_xhtml() function in lxml.html that can do the conversion for you. And the serialiser can always be chosen using the 'method' argument (that's basically the difference between lxml.etree.tostring() and lxml.html.tostring()). >> 3) close the tag through byte string substitution *after* serialisation > > Yipes. If I do that, I'll just do the entire tag through such a > substitution to be honest and not use lxml at all. It's rather safe, though. The exact string to replace would be ">", which won't appear that easily in your content. Doing the parsing and replacing manually on the input is a lot more fragile. >> If you choose to go with 2), you may consider converting the stream back to >> plain HTML *after* processing the esi tags, using an additional >> parse-serialise cycle (or an external tool like xmllint or tidy). > > That sounds pretty bad for performance. :( Don't underestimate the speed of a tool that was made for the job. Stefan From optilude+lists at gmail.com Fri Jan 29 10:10:54 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 17:10:54 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62928F.5050400@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> Message-ID: Stefan Behnel wrote: > Works for me. Could you send me a complete code snippet where it doesn't > work for you? Okay, here's a pseudo-doctest that illustrates the problem: First, we create a simple document. We use the HTML parser here, because we don't necessarily trust the input being 100% valid XHTML, even though the doctype says so. >>> from lxml import etree, html >>> doc = """\ ... ... ... ...

./target.html

... ... ... """ >>> inputTree = html.fromstring(doc) We are going to replace the tag with an tag. We find it via an XPath: >>> placeholderXPath = etree.XPath("//img[contains(concat(' ', normalize-space(@class), ' '), ' mceTile ')]") >>> matched = list(placeholderXPath(inputTree)) >>> matchedNode = matched[0] We then create the ESI node. At this point, it's nice and self-closing. Note that we use the etree.tostring() method, since we want XHTML output. >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >>> esiNode.set('src', matchedNode.get('alt')) >>> print etree.tostring(esiNode) Now we connect it to the parent: >>> matchedNode.getparent().replace(matchedNode, esiNode) At this point it's all over: >>> print etree.tostring(esiNode) And sure enough: >>> print etree.tostring(inputTree)

It's also interesting to note that this suddenly has the xmlns declaration twice. Any ideas would be highly welcome. I've tried to play with different ways to construct the ESI tag, and different placements for the placeholder (e.g. outside the

tag), but it's all the same. It also doesn't seem to make any difference whether I parse with etree.fromstring() or html.fromstring() (in the real code I'm actually feeding an HTMLParser). As soon as I insert it into the parent tree, the tag stops self closing. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 10:18:52 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 10:18:52 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> Message-ID: <4B62A7FC.80607@behnel.de> Martin Aspeli, 29.01.2010 09:15: > Stefan Behnel wrote: >> Setting the .text to None is redundant as this is a new element. > > It was an act of desperation. :) I know. ;) >>> tilePlaceholderNode.addnext(esiNode) >>> toRemove.append(tilePlaceholderNode) >> I guess I would have used parent.replace(old,new) here. > > I didn't do this for two reasons: > > 1. In some cases (though not here) I'm replacing one placeholder with > multiple nodes. Another nice feature: support a sequence as replacement. :) Although that requirement is basically satisfied with slice replacements, so I guess that won't make it in for now. > 2. This code appears within a loop that's manipulating the tree for > each of multiple elements matched with an XPath expression. I thought > deleting a node mid-iteration would cause problems. XPath returns a list of nodes, so you are no longer iterating over the tree structure in this case. Ripping stuff out should be absolutely safe here. >>> Output: >>> >>>

...>> xmlns:esi="http://www.edge-delivery.org/esi/1.0" >>> src="http://...">

>> Works for me. Could you send me a complete code snippet where it doesn't >> work for you? > > How much work are you willing to put in? :-) > > I can give you a Plone buildout that will set up everything and talk you > through the steps to reproduce. LOL! :) "You know, I have this huge pile of code here, but it's really easy to set up and then all you have to do is a tiny bit of debugging. It's easy! It really is! I can't believe you don't want to feel the fun to try it!" Honestly, could you try to come up with a little example that injects namespaced XML content into a small HTML page, and that shows that the XML serialiser behaves unexpected? Shouldn't be hard to write... Stefan From optilude+lists at gmail.com Fri Jan 29 10:22:44 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 17:22:44 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62A7FC.80607@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62A7FC.80607@behnel.de> Message-ID: Stefan Behnel wrote: >> 1. In some cases (though not here) I'm replacing one placeholder with >> multiple nodes. > > Another nice feature: support a sequence as replacement. :) > > Although that requirement is basically satisfied with slice replacements, > so I guess that won't make it in for now. Could you elaborate an example? >> 2. This code appears within a loop that's manipulating the tree for >> each of multiple elements matched with an XPath expression. I thought >> deleting a node mid-iteration would cause problems. > > XPath returns a list of nodes, so you are no longer iterating over the tree > structure in this case. Ripping stuff out should be absolutely safe here. Cool! Less code. >>>> Output: >>>> >>>>

...>>> xmlns:esi="http://www.edge-delivery.org/esi/1.0" >>>> src="http://...">

>>> Works for me. Could you send me a complete code snippet where it doesn't >>> work for you? >> How much work are you willing to put in? :-) >> >> I can give you a Plone buildout that will set up everything and talk you >> through the steps to reproduce. > > LOL! :) > > "You know, I have this huge pile of code here, but it's really easy to set > up and then all you have to do is a tiny bit of debugging. It's easy! It > really is! I can't believe you don't want to feel the fun to try it!" That's why I asked. :-p > Honestly, could you try to come up with a little example that injects > namespaced XML content into a small HTML page, and that shows that the XML > serialiser behaves unexpected? Shouldn't be hard to write... See my other mail. I got a minimal example that's bombing out for me. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 11:08:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 11:08:56 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> Message-ID: <4B62B3B8.60206@behnel.de> Martin Aspeli, 29.01.2010 10:10: > here's a pseudo-doctest that illustrates the problem: > > First, we create a simple document. We use the HTML parser here, because > we don't necessarily trust the input being 100% valid XHTML, even though > the doctype says so. I think that's the main problem. If you parse XHTML using the HTML parser, you loose information due to the fact that namespaces are not well-defined for HTML. I'd *always* try with the XML parser first. However, according to your last comment, it seems you have tried the XML parser already... > >>> from lxml import etree, html > >>> doc = """\ > ... "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > ... xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> > ... > ...

alt="./target.html" />

> ... > ... > ... """ > >>> inputTree = html.fromstring(doc) > > We are going to replace the tag with an tag. We > find it via an XPath: > > >>> placeholderXPath = etree.XPath("//img[contains(concat(' ', > normalize-space(@class), ' '), ' mceTile ')]") Perfect use case for lxml.cssselect. :) > >>> matched = list(placeholderXPath(inputTree)) As I said, XPath returns a *list*. Personally, I'd love to have it return an iterable, but libxml2 doesn't easily give you that. IIRC, there's some limited support for this (it works for certain patterns), but that would need some serious wrapping effort with non-trivial memory management. > >>> matchedNode = matched[0] > > We then create the ESI node. At this point, it's nice and self-closing. > Note that we use the etree.tostring() method, since we want XHTML output. > > >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} > >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) > >>> esiNode.set('src', matchedNode.get('alt')) > > >>> print etree.tostring(esiNode) > src="./target.html"/> Ok so far. > Now we connect it to the parent: > > >>> matchedNode.getparent().replace(matchedNode, esiNode) > > At this point it's all over: > > >>> print etree.tostring(esiNode) > src="./target.html"> Ah, this is because it is now part of an HTML document, so the HTML semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) that provided 'better' support for HTML serialisation by taking into account the document context. Looks like this strikes here. I just looked it up in the sources, recent 2.7.x versions of libxml2 have added a way to override this behaviour again, but lxml doesn't do this yet. IIRC, it wasn't trivial at the time - I think it required going through a different serialisation function or something. > And sure enough: > > >>> print etree.tostring(inputTree) > xmlns:esi="http://www.edge-delivery.org/esi/1.0" > xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" > xml:lang="en"> >

src="./target.html">

> > > It's also interesting to note that this suddenly has the xmlns > declaration twice. ... namespaces in HTML ... > Any ideas would be highly welcome. I've tried to play with different > ways to construct the ESI tag, and different placements for the > placeholder (e.g. outside the

tag), but it's all the same. It also > doesn't seem to make any difference whether I parse with > etree.fromstring() or html.fromstring() (in the real code I'm actually > feeding an HTMLParser). It *should* make a difference, but from your example I can see that it doesn't. No idea why. I'll have a closer look later. Stefan From stefan_ml at behnel.de Fri Jan 29 11:11:47 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 11:11:47 +0100 Subject: [lxml-dev] exslt functions in xpath expressions In-Reply-To: References: Message-ID: <4B62B463.4040701@behnel.de> Dirk Rothe, 29.01.2010 09:57: > During XPath Evaluations in XSL-Transformations it's possible to use Stuff > from http://www.exslt.org/ (so [5] does indeed match the element). > During XPath Evaluations its only possible to use standard XPath/XSLT > Functions. Is there a chance to enable the functions from exslt for lxml > XPath evaluations as well? > > ========================================================= > In [1]: from lxml import etree > > In [2]: from StringIO import StringIO > > In [3]: tree = etree.parse(StringIO('')) > > In [4]: print tree.xpath('/a[@b=concat("1","2")]') > [] > > In [5]: print tree.xpath('/a[@b=str:split("12 34")]') > [] > ========================================================= They should be enabled. But you have to specify the namespace of the function you use. http://codespeak.net/lxml/xpathxslt.html#xpath Stefan From d.rothe at semantics.de Fri Jan 29 11:29:30 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Fri, 29 Jan 2010 11:29:30 +0100 Subject: [lxml-dev] exslt functions in xpath expressions In-Reply-To: <4B62B463.4040701@behnel.de> References: <4B62B463.4040701@behnel.de> Message-ID: On Fri, 29 Jan 2010 11:11:47 +0100, Stefan Behnel wrote: > > Dirk Rothe, 29.01.2010 09:57: >> During XPath Evaluations in XSL-Transformations it's possible to use >> Stuff >> from http://www.exslt.org/ (so [5] does indeed match the element). >> During XPath Evaluations its only possible to use standard XPath/XSLT >> Functions. Is there a chance to enable the functions from exslt for lxml >> XPath evaluations as well? >> >> ========================================================= >> In [1]: from lxml import etree >> >> In [2]: from StringIO import StringIO >> >> In [3]: tree = etree.parse(StringIO('')) >> >> In [4]: print tree.xpath('/a[@b=concat("1","2")]') >> [] >> >> In [5]: print tree.xpath('/a[@b=str:split("12 34")]') >> [] >> ========================================================= > > They should be enabled. But you have to specify the namespace of the > function you use. > > http://codespeak.net/lxml/xpathxslt.html#xpath Ah, sorry. I should have checked this. --dirkse From optilude+lists at gmail.com Fri Jan 29 11:51:29 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 18:51:29 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62B3B8.60206@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> Message-ID: Stefan Behnel wrote: > Martin Aspeli, 29.01.2010 10:10: >> here's a pseudo-doctest that illustrates the problem: >> >> First, we create a simple document. We use the HTML parser here, because >> we don't necessarily trust the input being 100% valid XHTML, even though >> the doctype says so. > > I think that's the main problem. If you parse XHTML using the HTML parser, > you loose information due to the fact that namespaces are not well-defined > for HTML. I'd *always* try with the XML parser first. This code is being used in a post-processing step for output from Plone. Performance is important, so trial-and-error like this is probably undesirable. And even then, this would need to work for documents parsed with the HTML parser. The output being transformed could include not-quite-well-formed XHTML from content-managed pages. That's the attraction of xlml in the first place - it can deal with somewhat-crap output. ;) > However, according to your last comment, it seems you have tried the XML > parser already... I just re-confirmed it. If the whole thing is parsed with etree.fromstring (and lxml.html is not used anywhere) it still doesn't close. >> >>> from lxml import etree, html >> >>> doc = """\ >> ...> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> ...> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> >> ... >> ...

> alt="./target.html" />

>> ... >> ... >> ... """ >> >>> inputTree = html.fromstring(doc) >> >> We are going to replace the tag with an tag. We >> find it via an XPath: >> >> >>> placeholderXPath = etree.XPath("//img[contains(concat(' ', >> normalize-space(@class), ' '), ' mceTile ')]") > > Perfect use case for lxml.cssselect. :) Well, I got the XPath from css2xpath.appspot.com which uses the same algorithm I think. >> >>> matched = list(placeholderXPath(inputTree)) > > As I said, XPath returns a *list*. Cool, thanks. > Personally, I'd love to have it return an iterable, but libxml2 doesn't > easily give you that. IIRC, there's some limited support for this (it works > for certain patterns), but that would need some serious wrapping effort > with non-trivial memory management. Changing it in a future release may be risky if it returns a list now. >> >>> matchedNode = matched[0] >> >> We then create the ESI node. At this point, it's nice and self-closing. >> Note that we use the etree.tostring() method, since we want XHTML output. >> >> >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'} >> >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap) >> >>> esiNode.set('src', matchedNode.get('alt')) >> >> >>> print etree.tostring(esiNode) >> > src="./target.html"/> > > Ok so far. > > >> Now we connect it to the parent: >> >> >>> matchedNode.getparent().replace(matchedNode, esiNode) >> >> At this point it's all over: >> >> >>> print etree.tostring(esiNode) >> > src="./target.html"> > > Ah, this is because it is now part of an HTML document, so the HTML > semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) > that provided 'better' support for HTML serialisation by taking into > account the document context. Looks like this strikes here. > > I just looked it up in the sources, recent 2.7.x versions of libxml2 have > added a way to override this behaviour again, but lxml doesn't do this yet. > IIRC, it wasn't trivial at the time - I think it required going through a > different serialisation function or something. Makes sense, sorta, but I would've thought this was a matter for serialisation, not parsing? Even when parsing as HTML, I'm using etree.tostring() to serialise. >> And sure enough: >> >> >>> print etree.tostring(inputTree) >> > xmlns:esi="http://www.edge-delivery.org/esi/1.0" >> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" >> xml:lang="en"> >>

> src="./target.html">

>> >> >> It's also interesting to note that this suddenly has the xmlns >> declaration twice. > > ... namespaces in HTML ... Yeah, yeah. XHTML. ;-) >> Any ideas would be highly welcome. I've tried to play with different >> ways to construct the ESI tag, and different placements for the >> placeholder (e.g. outside the

tag), but it's all the same. It also >> doesn't seem to make any difference whether I parse with >> etree.fromstring() or html.fromstring() (in the real code I'm actually >> feeding an HTMLParser). > > It *should* make a difference, but from your example I can see that it > doesn't. No idea why. I'll have a closer look later. I appreciate it! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 13:16:59 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 13:16:59 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> Message-ID: <4B62D1BB.1050408@behnel.de> Martin Aspeli, 29.01.2010 11:51: > Stefan Behnel wrote: >> Personally, I'd love to have it return an iterable, but libxml2 doesn't >> easily give you that. IIRC, there's some limited support for this (it works >> for certain patterns), but that would need some serious wrapping effort >> with non-trivial memory management. > > Changing it in a future release may be risky if it returns a list now. Obviously. It would rather become a new method on the XPath class, like xpath.iterfind(el). >>> ...>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>> ...>> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> >>> ... >>> ...

>> alt="./target.html" />

>>> ... >>> ... >> [...] >>> >>> matchedNode.getparent().replace(matchedNode, esiNode) >>> >>> print etree.tostring(esiNode) >>> >> src="./target.html"> >> >> Ah, this is because it is now part of an HTML document, so the HTML >> semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?) >> that provided 'better' support for HTML serialisation by taking into >> account the document context. Looks like this strikes here. >> >> I just looked it up in the sources, recent 2.7.x versions of libxml2 have >> added a way to override this behaviour again, but lxml doesn't do this yet. >> IIRC, it wasn't trivial at the time - I think it required going through a >> different serialisation function or something. > > Makes sense, sorta, but I would've thought this was a matter for > serialisation, not parsing? Even when parsing as HTML, I'm using > etree.tostring() to serialise. I read through the libxml2 sources a bit more. It's not confusing HTML at all, it's even smarter than I thought. It looks at the *doctype* of the document that is being serialised and then applies special XHTML formatting rules. :o) http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 Stefan From optilude+lists at gmail.com Fri Jan 29 14:27:28 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 21:27:28 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62D1BB.1050408@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> Message-ID: Stefan Behnel wrote: > I read through the libxml2 sources a bit more. It's not confusing HTML at > all, it's even smarter than I thought. It looks at the *doctype* of the > document that is being serialised and then applies special XHTML formatting > rules. :o) But... XHTML says empty tags can self-close as far as I know. And even then, this is in a different namespace. > http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 My C fu is weak. Any hints in there I'm missing? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 15:32:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 15:32:42 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> Message-ID: <4B62F18A.2040105@behnel.de> Martin Aspeli, 29.01.2010 14:27: > Stefan Behnel wrote: > >> I read through the libxml2 sources a bit more. It's not confusing HTML at >> all, it's even smarter than I thought. It looks at the *doctype* of the >> document that is being serialised and then applies special XHTML formatting >> rules. :o) > > But... XHTML says empty tags can self-close as far as I know. Sure. I just pointed you to the code that formats the output. Given that the DOCTYPE plays the card here, you may also consider keeping the DOCTYPE out of the tree and prepending it after the serialisation. >> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 > > My C fu is weak. Any hints in there I'm missing? This, for example: http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451 The rule that bites you here is in line 1452. If the element uses a namespace prefix, it will not become self-closing. I have no idea about the reasoning behind such a rule, but if you are interested, I'd go straight to the libxml2 mailing list and ask. There's also line 1414 http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414 which emits a default namespace declaration for the XHTML namespace regardless of the existing declarations. Certainly space left for enhancements. IIRC, the XHTML formatting is rather new, may have been added in the 2.7 line. You'll have a good chance of being heard if you propose some sensible improvements to it. Stefan From mateusz-lists at ant.gliwice.pl Fri Jan 29 15:49:16 2010 From: mateusz-lists at ant.gliwice.pl (Mateusz Korniak) Date: Fri, 29 Jan 2010 15:49:16 +0100 Subject: [lxml-dev] Extending //r/text() Message-ID: <201001291549.16846.mateusz-lists@ant.gliwice.pl> Hi ! I am using lxml and find it really great ! Is it possible to extend like http://codespeak.net/lxml/extensions.html functions which are similar to //r/text() ? Although I have defined "testf" ns = lxml.etree.FunctionNamespace(None) ns['testf'] = testf from running: res = test_root.xpath("//r[testf()]") res = test_root.xpath("//r/text()") res = test_root.xpath("//r/testf()") I get: lxml.etree.XPathEvalError: Invalid expression when executing "//r/testf()" Thanks in advance and regards ! -- Mateusz Korniak From optilude+lists at gmail.com Fri Jan 29 16:09:32 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 29 Jan 2010 23:09:32 +0800 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62F18A.2040105@behnel.de> References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> <4B62F18A.2040105@behnel.de> Message-ID: Stefan Behnel wrote: > Given that the DOCTYPE plays the card here, you may also consider keeping > the DOCTYPE out of the tree and prepending it after the serialisation. That's going to be pretty tricky, but I guess we can try. I wonder what the side-effect of this may, though. Presumably, the DOCTYPE detection is there for a reason. >>> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137 >> My C fu is weak. Any hints in there I'm missing? > > This, for example: > > http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451 > > The rule that bites you here is in line 1452. If the element uses a > namespace prefix, it will not become self-closing. I have no idea about the > reasoning behind such a rule, but if you are interested, I'd go straight to > the libxml2 mailing list and ask. > > There's also line 1414 > > http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414 > > which emits a default namespace declaration for the XHTML namespace > regardless of the existing declarations. Certainly space left for enhancements. > > IIRC, the XHTML formatting is rather new, may have been added in the 2.7 > line. You'll have a good chance of being heard if you propose some sensible > improvements to it. Good to know. I'm not sure I know how to formulate the needed changes except by re-stating the problem I'm having here, though. It'd probably help if I understood the purpose of the special formatting better. I naively thought that XHTML = XML and wouldn't need any magic. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Fri Jan 29 16:27:09 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 16:27:09 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: <4B62A362.40502@behnel.de> References: <4B61B583.50003@behnel.de> <4B62A362.40502@behnel.de> Message-ID: <4B62FE4D.7090504@behnel.de> [replying to myself] Stefan Behnel, 29.01.2010 09:59: > The serialiser in libxml2 can only write out what is there. > > It could work for a doctype, though. Support for passing that verbatimly > into the serialiser would be a nice feature. There we go: https://codespeak.net/viewvc/?view=rev&revision=70976 >>> xml = '\n' >>> tree = etree.parse(StringIO(xml)) >>> print(etree.tostring(tree)) >>> print(etree.tostring(tree, ... doctype='')) Stefan From stefan_ml at behnel.de Fri Jan 29 16:45:17 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 16:45:17 +0100 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> <4B62F18A.2040105@behnel.de> Message-ID: <4B63028D.2030603@behnel.de> Martin Aspeli, 29.01.2010 16:09: > Stefan Behnel wrote: >> Given that the DOCTYPE plays the card here, you may also consider keeping >> the DOCTYPE out of the tree and prepending it after the serialisation. > > That's going to be pretty tricky, but I guess we can try. > > I wonder what the side-effect of this may, though. Presumably, the > DOCTYPE detection is there for a reason. :) I think it's because I complained about one of the early 2.7.x versions breaking lxml's serialisation completely, so Daniel eventually added some "do what I mean" work-around to call the right functions in absence of a specific configuration (which lxml can't pass as the API it uses doesn't allow it ...) Don't expect everything in libxml2 to be well designed from the ground up. It was grown over years and has become a crucial part of the GNU/GNOME/... infrastructure. It naturally carries quite a bit of backwards compatibility with it, in both API and functionality. It certainly has its edges. Discussing new stuff to move it into the right directions is almost always worth it. >> IIRC, the XHTML formatting is rather new, may have been added in the 2.7 >> line. You'll have a good chance of being heard if you propose some sensible >> improvements to it. > > Good to know. I'm not sure I know how to formulate the needed changes > except by re-stating the problem I'm having here, though. It'd probably > help if I understood the purpose of the special formatting better. I > naively thought that XHTML = XML and wouldn't need any magic. :) I wouldn't call that naive. Just go and ask. Stefan From stefan_ml at behnel.de Fri Jan 29 16:52:01 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jan 2010 16:52:01 +0100 Subject: [lxml-dev] Extending //r/text() In-Reply-To: <201001291549.16846.mateusz-lists@ant.gliwice.pl> References: <201001291549.16846.mateusz-lists@ant.gliwice.pl> Message-ID: <4B630421.7090604@behnel.de> Mateusz Korniak, 29.01.2010 15:49: > I am using lxml and find it really great ! Happy to hear that. :) > Is it possible to extend like > http://codespeak.net/lxml/extensions.html > functions which are similar to > //r/text() ? > > Although I have defined "testf" > > ns = lxml.etree.FunctionNamespace(None) > ns['testf'] = testf > > from running: > > res = test_root.xpath("//r[testf()]") > res = test_root.xpath("//r/text()") > res = test_root.xpath("//r/testf()") > > I get: > lxml.etree.XPathEvalError: Invalid expression > > when executing "//r/testf()" So that's only for the last expression, right? It's different in that it doesn't have anything to match on. "text()" is a special function in XPath that matches any text node. This special property can't be replaced with an extension function. However, you didn't write anything about your use case. There may be other ways to do what you want, such as "myfunc( //r/node() )". Stefan From d.rothe at semantics.de Sat Jan 30 16:14:44 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Sat, 30 Jan 2010 16:14:44 +0100 Subject: [lxml-dev] exslt functions in xpath expressions In-Reply-To: References: <4B62B463.4040701@behnel.de> Message-ID: On Fri, 29 Jan 2010 11:29:30 +0100, Dirk Rothe wrote: > On Fri, 29 Jan 2010 11:11:47 +0100, Stefan Behnel > wrote: > >> >> Dirk Rothe, 29.01.2010 09:57: >>> During XPath Evaluations in XSL-Transformations it's possible to use >>> Stuff >>> from http://www.exslt.org/ (so [5] does indeed match the
element). >>> During XPath Evaluations its only possible to use standard XPath/XSLT >>> Functions. Is there a chance to enable the functions from exslt for >>> lxml >>> XPath evaluations as well? >>> >>> ========================================================= >>> In [1]: from lxml import etree >>> >>> In [2]: from StringIO import StringIO >>> >>> In [3]: tree = etree.parse(StringIO('')) >>> >>> In [4]: print tree.xpath('/a[@b=concat("1","2")]') >>> [] >>> >>> In [5]: print tree.xpath('/a[@b=str:split("12 34")]') >>> [] >>> ========================================================= >> >> They should be enabled. But you have to specify the namespace of the >> function you use. >> >> http://codespeak.net/lxml/xpathxslt.html#xpath > > Ah, sorry. I should have checked this. Hmm, but it's not working: ========================================================= In [9]: print tree.xpath("/a[@b=str:split('12 34')]", namespaces={'str': "http://exslt.org/strings"}) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (23303, 0)) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (24793, 0)) --------------------------------------------------------------------------- XPathEvalError Traceback (most recent call last) D:\vls-trunk\server\bin\ in () d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd in lxml.etree._ElementTree.xpath (sr c/lxml/lxml.etree.c:41699)() d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd in lxml.etree.XPathDocumentEvaluator .__call__ (src/lxml/lxml.etree.c:103472)() d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd in lxml.etree._XPathEvaluatorBase._h andle_result (src/lxml/lxml.etree.c:102330)() d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd in lxml.etree._XPathEvaluatorBase._r aise_eval_error (src/lxml/lxml.etree.c:102153)() XPathEvalError: Unregistered function From stefan_ml at behnel.de Sat Jan 30 17:10:25 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 30 Jan 2010 17:10:25 +0100 Subject: [lxml-dev] exslt functions in xpath expressions In-Reply-To: References: <4B62B463.4040701@behnel.de> Message-ID: <4B6459F1.9090600@behnel.de> Dirk Rothe, 30.01.2010 16:14: > In [9]: print tree.xpath("/a[@b=str:split('12 34')]", namespaces={'str': > "http://exslt.org/strings"}) >[...] > XPathEvalError: Unregistered function You're right, they are currently only available to XSLT. It seems that at least the date, math, sets and string functions can be enabled in plain XPath, but only from libxslt 1.1.25 onwards. That version was released on 2009-09-17, so it's fairly recent. http://xmlsoft.org/XSLT/EXSLT/html/libexslt-exslt.html Could you file a feature request for this in the bug tracker? I should be able to add support in lxml 2.3. Stefan From richardbp+lxml at gmail.com Sun Jan 31 05:06:56 2010 From: richardbp+lxml at gmail.com (Richard Baron Penman) Date: Sun, 31 Jan 2010 15:06:56 +1100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? Message-ID: hello, I am after xpath support for an application running on Google App Engine, which unfortunately rules out lxml. According to this document (http://effbot.org/zone/element-xpath.htm) the development version of ElementTree 1.3a has additional support for xpath, which covers my use cases. >From my tests I found attributes and child nodes work: >>> from elementtree import ElementTree >>> tree = ElementTree.fromstring('') >>> print list(tree.findall('.//*[@class="test"]')) [] >>> print list(tree.findall('.//b[c]')) [] However tag positions appear to be broken: >>> print list(tree.findall('.//b[1]')) # should return b element [] Have I missed something? Suggestions? regards, Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100131/677569a5/attachment.htm From stefan_ml at behnel.de Sun Jan 31 07:44:09 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 31 Jan 2010 07:44:09 +0100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? In-Reply-To: References: Message-ID: <4B6526B9.9060007@behnel.de> Richard Baron Penman, 31.01.2010 05:06: > I am after xpath support for an application running on Google App Engine, > which unfortunately rules out lxml. Yeah, I know. That's one of the reasons I never found a use for the GAE on my side. That also makes your e-mail somewhat misplaced on this list. ;) > According to this document (http://effbot.org/zone/element-xpath.htm) the > development version of ElementTree 1.3a has additional support for xpath, Careful. It has *extended* the supported *subset* of XPath, compared to what ET 1.2 has. The ElementPath implementation was also completely rewritten and it's what lxml uses since 2.0. > which covers my use cases. Apparently not. You can look at the sources to see what it supports. It's really quite short and simple. http://svn.effbot.org/public/elementtree-1.3/elementtree/ElementPath.py You may also want to look for "xpath" on this page: http://effbot.org/zone/element-index.htm That should get you here: http://sourceforge.net/projects/pdis-xpath/ I never tried it, but it's been recently updated, so it looks like it's still maintained. >>From my tests I found attributes and child nodes work: > >>>> from elementtree import ElementTree >>>> tree = ElementTree.fromstring(' class="test">') >>>> print list(tree.findall('.//*[@class="test"]')) > [] >>>> print list(tree.findall('.//b[c]')) > [] > > > However tag positions appear to be broken: >>>> print list(tree.findall('.//b[1]')) # should return b element > [] That shouldn't be hard to add. You just have to make sure it only counts elements within the same parent, so you may have to add the selector in more than one place. I guess that's why Fredrik didn't add it while he was at it. Stefan From richardbp+lxml at gmail.com Sun Jan 31 14:56:01 2010 From: richardbp+lxml at gmail.com (Richard Baron Penman) Date: Mon, 1 Feb 2010 00:56:01 +1100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? In-Reply-To: <4B6526B9.9060007@behnel.de> References: <4B6526B9.9060007@behnel.de> Message-ID: hi Stefan, thanks very much for your reply. > > I am after xpath support for an application running on Google App Engine, > > which unfortunately rules out lxml. > > Yeah, I know. That's one of the reasons I never found a use for the GAE on > my side. That also makes your e-mail somewhat misplaced on this list. ;) > Hopefully the lxml feature request goes somewhere: http://code.google.com/p/googleappengine/issues/detail?id=18 Can you recommend an alternative for discussing ElementTree? I tried emailing Fredrik earlier but didn't get a response and the ElementTree repository hasn't been committed to since 2007. http://sourceforge.net/projects/pdis-xpath/ > > I never tried it, but it's been recently updated, so it looks like it's > still maintained. > That project does look promising, however it doesn't yet support // or .. > > However tag positions appear to be broken: > >>>> print list(tree.findall('.//b[1]')) # should return b element > > [] > > That shouldn't be hard to add. You just have to make sure it only counts > elements within the same parent, so you may have to add the selector in > more than one place. I guess that's why Fredrik didn't add it while he was > at it. > I found it was half implemented and finished it off. There is some elegant code in ElementPath.py but it needs refactoring... Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100201/b9645364/attachment.htm