From faassen at startifact.com Tue Jun 3 20:30:07 2008 From: faassen at startifact.com (Martijn Faassen) Date: Tue, 03 Jun 2008 20:30:07 +0200 Subject: [lxml-dev] segfault when using etree.CustomElementClassLookup Message-ID: Hi there, I just ran into a segfault with lxml (2.0.6). The problem is as follows: from lxml import etree class Lookup(etree.CustomElementClassLookup): def __init__(self): pass def lookup(self, node_type, document, namespace, name): return Foo class Foo(etree.ElementBase): def custom(self): return "test" lookup = Lookup() parser = etree.XMLParser() parser.setElementClassLookup(lookup) root = etree.XML('', parser) # crash! If I leave out the custom __init__ in Lookup, things won't crash. Regards, Martijn From ultrokevjr at gmail.com Wed Jun 4 11:23:01 2008 From: ultrokevjr at gmail.com (Kevin JR) Date: Wed, 4 Jun 2008 17:23:01 +0800 Subject: [lxml-dev] svn version failed to compiled against hg version of cython Message-ID: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> libxslt-1.1.24 libxml2-2.6.32 python-2.5.2 cython-hg(482) the error message: $ python setup.py build Building lxml version 2.1.beta3-55506. Building with Cython 0.9.6.14. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/lib running build running build_py running build_ext cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c Error converting Pyrex file to C: ------------------------------------------------------------ ... c_attr = c_attr.next return attributes cdef object __RE_XML_ENCODING __RE_XML_ENCODING = re.compile( ur'^(\s*<\?\s*xml[^>]+)\s+encoding\s*=\s*"[^"]*"\s*', re.U) ^ ------------------------------------------------------------ /dev/shm/python-lxml/src/lxml-build/src/lxml/apihelpers.pxi:487:6: Expected ')' -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080604/8727135a/attachment.htm From stefan_ml at behnel.de Wed Jun 4 12:36:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 4 Jun 2008 12:36:12 +0200 (CEST) Subject: [lxml-dev] svn version failed to compiled against hg version of cython In-Reply-To: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> References: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> Message-ID: <24598.194.114.62.69.1212575772.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Kevin JR wrote: > libxslt-1.1.24 > libxml2-2.6.32 > python-2.5.2 > cython-hg(482) > > the error message: > > $ python setup.py build > Building lxml version 2.1.beta3-55506. > Building with Cython 0.9.6.14. > Using build configuration of libxslt 1.1.24 > Building against libxml2/libxslt in the following directory: /usr/lib > running build > running build_py > running build_ext > cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c > > Error converting Pyrex file to C: > ------------------------------------------------------------ > ... > c_attr = c_attr.next > return attributes > > cdef object __RE_XML_ENCODING > __RE_XML_ENCODING = re.compile( > ur'^(\s*<\?\s*xml[^>]+)\s+encoding\s*=\s*"[^"]*"\s*', re.U) > ^ > ------------------------------------------------------------ > > /dev/shm/python-lxml/src/lxml-build/src/lxml/apihelpers.pxi:487:6: > Expected > ')' I guess you are actually using an older Cython version, likely installed with easy_install. The version number in current hg wasn't increased yet. Stefan From ultrokevjr at gmail.com Wed Jun 4 12:44:16 2008 From: ultrokevjr at gmail.com (Kevin JR) Date: Wed, 4 Jun 2008 18:44:16 +0800 Subject: [lxml-dev] svn version failed to compiled against hg version of cython In-Reply-To: <24598.194.114.62.69.1212575772.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> <24598.194.114.62.69.1212575772.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <3f0e729b0806040344lb60c9f7qbf0fdbffca90db1@mail.gmail.com> On Wed, Jun 4, 2008 at 6:36 PM, Stefan Behnel wrote: > Kevin JR wrote: > > libxslt-1.1.24 > > libxml2-2.6.32 > > python-2.5.2 > > cython-hg(482) > > > > the error message: > > > > $ python setup.py build > > Building lxml version 2.1.beta3-55506. > > Building with Cython 0.9.6.14. > > Using build configuration of libxslt 1.1.24 > > Building against libxml2/libxslt in the following directory: /usr/lib > > running build > > running build_py > > running build_ext > > cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c > > > > Error converting Pyrex file to C: > > ------------------------------------------------------------ > > ... > > c_attr = c_attr.next > > return attributes > > > > cdef object __RE_XML_ENCODING > > __RE_XML_ENCODING = re.compile( > > ur'^(\s*<\?\s*xml[^>]+)\s+encoding\s*=\s*"[^"]*"\s*', re.U) > > ^ > > ------------------------------------------------------------ > > > > /dev/shm/python-lxml/src/lxml-build/src/lxml/apihelpers.pxi:487:6: > > Expected > > ')' > > I guess you are actually using an older Cython version, likely installed > with easy_install. The version number in current hg wasn't increased yet. > > no, it's compiled from hg pool. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080604/027dbd4b/attachment.htm From stefan_ml at behnel.de Wed Jun 4 14:38:30 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jun 2008 14:38:30 +0200 Subject: [lxml-dev] segfault when using etree.CustomElementClassLookup In-Reply-To: References: Message-ID: <48468CC6.4050804@behnel.de> Hi Martijn, Martijn Faassen wrote: > I just ran into a segfault with lxml (2.0.6). The problem is as follows: > > from lxml import etree > > class Lookup(etree.CustomElementClassLookup): > def __init__(self): > pass Yep, you didn't call the __init__() method of the super class here, so the internal lookup function call isn't set up. I replaced that with a __cinit__() now that always sets it to the default lookup scheme, so that it won't segfault anymore even if people forget the obvious. ;) A patch is attached and it's generally easy to work around this by writing correct code, so there won't be a 2.0.7 right away. BTW, this: > parser.setElementClassLookup(lookup) is correctly spelled > parser.set_element_class_lookup(lookup) since lxml 2.0, following PEP 8 naming conventions. However, I didn't dare to remove the original method, since I figured that it would break tons of code for no major reason. At least the examples should reflect the new name everywhere now, so maybe I can remove it in lxml 3.0. ;) Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: class-lookup-crash-fix.patch Type: text/x-patch Size: 3033 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080604/81a340c2/attachment.bin From stefan_ml at behnel.de Wed Jun 4 19:50:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jun 2008 19:50:01 +0200 Subject: [lxml-dev] svn version failed to compiled against hg version of cython In-Reply-To: <3f0e729b0806040344lb60c9f7qbf0fdbffca90db1@mail.gmail.com> References: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> <24598.194.114.62.69.1212575772.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <3f0e729b0806040344lb60c9f7qbf0fdbffca90db1@mail.gmail.com> Message-ID: <4846D5C9.6090301@behnel.de> Hi, Kevin JR wrote: > On Wed, Jun 4, 2008 at 6:36 PM, Stefan Behnel wrote: >> I guess you are actually using an older Cython version, likely installed >> with easy_install. The version number in current hg wasn't increased yet. >> >> > no, it's compiled from hg pool. Believe me, you are not using a recent developer version of Cython. Maybe you have hg pulled from cython-release instead of cython-devel or whatever. Try moving your hg Cython directory out of the way and check if it still compiles. Stefan From faassen at startifact.com Wed Jun 4 22:01:24 2008 From: faassen at startifact.com (Martijn Faassen) Date: Wed, 4 Jun 2008 22:01:24 +0200 Subject: [lxml-dev] segfault when using etree.CustomElementClassLookup In-Reply-To: <48468CC6.4050804@behnel.de> References: <48468CC6.4050804@behnel.de> Message-ID: <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> Hey Stefan, On Wed, Jun 4, 2008 at 2:38 PM, Stefan Behnel wrote: > Martijn Faassen wrote: >> I just ran into a segfault with lxml (2.0.6). The problem is as follows: >> >> from lxml import etree >> >> class Lookup(etree.CustomElementClassLookup): >> def __init__(self): >> pass > > Yep, you didn't call the __init__() method of the super class here, so the > internal lookup function call isn't set up. Hm, I thought you were wrong, but you are right. I actually did have a super call before I whittled it away to a minimal (too minimal!) test case: class Lookup(etree.CustomElementClassLookup): def __init__(self): super(etree.CustomElementClassLookup, self).__init__() But I just realized that call was wrong, and should've been: class Lookup(etree.CustomElementClassLookup): def __init__(self): super(Lookup, self).__init__() that *does* work. :) > I replaced that with a __cinit__() > now that always sets it to the default lookup scheme, so that it won't > segfault anymore even if people forget the obvious. ;) > A patch is attached and it's generally easy to work around this by writing > correct code, so there won't be a 2.0.7 right away. Yes, that's fine, I could work around it anyway, and you're right it's also a mistake for me. You don't expect a segfault even if you do it wrong of course, but it's a corner case. My apologies for the mistaken bug report! > BTW, this: > >> parser.setElementClassLookup(lookup) > > is correctly spelled > >> parser.set_element_class_lookup(lookup) > > since lxml 2.0, following PEP 8 naming conventions. However, I didn't dare to > remove the original method, since I figured that it would break tons of code > for no major reason. At least the examples should reflect the new name > everywhere now, so maybe I can remove it in lxml 3.0. ;) The documentation on the website still has the camelCases when I read it yesterday. Regards, Martijn From stefan_ml at behnel.de Wed Jun 4 22:23:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jun 2008 22:23:39 +0200 Subject: [lxml-dev] segfault when using etree.CustomElementClassLookup In-Reply-To: <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> References: <48468CC6.4050804@behnel.de> <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> Message-ID: <4846F9CB.6010704@behnel.de> Hi, Martijn Faassen wrote: > On Wed, Jun 4, 2008 at 2:38 PM, Stefan Behnel wrote: >>> parser.setElementClassLookup(lookup) >> is correctly spelled >> >>> parser.set_element_class_lookup(lookup) >> since lxml 2.0, following PEP 8 naming conventions. However, I didn't dare to >> remove the original method, since I figured that it would break tons of code >> for no major reason. At least the examples should reflect the new name >> everywhere now, so maybe I can remove it in lxml 3.0. ;) > > The documentation on the website still has the camelCases when I read > it yesterday. Hrmpf, thanks. I had actually postponed my initial decision to follow PEP 8 everywhere, and then forgotten to fix that function name for 2.0. Then I figured out later that it was still used everywhere in the docs, so I couldn't remove it without a longer warning phase. I had fixed it on the trunk back then, but apparently forgot to merge the doc changes over to the 2.0 branch... It's fixed now ... finally ... Stefan From faassen at startifact.com Wed Jun 4 22:27:31 2008 From: faassen at startifact.com (Martijn Faassen) Date: Wed, 4 Jun 2008 22:27:31 +0200 Subject: [lxml-dev] segfault when using etree.CustomElementClassLookup In-Reply-To: <4846F9CB.6010704@behnel.de> References: <48468CC6.4050804@behnel.de> <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> <4846F9CB.6010704@behnel.de> Message-ID: <8928d4e90806041327o253fc5b3s7aa8c717b7f51b4d@mail.gmail.com> Hi there, On Wed, Jun 4, 2008 at 10:23 PM, Stefan Behnel wrote: [snip] > It's fixed now ... finally ... I'm glad this thread came to some good after all then, even though it was all based on a mistake by me. :) Regards, Martijn From rogerpatterson at gmail.com Thu Jun 5 01:59:45 2008 From: rogerpatterson at gmail.com (Roger Patterson) Date: Wed, 04 Jun 2008 16:59:45 -0700 Subject: [lxml-dev] a different segfault In-Reply-To: <8928d4e90806041327o253fc5b3s7aa8c717b7f51b4d@mail.gmail.com> References: <48468CC6.4050804@behnel.de> <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> <4846F9CB.6010704@behnel.de> <8928d4e90806041327o253fc5b3s7aa8c717b7f51b4d@mail.gmail.com> Message-ID: <48472C71.3000409@gmail.com> Hi Stefan et al. I am getting a mysterious segfault using the XSLT lib. Basically, if I have: in my transform, I get the segfault, if I remove that line, it works fine. Now, I haven't distilled it down to a succinct example yet, and my transform and code are pretty large, but I was wondering if anyone else has experienced this? cheers -Roger The dump looks like this: *** glibc detected *** python: double free or corruption (!prev): 0x000000000f7dab00 *** ======= Backtrace: ========= /lib64/libc.so.6[0x352d46e890] /lib64/libc.so.6(cfree+0x8c)[0x352d471fac] /usr/lib64/libxml2.so.2(xmlFreeNodeList+0x177)[0x3536e4ff27] /usr/lib64/libxml2.so.2(xmlFreeNodeList+0x89)[0x3536e4fe39] /usr/lib64/libxml2.so.2(xmlFreeNodeList+0x89)[0x3536e4fe39] /usr/lib64/libxml2.so.2(xmlFreeDoc+0xb6)[0x3536e4fc96] /usr/lib/python2.4/site-packages/lxml-2.0.4-py2.4-linux-x86_64.egg/lxml/etree.so[0x2aaaaf1b79c8] /usr/lib/python2.4/site-packages/lxml-2.0.4-py2.4-linux-x86_64.egg/lxml/etree.so[0x2aaaaf1b900d] /usr/lib64/libpython2.4.so.1.0[0x353fa74f98] /usr/lib64/libpython2.4.so.1.0[0x353fa4abd2] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x383)[0x353fa95363] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f)[0x353fa9405f] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f)[0x353fa9405f] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6)[0x353fa94486] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6)[0x353fa94486] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6)[0x353fa94486] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6)[0x353fa94486] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f)[0x353fa9405f] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f)[0x353fa9405f] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0[0x353fa4c263] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10)[0x353fa35f90] /usr/lib64/libpython2.4.so.1.0[0x353fa3c01f] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10)[0x353fa35f90] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x220d)[0x353fa921ed] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6)[0x353fa94486] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f)[0x353fa9405f] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925)[0x353fa95905] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCode+0x32)[0x353fa95952] /usr/lib64/libpython2.4.so.1.0[0x353fab1ea9] /usr/lib64/libpython2.4.so.1.0(PyRun_SimpleFileExFlags+0x1a8)[0x353fab3358] /usr/lib64/libpython2.4.so.1.0(Py_Main+0xa5d)[0x353fab979d] /lib64/libc.so.6(__libc_start_main+0xf4)[0x352d41d8a4] python[0x400629] From ultrokevjr at gmail.com Thu Jun 5 02:57:36 2008 From: ultrokevjr at gmail.com (Kevin JR) Date: Thu, 5 Jun 2008 08:57:36 +0800 Subject: [lxml-dev] svn version failed to compiled against hg version of cython In-Reply-To: <4846D5C9.6090301@behnel.de> References: <3f0e729b0806040223u6aacfe8ei346e207c94b8038b@mail.gmail.com> <24598.194.114.62.69.1212575772.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <3f0e729b0806040344lb60c9f7qbf0fdbffca90db1@mail.gmail.com> <4846D5C9.6090301@behnel.de> Message-ID: <3f0e729b0806041757q60776feape95f9fa853a3f490@mail.gmail.com> On Thu, Jun 5, 2008 at 1:50 AM, Stefan Behnel wrote: > Hi, > > Kevin JR wrote: > > On Wed, Jun 4, 2008 at 6:36 PM, Stefan Behnel > wrote: > >> I guess you are actually using an older Cython version, likely installed > >> with easy_install. The version number in current hg wasn't increased > yet. > >> > >> > > no, it's compiled from hg pool. > > Believe me, you are not using a recent developer version of Cython. Maybe > you > have hg pulled from cython-release instead of cython-devel or whatever. Try > moving your hg Cython directory out of the way and check if it still > compiles. > > yes, it works with cython-devel. thanks very much! :) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080605/8078e02c/attachment-0001.htm From stefan_ml at behnel.de Thu Jun 5 08:48:18 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Jun 2008 08:48:18 +0200 Subject: [lxml-dev] a different segfault In-Reply-To: <48472C71.3000409@gmail.com> References: <48468CC6.4050804@behnel.de> <8928d4e90806041301o2a3ab2b9l35618dd04a82674f@mail.gmail.com> <4846F9CB.6010704@behnel.de> <8928d4e90806041327o253fc5b3s7aa8c717b7f51b4d@mail.gmail.com> <48472C71.3000409@gmail.com> Message-ID: <48478C32.6020301@behnel.de> Hi, Roger Patterson wrote: > I am getting a mysterious segfault using the XSLT lib. > Basically, if I have: > > > in my transform, I get the segfault, if I remove that line, it works fine. > > Now, I haven't distilled it down to a succinct example yet, and my > transform and code are pretty large, but I was wondering if anyone else > has experienced this? Your stack trace doesn't tell me much. Could you a) use lxml 2.0.6 and b) rebuild lxml with "-g3" CFLAGS to see the function that's executed here? Running your code in valgrind would be great, there's a command line in the Makefile. Stefan From Olivier.Collioud at wipo.int Fri Jun 6 10:25:38 2008 From: Olivier.Collioud at wipo.int (Olivier Collioud) Date: Fri, 06 Jun 2008 10:25:38 +0200 Subject: [lxml-dev] xpath comparison Message-ID: Hello, is there any xpath equivalent in lxml to this: (ms:string-compare(@START,$symbol)=-1 and ms:string-compare(@END,$symbol)=1) which mean: (START < symbol and symbol < END) ? Regards, Olivier. ------ World Intellectual Property Organization Disclaimer: This electronic message may contain privileged, confidential and copyright protected information. If you have received this e-mail by mistake, please immediately notify the sender and delete this e-mail and all its attachments. Please ensure all e-mail attachments are scanned for viruses prior to opening or using. From stefan_ml at behnel.de Fri Jun 6 11:08:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 11:08:17 +0200 Subject: [lxml-dev] xpath comparison In-Reply-To: References: Message-ID: <4848FE81.8060301@behnel.de> Olivier Collioud wrote: > is there any xpath equivalent in lxml to this: > > (ms:string-compare(@START,$symbol)=-1 and > ms:string-compare(@END,$symbol)=1) > > which mean: > > (START < symbol and symbol < END) You can implement your own python function to do this. http://codespeak.net/lxml/extensions.html Stefan From romme at tesall.ru Fri Jun 6 12:28:07 2008 From: romme at tesall.ru (RommeDeSerieux) Date: Fri, 06 Jun 2008 14:28:07 +0400 Subject: [lxml-dev] Forced attribute value escaping Message-ID: <48491137.4070908@tesall.ru> Why does lxml insist on encoding non-ascii characters in attribute values as xml entities while leaving the text nodes, tag names and attribute names intact? How do i turn it off? From stefan_ml at behnel.de Fri Jun 6 12:35:22 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 12:35:22 +0200 Subject: [lxml-dev] Forced attribute value escaping In-Reply-To: <48491137.4070908@tesall.ru> References: <48491137.4070908@tesall.ru> Message-ID: <484912EA.4060102@behnel.de> Hi, RommeDeSerieux wrote: > Why does lxml insist on encoding non-ascii characters in attribute > values as xml entities while leaving the text nodes, tag names and > attribute names intact? How do i turn it off? How about a code example? Stefan From romme at tesall.ru Fri Jun 6 13:06:07 2008 From: romme at tesall.ru (RommeDeSerieux) Date: Fri, 06 Jun 2008 15:06:07 +0400 Subject: [lxml-dev] Forced attribute value escaping In-Reply-To: <484912EA.4060102@behnel.de> References: <48491137.4070908@tesall.ru> <484912EA.4060102@behnel.de> Message-ID: <48491A1F.5070703@tesall.ru> Stefan Behnel wrote: > How about a code example? #! /usr/bin/env python ## vim: fileencoding=utf-8 from lxml import etree node = etree.Element(u'tag_???') node.attrib[u'attribute_???????'] = u'value_????????' node.text = u'text_?????' # what i'm getting (with some linebreaks for email): # text_????? # # # expected result: # text_????? # print etree.tostring(node, encoding='utf-8') # P.S. Sorry for sending that to your own address From stefan_ml at behnel.de Fri Jun 6 13:20:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 13:20:39 +0200 Subject: [lxml-dev] Forced attribute value escaping In-Reply-To: <484919CE.8060202@tesall.ru> References: <48491137.4070908@tesall.ru> <484912EA.4060102@behnel.de> <484919CE.8060202@tesall.ru> Message-ID: <48491D87.1080808@behnel.de> Hi, please keep the list in CC. RommeDeSerieux wrote: > #! /usr/bin/env python > ## vim: fileencoding=utf-8 > from lxml import etree > > node = etree.Element(u'tag_???') > node.attrib[u'attribute_???????'] = u'value_????????' > node.text = u'text_?????' > > # what i'm getting (with some linebreaks for email): > # text_????? > # > # > # expected result: > # text_????? > # > print etree.tostring(node, encoding='utf-8') Yep, the serialisation is done by libxml2, so if you feel that this should look different, please file a bug report over there, or report it on their mailing list. http://xmlsoft.org/bugs.html Stefan From stefan_ml at behnel.de Fri Jun 6 13:48:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 13:48:58 +0200 Subject: [lxml-dev] Forced attribute value escaping In-Reply-To: <48491D87.1080808@behnel.de> References: <48491137.4070908@tesall.ru> <484912EA.4060102@behnel.de> <484919CE.8060202@tesall.ru> <48491D87.1080808@behnel.de> Message-ID: <4849242A.9010809@behnel.de> Hi again, Stefan Behnel wrote: > RommeDeSerieux wrote: >> #! /usr/bin/env python >> ## vim: fileencoding=utf-8 >> from lxml import etree >> >> node = etree.Element(u'tag_???') >> node.attrib[u'attribute_???????'] = u'value_????????' >> node.text = u'text_?????' >> >> # what i'm getting (with some linebreaks for email): >> # text_????? >> # >> # >> # expected result: >> # text_????? >> # >> print etree.tostring(node, encoding='utf-8') > > the serialisation is done by libxml2 Taking a deeper look at it, it seems that there's actually some legacy code in libxml2 that triggers this behaviour when the document encoding is not provided. We can work around that by always setting it to "UTF-8" for new documents. Here's a patch. Stefan === src/lxml/parser.pxi ================================================================== --- src/lxml/parser.pxi (revision 4485) +++ src/lxml/parser.pxi (local) @@ -588,8 +588,11 @@ _raiseParseError(c_ctxt, filename, context._error_log) else: _raiseParseError(c_ctxt, filename, None) - elif result.URL is NULL and filename is not None: - result.URL = tree.xmlStrdup(_cstr(filename)) + else: + if result.URL is NULL and filename is not None: + result.URL = tree.xmlStrdup(_cstr(filename)) + if result.encoding is NULL: + result.encoding = tree.xmlStrdup("UTF-8") return result cdef int _fixHtmlDictNames(tree.xmlDict* c_dict, xmlDoc* c_doc) nogil: @@ -1366,6 +1369,8 @@ result = tree.xmlNewDoc(NULL) if result is NULL: python.PyErr_NoMemory() + if result.encoding is NULL: + result.encoding = tree.xmlStrdup("UTF-8") __GLOBAL_PARSER_CONTEXT.initDocDict(result) return result From Olivier.Collioud at wipo.int Fri Jun 6 15:10:14 2008 From: Olivier.Collioud at wipo.int (Olivier Collioud) Date: Fri, 06 Jun 2008 15:10:14 +0200 Subject: [lxml-dev] xpath comparison Message-ID: Thanks Stefan, But I would have expected that: def isin(context,symbol,start,end): return start <= symbol <= end ns = ElementTree.FunctionNamespace('http://wipo.int/isin') ns.prefix = 'ii' ns['isin'] = isin for mref in definitionsTree.xpath('//MREF[ii:isin(%s, at START, at END)]' % symbol): ... would be faster than: for mref in definitionsTree.xpath('//MREF'): if mref.get('START') <= symbol <= mref.get('END'): ... Because the first solution should visit less node. But it seems that it is the contrary. Why ? Olivier >>> Stefan Behnel 6/06/08 11:08 am >>> Olivier Collioud wrote: > is there any xpath equivalent in lxml to this: > > (ms:string-compare(@START,$symbol)=-1 and > ms:string-compare(@END,$symbol)=1) > > which mean: > > (START < symbol and symbol < END) You can implement your own python function to do this. http://codespeak.net/lxml/extensions.html Stefan _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev ------ World Intellectual Property Organization Disclaimer: This electronic message may contain privileged, confidential and copyright protected information. If you have received this e-mail by mistake, please immediately notify the sender and delete this e-mail and all its attachments. Please ensure all e-mail attachments are scanned for viruses prior to opening or using. From stefan_ml at behnel.de Fri Jun 6 15:16:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 15:16:56 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 In-Reply-To: <8928d4e90806020610o1b03461r9d7223723919d15a@mail.gmail.com> References: <4830827A.8050304@behnel.de> <48309D44.9080603@behnel.de> <48401C2F.3020500@behnel.de> <48418CAF.5010800@palladion.com> <8928d4e90806020610o1b03461r9d7223723919d15a@mail.gmail.com> Message-ID: <484938C8.8040000@behnel.de> Hi, Martijn Faassen wrote: > Looks like > thanks to the use of Cython, porting to Python 3 while remaining > compatible with Python 2.x actually is easier than if it'd been plain > Python. That's actually a nice migration path: "sorry, this pure Python module is currently only available as a C extension for Python3, due to incompatible syntax changes in the Python language." ;) Stefan From stefan_ml at behnel.de Fri Jun 6 15:55:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 15:55:36 +0200 Subject: [lxml-dev] xpath comparison In-Reply-To: References: Message-ID: <484941D8.6000806@behnel.de> Olivier Collioud top-posted: > I would have expected that: > > def isin(context,symbol,start,end): > return start <= symbol <= end > ns = ElementTree.FunctionNamespace('http://wipo.int/isin') > ns.prefix = 'ii' > ns['isin'] = isin > > for mref in definitionsTree.xpath('//MREF[ii:isin(%s, at START, at END)]' % > symbol): That should spell for mref in definitionsTree.xpath( '//MREF[ii:isin($symbol, at START, at END)]', symbol=symbol): > would be faster than: > > for mref in definitionsTree.xpath('//MREF'): > if mref.get('START') <= symbol <= mref.get('END'): > ... > > Because the first solution should visit less node. No. It has to visit all nodes and call the Python function on them. This can't be optimised away. libxml2 doesn't seem to support string comparisons using "<" and ">" in XPath, so I guess you're basically out of luck. What you could try is testing for a common prefix of your start and end marker and only compare values that start with that. Or, if your elements are sorted, you can use el.find() instead, which short-circuits in lxml 2.0. (BTW, you can sort the children of an element using the usual tuple pack-sort-unpack scheme and reassign the slice at the end). Please report back if any of the above approaches worked better for you. Stefan From stefan_ml at behnel.de Fri Jun 6 16:06:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 16:06:56 +0200 Subject: [lxml-dev] xpath comparison In-Reply-To: References: Message-ID: <48494480.6090505@behnel.de> Hi again, Olivier Collioud wrote: > for mref in definitionsTree.xpath('//MREF'): > if mref.get('START') <= symbol <= mref.get('END'): If you really need a high-speed implementation for this, consider rewriting it as an external Cython module that works directly on the libxml2 tree. http://codespeak.net/lxml/capi.html#writing-external-modules-in-cython The public C-API of lxml.etree is described here: http://codespeak.net/svn/lxml/trunk/src/lxml/etreepublic.pxd Stefan From stefan_ml at behnel.de Fri Jun 6 16:08:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jun 2008 16:08:44 +0200 Subject: [lxml-dev] xpath comparison In-Reply-To: <484941D8.6000806@behnel.de> References: <484941D8.6000806@behnel.de> Message-ID: <484944EC.9090509@behnel.de> Hi, please ignore this, was thought the wrong way round. Stefan Behnel wrote: > What you could try is testing for a > common prefix of your start and end marker and only compare values that start > with that. From chris at simplistix.co.uk Sun Jun 8 11:52:33 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Sun, 08 Jun 2008 04:52:33 -0500 Subject: [lxml-dev] confused newbie ;-) Message-ID: <484BABE1.2030100@simplistix.co.uk> Hi Guys, I've finally come around to trying out lxml2, so I googled for it and ended up here: http://codespeak.net/lxml/index.html#download I got myself in a bit of know 'cos I didn't realise the pointer to PyPI was a link (I could have sworn I hovered over it and got no link, but I see now that it is). So I went to http://pypi.python.org/pypi and searched for lxml. This returns a link to version 0.9. This is what really threw me... well, once I found out there was a version 2.1beta2 available! I guess my main questions are: - Why are there no links from http://pypi.python.org/pypi/lxml/0.9 to http://pypi.python.org/pypi/lxml? - Why is 0.9 being returned as the only relevent search result? - How come http://pypi.python.org/pypi/lxml returns a list with no other information, versus say http://pypi.python.org/pypi/mailinglogger, which returns the docs and a list of versions? cheers, Chris PS: I did also try using easy_install, but that sucked down 2.1beta2 and then whined that there was no C compiler. How can I teach easy_install that I only want eggs where there is already a windows binary? -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Sun Jun 8 19:21:18 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jun 2008 19:21:18 +0200 Subject: [lxml-dev] confused newbie ;-) In-Reply-To: <484BABE1.2030100@simplistix.co.uk> References: <484BABE1.2030100@simplistix.co.uk> Message-ID: <484C150E.7000602@behnel.de> Hi, Chris Withers wrote: > I've finally come around to trying out lxml2, so I googled for it and > ended up here: > > http://codespeak.net/lxml/index.html#download Hmm, when I google for "lxml", I get the homepage as first hit. But then, Google isn't the same everywhere... > - Why are there no links from http://pypi.python.org/pypi/lxml/0.9 to > http://pypi.python.org/pypi/lxml? Because PyPI isn't very intelligent. > - Why is 0.9 being returned as the only relevent search result? No idea, honestly. But that's something you can ask on the catalog-sig list. I'd like to hear what they have to say about it. http://mail.python.org/mailman/listinfo/catalog-sig > - How come http://pypi.python.org/pypi/lxml returns a list with no other > information, It lists all available versions, pretty much in the right order and all with links to the download pages. What other information do you need? > versus say http://pypi.python.org/pypi/mailinglogger, which > returns the docs and a list of versions? Not for me. I just see a single version with the downloadable archives on that page. No link to any other version or any note thereof. > PS: I did also try using easy_install, but that sucked down 2.1beta2 and > then whined that there was no C compiler. How can I teach easy_install > that I only want eggs where there is already a windows binary? http://peak.telecommunity.com/DevCenter/EasyInstall http://peak.telecommunity.com/DevCenter/EasyInstall#changing-the-active-version Stefan From cz at gocept.com Mon Jun 9 09:50:54 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Mon, 9 Jun 2008 09:50:54 +0200 Subject: [lxml-dev] Pickling objectified trees References: <45E197C8.70400@behnel.de> Message-ID: Hi! On 2007-02-25 15:06:00 +0100, Stefan Behnel said: > Christian Zagrodnick wrote: >> the other day I had to pickle objectified trees. I just thought to >> share my findings. >> >> You might consider just registering the reduce function in lxml itself. > > Interesting. Sure, why not? Objectify is totally about data classes after all. > > Applied to the trunk (with small changes). I found a may-be-considered-a-bug. The script below shows that when pickling the node, the processing instruction is ommited. When trying to pickle the root tree, the error is raised. The problem is, that we pickle the `xml` object and get it back with all its descendants. But after unpickling the whole tree is not as it was before. So that's actually a bug. I guess the best would be to a) always serialize the roottree on pickle and b) remember which part of the tree actually was pickelt so on unpickle this exact object can be restored. I got to play with this a bit before I can deliver some useful code, thogh. Regards, Christian ---------------- import pickle import lxml.etree import lxml.objectify xml = lxml.objectify.fromstring('') print pickle.dumps(xml) print pickle.dumps(xml.getroottree()) -------------------- ----------------- clxml.objectify fromstring p0 (S'' p1 tp2 Rp3 . Traceback (most recent call last): File "pi.py", line 8, in ? print pickle.dumps(xml.getroottree()) File "/Users/zagy/development/python/lib/python2.4/pickle.py", line 1386, in dumps Pickler(file, protocol, bin).dump(obj) File "/Users/zagy/development/python/lib/python2.4/pickle.py", line 231, in dump self.save(obj) File "/Users/zagy/development/python/lib/python2.4/pickle.py", line 313, in save rv = reduce(self.proto) File "/Users/zagy/development/python/lib/python2.4/copy_reg.py", line 69, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle _ElementTree objects -------------------------------------- -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From cz at gocept.com Mon Jun 9 10:06:33 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Mon, 9 Jun 2008 10:06:33 +0200 Subject: [lxml-dev] Pickling objectified trees References: <45E197C8.70400@behnel.de> Message-ID: On 2008-06-09 09:50:54 +0200, Christian Zagrodnick said: > Hi! > > On 2007-02-25 15:06:00 +0100, Stefan Behnel said: >> Christian Zagrodnick wrote: >>> the other day I had to pickle objectified trees. I just thought to >>> share my findings. >>> > >>> You might consider just registering the reduce function in lxml itself. >> > >> Interesting. Sure, why not? Objectify is totally about data classes after > all. >> > >> Applied to the trunk (with small changes). > > > I found a may-be-considered-a-bug. The script below shows that when > > pickling the node, the processing instruction is ommited. > > When trying to pickle the root tree, the error is raised. > > The problem is, that we pickle the `xml` object and get it back with > > all its descendants. But after unpickling the whole tree is not as it > > was before. So that's actually a bug. > > I guess the best would be to > > a) always serialize the roottree on pickle and > b) remember which part of the tree actually was pickelt so on unpickle > > this exact object can be restored. > > I got to play with this a bit before I can deliver some useful code, thogh. Well for pickling the root node all it really takes is to use the getroottree() for serializing. The fromstring method returns the right object anyway. -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From tom_schr at web.de Mon Jun 9 10:34:46 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Mon, 09 Jun 2008 10:34:46 +0200 Subject: [lxml-dev] Patch for xpathgrep.py Message-ID: <720747312@web.de> Hi, although I haven't wrote on this mailinglist yet, I read it regularly and like lxml very much! Thanks for all the hard work that went into this library! :-) To improve it a bit, I've created a humble patch against xpathgrep.py. I've looked at it and I missed a namespace option. Well, here it is. See the attached patch. Hope it's ok, feel free to adapt it to your needs. Keep up the excellent work! Thanks, Tom -------------- next part -------------- A non-text attachment was scrubbed... Name: xpathgrep.py.patch Type: text/x-diff Size: 2413 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080609/ebf28b24/attachment.bin From stefan_ml at behnel.de Mon Jun 9 12:51:31 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 9 Jun 2008 12:51:31 +0200 (CEST) Subject: [lxml-dev] Pickling objectified trees In-Reply-To: References: <45E197C8.70400@behnel.de> Message-ID: <62315.194.114.62.66.1213008691.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, Christian Zagrodnick wrote: > Well for pickling the root node all it really takes is to use the > getroottree() for serializing. The fromstring method returns the right > object anyway. Not quite. Pickling works nicely through tostring(), but the unpickle process must return an ElementTree and there isn't currently a straight forward unpickle function that takes a string and returns an ElementTree. I'll see how to fix that. Stefan From stefan_ml at behnel.de Mon Jun 9 14:58:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 9 Jun 2008 14:58:00 +0200 (CEST) Subject: [lxml-dev] Patch for xpathgrep.py In-Reply-To: <720747312@web.de> References: <720747312@web.de> Message-ID: <59772.194.114.62.66.1213016280.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Thomas Schraitle wrote: > like > lxml very much! Thanks for all the hard work that went into this library! :) > To improve it a bit, I've created a humble patch against xpathgrep.py. > I've > looked at it and I missed a namespace option. Well, here it is. See the > attached patch. Hope it's ok, feel free to adapt it to your needs. I considered using ETXPath() instead of plain XPath() for that script, so that you can say "//{ns}stuff". But I think your patch makes sense as well and may mean less typing/c&p on the cmd line. I'll enable both then. Thanks, Stefan From tom_schr at web.de Mon Jun 9 15:42:05 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Mon, 09 Jun 2008 15:42:05 +0200 Subject: [lxml-dev] Patch for xpathgrep.py Message-ID: <720896553@web.de> Hi Stefan, > [...] > > To improve it a bit, I've created a humble patch against xpathgrep.py. > > I've > > looked at it and I missed a namespace option. Well, here it is. See the > > attached patch. Hope it's ok, feel free to adapt it to your needs. > > I considered using ETXPath() instead of plain XPath() for that script, so > that you can say "//{ns}stuff". But I think your patch makes sense as well > and may mean less typing/c&p on the cmd line. I'll enable both then. Ahh, I forgot about the Clark notation. Good to know, thanks! I'm so get used to the prefix notation. By the way, I think there is a small error in my patch: The variable namespaces isn't correctly initialized. There should be a "parser.set_defaults(namespaces=[])" before the "parser.add_option("-N", "--ns", ...)" line. To spread the word about lxml and xpathgrep.py: I've wrote a small article here: http://lizards.opensuse.org/2008/06/09/query-your-xml-with-xpathgreppy/ :-) Thanks, Tom From pete.forman at westerngeco.com Mon Jun 9 17:57:31 2008 From: pete.forman at westerngeco.com (Pete Forman) Date: Mon, 09 Jun 2008 16:57:31 +0100 Subject: [lxml-dev] Patch for xpathgrep.py References: <720896553@web.de> Message-ID: Thomas Schraitle writes: > To spread the word about lxml and xpathgrep.py: I've wrote a small > article here: > http://lizards.opensuse.org/2008/06/09/query-your-xml-with-xpathgreppy/ Something has prettified away your apostrophes (with an e) in the //chapter[not(@id)]/title example. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent pete.forman at westerngeco.com -./\.- the opinion of Schlumberger or http://petef.22web.net -./\.- WesternGeco. From stefan_ml at behnel.de Mon Jun 9 19:32:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jun 2008 19:32:37 +0200 Subject: [lxml-dev] Patch for xpathgrep.py In-Reply-To: <720896553@web.de> References: <720896553@web.de> Message-ID: <484D6935.1070601@behnel.de> Hi, Thomas Schraitle wrote: > I forgot about the Clark notation. Actually, I found that it's easier to stay with your simple option. If you work on the command line, the Clark notation just gets in the way. > To spread the word about lxml and xpathgrep.py: I've wrote a small article here: > http://lizards.opensuse.org/2008/06/09/query-your-xml-with-xpathgreppy/ Thanks. I updated the script a little. It now supports all methods of the unicode type as XPath functions, and some other goodies. Hope you like it. Stefan From tom_schr at web.de Mon Jun 9 21:48:08 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Mon, 9 Jun 2008 21:48:08 +0200 Subject: [lxml-dev] Patch for xpathgrep.py In-Reply-To: <484D6935.1070601@behnel.de> References: <720896553@web.de> <484D6935.1070601@behnel.de> Message-ID: <200806092148.09451.tom_schr@web.de> Hi, On Montag, 9. Juni 2008, Stefan Behnel wrote: > > Thomas Schraitle wrote: > > I forgot about the Clark notation. > > Actually, I found that it's easier to stay with your simple option. If > you work on the command line, the Clark notation just gets in the way. Yes, the command line gets a bit long in this case. :) > > [...] > > I updated the script a little. It now supports all methods of > the unicode type as XPath functions, and some other goodies. Hope you > like it. Very good! Ahh, by the way: I think, I discovered a strange behaviour that should be fixed too. Try to run it like this: # Without(!) any filename: $ xpathgrep.py /foo I get this: Traceback (most recent call last): File "tools/xpathgrep.py", line 256, in found = main(options, args) File "tools/xpathgrep.py", line 242, in main sys.stdin, xpath, print_name, options.xinclude, UnboundLocalError: local variable 'print_name' referenced before assignment I think that is not the intended behavior, isn't it? :) Maybe the script should check, if the there are at least 3 arguments? Thanks, Tom -- Thomas Schraitle From tom_schr at web.de Mon Jun 9 22:06:41 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Mon, 9 Jun 2008 22:06:41 +0200 Subject: [lxml-dev] Patch for xpathgrep.py In-Reply-To: References: <720896553@web.de> Message-ID: <200806092206.42185.tom_schr@web.de> Hi, On Montag, 9. Juni 2008, Pete Forman wrote: > Thomas Schraitle writes: > > To spread the word about lxml and xpathgrep.py: I've wrote a small > > article here: > > http://lizards.opensuse.org/2008/06/09/query-your-xml-with-xpathgrep > >py/ > > Something has prettified away your apostrophes (with an e) in the > //chapter[not(@id)]/title example. Strange. I can see it with the apostrophes in Firefox and Konqueror. Maybe some font issue in your browser? Thanks, Tom -- Thomas Schraitle From stefan_ml at behnel.de Mon Jun 9 22:56:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jun 2008 22:56:00 +0200 Subject: [lxml-dev] Patch for xpathgrep.py In-Reply-To: <200806092148.09451.tom_schr@web.de> References: <720896553@web.de> <484D6935.1070601@behnel.de> <200806092148.09451.tom_schr@web.de> Message-ID: <484D98E0.5000004@behnel.de> Thomas Schraitle wrote: > # Without(!) any filename: > $ xpathgrep.py /foo > > I get this: > > Traceback (most recent call last): > File "tools/xpathgrep.py", line 256, in > found = main(options, args) > File "tools/xpathgrep.py", line 242, in main > sys.stdin, xpath, print_name, options.xinclude, > UnboundLocalError: local variable 'print_name' referenced before > assignment Ah, a copy&paste bug. Thanks. Stefan From Olivier.Collioud at wipo.int Tue Jun 10 09:26:28 2008 From: Olivier.Collioud at wipo.int (Olivier Collioud) Date: Tue, 10 Jun 2008 09:26:28 +0200 Subject: [lxml-dev] xpath comparison Message-ID: Hello. > Please report back if any of the above approaches worked better for you. Finally I choose to first collect SREF and MREF elements values from the first XML file in a dictionary/set structure (to factorise and avoid duplicates) and then loop over it for each SYMBOL found in the second XML file. This solution is at least 10 times faster than the previous one and is fast enough. Thank you. Olivier. ------ World Intellectual Property Organization Disclaimer: This electronic message may contain privileged, confidential and copyright protected information. If you have received this e-mail by mistake, please immediately notify the sender and delete this e-mail and all its attachments. Please ensure all e-mail attachments are scanned for viruses prior to opening or using. From pete.forman at westerngeco.com Tue Jun 10 09:38:07 2008 From: pete.forman at westerngeco.com (Pete Forman) Date: Tue, 10 Jun 2008 08:38:07 +0100 Subject: [lxml-dev] Patch for xpathgrep.py References: <720896553@web.de> <200806092206.42185.tom_schr@web.de> Message-ID: Thomas Schraitle writes: > On Montag, 9. Juni 2008, Pete Forman wrote: >> Thomas Schraitle writes: >> > To spread the word about lxml and xpathgrep.py: I've wrote a small >> > article here: >> > http://lizards.opensuse.org/2008/06/09/query-your-xml-with-xpathgrep >> >py/ >> >> Something has prettified away your apostrophes (with an e) in the >> //chapter[not(@id)]/title example. > > Strange. I can see it with the apostrophes in Firefox and > Konqueror. Maybe some font issue in your browser? Those are typographically correct left and right single quotation marks U+2018 and U+2019. However the command line shell needs an apostrophe U+0027. Copy and paste from the browser to the command line will not work. I see single quotation marks in Opera, FF and IE, and viewing source:

Now this reduces the above output just to the wanted chapter titles. And I can extent my query just for all chapters that doesn’t have an id attribute:

$ xpathgrep.py ‘//chapter[not(@id)]/title’ db.xml

(You need the apostroph because of the shell.) The listing of File db.xml is similarly affected. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent pete.forman at westerngeco.com -./\.- the opinion of Schlumberger or http://petef.22web.net -./\.- WesternGeco. From tom_schr at web.de Tue Jun 10 19:19:49 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Tue, 10 Jun 2008 19:19:49 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? Message-ID: <200806101919.49899.tom_schr@web.de> Hi, while I'm constantly learning lxml, I came across a problem in one of my extension functions: Let's say I have a couple of parameters in my XSLT file. How can I access them in my extension function? I've looked into the documentation, mainly "Extension functions for XPath and XSLT", however I couldn't find anything about it, nor an example for this specific case. Perhaps I just overlooked it. Can someone give me a hint where to look? Or maybe a small example would be greatly appreciated. Thanks, Tom -- Thomas Schraitle From stefan_ml at behnel.de Tue Jun 10 20:01:08 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jun 2008 20:01:08 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <200806101919.49899.tom_schr@web.de> References: <200806101919.49899.tom_schr@web.de> Message-ID: <484EC164.9080907@behnel.de> Hi, Thomas Schraitle wrote: > Let's say I have a couple of parameters in my XSLT file. How can I access > them in my extension function? You can't. Any reason why you are not just passing them into the function? Stefan From tom_schr at web.de Tue Jun 10 20:30:34 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Tue, 10 Jun 2008 20:30:34 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <484EC164.9080907@behnel.de> References: <200806101919.49899.tom_schr@web.de> <484EC164.9080907@behnel.de> Message-ID: <200806102030.35928.tom_schr@web.de> Hi, On Dienstag, 10. Juni 2008, Stefan Behnel wrote: > > Thomas Schraitle wrote: > > Let's say I have a couple of parameters in my XSLT file. How can I > > access them in my extension function? > > You can't. Any reason why you are not just passing them into the > function? The reason is I'm just confused. :) Ok, so probably you mean I just have to collect the parameter(s) (with a XPath) and... what? From the documentation I can't find any place where I can insert my arguments. Let me explain my confusion a bit: For example, you have an extension function foo: def foo(_): return True As the documentation says ("What to return from a function"), you create an XPath evaluator, for example: e = etree.XPathEvaluator(doc) e.evaluate("foo()") BUT: Where do I pass my arguments? Maybe I can play with a global variable, but I think, that would be a bad approach. Sorry, could you give a short example to elaborate this further, please? Thanks, Tom -- Thomas Schraitle From stefan_ml at behnel.de Tue Jun 10 20:41:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jun 2008 20:41:58 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <200806102030.35928.tom_schr@web.de> References: <200806101919.49899.tom_schr@web.de> <484EC164.9080907@behnel.de> <200806102030.35928.tom_schr@web.de> Message-ID: <484ECAF6.8080600@behnel.de> Hi, Thomas Schraitle wrote: > On Dienstag, 10. Juni 2008, Stefan Behnel wrote: >> Thomas Schraitle wrote: >>> Let's say I have a couple of parameters in my XSLT file. How can I >>> access them in my extension function? >> You can't. Any reason why you are not just passing them into the >> function? > > Let me explain my confusion a bit: For example, you have an extension > function foo: > > def foo(_): > return True Why not def foo(_, a, b, c): ... Then you take your XSLT parameter, say "$param" and call your function like this: Stefan From tom_schr at web.de Tue Jun 10 21:05:43 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Tue, 10 Jun 2008 21:05:43 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <484ECAF6.8080600@behnel.de> References: <200806101919.49899.tom_schr@web.de> <200806102030.35928.tom_schr@web.de> <484ECAF6.8080600@behnel.de> Message-ID: <200806102105.44138.tom_schr@web.de> Hi, On Dienstag, 10. Juni 2008, Stefan Behnel wrote: > [...] > Why not > > def foo(_, a, b, c): ... > > Then you take your XSLT parameter, say "$param" and call your function > like this: > > Ok, I knew it was that dump simple... ;) Unfortunately, it does not work in my case. I use the DocBook stylesheets very much in combination with xsltproc. However, xsltproc doesn't support the DocBook extension functions. These create line numbers, get the size of a graphic, etc. You know, these sort of things that are very difficult or impossible to deal with XSLT. The problem is these extension functions are only implemented for Saxon or Xalan. Well, my idea was to write these in lxml. To do this, I have to stick with the number of arguments, unless I change the DocBook stylesheets. My dream would be to have a kind of a wrapper for xsltproc which implements the extension function. With this I can transform any DocBook document without any customization so far. Do you see any other solution? I thought I could get the parameters from the context, but obviously this isn't possible. It would be very useful if I could get access to the stylesheet itself to seek for any parameters (just day dreaming. ;) ) Thanks, Tom -- Thomas Schraitle From stefan_ml at behnel.de Tue Jun 10 21:28:30 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jun 2008 21:28:30 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <200806102105.44138.tom_schr@web.de> References: <200806101919.49899.tom_schr@web.de> <200806102030.35928.tom_schr@web.de> <484ECAF6.8080600@behnel.de> <200806102105.44138.tom_schr@web.de> Message-ID: <484ED5DE.9030904@behnel.de> Hi, Thomas Schraitle wrote: > I use the DocBook stylesheets very much in combination with xsltproc. > However, xsltproc doesn't support the DocBook extension functions. These > create line numbers, get the size of a graphic, etc. You know, these sort > of things that are very difficult or impossible to deal with XSLT. The > problem is these extension functions are only implemented for Saxon or > Xalan. > > Well, my idea was to write these in lxml. To do this, I have to stick with > the number of arguments, unless I change the DocBook stylesheets. My > dream would be to have a kind of a wrapper for xsltproc which implements > the extension function. What about this (totally untested) code snippet: class DocBoocXSLT(XSLT): def __init__(self, xslt_input): extensions = {("ns", "myfunc1") : self.myfunc1} XSLT.__init__(self, xslt_input, extensions = extensions) def myfunc1(self, ctxt, a, b, c): param = self.parameters["someparam"] ... def __call__(self, *args, **kwargs): self.parameters = kwargs return XSLT.__call__(self, *args, **kwargs) Stefan From tom_schr at web.de Tue Jun 10 21:35:08 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Tue, 10 Jun 2008 21:35:08 +0200 Subject: [lxml-dev] How to access XSLT Parameters Inside an Extension Function? In-Reply-To: <484ED5DE.9030904@behnel.de> References: <200806101919.49899.tom_schr@web.de> <200806102105.44138.tom_schr@web.de> <484ED5DE.9030904@behnel.de> Message-ID: <200806102135.09362.tom_schr@web.de> Hi, On Dienstag, 10. Juni 2008, Stefan Behnel wrote: > [...] > > What about this (totally untested) code snippet: > > class DocBoocXSLT(XSLT): > def __init__(self, xslt_input): > extensions = {("ns", "myfunc1") : self.myfunc1} > XSLT.__init__(self, xslt_input, extensions = extensions) > > def myfunc1(self, ctxt, a, b, c): > param = self.parameters["someparam"] > ... > > def __call__(self, *args, **kwargs): > self.parameters = kwargs > return XSLT.__call__(self, *args, **kwargs) > That looks very promising! I will test it more and come back if I have further questions. :) Thank you very much, Stefan! Tom -- Thomas Schraitle From tom_schr at web.de Thu Jun 12 09:32:15 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Thu, 12 Jun 2008 09:32:15 +0200 Subject: [lxml-dev] Error from Variables in Extension Functions? Message-ID: <722094985@web.de> Hi, as I am trying to learn more of lxml simplicity, I stumpled over a strange behaviour that I couldn't explain. Probably it's only my wrong understanding. :) I've played with some extension functions and I could condense my problem to a small testcase. Maybe it's totally dumb and useless and only my wrong understanding. Well, let's see. :) I used the source code from lxml-2.0.6, compiled it and modified the file in lxml/tests/test_xslt.py. I added the following small test function: def test_extensions3(self): tree = self.parse('B') style = self.parse('''\ ''') def mytext(ctxt, values): return 'X' * len(values) namespace = etree.FunctionNamespace('testns') namespace['mytext'] = mytext result = tree.xslt(style) self.assertEquals(self._rootstring(result), _bytes('X')) Actually, the function test_extensions3() is derived from test_extensions2. The only difference is the variable content which is passed on to the extension function. I've expected an identical result like test_extensions2. From what I know, I think it is a common approach to collect the content in a variable and pass it on to other functions. I've ran the testcase like this: $ python test.py -vv lxml/tests/test_xslt.py Unfortunately it gives me this output: ====================================================================== ERROR: test_extensions3 (lxml.tests.test_xslt.ETreeXSLTTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.5/unittest.py", line 260, in run testMethod() File "/local/repos/home:thomas-schraitle/python-lxml/lxml-2.0.6/src/lxml/tests/test_xslt.py", line 620, in test_extensions3 result = tree.xslt(style) File "lxml.etree.pyx", line 1732, in lxml.etree._ElementTree.xslt (src/lxml/lxml.etree.c:16290) File "xslt.pxi", line 457, in lxml.etree.XSLT.__call__ (src/lxml/lxml.etree.c:83016) XSLTApplyError: XPath evaluation returned no result. ---------------------------------------------------------------------- Ran 54 tests in 0.686s FAILED (errors=1) Any hints what I'm doing wrong? Thanks, Tom Used libraries: libxml2-2.6.31-3.1 libxslt-1.1.22-3.2 From jholg at gmx.de Thu Jun 12 11:30:13 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 12 Jun 2008 11:30:13 +0200 Subject: [lxml-dev] Error from Variables in Extension Functions? In-Reply-To: <722094985@web.de> References: <722094985@web.de> Message-ID: <20080612093014.33530@gmx.net> Hi, > def test_extensions3(self): > tree = self.parse('B') > style = self.parse('''\ > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:myns="testns" > exclude-result-prefixes="myns"> > > > > > > > ''') > > def mytext(ctxt, values): > return 'X' * len(values) > > namespace = etree.FunctionNamespace('testns') > namespace['mytext'] = mytext > > result = tree.xslt(style) > self.assertEquals(self._rootstring(result), > _bytes('X')) > > Actually, the function test_extensions3() is derived from > test_extensions2. The > only difference is the variable content which is passed on to the > extension > function. I've expected an identical result like test_extensions2. From > what > I know, I think it is a common approach to collect the content in a > variable > and pass it on to other functions. > > [...] > > Any hints what I'm doing wrong? > ?In the stylesheet, your selection of content into the "content" XSLT variable does not work.? Try s.th. like ???? style = self.parse('''\ ? ??? ??? ? ''') ??? ?Cheers, Holger? -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080612/28125f5c/attachment.htm From stefan_ml at behnel.de Thu Jun 12 11:45:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jun 2008 11:45:59 +0200 Subject: [lxml-dev] Error from Variables in Extension Functions? In-Reply-To: <722094985@web.de> References: <722094985@web.de> Message-ID: <4850F057.5000607@behnel.de> Hi, Thomas Schraitle wrote: > lxml-2.0.6 > [...] > > > > > [...] > XSLTApplyError: XPath evaluation returned no result. Thanks for the test case. lxml 2.1 gives me a NotImplementedError when trying to convert the XSLT result tree fragment in "content" to a Python object. So this is actually a missing feature. Could you file a feature request in launchpad for handling XSLT result trees in XPath? Stefan From tom_schr at web.de Thu Jun 12 12:40:21 2008 From: tom_schr at web.de (Thomas Schraitle) Date: Thu, 12 Jun 2008 12:40:21 +0200 Subject: [lxml-dev] Error from Variables in Extension Functions? Message-ID: <722208853@web.de> Hi, > Thomas Schraitle wrote: > > lxml-2.0.6 > > [...] > > > > > > > > > > [...] > > XSLTApplyError: XPath evaluation returned no result. > > Thanks for the test case. lxml 2.1 gives me a NotImplementedError when trying > to convert the XSLT result tree fragment in "content" to a Python object. So > this is actually a missing feature. Could you file a feature request in > launchpad for handling XSLT result trees in XPath? Ok, done. See https://bugs.launchpad.net/lxml/+bug/239425 Tom From nepi at gmx.ch Wed Jun 18 16:35:05 2008 From: nepi at gmx.ch (Daniel Jirku) Date: Wed, 18 Jun 2008 16:35:05 +0200 Subject: [lxml-dev] lxml with utf-8 Message-ID: <20080618143505.197240@gmx.net> hi.. i'm new to lxml but very interested to using it... i now have a problem. i want to add an element with an umlaut (non ascii character), so im using utf-8. but as soon as i run my pyhton script, i get the following error: File "lxml.etree.pyx", line 835, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:9595) File "apihelpers.pxi", line 409, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:28436) File "apihelpers.pxi", line 951, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32423) AssertionError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes My script looks like this... # -*- coding: utf-8 -*- .... parser = etree.XMLParser(encoding="utf-8") etree.set_default_parser(parser) # bad sign '?' badString = "bl?m" root = etree.Element("neuIns") for i in range(5): tagAd = etree.SubElement(root, "ad", id=str(i)) foo = etree.SubElement(tagAd, "foo") foo.text = badString.encode("utf8") toStringValue = etree.tostring(root, encoding="utf-8", method="xml") writeToFile(toStringValue) -------- all the new parser set up and badString.enocde i just did to be shure everything is utf-8... without it it also doesn't work :) my setup is: python 2.5 pydev (eclipse) default encoding in eclipse is utf-8, also stdout encoding is utf-8 i found this in the mailing list (http://article.gmane.org/gmane.comp.python.lxml.devel/2320/match=assertionerror), but i think it should be possible to write utf-8 strings to an xml with lxml?! what i'm doing wrong... hope you can help.. thanks in advance.. dani. -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Wed Jun 18 16:46:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jun 2008 16:46:58 +0200 (CEST) Subject: [lxml-dev] lxml with utf-8 In-Reply-To: <20080618143505.197240@gmx.net> References: <20080618143505.197240@gmx.net> Message-ID: <53684.145.253.136.18.1213800418.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Daniel Jirku wrote: > My script looks like this... > > # -*- coding: utf-8 -*- > .... > badString = "bl?m" Make that badString = u"bl?m" Mind the 'u', which makes it a unicode string. Your code above gives you a UTF-8 encoded byte string. Stefan From kalisky at hotmail.com Thu Jun 19 08:49:00 2008 From: kalisky at hotmail.com (Ofer Kalisky) Date: Thu, 19 Jun 2008 06:49:00 +0000 Subject: [lxml-dev] text of an etree Element Message-ID: Hi, When I access some node's text (node.text) I get only the first text value of it, for example, if the html is: asdfsomethingqwer and "node" is the "", then node.text retrieves asdf. if I use node.itertext(), I get [asdf, something, qwer], which is again, not what I want... is there a way to reach the other text of a node ("qwer")? _________________________________________________________________ Connect to the next generation of MSN Messenger? http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080619/5621ce37/attachment-0001.htm From jholg at gmx.de Thu Jun 19 09:09:54 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 19 Jun 2008 09:09:54 +0200 Subject: [lxml-dev] text of an etree Element In-Reply-To: References: Message-ID: <20080619070954.68820@gmx.net> Hi, > > asdfsomethingqwer > > and "node" is the "", then node.text retrieves asdf. > if I use node.itertext(), I get [asdf, something, qwer], which is again, > not what I want... > > is there a way to reach the other text of a node ("qwer")? ?Have you already looked at?http://codespeak.net/lxml/tutorial.html#elements-contain-text???Cheers,Holger?? -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080619/e201f78a/attachment.htm From stefan_ml at behnel.de Thu Jun 19 09:36:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jun 2008 09:36:29 +0200 (CEST) Subject: [lxml-dev] text of an etree Element In-Reply-To: References: Message-ID: <55176.145.253.136.18.1213860989.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Ofer Kalisky wrote: > When I access some node's text (node.text) I get only the first text value > of it, for example, if the html is: > > asdfsomethingqwer > > and "node" is the "", then node.text retrieves asdf. > if I use node.itertext(), I get [asdf, something, qwer], which is again, > not what I want... > > is there a way to reach the other text of a node ("qwer")? http://codespeak.net/lxml/tutorial.html#elements-contain-text Stefan From marc at spielefanpages.de Thu Jun 19 11:05:27 2008 From: marc at spielefanpages.de (Marcel Hellkamp) Date: Thu, 19 Jun 2008 11:05:27 +0200 (CEST) Subject: [lxml-dev] Segmentation fault in lxml.html after pickling Message-ID: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> This script crashes with a segmentation fault :/ Using Python-2.5.2, libxslt-1.1.9, libxml2-2.6.32, lxml-2.1beta2, linux-i686 #!/usr/bin/python # coding=utf-8 import cPickle import lxml, lxml.html html = ''' Test Page Test Page ''' tree = lxml.html.fromstring(html) cf = open('test.pcl', 'w') cPickle.dump(tree, cf, -1) cf.close() cf = open('test.pcl', 'r') pickled_tree = cPickle.load(cf) cf.close() print 'This works fine...' lxml.html.tostring(tree) print 'This crashes...' lxml.html.tostring(pickled_tree) From stefan_ml at behnel.de Thu Jun 19 11:25:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jun 2008 11:25:58 +0200 (CEST) Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> Message-ID: <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Marcel Hellkamp wrote: > This script crashes with a segmentation fault :/ > > import cPickle > [...] > tree = lxml.html.fromstring(html) > > cf = open('test.pcl', 'w') > cPickle.dump(tree, cf, -1) > cf.close() > > cf = open('test.pcl', 'r') > pickled_tree = cPickle.load(cf) > cf.close() Yes, you can't pickle Elements in lxml.etree. This feature is currently only available in lxml.objectify, where Elements behave a lot more like Python objects. I think it makes a little less sense in lxml.etree where you'd have to keep some more state about the Element classes used inside the tree. I'm not sure how valuable this is in lxml.html. Could you describe your use case a little? Stefan From faassen at startifact.com Thu Jun 19 13:45:03 2008 From: faassen at startifact.com (Martijn Faassen) Date: Thu, 19 Jun 2008 13:45:03 +0200 Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: [snip] > I think it makes a little less sense in lxml.etree where you'd have to > keep some more state about the Element classes used inside the tree. I'm > not sure how valuable this is in lxml.html. I'd love it if I could somehow store lxml trees in the ZODB, and that'd need pickle support. Whether it could be made to be efficient I don't know - you'd not want the whole tree to be pickled as a whole in case of large trees, but some form of partitioning scheme into separate pickles. You're right that custom-element binding would be nice in this case, and that means the pickle can't simply be the XML content unless it's somehow annotated first. Anyway, this is a rather out there use case. I am just intrigued to learn that objectify elements can be pickled. Regards, Martijn From stefan_ml at behnel.de Thu Jun 19 16:16:35 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jun 2008 16:16:35 +0200 (CEST) Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <35832.145.253.136.18.1213884995.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Martijn Faassen wrote: > I'd love it if I could somehow store lxml trees in the ZODB, and that'd > need pickle support. Whether it could be made to be efficient I don't > know - you'd not want the whole tree to be pickled as a whole in case of > large trees, but some form of partitioning scheme into separate pickles. > You're right that custom-element binding would be nice in this case, and > that means the pickle can't simply be the XML content unless it's > somehow annotated first. > > Anyway, this is a rather out there use case. I am just intrigued to > learn that objectify elements can be pickled. It's just easier to do in objectify, as it has a pretty comprehensive setup for Element class mapping. If you want to be sure to get back exactly the same Element tree after pickling, you can just annotate() an objectify tree before pickling it. Doing the same thing in lxml.etree would require storing some information about the current Element lookup, which may be a lot of information, e.g. for the namespace class setup. That's a parser-local setup, so we can't just use the setup of the default parser either but need a concrete context for the unpickling. lxml.html might be considered having such a context in a similar way lxml.objectify has it, as it comes with its own classes and lookup scheme. Stefan From stefan_ml at behnel.de Thu Jun 19 16:16:41 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jun 2008 16:16:41 +0200 (CEST) Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <35866.145.253.136.18.1213885001.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Martijn Faassen wrote: > I'd love it if I could somehow store lxml trees in the ZODB, and that'd > need pickle support. Whether it could be made to be efficient I don't > know - you'd not want the whole tree to be pickled as a whole in case of > large trees, but some form of partitioning scheme into separate pickles. > You're right that custom-element binding would be nice in this case, and > that means the pickle can't simply be the XML content unless it's > somehow annotated first. > > Anyway, this is a rather out there use case. I am just intrigued to > learn that objectify elements can be pickled. It's just easier to do in objectify, as it has a pretty comprehensive setup for Element class mapping. If you want to be sure to get back exactly the same Element tree after pickling, you can just annotate() an objectify tree before pickling it. Doing the same thing in lxml.etree would require storing some information about the current Element lookup, which may be a lot of information, e.g. for the namespace class setup. That's a parser-local setup, so we can't just use the setup of the default parser either but need a concrete context for the unpickling. lxml.html might be considered having such a context in a similar way lxml.objectify has it, as it comes with its own classes and lookup scheme. Stefan From stefan_ml at behnel.de Thu Jun 19 18:58:20 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jun 2008 18:58:20 +0200 Subject: [lxml-dev] lxml 2.1 beta3 released Message-ID: <485A902C.8050706@behnel.de> Hi all, I'm proud to release lxml 2.1beta3 to PyPI. This is the first lxml release that builds and works on Python 2.3, 2.4, 2.5, 2.6 (beta) and 3.0 (beta). Unusual for a beta release, the third beta contains more new features than bug fixes, which is largely (but not only) due to adaptations with respect to Python 3. The changelog follows below. I expect this to be the last beta release before 2.1 final. Feedback is very much appreciated, especially on the "experimental" features like the namespace cleanup function and on Python 2.6/3.0 support. Your feedback will help in making the final release the best lxml ever. Have fun, Stefan 2.1beta3 (2008-06-19) Features added * Major overhaul of tools/xpathgrep.py script. * Pickling ElementTree objects in lxml.objectify. * Support for parsing from file-like objects that return unicode strings. * New function etree.cleanup_namespaces(el) that removes unused namespace declarations from a (sub)tree (experimental). * XSLT results support the buffer protocol in Python 3. * Polymorphic functions in lxml.html that accept either a tree or a parsable string will return either a UTF-8 encoded byte string, a unicode string or a tree, based on the type of the input. Previously, the result was always a byte string or a tree. * Support for Python 2.6 and 3.0 beta. * File name handling now uses a heuristic to convert between byte strings (usually filenames) and unicode strings (usually URLs). * Parsing from a plain file object frees the GIL under Python 2.x. * Running iterparse() on a plain file (or filename) frees the GIL on reading under Python 2.x. * Conversion functions html_to_xhtml() and xhtml_to_html() in lxml.html (experimental). * Most features in lxml.html work for XHTML namespaced tag names (experimental). Bugs fixed * ElementTree.parse() didn't handle target parser result. * Crash in Element class lookup classes when the __init__() method of the super class is not called from Python subclasses. * A number of problems related to unicode/byte string conversion of filenames and error messages were fixed. * Building on MacOS-X now passes the "flat_namespace" option to the C compiler, which reportedly prevents build quirks and crashes on this platform. * Windows build was broken. * Rare crash when serialising to a file object with certain encodings. Other changes * Non-ASCII characters in attribute values are no longer escaped on serialisation. * Passing non-ASCII byte strings or invalid unicode strings as .tag, namespaces, etc. will result in a ValueError instead of an AssertionError (just like the tag well-formedness check). * Up to several times faster attribute access (i.e. tree traversal) in lxml.objectify. From stefan_ml at behnel.de Fri Jun 20 11:13:16 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Jun 2008 11:13:16 +0200 Subject: [lxml-dev] lxml 2.0.7 released Message-ID: <485B74AC.5090802@behnel.de> Hi all, I just released lxml 2.0.7 to PyPI. This is a bug-and-convenience-fixes-only release for the stable 2.0 series. Changelog below. The C sources in this release were generated with Cython 0.9.8 and tested with Python 2.6 beta1. Have fun, Stefan 2.0.7 (2008-06-20) Features added * Pickling ElementTree objects in lxml.objectify. Bugs fixed * Descending dot-separated classes in CSS selectors were not resolved correctly. * ElementTree.parse() didn't handle target parser result. * Potential threading problem in XInclude. * Crash in Element class lookup classes when the __init__() method of the super class is not called from Python subclasses. Other changes * Non-ASCII characters in attribute values are no longer escaped on serialisation. From jholg at gmx.de Fri Jun 20 17:47:11 2008 From: jholg at gmx.de (Holger Joukl) Date: Fri, 20 Jun 2008 17:47:11 +0200 Subject: [lxml-dev] operator.delslice() not working with 2.1beta3 Message-ID: <20080620155913.68830@gmx.net> Hi, I just realized that operator.delslice() does not work with 2.1beta3: ?>>> root = msg >>> root.a = [1,2,3,4] >>> import operator >>> operator.delslice(msg.a, 0, 4) Traceback (most recent call last): ? File "", line 1, in ? TypeError: object doesn't support slice deletion >>> ?which worked in? 2.0alpha ?>>> root = msg >>> root.a = [1,2,3,4] >>> import operator >>> operator.delslice(msg.a, 0, 4) >>> ?I suspect this was introduced with extended slicing. ?operator.getslice and operator.setslice still work, btw. ??Holger? ? -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080620/1cddd1ee/attachment.htm From stefan_ml at behnel.de Fri Jun 20 21:37:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Jun 2008 21:37:26 +0200 Subject: [lxml-dev] operator.delslice() not working with 2.1beta3 In-Reply-To: <20080620155913.68830@gmx.net> References: <20080620155913.68830@gmx.net> Message-ID: <485C06F6.8030700@behnel.de> Hi, Holger Joukl wrote: > I just realized that operator.delslice() does not work with 2.1beta3: > > >>> root = msg >>>> root.a = [1,2,3,4] >>>> import operator >>>> operator.delslice(msg.a, 0, 4) > Traceback (most recent call last): > File "", line 1, in ? > TypeError: object doesn't support slice deletion > which worked in 2.0alpha > > >>> root = msg >>>> root.a = [1,2,3,4] >>>> import operator >>>> operator.delslice(msg.a, 0, 4) >>>> > I suspect this was introduced with extended slicing. > > operator.getslice and operator.setslice still work, btw. Hmmm, yes, the operator maps to a pretty direct call of PySequence_DelSlice(), which internally does this: m = s->ob_type->tp_as_sequence; if (m && m->sq_ass_slice) { ... } type_error("'%.200s' object doesn't support slice deletion", s); return -1; So to make this work, I'll have to wrap __delitem__() in a deprecated __delslice__(). That's ugly... Stefan From dfedoruk at gmail.com Fri Jun 20 22:50:52 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Sat, 21 Jun 2008 00:50:52 +0400 Subject: [lxml-dev] file descriptors leak since lxml 2.0.5 while resolving local DTD In-Reply-To: <48199845.4@behnel.de> References: <48199845.4@behnel.de> Message-ID: <485C182C.6050307@gmail.com> Greetings, I've been using 2.0 for a while and today I've decided to upgrade to the most recent 2.0.7. I got a problem, and, by binary search (based on change log) :) I found it in 2.0.5 first - it is the local file DTD resolver. This issue originates in http://article.gmane.org/gmane.comp.python.lxml.devel/3499 Eventually I have to load DTD in some specific cases for parsing. Even if I load it from local disc and cache it, the parsing time is longer up to 10 times (40ms instead of 4ms). So, I came up to the following (ugly) solution: class LocalDTDResolver(etree.Resolver): def __init__(self, conf): self.conf = conf self.cached = None def resolve(self, url, id, context): if not self.cached: self.cached = self.resolve_filename( self.conf + '/vxml.dtd' , context ) return self.cached class LxmlUser(...): # just the relevant snippets def __init__(...) self.xmlParser = etree.XMLParser(no_network=True, resolve_entities=False, load_dtd=False) self.resolvingParser = etree.XMLParser(no_network=False, resolve_entities=False, load_dtd=True) self.resolvingParser.resolvers.add(LocalDTDResolver(local_path)) def call_parser(self, replies): for data in replies: if need_resolve: parser = self.resolvingParser else: parser = self.xmlParser xmlres = etree.parse( StringIO.StringIO( data ), parser ) Systems are FreeBSD 6.2/7.0, lxml.etree: (2, 0, 5, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) This code is run within mod_python3/apache2.2.8 Up to 2.0.5 I have no problem when the resolvingParser is called. But since 2.0.5 after I have this: # no call of resolving parser [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 377 # after a single (!) call of resolving parser [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 11439 And my local DTD file is opened about 11000 times (according to fstat and find -inode). Am I doing something wrong in such a way of coding or it is a bug? Cheers, Dmitri From stefan_ml at behnel.de Sat Jun 21 11:56:43 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Jun 2008 11:56:43 +0200 Subject: [lxml-dev] file descriptors leak since lxml 2.0.5 while resolving local DTD In-Reply-To: <485C182C.6050307@gmail.com> References: <48199845.4@behnel.de> <485C182C.6050307@gmail.com> Message-ID: <485CD05B.6000703@behnel.de> Hi, Dmitri Fedoruk wrote: > I got a problem, and, by binary search (based on change log) :) I found > it in 2.0.5 first - it is the local file DTD resolver. I'll take a look. > This issue originates in > http://article.gmane.org/gmane.comp.python.lxml.devel/3499 > > Eventually I have to load DTD in some specific cases for parsing. Even > if I load it from local disc and cache it, the parsing time is longer up > to 10 times (40ms instead of 4ms). > > So, I came up to the following (ugly) solution: > > class LocalDTDResolver(etree.Resolver): > def __init__(self, conf): > self.conf = conf > self.cached = None > def resolve(self, url, id, context): > if not self.cached: > self.cached = self.resolve_filename( self.conf + '/vxml.dtd' > , context ) > return self.cached Not that ugly, but not very helpful either. You are caching the filename, not the content. Check docloader.pxi to see how simple the machinery is here. There isn't currently a way to return a parsed document from a resolver (and I don't think libxml2 supports that), so I think the best you can do is to return the content as a cached string, thus avoiding I/O but not the parse overhead. > Systems are FreeBSD 6.2/7.0, > lxml.etree: (2, 0, 5, 0) > libxml used: (2, 6, 30) > libxml compiled: (2, 6, 30) > libxslt used: (1, 1, 22) > libxslt compiled: (1, 1, 22) > > This code is run within mod_python3/apache2.2.8 Now that you mention it: are you using the single interpreter option in mod_python or does it work without? I fixed a couple of threading things in 2.0.6, so that should now work without that work-around. But it's still untested due to lack of feedback. > Up to 2.0.5 I have no problem when the resolvingParser is called. But > since 2.0.5 after I have this: > # no call of resolving parser > [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles > kern.openfiles: 377 > # after a single (!) call of resolving parser > [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles > kern.openfiles: 11439 If you are really using the above code then it means that libxml2 is reading the DTD internally. Maybe there's something more we have to clean up, or maybe it's really a leak in libxml2. But the numbers you post here look very unrealistic to me. > And my local DTD file is opened about 11000 times (according to fstat > and find -inode). If you parse it once, libxml2 should open the DTD file once, and not more. I'll look into that. Stefan From stefan_ml at behnel.de Sat Jun 21 12:37:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Jun 2008 12:37:09 +0200 Subject: [lxml-dev] file descriptors leak since lxml 2.0.5 while resolving local DTD In-Reply-To: <485C182C.6050307@gmail.com> References: <48199845.4@behnel.de> <485C182C.6050307@gmail.com> Message-ID: <485CD9D5.1040002@behnel.de> Hi, Dmitri Fedoruk wrote: > Up to 2.0.5 I have no problem when the resolvingParser is called. But > since 2.0.5 after I have this: > # no call of resolving parser > [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles > kern.openfiles: 377 > # after a single (!) call of resolving parser > [root at machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles > kern.openfiles: 11439 > > And my local DTD file is opened about 11000 times (according to fstat > and find -inode). When I run the "resolve_filename_dtd" test in test_etree.py and print "lsof | wc -l" directly before and after parsing, I get the same number each time. So I can't see your leak here. Stefan From stefan_ml at behnel.de Sun Jun 22 07:46:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 22 Jun 2008 07:46:39 +0200 Subject: [lxml-dev] operator.delslice() not working with 2.1beta3 In-Reply-To: <485C06F6.8030700@behnel.de> References: <20080620155913.68830@gmx.net> <485C06F6.8030700@behnel.de> Message-ID: <485DE73F.3030201@behnel.de> Hi, > Holger Joukl wrote: >> I just realized that operator.delslice() does not work with 2.1beta3 Actually, delslice() never worked in 2.0 and was removed from the operator module in Py3. It's also easily replaced with operator.delitem(x, slice()). So I won't change the current behaviour. Stefan From john at jpevans.com Sun Jun 22 09:42:59 2008 From: john at jpevans.com (John Evans) Date: Sun, 22 Jun 2008 00:42:59 -0700 Subject: [lxml-dev] Potential bug when using nth-child in CSSSelector Message-ID: <6f7ea56f0806220042y6e5a7f0o7135efa7d25e6d79@mail.gmail.com> Hi All, I have observed some unexpected behavior with lxml -- I looked in the bug tracker and didn't see anything that seemed related to this. daedalus:~ john$ python Python 2.5.2 Stackless 3.1b3 060516 (python-2.52:61022, Feb 27 2008, 16:52:03) [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml.cssselect import CSSSelector >>> CSSSelector("div div:nth-child(5) div div div:nth-child(3) img").path "descendant-or-self::div/descendant::*[name() = 'div' and (position() = 5)]" >>> It looks like when you use the nth-child member in a CSSSelector it chops off the rest of the CSSSelector and does not process it when compiling it down to an xpath. Is this a bug or am I just doing something stupid? Thanks, - John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080622/9ae6535e/attachment.htm From stefan_ml at behnel.de Sun Jun 22 11:15:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 22 Jun 2008 11:15:49 +0200 Subject: [lxml-dev] Potential bug when using nth-child in CSSSelector In-Reply-To: <6f7ea56f0806220042y6e5a7f0o7135efa7d25e6d79@mail.gmail.com> References: <6f7ea56f0806220042y6e5a7f0o7135efa7d25e6d79@mail.gmail.com> Message-ID: <485E1845.4010104@behnel.de> Hi, thanks for the report. John Evans wrote: > I have observed some unexpected behavior with lxml -- I looked in the bug > tracker and didn't see anything that seemed related to this. > > daedalus:~ john$ python > Python 2.5.2 Stackless 3.1b3 060516 (python-2.52:61022, Feb 27 2008, > 16:52:03) > [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> from lxml.cssselect import CSSSelector >>>> CSSSelector("div div:nth-child(5) div div div:nth-child(3) img").path > "descendant-or-self::div/descendant::*[name() = 'div' and (position() = 5)]" > > It looks like when you use the nth-child member in a CSSSelector it chops > off the rest of the CSSSelector and does not process it when compiling it > down to an xpath. That was an indentation bug (i.e. wrong block-level) in the selector parser, which made it stop short after a (pseudo) function with parameters. Here's a fixed version for 2.0: http://codespeak.net/svn/lxml/branch/lxml-2.0/src/lxml/cssselect.py or for 2.1 beta: http://codespeak.net/svn/lxml/trunk/src/lxml/cssselect.py Stefan From john at jpevans.com Mon Jun 23 00:04:42 2008 From: john at jpevans.com (John Evans) Date: Sun, 22 Jun 2008 15:04:42 -0700 Subject: [lxml-dev] Potential bug when using nth-child in CSSSelector In-Reply-To: <485E1845.4010104@behnel.de> References: <6f7ea56f0806220042y6e5a7f0o7135efa7d25e6d79@mail.gmail.com> <485E1845.4010104@behnel.de> Message-ID: <6f7ea56f0806221504v63df150m6993ee9ef56d5e2a@mail.gmail.com> Thanks very much, that did the trick. On Sun, Jun 22, 2008 at 2:15 AM, Stefan Behnel wrote: > Hi, > > thanks for the report. > > John Evans wrote: > > I have observed some unexpected behavior with lxml -- I looked in the bug > > tracker and didn't see anything that seemed related to this. > > > > daedalus:~ john$ python > > Python 2.5.2 Stackless 3.1b3 060516 (python-2.52:61022, Feb 27 2008, > > 16:52:03) > > [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin > > Type "help", "copyright", "credits" or "license" for more information. > >>>> from lxml.cssselect import CSSSelector > >>>> CSSSelector("div div:nth-child(5) div div div:nth-child(3) img").path > > "descendant-or-self::div/descendant::*[name() = 'div' and (position() = > 5)]" > > > > It looks like when you use the nth-child member in a CSSSelector it chops > > off the rest of the CSSSelector and does not process it when compiling it > > down to an xpath. > > That was an indentation bug (i.e. wrong block-level) in the selector > parser, > which made it stop short after a (pseudo) function with parameters. > > Here's a fixed version for 2.0: > > http://codespeak.net/svn/lxml/branch/lxml-2.0/src/lxml/cssselect.py > > or for 2.1 beta: > > http://codespeak.net/svn/lxml/trunk/src/lxml/cssselect.py > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080622/4c69bad0/attachment-0001.htm From jholg at gmx.de Mon Jun 23 17:49:46 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 23 Jun 2008 17:49:46 +0200 Subject: [lxml-dev] operator.delslice() not working with 2.1beta3 In-Reply-To: <485DE73F.3030201@behnel.de> References: <20080620155913.68830@gmx.net> <485C06F6.8030700@behnel.de> <485DE73F.3030201@behnel.de> Message-ID: <20080623155307.238660@gmx.net> Hi, > Actually, delslice() never worked in 2.0 and was removed from the > operator > module in Py3. It's also easily replaced with operator.delitem(x, > slice()). > > So I won't change the current behaviour. ?I've been running on some frozen 2.0alpha state for a while now so I probably missed __delslice__ and cohorts going away. ?Anyhow, I should've looked into the Py3 what's new stuff rather than the operator module documentation which is not yet up-to-date with regard to these deprecations. ?I'll use operator.delitem, then. ?Thanks for the clarification H.? -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080623/696bfbec/attachment.htm From jholg at gmx.de Mon Jun 23 17:56:37 2008 From: jholg at gmx.de (Holger Joukl) Date: Mon, 23 Jun 2008 17:56:37 +0200 Subject: [lxml-dev] objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta Message-ID: <20080623160833.295700@gmx.net> Hi, I have a usecase where I need to deannotate an objectified tree and then manually set py:pytype or xsi:type attributes. ?However, this seems to be getting difficult with 2.1beta as deannotate wipes out all nsmap information with its call to cleanup_namespaces(), and I cannot set a namespaced attribute through .set(...) ?Maybe I'm missing something and there's a convenient way to do this? ?If not, could we make the call to cleanup_namespaces optional (defaults to True) in deannotate()?? ?Holger? ?? -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080623/6a56cb1b/attachment.htm From usernamenumber at gmail.com Tue Jun 24 02:22:34 2008 From: usernamenumber at gmail.com (Brad Smith) Date: Mon, 23 Jun 2008 20:22:34 -0400 Subject: [lxml-dev] Problem handling   Message-ID: Hello, I am trying to handle some html data (the content of which I don't have control over) with lxml. The problem is that whenever   is encountered lxml.etree.fromstring throws "XMLSyntaxError: Entity 'nbsp' not defined" and parsing fails. I have to admit I'm at a loss for how to deal with this. I've looked up the DTDs for html and xhtml and the entity isn't defined there, so where would it be or, since I just want to store certain bits of the content, not render it, can I make lxml less picky? The lxml.html.soupparser can handle entities, but actually mis-interprets the html because it is "too" well-formed. For example, the author does: foostuff Which soupparser interprets as ...wrongly "correcting" the original markup. Any help here would be greatly appreciated. --Brad From sidnei at enfoldsystems.com Tue Jun 24 02:29:48 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 23 Jun 2008 21:29:48 -0300 Subject: [lxml-dev] lxml 2.0.7 released In-Reply-To: <485B74AC.5090802@behnel.de> References: <485B74AC.5090802@behnel.de> Message-ID: Binaries for Windows have been uploaded for Python 2.4 and Python 2.5. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Tue Jun 24 02:30:07 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 23 Jun 2008 21:30:07 -0300 Subject: [lxml-dev] lxml 2.1 beta3 released In-Reply-To: <485A902C.8050706@behnel.de> References: <485A902C.8050706@behnel.de> Message-ID: Binaries for Windows have been uploaded for Python 2.4 and Python 2.5. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From ballbach at rten.net Tue Jun 24 03:45:25 2008 From: ballbach at rten.net (Michael Ballbach) Date: Mon, 23 Jun 2008 21:45:25 -0400 Subject: [lxml-dev] schema validation and resolvers Message-ID: <20080624014525.GA26170@wayreth.rten.net> I've been trying to use etree.XMLSchema() to load a schema that has external imports - but I'd like to utilize an etree.Resolver object. In my application schemas aren't always on disk. This does not work, because the libxml2 xmlSchemaAddSchemaDoc() function, when called to prepare the import, creates a new parser context and the _local_resolver() function is unable to map this unknown context to a Resolvers object. I've made a simple patch but I'll preface this by saying that I don't know libxml2 or lxml internals very well, so I may well have misunderstood something about this or there may be a way to do this (other than the base_url trick, which works, but as I said my schemas aren't always in files, and when they are they aren't always in the same directories). The naive way to fix this is: Index: parser.pxi =================================================================== --- parser.pxi (revision 56012) +++ parser.pxi (working copy) @@ -393,13 +393,22 @@ context._storage.add(data) return c_input +cdef xmlparser.xmlParserCtxt* _findDefaultParserContext() with gil: + return __GLOBAL_PARSER_CONTEXT.getDefaultParser()._getParserContext()._c_ctxt + cdef xmlparser.xmlParserInput* _local_resolver(char* c_url, char* c_pubid, xmlparser.xmlParserCtxt* c_context) nogil: # no Python objects here, may be called without thread context ! # when we declare a Python object, Pyrex will INCREF(None) ! cdef xmlparser.xmlParserInput* c_input cdef int error + + # check the default parser to support contexts generated within libxml2 if c_context._private is NULL: + if _findDefaultParserContext() is not NULL: + c_context = _findDefaultParserContext() + + if c_context._private is NULL: if __DEFAULT_ENTITY_LOADER is NULL: return NULL return __DEFAULT_ENTITY_LOADER(c_url, c_pubid, c_context) This will default to using the thread's default parser's _ParserDictionaryContext object when none was found inside the passed context. This 'makes sense' to me in the sense that when I was first debugging this problem one of the first things I tried was using the default parser, which I figured might come into play for any parsing that happens behind the scenes. This works fine and I'm able to use a Resolver now - as long as I add it to the default parser's resolvers list. However, it's not 'correct' in the sense that in reality it should probably use the resolvers associated with the schema's original parser. (I think it could be argued that the patch is correct in the abstract in that it catches resolution requests that would otherwise be missed, but that it is incorrect in the sense that the more specific resolver associated with the document's original parser is what should be found in this particular case.) From my perspective this could be fixed in one of two basic ways: 1) Modify libxml2 to somehow get lxml's _private stuff in there. This could be by passing a user specified context for use in new parses or something like that. I'd imagine this is much less likely to work out than option 2. 2) Use the _ParserDictionaryContext system there to store state about when the schema code is entered so that a proper XML context can be inferred and the original document's parser's resolvers can be called. One must be careful here to make sure that any use of lxml within the resolver callbacks does not mess up this state. I'd love feedback about this issue, and I'd be happy to implement one of these changes or something else, whatever makes sense to folks, as my long term interest is in having this work 'out-of-the-box'. In closing, thanks to all you fellows working on lxml, it's really great! -- Michael Ballbach, N0ZTQ ballbach at rten.net -- PGP KeyID: 0xA05D5555 http://www.rten.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080623/cc9da193/attachment.pgp From stefan_ml at behnel.de Tue Jun 24 08:10:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Jun 2008 08:10:12 +0200 Subject: [lxml-dev] objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta In-Reply-To: <20080623160833.295700@gmx.net> References: <20080623160833.295700@gmx.net> Message-ID: <48608FC4.6000200@behnel.de> Hi, Holger Joukl wrote: > I have a usecase where I need to deannotate an objectified tree > and then manually set py:pytype or xsi:type attributes. > > However, this seems to be getting difficult with 2.1beta as deannotate > wipes out all nsmap information with its call to cleanup_namespaces(), > and I cannot set a namespaced > > attribute through .set(...) > > could we make the call to cleanup_namespaces optional (defaults > to True) in deannotate()? I wasn't entirely sure if it was a good idea when I added it. I guess it's best to keep it out or make it optional (default False). Stefan From aryeh at bigfoot.com Tue Jun 24 11:12:51 2008 From: aryeh at bigfoot.com (Arye) Date: Tue, 24 Jun 2008 11:12:51 +0200 Subject: [lxml-dev] validation with multiple XSD files In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A22D4@ZIRIA.esd189.org> References: <3A49C88789256B4AB33AC603DB6AF49B011A22D4@ZIRIA.esd189.org> Message-ID: <6f63a0ad0806240212h550ae194y6e88b3d2c65ec814@mail.gmail.com> Hello, I am coming back to this after a delay. On top of lxml I was also considering ZSI to process my schema file. It appears that ZSI has no problem following the includes. So I will explore ZSI for now. Not being an expert on XML/XSD, I suspect that this is the best thing to do to avoid some tricky issues with namespaces. Regards, Arye. On Fri, May 9, 2008 at 6:38 PM, John Lovell wrote: > Arye: > > I had a similar problem and this is how I handled it. > > http://messagesleuth.svn.sourceforge.net/viewvc/messagesleuth/trunk/xsd/xsd2one.py?view=markup > > I didn't ask the group so others may have a better or more full featured > approach. > > John W. Lovell > Web Applications Engineer > Northwest Educational Service District > 1601 R Avenue > Anacortes, WA 98221 > (360) 299-4086 > jlovell at nwesd.org > > www.esd189.org > Together We Can ... > ________________________________ > From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] > On Behalf Of Arye > Sent: Friday, May 09, 2008 9:26 AM > To: lxml-dev at codespeak.net > Subject: [lxml-dev] validation with multiple XSD files > > > Hello all, > I would like to so some schema validation and started with the instructions > in : > http://codespeak.net/lxml/dev/validation.html#xmlschema > > > This all works great. Now I would like to extend this to a XSD file that > includes many other files. In other words I have a directory of XSD files > that I would like to use. The include statement look like this (the included > file is referenced by its name): > > > elementFormDefault="qualified"> > > > ... > ... some types defined in "base.xsd" are used here > > > > I am new to lxml so sorry in advance if the question does not make sense. > > Regards, > Arye. > > > > From stefan_ml at behnel.de Tue Jun 24 09:36:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Jun 2008 09:36:37 +0200 Subject: [lxml-dev] schema validation and resolvers In-Reply-To: <20080624014525.GA26170@wayreth.rten.net> References: <20080624014525.GA26170@wayreth.rten.net> Message-ID: <4860A405.7030003@behnel.de> Hi, Michael Ballbach wrote: > I've been trying to use etree.XMLSchema() to load a schema that has > external imports - but I'd like to utilize an etree.Resolver object. In > my application schemas aren't always on disk. > > This does not work, because the libxml2 xmlSchemaAddSchemaDoc() > function, when called to prepare the import, creates a new parser > context and the _local_resolver() function is unable to map this unknown > context to a Resolvers object. I agree that this is not satisfactory. It means that a quirk in libxml2 leaks into Python space. > 2) Use the _ParserDictionaryContext system there to store state about > when the schema code is entered so that a proper XML context can be > inferred and the original document's parser's resolvers can be > called. One must be careful here to make sure that any use of lxml > within the resolver callbacks does not mess up this state. I think that's the way to go. And yes, it must be a stack (i.e. a list) in this case. I'd just push the right parser (not only its context) as a "current thread parser" in cases where we know that libxml2 will not provide us with a context (XInclude is another case, happy to see that resolved at the same time). Then we can fall back to using the top-most parser if the _private pointer is NULL. Stefan From stefan_ml at behnel.de Tue Jun 24 09:00:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Jun 2008 09:00:32 +0200 Subject: [lxml-dev] Problem handling   In-Reply-To: References: Message-ID: <48609B90.9010108@behnel.de> Hi, Brad Smith wrote: > I am trying to handle some html data (the content of which I don't > have control over) with lxml. The problem is that whenever   is > encountered lxml.etree.fromstring throws "XMLSyntaxError: Entity > 'nbsp' not defined" and parsing fails. Are you using the HTML parser? Pass an HTMLParser() instance to fromstring() or try the fromstring() in lxml.html. Stefan From stefan_ml at behnel.de Tue Jun 24 18:36:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Jun 2008 18:36:29 +0200 Subject: [lxml-dev] validation with multiple XSD files In-Reply-To: References: Message-ID: <4861228D.8040904@behnel.de> Hi, Arye wrote: > Now I would like to extend this to a XSD file that > includes many other files. In other words I have a directory of XSD files > that I would like to use. The include statement look like this (the included > file is referenced by its name): > > > elementFormDefault="qualified"> > > > ... > ... some types defined in "base.xsd" are used here I'm not sure what you are trying to do here. Including or importing XSD files should not be a problem at all, so maybe you could elaborate on the actual problem you are facing? Maybe with some example code that shows what you are doing? Stefan From ballbach at rten.net Wed Jun 25 01:04:40 2008 From: ballbach at rten.net (Michael Ballbach) Date: Tue, 24 Jun 2008 19:04:40 -0400 Subject: [lxml-dev] schema validation and resolvers In-Reply-To: <4860A405.7030003@behnel.de> References: <20080624014525.GA26170@wayreth.rten.net> <4860A405.7030003@behnel.de> Message-ID: <20080624230440.GA16587@wayreth.rten.net> On Tue, Jun 24, 2008 at 09:36:37AM +0200, Stefan Behnel wrote: > I think that's the way to go. And yes, it must be a stack (i.e. a > list) in this case. I'd just push the right parser (not only its > context) as a "current thread parser" in cases where we know that > libxml2 will not provide us with a context (XInclude is another case, > happy to see that resolved at the same time). Then we can fall back to > using the top-most parser if the _private pointer is NULL. How does this look? (Unit tests follow the patch) Index: src/lxml/xinclude.pxi =================================================================== --- src/lxml/xinclude.pxi (revision 56012) +++ src/lxml/xinclude.pxi (working copy) @@ -33,12 +33,14 @@ # i.e. as a sibling, which does not conflict with traversal. cdef int result self._error_log.connect() + __GLOBAL_PARSER_CONTEXT.pushImpliedContext(node._doc._parser) with nogil: if node._doc._parser is not None: result = xinclude.xmlXIncludeProcessTreeFlags( node._c_node, node._doc._parser._parse_options) else: result = xinclude.xmlXIncludeProcessTree(node._c_node) + __GLOBAL_PARSER_CONTEXT.popImpliedContext() self._error_log.disconnect() if result == -1: Index: src/lxml/xmlschema.pxi =================================================================== --- src/lxml/xmlschema.pxi (revision 56012) +++ src/lxml/xmlschema.pxi (working copy) @@ -65,7 +65,13 @@ raise XMLSchemaParseError, u"No tree or file given" if parser_ctxt is not NULL: + # calling xmlSchemaParse on a schema with imports or includes will + # cause libxml2 to create an internal context for parsing, so push + # an implied context to route resolve requests to the document's parser + __GLOBAL_PARSER_CONTEXT.pushImpliedContext(doc._parser) self._c_schema = xmlschema.xmlSchemaParse(parser_ctxt) + __GLOBAL_PARSER_CONTEXT.popImpliedContext() + if _LIBXML_VERSION_INT >= 20624: xmlschema.xmlSchemaFreeParserCtxt(parser_ctxt) Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (revision 56012) +++ src/lxml/parser.pxi (working copy) @@ -42,6 +42,11 @@ cdef tree.xmlDict* _c_dict cdef _BaseParser _default_parser + cdef object _implied_parser_contexts + + def __init__(self): + self._implied_parser_contexts = [] + def __dealloc__(self): if self._c_dict is not NULL: xmlparser.xmlDictFree(self._c_dict) @@ -131,6 +136,38 @@ # otherwise we'd free data that's in use => segfault self.initThreadDictRef(&result.dict) + cdef xmlparser.xmlParserCtxt *findImpliedContext(self) with gil: + u"""Return any current implied xml parser context for this thread. This + is used when the _local_resolver function is called with a context + that was generated from within libxml2 - which happens when parsing + schema and xinclude external references.""" + + cdef _ParserDictionaryContext thread_context + cdef _BaseParser implied_parser + cdef Py_ssize_t count + + # see if we have a current implied parser + count = python.PyList_GET_SIZE(self._implied_parser_contexts) + if count != 0: + implied_parser = python.PyList_GET_ITEM(self._implied_parser_contexts, count - 1) + python.Py_INCREF(implied_parser) # borrowed reference + if implied_parser is not None: + return implied_parser._getParserContext()._c_ctxt + + # we don't, so use the thread's default parser context + thread_context = __GLOBAL_PARSER_CONTEXT._findThreadParserContext() + return thread_context._default_parser._getParserContext()._c_ctxt + + cdef void pushImpliedContext(self, context) with gil: + u"Push a new implied context object." + if context is not None and not isinstance(context, _BaseParser): + raise TypeError, u"implied contexts must be _ParserContext objects" + python.PyList_Append(self._implied_parser_contexts, context) + + cdef void popImpliedContext(self) with gil: + u"Pop and return the current implied context object." + self._implied_parser_contexts.pop() + cdef _ParserDictionaryContext __GLOBAL_PARSER_CONTEXT __GLOBAL_PARSER_CONTEXT = _ParserDictionaryContext() __GLOBAL_PARSER_CONTEXT.initMainParserContext() @@ -399,7 +436,13 @@ # when we declare a Python object, Pyrex will INCREF(None) ! cdef xmlparser.xmlParserInput* c_input cdef int error + + # if there is no _ParserDictionaryContext associated with the xmlParserCtxt + # passed, check to see if the thread state object has an implied context. if c_context._private is NULL: + c_context = __GLOBAL_PARSER_CONTEXT.findImpliedContext() + + if c_context._private is NULL: if __DEFAULT_ENTITY_LOADER is NULL: return NULL return __DEFAULT_ENTITY_LOADER(c_url, c_pubid, c_context) Unit tests: Index: src/lxml/tests/test_etree.py =================================================================== --- src/lxml/tests/test_etree.py (revision 56012) +++ src/lxml/tests/test_etree.py (working copy) @@ -2342,6 +2342,28 @@ 'a', tree.getroot()[1].tag) + def test_xinclude_resolver(self): + """Test that xinclude references can be processed by a resolver.""" + + class res(etree.Resolver): + def __init__(self, text): + self.text = text + self.called = False + + def resolve(self, url, Id, context): + self.called = True + return self.resolve_string(self.text, context) + + include_text = open(fileInTestDir('test.xml')).read() + parser = etree.XMLParser() + res_instance = res(include_text) + parser.resolvers.add(res_instance) + tree = etree.parse(fileInTestDir('include/test_xinclude.xml'), parser = parser) + self.include(tree) + + # make sure the resolver was used + self.assert_(res_instance.called) + class ETreeXIncludeTestCase(XIncludeTestCase): def include(self, tree): tree.xinclude() Index: src/lxml/tests/test_xmlschema.py =================================================================== --- src/lxml/tests/test_xmlschema.py (revision 56012) +++ src/lxml/tests/test_xmlschema.py (working copy) @@ -152,7 +152,108 @@ self.assert_(tree_valid.xmlschema(schema)) self.assert_(not tree_invalid.xmlschema(schema)) - + # + # schema + resolvers tests&data: + # + + resolver_schema_int = BytesIO("""\ + + + +""") + + resolver_schema_int2 = BytesIO("""\ + + + +""") + + resolver_schema_ext = """\ + + + + +""" + + class simple_resolver(etree.Resolver): + def __init__(self, schema): + self.schema = schema + + def resolve(self, url, Id, context): + assert(url == 'XXX.xsd') + return self.resolve_string(self.schema, context) + + def test_xmlschema_resolvers(self): + """Test that resolvers work with schema.""" + + parser = etree.XMLParser() + parser.resolvers.add(self.simple_resolver(self.resolver_schema_ext)) + schema_doc = etree.parse(self.resolver_schema_int, parser = parser) + schema = etree.XMLSchema(schema_doc) + + def test_xmlschema_resolvers_root(self): + """Test that the default resolver will get called if there's no + specific parser resolver.""" + + root_resolver = self.simple_resolver(self.resolver_schema_ext) + etree.get_default_parser().resolvers.add(root_resolver) + schema_doc = etree.parse(self.resolver_schema_int) + schema = etree.XMLSchema(schema_doc) + etree.get_default_parser().resolvers.remove(root_resolver) + + def test_xmlschema_resolvers_noroot(self): + """Test that the default resolver will not get called when a more + specific resolver is registered.""" + + class res_root(etree.Resolver): + def resolve(self, url, Id, context): + assert(False) + return None + + root_resolver = res_root() + etree.get_default_parser().resolvers.add(root_resolver) + parser = etree.XMLParser() + parser.resolvers.add(self.simple_resolver(self.resolver_schema_ext)) + schema_doc = etree.parse(self.resolver_schema_int, parser = parser) + schema = etree.XMLSchema(schema_doc) + etree.get_default_parser().resolvers.remove(root_resolver) + + def test_xmlschema_nested_resolvers(self): + """Test that resolvers work in a nested fashion.""" + + class res_nested(etree.Resolver): + def __init__(self, ext_schema): + self.ext_schema = ext_schema + + def resolve(self, url, Id, context): + assert(url == 'YYY.xsd') + return self.resolve_string(self.ext_schema, context) + + class res(etree.Resolver): + def __init__(self, ext_schema_1, ext_schema_2): + self.ext_schema_1 = ext_schema_1 + self.ext_schema_2 = ext_schema_2 + + def resolve(self, url, Id, context): + assert(url == 'XXX.xsd') + + new_parser = etree.XMLParser() + new_parser.resolvers.add(res_nested(self.ext_schema_2)) + new_schema_doc = etree.parse(self.ext_schema_1, parser = new_parser) + new_schema = etree.XMLSchema(new_schema_doc) + + return self.resolve_string(ETreeXMLSchemaTestCase.resolver_schema_ext, context) + + parser = etree.XMLParser() + parser.resolvers.add(res(self.resolver_schema_int2, self.resolver_schema_ext)) + schema_doc = etree.parse(self.resolver_schema_int, parser = parser) + schema = etree.XMLSchema(schema_doc) + def test_suite(): suite = unittest.TestSuite() suite.addTests([unittest.makeSuite(ETreeXMLSchemaTestCase)]) -- Michael Ballbach, N0ZTQ ballbach at rten.net -- PGP KeyID: 0xA05D5555 http://www.rten.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080624/b0633e38/attachment.pgp From stefan_ml at behnel.de Wed Jun 25 22:10:53 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Jun 2008 22:10:53 +0200 Subject: [lxml-dev] schema validation and resolvers In-Reply-To: <20080624230440.GA16587@wayreth.rten.net> References: <20080624014525.GA26170@wayreth.rten.net> <4860A405.7030003@behnel.de> <20080624230440.GA16587@wayreth.rten.net> Message-ID: <4862A64D.2040609@behnel.de> Hi, Michael Ballbach wrote: > How does this look? (Unit tests follow the patch) [patch stripped] Thanks for the patch. Please send non-trivial patches as attachments, that makes them easier to apply. I had to fix it up a little, as I decided not to actually push the parser but only the parser context. If we ever need to propagate more than we store in the context, I guess we'd best augment the context class... Also, the XInclude test case didn't work, so it didn't catch a couple of problems with the resolver integration. Thanks also for the test cases. Could you check if the current trunk works for you? Stefan From ballbach at rten.net Wed Jun 25 22:17:28 2008 From: ballbach at rten.net (Michael Ballbach) Date: Wed, 25 Jun 2008 16:17:28 -0400 Subject: [lxml-dev] schema validation and resolvers In-Reply-To: <4862A64D.2040609@behnel.de> References: <20080624014525.GA26170@wayreth.rten.net> <4860A405.7030003@behnel.de> <20080624230440.GA16587@wayreth.rten.net> <4862A64D.2040609@behnel.de> Message-ID: <20080625201728.GA5301@wayreth.rten.net> On Wed, Jun 25, 2008 at 10:10:53PM +0200, Stefan Behnel wrote: > Thanks for the patch. Sure. > Thanks also for the test cases. Could you check if the current trunk > works for you? Seems to, thanks. -- Michael Ballbach, N0ZTQ ballbach at rten.net -- PGP KeyID: 0xA05D5555 http://www.rten.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080625/c1686ba6/attachment.pgp From xkenneth at gmail.com Thu Jun 26 06:40:21 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Wed, 25 Jun 2008 23:40:21 -0500 Subject: [lxml-dev] Mac OS X Status? Message-ID: <5CA8787A-DA97-4103-848B-B5E2EDCCD224@gmail.com> Has support for OS X improved at all? I still find "easy_install lxml" failing. Regards, Kenneth Miller From sergio at sergiomb.no-ip.org Thu Jun 26 07:09:25 2008 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Thu, 26 Jun 2008 06:09:25 +0100 Subject: [lxml-dev] Mac OS X Status? In-Reply-To: <5CA8787A-DA97-4103-848B-B5E2EDCCD224@gmail.com> References: <5CA8787A-DA97-4103-848B-B5E2EDCCD224@gmail.com> Message-ID: <1214456965.14491.7.camel@monteirov> On Wed, 2008-06-25 at 23:40 -0500, Kenneth Miller wrote: > Has support for OS X improved at all? I still find "easy_install > lxml" > failing. > Hi, what is faling ? -- S?rgio M.B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080626/bef0832a/attachment-0001.bin From eric at infrae.com Thu Jun 26 16:53:43 2008 From: eric at infrae.com (eric casteleijn) Date: Thu, 26 Jun 2008 16:53:43 +0200 Subject: [lxml-dev] easyinstall 2.0.7 does not work Message-ID: <4863AD77.7040305@infrae.com> Hi, when I do: easy_install-2.4 lxml==2.0.7 I get: Searching for lxml==2.0.7 Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.0.7 Downloading http://codespeak.net/lxml/lxml-2.0.7.tgz error: Can't download http://codespeak.net/lxml/lxml-2.0.7.tgz: 404 Not Found I see that there's no download link for 2.0.7 on the codespeak.net/lxml news page, and somehow setuptools ends up looking there. easy_install-2.4 lxml==2.0.6 works just fine (but also takes the download link on the codespeak page.) -- - eric casteleijn http://infrae.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080626/e939af98/attachment.pgp From stefan_ml at behnel.de Thu Jun 26 19:15:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Jun 2008 19:15:57 +0200 Subject: [lxml-dev] easyinstall 2.0.7 does not work In-Reply-To: <4863AD77.7040305@infrae.com> References: <4863AD77.7040305@infrae.com> Message-ID: <4863CECD.6060702@behnel.de> eric casteleijn wrote: > easy_install-2.4 lxml==2.0.7 > > I get: > > Searching for lxml==2.0.7 > Reading http://pypi.python.org/simple/lxml/ > Reading http://codespeak.net/lxml > Best match: lxml 2.0.7 > Downloading http://codespeak.net/lxml/lxml-2.0.7.tgz > error: Can't download http://codespeak.net/lxml/lxml-2.0.7.tgz: 404 Not > Found Hmmm, interesting, looks like I forgot to upload it... Should be fixed now. Stefan From xkenneth at gmail.com Thu Jun 26 21:27:57 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Thu, 26 Jun 2008 14:27:57 -0500 Subject: [lxml-dev] Mac OS X Status? In-Reply-To: <1214456965.14491.7.camel@monteirov> References: <5CA8787A-DA97-4103-848B-B5E2EDCCD224@gmail.com> <1214456965.14491.7.camel@monteirov> Message-ID: <283C94C6-EFC2-4FF2-9072-FA7478078F05@gmail.com> Here's an error log... http://rafb.net/p/5zNLTj87.html Regards, Ken Thanks for your time! On Jun 26, 2008, at 12:09 AM, Sergio Monteiro Basto wrote: > On Wed, 2008-06-25 at 23:40 -0500, Kenneth Miller wrote: >> Has support for OS X improved at all? I still find "easy_install >> lxml" >> failing. >> > > Hi, what is faling ? > -- > S?rgio M.B. From stefan_ml at behnel.de Thu Jun 26 21:35:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Jun 2008 21:35:57 +0200 Subject: [lxml-dev] Mac OS X Status? In-Reply-To: <283C94C6-EFC2-4FF2-9072-FA7478078F05@gmail.com> References: <5CA8787A-DA97-4103-848B-B5E2EDCCD224@gmail.com> <1214456965.14491.7.camel@monteirov> <283C94C6-EFC2-4FF2-9072-FA7478078F05@gmail.com> Message-ID: <4863EF9D.7020706@behnel.de> Hi, Kenneth Miller wrote: > Here's an error log... > > http://rafb.net/p/5zNLTj87.html Your log says: > Using build configuration of libxslt 1.1.12 which is too old. Do you have a current libxml2/libxslt installed? http://codespeak.net/lxml/dev/build.html#providing-newer-library-versions-on-mac-os-x Stefan From faassen at startifact.com Thu Jun 26 23:29:26 2008 From: faassen at startifact.com (Martijn Faassen) Date: Thu, 26 Jun 2008 23:29:26 +0200 Subject: [lxml-dev] pickling ElementStringResult Message-ID: Hi there, I just ran into a problem when upgrading an application to use lxml 2.0.x instead of 1.3.6. xpath now returns a special lxml.etree._ElementStringResult object, a smart string. A bit too smart for my situation... Previously, I'd get a string back and I could stuff that into the ZODB just fine - it was picklable. Now I get this and it can't be pickled, so the same code fails. Is it only xpath that is so affected or are there other operations where these things can be returned? It's quite frustrating to have to worry whether a string came from an lxml xpath (and possibly operations?) and therefore is broken in a subtle way when you want to store them somewhere. You always worry whether your test coverage is complete enough to detect all cases. It's also relatively hard to track this kind of bug down even if you have good test coverage, as the error only occurs later on when the ZODB tries to pickle the object with this kind of special string somewhere in it. Another potential problem is that a smart string when held in an object database might keep the underlying XML document alive long past its intended lifetime. I was originally thinking about coming up with a special way to pickle these smart strings, but this actually leads me to prefer another solution: a flag that can be passed to xpath() that turns off the returning of smart strings at all. That would fit my use case quite well, though I'd have to remember to use the flag all the time. It might be nice if the flag could instead be passed to the parser, but I'm not sure whether that is implementable. Regards, Martijn From xkenneth at gmail.com Thu Jun 26 23:50:30 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Thu, 26 Jun 2008 16:50:30 -0500 Subject: [lxml-dev] LXML and persistence with the ZODB? Message-ID: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> Anyone done anything with this? I know it's simple enough to simply store an XML string, but has anyone bothered with storying lxml objects? Regards, Ken From stefan_ml at behnel.de Fri Jun 27 07:50:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 07:50:11 +0200 Subject: [lxml-dev] pickling ElementStringResult In-Reply-To: References: Message-ID: <48647F93.6080800@behnel.de> Hi, Martijn Faassen wrote: > Previously, I'd get a string back and I could stuff that into the ZODB > just fine - it was picklable. Now I get this and it can't be pickled, so > the same code fails. Is it only xpath that is so affected or are there > other operations where these things can be returned? No, only for XPath results (wherever they occur). > Another potential problem is that a smart string when held in an object > database might keep the underlying XML document alive long past its > intended lifetime. > > I was originally thinking about coming up with a special way to pickle > these smart strings, but this actually leads me to prefer another > solution: a flag that can be passed to xpath() that turns off the > returning of smart strings at all. Wouldn't it be enough to pickle the string subclass as a plain (unicode) string? You would obviously loose information that way, but pickling the string result together with the entire tree would be much more surprising IMHO. Stefan From stefan_ml at behnel.de Fri Jun 27 07:53:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 07:53:33 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> Message-ID: <4864805D.4050701@behnel.de> Hi, Kenneth Miller wrote: > Anyone done anything with this? I know it's simple enough to simply > store an XML string, but has anyone bothered with storying lxml objects? lxml.objectify objects can be pickled. Note this recent thread, which has not reached a decision. http://comments.gmane.org/gmane.comp.python.lxml.devel/3730 Stefan From faassen at startifact.com Fri Jun 27 11:03:15 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 11:03:15 +0200 Subject: [lxml-dev] pickling ElementStringResult In-Reply-To: <48647F93.6080800@behnel.de> References: <48647F93.6080800@behnel.de> Message-ID: <8928d4e90806270203h6c5cfc1byc8f9b752bec7cfda@mail.gmail.com> Hi there, On Fri, Jun 27, 2008 at 7:50 AM, Stefan Behnel wrote: [snip] > Wouldn't it be enough to pickle the string subclass as a plain (unicode) > string? You would obviously loose information that way, but pickling the > string result together with the entire tree would be much more surprising IMHO. Sorry for being unclear, I'm not suggesting that the entire tree should be pickled. The ZODB has a cache, which is simply some of the "recently touched" Python objects in memory. I have no idea how long the object in question will remain in the ZODB cache (i.e. just a normal Python object in memory). It could be there for hours, days, depending on activity and cache size, etc. The smart string keeps the document it was in awake, possibly way past the expected time, and the document will only be collected if the object is removed from the ZODB cache, which I can't predict very well. Only when the smart string is pickled would the reference with the document be broken. At least, I *think* this is how it works. So, while pickling happens immediately, the object doesn't disappear right away after pickling, keeping this reference alive. In think in general smart strings behave somewhat unexpectedly in the face of potentially long-running processes. One is inclined to treat them as strings, but their memory behavior is quite different. Regards, Martijn From marius at pov.lt Fri Jun 27 15:57:36 2008 From: marius at pov.lt (Marius Gedminas) Date: Fri, 27 Jun 2008 16:57:36 +0300 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> Message-ID: <20080627135736.GB26640@fridge.pov.lt> On Thu, Jun 26, 2008 at 04:50:30PM -0500, Kenneth Miller wrote: > Anyone done anything with this? I know it's simple enough to simply > store an XML string, but has anyone bothered with storying lxml objects? I wouldn't even try. ZODB is tricky to get right, especially if you're trying to store mutable objects not designed to be stored in the ZODB. And if future-compatibility with your old database is important, you must commit to never renaming a class or an attribute, which is hard to do when you rely on an external library without such commitment. Marius Gedminas -- Unix for stability; Macs for productivity; Palm for mobility; Windows for Solitaire -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080627/e3741982/attachment.pgp From faassen at startifact.com Fri Jun 27 17:38:09 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 17:38:09 +0200 Subject: [lxml-dev] XSLT and threading Message-ID: Hi there, We've been trying to do XSLT transformations in a multi-threaded (Zope 2) situation. On 1.3.6 this won't work, but 2.0.x is supposed to have support for this. Unfortunately we're getting memory errors from within XSLT.__call__, and we think this is the problem: transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) if transform_ctxt is NULL: _destroyFakeDoc(input_doc._c_doc, c_doc) python.PyErr_NoMemory() For some reason this sometimes works, sometimes fails; it doesn't always do this, so we suspect perhaps we're already in a copied XSLT sheet (by _copyXSLT) when this happens. We're also wondering about the threading strategy of 2.1.x; the copyXSLT code is removed. Is there a new strategy? I couldn't find anything about it in CHANGES.txt. I mean, I wouldn't be unhappy with a new strategy as actually re-parsing the stylesheet each time this gets called from a different thread is rather expensive (the stylesheet isn't cached as far as I can see). What's the new strategy, if any? (Unfortunately 2.1.x also gives us an error, though a different one. I don't have this error handy here.) We're trying to reduce this to a simpler test case that demonstrates the problem but we're having a hard time so far. Any hints would be welcome. Regards, Martijn From eric at infrae.com Fri Jun 27 17:42:06 2008 From: eric at infrae.com (eric casteleijn) Date: Fri, 27 Jun 2008 17:42:06 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: References: Message-ID: <48650A4E.6030205@infrae.com> Martijn Faassen wrote: > Hi there, > > We've been trying to do XSLT transformations in a multi-threaded (Zope > 2) situation. On 1.3.6 this won't work, but 2.0.x is supposed to have > support for this. Unfortunately we're getting memory errors from within > XSLT.__call__, and we think this is the problem: > > transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) > if transform_ctxt is NULL: > _destroyFakeDoc(input_doc._c_doc, c_doc) > python.PyErr_NoMemory() > > For some reason this sometimes works, sometimes fails; it doesn't always > do this, so we suspect perhaps we're already in a copied XSLT sheet (by > _copyXSLT) when this happens. > > We're also wondering about the threading strategy of 2.1.x; the copyXSLT > code is removed. Is there a new strategy? I couldn't find anything about > it in CHANGES.txt. I mean, I wouldn't be unhappy with a new strategy as > actually re-parsing the stylesheet each time this gets called from a > different thread is rather expensive (the stylesheet isn't cached as far > as I can see). What's the new strategy, if any? > > (Unfortunately 2.1.x also gives us an error, though a different one. I > don't have this error handy here.) > > We're trying to reduce this to a simpler test case that demonstrates the > problem but we're having a hard time so far. Any hints would be welcome. Important to note: We are using a filename resolver for the url of an xsl:import in the set up where these errors occur. -- - eric casteleijn http://infrae.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080627/26c46ded/attachment.pgp From faassen at startifact.com Fri Jun 27 17:41:14 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 17:41:14 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: <20080627135736.GB26640@fridge.pov.lt> References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> Message-ID: Marius Gedminas wrote: > On Thu, Jun 26, 2008 at 04:50:30PM -0500, Kenneth Miller wrote: >> Anyone done anything with this? I know it's simple enough to simply >> store an XML string, but has anyone bothered with storying lxml objects? > > I wouldn't even try. ZODB is tricky to get right, especially if you're > trying to store mutable objects not designed to be stored in the ZODB. > And if future-compatibility with your old database is important, you > must commit to never renaming a class or an attribute, which is hard to do > when you rely on an external library without such commitment. Actually we're pickling a very well known data structure, XML, not a tree of elements. You could therefore write a pickler that takes advantage of this particular property. You'd only worry about the outer class name at most. That said, it'd still be tricky. It'd be nice to get a modification notification from lxml for instance, so that the ZODB knows that a tree has been changed and thus needs to be marked dirty so that a new transaction can be committed. There are other tricky issues; if you use custom classes for your elements, you can't get away with just pickling the XML anymore, as you'd lose this information. Regards, Martijn From faassen at startifact.com Fri Jun 27 17:55:06 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 17:55:06 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> Message-ID: Martijn Faassen wrote: > Marius Gedminas wrote: >> On Thu, Jun 26, 2008 at 04:50:30PM -0500, Kenneth Miller wrote: >>> Anyone done anything with this? I know it's simple enough to simply >>> store an XML string, but has anyone bothered with storying lxml objects? >> I wouldn't even try. ZODB is tricky to get right, especially if you're >> trying to store mutable objects not designed to be stored in the ZODB. >> And if future-compatibility with your old database is important, you >> must commit to never renaming a class or an attribute, which is hard to do >> when you rely on an external library without such commitment. > > Actually we're pickling a very well known data structure, XML, not a > tree of elements. You could therefore write a pickler that takes > advantage of this particular property. You'd only worry about the outer > class name at most. Checking the lxml.objectify code, it's indeed using this strategy. Since libxml2 serialization and deserialization of XML is pretty fast, this should be okay to use in a pickler. Regards, Martijn From jwashin at vt.edu Fri Jun 27 18:12:50 2008 From: jwashin at vt.edu (Jim Washington) Date: Fri, 27 Jun 2008 12:12:50 -0400 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> Message-ID: <48651182.8040104@vt.edu> Martijn Faassen wrote: > Martijn Faassen wrote: >> Marius Gedminas wrote: >>> On Thu, Jun 26, 2008 at 04:50:30PM -0500, Kenneth Miller wrote: >>>> Anyone done anything with this? I know it's simple enough to simply >>>> store an XML string, but has anyone bothered with storying lxml objects? >>> I wouldn't even try. ZODB is tricky to get right, especially if you're >>> trying to store mutable objects not designed to be stored in the ZODB. >>> And if future-compatibility with your old database is important, you >>> must commit to never renaming a class or an attribute, which is hard to do >>> when you rely on an external library without such commitment. >> Actually we're pickling a very well known data structure, XML, not a >> tree of elements. You could therefore write a pickler that takes >> advantage of this particular property. You'd only worry about the outer >> class name at most. > > Checking the lxml.objectify code, it's indeed using this strategy. Since > libxml2 serialization and deserialization of XML is pretty fast, this > should be okay to use in a pickler. If you are feeling adventuresome (and don't mind running yet another server), zif.sedna.sednaobject is also a future possibility. It's in alpha release and available at the cheese shop. http://zif.svn.sourceforge.net/viewvc/zif/zif.sedna/trunk/src/zif/sedna/README_sednaobject.txt?view=markup - Jim Washington From stefan_ml at behnel.de Fri Jun 27 18:23:46 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 18:23:46 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: References: Message-ID: <48651412.1030403@behnel.de> Hi, Martijn Faassen wrote: > We've been trying to do XSLT transformations in a multi-threaded (Zope > 2) situation. On 1.3.6 this won't work, Yep, I should really put up a warning somewhere that threading in 1.3.x is buggy in a couple of use cases and should be used with caution. (isn't there one in the FAQ anyway?) 2.1 is much cleaner here, but the latest 2.0 versions should also work in most cases. > Unfortunately we're getting memory errors from within > XSLT.__call__, and we think this is the problem: > > transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) > if transform_ctxt is NULL: > _destroyFakeDoc(input_doc._c_doc, c_doc) > python.PyErr_NoMemory() > > For some reason this sometimes works, sometimes fails; it doesn't always > do this, so we suspect perhaps we're already in a copied XSLT sheet (by > _copyXSLT) when this happens. I think I'll need a test case to see this. Anyway, the _copyXSLT() is really just a work-around in 2.0 up to 2.0.5, which lacked a reliable way of keeping (sub-)trees within thread boundaries, where the dictionary keeping their tag names is defined. lxml 2.1 does this at the end of XSLT.__call__(): if not _checkThreadDict(c_result.dict): # fix document dictionary c_node = _findChildForwards(c_result, 0) if c_node is not NULL: __GLOBAL_PARSER_CONTEXT.initThreadDictRef(&c_result.dict) moveNodeToDocument(result_doc, self._c_style.doc, c_node) I don't remember why 2.0.6 doesn't do this. It shouldn't break anything to backport it - maybe I just left it out at the time because it hasn't been tested very well yet, due to lack of user feedback. That makes the copy work-around appear as a safer choice (especially if you are still running into problems with 2.1beta). > We're also wondering about the threading strategy of 2.1.x; the copyXSLT > code is removed. Is there a new strategy? 2.1 detects it when a subtree is merged into a tree from a different thread and migrates the names stored in the thread dictionary over to the target thread. This is a bit more overhead, but you only pay it in multi-threaded environments, where you gain safety in parallel execution. > (Unfortunately 2.1.x also gives us an error, though a different one. I > don't have this error handy here.) Luckily, 2.1 is still in beta, so it would be great to sort this out soon. Stefan From faassen at startifact.com Fri Jun 27 19:14:00 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 19:14:00 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: <48651412.1030403@behnel.de> References: <48651412.1030403@behnel.de> Message-ID: <8928d4e90806271014v61ec6392rfe3ee4d162086400@mail.gmail.com> Hey Stefan, Yeah, we realize you need a test case. Eric worked quite hard to try to isolate the problem, but no luck yet so far. Hopefully we'll be able to tell you more next week. The situation seems to occur when custom resolvers are involved, and it's conceivable it's not really a threading problem at all. Regards, Martijn P.S. If not, we can just personally say "thank you" at EuroPython - Eric is also going to be there. ;) From stefan_ml at behnel.de Fri Jun 27 20:14:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 20:14:12 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: References: Message-ID: <48652DF4.1000404@behnel.de> Hi, Martijn Faassen wrote: > Unfortunately we're getting memory errors from within > XSLT.__call__, and we think this is the problem: > > transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) > if transform_ctxt is NULL: > _destroyFakeDoc(input_doc._c_doc, c_doc) > python.PyErr_NoMemory() Does "memory errors" mean you get an exception or a crash? Maybe there are cases in libxslt where xsltNewTransformContext() can return NULL that do not involve a malloc problem and must be handled differently? If you get a crash, is there any chance you can come up with a valgrind trace? Stefan From stefan_ml at behnel.de Fri Jun 27 20:32:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 20:32:29 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> Message-ID: <4865323D.1050205@behnel.de> Hi, Martijn Faassen wrote: > That said, it'd still be tricky. It'd be nice to get a modification > notification from lxml for instance, so that the ZODB knows that a tree > has been changed and thus needs to be marked dirty so that a new > transaction can be committed. If someone provides code for such a feature, I will consider adding it. However, it must have a negligible performance impact (especially if not used) and meaningful event types. Problems: the public C-API enables access to the tree at the C level, and adding an Element to a different tree removes it from the source tree, so changes can be non-local for a tree. > There are other tricky issues; if you use custom classes for your > elements, you can't get away with just pickling the XML anymore, as > you'd lose this information. I wonder if it isn't enough to put a warning into the docs that pickling does not store all tree state and that unpickling will only work correctly inside the same parser setup that was used for pickling. I would expect that most applications can easily achieve this. We'd only have to find a way to associate a concrete parser setup with an unpickler. Stefan From faassen at startifact.com Fri Jun 27 20:42:28 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 20:42:28 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: <48652DF4.1000404@behnel.de> References: <48652DF4.1000404@behnel.de> Message-ID: <8928d4e90806271142t15c66c62p79e8fade1a2fcbfb@mail.gmail.com> Hi again, On Fri, Jun 27, 2008 at 8:14 PM, Stefan Behnel wrote: > Martijn Faassen wrote: >> Unfortunately we're getting memory errors from within >> XSLT.__call__, and we think this is the problem: >> >> transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) >> if transform_ctxt is NULL: >> _destroyFakeDoc(input_doc._c_doc, c_doc) >> python.PyErr_NoMemory() > > Does "memory errors" mean you get an exception or a crash? Maybe there are > cases in libxslt where xsltNewTransformContext() can return NULL that do not > involve a malloc problem and must be handled differently? The exception, not a crash. Yes, it looks like that returns NULL for some reason. Possibly the resolvers haven't been set up correctly in the copy and that's why it fails? I think we got a crash though with lxml2., so we can oblige you with both sets of information. :) Unfortunately I don't have a set up here that demonstrates the behavior. Regards, Martijn From faassen at startifact.com Fri Jun 27 20:48:12 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Jun 2008 20:48:12 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: <4865323D.1050205@behnel.de> References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> <4865323D.1050205@behnel.de> Message-ID: <8928d4e90806271148n1a3d9addhfd826a9b430c1232@mail.gmail.com> Hey, On Fri, Jun 27, 2008 at 8:32 PM, Stefan Behnel wrote: > Martijn Faassen wrote: >> That said, it'd still be tricky. It'd be nice to get a modification >> notification from lxml for instance, so that the ZODB knows that a tree >> has been changed and thus needs to be marked dirty so that a new >> transaction can be committed. > > If someone provides code for such a feature, I will consider adding it. > However, it must have a negligible performance impact (especially if not used) > and meaningful event types. What do you mean by meaningful event types? There's only one relevant event in case of the ZODB: the tree has changed. > Problems: the public C-API enables access to the tree at the C level, and > adding an Element to a different tree removes it from the source tree, so > changes can be non-local for a tree. I think we may be able to ignore the C-API, and just tell people who really want to make sure this works to also manually call invalidate. When an element is added to a different tree I think you still have enough time to mark the original tree dirty though, right? > I wonder if it isn't enough to put a warning into the docs that pickling does > not store all tree state and that unpickling will only work correctly inside > the same parser setup that was used for pickling. I would expect that most > applications can easily achieve this. > > We'd only have to find a way to associate a concrete parser setup with an > unpickler. Tricky.. In the case of the ZODB, we don't control pickling or unpickling directly, or when it happens, and you'd still want to support the case of different parser setups in different cases. Not that I'm in urgent need of this feature, but it'd be very cool if it worked nonetheless, it opens up some interesting possibilities. Regards, Martijn From stefan_ml at behnel.de Fri Jun 27 21:35:31 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 21:35:31 +0200 Subject: [lxml-dev] LXML and persistence with the ZODB? In-Reply-To: <8928d4e90806271148n1a3d9addhfd826a9b430c1232@mail.gmail.com> References: <73DDE6A1-1F34-438A-963B-1B61CFA921A1@gmail.com> <20080627135736.GB26640@fridge.pov.lt> <4865323D.1050205@behnel.de> <8928d4e90806271148n1a3d9addhfd826a9b430c1232@mail.gmail.com> Message-ID: <48654103.1040406@behnel.de> Hi, Martijn Faassen wrote: > On Fri, Jun 27, 2008 at 8:32 PM, Stefan Behnel wrote: >> Martijn Faassen wrote: >>> That said, it'd still be tricky. It'd be nice to get a modification >>> notification from lxml for instance, so that the ZODB knows that a tree >>> has been changed and thus needs to be marked dirty so that a new >>> transaction can be committed. >> If someone provides code for such a feature, I will consider adding it. >> However, it must have a negligible performance impact (especially if not used) >> and meaningful event types. > > What do you mean by meaningful event types? There's only one relevant > event in case of the ZODB: the tree has changed. What is a tree? Imagine you have two ElementTree objects wrapping subtrees one of which is contained in the other. Now a subtree of both changes. What event(s) do you expect? And is "tree changed" really enough, or are there applications that can benefit from a distinction between creation, update and removal of Elements? What information is part of an event notification? Should applications be able to efficiently decide if it affects a certain subtree before they react or not? What about just changing the text of an Element? Should the Element create an event or is this also only a tree event? >> Problems: the public C-API enables access to the tree at the C level, and >> adding an Element to a different tree removes it from the source tree, so >> changes can be non-local for a tree. > > I think we may be able to ignore the C-API, and just tell people who > really want to make sure this works to also manually call invalidate. I'm fine with that. > When an element is added to a different tree I think you still have > enough time to mark the original tree dirty though, right? Sure, but these things usually involve more issues than you can see before you start implementing them. I have no idea if this has any impact on the source thread of a tree, for example. >> I wonder if it isn't enough to put a warning into the docs that pickling does >> not store all tree state and that unpickling will only work correctly inside >> the same parser setup that was used for pickling. I would expect that most >> applications can easily achieve this. >> >> We'd only have to find a way to associate a concrete parser setup with an >> unpickler. > > Tricky.. In the case of the ZODB, we don't control pickling or > unpickling directly, or when it happens, and you'd still want to > support the case of different parser setups in different cases. Definitely. It's just easier in objectify because there is a comprehensive default setup, so most people won't notice that pickling is not parser-specific. Stefan From stefan_ml at behnel.de Fri Jun 27 21:55:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jun 2008 21:55:26 +0200 Subject: [lxml-dev] XSLT and threading In-Reply-To: <8928d4e90806271142t15c66c62p79e8fade1a2fcbfb@mail.gmail.com> References: <48652DF4.1000404@behnel.de> <8928d4e90806271142t15c66c62p79e8fade1a2fcbfb@mail.gmail.com> Message-ID: <486545AE.9040106@behnel.de> Hi, Martijn Faassen wrote: > On Fri, Jun 27, 2008 at 8:14 PM, Stefan Behnel wrote: >> Martijn Faassen wrote: >>> Unfortunately we're getting memory errors from within >>> XSLT.__call__, and we think this is the problem: >>> >>> transform_ctxt = xslt.xsltNewTransformContext(self._c_style, c_doc) >>> if transform_ctxt is NULL: >>> _destroyFakeDoc(input_doc._c_doc, c_doc) >>> python.PyErr_NoMemory() >> Does "memory errors" mean you get an exception or a crash? Maybe there are >> cases in libxslt where xsltNewTransformContext() can return NULL that do not >> involve a malloc problem and must be handled differently? > > The exception, not a crash. Yes, it looks like that returns NULL for > some reason. Possibly the resolvers haven't been set up correctly in > the copy and that's why it fails? Hmmm, I wouldn't know what impact custom resolvers could have here... Maybe this is really a memory problem after all? Do you have a chance to check the error log in lxml.etree when this happens? libxml2/libxslt often write a message there when something fails. Stefan From stefan_ml at behnel.de Sat Jun 28 15:39:08 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 28 Jun 2008 15:39:08 +0200 Subject: [lxml-dev] pickling ElementStringResult In-Reply-To: <8928d4e90806270203h6c5cfc1byc8f9b752bec7cfda@mail.gmail.com> References: <48647F93.6080800@behnel.de> <8928d4e90806270203h6c5cfc1byc8f9b752bec7cfda@mail.gmail.com> Message-ID: <48663EFC.7070102@behnel.de> Hi, Martijn Faassen wrote: > The ZODB has a cache, which is simply some of the "recently touched" > Python objects in memory. Why would an XPath string result end up in that cache in the first place? > In think in general smart strings behave somewhat unexpectedly in the > face of potentially long-running processes. One is inclined to treat > them as strings, but their memory behavior is quite different. True. But the only way I see that would work around this internally is a weak reference - and Elements are not currently weak referencible. I never tried, but I would imagine that there is an overhead involved in adding a "__weakref__" to the _Element class. IIRC, this adds a dictionary to the class. I could also imagine giving the smart strings a method ".toplainstring()" that would return a plain string value without the parent link. That way, users who want to pass on the string to a potentially long-living place can unlink the string from its parent. Your proposal of configuring this behaviour on a parser (XML parser, not XPath parser) isn't impossible either, since we already pass a _Document (with a parser reference) into the XPath value unpacker. But I'm not convinced that that is the right place for such an option. Doing that in the XPath class looks harder at first sight. Stefan From vik.list.nutch at gmail.com Sun Jun 29 11:34:02 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Sun, 29 Jun 2008 02:34:02 -0700 Subject: [lxml-dev] Possible bug in DOM tree iteration? Message-ID: <4867570A.2000202@gmail.com> Hi all, I'm running some tests on a page's DOM tree by assigning each element a unique identifier and then doing some analysis using this. I use code similar to, root = bs.fromstring(txtcontent) self.pagetree = etree.iterwalk(root, events=("start",)) for event, element in self.pagetree: element.attrib['uid'] = str(cnt) cnt = cnt + 1 etc. However, I notice that when iterating through the DOM, on text such as the following: --

This is something here which has some more text here and repeats here again for this statement and some more text here that doesn't have any tags at all.

-- The uid is assigned only to the P and the first B, but everything after is left untouched. So, the word "statement" is never assigned an Id. Moreover, I'm not sure how to access the rest of the the text under the P tag. When iterating through the tree, shouldn't the other tags be included too, as well as the text for the P element should contain ALL the text in there, including the b tags? If this is intended behavior, could someone point me to how I could achieve accessing all the other text under the P tag, as well as the B tags in there? Cheers Viksit From stefan_ml at behnel.de Mon Jun 30 09:44:27 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 30 Jun 2008 09:44:27 +0200 Subject: [lxml-dev] Possible bug in DOM tree iteration? In-Reply-To: <4867570A.2000202@gmail.com> References: <4867570A.2000202@gmail.com> Message-ID: <48688EDB.4090207@behnel.de> Hi, Viksit Gaur wrote: > I'm running some tests on a page's DOM tree by assigning each element a > unique identifier and then doing some analysis using this. I use code > similar to, > > root = bs.fromstring(txtcontent) > self.pagetree = etree.iterwalk(root, events=("start",)) > for event, element in self.pagetree: > element.attrib['uid'] = str(cnt) > cnt = cnt + 1 I guess it's really only similar to the above, as this code works just fine for the HTML snippet you present below. > I'm not sure how to access the rest of the the text under the > P tag. When iterating through the tree, shouldn't the other tags be > included too, as well as the text for the P element should contain ALL > the text in there, including the b tags? Please read the tutorial on Elements containing text. Stefan From terry_n_brown at yahoo.com Mon Jun 30 20:20:20 2008 From: terry_n_brown at yahoo.com (Terry Brown) Date: Mon, 30 Jun 2008 13:20:20 -0500 Subject: [lxml-dev] schema validation - what would be valid at this point? Message-ID: <20080630132020.2672f42b@nrri.umn.edu> Hi all, First thanks for lxml - it's great. If this is the wrong place for feature requests please point me elsewhere... A feature that would be very useful for a number of applications, particularly editors: the ability to ask "what attributes / elements would be valid at this point?" where I guess a point would be identified by an XPath or perhaps an Element object. I don't know if any of the libxml2 schema APIs expose this, if they don't, that's probably where the change needs to be made first, but if they do it would be great if lxml passed it on to python. Thanks, Terry From jlovell at esd189.org Mon Jun 30 20:49:50 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 30 Jun 2008 11:49:50 -0700 Subject: [lxml-dev] schema validation - what would be valid at this point? In-Reply-To: <20080630132020.2672f42b@nrri.umn.edu> References: <20080630132020.2672f42b@nrri.umn.edu> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2389@ZIRIA.esd189.org> Can you give an example of what that might look like? John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Terry Brown Sent: Monday, June 30, 2008 11:20 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] schema validation - what would be valid at this point? Hi all, First thanks for lxml - it's great. If this is the wrong place for feature requests please point me elsewhere... A feature that would be very useful for a number of applications, particularly editors: the ability to ask "what attributes / elements would be valid at this point?" where I guess a point would be identified by an XPath or perhaps an Element object. I don't know if any of the libxml2 schema APIs expose this, if they don't, that's probably where the change needs to be made first, but if they do it would be great if lxml passed it on to python. Thanks, Terry _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From jlovell at esd189.org Mon Jun 30 22:02:34 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 30 Jun 2008 13:02:34 -0700 Subject: [lxml-dev] schema validation - what would be valid at this point? In-Reply-To: <20080630142238.6c8eb242@nrri.umn.edu> References: <20080630132020.2672f42b@nrri.umn.edu><3A49C88789256B4AB33AC603DB6AF49B011A2389@ZIRIA.esd189.org> <20080630142238.6c8eb242@nrri.umn.edu> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A238B@ZIRIA.esd189.org> So with schema snippets that look like this, what would it give you as valid input options? John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: Terry Brown [mailto:tbrown at nrri.umn.edu] Sent: Monday, June 30, 2008 12:23 PM To: John Lovell Subject: Re: [lxml-dev] schema validation - what would be valid at this point? On Mon, 30 Jun 2008 11:49:50 -0700 "John Lovell" wrote: > Can you give an example of what that might look like? What I'd really like to have is the functionality of Emac's nxml-mode via a Python API. That mode parses the text up to the insert point, and can then tell you what elements would be valid there, according to any RELAX-NX schema you specify. So really I guess it makes more sense to be able to get the answer at parse time, rather than when the tree's complete. If, so far, the parser has seen "" the answer to the question "what's valid at this point?" would be "base, isindex, link, meta, script, style, title". I don't know if the libxml2 schema API will answer that question, if not then it's more a libxml2 feature request (initally :) rather than an lxml thing. Just using HTML as an example, you could answer in terms of any user supplied schema. Sorry my initial description was confused because I was thinking about already complete and valid trees to which you might add something, it would be nice to handle that case too, but it seems to make more sense to approach it from the incomplete, half parsed :-) tree point of view first. Cheers -Terry > John W. Lovell > Web Applications Engineer > Northwest Educational Service District > 1601 R Avenue > Anacortes, WA 98221 > (360) 299-4086 > jlovell at nwesd.org > > www.nwesd.org > Together We Can ... > > > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Terry Brown > Sent: Monday, June 30, 2008 11:20 AM > To: lxml-dev at codespeak.net > Subject: [lxml-dev] schema validation - what would be valid at this > point? > > Hi all, > > First thanks for lxml - it's great. > > If this is the wrong place for feature requests please point me > elsewhere... > > A feature that would be very useful for a number of applications, > particularly editors: the ability to ask "what attributes / elements > would be valid at this point?" where I guess a point would be > identified by an XPath or perhaps an Element object. > > I don't know if any of the libxml2 schema APIs expose this, if they > don't, that's probably where the change needs to be made first, but if > they do it would be great if lxml passed it on to python. > > Thanks, > > Terry > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > ----------------------------------------------------------------------- Terry Brown Natural Resources Research Institute Research Associate 5013 Miller Trunk Highway tbrown at nrri.umn.edu University of Minnesota, Duluth Ph. 218 720 4345 Duluth, Minnesota 55811 Fax 218 720 4328 http://beaver.nrri.umn.edu/~tbrown/ From eric at ejahn.net Mon Jun 30 22:11:37 2008 From: eric at ejahn.net (Eric Jahn) Date: Mon, 30 Jun 2008 20:11:37 +0000 Subject: [lxml-dev] namespace strangeness in lxml 1.1 Message-ID: <1214856697.868.38.camel@localhost.localdomain> Using python-lxml 1.1.1-1 which comes with Debian Etch, I am noticing some odd behavior in how lxml handles attribute namespaces, and namespaces in general. In the following code, all actual namespace urls should be mapped to namespaces in the resulting output, and they are, except if an attribute is involved: type="{http://domain2.info}someattribute Is this an lxml/libxml2 bug? The second odd behavior is that, for subelements, all the namespace declarations are automatically redeclared in the subelement tag, which is redundant. Does anyone know how to get around these problems? from lxml import etree NS1_NAMESPACE = "http://domain1.info" NS2_NAMESPACE = "http://domain2.info" NS1 = "{%s}" % NS1_NAMESPACE NS2 = "{%s}" % NS2_NAMESPACE NSMAP = {"NS1" : NS1_NAMESPACE , "NS2" : NS2_NAMESPACE} root = etree.Element(NS1 + "firstelement", nsmap=NSMAP) element = etree.Element(NS2 + "secondelement", nsmap=NSMAP, type = NS2 + "someattribute") root.append(element) print etree.tostring(root,pretty_print=True) Which results in the following output: From terry_n_brown at yahoo.com Mon Jun 30 22:22:44 2008 From: terry_n_brown at yahoo.com (Terry Brown) Date: Mon, 30 Jun 2008 15:22:44 -0500 Subject: [lxml-dev] schema validation - what would be valid at this point? In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A238B@ZIRIA.esd189.org> References: <20080630132020.2672f42b@nrri.umn.edu> <3A49C88789256B4AB33AC603DB6AF49B011A2389@ZIRIA.esd189.org> <20080630142238.6c8eb242@nrri.umn.edu> <3A49C88789256B4AB33AC603DB6AF49B011A238B@ZIRIA.esd189.org> Message-ID: <20080630152244.5fe10dcb@nrri.umn.edu> Display a list of approx. 3.40282e+38 possible codes and ask you to select one. :-) I don't know what nxml-mode would do, it's RELAX-NG based and I don't know if RELAX-NG supports patterns like that. I guess there's no real need to support offering options for attribute values (or text() node values), although nxml-mode will offer "http://www.w3.org/1999/xhtml" as a value for @xmlns in . If it only generated lists of valid element names and attribute names it would still be very helpful. And like I said, I'm only suggesting that if the libxml2 schema API exposes this it would be great if lxml could pass it on. At some point the validating parser must generate a list of valid element names in order to check that the current element is in that list. Granted you've shown that for attribute values there may be no list, just a match pattern, but I'm sure for element and attribute names there would be a list. Cheers -Terry On Mon, 30 Jun 2008 13:02:34 -0700 "John Lovell" wrote: > So with schema snippets that look like this, what would it give you as > valid input options? > > > > > > > > > > > > xmlns:xs="http://www.w3.org/2001/XMLSchema" /> > > > > > > John W. Lovell > Web Applications Engineer > Northwest Educational Service District > 1601 R Avenue > Anacortes, WA 98221 > (360) 299-4086 > jlovell at nwesd.org > > www.nwesd.org > Together We Can ... > > -----Original Message----- > From: Terry Brown [mailto:tbrown at nrri.umn.edu] > Sent: Monday, June 30, 2008 12:23 PM > To: John Lovell > Subject: Re: [lxml-dev] schema validation - what would be valid at > this point? > > On Mon, 30 Jun 2008 11:49:50 -0700 > "John Lovell" wrote: > > > Can you give an example of what that might look like? > > What I'd really like to have is the functionality of Emac's nxml-mode > via a Python API. That mode parses the text up to the insert point, > and can then tell you what elements would be valid there, according > to any RELAX-NX schema you specify. > > So really I guess it makes more sense to be able to get the answer at > parse time, rather than when the tree's complete. > > If, so far, the parser has seen "" the answer to the > question "what's valid at this point?" would be "base, isindex, link, > meta, script, style, title". I don't know if the > libxml2 schema API will answer that question, if not then it's more a > libxml2 feature request (initally :) rather than an lxml thing. Just > using HTML as an example, you could answer in terms of any user > supplied schema. > > Sorry my initial description was confused because I was thinking about > already complete and valid trees to which you might add something, it > would be nice to handle that case too, but it seems to make more sense > to approach it from the incomplete, half parsed :-) tree point of view > first. > > Cheers -Terry > > > > John W. Lovell > > Web Applications Engineer > > Northwest Educational Service District > > 1601 R Avenue > > Anacortes, WA 98221 > > (360) 299-4086 > > jlovell at nwesd.org > > > > www.nwesd.org > > Together We Can ... > > > > > > -----Original Message----- > > From: lxml-dev-bounces at codespeak.net > > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Terry Brown > > Sent: Monday, June 30, 2008 11:20 AM > > To: lxml-dev at codespeak.net > > Subject: [lxml-dev] schema validation - what would be valid at this > > point? > > > > Hi all, > > > > First thanks for lxml - it's great. > > > > If this is the wrong place for feature requests please point me > > elsewhere... > > > > A feature that would be very useful for a number of applications, > > particularly editors: the ability to ask "what attributes / > > elements would be valid at this point?" where I guess a point would > > be identified by an XPath or perhaps an Element object. > > > > I don't know if any of the libxml2 schema APIs expose this, if they > > don't, that's probably where the change needs to be made first, but > > if > > > they do it would be great if lxml passed it on to python. > > > > Thanks, > > > > Terry > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > > ----------------------------------------------------------------------- > Terry Brown Natural Resources Research Institute > Research Associate 5013 Miller Trunk Highway > tbrown at nrri.umn.edu University of Minnesota, Duluth > Ph. 218 720 4345 Duluth, Minnesota 55811 > Fax 218 720 4328 http://beaver.nrri.umn.edu/~tbrown/ > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From eric at ejahn.net Mon Jun 30 22:28:10 2008 From: eric at ejahn.net (Eric Jahn) Date: Mon, 30 Jun 2008 16:28:10 -0400 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <1214856697.868.38.camel@localhost.localdomain> References: <1214856697.868.38.camel@localhost.localdomain> Message-ID: <1214857690.868.42.camel@localhost.localdomain> On Mon, 2008-06-30 at 20:11 +0000, Eric Jahn wrote: > ...The second odd behavior is that, for > subelements, all the namespace declarations are automatically redeclared > in the subelement tag, which is redundant. I see that this second error goes away when I use child1 = etree.SubElement(root,NS2 + "secondelement", nsmap=NSMAP, type = NS2 + "someattribute") instead of root = etree.Element(NS1 + "firstelement", nsmap=NSMAP) element = etree.Element(NS2 + "secondelement", nsmap=NSMAP, type = NS2 + "someattribute") root.append(element) -Eric From eric at ejahn.net Mon Jun 30 22:49:38 2008 From: eric at ejahn.net (Eric Jahn) Date: Mon, 30 Jun 2008 16:49:38 -0400 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <1214857690.868.42.camel@localhost.localdomain> References: <1214856697.868.38.camel@localhost.localdomain> <1214857690.868.42.camel@localhost.localdomain> Message-ID: <1214858978.868.44.camel@localhost.localdomain> I just confirmed that this problem still occurs in lxml 2.1beta3. -Eric