From jholg at gmx.de Tue Oct 2 17:58:52 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 02 Oct 2007 17:58:52 +0200 Subject: [lxml-dev] trunk schematron tests core dump (was: annotate, pyannotate, xsiannotate) In-Reply-To: <20070921142905.315500@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> <46F37F62.40307@behnel.de> <20070921092340.311080@gmx.net> <20070921142905.315500@gmx.net> Message-ID: <20071002155852.130440@gmx.net> Hi, > > > Schematron uses XPath a lot, so I wouldn't be surprised if this was > > > related to > > > the XPath bug in libxml2 2.6.27. Is there any chance you could switch > to > [...] > Unfortunately, using the latest & greatest libxml2/libxslt (2.6.33/1.1.22) > doesn't solve the problem for me. I'm trying to get some sensible information but have real problems with debugging, as I'm seeing line number information that is just plain wrong, though compiling with debugging on and everything, the likes of: (gdb) info source Current source file is src/lxml/etree.c Compilation directory is /home/lb54320/pydev/LXML/lxml/ Located in /home/lb54320/pydev/LXML/lxml/src/lxml/etree.c Contains 90795 lines. Source language is c. Compiled with stabs debugging format. (gdb) b etree.c:70850 No line 70850 in file "src/lxml/etree.c". (gdb) No idea what I'm doing wrong here, at the moment. So the info on the crash does not get much better than that backtrace at the moment: Program received signal SIGSEGV, Segmentation fault. 0xff0b3218 in strlen () from /usr/lib/libc.so.1 (gdb) bt #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe23df04 in __xmlRaiseError () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 #4 0xfe3e717c in xmlSchematronPErr () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 #5 0xfe3e9878 in xmlSchematronParse () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 #6 0xfe68dfdc in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x1b30f0, __pyx_args=0x1db670, __pyx_kwds=0x0) at src/lxml/etree.c:5663 What I can see, though, is that using the same schematron schema with xmllint does not crash: 0 $ cat invalid_empty.xst 0 $ python2.4 -i -c 'from lxml import etree; print etree.LIBXML_VERSION; schema = etree.Schematron(etree.parse("invalid_empty.xst"))' (2, 6, 30) Segmentation Fault (core dumped) whereas $ /apps/pydev/bin/xmllint --schematron invalid_empty.xst foo.xml --version /apps/pydev/bin/xmllint: using libxml version 20630 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib invalid_empty.xst:1: element schema: Schemas parser error : The schematron document 'invalid_empty.xst' has no pattern Schematron schema invalid_empty.xst failed to compile Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From jg307 at cam.ac.uk Tue Oct 2 22:33:06 2007 From: jg307 at cam.ac.uk (James Graham) Date: Tue, 02 Oct 2007 21:33:06 +0100 Subject: [lxml-dev] Tag name validation and HTML In-Reply-To: <46FBBEBD.7030308@behnel.de> References: <46FBA0D3.6010700@cam.ac.uk> <46FBBEBD.7030308@behnel.de> Message-ID: <4702AB02.6080300@cam.ac.uk> Stefan Behnel wrote: > James Graham wrote: >> The development branch of lxml 2 appears to restrict the characters that may >> appear in a tag name. Whilst this may be appropriate for XML, it does not match >> the behavior of all common HTML UAs and, as such, does not match the current >> draft of the HTML 5 spec [1]. > > This is actually not as simple as it might seem. The Element factory cannot > distinguish between XML and HTML tags, so it cannot switch off validation for > a particular tag. So the conservative solution would be to actually follow the > HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. > But then there's not much left that you could honestly call validation. Also, > I would still want to restrict ":" in tag names, as this has been a source of > problems way too often. So that would just leave spaces and any of ":/>" as > invalid characters in tag names. The : thing is difficult because HTML UAs are expected to deal with : in the tag name and there is content in the wild that depends on this being accepted; MS Office produces "HTML" containing tags like , for example. Since I, and I guess others too, want to use lxml to process random content that may have colons in the tag names, hard failure for this case is a problem. To make matters worse it is possible that the HTML spec will change in the future to introduce some sort of namespacing feature which may or may not use colons. Given all of this I would prefer it if it were possible to have an HTML-specific mode with much more liberal rules than the XML mode. This could then be adapted to support any namespacing features HTML grows in the future. For example, if one could do something like import lxml.html lxml.html.Element("o:p") where lxml.html.Element would be just like lxml.etree.Element but without XML-specific validity checks. I guess there might be serious practical difficulties with that exact solution, but I think the general idea of being able to flag an element as following HTML rules or XML rules would be more user-friendly than having a set of rules that neither matches the XML nor the HTML model correctly. -- "Mixed up signals Bullet train People snuffed out in the brutal rain" --Conner Oberst From l.oluyede at gmail.com Wed Oct 3 15:31:16 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Wed, 3 Oct 2007 15:31:16 +0200 Subject: [lxml-dev] Namespace serialization patch Message-ID: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> I had the same problem Anders Bruun Olsen had in this thread: http://comments.gmane.org/gmane.comp.python.lxml.devel/2924 What I'd like to know if I have to wait for 2.0 completion (using the alpha is not an option AFAIK) to use it or you plan to release an interim 1.3.x version with that patch applied. Thanks -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From anders at bruun-olsen.net Wed Oct 3 20:28:48 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 03 Oct 2007 20:28:48 +0200 Subject: [lxml-dev] Namespace serialization patch In-Reply-To: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> Message-ID: <4703DF60.4040108@bruun-olsen.net> Lawrence Oluyede wrote: > I had the same problem Anders Bruun Olsen had in this thread: > http://comments.gmane.org/gmane.comp.python.lxml.devel/2924 > What I'd like to know if I have to wait for 2.0 completion (using the > alpha is not an option AFAIK) to use it or you plan to release an > interim 1.3.x version with that patch applied. Building LXML from SVN is really rather straightforward and of course includes the fixes for that particular problem as well as others. See the download page for instructions on building from SVN. -- Anders From l.oluyede at gmail.com Wed Oct 3 21:47:19 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Wed, 3 Oct 2007 21:47:19 +0200 Subject: [lxml-dev] Namespace serialization patch In-Reply-To: <4703DF60.4040108@bruun-olsen.net> References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> <4703DF60.4040108@bruun-olsen.net> Message-ID: <9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com> > Building LXML from SVN is really rather straightforward and of course > includes the fixes for that particular problem as well as others. > See the download page for instructions on building from SVN. I, personally, don't have a problem with that but AFAIK at work using the SVN version is a lesser option than using the 2.0alpha. -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From mwm-keyword-lxml.9112b8 at mired.org Wed Oct 3 22:50:16 2007 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Wed, 3 Oct 2007 16:50:16 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? Message-ID: <20071003165016.104d2caf@bhuda.mired.org> I'm getting crashes - by which I mean the python process is segfaulting and, with some tweaking of GNU/Linux, leaving me a core file - while using lxml to parse data. Versions: OS: RHEL 5 Python: 2.5.1 (custom built). lxml: 1.3.3 libxml: 2.6.26 (both compiled and built) libxslt: 1.1.17 [Yes, I know those are a bit out of date, but we had to give our client host requirements months ago, and those were current at the time, and changing them is a non-trivial process, and I've already started on it, but I'd rather not do that if I can avoid it....] Rebuilding python with OPTS=-g (I set that for the lxml build as well), I can get a "where" output that points at lxml: #0 0x00002aaaaf906c3a in rename () from /usr/local/lib/python2.5/site-packages/lxml/etree.so #1 0x00002aaaaf906be7 in rename () from /usr/local/lib/python2.5/site-packages/lxml/etree.so #2 0x00002aaaaf8ebdfe in rename () from /usr/local/lib/python2.5/site-packages/lxml/etree.so #3 0x00002aaaaf966a5c in findOrBuildNodeNs () from /usr/local/lib/python2.5/site-packages/lxml/etree.so The first problem is that this isn't repeatable. I've got test data that will make it happen, but I have to feed that data through the system a few thousand times in. This is part of a database ETL system, parsing data from the XML to load into the database. If I feed it the exact same data over and over again, it'll work 9999 times out of ten thousand - but then fail that ten thousands time with a segfault. While this might not seem like a big deal, we're planning on processing hundreds of thousands of documents a day, so we're talking about having an instance of the process die tens of times a day. So I sorta need to fix it. The document is straightforward: it starts with a meta element with a set of attributes, and then has a lot of data elements, all the same type, all with the same attributes (give or take an optional one), and I just use document.xpath to find the elements, and then read off their attribute values to save to a database load file. Hints on how to proceed - setting things up so I can use gdb on the lxml sources, for instance - would be greatly appreciated. If this looks like a bug that's been fixed if I update one or more libraries, that would be great information (i.e. - I can use it to get all the libraries updated). Anything else that you think I oughta know would be nice as well. The sample document is almost half a megabyte (and might be proprietary). If you'd like to look at it, drop me a line. thanks, http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From lists.steve at arachnedesign.net Thu Oct 4 00:50:50 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Wed, 3 Oct 2007 18:50:50 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <20071003165016.104d2caf@bhuda.mired.org> References: <20071003165016.104d2caf@bhuda.mired.org> Message-ID: <7BFF4FC4-72EC-419C-A1A6-0C333F435B09@arachnedesign.net> > I'm getting crashes - by which I mean the python process is > segfaulting and, with some tweaking of GNU/Linux, leaving me a core > file - while using lxml to parse data. > > Versions: > > OS: RHEL 5 > Python: 2.5.1 (custom built). > lxml: 1.3.3 > libxml: 2.6.26 (both compiled and built) > libxslt: 1.1.17 As an aside (addendum?, whatever ..) I recently got nailed w/ segfaults and bus errors that seemed to not be 100% reproducible on OS X. I built lxml against: libxml 2.6.30 libxslt 1.1.22 python2.5.1(and python2.4.4) lxml 1.3.4 (all using MacPorts) My code was basically generating large(-ish -- though really not much bigger than 4 megs or so) documents like so (inspired from ElementTree examples): import lxml.etree as ET root = ET.Element('graph', **root_attribs) ET.SubElement(root, 'node', id='something', label=name) ET.SubElement(node, 'att', name='pvalue', type='real', value=pval) ... The nesting level wouldn't ever really go more than 3 or 4 children deep. Anyway, I know there was talk about lxml crashing w/ the default OS X xml libs, but here's the case when I'm using the newer ones. I don't know if this is the same issue as Mike's having, but since this just happened to me and I haven't been able to smoke it out, I'm bringing it up here (in the meantime I've switched to elementtree and the same code works fine (if not slower)). I will try to create a minimal test case after one of my deadlines pass to help smoke this out better(also, I don't know if the minimal test case will help, is it possible that it's a function of the size of the xml doc that I'm trying to build?) Thanks, -steve From etiffany at alum.mit.edu Thu Oct 4 01:43:01 2007 From: etiffany at alum.mit.edu (Eric Tiffany) Date: Wed, 03 Oct 2007 19:43:01 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <7BFF4FC4-72EC-419C-A1A6-0C333F435B09@arachnedesign.net> Message-ID: On OS X, you might actually be using the system libs rather than the newer libs (in /opt/local/lib, if you are using MacOSPorts, for example). I had lots of segfault problems until I realized that even though lxml was claiming it was running with the newer libs, the info was only based on what it was built with. At least, that's what it seemed like. Anyway, all my (segfault) problems went away when I exported DYLD_LIBRARY_PATH=/opt/local/lib Into the environment where python was running. Actually, python was running zope/plone, but I think this problem could be similar to yours. ET On 10/3/07 6:50 PM, "Steve Lianoglou" wrote: >> I'm getting crashes - by which I mean the python process is >> segfaulting and, with some tweaking of GNU/Linux, leaving me a core >> file - while using lxml to parse data. >> >> Versions: >> >> OS: RHEL 5 >> Python: 2.5.1 (custom built). >> lxml: 1.3.3 >> libxml: 2.6.26 (both compiled and built) >> libxslt: 1.1.17 > > As an aside (addendum?, whatever ..) I recently got nailed w/ > segfaults and bus errors that seemed to not be 100% reproducible on > OS X. > > I built lxml against: > > libxml 2.6.30 > libxslt 1.1.22 > python2.5.1(and python2.4.4) > lxml 1.3.4 > (all using MacPorts) > > My code was basically generating large(-ish -- though really not much > bigger than 4 megs or so) documents like so (inspired from > ElementTree examples): > > import lxml.etree as ET > root = ET.Element('graph', **root_attribs) > ET.SubElement(root, 'node', id='something', label=name) > ET.SubElement(node, 'att', name='pvalue', type='real', value=pval) > ... > > The nesting level wouldn't ever really go more than 3 or 4 children > deep. > > Anyway, I know there was talk about lxml crashing w/ the default OS X > xml libs, but here's the case when I'm using the newer ones. > > I don't know if this is the same issue as Mike's having, but since > this just happened to me and I haven't been able to smoke it out, I'm > bringing it up here (in the meantime I've switched to elementtree and > the same code works fine (if not slower)). > > I will try to create a minimal test case after one of my deadlines > pass to help smoke this out better(also, I don't know if the minimal > test case will help, is it possible that it's a function of the size > of the xml doc that I'm trying to build?) > > Thanks, > -steve > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- ________________________________________________ Eric Tiffany | +1 413-458-3743 etiffany at alum.mit.edu | +1 413-627-1778 mobile From lists.steve at arachnedesign.net Thu Oct 4 01:50:11 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Wed, 3 Oct 2007 19:50:11 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: References: Message-ID: <9AD07B43-4E11-43A2-AE9E-6E7060F8A5F2@arachnedesign.net> > On OS X, you might actually be using the system libs rather than > the newer > libs (in /opt/local/lib, if you are using MacOSPorts, for > example). I had > lots of segfault problems until I realized that even though lxml was > claiming it was running with the newer libs, the info was only > based on what > it was built with. At least, that's what it seemed like. > > Anyway, all my (segfault) problems went away when I exported > > DYLD_LIBRARY_PATH=/opt/local/lib > > Into the environment where python was running. Hmm .. interesting. I was playing with DYLD_LIBRARY_PATH, but I thought that had to be set during compile time (of lxml). Even though ... through my hunting on the intarweb, I came across a suggestion to use `otool` to see what libs were being used. So I tried like so: $ otool -L /opt/local/Library/Frameworks/Python.framework/Versions/ Current/lib/python2.4/site-packages/lxml/etree.so /opt/local/Library/Frameworks/Python.framework/Versions/Current/lib/ python2.4/site-packages/lxml/etree.so: /opt/local/lib/libxslt.1.dylib (compatibility version 3.0.0, current version 3.22.0) /opt/local/lib/libexslt.0.dylib (compatibility version 9.0.0, current version 9.13.0) /opt/local/lib/libxml2.2.dylib (compatibility version 9.0.0, current version 9.30.0) /opt/local/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.3) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.3.9) The fact that the xml libs in /opt/local were the ones being referenced made me think that those are the ones it would use ... is that right? Looking at that closer, I do see ``/usr/lib/ libSystem.B.dylib``which is OS X default, but honestly don't know what it's responsible for ... -steve From etiffany at alum.mit.edu Thu Oct 4 04:06:11 2007 From: etiffany at alum.mit.edu (Eric Tiffany) Date: Wed, 03 Oct 2007 22:06:11 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <9AD07B43-4E11-43A2-AE9E-6E7060F8A5F2@arachnedesign.net> Message-ID: Check the man page for dyld, which notes DYLD_LIBRARY_PATH This is a colon separated list of directories that contain libraries. The dynamic linker searches these directories before it searches the default locations for libraries. It allows you to test new versions of existing libraries. For each library that a program uses, the dynamic linker looks for it in each directory in DYLD_LIBRARY_PATH in turn. If it still can't find the library, it then searches DYLD_FALL- BACK_FRAMEWORK_PATH and DYLD_FALLBACK_LIBRARY_PATH in turn. Use the -L option to otool(1). to discover the frameworks and shared libraries that the executable is linked against. I think otool is telling you what libs the .so would *like* to use, but the environment will tell dyld where to look at runtime. At least, that's the way I interpret it. Anyway, my segfaults and bus errors stopped. ET On 10/3/07 7:50 PM, "Steve Lianoglou" wrote: >> On OS X, you might actually be using the system libs rather than >> the newer >> libs (in /opt/local/lib, if you are using MacOSPorts, for >> example). I had >> lots of segfault problems until I realized that even though lxml was >> claiming it was running with the newer libs, the info was only >> based on what >> it was built with. At least, that's what it seemed like. >> >> Anyway, all my (segfault) problems went away when I exported >> >> DYLD_LIBRARY_PATH=/opt/local/lib >> >> Into the environment where python was running. > > Hmm .. interesting. > > I was playing with DYLD_LIBRARY_PATH, but I thought that had to be > set during compile time (of lxml). > > Even though ... through my hunting on the intarweb, I came across a > suggestion to use `otool` to see what libs were being used. So I > tried like so: > > $ otool -L /opt/local/Library/Frameworks/Python.framework/Versions/ > Current/lib/python2.4/site-packages/lxml/etree.so > /opt/local/Library/Frameworks/Python.framework/Versions/Current/lib/ > python2.4/site-packages/lxml/etree.so: > /opt/local/lib/libxslt.1.dylib (compatibility version 3.0.0, > current version 3.22.0) > /opt/local/lib/libexslt.0.dylib (compatibility version > 9.0.0, current version 9.13.0) > /opt/local/lib/libxml2.2.dylib (compatibility version 9.0.0, > current version 9.30.0) > /opt/local/lib/libz.1.dylib (compatibility version 1.0.0, > current version 1.2.3) > /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, > current version 88.3.9) > > The fact that the xml libs in /opt/local were the ones being > referenced made me think that those are the ones it would use ... is > that right? Looking at that closer, I do see ``/usr/lib/ > libSystem.B.dylib``which is OS X default, but honestly don't know > what it's responsible for ... > > -steve -- ____________________________________________________ Eric Tiffany | eric at projectliberty.org Interop Tech Lead | +1 413-458-3743 Liberty Alliance | +1 413-627-1778 mobile From rocarras at gmail.com Thu Oct 4 15:51:43 2007 From: rocarras at gmail.com (Roberto Carrasco) Date: Thu, 4 Oct 2007 09:51:43 -0400 Subject: [lxml-dev] Problem with lxml library running on Windows Message-ID: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> Hi: We have an issue with libxml library running on Windows. We are trying to read an xml document from a string over and over but the program crashes in the while loop. We suspect the problem is that we cannot run the function etree.parse too much times when we are reading a xml document from a string. The code crashes when the program read a xml document repeatedly. The issue is on Windows becuase on an Linux environment there is no problem excecuting it. We are trying to execute the piece of code shown below in this environment: - Windows XP Service Pack 2 - Python 2.5 - lxml 1.3.4 and 2.0 alpha 3 The question is: what we are doing wrong? or is this a problem with the library running on Windows? # -*- coding: UTF-8 -*- from lxml import etree from StringIO import StringIO if __name__ == "__main__": document=""" 1-3 2006-03-13 08:44:52 SANTIAGO PUENTE 13/03/2006 Robertin MANZANO 2006-03-10 15:52:29 """ j=0 while 1: print j j+=1 #tree = etree.parse(StringIO(docRauco0)) tree = etree.fromstring(document) images_url = tree.xpath('//link[@rel="media"][@href]') image_url_name=images_url[0].attrib['href'] -- Regards, Roberto -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20071004/3c12f465/attachment.htm From jholg at gmx.de Fri Oct 5 14:00:41 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 05 Oct 2007 14:00:41 +0200 Subject: [lxml-dev] Re: Tag name validation and HTML Message-ID: <20071005120041.180700@gmx.net> Hi, > The : thing is difficult because HTML UAs are expected to deal with : in > the tag name and there is content in the wild that depends on this being > accepted; MS Office produces "HTML" containing tags like , for > example. Since I, and I guess others too, want to use lxml to process > random content that may have colons in the tag names, hard failure for > this case is a problem. To make matters worse it is possible that the > HTML spec will change in the future to introduce some sort of > namespacing feature which may or may not use colons. You'd get errors when parsing such stuff with the XML parser: >>> etree.fromstring("""foo""") Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 2137, in etree.fromstring File "parser.pxi", line 1301, in etree._parseMemoryDocument File "parser.pxi", line 1207, in etree._parseDoc File "parser.pxi", line 782, in etree._BaseParser._parseDoc File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc File "parser.pxi", line 523, in etree._handleParseResult File "parser.pxi", line 471, in etree._raiseParseError etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5 but not with the HTML parser: >>> etree.HTML >>> etree.HTML("""foo""") >>> So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements. For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data. > Given all of this I would prefer it if it were possible to have an > HTML-specific mode with much more liberal rules than the XML mode. This > could then be adapted to support any namespacing features HTML grows in > the future. For example, if one could do something like > > import lxml.html > lxml.html.Element("o:p") > > where lxml.html.Element would be just like lxml.etree.Element but > without XML-specific validity checks. I guess there might be serious > practical difficulties with that exact solution, but I think the general > idea of being able to flag an element as following HTML rules or XML > rules would be more user-friendly than having a set of rules that > neither matches the XML nor the HTML model correctly. Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element(). Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Sat Oct 6 19:22:28 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Oct 2007 19:22:28 +0200 Subject: [lxml-dev] trunk schematron tests core dump In-Reply-To: <20071002155852.130440@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> <46F37F62.40307@behnel.de> <20070921092340.311080@gmx.net> <20070921142905.315500@gmx.net> <20071002155852.130440@gmx.net> Message-ID: <4707C454.8020302@behnel.de> jholg at gmx.de wrote: >>>> Schematron uses XPath a lot, so I wouldn't be surprised if this was >>>> related to >>>> the XPath bug in libxml2 2.6.27. Is there any chance you could switch >> to >> [...] >> Unfortunately, using the latest & greatest libxml2/libxslt (2.6.33/1.1.22) >> doesn't solve the problem for me. > > I'm trying to get some sensible information but have real problems with debugging, as I'm seeing line number information that is just plain wrong, though compiling with debugging on and everything, the likes of: > > (gdb) info source > Current source file is src/lxml/etree.c > Compilation directory is /home/lb54320/pydev/LXML/lxml/ > Located in /home/lb54320/pydev/LXML/lxml/src/lxml/etree.c > Contains 90795 lines. > Source language is c. > Compiled with stabs debugging format. > (gdb) b etree.c:70850 > No line 70850 in file "src/lxml/etree.c". > (gdb) Never seen that before. I assume you did a clean build before that? Maybe gdb doesn't get along with the source line references in the comments of the generated C file? > So the info on the crash does not get much better than that backtrace at the moment: > > Program received signal SIGSEGV, Segmentation fault. > 0xff0b3218 in strlen () from /usr/lib/libc.so.1 > (gdb) bt > #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 > #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 > #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 > #3 0xfe23df04 in __xmlRaiseError () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 > #4 0xfe3e717c in xmlSchematronPErr () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 > #5 0xfe3e9878 in xmlSchematronParse () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2 > #6 0xfe68dfdc in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x1b30f0, > __pyx_args=0x1db670, __pyx_kwds=0x0) at src/lxml/etree.c:5663 > > > What I can see, though, is that using the same schematron schema with xmllint does not crash: > 0 $ cat invalid_empty.xst > > > 0 $ python2.4 -i -c 'from lxml import etree; print etree.LIBXML_VERSION; schema = etree.Schematron(etree.parse("invalid_empty.xst"))' > (2, 6, 30) > Segmentation Fault (core dumped) > > whereas > > $ /apps/pydev/bin/xmllint --schematron invalid_empty.xst foo.xml --version > /apps/pydev/bin/xmllint: using libxml version 20630 > compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib > invalid_empty.xst:1: element schema: Schemas parser error : The schematron document 'invalid_empty.xst' has no pattern > Schematron schema invalid_empty.xst failed to compile > > xmllint has a different error reporting setup, that might make the difference. Anyway, error reporting in Schematron is pretty basic and remember working around that at the time. I'll have to take a deeper look into it when I find the time. Stefan From stefan_ml at behnel.de Sat Oct 6 19:28:47 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Oct 2007 19:28:47 +0200 Subject: [lxml-dev] Tag name validation and HTML In-Reply-To: <4702AB02.6080300@cam.ac.uk> References: <46FBA0D3.6010700@cam.ac.uk> <46FBBEBD.7030308@behnel.de> <4702AB02.6080300@cam.ac.uk> Message-ID: <4707C5CF.9080102@behnel.de> Hi, James Graham wrote: > The : thing is difficult because HTML UAs are expected to deal with : in > the tag name and there is content in the wild that depends on this being > accepted; MS Office produces "HTML" containing tags like , for > example. Since I, and I guess others too, want to use lxml to process > random content that may have colons in the tag names, hard failure for > this case is a problem. To make matters worse it is possible that the > HTML spec will change in the future to introduce some sort of > namespacing feature which may or may not use colons. Ok, so I understand that HTML tags must be treated different from XML tags. > Given all of this I would prefer it if it were possible to have an > HTML-specific mode with much more liberal rules than the XML mode. This > could then be adapted to support any namespacing features HTML grows in > the future. For example, if one could do something like > > import lxml.html > lxml.html.Element("o:p") > > where lxml.html.Element would be just like lxml.etree.Element but > without XML-specific validity checks. This absolutely makes sense to me. I'll have to look into the details of an implementation though, since tag name validation is currently done in lxml.etree.Element, which is simply reused by the Python-implemented lxml.html. So we'd have to provide some kind of Python-level API for this. Stefan From stefan_ml at behnel.de Sat Oct 6 19:33:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Oct 2007 19:33:07 +0200 Subject: [lxml-dev] prefix mappings In-Reply-To: <46FEAF82.60501@antwerpen.be> References: <46FEAF82.60501@antwerpen.be> Message-ID: <4707C6D3.1080001@behnel.de> FnH wrote: > I would like to generate the following serialization: > > > > [...] > In order to solve this I think it would be a good idea to allow (or take > into account) prefix mappings on non root nodes as well. The output I'd > like could then be achieved by the following code snippet: > > a = Element("{foo}a", nsmap={None:"foo"}) > a.append(Element("{bar}b", nsmap={None:"bar"})) >>> from lxml.etree import Element, tostring >>> a = Element("{foo}a", nsmap={None:"foo"}) >>> a.append(Element("{bar}b", nsmap={None:"bar"})) >>> print tostring(a, pretty_print=True) This is on lxml 2.0 alpha, lxml 1.3 should work alike. Stefan From stefan_ml at behnel.de Sat Oct 6 19:39:02 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Oct 2007 19:39:02 +0200 Subject: [lxml-dev] Namespace serialization patch In-Reply-To: <9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com> References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> <4703DF60.4040108@bruun-olsen.net> <9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com> Message-ID: <4707C836.8070307@behnel.de> Lawrence Oluyede wrote: >> Building LXML from SVN is really rather straightforward and of course >> includes the fixes for that particular problem as well as others. >> See the download page for instructions on building from SVN. > > I, personally, don't have a problem with that but AFAIK at work using > the SVN version is a lesser option than using the 2.0alpha. There will be a new release soon, but I can't currently tell when exactly. However, the patch in the current 1.3 branch (which reflects the stable 1.3 series) will definitely go in there, so you should be just fine with using an unofficial branch build for now, or a patched 1.3 build (which would currently be exactly the same anyway). Stefan From stefan_ml at behnel.de Sat Oct 6 22:21:28 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Oct 2007 22:21:28 +0200 Subject: [lxml-dev] Problem with lxml library running on Windows In-Reply-To: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> Message-ID: <4707EE48.7000704@behnel.de> Hi, as expected, I cannot reproduce your problem on Linux. Roberto Carrasco wrote: > We have an issue with libxml library running on Windows. > We are trying to read an xml document from a string over and over but the > program crashes in the while loop. We suspect the problem is that we cannot > run the function etree.parse too much times when we are reading a xml > document from a string. lxml.etree actually optimises parsing from a StringIO object into parsing via fromstring() - or rather its internal implementation. So I can't see how this would make a difference. > We are trying to execute the piece of code shown below in this environment: > > - Windows XP Service Pack 2 > - Python 2.5 > - lxml 1.3.4 and 2.0 alpha 3 You are using the pre-built binaries from PyPI, right? I'm not currently sure which version of libxml2 they use, but should be 2.6.28 or later. > The code crashes when the program read a xml document repeatedly. > The issue is on Windows becuase on an Linux environment there is no problem > excecuting it. > > The question is: what we are doing wrong? or is this a problem with the > library running on Windows? > > # -*- coding: UTF-8 -*- > from lxml import etree > from StringIO import StringIO > > if __name__ == "__main__": > > document=""" > 1-3 > 2006-03-13 > 08:44:52 > SANTIAGO > PUENTE > type="string">13/03/2006 > type="string">Robertin > MANZANO > > 2006-03-10 > 15:52:29 > > > > """ > > j=0 > while 1: > print j > j+=1 > > #tree = etree.parse(StringIO(docRauco0)) > tree = etree.fromstring(document) > images_url = tree.xpath('//link[@rel="media"][@href]') > image_url_name=images_url[0].attrib['href'] Just to mention it, you could simplify this to images_url_names = tree.xpath('//link[@rel="media"]/@href') Regarding your problem - instead of this line: image_url_name=images_url[0].attrib['href'] could you try this instead, to see if it still crashes: image_url_name=images_url[0].get('href') Apart from that, I would need some debugging information to understand what's happening here. While there are differences between the behaviour of libxml2 under Linux and Windows, I don't currently see any that could cause the above code to fail. Stefan From stefan_ml at behnel.de Sun Oct 7 07:14:33 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 07 Oct 2007 07:14:33 +0200 Subject: [lxml-dev] lxml 2.0alpha4 released Message-ID: <47086B39.4030802@behnel.de> Hi all, I just released a 4th alpha version of lxml 2.0 to PyPI. It hopefully sets an end to the tag name validation problems by distinguishing between HTML tags and XML tags based on the associated parser, i.e. either the one that parsed it or the one that created the element through its "makeelement" method. Note that the Element factory of lxml.etree uses the XMLParser by default, while the factory in lxml.html uses the HTMLParser, and thus allows HTML tag names. Everyone who bumped into and/or reported problems with this, please verify that this provides a viable solution to you. Have fun, Stefan 2.0alpha4 (2007-10-07) Features added Bugs fixed * AttributeError in feed parser on parse errors Other changes * Tag name validation in lxml.etree (and lxml.html) now distinguishes between HTML tags and XML tags based on the parser that was used to parse or create them. HTML tags no longer reject any non-ASCII characters in tag names but only spaces and the special characters <>&/'" From Michael.Pechal at silabs.com Sun Oct 7 21:30:34 2007 From: Michael.Pechal at silabs.com (Michael Pechal) Date: Sun, 7 Oct 2007 14:30:34 -0500 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation Message-ID: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> Hello, I am new to XML and I have found lxml.objectify to be very useful. I am using XML to store register settings. I use the register mnemonic as the tag. I use custom attributes to store additional information, such as address, description, apply, etc. I am also using the _pytype and _xsi attributes. I am using the binary install of lxml 1.3.4 on WinXP running Python 2.5.1. My main problem is that assigning a new value to an objectify.DataElement destroys the existing attribute list. My current workaround is to retrieve the attributes with the items() call, assign the new value, and then reapply attributes with set() method for each pair in the items dict. I dug through the API documentation and I did not see a way around this issue. Am I missing something here? I thought about subclassing DataElement and then I scanned the SVN development change list. I saw some discussion about preserving _pytype or _xsi attributes, but does this include ALL attributes? If so, I will proceed with a build from the latest SVN copy. How stable are dev versions? Are there automated acceptance tests (unittest) that gate the check in? I may just use my workaround until 1.3.5 arrives. Another issue I noticed is that if I specify _xsi='int', the _pytype attribute will be 'long' instead of 'integer', so I am forced to use _pytype='integer' for all integer data elements. Also, if you run objectify.annonate(), the integer becomes a long type again. Annotate should look to the _xsi or even pyval type. Has this been fixed? This is not really an issue for me, since I always keep the list annotated. The objectify API documentation was helpful. As a new user, I had a few problems with save and retrieve from file. I would suggest updating the objectify API document to provide a full example of saving to and loading from a file. I have provided a test case from my unittest code below that may be useful for the documentation: # ------------------------------------------------------------------------ - def testFileSaveAndLoad(self): """ Save to XML file, then reload and compare data. """ # note the self.objRoot is created in the setUp() method tofile = etree.tostring(self.objRoot, pretty_print=True) xmlFH = open('test.xml', 'w') xmlFH.write(tofile) xmlFH.close() parser = etree.XMLParser(remove_blank_text=True) lookup = objectify.ObjectifyElementClassLookup() parser.setElementClassLookup(lookup) tree = etree.parse('test.xml', parser) root = tree.getroot() # crucial step, as parse() doesn't return the root fromfile = etree.tostring(root, pretty_print=True) self.assertEqual(tofile, fromfile) Also on the documentation front, there is a failure with help() on the objectify module. : Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import objectify >>> from lxml import etree >>> help(objectify) Traceback (most recent call last): File "", line 1, in File "C:\Python25\lib\site.py", line 346, in __call__ return pydoc.help(*args, **kwds) File "C:\Python25\lib\pydoc.py", line 1645, in __call__ self.help(request) File "C:\Python25\lib\pydoc.py", line 1689, in help else: doc(request, 'Help on %s:') File "C:\Python25\lib\pydoc.py", line 1481, in doc pager(title % desc + '\n\n' + text.document(object, name)) File "C:\Python25\lib\pydoc.py", line 324, in document if inspect.ismodule(object): return self.docmodule(*args) File "C:\Python25\lib\pydoc.py", line 1070, in docmodule inspect.getclasstree(classlist, 1), name)] File "C:\Python25\lib\inspect.py", line 656, in getclasstree for parent in c.__bases__: TypeError: 'functools.partial' object is not iterable >>> objectify Note that help(etree) works fine. Thanks, Michael This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20071007/1e257c1b/attachment-0001.htm From stefan_ml at behnel.de Mon Oct 8 09:47:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Oct 2007 09:47:07 +0200 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation In-Reply-To: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> Message-ID: <4709E07B.5000201@behnel.de> Hi, thanks for sharing your impressions. Michael Pechal wrote: > I am new to XML and I have found lxml.objectify to be very useful. I am > using XML to store register settings. I use the register mnemonic as > the tag. I use custom attributes to store additional information, such > as address, description, apply, etc. This sounds a bit like attribute misuse. If you are not tied to a specific XML language, consider storing this information in the XML structure rather than XML attributes, just as you would in a Python object. > I am also using the _pytype and > _xsi attributes. I am using the binary install of lxml 1.3.4 on WinXP > running Python 2.5.1. > > My main problem is that assigning a new value to an > objectify.DataElement destroys the existing attribute list. That's intentional. It's like when you assign a Python value to an object attribute. The old value will be lost in that case, including all of its own attributes. Note that I'm not talking about XML at all here, this is plain Python object behaviour, which is what objectify mimics. > I thought about subclassing DataElement DataElement is not a class, it's a factory function. So you can write a wrapper but you cannot subclass it. > and then I scanned the SVN > development change list. I saw some discussion about preserving _pytype > or _xsi attributes, but does this include ALL attributes? No, these two are special (or rather their namespaced XML attributes). > If so, I will > proceed with a build from the latest SVN copy. How stable are dev > versions? There are currently two actively maintained branches: the 1.3 branch for the stable 1.3 series (basically, everything that gets committed here will be in a future 1.3.x release), and the current trunk for the future 2.0 series, which is currently in alpha status. This means: some functionallity and some APIs are not stable yet and there may still be incompatible changes to come if their value for lxml 2.0 is considered high enough to break current code. The 2.0 web pages are also online: http://codespeak.net/lxml/dev/ > Are there automated acceptance tests (unittest) that gate the > check in? Sure, check out the test suite that comes with the source distribution. It's pretty extensive by now. http://codespeak.net/lxml/build.html#running-the-tests-and-reporting-errors There is also a benchmarking suite that might be of interest to you. http://codespeak.net/lxml/performance.html > I may just use my workaround until 1.3.5 arrives. No 1.3.x release will ever change the above behaviour, and it won't change for 2.0 either. > Another issue I noticed is that if I specify _xsi='int', the _pytype > attribute will be 'long' instead of 'integer', so I am forced to use > _pytype='integer' for all integer data elements. You're mixing names here, so I'm not quite sure what exactly you are doing. Make sure you are distinguishing between Python type names and XSI type names in your code. In general, XSI types are more accurate, so you might want to prefer them. The Python type "long" maps to the XSI type "integer" and various other XSI types. Only the small XSI integer types like "int" or "short" map to a Python int. > Also, if you run > objectify.annonate(), the integer becomes a long type again. Annotate > should look to the _xsi or even pyval type. Has this been fixed? This > is not really an issue for me, since I always keep the list annotated. This has been changed in 2.0, which uses annotations quite a bit more naturally. The current behaviour in 1.3 will not change, unless it's considered a bug that should be fixed. > The objectify API documentation was helpful. As a new user, I had a few > problems with save and retrieve from file. That's because this is done through lxml.etree rather than objectify directly. You will almost certainly need both to work with objectify anyway, so it's worth skipping through the lxml.etree tutorial. That said, the objectify documentation is starting to get rather lengthy. Maybe we should start focussing it a bit, also towards users that do not know lxml.etree or ElementTree before looking at objectify. > I would suggest updating the > objectify API document to provide a full example of saving to and > loading from a file. I have provided a test case from my unittest code > below that may be useful for the documentation: Thanks. This is a question of duplicating documentation versus making it easily accessible. We also use our documentation as doctests, where accessing files is not as straight forward as it should look. > Also on the documentation front, there is a failure with help() on the > objectify module. : > > >>> help(objectify) > > Traceback (most recent call last): > TypeError: 'functools.partial' object is not iterable > > Note that help(etree) works fine. Ah, I wasn't aware of that. However, it seems to be more of a problem in help() itself rather than objectify. I'll have to investigate this one day or another... Stefan From jholg at gmx.de Mon Oct 8 09:59:01 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 08 Oct 2007 09:59:01 +0200 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation In-Reply-To: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> Message-ID: <20071008075901.13870@gmx.net> Hi Michael, > My main problem is that assigning a new value to an > objectify.DataElement destroys the existing attribute list. My current > workaround is to retrieve the attributes with the items() call, assign > the new value, and then reapply attributes with set() method for each > pair in the items dict. I dug through the API documentation and I did > not see a way around this issue. Am I missing something here? objectify maps element data to simple python builtins, so it treats them as immutable, i.e. you cannot modify element.text. This is intentional and not going to change (I hope :) With 2.0alpha, you could however do this to keep your attributes: >>> msg.a = 12345 >>> msg.a.set("foo", "bar") >>> print objectify.dump(msg) msg = None [ObjectifiedElement] a = 12345 [IntElement] * py:pytype = 'int' * foo = 'bar' >>> msg.a = objectify.DataElement("changeMe", attrib=dict(msg.a.attrib)) >>> print objectify.dump(msg) msg = None [ObjectifiedElement] a = 'changeMe' [StringElement] * py:pytype = 'str' * foo = 'bar' >>> Note that the foo attribute remains intact, whilst the py:pytype gets corrected to s.th. that fits the element value. > I thought about subclassing DataElement and then I scanned the SVN > development change list. I saw some discussion about preserving _pytype > or _xsi attributes, but does this include ALL attributes? If so, I will > proceed with a build from the latest SVN copy. How stable are dev > versions? Are there automated acceptance tests (unittest) that gate the > check in? I may just use my workaround until 1.3.5 arrives. Generally I'd say the dev versions are still very stable with regard to robustness, but of course feature-wise they can be in flux. > Another issue I noticed is that if I specify _xsi='int', the _pytype > attribute will be 'long' instead of 'integer', so I am forced to use > _pytype='integer' for all integer data elements. Also, if you run > objectify.annonate(), the integer becomes a long type again. Annotate > should look to the _xsi or even pyval type. Has this been fixed? This > is not really an issue for me, since I always keep the list annotated. Please try 2.0alpha, the behaviour with regard to py:pytype/xsi:type has been tweaked a little, and some parts now behave more "natural". There's also now a triplet of annotation functions (pyannotate, xsiannotate, annotate) that give you fine-grained control of annotation. The most prominent change is that you get auto-pytypification now: >>> msg.a = 999 >>> print objectify.dump(msg) msg = None [ObjectifiedElement] a = 999 [IntElement] * py:pytype = 'int' >>> Here, the actual Python type of the RVAL of the assignment gets taken into account now. Regarding your example, explicitly setting _xsi="int" gives >>> msg.a = objectify.DataElement(8, _xsi="int") >>> print objectify.dump(msg) msg = None [ObjectifiedElement] a = 8 [IntElement] * py:pytype = 'int' * xsi:type = 'xsd:int' >>> I do think it's just the same with 1.3, so I think you might have mixed s.th. up here. However, specifying _xsi="integer" will result in: >>> msg.a = objectify.DataElement(8, _xsi="integer") >>> print objectify.dump(msg) msg = None [ObjectifiedElement] a = 8L [LongElement] * py:pytype = 'long' * xsi:type = 'xsd:integer' >>> This is due to the XML Schema type system, where an XML Schema integer is not restricted to 32 bits, like a Python int (still is), see http://www.w3.org/TR/xmlschema-2/ Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From faassen at startifact.com Mon Oct 8 14:47:35 2007 From: faassen at startifact.com (Martijn Faassen) Date: Mon, 08 Oct 2007 14:47:35 +0200 Subject: [lxml-dev] did windows binary versions ever get removed from the cheeseshop? Message-ID: Hi there, To start off, I'm not 100% sure on this, so I'm just checking. I thought at some stage I had a working windows installation of my software that was using lxml 1.3 binaries (not 1.3.1 or something, just 1.3). When I tried again today it didn't work anymore, and instead had to start using lxml 1.3.4 (for instance). Is it possible that someone for some reason removed the versions for 1.3 from the cheeseshop? If so, my general recommendation would be never to do this, even if the packages are broken somehow. A release is a release, and people might be depending on it. For this reason, never remove release files, and also never overwrite release files. Anyway, I'm not at all sure this actually happened with lxml, but I'm just writing this to make sure it won't happen. :) Regards, Martijn From stefan_ml at behnel.de Mon Oct 8 19:51:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Oct 2007 19:51:38 +0200 Subject: [lxml-dev] did windows binary versions ever get removed from the cheeseshop? In-Reply-To: References: Message-ID: <470A6E2A.8050602@behnel.de> Martijn Faassen wrote: > To start off, I'm not 100% sure on this, so I'm just checking. > > I thought at some stage I had a working windows installation of my > software that was using lxml 1.3 binaries (not 1.3.1 or something, just > 1.3). When I tried again today it didn't work anymore, and instead had > to start using lxml 1.3.4 (for instance). > > Is it possible that someone for some reason removed the versions for 1.3 > from the cheeseshop? If so, my general recommendation would be never to > do this, even if the packages are broken somehow. A release is a > release, and people might be depending on it. For this reason, never > remove release files, and also never overwrite release files. > > Anyway, I'm not at all sure this actually happened with lxml, but I'm > just writing this to make sure it won't happen. :) Thanks for the warning. However, I didn't remove anything myself and I wouldn't know why Sidnei should have. I'm not sure but I have a feeling that we never had any Windows binaries for 1.3... Anyway, I agree that releases should stay where they were uploaded. There are always reasons why you would want to go back to or compare/test with older versions. Note that the "Index of Packages" even lists them all. I actually do that by hand after each release - distutils/PyPI doesn't seem to have a way to say: "don't hide other releases after an upload". Stefan From Michael.Pechal at silabs.com Mon Oct 8 21:37:31 2007 From: Michael.Pechal at silabs.com (Michael Pechal) Date: Mon, 8 Oct 2007 14:37:31 -0500 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation In-Reply-To: <4709E07B.5000201@behnel.de> References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <4709E07B.5000201@behnel.de> Message-ID: <6DD7584058DDB24193C9049940B45FFB018885E0@EXCAUS001.silabs.com> Stefan, Thank you for your prompt reply. I found the Epydoc URL (http://codespeak.net/lxml/dev/api/lxml.objectify-module.html) which provided more information. I did not see a direct link to here from the lxml page or perhaps I missed it? I have provided a few responses below. Regards, Michael -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel Sent: Monday, October 08, 2007 2:47 AM To: Michael Pechal Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation Hi, thanks for sharing your impressions. Michael Pechal wrote: > I am new to XML and I have found lxml.objectify to be very useful. I am > using XML to store register settings. I use the register mnemonic as > the tag. I use custom attributes to store additional information, such > as address, description, apply, etc. This sounds a bit like attribute misuse. If you are not tied to a specific XML language, consider storing this information in the XML structure rather than XML attributes, just as you would in a Python object. [MP] I suppose I was being a little lazy here. I am retrofitting an existing cPickle implementation with custom data classes. Only a few parts of the data model will require the attributes, so it won't be too painful to create sub-elements for what I am now trying to store as attributes. > I thought about subclassing DataElement DataElement is not a class, it's a factory function. So you can write a wrapper but you cannot subclass it. [MP] Thanks for the clarification. I should have stated "objectify.IntElement". > Are there automated acceptance tests (unittest) that gate the > check in? Sure, check out the test suite that comes with the source distribution. It's pretty extensive by now. http://codespeak.net/lxml/build.html#running-the-tests-and-reporting-err ors There is also a benchmarking suite that might be of interest to you. http://codespeak.net/lxml/performance.html [MP] Very nice! > Another issue I noticed is that if I specify _xsi='int', the _pytype > attribute will be 'long' instead of 'integer', so I am forced to use > _pytype='integer' for all integer data elements. You're mixing names here, so I'm not quite sure what exactly you are doing. Make sure you are distinguishing between Python type names and XSI type names in your code. In general, XSI types are more accurate, so you might want to prefer them. The Python type "long" maps to the XSI type "integer" and various other XSI types. Only the small XSI integer types like "int" or "short" map to a Python int. [MP] Ah, I see. I misunderstood the various formats for the _xsi attribute. I should have used _xsi='int' or 'short'. Thanks for the clarification. >>> e = objectify.DataElement(1, _xsi='integer') >>> type(e.pyval) >>> e = objectify.DataElement(1, _xsi='int') >>> type(e.pyval) >>> e = objectify.DataElement(1, _xsi='short') >>> type(e.pyval) > Also, if you run > objectify.annonate(), the integer becomes a long type again. Annotate > should look to the _xsi or even pyval type. Has this been fixed? This > is not really an issue for me, since I always keep the list annotated. This has been changed in 2.0, which uses annotations quite a bit more naturally. The current behaviour in 1.3 will not change, unless it's considered a bug that should be fixed. [MP] With the correct _xsi attribute, annotate() correctly restores the pytype attribute to 'int'. >>> e = objectify.DataElement(1, _xsi='int') >>> e.items() [('{http://codespeak.net/lxml/objectify/pytype}pytype', 'int'), ('{http://www.w3.org/2001/XMLSchema-instance}type', 'short')] >>> objectify.deannotate(e) >>> e.items() [] >>> objectify.annotate(e) >>> e.items() [('{http://codespeak.net/lxml/objectify/pytype}pytype', 'int')] > I would suggest updating the > objectify API document to provide a full example of saving to and > loading from a file. I have provided a test case from my unittest code > below that may be useful for the documentation: Thanks. This is a question of duplicating documentation versus making it easily accessible. We also use our documentation as doctests, where accessing files is not as straight forward as it should look. [MP] I would just clarify that the etree.parse() call returns etree._ElementTree type, while the getroot() call returns objectify.ObjectifyElement type for direct use. This took about ten minutes to undercover, so it was not a huge problem. I just like to jump in and run with examples to see how far I can get before I have to roll-up my sleeves and review the details of a new module. :) I will spend some time reviewing the etree documentation as well. >>> parser = etree.XMLParser(remove_blank_text=True) >>> lookup = objectify.ObjectifyElementClassLookup() >>> parser.setElementClassLookup(lookup) >>> tree = etree.parse('codec.xml', parser) >>> type(tree) >>> tree.reg_config Traceback (most recent call last): File "", line 1, in AttributeError: 'etree._ElementTree' object has no attribute 'reg_config' >>> root = tree.getroot() >>> type(root) >>> root.reg_config This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. From stefan_ml at behnel.de Mon Oct 8 21:58:09 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Oct 2007 21:58:09 +0200 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <20071003165016.104d2caf@bhuda.mired.org> References: <20071003165016.104d2caf@bhuda.mired.org> Message-ID: <470A8BD1.6060305@behnel.de> Hi, sorry for the late reply, I was on vacation last week and am just catching up with my e-mail. Mike Meyer wrote: > I'm getting crashes - by which I mean the python process is > segfaulting and, with some tweaking of GNU/Linux, leaving me a core > file - while using lxml to parse data. > > Versions: > > OS: RHEL 5 > Python: 2.5.1 (custom built). > lxml: 1.3.3 > libxml: 2.6.26 (both compiled and built) > libxslt: 1.1.17 > > Yes, I know those are a bit out of date They should work, though. > Rebuilding python with OPTS=-g (I set that for the lxml build as > well), I can get a "where" output that points at lxml: > > > #0 0x00002aaaaf906c3a in rename () > from /usr/local/lib/python2.5/site-packages/lxml/etree.so > #1 0x00002aaaaf906be7 in rename () > from /usr/local/lib/python2.5/site-packages/lxml/etree.so > #2 0x00002aaaaf8ebdfe in rename () > from /usr/local/lib/python2.5/site-packages/lxml/etree.so > #3 0x00002aaaaf966a5c in findOrBuildNodeNs () > from /usr/local/lib/python2.5/site-packages/lxml/etree.so > > The first problem is that this isn't repeatable. I've got test data > that will make it happen, but I have to feed that data through the > system a few thousand times in. This is part of a database ETL system, > parsing data from the XML to load into the database. If I feed it the > exact same data over and over again, it'll work 9999 times out of ten > thousand - but then fail that ten thousands time with a segfault. Are those the real numbers? The 10000, I mean? That would explain a *lot*. lxml.etree currently has a hard limit for namespace prefix generation (the "nsXX" bit), which happens to be (an arbitrary) 10000 *per document*. Admittedly, the resulting behaviour is far from robust and you seem to have triggered a case where this number matters. I attached a patch (against the trunk) that switches the counter to a Python long instead, which is only bound by available memory. > The document is straightforward: it starts with a meta element with a > set of attributes, and then has a lot of data elements, all the same > type, all with the same attributes (give or take an optional one), and > I just use document.xpath to find the elements, and then read off > their attribute values to save to a database load file. > > Hints on how to proceed - setting things up so I can use gdb on the > lxml sources, for instance - would be greatly appreciated. A way to work around this is to *not* reuse documents. You mention a "meta element", so I guess you use a single document and keep adding namespaced elements to it. That lets the counter overflow, as the namespaces must declare and adapt their prefixes when being added to an existing document. You can print the "prefix" attribute of elements to see how the numbers go up. I don't know your code, so I can't be more specific. Please ask back if you need any further hints what you can do to avoid this in general. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: python-namespace-prefix-counter.patch Type: text/x-diff Size: 2622 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20071008/babff101/attachment-0001.bin From Michael.Pechal at silabs.com Mon Oct 8 22:55:33 2007 From: Michael.Pechal at silabs.com (Michael Pechal) Date: Mon, 8 Oct 2007 15:55:33 -0500 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation In-Reply-To: <20071008075901.13870@gmx.net> References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <20071008075901.13870@gmx.net> Message-ID: <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com> Holger, > Note that the foo attribute remains intact, whilst the py:pytype gets > corrected to s.th. that fits the element value. [MP] Thanks for your suggestion. I will follow Stefan's advice and create separate elements versus using the attribute list. Thus, I would have the following structure in my codec.xml file: codec (TREE) reg_config (TREE) filter_type (TREE) value (int) desc (str) apply (bool) default (int) This should be faster and cleaner, as I am not abusing the attribute list. I also don't have to worry about attribute type conversions, since all attributes are strings. I can access the pyval property, so no conversion is required in my data handler accessor methods. Phase one involves translating everything into XML (currently custom data classes with cPickling). Phase two entails developing an XSD file to validate the XML. I imagine the scheme validation will be cleaner with the separate elements versus the bloated attribute list. I have a better understanding of the spirit of lxml, but I have much to learn regarding XML and XSLT in general. I will perform more background reading before troubling this list again with basic questions. :) > Regarding your example, explicitly setting _xsi="int" gives > >>> msg.a = objectify.DataElement(8, _xsi="int") > >>> print objectify.dump(msg) > msg = None [ObjectifiedElement] > a = 8 [IntElement] > * py:pytype = 'int' > * xsi:type = 'xsd:int' > >>> [MP] Thanks for the clarification. I need to use _xsi="int". I will review the XML schema link that you provided. Regards, Michael This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. From stefan_ml at behnel.de Mon Oct 8 23:13:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Oct 2007 23:13:22 +0200 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation In-Reply-To: <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com> References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <20071008075901.13870@gmx.net> <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com> Message-ID: <470A9D72.4060407@behnel.de> Michael Pechal wrote: > Phase one involves translating everything into XML (currently custom > data classes with cPickling). Doesn't tostring() or ElementTree(root).write() do what you want? I don't see why you would go through pickling here... http://effbot.org/elementtree/elementtree-elementtree.htm > Phase two entails developing an XSD file to validate the XML. Unless you are very firm with XML Schema and/or have good tool support, I generally suggest writing a RelaxNG schema instead (preferably in the "compact syntax" aka RNC), which is easy to write, read and understand and is well supported by lxml/libxml2. It also supports the XSD datatypes and can be translated into an XML Schema via tools like trang. Stefan From agustin.villena at gmail.com Mon Oct 8 23:42:13 2007 From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=) Date: Mon, 08 Oct 2007 17:42:13 -0400 Subject: [lxml-dev] Problem with lxml library running on Windows In-Reply-To: <4707EE48.7000704@behnel.de> References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de> Message-ID: Hi! I tested a simplified code (attached to this post) in 2 versions of Windows, with different results: Python version: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 lxml version: LIBXML_COMPILED_VERSION: (2, 6, 28) LIBXML_VERSION : (2, 6, 28) LIBXSLT_COMPILED_VERSION: (1, 1, 19) LIBXSLT_VERSION: (1, 1, 19), LXML_VERSION: (1, 3, 4, 0)} For the same version of python and lxml - Doesn't crashes in Microsoft Windows Vista Ultimate Version: 6.0.6000 build 6000 - Crashes after 137 iterations Microsoft Windows XP Profesional Version: 5.1.2600 Service Pack 2 Build 2600 The generated error signature is: AppName: python.exe AppVer: 0.0.0.0 ModName: etree.pyd ModVer: 0.0.0.0 Offset: 00010c90 Attached to this post is the error report generated for Microsoft after the crash Cheers Agustin Stefan Behnel escribi?: > Hi, > > as expected, I cannot reproduce your problem on Linux. > > > Roberto Carrasco wrote: >> We have an issue with libxml library running on Windows. >> We are trying to read an xml document from a string over and over but the >> program crashes in the while loop. We suspect the problem is that we cannot >> run the function etree.parse too much times when we are reading a xml >> document from a string. > > lxml.etree actually optimises parsing from a StringIO object into parsing via > fromstring() - or rather its internal implementation. So I can't see how this > would make a difference. > > >> We are trying to execute the piece of code shown below in this environment: >> >> - Windows XP Service Pack 2 >> - Python 2.5 >> - lxml 1.3.4 and 2.0 alpha 3 > > You are using the pre-built binaries from PyPI, right? I'm not currently sure > which version of libxml2 they use, but should be 2.6.28 or later. > > >> The code crashes when the program read a xml document repeatedly. >> The issue is on Windows becuase on an Linux environment there is no problem >> excecuting it. >> >> The question is: what we are doing wrong? or is this a problem with the >> library running on Windows? >> >> # -*- coding: UTF-8 -*- >> from lxml import etree >> from StringIO import StringIO >> >> if __name__ == "__main__": >> >> document=""" >> 1-3 >> 2006-03-13 >> 08:44:52 >> SANTIAGO >> PUENTE >> > type="string">13/03/2006 >> > type="string">Robertin >> MANZANO >> >> 2006-03-10 >> 15:52:29 >> >> >> >> """ >> >> j=0 >> while 1: >> print j >> j+=1 >> >> #tree = etree.parse(StringIO(docRauco0)) >> tree = etree.fromstring(document) >> images_url = tree.xpath('//link[@rel="media"][@href]') >> image_url_name=images_url[0].attrib['href'] > > Just to mention it, you could simplify this to > > images_url_names = tree.xpath('//link[@rel="media"]/@href') > > > Regarding your problem - instead of this line: > > image_url_name=images_url[0].attrib['href'] > > could you try this instead, to see if it still crashes: > > image_url_name=images_url[0].get('href') > > > Apart from that, I would need some debugging information to understand what's > happening here. While there are differences between the behaviour of libxml2 > under Linux and Windows, I don't currently see any that could cause the above > code to fail. > > Stefan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 2551_appcompat.txt Url: http://codespeak.net/pipermail/lxml-dev/attachments/20071008/225338ee/attachment.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lxml_crash_windows.py Url: http://codespeak.net/pipermail/lxml-dev/attachments/20071008/225338ee/attachment.diff From Michael.Pechal at silabs.com Tue Oct 9 00:38:15 2007 From: Michael.Pechal at silabs.com (Michael Pechal) Date: Mon, 8 Oct 2007 17:38:15 -0500 Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment, attribute handling, and documentation References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <20071008075901.13870@gmx.net> <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com> <470A9D72.4060407@behnel.de> Message-ID: <6DD7584058DDB24193C9049940B45FFB0188860E@EXCAUS001.silabs.com> Stefan, > Doesn't tostring() or ElementTree(root).write() do what you want? I don't > see why you would go through pickling here... > http://effbot.org/elementtree/elementtree-elementtree.htm They work very well! What I was trying to say is that I currently use custom python classes that are persisted via cPickle. Phase one involves replacing the data model with lxml.objectify and all of its superior power. So, goodbye cPickled data classes and hello lxml.objectify! In the past, I have leveraged cPickle, ConfigParser, or custom parser. I have wanted to leverage XML for some time but the learning curve is steep. Now lxml.objectify has come to my rescue. My tool is based on MVC design. I have converted the data model to an objectified tree and I have a unittest wrapper to exercise the data model. Before I update the controller methods for tree access, I wanted to finalize the XML structure. I just need to refactor the data model and "do it right" with more elements versus hacking the attribute list. Then, phase one will be complete. > > Phase two entails developing an XSD file to validate the XML. > Unless you are very firm with XML Schema and/or have good tool support, I > generally suggest writing a RelaxNG schema instead (preferably in the > "compact syntax" aka RNC), which is easy to write, read and understand and > is well supported by lxml/libxml2. It also supports the XSD datatypes and > can be translated into an XML Schema via tools like trang. Thanks for the advice. I will explore RelaxNG schema first. We are out of licenses for Altova XMLSpy 2007 and it is pricy! I found Editix (http://www.editix.com/) for $85. It is cross-platform (*nix, OS X and Windows). The documentation lists Schema Generator (DTD, W3C XML Schema, XML Relax NG) from XML documents. When I am serious about schema work, I will try out the shareware version. Regards, Michael This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. From kf9150 at gmail.com Tue Oct 9 01:07:00 2007 From: kf9150 at gmail.com (Kelie) Date: Mon, 8 Oct 2007 23:07:00 +0000 (UTC) Subject: [lxml-dev] is there a binary windows installer for lxml 2.0alpha4 release? Message-ID: as subject. thanks. From stefan_ml at behnel.de Tue Oct 9 12:39:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Oct 2007 12:39:19 +0200 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <20071008172121.4a6acd8b@bhuda.mired.org> References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> Message-ID: <470B5A57.4030102@behnel.de> Hi, ok, that wasn't the problem then (it's still good to have it fixed, though). Mike Meyer wrote: > A master process reads in a a couple of config files, and parses and > checks them against a schema, and then possibly plugs in some default > attribute values. It then forks two processes: > > 1) Uses http to get xml documents from a remote server. These are the > ones I described; they have a meta element and then a data element > containing "row "elements, with the actual values in the attributes > to "row" elements. This process uses iterparse to pull one value > from the meta element, and then saves the entire thing to disk. That's the process that fails, right? Can you find out if it's the iterparse() or something else that fails here? Using valgrind is usually a great way to find out what's going wrong. It will make the run a lot slower, but it should print some helpful infos when it crashes. Run it like this: valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \ python yourscript.py preferably only on the process that crashes. > I.e. - the only documents that gets reused a lot is the schema, which > are built, passed to RelaxNG, and then used to validate each of those > thousands of documents. That's ok. > architecture makes things a little convoluted, but the basic path is > something like: > > data = urlopen(....) > try: > parsed = fromstring(data.read()) parse(data) should do, BTW. > if not schema.validate(parsed): > handle_broken_document(parsed=parsed) > for node in parsed.xpath('Types/Type'): > d = dict(node.attrib): > save_for_db(d) > for node in parsed.xpath('AltTypes/AltType'): > d = dict(node.attrib): > save_for_db(d) > for node in parsed.xpath('MoreTypes/MoreType'): > d = dict(node.attrib): > save_for_db(d) That's pretty straight forward code, I don't see any risk here. But I'm wondering which of the two processes actually fails now - you're presenting this one, but from your previous posts I though it was the other one that crashed. > I tried turning of the parsing - which pretty much makes everything > else do nothing but pass around the raw data - and got no failures. I > also tried turning off just the validation, so that the work is still > getting done - and got failures. Hmmmm, are those failures related to validation errors? Just in case it's the second process that fails (the XPath one), it could be worth testing if using the XPath() class instead of the xpath() method works better. That might give us a hint on where the problem comes from. It should also be faster, BTW. Stefan From stefan_ml at behnel.de Tue Oct 9 13:00:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Oct 2007 13:00:05 +0200 Subject: [lxml-dev] is there a binary windows installer for lxml 2.0alpha4 release? In-Reply-To: References: Message-ID: <470B5F35.7090601@behnel.de> Kelie wrote: > as subject. thanks. 1) According to PyPI: no. 2) According to me: not yet, wait for Sidnei to upload it to PyPI (see 1). Stefan From stefan_ml at behnel.de Tue Oct 9 13:02:11 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Oct 2007 13:02:11 +0200 Subject: [lxml-dev] Problem with lxml library running on Windows In-Reply-To: References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de> Message-ID: <470B5FB3.80708@behnel.de> Agust?n Villena wrote: > For the same version of python and lxml > > - Doesn't crashes in > Microsoft Windows Vista Ultimate > Version: 6.0.6000 build 6000 > > - Crashes after 137 iterations > Microsoft Windows XP Profesional > Version: 5.1.2600 Service Pack 2 Build 2600 Hmm, but then I can't see how this is supposed to be a problem with lxml. I mean, if the only difference is the code that Microsoft puts below the runtime environment, I would just go and ask Microsoft what they did wrong (or what they fixed in Vista to make it work). Stefan From agustin.villena+gmane at gmail.com Tue Oct 9 13:28:09 2007 From: agustin.villena+gmane at gmail.com (Agustin Villena) Date: Tue, 09 Oct 2007 07:28:09 -0400 Subject: [lxml-dev] Problem with lxml library running on Windows In-Reply-To: <470B5FB3.80708@behnel.de> References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de> <470B5FB3.80708@behnel.de> Message-ID: I agree. Nonetheless, WinXP SP2 is a very relevant platform to be ignored :( And lxml is the fastest XPath alternative for python Do you know Is avalidable the lxml's egg windows build system in the source tree?, this can be a very good base to me to debug and reproduce the problem Did it uses pre-compiled libxml libraries? Or custom compiled libraries? Thanks Agustin Stefan Behnel escribi?: > Agust?n Villena wrote: >> For the same version of python and lxml >> >> - Doesn't crashes in >> Microsoft Windows Vista Ultimate >> Version: 6.0.6000 build 6000 >> >> - Crashes after 137 iterations >> Microsoft Windows XP Profesional >> Version: 5.1.2600 Service Pack 2 Build 2600 > > Hmm, but then I can't see how this is supposed to be a problem with lxml. I > mean, if the only difference is the code that Microsoft puts below the runtime > environment, I would just go and ask Microsoft what they did wrong (or what > they fixed in Vista to make it work). > > Stefan From stefan_ml at behnel.de Tue Oct 9 15:02:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Oct 2007 15:02:29 +0200 Subject: [lxml-dev] Problem with lxml library running on Windows In-Reply-To: References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de> <470B5FB3.80708@behnel.de> Message-ID: <470B7BE5.8000404@behnel.de> Agustin Villena wrote: > Nonetheless, WinXP SP2 is a very relevant platform to be ignored :( Sadly, yes. > Do you know Is avalidable the lxml's egg windows build system in the > source tree?, this can be a very good base to me to debug and reproduce > the problem > Did it uses pre-compiled libxml libraries? Or custom compiled libraries? AFAIK (Sidnei will know better), it should compile with MSVC 2003 like this: http://codespeak.net/lxml/build.html#static-linking-on-windows You might also have success with MinGW (setup.py --compiler=mingw32). If you need help, please ask back on the list. Stefan From mwm-keyword-lxml.9112b8 at mired.org Tue Oct 9 21:01:56 2007 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Tue, 9 Oct 2007 15:01:56 -0400 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <470B5A57.4030102@behnel.de> References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> <470B5A57.4030102@behnel.de> Message-ID: <20071009150156.2a90e40a@bhuda.mired.org> On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel wrote: > ok, that wasn't the problem then (it's still good to have it fixed, though). > Mike Meyer wrote: > > A master process reads in a a couple of config files, and parses and > > checks them against a schema, and then possibly plugs in some default > > attribute values. It then forks two processes: > > > > 1) Uses http to get xml documents from a remote server. These are the > > ones I described; they have a meta element and then a data element > > containing "row "elements, with the actual values in the attributes > > to "row" elements. This process uses iterparse to pull one value > > from the meta element, and then saves the entire thing to disk. > That's the process that fails, right? No, it's the second process, that uses xpath expressions to find elements to pull the attribute values from, that fails. > Can you find out if it's the iterparse() or something else that fails here? Well, I did try isolating parts of the parsing process. The problem appears to be in the attribute extraction code. Basically, I have a routine that I pass an xpath expression to, and a list of attributes I want values for from those elements. I was being clever (probably to clever), and letting lxml provide a dictionary, using dict to make a copy of it (i.e. - the "d = dict(node.attrib)" line), and then playing game with sets to remove extra keys and add empty strings for missing attributes. If I just create an empty dictionary and plug empty strings into it for all the keys, the problem goes away. So I rewrote that code with something a bit more straightforward: d = dict() for key in keys: d[key] = node.get(key, '?) and again, I haven't been able to recreate the problem. The rest of this is probably irrelevant at this point. I've got code that appears to be working, and things to try if it doesn't work. If you'd like to continue chasing this, let me know if there's anything I can do to help. > Using valgrind is usually a great way to find out what's going wrong. It will > make the run a lot slower, but it should print some helpful infos when it > crashes. Run it like this: > > valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \ > python yourscript.py > > preferably only on the process that crashes. I've got this. I get errors from the Python parser and oracle libraries (uninitialized values). Then errors from lxml that look like the gdb "where" output: it just points through etree.so, but adds that it's doing an invalid read of size 8 or 4 (didn't have the size before, but this should cause the segfaults). These all seem to be followed by an error that says Address 0x4D31450 is 8 bytes inside a block of size 120 free'd And then traces back through vg_replace_malloc.c, then xmlFreeNodeList in libxml2 a couple of times, and then back to etree.so. > > architecture makes things a little convoluted, but the basic path is > > something like: > > > > data = urlopen(....) > > try: > > parsed = fromstring(data.read()) > > parse(data) should do, BTW. Yeah, I know. But the urlopen happens in a different process (and host, for that matter) than the parsing code. That got lost in the simplification. Note that I changed this - I'm actually using the "findall" method, not the "xpath" method, to find the elements of interest. All values passed to findall are paths as indicated, though. > > if not schema.validate(parsed): > > handle_broken_document(parsed=parsed) > > for node in parsed.findall('Types/Type'): > > d = dict(node.attrib): > > save_for_db(d) > > for node in parsed.findall('AltTypes/AltType'): > > d = dict(node.attrib): > > save_for_db(d) > > for node in parsed.findall('MoreTypes/MoreType'): > > d = dict(node.attrib): > > save_for_db(d) > > That's pretty straight forward code, I don't see any risk here. But I'm > wondering which of the two processes actually fails now - you're presenting > this one, but from your previous posts I though it was the other one that crashed. > > I tried turning of the parsing - which pretty much makes everything > > else do nothing but pass around the raw data - and got no failures. I > > also tried turning off just the validation, so that the work is still > > getting done - and got failures. > Hmmmm, are those failures related to validation errors? Nope. I have files without validation errors that cause failures, whereas I haven't caught the one test file that does validate causing problems. > Just in case it's the second process that fails (the XPath one), it could be > worth testing if using the XPath() class instead of the xpath() method works > better. That might give us a hint on where the problem comes from. It should > also be faster, BTW. I should have thought of that myself. Faster is good, so I went ahead and made this change. Haven't tried it in the dict(node.attrib) version, though. -- Mike Meyer http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From stefan_ml at behnel.de Wed Oct 10 09:03:04 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 10 Oct 2007 09:03:04 +0200 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <20071009150156.2a90e40a@bhuda.mired.org> References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> <470B5A57.4030102@behnel.de> <20071009150156.2a90e40a@bhuda.mired.org> Message-ID: <470C7928.7000307@behnel.de> Mike Meyer wrote: > On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel wrote: >> Can you find out if it's the iterparse() or something else that fails here? > > Well, I did try isolating parts of the parsing process. The problem > appears to be in the attribute extraction code. > > Basically, I have a routine that I pass an xpath expression to, and a > list of attributes I want values for from those elements. I was being > clever (probably to clever), and letting lxml provide a dictionary, > using dict to make a copy of it (i.e. - the "d = dict(node.attrib)" > line), That should work though. You should also be able to safely do d = dict(node.items()) or something in that line, which should even be faster as it avoids the intermediate attrib proxy and iterator creation steps. If you wan to be more selective, a generator expression will do. > and then playing game with sets to remove extra keys and add > empty strings for missing attributes. If I just create an empty > dictionary and plug empty strings into it for all the keys, the > problem goes away. > > So I rewrote that code with something a bit more straightforward: > > d = dict() > for key in keys: > d[key] = node.get(key, '?) > > and again, I haven't been able to recreate the problem. Hmmmm, this sounds like a deallocation problem then. Calling .attrib creates a dict-like Proxy that adds a cyclic reference to the underlying Element, so this changes the garbage collection behaviour. Things have been going astray a couple of times already here, as this is really hard to get right for the tons and tons of possible use cases (involving threading race conditions and what not). Though I was pretty sure that 1.3.2+ didn't suffer from anything like that anymore and the attrib stuff should actually have been fixed in 1.2 already AFAIR. >> Using valgrind is usually a great way to find out what's going wrong. It will >> make the run a lot slower, but it should print some helpful infos when it >> crashes. Run it like this: >> >> valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \ >> python yourscript.py >> >> preferably only on the process that crashes. > > I've got this. I get errors from the Python parser and oracle > libraries (uninitialized values). Then errors from lxml that look like > the gdb "where" output: it just points through etree.so, but adds that > it's doing an invalid read of size 8 or 4 (didn't have the size > before, but this should cause the segfaults). These all seem to be > followed by an error that says > Address 0x4D31450 is 8 bytes inside a block of size 120 free'd > And then traces back through vg_replace_malloc.c, then xmlFreeNodeList > in libxml2 a couple of times, and then back to etree.so. Then it is a deallocation problem. Apparently, the XML nodes it accesses were already freed before - that's what's great about valgrind: it tells you what last happened to the memory that it now fails to access, so you can figure out why it was freed in the first place. Could you send me the output? Stefan From tillea at rki.de Thu Oct 11 09:56:24 2007 From: tillea at rki.de (Andreas Tille) Date: Thu, 11 Oct 2007 09:56:24 +0200 (CEST) Subject: [lxml-dev] Beginner question Message-ID: Hi, I'm sorry to start with this beginner question. Yesterday I stumbled over lxml and I think it is a really great tool which exactly is what I ever wanted but I'm afraid I need some kick start. I try to parse some XML files that are used as transport medium between different databases. We use a self defined XSD schema. The xml file lokes like this: ... With the code that I adopted from the tutorial for event, elem in etree.iterparse(infile, events=("start")): if event == "start": print "start:", etree.tostring(elem, pretty_print=True) print "--->", elem.tag I got something like: ... start: ---> {http://www3.rki.de/ns/agi/ibs/2007/T06/report}source start: ---> {http://www3.rki.de/ns/rki/base/ct/2007/T03}software start ---> {http://www3.rki.de/ns/agi/ibs/2007/T06/report}source ... the elements as a whole with children on the one hand but I have no idea how to finally access the values like 'idSource="NRZ Berlin" ' nor do I have an idea how to get rid of the default name space that is prepended before the tags. I would rather like to access the tag called "source" (without the default name space) or "ct:software" with the shortcut of the name space. I also found the very interesting objectify method at http://codespeak.net/lxml/objectify.html but I finally have no idea how to use that in the parser because the page just describes creating objects (or did I missed something?) Sorry for my ignorance in case things should be obvious from reading the docs. Kind regards Andreas. -- http://fam-tille.de From stefan_ml at behnel.de Thu Oct 11 13:49:59 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 11 Oct 2007 13:49:59 +0200 Subject: [lxml-dev] Beginner question In-Reply-To: References: Message-ID: <470E0DE7.2080600@behnel.de> Andreas Tille wrote: > I'm sorry to start with this beginner question. Everyone's a beginner right from the start. :) > With the code that I adopted from the tutorial > > for event, elem in etree.iterparse(infile, events=("start")): > if event == "start": > print "start:", etree.tostring(elem, pretty_print=True) > print "--->", elem.tag The "start" event only guarantees that the Element itself is complete, but its children may or may not be parsed yet. Use the "end" event if you need to access the children. BTW, testing for event == "start" if you already restricted the events to ("start",) is redundant. > idea how to finally access the values like 'idSource="NRZ Berlin" ' That would be an attribute. Read the tutorial on this. http://codespeak.net/lxml/tutorial.html#elements-carry-attributes > nor do I have an idea how to get rid of the default name space that > is prepended before the tags. I would rather like to access the tag > called "source" (without the default name space) But there *is* a namespace, so how would you distinguish it from a plain "source" tag without namespace? If it's just for brevity, you can always use string constants. > or "ct:software" with the shortcut of the name space. Who guarantees that the namespace prefix ("ct") is used in all data files? Your code would stop working if it wasn't... > I also found the very interesting objectify method at > http://codespeak.net/lxml/objectify.html > but I finally have no idea how to use that in the parser because > the page just describes creating objects (or did I missed something?) http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify iterparse() also returns a (special) parser, so the setup of the lookup scheme should work alike. I never tried it, but this should work: parser = etree.iterparse(source_file, remove_blank_text=True) lookup = objectify.ObjectifyElementClassLookup() parser.setElementClassLookup(lookup) for event, element in parser: ... Stefan From tillea at rki.de Thu Oct 11 14:35:18 2007 From: tillea at rki.de (Andreas Tille) Date: Thu, 11 Oct 2007 14:35:18 +0200 (CEST) Subject: [lxml-dev] Beginner question In-Reply-To: <470E0DE7.2080600@behnel.de> References: <470E0DE7.2080600@behnel.de> Message-ID: On Thu, 11 Oct 2007, Stefan Behnel wrote: >> for event, elem in etree.iterparse(infile, events=("start")): >> if event == "start": >> print "start:", etree.tostring(elem, pretty_print=True) >> print "--->", elem.tag > > The "start" event only guarantees that the Element itself is complete, but its > children may or may not be parsed yet. Use the "end" event if you need to > access the children. Does this mean the usage of etree.iterparse(infile, events=("end")) would be what I really want? > BTW, testing for event == "start" if you already restricted the events to > ("start",) is redundant. Right. The condition was a remaining from some other tests ... >> idea how to finally access the values like 'idSource="NRZ Berlin" ' > > That would be an attribute. Read the tutorial on this. > > http://codespeak.net/lxml/tutorial.html#elements-carry-attributes Ahhh, elem.get(attribute) did the trick. Thanks. > But there *is* a namespace, so how would you distinguish it from a plain > "source" tag without namespace? > > If it's just for brevity, you can always use string constants. I decided for if elem.tag.endswith('}source'): source = elem.get("idSource") because for practical reasons I can be sure that I'm in the default name space. >> or "ct:software" with the shortcut of the name space. > > Who guarantees that the namespace prefix ("ct") is used in all data files? > Your code would stop working if it wasn't... It would not validate before if the ct would be missing in the place where it is used here. But I can see your arguing and can cope with it. I just thought I would have missed something in the API that would enable me to use shortcuts. > http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify > > iterparse() also returns a (special) parser, so the setup of the lookup scheme > should work alike. I never tried it, but this should work: > > parser = etree.iterparse(source_file, remove_blank_text=True) > > lookup = objectify.ObjectifyElementClassLookup() > parser.setElementClassLookup(lookup) > > for event, element in parser: > ... Well, when using the code: for event, element in parser: print "element: ", etree.tostring(element, pretty_print=True) gives for instance: element: element: element: I here also wonder how to obtain the attribute idSource from the source tag for instance. Many thanks for the hint in the beginning which brought me quite a step foreward Andreas. -- http://fam-tille.de From stefan_ml at behnel.de Thu Oct 11 16:04:35 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 11 Oct 2007 16:04:35 +0200 Subject: [lxml-dev] Beginner question In-Reply-To: References: <470E0DE7.2080600@behnel.de> Message-ID: <470E2D73.60904@behnel.de> Andreas Tille wrote: > On Thu, 11 Oct 2007, Stefan Behnel wrote: > >>> for event, elem in etree.iterparse(infile, events=("start")): >>> if event == "start": >>> print "start:", etree.tostring(elem, pretty_print=True) >>> print "--->", elem.tag >> The "start" event only guarantees that the Element itself is complete, but its >> children may or may not be parsed yet. Use the "end" event if you need to >> access the children. > > Does this mean the usage of > etree.iterparse(infile, events=("end")) > would be what I really want? Depends on what you want, but likely yes. Note that ("end",) is the default anyway. >>> or "ct:software" with the shortcut of the name space. >> Who guarantees that the namespace prefix ("ct") is used in all data files? >> Your code would stop working if it wasn't... > > It would not validate before if the ct would be missing in the place where > it is used here. Why not? You could use "humptydumpty:software" as long as you associated "humptydumpty" with the right namespace. And your XML document could define 1000 prefixes for the same namespace and then use a different prefix for each tag. And it would validate just fine, as the namespace would be correct. >> http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify >> >> iterparse() also returns a (special) parser, so the setup of the lookup scheme >> should work alike. I never tried it, but this should work: >> >> parser = etree.iterparse(source_file, remove_blank_text=True) >> >> lookup = objectify.ObjectifyElementClassLookup() >> parser.setElementClassLookup(lookup) >> >> for event, element in parser: >> ... > > Well, when using the code: > > for event, element in parser: > print "element: ", etree.tostring(element, pretty_print=True) > > gives for instance: > > element: > element: > > > element: > > I here also wonder how to obtain the attribute idSource from the source tag > for instance. Same attribute access as before, just the child access API is different, as described in the objectify docs. Stefan From felwert at uni-bremen.de Thu Oct 11 16:26:45 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Thu, 11 Oct 2007 16:26:45 +0200 Subject: [lxml-dev] [Spam: 5.001 ] Re: Beginner question In-Reply-To: References: <470E0DE7.2080600@behnel.de> Message-ID: <1192112805.8247.12.camel@FredDesk> Am Donnerstag, den 11.10.2007, 14:35 +0200 schrieb Andreas Tille: > On Thu, 11 Oct 2007, Stefan Behnel wrote: > > But there *is* a namespace, so how would you distinguish it from a plain > > "source" tag without namespace? > > > > If it's just for brevity, you can always use string constants. > > I decided for > > if elem.tag.endswith('}source'): > source = elem.get("idSource") > > because for practical reasons I can be sure that I'm in the default > name space. If you want it a bit less "dirty" and more XMLish, you could use local-name() from XPath: lname = etree.XPath('local-name()') if lname(elem) == 'source': source = elem.get('idSource') Cheers, Frederik From stefan_ml at behnel.de Thu Oct 11 17:18:57 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 11 Oct 2007 17:18:57 +0200 Subject: [lxml-dev] Dealing with segfaults in lxml? In-Reply-To: <20071010104245.361b359c@mbook.mired.org> References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> <470B5A57.4030102@behnel.de> <20071009150156.2a90e40a@bhuda.mired.org> <470C7928.7000307@behnel.de> <20071010104245.361b359c@mbook.mired.org> Message-ID: <470E3EE1.3050707@behnel.de> Mike Meyer wrote: > On Wed, 10 Oct 2007 09:03:04 +0200 Stefan Behnel wrote: >> d = dict(node.items()) >> >> or something in that line, which should even be faster as it avoids the >> intermediate attrib proxy and iterator creation steps. If you wan to be more >> selective, a generator expression will do. > > I tried the node.items() variation, and that was still causing > segfaults. Then it's still different than I thought. If all you change is this line: d = dict(node.attrib) and you get segfaults with this: d = dict(node.items()) but not with this: d = dict() for key in keys: d[key] = node.get(key, '?) I really can't extract anything meaningful from that. The complete valgrind trace would be helpful. Stefan From bkc at murkworks.com Mon Oct 15 20:18:46 2007 From: bkc at murkworks.com (Brad Clements) Date: Mon, 15 Oct 2007 14:18:46 -0400 Subject: [lxml-dev] custom resolver, why does system url start with XSLT:? Message-ID: <4713AF06.3060408@murkworks.com> I have a project (XSL based TAL) that has used libxml2 and libxslt for a couple of years. I have a custom resolver that has worked "ok" with this. Now I have converted the project to use lxml. I am creating a parser and adding my resolver. when my resolver gets called, the URIs are weirdly mangled like this: resolve url 'XSLT:///xml/navigation.xml' id None ctext resolve url '/xml/carrier_payables_navigation.xml' id None ctext resolve url 'XSLT:///services/+payment_accounts' id None ctext (the 2nd one is not mangled, looks ok to me) What's the story with XSLT:// being stuck on the front of the system urls? I don't see that happen when I use libxml2 directly. I tried looking through the lxml source to find this, but I couldn't find it in docloader, parser, or xslt. where is the XSLT scheme coming from, is it lxml or libxslt? Why is it being inserted? The last example url, comes from using document() in a stylesheet, (the converted form of this:)