From optilude+lists at gmail.com Mon Nov 2 03:58:26 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 2 Nov 2009 02:58:26 +0000 (UTC) Subject: [lxml-dev] Critical crashes on Windows under high load Message-ID: Hi folks, We have an incredibly frustrating, show-stopping problem using lxml (under Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on Windows. Under high load, the Python process crashes. There is no traceback in the log, so I can't identify where it actually happens, but we get a Windows error dialogue saying python.exe (or pythonservice.exe if running as a Windows service) has crashed in etree.pyd (at some binary address, no line numbers or function references). The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're trying to simplify these, but there's nothing obviously wrong, and in any case it shouldn't crash. We've tried to run both multi-threaded and single-threaded 'paster' processes: the problem happens with both. I did read somewhere that it's possible to build a single-threaded lxml egg (?), but I haven't found one. We would be incredibly grateful for any help with (a) debugging and (b) resolving this. At present, we're having to fight a lot of nervousness regarding the production-worthiness of our Deliverance/lxml based solution, which is rather unfortunate. :-( Cheers, Martin From mykingheaven at gmail.com Mon Nov 2 08:04:26 2009 From: mykingheaven at gmail.com (David Shieh) Date: Mon, 2 Nov 2009 15:04:26 +0800 Subject: [lxml-dev] About encoding question ! Message-ID: Hey guys, I recently use lxml to do my HTML parsing, it's really great, and indeed the fastest one compare to other libraries. But since I begin to parse some other pages using gb2312 coding, I've a problem. The output is in here: http://david-paste.cn/paste/50/ Please help me with this, thanks you guys. Regards, David -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091102/7c02ac03/attachment.htm From stefan_ml at behnel.de Mon Nov 2 14:50:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Nov 2009 14:50:01 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: References: Message-ID: <4AEEE389.5080100@behnel.de> Martin Aspeli, 02.11.2009 03:58: > We have an incredibly frustrating, show-stopping problem using lxml I assume you are using lxml 2.2.2? > [...] on Windows. And now we have two problems... > Under high load, the Python process crashes. There is no traceback in the log, > so I can't identify where it actually happens, but we get a Windows error > dialogue saying python.exe (or pythonservice.exe if running as a Windows > service) has crashed in etree.pyd (at some binary address, no line numbers or > function references). I do not build the Windows binaries myself, so I have no idea if there are any debug symbols in there. Would certainly be nice to have them. > The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're > trying to simplify these, but there's nothing obviously wrong, and in any case > it shouldn't crash. XPath shouldn't crash by itself, so I'd rather focus the debugging on the other things you are doing. Are you running the XPath queries against trees that are being modified concurrently? Did you check for memory problems? Could you try to come up with a stripped down set of operations that your code does using lxml? And which of them happen concurrently? > We've tried to run both multi-threaded and single-threaded 'paster' processes: > the problem happens with both. Does that mean that this happens even if you run everything single-threaded? > We would be incredibly grateful for any help with (a) debugging and (b) > resolving this. At present, we're having to fight a lot of nervousness regarding > the production-worthiness of our Deliverance/lxml based solution, which is > rather unfortunate. :-( Certainly. Stefan From piet at cs.uu.nl Mon Nov 2 14:37:31 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Mon, 2 Nov 2009 14:37:31 +0100 Subject: [lxml-dev] About encoding question ! In-Reply-To: References: Message-ID: <19182.57499.984992.946714@Cochabamba.local> >>>>> David Shieh (DS) wrote: >DS> Hey guys, >DS> I recently use lxml to do my HTML parsing, it's really great, and >DS> indeed the fastest one compare to other libraries. >DS> But since I begin to parse some other pages using gb2312 coding, I've a >DS> problem. The output is in here: http://david-paste.cn/paste/50/ Firstly, HTML is not XML. XHTML is, however. So if your input is not XHTML, you should use a HTML parser instead of the XML parser. >From the first error message it seems that you have a byte string as input, not a Unicode string (this also seems to be implied by your message ('pages using gb2312 coding'). If you feed these to the xml parser they should contain an encoding declaration, like: Otherwise the parser thinks it is utf-8, as the error message indicates. contents.encode('utf-8') doesn't make sense when contents contains a byte string. This would only make sense when it contains a Unicode string. Neither does contents.encode('gb2312'). contents.decode('utf-8') is wrong if contents does not contain a utf-8 encoded byte string. However, contents.decode('gb2312') would make sense if contents contains a gb2312 encoded byte string. This will deliver a Unicode string that you can pass to etree.fromstring. So etree.fromstring(contents.decode('gb2312')) could be an alternative for specifying gb2312 in the file itself. -- Piet van Oostrum WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4] From ndudfield at gmail.com Mon Nov 2 14:53:25 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Tue, 03 Nov 2009 00:53:25 +1100 Subject: [lxml-dev] lxml 2.2.3 released In-Reply-To: References: Message-ID: <4AEEE455.2050209@gmail.com> Stefan & Co, Great news on 2.2.3. Any ETA for windows 2.5 binaries? Cheers. From optilude+lists at gmail.com Mon Nov 2 15:24:39 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 02 Nov 2009 22:24:39 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4AEEE389.5080100@behnel.de> References: <4AEEE389.5080100@behnel.de> Message-ID: <4AEEEBA7.1090906@gmail.com> Stefan Behnel wrote: > Martin Aspeli, 02.11.2009 03:58: >> We have an incredibly frustrating, show-stopping problem using lxml > > I assume you are using lxml 2.2.2? Yes, though we also tried the latest in the 2.0.x line as a downgrade for a bit. Same problem. >> [...] on Windows. > > And now we have two problems... > > >> Under high load, the Python process crashes. There is no traceback in the log, >> so I can't identify where it actually happens, but we get a Windows error >> dialogue saying python.exe (or pythonservice.exe if running as a Windows >> service) has crashed in etree.pyd (at some binary address, no line numbers or >> function references). > > I do not build the Windows binaries myself, so I have no idea if there are > any debug symbols in there. Would certainly be nice to have them. Who does? Sidnei? >> The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're >> trying to simplify these, but there's nothing obviously wrong, and in any case >> it shouldn't crash. > > XPath shouldn't crash by itself, so I'd rather focus the debugging on the > other things you are doing. Are you running the XPath queries against trees > that are being modified concurrently? It's possible that Deliverance is doing something evil here, but I kind of doubt it. As far as I can tell, this is a Windows-specific problem, or at least no-one seems to have reported it on Unix. > Did you check for memory problems? How would I do that? > Could you try to come up with a stripped down set of operations that your > code does using lxml? And which of them happen concurrently? I'm not sure. It'd be difficult. The crash dialogue doesn't tell me where in lxml the problem is (since there's no stack trace). Deliverance is doing a fair amount of work with lxml (evaluating xpath expressions, parsing the two input trees (theme + content), modifying the output tree). So far, we've not been able to pinpoint exactly where it happens, or if it's even deterministic. >> We've tried to run both multi-threaded and single-threaded 'paster' processes: >> the problem happens with both. > > Does that mean that this happens even if you run everything single-threaded? We put the paster processes under which the WSGI pipeline runs into single threaded mode (or at least, we set the threadpool size of each process to 1), so in theory, there shouldn't be any concurrency. I don't know if that's actually the case, though. I guess the most constructive thing would be if I could find some better way of debugging this. People closer to the project (and server) where this is happening are working on a load test suite that can reproduce this reliably, though it's pretty much trial and error. The problem is that as of right now, I don't know what I'd do next even if they did make it occur reliably. I don't understand how lxml is built, how Cython works, how to write C extensions, or how to do C development on Windows. It's a loooong time since I wrote C/C++ and that was on Linux. ;-) Martin From optilude+lists at gmail.com Mon Nov 2 15:27:00 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 02 Nov 2009 22:27:00 +0800 Subject: [lxml-dev] 2.2.2 binary egg for Mac OS X 10.6 In-Reply-To: References: Message-ID: Martin Aspeli wrote: > Martin Aspeli gmail.com> writes: > >>> Is there any chance we could have 10.6 eggs? If there are reliable build >>> instructions, I can help build them. > > Okay, I finally got this to work using zc.buildout and z3c.recipe.staticlxml. > > I have eggs for Python 2.4 and 2.6. The build for Python 2.5 is failing in > mysterious ways (it says "no egg found" in the temp directory). > > Can I have PyPI access (username 'optilude') to upload these? Otherwise, can I > send them somewhere for someone else? Actually, I'm not sure that these *do* work. I think I need to defer to Stefan Eletzhofer or someone else with a bit more experience of doing this right. It is really frustrating. People get stuck on this almost on a daily basis trying to use some of the new Plone tools we have that depend on lxml. :( I realise it's not lxml's fault, but unfortunately it's something that lxml will have to fix, since Apple aren't going to. ;-) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Mon Nov 2 16:01:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Nov 2009 16:01:09 +0100 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4AEEEBA7.1090906@gmail.com> References: <4AEEE389.5080100@behnel.de> <4AEEEBA7.1090906@gmail.com> Message-ID: <4AEEF435.8030404@behnel.de> Martin Aspeli, 02.11.2009 15:24: > Stefan Behnel wrote: >> Martin Aspeli, 02.11.2009 03:58: >>> Under high load, the Python process crashes. There is no traceback in the log, >>> so I can't identify where it actually happens, but we get a Windows error >>> dialogue saying python.exe (or pythonservice.exe if running as a Windows >>> service) has crashed in etree.pyd (at some binary address, no line numbers or >>> function references). >> I do not build the Windows binaries myself, so I have no idea if there are >> any debug symbols in there. Would certainly be nice to have them. > > Who does? Sidnei? Yes. >>> The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're >>> trying to simplify these, but there's nothing obviously wrong, and in any case >>> it shouldn't crash. >> XPath shouldn't crash by itself, so I'd rather focus the debugging on the >> other things you are doing. Are you running the XPath queries against trees >> that are being modified concurrently? > > It's possible that Deliverance is doing something evil here, but I kind > of doubt it. As far as I can tell, this is a Windows-specific problem, > or at least no-one seems to have reported it on Unix. So I assume you ran similar load tests under Unix systems? >> Did you check for memory problems? > > How would I do that? I mean, does the process' memory usage grow uncontrolled? If it's running out of memory, it's quite possible that it crashes. Not all memory errors can be handled safely. >> Could you try to come up with a stripped down set of operations that your >> code does using lxml? And which of them happen concurrently? > > I'm not sure. It'd be difficult. Who said debugging would come for free? > The crash dialogue doesn't tell me > where in lxml the problem is (since there's no stack trace). Deliverance > is doing a fair amount of work with lxml (evaluating xpath expressions, > parsing the two input trees (theme + content), modifying the output > tree). Is that one tree per thread or are trees being handled by multiple threads? If threads don't share data, it can't be a threading issue (at least not from the POV of lxml). >>> We've tried to run both multi-threaded and single-threaded 'paster' processes: >>> the problem happens with both. >> Does that mean that this happens even if you run everything single-threaded? > > We put the paster processes under which the WSGI pipeline runs into > single threaded mode (or at least, we set the threadpool size of each > process to 1), so in theory, there shouldn't be any concurrency. I don't > know if that's actually the case, though. It would be helpful if you could find out. In the worst case, you can inject a WSGI layer that simply acquires a lock while it forwards the request. Then you're sure it's single threaded. > I guess the most constructive thing would be if I could find some better > way of debugging this. People closer to the project (and server) where > this is happening are working on a load test suite that can reproduce > this reliably, though it's pretty much trial and error. The problem is > that as of right now, I don't know what I'd do next even if they did > make it occur reliably. Well, at least, if it can be reproduced, it can be tracked down and fixed. > I don't understand how lxml is built, how Cython works, how to write C > extensions, or how to do C development on Windows. It's a loooong time > since I wrote C/C++ and that was on Linux. ;-) Luckily, you don't have to. lxml is written in Cython, not in C. Stefan From optilude+lists at gmail.com Mon Nov 2 16:11:45 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 02 Nov 2009 23:11:45 +0800 Subject: [lxml-dev] Critical crashes on Windows under high load In-Reply-To: <4AEEF435.8030404@behnel.de> References: <4AEEE389.5080100@behnel.de> <4AEEEBA7.1090906@gmail.com> <4AEEF435.8030404@behnel.de> Message-ID: <4AEEF6B1.6050102@gmail.com> Stefan Behnel wrote: >>>> The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're >>>> trying to simplify these, but there's nothing obviously wrong, and in any case >>>> it shouldn't crash. >>> XPath shouldn't crash by itself, so I'd rather focus the debugging on the >>> other things you are doing. Are you running the XPath queries against trees >>> that are being modified concurrently? >> It's possible that Deliverance is doing something evil here, but I kind >> of doubt it. As far as I can tell, this is a Windows-specific problem, >> or at least no-one seems to have reported it on Unix. > > So I assume you ran similar load tests under Unix systems? No, I wish we could. :( I'm basing this on the fact that (a) Unix deployments seem more common (b) no-one has reported this on Unix that I can see and (c) I've found at least one other person with Windows crashes. But who knows, I could be completely wrong. What I can say for certain is that the crashes do occur from time to time under relatively normal usage patterns. >>> Did you check for memory problems? >> How would I do that? > > I mean, does the process' memory usage grow uncontrolled? If it's running > out of memory, it's quite possible that it crashes. Not all memory errors > can be handled safely. We normally discover the error only after the process has crashed. There's no pre-warning. It looks like memory usage is relatively stable when the system is running normally. I'll try to take a closer look, though. >>> Could you try to come up with a stripped down set of operations that your >>> code does using lxml? And which of them happen concurrently? >> I'm not sure. It'd be difficult. > > Who said debugging would come for free? Heh, true. A *lot* of time has gone into this already. We're talking about a fairly big stack here, though. What I think we try, though is to attempt to reproduce the problem with a load test suite and a static back end instead of having Plone in the mix. That should produce a relatively small WSGI pipeline and a manageable amount of code. If it still crashes, of course. >> The crash dialogue doesn't tell me >> where in lxml the problem is (since there's no stack trace). Deliverance >> is doing a fair amount of work with lxml (evaluating xpath expressions, >> parsing the two input trees (theme + content), modifying the output >> tree). > > Is that one tree per thread or are trees being handled by multiple threads? > If threads don't share data, it can't be a threading issue (at least not > from the POV of lxml). One per thread almost certainly. They're read on each request as far as I can tell. I'd have to defer to the Deliverance developers, though. >>>> We've tried to run both multi-threaded and single-threaded 'paster' processes: >>>> the problem happens with both. >>> Does that mean that this happens even if you run everything single-threaded? >> We put the paster processes under which the WSGI pipeline runs into >> single threaded mode (or at least, we set the threadpool size of each >> process to 1), so in theory, there shouldn't be any concurrency. I don't >> know if that's actually the case, though. > > It would be helpful if you could find out. In the worst case, you can > inject a WSGI layer that simply acquires a lock while it forwards the > request. Then you're sure it's single threaded. Does anyone know? We're using Paste#httpserver and set threadpool_count = 1. I assume that means single threaded? >> I guess the most constructive thing would be if I could find some better >> way of debugging this. People closer to the project (and server) where >> this is happening are working on a load test suite that can reproduce >> this reliably, though it's pretty much trial and error. The problem is >> that as of right now, I don't know what I'd do next even if they did >> make it occur reliably. > > Well, at least, if it can be reproduced, it can be tracked down and fixed. Yeah. That's basically what we're working towards now. But it's not straightforward, at least not in a way that we can give to other people to look at. >> I don't understand how lxml is built, how Cython works, how to write C >> extensions, or how to do C development on Windows. It's a loooong time >> since I wrote C/C++ and that was on Linux. ;-) > > Luckily, you don't have to. lxml is written in Cython, not in C. But libxml2 and libxslt are. I suppose it's conceivable the problem is there, or in the way they're statically linked perhaps? Not that I understand Cython either. ;-) Thanks for your help! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From manu3d at gmail.com Mon Nov 2 16:29:39 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 2 Nov 2009 15:29:39 +0000 Subject: [lxml-dev] Handling processing instructions In-Reply-To: <915dc91d0910280949leb7379arec539568a9aad834@mail.gmail.com> References: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> <433ebc870910271208s499e722j4b3d1441c80b0987@mail.gmail.com> <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> <915dc91d0910280949leb7379arec539568a9aad834@mail.gmail.com> Message-ID: <915dc91d0911020729w380be326qdd2955c3d0f21e07@mail.gmail.com> Stefan, I don't know if you missed this thread: is it possible to remove a processing instruction that is a preceding sibling of the root node of an ElementTree? Somehow I can access it via tree.getroot().getprevious() or tree.getroot().itersiblings(preceding=True).next() but I cannot find a way to delete it. A test case to cut&paste: from lxml import etree from StringIO import StringIO tree = etree.parse(StringIO("")) Thank you! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091102/1b24cd64/attachment-0001.htm From jholg at gmx.de Thu Nov 5 14:08:18 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 05 Nov 2009 14:08:18 +0100 Subject: [lxml-dev] confusing xpath performance characteristics Message-ID: <20091105130818.216300@gmx.net> Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185] What??? >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] >>> Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From jholg at gmx.de Thu Nov 5 22:05:07 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 05 Nov 2009 22:05:07 +0100 Subject: [lxml-dev] confusing xpath performance characteristics In-Reply-To: <20091105130818.216300@gmx.net> References: <20091105130818.216300@gmx.net> Message-ID: <20091105210507.314490@gmx.net> Oops, > But why would > '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' > perform drastically slower than > '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' > ? > > libxml2 problem? Running the same xpaths in Oxygen I don't notice > performance differences (can't profile this). That was supposed to read: Why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '//xs:element[@name="equity"]/@type' ? Holger -- DSL-Preisknaller: DSL Komplettpakete von GMX schon f?r 16,99 Euro mtl.!* Hier klicken: http://portal.gmx.net/de/go/dsl02 From kevinar18 at hotmail.com Fri Nov 6 04:49:31 2009 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Thu, 5 Nov 2009 22:49:31 -0500 Subject: [lxml-dev] No module named lxml In-Reply-To: References: Message-ID: I'm not really sure how to setup lxml properly, so I need some help. Summary: import lxml gives the following error: ImportError: No module named lxml Background: I first tried out the install instructions here: http://codespeak.net/lxml/installation.html#ms-windows This involved 3 steps: 1) using easy_install lxml 2) Installing libxml 3) Installing libxslt Suffice to say that method never worked right. During all those steps, I probably did something wrong. Next, I tried the Windows binary egg method (after I deleted the previous lxml install). I downloaded: lxml-2.2.2-py2.6-win-amd64.egg and I copied the file to: Python2.6.1\Lib\site-packages\lxml-2.2.2-py2.6-win-amd64.egg Questions: * Is this the correct way to install an egg? (or is there more to it?) * Could, elementree or html5lib packages be causing a conflict? * So, what am I doing wrong? :) Is it something simple that I missed or should it be working the way I have it now? System info: Windows Vista 64bit Python 2.6.4 64bit Contents of my site-packages: babel-0.9.4-py2.6.egg cherrypy-3.1.1-py2.6.egg decorator-3.0.0-py2.6.egg django-1.0.2_final-py2.6.egg django_wikiapp-0.2.0-py2.6.egg 939 easy-install.pth elementtree-1.2.7_20070827_preview-py2.6.egg elixir-0.3.0-py2.6.egg elixir-0.6.1-py2.6.egg FormEncode-1.2.3dev_r0-py2.6.egg Genshi-0.5.1-py2.6.egg Genshi.egg-info html5lib-0.11.1-py2.6 - Copy.zip html5lib-0.11.1-py2.6.egg lxml-2.2.2-py2.6-win-amd64.egg mako-0.2.4-py2.6.egg nose-0.10.4-py2.6.egg paste-1.7.2-py2.6.egg pastedeploy-1.3.3-py2.6.egg pastescript-1.7.3-py2.6.egg posterity-0.6-py2.6.egg Pylons-0.9.7-py2.6.egg pysqlite-2.5.5-py2.6.egg-info pysqlite2 pytidylib-0.1.2-py2.6.egg-info README.txt resolver-0.2.1-py2.6.egg selector-0.8.11-py2.6.egg setuptools-0.6c9-py2.6.egg setuptools.pth sqlalchemy-0.3.11-py2.6.egg sqlalchemy-0.5.3-py2.6.egg tempita-0.2-py2.6.egg tidy tidylib uTidylib-0.2-py2.6.egg-info weberror-0.10.1-py2.6.egg webob-0.9.6.1-py2.6.egg webtest-1.1-py2.6.egg wsgiref-0.1.2-py2.6.egg xlrd xlrd-0.6.1-py2.6.egg-info zzzzzzhtml5lib-0.11.1-py2.6 - Copyzzzzzzzzzzz _________________________________________________________________ Bing brings you maps, menus, and reviews organized in one place. http://www.bing.com/search?q=restaurants&form=MFESRP&publ=WLHMTAG&crea=TEXT_MFESRP_Local_MapsMenu_Resturants_1x1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091105/c41baef8/attachment.htm From stefan_ml at behnel.de Fri Nov 6 08:06:20 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Nov 2009 08:06:20 +0100 Subject: [lxml-dev] No module named lxml In-Reply-To: References: Message-ID: <4AF3CAEC.9090703@behnel.de> Hi, Kevin Ar18, 06.11.2009 04:49: > import lxml gives the following error: > > ImportError: No module named lxml Normal Python import error when it can't find a module or package. > Background: > > I first tried out the install instructions here: http://codespeak.net/lxml/installation.html#ms-windows > > This involved 3 steps: 1) using easy_install lxml 2) Installing libxml 3) Installing libxslt No, it just involves the step that is mentioned in that section, i.e. 1). > Suffice to say that method never worked right. No, that's not enough. Please provide the output of the easy_install call. > During all those steps, I probably did something wrong. Or something went wrong and you didn't notice. > Next, I tried the Windows binary egg method (after I deleted the previous lxml install). easy_install *is* the "Windows binary egg method". > I downloaded: lxml-2.2.2-py2.6-win-amd64.egg and I copied the file to: > > Python2.6.1\Lib\site-packages\lxml-2.2.2-py2.6-win-amd64.egg Ok, that's where it came from then. I noticed it in the file list in site-packages below. > * Is this the correct way to install an egg? (or is there more to it?) That's not how eggs work. They require installation. > * Could, elementree or html5lib packages be causing a conflict? No. > Contents of my site-packages: > [...] > lxml-2.2.2-py2.6-win-amd64.egg > [...] Move that file out of site-packages and run "easy_install" on it. If that fails, please provide the complete output. Stefan From jholg at gmx.de Fri Nov 6 08:57:48 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 06 Nov 2009 08:57:48 +0100 Subject: [lxml-dev] confusing xpath performance characteristics In-Reply-To: <20091105210507.314490@gmx.net> References: <20091105130818.216300@gmx.net> <20091105210507.314490@gmx.net> Message-ID: <20091106075748.282000@gmx.net> > That was supposed to read: > > Why would > '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' > perform drastically slower than > '//xs:element[@name="equity"]/@type' > ? And one more thing I forgot to mention: lxml version: 2.2.2 libxml2 version: (2, 6, 32) libxslt version: (1, 1, 23) Holger -- DSL-Preisknaller: DSL Komplettpakete von GMX schon f?r 16,99 Euro mtl.!* Hier klicken: http://portal.gmx.net/de/go/dsl02 From stefan_ml at behnel.de Fri Nov 6 09:29:26 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Nov 2009 09:29:26 +0100 Subject: [lxml-dev] No module named lxml In-Reply-To: References: Message-ID: <4AF3DE66.2000008@behnel.de> Hi, Kevin Ar18, 06.11.2009 04:49: > I'm not really sure how to setup lxml properly, so I need some help. Please check if the updated installation instructions are clearer now. http://codespeak.net/lxml/installation.html Stefan From stefan_ml at behnel.de Fri Nov 6 14:54:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Nov 2009 14:54:27 +0100 Subject: [lxml-dev] confusing xpath performance characteristics In-Reply-To: <20091105130818.216300@gmx.net> References: <20091105130818.216300@gmx.net> Message-ID: <4AF42A93.5020703@behnel.de> Hi, jholg at gmx.de, 05.11.2009 14:08: > I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: > > I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): > >>>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) > [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] > > So I think I'm being smart and give a little more path information > - reckoning that this should *improve* performance: > >>>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) > [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] > > Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: > >>>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) > [103.79744100570679, 103.83671712875366, 103.61817717552185] > > What??? > >>>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) > [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] > > Ok, this version's better than my naive approach, which seems logical to me. I have no idea what libxml2 does here, but it looks like there are some optimisations going on in the most simple case. Also see the other thread two weeks ago, where John Krukoff stumbled over unexpected performance differences when testing for namespaces. Consider running the evaluation through callgrind and kcachegrind to find out what happens in each case and what works differently. Stefan From kevinar18 at hotmail.com Sat Nov 7 02:58:11 2009 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Fri, 6 Nov 2009 20:58:11 -0500 Subject: [lxml-dev] No module named lxml In-Reply-To: <4AF3DE66.2000008@behnel.de> References: <4AF3DE66.2000008@behnel.de> Message-ID: > > I'm not really sure how to setup lxml properly, so I need some help. > > Please check if the updated installation instructions are clearer now. > > http://codespeak.net/lxml/installation.html That does help some, yes. Based on both of your replies I was able to get it working. However, there are still some potential points for confusion. Did you actually want some help to clarify parts of the Installation page (to make it easier for newcomers) or were you just asking if I was able to install it? I mean, I would be willing to, but I don't think that's what you really had in mind when you asked. :) On the other hand, if you do, then I guess I have a few questions I would need to ask first. Thanks again _________________________________________________________________ Hotmail: Trusted email with Microsoft's powerful SPAM protection. http://clk.atdmt.com/GBL/go/177141664/direct/01/ http://clk.atdmt.com/GBL/go/177141664/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091106/e28092f9/attachment-0001.htm From stefan_ml at behnel.de Sat Nov 7 08:33:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Nov 2009 08:33:25 +0100 Subject: [lxml-dev] No module named lxml In-Reply-To: References: <4AF3DE66.2000008@behnel.de> Message-ID: <4AF522C5.8000302@behnel.de> Hi, Kevin Ar18, 07.11.2009 02:58: >>> I'm not really sure how to setup lxml properly, so I need some help. >> Please check if the updated installation instructions are clearer now. >> >> http://codespeak.net/lxml/installation.html >> > That does help some, yes. Based on both of your replies I was able to > get it working. However, there are still some potential points for confusion. > > Did you actually want some help to clarify parts of the Installation > page (to make it easier for newcomers) or were you just asking if I was > able to install it? I mean, I would be willing to, but I don't think > that's what you really had in mind when you asked. :) Any contribution is appreciated. > On the other hand, if you do, then I guess I have a few questions I > would need to ask first. Go ahead, this mailing list is there for getting questions answered (amongst other things). Stefan From stefan_ml at behnel.de Sat Nov 7 17:12:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Nov 2009 17:12:38 +0100 Subject: [lxml-dev] Handling processing instructions In-Reply-To: <915dc91d0911020729w380be326qdd2955c3d0f21e07@mail.gmail.com> References: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> <433ebc870910271208s499e722j4b3d1441c80b0987@mail.gmail.com> <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> <915dc91d0910280949leb7379arec539568a9aad834@mail.gmail.com> <915dc91d0911020729w380be326qdd2955c3d0f21e07@mail.gmail.com> Message-ID: <4AF59C76.1060806@behnel.de> Emanuele D'Arrigo, 02.11.2009 16:29: > is it possible to remove a processing instruction that is a preceding > sibling of the root node of an ElementTree? Somehow I can access it via > tree.getroot().getprevious() or > tree.getroot().itersiblings(preceding=True).next() but I cannot find a way > to delete it. > > A test case to cut&paste: > > from lxml import etree > from StringIO import StringIO > tree = etree.parse(StringIO("")) I don't think that's currently possible, no. Could you file a bug in the bug tracker? Thanks! Stefan From stefan_ml at behnel.de Sat Nov 7 17:24:43 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Nov 2009 17:24:43 +0100 Subject: [lxml-dev] Fun with unicode errors In-Reply-To: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> Message-ID: <4AF59F4B.2020607@behnel.de> Hi, thanks for the report. Praktikant3 - SAG, 29.10.2009 09:53: > Maybe you have an idea what could be happening here, otherwise I will > (try to) come back with a more complete example. For now I have this > small code excerpt that behaves strangely: (isinstance(output_xml, > lxml.etree._Element) is True) > > # The two ET.tostring() invocations below, (1) and (2), show the > # following behaviour: > > # (1) "works" (UnicodeDecodeError about el.text after (2)) > > # (1) (2) "works" (UnicodeDecodeError about el.text after (2)) > > # (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about > ET.tostsring() (2) > > ET.tostring(output_xml) # (1) Ok, so, do I understand this correctly: a normal serialisation works, right? Only when you start deleting text content, it will start failing to serialise? > # Make pretty-printing work by removing unnecessary whitespace: > for el in output_xml.iter(): > ET.tostring(el) # (2) > if len(el) and el.text and not el.text.strip(): > el.text = None > if el.tail and not el.tail.strip(): > el.tail = None I don't have your input file, so I can't test this. Please provide either the input file (private e-mail is ok) or at least the XML snippet that contains the text that makes this fail. To do this, try to remove only the .text or the .tail attribute and see which one of them produces this problem. Then, print a tag trace on each iteration to see which element fails and try to remove unrelated XML content until you can reproduce this with a short XML file. Stefan From dakota at brokenpipe.ru Sun Nov 8 23:28:03 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Mon, 9 Nov 2009 01:28:03 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements Message-ID: Hi, I have a problem I couldn't find the way to solve. Sadly if it's not possible. That's the code from XSLT extensions tutorial: Everything's working cool, but I need a bit more: Let's assume that's my execute function's signature: execute(self, context, self_node, input_node, output_parent) When it's started by XSLT processor, self_node contains tag with and tags inside it. What should I do to ask XSLT processor to evaluate it? I need to make self_node containing . And even more. What if my XSLT file looks like: I need to have execute function called for deepest first, I need to evaluate inside it to have ... And to process the rest having the result of processed and all XSLT instructions inside. Is there a way to do it? Or maybe there is some other way to get such possibilities? Thanks. -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/82b2c501/attachment.htm From lxml-dev at barillari.org Mon Nov 9 02:56:27 2009 From: lxml-dev at barillari.org (Joseph Barillari) Date: Sun, 8 Nov 2009 20:56:27 -0500 Subject: [lxml-dev] Bug? lxml.html produces bogus results if HTML contains control characters Message-ID: <20091109015627.GB20347@barillari.org> Hi, I'm a big fan of lxml.html, but I think I've just found a bug in it: Here's a test case. html = """
one \x05two three
""" import lxml.html tree = lxml.html.fromstring(html) xpath = "/descendant::table" cells = tree.xpath(xpath)[0].getchildren()[0].getchildren() print [cell.text_content() for cell in cells] # prints ['one', ''] tree = lxml.html.fromstring(html.replace("\x05","")) cells = tree.xpath(xpath)[0].getchildren()[0].getchildren() print [cell.text_content() for cell in cells] # prints ['one', 'two', 'three'] The apparent bug is that lxml.html fails to parse the above HTML properly when it contains a control character (\x05 a.k.a ^E). Obviously, well-formed HTML will not contain such characters, but real-world HTML often does. The library does not report an error, but simply truncates the row at the control character, which made this behavior tricky for me track down. I'm running the following versions: lxml.etree: (2, 2, 2, 0) libxml used: (2, 7, 6) libxml compiled: (2, 7, 5) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 26) I must admit that I do not know if this is a bug in lxml or in libxml, but I am certainly willing to help investigate. Best, Joe From stefan_ml at behnel.de Mon Nov 9 08:44:03 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Nov 2009 08:44:03 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: Message-ID: <4AF7C843.6050509@behnel.de> Marat Dakota, 08.11.2009 23:28: > Hi, I have a problem I couldn't find the way to solve. Sadly if it's not > possible. > > That's the code from XSLT extensions tutorial: > > > > > > > > Everything's working cool, but I need a bit more: > > > > /> > > > > Let's assume that's my execute function's signature: > > execute(self, context, self_node, input_node, output_parent) > > When it's started by XSLT processor, self_node contains tag with > and tags inside it. What should I do to ask > XSLT processor to evaluate it? It's funny that you found the example above but didn't make it to the text sections right below it. http://codespeak.net/lxml/extensions.html#applying-xsl-templates Stefan From stefan_ml at behnel.de Mon Nov 9 09:57:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Nov 2009 09:57:12 +0100 Subject: [lxml-dev] confusing xpath performance characteristics In-Reply-To: <20091105210507.314490@gmx.net> References: <20091105130818.216300@gmx.net> <20091105210507.314490@gmx.net> Message-ID: <4AF7D968.4040400@behnel.de> [cross-posting this from lxml-dev] jholg at gmx.de, 05.11.2009 22:05: > Why would > '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' [takes 103.62 seconds in libxml2 2.6.32] > perform drastically slower than > '//xs:element[@name="equity"]/@type' [takes 0.096 seconds] > ? Yes, that's surprising, especially when you see the absolute times. I ran it through callgrind, and the problem is that the first case produces separate node set results, one for each parent element that was found and searched. I.e., the evaluation works more or less like this: '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' -> walk through all "xs:schema" root elements (1 result) -> walk through all "xs:complexType" children (many found) -> for each result, search matching "xs:element" descendants -> for each match, select the "type" attribute (1 result) -> collect all matched attributes in a node-set -> merge the set of results and sort them into document order It's the last operation, merging and sorting large sets of results, that makes this extremely slow - it takes 92% of the evaluation time in my tests (using libxml2 2.7.5). It's much faster to traverse the document in a single step, and just select single attributes from it, that can quickly be appended to the node set. I imagine that this step could actually be optimised away in many cases (like the case above, where results are guaranteed to be found in doc order), so I guess it's just in there to avoid too much special casing. But it seriously kills the performance here. The sorting algorithm is an unstable shell-sort, whose exponential runtime I expect to be the key problem here. I would assume that in most cases, the partial node-sets are already in doc-order, so optimising away the sorting in favour of appending one node-set to the other whenever doc order is guaranteed to be preserved would drastically drop the runtime of the longer expression. Stefan From dakota at brokenpipe.ru Mon Nov 9 10:13:56 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Mon, 9 Nov 2009 12:13:56 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4AF7C843.6050509@behnel.de> References: <4AF7C843.6050509@behnel.de> Message-ID: Than it looks like I don't completely understand how it works. Could you please help me: XSLT file is: blabla Function is: def execute(self, context, self_node, input_node, output_parent): ????? print etree.tostring(deepcopy(self_node)) What should I put instead of ????? to make self_node having attribute named test with "111" as value? I tried to call self.apply_templates for everything (for self_node, for self_node[0], even for input_node and output_parent). No result in either case, print result is the same: blabla What am I doing wrong? Thanks. -- Marat On Mon, Nov 9, 2009 at 10:44 AM, Stefan Behnel wrote: > > Marat Dakota, 08.11.2009 23:28: > > Hi, I have a problem I couldn't find the way to solve. Sadly if it's not > > possible. > > > > That's the code from XSLT extensions tutorial: > > > > > > > > > > > > > > > > Everything's working cool, but I need a bit more: > > > > > > > > > /> > > > > > > > > Let's assume that's my execute function's signature: > > > > execute(self, context, self_node, input_node, output_parent) > > > > When it's started by XSLT processor, self_node contains tag > with > > and tags inside it. What should I do to > ask > > XSLT processor to evaluate it? > > It's funny that you found the example above but didn't make it to the text > sections right below it. > > http://codespeak.net/lxml/extensions.html#applying-xsl-templates > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/0d4e79a3/attachment-0001.htm From ptuzla at gmail.com Mon Nov 9 14:01:26 2009 From: ptuzla at gmail.com (Polat Tuzla) Date: Mon, 9 Nov 2009 15:01:26 +0200 Subject: [lxml-dev] ancestor-or-self Message-ID: Hi, I'm observing that xpath axis "ancestor-or-self" does not function properly. Below outputs demonstrate the case. In the first one whole tree is printed, and in the second one ancestor-or-self is used, but the result does not differ. I'll open a ticket for this as a bug, unless someone tells me that I'm missing a point. Thanks, Polat Tuzla In [309]: print etree.tostring(root, pretty_print=True) ?? ?? ? ?? ? ?? ? ? ?? ? ?? In [310]: ?print etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[0], pretty_print=True) ?? .....: ?? ?? ? ?? ? ?? ? ? ?? ? ?? From stefan_ml at behnel.de Mon Nov 9 14:45:41 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Nov 2009 14:45:41 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: References: Message-ID: <4AF81D05.8010308@behnel.de> Polat Tuzla, 09.11.2009 14:01: > I'm observing that xpath axis "ancestor-or-self" does not function > properly. Below outputs demonstrate the case. > In the first one whole tree is printed, and in the second one > ancestor-or-self is used, but the result does not differ. > > I'll open a ticket for this as a bug, unless someone tells me that I'm > missing a point. > Thanks, > > Polat Tuzla > > > In [309]: print etree.tostring(root, pretty_print=True) > > > > > > > > > > In [310]: print > etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[0], > pretty_print=True) > .....: > > > > > > > > Note that you only look at the first result using the "[0]" subscript, which in this case is the root node. Stefan From jholg at gmx.de Mon Nov 9 15:04:45 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 09 Nov 2009 15:04:45 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: References: Message-ID: <20091109140445.30990@gmx.net> Hi, > I'm observing that xpath axis "ancestor-or-self" does not function > properly. Below outputs demonstrate the case. > In the first one whole tree is printed, and in the second one > ancestor-or-self is used, but the result does not differ. > > I'll open a ticket for this as a bug, unless someone tells me that I'm > missing a point. You're missing a point :) > In [309]: print etree.tostring(root, pretty_print=True) > > ?? > ?? ? > ?? ? > ?? ? ? > ?? ? > ?? > Ok, so you print out root here. > In [310]: ?print > etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[0], > pretty_print=True) > ?? .....: > > ?? > ?? ? > ?? ? > ?? ? ? > ?? ? > ?? > Note how you print out root again, now: >>> root.xpath("/a/b/c/ancestor-or-self::*") [, , ] >>> root.xpath("/a/b/c/ancestor-or-self::*")[0] is root True >>> xpath() returns a list of elements in this case, of which you select the first item - which is root. ancestor-or-self is a forward axis and the position of nodes in a forward axis is defined in terms of document order: See the Xpath Rec: 2.4 Predicates [...] Thus, the ancestor, ancestor-or-self, preceding, and preceding-sibling axes are reverse axes; all other axes are forward axes. [...] The proximity position of a member of a node-set with respect to an axis is defined to be the position of the node in the node-set ordered in document order if the axis is a forward axis and ordered in reverse document order if the axis is a reverse axis. The first position is 1. Holger -- DSL-Preisknaller: DSL Komplettpakete von GMX schon f?r 16,99 Euro mtl.!* Hier klicken: http://portal.gmx.net/de/go/dsl02 From ptuzla at gmail.com Mon Nov 9 15:17:52 2009 From: ptuzla at gmail.com (Polat Tuzla) Date: Mon, 9 Nov 2009 16:17:52 +0200 Subject: [lxml-dev] ancestor-or-self In-Reply-To: <4AF81D05.8010308@behnel.de> References: <4AF81D05.8010308@behnel.de> Message-ID: Thank you for your response. I looked at the other results, and they did not seem to obey the xpath axis either. By using ancestor-or-self, I'm expecting an output like this: But the other results that are returned to me are: In [311]: print etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[1], pretty_print=True) .....: In [312]: print etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[2], pretty_print=True) .....: In [313]: print etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[3], pretty_print=True) .....: IndexError: list index out of range On Mon, Nov 9, 2009 at 3:45 PM, Stefan Behnel wrote: > > Polat Tuzla, 09.11.2009 14:01: > > I'm observing that xpath axis "ancestor-or-self" does not function > > properly. Below outputs demonstrate the case. > > In the first one whole tree is printed, and in the second one > > ancestor-or-self is used, but the result does not differ. > > > > I'll open a ticket for this as a bug, unless someone tells me that I'm > > missing a point. > > Thanks, > > > > Polat Tuzla > > > > > > In [309]: print etree.tostring(root, pretty_print=True) > > > > > > > > > > > > > > > > > > > > In [310]: print > > etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[0], > > pretty_print=True) > > .....: > > > > > > > > > > > > > > > > > > Note that you only look at the first result using the "[0]" subscript, > which in this case is the root node. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/0f473eec/attachment.htm From jholg at gmx.de Mon Nov 9 16:03:02 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 09 Nov 2009 16:03:02 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: <20091109140445.30990@gmx.net> References: <20091109140445.30990@gmx.net> Message-ID: <20091109150302.30970@gmx.net> I might have talked nonsense here: > ancestor-or-self is a forward axis and the position of nodes in a forward > axis is defined in terms of document order: > > See the Xpath Rec: > 2.4 Predicates > > [...] Thus, the ancestor, ancestor-or-self, preceding, and > preceding-sibling axes are reverse axes; all other axes are forward axes. > [...] The proximity position of a member of a node-set with respect to an > axis is defined to be the position of the node in the node-set ordered in > document order if the axis is a forward axis and ordered in reverse document > order if the axis is a reverse axis. The first position is 1. First of all, ancestor-or-self is a *reverse-axis* so the "proximity position" is in reverse document order. Second, I now think that this "proximity position" is only relevant with regard to positional predicate filtering, not with regard to the order of nodes in the xpath result node set. E.g. >>> root.xpath("/a/b/c/ancestor-or-self::*[position()=1]") [] >>> root.xpath("/a/b/c/ancestor-or-self::*[position()=2]") [] >>> root.xpath("/a/b/c/ancestor-or-self::*[position()=3]") [] >>> As you can see, position numbering indeed follows reverse document order here. Third, I *think* that, as the result of the xpath expression "/a/b/c/ancestor-or-self::*" is a node-set, the order of nodes in the nodeset is an implementation detail; document order might be a good choice to have consistent results. Clarifications welcome. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From thomas.schloegl at unicreditgroup.at Mon Nov 9 16:20:12 2009 From: thomas.schloegl at unicreditgroup.at (=?iso-8859-1?Q?SCHL=D6GL_Thomas?=) Date: Mon, 9 Nov 2009 16:20:12 +0100 Subject: [lxml-dev] problem with etree.tostring and special characters inxml-text-attributes Message-ID: <98ED979E46225E47A0088454A34E03FE3C2C24@SRES1MXS5V1.res1.loc.lan.at> Hi, I have a problem with the replacement of special characters (xml-entities) in xml-text-properties of the etree.tostring method: After etree.tostring the characters &,<,> are replaced with & < > but ' and " remain in the text-properties. e.g.: final_delim=end, record_delim='\n', delim='|', quote=double But I need: final_delim=end, record_delim='\n', delim='|', quote=double I wrote a method that iterates through all text properties after the tostring-method and replaces ' and " of this text-properties in the serialized string, but with large xml-files this consumes a lot of time. Does anybody have a better solution? Kind Regards, Thomas Ing. Thomas Schl?gl 4266-UGG3C1 / CFO Datawarehouse Core Solutions Member of the Crossfunctions-Team UniCredit Global Information Services S.p.A. (UGIS S.p.A.) Zweigniederlassung ?sterreich Nordbergstrasse 13 A-1090 Vienna Room: TZB-6.18 Tel: +43 1 71730 50499 Mobile: - mailto:thomas.schloegl at wave-solutions.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/b369ed79/attachment.htm From marcello at perathoner.de Mon Nov 9 17:22:37 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 09 Nov 2009 17:22:37 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: <20091109150302.30970@gmx.net> References: <20091109140445.30990@gmx.net> <20091109150302.30970@gmx.net> Message-ID: <4AF841CD.9090307@perathoner.de> jholg at gmx.de wrote: > Third, I *think* that, as the result of the xpath expression > "/a/b/c/ancestor-or-self::*" is a node-set, the order of nodes in the > nodeset is an implementation detail; document order might be a good > choice to have consistent results. > > Clarifications welcome. "An axis is either a forward axis or a reverse axis. An axis that only ever contains the context node or nodes that are after the context node in document order is a forward axis. An axis that only ever contains the context node or nodes that are before the context node in document order is a reverse axis." Forward and reverse axes both contain the nodes in document order. "The proximity position of a member of a node-set with respect to an axis is defined to be the position of the node in the node-set ordered in document order if the axis is a forward axis and ordered in reverse document order if the axis is a reverse axis." In a reverse axis the 'proximity position' is defined in reverse order. root.xpath ("/a/b/c/ancestor-or-self::*") returns the nodes in document order and root.xpath ("/a/b/c/ancestor-or-self::*")[0] is the first node in document order. It's a thinko, not a bug. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Mon Nov 9 18:15:48 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 09 Nov 2009 18:15:48 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: <20091109150302.30970@gmx.net> References: <20091109140445.30990@gmx.net> <20091109150302.30970@gmx.net> Message-ID: <4AF84E44.4010303@perathoner.de> jholg at gmx.de wrote: > Clarifications welcome. I found a better explanation. XPath 2.0 clarifies all this: "[Definition: An axis step returns a sequence of nodes that are reachable from the context node via a specified axis. Such a step has two parts: an axis, which defines the "direction of movement" for the step, and a node test, which selects nodes based on their kind, name, and/or type annotation.] If the context item is a node, an axis step returns a sequence of zero or more nodes; otherwise, a type error is raised [err:XPTY0020]. The resulting node sequence is returned in document order. An axis step may be either a forward step or a reverse step, followed by zero or more predicates." ---- http://www.w3.org/TR/xpath20/#dt-axis-step "Note: When using predicates with a sequence of nodes selected using a reverse axis, it is important to remember that the the context positions for such a sequence are assigned in reverse document order. For example, preceding::foo[1] returns the first qualifying foo element in reverse document order, because the predicate is part of an axis step using a reverse axis. By contrast, (preceding::foo)[1] returns the first qualifying foo element in document order, because the parentheses cause (preceding::foo) to be parsed as a primary expression in which context positions are assigned in document order. Similarly, ancestor::*[1] returns the nearest ancestor element, because the ancestor axis is a reverse axis, whereas (ancestor::*)[1] returns the root element (first ancestor in document order). The fact that a reverse-axis step assigns context positions in reverse document order for the purpose of evaluating predicates does not alter the fact that the final result of the step is always in document order." ---- http://www.w3.org/TR/xpath20/#id-predicates -- Marcello Perathoner webmaster at gutenberg.org From stefan_ml at behnel.de Mon Nov 9 18:16:57 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Nov 2009 18:16:57 +0100 Subject: [lxml-dev] ancestor-or-self In-Reply-To: References: <4AF81D05.8010308@behnel.de> Message-ID: <4AF84E89.4020809@behnel.de> Polat Tuzla, 09.11.2009 15:17: > Thank you for your response. > I looked at the other results, and they did not seem to obey the xpath axis > either. > By using ancestor-or-self, I'm expecting an output like this: > > > > > > > > > But the other results that are returned to me are: > > In [311]: print etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[1], > pretty_print=True) > .....: > > > > > > An XPath query will not construct a new tree for you. What you see here is the result of serialising the second node in the result set, including its subtree *as defined in the document*. This has nothing to do with the query you ran *before* the serialisation, and which correctly returned the matching nodes in a list. Stefan From ptuzla at gmail.com Mon Nov 9 18:23:11 2009 From: ptuzla at gmail.com (Polat Tuzla) Date: Mon, 9 Nov 2009 19:23:11 +0200 Subject: [lxml-dev] ancestor-or-self In-Reply-To: <4AF84E89.4020809@behnel.de> References: <4AF81D05.8010308@behnel.de> <4AF84E89.4020809@behnel.de> Message-ID: OK. I see.. Thank you all for the quick responses. Regards, Polat On Mon, Nov 9, 2009 at 7:16 PM, Stefan Behnel wrote: > > Polat Tuzla, 09.11.2009 15:17: > > Thank you for your response. > > I looked at the other results, and they did not seem to obey the xpath > axis > > either. > > By using ancestor-or-self, I'm expecting an output like this: > > > > > > > > > > > > > > > > > > But the other results that are returned to me are: > > > > In [311]: print > etree.tostring(root.xpath("/a/b/c/ancestor-or-self::*")[1], > > pretty_print=True) > > .....: > > > > > > > > > > > > > > An XPath query will not construct a new tree for you. What you see here is > the result of serialising the second node in the result set, including its > subtree *as defined in the document*. This has nothing to do with the query > you ran *before* the serialisation, and which correctly returned the > matching nodes in a list. > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/74f0db5c/attachment.htm From manu3d at gmail.com Mon Nov 9 23:53:03 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 9 Nov 2009 22:53:03 +0000 Subject: [lxml-dev] Handling processing instructions In-Reply-To: <4AF59C76.1060806@behnel.de> References: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> <433ebc870910271208s499e722j4b3d1441c80b0987@mail.gmail.com> <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> <915dc91d0910280949leb7379arec539568a9aad834@mail.gmail.com> <915dc91d0911020729w380be326qdd2955c3d0f21e07@mail.gmail.com> <4AF59C76.1060806@behnel.de> Message-ID: <915dc91d0911091453l1635ad2et4ccac80caade77d9@mail.gmail.com> 2009/11/7 Stefan Behnel > > Emanuele D'Arrigo, 02.11.2009 16:29: > > is it possible to remove a processing instruction that is a preceding > > sibling of the root node of an ElementTree? > > I don't think that's currently possible, no. Could you file a bug in the > bug tracker? > Done: https://bugs.launchpad.net/lxml/+bug/479613 Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/49c8d894/attachment.htm From manu3d at gmail.com Tue Nov 10 00:08:21 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 9 Nov 2009 23:08:21 +0000 Subject: [lxml-dev] Entity References resolution Message-ID: <915dc91d0911091508t7f702d54l469ec8b5f4b44b70@mail.gmail.com> Hi everybody, today I tried this: parser = etree.XMLParser(resolve_entities=False) tree = etree.parse(StringIO("&aReference;"), parser) I was expecting it to simply treat the entity reference as text but it raises an error. What is the parser's parameter resolve_entities for then? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091109/cf5ea461/attachment.htm From dakota at brokenpipe.ru Tue Nov 10 06:56:15 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 10 Nov 2009 08:56:15 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> Message-ID: Am I asking something wrong or something too obvious to answer? On Mon, Nov 9, 2009 at 12:13 PM, Marat Dakota wrote: > Than it looks like I don't completely understand how it works. Could you > please help me: > > XSLT file is: > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:my="testns" > extension-element-prefixes="my"> > > > > /> > blabla > > > > > > Function is: > > def execute(self, context, self_node, input_node, output_parent): > ????? > print etree.tostring(deepcopy(self_node)) > > What should I put instead of ????? to make self_node having attribute named > test with "111" as value? > I tried to call self.apply_templates for everything (for self_node, for > self_node[0], even for input_node and output_parent). No result in either > case, print result is the same: > > > > blabla > > > What am I doing wrong? > > Thanks. > > -- > Marat > > > On Mon, Nov 9, 2009 at 10:44 AM, Stefan Behnel wrote: > >> >> Marat Dakota, 08.11.2009 23:28: >> > Hi, I have a problem I couldn't find the way to solve. Sadly if it's not >> > possible. >> > >> > That's the code from XSLT extensions tutorial: >> > >> > >> > >> > >> > >> > >> > >> > Everything's working cool, but I need a bit more: >> > >> > >> > >> > > > /> >> > >> > >> > >> > Let's assume that's my execute function's signature: >> > >> > execute(self, context, self_node, input_node, output_parent) >> > >> > When it's started by XSLT processor, self_node contains tag >> with >> > and tags inside it. What should I do to >> ask >> > XSLT processor to evaluate it? >> >> It's funny that you found the example above but didn't make it to the text >> sections right below it. >> >> http://codespeak.net/lxml/extensions.html#applying-xsl-templates >> >> Stefan >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091110/002eedb5/attachment.html From dakota at brokenpipe.ru Tue Nov 10 07:29:12 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 10 Nov 2009 09:29:12 +0300 Subject: [lxml-dev] Entity References resolution In-Reply-To: <915dc91d0911091508t7f702d54l469ec8b5f4b44b70@mail.gmail.com> References: <915dc91d0911091508t7f702d54l469ec8b5f4b44b70@mail.gmail.com> Message-ID: Hi, I don't know why but resolve_entities=False works for me only when DOCTYPE declaration is present. For your case will work: tree = etree.parse(StringIO('&aReference;'), parser) Just be careful because unparsed entities are not just parts of element's text, they are this element's child objects. This causes problems with attributes (I don't know if this is a bug or feature) because attributes are strings and can't contain objects inside. By this I mean the following. If you have xml like: bla &aReference; foo You'll get: >>> print tree.getroot().text bla >>> print tree.getroot()[0] &aReference; >>> print tree.getroot()[0].tail foo But when you have xml like: You'll get two child elements inside aRoot: >>> print tree.getroot()[0] &aReference; >>> print tree.getroot()[1] >>> print tree.getroot()[1].get('attr') bla foo So, parser prepends entity object before child node. That doesn't look nice... -- Marat On Tue, Nov 10, 2009 at 2:08 AM, Emanuele D'Arrigo wrote: > Hi everybody, > > today I tried this: > > parser = etree.XMLParser(resolve_entities=False) > tree = etree.parse(StringIO("&aReference;"), parser) > > I was expecting it to simply treat the entity reference as text but it > raises an error. What is the parser's parameter resolve_entities for then? > > Manu > > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091110/1edee2f7/attachment-0001.htm From stefan_ml at behnel.de Tue Nov 10 09:26:10 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Nov 2009 09:26:10 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> Message-ID: <4AF923A2.8010006@behnel.de> Hi, please don't top-post. Marat Dakota, 09.11.2009 10:13: > Than it looks like I don't completely understand how it works. Could you > please help me: > > XSLT file is: > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:my="testns" > extension-element-prefixes="my"> > > > > /> > blabla > > > > Ah, yes, attributes. Attributes are mapped to smart strings when passed through Python code. > Function is: > > def execute(self, context, self_node, input_node, output_parent): > ????? > print etree.tostring(deepcopy(self_node)) Remember that self_node is the extension element itself. It will not change during the evaluation, so printing it is uninteresting. > What should I put instead of ????? to make self_node having attribute named > test with "111" as value? > I tried to call self.apply_templates for everything (for self_node, for > self_node[0], even for input_node and output_parent). I would guess that you want to call it on "input_node". Calling .apply_templates() should return a string (although I never tested that), which you can then add as a new attribute to the tree. You have to do that manually, though. I would expect that "is_attribute" flag of the smart string to be True in your case (see the XPath docs), that would allow you to distinguish attributes from plain text context. Stefan From stefan_ml at behnel.de Tue Nov 10 09:36:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Nov 2009 09:36:42 +0100 Subject: [lxml-dev] Entity References resolution In-Reply-To: References: <915dc91d0911091508t7f702d54l469ec8b5f4b44b70@mail.gmail.com> Message-ID: <4AF9261A.4060708@behnel.de> Marat Dakota, 10.11.2009 07:29: > I don't know why but resolve_entities=False works for me only when DOCTYPE > declaration is present. For your case will work: > > tree = etree.parse(StringIO(' "some_dtd_link">&aReference;'), parser) Undeclared entities are an error for XML parsers, regardless if you want to keep them in the tree or resolve them. BTW, using etree.fromstring() instead of etree.parse(StringIO()) is both more readable and more efficient. > Just be careful because unparsed entities are not just parts of element's > text, they are this element's child objects. This is a feature. > This causes problems with > attributes (I don't know if this is a bug or feature) Certainly not a feature. > when you have xml like: > > > > You'll get two child elements inside aRoot: > >>> print tree.getroot()[0] > &aReference; > >>> print tree.getroot()[1] > > >>> print tree.getroot()[1].get('attr') > bla foo > > So, parser prepends entity object before child node. I doubt that it does that. However, this looks like something that needs better handling. Could you file a bug report? Thanks, Stefan From dakota at brokenpipe.ru Tue Nov 10 10:38:43 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 10 Nov 2009 12:38:43 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4AF923A2.8010006@behnel.de> References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> Message-ID: Thanks for reply. It looks like it's not working as expected. class MyExtElement(etree.XSLTExtension): def execute(self, context, self_node, input_node, output_parent): results = self.apply_templates(context, input_node) print results This code causes: Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in 'lxml.etree._callExtensionElement' ignored [] [] [] [] [] ... and so on for my example. When I try input_node[0] instead of input_node: results = self.apply_templates(context, input_node[0]) print results results is empty list. By the way, this simple code causes segmentation fault on my OSX, Linux and Windows machines. class MyExtElement(etree.XSLTExtension): def execute(self, context, self_node, input_node, output_parent): print input_node -- Marat On Tue, Nov 10, 2009 at 11:26 AM, Stefan Behnel wrote: > Hi, > > please don't top-post. > > > Marat Dakota, 09.11.2009 10:13: > > Than it looks like I don't completely understand how it works. Could you > > please help me: > > > > XSLT file is: > > > > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > > xmlns:my="testns" > > extension-element-prefixes="my"> > > > > > > > > > /> > > blabla > > > > > > > > > > Ah, yes, attributes. Attributes are mapped to smart strings when passed > through Python code. > > > > Function is: > > > > def execute(self, context, self_node, input_node, output_parent): > > ????? > > print etree.tostring(deepcopy(self_node)) > > Remember that self_node is the extension element itself. It will not change > during the evaluation, so printing it is uninteresting. > > > > What should I put instead of ????? to make self_node having attribute > named > > test with "111" as value? > > I tried to call self.apply_templates for everything (for self_node, for > > self_node[0], even for input_node and output_parent). > > I would guess that you want to call it on "input_node". Calling > .apply_templates() should return a string (although I never tested that), > which you can then add as a new attribute to the tree. You have to do that > manually, though. I would expect that "is_attribute" flag of the smart > string to be True in your case (see the XPath docs), that would allow you > to distinguish attributes from plain text context. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091110/dc322a89/attachment.htm From stefan_ml at behnel.de Tue Nov 10 16:22:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Nov 2009 16:22:21 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> Message-ID: <4AF9852D.3020408@behnel.de> Hi, please, don't top-post. Marat Dakota, 10.11.2009 10:38: > Thanks for reply. It looks like it's not working as expected. > > class MyExtElement(etree.XSLTExtension): > def execute(self, context, self_node, input_node, output_parent): > results = self.apply_templates(context, input_node) > print results > > This code causes: > > Exception RuntimeError: 'maximum recursion depth exceeded while calling a > Python object' in 'lxml.etree._callExtensionElement' ignored > [] > [] > [] > [] > [] > ... and so on Ok, that's a bug. It shouldn't swallow that exception. It's because it's trying to call Python functions in the exception handler, which fails if the recursion limit is already reached. I'll see if I can make this more robust. But, yes, the above will necessarily lead to infinite recursion. Sorry, didn't think of that. The feature you want is not "apply_templates", as that just mimics the behaviour of xsl:apply-templates in XSLT, i.e. you can't define which template will be applied or which XSL tags will run. That feature is currently not available (apart from creating a new stylesheet from the content of the extension element and applying that to input_node, but that's a clumsy and also incomplete solution). > When I try input_node[0] instead of input_node: > > results = self.apply_templates(context, input_node[0]) > print results > > results is empty list. You didn't show your input document, so I don't know what "input_node" or "input_node[0]" actually are in your case. > By the way, this simple code causes segmentation fault on my OSX, Linux and > Windows machines. > > > class MyExtElement(etree.XSLTExtension): > def execute(self, context, self_node, input_node, output_parent): > print input_node No idea why, I'll have to look into that. Stefan From dakota at brokenpipe.ru Tue Nov 10 16:58:42 2009 From: dakota at brokenpipe.ru (Marat Dakota) Date: Tue, 10 Nov 2009 18:58:42 +0300 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4AF9852D.3020408@behnel.de> References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> Message-ID: On Tue, Nov 10, 2009 at 6:22 PM, Stefan Behnel wrote: > Hi, > > please, don't top-post. > Sorry, I didn't know what 'top-post' mean until you've asked me twice not to do it and I checked wikipedia. > > The feature you want is not "apply_templates", as that just mimics the > behaviour of xsl:apply-templates in XSLT, i.e. you can't define which > template will be applied or which XSL tags will run. That feature is > currently not available (apart from creating a new stylesheet from the > content of the extension element and applying that to input_node, but > that's a clumsy and also incomplete solution). > > That's sad. Do you think it's hard to implement this feature, are there libxml or libxslt limits that will not allow to do that? Maybe I could join in and dig a bit? Just don't know where to start. I hope it's not too complicated and it's possible - I have my problem's elegant solution, but it needs this feature. > > > When I try input_node[0] instead of input_node: > > > > results = self.apply_templates(context, input_node[0]) > > print results > > > > results is empty list. > > You didn't show your input document, so I don't know what "input_node" or > "input_node[0]" actually are in your case. > > My input document is just: > > > By the way, this simple code causes segmentation fault on my OSX, Linux > and > > Windows machines. > > > > > > class MyExtElement(etree.XSLTExtension): > > def execute(self, context, self_node, input_node, output_parent): > > print input_node > > No idea why, I'll have to look into that. My colleague traced it a bit. He was just very curious if it's python's segfault (which he never met and that's why he's curious) or lxml's one. He said lxml dies when trying to read property tag of input_node. -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091110/6ddf3963/attachment.htm From stefan_ml at behnel.de Tue Nov 10 17:20:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Nov 2009 17:20:25 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> Message-ID: <4AF992C9.7090400@behnel.de> Marat Dakota, 10.11.2009 16:58: > On Tue, Nov 10, 2009 at 6:22 PM, Stefan Behnel wrote: >> The feature you want is not "apply_templates", as that just mimics the >> behaviour of xsl:apply-templates in XSLT, i.e. you can't define which >> template will be applied or which XSL tags will run. That feature is >> currently not available (apart from creating a new stylesheet from the >> content of the extension element and applying that to input_node, but >> that's a clumsy and also incomplete solution). >> > That's sad. Do you think it's hard to implement this feature, are there > libxml or libxslt limits that will not allow to do that? All it takes is figuring out how to make libxslt start its evaluation at a given point in the stylesheet document. That's usually more work than looking it up in the docs, although the place to start would likely be here: http://xmlsoft.org/XSLT/html/index.html http://xmlsoft.org/XSLT/html/libxslt-templates.html > Maybe I could join > in and dig a bit? Just don't know where to start. I hope it's not too > complicated and it's possible - I have my problem's elegant solution, but it > needs this feature. You can try to dig into libxslt to find it out. I don't currently have the time to implement major new features, but if I get an outline how this should work, so that I can estimate the amount of work it takes, I may get around to do it. >>> When I try input_node[0] instead of input_node: >>> >>> results = self.apply_templates(context, input_node[0]) >>> print results >>> >>> results is empty list. >> You didn't show your input document, so I don't know what "input_node" or >> "input_node[0]" actually are in your case. >> > My input document is just: > > In that case, input_node[0] just doesn't exist, so there's nothing to do. Actually, I wonder why that doesn't raise an IndexError... >>> By the way, this simple code causes segmentation fault on my OSX, Linux >> and Windows machines. >>> >>> class MyExtElement(etree.XSLTExtension): >>> def execute(self, context, self_node, input_node, output_parent): >>> print input_node >> No idea why, I'll have to look into that. > > My colleague traced it a bit. He was just very curious if it's python's > segfault (which he never met and that's why he's curious) or lxml's one. He > said lxml dies when trying to read property tag of input_node. My guess is that input_node is not an element here but a different kind of XML node. Looks like that case isn't handled in the sources (a seriously blatant omission, although I guess I just didn't expect that this can happen...) Stefan From stefan_ml at behnel.de Wed Nov 11 16:35:30 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Nov 2009 16:35:30 +0100 Subject: [lxml-dev] lxml 2.2.4 released Message-ID: <4AFAD9C2.30102@behnel.de> Hi all, I just pushed a new bug-fix release to PyPI. The *only* change is a fix for the static build, which was accidentally broken in 2.2.3. There are no code changes in this release. Have fun, Stefan From stefan_ml at behnel.de Wed Nov 11 22:18:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Nov 2009 22:18:16 +0100 Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT extension elements In-Reply-To: <4AF992C9.7090400@behnel.de> References: <4AF7C843.6050509@behnel.de> <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> Message-ID: <4AFB2A18.5060504@behnel.de> Stefan Behnel, 10.11.2009 17:20: > Marat Dakota, 10.11.2009 16:58: >>>> By the way, this simple code causes segmentation fault on my OSX, Linux >>>> and Windows machines. >>>> >>>> class MyExtElement(etree.XSLTExtension): >>>> def execute(self, context, self_node, input_node, output_parent): >>>> print input_node >>> >>> No idea why, I'll have to look into that. >> >> My colleague traced it a bit. He was just very curious if it's python's >> segfault (which he never met and that's why he's curious) or lxml's one. He >> said lxml dies when trying to read property tag of input_node. > > My guess is that input_node is not an element here but a different kind of > XML node. Yes, that was the reason. Template matching on "/" makes the *document node* the context node instead of the root node. While this is ok from the POV of XPath/XSLT, it doesn't make sense in lxml.etree. lxml 2.2.5 will no longer crash here (fix is committed) and lxml 2.3 will support other types of context nodes. Stefan From herve.cauwelier at free.fr Fri Nov 13 09:57:19 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Fri, 13 Nov 2009 09:57:19 +0100 Subject: [lxml-dev] inconsistency in the text nodes? Message-ID: <4AFD1F6F.8030407@free.fr> Hi, I've set up the following excerpt to reproduce it: >>> from lxml import etree >>> s = etree._ElementStringResult("herve") >>> u = etree._ElementUnicodeResult(u"Herv?") >>> print s.getparent() Traceback (most recent call last): File "", line 1, in File "extensions.pxi", line 603, in lxml.etree._ElementStringResult.getparent (src/lxml/lxml.etree.c:99519) AttributeError: '_ElementStringResult' object has no attribute '_parent' >>> print u.getparent() None I'd like both of them to have the same behaviour, my personal preference being returning None. I guess it would look like that: --- lxml/extensions.pxi.orig 2009-11-13 09:46:27.760913564 +0100 +++ lxml/extensions.pxi 2009-11-13 09:46:34.869420791 +0100 @@ -597,6 +597,8 @@ return self._parent class _ElementStringResult(str): + cdef _Element _parent + cdef readonly object is_tail + cdef readonly object is_text + cdef readonly object is_attribute + # we need to use a Python class here, str cannot be C-subclassed # in Pyrex/Cython def getparent(self): I'm playing with these directly because I need to wrap them. Regards, Herv? From optilude+lists at gmail.com Sat Nov 14 17:25:16 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sun, 15 Nov 2009 00:25:16 +0800 Subject: [lxml-dev] lxml 2.2.3 - html.usedoctest bug? Message-ID: Hi, I put this into my doctest: >>> import lxml.html.usedoctest And now I get: Error in test /Users/optilude/Development/Plone/Code/Build/plone/4.0/src/plone.app.blocks/plone/app/blocks/rendering.txt Traceback (most recent call last): File "/opt/python/parts/opt/lib/python2.6/unittest.py", line 279, in run testMethod() File "/Users/optilude/.buildout/eggs/zope.testing-3.7.7-py2.6.egg/zope/testing/doctest.py", line 2327, in runTest test, out=write, clear_globs=False) File "/Users/optilude/.buildout/eggs/zope.testing-3.7.7-py2.6.egg/zope/testing/doctest.py", line 1497, in run return self.__run(test, compileflags, out) File "/Users/optilude/.buildout/eggs/zope.testing-3.7.7-py2.6.egg/zope/testing/doctest.py", line 1376, in __run if check(example.want, got, self.optionflags): File "/Users/optilude/.buildout/eggs/lxml-2.2.3-py2.6-macosx-10.6-i386.egg/lxml/doctestcompare.py", line 100, in check_output except etree.XMLSyntaxError: NameError: global name 'etree' is not defined This is pretty weird, since 'etree' is definitely imported in doctestcompare.py in my lxml egg. Any ideas? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Sun Nov 15 13:52:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Nov 2009 13:52:16 +0100 Subject: [lxml-dev] inconsistency in the text nodes? In-Reply-To: <4AFD1F6F.8030407@free.fr> References: <4AFD1F6F.8030407@free.fr> Message-ID: <4AFFF980.6000909@behnel.de> Hi, Herv? Cauwelier, 13.11.2009 09:57: > I've set up the following excerpt to reproduce it: > > >>> from lxml import etree > >>> s = etree._ElementStringResult("herve") > >>> u = etree._ElementUnicodeResult(u"Herv?") > >>> print s.getparent() > Traceback (most recent call last): > File "", line 1, in > File "extensions.pxi", line 603, in > lxml.etree._ElementStringResult.getparent > (src/lxml/lxml.etree.c:99519) > AttributeError: '_ElementStringResult' object has no attribute '_parent' > >>> print u.getparent() > None Note the leading underscore in the class names. They are not public. > I'd like both of them to have the same behaviour, my personal preference > being returning None. I guess it would look like that: > > --- lxml/extensions.pxi.orig 2009-11-13 09:46:27.760913564 +0100 > +++ lxml/extensions.pxi 2009-11-13 09:46:34.869420791 +0100 > @@ -597,6 +597,8 @@ > return self._parent > > class _ElementStringResult(str): > + cdef _Element _parent > + cdef readonly object is_tail > + cdef readonly object is_text > + cdef readonly object is_attribute > + > # we need to use a Python class here, str cannot be C-subclassed > # in Pyrex/Cython > def getparent(self): That's not valid Cython code. "_ElementStringResult" is a normal Python class. > I'm playing with these directly because I need to wrap them. Can you provide some details why you want to do that? Stefan From cswiggett at knowledgemosaic.com Mon Nov 16 01:03:03 2009 From: cswiggett at knowledgemosaic.com (Clif Swiggett) Date: Sun, 15 Nov 2009 16:03:03 -0800 Subject: [lxml-dev] etree Parser corrupting HTML Message-ID: <4B0096B7.3090704@knowledgemosaic.com> I've had two mysteries come up recently using etree.parse() (lxml version 2.2.2). Can anyone shed some light on how to work around these? Start with this input ======input.html=========

Foo

====================== ... then parse using ... htmlTree = etree.parse(open("input.html"), parser=etree.HTMLParser()) open("out.xml", "wb").write(etree.tostring(htmlTree)) ... the resulting XML is corrupted in two ways: ======output.html=========

Foo

======================= Problem 1: The namespace declaration is duplicated (resulting in invalid XML) Problem 2: The

tag was moved outside of the

tag. Basic dom structure has been re-arranged. Neither of these problems happen when I use the XMLParser (e.g. 'xmlTree = etree.parse(open("test2.html"), parser=etree.XMLParser())'). However I don't know if my source html will be valid XHTML or, more likely, just HTML. Is there a way to get the HTMLParser to do the right thing? Thanks for any suggestions or advice. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091115/24f35b11/attachment.htm From optilude+lists at gmail.com Mon Nov 16 02:55:14 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 16 Nov 2009 09:55:14 +0800 Subject: [lxml-dev] etree Parser corrupting HTML In-Reply-To: <4B0096B7.3090704@knowledgemosaic.com> References: <4B0096B7.3090704@knowledgemosaic.com> Message-ID: Clif Swiggett wrote: > I've had two mysteries come up recently using etree.parse() (lxml > version 2.2.2). Can anyone shed some light on how to work around these? > > Start with this input > ======input.html========= > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > >

>

Foo

>

> > > ====================== > > ... then parse using ... > > htmlTree = etree.parse(open("input.html"), parser=etree.HTMLParser()) > open("out.xml", "wb").write(etree.tostring(htmlTree)) > > ... the resulting XML is corrupted in two ways: > > ======output.html========= > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > xmlns="http://www.w3.org/1999/xhtml"> > >

>

>

Foo

> > > ======================= > > Problem 1: The namespace declaration is duplicated (resulting in > invalid XML) That does seem dodgy indeed. > Problem 2: The

tag was moved outside of the

tag. Basic dom > structure has been re-arranged. I suspect this is because having a

inside an

is illegal in HTML (at least I think it is), so it's "cleaning" up your HTML. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Mon Nov 16 08:51:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Nov 2009 08:51:23 +0100 Subject: [lxml-dev] etree Parser corrupting HTML In-Reply-To: <4B0096B7.3090704@knowledgemosaic.com> References: <4B0096B7.3090704@knowledgemosaic.com> Message-ID: <4B01047B.9070602@behnel.de> Clif Swiggett, 16.11.2009 01:03: > I've had two mysteries come up recently using etree.parse() (lxml > version 2.2.2). Can anyone shed some light on how to work around these? > > Start with this input > ======input.html========= > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > >

>

Foo

>

> > > ====================== Note that this is an XML document. > ... then parse using ... > > htmlTree = etree.parse(open("input.html"), parser=etree.HTMLParser()) > open("out.xml", "wb").write(etree.tostring(htmlTree)) You should really try to write less verbose code: htmlTree = etree.parse("input.html", parser=etree.HTMLParser()) htmlTree.write("out.xml") This is shorter and also a lot faster. > ... the resulting XML is corrupted in two ways: That's because you are parsing it with an HTML parser. For parsing XML, use an XML parser. Stefan From thomas.schloegl at unicreditgroup.at Mon Nov 16 10:00:57 2009 From: thomas.schloegl at unicreditgroup.at (=?iso-8859-1?Q?SCHL=D6GL_Thomas?=) Date: Mon, 16 Nov 2009 10:00:57 +0100 Subject: [lxml-dev] FW: problem with etree.tostring and special characters inxml-text-attributes Message-ID: <98ED979E46225E47A0088454A34E03FE42A4DA@SRES1MXS5V1.res1.loc.lan.at> Isn't there anybody who can give a comment to my question below (e.q. you're too stupid for lxml) ? Regards, Thomas From: SCHL?GL Thomas Sent: Monday, November 09, 2009 4:20 PM To: 'lxml-dev at codespeak.net' Subject: problem with etree.tostring and special characters inxml-text-attributes Hi, I have a problem with the replacement of special characters (xml-entities) in xml-text-properties of the etree.tostring method: After etree.tostring the characters &,<,> are replaced with & < > but ' and " remain in the text-properties. e.g.: final_delim=end, record_delim='\n', delim='|', quote=double But I need: final_delim=end, record_delim='\n', delim='|', quote=double I wrote a method that iterates through all text properties after the tostring-method and replaces ' and " of this text-properties in the serialized string, but with large xml-files this consumes a lot of time. Does anybody have a better solution? Kind Regards, Thomas Ing. Thomas Schl?gl 4266-UGG3C1 / CFO Datawarehouse Core Solutions Member of the Crossfunctions-Team UniCredit Global Information Services S.p.A. (UGIS S.p.A.) Zweigniederlassung ?sterreich Nordbergstrasse 13 A-1090 Vienna Room: TZB-6.18 Tel: +43 1 71730 50499 Mobile: - mailto:thomas.schloegl at wave-solutions.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091116/4ba0f204/attachment-0001.htm From herve.cauwelier at free.fr Mon Nov 16 10:07:57 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Mon, 16 Nov 2009 10:07:57 +0100 Subject: [lxml-dev] inconsistency in the text nodes? In-Reply-To: <4AFFF980.6000909@behnel.de> References: <4AFD1F6F.8030407@free.fr> <4AFFF980.6000909@behnel.de> Message-ID: <4B01166D.4030200@free.fr> On 15/11/2009 13:52, Stefan Behnel wrote: > Note the leading underscore in the class names. They are not public. Still, they are exposed in XPath results. > That's not valid Cython code. "_ElementStringResult" is a normal Python class. Then it's: --- extensions.pxi.orig 2009-11-13 09:46:27.760913564 +0100 +++ extensions.pxi 2009-11-16 10:01:27.399150556 +0100 @@ -597,6 +597,11 @@ return self._parent class _ElementStringResult(str): + _parent = None + is_tail = None + is_text = None + is_attribute = None + # we need to use a Python class here, str cannot be C-subclassed # in Pyrex/Cython def getparent(self): But the idea is the same. >> I'm playing with these directly because I need to wrap them. > > Can you provide some details why you want to do that? I'm abstracting the XML library in the library I design. Now I needed text node objects that are always unicode and with an API similar to the rest of the library. I inherit from the Python unicode class, and wrap _Element*Result's parent in my own element class. From stefan_ml at behnel.de Mon Nov 16 10:23:00 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Nov 2009 10:23:00 +0100 Subject: [lxml-dev] FW: problem with etree.tostring and special characters in xml-text-attributes In-Reply-To: <98ED979E46225E47A0088454A34E03FE42A4DA@SRES1MXS5V1.res1.loc.lan.at> References: <98ED979E46225E47A0088454A34E03FE42A4DA@SRES1MXS5V1.res1.loc.lan.at> Message-ID: <4B0119F4.4070002@behnel.de> SCHL?GL Thomas, 16.11.2009 10:00: > Isn't there anybody who can give a comment to my question below > (e.q. you're too stupid for lxml) ? Sorry, what? Stefan From stefan_ml at behnel.de Mon Nov 16 11:38:43 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Nov 2009 11:38:43 +0100 Subject: [lxml-dev] problem with etree.tostring and special characters in xml-text-attributes In-Reply-To: <98ED979E46225E47A0088454A34E03FE3C2C24@SRES1MXS5V1.res1.loc.lan.at> References: <98ED979E46225E47A0088454A34E03FE3C2C24@SRES1MXS5V1.res1.loc.lan.at> Message-ID: <4B012BB3.4080808@behnel.de> SCHL?GL Thomas, 09.11.2009 16:20: > I have a problem with the replacement of special characters > (xml-entities) in xml-text-properties of the etree.tostring method: > > After etree.tostring the characters &,<,> are replaced with & < > > but ' and " remain in the text-properties. > > e.g.: > > final_delim=end, record_delim='\n', delim='|', > quote=double > > But I need: > > final_delim=end, record_delim='\n', > delim='|', quote=double > > I wrote a method that iterates through all text properties after the > tostring-method and replaces ' and " of this text-properties in the > serialized string, but with large xml-files this consumes a lot of time. > > Does anybody have a better solution? There is no automatic way to inject these entity references, as it's simply not required to do that. Note that you didn't present any reasoning for your requirement, BTW. What you can do is write an XPath expression that searches the tree for the text you need to replace, and then replace it by entity references before serialising. May or may not be faster than what you do now, but it certainly is safer. Stefan From herve.cauwelier at free.fr Mon Nov 16 12:05:35 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Mon, 16 Nov 2009 12:05:35 +0100 Subject: [lxml-dev] inconsistency in the text nodes? In-Reply-To: <4B0118E3.2040407@behnel.de> References: <4AFD1F6F.8030407@free.fr> <4AFFF980.6000909@behnel.de> <4B01166D.4030200@free.fr> <4B0118E3.2040407@behnel.de> Message-ID: <4B0131FF.8050004@free.fr> On 16/11/2009 10:18, Stefan Behnel wrote: > > Herv? Cauwelier, 16.11.2009 10:07: >> I inherit from the Python unicode class, and wrap _Element*Result's >> parent in my own element class. > > That doesn't tell me why you would want to instantiate them yourself. It's because of the first test case where I test the construction of my own instances. I give an instance of text result without using XPath (which is tested in another test case). There's a factory function but it's not exposed to the Python level. Is it just because Cython requires variable initialization that these are declared in _ElementUnicodeResult but not in _ElementStringResult? I didn't noticed at first glance that _ElementStringResult is pure Python. I'm not familiar with Cython. Herv? From stefan_ml at behnel.de Mon Nov 16 12:08:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Nov 2009 12:08:38 +0100 Subject: [lxml-dev] inconsistency in the text nodes? In-Reply-To: <4B0131FF.8050004@free.fr> References: <4AFD1F6F.8030407@free.fr> <4AFFF980.6000909@behnel.de> <4B01166D.4030200@free.fr> <4B0118E3.2040407@behnel.de> <4B0131FF.8050004@free.fr> Message-ID: <4B0132B6.7070603@behnel.de> Herv? Cauwelier, 16.11.2009 12:05: > On 16/11/2009 10:18, Stefan Behnel wrote: >> >> Herv? Cauwelier, 16.11.2009 10:07: >>> I inherit from the Python unicode class, and wrap _Element*Result's >>> parent in my own element class. >> >> That doesn't tell me why you would want to instantiate them yourself. > > It's because of the first test case where I test the construction of my > own instances. I give an instance of text result without using XPath > (which is tested in another test case). So, let's sum this up. You want me to give non-public classes a public interface so that you can use them in a user code test case? If that's your use case, my answer is that this is not going to happen. Stefan From stefan_ml at behnel.de Mon Nov 16 17:45:29 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Nov 2009 17:45:29 +0100 Subject: [lxml-dev] etree Parser corrupting HTML In-Reply-To: <4B017DD3.8070108@knowledgemosaic.com> References: <4B0096B7.3090704@knowledgemosaic.com> <4B01047B.9070602@behnel.de> <4B017DD3.8070108@knowledgemosaic.com> Message-ID: <4B0181A9.7010304@behnel.de> Clif Swiggett, 16.11.2009 17:29: > Thanks for the tips to tighten up my code. > Do you have any suggestions for how best to detect whether a source file > is html or xml? I'm harvesting arbitrary files off the web ... You can either pass it into the XML parser and fall back to the HTML parser if that fails (which it most likely will for the large majority of HTML files), or, if you expect more HTML pages than XML pages, you can read the page into a string and search for the XHTML namespace before passing it into the parser. If you want to parse arbitrary web pages, you will likely end up using some kind of parser cascade anyway to make sure you get the best conformance and speed in the normal cases and the best error recovery in the pathological cases. Remember that lxml's parser is very fast, so it's ok to parse things multiple times with different parser configurations. Stefan From mike at easyads.net Wed Nov 18 21:03:44 2009 From: mike at easyads.net (Mike Naglee) Date: Wed, 18 Nov 2009 15:03:44 -0500 Subject: [lxml-dev] Python 2.6 Win-32 binary? Message-ID: <00af01ca688a$3c216ec0$b4644c40$@net> Hey folks, Am I just missing something or is there no Python 2.6 Win32 binary for lxml? Thanks, and have an interesting day! Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091118/584683da/attachment.htm From martin.seiler at gmx.de Thu Nov 19 11:30:46 2009 From: martin.seiler at gmx.de (Martin Seiler) Date: Thu, 19 Nov 2009 11:30:46 +0100 Subject: [lxml-dev] etree not printing pretty :( Message-ID: <4B051E56.4070701@gmx.de> Hi all, I am rather new to python and lxml and I wonder about the output of my tree. I add an Element, which contains some childs to a tree. When I write the output everything is in pretty print, but the elements I appended. They show up in one line. Here is a snippet of my code: > eggs = et.fromstring(xmltransformed) > for each in eggs: > if each.tag == "someElement": > > each.append(anotherElement) > > else: > pass > > > et.ElementTree(element=eggs).write(xmlout, pretty_print=True) When I open the file in firefox its formated all right, but when I view the code 'anotherElement' with all its childs is just one line... what am I doing wrong here? python: 2.6 python-lxml: 2.1.5-1ubuntu2 OS: Ubuntu 9.04 Thanks for you help! Cheers, martin From stefan_ml at behnel.de Thu Nov 19 11:40:30 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Nov 2009 11:40:30 +0100 Subject: [lxml-dev] etree not printing pretty :( In-Reply-To: <4B051E56.4070701@gmx.de> References: <4B051E56.4070701@gmx.de> Message-ID: <4B05209E.6090408@behnel.de> Hi, Martin Seiler, 19.11.2009 11:30: > I am rather new to python and lxml and I wonder about the output of my > tree. I add an Element, which contains some childs to a tree. When I > write the output everything is in pretty print, but the elements I > appended. They show up in one line. did you read this? http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Stefan From esbenbugge at gmail.com Fri Nov 20 10:44:09 2009 From: esbenbugge at gmail.com (Esben Bugge) Date: Fri, 20 Nov 2009 10:44:09 +0100 Subject: [lxml-dev] Installing on Mac OS X 10.6.2: "lipo: can't figure out the architecture type of" Message-ID: I have tried to install lxml on Mac OS X 10.6.2 using both sudo python setup.py build --static-deps and sudo STATIC_DEPS=true easy_install lxml In both cases I get lipo: can't figure out the architecture type of: /var/tmp//ccdcpY3e.out make[2]: *** [SAX.lo] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 Exception: Command "make" returned code 2 What can I do to resolve this error? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091120/9fdb6c64/attachment-0001.htm From stefan_ml at behnel.de Fri Nov 20 11:11:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Nov 2009 11:11:01 +0100 Subject: [lxml-dev] Installing on Mac OS X 10.6.2: "lipo: can't figure out the architecture type of" In-Reply-To: References: Message-ID: <4B066B35.8000605@behnel.de> Hi, Esben Bugge, 20.11.2009 10:44: > I have tried to install lxml on Mac OS X 10.6.2 using both > > sudo python setup.py build --static-deps > > and > > sudo STATIC_DEPS=true easy_install lxml > > In both cases I get > > lipo: can't figure out the architecture type of: /var/tmp//ccdcpY3e.out > make[2]: *** [SAX.lo] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all] Error 2 > Exception: Command "make" returned code 2 > > What can I do to resolve this error? Here's a patch that is supposed to make things work with MacOS 10.6: https://codespeak.net/viewvc/lxml/branch/lxml-2.2/buildlibxml.py?r1=66217&r2=69351&view=patch Please try if that changes anything for you. Stefan From esbenbugge at gmail.com Fri Nov 20 13:15:16 2009 From: esbenbugge at gmail.com (Esben Bugge) Date: Fri, 20 Nov 2009 13:15:16 +0100 Subject: [lxml-dev] Installing on Mac OS X 10.6.2: "lipo: can't figure out the architecture type of" In-Reply-To: <4B066B35.8000605@behnel.de> References: <4B066B35.8000605@behnel.de> Message-ID: Thanks for your response. I added the changes of the patch to buildlibxml.py and the installation went a lot further... but it still fails. Now with: ld: library not found for -lbundle1.o ld: library not found for -lbundle1.o collect2: ld returned 1 exit statuscollect2: ld returned 1 exit status lipo: can't open input file: /var/tmp//ccUI4H6k.out (No such file or directory) error: command 'gcc-4.0' failed with exit status 1 Any other ideas? 2009/11/20 Stefan Behnel > Hi, > > Esben Bugge, 20.11.2009 10:44: > > I have tried to install lxml on Mac OS X 10.6.2 using both > > > > sudo python setup.py build --static-deps > > > > and > > > > sudo STATIC_DEPS=true easy_install lxml > > > > In both cases I get > > > > lipo: can't figure out the architecture type of: /var/tmp//ccdcpY3e.out > > make[2]: *** [SAX.lo] Error 1 > > make[1]: *** [all-recursive] Error 1 > > make: *** [all] Error 2 > > Exception: Command "make" returned code 2 > > > > What can I do to resolve this error? > > Here's a patch that is supposed to make things work with MacOS 10.6: > > > https://codespeak.net/viewvc/lxml/branch/lxml-2.2/buildlibxml.py?r1=66217&r2=69351&view=patch > > Please try if that changes anything for you. > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091120/9d6ca204/attachment.htm From nicolas at nexedi.com Sat Nov 21 16:34:35 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Sat, 21 Nov 2009 16:34:35 +0100 Subject: [lxml-dev] smart_strings Message-ID: <4B08088B.90605@nexedi.com> Hi, I would like to know if exists a way to reach an attribute name with an xpath result from _ElementStringResult instance. I see that we can reach is parent, but i need to know to which attribute this result belong, without parsing xpath expression with regex or something similar. root = etree.XML('TEXT') result = root.xpath('/root/a/attribute::a')[0] assert result.is_attribute result.getparent().???? Regards, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Sat Nov 21 18:05:17 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Nov 2009 18:05:17 +0100 Subject: [lxml-dev] smart_strings In-Reply-To: <4B08088B.90605@nexedi.com> References: <4B08088B.90605@nexedi.com> Message-ID: <4B081DCD.2020809@behnel.de> Nicolas Delaby, 21.11.2009 16:34: > I would like to know if exists a way to reach an attribute name with an > xpath result from _ElementStringResult instance. > I see that we can reach is parent, but i need to know to which attribute > this result belong, without parsing xpath expression with regex or > something similar. > > root = etree.XML('TEXT') > result = root.xpath('/root/a/attribute::a')[0] > assert result.is_attribute > result.getparent().???? You'll have to wait for lxml 2.3 (or use the current SVN trunk), which will provide the smart string results of attributes with an "attrname" property. Stefan From nicolas at nexedi.com Mon Nov 23 10:04:29 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Mon, 23 Nov 2009 10:04:29 +0100 Subject: [lxml-dev] smart_strings In-Reply-To: <4B081DCD.2020809@behnel.de> References: <4B08088B.90605@nexedi.com> <4B081DCD.2020809@behnel.de> Message-ID: <4B0A501D.9080406@nexedi.com> Stefan Behnel a ?crit : > Nicolas Delaby, 21.11.2009 16:34: >> I would like to know if exists a way to reach an attribute name with an >> xpath result from _ElementStringResult instance. >> I see that we can reach is parent, but i need to know to which attribute >> this result belong, without parsing xpath expression with regex or >> something similar. >> >> root = etree.XML('TEXT') >> result = root.xpath('/root/a/attribute::a')[0] >> assert result.is_attribute >> result.getparent().???? > > You'll have to wait for lxml 2.3 (or use the current SVN trunk), which will > provide the smart string results of attributes with an "attrname" property. > > Stefan > This is a good news, I'll be patient :) Thanks, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From read.beyond.data at gmx.net Mon Nov 23 12:56:13 2009 From: read.beyond.data at gmx.net (Celvin) Date: Mon, 23 Nov 2009 12:56:13 +0100 Subject: [lxml-dev] lxml Win32 egg for Python 2.6 Message-ID: <1495183427.20091123125613@gmx.net> Hi, I just went to my local cheese shop to upgrade lxml for Python 2.6 but noticed that there are only lxml 2.2.4 / Python 2.6 packages for AMD64 - is this intentional or a mere oversight? Cheers, Celvin From mike at easyads.net Mon Nov 23 17:18:25 2009 From: mike at easyads.net (Mike Naglee) Date: Mon, 23 Nov 2009 11:18:25 -0500 Subject: [lxml-dev] lxml Win32 egg for Python 2.6 In-Reply-To: <1495183427.20091123125613@gmx.net> References: <1495183427.20091123125613@gmx.net> Message-ID: <007a01ca6c58$965f4bb0$c31de310$@net> Celvin, I have hit the same roadblock. I sent a similar message to the mailing list last Wednesday, but have not heard anything back yet. If I find a work around I'll let you know. Mike -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Celvin Sent: Monday, November 23, 2009 6:56 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] lxml Win32 egg for Python 2.6 Hi, I just went to my local cheese shop to upgrade lxml for Python 2.6 but noticed that there are only lxml 2.2.4 / Python 2.6 packages for AMD64 - is this intentional or a mere oversight? Cheers, Celvin _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.425 / Virus Database: 270.14.77/2520 - Release Date: 11/22/09 19:40:00 From chris at simplistix.co.uk Tue Nov 24 07:32:38 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 24 Nov 2009 06:32:38 +0000 Subject: [lxml-dev] lxml slower than ElementTree? In-Reply-To: <4B0B3E29.2020203@lexicon.net> References: <914552d0-09d9-45c3-ba41-7a7b8b75b275@1g2000vbm.googlegroups.com> <4B0B1B95.5030603@lexicon.net> <4B0B3E29.2020203@lexicon.net> Message-ID: <4B0B7E06.6000606@simplistix.co.uk> Background: John maintains xlrd, the python package for reading Excel files, and is looking to support Microsoft's newer xml based format... John Machin wrote: > Note that it needs an ElementTree implementation (supplied with more > recent Pythons), and tries to find one in various places. Limited > testing with lxml gave identical results, but slightly slower, so you > could try that instead if you wanted to (would require fiddling with > imports in xlsxrd.py) That surprised me. Would anyone here be interested in taking a look at John's code to see what's tripping up lxml and causing it to be slower? cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Tue Nov 24 09:48:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Nov 2009 09:48:05 +0100 Subject: [lxml-dev] lxml slower than ElementTree? In-Reply-To: <4B0B7E06.6000606@simplistix.co.uk> References: <914552d0-09d9-45c3-ba41-7a7b8b75b275@1g2000vbm.googlegroups.com> <4B0B1B95.5030603@lexicon.net> <4B0B3E29.2020203@lexicon.net> <4B0B7E06.6000606@simplistix.co.uk> Message-ID: <4B0B9DC5.7080506@behnel.de> Chris Withers, 24.11.2009 07:32: > Background: John maintains xlrd, the python package for reading Excel > files, and is looking to support Microsoft's newer xml based format... > > John Machin wrote: > >> Note that it needs an ElementTree implementation (supplied with more >> recent Pythons), and tries to find one in various places. Limited >> testing with lxml gave identical results, but slightly slower, so you >> could try that instead if you wanted to (would require fiddling with >> imports in xlsxrd.py) > > That surprised me. Would anyone here be interested in taking a look at > John's code to see what's tripping up lxml and causing it to be slower? I didn't look at the code, but you can take a look at http://codespeak.net/lxml/performance.html In general, ET can't compete with lxml.etree, whereas cET can, especially when you stay with code that supports both ET and lxml.etree. Some major differences: - lxml has a fast parser and a fast serialiser. cET has the first but not the latter. ET is straight out. - lxml parses much faster from file names than from open file(-like) objects, especially multi-threaded. ET handles them exactly the same. - lxml can run multi-threaded with great gains, ET benefits very little. - lxml has XPath, ET doesn't. - lxml has XSLT, ET doesn't. - ET creates the tree as Python objects once and for all, lxml creates Python proxies only at request. - ET 1.2 uses a simpler and faster ElementPath implementation than ET 1.3. lxml.etree uses the 1.3 implementation since version 2.x. - Tree iteration using getiterator() in lxml.etree is much faster than in cET, and also much, much faster than using .find(). This strikes even more when searching specific tags, because fewer proxies have to be created. (c)ET doesn't show a difference here. So it is not surprising that code performs worse in lxml.etree if it was tuned for performance using ET - which it likely was, given the quote above. If it had been tuned for lxml.etree, I wouldn't be suprised if it ran faster in absolute numbers, but slower with ET. It's also worth reading this: http://codespeak.net/lxml/performance.html#a-longer-example It might give an idea of how unexpected the performance of an implementation can be. Stefan From srijit at aim.com Tue Nov 24 10:12:50 2009 From: srijit at aim.com (srijit at aim.com) Date: Tue, 24 Nov 2009 04:12:50 -0500 Subject: [lxml-dev] lxml 2.2.4 for Python 2.6 Message-ID: <8CC3AFD6EA1145B-145C-4FAC@webmail-m001.sysops.aol.com> Is there any reason why lxml-2.2.4-py2.6-win32.egg or lxml-2.2.4.win32-py2.6.exe is not available in http://pypi.python.org/pypi/lxml/2.2.4? Best regards, /Srijit -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091124/7c13684d/attachment-0001.htm From stefan_ml at behnel.de Tue Nov 24 16:12:44 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Nov 2009 16:12:44 +0100 Subject: [lxml-dev] lxml slower than ElementTree? In-Reply-To: <4B0BD8D2.4080505@lexicon.net> References: <914552d0-09d9-45c3-ba41-7a7b8b75b275@1g2000vbm.googlegroups.com> <4B0B1B95.5030603@lexicon.net> <4B0B3E29.2020203@lexicon.net> <4B0B7E06.6000606@simplistix.co.uk> <4B0B9DC5.7080506@behnel.de> <4B0BD8D2.4080505@lexicon.net> Message-ID: <4B0BF7EC.1090606@behnel.de> Hi, John Machin, 24.11.2009 14:00: > Thanks for your response, Stefan. You're welcome. > On 24/11/2009 7:48 PM, Stefan Behnel wrote: >> In general, ET can't compete with lxml.etree, whereas cET can, especially >> when you stay with code that supports both ET and lxml.etree. > > As it stands, the code tries to import xml.etree.cElementTree, then > cElementTree, then ElementTree. I plan to allow an override option, > where the caller can import any minimal-ET-subset-compliant module (e.g. > lxml) and pass it in. The mentioned speed comparison was using cET. >> - lxml parses much faster from file names than from open file(-like) >> objects, especially multi-threaded. ET handles them exactly the same. > > The MS Excel 2007 file format is a ZIP file containing compressed XML > documents. Hence one ends up parsing file-like objects. Sure, I expected something like that. The thing is just that parsing from file-like object from lxml means that it needs to walk all the way up the Python stack to look up and call the object's .read() method for each chunk of data, which then builds and returns a new Python string object. Parsing from a file path implies straight calls to C's file reading function, which works completely outside of the GIL. So there is a pretty huge performance penalty for file-like objects. This penalty applies to (c)ET as well, though, so that's not a reason for lxml to be slower than them. > For better or worse, ET comes with Python. Design goal was to avoid > requiring Nth party modules where appropriate I totally understand that design goal. Being in the stdlib is clearly a major advantage of ET. >> - ET creates the tree as Python objects once and for all, lxml creates >> Python proxies only at request. > > I acknowledge that ET may start swapping much sooner than lxml That's not what I was referring to, though. It's ok to call ET an elephant, but cET is actually *very* memory friendly. I meant to say that lxml.etree implies a performance penalty if you access many elements in the tree, where cET can just return their reference. The penalty is not large, but it's certainly worth being a bit more selective when walking the tree and searching elements. That's what XPath and getiterator(tagname) are great for. >> - Tree iteration using getiterator() in lxml.etree is much faster than in >> cET, and also much, much faster than using .find(). This strikes even >> more >> when searching specific tags, because fewer proxies have to be created. >> (c)ET doesn't show a difference here. > > A base class provides a controller method that uses > getiterator(tag=None) and a mapping from tags to methods. Each type of > document (about 5 types) has a subclass. I would expect that to be pretty fast on cET. For lxml.etree, however, it might still be faster to traverse the document independently for each interesting tag name - as long as the document fits into memory completely, and as long as the methods are independent from each other. The benefit certainly depends on the ratio of interesting versus ignored elements in the traversal, but given that lxml's tight traversal loop is easily an order of magnitude faster than even cET, being selective can really turn the vane here. > The area where heavy lifting is required is the worksheet document, > which contains cell elements as children of row elements. Max 2**20 rows > and max 2**14 cells per row. If iterparse is available, a specialised > controller method is used; it uses iterparse to iterate over row > elements (only "end" events), clearing each row as it is finished. This > should solve most of the memory problem for [c]ET. Extending this to > clear the root element (i.e. avoid leaving empty row elements lying > about) is a possibility. Too bad that a) lxml's iterparse() is slightly slower than the one in cET and b) clearing the root element doesn't work in lxml.etree I guess a) is where your initial comment mainly originated from. Regarding b), getting your hands at the root element in (c)ET would require you to also accept 'start' events, which usually results in such a huge performance drop for larger documents that it's almost always better to just leave the dead elements around instead. Also note that lxml.etree would allow you to drop them during the iteration by removing the preceding siblings of the current row element through their common parent element. So you can have the cake and eat it, too. :) Stefan From stefan_ml at behnel.de Tue Nov 24 16:14:29 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Nov 2009 16:14:29 +0100 Subject: [lxml-dev] lxml 2.2.4 for Python 2.6 In-Reply-To: <8CC3AFD6EA1145B-145C-4FAC@webmail-m001.sysops.aol.com> References: <8CC3AFD6EA1145B-145C-4FAC@webmail-m001.sysops.aol.com> Message-ID: <4B0BF855.1040806@behnel.de> srijit at aim.com, 24.11.2009 10:12: > Is there any reason why lxml-2.2.4-py2.6-win32.egg or > lxml-2.2.4.win32-py2.6.exe is not available in > http://pypi.python.org/pypi/lxml/2.2.4? They'll hopefully become available soon. Be patient, or use lxml 2.2.2 for now. Stefan From tseaver at palladion.com Tue Nov 24 17:43:17 2009 From: tseaver at palladion.com (Tres Seaver) Date: Tue, 24 Nov 2009 11:43:17 -0500 Subject: [lxml-dev] lxml 2.2.4 for Python 2.6 In-Reply-To: <4B0BF855.1040806@behnel.de> References: <8CC3AFD6EA1145B-145C-4FAC@webmail-m001.sysops.aol.com> <4B0BF855.1040806@behnel.de> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > srijit at aim.com, 24.11.2009 10:12: >> Is there any reason why lxml-2.2.4-py2.6-win32.egg or >> lxml-2.2.4.win32-py2.6.exe is not available in >> http://pypi.python.org/pypi/lxml/2.2.4? > > They'll hopefully become available soon. Be patient, or use lxml 2.2.2 for now. Or get a free compiler and build it from source[1]. If you manage that, you could even get blessed to upload the result for others. [1] http://boodebr.org/main/python/build-windows-extensions Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksMDSQACgkQ+gerLs4ltQ7g2wCcDagtIfm6YWDvOkbBlKcjVHpk WBMAoI6wHmW+glDhq7wsmo2S5dL22ABo =CTRw -----END PGP SIGNATURE----- From cognomeome at tiscali.it Tue Nov 24 17:40:34 2009 From: cognomeome at tiscali.it (cgnmm) Date: Tue, 24 Nov 2009 17:40:34 +0100 Subject: [lxml-dev] elementtree as a hierarchical object tree Message-ID: <1259080834.5166.40.camel@fdb-netbook> Hi, I'm new of lxml and XML programming. I would to use etree as a hierarchical object tree: every node element should have a property associated to an object. >>> import lxml.etree as etree >>> element=etree.fromstring('some text') >>> element.text="a string" that's all ok, but: >>> element.text={'a':'real', 'python':'object'} TypeError: Argument must be string or unicode. OK, the property name is "text" so some text is expected! My question is: if write a custom element with a property called "object", may I store inside it a reference to an object? From kevinar18 at hotmail.com Tue Nov 24 19:22:33 2009 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Tue, 24 Nov 2009 13:22:33 -0500 Subject: [lxml-dev] Why can't I turn all xpath node types into string? Message-ID: XML defines several node types that get returned by XPATH: http://www.w3.org/TR/xpath#data-model When I use the xpath feature in lxml and then try to read out the data, it won't work for certain node types. How can I get a string version of all node types? Examples follow. The following works: Domain/HTML: http://www.citiesxl.com/ XPATH: //div//a from lxml import etree self.xpath_list = self.tree.xpath(xpath) for entry in self.xpath_list: tmp = "" try: tmp = etree.tostring(entry) except: tmp = str(entry) print tmp In this example, xpath_list is of type: The etree.tostring() function works for this type. The following will NOT work for etree.tostring() XPATH: //div//a/attribute::href It says it can't handle the type or something. In this case, xpath_list is of type: or something like that (I forget the actual type). The str() function will work on this one. The following will NOT work at all using either str() or etree.tostring() XPATH: //self::text() One of the types is: and I get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 38: ordinal not in range(128) Is there an lxml function that will convert all types to strings properly instead of trying to use this try/except hack to handle all the types? _________________________________________________________________ Bing brings you maps, menus, and reviews organized in one place. http://www.bing.com/search?q=restaurants&form=MFESRP&publ=WLHMTAG&crea=TEXT_MFESRP_Local_MapsMenu_Resturants_1x1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091124/537dd7bc/attachment.htm From jholg at gmx.de Tue Nov 24 21:47:21 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 24 Nov 2009 21:47:21 +0100 Subject: [lxml-dev] Why can't I turn all xpath node types into string? In-Reply-To: References: Message-ID: <20091124204721.294880@gmx.net> Hi, > > When I use the xpath feature in lxml and then try to read out the data, it > won't work for certain node types. How can I get a string version of all > node types? Examples follow. Please take a look at http://codespeak.net/lxml/xpathxslt.html#xpath (section xpath return values) > In this example, xpath_list is of type: > > The etree.tostring() function works for this type. etree.tostring() is lxml's serialization function and takes an element or an elementtree (see api reference) > > The following will NOT work for etree.tostring() > > XPATH: //div//a/attribute::href > > > > It says it can't handle the type or something. In this case, xpath_list > is of type: or something like > that (I forget the actual type). The str() function will work on this one. See above. > The following will NOT work at all using either str() or etree.tostring() > > XPATH: //self::text() > > One of the types is: > > > and I get this error: > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in > position 38: ordinal not in range(128) You can't safely use str() on a unicode string as this implicitly tries to encode to your python installation encoding (which is usually ascii). You might want to look at e.g. http://www.amk.ca/python/howto/unicode for background on unicode in python. > Is there an lxml function that will convert all types to strings properly > instead of trying to use this try/except hack to handle all the types? Well, you can use unicode() to convert all these to unicode strings (and afterwards encode to the encoding you need), but I suppose you want a serialized representation of an element result. So, you will have to treat the types separately, e.g. by using isinstance() tests. brgds Holger -- Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02 From jholg at gmx.de Tue Nov 24 21:54:16 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 24 Nov 2009 21:54:16 +0100 Subject: [lxml-dev] elementtree as a hierarchical object tree In-Reply-To: <1259080834.5166.40.camel@fdb-netbook> References: <1259080834.5166.40.camel@fdb-netbook> Message-ID: <20091124205416.139990@gmx.net> Hi, > I would to use etree as a hierarchical object tree: every node element > should have a property associated to an object. > > >>> import lxml.etree as etree > >>> element=etree.fromstring('some text') > >>> element.text="a string" > > that's all ok, but: > > >>> element.text={'a':'real', 'python':'object'} > TypeError: Argument must be string or unicode. > > OK, the property name is "text" so some text is expected! > > My question is: if write a custom element with a property called > "object", may I store inside it a reference to an object? This could be helpful: http://codespeak.net/lxml/element_classes.html But be aware that " [...] There is one thing to know up front. Element classes must not have an __init___ or __new__ method. There should not be any internal state either, except for the data stored in the underlying XML tree. [...] " Depending on your use case you might also want to look at http://codespeak.net/lxml/objectify.html Maybe you don't need to reference separate objects but could instead use an XML tree that very much behaves like native python objects. brgds Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser From jholg at gmx.de Wed Nov 25 11:38:00 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 25 Nov 2009 11:38:00 +0100 Subject: [lxml-dev] better schematron support Message-ID: <20091125103800.199670@gmx.net> Hi, schematron support in lxml is currently a second-class citizen due to libxml2 restrictions, see e.g. http://mail.gnome.org/archives/xml/2007-August/msg00016.html where Stefan commented on error reporting deficiencies or http://mail.gnome.org/archives/xml/2009-September/msg00022.html where Daniel of libxml2 fame comments on a feature request to support schematron embedded in XML Schema, basically stating that the implementation is incomplete. However, there is a pure-XSLT implementation of the now-ISO-standardized schematron by its inventor (and editor of the standard) Rick Jelliffe, the so-called skeleton implementation: http://www.schematron.com/ (Daniel also mentions this in his comment) Basically, the "skeleton" toolchain creates an xslt that is used for validation. Indeed, schematron itself is just a well-defined way of using xslt for validation, which I was looking into but really wasn't aware that schematron does this exactly. The "skeleton" implementation is available in both xslt 1 and 2 notions. The toolchain steps are (taken from www.schematron.com, modifications by me) 0) [Extract from XML Schema/RelaxNG schema)] 1) Process inclusions 2) Process abstract patterns 3) Compile the schema 4) Validate which translates to xsltproc XSD2Schtrn.xsl XMLSchema.xsd > theSchema.sch or xsltproc RNG2Schtrn.xsl RelaxNGSchema.rng > theSchema.sch xsltproc iso_dsdl_include.xsl theSchema.sch > theSchema1.sch xsltproc iso_abstract_expand.xsl theSchema1.sch > theSchema2.sch xsltproc iso_svrl_for_xsltn.xsl theSchema2.sch > theSchema.xsl xsltproc theSchema.xsl myDocument.xml > myResult.xml Enter libxslt aka lxml's xslt capabilities: It looks pretty easy to integrate this xslt-based toolchain into lxml, effectively enabling full ISO schematron support. I suggest complementing the current lxml schematron support using this approach: - add the necessary stylesheets (extraction and skeleton implementation) to the lxml codebase - add a convenient API to support xslt-based schematron validation to lxml that - hides toolchain steps, at least in default mode - fits in with the current validators' API - provides support for the parameters used for the separate toolchain steps And finally: Maybe somebody has already done this with lxml + schematron. Care to step forward? Any Comments? Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From stefan_ml at behnel.de Wed Nov 25 13:48:33 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Nov 2009 13:48:33 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091125103800.199670@gmx.net> References: <20091125103800.199670@gmx.net> Message-ID: <4B0D27A1.2020607@behnel.de> Hi Holger, thanks for bringing this up. I'm all in favour of doing this. jholg at gmx.de, 25.11.2009 11:38: > I suggest complementing the current lxml schematron support using this approach: > - add the necessary stylesheets (extraction and skeleton implementation) to the lxml codebase It looks like the license is ok for inclusion. The basic restrictions are that the author wants to keep his name in the sources and that modifications must be recognisable as such. So shipping the files verbatimly should not do any harm to lxml's users. > - add a convenient API to support xslt-based schematron validation to lxml that > - hides toolchain steps, at least in default mode > - fits in with the current validators' API > - provides support for the parameters used for the separate toolchain steps This sounds like it can easily be done in plain Python code, so I'd appreciate having a separate "lxml.schematron" module that implements this. It should mimic the existing validator APIs as much as possible. Additional parameters in the validator constructor are fine. I imagine that the result document could be represented by a special class that helps in interpreting errors, although I haven't actually looked into this any deeper. Holger, could you open a bug report for this? Stefan From jholg at gmx.de Wed Nov 25 15:45:21 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 25 Nov 2009 15:45:21 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <4B0D27A1.2020607@behnel.de> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> Message-ID: <20091125144521.194110@gmx.net> > This sounds like it can easily be done in plain Python code, so I'd > appreciate having a separate "lxml.schematron" module that implements > this. > It should mimic the existing validator APIs as much as possible. > Additional > parameters in the validator constructor are fine. > > I imagine that the result document could be represented by a special class > that helps in interpreting errors, although I haven't actually looked into > this any deeper. > > Holger, could you open a bug report for this? > > Stefan Done: https://bugs.launchpad.net/lxml/+bug/488222 Given that I'm currently establishing schematron validations embedded in an XML Schema we're using here I will need this tool chain anyhow, sooner or later. So unless you need a nice little after work project yourself (or have already started hacking away ;) I'd volunteer to come up with an implementation for this (on a branch). Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Wed Nov 25 15:55:13 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Nov 2009 15:55:13 +0100 Subject: [lxml-dev] better schematron support In-Reply-To: <20091125144521.194110@gmx.net> References: <20091125103800.199670@gmx.net> <4B0D27A1.2020607@behnel.de> <20091125144521.194110@gmx.net> Message-ID: <4B0D4551.1020702@behnel.de> jholg at gmx.de, 25.11.2009 15:45: > unless you need a nice little after work project yourself (or have > already started hacking away ;) I'd volunteer to come up with an > implementation for this (on a branch). Oh, no further need for glory and fame on my side - just go ahead. :) Stefan From esbenbugge at gmail.com Wed Nov 25 16:20:39 2009 From: esbenbugge at gmail.com (Esben Bugge) Date: Wed, 25 Nov 2009 16:20:39 +0100 Subject: [lxml-dev] Installing on Mac OS X 10.6.2: "lipo: can't figure out the architecture type of" In-Reply-To: References: <4B066B35.8000605@behnel.de> Message-ID: I am still experiencing these problems. Do anyone have a suggestion on what I can do? Regards, Esben 2009/11/20 Esben Bugge > Thanks for your response. I added the changes of the patch to > buildlibxml.py and the installation went a lot further... but it still > fails. Now with: > > ld: library not found for -lbundle1.o > ld: library not found for -lbundle1.o > collect2: ld returned 1 exit statuscollect2: ld returned 1 exit status > lipo: can't open input file: /var/tmp//ccUI4H6k.out (No such file or > directory) > error: command 'gcc-4.0' failed with exit status 1 > > Any other ideas? > > 2009/11/20 Stefan Behnel > > Hi, >> >> Esben Bugge, 20.11.2009 10:44: >> > I have tried to install lxml on Mac OS X 10.6.2 using both >> > >> > sudo python setup.py build --static-deps >> > >> > and >> > >> > sudo STATIC_DEPS=true easy_install lxml >> > >> > In both cases I get >> > >> > lipo: can't figure out the architecture type of: /var/tmp//ccdcpY3e.out >> > make[2]: *** [SAX.lo] Error 1 >> > make[1]: *** [all-recursive] Error 1 >> > make: *** [all] Error 2 >> > Exception: Command "make" returned code 2 >> > >> > What can I do to resolve this error? >> >> Here's a patch that is supposed to make things work with MacOS 10.6: >> >> >> https://codespeak.net/viewvc/lxml/branch/lxml-2.2/buildlibxml.py?r1=66217&r2=69351&view=patch >> >> Please try if that changes anything for you. >> >> Stefan >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091125/d3f07e07/attachment.htm From jkrukoff at ltgc.com Sat Nov 28 00:02:28 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 27 Nov 2009 16:02:28 -0700 Subject: [lxml-dev] XPath optimization troubles. In-Reply-To: <4AE17C22.90204@behnel.de> References: <1256149220.16931.9.camel@localhost.localdomain> <4AE17C22.90204@behnel.de> Message-ID: <1259362948.7275.12.camel@localhost.localdomain> On Fri, 2009-10-23 at 11:49 +0200, Stefan Behnel wrote: > John Krukoff wrote: > > I expect this is properly a libxml2 question, but it's weird enough I > > wanted to check here first to make sure that lxml isn't effecting the > > results. > > > > I have equivalent XPath expressions, one using prefixes to do the > > selection, and one using namespace-uri to do the check. The > > namespace-uri version consistently runs 2-3x faster on a range of test > > data, and I have no idea why. > > > > Here's the prefix version: > >> '//@gizmo:*/parent::*[ not( self::gizmo:* ) ]' > > > > And here's the namespace-uri version: > >> '//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != > > "%(gizmo)s" ]' % namespaces > BTW, have you also measured the performance of using an XPath variable for > the URI in the second case? > > Stefan Finally got around to giving this a try, and the performance difference for using XPath variables looks to be negligible. Switched over to using variables then, as the substitution is obviously safer in the general case, and it looks a bit cleaner. Wanted to mention on the list, in case anybody else ends up using this optimization. As always, thanks for the tip Stefan. -- John Krukoff Land Title Guarantee Company