From stefan_ml at behnel.de Sat Mar 1 09:33:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 01 Mar 2008 09:33:56 +0100 Subject: [lxml-dev] Setting URL from lxml.html.fromstring, etc In-Reply-To: <47C888C8.50102@colorstudy.com> References: <47B8C56E.3090106@colorstudy.com> <47B942CD.5090501@behnel.de> <47B9C8AD.1050502@colorstudy.com> <47C2EA9B.8030007@behnel.de> <47C47CC0.7090904@colorstudy.com> <47C68BA1.3090902@behnel.de> <47C6E9FB.1060903@colorstudy.com> <47C86848.80003@behnel.de> <47C888C8.50102@colorstudy.com> Message-ID: <47C914F4.4080003@behnel.de> Hi Ian, Ian Bicking wrote: > OK. Then would the html base attribute just be a read-only property > then? Like: > > def base(self): > return super(HtmlElement, self).base > base = property(base) > > I'm not terribly concerned about whether it is read-only or not. It's a > little fuzzy, since HTML is parsed to the lxml representation, and > though it will probably be serialized to HTML again (if it is serialized > at all) and HTML doesn't have anything like xml:base, the lxml > representation is not itself exactly HTML. And if you serialize to > XHTML, then xml:base is available. Hmm, true. However, if you use lxml.html, you're likely to stay in the HTML world, so I would prefer making this read-only. If you really want an xml:base attribute, you can set it yourself, and if you really want to set the document URL, it's better to be explicit than setting it through an Element. > Also translating HTML to XHTML is kind of an outstanding issue for > lxml.html, and it seems reasonable to me that XHTML could be parsed into > the same classes as HTML. The only real caveat there is that XHTML uses > different (namespaced) tag names. If you remove the tag names, then the > classes and the lookup applies just fine. (Presumably the lookup could > be changed to support XHTML fairly easily.) That's a different topic, so I think we should discuss that in a separate thread. Stefan From stefan_ml at behnel.de Sat Mar 1 09:49:25 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 01 Mar 2008 09:49:25 +0100 Subject: [lxml-dev] XHTML handling in lxml.html Message-ID: <47C91895.8060805@behnel.de> Ian Bicking wrote: > translating HTML to XHTML is kind of an outstanding issue for lxml.html, > and it seems reasonable to me that XHTML could be parsed into the same > classes as HTML. The only real caveat there is that XHTML uses different > (namespaced) tag names. I agree that there is more we could do. For example, we could add "xhtml" as a serialisation method and do stuff internally to add a namespace declaration to the serialised "" (iff there isn't a namespace declared already). I'm not sure if it would be an error if the tree contains non-HTML elements, I guess we could just leave that to the user. > If you remove the tag names, then the classes and > the lookup applies just fine. (Presumably the lookup could be changed to > support XHTML fairly easily.) I would say so, yes. There would also be issues with the XPath expressions in things like html.clean, I assume. It would definitely be a good thing if the whole machinery could handle namespace-free HTML and namespaced XHTML equally well. Stefan From sidnei at enfoldsystems.com Sat Mar 1 12:26:35 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Sat, 1 Mar 2008 08:26:35 -0300 Subject: [lxml-dev] Using EXSLT extensions on Windows with standard lxml binaries In-Reply-To: <47C7C32B.8030704@behnel.de> References: <181320030.20080226163010@gmail.com> <47C6EFB5.70006@behnel.de> <47C71EEC.20202@behnel.de> <47C7C32B.8030704@behnel.de> Message-ID: On Fri, Feb 29, 2008 at 5:32 AM, Stefan Behnel wrote: > > What should we do? Release new builds of 1.3.x with updated libxslt? > > This is not a critical problem, so I wouldn't do a re-release. If you can > build 2.0.2 with a newer libxslt, that's just fine. I currently don't have the > time to backport fixes for a 1.3.7 release, but once that gets done, we'll > have that problem sorted out as well. Ok, 2.0.2 is up. > Is there a way you could document the libxml2/libxslt versions used when > uploading binaries? Like, in the file comment on PyPI? Right now, only if I do it manually, or if I override the 'upload' setuptools command. There's no command-line or setup.py option to specifying what the comment will be, it is hardcoded inside the 'upload' command. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Sat Mar 1 17:06:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 01 Mar 2008 17:06:01 +0100 Subject: [lxml-dev] Segfault and bus error when importing lxml.html.clean after importing webbrowser In-Reply-To: References: Message-ID: <47C97EE9.4080808@behnel.de> Hi, Jon Rosebaugh wrote: > I was trying to use lxml.html.clean to sanitize comments in my blog. > Unfortunately, although I can import and use it in a standalone > console session, it fails within the webapp. Sometimes it segfaults, > and sometimes it's a bus error instead. > After going through all the imports to see what _they_ imported, I > finally tracked down a minimal example that can cause the problem: > > import webbrowser > import lxml.html.clean > > If I reverse the order of imports, everything works fine, so for the > moment I've worked around it by making sure that lxml.html.clean is > imported the very first thing. The problem has been investigated. Apparently, importing the webbrowser module can dynamically load the libxml2 library. As only lxml was built against the updated libraries, this first import will load the older system libraries, which then conflict with the libraries that lxml requires. https://bugs.launchpad.net/lxml/+bug/197243 This problem is due to a misconfigured system that uses conflicting library versions, so there is nothing lxml can do here. Stefan From stefan_ml at behnel.de Sun Mar 2 10:08:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 02 Mar 2008 10:08:29 +0100 Subject: [lxml-dev] XSLT extension elements landed on trunk Message-ID: <47CA6E8D.9080005@behnel.de> Hi, the current trunk now has support for Python implemented XSLT extension elements. It's sort of a sandbox environment with read-only Elements, where you can do basically anything based on the stylesheet and the input document, and then append some result subtree to the XSLT output tree. Here's a short XSLT snippet that uses an extension, and a Python class that provides such an extension: XY class MyExt(etree.XSLTExtension): def execute(self, context, self_node, input_node, output_parent): # apply templates to my own children and process the result for child in self_node: for result in self.apply_templates(context, child): if isinstance(result, basestring): el = etree.Element("T") el.text = result else: el = result output_parent.append(el) I don't remember when I first started thinking about this, but it was actually pretty hard to get right until now. I uploaded some docs to the dev site: http://codespeak.net/lxml/dev/xpathxslt.html#extension-elements Note that this is currently an experimental feature that will go into lxml 2.1. Any comments and bug reports will be very much appreciated. Have fun, Stefan From stefan_ml at behnel.de Mon Mar 3 08:22:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 03 Mar 2008 08:22:58 +0100 Subject: [lxml-dev] Using EXSLT extensions on Windows with standard lxml binaries In-Reply-To: References: <181320030.20080226163010@gmail.com> <47C6EFB5.70006@behnel.de> <47C71EEC.20202@behnel.de> <47C7C32B.8030704@behnel.de> Message-ID: <47CBA752.4050703@behnel.de> Hi Sidnei, Sidnei da Silva wrote: > On Fri, Feb 29, 2008 at 5:32 AM, Stefan Behnel wrote: >> Is there a way you could document the libxml2/libxslt versions used when >> uploading binaries? Like, in the file comment on PyPI? > > Right now, only if I do it manually, or if I override the 'upload' > setuptools command. Overriding 'upload' isn't practicable as there isn't a hook for it. The comment is built right before uploading the file. You could add a line to the package description on the PyPI site manually, like "the Windows binary downloads on this site statically include libxml2 2.6.XY and libxslt 1.1.Z". Not sure you're currently allowed to do so, though. Stefan From lists at necoro.eu Tue Mar 4 00:58:08 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Tue, 04 Mar 2008 00:58:08 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop Message-ID: <47CC9090.9070305@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I'm developing a PyGTK-application that uses lxml to validate plugin-XMLs. After upgrading to lxml-2*, I noticed, that my application is not shut down correctly (i.e. I close the application, but it still runs in the background). After evaluating a little bit, I got the test case attached. This case hangs at the end: Example output: necoro at Devoty ~ % ./test.py lxml.etree: (2, 0, 2, 0) libxml used: (2, 6, 31) libxml compiled: (2, 6, 31) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) Destroy... Destroyed Now after the GTK-Main ... So the main has finished Important notes: - - it does not hang if I use etree.XML instead of etree.parse - - it does not hang if gobject.threads_init() is not called - - works with lxml-1.3.6 A quick glance with gdb: (gdb) bt #0 0xb7f97410 in __kernel_vsyscall () #1 0xb7e5248e in sem_wait () from /lib/libpthread.so.0 #2 0xb7f1ea38 in PyThread_acquire_lock () from /usr/lib/libpython2.5.so.1.0 (No further information as I don't have debug information in all my libraries ;)) Would be great, if anyone has an idea or is able to fix it :) Regards, Necoro -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHzJCQ4UOg/zhYFuARAtqLAJ92Wc18PFIcQ/e1ppi7P27GNkAHAwCfeC58 l3/QshqSiSVfXlBWw9Jz9sc= =Dj8o -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: test.py Type: text/x-python Size: 745 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080304/f89f5080/attachment.py From ESoutyrina at scopeseven.com Tue Mar 4 03:07:00 2008 From: ESoutyrina at scopeseven.com (Elena Soutyrina) Date: Mon, 3 Mar 2008 18:07:00 -0800 Subject: [lxml-dev] error installing lxml Message-ID: I am having trouble to install lxml. I already installed libxml2 (2.6.30) and libxslt, Cython (0.9.6.12) easy_install lxml gives me error Building with Cython 0.9.6.12. Building lxml version 2.0.2. warning: no previously-included files found matching 'doc/pyrex.txt' src/lxml/lxml.etree.c:1536: error: syntax error before 'xmlSchemaSAXPlugStruct' src/lxml/lxml.etree.c:1536: error: syntax error before 'xmlSchemaSAXPlugStruct' What am I missing? Best regards, Elena Soutyrina Application Engineer | Scope Seven 310 220 3939 x430 2201 Park Place, Suite 100 | El Segundo, CA 90245 www.scopeseven.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080303/4a0da0e0/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2743 bytes Desc: Glacier Bkgrd.jpg Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080303/4a0da0e0/attachment-0001.jpeg From stefan_ml at behnel.de Tue Mar 4 08:16:53 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 04 Mar 2008 08:16:53 +0100 Subject: [lxml-dev] error installing lxml In-Reply-To: References: Message-ID: <47CCF765.4040000@behnel.de> Hi, Elena Soutyrina wrote: > I am having trouble to install lxml. > > I already installed libxml2 (2.6.30) and libxslt, Cython (0.9.6.12) Not installing Cython is generally a good idea, although it shouldn't change anything here. > easy_install lxml gives me error Building with Cython 0.9.6.12. > > Building lxml version 2.0.2. > > warning: no previously-included files found matching 'doc/pyrex.txt' > > src/lxml/lxml.etree.c:1536: error: syntax error before > 'xmlSchemaSAXPlugStruct' > > src/lxml/lxml.etree.c:1536: error: syntax error before > 'xmlSchemaSAXPlugStruct' > > > > What am I missing? At least, I'm missing the output that comes *before* the gcc errors above. Could you send that in as well? Stefan From lists at necoro.eu Tue Mar 4 10:22:06 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Tue, 04 Mar 2008 10:22:06 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47CC9090.9070305@necoro.eu> References: <47CC9090.9070305@necoro.eu> Message-ID: <47CD14BE.4040409@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Just some more information if it is needed: Python: 2.5.1 PyGTK: 2.12.0 PyGObject: 2.14.0 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHzRS+4UOg/zhYFuARAvM1AJ4id+x37NO7U/vndkRG5k5LX4csNQCdEqpy sElD9ljhOrR/ESGptZMWKvU= =14z3 -----END PGP SIGNATURE----- From ianb at colorstudy.com Tue Mar 4 20:27:17 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 04 Mar 2008 13:27:17 -0600 Subject: [lxml-dev] XHTML handling in lxml.html In-Reply-To: <47C91895.8060805@behnel.de> References: <47C91895.8060805@behnel.de> Message-ID: <47CDA295.7000501@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> translating HTML to XHTML is kind of an outstanding issue for lxml.html, >> and it seems reasonable to me that XHTML could be parsed into the same >> classes as HTML. The only real caveat there is that XHTML uses different >> (namespaced) tag names. > > I agree that there is more we could do. For example, we could add "xhtml" as a > serialisation method and do stuff internally to add a namespace declaration to > the serialised "" (iff there isn't a namespace declared already). I'm > not sure if it would be an error if the tree contains non-HTML elements, I > guess we could just leave that to the user. I think one of the justifications for XHTML (what few their are ;) is that it can represent non-HTML elements reasonably elegantly. But I don't think this is a problem. >> If you remove the tag names, then the classes and >> the lookup applies just fine. (Presumably the lookup could be changed to >> support XHTML fairly easily.) > > I would say so, yes. There would also be issues with the XPath expressions in > things like html.clean, I assume. It would definitely be a good thing if the > whole machinery could handle namespace-free HTML and namespaced XHTML equally > well. This came up with Deliverance as well, as some people want to use XHTML. Because of all the namespace/URI/prefix confusion, it seems quite awkward. The most elegant solution, at least in that context, seems like using just HTML internally. So if we get XHTML, we parse it as XML and remove the namespace from every element in the namespace http://www.w3.org/1999/xhtml. Then when serializing to XHTML, we add that namespace to everything that doesn't have a namespace (and maybe with a whitelist of elements in XHTML). Then internally there's a consistent representation, and the XHTML/HTML division can be treated more like a parsing/serialization issue. Arguably the distinction is more than just serialization, and {http://www.w3.org/1999/xhtml}div is really distinct from a plain div. But that's not an argument I'd make ;) Mostly as an aside, I'm planning to parse XHTML using the XML parser, but if it fails to use the HTML parser, as the parsing error behavior of the two parsers is so different that they aren't really equivalent. Or... put another way, if you consider the error-tolerant HTML parser to be suitable for a task, then the error-intolerant XML parser may not be suitable (by itself). Ian From stefan_ml at behnel.de Wed Mar 5 22:22:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 05 Mar 2008 22:22:00 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47CC9090.9070305@necoro.eu> References: <47CC9090.9070305@necoro.eu> Message-ID: <47CF0EF8.6030605@behnel.de> Hi, Ren? 'Necoro' Neumann wrote: > I'm developing a PyGTK-application that uses lxml to validate plugin-XMLs. > After upgrading to lxml-2*, I noticed, that my application is not shut > down correctly (i.e. I close the application, but it still runs in the > background). Just a quick note that I can reproduce this. But I'll have to take a deeper look into this to figure out what's going on... Stefan From stefan_ml at behnel.de Sun Mar 9 11:50:54 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 09 Mar 2008 11:50:54 +0100 Subject: [lxml-dev] lxml for Python 2.3 In-Reply-To: <000a01c88164$811663a0$0907050a@valeriya7y577d> References: <000a01c88164$811663a0$0907050a@valeriya7y577d> Message-ID: <47D3C10E.5030700@behnel.de> Hi, VALERIY POGREBITSKIY wrote: > I am looking for a lxml distribution for Python 2.3, and found no such > distribution (it exeists for Python 2.4 and 2.5 only). I also found some of > your comments that you found a way to compile 'lxml' for Python 2.3. > > Do you have (or do you know where I can find) such distribution (install) - > for Python 2.3? > > I would greatly appreciate your help. Since lxml compiles and runs nicely under Python 2.3 (use easy_install, as usual), I assume that what you actually want is a pre-built binary for MS Windows, right? Maybe someone on the list has a Windows build for Py2.3, or can get one ready? Stefan From sidnei at enfoldsystems.com Mon Mar 10 02:30:09 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Sun, 9 Mar 2008 22:30:09 -0300 Subject: [lxml-dev] lxml for Python 2.3 In-Reply-To: <47D3C10E.5030700@behnel.de> References: <000a01c88164$811663a0$0907050a@valeriya7y577d> <47D3C10E.5030700@behnel.de> Message-ID: On Sun, Mar 9, 2008 at 7:50 AM, Stefan Behnel wrote: > Since lxml compiles and runs nicely under Python 2.3 (use easy_install, as > usual), I assume that what you actually want is a pre-built binary for MS > Windows, right? > > Maybe someone on the list has a Windows build for Py2.3, or can get one ready? Py2.3 needs VS6, which I haven't bothered installing after moving to a new machine. I don't feel compelled to go through that. :) -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From ianb at colorstudy.com Mon Mar 10 09:19:17 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Mar 2008 03:19:17 -0500 Subject: [lxml-dev] should lxml.html have the same API as lxml.etree? Message-ID: <47D4EF05.4060209@colorstudy.com> I've noticed in html5lib that they do (in html5lib.treebuilder.etree_lxml): try: import lxml.html as etree except ImportError: from lxml import etree with the expectation that the two work the same way. They don't work the same, specifically there's no etree.Comment. Is it a reasonable expectation that they act the same? (I think they haven't tested the code much with lxml 2, so basically they haven't exercised the first case... though looking at the code some I'm not sure it works with the second case either) Ian From ianb at colorstudy.com Mon Mar 10 09:22:13 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Mar 2008 03:22:13 -0500 Subject: [lxml-dev] lxml.html.ElementSoup Message-ID: <47D4EFB5.4000800@colorstudy.com> I believe lxml.html.ElementSoup.parse doesn't have the same API as other parse functions -- it returns an element, where the other parse functions return trees. Also, should ElementSoup have a lower case name? (Maybe renaming it would make deprecating this parse behavior easier too) Ian From stefan_ml at behnel.de Mon Mar 10 09:28:24 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 09:28:24 +0100 Subject: [lxml-dev] lxml.html.ElementSoup In-Reply-To: <47D4EFB5.4000800@colorstudy.com> References: <47D4EFB5.4000800@colorstudy.com> Message-ID: <47D4F128.1040606@behnel.de> Hi, Ian Bicking wrote: > I believe lxml.html.ElementSoup.parse doesn't have the same API as other > parse functions -- it returns an element, where the other parse > functions return trees. > > Also, should ElementSoup have a lower case name? (Maybe renaming it > would make deprecating this parse behavior easier too) I agree that it's not consistent, but it's how ET does it: http://effbot.org/zone/element-soup.htm Stefan From stefan_ml at behnel.de Mon Mar 10 09:51:35 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 09:51:35 +0100 Subject: [lxml-dev] should lxml.html have the same API as lxml.etree? In-Reply-To: <47D4EF05.4060209@colorstudy.com> References: <47D4EF05.4060209@colorstudy.com> Message-ID: <47D4F697.4060901@behnel.de> Hi Ian, Ian Bicking wrote: > I've noticed in html5lib that they do (in html5lib.treebuilder.etree_lxml): > > try: > import lxml.html as etree > except ImportError: > from lxml import etree > > with the expectation that the two work the same way. They don't work > the same, specifically there's no etree.Comment. > > Is it a reasonable expectation that they act the same? (I think they > haven't tested the code much with lxml 2, so basically they haven't > exercised the first case... though looking at the code some I'm not sure > it works with the second case either) I thought about that for lxml.objectify, too. I mean, you could import basically everything from etree into the package/module namespace and be done. The question of having an "__all__" or not is related to this. The thing that made me think about this was tostring(). There is no tostring() in objectify, so it's unlikely that you will ever be able to use it without also importing etree. But on the other hand, if you agree to import some names, where do you draw the line? Would you want to provide XSLT and RelaxNG as well? What about all the exceptions? And I'm absolutely sure there will be the day where I forget to add an import (or assignment) somewhere when I add a new name to etree. Currently, lxml.objectify is positioned as an API *on-top* of etree, although things that behave differently are duplicated already. I haven't made up my mind yet. However, I do feel that there may be more things that might want to behave different in the future, so having them duplicated from the beginning makes it a) easier to grow the different APIs in their respective direction, and b) easier for users to use them consistently inside the module/package API that they are using, without caring about the interaction with etree if they don't use it. Any more opinions on this? Stefan From ianb at colorstudy.com Mon Mar 10 10:54:30 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Mar 2008 04:54:30 -0500 Subject: [lxml-dev] Some benchmarks Message-ID: <47D50556.1040601@colorstudy.com> For the curious, I've attached some benchmarks. These are preliminary, I'm putting together the numbers for my HTML talk at PyCon. One thing that I'd like to test is the memory use for documents. To do this I'm parsing about 4.5Mb of documents and keeping them in memory, and looking at the VSZ/RSS sizes reported by ps before and after. I don't think this is the right/best way to do this. For instance, transient memory use by some parsers makes Python grab a bunch of memory, but it might be free after parsing, and usable for other things. Also, I don't know if VSZ/RSS is valid at all. I get the impression it isn't that valid. And the increases I'm seeing for lxml don't seem to be sufficient; at least the process should grow by 4.5Mb, right? lxml can't be that much more efficient than the serialized form of these files. Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated. (Maybe I could look at ps, and then start creating Python objects until the memory use increases, so that I know I've used up any extra allocated memory?) I've also attached the script, though you'll need to grab your own HTML files. html_lxml is broken; I patched it locally to work (http://code.google.com/p/html5lib/issues/detail?id=65). Ian -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: performance-results.txt Url: http://codespeak.net/pipermail/lxml-dev/attachments/20080310/c8eb7ddd/attachment-0001.txt -------------- next part -------------- A non-text attachment was scrubbed... Name: tester.py Type: text/x-python Size: 8794 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080310/c8eb7ddd/attachment-0001.py From stefan_ml at behnel.de Mon Mar 10 11:27:47 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 11:27:47 +0100 Subject: [lxml-dev] should lxml.html have the same API as lxml.etree? In-Reply-To: <47D4EF05.4060209@colorstudy.com> References: <47D4EF05.4060209@colorstudy.com> Message-ID: <47D50D23.5060507@behnel.de> Hi Ian, Ian Bicking wrote: > I think they > haven't tested the code much with lxml 2, so basically they haven't > exercised the first case... though looking at the code some I'm not sure > it works with the second case either I guess you are referring to fromstring()? There /is/ a fromstring() in etree, which does the same as XML(). The reasoning is that for XML literals it reads well to write XML("< ... >") while when parsing from a string variable, it's more readable to write fromstring(some_content) Stefan From jholg at gmx.de Mon Mar 10 11:36:02 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 10 Mar 2008 11:36:02 +0100 Subject: [lxml-dev] should lxml.html have the same API as lxml.etree? In-Reply-To: <47D4F697.4060901@behnel.de> References: <47D4EF05.4060209@colorstudy.com> <47D4F697.4060901@behnel.de> Message-ID: <20080310105610.80120@gmx.net> Hi, ? > > > Is it a reasonable expectation that they act the same? (I think they > > haven't tested the code much with lxml 2, so basically they haven't > > exercised the first case... though looking at the code some I'm not > sure > > it works with the second case either) > > [...] > > Currently, lxml.objectify is positioned as an API *on-top* of etree, > although > things that behave differently are duplicated already. I haven't made up > my > mind yet. However, I do feel that there may be more things that might > want to > behave different in the future, so having them duplicated from the > beginning > makes it a) easier to grow the different APIs in their respective > direction, > and b) easier for users to use them consistently inside the > module/package API > that they are using, without caring about the interaction with etree if > they > don't use it. ?Mirroring the etree API completely in objectify might make the incautious user think that these modules can be used completely interchangeable, while they are not. And the difference are subtle, e.g. the indexing behaviour (sibling access in objectify, children access in etree), and will not necessarily produce easily detectable errors. ?So for an lxml.objectify that exposes a full etree-API, it should be stated very prominently that you can't just take your existing etree worker code and start using objectify instead. ?Holger? ? -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080310/298e3015/attachment.htm From stefan_ml at behnel.de Mon Mar 10 12:01:41 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 12:01:41 +0100 Subject: [lxml-dev] Some benchmarks In-Reply-To: <47D50556.1040601@colorstudy.com> References: <47D50556.1040601@colorstudy.com> Message-ID: <47D51515.7040103@behnel.de> Hi Ian, Ian Bicking wrote: > For the curious, I've attached some benchmarks. These are preliminary, > I'm putting together the numbers for my HTML talk at PyCon. Those /are/ pretty impressive numbers. Go, get some lxml ads up on PyCon. :) > One thing that I'd like to test is the memory use for documents. To do > this I'm parsing about 4.5Mb of documents and keeping them in memory, > and looking at the VSZ/RSS sizes reported by ps before and after. I > don't think this is the right/best way to do this. For instance, > transient memory use by some parsers makes Python grab a bunch of > memory, but it might be free after parsing, and usable for other things. > Also, I don't know if VSZ/RSS is valid at all. I get the impression it > isn't that valid. And the increases I'm seeing for lxml don't seem to > be sufficient; at least the process should grow by 4.5Mb, right? lxml > can't be that much more efficient than the serialized form of these files. :) Didn't you see the code snippet in lxml's parser that sneaks all documents into dark memory? I noticed that you calculate the initial size /after/ parsing in the --serialize case. If I move it before that, I get reasonable numbers for lxml: +17M for 2.5MB of documents on a 32bit machine. I don't mind having a bit of setup-time memory in those numbers, as the absolute numbers are dominated by the document size. They very much depend on your specific documents anyway (amount of text vs. tags, for example). So if two libraries are close here, either of them might win for a specific input. And if they are far away, well, then it's obvious enough which is better. A meg more or less is of no value. > Another clear indication that we're measuring transient stuff is that > when using the BeautifulSoup or html5 parser with an lxml document the > memory increases substantially. So any ideas on how to test memory > would be much appreciated. Somewhat hard to do across libraries. For example, the way the ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself. On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor. Stefan From stefan_ml at behnel.de Mon Mar 10 12:17:10 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 12:17:10 +0100 Subject: [lxml-dev] should lxml.html have the same API as lxml.etree? In-Reply-To: <20080310105610.80120@gmx.net> References: <47D4EF05.4060209@colorstudy.com> <47D4F697.4060901@behnel.de> <20080310105610.80120@gmx.net> Message-ID: <47D518B6.60800@behnel.de> Hi, jholg at gmx.de wrote: > Mirroring the etree API completely in objectify might make the incautious > user think that these modules can be used completely interchangeable, while > they are not. > > And the difference are subtle, e.g. the indexing behaviour (sibling access > in objectify, children access in etree), and will not necessarily produce > easily detectable errors. > > So for an lxml.objectify that exposes a full etree-API, it should be > stated very prominently that you can't just take your existing etree worker > code and start using objectify instead. That's not really something I'm worried about. I think it's clear from the docs (and intuitively from etree and objectify being separate modules) that they are not interchangeable. I think the real question is: should users need to import etree, when what they actually work with is objectify or lxml.html? And I think duplicating the module content and having to import only one module/package makes it even clearer that they /have/ separate APIs than the current mix of "partly etree, partly objectify" can. It allows users to stay inside the world of one tool, without even having to think about the differences to another tool that they do not care about (or at least should not have to). Stefan From stefan_ml at behnel.de Mon Mar 10 12:26:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 12:26:37 +0100 Subject: [lxml-dev] lxml for Python 2.3 In-Reply-To: References: Message-ID: <47D51AED.2040401@behnel.de> Hi, Valeriy Pogrebitskiy wrote: > This communication may contain privileged or confidential information. If > you are not the intended recipient, you are hereby notified that you have > received this message in error and that any review, dissemination, > distribution or copying of this message is strictly prohibited. Note that your message is archived, which is clearly stated on the subscription page. Stefan From stefan_ml at behnel.de Mon Mar 10 12:30:45 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 12:30:45 +0100 Subject: [lxml-dev] lxml for Python 2.3 In-Reply-To: References: Message-ID: <47D51BE5.40605@behnel.de> Hi, Valeriy Pogrebitskiy wrote: > Does this mean that easy_install will, or will not build lxml? > And yes - I am looking for Windows version. If you have the necessary tools installed on your machine (which seems to be VC6 for Python 2.3), you should be able to build lxml. You might also be successful with MinGW, but you'll have to work that out. May I ask what the reason is why you are tied to Py2.3? Stefan From ianb at colorstudy.com Mon Mar 10 20:12:55 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Mar 2008 14:12:55 -0500 Subject: [lxml-dev] Some benchmarks In-Reply-To: <47D51515.7040103@behnel.de> References: <47D50556.1040601@colorstudy.com> <47D51515.7040103@behnel.de> Message-ID: <47D58837.4070108@colorstudy.com> Stefan Behnel wrote: > I noticed that you calculate the initial size /after/ parsing in the > --serialize case. If I move it before that, I get reasonable numbers for lxml: > +17M for 2.5MB of documents on a 32bit machine. I didn't intend to include the --serialize option, but must have done so. Though I don't know why they weren't *all* messed up then? Anyway, I get 25MB, which seems quite reasonable. Here's the revised numbers: VSZ / RSS lxml : 25908 / 26232 bs : 82508 / 82168 html5_cet : 54616 / 54760 html5_et : 64688 / 64964 html5_lxml : 49056 / 49124 html5_minidom : 194352 / 192936 html5_simple : 99772 / 98016 lxml_bs : 104916 / 104856 htmlparser : 4440 / 4448 I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar: VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19) I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar). I don't actually know if BS is better than lxml in parsing... anything. I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard. If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess. >> Another clear indication that we're measuring transient stuff is that >> when using the BeautifulSoup or html5 parser with an lxml document the >> memory increases substantially. So any ideas on how to test memory >> would be much appreciated. > > Somewhat hard to do across libraries. For example, the way the ElementSoup > parser (i.e. BS on lxml) works, is: parse the document with BS, and then > recursively translate the tree into an lxml tree. So you temporarily use about > twice the memory. You'd have to intercept the tree builder process at the end > (before releasing the BS tree) and measure there in order to get the maximum > amount of memory used. I'd run it a couple of times and just watch top while > it's running. That way, you can figure out something close to the maximum > yourself. I'm pretty sure what you end up with after is the maximum use, as Python doesn't release memory back to the operating system after its allocated it. (Or at least Python 2.4 doesn't.) So instead you have a pool of memory that Python isn't using, but the OS doesn't know that. I guess the assumption is that if Python never needs to use it again, at least the OS can move it to virtual memory. > On the other hand, I don't know if temporary memory is of that much value for > a comparison. If it takes more space while parsing - so what? You'll likely > keep the document tree in memory much longer than the parsing takes, so that's > the dominating factor. Right, I'm more interested in the memory the finished document takes. Intermediate memory use shows up in the performance numbers anyway. Though I don't know if all that memory use might also lead to fragmentation, slowing down later allocations? This is beyond my understanding of Python performance. Ian From stefan_ml at behnel.de Mon Mar 10 21:15:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 Mar 2008 21:15:58 +0100 Subject: [lxml-dev] Some benchmarks In-Reply-To: <47D58837.4070108@colorstudy.com> References: <47D50556.1040601@colorstudy.com> <47D51515.7040103@behnel.de> <47D58837.4070108@colorstudy.com> Message-ID: <47D596FE.6070609@behnel.de> Hi, Ian Bicking wrote: > I get 25MB, which seems quite reasonable. Here's the revised numbers: > > VSZ / RSS > lxml : 25908 / 26232 > bs : 82508 / 82168 > html5_cet : 54616 / 54760 > html5_et : 64688 / 64964 > html5_lxml : 49056 / 49124 > html5_minidom : 194352 / 192936 > html5_simple : 99772 / 98016 > lxml_bs : 104916 / 104856 > htmlparser : 4440 / 4448 Still pretty good for lxml. That actually surprises me, cET is more memory friendly by itself (due to its simpler tree model), so it must be html5lib that takes its bite here. > I also tried allocating random strings until the size increased, to see > if there was lots of allocated but free memory (the unused amount is an > estimate, as I'm unsure what the exact internal representation of a list > of strings is). The results were peculiar: > > VSZ RSS (used) > lxml : 26952 / 26211 (unused: 5) > bs : 83408 / 82156 (unused: 0) > html5_cet : 55640 / 54745 (unused: 19) > html5_et : 65712 / 64946 (unused: 14) > html5_lxml : 50072 / 48986 (unused: 134) > html5_minidom : 195372 / 192914 (unused: 14) > html5_simple : 99772 / 97999 (unused: 17) > lxml_bs : 104644 / 73037 (unused: 31783) > htmlparser : 4448 / 4433 (unused: 19) > > I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots > of free memory left over at the end. I am surprised that the others > don't, at least html5_lxml should be similar I'd think (though I guess > if you take into account the unused memory then html5_lxml and lxml_bs > are similar). That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use Python's memory management, so memory that is freed by the parser is really freed to the OS, not just left as a growing interpreter heap. I think that's the main reason why html5_lxml ends up below html5_cet in your test. (Please correct me :) > I don't actually know if BS is better than lxml in parsing... anything. When I tried it on the generated libxml2 HTML documentation (2.5 MB), BS crashed with an encoding error, while lxml worked just fine. But you might argue that libxml2 should be able to parse its own documentation. ;) > I haven't looked hard (yet, at least). The example on the ElementSoup > page parses *slightly* better with BS, but lxml parses it very similarly > to how html5lib parses it, which I'd consider the better standard. > html5lib has the advantage of being a kind of standard. > > If I had a good collection of crappy HTML, that would probably be an > interesting test to see how differently html5lib, BS, and lxml parse it. > I'm not sure where to find a good collection like that. Maybe > html5lib's tests, I guess. There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance. >> ElementSoup >> parser (i.e. BS on lxml) works, is: parse the document with BS, and then >> recursively translate the tree into an lxml tree. So you temporarily >> use about >> twice the memory. You'd have to intercept the tree builder process at >> the end >> (before releasing the BS tree) and measure there in order to get the >> maximum >> amount of memory used. I'd run it a couple of times and just watch top >> while >> it's running. That way, you can figure out something close to the maximum >> yourself. > > I'm pretty sure what you end up with after is the maximum use, as Python > doesn't release memory back to the operating system after its allocated > it. (Or at least Python 2.4 doesn't.) So instead you have a pool of > memory that Python isn't using, but the OS doesn't know that. I guess > the assumption is that if Python never needs to use it again, at least > the OS can move it to virtual memory. Again, unfair advantage for lxml. What about running a shell script in parallel to the parser tests that dumps the program's current RAM usage to a file as fast as it can. Then run it through "sort -n -r | head -1" to get the peak and use that? >> On the other hand, I don't know if temporary memory is of that much >> value for >> a comparison. If it takes more space while parsing - so what? You'll >> likely >> keep the document tree in memory much longer than the parsing takes, >> so that's the dominating factor. > > Right, I'm more interested in the memory the finished document takes. > Intermediate memory use shows up in the performance numbers anyway. > Though I don't know if all that memory use might also lead to > fragmentation, slowing down later allocations? This is beyond my > understanding of Python performance. My guess is that there is enough memory overhead involved in a dynamic language like Python to keep the impact of memory fragmentation on the parser performance rather low in comparison. But that's just a guess. Stefan From mwm-keyword-lxml.9112b8 at mired.org Mon Mar 10 22:02:48 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Mon, 10 Mar 2008 17:02:48 -0400 Subject: [lxml-dev] Some benchmarks In-Reply-To: <47D596FE.6070609@behnel.de> References: <47D50556.1040601@colorstudy.com> <47D51515.7040103@behnel.de> <47D58837.4070108@colorstudy.com> <47D596FE.6070609@behnel.de> Message-ID: <20080310170248.239e978f@bhuda.mired.org> On Mon, 10 Mar 2008 21:15:58 +0100 Stefan Behnel wrote: > > I also tried allocating random strings until the size increased, to see > > if there was lots of allocated but free memory (the unused amount is an > > estimate, as I'm unsure what the exact internal representation of a list > > of strings is). The results were peculiar: > > > > VSZ RSS (used) > > lxml : 26952 / 26211 (unused: 5) > > bs : 83408 / 82156 (unused: 0) > > html5_cet : 55640 / 54745 (unused: 19) > > html5_et : 65712 / 64946 (unused: 14) > > html5_lxml : 50072 / 48986 (unused: 134) > > html5_minidom : 195372 / 192914 (unused: 14) > > html5_simple : 99772 / 97999 (unused: 17) > > lxml_bs : 104644 / 73037 (unused: 31783) > > htmlparser : 4448 / 4433 (unused: 19) > > > > I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots > > of free memory left over at the end. I am surprised that the others > > don't, at least html5_lxml should be similar I'd think (though I guess > > if you take into account the unused memory then html5_lxml and lxml_bs > > are similar). > > That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use > Python's memory management, so memory that is freed by the parser is really > freed to the OS, not just left as a growing interpreter heap. Not necessarily. libxml2 uses the c libraries free/malloc. Historically, on Unix systems the C library free/malloc don't return the memory to the OS, but keep it in an internal heap. Systems that are Not Unix tend to do otherwise, creating some confusion for people moving from those systems to unix. > > I haven't looked hard (yet, at least). The example on the ElementSoup > > page parses *slightly* better with BS, but lxml parses it very similarly > > to how html5lib parses it, which I'd consider the better standard. > > html5lib has the advantage of being a kind of standard. > > > > If I had a good collection of crappy HTML, that would probably be an > > interesting test to see how differently html5lib, BS, and lxml parse it. > > I'm not sure where to find a good collection like that. Maybe > > html5lib's tests, I guess. > > There seem to be a fair amount of HTML browser compliance test suites on the > web, but I didn't find any test suites for broken HTML at a first glance. I think google has a nice collection of broken html :-). http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From ianb at colorstudy.com Mon Mar 10 23:21:46 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Mar 2008 17:21:46 -0500 Subject: [lxml-dev] .base and docinfo.URL Message-ID: <47D5B47A.7030302@colorstudy.com> Does .base inherit from docinfo.URL? It doesn't seem like it does. I tried changing .base_url to just return self.base, but if I do: >>> from lxml.html import parse >>> doc = parse('http://python.org').getroot() >>> print doc.base None >>> doc.getroottree().docinfo.URL 'http://python.org' From stefan_ml at behnel.de Tue Mar 11 08:41:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Mar 2008 08:41:09 +0100 Subject: [lxml-dev] Some benchmarks In-Reply-To: <20080310170248.239e978f@bhuda.mired.org> References: <47D50556.1040601@colorstudy.com> <47D51515.7040103@behnel.de> <47D58837.4070108@colorstudy.com> <47D596FE.6070609@behnel.de> <20080310170248.239e978f@bhuda.mired.org> Message-ID: <47D63795.1080402@behnel.de> Hi, Mike Meyer wrote: > Not necessarily. libxml2 uses the c libraries > free/malloc. Historically, on Unix systems the C library free/malloc > don't return the memory to the OS, but keep it in an internal > heap. Systems that are Not Unix tend to do otherwise, creating some > confusion for people moving from those systems to unix. I tend to consider libc a part of the OS. But technically you are right and it even makes a difference here. >>> I haven't looked hard (yet, at least). The example on the ElementSoup >>> page parses *slightly* better with BS, but lxml parses it very similarly >>> to how html5lib parses it, which I'd consider the better standard. >>> html5lib has the advantage of being a kind of standard. >>> >>> If I had a good collection of crappy HTML, that would probably be an >>> interesting test to see how differently html5lib, BS, and lxml parse it. >>> I'm not sure where to find a good collection like that. Maybe >>> html5lib's tests, I guess. >> There seem to be a fair amount of HTML browser compliance test suites on the >> web, but I didn't find any test suites for broken HTML at a first glance. > > I think google has a nice collection of broken html :-). Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just don't know how to write a Google search query for broken HTML pages... :) Anyway, I'm not sure they actually keep the broken HTML pages around. I would expect them to send them through a sanitizer before doing anything else with them (including local caching). Stefan From stefan_ml at behnel.de Tue Mar 11 08:54:03 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Mar 2008 08:54:03 +0100 Subject: [lxml-dev] .base and docinfo.URL In-Reply-To: <47D5B47A.7030302@colorstudy.com> References: <47D5B47A.7030302@colorstudy.com> Message-ID: <47D63A9B.1070101@behnel.de> Hi Ian, Ian Bicking wrote: > Does .base inherit from docinfo.URL? It doesn't seem like it does. I > tried changing .base_url to just return self.base, but if I do: > > >>> from lxml.html import parse > >>> doc = parse('http://python.org').getroot() > >>> print doc.base > None > >>> doc.getroottree().docinfo.URL > 'http://python.org' I just checked the libxml2 source, it actually behaves completely different for HTML documents. Here, it looks for and takes that. It completely ignores the document URL for HTML. I think it would be good to override that (directly in etree), so that it returns the document URL if nothing is returned from the base search. That way, it's consistent with the fallback in XML. Stefan From stefan_ml at behnel.de Tue Mar 11 10:35:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Mar 2008 10:35:01 +0100 Subject: [lxml-dev] .base and docinfo.URL In-Reply-To: <47D63A9B.1070101@behnel.de> References: <47D5B47A.7030302@colorstudy.com> <47D63A9B.1070101@behnel.de> Message-ID: <47D65245.1030600@behnel.de> Hi, fixed on the trunk. Stefan Stefan Behnel wrote: > Ian Bicking wrote: >> Does .base inherit from docinfo.URL? It doesn't seem like it does. I >> tried changing .base_url to just return self.base, but if I do: >> >> >>> from lxml.html import parse >> >>> doc = parse('http://python.org').getroot() >> >>> print doc.base >> None >> >>> doc.getroottree().docinfo.URL >> 'http://python.org' > > I just checked the libxml2 source, it actually behaves completely different > for HTML documents. Here, it looks for > > > > and takes that. It completely ignores the document URL for HTML. > > I think it would be good to override that (directly in etree), so that it > returns the document URL if nothing is returned from the base search. That > way, it's consistent with the fallback in XML. > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From faassen at startifact.com Tue Mar 11 19:06:58 2008 From: faassen at startifact.com (Martijn Faassen) Date: Tue, 11 Mar 2008 19:06:58 +0100 Subject: [lxml-dev] lxml.html.ElementSoup In-Reply-To: <47D4F128.1040606@behnel.de> References: <47D4EFB5.4000800@colorstudy.com> <47D4F128.1040606@behnel.de> Message-ID: Stefan Behnel wrote: > Ian Bicking wrote: [snip] >> Also, should ElementSoup have a lower case name? (Maybe renaming it >> would make deprecating this parse behavior easier too) > > I agree that it's not consistent, but it's how ET does it: > > http://effbot.org/zone/element-soup.htm We do have precedents for lower-casing things that ET doesn't lowercase, like the entire 'etree' module itself. Regards, Martijn From stefan_ml at behnel.de Tue Mar 11 19:31:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Mar 2008 19:31:05 +0100 Subject: [lxml-dev] lxml.html.ElementSoup In-Reply-To: References: <47D4EFB5.4000800@colorstudy.com> <47D4F128.1040606@behnel.de> Message-ID: <47D6CFE9.9020409@behnel.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: > [snip] >>> Also, should ElementSoup have a lower case name? (Maybe renaming it >>> would make deprecating this parse behavior easier too) >> I agree that it's not consistent, but it's how ET does it: >> >> http://effbot.org/zone/element-soup.htm > > We do have precedents for lower-casing things that ET doesn't lowercase, > like the entire 'etree' module itself. ok, but since we even have the same name, it's a good excuse for also following the inconsistent API. :) The API is the main problem here, not the package name. Stefan From stefan_ml at behnel.de Tue Mar 11 20:37:04 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Mar 2008 20:37:04 +0100 Subject: [lxml-dev] Help getting lxml to work reliably on MacOS-X In-Reply-To: References: <47A6D0B1.5020600@behnel.de> <47A8C588.9060701@behnel.de> <47ACB0D6.4070704@behnel.de> Message-ID: <47D6DF60.5070507@behnel.de> Hi, Christian Zagrodnick wrote: > On 2008-02-09 14:33:43 +0100, Christian Zagrodnick said: >> On 2008-02-08 20:43:18 +0100, Stefan Behnel said: >>> Christian Zagrodnick wrote: >>>> The main problem is, that lxml runs the wrong xslt-config. So I was >>>> basically building libxml2 and libxslt just for the fun of it. >>>> >>>> The question is if lxml really always needs to call xslt-config. Or how >>>> one would set the path in the buildout so that the right xslt-config is >>>> called. >>>> >>>> If I manually set the path it works like charm: >>>> >>>> % PATH=`pwd`/parts/libxslt/bin:$PATH bin/buildout >>> You can now pass "--with-xslt-config=XXX" to setup.py. ... and --with-xml2-config=YYY, just in case you installed both in different places. >> Gotta check how we best pass that along in buildout. > > So in the *next* version of zc.recipe.egg (i.e. >1.0.0), the following > will probably work. It works with the trunk of zc.recipe.egg (but this > is neither released nor has it been reviewed by Jim Fulton, yet): > > [lxml-environment] > PATH=${buildout:directory}/parts/libxslt/bin:%(PATH)s [...] ... and, you can also set the XML2_CONFIG and XSLT_CONFIG environment variables to make sure setup.py really picks up the right config. Stefan From stefan_ml at behnel.de Wed Mar 12 18:55:20 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Mar 2008 18:55:20 +0100 Subject: [lxml-dev] Python XML Validator In-Reply-To: <20080312005244.4319e4a5@bhuda.mired.org> References: <20080302230708.260fa4a9@bhuda.mired.org> <20080305102306.0a3db148@mbook-fbsd> <20080311121006.0da78de7@mbook-fbsd> <47D6BAE9.4050105@behnel.de> <20080312005244.4319e4a5@bhuda.mired.org> Message-ID: <47D81908.5080508@behnel.de> Hi, moving this here from python-dev (where it started for whatever reason...) Mike Meyer wrote: > On Tue, 11 Mar 2008 18:01:29 +0100 Stefan Behnel wrote: >> BTW, we had MacOS builds a while ago, so I wouldn't mind having someone >> volunteer to contribute builds on a regular basis (static builds preferred). > > For which Python build? python.org? Activatestate? Leopard? Macports? > Fink? pkgsrc? Any idea if a single build will work for all of them? I have no idea. At the very least, different Python major versions will pose a problem. And I guess the builds provided by package distributions like fink and macports will also require newer dependencies on other ends, or be built with newer compilers... >>> The second time for OS-X, I used an older version of lxml (1.3.6), and >>> just did "setup.py install". Worked like a charm. That's not hard. >> Interesting. 1.3.6 should also require libxml2 2.6.20 (although maybe less >> strictly than 2.0). > > I just grabbed it and tried parsing thing with it; I didn't try the > advanced features that I depend on lxml for (rng validation and lots > of xpath), or what the OP was looking for (validation). Running the > test.py suite turns has one failure: > > File "/Users/mwm/lxml-1.3.6/src/lxml/tests/../../../doc/parsing.txt", line 369, in parsing.txt > Failed example: > etree.tounicode(root) > Expected: > u' \uf8d1 + \uf8d2 ' > Got: > u'  +  ' If that's the only problem, then 1.3.x works 'acceptably' with 2.6.16 - except that newer versions are much better in parsing HTML and validating with XML schema (amongst other things). Note that the test suite tends to avoid testing features that only depend on libxml2, and especially stuff that has changed between library versions. It's a test suite for lxml, not for libxml2. However, 2.0 will not work that easily. Things like parse-time schema validation and schematron support do not work on versions below 2.6.20 (or actually 2.6.21, but we disable schematron on 2.6.20). We might be able to work around some more stuff by spreading some #ifdef's and #defines, but so far, I find it perfectly acceptable if 2.0 requires newer dependencies for new features. People who care about reliability will not use libraries as old as 2.6.16 anyway. The list of fixed bug only gets longer with newer versions. >>>>> Which means you wind up having to >>>>> build those yourself if you want a recent version of lxml, even if >>>>> you're using a system that includes lxml in it's package system. >>>> If you want a clean system, e.g. for production use, buildout has proven to be >>>> a good idea. And we also provide pretty good instructions on our web page on >>>> how to install lxml on MacOS-X and what to take care of. >>> Yes, but the proposal was to include it in the Python standard >>> library. Software that doesn't work on popular target platforms >>> without updating a standard system library isn't really suitable for >>> that. >> Hmm, coming somewhat back on-topic: how does Python currently handle its >> dependencies under MacOS-X? SQLite, for example? Does it use system libraries >> only, or are there libraries it ships with? (The MacOS distro is much bigger, >> but that might be due to the universal build - although that suggests that >> MacOS-X users do not care about disk space or download size anyway) > > For most of them, it checks for the existence of the libraries and > header files for those packages, and then builds the wrapper libraries > if it finds their requirements. Look through the 2.5.2 setup.py for > how sqlite3 is handled (it's a bit much to include here). Funny, looking for the sqlite setup was actually a good idea. It does all sorts of things to figure out a good one to use, specifically on MacOS-X. There even appears to be some trickery to take the first library it finds, static or dynamic, instead of continuing to look for a dynlib. I wouldn't mind adding a similar setup to lxml's setupinfo.py. Maybe someone can get a hand on this? It would be great to have an automatic static build on MacOS, so that people could just run setup.py and be sure it uses the expected libs the next time they use it. Is there a standard directory prefix where macport & Co. install libraries and related stuff like xslt-config? Stefan From mwm-keyword-lxml.9112b8 at mired.org Thu Mar 13 05:53:00 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Thu, 13 Mar 2008 00:53:00 -0400 Subject: [lxml-dev] Python XML Validator In-Reply-To: <47D81908.5080508@behnel.de> References: <20080302230708.260fa4a9@bhuda.mired.org> <20080305102306.0a3db148@mbook-fbsd> <20080311121006.0da78de7@mbook-fbsd> <47D6BAE9.4050105@behnel.de> <20080312005244.4319e4a5@bhuda.mired.org> <47D81908.5080508@behnel.de> Message-ID: <20080313005300.6266e3aa@bhuda.mired.org> On Wed, 12 Mar 2008 18:55:20 +0100 Stefan Behnel wrote: > Mike Meyer wrote: > > On Tue, 11 Mar 2008 18:01:29 +0100 Stefan Behnel wrote: > >> BTW, we had MacOS builds a while ago, so I wouldn't mind having someone > >> volunteer to contribute builds on a regular basis (static builds preferred). > > For which Python build? python.org? Activatestate? Leopard? Macports? > > Fink? pkgsrc? Any idea if a single build will work for all of them? > I have no idea. At the very least, different Python major versions will pose a > problem. And I guess the builds provided by package distributions like fink > and macports will also require newer dependencies on other ends, or be built > with newer compilers... The package system versions can probably be ignored, as the package systems will provide lxml (or the libraries and an environment to find them from setup.py). ActiveState seems to be a world of it's own. Personally, I'll use the Apple python until I need a build with features they it doesn't provide (probably sometime after 2.6 comes out later this year). But that doesn't seem like a good way to pick one to support. > >>> The second time for OS-X, I used an older version of lxml (1.3.6), and > >>> just did "setup.py install". Worked like a charm. That's not hard. > >> Interesting. 1.3.6 should also require libxml2 2.6.20 (although maybe less > >> strictly than 2.0). > > > > I just grabbed it and tried parsing thing with it; I didn't try the > > advanced features that I depend on lxml for (rng validation and lots > > of xpath), or what the OP was looking for (validation). Running the > > test.py suite turns has one failure: > > > > File "/Users/mwm/lxml-1.3.6/src/lxml/tests/../../../doc/parsing.txt", line 369, in parsing.txt > > Failed example: > > etree.tounicode(root) > > Expected: > > u' \uf8d1 + \uf8d2 ' > > Got: > > u'  +  ' > > If that's the only problem, then 1.3.x works 'acceptably' with 2.6.16 - except > that newer versions are much better in parsing HTML and validating with XML > schema (amongst other things). Note that the test suite tends to avoid testing > features that only depend on libxml2, and especially stuff that has changed > between library versions. It's a test suite for lxml, not for libxml2. That's what I was afraid of. This is the "easy" solution for OSX, but it doesn't get you software you'd want to for the advanced features that make lxml so attractive. > However, 2.0 will not work that easily. Things like parse-time schema > validation and schematron support do not work on versions below 2.6.20 (or > actually 2.6.21, but we disable schematron on 2.6.20). We might be able to > work around some more stuff by spreading some #ifdef's and #defines, but so > far, I find it perfectly acceptable if 2.0 requires newer dependencies for new > features. People who care about reliability will not use libraries as old as > 2.6.16 anyway. The list of fixed bug only gets longer with newer versions. Lxml is a cutting edge tool for xml work. I need the features it offers that make it such, and that means having recent versions of those libraries, because earlier ones didnt' have them. That's cool with me. > >>>>> Which means you wind up having to > >>>>> build those yourself if you want a recent version of lxml, even if > >>>>> you're using a system that includes lxml in it's package system. > >>>> If you want a clean system, e.g. for production use, buildout has proven to be > >>>> a good idea. And we also provide pretty good instructions on our web page on > >>>> how to install lxml on MacOS-X and what to take care of. > >>> Yes, but the proposal was to include it in the Python standard > >>> library. Software that doesn't work on popular target platforms > >>> without updating a standard system library isn't really suitable for > >>> that. > >> Hmm, coming somewhat back on-topic: how does Python currently handle its > >> dependencies under MacOS-X? SQLite, for example? Does it use system libraries > >> only, or are there libraries it ships with? (The MacOS distro is much bigger, > >> but that might be due to the universal build - although that suggests that > >> MacOS-X users do not care about disk space or download size anyway) > > > > For most of them, it checks for the existence of the libraries and > > header files for those packages, and then builds the wrapper libraries > > if it finds their requirements. Look through the 2.5.2 setup.py for > > how sqlite3 is handled (it's a bit much to include here). > > Funny, looking for the sqlite setup was actually a good idea. It does all > sorts of things to figure out a good one to use, specifically on MacOS-X. > There even appears to be some trickery to take the first library it finds, > static or dynamic, instead of continuing to look for a dynlib. > > I wouldn't mind adding a similar setup to lxml's setupinfo.py. Maybe someone > can get a hand on this? It would be great to have an automatic static build on > MacOS, so that people could just run setup.py and be sure it uses the expected > libs the next time they use it. This sounds like the best approach, especially if the install/build document provides pointers to the package systems it checks for. > Is there a standard directory prefix where macport & Co. install libraries and > related stuff like xslt-config? There's a default, but you can change it for all of them (a feature they all inherited from Hubbard's original version for FreeBSD). The default prefix for each of them is: MacPorts: /opt/local/... Fink: /sw/... pkgsrc: /usr/pkg/... http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From james at yaean.com Thu Mar 13 11:26:05 2008 From: james at yaean.com (James Zhu) Date: Thu, 13 Mar 2008 18:26:05 +0800 Subject: [lxml-dev] Strange problem with lxml.html.diff Message-ID: <51066ce10803130326x64226e78r6625a5d773d287e9@mail.gmail.com> Hi guys, Here's what I did: james at orchid ~ $ python Python 2.4.4 (#1, Mar 10 2008, 14:55:59) [GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> etree.LXML_VERSION (2, 0, 2, 0) >>> from lxml.html.diff import htmldiff >>> doc1 = """some

test""" >>> doc2 = """some

text""" >>> print htmldiff(doc1, doc2) some

text

test

>>> doc3 = """some
test""" >>> doc4 = """some
text""" >>> print htmldiff(doc3, doc4) some
>>> It seems that the contents after
mysteriously disappeared. Any ideas? Regards James From ianb at colorstudy.com Thu Mar 13 20:05:13 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 13 Mar 2008 14:05:13 -0500 Subject: [lxml-dev] Strange problem with lxml.html.diff In-Reply-To: <51066ce10803130326x64226e78r6625a5d773d287e9@mail.gmail.com> References: <51066ce10803130326x64226e78r6625a5d773d287e9@mail.gmail.com> Message-ID: <47D97AE9.9010908@colorstudy.com> James Zhu wrote: > Hi guys, > > Here's what I did: > > james at orchid ~ $ python > Python 2.4.4 (#1, Mar 10 2008, 14:55:59) > [GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> from lxml import etree >>>> etree.LXML_VERSION > (2, 0, 2, 0) >>>> from lxml.html.diff import htmldiff >>>> doc1 = """some

test""" >>>> doc2 = """some

text""" >>>> print htmldiff(doc1, doc2) > some

text

test

>>>> doc3 = """some
test""" >>>> doc4 = """some
text""" >>>> print htmldiff(doc3, doc4) > some
> > It seems that the contents after
mysteriously disappeared. Any ideas? Hmm... it looks like a bug with empty elements, generally. For instance: >>> print htmldiff('

Some text

', '

Some other

')

Some

At first I thought it might be something with block-level elements, but no: >>> print htmldiff('

Some x text

', '

Some x other

')

Some x other text

It looks like there's some code in htmldiff that drops empty tags, ignoring their .tail. It might be a small fix, but I'm not sure, and with PyCon I'm a little pressed for time, so I can't fix it right now I'm afraid. Ian From james at yaean.com Fri Mar 14 03:41:13 2008 From: james at yaean.com (James Zhu) Date: Fri, 14 Mar 2008 10:41:13 +0800 Subject: [lxml-dev] Strange problem with lxml.html.diff In-Reply-To: <47D97AE9.9010908@colorstudy.com> References: <51066ce10803130326x64226e78r6625a5d773d287e9@mail.gmail.com> <47D97AE9.9010908@colorstudy.com> Message-ID: <51066ce10803131941p4b5b3db5tf80ad7957803afca@mail.gmail.com> On 3/14/08, Ian Bicking wrote: > > It looks like there's some code in htmldiff that drops empty tags, > ignoring their .tail. It might be a small fix, but I'm not sure, and > with PyCon I'm a little pressed for time, so I can't fix it right now > I'm afraid. > Never mind. Take your time. ;) James From james at yaean.com Fri Mar 14 09:03:18 2008 From: james at yaean.com (James Zhu) Date: Fri, 14 Mar 2008 16:03:18 +0800 Subject: [lxml-dev] Strange problem with lxml.html.diff In-Reply-To: <47D97AE9.9010908@colorstudy.com> References: <51066ce10803130326x64226e78r6625a5d773d287e9@mail.gmail.com> <47D97AE9.9010908@colorstudy.com> Message-ID: <51066ce10803140103s50dd6224v166e2502b1ada509@mail.gmail.com> On 3/14/08, Ian Bicking wrote: > (snipped) > It looks like there's some code in htmldiff that drops empty tags, > ignoring their .tail. (snipped) Yes, you're right. I looked at the code and made some minor changes. Now it seems to work fine. But I don't know whether there'll be any side effects. I'm attaching a patch FYI. James -------------- next part -------------- A non-text attachment was scrubbed... Name: htmldiff.patch Type: application/octet-stream Size: 1647 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080314/adc6391c/attachment.obj From xphuture at gmail.com Fri Mar 14 15:57:25 2008 From: xphuture at gmail.com (Fabien) Date: Fri, 14 Mar 2008 15:57:25 +0100 Subject: [lxml-dev] Help getting lxml to work reliably on MacOS-X In-Reply-To: <47A6D0B1.5020600@behnel.de> References: <47A6D0B1.5020600@behnel.de> Message-ID: <622afeaa0803140757k2854afccx26d87167e4b6a1b8@mail.gmail.com> Hello, > - what package management system (fink/macports) do you use? macports > - are you using the stock Python or one that is installed separately? the stock Python bundled with Leopard > - what library versions are you using of libxml2, libxslt, zlib, libiconv? $ port installed |egrep 'libxml2|xslt|zlib|iconv' libiconv @1.12_0 (active) libxml2 @2.6.31_0 (active) libxslt @1.1.22_0 (active) zlib @1.2.3_1 (active) When I'm trying to make the install, I get the following errors : MacBook-Pro:Downloads jibaku$ sudo easy_install lxml==2.0.2 Searching for lxml==2.0.2 Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.0.2 Downloading http://codespeak.net/lxml/lxml-2.0.2.tgz Processing lxml-2.0.2.tgz Running lxml-2.0.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install--6uwin/lxml-2.0.2/egg-dist-tmp-25cVmi Building lxml version 2.0.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. warning: no previously-included files found matching 'doc/pyrex.txt' ld warning: in /opt/local/lib/libxslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libexslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libxml2.dylib, file is not of required architecture ld warning: in /opt/local/lib/libz.dylib, file is not of required architecture ld warning: in /opt/local/lib/libxslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libexslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libxml2.dylib, file is not of required architecture ld warning: in /opt/local/lib/libz.dylib, file is not of required architecture ld warning: in /opt/local/lib/libxslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libexslt.dylib, file is not of required architecture ld warning: in /opt/local/lib/libxml2.dylib, file is not of required architecture ld warning: in /opt/local/lib/libz.dylib, file is not of required architecture No eggs found in /tmp/easy_install--6uwin/lxml-2.0.2/egg-dist-tmp-25cVmi (setup script problem?) Thanks in advance -- Fabien SCHWOB From stefan_ml at behnel.de Fri Mar 14 17:30:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 14 Mar 2008 17:30:32 +0100 Subject: [lxml-dev] Help getting lxml to work reliably on MacOS-X In-Reply-To: <622afeaa0803140757k2854afccx26d87167e4b6a1b8@mail.gmail.com> References: <47A6D0B1.5020600@behnel.de> <622afeaa0803140757k2854afccx26d87167e4b6a1b8@mail.gmail.com> Message-ID: <47DAA828.6040507@behnel.de> Hi, Fabien wrote: >> - what package management system (fink/macports) do you use? > macports > >> - are you using the stock Python or one that is installed separately? > the stock Python bundled with Leopard > >> - what library versions are you using of libxml2, libxslt, zlib, libiconv? > $ port installed |egrep 'libxml2|xslt|zlib|iconv' > libiconv @1.12_0 (active) > libxml2 @2.6.31_0 (active) > libxslt @1.1.22_0 (active) > zlib @1.2.3_1 (active) > > When I'm trying to make the install, I get the following errors : > > MacBook-Pro:Downloads jibaku$ sudo easy_install lxml==2.0.2 > Searching for lxml==2.0.2 > Reading http://pypi.python.org/simple/lxml/ > Reading http://codespeak.net/lxml > Best match: lxml 2.0.2 > Downloading http://codespeak.net/lxml/lxml-2.0.2.tgz > Processing lxml-2.0.2.tgz > Running lxml-2.0.2/setup.py -q bdist_egg --dist-dir > /tmp/easy_install--6uwin/lxml-2.0.2/egg-dist-tmp-25cVmi > Building lxml version 2.0.2. > NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' > needs to be available. > warning: no previously-included files found matching 'doc/pyrex.txt' > ld warning: in /opt/local/lib/libxslt.dylib, file is not of required > architecture > ld warning: in /opt/local/lib/libexslt.dylib, file is not of required > architecture Just guessing, but this looks like a problem with your setup. It seems to use the correct libraries in /opt/local (although it doesn't look like building anything here). But somehow they do not seem to fit either your Python or the compiler, so apparently it can't build against them. Is this on Intel or PPC? Maybe others can comment on this? Stefan From mwm-keyword-lxml.9112b8 at mired.org Fri Mar 14 17:59:20 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Fri, 14 Mar 2008 12:59:20 -0400 Subject: [lxml-dev] Help getting lxml to work reliably on MacOS-X In-Reply-To: <47DAA828.6040507@behnel.de> References: <47A6D0B1.5020600@behnel.de> <622afeaa0803140757k2854afccx26d87167e4b6a1b8@mail.gmail.com> <47DAA828.6040507@behnel.de> Message-ID: <20080314125920.01c867a1@bhuda.mired.org> On Fri, 14 Mar 2008 17:30:32 +0100 Stefan Behnel wrote: > Hi, > > Fabien wrote: > >> - what package management system (fink/macports) do you use? > > macports > > > >> - are you using the stock Python or one that is installed separately? > > the stock Python bundled with Leopard > > > >> - what library versions are you using of libxml2, libxslt, zlib, libiconv? > > $ port installed |egrep 'libxml2|xslt|zlib|iconv' > > libiconv @1.12_0 (active) > > libxml2 @2.6.31_0 (active) > > libxslt @1.1.22_0 (active) > > zlib @1.2.3_1 (active) > > > > When I'm trying to make the install, I get the following errors : > > > > MacBook-Pro:Downloads jibaku$ sudo easy_install lxml==2.0.2 > > Searching for lxml==2.0.2 > > Reading http://pypi.python.org/simple/lxml/ > > Reading http://codespeak.net/lxml > > Best match: lxml 2.0.2 > > Downloading http://codespeak.net/lxml/lxml-2.0.2.tgz > > Processing lxml-2.0.2.tgz > > Running lxml-2.0.2/setup.py -q bdist_egg --dist-dir > > /tmp/easy_install--6uwin/lxml-2.0.2/egg-dist-tmp-25cVmi > > Building lxml version 2.0.2. > > NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' > > needs to be available. > > warning: no previously-included files found matching 'doc/pyrex.txt' > > ld warning: in /opt/local/lib/libxslt.dylib, file is not of required > > architecture > > ld warning: in /opt/local/lib/libexslt.dylib, file is not of required > > architecture > > Just guessing, but this looks like a problem with your setup. It seems to use > the correct libraries in /opt/local (although it doesn't look like building > anything here). But somehow they do not seem to fit either your Python or the > compiler, so apparently it can't build against them. Is this on Intel or PPC? > > Maybe others can comment on this? IIRC, macports builds binaries for your system, not universal binaries. /usr/bin/python, on the other hand, is a universal binary. See if the library ports have a universal variant, and if so, try installing that. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From stefan_ml at behnel.de Sun Mar 16 09:34:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Mar 2008 09:34:40 +0100 Subject: [lxml-dev] lxml.html.ElementSoup In-Reply-To: <47D4EFB5.4000800@colorstudy.com> References: <47D4EFB5.4000800@colorstudy.com> Message-ID: <47DCDBA0.6090102@behnel.de> Hi, Ian Bicking wrote: > I believe lxml.html.ElementSoup.parse doesn't have the same API as other > parse functions -- it returns an element, where the other parse > functions return trees. > > Also, should ElementSoup have a lower case name? (Maybe renaming it > would make deprecating this parse behavior easier too) I think we can fix both problems. What about adding a module "lxml.html.soupparser" that has the expected interface, and make ElementSoup.py a pure legacy wrapper around that? We could then also add a fromstring() method that would accept plain strings, so that users wouldn't have to do the StringIO() dance themselves. Stefan From stefan_ml at behnel.de Sun Mar 16 13:59:54 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Mar 2008 13:59:54 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47CC9090.9070305@necoro.eu> References: <47CC9090.9070305@necoro.eu> Message-ID: <47DD19CA.2070202@behnel.de> Hi, Ren? 'Necoro' Neumann wrote: > I'm developing a PyGTK-application that uses lxml to validate plugin-XMLs. > After upgrading to lxml-2*, I noticed, that my application is not shut > down correctly (i.e. I close the application, but it still runs in the > background). I looked into this again and came up with a simple change to your example that works. Instead of parse(StringIO(...)) use parse(StringIO(...), base_url="http://the/url") I have no idea why that works, there isn't any apparent reason why that makes a difference - the different code path is not related to any threading at all. So I suppose this only works around a symptom (threading issues can really be that bizarre). Also, as you noted, calling fromstring() instead of parse() works for whatever reason, although parse(StringIO(...)) does essentially the same thing internally. I attached a patch that also works for me and (again) seems to work around the symptom you see. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: parse-gtk-problem.patch Type: text/x-patch Size: 1698 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080316/e587e25c/attachment.bin From stefan_ml at behnel.de Mon Mar 17 17:20:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Mar 2008 17:20:42 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47DE9082.7010303@necoro.eu> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> Message-ID: <47DE9A5A.7020403@behnel.de> Hi, Ren? 'Necoro' Neumann wrote: >> I attached a patch that also works for me and (again) seems to work >> around the symptom you see. > > This patch only works for me when passing a filename or file to parse(). > When passing a StringIO instance it still hangs. True, didn't realise that in my tests. Here is another patch that seems to work better for me (still no idea why). > What about the way simpler patch: > Add the following lines somewhere, where they fit: > > if base_url is None: > base_url = "" > > Because passing an empty base_url works too. Or does this have > side-effects? Yes. The URL is used to identify the document in a couple of places, and the base_url parameter is just a way to override the URL in cases where lxml can't determine it itself. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: gtk-threading-filename-guessing.patch Type: text/x-patch Size: 1054 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080317/4bf371f5/attachment.bin From rogerpatterson at gmail.com Tue Mar 18 20:03:44 2008 From: rogerpatterson at gmail.com (Roger Patterson) Date: Tue, 18 Mar 2008 12:03:44 -0700 Subject: [lxml-dev] html entities and lxml.html.ElementSoup Message-ID: <47E01210.1030703@gmail.com> Hi there, I'm getting an interesting situation. When using the very cool ElementSoup add-on to lxml.html with certain source-html files that already encode entities (eg. £), using the ElementSoup.parse() messes up the entities. I looked through the code, and see that you are using the unescape() function from ElementTree's ElementSoup. Unfortunately, what I think is happening, is that unescape() should only be called if the html was initially parsed by BeautifulSoup with convertEntities="html" (as in ElementTree's ElementSoup), otherwise, you can sometimes get html pages with entities that are unescaped getting unescaped again. What I'm currently doing to solve this is first parsing it with BeautifulSoup(html, convertEntities="html"), then calling ElementSoup.convert_tree(soup). This work-around works fine, but I thought I'd bring it to your attention. cheers -Roger From lists at necoro.eu Tue Mar 18 20:27:37 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Tue, 18 Mar 2008 20:27:37 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47DE9A5A.7020403@behnel.de> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> Message-ID: <47E017A9.5070908@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey, Stefan Behnel schrieb: | Hi, | | Ren? 'Necoro' Neumann wrote: |>> I attached a patch that also works for me and (again) seems to work |>> around the symptom you see. |> This patch only works for me when passing a filename or file to parse(). |> When passing a StringIO instance it still hangs. | | True, didn't realise that in my tests. | | Here is another patch that seems to work better for me (still no idea why). I did not test the patch - but I guess it does not change anything, does it? It just replaces "try ... except" with "if ... is not None" constructs :) Or is the "return None" important. Because StringIO instances have none of the tested attributes. Regards, Necoro -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH4Bep4UOg/zhYFuARArqiAJ4goMG7dHMRUWxwzHDrqASSvv54pgCdGTa7 FF6VwDzFpdPpLjpJ3g9aBDQ= =UO7H -----END PGP SIGNATURE----- From stefan_ml at behnel.de Tue Mar 18 22:25:53 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 18 Mar 2008 22:25:53 +0100 Subject: [lxml-dev] html entities and lxml.html.ElementSoup In-Reply-To: <47E01210.1030703@gmail.com> References: <47E01210.1030703@gmail.com> Message-ID: <47E03361.3070006@behnel.de> Hi, Roger Patterson wrote: > I'm getting an interesting situation. When using the very cool > ElementSoup add-on to lxml.html with certain source-html files that > already encode entities (eg. £), using the ElementSoup.parse() > messes up the entities. It looks like it's not the parse(), but rather the serialisation. What happens is that the entity references end up in the /text/ content, which is clearly wrong as it leads to re-escaping of the references on the way out. > What I'm currently doing to solve this is first parsing it with > BeautifulSoup(html, convertEntities="html"), then calling > ElementSoup.convert_tree(soup). This work-around works fine, but I > thought I'd bring it to your attention. ElementSoup should do that for you. I fixed it on the trunk. Stefan From jlovell at esd189.org Wed Mar 19 18:38:08 2008 From: jlovell at esd189.org (John Lovell) Date: Wed, 19 Mar 2008 10:38:08 -0700 Subject: [lxml-dev] An lxml tree inside a lxml tree. Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> I am appending one lxml tree's root element into another lxml tree. Everything seems to work fine until I serialize (a.k.a. tostring) the containing tree. At which point everything is serialized, however the appended subtree ignores the given pretty_print setting. Is there a way to have the included tree be just more elements in the bigger tree? Or, do I misunderstand the problem. Note that I haven't updated to lxml 2 yet. python: 2.5.1 lxml.etree: (1, 3, 3, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 29) libxslt used: (1, 1, 21) libxslt compiled: (1, 1, 21) Thanks, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 www.esd189.org Together We Can ... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080319/2815ac06/attachment.htm From jlovell at esd189.org Thu Mar 20 19:27:16 2008 From: jlovell at esd189.org (John Lovell) Date: Thu, 20 Mar 2008 11:27:16 -0700 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> Okay, I came up with some sample code to illustrate my problem and the behavior shows up without included one tree inside another. Code: from lxml import etree from StringIO import StringIO xml = """ """ # It doesn't matter which one of these I use. root = etree.fromstring(xml) #root = etree.parse(StringIO(xml)).getroot() print root.tag print "" print etree.tostring(root, pretty_print=False) Output: root Thanks, John ________________________________ From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of John Lovell Sent: Wednesday, March 19, 2008 10:38 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] An lxml tree inside a lxml tree. I am appending one lxml tree's root element into another lxml tree. Everything seems to work fine until I serialize (a.k.a. tostring) the containing tree. At which point everything is serialized, however the appended subtree ignores the given pretty_print setting. Is there a way to have the included tree be just more elements in the bigger tree? Or, do I misunderstand the problem. Note that I haven't updated to lxml 2 yet. python: 2.5.1 lxml.etree: (1, 3, 3, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 29) libxslt used: (1, 1, 21) libxslt compiled: (1, 1, 21) Thanks, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 www.esd189.org Together We Can ... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080320/60d04e5b/attachment-0001.htm From stefan_ml at behnel.de Wed Mar 19 08:00:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 19 Mar 2008 08:00:12 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47E017A9.5070908@necoro.eu> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> Message-ID: <47E0B9FC.6000606@behnel.de> Hi, Ren? 'Necoro' Neumann wrote: > Stefan Behnel wrote: > | Ren? 'Necoro' Neumann wrote: > |>> I attached a patch that also works for me and (again) seems to work > |>> around the symptom you see. > |> This patch only works for me when passing a filename or file to parse(). > |> When passing a StringIO instance it still hangs. > | > | True, didn't realise that in my tests. > | > | Here is another patch that seems to work better for me (still no idea > why). > > I did not test the patch - but I guess it does not change anything, does > it? It just replaces "try ... except" with "if ... is not None" > constructs :) It works better, as I said. At least for me. The function I changed is the main difference between what works and what doesn't. > Or is the "return None" important. Because StringIO instances have none > of the tested attributes. The "return None" was there before. Stefan From stefan_ml at behnel.de Sun Mar 23 17:57:27 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 23 Mar 2008 17:57:27 +0100 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> Message-ID: <47E68BF7.3020807@behnel.de> Hi, John Lovell wrote: > Okay, I came up with some sample code to illustrate my problem and the > behavior shows up without included one tree inside another. > > > Code: > > from lxml import etree > from StringIO import StringIO > > xml = """ > > > > > """ > > # It doesn't matter which one of these I use. > root = etree.fromstring(xml) > #root = etree.parse(StringIO(xml)).getroot() > > print root.tag > > print "" > > print etree.tostring(root, pretty_print=False) > > > Output: > > root > > > > > > lxml will not alter your document unless you tell it to do so. This includes whitespace - which you may or may not consider ignorable. The "pretty_print=False" option tells lxml to not modify the document on serialisation, so it doesn't keeps the whitespace the way it is. If you want the whitespace removed, tell the parser to do it for you using the "remove_blank_text" option. Stefan From jlovell at esd189.org Mon Mar 24 17:10:20 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 24 Mar 2008 09:10:20 -0700 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <47E68BF7.3020807@behnel.de> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> <47E68BF7.3020807@behnel.de> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A224F@ZIRIA.esd189.org> Stefan: Of course coming from you this works great. I am sorry I missed this in the tutorial. John -----Original Message----- From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Sunday, March 23, 2008 9:57 AM To: John Lovell Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] An lxml tree inside a lxml tree. Hi, John Lovell wrote: > Okay, I came up with some sample code to illustrate my problem and the > behavior shows up without included one tree inside another. > > > Code: > > from lxml import etree > from StringIO import StringIO > > xml = """ > > > > > """ > > # It doesn't matter which one of these I use. > root = etree.fromstring(xml) > #root = etree.parse(StringIO(xml)).getroot() > > print root.tag > > print "" > > print etree.tostring(root, pretty_print=False) > > > Output: > > root > > > > > > lxml will not alter your document unless you tell it to do so. This includes whitespace - which you may or may not consider ignorable. The "pretty_print=False" option tells lxml to not modify the document on serialisation, so it doesn't keeps the whitespace the way it is. If you want the whitespace removed, tell the parser to do it for you using the "remove_blank_text" option. Stefan From sidnei at enfoldsystems.com Tue Mar 25 01:55:47 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 24 Mar 2008 21:55:47 -0300 Subject: [lxml-dev] Quickly dupe a parsed XSL tree Message-ID: I have one application using lxml which runs as a Apache output filter. This app fetches a XSL file which in turn includes/imports several other XSL files over the network. Doing so is reasonably expensive, but needs to happen once and only once, since I keep a cached copy of the parsed XSL file in memory per thread. The problem is though that there can be a *lot* of threads in a typical Apache setup, and until the application has 'warmed up', each client that first hits a thread pays the price of fetching and parsing this XSL. I am looking for a way to, as soon as at least one thread has done the work and has the XSL object in memory, have the other threads just clone/dupe the existing in-memory XSL object instead of doing all this useless computation all over again. Does such a thing exist already? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Tue Mar 25 10:12:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 25 Mar 2008 10:12:42 +0100 (CET) Subject: [lxml-dev] Quickly dupe a parsed XSL tree In-Reply-To: References: Message-ID: <60864.194.114.62.66.1206436362.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Sidnei da Silva wrote: > I am looking for a way to, as soon as at least one thread has done the > work and has the XSL object in memory, have the other threads just > clone/dupe the existing in-memory XSL object instead of doing all this > useless computation all over again. If you are using lxml 2.0, you can "copy.deepcopy()" XSLT instances. Stefan From stefan_ml at behnel.de Tue Mar 25 22:04:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 25 Mar 2008 22:04:02 +0100 Subject: [lxml-dev] [XML-SIG] lxml iterparse and comments In-Reply-To: <47E87D43.1090802@frii.com> References: <47E87D43.1090802@frii.com> Message-ID: <47E968C2.6030905@behnel.de> Hi, Stuart McGraw wrote: >> Stuart McGraw wrote: >> > I am probably mising something elementary (I am new >> > to both xml and lxml), but I am having problems figuring >> > out how to get comments when using lxml's iterparse(). >> > When I parse xml with parse() and iterate though the >> > result, I get the comments. But when I try to do the >> > same thing (approximately I think) with iterparse, >> > I don't see any comments. >> >> While the comments end up in the tree that iterparse generates, they >> do not show up in the events. Now that you mention it, I >> actually think that should change. There should be events >> "comment" and "pi" that yield them if requested. > > That would be ideal, from my perspective. It also seems > more consistent with the other interfaces (parse, parse target, > etc) Implemented on the trunk, will be in lxml 2.1. >> Have you tried the parser target interface? > I am having trouble getting it to work. Specifically, the test > code below produces the output I expected when run with > cElementTree, but with lxml, it is missing "end" callbacks, > the second "start(entry) " callback, and the resolved entity > text. Am I doing something wrong? > > Test code: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > #import xml.etree.cElementTree as ET > import lxml.etree as ET > from cStringIO import StringIO > > # XML data... > #============================================= > xmltxt = \ > ''' > > > > > > ]> > > > > text 1 is &ex; > > text 2 > ''' > #============================================= > > print '\nTargetParser:\n-------------' > > try: XMLParser = ET.XMLParser > except AttributeError: XMLParser = ET.XMLTreeBuilder > > class EchoTarget: > def comment(self, tag): > print "comment", tag > def start(self, tag, attrib): > print "start", tag, attrib > def end(self, tag): > print "end", tag > def data(self, data): > print "data", repr(data) > def close(self): > print "close" > return "closed!" > > parser = XMLParser( target = EchoTarget()) > result = ET.parse( StringIO (xmltxt), parser) I can reproduce that. Seems to require an entity reference in the data, though. I'll look into it. Stefan From stefan_ml at behnel.de Wed Mar 26 00:23:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 26 Mar 2008 00:23:09 +0100 Subject: [lxml-dev] [XML-SIG] lxml iterparse and comments In-Reply-To: <47E968C2.6030905@behnel.de> References: <47E87D43.1090802@frii.com> <47E968C2.6030905@behnel.de> Message-ID: <47E9895D.4040407@behnel.de> Hi, Stuart McGraw wrote: >> I am having trouble getting it to work. Specifically, the test >> code below produces the output I expected when run with >> cElementTree, but with lxml, it is missing "end" callbacks, >> the second "start(entry) " callback, and the resolved entity >> text. Fixed for 2.0.3. Stefan From albert.brandl at tttech.com Wed Mar 26 11:15:55 2008 From: albert.brandl at tttech.com (Albert Brandl) Date: Wed, 26 Mar 2008 11:15:55 +0100 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> Message-ID: <20080326101555.GE32704@tttech.com> With lxml 1.3.6, pretty-printing still is broken. If I append a subtree to an element, >>> elem1 = fromstring(""" ... ... ... ... """) >>> elem2 = Element("e") >>> elem2.append(elem1) pretty-printing does not what I'd expect: >>> print tostring(elem2, pretty_print = True) The indentation of the element "a" is correct, but it looks like pretty- printing is not applied to anything below (including the closing tag ). I don't know if this problem also exists with lxml 2.x - we still use the 1.3 branch. Best regards, Albert From stefan_ml at behnel.de Wed Mar 26 12:07:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 26 Mar 2008 12:07:09 +0100 (CET) Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <20080326101555.GE32704@tttech.com> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> <20080326101555.GE32704@tttech.com> Message-ID: <8927.194.114.62.37.1206529629.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, Albert Brandl wrote: > With lxml 1.3.6, pretty-printing still is broken. If I append a subtree to > an element, > >>>> elem1 = fromstring(""" > ... > ... > ... > ... """) >>>> elem2 = Element("e") >>>> elem2.append(elem1) > > pretty-printing does not what I'd expect: > >>>> print tostring(elem2, pretty_print = True) > > > > > > I added an answer here: https://answers.launchpad.net/lxml/+question/28032 The so-called "pretty printing" of XML essentially means adding white-space at places where it looks natural and where it is unlikely to scramble the content. Mind the word "unlikely". The notion of "ignorable whitespace" in XML is underdefined and a pure parser thing. You can help the serialiser in figuring out what whitespace is "ignorable" by either a) letting the parser remove ignorable whitespace for you by giving it a DTD and the "remove_blank_text" option, or b) by removing it yourself, e.g. by deleting empty tail text and empty text before elements. After all, you know best what is ignorable and what isn't. Example: def remove_ignorable_whitespace(root): for el in root.iter(): if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None Stefan From albert.brandl at tttech.com Wed Mar 26 13:34:07 2008 From: albert.brandl at tttech.com (Albert Brandl) Date: Wed, 26 Mar 2008 13:34:07 +0100 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <8927.194.114.62.37.1206529629.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org> <3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org> <20080326101555.GE32704@tttech.com> <8927.194.114.62.37.1206529629.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20080326123407.GH32704@tttech.com> On Wed, Mar 26, 2008 at 12:07:09PM +0100, Stefan Behnel wrote: > I added an answer here: > > https://answers.launchpad.net/lxml/+question/28032 I finally got it ;-). It was not clear to me that lxml only _adds_ whitespace when appropriate, neither what "appropriate" actually means. If I understand your explanation correctly, the pretty-printing algorithm leaves alone any elements that already contain whitespace - this corresponds to the behavior in my example, and your method of removing leading and trailing whitespace will help in most cases. Thanks, Albert From jlovell at esd189.org Wed Mar 26 16:12:47 2008 From: jlovell at esd189.org (John Lovell) Date: Wed, 26 Mar 2008 08:12:47 -0700 Subject: [lxml-dev] An lxml tree inside a lxml tree. In-Reply-To: <20080326123407.GH32704@tttech.com> References: <3A49C88789256B4AB33AC603DB6AF49B011A223D@ZIRIA.esd189.org><3A49C88789256B4AB33AC603DB6AF49B011A2243@ZIRIA.esd189.org><20080326101555.GE32704@tttech.com><8927.194.114.62.37.1206529629.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080326123407.GH32704@tttech.com> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2251@ZIRIA.esd189.org> Based on Stefan's previous answer to me or the tutorial (blush). Code: from lxml import etree from StringIO import StringIO top = etree.Element("root") xml = """ """ parser = etree.XMLParser(remove_blank_text=True) # It doesn't matter which one of these I use. root = etree.fromstring(xml, parser) #root = etree.parse(StringIO(xml), parser).getroot() print root.tag top.append(root) print "" print etree.tostring(top, pretty_print=False) print "" print etree.tostring(top, pretty_print=True) Output: subroot Hope this helps, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 www.esd189.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Albert Brandl Sent: Wednesday, March 26, 2008 5:34 AM To: Stefan Behnel Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] An lxml tree inside a lxml tree. On Wed, Mar 26, 2008 at 12:07:09PM +0100, Stefan Behnel wrote: > I added an answer here: > > https://answers.launchpad.net/lxml/+question/28032 I finally got it ;-). It was not clear to me that lxml only _adds_ whitespace when appropriate, neither what "appropriate" actually means. If I understand your explanation correctly, the pretty-printing algorithm leaves alone any elements that already contain whitespace - this corresponds to the behavior in my example, and your method of removing leading and trailing whitespace will help in most cases. Thanks, Albert _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From Dennis.Benzinger at gmx.net Wed Mar 26 21:02:56 2008 From: Dennis.Benzinger at gmx.net (Dennis Benzinger) Date: Wed, 26 Mar 2008 21:02:56 +0100 Subject: [lxml-dev] XPath2 and XQuery support via XQilla? Message-ID: <47EAABF0.3060302@gmx.net> Hello! Would it be possible to add XPath2 and XQuery support via XQilla to lxml? The problem I see is that XQilla is built on top of Xerces-C and lxml uses libxml2. Dennis Benzinger From stefan_ml at behnel.de Wed Mar 26 22:19:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 26 Mar 2008 22:19:42 +0100 Subject: [lxml-dev] lxml 2.0.3 released Message-ID: <47EABDEE.1060505@behnel.de> Hi all, lxml 2.0.3 is on PyPI. This is a bugfix release for the stable 2.0 series. Changelog follows below. Have fun, Stefan 2.0.3 (2008-03-26) Features added * soupparser.parse() allows passing keyword arguments on to BeautifulSoup. * fromstring() method in lxml.html.soupparser. Bugs fixed * lxml.html.diff didn't treat empty tags properly (e.g.,
). * Handle entity replacements correctly in target parser. * Crash when using iterparse() with XML Schema validation. * The BeautifulSoup parser (soupparser.py) did not replace entities, which made them turn up in text content. * Attribute assignment of custom PyTypes in objectify could fail to correctly serialise the value to a string. Other changes * lxml.html.ElementSoup was replaced by a new module lxml.html.soupparser with a more consistent API. The old module remains for compatibility with ElementTree's own ElementSoup module. * Setting the XSLT_CONFIG and XML2_CONFIG environment variables at build time will let setup.py pick up the xml2-config and xslt-config scripts from the supplied path name. * Passing --with-xml2-config=/path/to/xml2-config to setup.py will override the xml2-config script that is used to determine the C compiler options. The same applies for the --with-xslt-config option. From stefan_ml at behnel.de Thu Mar 27 09:41:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Mar 2008 09:41:42 +0100 Subject: [lxml-dev] XPath2 and XQuery support via XQilla? In-Reply-To: <47EAABF0.3060302@gmx.net> References: <47EAABF0.3060302@gmx.net> Message-ID: <47EB5DC6.7040802@behnel.de> Hi, Dennis Benzinger wrote: > Would it be possible to add XPath2 and XQuery support via XQilla > to lxml? Possibly. It's a somewhat unofficial design goal of lxml to interface with great tools. Would need some deeper insights, though. At first glance, XQilla doesn't seem to have complete XQuery support. Also, XPath2/XQuery is a lot about XML schema, so it's not just like changing the XML tree implementation would do the job already. > The problem I see is that XQilla is built on top of Xerces-C and lxml > uses libxml2. That is the obvious problem. :) Sounds like a bigger project, so don't expect any code unless a) you do it yourself or b) someone really wants to pay for it. Stefan From stefan_ml at behnel.de Thu Mar 27 17:55:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Mar 2008 17:55:33 +0100 Subject: [lxml-dev] lxml 2.1alpha1 released Message-ID: <47EBD185.1080902@behnel.de> Hi all, the first alpha release of lxml 2.1 is on PyPI. This is a "cleanup and new features" release that follows the current stable 2.0 series. The major new features are XSLT extension elements and support for comments/PIs in iterparse(). A lot of formerly deprecated API functions were finally removed, so you may have to adapt your code (a little). After the change, it will continue to run with both lxml 2.0.x and 2.1. Have fun, Stefan 2.1alpha1 (2008-03-27) Features added * New event types 'comment' and 'pi' in iterparse(). * XSLTAccessControl instances have a property options that returns a dict of access configuration options. * Constant instances DENY_ALL and DENY_WRITE on XSLTAccessControl class. * Extension elements for XSLT (experimental!) * Element.base property returns the xml:base or HTML base URL of an Element. * docinfo.URL property is writable. Bugs fixed * Default encoding for plain text serialisation was different from that of XML serialisation (UTF-8 instead of ASCII). Other changes * Minor API speed-ups. * The benchmark suite now uses tail text in the trees, which makes the absolute numbers incomparable to previous results. * Generating the HTML documentation now requires Pygments, which is used to enable syntax highlighting for the doctest examples. Most long-time deprecated functions and methods were removed: * etree.clearErrorLog(), use etree.clear_error_log() * etree.useGlobalPythonLog(), use etree.use_global_python_log() * etree.ElementClassLookup.setFallback(), use etree.ElementClassLookup.set_fallback() * etree.getDefaultParser(), use etree.get_default_parser() * etree.setDefaultParser(), use etree.set_default_parser() * etree.setElementClassLookup(), use etree.set_element_class_lookup() Note that parser.setElementClassLookup() has not been removed yet, although parser.set_element_class_lookup() should be used instead. * xpath_evaluator.registerNamespace(), use xpath_evaluator.register_namespace() * xpath_evaluator.registerNamespaces(), use xpath_evaluator.register_namespaces() * objectify.setPytypeAttributeTag, use objectify.set_pytype_attribute_tag * objectify.setDefaultParser(), use objectify.set_default_parser() From lists at necoro.eu Thu Mar 27 18:40:12 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Thu, 27 Mar 2008 18:40:12 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47E0B9FC.6000606@behnel.de> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> <47E0B9FC.6000606@behnel.de> Message-ID: <47EBDBFC.9030105@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey, I'm seeing announcements for new versions on this list :) - Any status update? - Is this bug fixed in one of the announced releases? - - Necoro -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH69v84UOg/zhYFuARAh3fAJwJsWw7JKjybPZMRKefPKPopn9gKwCgg6rX JgvArRG+kJUuHwJRMqDXVmI= =YB+H -----END PGP SIGNATURE----- From stefan_ml at behnel.de Thu Mar 27 18:53:41 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Mar 2008 18:53:41 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47EBDBFC.9030105@necoro.eu> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> <47E0B9FC.6000606@behnel.de> <47EBDBFC.9030105@necoro.eu> Message-ID: <47EBDF25.3090703@behnel.de> Hi, Ren? 'Necoro' Neumann wrote: > I'm seeing announcements for new versions on this list :) - Any status > update? - Is this bug fixed in one of the announced releases? Your test script works for me, so please give it a try and report back. :) Stefan From lists at necoro.eu Thu Mar 27 20:47:24 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Thu, 27 Mar 2008 20:47:24 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47EBDF25.3090703@behnel.de> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> <47E0B9FC.6000606@behnel.de> <47EBDBFC.9030105@necoro.eu> <47EBDF25.3090703@behnel.de> Message-ID: <47EBF9CC.500@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel schrieb: | Hi, | | Ren? 'Necoro' Neumann wrote: |> I'm seeing announcements for new versions on this list :) - Any status |> update? - Is this bug fixed in one of the announced releases? | | Your test script works for me, so please give it a try and report back. :) | | Stefan | | I know there was something to do ^^. Ok - I tested it, and it works for me with files and StringIO ... even if I don't understand why... (perhaps it is something in cython) Thanks :) Ren? -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH6/nM4UOg/zhYFuARAu1GAJ9jhls3YWcH7i2t17hvba0XL7OHpgCdFeOM rYjGTr8V+6K4JQ+qdGnL1aE= =OcCa -----END PGP SIGNATURE----- From lists at necoro.eu Thu Mar 27 21:09:02 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Thu, 27 Mar 2008 21:09:02 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47EBF9CC.500@necoro.eu> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> <47E0B9FC.6000606@behnel.de> <47EBDBFC.9030105@necoro.eu> <47EBDF25.3090703@behnel.de> <47EBF9CC.500@necoro.eu> Message-ID: <47EBFEDE.40307@necoro.eu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ren? 'Necoro' Neumann schrieb: > Stefan Behnel schrieb: > | Hi, > | > | Ren? 'Necoro' Neumann wrote: > |> I'm seeing announcements for new versions on this list :) - Any status > |> update? - Is this bug fixed in one of the announced releases? > | > | Your test script works for me, so please give it a try and report back. :) > | > | Stefan > | > | > > I know there was something to do ^^. Ok - I tested it, and it works for > me with files and StringIO ... even if I don't understand why... > (perhaps it is something in cython) > > Thanks :) > > Ren? Hmm ... an enigmail flaw. Sorry... So in case you were wondering ... the email really was sent by me, even if enigmail is telling something different ^^ (or is it just _my_ enigmail)? Ren? -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH6/7e4UOg/zhYFuARAlF/AJ4o6XY/Nnf6j16KEHuYIWnA2EDrlACfc/ee 0EcB82O48+s1t6STw5AifRc= =bGtM -----END PGP SIGNATURE----- From lists at necoro.eu Thu Mar 27 21:27:37 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Thu, 27 Mar 2008 21:27:37 +0100 Subject: [lxml-dev] [BUG] lxml-2* hangs on interpreter shutdown with gtk-mainloop In-Reply-To: <47EBDF25.3090703@behnel.de> References: <47CC9090.9070305@necoro.eu> <47DD19CA.2070202@behnel.de> <47DE9082.7010303@necoro.eu> <47DE9A5A.7020403@behnel.de> <47E017A9.5070908@necoro.eu> <47E0B9FC.6000606@behnel.de> <47EBDBFC.9030105@necoro.eu> <47EBDF25.3090703@behnel.de> Message-ID: <47EC0339.7030700@necoro.eu> Stefan Behnel schrieb: > Hi, > > Ren? 'Necoro' Neumann wrote: >> I'm seeing announcements for new versions on this list :) - Any status >> update? - Is this bug fixed in one of the announced releases? > > Your test script works for me, so please give it a try and report back. :) > > Stefan > > Ok ... just again (unsigned): Yes it works here too. But don't know why ;) (perhaps a cython issue?) Regarding the two other mails: Sorry for the spam. But your mailinglist software seems to have an error... I sent the last copy to myself too (BCC) - and there it was received ok and with correct signing. But the mailinglist seems to have altered the text... I only received it in Base64 coding (or something alike): >From the header: Content-Transfer-Encoding: base64 as opposed to the normal: Content-Transfer-Encoding: 8bit From jkrukoff at ltgc.com Fri Mar 28 21:30:51 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 28 Mar 2008 14:30:51 -0600 Subject: [lxml-dev] ElementTree.find does not accept QName objects. Message-ID: <1206736251.5734.29.camel@localhost.localdomain> Since I was the one that complained about the find method on Elements not accepting QNames, it's probably not surprising that I expected them to work with the ElementTree find method as well. Instead an unsliceable error is thrown, due to the value being expected to be a string: >>> from lxml import etree >>> x = etree.XML( '' ) >>> e = etree.ElementTree( x ) >>> e.find( 'b' ) >>> e.find( etree.QName( 'b' ) ) Traceback (most recent call last): File "", line 1, in File "etree.pyx", line 508, in etree._ElementTree.find TypeError: 'etree.QName' object is unsliceable Though using e.getroot( ).find is working fine as a workaround for me. -- John Krukoff Land Title Guarantee Company From stefan_ml at behnel.de Sat Mar 29 11:42:08 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 29 Mar 2008 11:42:08 +0100 Subject: [lxml-dev] ElementTree.find does not accept QName objects. In-Reply-To: <1206736251.5734.29.camel@localhost.localdomain> References: <1206736251.5734.29.camel@localhost.localdomain> Message-ID: <47EE1D00.3000109@behnel.de> Hi, John Krukoff wrote: > Since I was the one that complained about the find method on Elements > not accepting QNames, it's probably not surprising that I expected them > to work with the ElementTree find method as well. Instead an unsliceable > error is thrown, due to the value being expected to be a string Sure, here's the obvious patch. BTW, I expect ET to have the same problem here. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: et-find-qname.patch Type: text/x-patch Size: 1163 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080329/fec491b2/attachment.bin