From stefan_ml at behnel.de Sat Sep 1 10:21:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 01 Sep 2007 10:21:29 +0200 Subject: [lxml-dev] objectify factories In-Reply-To: <20070830142730.280890@gmx.net> References: <46C945BF.9020805@behnel.de> <46CC80A4.7010607@behnel.de> <20070830114448.280910@gmx.net> <46D6BC03.4050103@behnel.de> <20070830142730.280890@gmx.net> Message-ID: <46D92109.8070105@behnel.de> jholg at gmx.de wrote: >> Given the current behaviour of _setElementValue(), I'd say it should just >> go and annotate everything it produces. > > Meaning an additional TypedElementMaker, right? I think it is actually nice to have the not-annotating ElementMaker as a choice. BTW, that's easy to achieve, I just added a simple "annotate=True" keyword argument to objectify.ElementMaker (not comitted yet). If you create a new E factory and pass False, it will just skip the annotation step. Stefan From stefan_ml at behnel.de Sun Sep 2 18:25:58 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 02 Sep 2007 18:25:58 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released Message-ID: <46DAE416.1080807@behnel.de> Hi all, I'm proudly announcing the first alpha release of lxml 2.0. It features a major cleanup both behind the scenes and at the surface, that improves the XML tool integration and makes the API clearer and more consistent in many places. The major new addition, however, is the lxml.html package, a new toolkit for HTML handling. The web site for the pre-2.0 series is online at http://codespeak.net/lxml/dev/ The "what's new" page has a description of the major changes: http://codespeak.net/lxml/dev/lxml2.html and the ChangeLog has a more detailed list, see below. This being an alpha release means that not everything is stable, both in terms of crashes and the API. There will be a small number of alpha releases to make the advancements publicly available, before the beta releases focus on improving the stability. I warmly invite everyone to contribute to the final release by discussing the API changes and the new features on the mailing list. There is always space for improvements! There is currently a known problem with Microsoft's compilers, so Windows builds may not become available for 2.0alpha1. The next alpha will hopefully come with prebuilt binaries for that platform. Building with the more standards compliant MinGW compilers should work. Note that working on the code now requires Cython (version 0.9.6.5), an enhanced fork of Pyrex. lxml therefore no longer ships with a copy of Pyrex or Cython, but as usual, building from the distribution sources does not require Cython. It can be installed with "easy_install Cython" or from here: http://www.cython.org/ I hope that lxml 2.0 will become a straight continuation of the success story that lxml 1.x was already. Have fun, Stefan 2.0alpha1 (2007-09-02) Features added * Reimplemented objectify.E for better performance and improved integration with objectify. Provides extended type support based on registered PyTypes. * XSLT objects now support deep copying * New makeSubElement() C-API function that allows creating a new subelement straight with text, tail and attributes. * XPath extension functions can now access the current context node (context.context_node) and use a context dictionary (context.eval_context) from the context provided in their first parameter * HTML tag soup parser based on BeautifulSoup in lxml.html.ElementSoup * New module lxml.doctestcompare by Ian Bicking for writing simplified doctests based on XML/HTML output. Use by importing lxml.usedoctest or lxml.html.usedoctest from within a doctest. * New module lxml.cssselect by Ian Bicking for selecting Elements with CSS selectors. * New package lxml.html written by Ian Bicking for advanced HTML treatment. * Namespace class setup is now local to the ElementNamespaceClassLookup instance and no longer global. * Schematron validation (incomplete in libxml2) * Additional stringify argument to objectify.PyType() takes a conversion function to strings to support setting text values from arbitrary types. * Entity support through an Entity factory and element classes. XML parsers now have a resolve_entities keyword argument that can be set to False to keep entities in the document. * column field on error log entries to accompany the line field * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on XPathError. * The regular expression functions in XPath now support passing a node-set instead of a string * Extended type annotation in objectify: new xsiannotate() function * EXSLT RegExp support in standard XPath (not only XSLT) Bugs fixed * lxml.etree did not check tag/attribute names * The XML parser did not report undefined entities as error * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last * Passing '' as XPath namespace prefix did not raise an error * Thread safety in XPath evaluators Other changes * objectify.PyType for None is now called "NoneType" * el.getiterator() renamed to el.iter(), following ElementTree 1.3 - original name is still available as alias * In the public C-API, findOrBuildNodeNs() was replaced by the more generic findOrBuildNodeNsPrefix * Major refactoring in XPath/XSLT extension function code * Network access in parsers disabled by default From stefan_ml at behnel.de Mon Sep 3 09:29:43 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 03 Sep 2007 09:29:43 +0200 Subject: [lxml-dev] [XML-SIG] lxml 2.0alpha1 released In-Reply-To: <46DAEC38.6060900@comcast.net> References: <46DAE500.40408@behnel.de> <46DAEC38.6060900@comcast.net> Message-ID: <46DBB7E7.1060309@behnel.de> Hi, Gloria W wrote: > Stefan, congratulations. This is definitely useful. Thanks! :) > Please talk a bit about the API, and how it differs/varies from > cElementTree, http://codespeak.net/lxml/dev/compatibility.html > or link to some examples. The docs are full of doctest examples. However, as lxml.html is still pretty new, its docs are not as comprehensive as those for lxml.etree yet. > For example, the node nesting, > the usage of a 'tail' for trailing text. I wonder if lxml offers more of > a DOM compliant node nesting, or if it conforms to the > conventions/oddities of ElemenTree. lxml.etree aims for ElementTree compatibility, so these things work alike. The above link describes the differences that we either cannot work around for technical reasons (or performance reasons) or that are considerate decisions where we think ElementTree is wrong. Note that the ElementTree API is more and more becoming a basis for other APIs in lxml. There is lxml.objectify, which replaces a lot of this API by something that works more like Python objects themselves (a data binding approach). lxml.html extends the API with a bunch of helper methods for link handling and also changes the way the serialisation works to better adapt it to HTML. There's also MathDOM, a MathML implementation, which was the original reason for making lxml extensible at the Element level, back in the days of lxml 0.7. The original idea was actually 'stolen' from Xist, although lxml has definitely found its own way of dealing with it. The one thing I like most about lxml is the tool integration. For example, you can use the Element API in lxml.etree or lxml.objectify or lxml.html, with any of the five path languages: ElementPath, ETXPath, XPath, CSS-Selectors or ObjectPath. I think this is a trend that should continue. Most XML/HTML formats can benefit from specialised Element classes with specially adapted or added methods, properties and even different tree behaviour, while still taking advantage of all the other tools that lxml provides. The possibilities that lxml offers here are close to unlimited (both at the Python level and at the C level) - even with the 'oddities' (as you called it) of ElementTree. I personally believe that .tail attributes are actually a big advantage, as the ignorance of text nodes simplifies the tree model considerably (well, the public one, not necessarily the internal one...) > Also show us how it differs from BeautifulSoup, which has extremely > robust unicode handling and mangled XML/HTML tag completion, but may > benchmark a bit slower. libxml2 does not have as robust support for HTML-like tag soup as BeautifulSoup, but it does a pretty good job anyway. In lxml 2.0, lxml.html comes with BeautifulSoup integration (as ElementTree does), so now you can have both: a tag soup parser and all the features of lxml. Stefan From mantegazza at ill.fr Mon Sep 3 09:38:09 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Mon, 3 Sep 2007 09:38:09 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DAE416.1080807@behnel.de> References: <46DAE416.1080807@behnel.de> Message-ID: <200709030938.09485.mantegazza@ill.fr> Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > ? ? * XSLT objects now support deep copying Good ;o) -- Fr?d?ric From stefan_ml at behnel.de Mon Sep 3 09:55:14 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 03 Sep 2007 09:55:14 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <200709030938.09485.mantegazza@ill.fr> References: <46DAE416.1080807@behnel.de> <200709030938.09485.mantegazza@ill.fr> Message-ID: <46DBBDE2.4040702@behnel.de> Fr?d?ric Mantegazza wrote: > Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > >> * XSLT objects now support deep copying > > Good ;o) ... although that's such a recent feature that I wouldn't bet my household on it. Since you had code that stumbled over the lack of that feature, could you give it some more testing so that we can see if it works? Especially in the threaded case? Thanks, Stefan From mantegazza at ill.fr Mon Sep 3 10:14:47 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Mon, 3 Sep 2007 10:14:47 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DBBDE2.4040702@behnel.de> References: <46DAE416.1080807@behnel.de> <200709030938.09485.mantegazza@ill.fr> <46DBBDE2.4040702@behnel.de> Message-ID: <200709031015.04577.mantegazza@ill.fr> Le lundi 3 septembre 2007 09:55, Stefan Behnel a ?crit : > Fr?d?ric Mantegazza wrote: > > Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > >> * XSLT objects now support deep copying > > > > Good ;o) > > ... although that's such a recent feature that I wouldn't bet my > household on it. Since you had code that stumbled over the lack of that > feature, could you give it some more testing so that we can see if it > works? Especially in the threaded case? Ok, I will make tests. -- Fr?d?ric From jholg at gmx.de Mon Sep 3 12:59:29 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 03 Sep 2007 12:59:29 +0200 Subject: [lxml-dev] cython + python2.4 problem (was: objectify factories) In-Reply-To: <46D86DC7.3080304@behnel.de> References: <46C945BF.9020805@behnel.de> <46CC80A4.7010607@behnel.de> <20070830114448.280910@gmx.net> <46D6BC03.4050103@behnel.de> <20070830142730.280890@gmx.net> <46D7D934.1090909@behnel.de> <20070831093430.303420@gmx.net> <46D7E1C3.7060706@behnel.de> <20070831160504.231930@gmx.net> <46D86DC7.3080304@behnel.de> Message-ID: <20070903105929.168580@gmx.net> Hi, > the trunk now builds with Cython instead of Pyrex, so please install it to > get > rid of the one failing doctest. (the reason the test fails is that Cython > knows about the package you specify in distutils, Pyrex ignores it). > > http://www.cython.org/ > > lxml requires Cython 0.9.6.5. Lazy me, not like you hadn't announced that quite a while ago. However, unfortunately: Just downloaded cython and tried to build lxml: 0 lb54320 at adevp02 .../lxml $ /apps/pydev/bin/python2.4 setup.py build Traceback (most recent call last): File "setup.py", line 28, in ? import setupinfo File "/data/pydev/hjoukl/LXML/lxml/setupinfo.py", line 5, in ? from Cython.Distutils import build_ext as build_pyx [...] File "/apps/pydev/lib/python2.4/site-packages/Cython/Compiler/TypeSlots.py", line 88 full_args = "O" + self.fixed_arg_format if self.has_dummy_arg else self.fixed_arg_format ^ SyntaxError: invalid syntax Seems like cython relies on Python2.5 syntax, which renders it unusable for me. Any chance to remove the hard 2.5-syntax-dependency? At a quick glance this seems to be the only place where conditional expressions turn up. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From jholg at gmx.de Tue Sep 4 12:00:01 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 04 Sep 2007 12:00:01 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DAE416.1080807@behnel.de> References: <46DAE416.1080807@behnel.de> Message-ID: <20070904100001.277890@gmx.net> Hi, > * Extended type annotation in objectify: new xsiannotate() function I propose renaming the existing annotate() function to pyannotate() and adding a public interface annotate() to the internal _annotate(), so you can xsi-typify and py-typify in one step. I also think it would be better to change the defaults of the "ignore_old" keyword args of the annotation functions to False, to avoid: >>> root = E.root(E.i(23), E.s("12"), E.sub()) >>> print objectify.dump(root) root = None [ObjectifiedElement] i = 23 [IntElement] * py:pytype = 'int' s = '12' [StringElement] * py:pytype = 'str' sub = '' [StringElement] >>> objectify.annotate(root) >>> print objectify.dump(root) root = None [ObjectifiedElement] i = 23 [IntElement] * py:pytype = 'int' s = 12 [IntElement] * py:pytype = 'int' sub = '' [StringElement] >>> where you lose the "str" type information of root.s. I think the current default is a bit counter-intuitive. What do you say? Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From mantegazza at ill.fr Tue Sep 4 15:39:32 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Tue, 4 Sep 2007 15:39:32 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <46D69BBC.4070701@behnel.de> References: <46D69BBC.4070701@behnel.de> Message-ID: <200709041539.33032.mantegazza@ill.fr> Le jeudi 30 ao?t 2007 12:28, Stefan Behnel a ?crit : > I just released lxml 1.3.4 to PyPI. It has a minor bug fix and a few > compatibility enhancements both backwards and forwards. Changelog > follows. On my debian etch (stable), I only have setuptools 0.6c3, but lxml 1.3.4 needs at least 0.6c5... I just changed the test in setup.py, and all compiled fine. But may I have some problems? Do 1.3.4 *really* need setuptools >=0.6c5? -- Fr?d?ric From jholg at gmx.de Tue Sep 4 15:46:41 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 04 Sep 2007 15:46:41 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <200709041539.33032.mantegazza@ill.fr> References: <46D69BBC.4070701@behnel.de> <200709041539.33032.mantegazza@ill.fr> Message-ID: <20070904134641.117140@gmx.net> Hi, > On my debian etch (stable), I only have setuptools 0.6c3, but lxml 1.3.4 > needs at least 0.6c5... I just changed the test in setup.py, and all > compiled fine. But may I have some problems? Do 1.3.4 *really* need > setuptools >=0.6c5? I had the same issue on my kubuntu system the other day, and did the same thing to get it to build (I *think* that was svn trunk, though), because I'd rather use os packages than easy_install newer setuptools, if possible. Worked for me. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From mantegazza at ill.fr Tue Sep 4 15:50:55 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Tue, 4 Sep 2007 15:50:55 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <20070904134641.117140@gmx.net> References: <46D69BBC.4070701@behnel.de> <200709041539.33032.mantegazza@ill.fr> <20070904134641.117140@gmx.net> Message-ID: <200709041550.57485.mantegazza@ill.fr> Le mardi 4 septembre 2007 15:46, jholg at gmx.de a ?crit : > > On my debian etch (stable), I only have setuptools 0.6c3, but lxml > > 1.3.4 needs at least 0.6c5... I just changed the test in setup.py, and > > all compiled fine. But may I have some problems? Do 1.3.4 *really* need > > setuptools >=0.6c5? > > I had the same issue on my kubuntu system the other day, and did the same > thing to get it to build (I *think* that was svn trunk, though), because > I'd rather use os packages than easy_install newer setuptools, if > possible. Worked for me. Thanks for the feedback :o) -- Fr?d?ric From jtk at yahoo.com Wed Sep 5 22:49:57 2007 From: jtk at yahoo.com (Jeff Kowalczyk) Date: Wed, 05 Sep 2007 16:49:57 -0400 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 Message-ID: The 2007-08-30 addition of lxml-2.0alpha1 to PyPI is possibly causing buildouts depending on lxml to fail with the following error when running in fetch-newest mode: Getting required 'zope.testbrowser' required by z3c.etestbrowser 1.0.2-r75829. We have the best distribution that satisfies 'zope.testbrowser'. Picked: zope.testbrowser = 3.4.1 Getting required 'lxml' required by z3c.etestbrowser 1.0.2-r75829. Getting distribution for 'lxml'. While: Installing test. Getting distribution for 'lxml'. Error: Can't download http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found Thanks. From stefan_ml at behnel.de Wed Sep 5 23:12:48 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 05 Sep 2007 23:12:48 +0200 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 In-Reply-To: References: Message-ID: <46DF1BD0.60400@behnel.de> Jeff Kowalczyk wrote: > The 2007-08-30 addition of lxml-2.0alpha1 to PyPI is possibly causing > buildouts depending on lxml to fail with the following error when running > in fetch-newest mode: > > Getting required 'zope.testbrowser' > required by z3c.etestbrowser 1.0.2-r75829. > We have the best distribution that satisfies 'zope.testbrowser'. > Picked: zope.testbrowser = 3.4.1 > Getting required 'lxml' > required by z3c.etestbrowser 1.0.2-r75829. > Getting distribution for 'lxml'. > While: > Installing test. > Getting distribution for 'lxml'. > Error: Can't download > http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found Yep, that's because there isn't such a branch yet. It will become available when 2.0 is released. I guess we should take that paragraph out till then... Stefan From l.oluyede at gmail.com Thu Sep 6 15:41:39 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 15:41:39 +0200 Subject: [lxml-dev] Resolve RelaxNG document Message-ID: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> I'd like to know if there's a practical way to resolve all the references in a RelaxNG document. Maybe 'expand' is the correct word. I have a document with some which are mostly custom data types and some elements using those types. For example: ---- a b --- What I'd like is to replace the reference "foo" with the definition of "foo" as a whole. Is there an easy way? I read something about custom resolvers... I use lxml2.0alpha1 -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Thu Sep 6 15:59:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Sep 2007 15:59:30 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> Message-ID: <46E007C2.3000107@behnel.de> Lawrence Oluyede wrote: > I'd like to know if there's a practical way to resolve all the > references in a RelaxNG document. Maybe 'expand' is the correct word. > > I have a document with some which are mostly custom data > types and some elements using those types. For example: > > ---- > > > a > b > > > > > > > > > --- > > What I'd like is to replace the reference "foo" with the definition of > "foo" as a whole. I think the easiest way is to do it by hand, something like: resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) for ref in tree.iter("ref"): define = resolve(tree, name = ref.get("name")) if define: ref.getparent().replace(ref, define[0]) # or define[0][0] ? Stefan From l.oluyede at gmail.com Thu Sep 6 16:36:16 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 16:36:16 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <46E007C2.3000107@behnel.de> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> Message-ID: <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> > I think the easiest way is to do it by hand, something like: > > resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) > for ref in tree.iter("ref"): > define = resolve(tree, name = ref.get("name")) > if define: > ref.getparent().replace(ref, define[0]) # or define[0][0] ? > Ok thanks anyway, I was doing it by hand. I hoped there was something in the schema validator. How does the relaxng validator knows if a document is valid if it doesn't expand references? That was my thought -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Thu Sep 6 17:01:21 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Sep 2007 17:01:21 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> Message-ID: <46E01641.3070903@behnel.de> Lawrence Oluyede wrote: >> I think the easiest way is to do it by hand, something like: >> >> resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) >> for ref in tree.iter("ref"): >> define = resolve(tree, name = ref.get("name")) >> if define: >> ref.getparent().replace(ref, define[0]) # or define[0][0] ? >> > > Ok thanks anyway, I was doing it by hand. I hoped there was something > in the schema validator. How does the relaxng validator knows if a > document is valid if it doesn't expand references? What makes you think it doesn't? It should be part of the evaluation step. However, you can't see that from the outside as the tree you pass in is not modified. Stefan From l.oluyede at gmail.com Thu Sep 6 17:46:30 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 17:46:30 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <46E01641.3070903@behnel.de> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> <46E01641.3070903@behnel.de> Message-ID: <9eebf5740709060846mded4030wbee9392b2b0f1358@mail.gmail.com> > What makes you think it doesn't? That's exactly what I meant. It does obviously, so I hoped there was a way to hook in the evaluation step and grab the expanded references but you just responded here below: > However, you can't see that from the > outside as the tree you pass in is not modified. Thanks! -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From jtk at yahoo.com Thu Sep 6 17:49:08 2007 From: jtk at yahoo.com (Jeff Kowalczyk) Date: Thu, 06 Sep 2007 11:49:08 -0400 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 References: <46DF1BD0.60400@behnel.de> Message-ID: Stefan Behnel wrote: > > Getting distribution for 'lxml'. > > While: > > Installing test. > > Getting distribution for 'lxml'. > > Error: Can't download > > http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found > > Yep, that's because there isn't such a branch yet. It will become > available when 2.0 is released. > > I guess we should take that paragraph out till then... The latest/pending update http://cheeseshop.python.org/pypi/lxml/2.0alpha2 causes a different error for zc.buildout-1.0.0b30 in fetch-newest mode: Picked: zope.testbrowser = 3.4.1 Getting required 'lxml' required by z3c.etestbrowser 1.0.2-r75829. Getting distribution for 'lxml'. While: Installing test. Getting distribution for 'lxml'. Error: Can't download http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0alpha2.tar.gz: 404 Not Found lxml-2.0alpha2.tar.gz is not (yet, temporarily) available at that URL. I think zc.buildout-1.0.0b30 uses the package's simple index: http://cheeseshop.python.org/simple/lxml/ From jholg at gmx.de Thu Sep 6 18:14:12 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 06 Sep 2007 18:14:12 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate (was: lxml 2.0alpha1 released) In-Reply-To: <20070904100001.277890@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> Message-ID: <20070906161412.136940@gmx.net> Hi, > I propose renaming the existing annotate() function to pyannotate() and > adding a public interface annotate() to the internal _annotate(), so you can > xsi-typify and py-typify in one step. > I also think it would be better to change the defaults of the "ignore_old" > keyword args of the annotation functions to False, to avoid: Attached is patch that does just that, with tests. I quirked the defaults for the new annotate() function that can now py:pytype and xsi:type-annotate in one step so that it behaves just like the former annotate() (at least it passes all the existing unittests which I did not alter) The new pyannotate() and xsiannotate() use different defaults, as suggested. There's an additional keyword arg keep_tree that lets you preserve existing TREE attribute values, if switched on. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail -------------- next part -------------- A non-text attachment was scrubbed... Name: pyannotate_annotate.patch Type: application/octet-stream Size: 18093 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070906/adac2214/attachment-0001.obj From stefan_ml at behnel.de Fri Sep 7 06:58:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 07 Sep 2007 06:58:22 +0200 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 In-Reply-To: References: <46DF1BD0.60400@behnel.de> Message-ID: <46E0DA6E.4000107@behnel.de> Jeff Kowalczyk wrote: > Stefan Behnel wrote: >>> Getting distribution for 'lxml'. >>> While: >>> Installing test. >>> Getting distribution for 'lxml'. >>> Error: Can't download >>> http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found >> Yep, that's because there isn't such a branch yet. It will become >> available when 2.0 is released. >> >> I guess we should take that paragraph out till then... > > The latest/pending update http://cheeseshop.python.org/pypi/lxml/2.0alpha2 > causes a different error for zc.buildout-1.0.0b30 in fetch-newest mode: > > Picked: zope.testbrowser = 3.4.1 > Getting required 'lxml' > required by z3c.etestbrowser 1.0.2-r75829. > Getting distribution for 'lxml'. > While: > Installing test. > Getting distribution for 'lxml'. > Error: Can't download > http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0alpha2.tar.gz: > 404 Not Found > > lxml-2.0alpha2.tar.gz is not (yet, temporarily) available at that URL. I > think zc.buildout-1.0.0b30 uses the package's simple index: > http://cheeseshop.python.org/simple/lxml/ Ah, didn't even know that existed... I accidentally registered alpha2 when updating the branch link, as I had already increased the trunk version. Should be fixed now. Thanks for making me aware of this. Stefan From l.oluyede at gmail.com Sun Sep 9 19:32:27 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Sun, 9 Sep 2007 19:32:27 +0200 Subject: [lxml-dev] Reparenting a node Message-ID: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Sun Sep 9 19:57:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 09 Sep 2007 19:57:12 +0200 Subject: [lxml-dev] Reparenting a node In-Reply-To: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> References: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> Message-ID: <46E433F8.6090501@behnel.de> Lawrence Oluyede wrote: > I have a doc A and a doc B, I'd like to put a node extracted from A in > the document B but I always get a ValueError: > > ValueError: Element is not a child of this node. Sounds like you're using remove() or index(), no need to do that. > I didn't find any "setparent" in the API. > > How can I do this? try node_in_B.append(node_in_A) See the "Elements are lists" section in the tutorial. Stefan From stefan_ml at behnel.de Tue Sep 11 22:00:04 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Sep 2007 22:00:04 +0200 Subject: [lxml-dev] ET compatible parser interfaces Message-ID: <46E6F3C4.7050403@behnel.de> Hi all, just wanted to send a note that I implemented ElementTree compatible interfaces on top of lxml's parsers. This means that you can now do two additional things: use "parser.feed(data)" and "parser.close()" to pass data to a parser in a step-by-step fashion, and pass a "target" keyword argument to a parser to receive SAX-like method calls on the object you pass. The interface is described here: http://effbot.org/elementtree/elementtree-xmlparser.htm This *should* work for both XML and HTML (the latter was hard enough to implement due to differences in the libxml2 API). I also added a parser section to the lxml tutorial while I was at it. The current down-side is that the trunk will require a patched version of Cython until the next Cython release. I added the patch to SVN (and to the Cython bug tracker). The reason is a syntax addition that allows grabbing the GIL during a function call, which simplifies the implementation considerably. Have fun, Stefan From anders at bruun-olsen.net Tue Sep 11 22:14:11 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Tue, 11 Sep 2007 22:14:11 +0200 Subject: [lxml-dev] Serialization with namespaces Message-ID: <46E6F713.3070801@bruun-olsen.net> Hi, I need to chop up some XML based on XPath expressions and serialize the resulting chunks individually. I thought LXML would be perfect for this task but have run into some problems. Here is the sample I use, test.xtm: Abele Henriksdatter i Radsted, Gotfred Bangs hustru First I parse the file and grab the root: >>> from lxml import etree >>> tree = etree.parse("test.xtm") >>> root = tree.getroot() >>> root.nsmap {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': 'http://www.w3.org/1999/xlink'} Then I do a little XPath magic: >>> find_topics = etree.ETXPath("//{%s}topic" % root.nsmap[None]) >>> elem = find_topics(root)[0] >>> elem Now the problem occurs when I try to serialize. When I serialize the root, everything looks fine: >>> etree.tostring(root, pretty_print=True) ' ... The XML Namespace is applied as it should. However on the topic-element that I found using XPath no XML Namespace is output: >>> etree.tostring(elem, pretty_print=True) '\n\t\t\n\t\t\t>> elem.nsmap {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': 'http://www.w3.org/1999/xlink'} I realize this might be because the element is not the root of the current document. How can I make LXML output the xmlns in this case? -- Anders From stefan_ml at behnel.de Wed Sep 12 12:19:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 12:19:17 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E6F713.3070801@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> Message-ID: <46E7BD25.4020100@behnel.de> Anders Bruun Olsen wrote: > Now the problem occurs when I try to serialize. When I serialize the > root, everything looks fine: > > >>> etree.tostring(root, pretty_print=True) > ' xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> > ... > > The XML Namespace is applied as it should. However on the topic-element > that I found using XPath no XML Namespace is output: > > >>> etree.tostring(elem, pretty_print=True) > '\n\t\t\n\t\t\t ... > > Even though the nsmap attribute is set correctly: > > >>> elem.nsmap > {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': > 'http://www.w3.org/1999/xlink'} Hmm, I actually thought these problems were gone with 1.3, but I can reproduce this with the current trunk. I'll look into it. Stefan From stefan_ml at behnel.de Wed Sep 12 12:47:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 12:47:30 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7BD25.4020100@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> Message-ID: <46E7C3C2.10103@behnel.de> Stefan Behnel wrote: > Anders Bruun Olsen wrote: >> Now the problem occurs when I try to serialize. When I serialize the >> root, everything looks fine: >> >> >>> etree.tostring(root, pretty_print=True) >> '> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >> ... >> >> The XML Namespace is applied as it should. However on the topic-element >> that I found using XPath no XML Namespace is output: >> >> >>> etree.tostring(elem, pretty_print=True) >> '\n\t\t\n\t\t\t> ... >> >> Even though the nsmap attribute is set correctly: >> >> >>> elem.nsmap >> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >> 'http://www.w3.org/1999/xlink'} > > Hmm, I actually thought these problems were gone with 1.3, but I can reproduce > this with the current trunk. Ok, so the problem here is libxml2. It serialises only the namespaces that are defined on the node itself, not all those that are defined in the node's context. I'll try to work around it. Stefan From stefan_ml at behnel.de Wed Sep 12 14:54:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 14:54:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7C3C2.10103@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> Message-ID: <46E7E170.3000803@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> Anders Bruun Olsen wrote: >>> Now the problem occurs when I try to serialize. When I serialize the >>> root, everything looks fine: >>> >>> >>> etree.tostring(root, pretty_print=True) >>> '>> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >>> ... >>> >>> The XML Namespace is applied as it should. However on the topic-element >>> that I found using XPath no XML Namespace is output: >>> >>> >>> etree.tostring(elem, pretty_print=True) >>> '\n\t\t\n\t\t\t>> ... >>> >>> Even though the nsmap attribute is set correctly: >>> >>> >>> elem.nsmap >>> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >>> 'http://www.w3.org/1999/xlink'} >> Hmm, I actually thought these problems were gone with 1.3, but I can reproduce >> this with the current trunk. > > Ok, so the problem here is libxml2. It serialises only the namespaces that are > defined on the node itself, not all those that are defined in the node's context. Here's a patch (against the trunk) that works for me. It copies the node before the serialisation and adds all namespaces that were declared up in the tree. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: ns-serialisation.patch Type: text/x-diff Size: 3984 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070912/c376e7b9/attachment.bin From anders at bruun-olsen.net Wed Sep 12 15:19:21 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:19:21 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E170.3000803@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> Message-ID: <46E7E759.1090308@bruun-olsen.net> Stefan Behnel wrote: > Stefan Behnel wrote: >> Stefan Behnel wrote: >>> Anders Bruun Olsen wrote: >>>> Now the problem occurs when I try to serialize. When I serialize the >>>> root, everything looks fine: >>>> >>>> >>> etree.tostring(root, pretty_print=True) >>>> '>>> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >>>> ... >>>> >>>> The XML Namespace is applied as it should. However on the topic-element >>>> that I found using XPath no XML Namespace is output: >>>> >>>> >>> etree.tostring(elem, pretty_print=True) >>>> '\n\t\t\n\t\t\t>>> ... >>>> >>>> Even though the nsmap attribute is set correctly: >>>> >>>> >>> elem.nsmap >>>> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >>>> 'http://www.w3.org/1999/xlink'} >>> Hmm, I actually thought these problems were gone with 1.3, but I can reproduce >>> this with the current trunk. >> Ok, so the problem here is libxml2. It serialises only the namespaces that are >> defined on the node itself, not all those that are defined in the node's context. > > Here's a patch (against the trunk) that works for me. It copies the node > before the serialisation and adds all namespaces that were declared up in the > tree. > > Stefan > Something seems amiss with the patch: $ svn co http://codespeak.net/svn/lxml/trunk lxml $ cd lxml $ patch <~/download/ns-serialisation.patch can't find file to patch at input line 5 Perhaps you should have used the -p or --strip option? The text leading up to this was: -------------------------- |Index: src/lxml/proxy.pxi |=================================================================== |--- src/lxml/proxy.pxi (Revision 46423) |+++ src/lxml/proxy.pxi (Arbeitskopie) -------------------------- File to patch: -- Anders From anders at bruun-olsen.net Wed Sep 12 15:22:08 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:22:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E759.1090308@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> Message-ID: <46E7E800.4010902@bruun-olsen.net> Anders Bruun Olsen wrote: > Something seems amiss with the patch: Sorry, my bad, you have of course already applied it to trunk. -- Anders From anders at bruun-olsen.net Wed Sep 12 15:28:08 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:28:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E800.4010902@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> Message-ID: <46E7E968.7050808@bruun-olsen.net> Anders Bruun Olsen wrote: > Anders Bruun Olsen wrote: >> Something seems amiss with the patch: > > Sorry, my bad, you have of course already applied it to trunk. > However, it seems that trunk does not build: $ make python setup.py build_ext -i Building with Cython. Building lxml version 2.0.alpha2-46501 running build_ext building 'lxml.etree' extension Error converting Pyrex file to C: ------------------------------------------------------------ ... include "xmlerror.pxi" # Error and log handling include "classlookup.pxi" # Element class lookup mechanisms include "nsclasses.pxi" # Namespace implementation and registry include "docloader.pxi" # Support for custom document loaders include "parser.pxi" # XML Parser include "parsertarget.pxi" # ET Parser target ^ ------------------------------------------------------------ /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found make: *** [inplace] Error 1 -- Anders From ianb at colorstudy.com Wed Sep 12 19:37:00 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 12 Sep 2007 12:37:00 -0500 Subject: [lxml-dev] ET 1.3 Message-ID: <46E823BC.10906@colorstudy.com> I was just reading the ElementTree 1.3 release notes: http://effbot.org/zone/elementtree-13-intro.htm Generally I like the changes. The change from Element as a factory function to Element as a subclassable class (akin to ElementBase), is nice -- I never understood why there was a distinction. Except... because "el = Element(tag)" doesn't necessarily mean that "el.__class__ is Element"...? getiterator to iter is a simple seeming change. Since getiterator actually returns an iterable, not an iterator, it's also just a little more accurate. Looks like it also moves to an iterator, not a list. I don't have much of an opinion on the parser and serializer stuff, though I'd love it if there was a proper serializer for HTML (not the dumb XSLT-based thing I put in lxml.html). I notice that elements now give warnings when treated as booleans. I like this a lot, as I've found many bugs in my code where I did "if el" where I should have done "if el is not None". And an element with no children doesn't feel falsish at all to me. I've actually already taken to using len(el) to test for children, just because I can't get myself to commit to this weird-seeming behavior. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Wed Sep 12 18:38:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 18:38:05 +0200 Subject: [lxml-dev] ET compatible parser interfaces In-Reply-To: <46E6F3C4.7050403@behnel.de> References: <46E6F3C4.7050403@behnel.de> Message-ID: <46E815ED.6020604@behnel.de> Stefan Behnel wrote: > The current down-side is that the trunk will require a patched version of > Cython until the next Cython release. ... which has just been released as Cython 0.9.6.6. Stefan :) From stefan_ml at behnel.de Wed Sep 12 20:19:49 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 20:19:49 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070906161412.136940@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> Message-ID: <46E82DC5.4010101@behnel.de> jholg at gmx.de wrote: >> I propose renaming the existing annotate() function to pyannotate() and >> adding a public interface annotate() to the internal _annotate(), so you can >> xsi-typify and py-typify in one step. Sure, that's fine. >> I also think it would be better to change the defaults of the "ignore_old" >> keyword args of the annotation functions to False Definitely ok for the new ones. Maybe for annotate() also, I'm not sure yet. > Attached is patch that does just that, with tests. > > I quirked the defaults for the new annotate() function that can now > py:pytype and xsi:type-annotate in one step so that it behaves just like > the former annotate() (at least it passes all the existing unittests which > I did not alter) The new pyannotate() and xsiannotate() use different > defaults, as suggested. I still have to look through the patch a bit more, but I generally like the intention, except: > There's an additional keyword arg keep_tree that lets you preserve existing TREE attribute values, if switched on. No way. :) It doesn't match the existing "ignore_*" parameters and the default is to /remove/ the tree annotation when what we want is to /create/ annotations. Taking one step back: what was the reason again why we started using TREE annotation at all? I mean, it doesn't have any advantage and it currently looks like it's getting in the way. Is there a reason that should keep us from just dropping it? completely? (minus backwards compatibility?) I mean, honestly, it's not used and it's even faster to check for children than it is to look up the attribute... Stefan From stefan_ml at behnel.de Wed Sep 12 21:59:43 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 21:59:43 +0200 Subject: [lxml-dev] ET 1.3 In-Reply-To: <46E823BC.10906@colorstudy.com> References: <46E823BC.10906@colorstudy.com> Message-ID: <46E8452F.5000800@behnel.de> Ian Bicking wrote: > I was just reading the ElementTree 1.3 release notes: > > http://effbot.org/zone/elementtree-13-intro.htm Ah, good to know. I already had a few discussions with Fredrik about a couple of features or changes in lxml.etree or ET 1.3, so both are continuously getting closer (especially now that parsers are almost compatible :). > Generally I like the changes. The change from Element as a factory > function to Element as a subclassable class (akin to ElementBase), is > nice Hmm, I'm not even sure we could do that in Cython. Sounds like he's been playing with __new__, not sure Cython supports that. > -- I never understood why there was a distinction. Except... > because "el = Element(tag)" doesn't necessarily mean that "el.__class__ > is Element"...? At least in lxml that's getting pretty rare these days... > getiterator to iter is a simple seeming change. Since getiterator > actually returns an iterable, not an iterator, it's also just a little > more accurate. Looks like it also moves to an iterator, not a list. That's one of the changes Fredrik mentioned a while ago, so lxml.etree already has it in 1.3. > I don't have much of an opinion on the parser and serializer stuff, > though I'd love it if there was a proper serializer for HTML (not the > dumb XSLT-based thing I put in lxml.html). I know. Actually, libxml2 distinguishes between HTML documents and XML documents internally, so we could already take that as a serialisation hint. So, if you parse stuff with HTML() or an HTMLParser, you'd get an HTML document on serialisation, otherwise you'd get an XML document. I could also imagine something like a separate ElementTree class in lxml.html that you could wrap any Element in to make sure it gets serialised as plain HTML (and not XHTML). > I notice that elements now give warnings when treated as booleans. I > like this a lot, as I've found many bugs in my code where I did "if el" > where I should have done "if el is not None". And an element with no > children doesn't feel falsish at all to me. I've actually already taken > to using len(el) to test for children, just because I can't get myself > to commit to this weird-seeming behavior. I guess lxml.etree will just follow in 2.0. I'll also take a look through the other changes. There were a few that I had not yet heard of. I like the fact that ET 1.3 and lxml 2.0 share a common alpha phase. That makes additions and learning from each other pretty easy. Stefan From stefan_ml at behnel.de Wed Sep 12 22:02:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 22:02:44 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E968.7050808@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> <46E7E968.7050808@bruun-olsen.net> Message-ID: <46E845E4.3030104@behnel.de> Anders Bruun Olsen wrote: > /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found Ah, thanks. I forgot that file when committing the target parser implementation. Fixed now. Stefan From ianb at colorstudy.com Wed Sep 12 22:48:56 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 12 Sep 2007 15:48:56 -0500 Subject: [lxml-dev] 2.0alpha too visible Message-ID: <46E850B8.8040803@colorstudy.com> I think the 2.0alpha release might be too visible. If you do "easy_install lxml" you get that version. One way to help this would be to not upload 2.0alpha to PyPI, but instead just put a link to a tarball with #egg=lxml-twoalpha or something, so it won't be considered newer than 1.3 (but you could install it with easy_install lxml==twoalpha). -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From jholg at gmx.de Thu Sep 13 10:17:04 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 13 Sep 2007 10:17:04 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46E82DC5.4010101@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> Message-ID: <20070913081704.138210@gmx.net> > There's an additional keyword arg keep_tree that lets you preserve > existing TREE attribute values, if switched on. > > No way. :) > > It doesn't match the existing "ignore_*" parameters and the default is to > /remove/ the tree annotation when what we want is to /create/ annotations. Hm, maybe then pyannotate() should rather not default to remove TREE attributes? > Taking one step back: what was the reason again why we started using TREE > annotation at all? I mean, it doesn't have any advantage and it currently > looks like it's getting in the way. Is there a reason that should keep us > from > just dropping it? completely? (minus backwards compatibility?) > > I mean, honestly, it's not used and it's even faster to check for children > than it is to look up the attribute... It's there to allow for leaf elements to be ObjectifiedElements, rather than ObjectifiedDataElements. The rules are easy for all other use cases: - the root has no parent element -> ObjectifiedElement - any other element with children -> ObjectifiedElement Things get difficult if you assign leaf elements and actually instantiate the python proxy objects. If no TREE attributes get used, these will end up being "default empty elements", usually string elements. Also, once having been serialized, there is no way that leaf elements can be recognized as ObjectifiedElements without the help of the TREE attribute. That's the main reason I propose the keep_tree functionality, to make ObjectifiedElement-leaves survive a creation-serialization-parse cycle. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Thu Sep 13 11:04:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 11:04:24 +0200 Subject: [lxml-dev] findall() returns an iterable instead of a sequence in ET 1.3 Message-ID: <46E8FD18.1010306@behnel.de> Hi Fredrik, I just noticed the above when I tried to copy over the new ElementPath implementation from the current ET 1.3 SVN. The current ET docs of 1.2 clearly state that findall() returns a sequence. I'm not questioning the new behaviour, but it's not even mentioned in your "ET 1.3 intro" text. Don't you think that change will break a lot of code out there? It already breaks a couple of places in lxml.html, e.g. code where the author knew that there were few results to expect (and thus a list was the perfect thing to return) and where it is was convenient to test for the truth value of the returned list to check for results. Admittedly, it's easy to write list(el.findall()), but the thing is: a) someone has to do that, and b) it's not always the best solution, so the change requires people to rethink their code. And the worst is: you will not even get an exception in all cases, as "if result" will simply behave differently and your code after that may still work - just not as expected. That's a pretty heavy change IMHO. Stefan From stefan_ml at behnel.de Thu Sep 13 11:48:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 11:48:12 +0200 Subject: [lxml-dev] findall() returns an iterable instead of a sequence in ET 1.3 In-Reply-To: <368a5cd50709130216k54ad919ar2b089c4674c51ec@mail.gmail.com> References: <46E8FD18.1010306@behnel.de> <368a5cd50709130216k54ad919ar2b089c4674c51ec@mail.gmail.com> Message-ID: <46E9075C.9050006@behnel.de> Fredrik Lundh wrote: >> I just noticed the above when I tried to copy over the new ElementPath >> implementation from the current ET 1.3 SVN. The current ET docs of 1.2 clearly >> state that findall() returns a sequence. I'm not questioning the new >> behaviour, but it's not even mentioned in your "ET 1.3 intro" text. > > That's probably because I dropped in the new (still pretty rough) path > implementation after I wrote the first episode, but before I uploaded > the code... > > The ET 1.2 documentation does indeed say that findall may return a > sequence *or* an iterator: > > http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree._ElementInterface.findall-method Ok, but this page doesn't (and I find it pretty visible): http://effbot.org/elementtree/elementtree-element.htm#tag-ET.Element.findall > but as you say, chances are that people are relying on behaviour > rather than implementation. Yet, it's pretty nice to have an iterator > for things like: > > for elem in tree.findall(simple pattern): > check elem properties > if right elem: > break I think the "findALL()" makes it sound like something that returns a sequence rather than an iterator. The API shouldn't work against people's intuition. > But maybe we could provide an "iterfind", perhaps? (that may or may > not be the same thing as findall). I would definitely prefer that, and I like the name already. Then, findall() could be as simple as return list(self.iterfind()) And it /should/ do the same as findall(), as it carries "find" in its name. It meats my intuition that findall() and iterfind() return exactly the same results, just in the expected different ways. > fwiw, I've had the same concerns wrt the iter/getiterator changes; in > 1.2, getiterator returned a list, not an iterator. in 1.3a3, it's an > alias for "elem.iter()". maybe it should be an alias for > "list(elem.iter())" instead? I think that's different as people do not generally expect something called "getiterator" to return a sequence, so as long as they don't look into the documentation, they would not easily use it in a way that breaks after the change. BUT, since you already deprecated getiterator() anyway, why not make it a pure legacy function that works as it did in the early days? Stefan From anders at bruun-olsen.net Thu Sep 13 13:15:37 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Thu, 13 Sep 2007 13:15:37 +0200 Subject: [lxml-dev] Potential bug in trunk Message-ID: <46E91BD9.2010904@bruun-olsen.net> Hi, I've run into a weird problem. I am making a CherryPy-based application which uses lxml to do XSLT conversion of XML before sending it to the browser. It worked fine before switching to trunk (which has the namespace-patch that I need). Here is the output: [13/Sep/2007:13:08:37] HTTP Serving HTTP on http://0.0.0.0:8080/ [13/Sep/2007:13:08:44] Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/cherrypy/_cprequest.py", line 90, in run hook() File "/usr/lib64/python2.4/site-packages/cherrypy/_cprequest.py", line 58, in __call__ return self.callback(**self.kwargs) File "/home/abo/workspace/xmldict/src/xmldict/__init__.py", line 42, in transform_output xsltdoc = etree.parse(open(xslfile)) File "etree.pyx", line 2189, in etree.parse File "parser.pxi", line 1183, in etree._parseDocument File "parser.pxi", line 1217, in etree._parseFilelikeDocument File "parser.pxi", line 1126, in etree._parseDocFromFilelike File "parser.pxi", line 83, in etree._ParserDictionaryContext.getDefaultParser File "parser.pxi", line 585, in etree._BaseParser._copy AttributeError: 'lxml.etree._ResolverRegistry' object has no attribute '_copy' The really weird part is that when I start up the interactive interpreter and do the exact same operation it works: >>> from lxml import etree >>> xsltfile = "/home/abo/workspace/dicts/svda/svda.xsl" >>> xsltdoc = etree.parse(open(xsltfile)) Anybody able to venture a guess as to where this bug might lie? Is it in lxml, cherrypy or my code? -- Anders From anders at bruun-olsen.net Thu Sep 13 13:58:07 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Thu, 13 Sep 2007 13:58:07 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E845E4.3030104@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> <46E7E968.7050808@bruun-olsen.net> <46E845E4.3030104@behnel.de> Message-ID: <46E925CF.9000603@bruun-olsen.net> Stefan Behnel wrote: >> /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found > Ah, thanks. I forgot that file when committing the target parser > implementation. Fixed now. Okay, trunk builds now. And I can confirm that the namespace patch works. Thanks! :) -- Anders From dfedoruk at gmail.com Thu Sep 13 17:18:21 2007 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Thu, 13 Sep 2007 19:18:21 +0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode Message-ID: Hello everyone, I'm developing a mod_python application that is based on XML\XSLT transforming. I used 4Suite libraries for that, but as the speed was unacceptable for me, I switched to lxml. Everything became much easier and 10 times faster, but I've encountered the subject problem. In brief - all my data and xslt are stored and transferred in UTF-8. With 4Suite everything was fine all the time. With lxml it works fine from the console, but inside mod_python it occasionaly dies, ~ one time out of three. Strange - the same code with the same data works or dies by its own means. As far as I have found, there was a similar problem with PyXML and encodings module, this is the problem with UTF, but there was no clear solution. So, my configuration is the following: Python 2.5.1 Server version: Apache/2.2.4 (FreeBSD) mod_python-3.3.1 And the relevant parts of my code are these: def extApplyXslt(xslt, data, logger ): try: strXslt = urllib2.urlopen(xslt).read() # i have to read the xslt url to the python string except urllib2.HTTPError, e: ....... except urllib2.URLError, e: ............. try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") ) # and now I have to use the string; a more elegant solution, anyone? f = StringIO(strXslt) xslt_doc = etree.parse(f, xslt_parser) # and here where the problem comes transform = etree.XSLT(xslt_doc) except Exception, exc: logger.log(logging.CRITICAL, exc.__str__() ) try: result_tree = transform(data) return etree.tostring(result_tree, 'utf-8') except Exception, exc: print "xslt processing error!", exc.__str__() return "" It dies with the message 'cannot unmarshal code objects in restricted execution mode'. By profiling I detected the point where problem occurs: transform = etree.XSLT(xslt_doc) So, I would be grateful for any suggestions how to get rid of this. I'd really like to use lxml. Maybe I should initialize the xslt processor in somehow other way? Thanks in advance, Dmitri From lee.brown at elecdev.com Thu Sep 13 17:40:33 2007 From: lee.brown at elecdev.com (Lee Brown) Date: Thu, 13 Sep 2007 11:40:33 -0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects inrestricted execution mode In-Reply-To: Message-ID: <200709131540.l8DFeFoY005233@ns1.elecdev.net> Greetings! The first thing I'd suggest is to also put your query on the Mod Python list as well. A few questions: Are you trying to execute this code in a Handler or in a Filter? There's world of hidden trouble lurking in Filters because of their re-entrant nature. Which Apache MPM are you using? If you're using a multiple-process module, you might try swithing to a single-process-multiple-thread module to see if this behavior changes. > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Dmitri Fedoruk > Sent: Thursday, September 13, 2007 11:18 AM > To: lxml-dev at codespeak.net > Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code > objects inrestricted execution mode > > Hello everyone, > > I'm developing a mod_python application that is based on > XML\XSLT transforming. > > I used 4Suite libraries for that, but as the speed was > unacceptable for me, I switched to lxml. Everything became > much easier and 10 times faster, but I've encountered the > subject problem. > > In brief - all my data and xslt are stored and transferred in UTF-8. > With 4Suite everything was fine all the time. With lxml it > works fine from the console, but inside mod_python it > occasionaly dies, ~ one time out of three. Strange - the same > code with the same data works or dies by its own means. > > As far as I have found, there was a similar problem with > PyXML and encodings module, this is the problem with UTF, but > there was no clear solution. > > So, my configuration is the following: > Python 2.5.1 > Server version: Apache/2.2.4 (FreeBSD) > mod_python-3.3.1 > > And the relevant parts of my code are these: > > def extApplyXslt(xslt, data, logger ): > try: > strXslt = urllib2.urlopen(xslt).read() > # i have to read the xslt url to the python string > except urllib2.HTTPError, e: > ....... > except urllib2.URLError, e: > ............. > try: > xslt_parser = etree.XMLParser() > xslt_parser.resolvers.add( PrefixResolver("XSLT") ) > > # and now I have to use the string; a more elegant > solution, anyone? > f = StringIO(strXslt) > xslt_doc = etree.parse(f, xslt_parser) > > # and here where the problem comes > transform = etree.XSLT(xslt_doc) > > except Exception, exc: > logger.log(logging.CRITICAL, exc.__str__() ) > > try: > result_tree = transform(data) > return etree.tostring(result_tree, 'utf-8') > except Exception, exc: > print "xslt processing error!", exc.__str__() > return "" > > It dies with the message 'cannot unmarshal code objects in > restricted execution mode'. By profiling I detected the point > where problem > occurs: > transform = etree.XSLT(xslt_doc) > > So, I would be grateful for any suggestions how to get rid of this. > I'd really like to use lxml. Maybe I should initialize the > xslt processor in somehow other way? > > Thanks in advance, > Dmitri > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From stefan_ml at behnel.de Thu Sep 13 17:45:09 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 17:45:09 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: References: Message-ID: <46E95B05.9020709@behnel.de> Dmitri Fedoruk wrote: > I'm developing a mod_python application that is based on XML\XSLT > transforming. > > I used 4Suite libraries for that, but as the speed was unacceptable > for me, I switched to lxml. Everything became much easier and 10 times > faster Thanks for sharing that. :) > but I've encountered the subject problem. > > In brief - all my data and xslt are stored and transferred in UTF-8. > With 4Suite everything was fine all the time. With lxml it works fine > from the console, but inside mod_python it occasionaly dies, ~ one > time out of three. Strange - the same code with the same data works or > dies by its own means. > > As far as I have found, there was a similar problem with PyXML and > encodings module, this is the problem with UTF, but there was no clear > solution. > > So, my configuration is the following: > Python 2.5.1 > Server version: Apache/2.2.4 (FreeBSD) > mod_python-3.3.1 Looks like you forgot to mention the lxml version you are using. > And the relevant parts of my code are these: > > def extApplyXslt(xslt, data, logger ): > try: > strXslt = urllib2.urlopen(xslt).read() > # i have to read the xslt url to the python string > except urllib2.HTTPError, e: > ....... > except urllib2.URLError, e: > ............. > try: > xslt_parser = etree.XMLParser() > xslt_parser.resolvers.add( PrefixResolver("XSLT") ) > > # and now I have to use the string; a more elegant solution, As I already mentioned on c.l.py, you can pass the result of urlopen() directly into parse(). > f = StringIO(strXslt) > xslt_doc = etree.parse(f, xslt_parser) > > # and here where the problem comes > transform = etree.XSLT(xslt_doc) > > except Exception, exc: > logger.log(logging.CRITICAL, exc.__str__() ) > > try: > result_tree = transform(data) > return etree.tostring(result_tree, 'utf-8') > except Exception, exc: > print "xslt processing error!", exc.__str__() > return "" > > It dies with the message 'cannot unmarshal code objects in restricted > execution mode'. By profiling I detected the point where problem > occurs: > transform = etree.XSLT(xslt_doc) Hmmm, I can't see where any "unmarshaling" should be taking place here - definitely not in XSLT(). And I don't get why this should only happen once in a while. Can you figure out what is writing this message? The python interpreter or mod_python? Stefan From goliath.mailinglist at gmx.de Thu Sep 13 18:02:21 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 18:02:21 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: References: Message-ID: <46E95F0D.4030506@gmx.de> > Everything became much easier and 10 times > faster, but I've encountered the subject problem. Same problem here, but with different code and versions: * Django as webframework * Apache 2.0.59 and 2.2.4 * lxml 1.3.x (all versions) * mod_python 3.2.10 and 3.3.1 * libxml2 2.6.28 / libxslt 1.1.20 I think this might have something to do with mod_python fiddling with __builtins__, at least googling for the error message told me, that Python switches to restricted mode when doing so (but this might one trigger of many). lxml seems to have callbacks run in its own "sandbox" (or something like this, at least it seems to be a different environment as the outer code had), which works fine unless the restricted mode is triggered. Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks. Some further bug-finding I did revealed, that the "unmarshaling"-error only occured if all modules I used in the callback are loaded before the callback runs. If I load them inside the callback the error differs. Example: ------------8<---------------------------------------------------- # unmarshaling error from foo import bar def callback(ctx, ...): return bar() ---------------------------------------------------->8------------ ------------8<---------------------------------------------------- # other error def callback(ctx, ...): from foo import bar return bar() ---------------------------------------------------->8------------ As I have the needed mod_python-configuration not done here I can't tell the other error, but I will add this later. (And I think it was some ImportError) I did not report this problem, because I was not sure which part in the chain to produce webpages was responsible. Django does fiddle with __builtins__, too (but removing it didn't help). And perhaps this is simply a mod_python-bug. So I used FastCGI, which works well. But I'm very interested in a better solution. ;-) For the questions raised by Lee Brown: > Are you trying to execute this code in a Handler or in a Filter? There's world > of hidden trouble lurking in Filters because of their re-entrant nature. I use normal XSLT-callbacks. Tried different methods to tell lxml which callbacks I have, none worked. (global namespace, callbacks as "extensions"-parameter for etree.XSLT) XSLT-sample-snippet: (Namespace is defined, callback gets called and works fine...until I try to use the code with mod_python) > Which Apache MPM are you using? If you're using a multiple-process module, you > might try swithing to a single-process-multiple-thread module to see if this > behavior changes. Using prefork here, as all threaded modules have problems with mod_php. mod_php might be another error-source. Read something about failing DB-connections when using mod_php and mod_python. But I don't really think disabling mod_php will make a difference here. Greetings, David Danier From lee.brown at elecdev.com Thu Sep 13 18:07:13 2007 From: lee.brown at elecdev.com (Lee Brown) Date: Thu, 13 Sep 2007 12:07:13 -0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E95F0D.4030506@gmx.de> Message-ID: <200709131606.l8DG6uuM005741@ns1.elecdev.net> Greetings! Sorry, I should have stated my first question more clearly. Are you calling your routines from within a Mod Python requestHandler object or an outputFilter object? > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of David Danier > Sent: Thursday, September 13, 2007 12:02 PM > To: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] lxml + mod_python: cannot unmarshal > code objects in restricted execution mode > > > Everything became much easier and 10 times faster, but I've > > encountered the subject problem. > > Same problem here, but with different code and versions: > * Django as webframework > * Apache 2.0.59 and 2.2.4 > * lxml 1.3.x (all versions) > * mod_python 3.2.10 and 3.3.1 > * libxml2 2.6.28 / libxslt 1.1.20 > > I think this might have something to do with mod_python > fiddling with __builtins__, at least googling for the error > message told me, that Python switches to restricted mode when > doing so (but this might one trigger of many). lxml seems to > have callbacks run in its own "sandbox" > (or something like this, at least it seems to be a different > environment as the outer code had), which works fine unless > the restricted mode is triggered. > > Somehow restricted mode is only mentioned in the docs for > RExec (http://docs.python.org/lib/module-rexec.html), but > should not be available any more, to I don't know what lxml > exactly does to use callbacks. > > Some further bug-finding I did revealed, that the > "unmarshaling"-error only occured if all modules I used in > the callback are loaded before the callback runs. If I load > them inside the callback the error differs. > Example: > ------------8<---------------------------------------------------- > # unmarshaling error > from foo import bar > def callback(ctx, ...): > return bar() > ---------------------------------------------------->8------------ > ------------8<---------------------------------------------------- > # other error > def callback(ctx, ...): > from foo import bar > return bar() > ---------------------------------------------------->8------------ > As I have the needed mod_python-configuration not done here I > can't tell the other error, but I will add this later. (And I > think it was some > ImportError) > > I did not report this problem, because I was not sure which > part in the chain to produce webpages was responsible. Django > does fiddle with __builtins__, too (but removing it didn't > help). And perhaps this is simply a mod_python-bug. So I used > FastCGI, which works well. > But I'm very interested in a better solution. ;-) > > For the questions raised by Lee Brown: > > Are you trying to execute this code in a Handler or in a Filter? > > There's world of hidden trouble lurking in Filters because > of their re-entrant nature. > > I use normal XSLT-callbacks. Tried different methods to tell > lxml which callbacks I have, none worked. > (global namespace, callbacks as "extensions"-parameter for etree.XSLT) > > XSLT-sample-snippet: > disable-output-escaping="yes"/> > (Namespace is defined, callback gets called and works > fine...until I try to use the code with mod_python) > > > Which Apache MPM are you using? If you're using a multiple-process > > module, you might try swithing to a single-process-multiple-thread > > module to see if this behavior changes. > > Using prefork here, as all threaded modules have problems > with mod_php. > mod_php might be another error-source. Read something about > failing DB-connections when using mod_php and mod_python. But > I don't really think disabling mod_php will make a difference here. > > Greetings, David Danier > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From goliath.mailinglist at gmx.de Thu Sep 13 18:50:54 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 18:50:54 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <200709131606.l8DG6uuM005741@ns1.elecdev.net> References: <200709131606.l8DG6uuM005741@ns1.elecdev.net> Message-ID: <46E96A6E.1090003@gmx.de> > Sorry, I should have stated my first question more clearly. Are you calling > your routines from within a Mod Python requestHandler object or an outputFilter > object? It is called out of a RequestHandler, but I'm not really doing this myself. Django does most of the work, see: http://www.djangoproject.com/documentation/modpython/ http://code.djangoproject.com/browser/django/trunk/django/core/handlers/modpython.py#L176 Greetings, David Danier From stefan_ml at behnel.de Thu Sep 13 18:53:34 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 18:53:34 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode Message-ID: <46E96B0E.3040202@behnel.de> ... just forwarding to the list ... [original mail by Dmitri Fedoruk] On 9/13/07, Stefan Behnel wrote: > Looks like you forgot to mention the lxml version you are using. The most important thing lxml-1.3.4 > As I already mentioned on c.l.py, you can pass the result of urlopen() > directly into parse(). Thank you, that looks better. > Hmmm, I can't see where any "unmarshaling" should be taking place here - > definitely not in XSLT(). And I don't get why this should only happen once in > a while. The point is that it than happens again and again, but I can't see any regularity. Pretty random. Here is the real code and it's profiling output: try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") ) inLogger.log(logging.INFO, "parser created" ) xslt_doc = etree.parse( urllib2.urlopen(xslt) , xslt_parser) inLogger.log(logging.INFO, "%s parsed" % xslt ) transform = etree.XSLT(xslt_doc) inLogger.log(logging.INFO, "xslt transformation created" ) except Exception, exc: inLogger.log(logging.CRITICAL, exc.__str__() ) logging output: Thu, 13 Sep 2007 19:53:31 INFO parser created Thu, 13 Sep 2007 19:53:31 INFO http://***/web-out-long.xsl parsed Thu, 13 Sep 2007 19:53:31 CRITICAL cannot unmarshal code objects in restricted execution mode As there is no "xslt transformation created" line, that's why I had to assume that the error happens in etree.XSLT . > Can you figure out what is writing this message? The python interpreter or > mod_python? mod_python . The python interpreter runs fine with it, not a single error. Dmitri From goliath.mailinglist at gmx.de Thu Sep 13 19:05:49 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 19:05:49 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E95F0D.4030506@gmx.de> References: <46E95F0D.4030506@gmx.de> Message-ID: <46E96DED.9050800@gmx.de> > Somehow restricted mode is only mentioned in the docs for RExec > (http://docs.python.org/lib/module-rexec.html), but should not be > available any more, to I don't know what lxml exactly does to use callbacks. Found another place that mentions restricted mode by accident: http://www.modpython.org/live/current/doc-html/pyapi-interps.html I think this paragraph describes the problem pretty well: ------------8<---------------------------------------------------- Note that if any third party module is being used which has a C code component that uses the simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules, then the interpreter name must be forcibly set to be "main_interpreter". This is necessary as such a module will only work correctly if run within the context of the first Python interpreter created by the process. If not forced to run under the "main_interpreter", a range of Python errors can arise, each typically referring to code being run in restricted mode. ---------------------------------------------------->8------------ (thanks to Lee Brown for asking about where lxml is called, it made me read the mod_python-docs again) I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. According to the docs this might help...but if I read this right might produce namespace-problems or at least pollute some global namespace. As this takes some time I will post the result later. Perhaps it can be fixed in lxml by not using the "simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules"? Greetings, David Danier From stefan_ml at behnel.de Thu Sep 13 19:28:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 19:28:50 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E96DED.9050800@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> Message-ID: <46E97352.4050704@behnel.de> Hi, David Danier wrote: >> Somehow restricted mode is only mentioned in the docs for RExec >> (http://docs.python.org/lib/module-rexec.html), but should not be >> available any more, to I don't know what lxml exactly does to use callbacks. > > Found another place that mentions restricted mode by accident: > http://www.modpython.org/live/current/doc-html/pyapi-interps.html > > I think this paragraph describes the problem pretty well: > ------------8<---------------------------------------------------- > Note that if any third party module is being used which has a C code > component that uses the simplified API for access to the Global > Interpreter Lock (GIL) for Python extension modules, then the > interpreter name must be forcibly set to be "main_interpreter". This is > necessary as such a module will only work correctly if run within the > context of the first Python interpreter created by the process. If not > forced to run under the "main_interpreter", a range of Python errors can > arise, each typically referring to code being run in restricted mode. > ---------------------------------------------------->8------------ > (thanks to Lee Brown for asking about where lxml is called, it made me > read the mod_python-docs again) thanks for the infos, that's good to know. > I'll try to setup my site on mod_python and using "PythonInterpreter > main_interpreter" in the config. According to the docs this might > help...but if I read this right might produce namespace-problems or at > least pollute some global namespace. As this takes some time I will post > the result later. Please do. > Perhaps it can be fixed in lxml by not using the "simplified API for > access to the Global Interpreter Lock (GIL) for Python extension modules"? No way. There's a reason why it is there which is the same why we use it: it's simple and usable. Using anything else would mean a lot of rewriting. You might want to try compiling lxml with "--without-threading", though, which disables concurrency support completely (i.e. not more GIL freeing). Stefan From goliath.mailinglist at gmx.de Thu Sep 13 19:51:14 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 19:51:14 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E97352.4050704@behnel.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> <46E97352.4050704@behnel.de> Message-ID: <46E97892.9010008@gmx.de> >> As this takes some time I will post >> the result later. > Please do. Seems to work properly. But I'm not really sure how bad "main_interpreter" is polluted now. > No way. There's a reason why it is there which is the same why we use it: it's > simple and usable. Using anything else would mean a lot of rewriting. Thats sad. What are the chances that patches addressing this problem are accepted? (Must review the code first, but I would really like a clean solution here) > You might want to try compiling lxml with "--without-threading", though, which > disables concurrency support completely (i.e. not more GIL freeing). Works, too. But I'm not really sure it it is a good idea to do so, as Py_NewInterpreter seems to create a thread, see http://www.python.org/doc/current/api/initialization.html#l2h-820. But I think this might not be a problem if not using a threaded Apache-MPM. Greetings, David Danier From dfedoruk at gmail.com Fri Sep 14 10:28:41 2007 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Fri, 14 Sep 2007 12:28:41 +0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E96DED.9050800@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> Message-ID: Hello, > I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. Fine, works for me too. As I'm not very good in python, I can't tell whether this is good or evil, but this trick works and that's all I need. Thanks! Dmitri From faassen at startifact.com Fri Sep 14 16:30:35 2007 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 14 Sep 2007 16:30:35 +0200 Subject: [lxml-dev] 2.0alpha too visible In-Reply-To: <46E850B8.8040803@colorstudy.com> References: <46E850B8.8040803@colorstudy.com> Message-ID: Ian Bicking wrote: > I think the 2.0alpha release might be too visible. If you do > "easy_install lxml" you get that version. > > One way to help this would be to not upload 2.0alpha to PyPI, but > instead just put a link to a tarball with #egg=lxml-twoalpha or > something, so it won't be considered newer than 1.3 (but you could > install it with easy_install lxml==twoalpha). I've run into it trying to get 2.0alpha several times too in buildout processes (which use setuptools underneath). Hiding 2.0alpha better might help. That said, in my opinion it's really a problem that we need to tackle on the buildout or framework end, being more explicit about what versions we need. This keeps happening with all kinds of other libraries as well and is not really the library's fault. (Buildout already has a feature to prefer released versions which can help a bit here) Regards, Martijn From stefan_ml at behnel.de Sat Sep 15 10:28:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 15 Sep 2007 10:28:08 +0200 Subject: [lxml-dev] 2.0alpha too visible In-Reply-To: <46E850B8.8040803@colorstudy.com> References: <46E850B8.8040803@colorstudy.com> Message-ID: <46EB9798.6000306@behnel.de> Hi Ian, Ian Bicking wrote: > I think the 2.0alpha release might be too visible. If you do > "easy_install lxml" you get that version. > > One way to help this would be to not upload 2.0alpha to PyPI, but > instead just put a link to a tarball with #egg=lxml-twoalpha or > something, so it won't be considered newer than 1.3 (but you could > install it with easy_install lxml==twoalpha). hmmm, I would like to keep 2.0alpha visible as there were some changes (and some more to come) that people should be aware of, especially when writing new code. So I want it uploaded on PyPI and I want it in the list of version you see when going to http://pypi.python.org/pypi/lxml I personally consider it a bug in easy_install that it always takes the newest version without paying attention to the development status (which is clearly stated as "3 - alpha" in the Trove list), or at least to the version string. It doesn't even provide an option to control that. I just wrote to the distutils list about that, we'll see. Stefan From stefan_ml at behnel.de Sat Sep 15 17:48:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 15 Sep 2007 17:48:30 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E97892.9010008@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> <46E97352.4050704@behnel.de> <46E97892.9010008@gmx.de> Message-ID: <46EBFECE.4030605@behnel.de> David Danier wrote: >>> As this takes some time I will post >>> the result later. >> Please do. > > Seems to work properly. But I'm not really sure how bad > "main_interpreter" is polluted now. I wouldn't expect much (namespace) polution - unless there's real evidence that this can become a problem. And a crash is definitely a more important problem than namespace polution. >> No way. There's a reason why it is there which is the same why we use it: it's >> simple and usable. Using anything else would mean a lot of rewriting. > > Thats sad. What are the chances that patches addressing this problem are > accepted? > (Must review the code first, but I would really like a clean solution here) We always accept patches as long as there is general interest and/or a good motivation behind them. But threading is pretty much an issue by itself in lxml.etree. And the "simplified API" gives you a way to just say "release GIL - call to libxml2 - acquire GIL" and "acquire GIL - run callback code - free GIL". That's as easy as it can get - especially since Cython has support for the latter nowadays. It is very unlikely that this can get any "cleaner" by changing the thread-lock calls. >> You might want to try compiling lxml with "--without-threading", though, which >> disables concurrency support completely (i.e. not more GIL freeing). > > Works, too. But I'm not really sure it it is a good idea to do so, as > Py_NewInterpreter seems to create a thread, see > http://www.python.org/doc/current/api/initialization.html#l2h-820. But I > think this might not be a problem if not using a threaded Apache-MPM. What this options does is that lxml.etree stops freeing the GIL internally when calling into libxml2, which simply disables any concurrency as it keeps the GIL until execution returns to Python code. Especially the (simplified) Thread API is no longer used, so there should no longer be any threading issues. Stefan From stefan_ml at behnel.de Sun Sep 16 00:37:06 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Sep 2007 00:37:06 +0200 Subject: [lxml-dev] lxml 2.0alpha2 released Message-ID: <46EC5E92.2050607@behnel.de> Hi all, I just released lxml 2.0alpha2 to PyPI. http://pypi.python.org/pypi/lxml/2.0alpha2 http://codespeak.net/lxml/dev/ It features a number of major API additions that follow the ElementTree library and the future API changes in ElementTree 1.3. The main new features are HTML serialisation support, a feed interface to the parsers, a SAX-like target parser interface, and iterfind() as an iterator version of findall(). All of these are currently more or less experimental, so feedback is warmly welcome. The mailing list is always open for discussion, not only on the new features. The complete changelog is below. Have fun Stefan 2.0alpha2 (2007-09-15) Features added * ET.write(), tostring() and tounicode() now accept a keyword argument "method" that can be one of 'xml' (or None), 'html' or 'text' to serialise as XML, HTML or plain text content. * iterfind() method on Elements returns an iterator equivalent to findall() * itertext() method on Elements * Setting a QName object as value of the .text property or as an attribute will resolve its prefix in the respective context * ElementTree-like parser target interface as described in http://effbot.org/elementtree/elementtree-xmlparser.htm * ElementTree-like feed parser interface on XMLParser and HTMLParser (feed() and close() methods) Bugs fixed * lxml failed to serialise namespace declarations of elements other than the root node of a tree * Race condition in XSLT where the resolver context leaked between concurrent XSLT calls Other changes * element.getiterator() returns a list, use element.iter() to retrieve an iterator (ElementTree 1.3 compatible behaviour) From stefan_ml at behnel.de Sun Sep 16 17:38:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Sep 2007 17:38:30 +0200 Subject: [lxml-dev] Potential bug in trunk In-Reply-To: <46E91BD9.2010904@bruun-olsen.net> References: <46E91BD9.2010904@bruun-olsen.net> Message-ID: <46ED4DF6.5090100@behnel.de> Anders Bruun Olsen wrote: > I've run into a weird problem. I am making a CherryPy-based application > which uses lxml to do XSLT conversion of XML before sending it to the > browser. It worked fine before switching to trunk (which has the > namespace-patch that I need). Here is the output: > > [13/Sep/2007:13:08:37] HTTP Serving HTTP on http://0.0.0.0:8080/ > [13/Sep/2007:13:08:44] Traceback (most recent call last): > etree._ParserDictionaryContext.getDefaultParser > File "parser.pxi", line 585, in etree._BaseParser._copy > AttributeError: 'lxml.etree._ResolverRegistry' object has no attribute > '_copy' > > Anybody able to venture a guess as to where this bug might lie? Is it in > lxml, cherrypy or my code? It was a bug in lxml that came in as part of the parser code refactoring. I fixed it for alpha2. Thanks for the report. Stefan From ebgssth at gmail.com Mon Sep 17 14:02:52 2007 From: ebgssth at gmail.com (js) Date: Mon, 17 Sep 2007 21:02:52 +0900 Subject: [lxml-dev] non-ascii characters get garbled Message-ID: Hello, list. The lxml doc [*1] says that "You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone." [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings but my experience is different from that. For example, the following code doesn't bother encoding things and leave the work to lxml.etree. According to the doc, this is right way, but it does't work and you'll got garbled characters. (give it a try) -------------------------------------------------------------------- # -*- coding: utf-8 -*- from lxml import html as etree url='http://apple.com/kr' tree = etree.parse(url) from pprint import pformat for t in tree.xpath('//a[text()]'): print t.text_content() -------------------------------------------------------------------- The next one break the rule and doing all charset conversion for oneself. This one works great and all charset conversion will succeed. -------------------------------------------------------------------- # -*- coding: utf-8 -*- from lxml import html as etree from urllib2 import urlopen from StringIO import StringIO url='http://apple.com/kr' res = urlopen(url) html = res.read().decode(res.headers.getparam('charset')) tree = etree.parse(StringIO(html)) from pprint import pformat for t in tree.xpath('//a[text()]'): print t.text_content() -------------------------------------------------------------------- But the latter doesn't always work. Sometimes I got "ValueError: Unicode strings with encoding declaration are not supported. " Is this a known issue? If so, how can I get out of this problem? Are there any workarounds? I tried to figure out the cause of these and looked over the lxml and libxml2 's code but could not find a clue. (To me this appeared to be not a lxml's problem but libxml2's ,though) Any information would be greatly appriciated. Thanks you in advance. From felwert at uni-bremen.de Mon Sep 17 14:11:23 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 14:11:23 +0200 Subject: [lxml-dev] CSS and lxml Message-ID: <1190031083.7564.21.camel@FredDesk> Hello! I am currently looking into the possibilities to work with CSS in lxml or Python in general. I hope it's not too off-topic, but since it's related to lxml, I thought I might post it here. The specific use case I have in mind would take an element and look for all applying CSS rules. Let's say we have a style element: and in the tree, there are two elements:

Some text

Some other text

Then I'd like to get something like >>> el1.getstyle() {'font-size': '16pt', 'color': 'red'} for the first element and >>> el2.getstyle() {'font-size': '16pt', 'font-weight': 'bold'} for the second one. I know that this is currently not possible. The only true CSS library for Python that I found were cssutils . They have quite sophisticated support for CSS parsing, but I think the library itself is quite DOM-centric and so it's not very pythonic / doesn't fit well to lxml. But more important, it has no real XML bindings. So it's possible to query stylesheets to get properties that match a selector: >>> stylesheet.props('p.strong') {'font-size': '16pt', 'font-weight': 'bold'} but not to query true elements to get the applying properties. On the other hand, lxml now has cssselect, which works the other way around: It takes a selector and returns all the elements that match that selector. >>> sel = CSSSelector('p.strong') >>> [e.text for e in sel(tree)] ['Some other text'] So I just wanted to ask if somebody already had thought about this, or if somebody has any ideas in which direction to head to solve this problem. Maybe one could write a module, that combines cssutils and lxml.cssselect to match css style properties and actual elements. But maybe a completely different approach would be needed. Regards, Frederik From stefan_ml at behnel.de Mon Sep 17 14:41:25 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 14:41:25 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: References: Message-ID: <46EE75F5.4090606@behnel.de> js wrote: > The lxml doc [*1] says that > "You should generally avoid converting XML/HTML data to unicode before > passing it into the parsers. It is both slower and error prone." > > [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings > > but my experience is different from that. Not quite. As you say below, you sometimes get ValueErrors depending on the page data, so it *is* error prone. > For example, the following code doesn't bother encoding things > and leave the work to lxml.etree. > According to the doc, this is right way, but it does't work > and you'll got garbled characters. (give it a try) > > -------------------------------------------------------------------- > # -*- coding: utf-8 -*- > from lxml import html as etree This import makes your code hard to read IMHO. If you use lxml.html, say it. > url='http://apple.com/kr' > > tree = etree.parse(url) > from pprint import pformat > for t in tree.xpath('//a[text()]'): > print t.text_content() > -------------------------------------------------------------------- Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And when I collect the text, it looks perfectly reasonable, including strings like u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. Are you sure it's the text content and not just the console output on your side? > Sometimes I got "ValueError: Unicode strings with encoding declaration > are not supported. " On the same page? I assume you were referring to a different page here that probably uses XHTML instead of HTML, right? The above should work for both - as long as libxml2 can detect the encoding (and if it can't, there's lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). Stefan From stefan_ml at behnel.de Mon Sep 17 15:07:21 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 15:07:21 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190031083.7564.21.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> Message-ID: <46EE7C09.8000006@behnel.de> Hi, Frederik Elwert wrote: > I am currently looking into the possibilities to work with CSS in lxml > or Python in general. I hope it's not too off-topic, but since it's > related to lxml, I thought I might post it here. Sure. > The specific use case I have in mind would take an element and look for > all applying CSS rules. Let's say we have a style element: > > > > and in the tree, there are two elements: > >

Some text

>

Some other text

> > Then I'd like to get something like > >>>> el1.getstyle() > {'font-size': '16pt', 'color': 'red'} > > for the first element and > >>>> el2.getstyle() > {'font-size': '16pt', 'font-weight': 'bold'} > for the second one. > > I know that this is currently not possible. The only true CSS library > for Python that I found were cssutils . > They have quite sophisticated support for CSS parsing, but I think the > library itself is quite DOM-centric and so it's not very pythonic / > doesn't fit well to lxml. But more important, it has no real XML > bindings. So it's possible to query stylesheets to get properties that > match a selector: > >>>> stylesheet.props('p.strong') > {'font-size': '16pt', 'font-weight': 'bold'} > > but not to query true elements to get the applying properties. > > On the other hand, lxml now has cssselect, which works the other way > around: It takes a selector and returns all the elements that match that > selector. > >>>> sel = CSSSelector('p.strong') >>>> [e.text for e in sel(tree)] > ['Some other text'] > > So I just wanted to ask if somebody already had thought about this, or > if somebody has any ideas in which direction to head to solve this > problem. > > Maybe one could write a module, that combines cssutils and > lxml.cssselect to match css style properties and actual elements. But > maybe a completely different approach would be needed. There are a couple of things you have to do here. First, you have to parse CSS, which only the cssutils currently do. Then you have to find out which of the rules apply to an element which AFAICT is not currently supported at all. You could do a brute force test and just take all selectors that you find in all CSS stylesheets in the document or in external references, to match them against the element in question - but that would be quite some overhead. On the other hand, if style lookup is more frequent than document parsing, you can build an inverse index: run through all CSS selectors, find the elements they match and store the style content for each of the elements, thus aggregating the style properties per element. You could maybe implement a "cssannotate(stylesheet, tree)" function, which would map a stylesheet on a tree by setting (or extending) the "style" attributes on each element accordingly. That would come pretty close to what you were looking for. Stefan From ebgssth at gmail.com Mon Sep 17 16:09:02 2007 From: ebgssth at gmail.com (js) Date: Mon, 17 Sep 2007 23:09:02 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46EE75F5.4090606@behnel.de> References: <46EE75F5.4090606@behnel.de> Message-ID: Thank you for you reply. > > -------------------------------------------------------------------- > > # -*- coding: utf-8 -*- > > from lxml import html as etree > > This import makes your code hard to read IMHO. If you use lxml.html, say it. Oh, html is just a little bit different version of etree so I always do above import. that's just my thought. I'll just say what I'll do next time, thanks. Explicit is better than implicit :) > > url='http://apple.com/kr' > > > > tree = etree.parse(url) > > from pprint import pformat > > for t in tree.xpath('//a[text()]'): > > print t.text_content() > > -------------------------------------------------------------------- > > Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And > when I collect the text, it looks perfectly reasonable, including strings like > > u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. > \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' > > This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. > > Are you sure it's the text content and not just the console output on your side? This is on lxml 2.0alpha2 and libxml2 2.6.29_0. I got the following. $ ./lxml_test.py Apple Store Mac iPod + iTunes Downloads Support ????????? ????? ??? ?? ?"? ?????? ?????? ??????? ? ????????? ??? ????? ?????? ?????? ?? ??? ?????? ????????? ???? ???? ?????? ?????? ?"?????????? - iBook G4 ??? PowerBook G4 ????????? ?????? ??"?(c)? ??? ? ???? ???? ???? ?????? eMac ?????? ?????? ?"?????????? here. ?????(c) ?????? ??????? ???? ??????? ??? This is not a console problem because I can get correct result by using latter method as I said before. > > Sometimes I got "ValueError: Unicode strings with encoding declaration > > are not supported. " > > On the same page? I assume you were referring to a different page here that > probably uses XHTML instead of HTML, right? The above should work for both - > as long as libxml2 can detect the encoding (and if it can't, there's > lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). Yes, from different page. I got the error when I'm getting http://www.hatena.com/ Thanks. From felwert at uni-bremen.de Mon Sep 17 16:12:40 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 16:12:40 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <46EE7C09.8000006@behnel.de> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> Message-ID: <1190038360.7564.49.camel@FredDesk> Am Montag, den 17.09.2007, 15:07 +0200 schrieb Stefan Behnel: > There are a couple of things you have to do here. First, you have to parse > CSS, which only the cssutils currently do. Then you have to find out which of > the rules apply to an element which AFAICT is not currently supported at all. No, cssutils supports only parsing and generating CSS, but not element-based style selection. And if it would, I guess they'd stick to xml.dom or something. > You could do a brute force test and just take all selectors that you find in > all CSS stylesheets in the document or in external references, to match them > against the element in question - but that would be quite some overhead. I thought about that and came to the same conclusion as you do regarding the overhead. > On > the other hand, if style lookup is more frequent than document parsing, you > can build an inverse index: run through all CSS selectors, find the elements > they match and store the style content for each of the elements, thus > aggregating the style properties per element. This would be quite practical, right. I'm just not sure about where to store the information. > You could maybe implement a "cssannotate(stylesheet, tree)" function, which > would map a stylesheet on a tree by setting (or extending) the "style" > attributes on each element accordingly. That would come pretty close to what > you were looking for. This just had the negative side-effect of changing the tree itself. So it would only be applicable for read-only-operations, since one wouldn't want to put all style permanently into style attributes for most use cases. Hm, I have to think about this. But it seems that a combination of lxml.cssselect and cssutils would quite do. Since I don't want to rely on lxml 2.0 yet, I'd wait for the implementation anyway. Thanks for your hints! Regards, Frederik From stefan_ml at behnel.de Mon Sep 17 16:37:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 16:37:05 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190038360.7564.49.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> Message-ID: <46EE9111.5050705@behnel.de> Frederik Elwert wrote: > Am Montag, den 17.09.2007, 15:07 +0200 schrieb Stefan Behnel: >> On >> the other hand, if style lookup is more frequent than document parsing, you >> can build an inverse index: run through all CSS selectors, find the elements >> they match and store the style content for each of the elements, thus >> aggregating the style properties per element. > > This would be quite practical, right. I'm just not sure about where to > store the information. If you don't want to alter the tree, you can use a dict to map Elements to a style dict. However, note that Elements are not currently weak referenceable, so you'd have to make sure the trees are discarded after use. >> You could maybe implement a "cssannotate(stylesheet, tree)" function, which >> would map a stylesheet on a tree by setting (or extending) the "style" >> attributes on each element accordingly. That would come pretty close to what >> you were looking for. > > This just had the negative side-effect of changing the tree itself. So > it would only be applicable for read-only-operations, since one wouldn't > want to put all style permanently into style attributes for most use > cases. Agreed. However, you can't store anything in Elements that is not reflected by the underlying tree, as Element objects (which are actually just proxies) can be garbage collected while the tree stays alive. You can also store style information in the tree under a separate namespace. > Hm, I have to think about this. But it seems that a combination of > lxml.cssselect and cssutils would quite do. Since I don't want to rely > on lxml 2.0 yet, I'd wait for the implementation anyway. Thanks for your > hints! I don't think cssselect.py uses any 2.0 specific features. Copying it over to 1.3 (or into your own code base) should work as a temporary solution. Stefan From gilles.lenfant at gmail.com Mon Sep 17 16:41:40 2007 From: gilles.lenfant at gmail.com (Gilles Lenfant) Date: Mon, 17 Sep 2007 16:41:40 +0200 Subject: [lxml-dev] XML files starting with BOM References: <42B618E0-1193-4211-8314-6383654EF8FF@gmail.com> Message-ID: Hi from an lxml newbie, A first, many thanks for lxml that's the easiest XML lib for Python. lxml doesnt't like XML files starting with a BOM (See http:// www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info). M$Office 2007 documents use such notation in their inner xml files. And I need to skip all chars from the file until I get a "<" before passing the stream to lxml. Hopefully, the files are UTF-8. Is it a bug or a feature ? -- Gilles Lenfant gilles.lenfant at gmail.com From felwert at uni-bremen.de Mon Sep 17 17:00:21 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 17:00:21 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <46EE9111.5050705@behnel.de> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> <46EE9111.5050705@behnel.de> Message-ID: <1190041221.7564.58.camel@FredDesk> Am Montag, den 17.09.2007, 16:37 +0200 schrieb Stefan Behnel: > If you don't want to alter the tree, you can use a dict to map Elements to a > style dict. However, note that Elements are not currently weak referenceable, > so you'd have to make sure the trees are discarded after use. Erm, I must confess, I'm not sure what this means, practically speaking. Is it enough to "del" the dict after use? Aside from that, this sounds quite good. > >> You could maybe implement a "cssannotate(stylesheet, tree)" function, which > >> would map a stylesheet on a tree by setting (or extending) the "style" > >> attributes on each element accordingly. That would come pretty close to what > >> you were looking for. > > > > This just had the negative side-effect of changing the tree itself. So > > it would only be applicable for read-only-operations, since one wouldn't > > want to put all style permanently into style attributes for most use > > cases. > > Agreed. However, you can't store anything in Elements that is not reflected by > the underlying tree, as Element objects (which are actually just proxies) can > be garbage collected while the tree stays alive. Yes, sure. So a style dict would have to be a totally separated object, I guess? I think I can live with that. > You can also store style information in the tree under a separate namespace. Hm, true. I have to think about that, since it would introduce some redundancy, but it might be the easiest way to go. > I don't think cssselect.py uses any 2.0 specific features. Copying it over to > 1.3 (or into your own code base) should work as a temporary solution. Ah, that's good, I'll give it a try! Thanks, Frederik From stefan_ml at behnel.de Mon Sep 17 17:00:51 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 17:00:51 +0200 Subject: [lxml-dev] XML files starting with BOM In-Reply-To: References: <42B618E0-1193-4211-8314-6383654EF8FF@gmail.com> Message-ID: <46EE96A3.6080604@behnel.de> Gilles Lenfant wrote: > A first, many thanks for lxml that's the easiest XML lib for Python. :) > lxml doesnt't like XML files starting with a BOM (See http:// > www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info). > > M$Office 2007 documents use such notation in their inner xml files. > And I need to skip all chars from the file until I get a "<" before > passing the stream to lxml. Hopefully, the files are UTF-8. Is this only with UTF-8 BOMs? > Is it a bug or a feature ? Parsing BOM-ed XML data should work. Could you give some more detail here? Such as some short example code that shows what you are doing to parse XML data with a BOM and that fails on your machine? Stefan From stefan_ml at behnel.de Mon Sep 17 17:20:51 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 17:20:51 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190041221.7564.58.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> <46EE9111.5050705@behnel.de> <1190041221.7564.58.camel@FredDesk> Message-ID: <46EE9B53.704@behnel.de> Frederik Elwert wrote: > Am Montag, den 17.09.2007, 16:37 +0200 schrieb Stefan Behnel: >> If you don't want to alter the tree, you can use a dict to map Elements to a >> style dict. However, note that Elements are not currently weak referenceable, >> so you'd have to make sure the trees are discarded after use. > > Erm, I must confess, I'm not sure what this means, practically speaking. > Is it enough to "del" the dict after use? Yes. If you use a per-tree dict and delete it when you delete the tree, you will be fine. Weak referencing means that the reference does not count for garbage collection, so when the Element is no longer used, it will not be kept alive only by the reference in the dict. See the weakref module. >> You can also store style information in the tree under a separate namespace. > > Hm, true. I have to think about that, since it would introduce some > redundancy, but it might be the easiest way to go. That's what lxml.objectify does for type annotations. It also provides annotate() and deannotate() functions to annotate everything and to clean up the tree when you're done. Stefan From stefan_ml at behnel.de Mon Sep 17 20:32:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 20:32:22 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070913081704.138210@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> Message-ID: <46EEC836.9030603@behnel.de> Hi Holger, jholg at gmx.de wrote: > Things get difficult if you assign leaf elements and actually instantiate > the python proxy objects. If no TREE attributes get used, these will end up > being "default empty elements", usually string elements. > > Also, once having been serialized, there is no way that leaf elements can > be recognized as ObjectifiedElements without the help of the TREE > attribute. That's the main reason I propose the keep_tree functionality, to > make ObjectifiedElement-leaves survive a creation-serialization-parse > cycle. I think we should do this: if old_pytypename == TREE_PYTYPE: if cetree.findChild(c_node, 0) is NULL: pytype = TREE_PYTYPE else: # check old type Do you still think we need the keep_tree then? Stefan From jholg at gmx.de Tue Sep 18 09:21:07 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 18 Sep 2007 09:21:07 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46EEC836.9030603@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> Message-ID: <20070918072107.19040@gmx.net> Hello Stefan, > > attribute. That's the main reason I propose the keep_tree functionality, > to > > make ObjectifiedElement-leaves survive a creation-serialization-parse > > cycle. > > I think we should do this: > > if old_pytypename == TREE_PYTYPE: > if cetree.findChild(c_node, 0) is NULL: > pytype = TREE_PYTYPE > else: > # check old type > > Do you still think we need the keep_tree then? You really don't like it, do you ;-)? I'd say this should work and remove the need for keep_tree, though. Sidenote: So I thought maybe we should revise the use of TREE in objectify in general, but one has to be very careful. You really want to have it e.g. in objectify.Element(): >>> o = objectify.Element("structural") >>> e = etree.Element("structural") >>> type(o), type(e) (, ) >>> root.o = o >>> root.e = e >>> # Now type lookup can not rely on parent == None ... >>> type(root.o), type(root.e) (, ) >>> Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Tue Sep 18 10:42:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 18 Sep 2007 10:42:19 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070918072107.19040@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> Message-ID: <46EF8F6B.8040403@behnel.de> Hi Holger, jholg at gmx.de wrote: >> I think we should do this: >> >> if old_pytypename == TREE_PYTYPE: >> if cetree.findChild(c_node, 0) is NULL: >> pytype = TREE_PYTYPE >> else: >> # check old type >> >> Do you still think we need the keep_tree then? > > You really don't like it, do you ;-)? > I'd say this should work and remove the need for keep_tree, though. Ok. I also added the tests from your patch now. Obvious question then: anything still missing from what your last patch did? > Sidenote: So I thought maybe we should revise the use of TREE in objectify in general, but one has to be very careful. You really want to have it e.g. in objectify.Element(): I think we should, and we should restrict its use to a minimum. If you want, you can take a look at it. I don't feel like touching working code at the moment. :) >>>> o = objectify.Element("structural") >>>> e = etree.Element("structural") >>>> type(o), type(e) > (, ) Whatever. I don't want any code to rely on *that*. :) (but I can see what your getting at) >>>> root.o = o >>>> root.e = e >>>> # Now type lookup can not rely on parent == None > ... >>>> type(root.o), type(root.e) > (, ) I'm not (any longer :) questioning the TREE type in general. I just think we should not write annotations where we know we will not need them. Stefan From jholg at gmx.de Tue Sep 18 10:57:58 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 18 Sep 2007 10:57:58 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46EF8F6B.8040403@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> Message-ID: <20070918085758.19050@gmx.net> > Ok. I also added the tests from your patch now. > > Obvious question then: anything still missing from what your last patch > did? I'll take a look. > you can take a look at it. I don't feel like touching working code at the > moment. :) I already peeked, and there is really not many places where TREE is used. I'd say it is needed anywhere it currently is, but I'll take a closer look. Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From ebgssth at gmail.com Tue Sep 18 15:56:21 2007 From: ebgssth at gmail.com (js) Date: Tue, 18 Sep 2007 22:56:21 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46EE75F5.4090606@behnel.de> References: <46EE75F5.4090606@behnel.de> Message-ID: Hello again. I downgraded libxml2 from 2.6.29_0 to 2.6.27_0 and re-run the test script. surprise, Now it all works as in the lxml doc! seems newer libxml2 has some problem converting charset. (2.6.28_1 doesn't work either.) I'll look at libxml2's source. Thank you. On 9/17/07, Stefan Behnel wrote: > > js wrote: > > The lxml doc [*1] says that > > "You should generally avoid converting XML/HTML data to unicode before > > passing it into the parsers. It is both slower and error prone." > > > > [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings > > > > but my experience is different from that. > > Not quite. As you say below, you sometimes get ValueErrors depending on the > page data, so it *is* error prone. > > > > For example, the following code doesn't bother encoding things > > and leave the work to lxml.etree. > > According to the doc, this is right way, but it does't work > > and you'll got garbled characters. (give it a try) > > > > -------------------------------------------------------------------- > > # -*- coding: utf-8 -*- > > from lxml import html as etree > > This import makes your code hard to read IMHO. If you use lxml.html, say it. > > > > url='http://apple.com/kr' > > > > tree = etree.parse(url) > > from pprint import pformat > > for t in tree.xpath('//a[text()]'): > > print t.text_content() > > -------------------------------------------------------------------- > > Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And > when I collect the text, it looks perfectly reasonable, including strings like > > u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. > \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' > > This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. > > Are you sure it's the text content and not just the console output on your side? > > > > Sometimes I got "ValueError: Unicode strings with encoding declaration > > are not supported. " > > On the same page? I assume you were referring to a different page here that > probably uses XHTML instead of HTML, right? The above should work for both - > as long as libxml2 can detect the encoding (and if it can't, there's > lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). > > Stefan > From jholg at gmx.de Wed Sep 19 13:24:09 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 19 Sep 2007 13:24:09 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070918085758.19050@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> Message-ID: <20070919112409.271040@gmx.net> Hi, attached patch - enhances the annotation tests to check the TREE-attribute survival for leaf-TREE-elements in annotate/pyannotate/xsiannotate - fixes a bug in DataElement that did not set py:pytype correctly when invoked with unicode string args and adds some tests for this. I renamed _get_pytypename() to _pytypename() (internal) and __get_pytypename() to pytypename() (public), so DataElement() now uses _pytypename() rather than _typename(). Holger Btw I'm getting core dumps in the schematron tests: 685/802 ( 85.4%): test_schematron_invalid_schema_empty (...hematron.ETreeSchematronTestCase)Segmentation Fault (core dumped) #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe2b7874 in __xmlRaiseError () from /apps/prod/lib/libxml2.so.2 #4 0xfe45fd5c in xmlSchematronPErr () from /apps/prod/lib/libxml2.so.2 #5 0xfe462d24 in xmlSchematronParse () from /apps/prod/lib/libxml2.so.2 #6 0xfe60f080 in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x829a10, __pyx_args=0x8872c8, __pyx_kwds=0x109b04) at src/lxml/etree.c:4131 #7 0x58504 in type_call (type=0xfe665988, args=0x829d30, kwds=0x89d4b0) at Objects/typeobject.c:443 #8 0x260c4 in PyObject_Call (func=0x829a10, arg=0x829d30, kw=0x89d4b0) at Objects/abstract.c:1802 -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- A non-text attachment was scrubbed... Name: test_treesurvival_dataelement_ustr.patch Type: application/octet-stream Size: 9707 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070919/fc92de37/attachment-0001.obj From stefan_ml at behnel.de Wed Sep 19 15:03:25 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 19 Sep 2007 15:03:25 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070919112409.271040@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> Message-ID: <46F11E1D.70309@behnel.de> jholg at gmx.de wrote: > attached patch > - enhances the annotation tests to check the TREE-attribute survival for leaf-TREE-elements in annotate/pyannotate/xsiannotate Ok. > - fixes a bug in DataElement that did not set py:pytype correctly when invoked with unicode string args and adds some tests for this. You should just commit this kind of fixes instead of sending them to the list. > I renamed _get_pytypename() to _pytypename() (internal) and __get_pytypename() to pytypename() (public), so DataElement() now uses _pytypename() rather than _type