From stefan_ml at behnel.de Sat Sep 1 10:21:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 01 Sep 2007 10:21:29 +0200 Subject: [lxml-dev] objectify factories In-Reply-To: <20070830142730.280890@gmx.net> References: <46C945BF.9020805@behnel.de> <46CC80A4.7010607@behnel.de> <20070830114448.280910@gmx.net> <46D6BC03.4050103@behnel.de> <20070830142730.280890@gmx.net> Message-ID: <46D92109.8070105@behnel.de> jholg at gmx.de wrote: >> Given the current behaviour of _setElementValue(), I'd say it should just >> go and annotate everything it produces. > > Meaning an additional TypedElementMaker, right? I think it is actually nice to have the not-annotating ElementMaker as a choice. BTW, that's easy to achieve, I just added a simple "annotate=True" keyword argument to objectify.ElementMaker (not comitted yet). If you create a new E factory and pass False, it will just skip the annotation step. Stefan From stefan_ml at behnel.de Sun Sep 2 18:25:58 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 02 Sep 2007 18:25:58 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released Message-ID: <46DAE416.1080807@behnel.de> Hi all, I'm proudly announcing the first alpha release of lxml 2.0. It features a major cleanup both behind the scenes and at the surface, that improves the XML tool integration and makes the API clearer and more consistent in many places. The major new addition, however, is the lxml.html package, a new toolkit for HTML handling. The web site for the pre-2.0 series is online at http://codespeak.net/lxml/dev/ The "what's new" page has a description of the major changes: http://codespeak.net/lxml/dev/lxml2.html and the ChangeLog has a more detailed list, see below. This being an alpha release means that not everything is stable, both in terms of crashes and the API. There will be a small number of alpha releases to make the advancements publicly available, before the beta releases focus on improving the stability. I warmly invite everyone to contribute to the final release by discussing the API changes and the new features on the mailing list. There is always space for improvements! There is currently a known problem with Microsoft's compilers, so Windows builds may not become available for 2.0alpha1. The next alpha will hopefully come with prebuilt binaries for that platform. Building with the more standards compliant MinGW compilers should work. Note that working on the code now requires Cython (version 0.9.6.5), an enhanced fork of Pyrex. lxml therefore no longer ships with a copy of Pyrex or Cython, but as usual, building from the distribution sources does not require Cython. It can be installed with "easy_install Cython" or from here: http://www.cython.org/ I hope that lxml 2.0 will become a straight continuation of the success story that lxml 1.x was already. Have fun, Stefan 2.0alpha1 (2007-09-02) Features added * Reimplemented objectify.E for better performance and improved integration with objectify. Provides extended type support based on registered PyTypes. * XSLT objects now support deep copying * New makeSubElement() C-API function that allows creating a new subelement straight with text, tail and attributes. * XPath extension functions can now access the current context node (context.context_node) and use a context dictionary (context.eval_context) from the context provided in their first parameter * HTML tag soup parser based on BeautifulSoup in lxml.html.ElementSoup * New module lxml.doctestcompare by Ian Bicking for writing simplified doctests based on XML/HTML output. Use by importing lxml.usedoctest or lxml.html.usedoctest from within a doctest. * New module lxml.cssselect by Ian Bicking for selecting Elements with CSS selectors. * New package lxml.html written by Ian Bicking for advanced HTML treatment. * Namespace class setup is now local to the ElementNamespaceClassLookup instance and no longer global. * Schematron validation (incomplete in libxml2) * Additional stringify argument to objectify.PyType() takes a conversion function to strings to support setting text values from arbitrary types. * Entity support through an Entity factory and element classes. XML parsers now have a resolve_entities keyword argument that can be set to False to keep entities in the document. * column field on error log entries to accompany the line field * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on XPathError. * The regular expression functions in XPath now support passing a node-set instead of a string * Extended type annotation in objectify: new xsiannotate() function * EXSLT RegExp support in standard XPath (not only XSLT) Bugs fixed * lxml.etree did not check tag/attribute names * The XML parser did not report undefined entities as error * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last * Passing '' as XPath namespace prefix did not raise an error * Thread safety in XPath evaluators Other changes * objectify.PyType for None is now called "NoneType" * el.getiterator() renamed to el.iter(), following ElementTree 1.3 - original name is still available as alias * In the public C-API, findOrBuildNodeNs() was replaced by the more generic findOrBuildNodeNsPrefix * Major refactoring in XPath/XSLT extension function code * Network access in parsers disabled by default From stefan_ml at behnel.de Mon Sep 3 09:29:43 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 03 Sep 2007 09:29:43 +0200 Subject: [lxml-dev] [XML-SIG] lxml 2.0alpha1 released In-Reply-To: <46DAEC38.6060900@comcast.net> References: <46DAE500.40408@behnel.de> <46DAEC38.6060900@comcast.net> Message-ID: <46DBB7E7.1060309@behnel.de> Hi, Gloria W wrote: > Stefan, congratulations. This is definitely useful. Thanks! :) > Please talk a bit about the API, and how it differs/varies from > cElementTree, http://codespeak.net/lxml/dev/compatibility.html > or link to some examples. The docs are full of doctest examples. However, as lxml.html is still pretty new, its docs are not as comprehensive as those for lxml.etree yet. > For example, the node nesting, > the usage of a 'tail' for trailing text. I wonder if lxml offers more of > a DOM compliant node nesting, or if it conforms to the > conventions/oddities of ElemenTree. lxml.etree aims for ElementTree compatibility, so these things work alike. The above link describes the differences that we either cannot work around for technical reasons (or performance reasons) or that are considerate decisions where we think ElementTree is wrong. Note that the ElementTree API is more and more becoming a basis for other APIs in lxml. There is lxml.objectify, which replaces a lot of this API by something that works more like Python objects themselves (a data binding approach). lxml.html extends the API with a bunch of helper methods for link handling and also changes the way the serialisation works to better adapt it to HTML. There's also MathDOM, a MathML implementation, which was the original reason for making lxml extensible at the Element level, back in the days of lxml 0.7. The original idea was actually 'stolen' from Xist, although lxml has definitely found its own way of dealing with it. The one thing I like most about lxml is the tool integration. For example, you can use the Element API in lxml.etree or lxml.objectify or lxml.html, with any of the five path languages: ElementPath, ETXPath, XPath, CSS-Selectors or ObjectPath. I think this is a trend that should continue. Most XML/HTML formats can benefit from specialised Element classes with specially adapted or added methods, properties and even different tree behaviour, while still taking advantage of all the other tools that lxml provides. The possibilities that lxml offers here are close to unlimited (both at the Python level and at the C level) - even with the 'oddities' (as you called it) of ElementTree. I personally believe that .tail attributes are actually a big advantage, as the ignorance of text nodes simplifies the tree model considerably (well, the public one, not necessarily the internal one...) > Also show us how it differs from BeautifulSoup, which has extremely > robust unicode handling and mangled XML/HTML tag completion, but may > benchmark a bit slower. libxml2 does not have as robust support for HTML-like tag soup as BeautifulSoup, but it does a pretty good job anyway. In lxml 2.0, lxml.html comes with BeautifulSoup integration (as ElementTree does), so now you can have both: a tag soup parser and all the features of lxml. Stefan From mantegazza at ill.fr Mon Sep 3 09:38:09 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Mon, 3 Sep 2007 09:38:09 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DAE416.1080807@behnel.de> References: <46DAE416.1080807@behnel.de> Message-ID: <200709030938.09485.mantegazza@ill.fr> Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > ? ? * XSLT objects now support deep copying Good ;o) -- Fr?d?ric From stefan_ml at behnel.de Mon Sep 3 09:55:14 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 03 Sep 2007 09:55:14 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <200709030938.09485.mantegazza@ill.fr> References: <46DAE416.1080807@behnel.de> <200709030938.09485.mantegazza@ill.fr> Message-ID: <46DBBDE2.4040702@behnel.de> Fr?d?ric Mantegazza wrote: > Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > >> * XSLT objects now support deep copying > > Good ;o) ... although that's such a recent feature that I wouldn't bet my household on it. Since you had code that stumbled over the lack of that feature, could you give it some more testing so that we can see if it works? Especially in the threaded case? Thanks, Stefan From mantegazza at ill.fr Mon Sep 3 10:14:47 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Mon, 3 Sep 2007 10:14:47 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DBBDE2.4040702@behnel.de> References: <46DAE416.1080807@behnel.de> <200709030938.09485.mantegazza@ill.fr> <46DBBDE2.4040702@behnel.de> Message-ID: <200709031015.04577.mantegazza@ill.fr> Le lundi 3 septembre 2007 09:55, Stefan Behnel a ?crit : > Fr?d?ric Mantegazza wrote: > > Le dimanche 2 septembre 2007 18:25, Stefan Behnel a ?crit : > >> * XSLT objects now support deep copying > > > > Good ;o) > > ... although that's such a recent feature that I wouldn't bet my > household on it. Since you had code that stumbled over the lack of that > feature, could you give it some more testing so that we can see if it > works? Especially in the threaded case? Ok, I will make tests. -- Fr?d?ric From jholg at gmx.de Mon Sep 3 12:59:29 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 03 Sep 2007 12:59:29 +0200 Subject: [lxml-dev] cython + python2.4 problem (was: objectify factories) In-Reply-To: <46D86DC7.3080304@behnel.de> References: <46C945BF.9020805@behnel.de> <46CC80A4.7010607@behnel.de> <20070830114448.280910@gmx.net> <46D6BC03.4050103@behnel.de> <20070830142730.280890@gmx.net> <46D7D934.1090909@behnel.de> <20070831093430.303420@gmx.net> <46D7E1C3.7060706@behnel.de> <20070831160504.231930@gmx.net> <46D86DC7.3080304@behnel.de> Message-ID: <20070903105929.168580@gmx.net> Hi, > the trunk now builds with Cython instead of Pyrex, so please install it to > get > rid of the one failing doctest. (the reason the test fails is that Cython > knows about the package you specify in distutils, Pyrex ignores it). > > http://www.cython.org/ > > lxml requires Cython 0.9.6.5. Lazy me, not like you hadn't announced that quite a while ago. However, unfortunately: Just downloaded cython and tried to build lxml: 0 lb54320 at adevp02 .../lxml $ /apps/pydev/bin/python2.4 setup.py build Traceback (most recent call last): File "setup.py", line 28, in ? import setupinfo File "/data/pydev/hjoukl/LXML/lxml/setupinfo.py", line 5, in ? from Cython.Distutils import build_ext as build_pyx [...] File "/apps/pydev/lib/python2.4/site-packages/Cython/Compiler/TypeSlots.py", line 88 full_args = "O" + self.fixed_arg_format if self.has_dummy_arg else self.fixed_arg_format ^ SyntaxError: invalid syntax Seems like cython relies on Python2.5 syntax, which renders it unusable for me. Any chance to remove the hard 2.5-syntax-dependency? At a quick glance this seems to be the only place where conditional expressions turn up. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From jholg at gmx.de Tue Sep 4 12:00:01 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 04 Sep 2007 12:00:01 +0200 Subject: [lxml-dev] lxml 2.0alpha1 released In-Reply-To: <46DAE416.1080807@behnel.de> References: <46DAE416.1080807@behnel.de> Message-ID: <20070904100001.277890@gmx.net> Hi, > * Extended type annotation in objectify: new xsiannotate() function I propose renaming the existing annotate() function to pyannotate() and adding a public interface annotate() to the internal _annotate(), so you can xsi-typify and py-typify in one step. I also think it would be better to change the defaults of the "ignore_old" keyword args of the annotation functions to False, to avoid: >>> root = E.root(E.i(23), E.s("12"), E.sub()) >>> print objectify.dump(root) root = None [ObjectifiedElement] i = 23 [IntElement] * py:pytype = 'int' s = '12' [StringElement] * py:pytype = 'str' sub = '' [StringElement] >>> objectify.annotate(root) >>> print objectify.dump(root) root = None [ObjectifiedElement] i = 23 [IntElement] * py:pytype = 'int' s = 12 [IntElement] * py:pytype = 'int' sub = '' [StringElement] >>> where you lose the "str" type information of root.s. I think the current default is a bit counter-intuitive. What do you say? Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From mantegazza at ill.fr Tue Sep 4 15:39:32 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Tue, 4 Sep 2007 15:39:32 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <46D69BBC.4070701@behnel.de> References: <46D69BBC.4070701@behnel.de> Message-ID: <200709041539.33032.mantegazza@ill.fr> Le jeudi 30 ao?t 2007 12:28, Stefan Behnel a ?crit : > I just released lxml 1.3.4 to PyPI. It has a minor bug fix and a few > compatibility enhancements both backwards and forwards. Changelog > follows. On my debian etch (stable), I only have setuptools 0.6c3, but lxml 1.3.4 needs at least 0.6c5... I just changed the test in setup.py, and all compiled fine. But may I have some problems? Do 1.3.4 *really* need setuptools >=0.6c5? -- Fr?d?ric From jholg at gmx.de Tue Sep 4 15:46:41 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 04 Sep 2007 15:46:41 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <200709041539.33032.mantegazza@ill.fr> References: <46D69BBC.4070701@behnel.de> <200709041539.33032.mantegazza@ill.fr> Message-ID: <20070904134641.117140@gmx.net> Hi, > On my debian etch (stable), I only have setuptools 0.6c3, but lxml 1.3.4 > needs at least 0.6c5... I just changed the test in setup.py, and all > compiled fine. But may I have some problems? Do 1.3.4 *really* need > setuptools >=0.6c5? I had the same issue on my kubuntu system the other day, and did the same thing to get it to build (I *think* that was svn trunk, though), because I'd rather use os packages than easy_install newer setuptools, if possible. Worked for me. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From mantegazza at ill.fr Tue Sep 4 15:50:55 2007 From: mantegazza at ill.fr (=?iso-8859-15?q?Fr=E9d=E9ric_Mantegazza?=) Date: Tue, 4 Sep 2007 15:50:55 +0200 Subject: [lxml-dev] lxml 1.3.4 released In-Reply-To: <20070904134641.117140@gmx.net> References: <46D69BBC.4070701@behnel.de> <200709041539.33032.mantegazza@ill.fr> <20070904134641.117140@gmx.net> Message-ID: <200709041550.57485.mantegazza@ill.fr> Le mardi 4 septembre 2007 15:46, jholg at gmx.de a ?crit : > > On my debian etch (stable), I only have setuptools 0.6c3, but lxml > > 1.3.4 needs at least 0.6c5... I just changed the test in setup.py, and > > all compiled fine. But may I have some problems? Do 1.3.4 *really* need > > setuptools >=0.6c5? > > I had the same issue on my kubuntu system the other day, and did the same > thing to get it to build (I *think* that was svn trunk, though), because > I'd rather use os packages than easy_install newer setuptools, if > possible. Worked for me. Thanks for the feedback :o) -- Fr?d?ric From jtk at yahoo.com Wed Sep 5 22:49:57 2007 From: jtk at yahoo.com (Jeff Kowalczyk) Date: Wed, 05 Sep 2007 16:49:57 -0400 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 Message-ID: The 2007-08-30 addition of lxml-2.0alpha1 to PyPI is possibly causing buildouts depending on lxml to fail with the following error when running in fetch-newest mode: Getting required 'zope.testbrowser' required by z3c.etestbrowser 1.0.2-r75829. We have the best distribution that satisfies 'zope.testbrowser'. Picked: zope.testbrowser = 3.4.1 Getting required 'lxml' required by z3c.etestbrowser 1.0.2-r75829. Getting distribution for 'lxml'. While: Installing test. Getting distribution for 'lxml'. Error: Can't download http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found Thanks. From stefan_ml at behnel.de Wed Sep 5 23:12:48 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 05 Sep 2007 23:12:48 +0200 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 In-Reply-To: References: Message-ID: <46DF1BD0.60400@behnel.de> Jeff Kowalczyk wrote: > The 2007-08-30 addition of lxml-2.0alpha1 to PyPI is possibly causing > buildouts depending on lxml to fail with the following error when running > in fetch-newest mode: > > Getting required 'zope.testbrowser' > required by z3c.etestbrowser 1.0.2-r75829. > We have the best distribution that satisfies 'zope.testbrowser'. > Picked: zope.testbrowser = 3.4.1 > Getting required 'lxml' > required by z3c.etestbrowser 1.0.2-r75829. > Getting distribution for 'lxml'. > While: > Installing test. > Getting distribution for 'lxml'. > Error: Can't download > http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found Yep, that's because there isn't such a branch yet. It will become available when 2.0 is released. I guess we should take that paragraph out till then... Stefan From l.oluyede at gmail.com Thu Sep 6 15:41:39 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 15:41:39 +0200 Subject: [lxml-dev] Resolve RelaxNG document Message-ID: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> I'd like to know if there's a practical way to resolve all the references in a RelaxNG document. Maybe 'expand' is the correct word. I have a document with some which are mostly custom data types and some elements using those types. For example: ---- a b --- What I'd like is to replace the reference "foo" with the definition of "foo" as a whole. Is there an easy way? I read something about custom resolvers... I use lxml2.0alpha1 -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Thu Sep 6 15:59:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Sep 2007 15:59:30 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> Message-ID: <46E007C2.3000107@behnel.de> Lawrence Oluyede wrote: > I'd like to know if there's a practical way to resolve all the > references in a RelaxNG document. Maybe 'expand' is the correct word. > > I have a document with some which are mostly custom data > types and some elements using those types. For example: > > ---- > > > a > b > > > > > > > > > --- > > What I'd like is to replace the reference "foo" with the definition of > "foo" as a whole. I think the easiest way is to do it by hand, something like: resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) for ref in tree.iter("ref"): define = resolve(tree, name = ref.get("name")) if define: ref.getparent().replace(ref, define[0]) # or define[0][0] ? Stefan From l.oluyede at gmail.com Thu Sep 6 16:36:16 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 16:36:16 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <46E007C2.3000107@behnel.de> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> Message-ID: <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> > I think the easiest way is to do it by hand, something like: > > resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) > for ref in tree.iter("ref"): > define = resolve(tree, name = ref.get("name")) > if define: > ref.getparent().replace(ref, define[0]) # or define[0][0] ? > Ok thanks anyway, I was doing it by hand. I hoped there was something in the schema validator. How does the relaxng validator knows if a document is valid if it doesn't expand references? That was my thought -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Thu Sep 6 17:01:21 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Sep 2007 17:01:21 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> Message-ID: <46E01641.3070903@behnel.de> Lawrence Oluyede wrote: >> I think the easiest way is to do it by hand, something like: >> >> resolve = etree.XPath("//rng:define[@name = $name]", namespaces=...) >> for ref in tree.iter("ref"): >> define = resolve(tree, name = ref.get("name")) >> if define: >> ref.getparent().replace(ref, define[0]) # or define[0][0] ? >> > > Ok thanks anyway, I was doing it by hand. I hoped there was something > in the schema validator. How does the relaxng validator knows if a > document is valid if it doesn't expand references? What makes you think it doesn't? It should be part of the evaluation step. However, you can't see that from the outside as the tree you pass in is not modified. Stefan From l.oluyede at gmail.com Thu Sep 6 17:46:30 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Thu, 6 Sep 2007 17:46:30 +0200 Subject: [lxml-dev] Resolve RelaxNG document In-Reply-To: <46E01641.3070903@behnel.de> References: <9eebf5740709060641i132bc385w91a793a76b61a511@mail.gmail.com> <46E007C2.3000107@behnel.de> <9eebf5740709060736v7804f1bdrd4f24a43172d2570@mail.gmail.com> <46E01641.3070903@behnel.de> Message-ID: <9eebf5740709060846mded4030wbee9392b2b0f1358@mail.gmail.com> > What makes you think it doesn't? That's exactly what I meant. It does obviously, so I hoped there was a way to hook in the evaluation step and grab the expanded references but you just responded here below: > However, you can't see that from the > outside as the tree you pass in is not modified. Thanks! -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From jtk at yahoo.com Thu Sep 6 17:49:08 2007 From: jtk at yahoo.com (Jeff Kowalczyk) Date: Thu, 06 Sep 2007 11:49:08 -0400 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 References: <46DF1BD0.60400@behnel.de> Message-ID: Stefan Behnel wrote: > > Getting distribution for 'lxml'. > > While: > > Installing test. > > Getting distribution for 'lxml'. > > Error: Can't download > > http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found > > Yep, that's because there isn't such a branch yet. It will become > available when 2.0 is released. > > I guess we should take that paragraph out till then... The latest/pending update http://cheeseshop.python.org/pypi/lxml/2.0alpha2 causes a different error for zc.buildout-1.0.0b30 in fetch-newest mode: Picked: zope.testbrowser = 3.4.1 Getting required 'lxml' required by z3c.etestbrowser 1.0.2-r75829. Getting distribution for 'lxml'. While: Installing test. Getting distribution for 'lxml'. Error: Can't download http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0alpha2.tar.gz: 404 Not Found lxml-2.0alpha2.tar.gz is not (yet, temporarily) available at that URL. I think zc.buildout-1.0.0b30 uses the package's simple index: http://cheeseshop.python.org/simple/lxml/ From jholg at gmx.de Thu Sep 6 18:14:12 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 06 Sep 2007 18:14:12 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate (was: lxml 2.0alpha1 released) In-Reply-To: <20070904100001.277890@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> Message-ID: <20070906161412.136940@gmx.net> Hi, > I propose renaming the existing annotate() function to pyannotate() and > adding a public interface annotate() to the internal _annotate(), so you can > xsi-typify and py-typify in one step. > I also think it would be better to change the defaults of the "ignore_old" > keyword args of the annotation functions to False, to avoid: Attached is patch that does just that, with tests. I quirked the defaults for the new annotate() function that can now py:pytype and xsi:type-annotate in one step so that it behaves just like the former annotate() (at least it passes all the existing unittests which I did not alter) The new pyannotate() and xsiannotate() use different defaults, as suggested. There's an additional keyword arg keep_tree that lets you preserve existing TREE attribute values, if switched on. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail -------------- next part -------------- A non-text attachment was scrubbed... Name: pyannotate_annotate.patch Type: application/octet-stream Size: 18093 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070906/adac2214/attachment-0001.obj From stefan_ml at behnel.de Fri Sep 7 06:58:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 07 Sep 2007 06:58:22 +0200 Subject: [lxml-dev] buildout lxml fetch 404 error since lxml-2.0alpha1 In-Reply-To: References: <46DF1BD0.60400@behnel.de> Message-ID: <46E0DA6E.4000107@behnel.de> Jeff Kowalczyk wrote: > Stefan Behnel wrote: >>> Getting distribution for 'lxml'. >>> While: >>> Installing test. >>> Getting distribution for 'lxml'. >>> Error: Can't download >>> http://codespeak.net/svn/lxml/branch/lxml-2.0: 404 Not Found >> Yep, that's because there isn't such a branch yet. It will become >> available when 2.0 is released. >> >> I guess we should take that paragraph out till then... > > The latest/pending update http://cheeseshop.python.org/pypi/lxml/2.0alpha2 > causes a different error for zc.buildout-1.0.0b30 in fetch-newest mode: > > Picked: zope.testbrowser = 3.4.1 > Getting required 'lxml' > required by z3c.etestbrowser 1.0.2-r75829. > Getting distribution for 'lxml'. > While: > Installing test. > Getting distribution for 'lxml'. > Error: Can't download > http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0alpha2.tar.gz: > 404 Not Found > > lxml-2.0alpha2.tar.gz is not (yet, temporarily) available at that URL. I > think zc.buildout-1.0.0b30 uses the package's simple index: > http://cheeseshop.python.org/simple/lxml/ Ah, didn't even know that existed... I accidentally registered alpha2 when updating the branch link, as I had already increased the trunk version. Should be fixed now. Thanks for making me aware of this. Stefan From l.oluyede at gmail.com Sun Sep 9 19:32:27 2007 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Sun, 9 Sep 2007 19:32:27 +0200 Subject: [lxml-dev] Reparenting a node Message-ID: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair From stefan_ml at behnel.de Sun Sep 9 19:57:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 09 Sep 2007 19:57:12 +0200 Subject: [lxml-dev] Reparenting a node In-Reply-To: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> References: <9eebf5740709091032y1d146dfeia8a1be874bbf57e1@mail.gmail.com> Message-ID: <46E433F8.6090501@behnel.de> Lawrence Oluyede wrote: > I have a doc A and a doc B, I'd like to put a node extracted from A in > the document B but I always get a ValueError: > > ValueError: Element is not a child of this node. Sounds like you're using remove() or index(), no need to do that. > I didn't find any "setparent" in the API. > > How can I do this? try node_in_B.append(node_in_A) See the "Elements are lists" section in the tutorial. Stefan From stefan_ml at behnel.de Tue Sep 11 22:00:04 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Sep 2007 22:00:04 +0200 Subject: [lxml-dev] ET compatible parser interfaces Message-ID: <46E6F3C4.7050403@behnel.de> Hi all, just wanted to send a note that I implemented ElementTree compatible interfaces on top of lxml's parsers. This means that you can now do two additional things: use "parser.feed(data)" and "parser.close()" to pass data to a parser in a step-by-step fashion, and pass a "target" keyword argument to a parser to receive SAX-like method calls on the object you pass. The interface is described here: http://effbot.org/elementtree/elementtree-xmlparser.htm This *should* work for both XML and HTML (the latter was hard enough to implement due to differences in the libxml2 API). I also added a parser section to the lxml tutorial while I was at it. The current down-side is that the trunk will require a patched version of Cython until the next Cython release. I added the patch to SVN (and to the Cython bug tracker). The reason is a syntax addition that allows grabbing the GIL during a function call, which simplifies the implementation considerably. Have fun, Stefan From anders at bruun-olsen.net Tue Sep 11 22:14:11 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Tue, 11 Sep 2007 22:14:11 +0200 Subject: [lxml-dev] Serialization with namespaces Message-ID: <46E6F713.3070801@bruun-olsen.net> Hi, I need to chop up some XML based on XPath expressions and serialize the resulting chunks individually. I thought LXML would be perfect for this task but have run into some problems. Here is the sample I use, test.xtm: Abele Henriksdatter i Radsted, Gotfred Bangs hustru First I parse the file and grab the root: >>> from lxml import etree >>> tree = etree.parse("test.xtm") >>> root = tree.getroot() >>> root.nsmap {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': 'http://www.w3.org/1999/xlink'} Then I do a little XPath magic: >>> find_topics = etree.ETXPath("//{%s}topic" % root.nsmap[None]) >>> elem = find_topics(root)[0] >>> elem Now the problem occurs when I try to serialize. When I serialize the root, everything looks fine: >>> etree.tostring(root, pretty_print=True) ' ... The XML Namespace is applied as it should. However on the topic-element that I found using XPath no XML Namespace is output: >>> etree.tostring(elem, pretty_print=True) '\n\t\t\n\t\t\t>> elem.nsmap {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': 'http://www.w3.org/1999/xlink'} I realize this might be because the element is not the root of the current document. How can I make LXML output the xmlns in this case? -- Anders From stefan_ml at behnel.de Wed Sep 12 12:19:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 12:19:17 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E6F713.3070801@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> Message-ID: <46E7BD25.4020100@behnel.de> Anders Bruun Olsen wrote: > Now the problem occurs when I try to serialize. When I serialize the > root, everything looks fine: > > >>> etree.tostring(root, pretty_print=True) > ' xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> > ... > > The XML Namespace is applied as it should. However on the topic-element > that I found using XPath no XML Namespace is output: > > >>> etree.tostring(elem, pretty_print=True) > '\n\t\t\n\t\t\t ... > > Even though the nsmap attribute is set correctly: > > >>> elem.nsmap > {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': > 'http://www.w3.org/1999/xlink'} Hmm, I actually thought these problems were gone with 1.3, but I can reproduce this with the current trunk. I'll look into it. Stefan From stefan_ml at behnel.de Wed Sep 12 12:47:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 12:47:30 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7BD25.4020100@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> Message-ID: <46E7C3C2.10103@behnel.de> Stefan Behnel wrote: > Anders Bruun Olsen wrote: >> Now the problem occurs when I try to serialize. When I serialize the >> root, everything looks fine: >> >> >>> etree.tostring(root, pretty_print=True) >> '> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >> ... >> >> The XML Namespace is applied as it should. However on the topic-element >> that I found using XPath no XML Namespace is output: >> >> >>> etree.tostring(elem, pretty_print=True) >> '\n\t\t\n\t\t\t> ... >> >> Even though the nsmap attribute is set correctly: >> >> >>> elem.nsmap >> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >> 'http://www.w3.org/1999/xlink'} > > Hmm, I actually thought these problems were gone with 1.3, but I can reproduce > this with the current trunk. Ok, so the problem here is libxml2. It serialises only the namespaces that are defined on the node itself, not all those that are defined in the node's context. I'll try to work around it. Stefan From stefan_ml at behnel.de Wed Sep 12 14:54:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 14:54:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7C3C2.10103@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> Message-ID: <46E7E170.3000803@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> Anders Bruun Olsen wrote: >>> Now the problem occurs when I try to serialize. When I serialize the >>> root, everything looks fine: >>> >>> >>> etree.tostring(root, pretty_print=True) >>> '>> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >>> ... >>> >>> The XML Namespace is applied as it should. However on the topic-element >>> that I found using XPath no XML Namespace is output: >>> >>> >>> etree.tostring(elem, pretty_print=True) >>> '\n\t\t\n\t\t\t>> ... >>> >>> Even though the nsmap attribute is set correctly: >>> >>> >>> elem.nsmap >>> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >>> 'http://www.w3.org/1999/xlink'} >> Hmm, I actually thought these problems were gone with 1.3, but I can reproduce >> this with the current trunk. > > Ok, so the problem here is libxml2. It serialises only the namespaces that are > defined on the node itself, not all those that are defined in the node's context. Here's a patch (against the trunk) that works for me. It copies the node before the serialisation and adds all namespaces that were declared up in the tree. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: ns-serialisation.patch Type: text/x-diff Size: 3984 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070912/c376e7b9/attachment.bin From anders at bruun-olsen.net Wed Sep 12 15:19:21 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:19:21 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E170.3000803@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> Message-ID: <46E7E759.1090308@bruun-olsen.net> Stefan Behnel wrote: > Stefan Behnel wrote: >> Stefan Behnel wrote: >>> Anders Bruun Olsen wrote: >>>> Now the problem occurs when I try to serialize. When I serialize the >>>> root, everything looks fine: >>>> >>>> >>> etree.tostring(root, pretty_print=True) >>>> '>>> xmlns:xlink="http://www.w3.org/1999/xlink" id="personnavnereg1"> >>>> ... >>>> >>>> The XML Namespace is applied as it should. However on the topic-element >>>> that I found using XPath no XML Namespace is output: >>>> >>>> >>> etree.tostring(elem, pretty_print=True) >>>> '\n\t\t\n\t\t\t>>> ... >>>> >>>> Even though the nsmap attribute is set correctly: >>>> >>>> >>> elem.nsmap >>>> {None: 'http://www.topicmaps.org/xtm/1.0/', 'xlink': >>>> 'http://www.w3.org/1999/xlink'} >>> Hmm, I actually thought these problems were gone with 1.3, but I can reproduce >>> this with the current trunk. >> Ok, so the problem here is libxml2. It serialises only the namespaces that are >> defined on the node itself, not all those that are defined in the node's context. > > Here's a patch (against the trunk) that works for me. It copies the node > before the serialisation and adds all namespaces that were declared up in the > tree. > > Stefan > Something seems amiss with the patch: $ svn co http://codespeak.net/svn/lxml/trunk lxml $ cd lxml $ patch <~/download/ns-serialisation.patch can't find file to patch at input line 5 Perhaps you should have used the -p or --strip option? The text leading up to this was: -------------------------- |Index: src/lxml/proxy.pxi |=================================================================== |--- src/lxml/proxy.pxi (Revision 46423) |+++ src/lxml/proxy.pxi (Arbeitskopie) -------------------------- File to patch: -- Anders From anders at bruun-olsen.net Wed Sep 12 15:22:08 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:22:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E759.1090308@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> Message-ID: <46E7E800.4010902@bruun-olsen.net> Anders Bruun Olsen wrote: > Something seems amiss with the patch: Sorry, my bad, you have of course already applied it to trunk. -- Anders From anders at bruun-olsen.net Wed Sep 12 15:28:08 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Wed, 12 Sep 2007 15:28:08 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E800.4010902@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> Message-ID: <46E7E968.7050808@bruun-olsen.net> Anders Bruun Olsen wrote: > Anders Bruun Olsen wrote: >> Something seems amiss with the patch: > > Sorry, my bad, you have of course already applied it to trunk. > However, it seems that trunk does not build: $ make python setup.py build_ext -i Building with Cython. Building lxml version 2.0.alpha2-46501 running build_ext building 'lxml.etree' extension Error converting Pyrex file to C: ------------------------------------------------------------ ... include "xmlerror.pxi" # Error and log handling include "classlookup.pxi" # Element class lookup mechanisms include "nsclasses.pxi" # Namespace implementation and registry include "docloader.pxi" # Support for custom document loaders include "parser.pxi" # XML Parser include "parsertarget.pxi" # ET Parser target ^ ------------------------------------------------------------ /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found make: *** [inplace] Error 1 -- Anders From ianb at colorstudy.com Wed Sep 12 19:37:00 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 12 Sep 2007 12:37:00 -0500 Subject: [lxml-dev] ET 1.3 Message-ID: <46E823BC.10906@colorstudy.com> I was just reading the ElementTree 1.3 release notes: http://effbot.org/zone/elementtree-13-intro.htm Generally I like the changes. The change from Element as a factory function to Element as a subclassable class (akin to ElementBase), is nice -- I never understood why there was a distinction. Except... because "el = Element(tag)" doesn't necessarily mean that "el.__class__ is Element"...? getiterator to iter is a simple seeming change. Since getiterator actually returns an iterable, not an iterator, it's also just a little more accurate. Looks like it also moves to an iterator, not a list. I don't have much of an opinion on the parser and serializer stuff, though I'd love it if there was a proper serializer for HTML (not the dumb XSLT-based thing I put in lxml.html). I notice that elements now give warnings when treated as booleans. I like this a lot, as I've found many bugs in my code where I did "if el" where I should have done "if el is not None". And an element with no children doesn't feel falsish at all to me. I've actually already taken to using len(el) to test for children, just because I can't get myself to commit to this weird-seeming behavior. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Wed Sep 12 18:38:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 18:38:05 +0200 Subject: [lxml-dev] ET compatible parser interfaces In-Reply-To: <46E6F3C4.7050403@behnel.de> References: <46E6F3C4.7050403@behnel.de> Message-ID: <46E815ED.6020604@behnel.de> Stefan Behnel wrote: > The current down-side is that the trunk will require a patched version of > Cython until the next Cython release. ... which has just been released as Cython 0.9.6.6. Stefan :) From stefan_ml at behnel.de Wed Sep 12 20:19:49 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 20:19:49 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070906161412.136940@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> Message-ID: <46E82DC5.4010101@behnel.de> jholg at gmx.de wrote: >> I propose renaming the existing annotate() function to pyannotate() and >> adding a public interface annotate() to the internal _annotate(), so you can >> xsi-typify and py-typify in one step. Sure, that's fine. >> I also think it would be better to change the defaults of the "ignore_old" >> keyword args of the annotation functions to False Definitely ok for the new ones. Maybe for annotate() also, I'm not sure yet. > Attached is patch that does just that, with tests. > > I quirked the defaults for the new annotate() function that can now > py:pytype and xsi:type-annotate in one step so that it behaves just like > the former annotate() (at least it passes all the existing unittests which > I did not alter) The new pyannotate() and xsiannotate() use different > defaults, as suggested. I still have to look through the patch a bit more, but I generally like the intention, except: > There's an additional keyword arg keep_tree that lets you preserve existing TREE attribute values, if switched on. No way. :) It doesn't match the existing "ignore_*" parameters and the default is to /remove/ the tree annotation when what we want is to /create/ annotations. Taking one step back: what was the reason again why we started using TREE annotation at all? I mean, it doesn't have any advantage and it currently looks like it's getting in the way. Is there a reason that should keep us from just dropping it? completely? (minus backwards compatibility?) I mean, honestly, it's not used and it's even faster to check for children than it is to look up the attribute... Stefan From stefan_ml at behnel.de Wed Sep 12 21:59:43 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 21:59:43 +0200 Subject: [lxml-dev] ET 1.3 In-Reply-To: <46E823BC.10906@colorstudy.com> References: <46E823BC.10906@colorstudy.com> Message-ID: <46E8452F.5000800@behnel.de> Ian Bicking wrote: > I was just reading the ElementTree 1.3 release notes: > > http://effbot.org/zone/elementtree-13-intro.htm Ah, good to know. I already had a few discussions with Fredrik about a couple of features or changes in lxml.etree or ET 1.3, so both are continuously getting closer (especially now that parsers are almost compatible :). > Generally I like the changes. The change from Element as a factory > function to Element as a subclassable class (akin to ElementBase), is > nice Hmm, I'm not even sure we could do that in Cython. Sounds like he's been playing with __new__, not sure Cython supports that. > -- I never understood why there was a distinction. Except... > because "el = Element(tag)" doesn't necessarily mean that "el.__class__ > is Element"...? At least in lxml that's getting pretty rare these days... > getiterator to iter is a simple seeming change. Since getiterator > actually returns an iterable, not an iterator, it's also just a little > more accurate. Looks like it also moves to an iterator, not a list. That's one of the changes Fredrik mentioned a while ago, so lxml.etree already has it in 1.3. > I don't have much of an opinion on the parser and serializer stuff, > though I'd love it if there was a proper serializer for HTML (not the > dumb XSLT-based thing I put in lxml.html). I know. Actually, libxml2 distinguishes between HTML documents and XML documents internally, so we could already take that as a serialisation hint. So, if you parse stuff with HTML() or an HTMLParser, you'd get an HTML document on serialisation, otherwise you'd get an XML document. I could also imagine something like a separate ElementTree class in lxml.html that you could wrap any Element in to make sure it gets serialised as plain HTML (and not XHTML). > I notice that elements now give warnings when treated as booleans. I > like this a lot, as I've found many bugs in my code where I did "if el" > where I should have done "if el is not None". And an element with no > children doesn't feel falsish at all to me. I've actually already taken > to using len(el) to test for children, just because I can't get myself > to commit to this weird-seeming behavior. I guess lxml.etree will just follow in 2.0. I'll also take a look through the other changes. There were a few that I had not yet heard of. I like the fact that ET 1.3 and lxml 2.0 share a common alpha phase. That makes additions and learning from each other pretty easy. Stefan From stefan_ml at behnel.de Wed Sep 12 22:02:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Sep 2007 22:02:44 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E7E968.7050808@bruun-olsen.net> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> <46E7E968.7050808@bruun-olsen.net> Message-ID: <46E845E4.3030104@behnel.de> Anders Bruun Olsen wrote: > /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found Ah, thanks. I forgot that file when committing the target parser implementation. Fixed now. Stefan From ianb at colorstudy.com Wed Sep 12 22:48:56 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 12 Sep 2007 15:48:56 -0500 Subject: [lxml-dev] 2.0alpha too visible Message-ID: <46E850B8.8040803@colorstudy.com> I think the 2.0alpha release might be too visible. If you do "easy_install lxml" you get that version. One way to help this would be to not upload 2.0alpha to PyPI, but instead just put a link to a tarball with #egg=lxml-twoalpha or something, so it won't be considered newer than 1.3 (but you could install it with easy_install lxml==twoalpha). -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From jholg at gmx.de Thu Sep 13 10:17:04 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 13 Sep 2007 10:17:04 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46E82DC5.4010101@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> Message-ID: <20070913081704.138210@gmx.net> > There's an additional keyword arg keep_tree that lets you preserve > existing TREE attribute values, if switched on. > > No way. :) > > It doesn't match the existing "ignore_*" parameters and the default is to > /remove/ the tree annotation when what we want is to /create/ annotations. Hm, maybe then pyannotate() should rather not default to remove TREE attributes? > Taking one step back: what was the reason again why we started using TREE > annotation at all? I mean, it doesn't have any advantage and it currently > looks like it's getting in the way. Is there a reason that should keep us > from > just dropping it? completely? (minus backwards compatibility?) > > I mean, honestly, it's not used and it's even faster to check for children > than it is to look up the attribute... It's there to allow for leaf elements to be ObjectifiedElements, rather than ObjectifiedDataElements. The rules are easy for all other use cases: - the root has no parent element -> ObjectifiedElement - any other element with children -> ObjectifiedElement Things get difficult if you assign leaf elements and actually instantiate the python proxy objects. If no TREE attributes get used, these will end up being "default empty elements", usually string elements. Also, once having been serialized, there is no way that leaf elements can be recognized as ObjectifiedElements without the help of the TREE attribute. That's the main reason I propose the keep_tree functionality, to make ObjectifiedElement-leaves survive a creation-serialization-parse cycle. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Thu Sep 13 11:04:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 11:04:24 +0200 Subject: [lxml-dev] findall() returns an iterable instead of a sequence in ET 1.3 Message-ID: <46E8FD18.1010306@behnel.de> Hi Fredrik, I just noticed the above when I tried to copy over the new ElementPath implementation from the current ET 1.3 SVN. The current ET docs of 1.2 clearly state that findall() returns a sequence. I'm not questioning the new behaviour, but it's not even mentioned in your "ET 1.3 intro" text. Don't you think that change will break a lot of code out there? It already breaks a couple of places in lxml.html, e.g. code where the author knew that there were few results to expect (and thus a list was the perfect thing to return) and where it is was convenient to test for the truth value of the returned list to check for results. Admittedly, it's easy to write list(el.findall()), but the thing is: a) someone has to do that, and b) it's not always the best solution, so the change requires people to rethink their code. And the worst is: you will not even get an exception in all cases, as "if result" will simply behave differently and your code after that may still work - just not as expected. That's a pretty heavy change IMHO. Stefan From stefan_ml at behnel.de Thu Sep 13 11:48:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 11:48:12 +0200 Subject: [lxml-dev] findall() returns an iterable instead of a sequence in ET 1.3 In-Reply-To: <368a5cd50709130216k54ad919ar2b089c4674c51ec@mail.gmail.com> References: <46E8FD18.1010306@behnel.de> <368a5cd50709130216k54ad919ar2b089c4674c51ec@mail.gmail.com> Message-ID: <46E9075C.9050006@behnel.de> Fredrik Lundh wrote: >> I just noticed the above when I tried to copy over the new ElementPath >> implementation from the current ET 1.3 SVN. The current ET docs of 1.2 clearly >> state that findall() returns a sequence. I'm not questioning the new >> behaviour, but it's not even mentioned in your "ET 1.3 intro" text. > > That's probably because I dropped in the new (still pretty rough) path > implementation after I wrote the first episode, but before I uploaded > the code... > > The ET 1.2 documentation does indeed say that findall may return a > sequence *or* an iterator: > > http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree._ElementInterface.findall-method Ok, but this page doesn't (and I find it pretty visible): http://effbot.org/elementtree/elementtree-element.htm#tag-ET.Element.findall > but as you say, chances are that people are relying on behaviour > rather than implementation. Yet, it's pretty nice to have an iterator > for things like: > > for elem in tree.findall(simple pattern): > check elem properties > if right elem: > break I think the "findALL()" makes it sound like something that returns a sequence rather than an iterator. The API shouldn't work against people's intuition. > But maybe we could provide an "iterfind", perhaps? (that may or may > not be the same thing as findall). I would definitely prefer that, and I like the name already. Then, findall() could be as simple as return list(self.iterfind()) And it /should/ do the same as findall(), as it carries "find" in its name. It meats my intuition that findall() and iterfind() return exactly the same results, just in the expected different ways. > fwiw, I've had the same concerns wrt the iter/getiterator changes; in > 1.2, getiterator returned a list, not an iterator. in 1.3a3, it's an > alias for "elem.iter()". maybe it should be an alias for > "list(elem.iter())" instead? I think that's different as people do not generally expect something called "getiterator" to return a sequence, so as long as they don't look into the documentation, they would not easily use it in a way that breaks after the change. BUT, since you already deprecated getiterator() anyway, why not make it a pure legacy function that works as it did in the early days? Stefan From anders at bruun-olsen.net Thu Sep 13 13:15:37 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Thu, 13 Sep 2007 13:15:37 +0200 Subject: [lxml-dev] Potential bug in trunk Message-ID: <46E91BD9.2010904@bruun-olsen.net> Hi, I've run into a weird problem. I am making a CherryPy-based application which uses lxml to do XSLT conversion of XML before sending it to the browser. It worked fine before switching to trunk (which has the namespace-patch that I need). Here is the output: [13/Sep/2007:13:08:37] HTTP Serving HTTP on http://0.0.0.0:8080/ [13/Sep/2007:13:08:44] Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/cherrypy/_cprequest.py", line 90, in run hook() File "/usr/lib64/python2.4/site-packages/cherrypy/_cprequest.py", line 58, in __call__ return self.callback(**self.kwargs) File "/home/abo/workspace/xmldict/src/xmldict/__init__.py", line 42, in transform_output xsltdoc = etree.parse(open(xslfile)) File "etree.pyx", line 2189, in etree.parse File "parser.pxi", line 1183, in etree._parseDocument File "parser.pxi", line 1217, in etree._parseFilelikeDocument File "parser.pxi", line 1126, in etree._parseDocFromFilelike File "parser.pxi", line 83, in etree._ParserDictionaryContext.getDefaultParser File "parser.pxi", line 585, in etree._BaseParser._copy AttributeError: 'lxml.etree._ResolverRegistry' object has no attribute '_copy' The really weird part is that when I start up the interactive interpreter and do the exact same operation it works: >>> from lxml import etree >>> xsltfile = "/home/abo/workspace/dicts/svda/svda.xsl" >>> xsltdoc = etree.parse(open(xsltfile)) Anybody able to venture a guess as to where this bug might lie? Is it in lxml, cherrypy or my code? -- Anders From anders at bruun-olsen.net Thu Sep 13 13:58:07 2007 From: anders at bruun-olsen.net (Anders Bruun Olsen) Date: Thu, 13 Sep 2007 13:58:07 +0200 Subject: [lxml-dev] Serialization with namespaces In-Reply-To: <46E845E4.3030104@behnel.de> References: <46E6F713.3070801@bruun-olsen.net> <46E7BD25.4020100@behnel.de> <46E7C3C2.10103@behnel.de> <46E7E170.3000803@behnel.de> <46E7E759.1090308@bruun-olsen.net> <46E7E800.4010902@bruun-olsen.net> <46E7E968.7050808@bruun-olsen.net> <46E845E4.3030104@behnel.de> Message-ID: <46E925CF.9000603@bruun-olsen.net> Stefan Behnel wrote: >> /home/abo/tmp/lxml/src/lxml/etree.pyx:2156:0: 'parsertarget.pxi' not found > Ah, thanks. I forgot that file when committing the target parser > implementation. Fixed now. Okay, trunk builds now. And I can confirm that the namespace patch works. Thanks! :) -- Anders From dfedoruk at gmail.com Thu Sep 13 17:18:21 2007 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Thu, 13 Sep 2007 19:18:21 +0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode Message-ID: Hello everyone, I'm developing a mod_python application that is based on XML\XSLT transforming. I used 4Suite libraries for that, but as the speed was unacceptable for me, I switched to lxml. Everything became much easier and 10 times faster, but I've encountered the subject problem. In brief - all my data and xslt are stored and transferred in UTF-8. With 4Suite everything was fine all the time. With lxml it works fine from the console, but inside mod_python it occasionaly dies, ~ one time out of three. Strange - the same code with the same data works or dies by its own means. As far as I have found, there was a similar problem with PyXML and encodings module, this is the problem with UTF, but there was no clear solution. So, my configuration is the following: Python 2.5.1 Server version: Apache/2.2.4 (FreeBSD) mod_python-3.3.1 And the relevant parts of my code are these: def extApplyXslt(xslt, data, logger ): try: strXslt = urllib2.urlopen(xslt).read() # i have to read the xslt url to the python string except urllib2.HTTPError, e: ....... except urllib2.URLError, e: ............. try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") ) # and now I have to use the string; a more elegant solution, anyone? f = StringIO(strXslt) xslt_doc = etree.parse(f, xslt_parser) # and here where the problem comes transform = etree.XSLT(xslt_doc) except Exception, exc: logger.log(logging.CRITICAL, exc.__str__() ) try: result_tree = transform(data) return etree.tostring(result_tree, 'utf-8') except Exception, exc: print "xslt processing error!", exc.__str__() return "" It dies with the message 'cannot unmarshal code objects in restricted execution mode'. By profiling I detected the point where problem occurs: transform = etree.XSLT(xslt_doc) So, I would be grateful for any suggestions how to get rid of this. I'd really like to use lxml. Maybe I should initialize the xslt processor in somehow other way? Thanks in advance, Dmitri From lee.brown at elecdev.com Thu Sep 13 17:40:33 2007 From: lee.brown at elecdev.com (Lee Brown) Date: Thu, 13 Sep 2007 11:40:33 -0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects inrestricted execution mode In-Reply-To: Message-ID: <200709131540.l8DFeFoY005233@ns1.elecdev.net> Greetings! The first thing I'd suggest is to also put your query on the Mod Python list as well. A few questions: Are you trying to execute this code in a Handler or in a Filter? There's world of hidden trouble lurking in Filters because of their re-entrant nature. Which Apache MPM are you using? If you're using a multiple-process module, you might try swithing to a single-process-multiple-thread module to see if this behavior changes. > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Dmitri Fedoruk > Sent: Thursday, September 13, 2007 11:18 AM > To: lxml-dev at codespeak.net > Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code > objects inrestricted execution mode > > Hello everyone, > > I'm developing a mod_python application that is based on > XML\XSLT transforming. > > I used 4Suite libraries for that, but as the speed was > unacceptable for me, I switched to lxml. Everything became > much easier and 10 times faster, but I've encountered the > subject problem. > > In brief - all my data and xslt are stored and transferred in UTF-8. > With 4Suite everything was fine all the time. With lxml it > works fine from the console, but inside mod_python it > occasionaly dies, ~ one time out of three. Strange - the same > code with the same data works or dies by its own means. > > As far as I have found, there was a similar problem with > PyXML and encodings module, this is the problem with UTF, but > there was no clear solution. > > So, my configuration is the following: > Python 2.5.1 > Server version: Apache/2.2.4 (FreeBSD) > mod_python-3.3.1 > > And the relevant parts of my code are these: > > def extApplyXslt(xslt, data, logger ): > try: > strXslt = urllib2.urlopen(xslt).read() > # i have to read the xslt url to the python string > except urllib2.HTTPError, e: > ....... > except urllib2.URLError, e: > ............. > try: > xslt_parser = etree.XMLParser() > xslt_parser.resolvers.add( PrefixResolver("XSLT") ) > > # and now I have to use the string; a more elegant > solution, anyone? > f = StringIO(strXslt) > xslt_doc = etree.parse(f, xslt_parser) > > # and here where the problem comes > transform = etree.XSLT(xslt_doc) > > except Exception, exc: > logger.log(logging.CRITICAL, exc.__str__() ) > > try: > result_tree = transform(data) > return etree.tostring(result_tree, 'utf-8') > except Exception, exc: > print "xslt processing error!", exc.__str__() > return "" > > It dies with the message 'cannot unmarshal code objects in > restricted execution mode'. By profiling I detected the point > where problem > occurs: > transform = etree.XSLT(xslt_doc) > > So, I would be grateful for any suggestions how to get rid of this. > I'd really like to use lxml. Maybe I should initialize the > xslt processor in somehow other way? > > Thanks in advance, > Dmitri > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From stefan_ml at behnel.de Thu Sep 13 17:45:09 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 17:45:09 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: References: Message-ID: <46E95B05.9020709@behnel.de> Dmitri Fedoruk wrote: > I'm developing a mod_python application that is based on XML\XSLT > transforming. > > I used 4Suite libraries for that, but as the speed was unacceptable > for me, I switched to lxml. Everything became much easier and 10 times > faster Thanks for sharing that. :) > but I've encountered the subject problem. > > In brief - all my data and xslt are stored and transferred in UTF-8. > With 4Suite everything was fine all the time. With lxml it works fine > from the console, but inside mod_python it occasionaly dies, ~ one > time out of three. Strange - the same code with the same data works or > dies by its own means. > > As far as I have found, there was a similar problem with PyXML and > encodings module, this is the problem with UTF, but there was no clear > solution. > > So, my configuration is the following: > Python 2.5.1 > Server version: Apache/2.2.4 (FreeBSD) > mod_python-3.3.1 Looks like you forgot to mention the lxml version you are using. > And the relevant parts of my code are these: > > def extApplyXslt(xslt, data, logger ): > try: > strXslt = urllib2.urlopen(xslt).read() > # i have to read the xslt url to the python string > except urllib2.HTTPError, e: > ....... > except urllib2.URLError, e: > ............. > try: > xslt_parser = etree.XMLParser() > xslt_parser.resolvers.add( PrefixResolver("XSLT") ) > > # and now I have to use the string; a more elegant solution, As I already mentioned on c.l.py, you can pass the result of urlopen() directly into parse(). > f = StringIO(strXslt) > xslt_doc = etree.parse(f, xslt_parser) > > # and here where the problem comes > transform = etree.XSLT(xslt_doc) > > except Exception, exc: > logger.log(logging.CRITICAL, exc.__str__() ) > > try: > result_tree = transform(data) > return etree.tostring(result_tree, 'utf-8') > except Exception, exc: > print "xslt processing error!", exc.__str__() > return "" > > It dies with the message 'cannot unmarshal code objects in restricted > execution mode'. By profiling I detected the point where problem > occurs: > transform = etree.XSLT(xslt_doc) Hmmm, I can't see where any "unmarshaling" should be taking place here - definitely not in XSLT(). And I don't get why this should only happen once in a while. Can you figure out what is writing this message? The python interpreter or mod_python? Stefan From goliath.mailinglist at gmx.de Thu Sep 13 18:02:21 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 18:02:21 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: References: Message-ID: <46E95F0D.4030506@gmx.de> > Everything became much easier and 10 times > faster, but I've encountered the subject problem. Same problem here, but with different code and versions: * Django as webframework * Apache 2.0.59 and 2.2.4 * lxml 1.3.x (all versions) * mod_python 3.2.10 and 3.3.1 * libxml2 2.6.28 / libxslt 1.1.20 I think this might have something to do with mod_python fiddling with __builtins__, at least googling for the error message told me, that Python switches to restricted mode when doing so (but this might one trigger of many). lxml seems to have callbacks run in its own "sandbox" (or something like this, at least it seems to be a different environment as the outer code had), which works fine unless the restricted mode is triggered. Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks. Some further bug-finding I did revealed, that the "unmarshaling"-error only occured if all modules I used in the callback are loaded before the callback runs. If I load them inside the callback the error differs. Example: ------------8<---------------------------------------------------- # unmarshaling error from foo import bar def callback(ctx, ...): return bar() ---------------------------------------------------->8------------ ------------8<---------------------------------------------------- # other error def callback(ctx, ...): from foo import bar return bar() ---------------------------------------------------->8------------ As I have the needed mod_python-configuration not done here I can't tell the other error, but I will add this later. (And I think it was some ImportError) I did not report this problem, because I was not sure which part in the chain to produce webpages was responsible. Django does fiddle with __builtins__, too (but removing it didn't help). And perhaps this is simply a mod_python-bug. So I used FastCGI, which works well. But I'm very interested in a better solution. ;-) For the questions raised by Lee Brown: > Are you trying to execute this code in a Handler or in a Filter? There's world > of hidden trouble lurking in Filters because of their re-entrant nature. I use normal XSLT-callbacks. Tried different methods to tell lxml which callbacks I have, none worked. (global namespace, callbacks as "extensions"-parameter for etree.XSLT) XSLT-sample-snippet: (Namespace is defined, callback gets called and works fine...until I try to use the code with mod_python) > Which Apache MPM are you using? If you're using a multiple-process module, you > might try swithing to a single-process-multiple-thread module to see if this > behavior changes. Using prefork here, as all threaded modules have problems with mod_php. mod_php might be another error-source. Read something about failing DB-connections when using mod_php and mod_python. But I don't really think disabling mod_php will make a difference here. Greetings, David Danier From lee.brown at elecdev.com Thu Sep 13 18:07:13 2007 From: lee.brown at elecdev.com (Lee Brown) Date: Thu, 13 Sep 2007 12:07:13 -0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E95F0D.4030506@gmx.de> Message-ID: <200709131606.l8DG6uuM005741@ns1.elecdev.net> Greetings! Sorry, I should have stated my first question more clearly. Are you calling your routines from within a Mod Python requestHandler object or an outputFilter object? > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of David Danier > Sent: Thursday, September 13, 2007 12:02 PM > To: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] lxml + mod_python: cannot unmarshal > code objects in restricted execution mode > > > Everything became much easier and 10 times faster, but I've > > encountered the subject problem. > > Same problem here, but with different code and versions: > * Django as webframework > * Apache 2.0.59 and 2.2.4 > * lxml 1.3.x (all versions) > * mod_python 3.2.10 and 3.3.1 > * libxml2 2.6.28 / libxslt 1.1.20 > > I think this might have something to do with mod_python > fiddling with __builtins__, at least googling for the error > message told me, that Python switches to restricted mode when > doing so (but this might one trigger of many). lxml seems to > have callbacks run in its own "sandbox" > (or something like this, at least it seems to be a different > environment as the outer code had), which works fine unless > the restricted mode is triggered. > > Somehow restricted mode is only mentioned in the docs for > RExec (http://docs.python.org/lib/module-rexec.html), but > should not be available any more, to I don't know what lxml > exactly does to use callbacks. > > Some further bug-finding I did revealed, that the > "unmarshaling"-error only occured if all modules I used in > the callback are loaded before the callback runs. If I load > them inside the callback the error differs. > Example: > ------------8<---------------------------------------------------- > # unmarshaling error > from foo import bar > def callback(ctx, ...): > return bar() > ---------------------------------------------------->8------------ > ------------8<---------------------------------------------------- > # other error > def callback(ctx, ...): > from foo import bar > return bar() > ---------------------------------------------------->8------------ > As I have the needed mod_python-configuration not done here I > can't tell the other error, but I will add this later. (And I > think it was some > ImportError) > > I did not report this problem, because I was not sure which > part in the chain to produce webpages was responsible. Django > does fiddle with __builtins__, too (but removing it didn't > help). And perhaps this is simply a mod_python-bug. So I used > FastCGI, which works well. > But I'm very interested in a better solution. ;-) > > For the questions raised by Lee Brown: > > Are you trying to execute this code in a Handler or in a Filter? > > There's world of hidden trouble lurking in Filters because > of their re-entrant nature. > > I use normal XSLT-callbacks. Tried different methods to tell > lxml which callbacks I have, none worked. > (global namespace, callbacks as "extensions"-parameter for etree.XSLT) > > XSLT-sample-snippet: > disable-output-escaping="yes"/> > (Namespace is defined, callback gets called and works > fine...until I try to use the code with mod_python) > > > Which Apache MPM are you using? If you're using a multiple-process > > module, you might try swithing to a single-process-multiple-thread > > module to see if this behavior changes. > > Using prefork here, as all threaded modules have problems > with mod_php. > mod_php might be another error-source. Read something about > failing DB-connections when using mod_php and mod_python. But > I don't really think disabling mod_php will make a difference here. > > Greetings, David Danier > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From goliath.mailinglist at gmx.de Thu Sep 13 18:50:54 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 18:50:54 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <200709131606.l8DG6uuM005741@ns1.elecdev.net> References: <200709131606.l8DG6uuM005741@ns1.elecdev.net> Message-ID: <46E96A6E.1090003@gmx.de> > Sorry, I should have stated my first question more clearly. Are you calling > your routines from within a Mod Python requestHandler object or an outputFilter > object? It is called out of a RequestHandler, but I'm not really doing this myself. Django does most of the work, see: http://www.djangoproject.com/documentation/modpython/ http://code.djangoproject.com/browser/django/trunk/django/core/handlers/modpython.py#L176 Greetings, David Danier From stefan_ml at behnel.de Thu Sep 13 18:53:34 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 18:53:34 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode Message-ID: <46E96B0E.3040202@behnel.de> ... just forwarding to the list ... [original mail by Dmitri Fedoruk] On 9/13/07, Stefan Behnel wrote: > Looks like you forgot to mention the lxml version you are using. The most important thing lxml-1.3.4 > As I already mentioned on c.l.py, you can pass the result of urlopen() > directly into parse(). Thank you, that looks better. > Hmmm, I can't see where any "unmarshaling" should be taking place here - > definitely not in XSLT(). And I don't get why this should only happen once in > a while. The point is that it than happens again and again, but I can't see any regularity. Pretty random. Here is the real code and it's profiling output: try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") ) inLogger.log(logging.INFO, "parser created" ) xslt_doc = etree.parse( urllib2.urlopen(xslt) , xslt_parser) inLogger.log(logging.INFO, "%s parsed" % xslt ) transform = etree.XSLT(xslt_doc) inLogger.log(logging.INFO, "xslt transformation created" ) except Exception, exc: inLogger.log(logging.CRITICAL, exc.__str__() ) logging output: Thu, 13 Sep 2007 19:53:31 INFO parser created Thu, 13 Sep 2007 19:53:31 INFO http://***/web-out-long.xsl parsed Thu, 13 Sep 2007 19:53:31 CRITICAL cannot unmarshal code objects in restricted execution mode As there is no "xslt transformation created" line, that's why I had to assume that the error happens in etree.XSLT . > Can you figure out what is writing this message? The python interpreter or > mod_python? mod_python . The python interpreter runs fine with it, not a single error. Dmitri From goliath.mailinglist at gmx.de Thu Sep 13 19:05:49 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 19:05:49 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E95F0D.4030506@gmx.de> References: <46E95F0D.4030506@gmx.de> Message-ID: <46E96DED.9050800@gmx.de> > Somehow restricted mode is only mentioned in the docs for RExec > (http://docs.python.org/lib/module-rexec.html), but should not be > available any more, to I don't know what lxml exactly does to use callbacks. Found another place that mentions restricted mode by accident: http://www.modpython.org/live/current/doc-html/pyapi-interps.html I think this paragraph describes the problem pretty well: ------------8<---------------------------------------------------- Note that if any third party module is being used which has a C code component that uses the simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules, then the interpreter name must be forcibly set to be "main_interpreter". This is necessary as such a module will only work correctly if run within the context of the first Python interpreter created by the process. If not forced to run under the "main_interpreter", a range of Python errors can arise, each typically referring to code being run in restricted mode. ---------------------------------------------------->8------------ (thanks to Lee Brown for asking about where lxml is called, it made me read the mod_python-docs again) I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. According to the docs this might help...but if I read this right might produce namespace-problems or at least pollute some global namespace. As this takes some time I will post the result later. Perhaps it can be fixed in lxml by not using the "simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules"? Greetings, David Danier From stefan_ml at behnel.de Thu Sep 13 19:28:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Sep 2007 19:28:50 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E96DED.9050800@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> Message-ID: <46E97352.4050704@behnel.de> Hi, David Danier wrote: >> Somehow restricted mode is only mentioned in the docs for RExec >> (http://docs.python.org/lib/module-rexec.html), but should not be >> available any more, to I don't know what lxml exactly does to use callbacks. > > Found another place that mentions restricted mode by accident: > http://www.modpython.org/live/current/doc-html/pyapi-interps.html > > I think this paragraph describes the problem pretty well: > ------------8<---------------------------------------------------- > Note that if any third party module is being used which has a C code > component that uses the simplified API for access to the Global > Interpreter Lock (GIL) for Python extension modules, then the > interpreter name must be forcibly set to be "main_interpreter". This is > necessary as such a module will only work correctly if run within the > context of the first Python interpreter created by the process. If not > forced to run under the "main_interpreter", a range of Python errors can > arise, each typically referring to code being run in restricted mode. > ---------------------------------------------------->8------------ > (thanks to Lee Brown for asking about where lxml is called, it made me > read the mod_python-docs again) thanks for the infos, that's good to know. > I'll try to setup my site on mod_python and using "PythonInterpreter > main_interpreter" in the config. According to the docs this might > help...but if I read this right might produce namespace-problems or at > least pollute some global namespace. As this takes some time I will post > the result later. Please do. > Perhaps it can be fixed in lxml by not using the "simplified API for > access to the Global Interpreter Lock (GIL) for Python extension modules"? No way. There's a reason why it is there which is the same why we use it: it's simple and usable. Using anything else would mean a lot of rewriting. You might want to try compiling lxml with "--without-threading", though, which disables concurrency support completely (i.e. not more GIL freeing). Stefan From goliath.mailinglist at gmx.de Thu Sep 13 19:51:14 2007 From: goliath.mailinglist at gmx.de (David Danier) Date: Thu, 13 Sep 2007 19:51:14 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E97352.4050704@behnel.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> <46E97352.4050704@behnel.de> Message-ID: <46E97892.9010008@gmx.de> >> As this takes some time I will post >> the result later. > Please do. Seems to work properly. But I'm not really sure how bad "main_interpreter" is polluted now. > No way. There's a reason why it is there which is the same why we use it: it's > simple and usable. Using anything else would mean a lot of rewriting. Thats sad. What are the chances that patches addressing this problem are accepted? (Must review the code first, but I would really like a clean solution here) > You might want to try compiling lxml with "--without-threading", though, which > disables concurrency support completely (i.e. not more GIL freeing). Works, too. But I'm not really sure it it is a good idea to do so, as Py_NewInterpreter seems to create a thread, see http://www.python.org/doc/current/api/initialization.html#l2h-820. But I think this might not be a problem if not using a threaded Apache-MPM. Greetings, David Danier From dfedoruk at gmail.com Fri Sep 14 10:28:41 2007 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Fri, 14 Sep 2007 12:28:41 +0400 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E96DED.9050800@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> Message-ID: Hello, > I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. Fine, works for me too. As I'm not very good in python, I can't tell whether this is good or evil, but this trick works and that's all I need. Thanks! Dmitri From faassen at startifact.com Fri Sep 14 16:30:35 2007 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 14 Sep 2007 16:30:35 +0200 Subject: [lxml-dev] 2.0alpha too visible In-Reply-To: <46E850B8.8040803@colorstudy.com> References: <46E850B8.8040803@colorstudy.com> Message-ID: Ian Bicking wrote: > I think the 2.0alpha release might be too visible. If you do > "easy_install lxml" you get that version. > > One way to help this would be to not upload 2.0alpha to PyPI, but > instead just put a link to a tarball with #egg=lxml-twoalpha or > something, so it won't be considered newer than 1.3 (but you could > install it with easy_install lxml==twoalpha). I've run into it trying to get 2.0alpha several times too in buildout processes (which use setuptools underneath). Hiding 2.0alpha better might help. That said, in my opinion it's really a problem that we need to tackle on the buildout or framework end, being more explicit about what versions we need. This keeps happening with all kinds of other libraries as well and is not really the library's fault. (Buildout already has a feature to prefer released versions which can help a bit here) Regards, Martijn From stefan_ml at behnel.de Sat Sep 15 10:28:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 15 Sep 2007 10:28:08 +0200 Subject: [lxml-dev] 2.0alpha too visible In-Reply-To: <46E850B8.8040803@colorstudy.com> References: <46E850B8.8040803@colorstudy.com> Message-ID: <46EB9798.6000306@behnel.de> Hi Ian, Ian Bicking wrote: > I think the 2.0alpha release might be too visible. If you do > "easy_install lxml" you get that version. > > One way to help this would be to not upload 2.0alpha to PyPI, but > instead just put a link to a tarball with #egg=lxml-twoalpha or > something, so it won't be considered newer than 1.3 (but you could > install it with easy_install lxml==twoalpha). hmmm, I would like to keep 2.0alpha visible as there were some changes (and some more to come) that people should be aware of, especially when writing new code. So I want it uploaded on PyPI and I want it in the list of version you see when going to http://pypi.python.org/pypi/lxml I personally consider it a bug in easy_install that it always takes the newest version without paying attention to the development status (which is clearly stated as "3 - alpha" in the Trove list), or at least to the version string. It doesn't even provide an option to control that. I just wrote to the distutils list about that, we'll see. Stefan From stefan_ml at behnel.de Sat Sep 15 17:48:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 15 Sep 2007 17:48:30 +0200 Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode In-Reply-To: <46E97892.9010008@gmx.de> References: <46E95F0D.4030506@gmx.de> <46E96DED.9050800@gmx.de> <46E97352.4050704@behnel.de> <46E97892.9010008@gmx.de> Message-ID: <46EBFECE.4030605@behnel.de> David Danier wrote: >>> As this takes some time I will post >>> the result later. >> Please do. > > Seems to work properly. But I'm not really sure how bad > "main_interpreter" is polluted now. I wouldn't expect much (namespace) polution - unless there's real evidence that this can become a problem. And a crash is definitely a more important problem than namespace polution. >> No way. There's a reason why it is there which is the same why we use it: it's >> simple and usable. Using anything else would mean a lot of rewriting. > > Thats sad. What are the chances that patches addressing this problem are > accepted? > (Must review the code first, but I would really like a clean solution here) We always accept patches as long as there is general interest and/or a good motivation behind them. But threading is pretty much an issue by itself in lxml.etree. And the "simplified API" gives you a way to just say "release GIL - call to libxml2 - acquire GIL" and "acquire GIL - run callback code - free GIL". That's as easy as it can get - especially since Cython has support for the latter nowadays. It is very unlikely that this can get any "cleaner" by changing the thread-lock calls. >> You might want to try compiling lxml with "--without-threading", though, which >> disables concurrency support completely (i.e. not more GIL freeing). > > Works, too. But I'm not really sure it it is a good idea to do so, as > Py_NewInterpreter seems to create a thread, see > http://www.python.org/doc/current/api/initialization.html#l2h-820. But I > think this might not be a problem if not using a threaded Apache-MPM. What this options does is that lxml.etree stops freeing the GIL internally when calling into libxml2, which simply disables any concurrency as it keeps the GIL until execution returns to Python code. Especially the (simplified) Thread API is no longer used, so there should no longer be any threading issues. Stefan From stefan_ml at behnel.de Sun Sep 16 00:37:06 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Sep 2007 00:37:06 +0200 Subject: [lxml-dev] lxml 2.0alpha2 released Message-ID: <46EC5E92.2050607@behnel.de> Hi all, I just released lxml 2.0alpha2 to PyPI. http://pypi.python.org/pypi/lxml/2.0alpha2 http://codespeak.net/lxml/dev/ It features a number of major API additions that follow the ElementTree library and the future API changes in ElementTree 1.3. The main new features are HTML serialisation support, a feed interface to the parsers, a SAX-like target parser interface, and iterfind() as an iterator version of findall(). All of these are currently more or less experimental, so feedback is warmly welcome. The mailing list is always open for discussion, not only on the new features. The complete changelog is below. Have fun Stefan 2.0alpha2 (2007-09-15) Features added * ET.write(), tostring() and tounicode() now accept a keyword argument "method" that can be one of 'xml' (or None), 'html' or 'text' to serialise as XML, HTML or plain text content. * iterfind() method on Elements returns an iterator equivalent to findall() * itertext() method on Elements * Setting a QName object as value of the .text property or as an attribute will resolve its prefix in the respective context * ElementTree-like parser target interface as described in http://effbot.org/elementtree/elementtree-xmlparser.htm * ElementTree-like feed parser interface on XMLParser and HTMLParser (feed() and close() methods) Bugs fixed * lxml failed to serialise namespace declarations of elements other than the root node of a tree * Race condition in XSLT where the resolver context leaked between concurrent XSLT calls Other changes * element.getiterator() returns a list, use element.iter() to retrieve an iterator (ElementTree 1.3 compatible behaviour) From stefan_ml at behnel.de Sun Sep 16 17:38:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 16 Sep 2007 17:38:30 +0200 Subject: [lxml-dev] Potential bug in trunk In-Reply-To: <46E91BD9.2010904@bruun-olsen.net> References: <46E91BD9.2010904@bruun-olsen.net> Message-ID: <46ED4DF6.5090100@behnel.de> Anders Bruun Olsen wrote: > I've run into a weird problem. I am making a CherryPy-based application > which uses lxml to do XSLT conversion of XML before sending it to the > browser. It worked fine before switching to trunk (which has the > namespace-patch that I need). Here is the output: > > [13/Sep/2007:13:08:37] HTTP Serving HTTP on http://0.0.0.0:8080/ > [13/Sep/2007:13:08:44] Traceback (most recent call last): > etree._ParserDictionaryContext.getDefaultParser > File "parser.pxi", line 585, in etree._BaseParser._copy > AttributeError: 'lxml.etree._ResolverRegistry' object has no attribute > '_copy' > > Anybody able to venture a guess as to where this bug might lie? Is it in > lxml, cherrypy or my code? It was a bug in lxml that came in as part of the parser code refactoring. I fixed it for alpha2. Thanks for the report. Stefan From ebgssth at gmail.com Mon Sep 17 14:02:52 2007 From: ebgssth at gmail.com (js) Date: Mon, 17 Sep 2007 21:02:52 +0900 Subject: [lxml-dev] non-ascii characters get garbled Message-ID: Hello, list. The lxml doc [*1] says that "You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone." [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings but my experience is different from that. For example, the following code doesn't bother encoding things and leave the work to lxml.etree. According to the doc, this is right way, but it does't work and you'll got garbled characters. (give it a try) -------------------------------------------------------------------- # -*- coding: utf-8 -*- from lxml import html as etree url='http://apple.com/kr' tree = etree.parse(url) from pprint import pformat for t in tree.xpath('//a[text()]'): print t.text_content() -------------------------------------------------------------------- The next one break the rule and doing all charset conversion for oneself. This one works great and all charset conversion will succeed. -------------------------------------------------------------------- # -*- coding: utf-8 -*- from lxml import html as etree from urllib2 import urlopen from StringIO import StringIO url='http://apple.com/kr' res = urlopen(url) html = res.read().decode(res.headers.getparam('charset')) tree = etree.parse(StringIO(html)) from pprint import pformat for t in tree.xpath('//a[text()]'): print t.text_content() -------------------------------------------------------------------- But the latter doesn't always work. Sometimes I got "ValueError: Unicode strings with encoding declaration are not supported. " Is this a known issue? If so, how can I get out of this problem? Are there any workarounds? I tried to figure out the cause of these and looked over the lxml and libxml2 's code but could not find a clue. (To me this appeared to be not a lxml's problem but libxml2's ,though) Any information would be greatly appriciated. Thanks you in advance. From felwert at uni-bremen.de Mon Sep 17 14:11:23 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 14:11:23 +0200 Subject: [lxml-dev] CSS and lxml Message-ID: <1190031083.7564.21.camel@FredDesk> Hello! I am currently looking into the possibilities to work with CSS in lxml or Python in general. I hope it's not too off-topic, but since it's related to lxml, I thought I might post it here. The specific use case I have in mind would take an element and look for all applying CSS rules. Let's say we have a style element: and in the tree, there are two elements:

Some text

Some other text

Then I'd like to get something like >>> el1.getstyle() {'font-size': '16pt', 'color': 'red'} for the first element and >>> el2.getstyle() {'font-size': '16pt', 'font-weight': 'bold'} for the second one. I know that this is currently not possible. The only true CSS library for Python that I found were cssutils . They have quite sophisticated support for CSS parsing, but I think the library itself is quite DOM-centric and so it's not very pythonic / doesn't fit well to lxml. But more important, it has no real XML bindings. So it's possible to query stylesheets to get properties that match a selector: >>> stylesheet.props('p.strong') {'font-size': '16pt', 'font-weight': 'bold'} but not to query true elements to get the applying properties. On the other hand, lxml now has cssselect, which works the other way around: It takes a selector and returns all the elements that match that selector. >>> sel = CSSSelector('p.strong') >>> [e.text for e in sel(tree)] ['Some other text'] So I just wanted to ask if somebody already had thought about this, or if somebody has any ideas in which direction to head to solve this problem. Maybe one could write a module, that combines cssutils and lxml.cssselect to match css style properties and actual elements. But maybe a completely different approach would be needed. Regards, Frederik From stefan_ml at behnel.de Mon Sep 17 14:41:25 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 14:41:25 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: References: Message-ID: <46EE75F5.4090606@behnel.de> js wrote: > The lxml doc [*1] says that > "You should generally avoid converting XML/HTML data to unicode before > passing it into the parsers. It is both slower and error prone." > > [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings > > but my experience is different from that. Not quite. As you say below, you sometimes get ValueErrors depending on the page data, so it *is* error prone. > For example, the following code doesn't bother encoding things > and leave the work to lxml.etree. > According to the doc, this is right way, but it does't work > and you'll got garbled characters. (give it a try) > > -------------------------------------------------------------------- > # -*- coding: utf-8 -*- > from lxml import html as etree This import makes your code hard to read IMHO. If you use lxml.html, say it. > url='http://apple.com/kr' > > tree = etree.parse(url) > from pprint import pformat > for t in tree.xpath('//a[text()]'): > print t.text_content() > -------------------------------------------------------------------- Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And when I collect the text, it looks perfectly reasonable, including strings like u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. Are you sure it's the text content and not just the console output on your side? > Sometimes I got "ValueError: Unicode strings with encoding declaration > are not supported. " On the same page? I assume you were referring to a different page here that probably uses XHTML instead of HTML, right? The above should work for both - as long as libxml2 can detect the encoding (and if it can't, there's lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). Stefan From stefan_ml at behnel.de Mon Sep 17 15:07:21 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 15:07:21 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190031083.7564.21.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> Message-ID: <46EE7C09.8000006@behnel.de> Hi, Frederik Elwert wrote: > I am currently looking into the possibilities to work with CSS in lxml > or Python in general. I hope it's not too off-topic, but since it's > related to lxml, I thought I might post it here. Sure. > The specific use case I have in mind would take an element and look for > all applying CSS rules. Let's say we have a style element: > > > > and in the tree, there are two elements: > >

Some text

>

Some other text

> > Then I'd like to get something like > >>>> el1.getstyle() > {'font-size': '16pt', 'color': 'red'} > > for the first element and > >>>> el2.getstyle() > {'font-size': '16pt', 'font-weight': 'bold'} > for the second one. > > I know that this is currently not possible. The only true CSS library > for Python that I found were cssutils . > They have quite sophisticated support for CSS parsing, but I think the > library itself is quite DOM-centric and so it's not very pythonic / > doesn't fit well to lxml. But more important, it has no real XML > bindings. So it's possible to query stylesheets to get properties that > match a selector: > >>>> stylesheet.props('p.strong') > {'font-size': '16pt', 'font-weight': 'bold'} > > but not to query true elements to get the applying properties. > > On the other hand, lxml now has cssselect, which works the other way > around: It takes a selector and returns all the elements that match that > selector. > >>>> sel = CSSSelector('p.strong') >>>> [e.text for e in sel(tree)] > ['Some other text'] > > So I just wanted to ask if somebody already had thought about this, or > if somebody has any ideas in which direction to head to solve this > problem. > > Maybe one could write a module, that combines cssutils and > lxml.cssselect to match css style properties and actual elements. But > maybe a completely different approach would be needed. There are a couple of things you have to do here. First, you have to parse CSS, which only the cssutils currently do. Then you have to find out which of the rules apply to an element which AFAICT is not currently supported at all. You could do a brute force test and just take all selectors that you find in all CSS stylesheets in the document or in external references, to match them against the element in question - but that would be quite some overhead. On the other hand, if style lookup is more frequent than document parsing, you can build an inverse index: run through all CSS selectors, find the elements they match and store the style content for each of the elements, thus aggregating the style properties per element. You could maybe implement a "cssannotate(stylesheet, tree)" function, which would map a stylesheet on a tree by setting (or extending) the "style" attributes on each element accordingly. That would come pretty close to what you were looking for. Stefan From ebgssth at gmail.com Mon Sep 17 16:09:02 2007 From: ebgssth at gmail.com (js) Date: Mon, 17 Sep 2007 23:09:02 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46EE75F5.4090606@behnel.de> References: <46EE75F5.4090606@behnel.de> Message-ID: Thank you for you reply. > > -------------------------------------------------------------------- > > # -*- coding: utf-8 -*- > > from lxml import html as etree > > This import makes your code hard to read IMHO. If you use lxml.html, say it. Oh, html is just a little bit different version of etree so I always do above import. that's just my thought. I'll just say what I'll do next time, thanks. Explicit is better than implicit :) > > url='http://apple.com/kr' > > > > tree = etree.parse(url) > > from pprint import pformat > > for t in tree.xpath('//a[text()]'): > > print t.text_content() > > -------------------------------------------------------------------- > > Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And > when I collect the text, it looks perfectly reasonable, including strings like > > u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. > \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' > > This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. > > Are you sure it's the text content and not just the console output on your side? This is on lxml 2.0alpha2 and libxml2 2.6.29_0. I got the following. $ ./lxml_test.py Apple Store Mac iPod + iTunes Downloads Support ????????? ????? ??? ?? ?"? ?????? ?????? ??????? ? ????????? ??? ????? ?????? ?????? ?? ??? ?????? ????????? ???? ???? ?????? ?????? ?"?????????? - iBook G4 ??? PowerBook G4 ????????? ?????? ??"?(c)? ??? ? ???? ???? ???? ?????? eMac ?????? ?????? ?"?????????? here. ?????(c) ?????? ??????? ???? ??????? ??? This is not a console problem because I can get correct result by using latter method as I said before. > > Sometimes I got "ValueError: Unicode strings with encoding declaration > > are not supported. " > > On the same page? I assume you were referring to a different page here that > probably uses XHTML instead of HTML, right? The above should work for both - > as long as libxml2 can detect the encoding (and if it can't, there's > lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). Yes, from different page. I got the error when I'm getting http://www.hatena.com/ Thanks. From felwert at uni-bremen.de Mon Sep 17 16:12:40 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 16:12:40 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <46EE7C09.8000006@behnel.de> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> Message-ID: <1190038360.7564.49.camel@FredDesk> Am Montag, den 17.09.2007, 15:07 +0200 schrieb Stefan Behnel: > There are a couple of things you have to do here. First, you have to parse > CSS, which only the cssutils currently do. Then you have to find out which of > the rules apply to an element which AFAICT is not currently supported at all. No, cssutils supports only parsing and generating CSS, but not element-based style selection. And if it would, I guess they'd stick to xml.dom or something. > You could do a brute force test and just take all selectors that you find in > all CSS stylesheets in the document or in external references, to match them > against the element in question - but that would be quite some overhead. I thought about that and came to the same conclusion as you do regarding the overhead. > On > the other hand, if style lookup is more frequent than document parsing, you > can build an inverse index: run through all CSS selectors, find the elements > they match and store the style content for each of the elements, thus > aggregating the style properties per element. This would be quite practical, right. I'm just not sure about where to store the information. > You could maybe implement a "cssannotate(stylesheet, tree)" function, which > would map a stylesheet on a tree by setting (or extending) the "style" > attributes on each element accordingly. That would come pretty close to what > you were looking for. This just had the negative side-effect of changing the tree itself. So it would only be applicable for read-only-operations, since one wouldn't want to put all style permanently into style attributes for most use cases. Hm, I have to think about this. But it seems that a combination of lxml.cssselect and cssutils would quite do. Since I don't want to rely on lxml 2.0 yet, I'd wait for the implementation anyway. Thanks for your hints! Regards, Frederik From stefan_ml at behnel.de Mon Sep 17 16:37:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 16:37:05 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190038360.7564.49.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> Message-ID: <46EE9111.5050705@behnel.de> Frederik Elwert wrote: > Am Montag, den 17.09.2007, 15:07 +0200 schrieb Stefan Behnel: >> On >> the other hand, if style lookup is more frequent than document parsing, you >> can build an inverse index: run through all CSS selectors, find the elements >> they match and store the style content for each of the elements, thus >> aggregating the style properties per element. > > This would be quite practical, right. I'm just not sure about where to > store the information. If you don't want to alter the tree, you can use a dict to map Elements to a style dict. However, note that Elements are not currently weak referenceable, so you'd have to make sure the trees are discarded after use. >> You could maybe implement a "cssannotate(stylesheet, tree)" function, which >> would map a stylesheet on a tree by setting (or extending) the "style" >> attributes on each element accordingly. That would come pretty close to what >> you were looking for. > > This just had the negative side-effect of changing the tree itself. So > it would only be applicable for read-only-operations, since one wouldn't > want to put all style permanently into style attributes for most use > cases. Agreed. However, you can't store anything in Elements that is not reflected by the underlying tree, as Element objects (which are actually just proxies) can be garbage collected while the tree stays alive. You can also store style information in the tree under a separate namespace. > Hm, I have to think about this. But it seems that a combination of > lxml.cssselect and cssutils would quite do. Since I don't want to rely > on lxml 2.0 yet, I'd wait for the implementation anyway. Thanks for your > hints! I don't think cssselect.py uses any 2.0 specific features. Copying it over to 1.3 (or into your own code base) should work as a temporary solution. Stefan From gilles.lenfant at gmail.com Mon Sep 17 16:41:40 2007 From: gilles.lenfant at gmail.com (Gilles Lenfant) Date: Mon, 17 Sep 2007 16:41:40 +0200 Subject: [lxml-dev] XML files starting with BOM References: <42B618E0-1193-4211-8314-6383654EF8FF@gmail.com> Message-ID: Hi from an lxml newbie, A first, many thanks for lxml that's the easiest XML lib for Python. lxml doesnt't like XML files starting with a BOM (See http:// www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info). M$Office 2007 documents use such notation in their inner xml files. And I need to skip all chars from the file until I get a "<" before passing the stream to lxml. Hopefully, the files are UTF-8. Is it a bug or a feature ? -- Gilles Lenfant gilles.lenfant at gmail.com From felwert at uni-bremen.de Mon Sep 17 17:00:21 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Mon, 17 Sep 2007 17:00:21 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <46EE9111.5050705@behnel.de> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> <46EE9111.5050705@behnel.de> Message-ID: <1190041221.7564.58.camel@FredDesk> Am Montag, den 17.09.2007, 16:37 +0200 schrieb Stefan Behnel: > If you don't want to alter the tree, you can use a dict to map Elements to a > style dict. However, note that Elements are not currently weak referenceable, > so you'd have to make sure the trees are discarded after use. Erm, I must confess, I'm not sure what this means, practically speaking. Is it enough to "del" the dict after use? Aside from that, this sounds quite good. > >> You could maybe implement a "cssannotate(stylesheet, tree)" function, which > >> would map a stylesheet on a tree by setting (or extending) the "style" > >> attributes on each element accordingly. That would come pretty close to what > >> you were looking for. > > > > This just had the negative side-effect of changing the tree itself. So > > it would only be applicable for read-only-operations, since one wouldn't > > want to put all style permanently into style attributes for most use > > cases. > > Agreed. However, you can't store anything in Elements that is not reflected by > the underlying tree, as Element objects (which are actually just proxies) can > be garbage collected while the tree stays alive. Yes, sure. So a style dict would have to be a totally separated object, I guess? I think I can live with that. > You can also store style information in the tree under a separate namespace. Hm, true. I have to think about that, since it would introduce some redundancy, but it might be the easiest way to go. > I don't think cssselect.py uses any 2.0 specific features. Copying it over to > 1.3 (or into your own code base) should work as a temporary solution. Ah, that's good, I'll give it a try! Thanks, Frederik From stefan_ml at behnel.de Mon Sep 17 17:00:51 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 17:00:51 +0200 Subject: [lxml-dev] XML files starting with BOM In-Reply-To: References: <42B618E0-1193-4211-8314-6383654EF8FF@gmail.com> Message-ID: <46EE96A3.6080604@behnel.de> Gilles Lenfant wrote: > A first, many thanks for lxml that's the easiest XML lib for Python. :) > lxml doesnt't like XML files starting with a BOM (See http:// > www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info). > > M$Office 2007 documents use such notation in their inner xml files. > And I need to skip all chars from the file until I get a "<" before > passing the stream to lxml. Hopefully, the files are UTF-8. Is this only with UTF-8 BOMs? > Is it a bug or a feature ? Parsing BOM-ed XML data should work. Could you give some more detail here? Such as some short example code that shows what you are doing to parse XML data with a BOM and that fails on your machine? Stefan From stefan_ml at behnel.de Mon Sep 17 17:20:51 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 17:20:51 +0200 Subject: [lxml-dev] CSS and lxml In-Reply-To: <1190041221.7564.58.camel@FredDesk> References: <1190031083.7564.21.camel@FredDesk> <46EE7C09.8000006@behnel.de> <1190038360.7564.49.camel@FredDesk> <46EE9111.5050705@behnel.de> <1190041221.7564.58.camel@FredDesk> Message-ID: <46EE9B53.704@behnel.de> Frederik Elwert wrote: > Am Montag, den 17.09.2007, 16:37 +0200 schrieb Stefan Behnel: >> If you don't want to alter the tree, you can use a dict to map Elements to a >> style dict. However, note that Elements are not currently weak referenceable, >> so you'd have to make sure the trees are discarded after use. > > Erm, I must confess, I'm not sure what this means, practically speaking. > Is it enough to "del" the dict after use? Yes. If you use a per-tree dict and delete it when you delete the tree, you will be fine. Weak referencing means that the reference does not count for garbage collection, so when the Element is no longer used, it will not be kept alive only by the reference in the dict. See the weakref module. >> You can also store style information in the tree under a separate namespace. > > Hm, true. I have to think about that, since it would introduce some > redundancy, but it might be the easiest way to go. That's what lxml.objectify does for type annotations. It also provides annotate() and deannotate() functions to annotate everything and to clean up the tree when you're done. Stefan From stefan_ml at behnel.de Mon Sep 17 20:32:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 17 Sep 2007 20:32:22 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070913081704.138210@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> Message-ID: <46EEC836.9030603@behnel.de> Hi Holger, jholg at gmx.de wrote: > Things get difficult if you assign leaf elements and actually instantiate > the python proxy objects. If no TREE attributes get used, these will end up > being "default empty elements", usually string elements. > > Also, once having been serialized, there is no way that leaf elements can > be recognized as ObjectifiedElements without the help of the TREE > attribute. That's the main reason I propose the keep_tree functionality, to > make ObjectifiedElement-leaves survive a creation-serialization-parse > cycle. I think we should do this: if old_pytypename == TREE_PYTYPE: if cetree.findChild(c_node, 0) is NULL: pytype = TREE_PYTYPE else: # check old type Do you still think we need the keep_tree then? Stefan From jholg at gmx.de Tue Sep 18 09:21:07 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 18 Sep 2007 09:21:07 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46EEC836.9030603@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> Message-ID: <20070918072107.19040@gmx.net> Hello Stefan, > > attribute. That's the main reason I propose the keep_tree functionality, > to > > make ObjectifiedElement-leaves survive a creation-serialization-parse > > cycle. > > I think we should do this: > > if old_pytypename == TREE_PYTYPE: > if cetree.findChild(c_node, 0) is NULL: > pytype = TREE_PYTYPE > else: > # check old type > > Do you still think we need the keep_tree then? You really don't like it, do you ;-)? I'd say this should work and remove the need for keep_tree, though. Sidenote: So I thought maybe we should revise the use of TREE in objectify in general, but one has to be very careful. You really want to have it e.g. in objectify.Element(): >>> o = objectify.Element("structural") >>> e = etree.Element("structural") >>> type(o), type(e) (, ) >>> root.o = o >>> root.e = e >>> # Now type lookup can not rely on parent == None ... >>> type(root.o), type(root.e) (, ) >>> Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Tue Sep 18 10:42:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 18 Sep 2007 10:42:19 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070918072107.19040@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> Message-ID: <46EF8F6B.8040403@behnel.de> Hi Holger, jholg at gmx.de wrote: >> I think we should do this: >> >> if old_pytypename == TREE_PYTYPE: >> if cetree.findChild(c_node, 0) is NULL: >> pytype = TREE_PYTYPE >> else: >> # check old type >> >> Do you still think we need the keep_tree then? > > You really don't like it, do you ;-)? > I'd say this should work and remove the need for keep_tree, though. Ok. I also added the tests from your patch now. Obvious question then: anything still missing from what your last patch did? > Sidenote: So I thought maybe we should revise the use of TREE in objectify in general, but one has to be very careful. You really want to have it e.g. in objectify.Element(): I think we should, and we should restrict its use to a minimum. If you want, you can take a look at it. I don't feel like touching working code at the moment. :) >>>> o = objectify.Element("structural") >>>> e = etree.Element("structural") >>>> type(o), type(e) > (, ) Whatever. I don't want any code to rely on *that*. :) (but I can see what your getting at) >>>> root.o = o >>>> root.e = e >>>> # Now type lookup can not rely on parent == None > ... >>>> type(root.o), type(root.e) > (, ) I'm not (any longer :) questioning the TREE type in general. I just think we should not write annotations where we know we will not need them. Stefan From jholg at gmx.de Tue Sep 18 10:57:58 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 18 Sep 2007 10:57:58 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46EF8F6B.8040403@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> Message-ID: <20070918085758.19050@gmx.net> > Ok. I also added the tests from your patch now. > > Obvious question then: anything still missing from what your last patch > did? I'll take a look. > you can take a look at it. I don't feel like touching working code at the > moment. :) I already peeked, and there is really not many places where TREE is used. I'd say it is needed anywhere it currently is, but I'll take a closer look. Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From ebgssth at gmail.com Tue Sep 18 15:56:21 2007 From: ebgssth at gmail.com (js) Date: Tue, 18 Sep 2007 22:56:21 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46EE75F5.4090606@behnel.de> References: <46EE75F5.4090606@behnel.de> Message-ID: Hello again. I downgraded libxml2 from 2.6.29_0 to 2.6.27_0 and re-run the test script. surprise, Now it all works as in the lxml doc! seems newer libxml2 has some problem converting charset. (2.6.28_1 doesn't work either.) I'll look at libxml2's source. Thank you. On 9/17/07, Stefan Behnel wrote: > > js wrote: > > The lxml doc [*1] says that > > "You should generally avoid converting XML/HTML data to unicode before > > passing it into the parsers. It is both slower and error prone." > > > > [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings > > > > but my experience is different from that. > > Not quite. As you say below, you sometimes get ValueErrors depending on the > page data, so it *is* error prone. > > > > For example, the following code doesn't bother encoding things > > and leave the work to lxml.etree. > > According to the doc, this is right way, but it does't work > > and you'll got garbled characters. (give it a try) > > > > -------------------------------------------------------------------- > > # -*- coding: utf-8 -*- > > from lxml import html as etree > > This import makes your code hard to read IMHO. If you use lxml.html, say it. > > > > url='http://apple.com/kr' > > > > tree = etree.parse(url) > > from pprint import pformat > > for t in tree.xpath('//a[text()]'): > > print t.text_content() > > -------------------------------------------------------------------- > > Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And > when I collect the text, it looks perfectly reasonable, including strings like > > u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544. > \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 ' > > This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console. > > Are you sure it's the text content and not just the console output on your side? > > > > Sometimes I got "ValueError: Unicode strings with encoding declaration > > are not supported. " > > On the same page? I assume you were referring to a different page here that > probably uses XHTML instead of HTML, right? The above should work for both - > as long as libxml2 can detect the encoding (and if it can't, there's > lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed). > > Stefan > From jholg at gmx.de Wed Sep 19 13:24:09 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 19 Sep 2007 13:24:09 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070918085758.19050@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> Message-ID: <20070919112409.271040@gmx.net> Hi, attached patch - enhances the annotation tests to check the TREE-attribute survival for leaf-TREE-elements in annotate/pyannotate/xsiannotate - fixes a bug in DataElement that did not set py:pytype correctly when invoked with unicode string args and adds some tests for this. I renamed _get_pytypename() to _pytypename() (internal) and __get_pytypename() to pytypename() (public), so DataElement() now uses _pytypename() rather than _typename(). Holger Btw I'm getting core dumps in the schematron tests: 685/802 ( 85.4%): test_schematron_invalid_schema_empty (...hematron.ETreeSchematronTestCase)Segmentation Fault (core dumped) #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe2b7874 in __xmlRaiseError () from /apps/prod/lib/libxml2.so.2 #4 0xfe45fd5c in xmlSchematronPErr () from /apps/prod/lib/libxml2.so.2 #5 0xfe462d24 in xmlSchematronParse () from /apps/prod/lib/libxml2.so.2 #6 0xfe60f080 in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x829a10, __pyx_args=0x8872c8, __pyx_kwds=0x109b04) at src/lxml/etree.c:4131 #7 0x58504 in type_call (type=0xfe665988, args=0x829d30, kwds=0x89d4b0) at Objects/typeobject.c:443 #8 0x260c4 in PyObject_Call (func=0x829a10, arg=0x829d30, kw=0x89d4b0) at Objects/abstract.c:1802 -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- A non-text attachment was scrubbed... Name: test_treesurvival_dataelement_ustr.patch Type: application/octet-stream Size: 9707 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070919/fc92de37/attachment-0001.obj From stefan_ml at behnel.de Wed Sep 19 15:03:25 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 19 Sep 2007 15:03:25 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070919112409.271040@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> Message-ID: <46F11E1D.70309@behnel.de> jholg at gmx.de wrote: > attached patch > - enhances the annotation tests to check the TREE-attribute survival for leaf-TREE-elements in annotate/pyannotate/xsiannotate Ok. > - fixes a bug in DataElement that did not set py:pytype correctly when invoked with unicode string args and adds some tests for this. You should just commit this kind of fixes instead of sending them to the list. > I renamed _get_pytypename() to _pytypename() (internal) and __get_pytypename() to pytypename() (public), so DataElement() now uses _pytypename() rather than _typename(). Any reason there *is* a pytypename() function? It doesn't seem to be used. > Btw I'm getting core dumps in the schematron tests: > > 685/802 ( 85.4%): test_schematron_invalid_schema_empty (...hematron.ETreeSchematronTestCase)Segmentation Fault (core dumped) > > #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 > #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 > #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 > #3 0xfe2b7874 in __xmlRaiseError () from /apps/prod/lib/libxml2.so.2 > #4 0xfe45fd5c in xmlSchematronPErr () from /apps/prod/lib/libxml2.so.2 > #5 0xfe462d24 in xmlSchematronParse () from /apps/prod/lib/libxml2.so.2 > #6 0xfe60f080 in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x829a10, > __pyx_args=0x8872c8, __pyx_kwds=0x109b04) at src/lxml/etree.c:4131 > #7 0x58504 in type_call (type=0xfe665988, args=0x829d30, kwds=0x89d4b0) > at Objects/typeobject.c:443 > #8 0x260c4 in PyObject_Call (func=0x829a10, arg=0x829d30, kw=0x89d4b0) > at Objects/abstract.c:1802 I don't get those, with none of the supported libxml2 versions. What's the one you use? Have you seen those with the trunk before or is it just now? Stefan From jholg at gmx.de Wed Sep 19 15:43:27 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 19 Sep 2007 15:43:27 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <46F11E1D.70309@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> Message-ID: <20070919134327.17290@gmx.net> Hi Stefan, > > - fixes a bug in DataElement that did not set py:pytype correctly when > invoked with unicode string args and adds some tests for this. > > You should just commit this kind of fixes instead of sending them to the > list. Ok. > > I renamed _get_pytypename() to _pytypename() (internal) and > __get_pytypename() to pytypename() (public), so DataElement() now uses _pytypename() > rather than _typename(). > > Any reason there *is* a pytypename() function? It doesn't seem to be used. I figured it's nice to have it usable from outside objectify if you need to use explicit pytype names, so you don't have to reimplement the str/unicode distinction everywhere. > > Btw I'm getting core dumps in the schematron tests: > > > > 685/802 ( 85.4%): test_schematron_invalid_schema_empty > (...hematron.ETreeSchematronTestCase)Segmentation Fault (core dumped) > > > > #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 > > #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 > > #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 > > #3 0xfe2b7874 in __xmlRaiseError () from /apps/prod/lib/libxml2.so.2 > > #4 0xfe45fd5c in xmlSchematronPErr () from /apps/prod/lib/libxml2.so.2 > > #5 0xfe462d24 in xmlSchematronParse () from /apps/prod/lib/libxml2.so.2 > > #6 0xfe60f080 in __pyx_f_5etree_10Schematron___init__ > (__pyx_v_self=0x829a10, > > __pyx_args=0x8872c8, __pyx_kwds=0x109b04) at src/lxml/etree.c:4131 > > #7 0x58504 in type_call (type=0xfe665988, args=0x829d30, kwds=0x89d4b0) > > at Objects/typeobject.c:443 > > #8 0x260c4 in PyObject_Call (func=0x829a10, arg=0x829d30, kw=0x89d4b0) > > at Objects/abstract.c:1802 > > I don't get those, with none of the supported libxml2 versions. What's the > one > you use? Have you seen those with the trunk before or is it just now? No, I've not seen such problems on the trunk before. I had to upgrade to latest cython to build this time. This is the setup: TESTED VERSION: 2.0.alpha2-46719 Python: (2, 4, 4, 'final', 0) lxml.etree: (2, 0, -198, 46719) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From jholg at gmx.de Wed Sep 19 16:12:25 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 19 Sep 2007 16:12:25 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070919134327.17290@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> Message-ID: <20070919141225.271030@gmx.net> > No, I've not seen such problems on the trunk before. I had to upgrade to > latest cython to build this time. Latest Cython *Release* (0.9.6.6), that is, to be exact. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Wed Sep 19 17:13:40 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 19 Sep 2007 17:13:40 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: References: <46EE75F5.4090606@behnel.de> Message-ID: <46F13CA4.5020302@behnel.de> js wrote: > I downgraded libxml2 from 2.6.29_0 to 2.6.27_0 > and re-run the test script. > surprise, Now it all works as in the lxml doc! > > seems newer libxml2 has some problem converting charset. > (2.6.28_1 doesn't work either.) I think it's this change in the function htmlCtxtReset() of HTMLparser.c in libxml2 2.6.28: @@ -5806,7 +5850,7 @@ ctxt->inSubset = 0; ctxt->errNo = XML_ERR_OK; ctxt->depth = 0; - ctxt->charset = XML_CHAR_ENCODING_UTF8; + ctxt->charset = XML_CHAR_ENCODING_NONE; ctxt->catalogs = NULL; xmlInitNodeInfoSeq(&ctxt->node_seq); So the default encoding is no longer UTF-8 and instead it tries auto detection (which apparently fails for your page, so it's likely the page that is broken here). The problem is that you can't really defend UTF-8 as a default encoding (or any default encoding at all) as I don't think there is any clear winner in the page encodings of all web pages out there. And UTF-8 is definitely something that will fail for many pages, while things like ISO-8859-1 just let the content pass so that you can still fix it by hand (if you feel like it). So libxml2 is actually right in not defaulting to UTF-8. Just in case you can't accept that, have you tried installing BeautifulSoup and parsing with lxml.html.ElementSoup? BeautifulSoup has pretty good encoding detection support. Stefan From stefan_ml at behnel.de Thu Sep 20 14:23:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 20 Sep 2007 14:23:22 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46F13CA4.5020302@behnel.de> References: <46EE75F5.4090606@behnel.de> <46F13CA4.5020302@behnel.de> Message-ID: <46F2663A.1050508@behnel.de> Stefan Behnel wrote: > js wrote: >> I downgraded libxml2 from 2.6.29_0 to 2.6.27_0 >> and re-run the test script. >> surprise, Now it all works as in the lxml doc! > > So the default encoding is no longer UTF-8 and instead it tries auto detection > (which apparently fails for your page, so it's likely the page that is broken > here). I added an "encoding" keyword argument to the parsers in the current trunk to override the document encoding (in case you happen to know better). So you could now parse the HTML document with >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8") >>> tree = etree.parse("http://the/file.html", utf8_html_parser) This will (very, very likely) give you an exception if the document is not UTF-8, so you can then fall back to another parser. Note that building the SVN trunk currently requires Cython 0.9.6.6, but the third alpha shouldn't be /that/ far away. Stefan From ebgssth at gmail.com Thu Sep 20 15:59:16 2007 From: ebgssth at gmail.com (js) Date: Thu, 20 Sep 2007 22:59:16 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46F2663A.1050508@behnel.de> References: <46EE75F5.4090606@behnel.de> <46F13CA4.5020302@behnel.de> <46F2663A.1050508@behnel.de> Message-ID: On 9/20/07, Stefan Behnel wrote: > I added an "encoding" keyword argument to the parsers in the current trunk to > override the document encoding (in case you happen to know better). So you > could now parse the HTML document with > > >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8") > >>> tree = etree.parse("http://the/file.html", utf8_html_parser) > > This will (very, very likely) give you an exception if the document is not > UTF-8, so you can then fall back to another parser. Thank you for your effort. but I wonder how can we know in what character set the document is written before GETing the page and check the response header, meta tag and contents itself? We really need to GET the doc first. So I think urlopen(url).read().decode(somecharset) and letting lxml parse it is not only easier but also giving us more flexibility. For example, by using python's urllib2, you can easily set User-Agent, adding more handlers, etc. Stefan, Is it possible to change lxml to avoid "ValueError" exception when passing decoded string to lxml.parse()? If the answer is no, could you please give me some advice or your idea on thin problem? Thank you in advance. From stefan_ml at behnel.de Thu Sep 20 17:38:32 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 20 Sep 2007 17:38:32 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: References: <46EE75F5.4090606@behnel.de> <46F13CA4.5020302@behnel.de> <46F2663A.1050508@behnel.de> Message-ID: <46F293F8.9040002@behnel.de> js wrote: > On 9/20/07, Stefan Behnel wrote: >> I added an "encoding" keyword argument to the parsers in the current trunk to >> override the document encoding (in case you happen to know better). So you >> could now parse the HTML document with >> >> >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8") >> >>> tree = etree.parse("http://the/file.html", utf8_html_parser) >> >> This will (very, very likely) give you an exception if the document is not >> UTF-8, so you can then fall back to another parser. > > Thank you for your effort. > but I wonder how can we know in what character set the document is > written before > GETing the page and check the response header, meta tag and contents itself? The libxml2 HTML parser does that for you. If there is a Content-Type (which is not too hidden inside the tag soup), the parser will obey it. It doesn't know about the header, but for that, you can pass in the "encoding" keyword. Note that you usually don't have to read the file if all you want it the header. Just read the header, check if you have to override the input encoding and then pass the file into parse(). > We really need to GET the doc first. > So I think urlopen(url).read().decode(somecharset) and This will not work if the document contains an encoding hint, such as a tag in HTML or an XML declaration, as the parser will switch encodings when it sees it. Thus the "encoding" keyword for detection override. > letting lxml parse it is not only easier but also giving us more flexibility. > For example, by using python's urllib2, you can easily set User-Agent, > adding more handlers, etc. You can pass the result of urlopen(url) into parse() instead or reading the string first. > Stefan, Is it possible to change lxml to avoid "ValueError" exception > when passing > decoded string to lxml.parse()? > If the answer is no, could you please give me some advice or your idea > on thin problem? Use lxml.fromstring() for parsing strings. Stefan From stefan_ml at behnel.de Thu Sep 20 18:24:15 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 20 Sep 2007 18:24:15 +0200 Subject: [lxml-dev] time to ideas: a good API for iterparse() on HTML ? Message-ID: <46F29EAF.1020403@behnel.de> Hi all, I wonder what a good API for iterparse() on HTML would be. I'm kinda tempted to change the iterparse class into a function like parse(), remove the existing keyword arguments and replace them with a standard "parser" argument as in parse() and fromstring(): >>> iterator = etree.iterparse(f, parser=etree.HTMLParser()) I'm not sure this works, as we can't support a parser target object ("target" keyword of parsers) or the feed parser interface with iterparse (which both the XMLParser and the HTMLParser currently support), but it wouldn't be obvious from the API that you can't pass a target parser into iterparse(). So it's not quite the perfect interface, as this would need to raise an error: >>> parser = etree.HTMLParser(target=SomeTarget()) >>> iterator = etree.iterparse(f, parser=parser) The alternatives would be an "html" keyword option to iterparse (the straight forward, simple solution, but which we use nowhere else in the API): >>> iterator = etree.iterparse(f, html=True) or a "method" argument like in the serialisers: >>> iterator = etree.iterparse(f, method="html") or maybe: >>> iterator = etree.iterparse(f, input_type="html") or an "iterparsehtml" function/class (which would be the worst thing to do IMHO): >>> iterator = etree.iterparsehtml(f) I feel that there should be some symmetry between iterparse(), the other parse functions and the parser classes, but I'm not sure which. Any comments? Stefan From stefan_ml at behnel.de Fri Sep 21 08:50:46 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 21 Sep 2007 08:50:46 +0200 Subject: [lxml-dev] time for ideas: a good API for iterparse() on HTML ? In-Reply-To: <46F29EAF.1020403@behnel.de> References: <46F29EAF.1020403@behnel.de> Message-ID: <46F369C6.5080901@behnel.de> Stefan Behnel wrote: > The alternatives would be an "html" keyword option to iterparse (the straight > forward, simple solution, but which we use nowhere else in the API): > > >>> iterator = etree.iterparse(f, html=True) I implemented this on the trunk for now. I'm not particularly happy about it, but it'll do for an alpha release. Any comments appreciated. Stefan From stefan_ml at behnel.de Fri Sep 21 10:22:58 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 21 Sep 2007 10:22:58 +0200 Subject: [lxml-dev] annotate, pyannotate, xsiannotate In-Reply-To: <20070919134327.17290@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> Message-ID: <46F37F62.40307@behnel.de> jholg at gmx.de wrote: >> Any reason there *is* a pytypename() function? It doesn't seem to be used. > > I figured it's nice to have it usable from outside objectify if you need to use explicit pytype names, so you don't have to reimplement the str/unicode distinction everywhere. Ok, why not. I added it. >>> Btw I'm getting core dumps in the schematron tests: >>> >>> 685/802 ( 85.4%): test_schematron_invalid_schema_empty >> (...hematron.ETreeSchematronTestCase)Segmentation Fault (core dumped) >>> #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 >>> #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 >>> #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 >>> #3 0xfe2b7874 in __xmlRaiseError () from /apps/prod/lib/libxml2.so.2 >>> #4 0xfe45fd5c in xmlSchematronPErr () from /apps/prod/lib/libxml2.so.2 >>> #5 0xfe462d24 in xmlSchematronParse () from /apps/prod/lib/libxml2.so.2 >>> #6 0xfe60f080 in __pyx_f_5etree_10Schematron___init__ >> (__pyx_v_self=0x829a10, >>> __pyx_args=0x8872c8, __pyx_kwds=0x109b04) at src/lxml/etree.c:4131 >>> #7 0x58504 in type_call (type=0xfe665988, args=0x829d30, kwds=0x89d4b0) >>> at Objects/typeobject.c:443 >>> #8 0x260c4 in PyObject_Call (func=0x829a10, arg=0x829d30, kw=0x89d4b0) >>> at Objects/abstract.c:1802 >> I don't get those, with none of the supported libxml2 versions. What's the >> one >> you use? Have you seen those with the trunk before or is it just now? > > No, I've not seen such problems on the trunk before. I had to upgrade to latest cython to build this time. > > This is the setup: > > TESTED VERSION: 2.0.alpha2-46719 > Python: (2, 4, 4, 'final', 0) > lxml.etree: (2, 0, -198, 46719) > libxml used: (2, 6, 27) > libxml compiled: (2, 6, 27) > libxslt used: (1, 1, 20) > libxslt compiled: (1, 1, 20) Schematron uses XPath a lot, so I wouldn't be surprised if this was related to the XPath bug in libxml2 2.6.27. Is there any chance you could switch to 2.6.28 or later? Note that lxml.etree (trunk) now emits a warning if you use XPath on 2.6.27, as we can't really work around it. It happens when you get certain errors in the XPath evaluation, as in the case above. Stefan From jholg at gmx.de Fri Sep 21 11:23:40 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 21 Sep 2007 11:23:40 +0200 Subject: [lxml-dev] trunk schematron tests core dump (was: annotate, pyannotate, xsiannotate) In-Reply-To: <46F37F62.40307@behnel.de> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> <46F37F62.40307@behnel.de> Message-ID: <20070921092340.311080@gmx.net> Hi Stefan, > >>> Btw I'm getting core dumps in the schematron tests: > >>> [...] > Schematron uses XPath a lot, so I wouldn't be surprised if this was > related to > the XPath bug in libxml2 2.6.27. Is there any chance you could switch to > 2.6.28 or later? Note that lxml.etree (trunk) now emits a warning if you > use > XPath on 2.6.27, as we can't really work around it. It happens when you > get > certain errors in the XPath evaluation, as in the case above. I'll try out the latest libxml2, I had also noted the warning. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From ianb at colorstudy.com Fri Sep 21 16:25:25 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 21 Sep 2007 09:25:25 -0500 Subject: [lxml-dev] 2.0alpha docs? Message-ID: <46F3D455.8020205@colorstudy.com> Are the docs for 2.0 built and up anywhere online? -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From jholg at gmx.de Fri Sep 21 16:29:05 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 21 Sep 2007 16:29:05 +0200 Subject: [lxml-dev] trunk schematron tests core dump (was: annotate, pyannotate, xsiannotate) In-Reply-To: <20070921092340.311080@gmx.net> References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net> <46F37F62.40307@behnel.de> <20070921092340.311080@gmx.net> Message-ID: <20070921142905.315500@gmx.net> Hi, > > Schematron uses XPath a lot, so I wouldn't be surprised if this was > > related to > > the XPath bug in libxml2 2.6.27. Is there any chance you could switch to > > 2.6.28 or later? Note that lxml.etree (trunk) now emits a warning if you > > use > > XPath on 2.6.27, as we can't really work around it. It happens when you > > get > > certain errors in the XPath evaluation, as in the case above. > > I'll try out the latest libxml2, I had also noted the warning. Unfortunately, using the latest & greatest libxml2/libxslt (2.6.33/1.1.22) doesn't solve the problem for me. Btw I won't come near a Solaris box for the next week, and probably not be reachable by mail, so unfortunately I will only be able to provide more info then. Have a nice week, everybody! Holger Here's what I see: Something strange (a cython bug?): #6 0xfe60fee0 in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x8c7c50, __pyx_args=0x887700, __pyx_kwds=0x109b04) at src/lxml/etree.c:4905 But when I look at etree.c in line 4905 this is nowhere near __pyx_f_5etree_10Schematron___init__: etree.c: ======== [...] 70188 70189 static int __pyx_f_5etree_10Schematron___init__(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/ 70190 static int __pyx_f_5etree_10Schematron___init__(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) { 70191 PyObject *__pyx_v_etree = 0; 70192 PyObject *__pyx_v_file = 0; [...] Test & backtrace ================ /apps/pydev/bin/python2.4 setup.py build_ext -i Building with Cython. Building lxml version 2.0.alpha2-46776 running build_ext /apps/pydev/bin/python2.4 test.py -p -v TESTED VERSION: 2.0.alpha2-46776 Python: (2, 4, 4, 'final', 0) lxml.etree: (2, 0, -198, 46776) libxml used: (2, 6, 30) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) 111/810 ( 13.7%): Doctest: validation.txt /Total line 1: Sum is not 100%. /Total line 1: Sum is not 100%. 671/810 ( 82.8%): Doctest: validation.txt /Total line 1: Sum is not 100%. /Total line 1: Sum is not 100%. 690/810 ( 85.2%): test_schematron (lxml.tests.test_schematron.ETreeSchematronTestCase)/AAA line 1: There is an extra element 693/810 ( 85.6%): test_schematron_invalid_schema_empty (...schematron.ETreeSchematronTestCase)make: *** [test_inplace] Segmentation Fault (core dumped) 2 lb54320 at adevp02 .../lxml $ gdb python2.4 -c core GNU gdb 4.18 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.6"... Core was generated by `/apps/pydev/bin/python2.4 test.py -p -v'. Program terminated with signal 9, Killed. Reading symbols from /usr/lib/libresolv.so.2...done. Reading symbols from /usr/lib/libsocket.so.1...done. Reading symbols from /usr/lib/libnsl.so.1...done. Reading symbols from /usr/lib/librt.so.1...done. Reading symbols from /usr/lib/libdl.so.1...done. Reading symbols from /usr/lib/libpthread.so.1...done. Reading symbols from /usr/lib/libm.so.1...done. Reading symbols from /usr/lib/libc.so.1...done. Reading symbols from /usr/lib/libmp.so.2...done. Reading symbols from /usr/lib/libaio.so.1...done. Reading symbols from /usr/platform/SUNW,Sun-Fire-V440/lib/libc_psr.so.1...done. Reading symbols from /usr/lib/libthread.so.1...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/time.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/itertools.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_curses.so...done. Reading symbols from /apps/prod/lib/libncurses.so.5...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/collections.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/strop.so...done. Reading symbols from /data/pydev/hjoukl/LXML/lxml/src/lxml/etree.so...done. Reading symbols from /apps/pydev/lib/libxslt.so.1...done. Reading symbols from /apps/pydev/lib/libexslt.so.0...done. Reading symbols from /apps/pydev/lib/libxml2.so.2...done. Reading symbols from /apps/prod/lib/libz.so...done. Reading symbols from /apps/prod//lib/libiconv.so.2...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_bisect.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_heapq.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/cStringIO.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/math.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/binascii.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_random.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/fcntl.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_socket.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_ssl.so...done. Reading symbols from /apps/local/lib/libssl.so.0.9.6...done. Reading symbols from /apps/local/lib/libcrypto.so.0.9.6...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/operator.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/struct.so...done. ---Type to continue, or q to quit--- Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/md5.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/sha.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/datetime.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/zlib.so...done. Reading symbols from /data/pydev/hjoukl/LXML/lxml/src/lxml/objectify.so...done. Reading symbols from /data/pydev/hjoukl/LXML/lxml/src/lxml/pyclasslookup.so...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/_locale.so...done. Reading symbols from /apps/local/lib/libintl.so.1...done. Reading symbols from /apps/pydev/lib/python2.4/lib-dynload/readline.so...done. Reading symbols from /apps/prod/lib/libreadline.so...done. #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 (gdb) bt #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1 #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1 #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe2b7afc in __xmlRaiseError () from /apps/pydev/lib/libxml2.so.2 #4 0xfe461e2c in xmlSchematronPErr () from /apps/pydev/lib/libxml2.so.2 #5 0xfe4648b4 in xmlSchematronParse () from /apps/pydev/lib/libxml2.so.2 #6 0xfe60fee0 in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x8c7c50, __pyx_args=0x887700, __pyx_kwds=0x109b04) at src/lxml/etree.c:4905 #7 0x58504 in type_call (type=0xfe666d80, args=0x824f30, kwds=0x89c810) at Objects/typeobject.c:443 #8 0x260c4 in PyObject_Call (func=0x8c7c50, arg=0x824f30, kw=0x89c810) at Objects/abstract.c:1802 #9 0x88f4c in ext_do_call (func=0xfe666d80, pp_stack=0xffbed5ec, flags=3, na=-1, nk=0) at Python/ceval.c:3848 #10 0x85af8 in PyEval_EvalFrame (f=0x1803d0) at Python/ceval.c:2214 #11 0x86eb8 in PyEval_EvalCodeEx (co=0x1be460, globals=0x0, locals=0x1803d0, args=0x88496c, argcount=4, kws=0x88497c, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2752 #12 0x88888 in update_keyword_args (orig_kwdict=0x0, nk=-4270016, pp_stack=0x4, func=0x4) at Python/ceval.c:3676 #13 0x886b0 in call_function (pp_stack=0xffbed840, oparg=4) at Python/ceval.c:3597 #14 0x85a00 in PyEval_EvalFrame (f=0x884818) at Python/ceval.c:2186 #15 0x887fc in fast_function (func=0x6d4770, pp_stack=0x77ef28, n=1, na=1, nk=1240904) at Python/ceval.c:3654 #16 0x886b0 in call_function (pp_stack=0xffbeda08, oparg=1) at Python/ceval.c:3597 #17 0x85a00 in PyEval_EvalFrame (f=0x77edc8) at Python/ceval.c:2186 #18 0x86eb8 in PyEval_EvalCodeEx (co=0x1be260, globals=0x0, locals=0x77edc8, args=0x98b834, argcount=2, kws=0x2f4370, kwcount=0, defs=0x1c067c, defcount=1, closure=0x0) at Python/ceval.c:2752 #19 0xdadd4 in PyFunction_GetCode (op=0x1c8a30) at Objects/funcobject.c:66 #20 0x260c4 in PyObject_Call (func=0x1c8a30, arg=0x98b828, kw=0x8c1150) at Objects/abstract.c:1802 #21 0x88f4c in ext_do_call (func=0x1c8a30, pp_stack=0xffbedca4, flags=3, na=-1, nk=0) at Python/ceval.c:3848 #22 0x85af8 in PyEval_EvalFrame (f=0x48eb78) at Python/ceval.c:2214 #23 0x86eb8 in PyEval_EvalCodeEx (co=0x1be2a0, globals=0x0, locals=0x48eb78, args=0x973f3c, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2752 #24 0xdadd4 in PyFunction_GetCode (op=0x1c8a70) at Objects/funcobject.c:66 #25 0x260c4 in PyObject_Call (func=0x1c8a70, arg=0x973f30, kw=0x0) at Objects/abstract.c:1802 #26 0x2e30c in instancemethod_descr_get (meth=0x1, obj=0x973f30, cls=0x0) at Objects/classobject.c:2539 #27 0x260c4 in PyObject_Call (func=0x1c8a70, arg=0x973f30, kw=0x0) at Objects/abstract.c:1802 #28 0x638b8 in slot_tp_call (self=0x6ced10, args=0x25e470, kwds=0x0) at Objects/typeobject.c:4549 #29 0x260c4 in PyObject_Call (func=0x6ced10, arg=0x25e470, kw=0x0) at Objects/abstract.c:1802 #30 0x8a8e0 in do_call (func=0x6ced10, pp_stack=0xffbee3a8, na=-1, nk=2483312)- -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From felwert at uni-bremen.de Fri Sep 21 18:34:23 2007 From: felwert at uni-bremen.de (Frederik Elwert) Date: Fri, 21 Sep 2007 18:34:23 +0200 Subject: [lxml-dev] 2.0alpha docs? In-Reply-To: <46F3D5D9.3010202@colorstudy.com> References: <46F3D455.8020205@colorstudy.com> <1190384977.9653.0.camel@FredDesk> <46F3D5D9.3010202@colorstudy.com> Message-ID: <1190392463.9653.3.camel@FredDesk> Am Freitag, den 21.09.2007, 09:31 -0500 schrieb Ian Bicking: > Frederik Elwert wrote: > > Am Freitag, den 21.09.2007, 09:25 -0500 schrieb Ian Bicking: > >> Are the docs for 2.0 built and up anywhere online? > > > > Yes, they are: > > > > http://codespeak.net/lxml/dev/ > > The front page doesn't have any links to > http://codespeak.net/lxml/dev/lxmlhtml.html -- but I'm not sure how the > doc index is created, so could you link that in? Sorry, forgot to answer to the list instead of you personally. I am not involved in the development of lxml, but somebody else on the list might look into this. Cheers, Frederik From stefan_ml at behnel.de Fri Sep 21 19:44:56 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 21 Sep 2007 19:44:56 +0200 Subject: [lxml-dev] 2.0alpha docs? In-Reply-To: <1190392463.9653.3.camel@FredDesk> References: <46F3D455.8020205@colorstudy.com> <1190384977.9653.0.camel@FredDesk> <46F3D5D9.3010202@colorstudy.com> <1190392463.9653.3.camel@FredDesk> Message-ID: <46F40318.8000800@behnel.de> Frederik Elwert wrote: > Am Freitag, den 21.09.2007, 09:31 -0500 schrieb Ian Bicking: >> Frederik Elwert wrote: >>> Am Freitag, den 21.09.2007, 09:25 -0500 schrieb Ian Bicking: >>>> Are the docs for 2.0 built and up anywhere online? >>> Yes, they are: >>> >>> http://codespeak.net/lxml/dev/ >> >> The front page doesn't have any links to >> http://codespeak.net/lxml/dev/lxmlhtml.html Sure it does. Look out for "lxml.html" in the side menu, right below lxml.objectify. Stefan From rcdailey at gmail.com Mon Sep 24 22:36:05 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 24 Sep 2007 15:36:05 -0500 Subject: [lxml-dev] Python script to optimize XML text Message-ID: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> Hi, I'm currently seeking a python script that provides a way of optimizing out useless characters in an XML document to provide the optimal size for the file. For example, assume the following XML script: By running this through an XML optimizer, the file would appear as: Note that the following were changed: - All comments were stripped from the XML - All spaces, tabs, carriage returns, and other forms of unimportant whitespace are removed - Elements that contain no text or children that are in the form of use the short-hand method for ending an element body: Anyone know of a tool or python script that can perform optimizations like explained above? I realize I could probably do this with regular expressions in python, however I was hoping someone already did this work. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070924/9a44339f/attachment-0001.htm From sidnei at enfoldsystems.com Mon Sep 24 22:49:01 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 24 Sep 2007 17:49:01 -0300 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> Message-ID: If your XML is well-formed, a XSLT is probably your best choice. I believe even the most trivial 'pass through' example might produce the output you expect here. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From lxml at holloway.co.nz Mon Sep 24 23:45:58 2007 From: lxml at holloway.co.nz (Matthew Cruickshank) Date: Tue, 25 Sep 2007 09:45:58 +1200 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> Message-ID: <46F83016.6070106@holloway.co.nz> Robert Dailey wrote: > Note that the following were changed: > - All comments were stripped from the XML > - All spaces, tabs, carriage returns, and other forms of unimportant > whitespace are removed > - Elements that contain no text or children that are in the form of > use the short-hand method for ending an element body: > As Sidnei says an XSLT is probably the easiest way, The first and third requirement are done by default in XSLT (I think), so you'd only need to match text nodes and normalize them... ps. please avoid using regexs with XML... that way leads to madness. With the possibility of commented-out nodes and nested structures and such regexs will only ever work on a tiny subset of XML. .Matthew Cruickshank http://docvert.org << MS Word to OpenDocument via an extensible XML Pipeline -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070925/6e47eb01/attachment.htm From rcdailey at gmail.com Tue Sep 25 00:02:43 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 24 Sep 2007 17:02:43 -0500 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <46F83016.6070106@holloway.co.nz> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> <46F83016.6070106@holloway.co.nz> Message-ID: <496954360709241502m7243c698iedfbb48094e28c0@mail.gmail.com> If I wanted to remove all whitespace between elements, I would use this regex: exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL ) However, this isn't working for some reason. I'm fairly new to regular expressions so I may be missing something obvious. Thanks. On 9/24/07, Matthew Cruickshank wrote: > > Robert Dailey wrote: > > Note that the following were changed: > - All comments were stripped from the XML > - All spaces, tabs, carriage returns, and other forms of unimportant > whitespace are removed > - Elements that contain no text or children that are in the form of > use the short-hand method for ending an element body: > > > > As Sidnei says an XSLT is probably the easiest way, > > The first and third requirement are done by default in XSLT (I think), so > you'd only need to match text nodes and normalize them... > > > > > > ps. please avoid using regexs with XML... that way leads to madness. With > the possibility of commented-out nodes and nested structures and such regexs > will only ever work on a tiny subset of XML. > > > .Matthew Cruickshank > http://docvert.org << MS Word to OpenDocument via an extensible XML > Pipeline > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070924/9a769a94/attachment.htm From rcdailey at gmail.com Tue Sep 25 00:05:31 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 24 Sep 2007 17:05:31 -0500 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <496954360709241502m7243c698iedfbb48094e28c0@mail.gmail.com> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> <46F83016.6070106@holloway.co.nz> <496954360709241502m7243c698iedfbb48094e28c0@mail.gmail.com> Message-ID: <496954360709241505i4abcdd2fr62f9fb55bc8a023e@mail.gmail.com> Woops! Never mind! I also need \r in there :) Works perfectly now On 9/24/07, Robert Dailey wrote: > > If I wanted to remove all whitespace between elements, I would use this > regex: > > exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL ) > > However, this isn't working for some reason. I'm fairly new to regular > expressions so I may be missing something obvious. Thanks. > > On 9/24/07, Matthew Cruickshank wrote: > > > Robert Dailey wrote: > > > > Note that the following were changed: > > - All comments were stripped from the XML > > - All spaces, tabs, carriage returns, and other forms of unimportant > > whitespace are removed > > - Elements that contain no text or children that are in the form of > > use the short-hand method for ending an element body: > > > > > > > > As Sidnei says an XSLT is probably the easiest way, > > > > The first and third requirement are done by default in XSLT (I think), > > so you'd only need to match text nodes and normalize them... > > > > > > > > > > > > ps. please avoid using regexs with XML... that way leads to madness. > > With the possibility of commented-out nodes and nested structures and such > > regexs will only ever work on a tiny subset of XML. > > > > > > .Matthew Cruickshank > > http://docvert.org << MS Word to OpenDocument via an extensible XML > > Pipeline > > > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070924/69673618/attachment.htm From mwm-keyword-lxml.9112b8 at mired.org Tue Sep 25 00:36:33 2007 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Mon, 24 Sep 2007 18:36:33 -0400 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <496954360709241502m7243c698iedfbb48094e28c0@mail.gmail.com> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> <46F83016.6070106@holloway.co.nz> <496954360709241502m7243c698iedfbb48094e28c0@mail.gmail.com> Message-ID: <20070924183633.712a194b@bhuda.mired.org> On Mon, 24 Sep 2007 17:02:43 -0500 "Robert Dailey" wrote: > If I wanted to remove all whitespace between elements, I would use this > regex: > > exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL ) > > However, this isn't working for some reason. I'm fairly new to regular > expressions so I may be missing something obvious. Thanks. Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski I think you can do this relatively simply with standard ElementTree tools: >>> doc = parse('.paneconfig.xml') >>> for e in doc.xpath('//*'): ... if e.tail: e.tail = e.tail.strip() ... if e.text: e.text = e.text.strip() ... However: - The regular expression version will be faster if you don't otherwise have to deal with the text as XML. - "unimportant whitespace" is *very* much an application-dependent definition. The solution I just gave you and the one presented above are different. The very statement implies that you're using XML as a data language, not a markup language, and my version works fine for the applications I've done that do that. Doesn't mean it's right for you, though. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From stefan_ml at behnel.de Tue Sep 25 09:22:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 25 Sep 2007 09:22:44 +0200 Subject: [lxml-dev] Python script to optimize XML text In-Reply-To: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> References: <496954360709241336i43173947sda5245c2d7477ecb@mail.gmail.com> Message-ID: <46F8B744.4090806@behnel.de> Robert Dailey wrote: > I'm currently seeking a python script that provides a way of optimizing > out useless characters in an XML document to provide the optimal size > for the file. Have you tried the "remove_blank_text" and "remove_comments" keyword options of the XMLParser? Try >>> help(etree.XMLParser) They may not always produce an "optimal" result, but that's because there is no such thing as an "optimal" result (as Mike already noted). What is "useless characters" in XML is very much application dependent. Just think of an XHTML document where all text content was stripped: ...

some bold text

... or even ... some cited text ... Not a good idea to remove all whitespace-only content here, IMHO. A good way to help the parser understand what you consider "useless" is to provide a DTD. Stefan From stefan_ml at behnel.de Wed Sep 26 13:18:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 26 Sep 2007 13:18:44 +0200 Subject: [lxml-dev] lxml 2.0alpha3 released Message-ID: <46FA4014.7060806@behnel.de> Hi all, I'm happy to announce the third alpha release of lxml 2.0. It features a number of clean-ups and another large refactoring of the parser code. The feed parser and the normal parsers now use separate contexts, so that you can use both concurrently on the same parser instance. Have fun, Stefan 2.0alpha3 (2007-09-26) Features added * Separate feed_error_log property for the feed parser interface. The normal parser interface and iterparse continue to use error_log. * The normal parsers and the feed parser interface are now separated and can be used concurrently on the same parser instance. * fromstringlist() and tostringlist() functions as in ElementTree 1.3 * iterparse() accepts an 'html' boolean keyword argument for parsing with the HTML parser (note that this interface may be subject to change) * Parsers accept an 'encoding' keyword argument that overrides the encoding of the parsed documents. * New C-API function hasChild() to test for children * annotate() function in objectify can annotate with Python types and XSI types in one step. Accompanied by xsiannotate() and pyannotate(). Bugs fixed * XML feed parser setup problem * Type annotation for unicode strings in DataElement() Other changes * lxml.etree now emits a warning if you use XPath with libxml2 2.6.27 (which can crash on certain XPath errors) * Type annotation in objectify now preserves the already annotated type be default to prevent loosing type information that is already there. From ebgssth at gmail.com Wed Sep 26 16:51:59 2007 From: ebgssth at gmail.com (js) Date: Wed, 26 Sep 2007 23:51:59 +0900 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: <46F293F8.9040002@behnel.de> References: <46EE75F5.4090606@behnel.de> <46F13CA4.5020302@behnel.de> <46F2663A.1050508@behnel.de> <46F293F8.9040002@behnel.de> Message-ID: Thank you for your reply. On 9/21/07, Stefan Behnel wrote: > > Thank you for your effort. > > but I wonder how can we know in what character set the document is > > written before > > GETing the page and check the response header, meta tag and contents itself? > > The libxml2 HTML parser does that for you. If there is a Content-Type > (which is not too hidden inside the tag soup), the parser will obey it. It > doesn't know about the header, but for that, you can pass in the "encoding" > keyword. Note that you usually don't have to read the file if all you want it > the header. Just read the header, check if you have to override the input > encoding and then pass the file into parse(). You're right. When libxml2 find meta tag, it converts the encoding according to it. But in real web, it doesn't always work. For example, look HTML on http://www.apple.com/kr/ ------------------------------------------------------------------------------------------------------------- ???????? ------------------------------------------------------------------------------------------------------------- As you can see above, the meta tag (And actually Content-Type HTTP header) declares the character encoding used in this pages is UTF-8. libxml2, however, cannot detect this fact till it reads to meta tag because the title is appeared before the meta tag. For this reason, libxml2 treats the title value as ISO 8859-1(also known as latin-1) and the value will get garbled. (I'm not quite sure but it this case some text after the meta also garbled.) To avoid this, I can, right as you said, pass charset value in the "encoding" keyword, which I can get from HTTP header, using HEAD/GET request. That works for this page but what can do if the web server doesn't return Content-Type HTTP header? I cannot rely on libxml2 for the reason I explained above. The best way I can think of is to GET the page and analyze it by myself approach by using Perl's LWP-like module. So, > > We really need to GET the doc first. What do you think? > > Stefan, Is it possible to change lxml to avoid "ValueError" exception > > when passing > > decoded string to lxml.parse()? > > If the answer is no, could you please give me some advice or your idea > > on thin problem? > > Use lxml.fromstring() for parsing strings. Oh, that's exactly what I'm looking for. Now I can do something like below. ------------------------------------------------------------------------------------------------------------- res = urlopen(url) doc = res.read() # Precedence rules from http://www.w3.org/International/tutorials/tutorial-char-enc/ encoding = res.headers.getparam('charset') or checkXMLDeclarationForEncoding(doc) or # returns charset values in XML declaration checkMetaForEncoding(doc) or # returns charset values in meta tag chardet.detect(doc).get.('encoding') # http://chardet.feedparser.org/ tree = etree.fromstring(doc, etree.HTMLParser(encoding=encoding)) ------------------------------------------------------------------------------------------------------------- I don't have checkXMLDeclarationForEncoding nor checkMetaForEncoding, though. From ebgssth at gmail.com Wed Sep 26 17:36:47 2007 From: ebgssth at gmail.com (js) Date: Thu, 27 Sep 2007 00:36:47 +0900 Subject: [lxml-dev] Any way to pass encoding to html.html_parser? Message-ID: Hello. A simple question about lxml2.0alpha3's new feature. > * Parsers accept an 'encoding' keyword argument that overrides the > encoding of the parsed documents. How can I pass encoding argument to the parser when using html.parse instead of etree.parse? From stefan_ml at behnel.de Thu Sep 27 08:09:48 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Sep 2007 08:09:48 +0200 Subject: [lxml-dev] non-ascii characters get garbled In-Reply-To: References: <46EE75F5.4090606@behnel.de> <46F13CA4.5020302@behnel.de> <46F2663A.1050508@behnel.de> <46F293F8.9040002@behnel.de> Message-ID: <46FB492C.6020903@behnel.de> Hi, js wrote: > You're right. When libxml2 find meta tag, it converts the encoding > according to it. > But in real web, it doesn't always work. I know, libxml2's HTML parser works pretty well, but it's not perfect. Especially robust encoding detection is still an issue. > ------------------------------------------------------------------------------------------------------------- > "http://www.w3.org/TR/html4/loose.dtd"> > > > ???????? > > > ------------------------------------------------------------------------------------------------------------- Yes, libxml2 will switch encodings when it sees the tag, but it will not start over to make sure the beginning is parsed in correctly. There are a couple of things you can do. For example, you can parse the page and then check the encoding through the docinfo property (after wrapping the result Element with an ElementTree, if you use "fromstring"), or look for a tag through find() or XPath. Then, reparse the document with the "encoding" keyword set. Or, you can install BeautifulSoup and use lxml.html.ElementSoup for parsing. BeautifulSoup has an HTML parser that comes with brilliant encoding detection. ElementSoup will build the lxml.html tree for you automatically. Or, you can use a regexp to detect a tag yourself before parsing. The function you use below would mainly check for something like ]*charset=["']([^"'>]*)["'] Stefan > ------------------------------------------------------------------------------------------------------------- > res = urlopen(url) > doc = res.read() > # Precedence rules from > http://www.w3.org/International/tutorials/tutorial-char-enc/ > encoding = > res.headers.getparam('charset') or > checkXMLDeclarationForEncoding(doc) or # returns charset > values in XML declaration > checkMetaForEncoding(doc) or # returns charset values in meta tag > chardet.detect(doc).get.('encoding') # http://chardet.feedparser.org/ > tree = etree.fromstring(doc, etree.HTMLParser(encoding=encoding)) > ------------------------------------------------------------------------------------------------------------- From stefan_ml at behnel.de Thu Sep 27 08:32:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Sep 2007 08:32:38 +0200 Subject: [lxml-dev] Any way to pass encoding to html.html_parser? In-Reply-To: References: Message-ID: <46FB4E86.3000103@behnel.de> js wrote: > A simple question about lxml2.0alpha3's new feature. > >> * Parsers accept an 'encoding' keyword argument that overrides the >> encoding of the parsed documents. > > How can I pass encoding argument to the parser when using html.parse instead of > etree.parse? Hmm, true, you can't currently do that, as lxml.html.html_parser is a parser instance, not a class. It's easy to build an equivalent parser, though. The next release will duplicate the parser class into lxml.html, until then, you can do this: class HTMLParser(lxml.etree.HTMLParser): def __init__(self, **kwargs): super(HTMLParser, self).__init__(**kwargs) self.setElementClassLookup(lxml.html.HtmlElementClassLookup()) Stefan From jg307 at cam.ac.uk Thu Sep 27 14:23:47 2007 From: jg307 at cam.ac.uk (James Graham) Date: Thu, 27 Sep 2007 13:23:47 +0100 Subject: [lxml-dev] Tag name validation and HTML Message-ID: <46FBA0D3.6010700@cam.ac.uk> The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1]. This is an issue for html5lib [2] as we are keen to keep support for building lxml trees from HTML input, something which is currently possible with lxml 1.3. In an only tangentially related question, is there a recommended way of creating a custom tag type, preferably using the same code for ElementTree and lxml.etree? In particular html5lib needs to create a notional document root element whilst parsing. So far, we have been using an ordinary Element with a .tag that cannot be produced by parsing any input e.g. root.tag="" but this doesn't feel very elegant. [1] http://www.whatwg.org/specs/web-apps/current-work/#tag-name0 [2] http://code.google.com/p/html5lib/ -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ebgssth at gmail.com Thu Sep 27 14:51:19 2007 From: ebgssth at gmail.com (js) Date: Thu, 27 Sep 2007 21:51:19 +0900 Subject: [lxml-dev] Any way to pass encoding to html.html_parser? In-Reply-To: <46FB4E86.3000103@behnel.de> References: <46FB4E86.3000103@behnel.de> Message-ID: Thank you for your help! and I'm looking forward to the next release. On 9/27/07, Stefan Behnel wrote: > > js wrote: > > A simple question about lxml2.0alpha3's new feature. > > > >> * Parsers accept an 'encoding' keyword argument that overrides the > >> encoding of the parsed documents. > > > > How can I pass encoding argument to the parser when using html.parse instead of > > etree.parse? > > Hmm, true, you can't currently do that, as lxml.html.html_parser is a parser > instance, not a class. > > It's easy to build an equivalent parser, though. The next release will > duplicate the parser class into lxml.html, until then, you can do this: > > class HTMLParser(lxml.etree.HTMLParser): > def __init__(self, **kwargs): > super(HTMLParser, self).__init__(**kwargs) > self.setElementClassLookup(lxml.html.HtmlElementClassLookup()) > > Stefan > > From stefan_ml at behnel.de Thu Sep 27 15:50:48 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Sep 2007 15:50:48 +0200 Subject: [lxml-dev] Tag name validation and HTML In-Reply-To: <46FBA0D3.6010700@cam.ac.uk> References: <46FBA0D3.6010700@cam.ac.uk> Message-ID: <46FBB538.5010604@behnel.de> James Graham wrote: > Is there a recommended way of creating > a custom tag type, preferably using the same code for ElementTree and > lxml.etree? Both lxml.etree and ElementTree have support for (something like) this, but not in the same way. In ET, you can pass an "element_factory" argument to the TreeBuilder. http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.TreeBuilder-class In lxml.etree, you can define an Element-Lookup for a parser. http://codespeak.net/lxml/element_classes.html As both approaches work at the parser level, it should be possible (though not too easy) to write some glue code that sets up a parser for either library, and then use the parser in the rest of the code without modification. Note that in lxml.etree, the decision about which element class to use for a given node is not taken inside the parser, but at element access time. Hence the different approaches (and the extensive support in lxml). > In particular html5lib needs to create a notional document root > element whilst parsing. This is a pretty specific problem. You can solve it in lxml.etree in two ways. If the root node has a specific name, you can use the CustomElementClassLookup scheme (so this won't work if you can't control the name of the root node). http://codespeak.net/lxml/element_classes.html#custom-element-class-lookup If the only way to decide about the class is to check for a parent, you can use the tree based lookup and check "getparent()" for None. http://codespeak.net/lxml/element_classes.html#tree-based-element-class-lookup-in-python I don't think ET can take this decision at all from the element_factory above, but then, you can always replace the root Element /after/ parsing, so I don't think you would even need that machinery here. > So far, we have been using an ordinary Element with a > .tag that cannot be produced by parsing any input e.g. > root.tag="" but this doesn't feel very elegant. Hmmm, but this changes the document, right? Could you explain a little what that node is supposed to do different than normal nodes? In particular, why can't a tree wrapper do what you want? Stefan From stefan_ml at behnel.de Thu Sep 27 16:31:25 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Sep 2007 16:31:25 +0200 Subject: [lxml-dev] Tag name validation and HTML In-Reply-To: <46FBA0D3.6010700@cam.ac.uk> References: <46FBA0D3.6010700@cam.ac.uk> Message-ID: <46FBBEBD.7030308@behnel.de> James Graham wrote: > The development branch of lxml 2 appears to restrict the characters that may > appear in a tag name. Whilst this may be appropriate for XML, it does not match > the behavior of all common HTML UAs and, as such, does not match the current > draft of the HTML 5 spec [1]. This is actually not as simple as it might seem. The Element factory cannot distinguish between XML and HTML tags, so it cannot switch off validation for a particular tag. So the conservative solution would be to actually follow the HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. But then there's not much left that you could honestly call validation. Also, I would still want to restrict ":" in tag names, as this has been a source of problems way too often. So that would just leave spaces and any of ":/>" as invalid characters in tag names. BTW, the spec you reference is actually a parser spec. Obviously, allowing "<" or "&" at the API level isn't a good idea either, so we end up defining our own way of validating tag names that would be somewhere between the XML spec and the HTML spec. And it would still allow you to write broken XML without noticing... > This is an issue for html5lib [2] as we are keen > to keep support for building lxml trees from HTML input, something which is > currently possible with lxml 1.3. Extensive support for HTML is definitely a goal of lxml, so if the current behaviour breaks the HTML spec, it must change. But I'll have to see how. Any comments appreciated. Stefan From stefan_ml at behnel.de Thu Sep 27 18:30:13 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Sep 2007 18:30:13 +0200 Subject: [lxml-dev] Tag name validation and HTML In-Reply-To: <46FBBEBD.7030308@behnel.de> References: <46FBA0D3.6010700@cam.ac.uk> <46FBBEBD.7030308@behnel.de> Message-ID: <46FBDA95.10603@behnel.de> Stefan Behnel wrote: > James Graham wrote: >> The development branch of lxml 2 appears to restrict the characters that may >> appear in a tag name. Whilst this may be appropriate for XML, it does not match >> the behavior of all common HTML UAs and, as such, does not match the current >> draft of the HTML 5 spec [1]. > > This is actually not as simple as it might seem. The Element factory cannot > distinguish between XML and HTML tags, so it cannot switch off validation for > a particular tag. So the conservative solution would be to actually follow the > HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. > But then there's not much left that you could honestly call validation. Also, > I would still want to restrict ":" in tag names, as this has been a source of > problems way too often. So that would just leave spaces and any of ":/>" as > invalid characters in tag names. > > BTW, the spec you reference is actually a parser spec. Obviously, allowing "<" > or "&" at the API level isn't a good idea either, so we end up defining our > own way of validating tag names that would be somewhere between the XML spec > and the HTML spec. And it would still allow you to write broken XML without > noticing... This patch might make for a good starter. Comments appreciated. Stefan Index: src/lxml/apihelpers.pxi =================================================================== --- src/lxml/apihelpers.pxi (Revision 46892) +++ src/lxml/apihelpers.pxi (Arbeitskopie) @@ -791,7 +791,23 @@ return _xmlNameIsValid(_cstr(name_utf8)) cdef int _xmlNameIsValid(char* c_name): - return tree.xmlValidateNCName(c_name, 0) == 0 + #return tree.xmlValidateNCName(c_name, 0) == 0 + if c_name is NULL or c_name[0] == c'\0': + return 0 + while c_name[0] != c'\0': + if c_name[0] == c':' or \ + c_name[0] == c'&' or \ + c_name[0] == c'<' or \ + c_name[0] == c'>' or \ + c_name[0] == c'/' or \ + c_name[0] == c'\x09' or \ + c_name[0] == c'\x0A' or \ + c_name[0] == c'\x0B' or \ + c_name[0] == c'\x0C' or \ + c_name[0] == c'\x20': + return 0 + c_name = c_name + 1 + return 1 cdef int _tagValidOrRaise(tag_utf) except -1: if not _pyXmlNameIsValid(tag_utf): From ebgssth at gmail.com Fri Sep 28 11:52:45 2007 From: ebgssth at gmail.com (js) Date: Fri, 28 Sep 2007 18:52:45 +0900 Subject: [lxml-dev] ElementSoup doesn't work as in doc/elementsoup.txt Message-ID: Hello. I'm learning ElementSoup, but it doesn't work the way it's supposed to be. I tried sample code in doc/elementsoup.txt but failed with error. --------------------------------------------------------------------------------------------------------------------- >>> tag_soup = 'Hello</head<body onload=crash()>Hi all<p>' >>> from lxml.html.ElementSoup import parse >>> from StringIO import StringIO >>> root = parse(StringIO(tag_soup)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", line 19, in parse root = _convert_tree(tree, makeelement) File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", line 40, in _convert_tree attrib=dict(beautiful_soup_tree.attrs)) File "parser.pxi", line 702, in etree._BaseParser.makeelement File "apihelpers.pxi", line 102, in etree._makeElement File "apihelpers.pxi", line 798, in etree._tagValidOrRaise ValueError: Invalid tag name u'[document]' --------------------------------------------------------------------------------------------------------------------- I'm using Python2.5 lxml-2.0alpha3 BeautifulSoup 3.0.4 Any clues? From stefan_ml at behnel.de Fri Sep 28 22:38:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 28 Sep 2007 22:38:07 +0200 Subject: [lxml-dev] ElementSoup doesn't work as in doc/elementsoup.txt In-Reply-To: <a23effaf0709280252m49f9ccbdy7e592c2e6cd2f9f4@mail.gmail.com> References: <a23effaf0709280252m49f9ccbdy7e592c2e6cd2f9f4@mail.gmail.com> Message-ID: <46FD662F.9020005@behnel.de> Hi, js wrote: > I'm learning ElementSoup, but it doesn't work the way it's supposed to be. > I tried sample code in doc/elementsoup.txt but failed with error. > --------------------------------------------------------------------------------------------------------------------- >>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>' >>>> from lxml.html.ElementSoup import parse >>>> from StringIO import StringIO >>>> root = parse(StringIO(tag_soup)) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", > line 19, in parse > root = _convert_tree(tree, makeelement) > File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", > line 40, in _convert_tree > attrib=dict(beautiful_soup_tree.attrs)) > File "parser.pxi", line 702, in etree._BaseParser.makeelement > File "apihelpers.pxi", line 102, in etree._makeElement > File "apihelpers.pxi", line 798, in etree._tagValidOrRaise > ValueError: Invalid tag name u'[document]' > --------------------------------------------------------------------------------------------------------------------- That's because of the tag name validation. Evidently, "[document]" (which is returned by BeautifulSoup) isn't a valid tag name. Sadly, the doctest above was not yet included in the test suite. However, the behaviour will change in alpha 4. lxml will no longer reject tag names except if they contain spaces or XML special characters. See this recent thread, which also has a patch: http://comments.gmane.org/gmane.comp.python.lxml.devel/3003?set_lines=100000 Sorry for the inconvenience, but don't forget that this is alpha software. Things might not always work as expected or might change unexpectedly (although we try to keep these changes as rare as possible). Stefan From FnH at antwerpen.be Sat Sep 29 22:03:14 2007 From: FnH at antwerpen.be (FnH) Date: Sat, 29 Sep 2007 22:03:14 +0200 Subject: [lxml-dev] prefix mappings Message-ID: <46FEAF82.60501@antwerpen.be> Hi, I would like to generate the following serialization: <a xmls="foo"> <b xmls="bar"/> </a> Currently, the closest I can come is: <a xmls="foo" xmlns:x="bar"> <x:b/> </a> using the following code: a = Element("{foo}a", nsmap={None:"foo", "x":"bar"}) a.append(Element("{bar}b")) Note that even though these serializations are equivalent, sometimes remapping namespaces is neccessary to work around faulty implementations in other products. The feed validator for example warns against using namespace prefixes on elements for good reason. If you don't, many readers aren't able to parse your feed (http://www.intertwingly.net/wiki/pie/XmlNamespaceConformanceTests). If you're using xhtml as content this means remapping the default namespace. In order to solve this I think it would be a good idea to allow (or take into account) prefix mappings on non root nodes as well. The output I'd like could then be achieved by the following code snippet: a = Element("{foo}a", nsmap={None:"foo"}) a.append(Element("{bar}b", nsmap={None:"bar"})) What do you think? Regards, Nick