From stefan_ml at behnel.de Thu Oct 1 09:12:39 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 Oct 2009 09:12:39 +0200 Subject: [lxml-dev] Question about lxml.html.builder In-Reply-To: <200909301728.55127.reagle@mit.edu> References: <200909301728.55127.reagle@mit.edu> Message-ID: <4AC45667.6050706@behnel.de> Joseph Reagle wrote: > I'm using Python 2.5.2 with lxml 2.1.1-1ubuntu1. > > I'm trying to get some version of the lines involving 'wp_ps' to work: > > if opts.text: > wp_ps = [E.P(p) for p in bio.wp_text.split('\n')] > table.append( > E.TR( > E.TD(''), > E.TD().extend(wp_ps), > #E.TD(bio.wp_text, colspan='2'), > E.TD(bio.eb_text, colspan='2'), > valign="top", > ), > ) > > Simply, I want to take text with LFs and turn them into Ps. You didn't mention what isn't working here and how it isn't working the way you expect. Anyway, without testing, I'd write it this way: table.append( E.TR( E.TD(''), E.TD(*[E.P(p) for p in bio.wp_text.split('\n')]), #E.TD(bio.wp_text, colspan='2'), E.TD(bio.eb_text, colspan='2'), valign="top", ), ) Stefan From manu3d at gmail.com Thu Oct 1 13:22:02 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 1 Oct 2009 12:22:02 +0100 Subject: [lxml-dev] xpath check, selective xslt In-Reply-To: <4AC2234C.5080209@behnel.de> References: <915dc91d0909200416p59026e97of0c1fe9a65e1a1c1@mail.gmail.com> <4AC0BE22.5010007@behnel.de> <915dc91d0909290232r78552a9co2d5953c090f4a8cd@mail.gmail.com> <4AC1D871.5080207@behnel.de> <915dc91d0909290407l11c7730fmcdefe0eb583e0a75@mail.gmail.com> <4AC2195A.1010307@behnel.de> <915dc91d0909290750n7234131ey5b5b8e81e0e8fc1a@mail.gmail.com> <4AC2234C.5080209@behnel.de> Message-ID: <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> 2009/9/29 Stefan Behnel > ... not as good as that. I just checked, and it actually decouples the > element from the rest of the document before running the XSLT. So that > won't help in the case that the XSLT needs to refer to the ancestors. > Darn! > I wonder if it would make sense to disable the decoupling for plain > Elements. The problem is that this might break code. Still, in-context sub-tree transformations doesn't seem to be a conceptually far fetched idea. Is the XSLT standard somehow forbidding it? If not, I think it'd be worth discussing it with the libxml people. > OTOH, I expect little XSLT code to really depend on this, You are probably right -now-, but if the functionality becomes available I suspect it'd be quickly adopted for all sorts of unforeseen purposes. > and it's easy to work around by wrapping the element in an ElementTree > object. > I didn't get this. The negatively important thing is that the element is not "in-context" when it's transformed. How would wrapping it in an ElementTree object help? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091001/bc07bce9/attachment.htm From stefan_ml at behnel.de Thu Oct 1 13:39:10 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 Oct 2009 13:39:10 +0200 Subject: [lxml-dev] xpath check, selective xslt In-Reply-To: <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> References: <915dc91d0909200416p59026e97of0c1fe9a65e1a1c1@mail.gmail.com> <4AC0BE22.5010007@behnel.de> <915dc91d0909290232r78552a9co2d5953c090f4a8cd@mail.gmail.com> <4AC1D871.5080207@behnel.de> <915dc91d0909290407l11c7730fmcdefe0eb583e0a75@mail.gmail.com> <4AC2195A.1010307@behnel.de> <915dc91d0909290750n7234131ey5b5b8e81e0e8fc1a@mail.gmail.com> <4AC2234C.5080209@behnel.de> <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> Message-ID: <4AC494DE.2030601@behnel.de> Emanuele D'Arrigo wrote: >> OTOH, I expect little XSLT code to really depend on this, > > You are probably right -now-, but if the functionality becomes available I > suspect it'd be quickly adopted for all sorts of unforeseen purposes. > >> and it's easy to work around by wrapping the element in an ElementTree >> object. > > I didn't get this. The negatively important thing is that the element is not > "in-context" when it's transformed. How would wrapping it in an ElementTree > object help? I meant the opposite: If we switch the way it works, I don't expect much code to break, and if it does, it'll be easy to fix by wrapping the element in an ElementTree object before passing it to the transformer. I'm not sure how hard this will be to implement, though. I didn't find an obvious function in libxslt that would take a context/start element in addition to the document to be transformed, so someone needs to investigate how libxslt works here (we normally call the xsltApplyStylesheetUser() function), and what it takes to start the transform from a different node (e.g. if calling xsltProcessOneNode() works out-of-the-box, or if there is a setup required beforehand). Could you file a feature request (i.e. bug) on the bug tracker for now? Stefan From reagle at mit.edu Thu Oct 1 18:29:58 2009 From: reagle at mit.edu (Joseph Reagle) Date: Thu, 1 Oct 2009 12:29:58 -0400 Subject: [lxml-dev] Question about lxml.html.builder In-Reply-To: <4AC45667.6050706@behnel.de> References: <200909301728.55127.reagle@mit.edu> <4AC45667.6050706@behnel.de> Message-ID: <200910011229.58802.reagle@mit.edu> On Thursday 01 October 2009, Stefan Behnel wrote: > E.TD(*[E.P(p) for p in bio.wp_text.split('\n')]), Thanks, that does the trick: you're using *args syntax to "delist" the list and make them arguments? Also, when I want to add a valign, I note it has to be: E.TD(colspan='2', *[E.P(p) for p in bio.wp_text.split('\n')]), not E.TD(*[E.P(p) for p in bio.wp_text.split('\n')], colspan='2'), though I'm not sure why. From mike_mp at zzzcomputing.com Sat Oct 3 20:12:46 2009 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Sat, 3 Oct 2009 14:12:46 -0400 Subject: [lxml-dev] setuptools issues with python2.6 maint In-Reply-To: <159c0df5ffd364411bf21a6d9edd3574.squirrel@www.geekisp.com> References: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> <4A7DB7FA.2090400@behnel.de> <159c0df5ffd364411bf21a6d9edd3574.squirrel@www.geekisp.com> Message-ID: On Aug 10, 2009, at 3:48 PM, Michael Bayer wrote: > Stefan Behnel wrote: >> >> >> No idea, never seen this before. I use the same setuptools version >> under >> Py2.6.2, and it works perfectly well. >> >> Have you tried the bdist_egg target instead of a mere "install"? >> >> Also, the way setuptools patches into distutils makes it quite >> possible >> that newer Python releases introduce incompatibilities, so maybe >> there's >> an >> issue over there. >> > > I was running build_ext in most cases. Didn't try bdist_egg. > Anyway my > dependency on that version of python is over for now, so if it is in > fact > an issue with py2.6+, we'll all find out soon enough. thanks for the > help. just an FYI, so now that 2.6.3 is out, more eyes have been on the problem and the issue is now described here: http://tarekziade.wordpress.com/2009/10/03/python-2-6-3-and-distribute/ From stefan_ml at behnel.de Mon Oct 5 09:48:45 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 05 Oct 2009 09:48:45 +0200 Subject: [lxml-dev] lxml and xmlsec In-Reply-To: References: Message-ID: <4AC9A4DD.7080308@behnel.de> Roland Hedberg wrote: > I need xmlsec and I'd like to use lxml instead of libxml, is it > possible ? > libxml is as it's distributed dependent on libxml, but I don't know > how hard it would be to put together a lxmlsec where libxml was > exchanged for lxml. Depending on how much of xmlsec you need, it shouldn't be too hard to wrap the functionality in Cython and put together a new extension module. lxml.etree exports the main parts in a public API that should allow you to get to the libxml2 tree. The Cython tutorial below has a section on wrapping a C library: http://sage.math.washington.edu/home/dagss/cython-tutorial-preprint.pdf > Has anyone done it ? >From the archives, it looks like John Krukoff should have some experience with it. Stefan From cswiggett at knowledgemosaic.com Wed Oct 7 19:45:41 2009 From: cswiggett at knowledgemosaic.com (Clif Swiggett) Date: Wed, 07 Oct 2009 10:45:41 -0700 Subject: [lxml-dev] Text attribute is None when element has text Message-ID: <4ACCD3C5.4010508@knowledgemosaic.com> If I run the code: test = etree.XML('text') for x in test.iter(): print("%s - %s"%(x.tag, x.text)) I get the output: root - None a - None I expected that root.text would have been 'text' rather than none. However, if I flip the text and tag, then it works. E.g. test = etree.XML('text') for x in test.iter(): print("%s - %s"%(x.tag, x.text)) Output: root - text a - None Anyone know why this is, or how to work around it? Thanks! - Clif From stefan_ml at behnel.de Wed Oct 7 20:09:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Oct 2009 20:09:54 +0200 Subject: [lxml-dev] Text attribute is None when element has text In-Reply-To: <4ACCD3C5.4010508@knowledgemosaic.com> References: <4ACCD3C5.4010508@knowledgemosaic.com> Message-ID: <4ACCD972.2040802@behnel.de> Clif Swiggett wrote: > If I run the code: > > test = etree.XML('text') > for x in test.iter(): > print("%s - %s"%(x.tag, x.text)) > > I get the output: > > root - None > a - None > > I expected that root.text would have been 'text' rather than none. > However, if I flip the text and tag, then it works. E.g. > > test = etree.XML('text') > for x in test.iter(): > print("%s - %s"%(x.tag, x.text)) > > Output: > > root - text > a - None > > Anyone know why this is, or how to work around it? Sure, read the docs: http://codespeak.net/lxml/tutorial.html#elements-contain-text Stefan From jkrukoff at ltgc.com Wed Oct 7 20:12:13 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 07 Oct 2009 12:12:13 -0600 Subject: [lxml-dev] Text attribute is None when element has text In-Reply-To: <4ACCD3C5.4010508@knowledgemosaic.com> References: <4ACCD3C5.4010508@knowledgemosaic.com> Message-ID: <1254939133.15233.4.camel@localhost.localdomain> On Wed, 2009-10-07 at 10:45 -0700, Clif Swiggett wrote: > If I run the code: > > test = etree.XML('text') > for x in test.iter(): > print("%s - %s"%(x.tag, x.text)) > > I get the output: > > root - None > a - None > > I expected that root.text would have been 'text' rather than none. > However, if I flip the text and tag, then it works. E.g. > > test = etree.XML('text') > for x in test.iter(): > print("%s - %s"%(x.tag, x.text)) > > Output: > > root - text > a - None > > Anyone know why this is, or how to work around it? > Thanks! > - Clif > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev Take a look at the tutorial, especially the "Elements contain text" section near here: http://codespeak.net/lxml/tutorial.html#the-element-class Hopefully that will explain how .text and .tail work for accessing text. The short answer is that your text is in test.find( 'a' ).tail, though. The itertext method may also be useful for you, depending on your use case. -- John Krukoff Land Title Guarantee Company From jkrukoff at ltgc.com Wed Oct 7 20:23:55 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 07 Oct 2009 12:23:55 -0600 Subject: [lxml-dev] lxml and xmlsec In-Reply-To: References: Message-ID: <1254939835.15233.13.camel@localhost.localdomain> On Tue, 2009-09-29 at 15:59 +0200, Roland Hedberg wrote: > Hi! > > I need xmlsec and I'd like to use lxml instead of libxml, is it > possible ? > libxml is as it's distributed dependent on libxml, but I don't know > how hard it would be to put together a lxmlsec where libxml was > exchanged for lxml. > Has anyone done it ? > > --Roland > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev I've gotten the two to work together, but I didn't try anything as hard as that. I used this set of python bindings for xmlsec: http://xmlsig.sourceforge.net/ And serialized from lxml and had that library reparse. If I had it to do over again, I'd try this set of python bindings instead, since I had crash issues with the first library when errors occured: http://pyxmlsec.labs.libre-entreprise.org/index.php It turned out that neither was actually suitable for my use case, however, as libxmlsec didn't seem to be capable of signing a document that the .NET 1.1 application I was talking to would accept. I ended up being forced into creating a .NET service (with it's own dedicated windows box!) just to handle the signing, as it was the only way I could interop. -- John Krukoff Land Title Guarantee Company From john at nmt.edu Wed Oct 7 20:22:31 2009 From: john at nmt.edu (John W. Shipman) Date: Wed, 7 Oct 2009 12:22:31 -0600 (MDT) Subject: [lxml-dev] Text attribute is None when element has text In-Reply-To: <4ACCD3C5.4010508@knowledgemosaic.com> References: <4ACCD3C5.4010508@knowledgemosaic.com> Message-ID: On Wed, 7 Oct 2009, Clif Swiggett wrote: +-- | If I run the code: | | test = etree.XML('text') | for x in test.iter(): | print("%s - %s"%(x.tag, x.text)) | | I get the output: | | root - None | a - None | | I expected that root.text would have been 'text' rather than none. +-- lxml does not represent mixed content (text intermingled with elements) in the same way that most other XML tools do. I have attempted to explain this here: http://www.nmt.edu/tcc/help/pubs/pylxml/ The relevant section is here: http://www.nmt.edu/tcc/help/pubs/pylxml/etree-view.html Here's your interactive example with the .tail attribute shown. ================================================================ Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree as et >>> test=et.XML('text') >>> for x in test.iter(): ... print ( "tag='%s' text='%s' tail='%s'" % ... (x.tag, x.text, x.tail) ) ... tag='root' text='None' tail='None' tag='a' text='None' tail='text' >>> ---------------------------------------------------------------- Best regards, John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber From jkrukoff at ltgc.com Wed Oct 7 23:27:48 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 07 Oct 2009 15:27:48 -0600 Subject: [lxml-dev] XPath & XSLT Extension functions and node sets Message-ID: <1254950868.23503.12.camel@localhost.localdomain> So, I've been experimenting with custom python extension functions and have come across a surprising limitation. Is there any way to get an extension function to be able to return a list of text nodes, instead of only a list of XML nodes? It's easy enough to pass in a list of text nodes (i.e. '//text()' ), but I can't find a way to return that same list from an extension function. I tried returning the result directly from an xpath call, thinking that maybe the smart strings would work, but no dice. -- John Krukoff Land Title Guarantee Company From stefan_ml at behnel.de Thu Oct 8 09:51:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 Oct 2009 09:51:23 +0200 Subject: [lxml-dev] XPath & XSLT Extension functions and node sets In-Reply-To: <1254950868.23503.12.camel@localhost.localdomain> References: <1254950868.23503.12.camel@localhost.localdomain> Message-ID: <4ACD99FB.9060305@behnel.de> John Krukoff wrote: > So, I've been experimenting with custom python extension functions and > have come across a surprising limitation. Is there any way to get an > extension function to be able to return a list of text nodes, instead of > only a list of XML nodes? > > It's easy enough to pass in a list of text nodes (i.e. '//text()' ), but > I can't find a way to return that same list from an extension function. > I tried returning the result directly from an xpath call, thinking that > maybe the smart strings would work, but no dice. Interesting. The code in extensions.pxi actually goes elif python.PySequence_Check(obj): resultSet = xpath.xmlXPathNodeSetCreate(NULL) for element in obj: if isinstance(element, _Element): node = <_Element>element xpath.xmlXPathNodeSetAdd(resultSet, node._c_node) else: xpath.xmlXPathFreeNodeSet(resultSet) raise XPathResultError, u"This is not a node: %r" % element so that's clearly not supported. I wonder if it can be. The problem is that the content of a node-set isn't freed when the node-set is, as the nodes are expected to be referenced in a document. So if we create new text nodes here, we'll most likely end up leaking their memory, simply because it will be impossible to tell if they originated from a document or from user provided strings. Supporting this for smart strings (that know their Element parent) might be relatively easy, but it's more tricky in the general case. It might work if we instead create a new Element and append the nodes to it. That would be a bit of an overhead, but it would enable garbage collection for us. Could you file a feature request for now? Stefan From rzinkstok at gmail.com Thu Oct 8 16:20:09 2009 From: rzinkstok at gmail.com (Roel Zinkstok) Date: Thu, 8 Oct 2009 16:20:09 +0200 Subject: [lxml-dev] Custom elements, class lookup and xpath Message-ID: Hi! I am playing around with custom element classes, and encountered the following problem: when using xpath to find a specific (custom) element, the element that is returned is not an instance of the custom element class. Instead, it is an instance of the standard Element class. This causes any call to a method of the custom class to fail with an AttributeError. See (simplified but working) code below. I have just started using lxml, so I am not very experienced; however, I could not find an answer to my question in the online tutorial and reference material, while google did not give any clues either. Can anyone help me with this? Is the class lookup method implemented correctly in my code? It seems to work when the custom elements are generated... Thanks! Roel ### Start of code ### from lxml import etree as ET class BliepElement(ET.ElementBase): TAG = 'bliep' def test(self): print "Bliep is testing!" class BlupElement(ET.ElementBase): TAG = 'blup' def test(self): print "Blup is testing!" def addBliep(self): x = self.xpath('//bliep') if not x: b = BliepElement() self.insert(0, b) def getBliep(self): x = self.xpath('//bliep') if not x: return None else: return x[0] class MyLookup(ET.CustomElementClassLookup): def lookup(self, nodetype, document, namespace, name): if name=='bliep': return BliepElement if name=='blup': return BlubElement parser = ET.XMLParser() parser.set_element_class_lookup(MyLookup()) ET.Element = parser.makeelement b1 = BlupElement() b1.addBliep() b2 = b1.getBliep() print ET.tostring(b1) print type(b1).mro() print type(b2).mro(b2.test() ### End of code ### Output when run in terminal: $ python test.py [, , , ] [, ] Traceback (most recent call last): File "test.py", line 54, in b2.test() AttributeError: 'lxml.etree._Element' object has no attribute 'test' -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091008/7dfc01c7/attachment-0001.htm From stefan_ml at behnel.de Thu Oct 8 18:25:00 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 Oct 2009 18:25:00 +0200 Subject: [lxml-dev] Custom elements, class lookup and xpath In-Reply-To: References: Message-ID: <4ACE125C.5080405@behnel.de> Hi, Roel Zinkstok wrote: > parser = ET.XMLParser() > parser.set_element_class_lookup(MyLookup()) > ET.Element = parser.makeelement This doesn't do what (I guess) you expect. The Element() factory is a pure API functions and is not used internally. > b1 = BlupElement() > b1.addBliep() > b2 = b1.getBliep() You should read the warning here: http://codespeak.net/lxml/element_classes.html#generating-xml-with-custom-classes However, it seems the way to handle this isn't documented on the doc pages, only on the ElementBase class. You can add a PARSER attribute to the custom Element classes, whose configuration will then be used for Element class lookups. Stefan From jkrukoff at ltgc.com Thu Oct 8 22:05:22 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 08 Oct 2009 14:05:22 -0600 Subject: [lxml-dev] XPath & XSLT Extension functions and node sets In-Reply-To: <4ACD99FB.9060305@behnel.de> References: <1254950868.23503.12.camel@localhost.localdomain> <4ACD99FB.9060305@behnel.de> Message-ID: <1255032322.31484.3.camel@localhost.localdomain> On Thu, 2009-10-08 at 09:51 +0200, Stefan Behnel wrote: [ snipped ] > Could you file a feature request for now? > > Stefan Okay, added bug to launchpad: https://bugs.launchpad.net/lxml/+bug/446654 It doesn't look like launchpad will let me assign it to the wishlist, though. -- John Krukoff Land Title Guarantee Company From matteo.malosio at itia.cnr.it Fri Oct 9 10:14:51 2009 From: matteo.malosio at itia.cnr.it (Matteo Malosio) Date: Fri, 09 Oct 2009 10:14:51 +0200 Subject: [lxml-dev] How to apply Relax ng schemas Message-ID: <4ACEF0FB.5050908@itia.cnr.it> Is it possibile to apply a Relax ng schema, similarly to how it is possible for XML schema? in the following line "schema" can only be an XML schema or also a Relax ng schema? schema = etree.XMLSchema(file=f) parser = objectify.makeparser(schema = schema) a = objectify.fromstring(xml, parser) Thank you Matteo From rzinkstok at gmail.com Fri Oct 9 13:09:19 2009 From: rzinkstok at gmail.com (Roel Zinkstok) Date: Fri, 9 Oct 2009 13:09:19 +0200 Subject: [lxml-dev] Custom elements, class lookup and xpath In-Reply-To: <4ACE125C.5080405@behnel.de> References: <4ACE125C.5080405@behnel.de> Message-ID: > > You can add a PARSER attribute to the custom Element classes, whose > configuration will then be used for Element class > lookups. > Thanks a lot, indeed adding a PARSER=parser statement to the custom element classes (combined with removing the ET.Element = parser.makeelement statement) did solve the issue. Roel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091009/2b0aac48/attachment.htm From jholg at gmx.de Fri Oct 9 13:17:11 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 09 Oct 2009 13:17:11 +0200 Subject: [lxml-dev] How to apply Relax ng schemas In-Reply-To: <4ACEF0FB.5050908@itia.cnr.it> References: <4ACEF0FB.5050908@itia.cnr.it> Message-ID: <20091009111711.110010@gmx.net> Hi, > Is it possibile to apply a Relax ng schema, similarly to how it is > possible for XML schema? > > in the following line "schema" can only be an XML schema or also a Relax > ng schema? > > schema = etree.XMLSchema(file=f) > parser = objectify.makeparser(schema = schema) > a = objectify.fromstring(xml, parser) No: >>> schema = etree.RelaxNG(etree.fromstring(""" ... ... ... ... ... ... ... """) ... ) >>> parser = objectify.makeparser(schema=schema) Traceback (most recent call last): File "", line 1, in ? File "lxml.objectify.pyx", line 1773, in lxml.objectify.makeparser (src/lxml/lxml.objectify.c:18224) File "parser.pxi", line 1222, in lxml.etree.XMLParser.__init__ (src/lxml/lxml.etree.c:69205) TypeError: Argument 'schema' has incorrect type (expected lxml.etree.XMLSchema, got lxml.etree.RelaxNG) >>> Don't know if handing the XMLSchema to the parser is merely a shortcut for the parse-validate cycle or if libxml2 actually uses some other, "validating XML Schema parser" instead of parsing the instance document with the "normal parser" and then applying schema validation to the parsed document. I suppose there is a difference with regard to feed parsing. Of course validating with RelaxNG schema is fully functional in lxml, as you probably have already seen: http://codespeak.net/lxml/validation.html#relaxng Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser From matteo.malosio at itia.cnr.it Fri Oct 9 14:38:11 2009 From: matteo.malosio at itia.cnr.it (Matteo Malosio) Date: Fri, 09 Oct 2009 14:38:11 +0200 Subject: [lxml-dev] Objectify python lists Message-ID: <4ACF2EB3.9040602@itia.cnr.it> I'd like to interpret XML Schema lists (i.e. "/1.0 2.0 3.0 4.0 5.0 6.0/" defined with something like "//" ) as python lists (a list made up of 6 doubles). I can't find any reference in the manual about python list supporting. I'm able only to use a sequence of positions (one per each number) but to me it seems too verbose. XMLschema /list/ would be better / Thank you! / -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091009/d598f6d6/attachment.htm From dhbaird at gmail.com Sun Oct 11 08:32:25 2009 From: dhbaird at gmail.com (David Baird) Date: Sun, 11 Oct 2009 00:32:25 -0600 Subject: [lxml-dev] CSSSelector namespace issues Message-ID: <440abda90910102332y59de6c7cqa69e5eceaa4ba0cb@mail.gmail.com> Hi, I love lxml and thanks for all the hard work. I was surprised and impressed to see that namespaces are supported in the CSS selector parser. This evening, I found some issues related to the cssselect module and namespaces. According to these documents, http://www.w3.org/TR/css3-namespace/#css-qnames http://www.w3.org/TR/css3-selectors/#typenmsp these are the possible ways to match elements: ns|E elements with name E in namespace ns *|E elements with name E in any namespace, including those without a namespace |E elements with name E without a namespace E if no default namespace has been declared for selectors, this is equivalent to *|E. Otherwise it is equivalent to ns|E where ns is the default namespace. I think that these would be the most correct ways to rewrite CSS Selectors into XPath: ns|E -> ns:E *|E -> *[local-name(.) = 'E'] |E -> E E -> *[local-name(.) = 'E'] So, I did some experiments with cssselector.CSSSelector and got these results: CSSSelector('a|b').path # okay: descendant-or-self::a:b CSSSelector('a|*').path # okay: descendant-or-self::a:* CSSSelector('*|*').path # okay: descendant-or-self::* CSSSelector('*|b').path # fails to parse :-( CSSSelector('|b').path # fails to parse :-( CSSSelector('|*').path # fails to parse :-( CSSSelector('b').path # semi-okay: descendant-or-self::b (I think this last one should really be descendant-or-self::*[local-name(.) = 'b']) So I thought I would dig around the code and see if I could fix things up. And I did, sort of. But, my fixes resulted in several existing regression tests failing. I'm having a hard time keeping myself awake longer, but I've attached a patch that I came up with. It is still a mystery to me why it is causing so many extra failures though. Also, while I was digging around, I found some inconsistent rules for converting to lower case in cssselect.py: # This does do lower case! class Element(object): def xpath(self): if self.namespace == '*': el = self.element.lower() ... return XPathExpr(element=el) # This doesn't do lower case! def css_to_xpath(css_expr, prefix='descendant-or-self::'): if isinstance(css_expr, _basestring): match = _el_re.search(css_expr) if match is not None: return '%s%s' % (prefix, match.group(0).strip()) This results in: CSSSelector('Bar[a="b"]').path == "descendant-or-self::bar[@a = 'b']" CSSSelector('Bar').path == 'descendant-or-self::Bar' Index: src/lxml/tests/test_css.py =================================================================== --- src/lxml/tests/test_css.py (revision 68307) +++ src/lxml/tests/test_css.py (working copy) @@ -114,6 +114,57 @@ def shortDescription(self): return self.selectors[self.index][0] + +class CSSXPathTestCase(HelperTestCase): + + test_cases = [ + # Selector XPath + # -------- ----- + ( 'a|b' , "descendant-or-self::a:b"), + ( 'a|*' , "descendant-or-self::a:*"), + ( '*|*' , "descendant-or-self::*"), + ( '*|b' , "descendant-or-self::*[local-name(.) = 'b']"), + ( '|b' , "descendant-or-self::b"), + ( '|*' , "descendant-or-self::*"), + ( 'b' , "descendant-or-self::*[local-name(.) = 'b']"), + ( '' , "descendant-or-self::*"), + + # test the lower-case rules: + ( 'Foo|Bar', "descendant-or-self::Foo:Bar"), + ( 'Foo|*' , "descendant-or-self::Foo:*"), + ( '*|Bar' , "descendant-or-self::*[local-name(.) = 'Bar']"), + ( '|Bar' , "descendant-or-self::Bar"), + ( 'Bar[a="b"]', + "descendant-or-self::*[local-name(.) = 'bar' and (@a = 'b')]"), + # FIXME: inconsistent lower casing rules!!!: + # (caused because the parser uses a shortcut regex to match + # single element names) + #( 'Bar' , "descendant-or-self::*[local-name(.) = 'bar']"), # this should succeed? + ( 'Bar' , "descendant-or-self::*[local-name(.) = 'Bar']"), # this should fail? + # XXX: Darn, if only "lower-case" were a standard function in XPath :-( + # ...then we could emit XPath expressions like the following: + ##( 'Bar' , "descendant-or-self::*[lower-case(local-name(.)) = 'bar']"), + ] + + def __init__(self, index): + self.index = index + super(HelperTestCase, self).__init__() + + def all(cls): + for i in range(len(cls.test_cases)): + yield cls(i) + all = classmethod(all) + + def runTest(self): + selector, xpath = self.test_cases[self.index] + result = cssselect.CSSSelector(selector).path + if result != xpath: + assert 0, 'Got %s, expected %s' % (repr(result), repr(xpath)) + + def shortDescription(self): + return self.test_cases[self.index][0] + + def unique(s): found = {} result = [] @@ -130,4 +181,5 @@ suite.addTests([make_doctest('test_css_select.txt')]) suite.addTests([make_doctest('test_css.txt')]) suite.addTests(list(CSSTestCase.all())) + suite.addTests(list(CSSXPathTestCase.all())) return suite Index: src/lxml/cssselect.py =================================================================== --- src/lxml/cssselect.py (revision 68307) +++ src/lxml/cssselect.py (working copy) @@ -386,11 +386,43 @@ class Element(object): """ Represents namespace|element + + Selector XPath Comments + -------- ----- -------- + + ns|E ns:E Elements with name E in + namespace ns. + + *|E *[local-name(.) = 'E'] Elements with name E in any + namespace, including those + without a namespace. + + |E E Elements with name E without + a namespace. + + E *[local-name(.) = 'E'] If no default namespace has been + declared for selectors, this is + equivalent to *|E. Otherwise it + is equivalent to ns|E where ns + is the default namespace. + + Note that the parser also + forces E to lowercase in this + case. (Perhaps the user should + be allowed to have more control + over whether or not lowercase + conversion is done...) + + See also: + + - http://www.w3.org/TR/css3-namespace/#css-qnames + - http://www.w3.org/TR/css3-selectors/#typenmsp """ - def __init__(self, namespace, element): + def __init__(self, namespace, element, make_lower=False): self.namespace = namespace self.element = element + self.make_lower = make_lower def __repr__(self): return '%s[%s]' % ( @@ -404,12 +436,22 @@ return '%s|%s' % (self.namespace, self.element) def xpath(self): - if self.namespace == '*': - el = self.element.lower() - else: + if self.make_lower: _el = self.element.lower() + else: _el = self.element + if self.namespace == '*' and _el == '*': # *|* + el = '*' + cond = None + elif self.namespace == '*': # E, *|E + el = '*' + cond = 'local-name(.) = %s' % (xpath_literal(_el),) + elif self.namespace == '': # |E + el = _el + cond = None + else: # ns|E # FIXME: Should we lowercase here? - el = '%s:%s' % (self.namespace, self.element) - return XPathExpr(element=el) + el = '%s:%s' % (self.namespace, _el) + cond = None + return XPathExpr(element=el, condition=cond) class Hash(object): """ @@ -511,7 +553,13 @@ if isinstance(css_expr, _basestring): match = _el_re.search(css_expr) if match is not None: - return '%s%s' % (prefix, match.group(0).strip()) + # XXX: this bypasses class Element: + # XXX: inconsistent application of lower case (class Element + # and the parser would have made this be lower case, + # as a special rule): + #return '%s*[local-name(.) = %s]' % (prefix, + # xpath_literal(match.group(0).strip())) # <-- more correct + return '%s%s' % (prefix, match.group(0).strip()) # <-- less correct match = _id_re.search(css_expr) if match is not None: return "%s%s[@id = '%s']" % ( @@ -691,24 +739,33 @@ def parse_simple_selector(stream): peek = stream.peek() - if peek != '*' and not isinstance(peek, Symbol): + # NOTE: See the doc string for Element to get an idea of the syntax here: + if peek != '*' and peek != '|' and not isinstance(peek, Symbol): + make_lower = True # <- irrelevant in this particular case element = namespace = '*' else: next = stream.next() - if next != '*' and not isinstance(next, Symbol): - raise SelectorSyntaxError( - "Expected symbol, got '%s'" % next) - if stream.peek() == '|': + if stream.peek() == '|': # ns|E, *|E, ns|*, *|* + make_lower = False namespace = next stream.next() element = stream.next() - if element != '*' and not isinstance(next, Symbol): + if element != '*' and not isinstance(element, Symbol): raise SelectorSyntaxError( "Expected symbol, got '%s'" % next) + elif next == '|': # |E, |* + # TODO: |* isn't handled properly yet :-( + make_lower = False + namespace = '' + element = stream.next() + elif next == '*' or isinstance(next, Symbol): # E + make_lower = True + #namespace = '*' # <-- more correct + namespace = '' # <-- less correct + element = next else: - namespace = '*' - element = next - result = Element(namespace, element) + raise SelectorSyntaxError("Expected symbol, got '%s'" % next) + result = Element(namespace, element, make_lower) has_hash = False while 1: peek = stream.peek() -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091011/03894ee6/attachment-0001.htm From lydia.patrovic at rbcmail.ru Sun Oct 11 11:56:55 2009 From: lydia.patrovic at rbcmail.ru (Lydia Patrovic) Date: Sun, 11 Oct 2009 13:56:55 +0400 Subject: [lxml-dev] html parsing incomplete - bug? Message-ID: <2af3bd1b256bcc61fc54aa1fd2f367ae560376e3@www.pochta.ru> Hello, I have tried parsing a webpage, but unfortunately, the node /html/body is not found. I suspect this to be a bug, since it works with the html-file saved with firefox, but not with the data from urllib I am using in my project. I am not sure, wether this is the right place or wether I better ask at the libxml2-list. If I am wrong here, accept my apologies, if not, here are further information: lxml.etree: (2, 2, 2, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) I have tried setting different encodings in etree.parse(source,etree.HTMLParser(encoding=...)) as well as deleting all the embedded files from the page which firefox saved (to rule out a dependency). The page itself does contain special characters (i.e. umlaute , etc.), the encoding given in the header is charset=iso-8859-1. None of the fixed bugs in http://codespeak.net/svn/lxml/trunk/CHANGES.txt seem to me to be related to this, nor any entry in the bugtracker. I have attached the urllib-downloaded page as a txt-file. Any help is appreciated. L. Patrovic -------------- next part -------------- Schueler.CC | Dein Sch?ler Community-Center
  anmelden     hilfe     blog     login  

Noch nicht dabei? Jetzt anmelden! Trailer

Dein Sch?ler Community-Center!


E-Mail:

Passwort:




Noch nicht dabei?
Jetzt anmelden!

Passwort vergessen?


Community-News
Unn?tzes Wissen:
  • Der Schmetterling hat 12.000 Augen.
  • Pocahontas erschien auf der R?ckseite der 20-Dollar-Note von 1875.
  • In Sarasota, Florida, wird das Singen im Bikini mit Gef?ngnis bestraft.

Schueler.CC - Deutschlands Sch?ler Community Center
Verbinde dich mit deinen Freunden und Schulkameraden, schau ihre Fotos an, triff dich in Gruppen und erfahre Neues aus deiner Schule …
Und nur hier:

+ Chat: Chatte wie bei ICQ oder MSN direkt auf Schueler.CC
+ Klassenraum: Deine Klasse mit eigenem Forum und Chat
+ Videos: Lade deine Videos hoch und teile sie mit Freunden

Jetzt kostenlos bei Schueler.CC anmelden
Vorname: Nachname:

Bitte gib deinen Vornamen ein.

Bitte gib deinen Nachnamen ein.
E-Mail: Passwort:


Dein Passwort muss aus mindestens 6 Zeichen bestehen.

Du musst die Nutzungsbedingungen akzeptieren, bevor du dich anmelden kannst.
Ich habe die Nutzungsbedingungen gelesen und akzeptiert.


Schon dabei:


David Kowalski
Berufskolleg f?r Technik



Sandra Mannschedel
Staatliche Realschule Pegnitz


  Mobile     Jugendschutz     Eltern/Lehrer     Datenschutz     Nutzungsbedingungen     Impressum  
From stefan_ml at behnel.de Sun Oct 11 14:21:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 11 Oct 2009 14:21:09 +0200 Subject: [lxml-dev] html parsing incomplete - bug? In-Reply-To: <2af3bd1b256bcc61fc54aa1fd2f367ae560376e3@www.pochta.ru> References: <2af3bd1b256bcc61fc54aa1fd2f367ae560376e3@www.pochta.ru> Message-ID: <4AD1CDB5.5010802@behnel.de> Hi, Lydia Patrovic wrote: > I have tried parsing a webpage, but unfortunately, the node /html/body is not found. > [...] > I have attached the urllib-downloaded page as a txt-file. > > Any help is appreciated. I get the same result with "xmllint --html", so it's definitely a libxml2 problem. It seems to read all tags and then just stops parsing without further notice. The next tag would be the Note the "main&20090924_2" attribute value, which can be interpreted as an unterminated entity. Please report this on the libxml2 mailing list. Stefan From stefan_ml at behnel.de Sun Oct 11 18:30:19 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 11 Oct 2009 18:30:19 +0200 Subject: [lxml-dev] CSSSelector namespace issues In-Reply-To: <440abda90910102332y59de6c7cqa69e5eceaa4ba0cb@mail.gmail.com> References: <440abda90910102332y59de6c7cqa69e5eceaa4ba0cb@mail.gmail.com> Message-ID: <4AD2081B.1060601@behnel.de> Hi, David Baird wrote: > I love lxml and thanks for all the hard work. Happy to hear that. :) > So I thought I would dig around the code and see if I could fix things up. > And I did, sort of. But, my fixes resulted in several existing regression > tests failing. I'm having a hard time keeping myself awake longer, but I've > attached a patch that I came up with. It is still a mystery to me why it is > causing so many extra failures though. Thanks a lot for the patch! I'll look through it ASAP. I didn't write that code, but I also noticed that the parser is somewhat fragile. It's a bit too forgiving in some places and non trivial to extend in others. > Also, while I was digging around, I found some inconsistent rules for > converting to lower case in cssselect.py: > > # This does do lower case! > class Element(object): > def xpath(self): > if self.namespace == '*': > el = self.element.lower() > ... > return XPathExpr(element=el) > > # This doesn't do lower case! > def css_to_xpath(css_expr, prefix='descendant-or-self::'): > if isinstance(css_expr, _basestring): > match = _el_re.search(css_expr) > if match is not None: > return '%s%s' % (prefix, match.group(0).strip()) > > This results in: > > CSSSelector('Bar[a="b"]').path == "descendant-or-self::bar[@a = 'b']" > CSSSelector('Bar').path == 'descendant-or-self::Bar' I'm not sure this is intended. It might make sense for HTML, but I actually doubt that lower casing tag names is a good idea in the general case. Thanks again, Stefan From stefan_ml at behnel.de Mon Oct 12 21:38:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 Oct 2009 21:38:53 +0200 Subject: [lxml-dev] CSSSelector namespace issues In-Reply-To: <440abda90910111328o5be44f69o35e9648a2b25ffc8@mail.gmail.com> References: <440abda90910102332y59de6c7cqa69e5eceaa4ba0cb@mail.gmail.com> <4AD2081B.1060601@behnel.de> <440abda90910111328o5be44f69o35e9648a2b25ffc8@mail.gmail.com> Message-ID: <4AD385CD.9050404@behnel.de> Hi, please keep the list involved in your replies. David Baird wrote: > On Sun, Oct 11, 2009 at 10:30 AM, Stefan Behnel wrote: >> Thanks a lot for the patch! I'll look through it ASAP. > > I think my test cases would have fit better into the test_css.txt > framework rather than into test_css.py. Yep. Could you move them there? > Too bad Python's batteries-included standard library doesn't include > *parsing tools* :-P (well, besides regex). True. pyparsing would be nice, although there are certainly others that deserve a similar standing. >>> Also, while I was digging around, I found some inconsistent rules for >>> converting to lower case in cssselect.py: >> I'm not sure this is intended. It might make sense for HTML, but I actually >> doubt that lower casing tag names is a good idea in the general case. > > For the situation of HTML (case-insensitive) versus XML > (case-sensitive), I think a good way to handle this is allow the user > to choose if they want lower-case conversion or not. I added some of > the hooks for doing this (especially by adding the "make_lower" > property in Element), but I didn't do much with these hooks yet. > Unfortunately, just lower-casing the selector doesn't do too much good > all by itself - the whole HTML tree would need to be lower cased as > well, and I guess the user would just have to be diligent to do that. >>> import lxml.html as h >>> [el.tag for el in ... h.fromstring("
").iter()] ['html', 'body', 'br'] So the HTML parser actually does it for you. Still, there is no reason why users can't just write their CSS expressions in lower case when searching HTML. Lower casing the expression internally breaks non-HTML traversal. I think that's just one more thing to clean up in lxml 2.3. Stefan From lydia.patrovic at rbcmail.ru Wed Oct 14 19:45:24 2009 From: lydia.patrovic at rbcmail.ru (Lydia Patrovic) Date: Wed, 14 Oct 2009 21:45:24 +0400 Subject: [lxml-dev] [Re]: html parsing incomplete - bug? Message-ID: Hello, thank you for your help, as the replace("\x00","") really helped me. Lydia From mateusz-lists at ant.gliwice.pl Fri Oct 16 10:52:06 2009 From: mateusz-lists at ant.gliwice.pl (Mateusz Korniak) Date: Fri, 16 Oct 2009 10:52:06 +0200 Subject: [lxml-dev] Parsing xml ignoring xml namesapces ? Message-ID: <200910161052.07134.mateusz-lists@ant.gliwice.pl> Hi ! Is it possible to parse XML completly ignoring xml ns declarations ? Code [1] provies element names like: {http://www.w3.org/1999/xhtml}html and I would prefer just html . Thanks in advance , regards ! [1]: # -*- coding: utf-8 -* import StringIO import lxml.html import lxml.etree content_xml = '\n \n

Foo

\n \n\n' parser = lxml.etree.XMLParser(ns_clean=True) tree = lxml.etree.parse(StringIO.StringIO(content_xml), parser) # root_elem = lxml.etree.fromstring(content_xml) root_elem = tree.getroot() for elem in root_elem.iter(): print "elem: %r/%s" % (elem,elem,) -- Mateusz Korniak From stefan_ml at behnel.de Fri Oct 16 11:12:47 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 Oct 2009 11:12:47 +0200 Subject: [lxml-dev] Parsing xml ignoring xml namesapces ? In-Reply-To: <200910161052.07134.mateusz-lists@ant.gliwice.pl> References: <200910161052.07134.mateusz-lists@ant.gliwice.pl> Message-ID: <4AD8390F.3050704@behnel.de> Mateusz Korniak wrote: > Is it possible to parse XML completly ignoring xml ns declarations ? No. > Code [1] provies element names like: > > {http://www.w3.org/1999/xhtml}html > > and I would prefer just > > html Use the lxml.html.xhtml_to_html() function, potentially followed by lxml.etree.cleanup_namespaces(). Stefan From mateusz-lists at ant.gliwice.pl Fri Oct 16 11:16:41 2009 From: mateusz-lists at ant.gliwice.pl (Mateusz Korniak) Date: Fri, 16 Oct 2009 11:16:41 +0200 Subject: [lxml-dev] How to delete mutibple attributes (parsed from HTML) Message-ID: <200910161116.41591.mateusz-lists@ant.gliwice.pl> I am parsing[1] simple HTML having two times defined I wish I could delete both xmlns definitions but: del root_elem.attrib["xmlns"] deletes only second one for both setups I used[2] and later there are no more "xmlns" ... :/ Any way I could do it ? Thanks in advance, regards ! [1]: # -*- coding: utf-8 -* import lxml.html import lxml.etree from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION content = """ Foo """ parser = lxml.html.HTMLParser(remove_comments=True,remove_pis=True) # Removing comments as later xml output with them is broken root_elem = lxml.html.document_fromstring(str(content),parser=parser) print "DEBUG: root_elem: %r/%s" % (root_elem,root_elem,) content_xml = lxml.etree.tostring(root_elem,pretty_print=True,with_tail=False,method="xml") # This produces non-parsable XML print "DEBUG: content_xml: %r" % (content_xml,) ## lxml.etree.fromstring(content_xml) # lxml.etree.XMLSyntaxError: Attribute xmlns redefined, line 1, column 18 print "DEBUG: root_elem.attrib: %r" % (root_elem.attrib,) del root_elem.attrib["xmlns"] print "DEBUG: root_elem.attrib after first delete: %r" % (root_elem.attrib,) content_xml = lxml.etree.tostring(root_elem,pretty_print=True,with_tail=False,method="xml") # This produces non-parsable XML print "DEBUG: content_xml: %r" % (content_xml,) lxml.etree.fromstring(content_xml) # This parses, but still has [2]: lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 4) libxml compiled: (2, 7, 3) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) and: lxml.etree: (2, 1, 5, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) -- Mateusz Korniak From optilude at gmail.com Sat Oct 17 16:29:37 2009 From: optilude at gmail.com (Martin Aspeli) Date: Sat, 17 Oct 2009 14:29:37 +0000 (UTC) Subject: [lxml-dev] 2.2.2 binary egg for Mac OS X 10.6 Message-ID: Hi, We have binary eggs for Python 2.4, 2.5 and 2.6 on Mac OS X 10.5. Unfortunately, we don't have them for 10.6 (Snow Leopard), and the 10.5 egg doesn't work on 10.6 (it crashes on startup). Is there any chance we could have 10.6 eggs? If there are reliable build instructions, I can help build them. Cheers, Martin From optilude at gmail.com Sat Oct 17 18:29:27 2009 From: optilude at gmail.com (Martin Aspeli) Date: Sat, 17 Oct 2009 16:29:27 +0000 (UTC) Subject: [lxml-dev] 2.2.2 binary egg for Mac OS X 10.6 References: Message-ID: Martin Aspeli gmail.com> writes: > Is there any chance we could have 10.6 eggs? If there are reliable build > instructions, I can help build them. I've tried this, and it hasn't been easy. :) - I first tried from the svn tag, but I couldn't get Cython to install. It complained about int or double size (I don't have the error anymore) - I then tried from the source .tar.gz, with python setup.py build --static- deps. That failed with compiler errors until I re-installed XCode with the OS X 10.4 compatibility library - I saw some advice about linking /usr/bin/gcc to gcc-4.0 instead of gcc-4.2, but that caused other platform issues in the build process - With gcc 4.2, the new XCode, and --static-deps it now builds cleanly, but when I try to run the tests I get: $ python test.py Traceback (most recent call last): File "test.py", line 595, in ? exitcode = main(sys.argv) File "test.py", line 558, in main test_cases = get_test_cases(test_files, cfg, tracer=tracer) File "test.py", line 260, in get_test_cases module = import_module(file, cfg, tracer=tracer) File "test.py", line 203, in import_module mod = __import__(modname) File "/Users/optilude/tmp/python/lxml/lxml-2.2.2/src/lxml/html/ __init__.py", line 12, in ? from lxml import etree ImportError: Failure linking new module: /Users/optilude/tmp/python/lxml/lxml-2.2.2/src/lxml/etree.so: Symbol not found: _htmlParseChunk Referenced from: /Users/optilude/tmp/python/lxml/lxml- 2.2.2/src/lxml/etree.so Expected in: flat namespace in /Users/optilude/tmp/python/lxml/lxml-2.2.2/src/lxml/etree.so I also tried to use the egg in my application, and got: $ ./bin/test plone.z3cform Test-module import failures: Module: plone.z3cform.tests Traceback (most recent call last): File "/Users/optilude/Development/Plone/Code/Build/dexterity/src/ plone.z3cform/plone/z3cform/tests.py", line 12, in ? import z3c.form.testing File "/Users/optilude/.buildout/eggs/z3c.form-2.1.0- py2.4.egg/z3c/form/testing.py", line 42, in ? import lxml.html File "build/bdist.macosx-10.6-i386/egg/lxml/html/__init__.py", line 12, in ? File "build/bdist.macosx-10.6-i386/egg/lxml/etree.py", line 7, in ? File "build/bdist.macosx-10.6-i386/egg/lxml/etree.py", line 6, in __bootstrap__ ImportError: Failure linking new module: /Users/optilude/.python-eggs/ lxml-2.2.2-py2.4-macosx-10.6-i386.egg-tmp/lxml/etree.so: Symbol not found: _exsltRegisterAll Referenced from: /Users/optilude/.python-eggs/ lxml-2.2.2-py2.4-macosx-10.6-i386.egg-tmp/lxml/etree.so Expected in: flat namespace in /Users/optilude/.python-eggs/ lxml-2.2.2-py2.4-macosx-10.6-i386.egg-tmp/lxml/etree.so At this point, I don't know what else to try. Google is not giving me many clues. :) Any ideas? Martin From jholg at gmx.de Tue Oct 20 10:13:48 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 20 Oct 2009 10:13:48 +0200 Subject: [lxml-dev] Objectify python lists In-Reply-To: <4ACF2EB3.9040602@itia.cnr.it> References: <4ACF2EB3.9040602@itia.cnr.it> Message-ID: <20091020081348.287500@gmx.net> Hi, this fell under my radar somehow. > I'd like to interpret XML Schema lists (i.e. > "/1.0 2.0 3.0 4.0 5.0 6.0/" > defined with something like > "//" ) > as python lists (a list made up of 6 doubles). > > I can't find any reference in the manual about python list supporting. > I'm able only to use a sequence of positions (one per each number) but > to me it seems too verbose. XMLschema /list/ would be better XML Schema lists are not supported by lxml.objectify, in the sense that they will not get mapped to python lists and there is no such thing as an "ObjectifiedListElement". I do not see a sane way to support this, as list-like behaviour is already used by ObjectifiedElement to provide access to "twin siblings", i.e. elements with the same name, both in access and assignment: >>> root.l = [1, 2, "three"] >>> print objectify.dump(root) msg = None [ObjectifiedElement] l = 1 [IntElement] * py:pytype = 'int' l = 2 [IntElement] * py:pytype = 'int' l = 'three' [StringElement] * py:pytype = 'str' >>> print root.l[0] 1 >>> print root.l[1] 2 >>> print root.l[2] three >>> I'm not sure about the exact syntax rules for XML Schema lists but guessing they are probably whitespace-separated values of a certain type these will end up in a string element in objectify, so you should be able to use .pyval.split() to get to the distinct values. >>> root = objectify.fromstring("1.0 2.0 3.0") >>> print objectify.dump(root) root = None [ObjectifiedElement] list = '1.0 2.0 3.0' [StringElement] >>> root.list.pyval.split() ['1.0', '2.0', '3.0'] >>> Problem might be list elements possibly providing a single value, which objectify will be able to map to a builtin type: >>> root = objectify.fromstring("4.0") >>> print objectify.dump(root) root = None [ObjectifiedElement] list = 4.0 [FloatElement] So the safest thing for "known-to-be-lists"-elements is probably >>> root = objectify.fromstring("1.0 2.0 3.0") >>> unicode(root.list).split() [u'1.0', u'2.0', u'3.0'] >>> root = objectify.fromstring("4.0") >>> unicode(root.list).split() [u'4.0'] >>> Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser From manu3d at gmail.com Tue Oct 20 23:24:07 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 20 Oct 2009 22:24:07 +0100 Subject: [lxml-dev] xpath check, selective xslt In-Reply-To: <4AC494DE.2030601@behnel.de> References: <915dc91d0909200416p59026e97of0c1fe9a65e1a1c1@mail.gmail.com> <4AC0BE22.5010007@behnel.de> <915dc91d0909290232r78552a9co2d5953c090f4a8cd@mail.gmail.com> <4AC1D871.5080207@behnel.de> <915dc91d0909290407l11c7730fmcdefe0eb583e0a75@mail.gmail.com> <4AC2195A.1010307@behnel.de> <915dc91d0909290750n7234131ey5b5b8e81e0e8fc1a@mail.gmail.com> <4AC2234C.5080209@behnel.de> <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> <4AC494DE.2030601@behnel.de> Message-ID: <915dc91d0910201424v3e611b6x5987d5befdc228ff@mail.gmail.com> 2009/10/1 Stefan Behnel > I'm not sure how hard this will be to implement, though. I didn't find an > obvious function in libxslt that would take a context/start element in > addition to the document to be transformed, so someone needs to investigate > how libxslt works here (we normally call the xsltApplyStylesheetUser() > function), and what it takes to start the transform from a different node > (e.g. if calling xsltProcessOneNode() works out-of-the-box, or if there is > a setup required beforehand). > > Could you file a feature request (i.e. bug) on the bug tracker for now? > Apologies, swamped by some tricky coding (still banging my head over it) I've been procrastinating on this issue. Of course I will. Are you referring to lxml's bug tracker or libxslt's bug tracker given that both lxml's functionality and limitations in this context arise from the latter? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091020/0cb210ab/attachment.htm From optilude+lists at gmail.com Wed Oct 21 02:27:28 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Wed, 21 Oct 2009 00:27:28 +0000 (UTC) Subject: [lxml-dev] 2.2.2 binary egg for Mac OS X 10.6 References: Message-ID: Martin Aspeli gmail.com> writes: > > Is there any chance we could have 10.6 eggs? If there are reliable build > > instructions, I can help build them. Okay, I finally got this to work using zc.buildout and z3c.recipe.staticlxml. I have eggs for Python 2.4 and 2.6. The build for Python 2.5 is failing in mysterious ways (it says "no egg found" in the temp directory). Can I have PyPI access (username 'optilude') to upload these? Otherwise, can I send them somewhere for someone else? Martin From stefan_ml at behnel.de Wed Oct 21 08:25:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 21 Oct 2009 08:25:42 +0200 Subject: [lxml-dev] xpath check, selective xslt In-Reply-To: <915dc91d0910201424v3e611b6x5987d5befdc228ff@mail.gmail.com> References: <915dc91d0909200416p59026e97of0c1fe9a65e1a1c1@mail.gmail.com> <4AC0BE22.5010007@behnel.de> <915dc91d0909290232r78552a9co2d5953c090f4a8cd@mail.gmail.com> <4AC1D871.5080207@behnel.de> <915dc91d0909290407l11c7730fmcdefe0eb583e0a75@mail.gmail.com> <4AC2195A.1010307@behnel.de> <915dc91d0909290750n7234131ey5b5b8e81e0e8fc1a@mail.gmail.com> <4AC2234C.5080209@behnel.de> <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> <4AC494DE.2030601@behnel.de> <915dc91d0910201424v3e611b6x5987d5befdc228ff@mail.gmail.com> Message-ID: <4ADEA966.5070107@behnel.de> Emanuele D'Arrigo wrote: > 2009/10/1 Stefan Behnel > >> I'm not sure how hard this will be to implement, though. I didn't find an >> obvious function in libxslt that would take a context/start element in >> addition to the document to be transformed, so someone needs to investigate >> how libxslt works here (we normally call the xsltApplyStylesheetUser() >> function), and what it takes to start the transform from a different node >> (e.g. if calling xsltProcessOneNode() works out-of-the-box, or if there is >> a setup required beforehand). >> >> Could you file a feature request (i.e. bug) on the bug tracker for now? >> > > Apologies, swamped by some tricky coding (still banging my head over it) > I've been procrastinating on this issue. Of course I will. Are you referring > to lxml's bug tracker or libxslt's bug tracker given that both lxml's > functionality and limitations in this context arise from the latter? lxml's bug tracker at launchpad. Thanks! Stefan From dhbaird at gmail.com Wed Oct 21 19:25:20 2009 From: dhbaird at gmail.com (David Baird) Date: Wed, 21 Oct 2009 11:25:20 -0600 Subject: [lxml-dev] CSSSelector namespace issues In-Reply-To: <4AD385CD.9050404@behnel.de> References: <440abda90910102332y59de6c7cqa69e5eceaa4ba0cb@mail.gmail.com> <4AD2081B.1060601@behnel.de> <440abda90910111328o5be44f69o35e9648a2b25ffc8@mail.gmail.com> <4AD385CD.9050404@behnel.de> Message-ID: <440abda90910211025q7a7ab4ddle9ec03016da83aee@mail.gmail.com> On Mon, Oct 12, 2009 at 1:38 PM, Stefan Behnel wrote: > David Baird wrote: >> I think my test cases would have fit better into the test_css.txt >> framework rather than into test_css.py. > Yep. Could you move them there? I got a little too zealous this weekend. And so I don't quite have a patch yet... This weekend I tried to fix a lot of the regressions. I fixed maybe ~50% of them. Here's some of the reasons why adding more-complete namespace support is hard and why it broke many regressions: 1. Wildcard namespaces require the use of local-name() inside of a predicate. Therefore, expressions that were once written like this: //*/ol[position() = 1] must be rewritten like this: //*/*[local-name() = 'ol'][position() = 1] The builder needs some refactoring to achieve this. The builder needs to be able to support creating expressions like this: li:nth-of-type(1) //*/*[local-name() = 'li'][position() = 1] li:nth-child(1) //*/*[local-name() = 'li' and position() = 1] 2. Some parts of the parser are inrecorrectly parsing expressions as elements. As a consequence of modifying elements to have better namespace support, many of these parses now generate invalid results. For example: div:nth-child(odd) "odd" was being incorrectly parsed as an element (just like div). These function-like selectors need to have a custom parser for parsing everything between the parenthesis. 3. Inconsistent/improper rules for lower-casing of elements names (mentioned in the original post). I spent a few hours trying to work around these issues. I had some success, but I only resolved ~50% of the regression failures and started having doubts about how much farther I could go. So, I am thinking about doing some independent work on a new parser and builder to see what I can come up with. This is not intended as criticism of the current cssselect - I am really happy that some other people were already working on it and I feel that they have made a lot of progress, and it might even be possible to reuse some pieces of that code. -David From jkrukoff at ltgc.com Wed Oct 21 20:20:20 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 21 Oct 2009 12:20:20 -0600 Subject: [lxml-dev] XPath optimization troubles. Message-ID: <1256149220.16931.9.camel@localhost.localdomain> Hello, I expect this is properly a libxml2 question, but it's weird enough I wanted to check here first to make sure that lxml isn't effecting the results. I have equivalent XPath expressions, one using prefixes to do the selection, and one using namespace-uri to do the check. The namespace-uri version consistently runs 2-3x faster on a range of test data, and I have no idea why. Here's the prefix version: > '//@gizmo:*/parent::*[ not( self::gizmo:* ) ]' And here's the namespace-uri version: > '//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != "%(gizmo)s" ]' % namespaces I'm running these as compiled expressions using etree.XPath( ..., namespaces = namespaces ), if that makes a difference. Any hints? Or a faster XPath to do the same thing for the ambitiously bored? -- John Krukoff Land Title Guarantee Company From nicolas at nexedi.com Fri Oct 23 10:41:10 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Fri, 23 Oct 2009 10:41:10 +0200 Subject: [lxml-dev] Availibilty of nsmap Message-ID: <4AE16C26.2080703@nexedi.com> Hi all, I would like to know if the sample of code bellow is an expected behaviour of nsmap handling. >>> from lxml import etree >>> string = '' >>> doc = etree.fromstring(string) >>> doc.find('node/{any_uri}node').nsmap {'sub': 'any_uri'} >>> doc.nsmap {} >>> Why nsmap at root Element level doesn't return nsmap of is children ? Regards, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Fri Oct 23 11:16:08 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 23 Oct 2009 11:16:08 +0200 Subject: [lxml-dev] Availibilty of nsmap In-Reply-To: <4AE16C26.2080703@nexedi.com> References: <4AE16C26.2080703@nexedi.com> Message-ID: <4AE17458.1000600@behnel.de> Nicolas Delaby wrote: > I would like to know if the sample of code bellow is an expected > behaviour of nsmap handling. > >>>> from lxml import etree >>>> string = '' >>>> doc = etree.fromstring(string) >>>> doc.find('node/{any_uri}node').nsmap > {'sub': 'any_uri'} >>>> doc.nsmap > {} > > Why nsmap at root Element level doesn't return nsmap of is children ? Because the nsmap of an Element contains the namespaces that are defined in its context. The namespaces defined by its children are not defined for the parent. Also note that children can happily redefine namespace prefixes that already exist in their parents. And nothing (ok, except good taste, maybe) keeps you from redefining the same namespace prefix differently on each child. Namespace prefixes are completely meaningless outside of their definition scope. Stefan From stefan_ml at behnel.de Fri Oct 23 11:49:22 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 23 Oct 2009 11:49:22 +0200 Subject: [lxml-dev] XPath optimization troubles. In-Reply-To: <1256149220.16931.9.camel@localhost.localdomain> References: <1256149220.16931.9.camel@localhost.localdomain> Message-ID: <4AE17C22.90204@behnel.de> John Krukoff wrote: > I expect this is properly a libxml2 question, but it's weird enough I > wanted to check here first to make sure that lxml isn't effecting the > results. > > I have equivalent XPath expressions, one using prefixes to do the > selection, and one using namespace-uri to do the check. The > namespace-uri version consistently runs 2-3x faster on a range of test > data, and I have no idea why. > > Here's the prefix version: >> '//@gizmo:*/parent::*[ not( self::gizmo:* ) ]' > > And here's the namespace-uri version: >> '//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != > "%(gizmo)s" ]' % namespaces > > I'm running these as compiled expressions using etree.XPath( ..., > namespaces = namespaces ), if that makes a difference. Any hints? Or a > faster XPath to do the same thing for the ambitiously bored? Just guessing: in libxml2, a node (element/attribute) knows its namespace URI, so comparing it to a constant string is a fast and local operation. Comparing namespace prefixes requires an indirection, as the prefix is mapped to a URI by the XPath evaluation context, and only the URI can be compared in a meaningful way. I'm not sure if the XPath engine can optimise this, as it might be possible to change the namespace-prefix mapping during the run. So simply replacing the prefix by a URI check may not be correct. But this is something that might be worth bringing to the attention of the libxml2 mailing list. BTW, have you also measured the performance of using an XPath variable for the URI in the second case? Stefan From nicolas at nexedi.com Fri Oct 23 14:52:12 2009 From: nicolas at nexedi.com (Nicolas Delaby) Date: Fri, 23 Oct 2009 14:52:12 +0200 Subject: [lxml-dev] Availibilty of nsmap In-Reply-To: <4AE17458.1000600@behnel.de> References: <4AE16C26.2080703@nexedi.com> <4AE17458.1000600@behnel.de> Message-ID: <4AE1A6FC.4020903@nexedi.com> > Namespace prefixes are completely meaningless outside of their definition > scope. > I understand, thanks you. Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From Praktikant3 at schmidhauser.ch Tue Oct 27 13:47:59 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Tue, 27 Oct 2009 13:47:59 +0100 Subject: [lxml-dev] etree._Element.items(): Really "arbitrary order" or rather "in document order"? Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706BAE@ZCH502.ch-sag.lenze.com> Hi, I have designed this Loop element that determines its loop counter variable from the first attribute. Example: I implemented this loop a year ago, finding the loop counter attribute using the following code: # 'el' is the etree._Element from above loop_counter = None format = None variables = {} for name, value in el.attrib.iteritems(): if name == "format": format = value continue if loop_counter is None: # the first attribute that is not 'format' loop_counter = name variables[name] = value I looked at the lxml implementation and it comes down to the _collectAttributes() function, defined in src/lxml/apihelpers.pxi. At that point I had to give up, I found it to be quicker to write this email than to find my way around libxml2 docs and code. I wanted to determine whether attributes parsed by libxml2 will remain in document order, regardless of what lxml's API documents (lxml.etree._Element.items() -> "arbitrary order"). Because, the code above worked without a hitch so far, always delivering attributes in document order. Is there an lxml or libxml2 version that will *not* keep attributes in document order? Why is it documented as being in "arbitrary order"? Do you want to be API-compatible for future changes that might break the order? Is libxml2 silent about this? What does its parser do? It was just observation that led me to believe document order is maintained, but I'd like to have the "proof" behind that observation. I don't know where to continue. Thanks, Felix Rabe From Praktikant3 at schmidhauser.ch Tue Oct 27 13:52:23 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Tue, 27 Oct 2009 13:52:23 +0100 Subject: [lxml-dev] etree._Element.items(): Really "arbitrary order" or rather "in document order"? (try #2 w/ formatting) Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706BB0@ZCH502.ch-sag.lenze.com> Hi, I have designed this Loop element that determines its loop counter variable from the first attribute. Example: I implemented this loop a year ago, finding the loop counter attribute using the following code: # 'el' is the etree._Element from above loop_counter = None format = None variables = {} for name, value in el.attrib.iteritems(): if name == "format": format = value continue if loop_counter is None: # the first attribute that is not 'format' loop_counter = name variables[name] = value I looked at the lxml implementation and it comes down to the _collectAttributes() function, defined in src/lxml/apihelpers.pxi. At that point I had to give up, I found it to be quicker to write this email than to find my way around libxml2 docs and code. I wanted to determine whether attributes parsed by libxml2 will remain in document order, regardless of what lxml's API documents (lxml.etree._Element.items() -> "arbitrary order"). Because, the code above worked without a hitch so far, always delivering attributes in document order. Is there an lxml or libxml2 version that will *not* keep attributes in document order? Why is it documented as being in "arbitrary order"? Do you want to be API-compatible for future changes that might break the order? Is libxml2 silent about this? What does its parser do? It was just observation that led me to believe document order is maintained, but I'd like to have the "proof" behind that observation. I don't know where to continue. Thanks, Felix Rabe From stefan_ml at behnel.de Tue Oct 27 17:48:43 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 27 Oct 2009 17:48:43 +0100 Subject: [lxml-dev] etree._Element.items(): Really "arbitrary order" or rather "in document order"? In-Reply-To: <1C3FA7C0D2C03E46A6690DB0242BB73D706BAE@ZCH502.ch-sag.lenze.com> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706BAE@ZCH502.ch-sag.lenze.com> Message-ID: <4AE7246B.3040101@behnel.de> Hi, Praktikant3 - SAG, 27.10.2009 13:47: > I wanted to determine whether attributes parsed by libxml2 will remain > in document order, regardless of what lxml's API documents > (lxml.etree._Element.items() -> "arbitrary order"). Yes, they will. The parser in libxml2 actually guarantees that (at least, according to the source comments). > Is there an lxml or libxml2 version that will *not* keep attributes in > document order? Why is it documented as being in "arbitrary order"? Because 1) ElementTree does not guarantee that it's document order and 2) lxml does not guarantee it either and 3) document order usually *is* an arbitrary order, except for canonical XML. Writing code that relies on a specific order of attributes within an element is bound to fail in most cases. Note that the interface is a dict-like mapping object. I do not guarantee that it will always stay that way. For example, it might become a dict subclass one day or the .items() method might return a dict view in Py3, or whatever. > Do you want to be API-compatible for future changes that might break the > order? Sure, works so far. > Is libxml2 silent about this? What does its parser do? It was > just observation that led me to believe document order is maintained, > but I'd like to have the "proof" behind that observation. I don't know > where to continue. libxml2 stores attributes as tree nodes, more specifically as an ordered linked list of attribute nodes, and the parser puts them into that list one after the other, in document order. That should be enough of a "proof" that it works that way. Stefan From manu3d at gmail.com Tue Oct 27 19:25:33 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 27 Oct 2009 18:25:33 +0000 Subject: [lxml-dev] Handling processing instructions Message-ID: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> A couple of questions on processing instructions: 1) short of iterating from the root element using repeatedly getprevious(), is there a way to obtain the processing instructions that are listed before the root element? 2) is it possible to delete a processing instruction from a tree/element? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091027/b92a4b32/attachment.htm From Praktikant3 at schmidhauser.ch Wed Oct 28 09:25:01 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Wed, 28 Oct 2009 09:25:01 +0100 Subject: [lxml-dev] etree._Element.items(): Really "arbitrary order" or rather "in document order"? In-Reply-To: <4AE7246B.3040101@behnel.de> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706BAE@ZCH502.ch-sag.lenze.com> <4AE7246B.3040101@behnel.de> Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706C4D@ZCH502.ch-sag.lenze.com> Hi Stefan, Thanks! That helped clarify it a lot. - Felix -----Urspr?ngliche Nachricht----- Von: Stefan Behnel [mailto:stefan_ml at behnel.de] Gesendet: Dienstag, 27. Oktober 2009 17:49 An: Praktikant3 - SAG Cc: lxml-dev at codespeak.net Betreff: Re: [lxml-dev] etree._Element.items(): Really "arbitrary order" or rather "in document order"? Hi, Praktikant3 - SAG, 27.10.2009 13:47: > I wanted to determine whether attributes parsed by libxml2 will remain > in document order, regardless of what lxml's API documents > (lxml.etree._Element.items() -> "arbitrary order"). Yes, they will. The parser in libxml2 actually guarantees that (at least, according to the source comments). > Is there an lxml or libxml2 version that will *not* keep attributes in > document order? Why is it documented as being in "arbitrary order"? Because 1) ElementTree does not guarantee that it's document order and 2) lxml does not guarantee it either and 3) document order usually *is* an arbitrary order, except for canonical XML. Writing code that relies on a specific order of attributes within an element is bound to fail in most cases. Note that the interface is a dict-like mapping object. I do not guarantee that it will always stay that way. For example, it might become a dict subclass one day or the .items() method might return a dict view in Py3, or whatever. > Do you want to be API-compatible for future changes that might break > the order? Sure, works so far. > Is libxml2 silent about this? What does its parser do? It was just > observation that led me to believe document order is maintained, but > I'd like to have the "proof" behind that observation. I don't know > where to continue. libxml2 stores attributes as tree nodes, more specifically as an ordered linked list of attribute nodes, and the parser puts them into that list one after the other, in document order. That should be enough of a "proof" that it works that way. Stefan From bob at brandt.ie Wed Oct 28 11:53:05 2009 From: bob at brandt.ie (Bob Brandt) Date: Wed, 28 Oct 2009 10:53:05 +0000 Subject: [lxml-dev] Comments before or after the root element Message-ID: First let me say I love lxml! I have noticed one problem though... (Although I am unsure whether the problem is with lxml or PEBCAK) I need to insert comments both before and after the root element. For what I can see this is not only perfectly legal but specifically allowed in the RFCs (although for the life of me I can find the relevant sections again) I can not figure out how to perform this. And from my testing it appears that even if I import (either tostring or parse) ignores a document which has comments before and after, it just ignores them... For now I am running a "workaround" with just outputs the comments, then the xml, then the comments, but it would be much better if lxml could do it since there could be problems with the XML Declaration. Thanks Bob -- The problem with socialism is that you eventually run out of other people's money. - Margaret Thatcher -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091028/e8d3d232/attachment.htm From cfbearden at gmail.com Wed Oct 28 14:35:23 2009 From: cfbearden at gmail.com (Chuck Bearden) Date: Wed, 28 Oct 2009 08:35:23 -0500 Subject: [lxml-dev] Handling processing instructions In-Reply-To: <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> References: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> <433ebc870910271208s499e722j4b3d1441c80b0987@mail.gmail.com> <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> Message-ID: <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> On Wed, Oct 28, 2009 at 4:21 AM, Emanuele D'Arrigo wrote: > 2009/10/27 Chuck Bearden >> >> On Tue, Oct 27, 2009 at 1:25 PM, Emanuele D'Arrigo >> wrote: >> > 1) short of iterating from the root element using repeatedly >> > getprevious(), >> > is there a way to obtain the processing instructions that are listed >> > before >> >> It is possible via XPath: >> >>> xmlTree.getroot().xpath('preceding-sibling::node()') > > Hi Chuck, thanks for your help! > > How is the xpath approach above different from: > > xmlTree.getRoot().getprevious() > > ? Evidently I mis-typed or mis-read the first time I tried the ElementTree way of getting a PI that precedes the root element. The obvious ElementTree way does indeed work. My mistake. >>> from lxml import etree >>> from StringIO import StringIO >>> xmlString = ''' ... ... ... ... ... ... ''' >>> xmlTree = etree.parse(StringIO(xmlString)) >>> xmlTree.getroot().xpath('preceding-sibling::node()') [] >>> xmlTree.getroot().getprevious() >>> > And would I be correct in saying one would still need to iterate "upward" to > find the very first processing instruction? I guess I think of .getprevious() and the preceding-sibling XPath axis as horizontal rather than vertical, so I would say that you need to recur or iterate "backwards" to get all nodes preceding the root element. >>> from lxml import etree >>> from StringIO import StringIO >>> xmlString = ''' ... ... ... ... ... ... ... ''' >>> xmlTree = etree.parse(StringIO(xmlString)) >>> xmlTree.getroot().xpath('preceding-sibling::node()') [, ] >>> xmlTree.getroot().getprevious() >>> xmlTree.getroot().getprevious().getprevious() >>> xmlTree.getroot().getprevious().getprevious().getprevious() >>> Chuck From cfbearden at gmail.com Wed Oct 28 15:05:36 2009 From: cfbearden at gmail.com (Chuck Bearden) Date: Wed, 28 Oct 2009 09:05:36 -0500 Subject: [lxml-dev] Comments before or after the root element In-Reply-To: References: Message-ID: <433ebc870910280705n264037cfw533405098a9c2a56@mail.gmail.com> On Wed, Oct 28, 2009 at 5:53 AM, Bob Brandt wrote: > First let me say I love lxml! > > I have noticed one problem though... (Although I am unsure whether the > problem is with lxml or PEBCAK) > > I need to insert comments both before and after the root element.? For what > I can see this is not only perfectly legal but specifically allowed in the > RFCs (although for the life of me I can find the relevant sections again) > > I can not figure out how to perform this.? And from my testing it appears > that even if I import (either tostring or parse) ignores a document which > has comments before and after, it just ignores them... > > For now I am running a "workaround" with just outputs the comments, then the > xml, then the comments, but it would be much better if lxml could do it > since there could be problems with the XML Declaration. The .addprevious() and .addnext() methods called on the root element seem to work for me: >>> from lxml import etree >>> from StringIO import StringIO >>> etree.__version__ u'2.1.5' >>> xmlString = ''' ... ... ... ... ... ''' >>> xmlTree = etree.parse(StringIO(xmlString)) >>> xmlTree.getroot().addprevious(etree.PI('prior_PI')) >>> xmlTree.getroot().addnext(etree.Comment('trailing comment')) >>> print etree.tostring(xmlTree, xml_declaration=True) >>> This also works for me with lxml 2.2.2. What version are you using? Likewise, my installation doesn't ignore the comments and PIs when I serialize and reparse the XML: >>> xmlString2 = etree.tostring(xmlTree, xml_declaration=True) >>> xmlTree2 = etree.parse(StringIO(xmlString2)) >>> print etree.tostring(xmlTree2, xml_declaration=True) >>> Chuck From manu3d at gmail.com Wed Oct 28 17:49:34 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Wed, 28 Oct 2009 16:49:34 +0000 Subject: [lxml-dev] Handling processing instructions In-Reply-To: <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> References: <915dc91d0910271125y4a8736cat51335f693ea0238e@mail.gmail.com> <433ebc870910271208s499e722j4b3d1441c80b0987@mail.gmail.com> <915dc91d0910280221q3a058410q3ea8316e9d0e2c95@mail.gmail.com> <433ebc870910280635r10444dc0ubccd0293c839197a@mail.gmail.com> Message-ID: <915dc91d0910280949leb7379arec539568a9aad834@mail.gmail.com> 2009/10/28 Chuck Bearden > > And would I be correct in saying one would still need to iterate > "upward" to > > find the very first processing instruction? > > I guess I think of .getprevious() and the preceding-sibling XPath > axis as horizontal rather than vertical, so I would say that you need > to recur or iterate "backwards" to get all nodes preceding the root > element. > Yes, I think we're thinking the same thing. I was thinking "upward" because of the top-bottom orientation of the source code, but "backward" is indeed reasonable in the context of siblings in the tree.. =) Thank you! Now I just hope somebody (Stephan?) will be able to answer question number 2, how to delete a processing instruction from the tree!! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091028/ff153999/attachment.htm From bob at brandt.ie Wed Oct 28 22:18:04 2009 From: bob at brandt.ie (Bob Brandt) Date: Wed, 28 Oct 2009 21:18:04 +0000 Subject: [lxml-dev] Comments before or after the root element In-Reply-To: <433ebc870910280705n264037cfw533405098a9c2a56@mail.gmail.com> References: <433ebc870910280705n264037cfw533405098a9c2a56@mail.gmail.com> Message-ID: Well I'm running version 2.0.5 and I guess it was a pebcak issue since the .addprevious() worked for me as well. I swear I looked over the documentation, but just missed the obvious.. Thanks Bob On Wed, Oct 28, 2009 at 2:05 PM, Chuck Bearden wrote: > On Wed, Oct 28, 2009 at 5:53 AM, Bob Brandt wrote: > > First let me say I love lxml! > > > > I have noticed one problem though... (Although I am unsure whether the > > problem is with lxml or PEBCAK) > > > > I need to insert comments both before and after the root element. For > what > > I can see this is not only perfectly legal but specifically allowed in > the > > RFCs (although for the life of me I can find the relevant sections again) > > > > I can not figure out how to perform this. And from my testing it appears > > that even if I import (either tostring or parse) ignores a document which > > has comments before and after, it just ignores them... > > > > For now I am running a "workaround" with just outputs the comments, then > the > > xml, then the comments, but it would be much better if lxml could do it > > since there could be problems with the XML Declaration. > > The .addprevious() and .addnext() methods called on the root element > seem to work for me: > > >>> from lxml import etree > >>> from StringIO import StringIO > >>> etree.__version__ > u'2.1.5' > >>> xmlString = ''' > ... > ... > ... > ... > ... ''' > >>> xmlTree = etree.parse(StringIO(xmlString)) > >>> xmlTree.getroot().addprevious(etree.PI('prior_PI')) > >>> xmlTree.getroot().addnext(etree.Comment('trailing comment')) > >>> print etree.tostring(xmlTree, xml_declaration=True) > > > > > > > >>> > > This also works for me with lxml 2.2.2. What version are you using? > > Likewise, my installation doesn't ignore the comments and PIs when I > serialize and reparse the XML: > > >>> xmlString2 = etree.tostring(xmlTree, xml_declaration=True) > >>> xmlTree2 = etree.parse(StringIO(xmlString2)) > >>> print etree.tostring(xmlTree2, xml_declaration=True) > > > > > > > >>> > > Chuck > -- The problem with socialism is that you eventually run out of other people's money. - Margaret Thatcher -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091028/676223ef/attachment.htm From manu3d at gmail.com Wed Oct 28 23:55:26 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Wed, 28 Oct 2009 22:55:26 +0000 Subject: [lxml-dev] xpath check, selective xslt In-Reply-To: <4ADEA966.5070107@behnel.de> References: <915dc91d0909200416p59026e97of0c1fe9a65e1a1c1@mail.gmail.com> <4AC1D871.5080207@behnel.de> <915dc91d0909290407l11c7730fmcdefe0eb583e0a75@mail.gmail.com> <4AC2195A.1010307@behnel.de> <915dc91d0909290750n7234131ey5b5b8e81e0e8fc1a@mail.gmail.com> <4AC2234C.5080209@behnel.de> <915dc91d0910010422y32939438tf5ff2affc75f005f@mail.gmail.com> <4AC494DE.2030601@behnel.de> <915dc91d0910201424v3e611b6x5987d5befdc228ff@mail.gmail.com> <4ADEA966.5070107@behnel.de> Message-ID: <915dc91d0910281555k1e73ae45xbc76aad620d7c692@mail.gmail.com> 2009/10/21 Stefan Behnel > >> Could you file a feature request (i.e. bug) on the bug tracker for now? > I finally managed to stop procrastinating on this and added a blueprint rather than a bug: a bug would imply an error in lxml or libxslt but I suspect this issue might be something not even the XML and XSLT specifications consider and/or allow. https://blueprints.launchpad.net/lxml/+spec/lxml-in-context-validation-and-transformations Hope it helps! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20091028/9cdd628c/attachment-0001.htm From Praktikant3 at schmidhauser.ch Thu Oct 29 09:53:28 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Thu, 29 Oct 2009 09:53:28 +0100 Subject: [lxml-dev] Fun with unicode errors Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> Hi, Maybe you have an idea what could be happening here, otherwise I will (try to) come back with a more complete example. For now I have this small code excerpt that behaves strangely: (isinstance(output_xml, lxml.etree._Element) is True) # The two ET.tostring() invocations below, (1) and (2), show the # following behaviour: # (1) "works" (UnicodeDecodeError about el.text after (2)) # (1) (2) "works" (UnicodeDecodeError about el.text after (2)) # (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about ET.tostsring() (2) ET.tostring(output_xml) # (1) # Make pretty-printing work by removing unnecessary whitespace: for el in output_xml.iter(): ET.tostring(el) # (2) if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None (1) and (2) are commented out to run them in the different combinations discussed. If you have a hint about what might be the issue there, I'd be very glad to hear it. I'm using this code on WinXP SP3, Python 2.6.4, lxml 2.2.2 (pre-compiled package). Expected behaviour would be to run without raising any exception in any of the runs. Something about output_xml changed, as this code snipped used to work. Thanks, Felix From Praktikant3 at schmidhauser.ch Thu Oct 29 11:30:35 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Thu, 29 Oct 2009 11:30:35 +0100 Subject: [lxml-dev] Fun with unicode errors In-Reply-To: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> Hi, The debugging continues. The issue below has been when I read the file using: input_xml = ET.parse(input_filename).getroot() If I change this to: input_xml = ET.XML(file(input_filename, "rb").read()) I get the UnicodeDecodeError in each of the (1)/(2) combinations. $ ./01_loop_debug.sh /c/python26/python.exe tools/xml merge/xmlmerge.py -i Build/CABxxxB.xml Traceback (most recent call last): File "tools/xml merge/xmlmerge.py", line 470, in sys.exit(main(sys.argv)) File "tools/xml merge/xmlmerge.py", line 447, in main output_xml = postprocess_xml(output_xml) File "tools/xml merge/xmlmerge.py", line 173, in postprocess_xml if el.tail and not el.tail.strip(): File "lxml.etree.pyx", line 833, in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:32942) File "apihelpers.pxi", line 620, in lxml.etree._collectText (src/lxml/lxml.etree.c:14919) File "apihelpers.pxi", line 1232, in lxml.etree.funicode (src/lxml/lxml.etree.c:19564) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: unexpected end of data Hope that helps the understanding of the issue a bit. Git is being a big help right now. If I reduce the input file to a certain amount, the problem goes away. Hopefully I can isolate the cause soon (before my employer makes me reimplement the workaround again, basically insert: xml = ET.XML(ET.tostring(xml, encoding="utf-8")) in several places). - Felix -----Urspr?ngliche Nachricht----- Von: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] Im Auftrag von Praktikant3 - SAG Gesendet: Donnerstag, 29. Oktober 2009 09:53 An: lxml-dev at codespeak.net Betreff: [lxml-dev] Fun with unicode errors Hi, Maybe you have an idea what could be happening here, otherwise I will (try to) come back with a more complete example. For now I have this small code excerpt that behaves strangely: (isinstance(output_xml, lxml.etree._Element) is True) # The two ET.tostring() invocations below, (1) and (2), show the # following behaviour: # (1) "works" (UnicodeDecodeError about el.text after (2)) # (1) (2) "works" (UnicodeDecodeError about el.text after (2)) # (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about ET.tostsring() (2) ET.tostring(output_xml) # (1) # Make pretty-printing work by removing unnecessary whitespace: for el in output_xml.iter(): ET.tostring(el) # (2) if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None (1) and (2) are commented out to run them in the different combinations discussed. If you have a hint about what might be the issue there, I'd be very glad to hear it. I'm using this code on WinXP SP3, Python 2.6.4, lxml 2.2.2 (pre-compiled package). Expected behaviour would be to run without raising any exception in any of the runs. Something about output_xml changed, as this code snipped used to work. Thanks, Felix _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From Praktikant3 at schmidhauser.ch Thu Oct 29 12:07:59 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Thu, 29 Oct 2009 12:07:59 +0100 Subject: [lxml-dev] Fun with unicode errors In-Reply-To: <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706D6E@ZCH502.ch-sag.lenze.com> Forget what I'm saying about the changed exceptions. There must be memory corruption. An unrelated change now makes the code raise the SerialisationError again. I keep working on this thing. - Felix From marcello at perathoner.de Thu Oct 29 12:41:55 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 29 Oct 2009 12:41:55 +0100 Subject: [lxml-dev] Fun with unicode errors In-Reply-To: <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> Message-ID: <4AE97F83.6010304@perathoner.de> Praktikant3 - SAG wrote: > The debugging continues. The issue below has been when I read the file using: > > input_xml = ET.parse(input_filename).getroot() > > If I change this to: > > input_xml = ET.XML(file(input_filename, "rb").read()) > > I get the UnicodeDecodeError in each of the (1)/(2) combinations. Your input file has a bogus encoding declaration and/or encoding errors. > UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: unexpected end of data In utf-8, 0xe5 is the start of a 3-byte sequence. It must be followed by two more chars. -- Marcello Perathoner webmaster at gutenberg.org From Praktikant3 at schmidhauser.ch Thu Oct 29 13:31:40 2009 From: Praktikant3 at schmidhauser.ch (Praktikant3 - SAG) Date: Thu, 29 Oct 2009 13:31:40 +0100 Subject: [lxml-dev] Fun with unicode errors In-Reply-To: <4AE97F83.6010304@perathoner.de> References: <1C3FA7C0D2C03E46A6690DB0242BB73D706D39@ZCH502.ch-sag.lenze.com> <1C3FA7C0D2C03E46A6690DB0242BB73D706D5D@ZCH502.ch-sag.lenze.com> <4AE97F83.6010304@perathoner.de> Message-ID: <1C3FA7C0D2C03E46A6690DB0242BB73D706D81@ZCH502.ch-sag.lenze.com> Funny thing is: >>> s = file("input.xml", "rb").read() >>> '\xe5' in s False and: >>> u = s.decode('utf-8') >>> len(u) == len(s) True There are no multi-byte sequences at all, Python can decode in a straightforward manner. - Felix -----Urspr?ngliche Nachricht----- Von: Marcello Perathoner [mailto:marcello at perathoner.de] Gesendet: Donnerstag, 29. Oktober 2009 12:42 An: Praktikant3 - SAG Cc: lxml-dev at codespeak.net Betreff: Re: [lxml-dev] Fun with unicode errors Praktikant3 - SAG wrote: > The debugging continues. The issue below has been when I read the file using: > > input_xml = ET.parse(input_filename).getroot() > > If I change this to: > > input_xml = ET.XML(file(input_filename, "rb").read()) > > I get the UnicodeDecodeError in each of the (1)/(2) combinations. Your input file has a bogus encoding declaration and/or encoding errors. > UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: > unexpected end of data In utf-8, 0xe5 is the start of a 3-byte sequence. It must be followed by two more chars. -- Marcello Perathoner webmaster at gutenberg.org From stefan_ml at behnel.de Fri Oct 30 15:08:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 30 Oct 2009 15:08:05 +0100 Subject: [lxml-dev] lxml 2.2.3 released Message-ID: <4AEAF345.50109@behnel.de> Hi all, I just released a new bugfix-only release to PyPI: lxml 2.2.3. It fixes many bugs, some of which are critical or at least annoying, so updating is generally recommended and should mostly be a drop-in for 2.2.x users. Have fun, Stefan 2.2.3 (2009-10-30) Bugs fixed * The resolve_entities option did not work in the incremental feed parser. * Looking up and deleting attributes without a namespace could hit a namespaced attribute of the same name instead. * Late errors during calls to SubElement() (e.g. attribute related ones) could leave a partially initialised element in the tree. * Modifying trees that contain parsed entity references could result in an infinite loop. * ObjectifiedElement.__setattr__ created an empty-string child element when the attribute value was rejected as a non-unicode/non-ascii string * Syntax errors in lxml.cssselect could result in misleading error messages. * Invalid syntax in CSS expressions could lead to an infinite loop in the parser of lxml.cssselect. * CSS special character escapes were not properly handled in lxml.cssselect. * CSS Unicode escapes were not properly decoded in lxml.cssselect. * Select options in HTML forms that had no explicit value attribute were not handled correctly. The HTML standard dictates that their value is defined by their text content. This is now supported by lxml.html. * XPath raised a TypeError when finding CDATA sections. This is now fully supported. * Calling help(lxml.objectify) didn't work at the prompt. * The ElementMaker in lxml.objectify no longer defines the default namespaces when annotation is disabled. * Feed parser failed to honour the 'recover' option on parse errors. * Diverting the error logging to Python's logging system was broken. From stefan_ml at behnel.de Sat Oct 31 16:39:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 31 Oct 2009 16:39:16 +0100 Subject: [lxml-dev] lxml 2.2.3 released In-Reply-To: <4AEAF345.50109@behnel.de> References: <4AEAF345.50109@behnel.de> Message-ID: <4AEC5A24.7020306@behnel.de> Stefan Behnel, 30.10.2009 15:08: > I just released a new bugfix-only release to PyPI: lxml 2.2.3. I forgot to state that this release was built using Cython 0.11.3. Starting with the upcoming 2.3 release, lxml will depend on Cython 0.12, which is to be released soon. Stefan