From chris0wj at gmail.com Mon Mar 2 03:24:26 2009 From: chris0wj at gmail.com (Chris Wj) Date: Sun, 1 Mar 2009 21:24:26 -0500 Subject: [lxml-dev] Segmentation Fault loading graphml.xsd Message-ID: <3a0f5ffd0903011824o47c8f68ck200d604c19cff3e9@mail.gmail.com> Before posting a new bug I want to confirm this. I am reading in graphml.xsd as a Schema to validate against, which has other xsd files that it references located in same folder. Linux x86_64, Python 2.5, lxml 2.2beta4 lxml.etree: (2, 2, -96, 0) libxml used: (2, 6, 32) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) Code to reproduce error: In [1]: from lxml import etree In [2]: etree.XMLSchema(file="grap graphml+svg.xsd graphml-attributes.xsd graphml-parseinfo.xsd graphml-structure.xsd graphml.dtd graphml.xsd In [2]: s = etree.XMLSchema(file="graphml.xsd") Segmentation fault Schemas can be obtained here: http://graphml.graphdrawing.org/specification.html Loading the others seg faults too. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090301/432f5318/attachment.htm From stefan_ml at behnel.de Mon Mar 2 14:07:51 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Mar 2009 14:07:51 +0100 Subject: [lxml-dev] Segmentation Fault loading graphml.xsd In-Reply-To: <3a0f5ffd0903011824o47c8f68ck200d604c19cff3e9@mail.gmail.com> References: <3a0f5ffd0903011824o47c8f68ck200d604c19cff3e9@mail.gmail.com> Message-ID: <49ABDA27.1010705@behnel.de> Hi, Chris Wj wrote: > Before posting a new bug I want to confirm this. I am reading in graphml.xsd > as a Schema to validate against, which has other xsd files that it > references located in same folder. > > Linux x86_64, Python 2.5, lxml 2.2beta4 > > lxml.etree: (2, 2, -96, 0) > libxml used: (2, 6, 32) > libxml compiled: (2, 6, 32) > libxslt used: (1, 1, 24) > libxslt compiled: (1, 1, 24) > > Code to reproduce error: > > In [1]: from lxml import etree > > In [2]: etree.XMLSchema(file="grap > graphml+svg.xsd graphml-attributes.xsd graphml-parseinfo.xsd > graphml-structure.xsd graphml.dtd graphml.xsd > > In [2]: s = etree.XMLSchema(file="graphml.xsd") > Segmentation fault > > Schemas can be obtained here: > http://graphml.graphdrawing.org/specification.html > Loading the others seg faults too. Thanks for the report. I can confirm that this was a bug in lxml. It only happens when you parse the schema directly from a filename. This will be fixed in the final 2.2 release. Stefan From jholg at gmx.de Tue Mar 3 11:31:11 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 03 Mar 2009 11:31:11 +0100 Subject: [lxml-dev] (re-raising) exceptions problem in lxml 2.2beta4 Message-ID: <20090303103111.167440@gmx.net> Hi, I just ran into a problem with some code that re-raises exceptions, where accessing sys.exc_info() returns (None, None, None) instead of the expected most recent exception information. This seems to happen if "some operation" is performed on the tree after the exception has been caught and before sys.exc_info() gets invoked. Looks like lxml clears the exception information somewhere on the way. Here's a minimal example where the invocation of iterchildren() triggers the behaviour: $ cat lxml_reraise.py import sys from lxml import etree print "using lxml version", etree.__version__ root = etree.Element('root') try: access = bool(sys.argv[1]) except IndexError: access = False try: raise RuntimeError('Too much foo for bar') except Exception, e: if access: print "children:", list(root.iterchildren()) print sys.exc_info() Run with lxml 2.1.5: $ python2.4 lxml_reraise.py using lxml version 2.1.5 (, , ) $ python2.4 lxml_reraise.py 1 using lxml version 2.1.5 children: [] (, , ) Run with lxml 2.2beta4: $ ln -s ~/pydev/tmp/lxml-2.2beta4/build/lib.solaris-2.8-sun4u-2.4/lxml $ python2.4 lxml_reraise.py using lxml version 2.2.beta4 (, , ) $ python2.4 lxml_reraise.py 1 using lxml version 2.2.beta4 children: [] (None, None, None) Now, I seem to remember some discussion of changes wrt to exceptions for lxml 2.2. Might this be an (unwanted) side effect to these changes? Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 From stefan_ml at behnel.de Tue Mar 3 14:29:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 3 Mar 2009 14:29:36 +0100 (CET) Subject: [lxml-dev] (re-raising) exceptions problem in lxml 2.2beta4 In-Reply-To: <20090303103111.167440@gmx.net> References: <20090303103111.167440@gmx.net> Message-ID: <43909.213.61.181.86.1236086976.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, jholg at gmx.de wrote: > I just ran into a problem with some code that re-raises exceptions, > where accessing sys.exc_info() returns (None, None, None) instead > of the expected most recent exception information. > This seems to happen if "some operation" is performed on the tree "some operation" being something that raises an exception internally, such as StopIteration. So it's not really the "most recent exception" that you get, but only the last exception that you caught in your frame. > Now, I seem to remember some discussion of changes wrt to exceptions for > lxml 2.2. Might this be an (unwanted) side effect to these changes? Yes, it's related, and it's definitely a side-effect in Python 2. I wonder what your example does in Py3... Looks like Cython needs some version specific code here. Pretty hard to get these things right in a portable way... Stefan From klizhentas at gmail.com Wed Mar 4 12:24:49 2009 From: klizhentas at gmail.com (Alex Klizhentas) Date: Wed, 4 Mar 2009 14:24:49 +0300 Subject: [lxml-dev] Lxml Crash Message-ID: <6310a8f80903040324g3642f793hac7c580b1d03aeb6@mail.gmail.com> Hi all, sometimes i get exception killing apache process. It happens occasionally (acually it happened once on my production site), so I have no more logs up to the moment, I can only suspect that crash happens when I am trying to replace the node: def replace(self,child,new_child): root = self.getroottree().getroot() index = self.index(child) if root._should_notify(): old_child = deepcopy(child) self.insert(index,new_child) etree.ElementBase.remove(self,child) root._notify(NodeReplaced(old_child,new_child)) return self[index] else: self.insert(index,new_child) etree.ElementBase.remove(self,child) return self[index] crash log is below: *** glibc detected *** /usr/sbin/apache2: free(): invalid pointer: 0x08cd6eca *** ======= Backtrace: ========= /lib/tls/i686/cmov/libc.so.6[0xb7e26a85] /lib/tls/i686/cmov/libc.so.6(cfree+0x90)[0xb7e2a4f0] /usr/lib/libxml2.so.2(xmlFreeNodeList+0x126)[0xa984d1e6] /usr/lib/libxml2.so.2(xmlFreeNode+0x76)[0xa984d656] /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa9992bf2] /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa99b529f] I will bring in more logs if crash repeats, but I will appreciate any ideas/thoughts/comments so I can quickly eliminate/workaround/prevent the issue from happening again. -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090304/7c0b8827/attachment.htm From stefan_ml at behnel.de Wed Mar 4 13:19:33 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 4 Mar 2009 13:19:33 +0100 (CET) Subject: [lxml-dev] Lxml Crash In-Reply-To: <6310a8f80903040324g3642f793hac7c580b1d03aeb6@mail.gmail.com> References: <6310a8f80903040324g3642f793hac7c580b1d03aeb6@mail.gmail.com> Message-ID: <53611.213.61.181.86.1236169173.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Alex Klizhentas wrote: > sometimes i get exception killing apache process. It happens occasionally > (acually it happened once on my production site), so I have no more logs > up to the moment, > [...] > I will bring in more logs if crash repeats, but I will appreciate any > ideas/thoughts/comments so I can quickly eliminate/workaround/prevent the > issue from happening again. One thing to note is that you are using lxml 2.2alpha1. There were plenty of bugs that were fixed in 2.2 since then, including a couple of crash bugs. I'd try to switch to 2.2beta4 ASAP. http://codespeak.net/lxml/dev/changes-2.2beta4.html > I can only suspect that crash happens when I am trying to > replace the node: > > def replace(self,child,new_child): > root = self.getroottree().getroot() > index = self.index(child) > if root._should_notify(): > old_child = deepcopy(child) > self.insert(index,new_child) > etree.ElementBase.remove(self,child) > root._notify(NodeReplaced(old_child,new_child)) > return self[index] > else: > self.insert(index,new_child) > etree.ElementBase.remove(self,child) > return self[index] Regarding this code, I assume that "self" is an ElementBase subtype. I wonder why you didn't write it like this: def replace(self,child,new_child): etree.ElementBase.replace(self, child, new_child) root = self.getroottree().getroot() if root._should_notify(): root._notify(NodeReplaced(child, new_child)) return new_child BTW, is your tree protected against concurrent modification in any way? If your environment (mod_python?) is configured to run requests in parallel, concurrently replacing a child of the same parent may lead to crashes. > crash log is below: > > *** glibc detected *** /usr/sbin/apache2: free(): invalid pointer: > 0x08cd6eca *** > ======= Backtrace: ========= > /lib/tls/i686/cmov/libc.so.6[0xb7e26a85] > /lib/tls/i686/cmov/libc.so.6(cfree+0x90)[0xb7e2a4f0] > /usr/lib/libxml2.so.2(xmlFreeNodeList+0x126)[0xa984d1e6] > /usr/lib/libxml2.so.2(xmlFreeNode+0x76)[0xa984d656] > /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa9992bf2] > /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa99b529f] All I can see here is that this happens when freeing a node or subtree. Not much I can extract from that. Stefan From klizhentas at gmail.com Wed Mar 4 13:30:55 2009 From: klizhentas at gmail.com (Alex Klizhentas) Date: Wed, 4 Mar 2009 15:30:55 +0300 Subject: [lxml-dev] Lxml Crash In-Reply-To: <53611.213.61.181.86.1236169173.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <6310a8f80903040324g3642f793hac7c580b1d03aeb6@mail.gmail.com> <53611.213.61.181.86.1236169173.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <6310a8f80903040430t1d5630f6h22d38c51e70ba0c6@mail.gmail.com> OK, thanks for your suggestions - I'll apply changes immediately, What about concurrency - XML trees are not shared between threads, so it's unlikely a root cause. On Wed, Mar 4, 2009 at 3:19 PM, Stefan Behnel wrote: > Alex Klizhentas wrote: > > sometimes i get exception killing apache process. It happens occasionally > > (acually it happened once on my production site), so I have no more logs > > up to the moment, > > [...] > > I will bring in more logs if crash repeats, but I will appreciate any > > ideas/thoughts/comments so I can quickly eliminate/workaround/prevent the > > issue from happening again. > > One thing to note is that you are using lxml 2.2alpha1. There were plenty > of bugs that were fixed in 2.2 since then, including a couple of crash > bugs. I'd try to switch to 2.2beta4 ASAP. > > http://codespeak.net/lxml/dev/changes-2.2beta4.html > > > > I can only suspect that crash happens when I am trying to > > replace the node: > > > > def replace(self,child,new_child): > > root = self.getroottree().getroot() > > index = self.index(child) > > if root._should_notify(): > > old_child = deepcopy(child) > > self.insert(index,new_child) > > etree.ElementBase.remove(self,child) > > root._notify(NodeReplaced(old_child,new_child)) > > return self[index] > > else: > > self.insert(index,new_child) > > etree.ElementBase.remove(self,child) > > return self[index] > > Regarding this code, I assume that "self" is an ElementBase subtype. I > wonder why you didn't write it like this: > > def replace(self,child,new_child): > etree.ElementBase.replace(self, child, new_child) > root = self.getroottree().getroot() > if root._should_notify(): > root._notify(NodeReplaced(child, new_child)) > return new_child > > BTW, is your tree protected against concurrent modification in any way? If > your environment (mod_python?) is configured to run requests in parallel, > concurrently replacing a child of the same parent may lead to crashes. > > > > crash log is below: > > > > *** glibc detected *** /usr/sbin/apache2: free(): invalid pointer: > > 0x08cd6eca *** > > ======= Backtrace: ========= > > /lib/tls/i686/cmov/libc.so.6[0xb7e26a85] > > /lib/tls/i686/cmov/libc.so.6(cfree+0x90)[0xb7e2a4f0] > > /usr/lib/libxml2.so.2(xmlFreeNodeList+0x126)[0xa984d1e6] > > /usr/lib/libxml2.so.2(xmlFreeNode+0x76)[0xa984d656] > > > /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa9992bf2] > > > /usr/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-i686.egg/lxml/etree.so[0xa99b529f] > > All I can see here is that this happens when freeing a node or subtree. > Not much I can extract from that. > > Stefan > > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090304/00dc0fa3/attachment.htm From paratribulations at free.fr Tue Mar 10 18:25:17 2009 From: paratribulations at free.fr (TP) Date: Tue, 10 Mar 2009 18:25:17 +0100 Subject: [lxml-dev] the subelements of my tree are moving alone Message-ID: Hi everybody, I have derived custom classes from ET._ElementTree and ET.ElementBase to obtain a custom tree suited to my needs. It works perfectly, but it seems that the nodes under the root node (the subelements) move sometimes "alone". The tree structure is kept, but the address of the elements in memory is changing. As the structure is kept, it is not a problem for lxml use only: I can walk in the tree, doing what I need. But the problem is that I use this custom tree as the underlying data structure for a PyQt custom QTreeWidget. In this widget, I use the method "internalPointer()" of QModelIndex instances (as proposed in the chapter 16 of book "Rapid GUI Programming with Python and Qt" by Mark Summerfield (around p.500)). The problem is that if the nodes move, the "internalPointer()" of Qt are not up to date: I obtain segmentation faults. Is this normal that nodes of the tree move in memory *alone*? Is this due to the garbage collector? If yes, how to keep my pointers up to date? Thanks in advance -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From jholg at gmx.de Wed Mar 11 09:22:41 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 11 Mar 2009 09:22:41 +0100 Subject: [lxml-dev] the subelements of my tree are moving alone In-Reply-To: References: Message-ID: <20090311082241.9850@gmx.net> Hi, > It works perfectly, but it seems that the nodes under the root node (the > subelements) move sometimes "alone". The tree structure is kept, but the > address of the elements in memory is changing. As the structure is kept, > it > is not a problem for lxml use only: I can walk in the tree, doing what I > need. That's true. lxml creates its elements on-the-fly on access, you can think of them as access proxies to the underlying libxml2 tree. This means they go away when no Python reference to them is kept. > But the problem is that I use this custom tree as the underlying data > structure for a PyQt custom QTreeWidget. In this widget, I use the > method "internalPointer()" of QModelIndex instances (as proposed in the > chapter 16 of book "Rapid GUI Programming with Python > and Qt" by Mark Summerfield (around p.500)). > > The problem is that if the nodes move, the "internalPointer()" of Qt are > not > up to date: I obtain segmentation faults. > > Is this normal that nodes of the tree move in memory *alone*? Is this due > to > the garbage collector? If yes, how to keep my pointers up to date? You could keep elements around by caching them, which is usually done for performance tuning (trading memory for speed), like: cache[root] = list(root.iter()) This caches the whole tree, see "Caching elements" in the objectify performance section: http://codespeak.net/lxml/performance.html#lxml-objectify So essentially you'd need to keep a Python reference to each instantiated element that you want to hand to PyQt. I wondered why PyQt doesn't keep the Python reference itself, but alas it's just a weak reference: http://www.mail-archive.com/pyqt at riverbankcomputing.com/msg16046.html Holger -- Nur bis 16.03.! DSL-Komplettanschluss inkl. WLAN-Modem f?r nur 17,95 ?/mtl. + 1 Monat gratis!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a From paratribulations at free.fr Wed Mar 11 11:41:05 2009 From: paratribulations at free.fr (TP) Date: Wed, 11 Mar 2009 11:41:05 +0100 Subject: [lxml-dev] the subelements of my tree are moving alone References: <20090311082241.9850@gmx.net> Message-ID: <23hk86-2s8.ln1@rama.fbx.proxad.net> jholg at gmx.de wrote: > So essentially you'd need to keep a Python reference to each instantiated > element that you want to hand to PyQt. I wondered why PyQt doesn't keep > the Python reference itself, but alas it's just a weak reference: > http://www.mail-archive.com/pyqt at riverbankcomputing.com/msg16046.html Thanks Holger and Stefan for your help. By keeping a reference to all elements in the tree, it works perfectly. Julien -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From tennis at tripit.com Wed Mar 11 17:25:33 2009 From: tennis at tripit.com (Tennis Smith) Date: Wed, 11 Mar 2009 09:25:33 -0700 Subject: [lxml-dev] Fun With Sequencing Message-ID: <49B7E5FD.5050705@tripitinc.com> Hi, I am generating xml files (for test purposes) that will be validated with a schema which uses "xs:sequence". The result is that my docs have to have exactly the right sequence of elements or it will not validate. Its going to be a pain to re-parse the schema file and then use the munged data to sort my xml docs. So, is there some way in lxml I can dump the schema's contents and use that to make sure I have everything in order *before* I try to validate? Failing that, are there any suggestions? Thanks -T From stefan_ml at behnel.de Sat Mar 14 15:45:31 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 14 Mar 2009 15:45:31 +0100 Subject: [lxml-dev] Fun With Sequencing Message-ID: <49BBC30B.6010706@behnel.de> [and to the list] Hi, Tennis Smith wrote: > I am generating xml files (for test purposes) that will be validated > with a schema which uses "xs:sequence". The result is that my docs have > to have exactly the right sequence of elements or it will not validate. Could you say a bit more about your constraints? Why can't you just change the schema, if it's purely for test purposes? And why can't you just look at the schema once and generate the XML documents accordingly? > Its going to be a pain to re-parse the schema file and then use the > munged data to sort my xml docs. So, is there some way in lxml I can > dump the schema's contents and use that to make sure I have everything > in order *before* I try to validate? No, and I doubt that there's a major use case for that. Validation is about figuring out that something is wrong (any usually also what is wrong), not about getting stuff fixed automatically. Stefan From stefan_ml at behnel.de Sat Mar 14 15:47:13 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 14 Mar 2009 15:47:13 +0100 Subject: [lxml-dev] the subelements of my tree are moving alone Message-ID: <49BBC371.7070904@behnel.de> [and to the list] TP wrote: > I have derived custom classes from ET._ElementTree and ET.ElementBase to > obtain a custom tree suited to my needs. It obviously depends on what you do with it, but I'm not so sure that subclassing the _ElementTree class is a good idea. It's not wrong in general, but maybe writing your own tree wrapper would be better. > It works perfectly, but it seems that the nodes under the root node (the > subelements) move sometimes "alone". The tree structure is kept, but the > address of the elements in memory is changing. As the structure is kept, it > is not a problem for lxml use only: I can walk in the tree, doing what I > need. > > But the problem is that I use this custom tree as the underlying data > structure for a PyQt custom QTreeWidget. In this widget, I use the > method "internalPointer()" of QModelIndex instances (as proposed in the > chapter 16 of book "Rapid GUI Programming with Python > and Qt" by Mark Summerfield (around p.500)). > > The problem is that if the nodes move, the "internalPointer()" of Qt are not > up to date: I obtain segmentation faults. > > Is this normal that nodes of the tree move in memory *alone*? Is this due to > the garbage collector? If yes, how to keep my pointers up to date? Here's some documentation: http://codespeak.net/lxml/dev/element_classes.html#background-on-element-proxies It's not only the "pointer", it's the Element instances that are replaced. It may be that "internalPointer()" (never heard of it, but I assume it's some kind of backpointing mechanism) does not respect Python's reference counting, so that the Element object gets garbage collected. A simple way to work around this is to keep a reference to all Element objects in the tree, initialised as my_tree.element_cache = list(root_element.iter()) But keep in mind that this needs to be maintained on tree changes. Stefan From paratribulations at free.fr Tue Mar 17 11:13:17 2009 From: paratribulations at free.fr (TP) Date: Tue, 17 Mar 2009 11:13:17 +0100 Subject: [lxml-dev] space in attribute name: xpath expression? Message-ID: Hi everybody, It seems not possible to define with fromstring() or ET.XML a tree containing attributes with spaces. But it is possible by adding the attribute containing a space afterwards, see the example below. ################### #!/usr/bin/env python # -*- coding: utf-8 -*- import lxml.etree as ET root = ET.XML("data") foo_elem = root.xpath( "//foo" ) foo_elem[0].set( "tu tu", "22" ) print ET.tostring( root ) ################### We obtain: data It seems a bad idea to have spaces in attributes. I have not found a way to make an xpath request work, for example the two following ones yield an error: print root.xpath( "//*[@tu tu=22]" ) print root.xpath( "//*[@tu\ tu=22]" ) >From another point of view, often we would like to define attribute names as they are, i.e. english expressions with spaces. How do you proceed? Put underscores in the attribute names, and then remove them when displaying in the tree (for example in a graphical widget)? Or define the correspondance between the attribute names and the english names in some part of the XML file (for example, the attribute names could be tags, associated to some text that would contain the english names. Thanks -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From stefan_ml at behnel.de Tue Mar 17 11:41:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Mar 2009 11:41:12 +0100 (CET) Subject: [lxml-dev] space in attribute name: xpath expression? In-Reply-To: References: Message-ID: <58418.213.61.181.86.1237286472.squirrel@groupware.dvs.informatik.tu-darmstadt.de> TP wrote: > It seems not possible to define with fromstring() or ET.XML a tree > containing attributes with spaces. I do hope it isn't. > But it is possible by adding the attribute containing a space afterwards, > see the example below. > > ################### > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > import lxml.etree as ET > > root = ET.XML("data") > foo_elem = root.xpath( "//foo" ) > foo_elem[0].set( "tu tu", "22" ) > print ET.tostring( root ) > ################### > > We obtain: > data Hmmm, ok, that looks like a bug to me. lxml should validate attribute names on the way in, just like tag names are validated. > From another point of view, often we would like to define attribute names > as they are, i.e. english expressions with spaces. How do you know that they will only ever be "english" expressions? What about Farsi and Chinese? > How do you proceed? Put > underscores in the attribute names, and then remove them when displaying > in the tree (for example in a graphical widget)? It is a very good and common design choice to separate data from representation. So these two are completely orthogonal. You can use '_' or '-' to separate words, or you can use a prefixed MD5 hash for the attribute name that maps to a separate name lookup table. Choices are endless. > Or define the correspondance > between the attribute names and the english names in some part of the XML > file (for example, the attribute names could be tags, associated to some > text that would contain the english names. With "tags" you mean "references", I assume. Maybe even references into a separate XML file (one per language) that defines the presentational name. Without knowing enough about your application, this sounds like a reasonable thing to do. Stefan From jholg at gmx.de Tue Mar 17 11:45:21 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 17 Mar 2009 11:45:21 +0100 Subject: [lxml-dev] space in attribute name: xpath expression? In-Reply-To: References: Message-ID: <20090317104521.261050@gmx.net> Hi, > import lxml.etree as ET > > root = ET.XML("data") > foo_elem = root.xpath( "//foo" ) > foo_elem[0].set( "tu tu", "22" ) > print ET.tostring( root ) > ################### XML does not allow blanks in attribute names. At least since version 2.0 lxml disallows setting such names through the API: >>> import lxml.etree as ET >>> >>> root = ET.XML("data") >>> foo_elem = root.xpath( "//foo" ) >>> foo_elem[0].set( "tu tu", "22" ) Traceback (most recent call last): File "", line 1, in ? File "lxml.etree.pyx", line 646, in lxml.etree._Element.set (src/lxml/lxml.etree.c:9638) File "apihelpers.pxi", line 411, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:31508) File "apihelpers.pxi", line 1323, in lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:38843) ValueError: Invalid attribute name u'tu tu' >>> print ET.__version__ 2.1.5 >>> > >>From another point of view, often we would like to define attribute > names as > they are, i.e. english expressions with spaces. How do you proceed? Put > underscores in the attribute names, and then remove them when displaying > in > the tree (for example in a graphical widget)? Or define the correspondance > between the attribute names and the english names in some part of the XML > file (for example, the attribute names could be tags, associated to some > text that would contain the english names. Yes, why not use a valid separator like _ or . and split words accordingly for representation. Of course, you'd have to make sure that your separator does not normally show up in your expressions. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 From paratribulations at free.fr Tue Mar 17 11:55:16 2009 From: paratribulations at free.fr (TP) Date: Tue, 17 Mar 2009 11:55:16 +0100 Subject: [lxml-dev] space in attribute name: xpath expression? References: <20090317104521.261050@gmx.net> Message-ID: jholg at gmx.de wrote: >>>> foo_elem[0].set( "tu tu", "22" ) > Traceback (most recent call last): > File "", line 1, in ? > File "lxml.etree.pyx", line 646, in lxml.etree._Element.set > (src/lxml/lxml.etree.c:9638) > File "apihelpers.pxi", line 411, in lxml.etree._setAttributeValue > (src/lxml/lxml.etree.c:31508) File "apihelpers.pxi", line 1323, in > lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:38843) > ValueError: Invalid attribute name u'tu tu' > >>>> print ET.__version__ > 2.1.5 On my computer: >>> print ET.__version__ 1.3.6 (I use Kubuntu 8.04) So the bug seems to have disappeared in the newer versions. > Yes, why not use a valid separator like _ or . and split words accordingly > for representation. Of course, you'd have to make sure that your separator > does not normally show up in your expressions. Thanks for your opinion on the subject. Julien -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From piotr.furman at webservice.pl Tue Mar 17 14:39:11 2009 From: piotr.furman at webservice.pl (Piotr Furman) Date: Tue, 17 Mar 2009 13:39:11 +0000 (UTC) Subject: [lxml-dev] iterparse and namespaces Message-ID: Hi, I've got small problem with iterparse and namespaces. I have this code: >>> from StringIO import StringIO >>> from lxml import etree >>> print etree.__version__ 2.1.5 >>> xml = """12""" >>> a1 = etree.iterparse(StringIO(xml), tag="a") >>> a1.next() Traceback (most recent call last): File "", line 1, in ? File "iterparse.pxi", line 515, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:77014) StopIteration >>> a2 = etree.iterparse(StringIO(xml), tag="{http://www.example.com}a") >>> a2.next() (u'end', ) Is that possible to get all tags "a" without passing namespace? I mean, a2 works, but could it be possible to make a1 working too? regards Piotr Furman From stefan_ml at behnel.de Tue Mar 17 15:13:28 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Mar 2009 15:13:28 +0100 (CET) Subject: [lxml-dev] iterparse and namespaces In-Reply-To: References: Message-ID: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Piotr Furman wrote: > I've got small problem with iterparse and namespaces. I have this code: > >>>> from StringIO import StringIO >>>> from lxml import etree >>>> print etree.__version__ > 2.1.5 >>>> xml = """>>> xmlns="http://www.example.com">12""" >>>> a1 = etree.iterparse(StringIO(xml), tag="a") >>>> a1.next() > Traceback (most recent call last): > File "", line 1, in ? > File "iterparse.pxi", line 515, in lxml.etree.iterparse.__next__ > (src/lxml/lxml.etree.c:77014) > StopIteration >>>> a2 = etree.iterparse(StringIO(xml), tag="{http://www.example.com}a") >>>> a2.next() > (u'end', ) > > Is that possible to get all tags "a" without passing namespace? I mean, a2 > works, but could it be possible to make a1 working too? ..., tag="{*}a" might work. Stefan From jholg at gmx.de Tue Mar 17 15:23:55 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 17 Mar 2009 15:23:55 +0100 Subject: [lxml-dev] iterparse and namespaces In-Reply-To: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20090317142355.261050@gmx.net> Hi, > > Is that possible to get all tags "a" without passing namespace? I mean, > a2 > > works, but could it be possible to make a1 working too? > > ..., tag="{*}a" > > might work. I thought so too but it doesn't :) >>> for e in etree.iterparse(StringIO(xml), tag="{*}a"): print e ... The wildcard works for element names, not for namespaces: >>> for e in etree.iterparse(StringIO(xml), tag="{http://www.example.com}*"): print e ... (u'end', ) (u'end', ) (u'end', ) >>> Same holds for .iter(). Is there a usecase for "give me every element x from whatever namespace"? Holger -- Aufgepasst: Sind Ihre Daten beim Online-Banking auch optimal gesch?tzt? Jetzt informieren und absichern: https://homebanking.web.de/?mc=mail at footer. From stefan_ml at behnel.de Tue Mar 17 15:48:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Mar 2009 15:48:54 +0100 (CET) Subject: [lxml-dev] iterparse and namespaces In-Reply-To: <20090317142355.261050@gmx.net> References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> Message-ID: <50366.213.61.181.86.1237301334.squirrel@groupware.dvs.informatik.tu-darmstadt.de> jholg at gmx.de wrote: >> > Is that possible to get all tags "a" without passing namespace? I >> mean, >> a2 >> > works, but could it be possible to make a1 working too? >> >> ..., tag="{*}a" >> >> might work. > > I thought so too but it doesn't :) Thanks for checking. :) > The wildcard works for element names, not for namespaces: > >>>> for e in etree.iterparse(StringIO(xml), >>>> tag="{http://www.example.com}*"): print e > ... > (u'end', ) > (u'end', ) > (u'end', ) >>>> > > Same holds for .iter(). > > Is there a usecase for "give me every element x from whatever namespace"? Yep, I guess that's why it doesn't work. :) Maybe the OP can give us a clearer idea about the background of this request. Stefan From piotr.furman at webservice.pl Tue Mar 17 15:59:54 2009 From: piotr.furman at webservice.pl (Piotr Furman) Date: Tue, 17 Mar 2009 14:59:54 +0000 (UTC) Subject: [lxml-dev] iterparse and namespaces References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> Message-ID: gmx.de> writes: > Is there a usecase for "give me every element x from whatever namespace"? > > Holger Thanks for answer, my use case is that I have a xml file with only one namespace defined in root. I guess that if there were more namespaces in one file it wouldn't make sense, but as long as it's only one I just don't care about that and would have all specified elements. So I have at least two choices, either remove xmlns from files, or iterate over all elements and filter out those I don't need. PF From stefan_ml at behnel.de Tue Mar 17 16:12:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Mar 2009 16:12:27 +0100 (CET) Subject: [lxml-dev] iterparse and namespaces In-Reply-To: References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> Message-ID: <42776.213.61.181.86.1237302747.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Piotr Furman wrote: > gmx.de> writes: >> Is there a usecase for "give me every element x from whatever >> namespace"? > > Thanks for answer, my use case is that I have a xml file with only one > namespace > defined in root. I guess that if there were more namespaces in one file it > wouldn't make sense, but as long as it's only one I just don't care about > that and would have all specified elements. uh? Then I really don't get it. If there is only one namespace that contains all elements, then why can't you just look for the tags in exactly that namespace? That will give you all tags with that name. Stefan From jcd at sdf.lonestar.org Tue Mar 17 16:20:26 2009 From: jcd at sdf.lonestar.org (J. Cliff Dyer) Date: Tue, 17 Mar 2009 11:20:26 -0400 Subject: [lxml-dev] iterparse and namespaces In-Reply-To: References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> Message-ID: <1237303226.18201.2.camel@aalcdl07> On Tue, 2009-03-17 at 14:59 +0000, Piotr Furman wrote: > gmx.de> writes: > > Is there a usecase for "give me every element x from whatever namespace"? > > > > Holger > > Thanks for answer, my use case is that I have a xml file with only one namespace > defined in root. I guess that if there were more namespaces in one file it > wouldn't make sense, but as long as it's only one I just don't care about that > and would have all specified elements. > > So I have at least two choices, either remove xmlns from files, or iterate over > all elements and filter out those I don't need. > > PF > If your concern is that the namespaces are unwieldy, you can also declare them so that you can use a more readable prefix. For example (taken from live code): ns = { 'mets': 'http://www.loc.gov/METS/', 'mods': 'http://www.loc.gov/mods/v3', } timestamp_set = excerpt_filestruct.xpath('mets:fptr[@FILEID="DIGITAL_ACCESS_COPY"]/mets:area', namespaces=ns) Cheers, Cliff > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From piotr.furman at webservice.pl Tue Mar 17 16:39:58 2009 From: piotr.furman at webservice.pl (Piotr Furman) Date: Tue, 17 Mar 2009 15:39:58 +0000 (UTC) Subject: [lxml-dev] iterparse and namespaces References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> <42776.213.61.181.86.1237302747.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel behnel.de> writes: > > Piotr Furman wrote: > > gmx.de> writes: > >> Is there a usecase for "give me every element x from whatever > >> namespace"? > > > > Thanks for answer, my use case is that I have a xml file with only one > > namespace > > defined in root. I guess that if there were more namespaces in one file it > > wouldn't make sense, but as long as it's only one I just don't care about > > that and would have all specified elements. > > uh? Then I really don't get it. If there is only one namespace that > contains all elements, then why can't you just look for the tags in > exactly that namespace? That will give you all tags with that name. > > Stefan Sure I can, but my real data is little bigger than this sample. Each element found with iterparse has many other tags I'd like to retrieve, using "iter" method. Here again, for each tag I would have to add namespace. If I do this in several lines code will be ugly. It would be also harder to maintain, if for some reason somebody would change xmlns one day. So it would be nice if iterparse could accept wildcard as namespace, but I see it can be solved another way. PF From stefan_ml at behnel.de Tue Mar 17 16:49:02 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Mar 2009 16:49:02 +0100 (CET) Subject: [lxml-dev] iterparse and namespaces In-Reply-To: References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> <42776.213.61.181.86.1237302747.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <52839.213.61.181.86.1237304942.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Piotr Furman wrote: > Stefan Behnel behnel.de> writes: >> Piotr Furman wrote: >> > gmx.de> writes: >> >> Is there a usecase for "give me every element x from whatever >> >> namespace"? >> > >> > Thanks for answer, my use case is that I have a xml file with only one >> > namespace >> > defined in root. I guess that if there were more namespaces in one >> file it >> > wouldn't make sense, but as long as it's only one I just don't care >> about >> > that and would have all specified elements. >> >> uh? Then I really don't get it. If there is only one namespace that >> contains all elements, then why can't you just look for the tags in >> exactly that namespace? That will give you all tags with that name. >> >> Stefan > > Sure I can, but my real data is little bigger than this sample. Each > element > found with iterparse has many other tags I'd like to retrieve, using > "iter" > method. Here again, for each tag I would have to add namespace. If I do > this in > several lines code will be ugly. It would be also harder to maintain, if > for some reason somebody would change xmlns one day. > > So it would be nice if iterparse could accept wildcard as namespace Note that this would not give you a namespace-free tag name on the element, so you'd still have to use qualified names in a couple of places. It's really best to assign the qualified names to variables and to work with those. Stefan From piotr.furman at webservice.pl Tue Mar 17 17:03:28 2009 From: piotr.furman at webservice.pl (Piotr Furman) Date: Tue, 17 Mar 2009 16:03:28 +0000 (UTC) Subject: [lxml-dev] iterparse and namespaces References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> <42776.213.61.181.86.1237302747.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <52839.213.61.181.86.1237304942.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel behnel.de> writes: > Note that this would not give you a namespace-free tag name on the > element, so you'd still have to use qualified names in a couple of places. > It's really best to assign the qualified names to variables and to work > with those. > > Stefan Agree, something like ns = "http://www.example.com" etree.iterparse(xml, tag="{%s}a" % ns) will be probably best way. Thanks for answers. From ross at kallisti.us Tue Mar 17 20:47:13 2009 From: ross at kallisti.us (Ross Vandegrift) Date: Tue, 17 Mar 2009 15:47:13 -0400 Subject: [lxml-dev] iterparse and namespaces In-Reply-To: <50366.213.61.181.86.1237301334.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <38832.213.61.181.86.1237299208.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090317142355.261050@gmx.net> <50366.213.61.181.86.1237301334.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20090317194713.GB5068@kallisti.us> On Tue, Mar 17, 2009 at 03:48:54PM +0100, Stefan Behnel wrote: > jholg at gmx.de wrote: > > Is there a usecase for "give me every element x from whatever namespace"? > > Yep, I guess that's why it doesn't work. :) While I realize this is a a different problem than the original poster had... Suppose you have to interoperate with an XML generator that changes the namespace based on an unrelated version number of the supporting platform and you had no way of knowing what namespaces a document would use? Further, you do know that the sematic content of the tags is unchanged. That situation led me to want an XML toolkit that would let me throw away namespace data - because stupid people have done stupid things with XML namespaces. And I have to live with it, whether it's right or not. I ended up solving the problem by search-and-replacing an XSLT sheet with heuristically gleaned version information and using that XSLT to create data structures I could actually do something with. Poetic justice, I'd say, that XML's structured approach can lead to a problem solvable only by ad-hoc parsering of a serialized XML doc :) (Though in retrospect, I think I could use lxml's nsmap members to glean the namespace information build unversioned data structures without the really ugly intermediate transform) -- Ross Vandegrift ross at kallisti.us "If the fight gets hot, the songs get hotter. If the going gets tough, the songs get tougher." --Woody Guthrie From l at lrowe.co.uk Tue Mar 17 21:47:34 2009 From: l at lrowe.co.uk (Laurence Rowe) Date: Tue, 17 Mar 2009 21:47:34 +0100 Subject: [lxml-dev] iterparse and namespaces In-Reply-To: References: Message-ID: Hi, Use local-name() in an xpath: >>> doc = etree.XML('''''') >>> doc.xpath("//*[local-name() = 'a']") [, ] HTH, Laurence 2009/3/17 Piotr Furman : > Hi, > > I've got small problem with iterparse and namespaces. I have this code: > >>>> from StringIO import StringIO >>>> from lxml import etree >>>> print etree.__version__ > 2.1.5 >>>> xml = """12""" >>>> a1 = etree.iterparse(StringIO(xml), tag="a") >>>> a1.next() > Traceback (most recent call last): > ?File "", line 1, in ? > ?File "iterparse.pxi", line 515, in lxml.etree.iterparse.__next__ > (src/lxml/lxml.etree.c:77014) > StopIteration >>>> a2 = etree.iterparse(StringIO(xml), tag="{http://www.example.com}a") >>>> a2.next() > (u'end', ) > > Is that possible to get all tags "a" without passing namespace? I mean, a2 > works, but could it be possible to make a1 working too? > > regards > Piotr Furman > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From bkline at rksystems.com Thu Mar 19 15:51:37 2009 From: bkline at rksystems.com (Bob Kline) Date: Thu, 19 Mar 2009 10:51:37 -0400 Subject: [lxml-dev] Possible bug Message-ID: <49C25BF9.8020108@rksystems.com> Before I dig into the work of producing a repro case, would the lxml developers be interested in a bug report if I confirm that the XSL/T parser which comes with the lxml package chokes on the serialized version of an XML tree assembled by the lxml's HTML parser when the original HTML document contains a comment which the XML spec doesn't like (because "--" appears inside the comment)? -- Bob Kline http://www.rksystems.com mailto:bkline at rksystems.com From stefan_ml at behnel.de Thu Mar 19 16:10:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Mar 2009 16:10:42 +0100 (CET) Subject: [lxml-dev] Possible bug In-Reply-To: <49C25BF9.8020108@rksystems.com> References: <49C25BF9.8020108@rksystems.com> Message-ID: <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Bob Kline wrote: > Before I dig into the work of producing a repro case, would the lxml > developers be interested in a bug report if I confirm that the XSL/T > parser which comes with the lxml package chokes on the serialized > version of an XML tree assembled by the lxml's HTML parser when the > original HTML document contains a comment which the XML spec doesn't > like (because "--" appears inside the comment)? So what you do is: 1) parse an HTML document that contains "--" in a comment 2) serialise it to XML, which produces broken XML because of the comment value You were not clear about the rest, but I guess it was not: 3) parse it using an XML parser which does not fail 4) pass it to XSLT(), which then fails to initialise but rather 3) return the serialised XML from a custom document resolver while running an XSLT right? As an under-informed guess, I would assume step 2) to be the problem here, in which case there is not much lxml can do about it. I also doubt that libxml2 can do much here, as the problem is that you are serialising an HTML tree into XML syntax without any intermediate semantic adaptation. A good way to work around this would be to let the HTML parser remove all comments on the way in by passing "remove_comments=True" - unless you really need them in the document. Some of the other parser options might also be of interest for your use case. You can also use lxml.html.clean to remove some other potentially harmful content from the HTML file before passing it into an XSLT. Stefan From bkline at rksystems.com Thu Mar 19 19:37:03 2009 From: bkline at rksystems.com (Bob Kline) Date: Thu, 19 Mar 2009 14:37:03 -0400 Subject: [lxml-dev] Possible bug In-Reply-To: <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <49C290CF.5060403@rksystems.com> Stefan Behnel wrote: > Bob Kline wrote: > >> Before I dig into the work of producing a repro case, would the lxml >> developers be interested in a bug report if I confirm that the XSL/T >> parser which comes with the lxml package chokes on the serialized >> version of an XML tree assembled by the lxml's HTML parser when the >> original HTML document contains a comment which the XML spec doesn't >> like (because "--" appears inside the comment)? >> > > So what you do is: > > 1) parse an HTML document that contains "--" in a comment > 2) serialise it to XML, which produces broken XML because of the comment > value > > You were not clear about the rest, but I guess it was not: > Hi, Stefan. Thanks for your reply. Sorry for not being sufficiently clear. Here's what I'm doing: reader = urllib2.urlopen(urlForHtmlPage) htmlPage = reader.read() tree = etree.HTML(htmlPage) xmlDocStrings = [''] for child in tree: if not rejectThisNode(child): xmlDocStrings.append(etree.tostring(child)) xmlDocStrings.append('') xmlDoc = "".join(xmlDocStrings) fp = open(nameOfXsltFile) transform = etree.XSLT(etree.parse(fp)) newDoc = transform(etree.XML(xmlDoc)) # blows up here > 3) parse it using an XML parser which does not fail > 4) pass it to XSLT(), which then fails to initialise > > but rather > > 3) return the serialised XML from a custom document resolver while running > an XSLT right? > No custom document resolver involved, as you can see from the code above. I'm beginning to think I came to the wrong conclusion when reading the following passage in the lxml documentation: HTML parsing is similarly simple. The parsers have a recover keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return something usable without raising an exception. I assumed a different meaning for "something usable" than the behavior of the software appears to justify, thinking that if the result was not a tree which would serialize itself back into well-formed XML an exception would be thrown. That's not how it works, though, is it? -- Bob Kline http://www.rksystems.com mailto:bkline at rksystems.com From stefan_ml at behnel.de Thu Mar 19 22:35:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Mar 2009 22:35:56 +0100 Subject: [lxml-dev] Possible bug In-Reply-To: <49C290CF.5060403@rksystems.com> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <49C290CF.5060403@rksystems.com> Message-ID: <49C2BABC.1050202@behnel.de> Hi, Bob Kline wrote: > Sorry for not being sufficiently clear. Here's what I'm doing: > > reader = urllib2.urlopen(urlForHtmlPage) > htmlPage = reader.read() > tree = etree.HTML(htmlPage) Use a custom parser here, as in parser = etree.HTMLParser(remove_comments=True, remove_pis=True) tree = etree.fromstring(htmlPage, parser) > xmlDocStrings = [''] > for child in tree: > if not rejectThisNode(child): > xmlDocStrings.append(etree.tostring(child)) > xmlDocStrings.append('') > xmlDoc = "".join(xmlDocStrings) You do not need to serialise here. It's perfectly ok if you do this: xmlDocStrings = tree.makeelement("NewDoc") for child in tree[:]: if not rejectThisNode(child): remove_ugly_content(child) xmlDocStrings.append(child) newDoc = transform(xmlDocStrings) > fp = open(nameOfXsltFile) > transform = etree.XSLT(etree.parse(fp)) > newDoc = transform(etree.XML(xmlDoc)) # blows up here If you split the last line in two, I would assume that it will fail on root = etree.XML(xmlDoc) and not on newDoc = transform(root) Removing unwanted content before running the transform will fix this. >> 3) parse it using an XML parser which does not fail >> 4) pass it to XSLT(), which then fails to initialise >> >> but rather >> >> 3) return the serialised XML from a custom document resolver while >> running an XSLT right? > > No custom document resolver involved, as you can see from the code > above. I'm beginning to think I came to the wrong conclusion when > reading the following passage in the lxml documentation: > > HTML parsing is similarly simple. The parsers have a recover keyword > argument that the HTMLParser sets by default. It lets libxml2 try > its best to return something usable without raising an exception. > > I assumed a different meaning for "something usable" than the behavior > of the software appears to justify, thinking that if the result was not > a tree which would serialize itself back into well-formed XML an > exception would be thrown. That's not how it works, though, is it? The "something usable" just means that while it may not succeed to parse all of a broken document 'correctly' (whatever that means in this context), it will always return a document that is as complete as possible. Without the "recover" option, you would get an exception when a parse error occurs. Stefan From bkline at rksystems.com Thu Mar 19 23:40:39 2009 From: bkline at rksystems.com (Bob Kline) Date: Thu, 19 Mar 2009 18:40:39 -0400 Subject: [lxml-dev] Possible bug In-Reply-To: <49C2BABC.1050202@behnel.de> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <49C290CF.5060403@rksystems.com> <49C2BABC.1050202@behnel.de> Message-ID: <49C2C9E7.80502@rksystems.com> Stefan Behnel wrote: > The "something usable" just means that while it may not succeed to parse > all of a broken document 'correctly' (whatever that means in this context), > it will always return a document that is as complete as possible. Right. It just won't necessarily be an XML document. I believe I'll be able to work around the problems (though not necessarily by just blowing away the comments altogether). It just wasn't clear to me based on the language in the documentation whether you'd be interested in a bug report based on my original interpretation of that language. I see now that the answer to that original question is "no." Thanks! -- Bob Kline http://www.rksystems.com mailto:bkline at rksystems.com From stefan_ml at behnel.de Fri Mar 20 08:23:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Mar 2009 08:23:48 +0100 Subject: [lxml-dev] Possible bug In-Reply-To: <49C2C9E7.80502@rksystems.com> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <49C290CF.5060403@rksystems.com> <49C2BABC.1050202@behnel.de> <49C2C9E7.80502@rksystems.com> Message-ID: <49C34484.5050001@behnel.de> Bob Kline wrote: > Stefan Behnel wrote: >> The "something usable" just means that while it may not succeed to parse >> all of a broken document 'correctly' (whatever that means in this >> context), >> it will always return a document that is as complete as possible. > > Right. It just won't necessarily be an XML document. At least, it will always be a tree structure. Your case is really exceptional here, as it produces non well-formed content. > I believe I'll be able to work around the problems (though not > necessarily by just blowing away the comments altogether). It just > wasn't clear to me based on the language in the documentation whether > you'd be interested in a bug report based on my original interpretation > of that language. I see now that the answer to that original question > is "no." It's not really "no". It's just that this is a rare case and there is little one can do about it. But there's always space for better documentation - contributions very welcome. Stefan From bkline at rksystems.com Fri Mar 20 15:11:11 2009 From: bkline at rksystems.com (Bob Kline) Date: Fri, 20 Mar 2009 10:11:11 -0400 Subject: [lxml-dev] Possible bug In-Reply-To: <49C34484.5050001@behnel.de> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <49C290CF.5060403@rksystems.com> <49C2BABC.1050202@behnel.de> <49C2C9E7.80502@rksystems.com> <49C34484.5050001@behnel.de> Message-ID: <49C3A3FF.1080905@rksystems.com> Stefan Behnel wrote: > It's not really "no". It's just that this is a rare case and there is > little one can do about it. Well, you could do what I'm doing: collapse sequences of two or more hyphens to single hyphens, and drop leading and trailing hyphens (or prefix leading hyphens with a space and append a space to a trailing hyphen) in the comment text. > But there's always space for better > documentation - contributions very welcome. > Excellent. How about (following the sentence quoted earlier in this thread, beginning "It lets libxml2 try its best to return something usable ...."): The result, when serialized with etree.tostring(), will often (but not always) be a well-formed XML document. -- Bob Kline http://www.rksystems.com mailto:bkline at rksystems.com From stefan_ml at behnel.de Fri Mar 20 15:55:47 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Mar 2009 15:55:47 +0100 (CET) Subject: [lxml-dev] Possible bug In-Reply-To: <49C3A3FF.1080905@rksystems.com> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <49C290CF.5060403@rksystems.com> <49C2BABC.1050202@behnel.de> <49C2C9E7.80502@rksystems.com> <49C34484.5050001@behnel.de> <49C3A3FF.1080905@rksystems.com> Message-ID: <33434.213.61.181.86.1237560947.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Bob Kline wrote: > Stefan Behnel wrote: >> It's not really "no". It's just that this is a rare case and there is >> little one can do about it. > > Well, you could do what I'm doing: collapse sequences of two or more > hyphens to single hyphens, and drop leading and trailing hyphens (or > prefix leading hyphens with a space and append a space to a trailing > hyphen) in the comment text. Hmmm, yes, I could imagine using a SAX function that wraps the comments callback in the HTML parser. But that would require a separate parser option, as it could break code. There are many HTML templating languages that use comments for all sorts of stuff, so if lxml starts preprocessing them, I imagine that there will be some rather unfriendly user comments. BTW, if performance is not your sine-qua-non priority here, you can write your own parser target that does the same thing in Python space. >> But there's always space for better >> documentation - contributions very welcome. > > Excellent. How about (following the sentence quoted earlier in this > thread, beginning "It lets libxml2 try its best to return something > usable ...."): > > The result, when serialized with etree.tostring(), will often (but > not always) be a well-formed XML document. I'll update it. Stefan From l at lrowe.co.uk Fri Mar 20 21:38:25 2009 From: l at lrowe.co.uk (Laurence Rowe) Date: Fri, 20 Mar 2009 21:38:25 +0100 Subject: [lxml-dev] Possible bug In-Reply-To: <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <49C25BF9.8020108@rksystems.com> <57542.213.61.181.86.1237475442.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: 2009/3/19 Stefan Behnel : > Bob Kline wrote: >> Before I dig into the work of producing a repro case, would the lxml >> developers be interested in a bug report if I confirm that the XSL/T >> parser which comes with the lxml package chokes on the serialized >> version of an XML tree assembled by the lxml's HTML parser when the >> original HTML document contains a comment which the XML spec doesn't >> like (because "--" appears inside the comment)? > > So what you do is: > > 1) parse an HTML document that contains "--" in a comment > 2) serialise it to XML, which produces broken XML because of the comment > value This is not necessary, libxslt is perfectly happy to work on trees parsed by the HTMLParser, e.g. >>> doc = etree.parse(html_file, parser=etree.HTMLParser()) >>> transform = etree.XSLT(etree.parse(transform_file)) >>> result = transform(doc) Laurence From lists at cheimes.de Sat Mar 21 00:29:11 2009 From: lists at cheimes.de (Christian Heimes) Date: Sat, 21 Mar 2009 00:29:11 +0100 Subject: [lxml-dev] lxml 2.2beta4 - release candidate for 2.2 final In-Reply-To: <49A8105E.8020608@behnel.de> References: <49A8105E.8020608@behnel.de> Message-ID: Stefan Behnel wrote: > Hi all, > > here is another almost-final version of lxml 2.2. Call it a release > candidate, if you prefer. This release was necessary as the changelog was > getting way too long, and the (crash-)bugs that were fixed in this release > were too important to wait. So, updating is recommended. Dear Stefan! Cython 0.11 was released a week ago. When can we expect lxml 2.2.0? Christian From stefan_ml at behnel.de Sat Mar 21 17:01:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Mar 2009 17:01:01 +0100 Subject: [lxml-dev] lxml 2.2 released Message-ID: <49C50F3D.4060800@behnel.de> Hi all, I'm proud to announce the release of lxml 2.2 final. This is a major new, stable and mature release that takes over the stable 2.x release series. All previous 2.x releases are now officially out of maintenance. It includes a large number of bug fixes and improvements (see below for a complete changelog) that make lxml 2.2 a lot more robust than the previous 2.1 and older releases. It is therefore generally worth upgrading (and it should not be too hard to do that). You can get it from: http://codespeak.net/lxml/ http://pypi.python.org/pypi/lxml/2.2 This release was built with Cython 0.11 final. Have fun, Stefan 2.2 (2009-03-21) Features added * Support for standalone flag in XML declaration through tree.docinfo.standalone and by passing standalone=True/False on serialisation. Bugs fixed * Crash when parsing an XML Schema with external imports from a filename. 2.2beta4 (2009-02-27) Features added * Support strings and instantiable Element classes as child arguments to the constructor of custom Element classes. * GZip compression support for serialisation to files and file-like objects. Bugs fixed * Deep-copying an ElementTree copied neither its sibling PIs and comments nor its internal/external DTD subsets. * Soupparser failed on broken attributes without values. * Crash in XSLT when overwriting an already defined attribute using xsl:attribute. * Crash bug in exception handling code under Python 3. This was due to a problem in Cython, not lxml itself. * lxml.html.FormElement._name() failed for non top-level forms. * TAG special attribute in constructor of custom Element classes was evaluated incorrectly. Other changes * Official support for Python 3.0.1. * Element.findtext() now returns an empty string instead of None for Elements without text content. 2.2beta3 (2009-02-17) Features added * XSLT.strparam() class method to wrap quoted string parameters that require escaping. Bugs fixed * Memory leak in XPath evaluators. * Crash when parsing indented XML in one thread and merging it with other documents parsed in another thread. * Setting the base attribute in lxml.objectify from a unicode string failed. * Fixes following changes in Python 3.0.1. * Minor fixes for Python 3. Other changes * The global error log (which is copied into the exception log) is now local to a thread, which fixes some race conditions. * More robust error handling on serialisation. 2.2beta2 (2009-01-25) Bugs fixed * Potential memory leak on exception handling. This was due to a problem in Cython, not lxml itself. * iter_links (and related link-rewriting functions) in lxml.html would interpret CSS like url("link") incorrectly (treating the quotation marks as part of the link). * Failing import on systems that have an io module. 2.2beta1 (2008-12-12) Features added * Allow lxml.html.diff.htmldiff to accept Element objects, not just HTML strings. Bugs fixed * Crash when using an XPath evaluator in multiple threads. * Fixed missing whitespace before Link:... in lxml.html.diff. Other changes * Export lxml.html.parse. 2.2alpha1 (2008-11-23) Features added * Support for XSLT result tree fragments in XPath/XSLT extension functions. * QName objects have new properties namespace and localname. * New options for exclusive C14N and C14N without comments. * Instantiating a custom Element classes creates a new Element. Bugs fixed * XSLT didn't inherit the parse options of the input document. * 0-bytes could slip through the API when used inside of Unicode strings. * With lxml.html.clean.autolink, links with balanced parenthesis, that end in a parenthesis, will be linked in their entirety (typical with Wikipedia links). From friedel at translate.org.za Tue Mar 24 11:39:48 2009 From: friedel at translate.org.za (F Wolff) Date: Tue, 24 Mar 2009 12:39:48 +0200 Subject: [lxml-dev] Spacing and the presence of xml:space="preserve" Message-ID: <1237891188.27279.19.camel@localhost> Hallo all We are currently using this expression to obtain a plain text version inside a node: ?For example: >>> from lxml import etree ?>>> etree.XPath("string()") >>> string_xpath(etree.fromstring(" asdf fdsa ")) ' asdf fdsa ' This works great and returns the string assuming xml:space="preserve", in other words, spacing is taken verbatim. We work on a file format where some of the spacing is very important (XLIFF). We generate such files with xml:space="preserve" in the necessary places. Not everybody generates such files, unfortunately, so we need to also handle the normalised versions. If I rather use the XPath function "normalize-space()", I can get the normalised spacing: 'asdf fdsa' but unfortunately it does this even if xml:space="preserve" is set: ??>>> etree.XPath("normalize-space()") >>> string_xpath(etree.fromstring(''' asdf fdsa ''')) ?'asdf fdsa' Unfortunately, I don't see a way to get the correct version (normalised by default, but with white-space preserved if xml:space="preserved" is set). Do I have to handle the cases separately, or is there a way for lxml to help me by just doing the right thing? I could special case on the node, but it would be a bit harder to know if some xml:space directive was given higher up in the tree. Or am I missing something in XPath / lxml? Any help would be appreciated. Friedel Wolff -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/video-virtaals-functionality From stefan_ml at behnel.de Tue Mar 24 16:32:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Mar 2009 16:32:01 +0100 (CET) Subject: [lxml-dev] Spacing and the presence of xml:space="preserve" In-Reply-To: <1237891188.27279.19.camel@localhost> References: <1237891188.27279.19.camel@localhost> Message-ID: <58cde709a2ed2232a684e23bcc84e7d8.squirrel@groupware.dvs.informatik.tu-darmstadt.de> F Wolff wrote: > We are currently using this expression to obtain a plain text version > inside a node: > > ???For example: >>>> from lxml import etree > ???>>> etree.XPath("string()") >>>> string_xpath(etree.fromstring(" asdf fdsa ")) > ' asdf fdsa ' > > This works great and returns the string assuming xml:space="preserve", > in other words, spacing is taken verbatim. We work on a file format > where some of the spacing is very important (XLIFF). We generate such > files with xml:space="preserve" in the necessary places. Not everybody > generates such files, unfortunately, so we need to also handle the > normalised versions. If I rather use the XPath function > "normalize-space()", I can get the normalised spacing: > 'asdf fdsa' > > but unfortunately it does this even if xml:space="preserve" is set: > > ??????>>> etree.XPath("normalize-space()") >>>> string_xpath(etree.fromstring(''' asdf >>>> fdsa ''')) > ???'asdf fdsa' > > > Unfortunately, I don't see a way to get the correct version (normalised > by default, but with white-space preserved if xml:space="preserved" is > set). Do I have to handle the cases separately, or is there a way for > lxml to help me by just doing the right thing? I could special case on > the node, but it would be a bit harder to know if some xml:space > directive was given higher up in the tree. Here is what the XPath 1.0 spec says about normalize-space(): """ Function: string normalize-space(string?) The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are the same as those allowed by the S production in XML. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node. """ So there is no reference to "xml:space" that would dictate a specific behaviour, neither for the context node nor for subtrees. But have you considered writing the required logic in XSLT instead of plain XPath or Python? The "mode" attribute on XSLT's templates should give you all that's needed here, and you'll still end up with a callable that returns a string (built entirely in C space), just a bit smarter this time. If you do this, please post the stylesheet. I think this might be interesting to others, too. Stefan From foolistbar at googlemail.com Tue Mar 24 23:42:08 2009 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Tue, 24 Mar 2009 22:42:08 +0000 Subject: [lxml-dev] html5lib tree builder in lxml 2.2 Message-ID: Hi, Getting around to actually looking at lxml 2.2's html5lib support, I note that it has its own treebuilder: I presume there was some reason (bugs?) that html5lib's own wasn't used. Would it be possible to get a patch for html5lib that would fix these issues (this'll need to be under the MIT license)? -- Geoffrey Sneddon From stefan_ml at behnel.de Wed Mar 25 22:44:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Mar 2009 22:44:36 +0100 Subject: [lxml-dev] html5lib tree builder in lxml 2.2 In-Reply-To: References: Message-ID: <49CAA5C4.3010202@behnel.de> Geoffrey Sneddon wrote: > Getting around to actually looking at lxml 2.2's html5lib support, I > note that it has its own treebuilder: I presume there was some reason > (bugs?) that html5lib's own wasn't used. Armin Ronacher wrote that part, so he should know: http://comments.gmane.org/gmane.comp.python.lxml.devel/3848?set_lines=100000 It uses a subclass of html5lib's TreeBuilder, so it's not a rewrite or something in that order. > Would it be possible to get a > patch for html5lib that would fix these issues (this'll need to be under > the MIT license)? It's mainly about stuff that ET doesn't support, such as the DOCTYPE, or top-level comments. I don't know if the html5lib project is interested in that, but it shouldn't be too hard to add some conditional lxml specifics to their code. Stefan From jbarrios at technorati.com Thu Mar 26 17:45:11 2009 From: jbarrios at technorati.com (Jorge Barrios) Date: Thu, 26 Mar 2009 09:45:11 -0700 Subject: [lxml-dev] test failures, lxml-2.2 Message-ID: <74BA155D-B944-4C6F-A868-73CF764BC523@technorati.com> I am hoping that there's an easy fix. I couldn't find anything about it in the mailing list. Anyone know what the problem might be? Jorge root at a0110:/data3/lxml-2.2# uname -a Linux a0110 2.6.18-8.el5PAE #1 SMP Thu Mar 15 20:29:51 EDT 2007 i686 i686 i386 GNU/Linux root at a0110:/data3/lxml-2.2# make test python setup.py build_ext -i Building lxml version 2.2. Building with Cython 0.11. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/ local/lib running build_ext python test.py -p -v TESTED VERSION: 2.2.0 Python: (2, 5, 2, 'final', 0) lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 3) libxml compiled: (2, 7, 3) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) 1010/1010 (100.0%): Doctest: xpathxslt.txt ---------------------------------------------------------------------- Ran 1010 tests in 29.973s OK PYTHONPATH=src: python selftest.py ********************************************************************** File "/var/data3/lxml-2.2/selftest.py", line 575, in selftest.encoding Failed example: serialize(elem) Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest.py", line 577, in selftest.encoding Failed example: serialize(elem, encoding="utf-8") Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest.py", line 579, in selftest.encoding Failed example: serialize(elem, encoding="us-ascii") Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest.py", line 581, in selftest.encoding Failed example: serialize(elem, encoding="iso-8859-1").lower() Expected: '\n' Got: '\n' ********************************************************************** 1 items had failures: 4 of 29 in selftest.encoding ***Test Failed*** 4 failures. 181 tests ok. PYTHONPATH=src: python selftest2.py ********************************************************************** File "/var/data3/lxml-2.2/selftest2.py", line 196, in selftest2.encoding Failed example: serialize(elem) Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest2.py", line 198, in selftest2.encoding Failed example: serialize(elem, "utf-8") Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest2.py", line 200, in selftest2.encoding Failed example: serialize(elem, "us-ascii") Expected: '' Got: '' ********************************************************************** File "/var/data3/lxml-2.2/selftest2.py", line 202, in selftest2.encoding Failed example: serialize(elem, "iso-8859-1").lower() Expected: '\n' Got: '\n' ********************************************************************** 1 items had failures: 4 of 29 in selftest2.encoding ***Test Failed*** 4 failures. 102 tests ok. root at a0110:/data3/lxml-2.2# From shigin at rambler-co.ru Thu Mar 26 18:38:32 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Thu, 26 Mar 2009 20:38:32 +0300 Subject: [lxml-dev] test failures, lxml-2.2 In-Reply-To: <74BA155D-B944-4C6F-A868-73CF764BC523@technorati.com> References: <74BA155D-B944-4C6F-A868-73CF764BC523@technorati.com> Message-ID: <1238089112.21767.1846.camel@atlas> ? ???, 26/03/2009 ? 09:45 -0700, Jorge Barrios ?????: > I am hoping that there's an easy fix. I couldn't find anything about > it in the mailing list. Anyone know what the problem might be? > PYTHONPATH=src: python selftest.py > ********************************************************************** > File "/var/data3/lxml-2.2/selftest.py", line 575, in selftest.encoding > Failed example: > serialize(elem) > Expected: > '' > Got: > '' > ********************************************************************** It doesn't look like a problem. It seems libxml serializes using hex representation of entities. The deserialized value would be the same. I don't know if libxml2 has a switch to turn old behaviour back, sorry. In [7]: etree.fromstring('').attrib['key'] Out[7]: u'\xe5\xf6\xf6<>' In [8]: etree.fromstring('').attrib['key'] Out[8]: u'\xe5\xf6\xf6<>' In [9]: k1 = etree.fromstring('').attrib['key'] In [10]: k2 = etree.fromstring('').attrib['key'] In [11]: k1 == k2 Out[11]: True From stefan_ml at behnel.de Fri Mar 27 11:34:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Mar 2009 11:34:24 +0100 (CET) Subject: [lxml-dev] [Question #65510]: How to set libxml:XML_PARSE_HUGE-option in lxml? In-Reply-To: <20090327081619.27850.74693.launchpad@palladium.canonical.com> References: <20090327081619.27850.74693.launchpad@palladium.canonical.com> Message-ID: <8a6cd6ae98d473bc5b245c6822c8fbbf.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, forwarding this to the mailing list where it belongs. bol wrote: > in lixbml2 changelog: > > Daniel Veillard Sun Jan 18 15:06:05 CET > > * include/libxml/parserInternals.h SAX2.c: add a new define > XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node, > the defaultis 10MB and can be removed with the HUGE parsing option Yes, this was changed in libxml2 2.7.x. > So how can I set in lxml this HUGE-option? You currently can't, and I wonder if this should really be an option and what the default should be here. > At the moment I use a from me modified version of libxml2 (with > XML_MAX_TEXT_LENGTH set to 100MB), which solves my problem. But I hope to > find a lxml-way to solve this. Could you say something about your use case? Stefan From stefan_ml at behnel.de Fri Mar 27 12:56:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Mar 2009 12:56:38 +0100 (CET) Subject: [lxml-dev] [Question #65510]: How to set libxml:XML_PARSE_HUGE-option in lxml? In-Reply-To: <20090327105339.27850.32356.launchpad@palladium.canonical.com> References: <8a6cd6ae98d473bc5b245c6822c8fbbf.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090327105339.27850.32356.launchpad@palladium.canonical.com> Message-ID: bol wrote: > we are working with large text corpora (bigger than 10mb). > Lxml is used for splitting this corpora-xml-files and run via sockets (old > non-xml-using) binaries for i.e. pos-tagging or tokenizing. Ok, that gives you a) the bit of structure that you need and b) safe and portable encoding support (which I assume is critical here), so that's fine with me. After all, XML is used for all sorts of things these days... > The option XML_PARSE_HUGE should be as in libxml default off. That's what I was wondering about. It's (sort of) on by default if you use libxml2 2.6.x and 2.7.[012], but it's supposed to be off by default if you use libxml2 2.7.3 and later. That's outside of the control of lxml. So you would get one behaviour on one system and a different behaviour on another system, even with the same version of lxml. However, this is meant as a security measure to prevent traps like the billion laughs attack. Therefore, I do understand that a) most people won't notice and b) having it on by default seems like the right setting. Is there any opposition to keeping the enforced parser restrictions (limited tree depth and text node length) enabled by default in newer libxml2 versions, and to provide a parser switch for disabling them? The alternative would be to disable them by default on all libxml2 versions, and to provide a switch that enables them if libxml2 supports it. But a safe default sounds a lot better. Stefan From d.rothe at semantics.de Fri Mar 27 14:10:33 2009 From: d.rothe at semantics.de (Dirk Rothe) Date: Fri, 27 Mar 2009 14:10:33 +0100 Subject: [lxml-dev] [Question #65510]: How to set libxml:XML_PARSE_HUGE-option in lxml? In-Reply-To: References: <8a6cd6ae98d473bc5b245c6822c8fbbf.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090327105339.27850.32356.launchpad@palladium.canonical.com> Message-ID: > Is there any opposition to keeping the enforced parser restrictions > (limited tree depth and text node length) enabled by default in newer > libxml2 versions, and to provide a parser switch for disabling them? The > alternative would be to disable them by default on all libxml2 versions, > and to provide a switch that enables them if libxml2 supports it. But a > safe default sounds a lot better. I would go for the safe defaults, and document it properly ;) From bolsog at users.sf.net Fri Mar 27 15:06:44 2009 From: bolsog at users.sf.net (Andreas) Date: Fri, 27 Mar 2009 15:06:44 +0100 Subject: [lxml-dev] Version in trunk (svn:63387) doesn't compile on Fedora 9.x86_64 / Fedora 10.x86_64 Message-ID: Hi, the version in trunk seems broken for Fedora 9/10. $ uname -a Linux rechner1 2.6.27.19-78.2.30.fc9.x86_64 #1 SMP Tue Feb 24 19:44:45 EST 2009 x86_64 x86_64 x86_64 GNU/Linux and $ uname -a Linux rechner2 2.6.27.7-134.fc10.x86_64 #1 SMP Mon Dec 1 22:21:35 EST 2008 x86_64 x86_64 x86_64 GNU/Linux libxml2 2.7.3 Python 2.5.1 ---- snip ---- python setup.py build_ext -i Building lxml version 2.2-63198. Building with Cython 0.11. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/lib64 running build_ext building 'lxml.etree' extension creating build creating build/temp.linux-x86_64-2.5 creating build/temp.linux-x86_64-2.5/src creating build/temp.linux-x86_64-2.5/src/lxml gcc -pthread -fno-strict-aliasing -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fPIC -I/usr/include/libxml2 -I/usr/include/python2.5 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.5/src/lxml/lxml.etree.o -w src/lxml/lxml.etree.c:3299: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER? src/lxml/lxml.etree.c:3600: error: expected declaration specifiers or ?...? before ?xsltLoadType? src/lxml/lxml.etree.c:3601: error: expected declaration specifiers or ?...? before ?xsltLoadType? src/lxml/lxml.etree.c:110035: error: expected declaration specifiers or ?...? before ?xsltLoadType? src/lxml/lxml.etree.c: In function ?__pyx_f_4lxml_5etree__xslt_store_resolver_exception?: src/lxml/lxml.etree.c:110095: error: ?__pyx_v_c_type? undeclared (first use in this function) src/lxml/lxml.etree.c:110095: error: (Each undeclared identifier is reported only once src/lxml/lxml.etree.c:110095: error: for each function it appears in.) src/lxml/lxml.etree.c:110095: error: ?XSLT_LOAD_DOCUMENT? undeclared (first use in this function) src/lxml/lxml.etree.c: At top level: src/lxml/lxml.etree.c:110242: error: expected declaration specifiers or ?...? before ?xsltLoadType? src/lxml/lxml.etree.c: In function ?__pyx_f_4lxml_5etree__xslt_doc_loader?: src/lxml/lxml.etree.c:110257: error: ?__pyx_v_c_type? undeclared (first use in this function) src/lxml/lxml.etree.c:110257: error: ?XSLT_LOAD_DOCUMENT? undeclared (first use in this function) src/lxml/lxml.etree.c:110280: error: ?XSLT_LOAD_STYLESHEET? undeclared (first use in this function) src/lxml/lxml.etree.c:110385: error: too many arguments to function ?__pyx_f_4lxml_5etree__xslt_store_resolver_exception? src/lxml/lxml.etree.c: In function ?__pyx_pf_4lxml_5etree_4XSLT___call__?: src/lxml/lxml.etree.c:113230: error: ?xsltTransformContext? has no member named ?dict? src/lxml/lxml.etree.c:113241: error: ?xsltTransformContext? has no member named ?dict? src/lxml/lxml.etree.c:113253: error: ?xsltTransformContext? has no member named ?dict? src/lxml/lxml.etree.c:113262: error: ?xsltTransformContext? has no member named ?dict? src/lxml/lxml.etree.c: In function ?initetree?: src/lxml/lxml.etree.c:147853: error: ?__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER? undeclared (first use in this function) src/lxml/lxml.etree.c:147853: error: ?xsltDocDefaultLoader? undeclared (first use in this function) error: command 'gcc' failed with exit status 1 make: *** [inplace] Error 1 ---- snip --- Regards Andreas From stefan_ml at behnel.de Fri Mar 27 15:51:37 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Mar 2009 15:51:37 +0100 (CET) Subject: [lxml-dev] Version in trunk (svn:63387) doesn't compile on Fedora 9.x86_64 / Fedora 10.x86_64 In-Reply-To: References: Message-ID: <0d74b926ba41dd74e8776785b0d9d982.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, thanks for the report. Andreas wrote: > the version in trunk seems broken for Fedora 9/10. Note that the trunk version is almost exactly what was released as lxml 2.2. > libxml2 2.7.3 > Python 2.5.1 > [...] > Building with Cython 0.11. > Using build configuration of libxslt 1.1.24 > Building against libxml2/libxslt in the following directory: /usr/lib64 > [...] > src/lxml/lxml.etree.c:3299: error: expected ?=?, ?,?, ?;?, ?asm? or > ?__attribute__? before ?__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER? Line 3299 appears to be the first line that uses names from libxslt, so my guess is that something is wrong with your libxslt setup. My first guess usually is that the libxslt-dev(el) package is missing, but you appear to have libxslt 1.1.24 installed, and lxml seems to find the "xslt-config" script which usually comes with libxslt-dev, so I'm a bit puzzled here. Maybe your default include path lacks /usr/lib64 (where libxslt/*.h gets installed)? Are you building this with 64bit Python and libraries? Stefan From stefan_ml at behnel.de Fri Mar 27 22:12:20 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Mar 2009 22:12:20 +0100 Subject: [lxml-dev] [Question #65510]: How to set libxml:XML_PARSE_HUGE-option in lxml? In-Reply-To: References: <8a6cd6ae98d473bc5b245c6822c8fbbf.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20090327105339.27850.32356.launchpad@palladium.canonical.com> Message-ID: <49CD4134.2020307@behnel.de> Dirk Rothe wrote: >> Is there any opposition to keeping the enforced parser restrictions >> (limited tree depth and text node length) enabled by default in newer >> libxml2 versions, and to provide a parser switch for disabling them? The >> alternative would be to disable them by default on all libxml2 versions, >> and to provide a switch that enables them if libxml2 supports it. But a >> safe default sounds a lot better. > > I would go for the safe defaults, and document it properly ;) There's a "huge_tree" option for now, defaulting to False. Let's see if it works out that way. https://codespeak.net/viewvc/?view=rev&revision=63399 Stefan From foolistbar at googlemail.com Sat Mar 28 21:09:36 2009 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 28 Mar 2009 20:09:36 +0000 Subject: [lxml-dev] html5lib tree builder in lxml 2.2 In-Reply-To: <49CAA5C4.3010202@behnel.de> References: <49CAA5C4.3010202@behnel.de> Message-ID: <4A3F72E4-A200-4D00-AA99-C9A3F9FFB8BF@googlemail.com> On 25 Mar 2009, at 21:44, Stefan Behnel wrote: >> Would it be possible to get a >> patch for html5lib that would fix these issues (this'll need to be >> under >> the MIT license)? > > It's mainly about stuff that ET doesn't support, such as the > DOCTYPE, or > top-level comments. I don't know if the html5lib project is > interested in > that, but it shouldn't be too hard to add some conditional lxml > specifics > to their code. There is already a whole separate lxml treebuilder in html5lib. I'm in part wondering why that wasn't used verbatim, and if there are any issues with it fixed in lxml 2.2's treebuilder that a patch be made available under licensing terms acceptable to html5lib (I'd probably look more closely to see quite what was changed if I could actually copy changes safely with the licensing being such). -- Geoffrey Sneddon From stefan_ml at behnel.de Sun Mar 29 07:27:08 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 29 Mar 2009 07:27:08 +0200 Subject: [lxml-dev] html5lib tree builder in lxml 2.2 In-Reply-To: <4A3F72E4-A200-4D00-AA99-C9A3F9FFB8BF@googlemail.com> References: <49CAA5C4.3010202@behnel.de> <4A3F72E4-A200-4D00-AA99-C9A3F9FFB8BF@googlemail.com> Message-ID: <49CF06AC.1050103@behnel.de> Geoffrey Sneddon wrote: > On 25 Mar 2009, at 21:44, Stefan Behnel wrote: > >>> Would it be possible to get a >>> patch for html5lib that would fix these issues (this'll need to be under >>> the MIT license)? >> >> It's mainly about stuff that ET doesn't support, such as the DOCTYPE, or >> top-level comments. I don't know if the html5lib project is interested in >> that, but it shouldn't be too hard to add some conditional lxml specifics >> to their code. > > There is already a whole separate lxml treebuilder in html5lib. Ah, interesting. I assume it just wasn't there at the time. > I'm in > part wondering why that wasn't used verbatim, and if there are any > issues with it fixed in lxml 2.2's treebuilder that a patch be made > available under licensing terms acceptable to html5lib (I'd probably > look more closely to see quite what was changed if I could actually copy > changes safely with the licensing being such). Come on, lxml is BSD licensed. If html5lib is MIT licensed, I doubt that anyone would be mad enough to put hope into suing you if you edit a file in one while taking a glimpse at the other. >From a quick look, it actually seems like the "etree_lxml" tree builder in html5lib has learned from the one in lxml.html already. So, please give it a review if you can. I wouldn't mind simply importing the html5lib one in lxml.html.html5parser if it's available. Stefan