From jkrukoff at ltgc.com Tue Aug 1 05:13:15 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Mon, 31 Jul 2006 21:13:15 -0600 Subject: [lxml-dev] Copying an ElementTree doesn't work. Message-ID: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com> Can someone explain to me why when an ElementTree is copied, it's root element isn't copied? >>> import lxml.etree as etree >>> import copy >>> root = etree.XML( '' ) >>> tree = copy.copy( etree.ElementTree( root ) ) >>> tree.getroot( ) is None True I get the same behaviour with deepcopy as well. Am I just supposed to always be using Element s and not ElementTree s? I'm running lxml 1.0.2 on Python 2.4.3, if that matters. From jkrukoff at ltgc.com Tue Aug 1 05:33:52 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Mon, 31 Jul 2006 21:33:52 -0600 Subject: [lxml-dev] Segfault in lxml during element copy Message-ID: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> I've been working on an XML based middleware system written in python and lxml, and I've started experiencing a segfault problem with lxml just as it's being rolled out to the rest of the team. Embarrassing, you know? It looks like a double free problem, as the crash is always acompanied by a glibc message that looks like this: *** glibc detected *** free(): invalid pointer: 0x0813e1a4 *** I've tried to come up with a stripped down test case to repeat the problem, but have been unable to reproduce it except in the full application. It's not absolutely consistent, I'll have to run the same request 3 or 4 times before it crashes, but it always does, even while generating identical output from identical input for those 3 or 4 calls. I've tracked down the line it crashes at, and it's a simple copy called on an XML element: copied = copy.copy( element ) If I remove it, and operate on the source xml directly instead of copying it (it's really just a safety mechanism), it still crashes, just in more random locations. I'm running lxml 1.0.2, on Python 2.4.3, with libxml2 2.6.26 and libxslt 1.1.17 if it matters. The problem is reproducible on a coworkers machine, also running lxml 1.0.2 with slightly different minor revisions of the xml libraries. From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 07:25:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 07:25:29 +0200 Subject: [lxml-dev] Copying an ElementTree doesn't work. In-Reply-To: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com> References: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com> Message-ID: <44CEE5C9.7020100@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > Can someone explain to me why when an ElementTree is copied, it's root > element isn't copied? > >>>> import lxml.etree as etree >>>> import copy >>>> root = etree.XML( '' ) >>>> tree = copy.copy( etree.ElementTree( root ) ) >>>> tree.getroot( ) is None > True > > I get the same behaviour with deepcopy as well. Am I just supposed to > always be using Element s and not ElementTree s? I'm running lxml > 1.0.2 on Python 2.4.3, if that matters. Copying ElementTrees is not currently implemented. The only reason to do it would be to avoid problems when people use it, there is no real gain. I do not even see why you would want to copy an ElementTree. As ElementTrees are immutable, the above is not different from this: tree = etree.ElementTree(root) I'll add __copy__ and __deepcopy__, though, so that the above problem will disappear. So, thanks for reporting this. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 07:56:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 07:56:20 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> Message-ID: <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > I've been working on an XML based middleware system written in python > and lxml, and I've started experiencing a segfault problem with lxml > just as it's being rolled out to the rest of the team. Embarrassing, > you know? Sorry for that. > It looks like a double free problem, as the crash is always acompanied > by a glibc message that looks like this: > *** glibc detected *** free(): invalid pointer: 0x0813e1a4 *** *May* be a double free problem, yes. > I've tried to come up with a stripped down test case to repeat the > problem, but have been unable to reproduce it except in the full > application. It's not absolutely consistent, I'll have to run the same > request 3 or 4 times before it crashes, but it always does, even while > generating identical output from identical input for those 3 or 4 calls. > > I've tracked down the line it crashes at, and it's a simple copy > called on an XML element: > copied = copy.copy( element ) ?? You mean, you get the above error ('free(): invalid pointer') when you call this? Then I have no idea where that bug could come from. At least, it can't really be copy() that triggers it... BTW, in lxml, copy() is the same as deepcopy(). Read doc/compatibility.txt on this. > If I remove it, and operate on the source xml directly instead of > copying it (it's really just a safety mechanism), it still crashes, > just in more random locations. That's likely, yes. Looks like your XML tree became corrupted in some way, so when the broken part of it is accessed, it crashes. > I'm running lxml 1.0.2, on Python 2.4.3, with libxml2 2.6.26 and > libxslt 1.1.17 if it matters. The problem is reproducible on a > coworkers machine, also running lxml 1.0.2 with slightly different > minor revisions of the xml libraries. Ok. Thanks for reporting this. We had a report before about lxml crashing in certain bizarre and difficult to reproduce situations, so maybe this is the same bug. Given the information above, it would be really hard for us to try to reproduce the bug, so if you want to help, I can only ask you to try to strip down your program to a relevant portion that allows us to actually see the bug ourselves. Otherwise it will be near impossible to fix it. Thanks for reporting this, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 09:36:55 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 09:36:55 +0200 Subject: [lxml-dev] An intriguing behaviour of xpath in lxml In-Reply-To: References: Message-ID: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> Hi Agustin, Agust?n Villena wrote: > I already know that xpath(".") in the document node works, but is > beyond my understanding why xpath("/") is not implemented. Well, what would you expect it to return? The XPath spec says: """ / selects the document root (which is always the parent of the document element) """ The document element is returned by "/*", so it's the root element of the document in ElementTree. The "document root" itself is not available in the tree model provided by lxml. It /could/ be a possibility to deliberately diverge from the spec here and return the root element instead. So, maybe you can enlighten us with your use case, so that we can decide what implementation would fit here. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 10:01:42 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 10:01:42 +0200 Subject: [lxml-dev] Copying an ElementTree doesn't work. In-Reply-To: <20060801014422.6bks9ygcu8gw0wkw@webmail.ltgc.com> References: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com> <44CEE5C9.7020100@gkec.informatik.tu-darmstadt.de> <20060801014422.6bks9ygcu8gw0wkw@webmail.ltgc.com> Message-ID: <44CF0A66.20203@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > Quoting Stefan Behnel : >> John Krukoff wrote: >>> Can someone explain to me why when an ElementTree is copied, it's root >>> element isn't copied? >>> >>>>>> import lxml.etree as etree >>>>>> import copy >>>>>> root = etree.XML( '' ) >>>>>> tree = copy.copy( etree.ElementTree( root ) ) >>>>>> tree.getroot( ) is None >>> True >> >> As ElementTrees are immutable, the above is not different from this: >> >> tree = etree.ElementTree(root) >> >> I'll add __copy__ and __deepcopy__, though, so that the above problem >> will disappear. So, thanks for reporting this. > > For what it's worth, the use case is that I have an element tree that I > want to copy multiple times, before performing destructive changes to > the copies. Currently, copying the contents of an element tree to > another element tree is kind of clunky: > >>>> original = etree.ElementTree( etree.XML( '' ) ) >>>> copied = etree.ElementTree( copy.copy( original.getroot( ) ) ) > > which is why I was asking if the expected use is to always pass around > elements and wrap them with element trees only when it was convient to > use the element tree methods (XSLT being what I'm interested in). > > So, thanks, the fix will make this look a little less ugly. Ok, sure. Just for code clarity, you might still want to use deepcopy() instead of copy(), not everybody is necessarily aware of the fact that lxml implements them the same way. Note also that copying an ElementTree actually now produces a shallow copy of the ElementTree. The XML tree is not touched in this case. Here is the patch, BTW, in case you want to apply it yourself. It will be in lxml 1.0.3 and 1.1, which are expected not too late this month. Stefan Index: src/lxml/etree.pyx =================================================================== --- src/lxml/etree.pyx (Revision 30633) +++ src/lxml/etree.pyx (Arbeitskopie) @@ -395,6 +395,15 @@ """ return self._context_node + def __copy__(self): + return ElementTree(self._context_node) + + def __deepcopy__(self, memo): + if self._context_node is None: + return ElementTree() + else: + return ElementTree( self._context_node.__copy__() ) + property docinfo: """Information about the document provided by parser and DTD. This value is only defined for ElementTree objects based on the root node From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 10:17:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 10:17:24 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> Message-ID: <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > Thanks for the response. Yeah, I know just how vague an error report > this is. I was really hoping I was hitting something that someone else > had already encountered. I've already wasted a day trying to strip the > program down to just the lxml operations, and haven't been able to come > up with a reduced set of the program that still causes the crash. Try to think about the main treatments you apply to trees. Do you move elements between trees? What happens to the source tree? Does the crash go away if you keep a reference to it? (maybe in a set or list) Do you keep cyclic references between objects that reference elements, i.e. is the Python cyclic garbage collector involved in cleaning up XML trees? If you use XSLT, can you reproduce the crash if you build the result tree (or a simpler one) by hand? Do you use XPath calls or extension functions? Are they required to trigger the crash? These kinds of bugs are mostly related to garbage collection and Python reference counting, so try to concentrate on code that results in freeing references to elements and trees. There is also a tool we commonly use to debug memory handling in lxml.etree. It's called "valgrind". doc/valgrind.txt contains a command line that allows you to run lxml with it. This gives you a stack trace when problems occur or when the program crashes that *might* give us a hint on what happened. In case you want to try, you can send me the output in private e-mail (preferably bzip2-ed or gzipped) so that I can take a look at it. > I'll spend another day on this, and see if I can't do better. Thanks, we really appreciate this kind of help. Stefan From faassen at infrae.com Tue Aug 1 11:13:13 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 11:13:13 +0200 Subject: [lxml-dev] lxml - exslt - regexp:match() In-Reply-To: <44CE477A.1030000@gkec.informatik.tu-darmstadt.de> References: <149473834@web.de> <44CE477A.1030000@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF1B29.1050609@infrae.com> Stefan Behnel wrote: [snip] > For comparison, I now implemented the examples from the page as unit tests, > which sadly showed that Python's regexps are incompatible with what EXSLT > requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only > the last "t" is returned for the group by re.findall(). So we can't claim > compatibility with EXSLT at this point. -- Note, though, that I never really > said it was compatible, it just builds on Python's re module. I still think > that's enough for a Python XML library. If it's not compatible, I think it should be invoked differently than in the EXSLT way. This way someone dropping in an EXSLT stylesheet with regexes doesn't have a half-working stylesheet but a completely and clearly failing stylesheet: lxml doesn't support the regexes. In addition, the path forward to getting the stylesheet working is clear: use the Python-based and deliberately incompatible regex facility instead, and rewrite the regexes. Regards, Martijn From faassen at infrae.com Tue Aug 1 11:15:53 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 11:15:53 +0200 Subject: [lxml-dev] An intriguing behaviour of xpath in lxml In-Reply-To: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF1BC9.7040102@infrae.com> Stefan Behnel wrote: > Hi Agustin, > > Agust?n Villena wrote: >> I already know that xpath(".") in the document node works, but is >> beyond my understanding why xpath("/") is not implemented. > > Well, what would you expect it to return? The XPath spec says: > > """ / selects the document root (which is always the parent of the > document element) """ > > The document element is returned by "/*", so it's the root element of > the document in ElementTree. The "document root" itself is not > available in the tree model provided by lxml. > > It /could/ be a possibility to deliberately diverge from the spec > here and return the root element instead. What about returning a root ElementTree? Then again, that is not the parent of the document element at present in our tree model, right? Or is it? Changing the getparent() behavior will have consequences we need to consider carefully. > So, maybe you can enlighten us with your use case, so that we can > decide what implementation would fit here. Yes, that would indeed be helpful. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 11:47:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 11:47:53 +0200 Subject: [lxml-dev] lxml - exslt - regexp:match() In-Reply-To: <44CF1B29.1050609@infrae.com> References: <149473834@web.de> <44CE477A.1030000@gkec.informatik.tu-darmstadt.de> <44CF1B29.1050609@infrae.com> Message-ID: <44CF2349.70602@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: > [snip] >> For comparison, I now implemented the examples from the page as unit >> tests, >> which sadly showed that Python's regexps are incompatible with what EXSLT >> requires. The Python RE "([a-z])+ " does not match "test " as in >> EXSLT, only >> the last "t" is returned for the group by re.findall(). So we can't claim >> compatibility with EXSLT at this point. -- Note, though, that I never >> really >> said it was compatible, it just builds on Python's re module. I still >> think >> that's enough for a Python XML library. > > If it's not compatible, I think it should be invoked differently than in > the EXSLT way. This way someone dropping in an EXSLT stylesheet with > regexes doesn't have a half-working stylesheet but a completely and > clearly failing stylesheet: lxml doesn't support the regexes. In > addition, the path forward to getting the stylesheet working is clear: > use the Python-based and deliberately incompatible regex facility > instead, and rewrite the regexes. Hmmm, I feel invited to disagree here. I reread the EXSLT spec on this topic and it does not contain any RE syntax specification and is rather unclear about what is required for compliance. It says this in the introduction of the RE module: """ For ease of implementation, the regular expressions used in this module currently use the Javascript regular expression syntax. """ while in the description of the functions, it mainly uses this wording: """ The second argument is a regular expression that follows the Javascript regular expression syntax. """ So, the way I read it, the "currently" does not seem to indicate a clear obligation to obey the actual RE syntax used in the spec. Especially the "ease of implementation" calls for a Python 're' implementation in lxml. :) I also believe that people using XML in a Python environment would rather expect regular expressions to be compatible with what they know from Python's re module (where they are pretty well defined) than with JavaScript expressions. So far, the differences only seem to show for repeated groups, so a large area of use cases is even compatible. BTW, the use case given in the EXSLT spec is easily rewritten by moving the RE repeat operator (+/*) into the group, so if portability is really required in this specific case, it can be achieved on the user side. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 12:02:03 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 12:02:03 +0200 Subject: [lxml-dev] An intriguing behaviour of xpath in lxml In-Reply-To: <44CF1BC9.7040102@infrae.com> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> Message-ID: <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Hi Agustin, >> >> Agust?n Villena wrote: >>> I already know that xpath(".") in the document node works, but is >>> beyond my understanding why xpath("/") is not implemented. >> >> Well, what would you expect it to return? The XPath spec says: >> >> """ / selects the document root (which is always the parent of the >> document element) """ >> >> The document element is returned by "/*", so it's the root element of >> the document in ElementTree. The "document root" itself is not >> available in the tree model provided by lxml. >> >> It /could/ be a possibility to deliberately diverge from the spec >> here and return the root element instead. > > What about returning a root ElementTree? Then that would be the only special case that returns an ElementTree from an XPath expression, although there is currently no way to get an ElementTree passed /into/ an XPath expression. And XPath extension functions would have to start caring about this, too. > Then again, that is not the parent of the document element at present > in our tree model, right? Or is it? No. ElementTrees and Elements are different things that serve different purposes. > Changing the getparent() behavior will have consequences we need > to consider carefully. I dislike the idea of having different (incompatible) return values only to match a single special case. If we say we return an Element from a function, having a special case that can return an ElementTree is far from intuitive and pretty error prone. So, depending on the use case, we may consider a) leaving it as is b) raise a different exception to make the problem more understandable c) return None to avoid the exception (not really a good idea, but would match the behaviour of the getparent() function) d) return a node set with the root element (thus diverging from the spec) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 13:56:25 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 13:56:25 +0200 Subject: [lxml-dev] An intriguing behaviour of xpath in lxml In-Reply-To: References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> Hi Agustin. Agustin Villena wrote: > On 8/1/06, Stefan Behnel wrote: > >> Agust?n Villena wrote: > >>> I already know that xpath(".") in the document node works, but is > >>> beyond my understanding why xpath("/") is not implemented. > >> > >> Well, what would you expect it to return? The XPath spec says: > >> > >> """ / selects the document root (which is always the parent of the > >> document element) """ > >> > >> The document element is returned by "/*", so it's the root element of > >> the document in ElementTree. The "document root" itself is not > >> available in the tree model provided by lxml. > > So, depending on the use case, we may consider > a) leaving it as is > b) raise a different exception to make the problem more understandable > c) return None to avoid the exception (not really a good idea, but > would match > the behaviour of the getparent() function) > d) return a node set with the root element (thus diverging from the > spec) > Well, the use case es really simple. I'm engaged in a internal course > teaching XML technologies to my co-workers, and I choose lxml as the > best trade-off between easy of use and power. The problem arises when my > "students" begun toying with xpath... Surprisingly the most common first > case that they tried is xpath("/"), and the Exception really confuses > them, and me. ;) Nice trap. Guess I'd try that first, too. > IMHO the is to paths: > - return a node set with the root element .PRO: is intuitive, CONS: > diverges from the spec It has the advantage of actually returning /something/ useful. It also allows users to access the root ElementTree if they like and thus more or less does what can be expected. I mean, this is a rare case anyway and it is actually well defined, so it would be wrong to raise an exception and thus tell the user "you did something wrong". It's a valid XPath expression and therefore perfectly reasonable to use it. I'll just document the difference and that's it. What I now implemented is: if the document root is returned, find its first child and return it as part of a node set instead. If it's not found, it returns None in the node set, but that shouldn't normally happen. > Thanks for your feedback and the great lxml :) Stefan From faassen at infrae.com Tue Aug 1 14:53:36 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 14:53:36 +0200 Subject: [lxml-dev] An intriguing behaviour of xpath in lxml In-Reply-To: <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF4ED0.2040105@infrae.com> Stefan Behnel wrote: [snip] > What I now implemented is: if the document root is returned, find its first > child and return it as part of a node set instead. If it's not found, it > returns None in the node set, but that shouldn't normally happen. This worries me a little... How does that work when the ancestor axis is used? The spec says: """ the ancestor axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent's parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node""" """ the ancestor-or-self axis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node """ Would that mean the current implementation creates double entries when these axes are used? That's not ideal. Note also: """ The root node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node. The root node also has as children processing instruction and comment nodes for processing instructions and comments that occur in the prolog and after the end of the document element """ Perhaps we should implement a special kind of node that represents the root node. It'd not occur in a normal ElementTree DOM, but it's there when you use XPath. It can be also serialized, just like an element, but would include the extra comments that may be there. Then again, we already diverge from strict XPath when we deal with attribute (we have no attribute node), or text (we have no text node). Diverging with root notes wouldn't be a disaster in that picture. That said, the root node is a lot more like an element than these other cases, in that a root node has children, just like element nodes. Regards, Martijn P.S. What do we do with namespace nodes by the way? From agustin.villena at gmail.com Tue Aug 1 15:17:48 2006 From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=) Date: Tue, 01 Aug 2006 09:17:48 -0400 Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml? Message-ID: HI! I'm at the task of processing a bunch of digital signed XMLs. My problem is exemplified in this example: a) The original XMLs were enveloped in a container, that has a default namespace and a signature. The internal XMLs also have their own signature. I doesn't have access to this "envelopes" anymore Some Data b) Sadly, a 3rd party software "extracted" the internal documents, "forgetting" the envelope's default namespace, therefore inalidating the doc's signatures Example of invalid extracted documents Some Data What was needed (xml 1) -------------------------------------------------- Some Data First question: * Is there any way with lxml to add a default namspace to an existing xml-tree Now, I'm trying to patch those messed xmls, injecting the namespace in the nodes that need to belong to the missing namespace, but the result is ugly: python code ------------------------------------------------------------- from lxml import etree NEW_NS = "http://www.example.org/example" doc = etree.parse("no_ns_doc.xml") #add namespace to the root node doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag) #add namespace to the first child of the root node, #since we don't want to touch de namespace of the #Signature Node for elem in doc.getroot()[0].getiterator(): elem.tag="{%s}%s" %(NEW_NS,elem.tag) doc.write("ns_patched_doc.xml") result (xml 2) ------------------------------------------------------------- Some Data ? I know that xml1 and xml2 are semantically the same, but the customer wants his XMLs as appear in the xml 1 example, or with a less ugly prefix. Is the anyway to force to use a more pretty prefix? Any ideas? Thanks Agustin -------------- next part -------------- A non-text attachment was scrubbed... Name: ns_patched_doc.xml Type: text/xml Size: 221 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment.bin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: add_ns_example.py Url: http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment.diff -------------- next part -------------- A non-text attachment was scrubbed... Name: no_ns_doc.xml Type: text/xml Size: 163 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment-0001.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 15:24:33 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 15:24:33 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF4ED0.2040105@infrae.com> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> Message-ID: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Stefan Behnel wrote: > [snip] >> What I now implemented is: if the document root is returned, find its >> first >> child and return it as part of a node set instead. If it's not found, it >> returns None in the node set, but that shouldn't normally happen. > > This worries me a little... > > How does that work when the ancestor axis is used? Ah, right. That didn't actually work either so far [1.0.2]: >>> from lxml import etree >>> tree = etree.XML("") >>> tree[0].xpath("ancestor::node()") Traceback (most recent call last): NotImplementedError: Not yet implemented result node type: 9 Now it gives this: >>> tree[0].xpath("ancestor::node()") [, ] > Would that mean the current implementation creates double entries when > these axes are used? That's not ideal. True, I'd even call that pretty much broken, both in the old and new implementation. > What do we do with namespace nodes by the way? Well: >>> tree[0].xpath("namespace::*") Traceback (most recent call last): NotImplementedError: Not yet implemented result node type: 18 > Perhaps we should implement a special kind of node that represents the > root node. It'd not occur in a normal ElementTree DOM, but it's there > when you use XPath. It can be also serialized, just like an element, but > would include the extra comments that may be there. Hmmm, if we go for this kind of special casing, I'd rather return an ElementTree than another special element (that would need to be treated in custom element class lookup, etc.) > Then again, we already diverge from strict XPath when we deal with > attribute (we have no attribute node), or text (we have no text node). > Diverging with root notes wouldn't be a disaster in that picture. > > That said, the root node is a lot more like an element than these other > cases, in that a root node has children, just like element nodes. The xpath() function already has lots of possible return values, so that's just a few more. However, we still have to handle the case of the ancestor axis. As you stated correctly, the root node is not part of the ElementTree DOM. So what about just skipping it completely? Just return an empty node set for "/" and leave it out in "ancestor::node()". That also fits the getparent() method and the iterancestors() method. And after all, there /is/ no Element to be returned here. Another point is XInclude nodes that stayed in after calling xinclude(). I guess we can just ignore those, too. Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI). Any objections? Stefan From agustin.villena at gmail.com Tue Aug 1 15:50:33 2006 From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=) Date: Tue, 01 Aug 2006 09:50:33 -0400 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF5C29.8080200@gmail.com> Just my two cents: I'm not so expert in XPath, but that intrigues me is that a perfect valid (and maybe the first XPATH expression that anybody learns) is not valid in lxml. The problem not only happens in the doc node, but in any child. >>>child = doc.getroot()[0] >>>child.xpath("/") Not yet implemented result node type: 9 I remember a recent thread discusing absolute xpath queries in lxml. Is this another case of this issue? What was the thread's conclussion? Cheers Agustin Stefan Behnel escribi?: > > Martijn Faassen wrote: >> Stefan Behnel wrote: >> [snip] >>> What I now implemented is: if the document root is returned, find its >>> first >>> child and return it as part of a node set instead. If it's not found, it >>> returns None in the node set, but that shouldn't normally happen. >> This worries me a little... >> >> How does that work when the ancestor axis is used? > > Ah, right. That didn't actually work either so far [1.0.2]: > > >>> from lxml import etree > >>> tree = etree.XML("") > >>> tree[0].xpath("ancestor::node()") > Traceback (most recent call last): > NotImplementedError: Not yet implemented result node type: 9 > > Now it gives this: > > >>> tree[0].xpath("ancestor::node()") > [, ] > > >> Would that mean the current implementation creates double entries when >> these axes are used? That's not ideal. > > True, I'd even call that pretty much broken, both in the old and new > implementation. > > >> What do we do with namespace nodes by the way? > > Well: > > >>> tree[0].xpath("namespace::*") > Traceback (most recent call last): > NotImplementedError: Not yet implemented result node type: 18 > > >> Perhaps we should implement a special kind of node that represents the >> root node. It'd not occur in a normal ElementTree DOM, but it's there >> when you use XPath. It can be also serialized, just like an element, but >> would include the extra comments that may be there. > > Hmmm, if we go for this kind of special casing, I'd rather return an > ElementTree than another special element (that would need to be treated in > custom element class lookup, etc.) > > >> Then again, we already diverge from strict XPath when we deal with >> attribute (we have no attribute node), or text (we have no text node). >> Diverging with root notes wouldn't be a disaster in that picture. >> >> That said, the root node is a lot more like an element than these other >> cases, in that a root node has children, just like element nodes. > > The xpath() function already has lots of possible return values, so that's > just a few more. However, we still have to handle the case of the ancestor axis. > > As you stated correctly, the root node is not part of the ElementTree DOM. So > what about just skipping it completely? Just return an empty node set for "/" > and leave it out in "ancestor::node()". That also fits the getparent() method > and the iterancestors() method. And after all, there /is/ no Element to be > returned here. > > Another point is XInclude nodes that stayed in after calling xinclude(). I > guess we can just ignore those, too. > > Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI). > > Any objections? > > Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 16:09:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 16:09:01 +0200 Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml? In-Reply-To: References: Message-ID: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de> Agust?n Villena wrote: > I'm at the task of processing a bunch of digital signed XMLs. My problem > is exemplified in this example: > > a) The original XMLs were enveloped in a container, that has a default > namespace and a signature. The internal XMLs also have their own > signature. I doesn't have access to this "envelopes" anymore > > > > > Some Data > > > > > > > > > > > b) Sadly, a 3rd party software "extracted" the internal documents, > "forgetting" the envelope's default namespace, therefore inalidating the > doc's signatures > > Example of invalid extracted documents > > > Some Data > > > > > Too bad. > What was needed (xml 1) > -------------------------------------------------- > > > Some Data > > > > > > > First question: > * Is there any way with lxml to add a default namspace to an existing > xml-tree No. lxml is namespace aware, so if there is no namespace it will just think that's what was intended. The only way to change the namespace is to change the tag. > Now, I'm trying to patch those messed xmls, injecting the namespace in > the nodes that need to belong to the missing namespace, but the result > is ugly: > > python code > ------------------------------------------------------------- > > from lxml import etree > > NEW_NS = "http://www.example.org/example" > > doc = etree.parse("no_ns_doc.xml") no guarantee, but try adding this here: old_root = doc.getroot() new_root = old_root.makeelement("{http://www.example.org/example}root", nsmap={None : "http://www.example.org/example"}) new_root.append(old_root) then work on 'new_root' and update the tags as you did below. > #add namespace to the root node > doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag) > > #add namespace to the first child of the root node, > #since we don't want to touch de namespace of the > #Signature Node > for elem in doc.getroot()[0].getiterator(): > elem.tag="{%s}%s" %(NEW_NS,elem.tag) doc = ElementTree( new_root[0] ) > doc.write("ns_patched_doc.xml") The append (i.e. move) operation above should fix the prefixes to match the ones defined in the new root element (i.e. None - the default prefix). Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 16:19:38 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 16:19:38 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF5C29.8080200@gmail.com> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> <44CF5C29.8080200@gmail.com> Message-ID: <44CF62FA.6090503@gkec.informatik.tu-darmstadt.de> Agust?n Villena wrote: > Stefan Behnel escribi?: >> As you stated correctly, the root node is not part of the ElementTree DOM. So >> what about just skipping it completely? Just return an empty node set for "/" >> and leave it out in "ancestor::node()". That also fits the getparent() method >> and the iterancestors() method. And after all, there /is/ no Element to be >> returned here. >> >> Another point is XInclude nodes that stayed in after calling xinclude(). I >> guess we can just ignore those, too. >> >> Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI). >> >> Any objections? > Just my two cents: > > I'm not so expert in XPath, but that intrigues me is that a perfect > valid (and maybe the first XPATH expression that anybody learns) is not > valid in lxml. Well, we're just trying to make it valid (or rather: work). The problem is the mapping of XPath semantics to ElementTree semantics. > The problem not only happens in the doc node, but in any child. > > >>>child = doc.getroot()[0] > >>>child.xpath("/") > Not yet implemented result node type: 9 Sure. It's an absolute XPath expression, doesn't depend on the context node. > I remember a recent thread discusing absolute xpath queries in lxml. > Is this another case of this issue? No. This is different, as it does not return an Element. That's why I am proposing to map the result to an empty node set (i.e. list). That way, it gets a well defined Python representation that makes sense in the ElementTree context, where root nodes do not exist. So, you would get exactly those Elements you asked for. :) I committed this for now, so, if you want to take a look at it... Stefan From faassen at infrae.com Tue Aug 1 17:46:07 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 17:46:07 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF773F.9030408@infrae.com> Stefan Behnel wrote: > Martijn Faassen wrote: [snip] >> Perhaps we should implement a special kind of node that represents the >> root node. It'd not occur in a normal ElementTree DOM, but it's there >> when you use XPath. It can be also serialized, just like an element, but >> would include the extra comments that may be there. > > Hmmm, if we go for this kind of special casing, I'd rather return an > ElementTree than another special element (that would need to be treated in > custom element class lookup, etc.) Advantage of returning a non-ElementTree but something Element-like (like Comment and ProcessingInstruction) is that iteration and such works. It's a node that represents the root and can be serialized. >> Then again, we already diverge from strict XPath when we deal with >> attribute (we have no attribute node), or text (we have no text node). >> Diverging with root notes wouldn't be a disaster in that picture. >> >> That said, the root node is a lot more like an element than these other >> cases, in that a root node has children, just like element nodes. > > The xpath() function already has lots of possible return values, so that's > just a few more. However, we still have to handle the case of the ancestor axis. > > As you stated correctly, the root node is not part of the ElementTree DOM. So > what about just skipping it completely? Just return an empty node set for "/" > and leave it out in "ancestor::node()". That also fits the getparent() method > and the iterancestors() method. And after all, there /is/ no Element to be > returned here. Well, that gives one no way to access any comments surrounding the document library from XPath. Not a disaster, but still. Returning something Element-like sounds the most natural in this case, just like returning a string is most natural for attribute nodes. > Another point is XInclude nodes that stayed in after calling xinclude(). I > guess we can just ignore those, too. > > Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI). > > Any objections? Just URI would be sufficient, but no objection to also returning the prefix information. Regards, Martijn From agustin.villena at gmail.com Tue Aug 1 17:46:21 2006 From: agustin.villena at gmail.com (=?ISO-8859-15?Q?Agust=EDn_Villena?=) Date: Tue, 01 Aug 2006 11:46:21 -0400 Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml? In-Reply-To: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de> References: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de> Message-ID: <44CF774D.8010407@gmail.com> Well, testing your lines I now have this code: --------------------------- from lxml import etree NEW_NS = "http://www.example.org/example" def add_ns(node,nsURL): if type(node)==etree._Element: node.tag="{%s}%s" %(nsURL,node.tag) doc = etree.parse("no_ns_doc.xml") old_root = doc.getroot() new_root = old_root.makeelement("{%s}root" % (NEW_NS),nsmap={None : NEW_NS}) new_root.append(old_root) add_ns(old_root,NEW_NS) #add namespace to the first child of the root node, #since we don't want to touch de namespace of the #Signature Node for elem in old_root[0].getiterator(): add_ns(elem,NEW_NS) #until this line, we have this new_root element : # new_doc = etree.ElementTree(new_root[0]) #All the children of new_root keeps their namespace #in the new doc. But in the serialized text, this namespace disappears #is this a bug? new_doc.write("ns_patched_doc.xml") --------- serialized ---------- Some Data ---------- Too bad... Any ideas? Agustin ---------------- As you may read, It almost works!. But when we move the new_root's children into new_doc, they looses their Stefan Behnel escribi?: > > Agust?n Villena wrote: >> I'm at the task of processing a bunch of digital signed XMLs. My problem >> is exemplified in this example: >> >> a) The original XMLs were enveloped in a container, that has a default >> namespace and a signature. The internal XMLs also have their own >> signature. I doesn't have access to this "envelopes" anymore >> >> >> >> >> Some Data >> >> >> >> >> >> >> >> >> >> >> b) Sadly, a 3rd party software "extracted" the internal documents, >> "forgetting" the envelope's default namespace, therefore inalidating the >> doc's signatures >> >> Example of invalid extracted documents >> >> >> Some Data >> >> >> >> >> > > Too bad. > > >> What was needed (xml 1) >> -------------------------------------------------- >> >> >> Some Data >> >> >> >> >> >> >> First question: >> * Is there any way with lxml to add a default namspace to an existing >> xml-tree > > No. lxml is namespace aware, so if there is no namespace it will just think > that's what was intended. The only way to change the namespace is to change > the tag. > > >> Now, I'm trying to patch those messed xmls, injecting the namespace in >> the nodes that need to belong to the missing namespace, but the result >> is ugly: >> >> python code >> ------------------------------------------------------------- >> >> from lxml import etree >> >> NEW_NS = "http://www.example.org/example" >> >> doc = etree.parse("no_ns_doc.xml") > > no guarantee, but try adding this here: > > old_root = doc.getroot() > new_root = old_root.makeelement("{http://www.example.org/example}root", > nsmap={None : "http://www.example.org/example"}) > new_root.append(old_root) > > then work on 'new_root' and update the tags as you did below. > >> #add namespace to the root node >> doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag) >> >> #add namespace to the first child of the root node, >> #since we don't want to touch de namespace of the >> #Signature Node >> for elem in doc.getroot()[0].getiterator(): >> elem.tag="{%s}%s" %(NEW_NS,elem.tag) > > doc = ElementTree( new_root[0] ) > >> doc.write("ns_patched_doc.xml") > > The append (i.e. move) operation above should fix the prefixes to match the > ones defined in the new root element (i.e. None - the default prefix). > > Stefan From faassen at infrae.com Tue Aug 1 17:56:28 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 17:56:28 +0200 Subject: [lxml-dev] running the tests on the trunk Message-ID: <44CF79AC.4090401@infrae.com> Hi there, I have trouble running the tests on the current trunk of lxml: Ran 556 tests in 2.333s FAILED (failures=1, errors=7) A lot of this seems to have to do with this attribute error while running the tests: AttributeError: 'module' object has no attribute 'iterparse' What's going on? Regards, Martijn From faassen at infrae.com Tue Aug 1 18:02:59 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 18:02:59 +0200 Subject: [lxml-dev] running the tests on the trunk In-Reply-To: <44CF79AC.4090401@infrae.com> References: <44CF79AC.4090401@infrae.com> Message-ID: <44CF7B33.4030901@infrae.com> Martijn Faassen wrote: > Hi there, > > I have trouble running the tests on the current trunk of lxml: > > Ran 556 tests in 2.333s > > FAILED (failures=1, errors=7) > > A lot of this seems to have to do with this attribute error while > running the tests: > > AttributeError: 'module' object has no attribute 'iterparse' > > What's going on? I think I figured it out: I need to upgrade my version of *ElementTree*. Regards, Martijn From faassen at infrae.com Tue Aug 1 18:05:49 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 18:05:49 +0200 Subject: [lxml-dev] running the tests on the trunk In-Reply-To: <44CF7B33.4030901@infrae.com> References: <44CF79AC.4090401@infrae.com> <44CF7B33.4030901@infrae.com> Message-ID: <44CF7BDD.1030201@infrae.com> Martijn Faassen wrote: > Martijn Faassen wrote: >> What's going on? > > I think I figured it out: I need to upgrade my version of *ElementTree*. Yup, that eliminated most problems, except for this failure in the doctests: File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 153, in resolvers.txt ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 153, in resolvers.txt Failed example: result = transform(honk_doc) Expected: Resolving url hoi:test as prefix honk ... failed Resolving url hoi:test as prefix hoi ... done Got: Resolving url hoi:test as prefix hoi ... done ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 165, in resolvers.txt Failed example: result = transform(normal_doc) Expected: Resolving url hoi:test as prefix honk ... failed Resolving url hoi:test as prefix hoi ... done Got: Resolving url hoi:test as prefix hoi ... done ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 192, in resolvers.txt Failed example: transform = etree.XSLT(honk_doc) Expected: Resolving url honk:test as prefix honk ... done Got: Resolving url honk:test as prefix hoi ... failed Resolving url honk:test as prefix honk ... done ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 194, in resolvers.txt Failed example: result = transform(normal_doc) Expected: Resolving url hoi:test as prefix honk ... failed Resolving url hoi:test as prefix hoi ... done Got: Resolving url hoi:test as prefix hoi ... done ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", line 199, in resolvers.txt Failed example: transform = etree.XSLT(honk_doc, access_control=ac) Expected: Resolving url honk:test as prefix honk ... done Got: Resolving url honk:test as prefix hoi ... failed Resolving url honk:test as prefix honk ... done From faassen at infrae.com Tue Aug 1 18:26:35 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 18:26:35 +0200 Subject: [lxml-dev] ElementTree comment behavior Message-ID: <44CF80BB.90404@infrae.com> Hi there, The whole XPath root node issue led me to investigate lxml's behavior with comment nodes, thinking that we might not do the right thing with mutation (as Comments subclass Element). However, it seems to behave rationally enough: >>> import lxml >>> from lxml import etree >>> c = etree.Comment('foo') >>> c.append(etree.Element('bar')) >>> len(c.getchildren()) 0 (I wonder what happens in the C tree though here.. cursory inspection of the tree.c code of libxml2 doesn't reveal special code to handle this case) Unfortunately, ElementTree behaves differently in this case! >>> from elementtree import ElementTree as etree2 >>> c = etree2.Comment('foo') >>> c.append(etree.Element('bar')) >>> len(c.getchildren()) 1 Evidently it allows child Elements to be added to comments. What to do in this case? Regards, Martijn From faassen at infrae.com Tue Aug 1 19:09:08 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 01 Aug 2006 19:09:08 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF773F.9030408@infrae.com> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> <44CF773F.9030408@infrae.com> Message-ID: <44CF8AB4.7040209@infrae.com> Martijn Faassen wrote: [snip] > Well, that gives one no way to access any comments surrounding the > document library from XPath. Not a disaster, but still. Returning > something Element-like sounds the most natural in this case, just like > returning a string is most natural for attribute nodes. I've just checked in a branch here: http://codespeak.net/svn/lxml/branch/lxml-xpathroot which experiments with adding a special XPath Root object. This root object only shows up when accessing / through XPath - there's no way to get to it using the normal ElementTree functionality. At first sight this implementation doesn't appear to be too difficult. I think this is a nicer solution than just not returning anything. Unfortunately, my changes also cause memory errors when running the test. It's possible this happens because we start stuffing our proxy in the _private of a XML_DOCUMENT_NODE, something that wasn't possible before, and we're probably not scanning for accurately in our deallocation logic. Don't have time to investigate this further now though, so I'll leave it in the branch for now. Feel free to investigate, Stefan. :) Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 19:09:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 19:09:51 +0200 Subject: [lxml-dev] ElementTree comment behavior In-Reply-To: <44CF80BB.90404@infrae.com> References: <44CF80BB.90404@infrae.com> Message-ID: <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > The whole XPath root node issue led me to investigate lxml's behavior > with comment nodes, thinking that we might not do the right thing with > mutation (as Comments subclass Element). However, it seems to behave > rationally enough: > > >>> import lxml > >>> from lxml import etree > >>> c = etree.Comment('foo') > >>> c.append(etree.Element('bar')) > >>> len(c.getchildren()) > 0 > > (I wonder what happens in the C tree though here.. cursory inspection of > the tree.c code of libxml2 doesn't reveal special code to handle this case) Well, this is how lxml currently implements _Comment.append(): def append(self, _Element element): pass Maybe it should rather raise an exception? > Unfortunately, ElementTree behaves differently in this case! > > >>> from elementtree import ElementTree as etree2 > >>> c = etree2.Comment('foo') > >>> c.append(etree.Element('bar')) > >>> len(c.getchildren()) > 1 > > Evidently it allows child Elements to be added to comments. > > What to do in this case? I personally find the behaviour of ET a bit bizarre here. What /is/ the element child of an XML comment? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 21:08:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 01 Aug 2006 21:08:20 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CF8AB4.7040209@infrae.com> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> <44CF773F.9030408@infrae.com> <44CF8AB4.7040209@infrae.com> Message-ID: <44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Martijn Faassen wrote: > [snip] >> Well, that gives one no way to access any comments surrounding the >> document library from XPath. Not a disaster, but still. Note that this only applies to the return values of XPath calls. Inside the expression, you can do whatever XPath supports. So you can still navigate the brothers and sisters of the document root and return the one of them that you're interested in, without having to pass the root itself into Python. >> Returning >> something Element-like sounds the most natural in this case, just like >> returning a string is most natural for attribute nodes. I'm still not convinced that this should be Element-like. It's not an Element and it has no representation in the ElementTree world. > I've just checked in a branch here: > > http://codespeak.net/svn/lxml/branch/lxml-xpathroot > > which experiments with adding a special XPath Root object. This root > object only shows up when accessing / through XPath - there's no way to > get to it using the normal ElementTree functionality. At first sight > this implementation doesn't appear to be too difficult. I think this is > a nicer solution than just not returning anything. Ok, I can see what you did. You'd have to rewrite that after the merge of the CAPI branch, which changes loads of stuff under the hood and largely impacts element class lookup. So it would have to fit in there. > Unfortunately, my changes also cause memory errors when running the > test. It's possible this happens because we start stuffing our proxy in > the _private of a XML_DOCUMENT_NODE, something that wasn't possible > before, and we're probably not scanning for accurately in our > deallocation logic. Don't have time to investigate this further now > though, so I'll leave it in the branch for now. doc._private is currently only used in XSLT (which may already interfere when extension functions are used), but I'm not very happy with the idea of using xmlDoc like any other element node. It starts with the fact that we now have _Document and _Root sitting on the same xmlDoc structure. That unnecessarily complicates the cleanup procedure for what I call a rare special case. If we really want to put something Element-like in there, we may consider making it part of the _Document class, which already is unique for the document root. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 07:27:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 07:27:59 +0200 Subject: [lxml-dev] running the tests on the trunk In-Reply-To: <44CF7BDD.1030201@infrae.com> References: <44CF79AC.4090401@infrae.com> <44CF7B33.4030901@infrae.com> <44CF7BDD.1030201@infrae.com> Message-ID: <44D037DF.9010204@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > that eliminated most problems, except for this failure in the doctests: > > File > "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", > line 153, in resolvers.txt > > ---------------------------------------------------------------------- > File > "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt", > line 153, in resolvers.txt > Failed example: > result = transform(honk_doc) > Expected: > Resolving url hoi:test as prefix honk ... failed > Resolving url hoi:test as prefix hoi ... done > Got: > Resolving url hoi:test as prefix hoi ... done > ---------------------------------------------------------------------- [snip] Ah, right. It's the tests that are broken here. I forgot that the resolvers are stored in a set and thus tested in arbitrary order (interesting that no one ever reported that for 1.0). So here they seem to use a different order that leads to different output. Guess I'll have to fix the tests here. Maybe the best way is to only let the resolver speak that succeeds, not the failed one(s) that were also tested. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 07:45:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 07:45:35 +0200 Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml? In-Reply-To: <44CF774D.8010407@gmail.com> References: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de> <44CF774D.8010407@gmail.com> Message-ID: <44D03BFF.80405@gkec.informatik.tu-darmstadt.de> Hi, Agust?n Villena wrote: > Well, testing your lines I now have this code: > --------------------------- > from lxml import etree > > NEW_NS = "http://www.example.org/example" > > def add_ns(node,nsURL): > if type(node)==etree._Element: > node.tag="{%s}%s" %(nsURL,node.tag) > > doc = etree.parse("no_ns_doc.xml") > > old_root = doc.getroot() > new_root = old_root.makeelement("{%s}root" % (NEW_NS),nsmap={None : NEW_NS}) > new_root.append(old_root) > > add_ns(old_root,NEW_NS) > #add namespace to the first child of the root node, > #since we don't want to touch de namespace of the > #Signature Node > for elem in old_root[0].getiterator(): > add_ns(elem,NEW_NS) > > #until this line, we have this new_root element : > # > > > new_doc = etree.ElementTree(new_root[0]) > #All the children of new_root keeps their namespace > #in the new doc. But in the serialized text, this namespace disappears > #is this a bug? > new_doc.write("ns_patched_doc.xml") > > --------- > serialized > ---------- > > > Some Data > > > > > Hmm, ok, that didn't quite work. Maybe we should just add a helper function for namespace handling, as Martijn suggested a while ago. We could implement something like this: def reassignNamespacePrefixes(element_or_tree, prefixmap): """Traverse the tree and replace the prefixes in namespace declarations by the URI->prefix mapping defined by prefixmap. """ Question: how do we handle the case where a prefix is already used for a different namespace in the tree? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 10:53:55 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 10:53:55 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> Message-ID: <44D06823.6030407@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > Okay, I've managed to create a crashing test case that's down to a > reasonable number of lines of code. I don't think I can remove anything > else and still have it crash. Great, thanks for stripping this down. > There's even some very odd changes that stop it from crashing, such as > shortening "fieldset" to "f". Fortunately, narrowing this down allowed > me to create a workaround for the real program, so fixing this is no > longer so urgent for me. > > I've also attached the results of a valgrind run using the recommended > command line parameters on the test program. I didn't bother gzipping > it, because it's pretty small. > > Please let me know if this fails to crash for you. I have to run it > using "python test.py" instead of "./test.py" to see the glibc error. It 'nicely' crashes for me and I think I can tell where it comes from. We use a global dictionary in the parser that stores tag names, attribute values, etc. It mainly serves the purpose of reducing the number of expensive malloc calls and avoiding duplicated storage of constant strings. Normally, it works just fine, unless there are operations that create additional dictionaries, like XSLT. :( So what happens in your case, is: when you move the content of the XSLT result document over to the document you parsed, it will contain strings from two different dictionaries (I just verified that). When the documents are freed, libxml2 checks if the strings it frees are in the document dictionary, sees that it is not the case (as it came from a different dictionary) and then frees it. This leaves stale pointers in the second dictionary. It's too bad we can't control the dictionary created by libxslt for transformations, as it is automatically created and used when we request a transformation context. So we can't just replace the dictionary afterwards. I'm not quite sure what to do here. There are ways to fix this, but they can be expensive, so I'll just have to figure out which one to go. One solution could be to extend the deep traversal that follows moving a subtree to a different document. We could let it check if the dicts are the same, and if they are not, copy the strings stored in the source dictionary to the destination dictionary. As I said, this can be expensive but is a rare case as (so far) it only applies to partial XSLT results being moved around. On the other hand, this would also allow moving subtrees between threads (which use independent dictionaries as well), so maybe it's worth it... As this problem (currently) only appears in XSLT, a second way to handle it would be to replace the dictionary of the transformation context after initialisation, but /before/ running the transform. That way, there should be less content already stored in it that would have to be moved. While the second one sounds like the least expensive, maybe there are even better ways I did not think of. I'll take a look at it. Again, thanks for reporting this and for providing a test case, Stefan > ------------------------------------------------------------------------ > > import lxml.etree as etree > > definitionXml = etree.XML( ''' > > ''' ) > > definitionXml[ : ] = etree.XSLT( etree.XML( ''' > > > > > > > > > > > > > ''' ) )( definitionXml[ 0 ] ).getroot( )[ : ] > > # Segfault occurs on this line. > del definitionXml > > print "Didn't crash!" > ------------------------------------------------------------------------ > ==29947== Invalid free() / delete / delete[] > ==29947== at 0x401C0C3: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) > ==29947== by 0x45A161A: xmlFreeNodeList (in /usr/lib/libxml2.so.2.6.26) > ==29947== Address 0x48AD24C is 20 bytes inside a block of size 1,024 free'd > ==29947== at 0x401C0C3: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) > ==29947== by 0x46B3EE5: xmlDictFree (in /usr/lib/libxml2.so.2.6.26) From faassen at infrae.com Wed Aug 2 11:52:09 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 02 Aug 2006 11:52:09 +0200 Subject: [lxml-dev] Return values of XPath calls In-Reply-To: <44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de> References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> <44CF773F.9030408@infrae.com> <44CF8AB4.7040209@infrae.com> <44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de> Message-ID: <44D075C9.8070307@infrae.com> Stefan Behnel wrote: > > Martijn Faassen wrote: >> Martijn Faassen wrote: >> [snip] >>> Well, that gives one no way to access any comments surrounding the >>> document library from XPath. Not a disaster, but still. > > Note that this only applies to the return values of XPath calls. Inside the > expression, you can do whatever XPath supports. So you can still navigate the > brothers and sisters of the document root and return the one of them that > you're interested in, without having to pass the root itself into Python. Yes, naturally - it's not a disaster and it's only from XPath. >>> Returning >>> something Element-like sounds the most natural in this case, just like >>> returning a string is most natural for attribute nodes. > > I'm still not convinced that this should be Element-like. It's not an Element > and it has no representation in the ElementTree world. It has no representation in the ElementTree itself, but it's quite Element-like in that it has children. It's also Element-like in that it is relatively straightforward to implement it as a special kind of Element. :) >> I've just checked in a branch here: >> >> http://codespeak.net/svn/lxml/branch/lxml-xpathroot >> >> which experiments with adding a special XPath Root object. This root >> object only shows up when accessing / through XPath - there's no way to >> get to it using the normal ElementTree functionality. At first sight >> this implementation doesn't appear to be too difficult. I think this is >> a nicer solution than just not returning anything. > > Ok, I can see what you did. You'd have to rewrite that after the merge of the > CAPI branch, which changes loads of stuff under the hood and largely impacts > element class lookup. So it would have to fit in there. Okay, understood. I wasn't sure on the status of the CAPI branch. >> Unfortunately, my changes also cause memory errors when running the >> test. It's possible this happens because we start stuffing our proxy in >> the _private of a XML_DOCUMENT_NODE, something that wasn't possible >> before, and we're probably not scanning for accurately in our >> deallocation logic. Don't have time to investigate this further now >> though, so I'll leave it in the branch for now. > > doc._private is currently only used in XSLT (which may already interfere when > extension functions are used), but I'm not very happy with the idea of using > xmlDoc like any other element node. It starts with the fact that we now have > _Document and _Root sitting on the same xmlDoc structure. That unnecessarily > complicates the cleanup procedure for what I call a rare special case. Agreed. > If we really want to put something Element-like in there, we may consider > making it part of the _Document class, which already is unique for the > document root. Okay, that might make sense. I will study the _Document class and see whether we can come up with a design that is satisfactory. Thanks for the design feedback. :) This is driven by my desire to see some sensible return value when people evaluate the '/' XPath expression. Returning nothing is so... nothing, and if this is the first thing people tend to do then it might give them the impression lxml is misbehaving somehow. Regards, Martijn From faassen at infrae.com Wed Aug 2 11:55:47 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 02 Aug 2006 11:55:47 +0200 Subject: [lxml-dev] ElementTree comment behavior In-Reply-To: <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de> References: <44CF80BB.90404@infrae.com> <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de> Message-ID: <44D076A3.2000100@infrae.com> Stefan Behnel wrote: > > Martijn Faassen wrote: >> The whole XPath root node issue led me to investigate lxml's behavior >> with comment nodes, thinking that we might not do the right thing with >> mutation (as Comments subclass Element). However, it seems to behave >> rationally enough: >> >> >>> import lxml >> >>> from lxml import etree >> >>> c = etree.Comment('foo') >> >>> c.append(etree.Element('bar')) >> >>> len(c.getchildren()) >> 0 >> >> (I wonder what happens in the C tree though here.. cursory inspection of >> the tree.c code of libxml2 doesn't reveal special code to handle this case) > > Well, this is how lxml currently implements _Comment.append(): > > def append(self, _Element element): > pass > > Maybe it should rather raise an exception? Yeah, I realized this after I wrote the post. If we were to raise an exception, we'd be incompatible with ElementTree, but I wouldn' mind too much as this is a rather ridiculous operation anyway and people who do this in their code should actually know they're doing something weird. Note that I apparently added no such method for other mutation operations such as 'insert'... >> Unfortunately, ElementTree behaves differently in this case! >> >> >>> from elementtree import ElementTree as etree2 >> >>> c = etree2.Comment('foo') >> >>> c.append(etree.Element('bar')) >> >>> len(c.getchildren()) >> 1 >> >> Evidently it allows child Elements to be added to comments. >> >> What to do in this case? > > I personally find the behaviour of ET a bit bizarre here. What /is/ the > element child of an XML comment? I think you're right in that it's bizarre. The reason it behaves this way might be convenience of implementation... I feel under no obligation to be compatible with ET here. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 12:13:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 12:13:40 +0200 Subject: [lxml-dev] ElementTree comment behavior In-Reply-To: <44D076A3.2000100@infrae.com> References: <44CF80BB.90404@infrae.com> <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de> <44D076A3.2000100@infrae.com> Message-ID: <44D07AD4.6020906@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Stefan Behnel wrote: >> >> Martijn Faassen wrote: >>> The whole XPath root node issue led me to investigate lxml's behavior >>> with comment nodes, thinking that we might not do the right thing >>> with mutation (as Comments subclass Element). However, it seems to >>> behave rationally enough: >>> >>> >>> import lxml >>> >>> from lxml import etree >>> >>> c = etree.Comment('foo') >>> >>> c.append(etree.Element('bar')) >>> >>> len(c.getchildren()) >>> 0 >>> >>> (I wonder what happens in the C tree though here.. cursory inspection >>> of the tree.c code of libxml2 doesn't reveal special code to handle >>> this case) >> >> Well, this is how lxml currently implements _Comment.append(): >> >> def append(self, _Element element): >> pass >> >> Maybe it should rather raise an exception? > > Yeah, I realized this after I wrote the post. If we were to raise an > exception, we'd be incompatible with ElementTree, but I wouldn' mind too > much as this is a rather ridiculous operation anyway and people who do > this in their code should actually know they're doing something weird. > > Note that I apparently added no such method for other mutation > operations such as 'insert'... I added the method in the CAPI branch (also __setitem__ and __setslice__). The mutators now raise a TypeError. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 13:01:25 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 13:01:25 +0200 Subject: [lxml-dev] Some performance results with threads Message-ID: <44D08605.7090408@gkec.informatik.tu-darmstadt.de> Hi, I did a little testing on a dual processor linux machine with the current trunk. I had a very simple setup with a number of threads (8-16) that each created a separate parser, parsed a 3MB string or file and then ran a small XSLT on it. Parsing and XSLT are operations that free the GIL for the majority of their internal work. The outcome was that the system was always between 20% and 40% idle. So, there is a certain speedup in multi-processor environments, but don't expect too much, especially when adding more processors. It shows that it makes sense to use threads on, say, a web server that has to serve other content in parallel (like static content), so that it can make use of a third of the processing time itself. But it will not get you 100% more throughput by doubling the number of processors. It looks like you should really expect less than a 50% speedup, depending on how much time your application actually spends in parsing, serialising, validating and XSLT. If your application does a lot of XML handling in Python code (like tree iteration etc.), the ratio can get close to 0, but if you have complex XSLTs or large schemas/documents to validate, the speedup can potentially be much higher. (I never thought I'd ever tell someone to rewrite code in XSLT to make it /faster/ ...) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 13:30:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 13:30:43 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D06823.6030407@gkec.informatik.tu-darmstadt.de> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> Message-ID: <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> Hi John, Stefan Behnel wrote: > John Krukoff wrote: >> Okay, I've managed to create a crashing test case that's down to a >> reasonable number of lines of code. I don't think I can remove anything >> else and still have it crash. > > It 'nicely' crashes for me and I think I can tell where it comes from. We use > a global dictionary in the parser that stores tag names, attribute values, > etc. It mainly serves the purpose of reducing the number of expensive malloc > calls and avoiding duplicated storage of constant strings. Normally, it works > just fine, unless there are operations that create additional dictionaries, > like XSLT. :( > > So what happens in your case, is: when you move the content of the XSLT result > document over to the document you parsed, it will contain strings from two > different dictionaries (I just verified that). When the documents are freed, > libxml2 checks if the strings it frees are in the document dictionary, sees > that it is not the case (as it came from a different dictionary) and then > frees it. This leaves stale pointers in the second dictionary. I attached a patch that is somewhat hacky and may not work in some situations. However, it should solve your crash for now and I will see if I can get something like this a bit cleaned up and merged into the next release (1.1). Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: xslt-dict-hack.patch Type: text/x-patch Size: 1244 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060802/40be118d/attachment-0001.bin From lxml at adhamh.com Wed Aug 2 17:19:48 2006 From: lxml at adhamh.com (Adhamh Findlay) Date: Wed, 02 Aug 2006 08:19:48 -0700 Subject: [lxml-dev] XML Schema: Getting more information on validation failures? Message-ID: <44D0C294.30902@adhamh.com> Hello, I'm new to lxml and I'm trying to get more information on why some validation is failing. Here is the code I am currently using: try: xmlschema.assertValid(xml_doc) except etree.DocumentInvalid: traceback.print_exc() print log print error.domain_name print error.type_name sys.exit() Here's the output I get: Traceback (most recent call last): File "./xml.py", line 42, in ? xmlschema.assertValid(xml_doc) File "etree.pyx", line 1624, in etree._Validator.assertValid DocumentInvalid: Document does not comply with schema Traceback (most recent call last): File "./xml.py", line 46, in ? print error.domain_name AttributeError: 'NoneType' object has no attribute 'domain_name' Is there any way to get more information than this? Thanks, Adhamh From faassen at infrae.com Wed Aug 2 18:12:59 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 02 Aug 2006 18:12:59 +0200 Subject: [lxml-dev] Some performance results with threads In-Reply-To: <44D08605.7090408@gkec.informatik.tu-darmstadt.de> References: <44D08605.7090408@gkec.informatik.tu-darmstadt.de> Message-ID: <44D0CF0B.1010209@infrae.com> Stefan Behnel wrote: > (I never thought I'd ever tell someone to rewrite > code in XSLT to make it /faster/ ...) In general if you can run a transformation using libxslt instead of a Python-based XML transformation algorithm, and the transformation is pretty 'natural' to XSLT, even on a single-threaded setup libxslt can speed things up. libxslt is a reasonably fast XSLT processor after all. Regards, Martijn From faassen at infrae.com Wed Aug 2 18:13:32 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 02 Aug 2006 18:13:32 +0200 Subject: [lxml-dev] Some performance results with threads In-Reply-To: <44D08605.7090408@gkec.informatik.tu-darmstadt.de> References: <44D08605.7090408@gkec.informatik.tu-darmstadt.de> Message-ID: <44D0CF2C.6070004@infrae.com> Stefan Behnel wrote: [snip info on multi-threaded use of lxml] Thanks for checking this out and letting us know, by the way. Good to know! Regards, Martijn From faassen at infrae.com Wed Aug 2 18:17:07 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 02 Aug 2006 18:17:07 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> Message-ID: <44D0D003.2080800@infrae.com> Stefan Behnel wrote: [XSLT segfaulting issue] > I attached a patch that is somewhat hacky and may not work in some situations. > However, it should solve your crash for now and I will see if I can get > something like this a bit cleaned up and merged into the next release (1.1). [code of patch] I believe this is very similar to the approach I took early on to ensure documents share their dictionaries, so who knows, we might be in luck and it's reliable. Hm, though I vaguely remember we already did that for XSLT too, so perhaps this is hacky in the place it's added, not the way it's done? Are there any cases you can think of where this would lead to problems? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 18:14:09 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 18:14:09 +0200 Subject: [lxml-dev] XML Schema: Getting more information on validation failures? In-Reply-To: <44D0C294.30902@adhamh.com> References: <44D0C294.30902@adhamh.com> Message-ID: <44D0CF51.5020404@gkec.informatik.tu-darmstadt.de> Hi Adhamh, Adhamh Findlay wrote: > I'm new to lxml and I'm trying to get more information on why some > validation is failing. Here is the code I am currently using: > > try: > xmlschema.assertValid(xml_doc) > except etree.DocumentInvalid: > traceback.print_exc() > print log > print error.domain_name > print error.type_name > sys.exit() Here is an example on how to do this: http://codespeak.net/lxml/api.html#error-handling-on-exceptions It's more something like this: try: xmlschema.assertValid(xml_doc) except etree.DocumentInvalid, error: log = error.error_log print log print log[-1].domain_name print log[-1].type_name > Here's the output I get: > Traceback (most recent call last): > File "./xml.py", line 46, in ? > print error.domain_name > AttributeError: 'NoneType' object has no attribute 'domain_name' This is because you set "error" to None somewhere in your program. You can't really blame lxml for that... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 18:22:37 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 02 Aug 2006 18:22:37 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D0D003.2080800@infrae.com> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> <44D0D003.2080800@infrae.com> Message-ID: <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de> Martijn Faassen schrieb: > Stefan Behnel wrote: > [XSLT segfaulting issue] >> I attached a patch that is somewhat hacky and may not work in some >> situations. >> However, it should solve your crash for now and I will see if I can get >> something like this a bit cleaned up and merged into the next release >> (1.1). > > [code of patch] > > I believe this is very similar to the approach I took early on to ensure > documents share their dictionaries, so who knows, we might be in luck > and it's reliable. Hm, though I vaguely remember we already did that for > XSLT too, so perhaps this is hacky in the place it's added, not the way > it's done? > > Are there any cases you can think of where this would lead to problems? The different between changing the dict on the parser context and on the XSLT context is that the parser context does not use it before it is returned. libxslt *might* store stuff in it, depending on the stylesheet. I filed a bug report on this and got an immediate "not a bug but a feature" by Daniel. The reason is that the transformation must not modify the stylesheet, so it just creates a sub-dictionary and is happy with that - unlike its users. However, he also said, if I want to propose an API for it, I should ask on the list. Don't think I'll do it, though, as it's not much worth to have the final function-that-solves-all-your-problems added in 1.1.98 if we want to keep up support for 1.1.12... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 3 18:02:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 03 Aug 2006 18:02:34 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks Message-ID: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> Hi all, I have already mentioned that lxml 1.1 will feature an alternative API, lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but written in Pyrex. The implementation is now nearing completion, so that 1.1beta will hopefully find its way towards cheeseshop early next week. It allows you to access XML in a data-binding like style, so that you can do this: >>> root=XML('HALLOWORLD') >>> print root.a.b.c.d, '--', root.a.b.c.d[1] HALLO -- WORLD A complete description is here: http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt objectify also features an additional path language (ObjectPath) based on the normal object attribute access scheme. It is implemented independent of the actual objectify API so that it can be used without switching the Element implementation over to 'objectify'. The language accepts expressions like the two used above, just written as strings or lists: * "root.a.{someNamespace}b.c.d" * "root.a.b.c.d[1]" * ".a.b.c" * ['root', '{otherNamespace}a'] * ['root', 'a', 'b', 'c', '{andAnotherNamespace}d[1]'] Here are a few timeit benchmarks: Setup: from lxml.elements.objectify import register, ObjectPath register() from lxml.etree import XML root = XML('') Normal Python object access tests for comparison: * root.a.b.c.d 10000 loops, best of 3: 16.3 usec per loop * root.a.b.c.d[0] 10000 loops, best of 3: 16.7 usec per loop * root.a.b.c.d[2] 100000 loops, best of 3: 18.4 usec per loop ObjectPath tests *without* parsing, i.e. timings of the call "path(root)" after an additional Setup as follows: * path = ObjectPath('root.a.b.c.d') 100000 loops, best of 3: 2.76 usec per loop * path = ObjectPath('root.a.b.c.d[0]') 100000 loops, best of 3: 2.77 usec per loop * path = ObjectPath('root.a.b.c.d[2]') 100000 loops, best of 3: 2.85 usec per loop Including parsing: * "path=ObjectPath('root.a.b.c.d'); path(root)" 10000 loops, best of 3: 27 usec per loop * "path=ObjectPath('root.a.b.c.d[2]'); path(root)" 10000 loops, best of 3: 29.7 usec per loop The same based on lists: * "path=ObjectPath(['root', 'a', 'b', 'c', 'd']); path(root)" 10000 loops, best of 3: 16.7 usec per loop * "path=ObjectPath(['root', 'a', 'b', 'c', 'd[2]']); path(root)" 10000 loops, best of 3: 18 usec per loop As you can see, the parser is not the fastest, especially for strings. It actually uses REs internally, as ObjectPath expressions are non trivial to parse (namespaces, indexes, ...). However, once the expression is parsed, element access is impressively fast, as it runs entirely in C. In the limited area of its applicability, it is even faster than full fledged XPath: * Setup: path=XPath('/root/a/b/c/d') Timing: "path(root)" 10000 loops, best of 3: 10.4 usec per loop * Timing: "path=XPath('/root/a/b/c/d'); path(root)" 10000 loops, best of 3: 44.8 usec per loop So I hope people find it useful. Stefan From faassen at infrae.com Thu Aug 3 19:51:15 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 03 Aug 2006 19:51:15 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks In-Reply-To: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> Message-ID: <44D23793.5040402@infrae.com> Stefan Behnel wrote: > I have already mentioned that lxml 1.1 will feature an alternative API, > lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but > written in Pyrex. The implementation is now nearing completion, so that > 1.1beta will hopefully find its way towards cheeseshop early next week. > > It allows you to access XML in a data-binding like style, so that you can do this: > > >>> root=XML('HALLOWORLD') > >>> print root.a.b.c.d, '--', root.a.b.c.d[1] > HALLO -- WORLD > > A complete description is here: > http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt > > objectify also features an additional path language (ObjectPath) based on the > normal object attribute access scheme. It is implemented independent of the > actual objectify API so that it can be used without switching the Element > implementation over to 'objectify'. While I'm quite interested in these developments I'm afraid I'm going to ask some difficult questions here. This is not criticism of these developments per-se, but it's a question about what lxml is all about and how we want to present these new technologies to users. Module separation: I notice the ObjectPath language is implemented in the 'objectify' module, but this looks like it really should be a separate module, it being an independent extension to lxml that does not rely on the other objectify stuff, as you mention. Use cases: What is the underlying thought? When would you recommend people to use ObjectPath instead of XPath or the .find() syntax? Technical comment: I also see that the ObjectPath parser is implemented in a rather low-level Pyrex formulation. Since you say that this parser is slow anyway, wouldn't it make sense to maintain this as straight Python instead? It would also be nice if we could make this parser and a pure-python implementation available for ElementTree itself. Global switch for objectify: As I mentioned before I'm still quite worried about switching the entire world over to objectify with a single global call. I really think this should be specified by using a different tree constructor. It just too sounds dangerous to me to globally switch the behavior of the whole API. In the 'classic' way of using the namespace registry, custom element classes are typically registered for particular elements in particular namespaces. Objectify however fundamentally alters the behavior of the entire system. I understood from your previous reply that you were working on ways to this settable per-tre; did I understand that correctly? I'd recommend making it the normal way to invoke the objectify behavior, not global. Now to the biggest item of my concern... Nature of lxml: The addition of a different data-binding model and different path language specific to lxml worries me quite a bit as we're reinventing wheels here, something not the original idea of lxml. The original idea of lxml was to try to stick to an existing API (ElementTree) as much as possible, along with existing XML standards (XPath, for instance) and build things on top of existing underlying technology (libxml2 and libxslt). This idea is quite dear to me and I consider this to be one of the reasons lxml seems reasonably succesful among developers: it does not make people learn too many new things, and tries to minimize the learning needed that's unique to lxml and no other system. The objectify data binding model is however a fundamentally new data-binding API: instead of the Amara or gnosis.objectify API we've created our own version. There are good reasons for this, and ElementTree is of course not the end of XML representations for Python. The question however arises whether these innovations should be maintained as core lxml... I'm worried we're offering developers too many alternatives here: two tree representations (elementtree and objectify), three path languages (.find(), XPath and ObjectPath), which includes two ways completely unique to lxml. Could these new things be shipped in a separate package instead, at least for now? I understand that the capi work, along with eggs, should make this relatively easy. We could even have it share the lxml namespace package, so it could still be called 'lxml.objectify' (and 'lxml.objectpath' as I'd suggest), or, alternatively, we could introduce a new 'lxmlext' namespace to maintain things like this. I'm quite concerned with how we present these to developers. I'd prefer a separate product identity, with a separate set of web pages (part of the larger lxml website but explicitly not described as 'core') and a separate packaging. Again, my questions and recommendations are not to discourage these developments. This kind of innovation certainly should be encouraged. I do worry about the proper place and the way these things are done. In the rush to innovate I don't want to lose track of the original goals of lxml. I sincerely hope we can work this out together. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 3 23:24:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 03 Aug 2006 23:24:48 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks In-Reply-To: <44D23793.5040402@infrae.com> References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com> Message-ID: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> Hi Martijn, thanks for your feedback, your questions are definitely worth asking. Martijn Faassen wrote: > Module separation: I notice the ObjectPath language is implemented in > the 'objectify' module, but this looks like it really should be a > separate module, it being an independent extension to lxml that does not > rely on the other objectify stuff, as you mention. > > Use cases: What is the underlying thought? When would you recommend > people to use ObjectPath instead of XPath or the .find() syntax? It's mainly meant to accompany objectify, that's why it's (currently) implemented in the same module. The reason why I said it's independent is purely out of technical considerations. It uses the same semantics and the same idea behind the API, so it's very closely related at the semantic level. XPath and ElementPath do not have their own module either, BTW, although they are almost as different compared to each other as compared to ObjectPath. The latter borrows from both (namespaces from ET, indexes from XPath), as well as from Python's object access pattern (the dot separator). > Technical comment: I also see that the ObjectPath parser is implemented > in a rather low-level Pyrex formulation. Since you say that this parser > is slow anyway, wouldn't it make sense to maintain this as straight > Python instead? It would also be nice if we could make this parser and a > pure-python implementation available for ElementTree itself. I agree that it could be worth having it available for ET, too, that would extend ET in the same way this now extends lxml.etree. However, you would then want to have an objectify module for ET, also, as this is where the path semantics actually come from. Also, the parser is not /that/ slow in its current incarnation. It's actually almost twice as fast as the (admittedly much more complex) XPath parser of libxml2. I don't think a pure Python version could be anywhere close to that. Also, the parser is very closely tied into the evaluator, so writing one of them in pure Python would make both considerably slower. So the thing is, as long as ObjectPath is used as part of lxml's objectify API, it should be optimised for the internal implementation. After all, one of the main goals of ObjectPath is to avoid instantiating all elements along the path and instead traversing the tree in plain C. > Global switch for objectify: As I mentioned before I'm still quite > worried about switching the entire world over to objectify with a single > global call. I really think this should be specified by using a > different tree constructor. It just too sounds dangerous to me to > globally switch the behavior of the whole API. > > In the 'classic' way of using the namespace registry, custom element > classes are typically registered for particular elements in particular > namespaces. Objectify however fundamentally alters the behavior of the > entire system. I understood from your previous reply that you were > working on ways to this settable per-tre; did I understand that > correctly? I'd recommend making it the normal way to invoke the > objectify behavior, not global. Ok, sure. The lxml.elements.classlookup module has (amongst other things) a per-parser lookup implementation. I guess you'd want that to become the preferred way of using objectify and I think that's a good idea. Currently, the docs only present that as an alternative (4th paragraph): http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt That part could be rewritten to make the global registry the alternative. > Now to the biggest item of my concern... > > Nature of lxml: The addition of a different data-binding model and > different path language specific to lxml worries me quite a bit as we're > reinventing wheels here, something not the original idea of lxml. The > original idea of lxml was to try to stick to an existing API > (ElementTree) as much as possible, along with existing XML standards > (XPath, for instance) and build things on top of existing underlying > technology (libxml2 and libxslt). This idea is quite dear to me and I > consider this to be one of the reasons lxml seems reasonably succesful > among developers: it does not make people learn too many new things, and > tries to minimize the learning needed that's unique to lxml and no other > system. I see your point and I agree that this is desirable. After all, there is not that much new in objectify either. Most of the Element API stays the same as in ET. The object access pattern looks (and feels) like normal Python objects and clearly borrows from Amara. > The objectify data binding model is however a fundamentally new > data-binding API: instead of the Amara or gnosis.objectify API we've > created our own version. There are good reasons for this, and > ElementTree is of course not the end of XML representations for Python. The main reason why it does not aim to be Amara compatible is that it inherits from ElementTree. It does not /need/ all the things for which Amara had to invent its own API as all of that is already part of the ET API. So the reason why this is a new API is that it allows it to integrate with lxml.etree. > The question however arises whether these innovations should be > maintained as core lxml... I'm worried we're offering developers too > many alternatives here: two tree representations (elementtree and > objectify), three path languages (.find(), XPath and ObjectPath), which > includes two ways completely unique to lxml. > > Could these new things be shipped in a separate package instead, at > least for now? I understand that the capi work, along with eggs, should > make this relatively easy. We could even have it share the lxml > namespace package, so it could still be called 'lxml.objectify' (and > 'lxml.objectpath' as I'd suggest), or, alternatively, we could introduce > a new 'lxmlext' namespace to maintain things like this. I started with "lxml.elementlib", then it became "lxml.elements". The reason why I chose to put the new stuff into a subpackage (not only submodules) was that I wanted to separate it from the core lxml. :) I don't mind giving it a better name and I would not even mind separating the packages into different eggs. It's not a problem technically, even version dependencies could be handled by setuptools. So it's mainly a matter of presentation. For example, the classlookup module would then have to stay a part of lxml (or could even be merged into lxml.etree), while the objectify module could become a separate distribution. > I'm quite concerned with how we present these to developers. I'd prefer > a separate product identity, with a separate set of web pages (part of > the larger lxml website but explicitly not described as 'core') and a > separate packaging. Hmmm, that would really make it a separate product. Do you really think it's worth it? It still requires lxml.etree to run and shares most of the API, so, to learn objectify, you'd have to learn lxml.etree. It's just that objectify would be better hidden from people who only want to use lxml.etree. Isn't a subpackage enough for that purpose? Maybe call it lxml.objectify to make it clear that it's more or less at a comparable level as lxml.etree itself. > Again, my questions and recommendations are not to discourage these > developments. This kind of innovation certainly should be encouraged. I > do worry about the proper place and the way these things are done. In > the rush to innovate I don't want to lose track of the original goals of > lxml. > > I sincerely hope we can work this out together. So do I. It's definitely the right time to discuss this now, before the release of 1.1 (and preferably also before the release of 1.1beta, which is supposed to be feature complete). Thanks for bringing up this discussion. Regards, Stefan From faassen at infrae.com Fri Aug 4 10:14:28 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 04 Aug 2006 10:14:28 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks In-Reply-To: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com> <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> Message-ID: <44D301E4.50800@infrae.com> Hey Stefan, Thanks for your constructive reply! Stefan Behnel wrote: > Martijn Faassen wrote: [snip smaller issues] >> Technical comment: I also see that the ObjectPath parser is >> implemented in a rather low-level Pyrex formulation. Since you say >> that this parser is slow anyway, wouldn't it make sense to maintain >> this as straight Python instead? It would also be nice if we could >> make this parser and a pure-python implementation available for >> ElementTree itself. > > I agree that it could be worth having it available for ET, too, that > would extend ET in the same way this now extends lxml.etree. However, > you would then want to have an objectify module for ET, also, as this > is where the path semantics actually come from. Not necessarily so, but yeah, that makes sense. Anyway, I cannot require an objectify module for ET. :) Separating out the ObjectPath is not that important then, though technically it would be possible to keep the implementation separate. [snip explanation about parser performance] > So the thing is, as long as ObjectPath is used as part of lxml's > objectify API, it should be optimised for the internal > implementation. After all, one of the main goals of ObjectPath is to > avoid instantiating all elements along the path and instead > traversing the tree in plain C. Okay, makes sense. Main usecase of ObjectPath are what, then? Performance is one, the other being traversing the tree in an 'objectify' way? When would I pick it above XPath or elementpath? Is the main answer: when I'm using objectify? [snip my worries about global switch for objectify] > Ok, sure. The lxml.elements.classlookup module has (amongst other > things) a per-parser lookup implementation. I guess you'd want that > to become the preferred way of using objectify and I think that's a > good idea. Yes, that would be preferred. > Currently, the docs only present that as an alternative (4th > paragraph): > http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt > That part > could be rewritten to make the global registry the alternative. I think we should do that, or perhaps even not mention the global registry at all but briefly mentioning that you could do something to it to make the whole of lxml work that way for your entire program... We could also consider just offering an API to register globally at all so they don't become tempted. :) >> Now to the biggest item of my concern... >> >> Nature of lxml: The addition of a different data-binding model and >> different path language specific to lxml worries me quite a bit as >> we're reinventing wheels here, something not the original idea of >> lxml. The original idea of lxml was to try to stick to an existing >> API (ElementTree) as much as possible, along with existing XML >> standards (XPath, for instance) and build things on top of existing >> underlying technology (libxml2 and libxslt). This idea is quite >> dear to me and I consider this to be one of the reasons lxml seems >> reasonably succesful among developers: it does not make people >> learn too many new things, and tries to minimize the learning >> needed that's unique to lxml and no other system. > > I see your point and I agree that this is desirable. After all, there > is not that much new in objectify either. Most of the Element API > stays the same as in ET. The object access pattern looks (and feels) > like normal Python objects and clearly borrows from Amara. While there's not much new in objectify, and while I agree that we're borrowing (hopefully) the best ideas from other implementations, we are crossing into the territory of inventing a new XML Python tree API here. It's a somewhat grey area on how much we're inventing and how much people need to learn, but I think we're going far enough to stop and think for a bit nonetheless. >> The objectify data binding model is however a fundamentally new >> data-binding API: instead of the Amara or gnosis.objectify API >> we've created our own version. There are good reasons for this, and >> ElementTree is of course not the end of XML representations for >> Python. > > The main reason why it does not aim to be Amara compatible is that it > inherits from ElementTree. It does not /need/ all the things for > which Amara had to invent its own API as all of that is already part > of the ET API. So the reason why this is a new API is that it allows > it to integrate with lxml.etree. Yes, that's part of the 'good reasons' I mentioned. :) There is no debate that there are good reasons and that this is a valuable development. My concern is with its presentation to innocent new developers that start looking at lxml. What's the story we want to tell them? We have these two APIs, which are similar but not identical, and you should pick one over the other, when? >> The question however arises whether these innovations should be >> maintained as core lxml... I'm worried we're offering developers >> too many alternatives here: two tree representations (elementtree >> and objectify), three path languages (.find(), XPath and >> ObjectPath), which includes two ways completely unique to lxml. >> >> Could these new things be shipped in a separate package instead, at >> least for now? I understand that the capi work, along with eggs, >> should make this relatively easy. We could even have it share the >> lxml namespace package, so it could still be called >> 'lxml.objectify' (and 'lxml.objectpath' as I'd suggest), or, >> alternatively, we could introduce a new 'lxmlext' namespace to >> maintain things like this. > > I started with "lxml.elementlib", then it became "lxml.elements". The > reason why I chose to put the new stuff into a subpackage (not only > submodules) was that I wanted to separate it from the core lxml. :) Yes, I can see that. I think 'objectify' is a good name, though perhaps a bit worrying we clash with gnosis.objectify. > I don't mind giving it a better name and I would not even mind > separating the packages into different eggs. It's not a problem > technically, even version dependencies could be handled by > setuptools. So it's mainly a matter of presentation. For example, the > classlookup module would then have to stay a part of lxml (or could > even be merged into lxml.etree), while the objectify module could > become a separate distribution. I think it makes sense for classlookup to remain part of the core. The *facility* to create new databinding APIs for lxml should be core - I have no beef with that and think it's a very powerful feature. The actual implementation of a new databinding API on top of lxml I'd prefer to be outside of the core, however. >> I'm quite concerned with how we present these to developers. I'd >> prefer a separate product identity, with a separate set of web >> pages (part of the larger lxml website but explicitly not described >> as 'core') and a separate packaging. > > Hmmm, that would really make it a separate product. Do you really > think it's worth it? It still requires lxml.etree to run and shares > most of the API, so, to learn objectify, you'd have to learn > lxml.etree. Understood. I realize that objectify leans heavily on the ET API. Then again, it also strongly changes the experience. I'm not proposing new people come into objectify and then never have to learn about lxml.etree. I'm just trying to make sure that when people run into lxml, they don't have to spend a lot of mental bandwidth to worry about what objectify is, when to use it, etc. If it's clear to them it's there that it's not core, that they don't need to worry about it at all, and that it's there when they want it, that would help. So far, most or all of the things in lxml are at least potentially familiar to a newcomer, if they're familiar with various XML standards and ElementTree. The new bits are the APIs we invented to glue them all together. objectify alters that in the sense that it's not an API used to glue these things together and it's also not an API people can be familiar with when they come in from the outside. It's a gradual step in many ways, but I think a significant one. > It's just that objectify would be better hidden from people who only > want to use lxml.etree. I don't think 'hidden' is the right word. I'd like to give objectify prominence, while also making it very clear in a developer's mind that this is a separate development, heavily tied into lxml and part of the lxml projects, but not something you have to buy into when you use lxml core. > Isn't a subpackage enough for that purpose? Maybe call it > lxml.objectify to make it clear that it's more or less at a > comparable level as lxml.etree itself. I would be prefer a clearly marked difference. If we call it 'lxml.objectify', but maintain it as an egg outside the core (lxml being the shared namespace package), we'll have a large step taken already. We don't need to necessarily split up the svn repository if we can generate both eggs independently from the same repository. We should also be careful in organizing our documentation and website to make clear that objectify is an extension to the core part, and that people do not have to worry about it when they come to lxml. I think we can do this so that objectify is not hidden, but also clearly separate from the core development. I realize that this is a hassle and it's on the edge of being worth it or not, but I think it'd be valuable. On a personal note, I'm going on a short trip and won't be able to communicate on this further until next week thursday or friday. Note too big a problem: I said what I wanted to say possibly too voluminously already. I'd be curious to see what other people's opinions are on these topics, so perhaps I'll see that when I get back. I also fully trust you'll make the right decisions if you want to proceed with a 1.1 beta release while I'm away. Regards, Martijn From elephantum at yandex.ru Fri Aug 4 10:52:30 2006 From: elephantum at yandex.ru (=?KOI8-R?B?9MHUwdLJzs/XIOHOxNLFyg==?=) Date: Fri, 04 Aug 2006 12:52:30 +0400 Subject: [lxml-dev] lxml goals Message-ID: <145471154681550@webmail5.yandex.ru> Hi, This is all very interesting, but the only thing I can't understand what does it have in common with lxml? In fact, for a quiet some time I do not understand the goal of lxml project. At first it was "ElementTree on top of libxml2", after it becames more and more bloated with ET-Incompatible API, now, program that uses lxml cannot be easy ported back to ElementTree. May be it's time to split into ET-implementations and lxml-specific? Or to say "lxml is no more just an ElementTree implementation, but a separate project with it's own ideoms"? 03.08.06, 20:02, Stefan Behnel : > Hi all, > I have already mentioned that lxml 1.1 will feature an alternative API, > lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but > written in Pyrex. The implementation is now nearing completion, so that > 1.1beta will hopefully find its way towards cheeseshop early next week. [...] > Stefan From faassen at infrae.com Fri Aug 4 11:45:11 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 04 Aug 2006 11:45:11 +0200 Subject: [lxml-dev] lxml goals In-Reply-To: <145471154681550@webmail5.yandex.ru> References: <145471154681550@webmail5.yandex.ru> Message-ID: <44D31727.8090706@infrae.com> ????????? ?????? wrote: > This is all very interesting, but the only thing I can't understand > what does it have in common with lxml? > > In fact, for a quiet some time I do not understand the goal of lxml > project. At first it was "ElementTree on top of libxml2", after it > becames more and more bloated with ET-Incompatible API, now, program > that uses lxml cannot be easy ported back to ElementTree. I don't think it's fair to say that our API is ET-incompatible. lxml's API is as compatible to ElementTree as we can make it, and we've expended quite some effort in making it be so. We've *extended* the API to expose a host of features in libxml2 and libxslt. For instance, we expose namespace prefixes in lxml.etree where ET does not. I do not consider these extensions as bloat but as important functionality. The API also got extended with a facility to hook in custom element classes for particular elements. This is an extension to the ET model which due to its nature needs to be done in the core. I think this is a nice and powerful feature. Stefan has now built other facilities on top of this that are unique to lxml. This is where I asked about goals. > May be it's time to split into ET-implementations and lxml-specific? > Or to say "lxml is no more just an ElementTree implementation, but a > separate project with it's own ideoms"? lxml has always been *more* than just an ElementTree implementation; if it were just an ElementTree implementation there'd be no point in doing our work. It's an ElementTree implementation that exposes a host of XML technologies implemented in libxml2 and libxslt. It's a Python XML library with support for XPath, XSLT, Relax NG, and so on. The objectify extensions, yes, we could present as a separate project with its own idioms. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 11:39:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 11:39:11 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks In-Reply-To: <44D301E4.50800@infrae.com> References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com> <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> <44D301E4.50800@infrae.com> Message-ID: <44D315BF.7020204@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Main usecase of ObjectPath are what, then? Performance is one, the other > being traversing the tree in an 'objectify' way? When would I pick it > above XPath or elementpath? Is the main answer: when I'm using objectify? Guess so. That would be another reason for leaving it inside the objectify module. I think the whole idea of ObjectPath is so much tied into objectify that it would not make sense to use one without the other. If you only want to use lxml.etree, you should be pretty well served with XPath. If, however, you want to use objectify, it's convenient to have a (fast) path language that matches the API. So, we should just separate etree and objectify and leave the rest as is. >> The lxml.elements.classlookup module has (amongst other >> things) a per-parser lookup implementation. I guess you'd want that >> to become the preferred way of using objectify and I think that's a >> good idea. > > Yes, that would be preferred. Ok, I'll fix it in the docs. > There is no > debate that there are good reasons and that this is a valuable > development. My concern is with its presentation to innocent new > developers that start looking at lxml. What's the story we want to tell > them? We have these two APIs, which are similar but not identical, and > you should pick one over the other, when? That should go into a FAQ entry, I guess. Something like this: Basically, they are two different approaches to XML: Python-like data-binding and a generic API for XML handling. * The ET API is more generic and does not require any knowledge about the XML structure that is treated. It supports more or less the entire XML infoset. Besides, it is very well suited for mixed and document-like content (including HTML). * The objectify API is very data centred and schema/structure focused. It does not support document-like XML (or HTML), but it's very convenient for handling Python(-like) data types stored in XML. So, objectify has a more convenient API in a smaller application scope, while ET is broadly applicable to everything that's XML. > I think 'objectify' is a good name, though perhaps > a bit worrying we clash with gnosis.objectify. What about calling it "objectic", then? Sounds similar, but still different enough to make it clear that it's not the same as gnosis.objectify or Amara. Google gives 862 hits on "objectic", and even 48 on "objectique". Not much of a chance to have a name clash with those. :) Then again, "objectify" has a meaning that pretty much fits its idea. Hmmm, I guess "objectify" is just fine as a name... > I think it makes sense for classlookup to remain part of the core. > The *facility* to create new databinding APIs for lxml should be core - > I have no beef with that and think it's a very powerful feature. The > actual implementation of a new databinding API on top of lxml I'd prefer > to be outside of the core, however. Understood. But then, classlookup is pretty lonely in lxml.elements. I should just merge it into lxml.etree. It's not much code and parts of it actually are already in etree (like the normal NS lookup and the per-parser stuff). > I realize that objectify leans heavily on the ET API. Then > again, it also strongly changes the experience. I'm not proposing new > people come into objectify and then never have to learn about > lxml.etree. I'm just trying to make sure that when people run into lxml, > they don't have to spend a lot of mental bandwidth to worry about what > objectify is, when to use it, etc. If it's clear to them it's there that > it's not core, that they don't need to worry about it at all, and that > it's there when they want it, that would help. > > So far, most or all of the things in lxml are at least potentially > familiar to a newcomer, if they're familiar with various XML standards > and ElementTree. The new bits are the APIs we invented to glue them all > together. objectify alters that in the sense that it's not an API used > to glue these things together and it's also not an API people can be > familiar with when they come in from the outside. It's a gradual step in > many ways, but I think a significant one. > > I'd like to give objectify > prominence, while also making it very clear in a developer's mind that > this is a separate development, heavily tied into lxml and part of the > lxml projects, but not something you have to buy into when you use lxml > core. > > I would prefer a clearly marked difference. If we call it > 'lxml.objectify', but maintain it as an egg outside the core (lxml being > the shared namespace package), we'll have a large step taken already. > We don't need to necessarily split up the svn repository if we can > generate both eggs independently from the same repository. Ok, I understand your concerns and I think they are valid. We should really give users easy guidelines through the package, so that they do not have to read tons of pages to understand where to /start/. That said, I believe that it's totally a good thing to provide different APIs on top of the same infrastructure. Things like parsing, XSLT, RNG, XPath, etc. work exactly the same way for all of them, so you only have to learn them once and can then freely choose the API that fits your current use case, without restarting from scratch and without any incompatibilities or differing capabilities of the library itself. So the proposal is: * merge lxml.elements.classlookup into lxml.etree * make both APIs stand side-by-side in the lxml package: lxml.etree and lxml.objectify * make it clear in the docs (and the FAQ) that they provide different APIs and how they differ, so that people can easily decide which suites their needs, without first needing to understand the details Not required for 1.1beta (but likely in 1.1): * build separate packages from setup.py: "lxml" and "lxml-objectify" (not too much of a big deal technically, BTW), where lxml-objectify requires lxml via setuptools. > We should also be careful in organizing our documentation and website to > make clear that objectify is an extension to the core part, and that > people do not have to worry about it when they come to lxml. I think we > can do this so that objectify is not hidden, but also clearly separate > from the core development. Sure. It already has its own page, which is somewhat similar to api.txt in spirit. So we should reorganise the doc section in main.txt to tell the users about both and how we see them in comparison. > On a personal note, I'm going on a short trip and won't be able to > communicate on this further until next week thursday or friday. I'll actually be almost away by then and come back at the end of august. So I'll try to get 1.1beta out early next week and 1.1 final when I come back (and find all those nasty little bugs reported on the list... :) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 12:24:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 12:24:43 +0200 Subject: [lxml-dev] lxml goals In-Reply-To: <145471154681550@webmail5.yandex.ru> References: <145471154681550@webmail5.yandex.ru> Message-ID: <44D3206B.6020409@gkec.informatik.tu-darmstadt.de> Hi ?????????, ????????? ?????? wrote: > In fact, for a quiet some time I do not understand the goal of lxml > project. At first it was "ElementTree on top of libxml2", Well, look closely. As cheeseshop puts it: """ lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more. """ So it *safely* *extends* the *ElementTree API* in a *pythonic* way. Those are the main goals: Be pythonic, safe, compatible to ET, and more comprehensive (named in no particular order). As for being pythonic, BTW, I think that objectify is one of the most pythonic ways of handling XML in Python. But maybe that's just me - and believe me, I'm biased... > after it becames > more and more bloated with ET-Incompatible API, now, program that uses lxml > cannot be easy ported back to ElementTree. I acknowledge that you are not a native english speaker, but I'd still be a bit more careful with words like "bloated" and "incompatible". There are very few places where lxml is incompatible to ET, and I believe that these spots are there for very good reasons. Some differ in pure legacy design decisions that were originally taken by ET (like for processing instructions), others result from restrictions posed by libxml2 (like the single parent issue). And I would not say that lxml is bloated in any way. All that is in there is actually a) useful or b) helpful or c) for compatibility or d) for any combination of the three. Martijn and I have taken care (and are still taking care, as this discussion shows) that the API stays consistent in itself and as close to existing APIs as possible, major points of influence being the ET API and the Python language idioms. Sure, in such a large library, you will never require every bit for your application. But different applications have different requirements, and I think lxml serves quite a large set of requirements in the XML area by now. And we are always concerned about keeping the specific subset required for an application easily accessible. > May be it's time to split into ET-implementations and lxml-specific? Well, you can't just split it. Most of the API and its extensions are tightly integrated and do not work in separation. That's not only a technical problem, it's rather a problem of API consistency. There are some parts that could be separated out, like the namespace registry and class lookup, for example. Now that we have the infrastructure for external modules in place, it could be moved to a separate module. However, that would break existing code and change the internal behaviour of lxml, which currently defaults to support namespace lookup. Too bad. That's one for compatibility, then. But there are not many things in lxml that come to my mind when I look for concerns like this... > say "lxml is no more just an ElementTree implementation, but a separate > project with it's own ideoms"? Well, it never *was* "just an ET implementation", just as the cheeseshop quote suggests. And as for lxml.objectify, it was never meant to become core technology in lxml. It's a separate API that inherits from ET, lxml, Amara and Python as much as possible, but otherwise stands on its own. It's not bloating lxml either. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 12:50:45 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 12:50:45 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> <44D0D003.2080800@infrae.com> <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de> Message-ID: <44D32685.8000207@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > The different between changing the dict on the parser context and on the XSLT > context is that the parser context does not use it before it is returned. > libxslt *might* store stuff in it, depending on the stylesheet. Ok, I looked through the libxslt source and cannot find a place where this is actually the case. According to the inline comments in transform.c, libxslt is supposed to use the dict for XSLT 'key' handling, but it doesn't look like that's true. (yeah, well, libx*** and documentation...) I could not even find the word 'dict' in the file keys.c ... So, given that insight, I'm now somewhat convinced that the patch I sent is actually harmless. So I'll just merge it in for 1.1beta and see what we get. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 13:12:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 13:12:05 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D32685.8000207@gkec.informatik.tu-darmstadt.de> References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> <44D0D003.2080800@infrae.com> <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de> <44D32685.8000207@gkec.informatik.tu-darmstadt.de> Message-ID: <44D32B85.2080304@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> The different between changing the dict on the parser context and on the XSLT >> context is that the parser context does not use it before it is returned. >> libxslt *might* store stuff in it, depending on the stylesheet. > > Ok, I looked through the libxslt source and cannot find a place where this is > actually the case. According to the inline comments in transform.c, libxslt is > supposed to use the dict for XSLT 'key' handling, but it doesn't look like > that's true. (yeah, well, libx*** and documentation...) I could not even find > the word 'dict' in the file keys.c ... > > So, given that insight, I'm now somewhat convinced that the patch I sent is > actually harmless. So I'll just merge it in for 1.1beta and see what we get. Right before committing, I noticed that the original patch actually introduces threading problems, so here is a new patch that fixes it The Right Way. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: xslt-dict-replace.patch Type: text/x-patch Size: 2773 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060804/55a8db7d/attachment.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 15:26:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 15:26:11 +0200 Subject: [lxml-dev] News from the 2.5 front Message-ID: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> Hi, just wanted to send a note that lxml.etree compiles nicely under Python 2.5b3 (AMD64) using the patched Pyrex version here: http://codespeak.net/svn/lxml/pyrex/ The only problem I currently encounter is a bug in linecache in 2.5's stdlib that prevents the doctests from running. Once that's solved, we can see if those tests pass as well. Stefan From fdrake at gmail.com Fri Aug 4 15:30:23 2006 From: fdrake at gmail.com (Fred Drake) Date: Fri, 4 Aug 2006 09:30:23 -0400 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> Message-ID: <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> On 8/4/06, Stefan Behnel wrote: > just wanted to send a note that lxml.etree compiles nicely under Python 2.5b3 > (AMD64) using the patched Pyrex version here: Woohoo! Thanks for testing this! > The only problem I currently encounter is a bug in linecache in 2.5's stdlib > that prevents the doctests from running. Once that's solved, we can see if > those tests pass as well. If there's really a bug in linecache, be sure to report it against Python on SourceForge so we can get it dealt with. -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 16:18:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 16:18:06 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> Message-ID: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> Hi Fred, Fred Drake wrote: > On 8/4/06, Stefan Behnel wrote: >> The only problem I currently encounter is a bug in linecache in 2.5's >> stdlib that prevents the doctests from running. Once that's solved, we >> can see if those tests pass as well. > > If there's really a bug in linecache, be sure to report it against Python > on SourceForge so we can get it dealt with. Oh, well. I did report it and then almost instantly got a TYOF back. The problem was: lxml used its own version of doctest.py, which was no longer compatible with 2.5. I always wondered where that came from and what it was good for. Should have asked long ago, I guess... Anyway, now it's gone and there's only one minor error in the test runs. I'll check if I can fix it. It's exception related, so it may still be a bug in the patched Pyrex version. Stefan From fdrake at gmail.com Fri Aug 4 16:27:03 2006 From: fdrake at gmail.com (Fred Drake) Date: Fri, 4 Aug 2006 10:27:03 -0400 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> Message-ID: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> On 8/4/06, Stefan Behnel wrote: > Oh, well. I did report it and then almost instantly got a TYOF back. The TYOF == "That's your own fault" ??? > problem was: lxml used its own version of doctest.py, which was no longer > compatible with 2.5. I always wondered where that came from and what it was > good for. Should have asked long ago, I guess... Hmm. There's a separate version in zope.testing as well. I've no idea if that's compatible with 2.5; there's so many other things that fall over with 2.5 it doesn't seem worthwhile to ask. > Anyway, now it's gone and there's only one minor error in the test runs. I'll > check if I can fix it. It's exception related, so it may still be a bug in the > patched Pyrex version. Ok. Let me know if there's anything I can help with on the 2.5 front. -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 16:52:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 16:52:04 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> Message-ID: <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> Fred Drake wrote: > On 8/4/06, Stefan Behnel wrote: >> Oh, well. I did report it and then almost instantly got a TYOF back. The > > TYOF == "That's your own fault" ??? Yup. :) >> problem was: lxml used its own version of doctest.py, which was no longer >> compatible with 2.5. I always wondered where that came from and what >> it was >> good for. Should have asked long ago, I guess... > > Hmm. There's a separate version in zope.testing as well. I've no > idea if that's compatible with 2.5; there's so many other things that > fall over with 2.5 it doesn't seem worthwhile to ask. Apparently, they changed some monkeypatching stuff related to the "getlines()" function in linecache.py. It now has a different signature. :-/ >> Anyway, now it's gone and there's only one minor error in the test >> runs. I'll >> check if I can fix it. It's exception related, so it may still be a >> bug in the >> patched Pyrex version. > > Ok. Let me know if there's anything I can help with on the 2.5 front. Thanks for offering help, that's always appreciated. :) I'll give it some more investigation first. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 18:01:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 04 Aug 2006 18:01:21 +0200 Subject: [lxml-dev] objectify, ObjectPath and Benchmarks In-Reply-To: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com> <44D269A0.3040700@gkec.informatik.tu-darmstadt.de> Message-ID: <44D36F51.9050105@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Martijn Faassen wrote: >> Global switch for objectify: As I mentioned before I'm still quite >> worried about switching the entire world over to objectify with a single >> global call. I really think this should be specified by using a >> different tree constructor. It just too sounds dangerous to me to >> globally switch the behavior of the whole API. >> >> In the 'classic' way of using the namespace registry, custom element >> classes are typically registered for particular elements in particular >> namespaces. Objectify however fundamentally alters the behavior of the >> entire system. I understood from your previous reply that you were >> working on ways to this settable per-tre; did I understand that >> correctly? I'd recommend making it the normal way to invoke the >> objectify behavior, not global. > > Ok, sure. The lxml.elements.classlookup module has (amongst other things) a > per-parser lookup implementation. I guess you'd want that to become the > preferred way of using objectify and I think that's a good idea. > > Currently, the docs only present that as an alternative (4th paragraph): > http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt > That part could be rewritten to make the global registry the alternative. Now that I started rewriting the doc section, I noticed that a per-parser setup will not be very satisfactory. It will not affect XML() and also not the trees built by hand using Element() etc., as both use and inherit the default parser. So the only way to get an objectify tree in that case is through the parser API. Once a parsed node is there, however, new subelements will inherit the parser lookup scheme. This makes the per-parser setup not useless, but a bit less beautiful... Any ideas how this could get a little nicer? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Aug 5 15:11:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 05 Aug 2006 15:11:08 +0200 Subject: [lxml-dev] Request for comments: Removing lxml.etree's default support for namespace class support Message-ID: <44D498EC.50507@gkec.informatik.tu-darmstadt.de> Hi all, I know, breaking compatibility is a serious topic, so I'm putting this here for an open discussion. This change would only impact code that uses the namespace class lookup to supply custom element classes to lxml.etree. Other code would continue to work. Currently, lxml.etree does namespace lookup for custom element classes by default. This has been the case in the 0.9 and 1.0 series. Starting with lxml 1.1, etree will support not only custom classes, but also custom lookup schemes for these classes. It includes a generic fallback mechanism from one lookup scheme to another if the first one fails. This means that the default support for namespace class lookup is becoming redundant, as it is also supported by a public class that provides the namespace lookup scheme. Also, the current scheme does not support a fallback other than the default element class, so code that wants to use the namespace lookup with a different fallback is still required to re-register both. To remove this redundancy, to speed up the default setup if namespace classes are /not/ used and (above all) to make the lookup API more accessible, I would like to remove the default for namespace lookup and replace it by the simplest possible mechanism that always returns the normal element classes. If namespace lookup support is needed, something like the following code would be required at setup time: from lxml import etree try: lookup = etree.ElementNamespaceClassLookup() except AttributeError: # lxml >= 0.9 and < 1.1 supports this by default pass else: # lxml >= 1.1 requires an explicit setup etree.setElementClassLookup(lookup) This code block is backwards compatible with lxml 0.9 and lxml 1.0, so new code that requires namespace class lookup could continue to support lxml from version 0.9 on, while older code that uses namespace classes would have to be updated with the above code block to support lxml 1.1 and later. Doing this switch *now* makes the above code pretty short, later changes would require version checking and the like. One of the main reasons for this change is that I would like to make the lookup mechanism explict and visible. It is a global property that impacts the entire library. Users who do not need to install their own custom classes should not be bothered with it, i.e. should be able to ignore the lookup API, the Namespace class registry, etc. For those who need a different mechanism, I believe that the current default does not make it visible enough that (for example) the functionality of the "Namespace" class registry is disabled if you select a different class lookup mechanism. So the new custom class support would work like this: * if no custom classes are used, no configuration is needed * any support for custom classes requires setting up a lookup scheme * changing the default class is done by creating and setting a default lookup scheme based on the new default classes * using the namespace lookup requires setting the ns lookup scheme, which then enables lookups based on the global Namespace registry * setting a per-parser lookup scheme enables delegation to the specific lookup registered with a parser, which in turn can deploy any of the available schemes and defaults to using the normal classes I'm also considering to replicate the Namespace registry locally in the ElementNamespaceClassLookup class. This would allow things like a per-parser namespace registry and the like. I think removing the default would also help in getting this cleaner. I'm really interested in hearing opinions on this. I think the above compatibility code makes the switch trivial to do, but I would like to hear if there are other impacts of this change that I might not have thought of. Stefan From jkrukoff at ltgc.com Sat Aug 5 23:47:56 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Sat, 5 Aug 2006 15:47:56 -0600 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <44D32B85.2080304@gkec.informatik.tu-darmstadt.de> Message-ID: <001801c6b8d8$d07bf870$051ea8c0@naomi> > Right before committing, I noticed that the original patch actually > introduces threading problems, so here is a new patch that fixes it The > Right Way. > Stefan I attempted to apply this patch against the lxml 1.0.2 release version, and had no luck. Do I need to be pulling 1.1 from svn to get this fix? --------- John Krukoff jkrukoff at ltgc.com From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 07:33:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 06 Aug 2006 07:33:08 +0200 Subject: [lxml-dev] Segfault in lxml during element copy In-Reply-To: <001801c6b8d8$d07bf870$051ea8c0@naomi> References: <001801c6b8d8$d07bf870$051ea8c0@naomi> Message-ID: <44D57F14.3000105@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: >> Right before committing, I noticed that the original patch actually >> introduces threading problems, so here is a new patch that fixes it The >> Right Way. > > I attempted to apply this patch against the lxml 1.0.2 release version, and > had no luck. Do I need to be pulling 1.1 from svn to get this fix? Ah, right, sorry. I had done so much work on 1.1 recently that I completely forgot that you are still using 1.0. 1.0 does not have threading support and I had to rewrite the patch to get it in. Here's a version against the current 1.0 branch that should apply cleanly against 1.0.2. I'll also release a 1.0.3 in a few days (preferably at the same time as 1.1beta to reduce the overhead for our egg maintainers). Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: xslt-crash.patch Type: text/x-patch Size: 3860 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060806/f546bb2a/attachment.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 12:09:28 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 06 Aug 2006 12:09:28 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> Message-ID: <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: >>> there's only one minor error in the test runs. I'll >>> check if I can fix it. It's exception related, so it may still be a >>> bug in the patched Pyrex version. >> Ok. Let me know if there's anything I can help with on the 2.5 front. > > Thanks for offering help, that's always appreciated. :) > > I'll give it some more investigation first. Ok, it was not a Pyrex bug. The problem is that lxml uses multiple inheritance in some exceptions and now that they are new style classes, it's no longer enough to call the constructor of the superclass directly. However, super() does not work for old style classes in 2.4, so I'm a bit challenged in getting this fixed in a backward compatible way. This works nicely in Python 2.4: class Error(Exception): pass class LxmlError(Error): def __init__(self, *args): Error.__init__(self, *args) self.error_log = __copyGlobalErrorLog() while Python 2.5 requires this: class LxmlError(Error): def __init__(self, *args): super(LxmlError, self).__init__(*args) self.error_log = __copyGlobalErrorLog() which does not work for classic classes in 2.3/4. Does anyone have an idea how to fix this nicely? Stefan From fdrake at gmail.com Sun Aug 6 18:22:07 2006 From: fdrake at gmail.com (Fred Drake) Date: Sun, 6 Aug 2006 12:22:07 -0400 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> Message-ID: <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> On 8/6/06, Stefan Behnel wrote: > Ok, it was not a Pyrex bug. The problem is that lxml uses multiple inheritance > in some exceptions and now that they are new style classes, it's no longer > enough to call the constructor of the superclass directly. Please explain in detail what problems you had with this approach. > However, super() > does not work for old style classes in 2.4, so I'm a bit challenged in getting > this fixed in a backward compatible way. > > This works nicely in Python 2.4: ... > while Python 2.5 requires this: ... > which does not work for classic classes in 2.3/4. Does anyone have an idea how > to fix this nicely? The Python 2.4 formulation should still work in Python 2.5. Direct calls to the superclass are not forbidden with new-style classes. -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 18:36:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 06 Aug 2006 18:36:58 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> Message-ID: <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> Hi Fred, Fred Drake wrote: > On 8/6/06, Stefan Behnel wrote: >> Ok, it was not a Pyrex bug. The problem is that lxml uses multiple >> inheritance >> in some exceptions and now that they are new style classes, it's no >> longer >> enough to call the constructor of the superclass directly. > > Please explain in detail what problems you had with this approach. As I said, I'm using this: class Error(Exception): pass class LxmlError(Error): def __init__(self, *args): Error.__init__(self, *args) self.error_log = __copyGlobalErrorLog() What I did not say is that afterwards, I use this: class XPathError(LxmlError): pass class LxmlSyntaxError(LxmlError, SyntaxError): pass class XPathSyntaxError(LxmlSyntaxError, XPathError): pass So there is a 'cross inheritance' here in XPathSyntaxError, but even when I remove the XPathError inheritance, I get the same result as follows. I now call this in Pyrex: raise XPathSyntaxError, "some message" and what comes out at the end is: Traceback ... XPathSyntaxError: None Which is not quite what you'd expect. I assume what happens is that the MRO ends up not calling Exception.__init__ or something, which leads to not setting the message. The following, works, however: class LxmlError(Error): def __init__(self, *args): super(LxmlError, self).__init__(*args) self.error_log = __copyGlobalErrorLog() What I now did was to call either the super() stuff or __init__ depending on Error being a subtype of 'object' or not. I would prefer having a simpler solution, though. Stefan From fdrake at gmail.com Sun Aug 6 19:04:48 2006 From: fdrake at gmail.com (Fred Drake) Date: Sun, 6 Aug 2006 13:04:48 -0400 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> Message-ID: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> On 8/6/06, Stefan Behnel wrote: > class LxmlSyntaxError(LxmlError, SyntaxError): > pass Is that the built-in SyntaxError? Leave that out. It's really only intended to be used with Python-language syntax errors. Handling for any other syntax errors should use separate exceptions specific to the processing for that language. Removing that, I get a reasonable error message for Python 2.4 and 2.5. -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca From luto at myrealbox.com Sun Aug 6 19:18:33 2006 From: luto at myrealbox.com (Andrew Lutomirski) Date: Sun, 6 Aug 2006 10:18:33 -0700 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> Message-ID: On 8/6/06, Fred Drake wrote: > > On 8/6/06, Stefan Behnel > wrote: > > class LxmlSyntaxError(LxmlError, SyntaxError): > > pass > > Is that the built-in SyntaxError? Leave that out. It's really only > intended to be used with Python-language syntax errors. Handling for > any other syntax errors should use separate exceptions specific to the > processing for that language. I think that elementtree and cElementTree do the same thing. I don't like this behavior at all, though -- I spent quite awhile trying to find a syntax error in my code a couple days ago when the real error was in the XML input. --Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20060806/abbec615/attachment.htm From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 19:20:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 06 Aug 2006 19:20:12 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> Message-ID: <44D624CC.6090203@gkec.informatik.tu-darmstadt.de> Hi Fred, Fred Drake wrote: > On 8/6/06, Stefan Behnel wrote: >> class LxmlSyntaxError(LxmlError, SyntaxError): >> pass > > Is that the built-in SyntaxError? Leave that out. It's really only > intended to be used with Python-language syntax errors. Handling for > any other syntax errors should use separate exceptions specific to the > processing for that language. Well, I'm not the one who put it there (and I definitely would not have used it in the first place). Thing is, lxml is heading for ElementTree compatibility and ElementTree raises a plain SyntaxError in the place where we raise LxmlSyntaxError. So removing the superclass would break compatibility to ET and also break existing code that depends on it... Stefan From fdrake at gmail.com Sun Aug 6 19:30:40 2006 From: fdrake at gmail.com (Fred Drake) Date: Sun, 6 Aug 2006 13:30:40 -0400 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D624CC.6090203@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> <44D624CC.6090203@gkec.informatik.tu-darmstadt.de> Message-ID: <9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com> On 8/6/06, Stefan Behnel wrote: > Well, I'm not the one who put it there (and I definitely would not have used > it in the first place). Thing is, lxml is heading for ElementTree > compatibility and ElementTree raises a plain SyntaxError in the place where we > raise LxmlSyntaxError. So removing the superclass would break compatibility to > ET and also break existing code that depends on it... Ok, I see. The SyntaxError is used directly in the ElementPath module. ;-( There's not going to be a really clean way to do this, or at least I can't think of it off-hand. Here's what I came up with; it's probably similar to what you did: =========================================== _newstyle_exceptions = isinstance(Exception, type) class Error(Exception): pass class LxmlError(Error): def __init__(self, *args): if _newstyle_exceptions: super(LxmlError, self).__init__(*args) else: Error.__init__(self, *args) self.error_log = [] class XPathError(LxmlError): pass class LxmlSyntaxError(LxmlError, SyntaxError): pass class XPathSyntaxError(LxmlSyntaxError, XPathError): pass raise XPathSyntaxError, "some message" =========================================== -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 19:50:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 06 Aug 2006 19:50:16 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> <44D35F14.9090202@gkec.informatik.tu-darmstadt.de> <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de> <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com> <44D61AAA.20309@gkec.informatik.tu-darmstadt.de> <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com> <44D624CC.6090203@gkec.informatik.tu-darmstadt.de> <9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com> Message-ID: <44D62BD8.3010505@gkec.informatik.tu-darmstadt.de> Hi Fred, Fred Drake wrote: > There's not going to be a really clean way to do this, or at least I > can't think of it off-hand. Here's what I came up with; it's probably > similar to what you did: > > =========================================== > _newstyle_exceptions = isinstance(Exception, type) > > class LxmlError(Error): > def __init__(self, *args): > if _newstyle_exceptions: > super(LxmlError, self).__init__(*args) > else: > Error.__init__(self, *args) > self.error_log = [] Yup, that's about what I did, too. It's not that ugly, just a relatively small work around for a backwards compatibility problem. So I think I'll just live with it. Thanks for helping, Stefan From benno.luthiger at id.ethz.ch Mon Aug 7 18:48:07 2006 From: benno.luthiger at id.ethz.ch (Luthiger Stoll Benno) Date: Mon, 7 Aug 2006 18:48:07 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem Message-ID: Hello I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I install lxml using easy_install. I saw that this problem was discussed last month on this list. I scanned the mails addressing this issue, however, I could not find a solution. How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? Regards, Benno From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Aug 7 20:20:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 07 Aug 2006 20:20:20 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: References: Message-ID: <44D78464.8000402@gkec.informatik.tu-darmstadt.de> Hi Benno, Luthiger Stoll Benno wrote: > I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I > install lxml using easy_install. I saw that this problem was discussed last > month on this list. I scanned the mails addressing this issue, however, I > could not find a solution. We do not provide eggs for Python installations that use 16 bit unicode (UCS2). The solution is therefore to compile lxml yourself. I assume you're on Linux, so that's not too much of an effort. http://codespeak.net/lxml/build.html > How can I test whether my python installation > (Python 2.3.5) is compiled with 2 bit unicode? Ah, 2 bit unicode? No, that's pretty unlikely... ;) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 8 21:59:26 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 08 Aug 2006 21:59:26 +0200 Subject: [lxml-dev] lxml 1.0.3 and 1.1beta on cheeseshop Message-ID: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de> Hi all, I finally managed to get 1.1beta out, right after releasing 1.0.3. 1.0.3 is a bug fix release. Since it fixes a crash in XSLT result handling, updating is recommended. 1.1beta is the last pre-release before the shiny new 1.1 series will take the lead. Despite the surprisingly short change log, it contains tons of changes under the hood and some major improvements in flexibility. It is the first lxml version to compile and run under Python 2.5 beta (3), comes with a C-API that makes it extensible by other Python C modules, and features an additional data-binding API on top of etree (objectify). For further information on the features of lxml 1.1, please refer to the HTML documentation in the source distribution or read the text files online: http://codespeak.net/svn/lxml/trunk/doc Note that lxml 1.1 requires a patched Pyrex version if you want to compile from non-release or modified sources. It is available here: http://codespeak.net/svn/lxml/pyrex This version of Pyrex supports Python 2.5 and public C-API generation, so it may be of interest to more than only lxml developers. As always, I'm happy about any egg contributions or bug reports that help in making lxml 1.1 the greatest Python XML tool ever. Have fun, Stefan Changelogs: (note that 1.1beta also contains the changes from 1.0.3) 1.1beta (2006-08-08) Features added * Unlock the GIL for deep copying documents and for XPath() * Support for Python 2.5 beta * New compact keyword argument for parsing read-only documents * Support for parser options in iterparse() * The namespace axis is supported in XPath and returns (prefix, URI) tuples * The XPath expression "/" now returns an empty list instead of raising an exception * XML-Object API on top of lxml (lxml.objectify) * Customizable Element class lookup: o Support for externally provided lookup functions o lxml.elements.classlookup module implements different lookup mechanisms * Support for processing instructions (ET-like, not compatible) * Public C-level API for independent extension modules Bugs fixed * XPathSyntaxError now inherits from XPathError * Threading race conditions in RelaxNG and XMLSchema * Crash when mixing elements from XSLT results into other trees, concurrent XSLT is only allowed when the stylesheet was parsed in the main thread * The EXSLT regexp:match function now works as defined (except for some differences in the regular expression syntax) * Setting element.text to '' returned None on request, not the empty string * iterparse() could crash on long XML files * Creating documents no longer copies the parser for later URL resolving. For performance reasons, only a reference is kept. Resolver updates on the parser will now be reflected by documents that were parsed before the change. Although this should rarely become visible, it is a behavioral change from 1.0. 1.0.3 (2006-08-08) Features added * Element.replace(old, new) method to replace a subelement by another one Bugs fixed * Crash when mixing elements from XSLT results into other trees * Copying/deepcopying did not work for ElementTree objects * Setting an attribute to a non-string value did not raise an exception * Element.remove() deleted the tail text from the removed Element From ogrisel at nuxeo.com Wed Aug 9 17:24:31 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Wed, 09 Aug 2006 17:24:31 +0200 Subject: [lxml-dev] Google Analytics tagger script based on lxml Message-ID: Hi list, Thanks to the neat HTMLParser feature in lxml I was able to quickly write a simple script to add Google Analytics tags at the end of static HTML files (generated from a REST source for instance). Feel free to use it should you find it any useful: http://champiland.homelinux.net/evogrid/code/evogrid.og.main/ga_tagger.py NB: google analytics is a free as in beer web traffic analyser: http://www.google.com/analytics/ Best, -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 9 18:59:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 09 Aug 2006 18:59:35 +0200 Subject: [lxml-dev] lxml 1.0.3 and 1.1beta on cheeseshop In-Reply-To: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de> References: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de> Message-ID: <44DA1476.4070209@gkec.informatik.tu-darmstadt.de> Ah, well, never release too early... Here is a patch against 1.1beta that fixes a couple of bugs in lxml.objectify, especially in the setattr() and addattr() methods of ObjectPath. Without the patch, you can't currently set attributes to Element values or lists. That's not too much of an issue, as you can still set them directly (without ObjectPath). But it's still annoying. Guess that's what a beta release is there for... Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: objectify-setattr-bugs.patch Type: text/x-patch Size: 12242 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060809/4652d733/attachment-0001.bin From ogrisel at nuxeo.com Wed Aug 9 19:39:09 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Wed, 09 Aug 2006 19:39:09 +0200 Subject: [lxml-dev] ElementTree and lxml advertised by yahoo Message-ID: The lxml part is just a reference at the bottom of the page, but anyway that's still a good start :) http://developer.yahoo.com/python/python-xml.html#element -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 10 08:31:57 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 10 Aug 2006 08:31:57 +0200 Subject: [lxml-dev] Developer version of the web pages online Message-ID: <44DAD2DD.5090307@gkec.informatik.tu-darmstadt.de> Hi all, I thought it would be a good idea to advocate the current developer version of lxml a bit more. So I uploaded the web pages from the trunk to http://codespeak.net/lxml/dev/ Their differences are obviously generated by a script using lxml.etree. :) Stefan From Holger.Joukl at LBBW.de Thu Aug 10 14:00:57 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Thu, 10 Aug 2006 14:00:57 +0200 Subject: [lxml-dev] [1.1beta] lxml.objectify python2.3 compatibilty In-Reply-To: Message-ID: Hi, lxml.objectify crashes under python2.3: PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3 Python 2.3.4 (#6, Jul 20 2004, 11:09:38) [GCC 2.95.2 19991024 (release)] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.objectify Traceback (most recent call last): File "", line 1, in ? ImportError: ld.so.1: python2.3: fatal: relocation error: file /apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol PyDict_Contains: referenced symbol not found >>> seems like PyDict_Contains is not available in python2.3: $ elfdump /apps/pydev/gcc/3.4.4/bin/python2.4 |grep -i pydict_cont [593] 0x0004c078 0x00000070 FUNC GLOB D 0 .text PyDict_Contains [3487] 0x0004c078 0x00000070 FUNC GLOB D 0 .text PyDict_Contains [593] PyDict_Contains 0 hjoukl at dev-a .../pytaf $ elfdump /apps/prod/bin/python2.3 |grep -i pydict_cont 1 hjoukl at dev-a .../pytaf $ Regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 10 14:28:41 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 10 Aug 2006 14:28:41 +0200 Subject: [lxml-dev] [1.1beta] lxml.objectify python2.3 compatibilty In-Reply-To: References: Message-ID: <44DB2679.1070807@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > lxml.objectify crashes under python2.3: > > PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3 > Python 2.3.4 (#6, Jul 20 2004, 11:09:38) > [GCC 2.95.2 19991024 (release)] on sunos5 > Type "help", "copyright", "credits" or "license" for more information. >>>> import lxml.objectify > Traceback (most recent call last): > File "", line 1, in ? > ImportError: ld.so.1: python2.3: fatal: relocation error: file > /apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol > PyDict_Contains: referenced symbol not found Besides the fact that you should not normally import modules that were compiled for a different Python version - you're right, thanks. That one slipped through accidentally. Here's the patch. Stefan Index: src/lxml/objectify.pyx =================================================================== --- src/lxml/objectify.pyx (Revision 31226) +++ src/lxml/objectify.pyx (Arbeitskopie) @@ -184,7 +184,7 @@ if c_ns is NULL and tree._getNs(child._c_node) is not NULL: continue name = child._c_node.name - if not python.PyDict_Contains(children, name): + if python.PyDict_GetItem(children, name) is NULL: python.PyDict_SetItem(children, name, child) return children Index: src/lxml/python.pxd =================================================================== --- src/lxml/python.pxd (Revision 31212) +++ src/lxml/python.pxd (Arbeitskopie) @@ -52,7 +52,6 @@ cdef void PyDict_Clear(object d) cdef object PyDict_Copy(object d) cdef Py_ssize_t PyDict_Size(object d) - cdef int PyDict_Contains(object d, object key) cdef object PySequence_List(object o) cdef object PySequence_Tuple(object o) From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 06:57:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 06:57:01 +0200 Subject: [lxml-dev] Request for comments: Removing lxml.etree's default support for namespace class support In-Reply-To: <44D498EC.50507@gkec.informatik.tu-darmstadt.de> References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de> Message-ID: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> Hi all, since there were no reactions so far, so I'll just extend my request a little. Stefan Behnel wrote: > To remove this redundancy, to speed up the default setup if namespace classes > are /not/ used and (above all) to make the lookup API more accessible, I would > like to remove the default for namespace lookup and replace it by the simplest > possible mechanism that always returns the normal element classes. [...] > One of the main reasons for this change is that I would like to make the > lookup mechanism explict and visible. It is a global property that impacts the > entire library. Users who do not need to install their own custom classes > should not be bothered with it, i.e. should be able to ignore the lookup API, > the Namespace class registry, etc. For those who need a different mechanism, I > believe that the current default does not make it visible enough that (for > example) the functionality of the "Namespace" class registry is disabled if > you select a different class lookup mechanism. I thought about this some more and found that having a per-parser setup as default would be pretty convenient and is an extremely small overhead compared to the default class lookup. And what's even better, making the parser lookup the default would remove the need to actually change the global lookup scheme, which avoids problems with different modules using lxml (as is already the case with objectify). So, the second proposal for custom class lookup: * if no custom classes are used, no configuration is needed * any support for custom classes should be registered at the parser level then, as before: > * changing the default class is done by creating and setting a default > lookup scheme based on the new default classes > * using the namespace lookup requires setting the ns lookup scheme, which > then enables lookups based on the global Namespace registry [leaving out the original per-parser bit] I think this really helps in getting custom class support in lxml cleaner. It would then be helpful to also extend the behaviour of the XML() and HTML() factories to use the default parser *iff* it matches their requirements (i.e. it *is* an XMLParser or HTMLParser respectively) and only if not, fall back to the current behaviour of using their own parser. This allows registering a lookup scheme for the default parser without loosing these functions. I'll just go and implement this on the trunk for now, so if there are any comments or diverging interests, please speak up on the list. Stefan From Holger.Joukl at LBBW.de Fri Aug 11 09:19:03 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 09:19:03 +0200 Subject: [lxml-dev] [objectify] writing custom DataElement subclasses In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> Message-ID: Hi, inheriting from the NumberElement base class there is a defined mechanism to set a text-to-pyval parser function using the _setValueParser method. Would it make sense to extend this well-defined mechanism to the general DataElement class? E.g. writing a custom datetime class looks s.th. like this: from lxml import objectify from datetime import datetime from dateutil.parser import parse from dateutil import tz # Unix epoch as datetime object EPOCH = datetime(1970, 1, 1, 0, 0, 0, 0, tzinfo=tz.tzutc()) # FIXME: Should probably be tzlocal, but this crashes under python2.4: # FIXME: ValueError: timestamp out of range for platform time_t when # FIXME: trying to calculate with datetime values # FIXME: This is due to changes in the time module, python2.3 just ignores it #_DEFAULT_TZ=tz.tzlocal() # better?? # Problem is that this rule is true now but has undergone some changes; # e.g. dst wasn't even invented until 1975 in Germany _DEFAULT_TZ=tz.tzstr('MET-1MEST-2,M3.5.0/02:00:00,M10.5.0/03:00:00') class _parsePrecedence: yearfirst = True dayfirst = False def _findtz(name, offset): """Determine the timezone information as best as we can. Offset takes precedence over name. If neither offset nor tz name are given, fallback to use system local tz. """ if offset: return tz.tzoffset(name, offset) if name: if name == 'UTC': return tz.tzutc() else: found_tz = tz.gettz(name) if found_tz: return found_tz else: return tz.tzstr(name) return _DEFAULT_TZ class DatetimeElement(objectify.ObjectifiedDataElement): def __get(self): return _datetimeValueOf(self) pyval = property(__get) def _type(text): return _checkDatetime(text) _type = staticmethod(_type) def __add__(self, other): return _datetimeValueOf(self) + _datetimeValueOf(other) def __sub__(self, other): return _datetimeValueOf(self) - _datetimeValueOf(other) def __radd__(self, other): return _datetimeValueOf(other) + _datetimeValueOf(self) def __rsub__(self, other): return _datetimeValueOf(other) - _datetimeValueOf(self) def __cmp__(self, other): return cmp(_datetimeValueOf(self), _datetimeValueOf(other)) def __str__(self): return str(self.pyval) def _datetimeValueOf(obj): if isinstance(obj, DatetimeElement): return DatetimeElement._type(obj.text) return obj def _checkDatetime(timestr): # parse raises ValueError if not successful return parse(timestr, tzinfos=_findtz, yearfirst=_parsePrecedence.yearfirst, dayfirst=_parsePrecedence.dayfirst) def register(): datetimeType = objectify.PyType('datetime', _checkDatetime, DatetimeElement) datetimeType.xmlSchemaTypes = ("datetime",) datetimeType.register() The re-implementation of property pyval might be left out here, also the _type staticmethod. Maybe the __str__ method, too if ObjectifiedDataElement changed its __str__ method to def __str__(self): return str(self.pyval) What do you think? Regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 09:57:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 09:57:11 +0200 Subject: [lxml-dev] [objectify] writing custom DataElement subclasses In-Reply-To: References: Message-ID: <44DC3857.102@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > inheriting from the NumberElement base class there is a defined > mechanism to set a text-to-pyval parser function using the > _setValueParser method. > Would it make sense to extend this well-defined mechanism to > the general DataElement class? [implementation of a date type] > The re-implementation of property pyval might be left out here, also the > _type staticmethod. > Maybe the __str__ method, too if ObjectifiedDataElement changed its __str__ > method to > def __str__(self): > return str(self.pyval) Writing str() in that way would not work in all cases. Just look at None, __str__() must always return a string. So, when None is returned as pyval, should __str__() return "" or "None"? Depends, right? What about numbers? Does 0 mean "0" or "False"? We could introduce an intermediate "ParsableObjectifiedDataElement" or something in that line. I don't know if there's enough use for it, though. It would only have 3-4 methods or something that don't do much. It's different in NumberElement, where the entire number protocol is implemented. BTW, I'm not opposed to integrating a date element class. As it looks, your's it pretty far advanced by now, and it's even an external Python module. I won't have the time to merge it before the end of the month, but if you can get some of the FIXME's out by then (no, *not* only the comments :), we can see if we get it into 1.1 final. Stefan From philipp at weitershausen.de Fri Aug 11 10:15:36 2006 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Fri, 11 Aug 2006 10:15:36 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com> Message-ID: <44DC3CA8.8040807@weitershausen.de> Fred Drake wrote: >> problem was: lxml used its own version of doctest.py, which was no longer >> compatible with 2.5. I always wondered where that came from and what it was >> good for. Should have asked long ago, I guess... > > Hmm. There's a separate version in zope.testing as well. I've no > idea if that's compatible with 2.5; there's so many other things that > fall over with 2.5 it doesn't seem worthwhile to ask. Jim, Tim, and others continously improved Python's doctest for Zope. The latest example is Benji's work on footnotes. AFAIK Zope's doctest was regularly sync'ed with Python's, though. At least Python 2.4's doctest is good enough for not having to ship your own version of it. Philipp From Holger.Joukl at LBBW.de Fri Aug 11 10:18:38 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 10:18:38 +0200 Subject: [lxml-dev] [objectify] writing custom DataElement subclasses In-Reply-To: <44DC3857.102@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel schrieb am 11.08.2006 09:57:11: > > The re-implementation of property pyval might be left out here, also the > > _type staticmethod. > > Maybe the __str__ method, too if ObjectifiedDataElement changed its __str__ > > method to > > def __str__(self): > > return str(self.pyval) > > Writing str() in that way would not work in all cases. Just look at None, > __str__() must always return a string. So, when None is returned as pyval, > should __str__() return "" or "None"? Depends, right? What about numbers? Does > 0 mean "0" or "False"? The NoneElement returns: def __str__(self): return "None" with a pyval: property pyval: def __get__(self): return None so no problem there. As for numbers, a pyval of 0 will result in "0" and a pyval of True in "True". I don't actually see a problem here :-) > We could introduce an intermediate "ParsableObjectifiedDataElement" or > something in that line. I don't know if there's enough use for it, though. It > would only have 3-4 methods or something that don't do much. It's different in > NumberElement, where the entire number protocol is implemented. I agree that another DataElement specialization would not be that useful here. > BTW, I'm not opposed to integrating a date element class. As it looks, your's > it pretty far advanced by now, and it's even an external Python module. I > won't have the time to merge it before the end of the month, but if you can > get some of the FIXME's out by then (no, *not* only the comments :), we can > see if we get it into 1.1 final. Yes, works like a charm. Note that it depends on external dateutil module, though. Without that parsing and timezone handling becomes a nightmare. As for the FIXME I fear that there will be no clean solution other than forcing the ObjectifiedDatetime user to register a _DEFAULT_TZ containing the explicit dst rule. Date/time handling is evil. Btw.: ObjectifiedElement .text and .pyval are read-only (which is a good thing imho). Is it possible to have a way to modify the text of the underlying cnode from within a custom ObjectifiedDataElement class, e.g. in _init()? I know this i possible when implementing this in pyrex, but for a pure-python implementation? The background is that for the ObjectifiedDatetime class I might optionally want to change the .text to ISO format. Regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Fri Aug 11 10:24:08 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 10:24:08 +0200 Subject: [lxml-dev] [objectify] DataElement function In-Reply-To: <44DC3857.102@gkec.informatik.tu-darmstadt.de> Message-ID: Hi Stefan, is it intentional/unavoidable that the element type returned by DataElement is always ObjectifiedElement, before putting it into an father element: >>> what = objectify.DataElement(18) >>> print what value = '18' [ObjectifiedElement] * py:pytype = 'int' >>> what = objectify.DataElement("hallo") >>> print what value = 'hallo' [ObjectifiedElement] * py:pytype = 'str' >>> what = objectify.DataElement("17", _pytype="str") >>> print what value = '17' [ObjectifiedElement] * py:pytype = 'str' >>> root = objectify.Element('root') >>> root.what = what >>> print root root = None [ObjectifiedElement] what = '17' [StringElement] * py:pytype = 'str' >>> Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From faassen at infrae.com Fri Aug 11 10:59:08 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 11 Aug 2006 10:59:08 +0200 Subject: [lxml-dev] News from the 2.5 front In-Reply-To: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> Message-ID: <44DC46DC.4090307@infrae.com> Stefan Behnel wrote: > Hi Fred, > > Fred Drake wrote: >> On 8/4/06, Stefan Behnel wrote: >>> The only problem I currently encounter is a bug in linecache in 2.5's >>> stdlib that prevents the doctests from running. Once that's solved, we >>> can see if those tests pass as well. >> If there's really a bug in linecache, be sure to report it against Python >> on SourceForge so we can get it dealt with. > > Oh, well. I did report it and then almost instantly got a TYOF back. The > problem was: lxml used its own version of doctest.py, which was no longer > compatible with 2.5. I always wondered where that came from and what it was > good for. Should have asked long ago, I guess... I'm not sure I remember anymore; possibly the doctest module that ships with Python 2.3 was too outdated or didn't support some features that I wanted. I might've taken it from Zope 3; not sure. Regards, Martijn From faassen at infrae.com Fri Aug 11 11:04:04 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 11 Aug 2006 11:04:04 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: References: Message-ID: <44DC4804.903@infrae.com> Luthiger Stoll Benno wrote: > Hello > > I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I install lxml using easy_install. I saw that this problem was discussed last month on this list. > I scanned the mails addressing this issue, however, I could not find a solution. > How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? > A straightforward compile of Python will be 2 byte unicode, not 4 bytes. Unfortunately most linux distributions ship with a 4 byte unicode version of Python, and distutils/setuptools cannot distinguish between 4 bytes and 2 bytes unicode yet. We've passed this problem (which goes beyond lxml) along to the setuptools developers, and they say "patches welcome". :) Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 10:59:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 10:59:04 +0200 Subject: [lxml-dev] [objectify] writing custom DataElement subclasses In-Reply-To: References: Message-ID: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de> Holger Joukl wrote: > Stefan Behnel wrote: >>> def __str__(self): >>> return str(self.pyval) >> >> Writing str() in that way would not work in all cases. Just look at None, >> __str__() must always return a string. So, when None is returned as pyval, >> should __str__() return "" or "None"? Depends, right? What about numbers? >> Does 0 mean "0" or "False"? > > The NoneElement returns: > def __str__(self): > return "None" > > with a pyval: > property pyval: > def __get__(self): > return None I know, I've written that code not too long ago. ;) I was just trying to say that the gain is relatively low, as there are only few simple methods that can be provided and many use cases still have to reimplement some of them. So I don't see a noticeable improvement. >> BTW, I'm not opposed to integrating a date element class. As it looks, > your's >> it pretty far advanced by now, and it's even an external Python module. > > Yes, works like a charm. Note that it depends on external dateutil module, > though. Which is this, I assume: http://labix.org/python-dateutil Ok, that's too bad, We can't rely on external modules for the lxml distribution, at least not for something that's not strictly required for all users. > Btw.: ObjectifiedElement .text and .pyval are read-only (which is a good > thing > imho). Is it possible to have a way to modify the text of the underlying > cnode from within a custom ObjectifiedDataElement class, e.g. in _init()? > I know this i possible when implementing this in pyrex, but for a > pure-python implementation? > The background is that for the ObjectifiedDatetime class I might optionally > want to change the .text to ISO format. Ah, good question. Not currently, I believe. But you're right, there might be cases where it makes sense to update the text from a subclass... Maybe adding a 'private' property '__text' might help here, or rather an explicit setter function '__updateTextInPlace(self, text)' ? Stefan From Holger.Joukl at LBBW.de Fri Aug 11 11:18:54 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 11:18:54 +0200 Subject: [lxml-dev] [objectify] writing custom DataElement subclasses In-Reply-To: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel schrieb am 11.08.2006 10:59:04: > I was just trying to say that the gain is relatively low, as there are only > few simple methods that can be provided and many use cases still have to > reimplement some of them. So I don't see a noticeable improvement. Probably the only gain would be for a objectify newbie that he/she needn't think too much about implementing the .pyval, __str__, ._type stuff. But I will rather think of a doc patch then to just document this a little more extensively (after my holidays _:-) > >> BTW, I'm not opposed to integrating a date element class. As it looks, > > your's > >> it pretty far advanced by now, and it's even an external Python module. > > > > Yes, works like a charm. Note that it depends on external dateutil module, > > though. > > Which is this, I assume: > > http://labix.org/python-dateutil Right. > Ok, that's too bad, We can't rely on external modules for the lxml > distribution, at least not for something that's not strictly required for all > users. Maybe we can fallback to the datetime standard mechanisms if dateutil isn't installed, but then TZ handling and parsing will be far more restricted/error-prone. Will think of that. > > Btw.: ObjectifiedElement .text and .pyval are read-only (which is a good > > thing > > imho). Is it possible to have a way to modify the text of the underlying > > cnode from within a custom ObjectifiedDataElement class, e.g. in _init()? > > I know this i possible when implementing this in pyrex, but for a > > pure-python implementation? > > The background is that for the ObjectifiedDatetime class I might optionally > > want to change the .text to ISO format. > > Ah, good question. Not currently, I believe. But you're right, there might be > cases where it makes sense to update the text from a subclass... > > Maybe adding a 'private' property '__text' might help here, or rather an > explicit setter function '__updateTextInPlace(self, text)' ? S.th. like this would be nice. And I still think not letting the user change the text from the outside is a good thing, at least as long as changing the .text might result in an object type <-> text value mismatch. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From faassen at infrae.com Fri Aug 11 11:30:35 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 11 Aug 2006 11:30:35 +0200 Subject: [lxml-dev] Request for comments: Removing lxml.etree's default support for namespace class support In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de> <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> Message-ID: <44DC4E3B.6000602@infrae.com> Stefan Behnel wrote: [snip] > So, the second proposal for custom class lookup: > > * if no custom classes are used, no configuration is needed > * any support for custom classes should be registered at the parser level +1 for per-parser custom class lookup. So far no objections to the first mail either. :) Regards, Martijn From faassen at infrae.com Fri Aug 11 11:37:04 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 11 Aug 2006 11:37:04 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: <44DC4804.903@infrae.com> References: <44DC4804.903@infrae.com> Message-ID: <44DC4FC0.6000402@infrae.com> Hey, [Benno] >> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? In order to write our patch to fix distutils/setuputils, we actually need an answer to Benno's question. Is there a straightforward way to find this out, in Python code? A brief glance through 'sys' didn't lead to an answer. A quick google likewise didn't seem to lead to anything so far. Perhaps we need to resort to devious unicode string manipulation that behaves differently depending on the amount of bytes your Python is compiled with for unicode representation.. Or we could try asking Fredrik Lundh :). Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 11:44:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 11:44:24 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: <44DC4FC0.6000402@infrae.com> References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com> Message-ID: <44DC5178.4000007@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen schrieb: > [Benno] >>> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? > > In order to write our patch to fix distutils/setuputils, we actually > need an answer to Benno's question. Is there a straightforward way to > find this out, in Python code? A brief glance through 'sys' didn't lead > to an answer. A quick google likewise didn't seem to lead to anything so > far. >>> import sys; print sys.maxunicode 1114111 on my UCS4 system. UCS2 systems cannot return values above 65536. Stefan From Holger.Joukl at LBBW.de Fri Aug 11 11:52:09 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 11:52:09 +0200 Subject: [lxml-dev] [objectify] root Element <-> tree problem In-Reply-To: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de> Message-ID: Hi Stefan, somethings is going wrong here: >>> root = objectify.Element('root') >>> root >>> root.x = 1 >>> root.y = 2 >>> print root root = None [ObjectifiedElement] x = 1 [IntElement] y = 2 [IntElement] >>> root.getroottree().getroot() >>> print root.getroottree().getroot() root = None [ObjectifiedElement] >>> root.getroottree() >>> print root.getroottree().getroot().getroottree() >>> I'm not doing something wrong, am I? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 12:11:10 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 12:11:10 +0200 Subject: [lxml-dev] [objectify] root Element <-> tree problem In-Reply-To: References: Message-ID: <44DC57BE.1060300@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > somethings is going wrong here: > >>>> root = objectify.Element('root') >>>> root > >>>> root.x = 1 >>>> root.y = 2 >>>> print root > root = None [ObjectifiedElement] > x = 1 [IntElement] > y = 2 [IntElement] >>>> root.getroottree().getroot() > >>>> print root.getroottree().getroot() > root = None [ObjectifiedElement] >>>> root.getroottree() > >>>> print root.getroottree().getroot().getroottree() > > > I'm not doing something wrong, am I? Nope. Was a premature optimisation with side-effects in current SVN. I changed objectify.Element() to always reuse the same document as the main use case is to add these things to other documents anyway. Pretty bad idea. Now that you said it, there are actually a lot of problems with it. Just reverted. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 12:57:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 12:57:54 +0200 Subject: [lxml-dev] [objectify] DataElement function In-Reply-To: References: Message-ID: <44DC62B2.6070709@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > is it intentional/unavoidable that the element type returned > by DataElement is always ObjectifiedElement That's not intentional. It's the fast path in _lookupElementClass that strikes here: if c_node.parent is NULL or not tree._isElement(c_node.parent): return ObjectifiedElement # if element has children => no data class if cetree.findChildForwards(c_node, 0) is not NULL: return ObjectifiedElement Only after that, it checks the attributes of the element that determine the element type. There are two ways to change that. * We could move the above code section behind the attribute tests * I thought about adding a C level function for creating new elements anyway. Something like that is already in etree, but it could be extended with an argument for an explicit lookup function (or element class) and made public. It's not as easy as it looks, though, as element objects are created in the elementFactory function, which would have to be adapted as well... Don't know if the second is really viable. The first is easier anyway... Stefan From Holger.Joukl at LBBW.de Fri Aug 11 13:16:21 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 11 Aug 2006 13:16:21 +0200 Subject: [lxml-dev] [objectify] DataElement function In-Reply-To: <44DC62B2.6070709@gkec.informatik.tu-darmstadt.de> Message-ID: lxml-dev-bounces at codespeak.net schrieb am 11.08.2006 12:57:54: > Hi Holger, > > Holger Joukl wrote: > > is it intentional/unavoidable that the element type returned > > by DataElement is always ObjectifiedElement > > That's not intentional. It's the fast path in _lookupElementClass that strikes > here: > > if c_node.parent is NULL or not tree._isElement(c_node.parent): > return ObjectifiedElement > > # if element has children => no data class > if cetree.findChildForwards(c_node, 0) is not NULL: > return ObjectifiedElement > > Only after that, it checks the attributes of the element that determine the > element type. > > There are two ways to change that. > > * We could move the above code section behind the attribute tests > > * I thought about adding a C level function for creating new elements anyway. > Something like that is already in etree, but it could be extended with an > argument for an explicit lookup function (or element class) and made public. > It's not as easy as it looks, though, as element objects are created in the > elementFactory function, which would have to be adapted as well... > > Don't know if the second is really viable. The first is easier anyway... If everything else works as is plus the mentioned thing works better, why not go for the simpler solution? It isn't a real problem at the moment as the DataElements I produce get promptly inserted into a father element and then behave nicely, but... Btw. Shouldn't the default Element class in _guessElementClass() become StringElement, to make this >>> root = objectify.fromstring("""""") >>> print root root = None [ObjectifiedElement] s = None [ObjectifiedElement] >>> finally result into StringElements for empty leaf elements? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 13:47:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 11 Aug 2006 13:47:21 +0200 Subject: [lxml-dev] [objectify] DataElement function In-Reply-To: References: Message-ID: <44DC6E49.2020306@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: >> Holger Joukl wrote: >>> is it intentional/unavoidable that the element type returned >>> by DataElement is always ObjectifiedElement >> >> That's not intentional. It's the fast path in _lookupElementClass that >> strikes here: >> >> if c_node.parent is NULL or not tree._isElement(c_node.parent): >> return ObjectifiedElement >> >> # if element has children => no data class >> if cetree.findChildForwards(c_node, 0) is not NULL: >> return ObjectifiedElement >> >> Only after that, it checks the attributes of the element that determine >> the element type. >> >> * We could move the above code section behind the attribute tests > > If everything else works as is plus the mentioned thing works better, why > not go for the simpler solution? Yup, did that. I also fixed a couple of problems related to different data types as I was at it. We don't currently have a way to check for the real Python types from PyType registered types, only string parsing is supported. However, the real types are passed to DataElement and must be treated similarly. It works for the standard Python types for now and also for custom types that provide a proper __str__() for conversion to XML data content. > Btw. Shouldn't the default Element class in _guessElementClass() > become StringElement, to make this > >>>> root = objectify.fromstring("""""") >>>> print root > root = None [ObjectifiedElement] > s = None [ObjectifiedElement] > > finally result into StringElements for empty leaf elements? It only looks wrong if you call the element "s", I guess... :) But I changed it so that if the element has * no type annotation and * no children and * no text content then, if it * has an element as parent it defaults to StringElement * has no parent it defaults to ObjectifiedElement I think that makes sense. Stefan From faassen at infrae.com Mon Aug 14 12:27:24 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 14 Aug 2006 12:27:24 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: <44DC5178.4000007@gkec.informatik.tu-darmstadt.de> References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com> <44DC5178.4000007@gkec.informatik.tu-darmstadt.de> Message-ID: <44E0500C.6080908@infrae.com> Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen schrieb: >> [Benno] >>>> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? >> In order to write our patch to fix distutils/setuputils, we actually >> need an answer to Benno's question. Is there a straightforward way to >> find this out, in Python code? A brief glance through 'sys' didn't lead >> to an answer. A quick google likewise didn't seem to lead to anything so >> far. > > >>> import sys; print sys.maxunicode > 1114111 > > on my UCS4 system. UCS2 systems cannot return values above 65536. Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits systems can still increase as more unicode codepoints get added, but looking for any value above 65536 should be a reliable way to distinguish UCS2 from UCS4. Regards, Martijn From fredrik at pythonware.com Mon Aug 14 13:06:43 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Mon, 14 Aug 2006 13:06:43 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com> Message-ID: Martijn Faassen wrote: >> >>> import sys; print sys.maxunicode >> 1114111 >> >> on my UCS4 system. UCS2 systems cannot return values above 65536. > > Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits > systems can still increase as more unicode codepoints get added, but > looking for any value above 65536 should be a reliable way to > distinguish UCS2 from UCS4. the 1114111 value isn't the number of assigned code points; it's the largest code point that's ever (*) going to be used by Unicode. *) "BMP plus sixteen supplemental planes should be enough for anybody" From faassen at infrae.com Tue Aug 15 11:48:52 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 15 Aug 2006 11:48:52 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com> Message-ID: <44E19884.9000007@infrae.com> Fredrik Lundh wrote: > Martijn Faassen wrote: > >>> >>> import sys; print sys.maxunicode >>> 1114111 >>> >>> on my UCS4 system. UCS2 systems cannot return values above 65536. >> Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits >> systems can still increase as more unicode codepoints get added, but >> looking for any value above 65536 should be a reliable way to >> distinguish UCS2 from UCS4. > > the 1114111 value isn't the number of assigned code points; it's the largest code > point that's ever (*) going to be used by Unicode. > *) "BMP plus sixteen supplemental planes should be enough for anybody" Thanks for the info! Don't know what BMP is, and I only have a vague idea of the planes (I'll read the wikipedia article :), but using 4 bytes to store something that could be stored in less than 3 seems like a waste. :) Oh well, I imagine machines can deal better with 4 bytes, especially if they're 64 bits. Anyway, we'll see whether we can come up with a patch that convinces distutils to distinguish between the two. Regards, Martijn From fredrik at pythonware.com Tue Aug 15 12:27:50 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 15 Aug 2006 12:27:50 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com> <44E19884.9000007@infrae.com> Message-ID: Martijn Faassen wrote: >> *) "BMP plus sixteen supplemental planes should be enough for anybody" > > Thanks for the info! > > Don't know what BMP is, and I only have a vague idea of the planes (I'll > read the wikipedia article :) start here: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters From faassen at infrae.com Tue Aug 15 13:01:36 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 15 Aug 2006 13:01:36 +0200 Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem In-Reply-To: <44E19884.9000007@infrae.com> References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com> <44E19884.9000007@infrae.com> Message-ID: <44E1A990.7070609@infrae.com> Martijn Faassen wrote: [snip] > Oh well, I imagine > machines can deal better with 4 bytes, especially if they're 64 bits. Hah, silly, of course 32 bits is enough for 4 bytes. I knew that! :) Regards, Martijn From jkrukoff at ltgc.com Thu Aug 17 01:31:33 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 16 Aug 2006 17:31:33 -0600 Subject: [lxml-dev] Request for comments: Removing lxml.etree's default support for namespace class support In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de> <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de> Message-ID: <1155771094.11584.30.camel@localhost> On Fri, 2006-08-11 at 06:57 +0200, Stefan Behnel wrote: > I thought about this some more and found that having a per-parser setup as > default would be pretty convenient and is an extremely small overhead compared > to the default class lookup. First off, thanks for getting 1.0.3 out so quickly. Really helped with my problems, and replace has been a very handy convenience function. I was actually just getting ready to ask you for per-parser custom class support when I came across you already talking about implementing it. It's actually an essential feature for me to be able to take advantage of the custom element classes, as in my application (XML based middleware) both the middleware layer and the applications all handle XML and all exist in the same process. Right now, an application can only change the default element class if it's very careful to make sure to restore it so it doesn't screw up the middleware, and even that solution is going to be impossible once the architecture goes multi-threaded. So, yeah, I'm pretty excited about getting this feature. -- John Krukoff Land Title Guarantee Company From ashish.kulkarni at kalyptorisk.com Wed Aug 23 09:34:40 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Wed, 23 Aug 2006 13:04:40 +0530 Subject: [lxml-dev] Building dynamically-linked lxml on windows using mingw32 Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com> Hello, I've successfully used ming32 to build lxml (dynamically linked). I was unable to get the static linking to work, because I was unable to get the VC++ 2003 Toolkit compiler and trying static linking with gcc gives lots of errors. Step 1: Download and install Mingw from http://mingw.org. Step 2: Start a command window and set the path to include MingW eg. set path=%path%;C:\mingw\bin Step 3: Download the win32 libs from ftp://xmlsoft.org/libxml2/win32. You will need iconv-1.9.1.win32.zip libxml2-2.6.23.win32.zip libxslt-1.1.15.win32.zip zlib-1.2.3.win32.zip Step 4: Follow the instructions in doc/build.txt for extraction, but use the following setupStaticBuild function instead of the one mentioned: def setupStaticBuild(): "See doc/build.txt to make this work." cflags = [ "-I..\\libxml2-2.6.23.win32\\include", "-I..\\libxslt-1.1.15.win32\\include", "-I..\\zlib-1.2.3.win32\\include", "-I..\\iconv-1.9.1.win32\\include" ] xslt_libs = [ "..\\libxml2-2.6.23.win32\\bin\\libxml2.dll", "..\\libxslt-1.1.15.win32\\bin\\libxslt.dll", "..\\libxslt-1.1.15.win32\\bin\\libexslt.dll", "..\\iconv-1.9.1.win32\\bin\\iconv.dll", "..\\zlib-1.2.3.win32\\lib\\zlib.lib" ] result = (cflags, xslt_libs) return result Yes, We ARE linking to DLLs directly as the export libraries are incomplete. 5. Copy the 4 DLLs mentioned above to the src/lxml folder. Also, add this line towards the end of the file, just below the "packages = ['lxml']," line: package_data={'': ['*.dll']}, 6. To build the extension, use the following command: python setup.py build --c=mingw32 --static bdist_wininst You should have an installer which uses lxml dynamically linked to the above DLLs. The installer size is around 1344kB, which is almost the same size you get via static linking. (as a comparison, lxml-1.0.2.win32-static-py2.4.exe is around 1266kB). Hope this helps, Ashish From faassen at infrae.com Wed Aug 23 16:44:34 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 23 Aug 2006 16:44:34 +0200 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? Message-ID: <44EC69D2.6080404@infrae.com> Hey, Compare: http://cheeseshop.python.org/pypi/lxml/1.0.2 with http://cheeseshop.python.org/pypi/lxml/1.0.3 and we see that 1.0.2 has support for lots of different platforms, including the nice static windows build, but 1.0.3 has not. In part this is my fault, as it appears I need to do various linux eggs, but a couple of more egg donations from others would be appreciated! The same story applies to 1.1 beta. Regards, Martijn From ashish.kulkarni at kalyptorisk.com Thu Aug 24 07:24:03 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Thu, 24 Aug 2006 10:54:03 +0530 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com> Hello, I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is not a static build, but the DLLs are included in the distribution (as per my previous mail). http://puggy.symonds.net/~ashish/downloads/ Also, I couldn't build the lxml.objectify extension for 1.1beta: apparently there is no pyrex-generated C file in the source distribution. Thus the 1.1 beta builds have that extension disabled. Hope this helps, Ashish From faassen at infrae.com Thu Aug 24 11:37:59 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 24 Aug 2006 11:37:59 +0200 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com> References: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com> Message-ID: <44ED7377.6050909@infrae.com> Ashish Kulkarni wrote: > I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is > not a static build, but the DLLs are included in the distribution (as > per my previous mail). The experience to the end user is the same, I think, so this sounds good too. :) > http://puggy.symonds.net/~ashish/downloads/ > > Also, I couldn't build the lxml.objectify extension for 1.1beta: > apparently there is no pyrex-generated C file in the source > distribution. Thus the 1.1 beta builds have that extension disabled. Thanks! It's useful to know we don't have a pyrex generated C file in the source directory for the objectify stuff. I'll leave that to Stephan Behnel to correct, as he's more familiar with the build procedure than I am. Previously Steve Howe has been taking care of our windows builds, so I'm still hoping he'll chip in versions for 1.0.3 (and possibly 1.1beta) for the cheeseshop. If however he turns out to be busy, we'll be sure to get back to you again. And for people on Windows who want to continue now, your downloads are available. Thank you very much! Regards, Martijn From jkrukoff at ltgc.com Thu Aug 24 14:04:56 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 24 Aug 2006 06:04:56 -0600 Subject: [lxml-dev] Replace/copy related segfault in lxml Message-ID: <1156421097.17673.20.camel@localhost> So, I've been making extensive use of lxml 1.0.3, and have come across another crash bug. This one also appears to be related to subtree replacement. This is with libxml2 2.6.26, and I haven't tested with lxml 1.1 beta to see if the bug is present there. There is a simple workaround, which appears to be to avoid using the new replace function. This is the error the attached test program gives me: *** glibc detected *** double free or corruption (fasttop): 0x080daec8 *** However, minor differences in the location and amount of whitespace in the input data change the crash, to errors such as this: *** glibc detected *** corrupted double-linked list: 0x0813b9f8 *** -- John Krukoff Land Title Guarantee Company -------------- next part -------------- A non-text attachment was scrubbed... Name: test-replace.py Type: text/x-python Size: 520 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060824/fa120482/attachment.py From jkrukoff at ltgc.com Thu Aug 24 15:28:55 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 24 Aug 2006 07:28:55 -0600 Subject: [lxml-dev] No extend method on elements? Message-ID: <1156426135.17673.43.camel@localhost> I know ElementTree doesn't support it, but is there any chance of getting an extend method on Elements? It's an awfully useful list function, and my first try for replacement was: [ element.append( new ) for new in otherelement ] However, it looks like for large element lists, it's far faster to use slice assignment: element[ len( element ) : len( element ) ] = otherelement which was not the most intuitive way to do things for me. It'd be nice if -0 : -0 was a real slice... Is this really the best way, or am I missing something obvious? -- John Krukoff Land Title Guarantee Company From fredrik at pythonware.com Thu Aug 24 15:51:06 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 24 Aug 2006 15:51:06 +0200 Subject: [lxml-dev] No extend method on elements? References: <1156426135.17673.43.camel@localhost> Message-ID: John Krukoff wrote: >I know ElementTree doesn't support it, but is there any chance of > getting an extend method on Elements? ET 1.3 has an extend() method. > element[ len( element ) : len( element ) ] = otherelement shorter: element[len(element):] = otherelement From faassen at infrae.com Thu Aug 24 16:39:00 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 24 Aug 2006 16:39:00 +0200 Subject: [lxml-dev] Replace/copy related segfault in lxml In-Reply-To: <1156421097.17673.20.camel@localhost> References: <1156421097.17673.20.camel@localhost> Message-ID: <44EDBA04.9050509@infrae.com> John Krukoff wrote: > So, I've been making extensive use of lxml 1.0.3, and have come across > another crash bug. This one also appears to be related to subtree > replacement. > > This is with libxml2 2.6.26, and I haven't tested with lxml 1.1 beta to > see if the bug is present there. There is a simple workaround, which > appears to be to avoid using the new replace function. > > This is the error the attached test program gives me: > > *** glibc detected *** double free or corruption (fasttop): 0x080daec8 > *** > > However, minor differences in the location and amount of whitespace in > the input data change the crash, to errors such as this: > > *** glibc detected *** corrupted double-linked list: 0x0813b9f8 *** Hm, I'm on an ubuntu 6.06, python 2.4, libxml 2.6.24, lxml-1.0 branch from svn, and so far I cannot reproduce your problem by running your script. Trying the 1.0.3 release now, same platform, still cannot reproduce the crash. What platform are you on? I can find a problem I run this code using 'valgrind' to detect memory errors - I get exuberant warnings now. Looks like you're on to something.. valgrind doesn't report these warnings when the workaround is enabled instead. I'll try to look into this more deeply later. Regards, Martijn From ashish.kulkarni at kalyptorisk.com Fri Aug 25 07:07:45 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Fri, 25 Aug 2006 10:37:45 +0530 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? In-Reply-To: <44ED7377.6050909@infrae.com> Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E92D@mailserver.kalyptorisk.com> Actually, now that lxml can be built with mingw32, one can do all the builds on linux itself. All you have to do is to build a mingw32 cross-compiler. http://www.mingw.org/MinGWiki/index.php/BuildMingwCross I've heard that a lot of projects use this approach to build win32 releases. So the official builds can at-least include the Mingw32 builds, until someone comes up with MSVC builds (which are almost always a bit faster). Hope this helps, Ashish -----Original Message----- From: Martijn Faassen [mailto:faassen at infrae.com] Sent: Thursday, August 24, 2006 3:08 PM To: Ashish Kulkarni Cc: lxml-dev at codespeak.net; howe at carcass.dhs.org Subject: Re: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? Ashish Kulkarni wrote: > I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is > not a static build, but the DLLs are included in the distribution (as > per my previous mail). The experience to the end user is the same, I think, so this sounds good too. :) > http://puggy.symonds.net/~ashish/downloads/ > > Also, I couldn't build the lxml.objectify extension for 1.1beta: > apparently there is no pyrex-generated C file in the source > distribution. Thus the 1.1 beta builds have that extension disabled. Thanks! It's useful to know we don't have a pyrex generated C file in the source directory for the objectify stuff. I'll leave that to Stephan Behnel to correct, as he's more familiar with the build procedure than I am. Previously Steve Howe has been taking care of our windows builds, so I'm still hoping he'll chip in versions for 1.0.3 (and possibly 1.1beta) for the cheeseshop. If however he turns out to be busy, we'll be sure to get back to you again. And for people on Windows who want to continue now, your downloads are available. Thank you very much! Regards, Martijn From jkrukoff at ltgc.com Fri Aug 25 11:31:52 2006 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 25 Aug 2006 03:31:52 -0600 Subject: [lxml-dev] Replace/copy related segfault in lxml Message-ID: <004801c6c829$4d100e30$051ea8c0@naomi> >Hm, I'm on an ubuntu 6.06, python 2.4, libxml 2.6.24, lxml-1.0 branch >from svn, and so far I cannot reproduce your problem by running your script. > >Trying the 1.0.3 release now, same platform, still cannot reproduce the >crash. > >What platform are you on? > >I can find a problem I run this code using 'valgrind' to detect memory >errors - I get exuberant warnings now. Looks like you're on to >something.. valgrind doesn't report these warnings when the workaround >is enabled instead. > >I'll try to look into this more deeply later. > >Regards, > >Martijn > I'm on an up to date gentoo stable box, with fairly aggressive optimization settings. CFLAGS="-march=pentium4 -O3 -pipe -mfpmath=sse -fomit-frame-pointer" To be exact. The problem seems to be related to text node handling. I stripped the test case down to the bare minimum for my box, but if you're having trouble reproducing try to add more whitespace to the test data. Let me know if you can't reproduce the segfault, and I'll try to get it to crash on one of our redhat boxes. --------- John Krukoff jkrukoff at ltgc.com From faassen at infrae.com Fri Aug 25 12:50:14 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 25 Aug 2006 12:50:14 +0200 Subject: [lxml-dev] Replace/copy related segfault in lxml In-Reply-To: <004801c6c829$4d100e30$051ea8c0@naomi> References: <004801c6c829$4d100e30$051ea8c0@naomi> Message-ID: <44EED5E6.2050606@infrae.com> John Krukoff wrote: [snip] > > The problem seems to be related to text node handling. I stripped the test > case down to the bare minimum for my box, but if you're having trouble > reproducing try to add more whitespace to the test data. Let me know if you > can't reproduce the segfault, and I'll try to get it to crash on one of our > redhat boxes. Sorry I wasn't more clear in my previous mail, I actually intended to acknowledge your problem. Since valgrind complains it's clear there is a memory allocation problem somewhere, it just doesn't show up with some platforms and/or compilation settings. Thankfully we have valgrind; I only thought of using it halfway writing the mail back to you. :) So, to be clear: problem reproduced here, acknowledged, and need to work on a fix. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 25 22:28:26 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 25 Aug 2006 22:28:26 +0200 Subject: [lxml-dev] Replace/copy related segfault in lxml In-Reply-To: <1156421097.17673.20.camel@localhost> References: <1156421097.17673.20.camel@localhost> Message-ID: <44EF5D6A.3080807@gkec.informatik.tu-darmstadt.de> Hi John, John Krukoff wrote: > So, I've been making extensive use of lxml 1.0.3, and have come across > another crash bug. This one also appears to be related to subtree > replacement. Thanks for reporting this. It's a bug in the replace() method. The Python document reference (and thus the document itself) can be freed before copying the tail content from it. Here's a fix against the trunk that should also apply to 1.0.3. Please test it. Stefan Index: src/lxml/etree.pyx =================================================================== --- src/lxml/etree.pyx (Revision 31246) +++ src/lxml/etree.pyx (Arbeitskopie) @@ -797,9 +797,9 @@ c_new_node = new_element._c_node c_new_next = c_new_node.next tree.xmlReplaceNode(c_old_node, c_new_node) - moveNodeToDocument(new_element, self._doc) _moveTail(c_new_next, c_new_node) _moveTail(c_old_next, c_old_node) + moveNodeToDocument(new_element, self._doc) # PROPERTIES property tag: From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 25 23:01:19 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 25 Aug 2006 23:01:19 +0200 Subject: [lxml-dev] No extend method on elements? In-Reply-To: References: <1156426135.17673.43.camel@localhost> Message-ID: <44EF651F.7020106@gkec.informatik.tu-darmstadt.de> Fredrik Lundh wrote: > John Krukoff wrote: > >> I know ElementTree doesn't support it, but is there any chance of >> getting an extend method on Elements? > > ET 1.3 has an extend() method. That's good to know. Then I guess lxml 1.1 should have one, too. >> element[ len( element ) : len( element ) ] = otherelement > > shorter: > > element[len(element):] = otherelement That's the "obvious" way of implementing it. So here's a quick and small patch against the trunk that adds the function to etree. Something like this will make it into 1.1. Stefan Index: src/lxml/etree.pyx =================================================================== --- src/lxml/etree.pyx (Revision 31661) +++ src/lxml/etree.pyx (Arbeitskopie) @@ -725,6 +725,11 @@ # parent element has moved; change them too.. moveNodeToDocument(element, self._doc) + def extend(self, elements): + """Extends the current children by the elements in the iterable. + """ + self[python.PY_SSIZE_T_MAX:python.PY_SSIZE_T_MAX] = elements + def clear(self): """Resets an element. This function removes all subelements, clears all attributes and sets the text and tail From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 07:57:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 30 Aug 2006 07:57:30 +0200 Subject: [lxml-dev] Building dynamically-linked lxml on windows using mingw32 In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com> References: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com> Message-ID: <44F528C9.4080502@gkec.informatik.tu-darmstadt.de> Hi Ashish, Ashish Kulkarni wrote: > I've successfully used ming32 to build lxml (dynamically linked). Thanks for sharing your experience. It's always helpful to have this kind of info archived on the list so that others can find it. > I was > unable to get the static linking to work, because I was unable to get the > VC++ 2003 Toolkit compiler and trying static linking with gcc gives lots of > errors. That would be the expected behaviour, I guess. Even using newer MS compilers with the VC-2003 compiled Python interpreter does not work, from what I've heard. That's been discussed on python-dev for some other extensions a while ago. Don't remember the result, though... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 08:03:27 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 30 Aug 2006 08:03:27 +0200 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com> References: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com> Message-ID: <44F52A2F.9070902@gkec.informatik.tu-darmstadt.de> Hi Ashish, Ashish Kulkarni wrote: > I couldn't build the lxml.objectify extension for 1.1beta: apparently > there is no pyrex-generated C file in the source distribution. Right, my fault. It's fixed now (on the trunk), just needed an additional "objectify.c" entry in the MANIFEST.in file. You can build the file yourself if you install the patched Pyrex version as described in build.txt. > Thus the 1.1 beta builds have that extension disabled. That's ok, 1.1 final (and 1.0.4) will be out pretty soon, so it's enough if we have that working by then. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 08:09:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 30 Aug 2006 08:09:04 +0200 Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms? In-Reply-To: <44EC69D2.6080404@infrae.com> References: <44EC69D2.6080404@infrae.com> Message-ID: <44F52B80.9080500@gkec.informatik.tu-darmstadt.de> Hi, Martijn Faassen wrote: > we see that 1.0.2 has support for lots of different platforms, > including the nice static windows build, but 1.0.3 has not. It's summer holiday time, I guess that's the reason. Since there was a crash bug in 1.0.3, I'll release a 1.0.4 soon, so it's not too much of a problem if eggs are missing for 1.0.3. But since I then really, /really/ hope that that'll finally be the last 1.0 release necessary, I'll be as happy as Martijn to see egg contributions. Stefan