From jkrukoff at ltgc.com Tue Aug 1 05:13:15 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Mon, 31 Jul 2006 21:13:15 -0600
Subject: [lxml-dev] Copying an ElementTree doesn't work.
Message-ID: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com>
Can someone explain to me why when an ElementTree is copied, it's root
element isn't copied?
>>> import lxml.etree as etree
>>> import copy
>>> root = etree.XML( '' )
>>> tree = copy.copy( etree.ElementTree( root ) )
>>> tree.getroot( ) is None
True
I get the same behaviour with deepcopy as well. Am I just supposed to
always be using Element s and not ElementTree s? I'm running lxml
1.0.2 on Python 2.4.3, if that matters.
From jkrukoff at ltgc.com Tue Aug 1 05:33:52 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Mon, 31 Jul 2006 21:33:52 -0600
Subject: [lxml-dev] Segfault in lxml during element copy
Message-ID: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com>
I've been working on an XML based middleware system written in python
and lxml, and I've started experiencing a segfault problem with lxml
just as it's being rolled out to the rest of the team. Embarrassing,
you know?
It looks like a double free problem, as the crash is always acompanied
by a glibc message that looks like this:
*** glibc detected *** free(): invalid pointer: 0x0813e1a4 ***
I've tried to come up with a stripped down test case to repeat the
problem, but have been unable to reproduce it except in the full
application. It's not absolutely consistent, I'll have to run the same
request 3 or 4 times before it crashes, but it always does, even while
generating identical output from identical input for those 3 or 4 calls.
I've tracked down the line it crashes at, and it's a simple copy
called on an XML element:
copied = copy.copy( element )
If I remove it, and operate on the source xml directly instead of
copying it (it's really just a safety mechanism), it still crashes,
just in more random locations.
I'm running lxml 1.0.2, on Python 2.4.3, with libxml2 2.6.26 and
libxslt 1.1.17 if it matters. The problem is reproducible on a
coworkers machine, also running lxml 1.0.2 with slightly different
minor revisions of the xml libraries.
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 07:25:29 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 07:25:29 +0200
Subject: [lxml-dev] Copying an ElementTree doesn't work.
In-Reply-To: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com>
References: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com>
Message-ID: <44CEE5C9.7020100@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> Can someone explain to me why when an ElementTree is copied, it's root
> element isn't copied?
>
>>>> import lxml.etree as etree
>>>> import copy
>>>> root = etree.XML( '' )
>>>> tree = copy.copy( etree.ElementTree( root ) )
>>>> tree.getroot( ) is None
> True
>
> I get the same behaviour with deepcopy as well. Am I just supposed to
> always be using Element s and not ElementTree s? I'm running lxml
> 1.0.2 on Python 2.4.3, if that matters.
Copying ElementTrees is not currently implemented. The only reason to do it
would be to avoid problems when people use it, there is no real gain. I do not
even see why you would want to copy an ElementTree.
As ElementTrees are immutable, the above is not different from this:
tree = etree.ElementTree(root)
I'll add __copy__ and __deepcopy__, though, so that the above problem will
disappear. So, thanks for reporting this.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 07:56:20 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 07:56:20 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com>
Message-ID: <44CEED04.8080300@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> I've been working on an XML based middleware system written in python
> and lxml, and I've started experiencing a segfault problem with lxml
> just as it's being rolled out to the rest of the team. Embarrassing,
> you know?
Sorry for that.
> It looks like a double free problem, as the crash is always acompanied
> by a glibc message that looks like this:
> *** glibc detected *** free(): invalid pointer: 0x0813e1a4 ***
*May* be a double free problem, yes.
> I've tried to come up with a stripped down test case to repeat the
> problem, but have been unable to reproduce it except in the full
> application. It's not absolutely consistent, I'll have to run the same
> request 3 or 4 times before it crashes, but it always does, even while
> generating identical output from identical input for those 3 or 4 calls.
>
> I've tracked down the line it crashes at, and it's a simple copy
> called on an XML element:
> copied = copy.copy( element )
?? You mean, you get the above error ('free(): invalid pointer') when you call
this? Then I have no idea where that bug could come from. At least, it can't
really be copy() that triggers it...
BTW, in lxml, copy() is the same as deepcopy(). Read doc/compatibility.txt on
this.
> If I remove it, and operate on the source xml directly instead of
> copying it (it's really just a safety mechanism), it still crashes,
> just in more random locations.
That's likely, yes. Looks like your XML tree became corrupted in some way, so
when the broken part of it is accessed, it crashes.
> I'm running lxml 1.0.2, on Python 2.4.3, with libxml2 2.6.26 and
> libxslt 1.1.17 if it matters. The problem is reproducible on a
> coworkers machine, also running lxml 1.0.2 with slightly different
> minor revisions of the xml libraries.
Ok. Thanks for reporting this. We had a report before about lxml crashing in
certain bizarre and difficult to reproduce situations, so maybe this is the
same bug. Given the information above, it would be really hard for us to try
to reproduce the bug, so if you want to help, I can only ask you to try to
strip down your program to a relevant portion that allows us to actually see
the bug ourselves. Otherwise it will be near impossible to fix it.
Thanks for reporting this,
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 09:36:55 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 09:36:55 +0200
Subject: [lxml-dev] An intriguing behaviour of xpath in lxml
In-Reply-To:
References:
Message-ID: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de>
Hi Agustin,
Agust?n Villena wrote:
> I already know that xpath(".") in the document node works, but is
> beyond my understanding why xpath("/") is not implemented.
Well, what would you expect it to return? The XPath spec says:
"""
/ selects the document root (which is always the parent of the document element)
"""
The document element is returned by "/*", so it's the root element of the
document in ElementTree. The "document root" itself is not available in the
tree model provided by lxml.
It /could/ be a possibility to deliberately diverge from the spec here and
return the root element instead.
So, maybe you can enlighten us with your use case, so that we can decide what
implementation would fit here.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 10:01:42 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 10:01:42 +0200
Subject: [lxml-dev] Copying an ElementTree doesn't work.
In-Reply-To: <20060801014422.6bks9ygcu8gw0wkw@webmail.ltgc.com>
References: <20060731211315.f0qc3sg30gccw4wk@webmail.ltgc.com> <44CEE5C9.7020100@gkec.informatik.tu-darmstadt.de>
<20060801014422.6bks9ygcu8gw0wkw@webmail.ltgc.com>
Message-ID: <44CF0A66.20203@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> Quoting Stefan Behnel :
>> John Krukoff wrote:
>>> Can someone explain to me why when an ElementTree is copied, it's root
>>> element isn't copied?
>>>
>>>>>> import lxml.etree as etree
>>>>>> import copy
>>>>>> root = etree.XML( '' )
>>>>>> tree = copy.copy( etree.ElementTree( root ) )
>>>>>> tree.getroot( ) is None
>>> True
>>
>> As ElementTrees are immutable, the above is not different from this:
>>
>> tree = etree.ElementTree(root)
>>
>> I'll add __copy__ and __deepcopy__, though, so that the above problem
>> will disappear. So, thanks for reporting this.
>
> For what it's worth, the use case is that I have an element tree that I
> want to copy multiple times, before performing destructive changes to
> the copies. Currently, copying the contents of an element tree to
> another element tree is kind of clunky:
>
>>>> original = etree.ElementTree( etree.XML( '' ) )
>>>> copied = etree.ElementTree( copy.copy( original.getroot( ) ) )
>
> which is why I was asking if the expected use is to always pass around
> elements and wrap them with element trees only when it was convient to
> use the element tree methods (XSLT being what I'm interested in).
>
> So, thanks, the fix will make this look a little less ugly.
Ok, sure. Just for code clarity, you might still want to use deepcopy()
instead of copy(), not everybody is necessarily aware of the fact that lxml
implements them the same way.
Note also that copying an ElementTree actually now produces a shallow copy of
the ElementTree. The XML tree is not touched in this case.
Here is the patch, BTW, in case you want to apply it yourself. It will be in
lxml 1.0.3 and 1.1, which are expected not too late this month.
Stefan
Index: src/lxml/etree.pyx
===================================================================
--- src/lxml/etree.pyx (Revision 30633)
+++ src/lxml/etree.pyx (Arbeitskopie)
@@ -395,6 +395,15 @@
"""
return self._context_node
+ def __copy__(self):
+ return ElementTree(self._context_node)
+
+ def __deepcopy__(self, memo):
+ if self._context_node is None:
+ return ElementTree()
+ else:
+ return ElementTree( self._context_node.__copy__() )
+
property docinfo:
"""Information about the document provided by parser and DTD. This
value is only defined for ElementTree objects based on the root node
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 10:17:24 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 10:17:24 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <20060801015204.2k8stoit344kwwww@webmail.ltgc.com>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de>
<20060801015204.2k8stoit344kwwww@webmail.ltgc.com>
Message-ID: <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> Thanks for the response. Yeah, I know just how vague an error report
> this is. I was really hoping I was hitting something that someone else
> had already encountered. I've already wasted a day trying to strip the
> program down to just the lxml operations, and haven't been able to come
> up with a reduced set of the program that still causes the crash.
Try to think about the main treatments you apply to trees. Do you move
elements between trees? What happens to the source tree? Does the crash go
away if you keep a reference to it? (maybe in a set or list)
Do you keep cyclic references between objects that reference elements, i.e. is
the Python cyclic garbage collector involved in cleaning up XML trees?
If you use XSLT, can you reproduce the crash if you build the result tree (or
a simpler one) by hand? Do you use XPath calls or extension functions? Are
they required to trigger the crash?
These kinds of bugs are mostly related to garbage collection and Python
reference counting, so try to concentrate on code that results in freeing
references to elements and trees.
There is also a tool we commonly use to debug memory handling in lxml.etree.
It's called "valgrind". doc/valgrind.txt contains a command line that allows
you to run lxml with it. This gives you a stack trace when problems occur or
when the program crashes that *might* give us a hint on what happened. In case
you want to try, you can send me the output in private e-mail (preferably
bzip2-ed or gzipped) so that I can take a look at it.
> I'll spend another day on this, and see if I can't do better.
Thanks, we really appreciate this kind of help.
Stefan
From faassen at infrae.com Tue Aug 1 11:13:13 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 11:13:13 +0200
Subject: [lxml-dev] lxml - exslt - regexp:match()
In-Reply-To: <44CE477A.1030000@gkec.informatik.tu-darmstadt.de>
References: <149473834@web.de>
<44CE477A.1030000@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF1B29.1050609@infrae.com>
Stefan Behnel wrote:
[snip]
> For comparison, I now implemented the examples from the page as unit tests,
> which sadly showed that Python's regexps are incompatible with what EXSLT
> requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only
> the last "t" is returned for the group by re.findall(). So we can't claim
> compatibility with EXSLT at this point. -- Note, though, that I never really
> said it was compatible, it just builds on Python's re module. I still think
> that's enough for a Python XML library.
If it's not compatible, I think it should be invoked differently than in
the EXSLT way. This way someone dropping in an EXSLT stylesheet with
regexes doesn't have a half-working stylesheet but a completely and
clearly failing stylesheet: lxml doesn't support the regexes. In
addition, the path forward to getting the stylesheet working is clear:
use the Python-based and deliberately incompatible regex facility
instead, and rewrite the regexes.
Regards,
Martijn
From faassen at infrae.com Tue Aug 1 11:15:53 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 11:15:53 +0200
Subject: [lxml-dev] An intriguing behaviour of xpath in lxml
In-Reply-To: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de>
References:
<44CF0497.6050106@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF1BC9.7040102@infrae.com>
Stefan Behnel wrote:
> Hi Agustin,
>
> Agust?n Villena wrote:
>> I already know that xpath(".") in the document node works, but is
>> beyond my understanding why xpath("/") is not implemented.
>
> Well, what would you expect it to return? The XPath spec says:
>
> """ / selects the document root (which is always the parent of the
> document element) """
>
> The document element is returned by "/*", so it's the root element of
> the document in ElementTree. The "document root" itself is not
> available in the tree model provided by lxml.
>
> It /could/ be a possibility to deliberately diverge from the spec
> here and return the root element instead.
What about returning a root ElementTree? Then again, that is not the
parent of the document element at present in our tree model, right? Or
is it? Changing the getparent() behavior will have consequences we need
to consider carefully.
> So, maybe you can enlighten us with your use case, so that we can
> decide what implementation would fit here.
Yes, that would indeed be helpful.
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 11:47:53 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 11:47:53 +0200
Subject: [lxml-dev] lxml - exslt - regexp:match()
In-Reply-To: <44CF1B29.1050609@infrae.com>
References: <149473834@web.de>
<44CE477A.1030000@gkec.informatik.tu-darmstadt.de>
<44CF1B29.1050609@infrae.com>
Message-ID: <44CF2349.70602@gkec.informatik.tu-darmstadt.de>
Hi Martijn,
Martijn Faassen wrote:
> Stefan Behnel wrote:
> [snip]
>> For comparison, I now implemented the examples from the page as unit
>> tests,
>> which sadly showed that Python's regexps are incompatible with what EXSLT
>> requires. The Python RE "([a-z])+ " does not match "test " as in
>> EXSLT, only
>> the last "t" is returned for the group by re.findall(). So we can't claim
>> compatibility with EXSLT at this point. -- Note, though, that I never
>> really
>> said it was compatible, it just builds on Python's re module. I still
>> think
>> that's enough for a Python XML library.
>
> If it's not compatible, I think it should be invoked differently than in
> the EXSLT way. This way someone dropping in an EXSLT stylesheet with
> regexes doesn't have a half-working stylesheet but a completely and
> clearly failing stylesheet: lxml doesn't support the regexes. In
> addition, the path forward to getting the stylesheet working is clear:
> use the Python-based and deliberately incompatible regex facility
> instead, and rewrite the regexes.
Hmmm, I feel invited to disagree here. I reread the EXSLT spec on this topic
and it does not contain any RE syntax specification and is rather unclear
about what is required for compliance. It says this in the introduction of the
RE module:
"""
For ease of implementation, the regular expressions used in this module
currently use the Javascript regular expression syntax.
"""
while in the description of the functions, it mainly uses this wording:
"""
The second argument is a regular expression that follows the Javascript
regular expression syntax.
"""
So, the way I read it, the "currently" does not seem to indicate a clear
obligation to obey the actual RE syntax used in the spec. Especially the "ease
of implementation" calls for a Python 're' implementation in lxml. :)
I also believe that people using XML in a Python environment would rather
expect regular expressions to be compatible with what they know from Python's
re module (where they are pretty well defined) than with JavaScript
expressions. So far, the differences only seem to show for repeated groups, so
a large area of use cases is even compatible. BTW, the use case given in the
EXSLT spec is easily rewritten by moving the RE repeat operator (+/*) into the
group, so if portability is really required in this specific case, it can be
achieved on the user side.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 12:02:03 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 12:02:03 +0200
Subject: [lxml-dev] An intriguing behaviour of xpath in lxml
In-Reply-To: <44CF1BC9.7040102@infrae.com>
References:
<44CF0497.6050106@gkec.informatik.tu-darmstadt.de>
<44CF1BC9.7040102@infrae.com>
Message-ID: <44CF269B.8080905@gkec.informatik.tu-darmstadt.de>
Hi Martijn,
Martijn Faassen wrote:
> Stefan Behnel wrote:
>> Hi Agustin,
>>
>> Agust?n Villena wrote:
>>> I already know that xpath(".") in the document node works, but is
>>> beyond my understanding why xpath("/") is not implemented.
>>
>> Well, what would you expect it to return? The XPath spec says:
>>
>> """ / selects the document root (which is always the parent of the
>> document element) """
>>
>> The document element is returned by "/*", so it's the root element of
>> the document in ElementTree. The "document root" itself is not
>> available in the tree model provided by lxml.
>>
>> It /could/ be a possibility to deliberately diverge from the spec
>> here and return the root element instead.
>
> What about returning a root ElementTree?
Then that would be the only special case that returns an ElementTree from an
XPath expression, although there is currently no way to get an ElementTree
passed /into/ an XPath expression. And XPath extension functions would have to
start caring about this, too.
> Then again, that is not the parent of the document element at present
> in our tree model, right? Or is it?
No. ElementTrees and Elements are different things that serve different purposes.
> Changing the getparent() behavior will have consequences we need
> to consider carefully.
I dislike the idea of having different (incompatible) return values only to
match a single special case. If we say we return an Element from a function,
having a special case that can return an ElementTree is far from intuitive and
pretty error prone.
So, depending on the use case, we may consider
a) leaving it as is
b) raise a different exception to make the problem more understandable
c) return None to avoid the exception (not really a good idea, but would match
the behaviour of the getparent() function)
d) return a node set with the root element (thus diverging from the spec)
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 13:56:25 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 13:56:25 +0200
Subject: [lxml-dev] An intriguing behaviour of xpath in lxml
In-Reply-To:
References:
<44CF0497.6050106@gkec.informatik.tu-darmstadt.de>
<44CF1BC9.7040102@infrae.com>
<44CF269B.8080905@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF4169.5070603@gkec.informatik.tu-darmstadt.de>
Hi Agustin.
Agustin Villena wrote:
> On 8/1/06, Stefan Behnel wrote:
> >> Agust?n Villena wrote:
> >>> I already know that xpath(".") in the document node works, but is
> >>> beyond my understanding why xpath("/") is not implemented.
> >>
> >> Well, what would you expect it to return? The XPath spec says:
> >>
> >> """ / selects the document root (which is always the parent of the
> >> document element) """
> >>
> >> The document element is returned by "/*", so it's the root element of
> >> the document in ElementTree. The "document root" itself is not
> >> available in the tree model provided by lxml.
>
> So, depending on the use case, we may consider
> a) leaving it as is
> b) raise a different exception to make the problem more understandable
> c) return None to avoid the exception (not really a good idea, but
> would match
> the behaviour of the getparent() function)
> d) return a node set with the root element (thus diverging from the
> spec)
> Well, the use case es really simple. I'm engaged in a internal course
> teaching XML technologies to my co-workers, and I choose lxml as the
> best trade-off between easy of use and power. The problem arises when my
> "students" begun toying with xpath... Surprisingly the most common first
> case that they tried is xpath("/"), and the Exception really confuses
> them, and me.
;) Nice trap. Guess I'd try that first, too.
> IMHO the is to paths:
> - return a node set with the root element .PRO: is intuitive, CONS:
> diverges from the spec
It has the advantage of actually returning /something/ useful. It also allows
users to access the root ElementTree if they like and thus more or less does
what can be expected.
I mean, this is a rare case anyway and it is actually well defined, so it
would be wrong to raise an exception and thus tell the user "you did something
wrong". It's a valid XPath expression and therefore perfectly reasonable to
use it. I'll just document the difference and that's it.
What I now implemented is: if the document root is returned, find its first
child and return it as part of a node set instead. If it's not found, it
returns None in the node set, but that shouldn't normally happen.
> Thanks for your feedback and the great lxml
:)
Stefan
From faassen at infrae.com Tue Aug 1 14:53:36 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 14:53:36 +0200
Subject: [lxml-dev] An intriguing behaviour of xpath in lxml
In-Reply-To: <44CF4169.5070603@gkec.informatik.tu-darmstadt.de>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de>
<44CF4169.5070603@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF4ED0.2040105@infrae.com>
Stefan Behnel wrote:
[snip]
> What I now implemented is: if the document root is returned, find its first
> child and return it as part of a node set instead. If it's not found, it
> returns None in the node set, but that shouldn't normally happen.
This worries me a little...
How does that work when the ancestor axis is used? The spec says:
"""
the ancestor axis contains the ancestors of the context node; the
ancestors of the context node consist of the parent of context node and
the parent's parent and so on; thus, the ancestor axis will always
include the root node, unless the context node is the root node"""
"""
the ancestor-or-self axis contains the context node and the ancestors of
the context node; thus, the ancestor axis will always include the root node
"""
Would that mean the current implementation creates double entries when
these axes are used? That's not ideal.
Note also:
"""
The root node is the root of the tree. A root node does not occur except
as the root of the tree. The element node for the document element is a
child of the root node. The root node also has as children processing
instruction and comment nodes for processing instructions and comments
that occur in the prolog and after the end of the document element
"""
Perhaps we should implement a special kind of node that represents the
root node. It'd not occur in a normal ElementTree DOM, but it's there
when you use XPath. It can be also serialized, just like an element, but
would include the extra comments that may be there.
Then again, we already diverge from strict XPath when we deal with
attribute (we have no attribute node), or text (we have no text node).
Diverging with root notes wouldn't be a disaster in that picture.
That said, the root node is a lot more like an element than these other
cases, in that a root node has children, just like element nodes.
Regards,
Martijn
P.S. What do we do with namespace nodes by the way?
From agustin.villena at gmail.com Tue Aug 1 15:17:48 2006
From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=)
Date: Tue, 01 Aug 2006 09:17:48 -0400
Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml?
Message-ID:
HI!
I'm at the task of processing a bunch of digital signed XMLs. My problem
is exemplified in this example:
a) The original XMLs were enveloped in a container, that has a default
namespace and a signature. The internal XMLs also have their own
signature. I doesn't have access to this "envelopes" anymore
Some Data
b) Sadly, a 3rd party software "extracted" the internal documents,
"forgetting" the envelope's default namespace, therefore inalidating the
doc's signatures
Example of invalid extracted documents
Some Data
What was needed (xml 1)
--------------------------------------------------
Some Data
First question:
* Is there any way with lxml to add a default namspace to an existing
xml-tree
Now, I'm trying to patch those messed xmls, injecting the namespace in
the nodes that need to belong to the missing namespace, but the result
is ugly:
python code
-------------------------------------------------------------
from lxml import etree
NEW_NS = "http://www.example.org/example"
doc = etree.parse("no_ns_doc.xml")
#add namespace to the root node
doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag)
#add namespace to the first child of the root node,
#since we don't want to touch de namespace of the
#Signature Node
for elem in doc.getroot()[0].getiterator():
elem.tag="{%s}%s" %(NEW_NS,elem.tag)
doc.write("ns_patched_doc.xml")
result (xml 2)
-------------------------------------------------------------
Some Data
?
I know that xml1 and xml2 are semantically the same, but the
customer wants his XMLs as appear in the xml 1 example, or with
a less ugly prefix.
Is the anyway to force to use a more pretty prefix?
Any ideas?
Thanks
Agustin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ns_patched_doc.xml
Type: text/xml
Size: 221 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment.bin
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: add_ns_example.py
Url: http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment.diff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: no_ns_doc.xml
Type: text/xml
Size: 163 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060801/071630a5/attachment-0001.bin
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 15:24:33 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 15:24:33 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF4ED0.2040105@infrae.com>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de>
<44CF4169.5070603@gkec.informatik.tu-darmstadt.de>
<44CF4ED0.2040105@infrae.com>
Message-ID: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
Martijn Faassen wrote:
> Stefan Behnel wrote:
> [snip]
>> What I now implemented is: if the document root is returned, find its
>> first
>> child and return it as part of a node set instead. If it's not found, it
>> returns None in the node set, but that shouldn't normally happen.
>
> This worries me a little...
>
> How does that work when the ancestor axis is used?
Ah, right. That didn't actually work either so far [1.0.2]:
>>> from lxml import etree
>>> tree = etree.XML("")
>>> tree[0].xpath("ancestor::node()")
Traceback (most recent call last):
NotImplementedError: Not yet implemented result node type: 9
Now it gives this:
>>> tree[0].xpath("ancestor::node()")
[, ]
> Would that mean the current implementation creates double entries when
> these axes are used? That's not ideal.
True, I'd even call that pretty much broken, both in the old and new
implementation.
> What do we do with namespace nodes by the way?
Well:
>>> tree[0].xpath("namespace::*")
Traceback (most recent call last):
NotImplementedError: Not yet implemented result node type: 18
> Perhaps we should implement a special kind of node that represents the
> root node. It'd not occur in a normal ElementTree DOM, but it's there
> when you use XPath. It can be also serialized, just like an element, but
> would include the extra comments that may be there.
Hmmm, if we go for this kind of special casing, I'd rather return an
ElementTree than another special element (that would need to be treated in
custom element class lookup, etc.)
> Then again, we already diverge from strict XPath when we deal with
> attribute (we have no attribute node), or text (we have no text node).
> Diverging with root notes wouldn't be a disaster in that picture.
>
> That said, the root node is a lot more like an element than these other
> cases, in that a root node has children, just like element nodes.
The xpath() function already has lots of possible return values, so that's
just a few more. However, we still have to handle the case of the ancestor axis.
As you stated correctly, the root node is not part of the ElementTree DOM. So
what about just skipping it completely? Just return an empty node set for "/"
and leave it out in "ancestor::node()". That also fits the getparent() method
and the iterancestors() method. And after all, there /is/ no Element to be
returned here.
Another point is XInclude nodes that stayed in after calling xinclude(). I
guess we can just ignore those, too.
Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI).
Any objections?
Stefan
From agustin.villena at gmail.com Tue Aug 1 15:50:33 2006
From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=)
Date: Tue, 01 Aug 2006 09:50:33 -0400
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com>
<44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF5C29.8080200@gmail.com>
Just my two cents:
I'm not so expert in XPath, but that intrigues me is that a perfect
valid (and maybe the first XPATH expression that anybody learns) is not
valid in lxml.
The problem not only happens in the doc node, but in any child.
>>>child = doc.getroot()[0]
>>>child.xpath("/")
Not yet implemented result node type: 9
I remember a recent thread discusing absolute xpath queries in lxml.
Is this another case of this issue? What was the thread's conclussion?
Cheers
Agustin
Stefan Behnel escribi?:
>
> Martijn Faassen wrote:
>> Stefan Behnel wrote:
>> [snip]
>>> What I now implemented is: if the document root is returned, find its
>>> first
>>> child and return it as part of a node set instead. If it's not found, it
>>> returns None in the node set, but that shouldn't normally happen.
>> This worries me a little...
>>
>> How does that work when the ancestor axis is used?
>
> Ah, right. That didn't actually work either so far [1.0.2]:
>
> >>> from lxml import etree
> >>> tree = etree.XML("")
> >>> tree[0].xpath("ancestor::node()")
> Traceback (most recent call last):
> NotImplementedError: Not yet implemented result node type: 9
>
> Now it gives this:
>
> >>> tree[0].xpath("ancestor::node()")
> [, ]
>
>
>> Would that mean the current implementation creates double entries when
>> these axes are used? That's not ideal.
>
> True, I'd even call that pretty much broken, both in the old and new
> implementation.
>
>
>> What do we do with namespace nodes by the way?
>
> Well:
>
> >>> tree[0].xpath("namespace::*")
> Traceback (most recent call last):
> NotImplementedError: Not yet implemented result node type: 18
>
>
>> Perhaps we should implement a special kind of node that represents the
>> root node. It'd not occur in a normal ElementTree DOM, but it's there
>> when you use XPath. It can be also serialized, just like an element, but
>> would include the extra comments that may be there.
>
> Hmmm, if we go for this kind of special casing, I'd rather return an
> ElementTree than another special element (that would need to be treated in
> custom element class lookup, etc.)
>
>
>> Then again, we already diverge from strict XPath when we deal with
>> attribute (we have no attribute node), or text (we have no text node).
>> Diverging with root notes wouldn't be a disaster in that picture.
>>
>> That said, the root node is a lot more like an element than these other
>> cases, in that a root node has children, just like element nodes.
>
> The xpath() function already has lots of possible return values, so that's
> just a few more. However, we still have to handle the case of the ancestor axis.
>
> As you stated correctly, the root node is not part of the ElementTree DOM. So
> what about just skipping it completely? Just return an empty node set for "/"
> and leave it out in "ancestor::node()". That also fits the getparent() method
> and the iterancestors() method. And after all, there /is/ no Element to be
> returned here.
>
> Another point is XInclude nodes that stayed in after calling xinclude(). I
> guess we can just ignore those, too.
>
> Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI).
>
> Any objections?
>
> Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 16:09:01 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 16:09:01 +0200
Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml?
In-Reply-To:
References:
Message-ID: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de>
Agust?n Villena wrote:
> I'm at the task of processing a bunch of digital signed XMLs. My problem
> is exemplified in this example:
>
> a) The original XMLs were enveloped in a container, that has a default
> namespace and a signature. The internal XMLs also have their own
> signature. I doesn't have access to this "envelopes" anymore
>
>
>
>
> Some Data
>
>
>
>
>
>
>
>
>
>
> b) Sadly, a 3rd party software "extracted" the internal documents,
> "forgetting" the envelope's default namespace, therefore inalidating the
> doc's signatures
>
> Example of invalid extracted documents
>
>
> Some Data
>
>
>
>
>
Too bad.
> What was needed (xml 1)
> --------------------------------------------------
>
>
> Some Data
>
>
>
>
>
>
> First question:
> * Is there any way with lxml to add a default namspace to an existing
> xml-tree
No. lxml is namespace aware, so if there is no namespace it will just think
that's what was intended. The only way to change the namespace is to change
the tag.
> Now, I'm trying to patch those messed xmls, injecting the namespace in
> the nodes that need to belong to the missing namespace, but the result
> is ugly:
>
> python code
> -------------------------------------------------------------
>
> from lxml import etree
>
> NEW_NS = "http://www.example.org/example"
>
> doc = etree.parse("no_ns_doc.xml")
no guarantee, but try adding this here:
old_root = doc.getroot()
new_root = old_root.makeelement("{http://www.example.org/example}root",
nsmap={None : "http://www.example.org/example"})
new_root.append(old_root)
then work on 'new_root' and update the tags as you did below.
> #add namespace to the root node
> doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag)
>
> #add namespace to the first child of the root node,
> #since we don't want to touch de namespace of the
> #Signature Node
> for elem in doc.getroot()[0].getiterator():
> elem.tag="{%s}%s" %(NEW_NS,elem.tag)
doc = ElementTree( new_root[0] )
> doc.write("ns_patched_doc.xml")
The append (i.e. move) operation above should fix the prefixes to match the
ones defined in the new root element (i.e. None - the default prefix).
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 16:19:38 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 16:19:38 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF5C29.8080200@gmail.com>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
<44CF5C29.8080200@gmail.com>
Message-ID: <44CF62FA.6090503@gkec.informatik.tu-darmstadt.de>
Agust?n Villena wrote:
> Stefan Behnel escribi?:
>> As you stated correctly, the root node is not part of the ElementTree DOM. So
>> what about just skipping it completely? Just return an empty node set for "/"
>> and leave it out in "ancestor::node()". That also fits the getparent() method
>> and the iterancestors() method. And after all, there /is/ no Element to be
>> returned here.
>>
>> Another point is XInclude nodes that stayed in after calling xinclude(). I
>> guess we can just ignore those, too.
>>
>> Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI).
>>
>> Any objections?
> Just my two cents:
>
> I'm not so expert in XPath, but that intrigues me is that a perfect
> valid (and maybe the first XPATH expression that anybody learns) is not
> valid in lxml.
Well, we're just trying to make it valid (or rather: work). The problem is the
mapping of XPath semantics to ElementTree semantics.
> The problem not only happens in the doc node, but in any child.
>
> >>>child = doc.getroot()[0]
> >>>child.xpath("/")
> Not yet implemented result node type: 9
Sure. It's an absolute XPath expression, doesn't depend on the context node.
> I remember a recent thread discusing absolute xpath queries in lxml.
> Is this another case of this issue?
No. This is different, as it does not return an Element. That's why I am
proposing to map the result to an empty node set (i.e. list). That way, it
gets a well defined Python representation that makes sense in the ElementTree
context, where root nodes do not exist. So, you would get exactly those
Elements you asked for. :)
I committed this for now, so, if you want to take a look at it...
Stefan
From faassen at infrae.com Tue Aug 1 17:46:07 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 17:46:07 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com>
<44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF773F.9030408@infrae.com>
Stefan Behnel wrote:
> Martijn Faassen wrote:
[snip]
>> Perhaps we should implement a special kind of node that represents the
>> root node. It'd not occur in a normal ElementTree DOM, but it's there
>> when you use XPath. It can be also serialized, just like an element, but
>> would include the extra comments that may be there.
>
> Hmmm, if we go for this kind of special casing, I'd rather return an
> ElementTree than another special element (that would need to be treated in
> custom element class lookup, etc.)
Advantage of returning a non-ElementTree but something Element-like
(like Comment and ProcessingInstruction) is that iteration and such
works. It's a node that represents the root and can be serialized.
>> Then again, we already diverge from strict XPath when we deal with
>> attribute (we have no attribute node), or text (we have no text node).
>> Diverging with root notes wouldn't be a disaster in that picture.
>>
>> That said, the root node is a lot more like an element than these other
>> cases, in that a root node has children, just like element nodes.
>
> The xpath() function already has lots of possible return values, so that's
> just a few more. However, we still have to handle the case of the ancestor axis.
>
> As you stated correctly, the root node is not part of the ElementTree DOM. So
> what about just skipping it completely? Just return an empty node set for "/"
> and leave it out in "ancestor::node()". That also fits the getparent() method
> and the iterancestors() method. And after all, there /is/ no Element to be
> returned here.
Well, that gives one no way to access any comments surrounding the
document library from XPath. Not a disaster, but still. Returning
something Element-like sounds the most natural in this case, just like
returning a string is most natural for attribute nodes.
> Another point is XInclude nodes that stayed in after calling xinclude(). I
> guess we can just ignore those, too.
>
> Ok, so what's missing? Namespaces. We can return them as tuple (prefix, URI).
>
> Any objections?
Just URI would be sufficient, but no objection to also returning the
prefix information.
Regards,
Martijn
From agustin.villena at gmail.com Tue Aug 1 17:46:21 2006
From: agustin.villena at gmail.com (=?ISO-8859-15?Q?Agust=EDn_Villena?=)
Date: Tue, 01 Aug 2006 11:46:21 -0400
Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml?
In-Reply-To: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de>
References:
<44CF607D.3080608@gkec.informatik.tu-darmstadt.de>
Message-ID: <44CF774D.8010407@gmail.com>
Well, testing your lines I now have this code:
---------------------------
from lxml import etree
NEW_NS = "http://www.example.org/example"
def add_ns(node,nsURL):
if type(node)==etree._Element:
node.tag="{%s}%s" %(nsURL,node.tag)
doc = etree.parse("no_ns_doc.xml")
old_root = doc.getroot()
new_root = old_root.makeelement("{%s}root" % (NEW_NS),nsmap={None : NEW_NS})
new_root.append(old_root)
add_ns(old_root,NEW_NS)
#add namespace to the first child of the root node,
#since we don't want to touch de namespace of the
#Signature Node
for elem in old_root[0].getiterator():
add_ns(elem,NEW_NS)
#until this line, we have this new_root element :
#
new_doc = etree.ElementTree(new_root[0])
#All the children of new_root keeps their namespace
#in the new doc. But in the serialized text, this namespace disappears
#is this a bug?
new_doc.write("ns_patched_doc.xml")
---------
serialized
----------
Some Data
----------
Too bad...
Any ideas?
Agustin
----------------
As you may read, It almost works!. But when we move the new_root's
children into new_doc, they looses their
Stefan Behnel escribi?:
>
> Agust?n Villena wrote:
>> I'm at the task of processing a bunch of digital signed XMLs. My problem
>> is exemplified in this example:
>>
>> a) The original XMLs were enveloped in a container, that has a default
>> namespace and a signature. The internal XMLs also have their own
>> signature. I doesn't have access to this "envelopes" anymore
>>
>>
>>
>>
>> Some Data
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> b) Sadly, a 3rd party software "extracted" the internal documents,
>> "forgetting" the envelope's default namespace, therefore inalidating the
>> doc's signatures
>>
>> Example of invalid extracted documents
>>
>>
>> Some Data
>>
>>
>>
>>
>>
>
> Too bad.
>
>
>> What was needed (xml 1)
>> --------------------------------------------------
>>
>>
>> Some Data
>>
>>
>>
>>
>>
>>
>> First question:
>> * Is there any way with lxml to add a default namspace to an existing
>> xml-tree
>
> No. lxml is namespace aware, so if there is no namespace it will just think
> that's what was intended. The only way to change the namespace is to change
> the tag.
>
>
>> Now, I'm trying to patch those messed xmls, injecting the namespace in
>> the nodes that need to belong to the missing namespace, but the result
>> is ugly:
>>
>> python code
>> -------------------------------------------------------------
>>
>> from lxml import etree
>>
>> NEW_NS = "http://www.example.org/example"
>>
>> doc = etree.parse("no_ns_doc.xml")
>
> no guarantee, but try adding this here:
>
> old_root = doc.getroot()
> new_root = old_root.makeelement("{http://www.example.org/example}root",
> nsmap={None : "http://www.example.org/example"})
> new_root.append(old_root)
>
> then work on 'new_root' and update the tags as you did below.
>
>> #add namespace to the root node
>> doc.getroot().tag="{%s}%s" %(NEW_NS,doc.getroot().tag)
>>
>> #add namespace to the first child of the root node,
>> #since we don't want to touch de namespace of the
>> #Signature Node
>> for elem in doc.getroot()[0].getiterator():
>> elem.tag="{%s}%s" %(NEW_NS,elem.tag)
>
> doc = ElementTree( new_root[0] )
>
>> doc.write("ns_patched_doc.xml")
>
> The append (i.e. move) operation above should fix the prefixes to match the
> ones defined in the new root element (i.e. None - the default prefix).
>
> Stefan
From faassen at infrae.com Tue Aug 1 17:56:28 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 17:56:28 +0200
Subject: [lxml-dev] running the tests on the trunk
Message-ID: <44CF79AC.4090401@infrae.com>
Hi there,
I have trouble running the tests on the current trunk of lxml:
Ran 556 tests in 2.333s
FAILED (failures=1, errors=7)
A lot of this seems to have to do with this attribute error while
running the tests:
AttributeError: 'module' object has no attribute 'iterparse'
What's going on?
Regards,
Martijn
From faassen at infrae.com Tue Aug 1 18:02:59 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 18:02:59 +0200
Subject: [lxml-dev] running the tests on the trunk
In-Reply-To: <44CF79AC.4090401@infrae.com>
References: <44CF79AC.4090401@infrae.com>
Message-ID: <44CF7B33.4030901@infrae.com>
Martijn Faassen wrote:
> Hi there,
>
> I have trouble running the tests on the current trunk of lxml:
>
> Ran 556 tests in 2.333s
>
> FAILED (failures=1, errors=7)
>
> A lot of this seems to have to do with this attribute error while
> running the tests:
>
> AttributeError: 'module' object has no attribute 'iterparse'
>
> What's going on?
I think I figured it out: I need to upgrade my version of *ElementTree*.
Regards,
Martijn
From faassen at infrae.com Tue Aug 1 18:05:49 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 18:05:49 +0200
Subject: [lxml-dev] running the tests on the trunk
In-Reply-To: <44CF7B33.4030901@infrae.com>
References: <44CF79AC.4090401@infrae.com> <44CF7B33.4030901@infrae.com>
Message-ID: <44CF7BDD.1030201@infrae.com>
Martijn Faassen wrote:
> Martijn Faassen wrote:
>> What's going on?
>
> I think I figured it out: I need to upgrade my version of *ElementTree*.
Yup, that eliminated most problems, except for this failure in the doctests:
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 153, in resolvers.txt
----------------------------------------------------------------------
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 153, in resolvers.txt
Failed example:
result = transform(honk_doc)
Expected:
Resolving url hoi:test as prefix honk ... failed
Resolving url hoi:test as prefix hoi ... done
Got:
Resolving url hoi:test as prefix hoi ... done
----------------------------------------------------------------------
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 165, in resolvers.txt
Failed example:
result = transform(normal_doc)
Expected:
Resolving url hoi:test as prefix honk ... failed
Resolving url hoi:test as prefix hoi ... done
Got:
Resolving url hoi:test as prefix hoi ... done
----------------------------------------------------------------------
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 192, in resolvers.txt
Failed example:
transform = etree.XSLT(honk_doc)
Expected:
Resolving url honk:test as prefix honk ... done
Got:
Resolving url honk:test as prefix hoi ... failed
Resolving url honk:test as prefix honk ... done
----------------------------------------------------------------------
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 194, in resolvers.txt
Failed example:
result = transform(normal_doc)
Expected:
Resolving url hoi:test as prefix honk ... failed
Resolving url hoi:test as prefix hoi ... done
Got:
Resolving url hoi:test as prefix hoi ... done
----------------------------------------------------------------------
File
"/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
line 199, in resolvers.txt
Failed example:
transform = etree.XSLT(honk_doc, access_control=ac)
Expected:
Resolving url honk:test as prefix honk ... done
Got:
Resolving url honk:test as prefix hoi ... failed
Resolving url honk:test as prefix honk ... done
From faassen at infrae.com Tue Aug 1 18:26:35 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 18:26:35 +0200
Subject: [lxml-dev] ElementTree comment behavior
Message-ID: <44CF80BB.90404@infrae.com>
Hi there,
The whole XPath root node issue led me to investigate lxml's behavior
with comment nodes, thinking that we might not do the right thing with
mutation (as Comments subclass Element). However, it seems to behave
rationally enough:
>>> import lxml
>>> from lxml import etree
>>> c = etree.Comment('foo')
>>> c.append(etree.Element('bar'))
>>> len(c.getchildren())
0
(I wonder what happens in the C tree though here.. cursory inspection of
the tree.c code of libxml2 doesn't reveal special code to handle this case)
Unfortunately, ElementTree behaves differently in this case!
>>> from elementtree import ElementTree as etree2
>>> c = etree2.Comment('foo')
>>> c.append(etree.Element('bar'))
>>> len(c.getchildren())
1
Evidently it allows child Elements to be added to comments.
What to do in this case?
Regards,
Martijn
From faassen at infrae.com Tue Aug 1 19:09:08 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 01 Aug 2006 19:09:08 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF773F.9030408@infrae.com>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
<44CF773F.9030408@infrae.com>
Message-ID: <44CF8AB4.7040209@infrae.com>
Martijn Faassen wrote:
[snip]
> Well, that gives one no way to access any comments surrounding the
> document library from XPath. Not a disaster, but still. Returning
> something Element-like sounds the most natural in this case, just like
> returning a string is most natural for attribute nodes.
I've just checked in a branch here:
http://codespeak.net/svn/lxml/branch/lxml-xpathroot
which experiments with adding a special XPath Root object. This root
object only shows up when accessing / through XPath - there's no way to
get to it using the normal ElementTree functionality. At first sight
this implementation doesn't appear to be too difficult. I think this is
a nicer solution than just not returning anything.
Unfortunately, my changes also cause memory errors when running the
test. It's possible this happens because we start stuffing our proxy in
the _private of a XML_DOCUMENT_NODE, something that wasn't possible
before, and we're probably not scanning for accurately in our
deallocation logic. Don't have time to investigate this further now
though, so I'll leave it in the branch for now. Feel free to
investigate, Stefan. :)
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 19:09:51 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 19:09:51 +0200
Subject: [lxml-dev] ElementTree comment behavior
In-Reply-To: <44CF80BB.90404@infrae.com>
References: <44CF80BB.90404@infrae.com>
Message-ID: <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de>
Martijn Faassen wrote:
> The whole XPath root node issue led me to investigate lxml's behavior
> with comment nodes, thinking that we might not do the right thing with
> mutation (as Comments subclass Element). However, it seems to behave
> rationally enough:
>
> >>> import lxml
> >>> from lxml import etree
> >>> c = etree.Comment('foo')
> >>> c.append(etree.Element('bar'))
> >>> len(c.getchildren())
> 0
>
> (I wonder what happens in the C tree though here.. cursory inspection of
> the tree.c code of libxml2 doesn't reveal special code to handle this case)
Well, this is how lxml currently implements _Comment.append():
def append(self, _Element element):
pass
Maybe it should rather raise an exception?
> Unfortunately, ElementTree behaves differently in this case!
>
> >>> from elementtree import ElementTree as etree2
> >>> c = etree2.Comment('foo')
> >>> c.append(etree.Element('bar'))
> >>> len(c.getchildren())
> 1
>
> Evidently it allows child Elements to be added to comments.
>
> What to do in this case?
I personally find the behaviour of ET a bit bizarre here. What /is/ the
element child of an XML comment?
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 1 21:08:20 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 01 Aug 2006 21:08:20 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CF8AB4.7040209@infrae.com>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de>
<44CF773F.9030408@infrae.com> <44CF8AB4.7040209@infrae.com>
Message-ID: <44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de>
Martijn Faassen wrote:
> Martijn Faassen wrote:
> [snip]
>> Well, that gives one no way to access any comments surrounding the
>> document library from XPath. Not a disaster, but still.
Note that this only applies to the return values of XPath calls. Inside the
expression, you can do whatever XPath supports. So you can still navigate the
brothers and sisters of the document root and return the one of them that
you're interested in, without having to pass the root itself into Python.
>> Returning
>> something Element-like sounds the most natural in this case, just like
>> returning a string is most natural for attribute nodes.
I'm still not convinced that this should be Element-like. It's not an Element
and it has no representation in the ElementTree world.
> I've just checked in a branch here:
>
> http://codespeak.net/svn/lxml/branch/lxml-xpathroot
>
> which experiments with adding a special XPath Root object. This root
> object only shows up when accessing / through XPath - there's no way to
> get to it using the normal ElementTree functionality. At first sight
> this implementation doesn't appear to be too difficult. I think this is
> a nicer solution than just not returning anything.
Ok, I can see what you did. You'd have to rewrite that after the merge of the
CAPI branch, which changes loads of stuff under the hood and largely impacts
element class lookup. So it would have to fit in there.
> Unfortunately, my changes also cause memory errors when running the
> test. It's possible this happens because we start stuffing our proxy in
> the _private of a XML_DOCUMENT_NODE, something that wasn't possible
> before, and we're probably not scanning for accurately in our
> deallocation logic. Don't have time to investigate this further now
> though, so I'll leave it in the branch for now.
doc._private is currently only used in XSLT (which may already interfere when
extension functions are used), but I'm not very happy with the idea of using
xmlDoc like any other element node. It starts with the fact that we now have
_Document and _Root sitting on the same xmlDoc structure. That unnecessarily
complicates the cleanup procedure for what I call a rare special case.
If we really want to put something Element-like in there, we may consider
making it part of the _Document class, which already is unique for the
document root.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 07:27:59 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 07:27:59 +0200
Subject: [lxml-dev] running the tests on the trunk
In-Reply-To: <44CF7BDD.1030201@infrae.com>
References: <44CF79AC.4090401@infrae.com> <44CF7B33.4030901@infrae.com>
<44CF7BDD.1030201@infrae.com>
Message-ID: <44D037DF.9010204@gkec.informatik.tu-darmstadt.de>
Martijn Faassen wrote:
> that eliminated most problems, except for this failure in the doctests:
>
> File
> "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
> line 153, in resolvers.txt
>
> ----------------------------------------------------------------------
> File
> "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/resolvers.txt",
> line 153, in resolvers.txt
> Failed example:
> result = transform(honk_doc)
> Expected:
> Resolving url hoi:test as prefix honk ... failed
> Resolving url hoi:test as prefix hoi ... done
> Got:
> Resolving url hoi:test as prefix hoi ... done
> ----------------------------------------------------------------------
[snip]
Ah, right. It's the tests that are broken here. I forgot that the resolvers
are stored in a set and thus tested in arbitrary order (interesting that no
one ever reported that for 1.0). So here they seem to use a different order
that leads to different output. Guess I'll have to fix the tests here. Maybe
the best way is to only let the resolver speak that succeeds, not the failed
one(s) that were also tested.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 07:45:35 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 07:45:35 +0200
Subject: [lxml-dev] Inyecting a default XML namespace in an existing xml?
In-Reply-To: <44CF774D.8010407@gmail.com>
References: <44CF607D.3080608@gkec.informatik.tu-darmstadt.de>
<44CF774D.8010407@gmail.com>
Message-ID: <44D03BFF.80405@gkec.informatik.tu-darmstadt.de>
Hi,
Agust?n Villena wrote:
> Well, testing your lines I now have this code:
> ---------------------------
> from lxml import etree
>
> NEW_NS = "http://www.example.org/example"
>
> def add_ns(node,nsURL):
> if type(node)==etree._Element:
> node.tag="{%s}%s" %(nsURL,node.tag)
>
> doc = etree.parse("no_ns_doc.xml")
>
> old_root = doc.getroot()
> new_root = old_root.makeelement("{%s}root" % (NEW_NS),nsmap={None : NEW_NS})
> new_root.append(old_root)
>
> add_ns(old_root,NEW_NS)
> #add namespace to the first child of the root node,
> #since we don't want to touch de namespace of the
> #Signature Node
> for elem in old_root[0].getiterator():
> add_ns(elem,NEW_NS)
>
> #until this line, we have this new_root element :
> #
>
>
> new_doc = etree.ElementTree(new_root[0])
> #All the children of new_root keeps their namespace
> #in the new doc. But in the serialized text, this namespace disappears
> #is this a bug?
> new_doc.write("ns_patched_doc.xml")
>
> ---------
> serialized
> ----------
>
>
> Some Data
>
>
>
>
>
Hmm, ok, that didn't quite work. Maybe we should just add a helper function
for namespace handling, as Martijn suggested a while ago.
We could implement something like this:
def reassignNamespacePrefixes(element_or_tree, prefixmap):
"""Traverse the tree and replace the prefixes in namespace declarations by
the URI->prefix mapping defined by prefixmap.
"""
Question: how do we handle the case where a prefix is already used for a
different namespace in the tree?
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 10:53:55 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 10:53:55 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de>
<20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com>
Message-ID: <44D06823.6030407@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> Okay, I've managed to create a crashing test case that's down to a
> reasonable number of lines of code. I don't think I can remove anything
> else and still have it crash.
Great, thanks for stripping this down.
> There's even some very odd changes that stop it from crashing, such as
> shortening "fieldset" to "f". Fortunately, narrowing this down allowed
> me to create a workaround for the real program, so fixing this is no
> longer so urgent for me.
>
> I've also attached the results of a valgrind run using the recommended
> command line parameters on the test program. I didn't bother gzipping
> it, because it's pretty small.
>
> Please let me know if this fails to crash for you. I have to run it
> using "python test.py" instead of "./test.py" to see the glibc error.
It 'nicely' crashes for me and I think I can tell where it comes from. We use
a global dictionary in the parser that stores tag names, attribute values,
etc. It mainly serves the purpose of reducing the number of expensive malloc
calls and avoiding duplicated storage of constant strings. Normally, it works
just fine, unless there are operations that create additional dictionaries,
like XSLT. :(
So what happens in your case, is: when you move the content of the XSLT result
document over to the document you parsed, it will contain strings from two
different dictionaries (I just verified that). When the documents are freed,
libxml2 checks if the strings it frees are in the document dictionary, sees
that it is not the case (as it came from a different dictionary) and then
frees it. This leaves stale pointers in the second dictionary.
It's too bad we can't control the dictionary created by libxslt for
transformations, as it is automatically created and used when we request a
transformation context. So we can't just replace the dictionary afterwards.
I'm not quite sure what to do here. There are ways to fix this, but they can
be expensive, so I'll just have to figure out which one to go.
One solution could be to extend the deep traversal that follows moving a
subtree to a different document. We could let it check if the dicts are the
same, and if they are not, copy the strings stored in the source dictionary to
the destination dictionary. As I said, this can be expensive but is a rare
case as (so far) it only applies to partial XSLT results being moved around.
On the other hand, this would also allow moving subtrees between threads
(which use independent dictionaries as well), so maybe it's worth it...
As this problem (currently) only appears in XSLT, a second way to handle it
would be to replace the dictionary of the transformation context after
initialisation, but /before/ running the transform. That way, there should be
less content already stored in it that would have to be moved.
While the second one sounds like the least expensive, maybe there are even
better ways I did not think of. I'll take a look at it.
Again, thanks for reporting this and for providing a test case,
Stefan
> ------------------------------------------------------------------------
>
> import lxml.etree as etree
>
> definitionXml = etree.XML( '''
>
> ''' )
>
> definitionXml[ : ] = etree.XSLT( etree.XML( '''
>
>
>
>
>
>
>
>
>
>
>
>
> ''' ) )( definitionXml[ 0 ] ).getroot( )[ : ]
>
> # Segfault occurs on this line.
> del definitionXml
>
> print "Didn't crash!"
> ------------------------------------------------------------------------
> ==29947== Invalid free() / delete / delete[]
> ==29947== at 0x401C0C3: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
> ==29947== by 0x45A161A: xmlFreeNodeList (in /usr/lib/libxml2.so.2.6.26)
> ==29947== Address 0x48AD24C is 20 bytes inside a block of size 1,024 free'd
> ==29947== at 0x401C0C3: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
> ==29947== by 0x46B3EE5: xmlDictFree (in /usr/lib/libxml2.so.2.6.26)
From faassen at infrae.com Wed Aug 2 11:52:09 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 02 Aug 2006 11:52:09 +0200
Subject: [lxml-dev] Return values of XPath calls
In-Reply-To: <44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de>
References: <44CF0497.6050106@gkec.informatik.tu-darmstadt.de> <44CF1BC9.7040102@infrae.com> <44CF269B.8080905@gkec.informatik.tu-darmstadt.de> <44CF4169.5070603@gkec.informatik.tu-darmstadt.de> <44CF4ED0.2040105@infrae.com> <44CF5611.8040904@gkec.informatik.tu-darmstadt.de> <44CF773F.9030408@infrae.com>
<44CF8AB4.7040209@infrae.com>
<44CFA6A4.3030202@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D075C9.8070307@infrae.com>
Stefan Behnel wrote:
>
> Martijn Faassen wrote:
>> Martijn Faassen wrote:
>> [snip]
>>> Well, that gives one no way to access any comments surrounding the
>>> document library from XPath. Not a disaster, but still.
>
> Note that this only applies to the return values of XPath calls. Inside the
> expression, you can do whatever XPath supports. So you can still navigate the
> brothers and sisters of the document root and return the one of them that
> you're interested in, without having to pass the root itself into Python.
Yes, naturally - it's not a disaster and it's only from XPath.
>>> Returning
>>> something Element-like sounds the most natural in this case, just like
>>> returning a string is most natural for attribute nodes.
>
> I'm still not convinced that this should be Element-like. It's not an Element
> and it has no representation in the ElementTree world.
It has no representation in the ElementTree itself, but it's quite
Element-like in that it has children. It's also Element-like in that it
is relatively straightforward to implement it as a special kind of
Element. :)
>> I've just checked in a branch here:
>>
>> http://codespeak.net/svn/lxml/branch/lxml-xpathroot
>>
>> which experiments with adding a special XPath Root object. This root
>> object only shows up when accessing / through XPath - there's no way to
>> get to it using the normal ElementTree functionality. At first sight
>> this implementation doesn't appear to be too difficult. I think this is
>> a nicer solution than just not returning anything.
>
> Ok, I can see what you did. You'd have to rewrite that after the merge of the
> CAPI branch, which changes loads of stuff under the hood and largely impacts
> element class lookup. So it would have to fit in there.
Okay, understood. I wasn't sure on the status of the CAPI branch.
>> Unfortunately, my changes also cause memory errors when running the
>> test. It's possible this happens because we start stuffing our proxy in
>> the _private of a XML_DOCUMENT_NODE, something that wasn't possible
>> before, and we're probably not scanning for accurately in our
>> deallocation logic. Don't have time to investigate this further now
>> though, so I'll leave it in the branch for now.
>
> doc._private is currently only used in XSLT (which may already interfere when
> extension functions are used), but I'm not very happy with the idea of using
> xmlDoc like any other element node. It starts with the fact that we now have
> _Document and _Root sitting on the same xmlDoc structure. That unnecessarily
> complicates the cleanup procedure for what I call a rare special case.
Agreed.
> If we really want to put something Element-like in there, we may consider
> making it part of the _Document class, which already is unique for the
> document root.
Okay, that might make sense. I will study the _Document class and see
whether we can come up with a design that is satisfactory. Thanks for
the design feedback. :)
This is driven by my desire to see some sensible return value when
people evaluate the '/' XPath expression. Returning nothing is so...
nothing, and if this is the first thing people tend to do then it might
give them the impression lxml is misbehaving somehow.
Regards,
Martijn
From faassen at infrae.com Wed Aug 2 11:55:47 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 02 Aug 2006 11:55:47 +0200
Subject: [lxml-dev] ElementTree comment behavior
In-Reply-To: <44CF8ADF.70805@gkec.informatik.tu-darmstadt.de>
References: <44CF80BB.90404@infrae.com>
<44CF8ADF.70805@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D076A3.2000100@infrae.com>
Stefan Behnel wrote:
>
> Martijn Faassen wrote:
>> The whole XPath root node issue led me to investigate lxml's behavior
>> with comment nodes, thinking that we might not do the right thing with
>> mutation (as Comments subclass Element). However, it seems to behave
>> rationally enough:
>>
>> >>> import lxml
>> >>> from lxml import etree
>> >>> c = etree.Comment('foo')
>> >>> c.append(etree.Element('bar'))
>> >>> len(c.getchildren())
>> 0
>>
>> (I wonder what happens in the C tree though here.. cursory inspection of
>> the tree.c code of libxml2 doesn't reveal special code to handle this case)
>
> Well, this is how lxml currently implements _Comment.append():
>
> def append(self, _Element element):
> pass
>
> Maybe it should rather raise an exception?
Yeah, I realized this after I wrote the post. If we were to raise an
exception, we'd be incompatible with ElementTree, but I wouldn' mind too
much as this is a rather ridiculous operation anyway and people who do
this in their code should actually know they're doing something weird.
Note that I apparently added no such method for other mutation
operations such as 'insert'...
>> Unfortunately, ElementTree behaves differently in this case!
>>
>> >>> from elementtree import ElementTree as etree2
>> >>> c = etree2.Comment('foo')
>> >>> c.append(etree.Element('bar'))
>> >>> len(c.getchildren())
>> 1
>>
>> Evidently it allows child Elements to be added to comments.
>>
>> What to do in this case?
>
> I personally find the behaviour of ET a bit bizarre here. What /is/ the
> element child of an XML comment?
I think you're right in that it's bizarre. The reason it behaves this
way might be convenience of implementation... I feel under no obligation
to be compatible with ET here.
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 12:13:40 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 12:13:40 +0200
Subject: [lxml-dev] ElementTree comment behavior
In-Reply-To: <44D076A3.2000100@infrae.com>
References: <44CF80BB.90404@infrae.com>
<44CF8ADF.70805@gkec.informatik.tu-darmstadt.de>
<44D076A3.2000100@infrae.com>
Message-ID: <44D07AD4.6020906@gkec.informatik.tu-darmstadt.de>
Martijn Faassen wrote:
> Stefan Behnel wrote:
>>
>> Martijn Faassen wrote:
>>> The whole XPath root node issue led me to investigate lxml's behavior
>>> with comment nodes, thinking that we might not do the right thing
>>> with mutation (as Comments subclass Element). However, it seems to
>>> behave rationally enough:
>>>
>>> >>> import lxml
>>> >>> from lxml import etree
>>> >>> c = etree.Comment('foo')
>>> >>> c.append(etree.Element('bar'))
>>> >>> len(c.getchildren())
>>> 0
>>>
>>> (I wonder what happens in the C tree though here.. cursory inspection
>>> of the tree.c code of libxml2 doesn't reveal special code to handle
>>> this case)
>>
>> Well, this is how lxml currently implements _Comment.append():
>>
>> def append(self, _Element element):
>> pass
>>
>> Maybe it should rather raise an exception?
>
> Yeah, I realized this after I wrote the post. If we were to raise an
> exception, we'd be incompatible with ElementTree, but I wouldn' mind too
> much as this is a rather ridiculous operation anyway and people who do
> this in their code should actually know they're doing something weird.
>
> Note that I apparently added no such method for other mutation
> operations such as 'insert'...
I added the method in the CAPI branch (also __setitem__ and __setslice__). The
mutators now raise a TypeError.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 13:01:25 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 13:01:25 +0200
Subject: [lxml-dev] Some performance results with threads
Message-ID: <44D08605.7090408@gkec.informatik.tu-darmstadt.de>
Hi,
I did a little testing on a dual processor linux machine with the current
trunk. I had a very simple setup with a number of threads (8-16) that each
created a separate parser, parsed a 3MB string or file and then ran a small
XSLT on it. Parsing and XSLT are operations that free the GIL for the majority
of their internal work.
The outcome was that the system was always between 20% and 40% idle. So, there
is a certain speedup in multi-processor environments, but don't expect too
much, especially when adding more processors. It shows that it makes sense to
use threads on, say, a web server that has to serve other content in parallel
(like static content), so that it can make use of a third of the processing
time itself. But it will not get you 100% more throughput by doubling the
number of processors.
It looks like you should really expect less than a 50% speedup, depending on
how much time your application actually spends in parsing, serialising,
validating and XSLT. If your application does a lot of XML handling in Python
code (like tree iteration etc.), the ratio can get close to 0, but if you have
complex XSLTs or large schemas/documents to validate, the speedup can
potentially be much higher. (I never thought I'd ever tell someone to rewrite
code in XSLT to make it /faster/ ...)
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 13:30:43 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 13:30:43 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D06823.6030407@gkec.informatik.tu-darmstadt.de>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com>
<44D06823.6030407@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de>
Hi John,
Stefan Behnel wrote:
> John Krukoff wrote:
>> Okay, I've managed to create a crashing test case that's down to a
>> reasonable number of lines of code. I don't think I can remove anything
>> else and still have it crash.
>
> It 'nicely' crashes for me and I think I can tell where it comes from. We use
> a global dictionary in the parser that stores tag names, attribute values,
> etc. It mainly serves the purpose of reducing the number of expensive malloc
> calls and avoiding duplicated storage of constant strings. Normally, it works
> just fine, unless there are operations that create additional dictionaries,
> like XSLT. :(
>
> So what happens in your case, is: when you move the content of the XSLT result
> document over to the document you parsed, it will contain strings from two
> different dictionaries (I just verified that). When the documents are freed,
> libxml2 checks if the strings it frees are in the document dictionary, sees
> that it is not the case (as it came from a different dictionary) and then
> frees it. This leaves stale pointers in the second dictionary.
I attached a patch that is somewhat hacky and may not work in some situations.
However, it should solve your crash for now and I will see if I can get
something like this a bit cleaned up and merged into the next release (1.1).
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xslt-dict-hack.patch
Type: text/x-patch
Size: 1244 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060802/40be118d/attachment-0001.bin
From lxml at adhamh.com Wed Aug 2 17:19:48 2006
From: lxml at adhamh.com (Adhamh Findlay)
Date: Wed, 02 Aug 2006 08:19:48 -0700
Subject: [lxml-dev] XML Schema: Getting more information on validation
failures?
Message-ID: <44D0C294.30902@adhamh.com>
Hello,
I'm new to lxml and I'm trying to get more information on why some
validation is failing. Here is the code I am currently using:
try:
xmlschema.assertValid(xml_doc)
except etree.DocumentInvalid:
traceback.print_exc()
print log
print error.domain_name
print error.type_name
sys.exit()
Here's the output I get:
Traceback (most recent call last):
File "./xml.py", line 42, in ?
xmlschema.assertValid(xml_doc)
File "etree.pyx", line 1624, in etree._Validator.assertValid
DocumentInvalid: Document does not comply with schema
Traceback (most recent call last):
File "./xml.py", line 46, in ?
print error.domain_name
AttributeError: 'NoneType' object has no attribute 'domain_name'
Is there any way to get more information than this?
Thanks,
Adhamh
From faassen at infrae.com Wed Aug 2 18:12:59 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 02 Aug 2006 18:12:59 +0200
Subject: [lxml-dev] Some performance results with threads
In-Reply-To: <44D08605.7090408@gkec.informatik.tu-darmstadt.de>
References: <44D08605.7090408@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D0CF0B.1010209@infrae.com>
Stefan Behnel wrote:
> (I never thought I'd ever tell someone to rewrite
> code in XSLT to make it /faster/ ...)
In general if you can run a transformation using libxslt instead of a
Python-based XML transformation algorithm, and the transformation is
pretty 'natural' to XSLT, even on a single-threaded setup libxslt can
speed things up. libxslt is a reasonably fast XSLT processor after all.
Regards,
Martijn
From faassen at infrae.com Wed Aug 2 18:13:32 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 02 Aug 2006 18:13:32 +0200
Subject: [lxml-dev] Some performance results with threads
In-Reply-To: <44D08605.7090408@gkec.informatik.tu-darmstadt.de>
References: <44D08605.7090408@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D0CF2C.6070004@infrae.com>
Stefan Behnel wrote:
[snip info on multi-threaded use of lxml]
Thanks for checking this out and letting us know, by the way. Good to know!
Regards,
Martijn
From faassen at infrae.com Wed Aug 2 18:17:07 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 02 Aug 2006 18:17:07 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de>
<44D08CE3.3090308@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D0D003.2080800@infrae.com>
Stefan Behnel wrote:
[XSLT segfaulting issue]
> I attached a patch that is somewhat hacky and may not work in some situations.
> However, it should solve your crash for now and I will see if I can get
> something like this a bit cleaned up and merged into the next release (1.1).
[code of patch]
I believe this is very similar to the approach I took early on to ensure
documents share their dictionaries, so who knows, we might be in luck
and it's reliable. Hm, though I vaguely remember we already did that for
XSLT too, so perhaps this is hacky in the place it's added, not the way
it's done?
Are there any cases you can think of where this would lead to problems?
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 18:14:09 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 18:14:09 +0200
Subject: [lxml-dev] XML Schema: Getting more information on validation
failures?
In-Reply-To: <44D0C294.30902@adhamh.com>
References: <44D0C294.30902@adhamh.com>
Message-ID: <44D0CF51.5020404@gkec.informatik.tu-darmstadt.de>
Hi Adhamh,
Adhamh Findlay wrote:
> I'm new to lxml and I'm trying to get more information on why some
> validation is failing. Here is the code I am currently using:
>
> try:
> xmlschema.assertValid(xml_doc)
> except etree.DocumentInvalid:
> traceback.print_exc()
> print log
> print error.domain_name
> print error.type_name
> sys.exit()
Here is an example on how to do this:
http://codespeak.net/lxml/api.html#error-handling-on-exceptions
It's more something like this:
try:
xmlschema.assertValid(xml_doc)
except etree.DocumentInvalid, error:
log = error.error_log
print log
print log[-1].domain_name
print log[-1].type_name
> Here's the output I get:
> Traceback (most recent call last):
> File "./xml.py", line 46, in ?
> print error.domain_name
> AttributeError: 'NoneType' object has no attribute 'domain_name'
This is because you set "error" to None somewhere in your program. You can't
really blame lxml for that...
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 2 18:22:37 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 02 Aug 2006 18:22:37 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D0D003.2080800@infrae.com>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de>
<44D08CE3.3090308@gkec.informatik.tu-darmstadt.de>
<44D0D003.2080800@infrae.com>
Message-ID: <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de>
Martijn Faassen schrieb:
> Stefan Behnel wrote:
> [XSLT segfaulting issue]
>> I attached a patch that is somewhat hacky and may not work in some
>> situations.
>> However, it should solve your crash for now and I will see if I can get
>> something like this a bit cleaned up and merged into the next release
>> (1.1).
>
> [code of patch]
>
> I believe this is very similar to the approach I took early on to ensure
> documents share their dictionaries, so who knows, we might be in luck
> and it's reliable. Hm, though I vaguely remember we already did that for
> XSLT too, so perhaps this is hacky in the place it's added, not the way
> it's done?
>
> Are there any cases you can think of where this would lead to problems?
The different between changing the dict on the parser context and on the XSLT
context is that the parser context does not use it before it is returned.
libxslt *might* store stuff in it, depending on the stylesheet.
I filed a bug report on this and got an immediate "not a bug but a feature" by
Daniel. The reason is that the transformation must not modify the stylesheet,
so it just creates a sub-dictionary and is happy with that - unlike its users.
However, he also said, if I want to propose an API for it, I should ask on the
list. Don't think I'll do it, though, as it's not much worth to have the final
function-that-solves-all-your-problems added in 1.1.98 if we want to keep up
support for 1.1.12...
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 3 18:02:34 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Thu, 03 Aug 2006 18:02:34 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
Message-ID: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de>
Hi all,
I have already mentioned that lxml 1.1 will feature an alternative API,
lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but
written in Pyrex. The implementation is now nearing completion, so that
1.1beta will hopefully find its way towards cheeseshop early next week.
It allows you to access XML in a data-binding like style, so that you can do this:
>>> root=XML('HALLOWORLD')
>>> print root.a.b.c.d, '--', root.a.b.c.d[1]
HALLO -- WORLD
A complete description is here:
http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt
objectify also features an additional path language (ObjectPath) based on the
normal object attribute access scheme. It is implemented independent of the
actual objectify API so that it can be used without switching the Element
implementation over to 'objectify'. The language accepts expressions like the
two used above, just written as strings or lists:
* "root.a.{someNamespace}b.c.d"
* "root.a.b.c.d[1]"
* ".a.b.c"
* ['root', '{otherNamespace}a']
* ['root', 'a', 'b', 'c', '{andAnotherNamespace}d[1]']
Here are a few timeit benchmarks:
Setup:
from lxml.elements.objectify import register, ObjectPath
register()
from lxml.etree import XML
root = XML('')
Normal Python object access tests for comparison:
* root.a.b.c.d
10000 loops, best of 3: 16.3 usec per loop
* root.a.b.c.d[0]
10000 loops, best of 3: 16.7 usec per loop
* root.a.b.c.d[2]
100000 loops, best of 3: 18.4 usec per loop
ObjectPath tests *without* parsing, i.e. timings of the call "path(root)"
after an additional Setup as follows:
* path = ObjectPath('root.a.b.c.d')
100000 loops, best of 3: 2.76 usec per loop
* path = ObjectPath('root.a.b.c.d[0]')
100000 loops, best of 3: 2.77 usec per loop
* path = ObjectPath('root.a.b.c.d[2]')
100000 loops, best of 3: 2.85 usec per loop
Including parsing:
* "path=ObjectPath('root.a.b.c.d'); path(root)"
10000 loops, best of 3: 27 usec per loop
* "path=ObjectPath('root.a.b.c.d[2]'); path(root)"
10000 loops, best of 3: 29.7 usec per loop
The same based on lists:
* "path=ObjectPath(['root', 'a', 'b', 'c', 'd']); path(root)"
10000 loops, best of 3: 16.7 usec per loop
* "path=ObjectPath(['root', 'a', 'b', 'c', 'd[2]']); path(root)"
10000 loops, best of 3: 18 usec per loop
As you can see, the parser is not the fastest, especially for strings. It
actually uses REs internally, as ObjectPath expressions are non trivial to
parse (namespaces, indexes, ...). However, once the expression is parsed,
element access is impressively fast, as it runs entirely in C. In the limited
area of its applicability, it is even faster than full fledged XPath:
* Setup: path=XPath('/root/a/b/c/d')
Timing: "path(root)"
10000 loops, best of 3: 10.4 usec per loop
* Timing: "path=XPath('/root/a/b/c/d'); path(root)"
10000 loops, best of 3: 44.8 usec per loop
So I hope people find it useful.
Stefan
From faassen at infrae.com Thu Aug 3 19:51:15 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Thu, 03 Aug 2006 19:51:15 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
In-Reply-To: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de>
References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D23793.5040402@infrae.com>
Stefan Behnel wrote:
> I have already mentioned that lxml 1.1 will feature an alternative API,
> lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but
> written in Pyrex. The implementation is now nearing completion, so that
> 1.1beta will hopefully find its way towards cheeseshop early next week.
>
> It allows you to access XML in a data-binding like style, so that you can do this:
>
> >>> root=XML('HALLOWORLD')
> >>> print root.a.b.c.d, '--', root.a.b.c.d[1]
> HALLO -- WORLD
>
> A complete description is here:
> http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt
>
> objectify also features an additional path language (ObjectPath) based on the
> normal object attribute access scheme. It is implemented independent of the
> actual objectify API so that it can be used without switching the Element
> implementation over to 'objectify'.
While I'm quite interested in these developments I'm afraid I'm going to
ask some difficult questions here. This is not criticism of these
developments per-se, but it's a question about what lxml is all about
and how we want to present these new technologies to users.
Module separation: I notice the ObjectPath language is implemented in
the 'objectify' module, but this looks like it really should be a
separate module, it being an independent extension to lxml that does not
rely on the other objectify stuff, as you mention.
Use cases: What is the underlying thought? When would you recommend
people to use ObjectPath instead of XPath or the .find() syntax?
Technical comment: I also see that the ObjectPath parser is implemented
in a rather low-level Pyrex formulation. Since you say that this parser
is slow anyway, wouldn't it make sense to maintain this as straight
Python instead? It would also be nice if we could make this parser and a
pure-python implementation available for ElementTree itself.
Global switch for objectify: As I mentioned before I'm still quite
worried about switching the entire world over to objectify with a single
global call. I really think this should be specified by using a
different tree constructor. It just too sounds dangerous to me to
globally switch the behavior of the whole API.
In the 'classic' way of using the namespace registry, custom element
classes are typically registered for particular elements in particular
namespaces. Objectify however fundamentally alters the behavior of the
entire system. I understood from your previous reply that you were
working on ways to this settable per-tre; did I understand that
correctly? I'd recommend making it the normal way to invoke the
objectify behavior, not global.
Now to the biggest item of my concern...
Nature of lxml: The addition of a different data-binding model and
different path language specific to lxml worries me quite a bit as we're
reinventing wheels here, something not the original idea of lxml. The
original idea of lxml was to try to stick to an existing API
(ElementTree) as much as possible, along with existing XML standards
(XPath, for instance) and build things on top of existing underlying
technology (libxml2 and libxslt). This idea is quite dear to me and I
consider this to be one of the reasons lxml seems reasonably succesful
among developers: it does not make people learn too many new things, and
tries to minimize the learning needed that's unique to lxml and no other
system.
The objectify data binding model is however a fundamentally new
data-binding API: instead of the Amara or gnosis.objectify API we've
created our own version. There are good reasons for this, and
ElementTree is of course not the end of XML representations for Python.
The question however arises whether these innovations should be
maintained as core lxml... I'm worried we're offering developers too
many alternatives here: two tree representations (elementtree and
objectify), three path languages (.find(), XPath and ObjectPath), which
includes two ways completely unique to lxml.
Could these new things be shipped in a separate package instead, at
least for now? I understand that the capi work, along with eggs, should
make this relatively easy. We could even have it share the lxml
namespace package, so it could still be called 'lxml.objectify' (and
'lxml.objectpath' as I'd suggest), or, alternatively, we could introduce
a new 'lxmlext' namespace to maintain things like this.
I'm quite concerned with how we present these to developers. I'd prefer
a separate product identity, with a separate set of web pages (part of
the larger lxml website but explicitly not described as 'core') and a
separate packaging.
Again, my questions and recommendations are not to discourage these
developments. This kind of innovation certainly should be encouraged. I
do worry about the proper place and the way these things are done. In
the rush to innovate I don't want to lose track of the original goals of
lxml.
I sincerely hope we can work this out together.
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 3 23:24:48 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Thu, 03 Aug 2006 23:24:48 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
In-Reply-To: <44D23793.5040402@infrae.com>
References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de>
<44D23793.5040402@infrae.com>
Message-ID: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
Hi Martijn,
thanks for your feedback, your questions are definitely worth asking.
Martijn Faassen wrote:
> Module separation: I notice the ObjectPath language is implemented in
> the 'objectify' module, but this looks like it really should be a
> separate module, it being an independent extension to lxml that does not
> rely on the other objectify stuff, as you mention.
>
> Use cases: What is the underlying thought? When would you recommend
> people to use ObjectPath instead of XPath or the .find() syntax?
It's mainly meant to accompany objectify, that's why it's (currently)
implemented in the same module. The reason why I said it's independent is
purely out of technical considerations. It uses the same semantics and the
same idea behind the API, so it's very closely related at the semantic level.
XPath and ElementPath do not have their own module either, BTW, although they
are almost as different compared to each other as compared to ObjectPath. The
latter borrows from both (namespaces from ET, indexes from XPath), as well as
from Python's object access pattern (the dot separator).
> Technical comment: I also see that the ObjectPath parser is implemented
> in a rather low-level Pyrex formulation. Since you say that this parser
> is slow anyway, wouldn't it make sense to maintain this as straight
> Python instead? It would also be nice if we could make this parser and a
> pure-python implementation available for ElementTree itself.
I agree that it could be worth having it available for ET, too, that would
extend ET in the same way this now extends lxml.etree. However, you would then
want to have an objectify module for ET, also, as this is where the path
semantics actually come from.
Also, the parser is not /that/ slow in its current incarnation. It's actually
almost twice as fast as the (admittedly much more complex) XPath parser of
libxml2. I don't think a pure Python version could be anywhere close to that.
Also, the parser is very closely tied into the evaluator, so writing one of
them in pure Python would make both considerably slower.
So the thing is, as long as ObjectPath is used as part of lxml's objectify
API, it should be optimised for the internal implementation. After all, one of
the main goals of ObjectPath is to avoid instantiating all elements along the
path and instead traversing the tree in plain C.
> Global switch for objectify: As I mentioned before I'm still quite
> worried about switching the entire world over to objectify with a single
> global call. I really think this should be specified by using a
> different tree constructor. It just too sounds dangerous to me to
> globally switch the behavior of the whole API.
>
> In the 'classic' way of using the namespace registry, custom element
> classes are typically registered for particular elements in particular
> namespaces. Objectify however fundamentally alters the behavior of the
> entire system. I understood from your previous reply that you were
> working on ways to this settable per-tre; did I understand that
> correctly? I'd recommend making it the normal way to invoke the
> objectify behavior, not global.
Ok, sure. The lxml.elements.classlookup module has (amongst other things) a
per-parser lookup implementation. I guess you'd want that to become the
preferred way of using objectify and I think that's a good idea.
Currently, the docs only present that as an alternative (4th paragraph):
http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt
That part could be rewritten to make the global registry the alternative.
> Now to the biggest item of my concern...
>
> Nature of lxml: The addition of a different data-binding model and
> different path language specific to lxml worries me quite a bit as we're
> reinventing wheels here, something not the original idea of lxml. The
> original idea of lxml was to try to stick to an existing API
> (ElementTree) as much as possible, along with existing XML standards
> (XPath, for instance) and build things on top of existing underlying
> technology (libxml2 and libxslt). This idea is quite dear to me and I
> consider this to be one of the reasons lxml seems reasonably succesful
> among developers: it does not make people learn too many new things, and
> tries to minimize the learning needed that's unique to lxml and no other
> system.
I see your point and I agree that this is desirable. After all, there is not
that much new in objectify either. Most of the Element API stays the same as
in ET. The object access pattern looks (and feels) like normal Python objects
and clearly borrows from Amara.
> The objectify data binding model is however a fundamentally new
> data-binding API: instead of the Amara or gnosis.objectify API we've
> created our own version. There are good reasons for this, and
> ElementTree is of course not the end of XML representations for Python.
The main reason why it does not aim to be Amara compatible is that it inherits
from ElementTree. It does not /need/ all the things for which Amara had to
invent its own API as all of that is already part of the ET API. So the reason
why this is a new API is that it allows it to integrate with lxml.etree.
> The question however arises whether these innovations should be
> maintained as core lxml... I'm worried we're offering developers too
> many alternatives here: two tree representations (elementtree and
> objectify), three path languages (.find(), XPath and ObjectPath), which
> includes two ways completely unique to lxml.
>
> Could these new things be shipped in a separate package instead, at
> least for now? I understand that the capi work, along with eggs, should
> make this relatively easy. We could even have it share the lxml
> namespace package, so it could still be called 'lxml.objectify' (and
> 'lxml.objectpath' as I'd suggest), or, alternatively, we could introduce
> a new 'lxmlext' namespace to maintain things like this.
I started with "lxml.elementlib", then it became "lxml.elements". The reason
why I chose to put the new stuff into a subpackage (not only submodules) was
that I wanted to separate it from the core lxml. :)
I don't mind giving it a better name and I would not even mind separating the
packages into different eggs. It's not a problem technically, even version
dependencies could be handled by setuptools. So it's mainly a matter of
presentation. For example, the classlookup module would then have to stay a
part of lxml (or could even be merged into lxml.etree), while the objectify
module could become a separate distribution.
> I'm quite concerned with how we present these to developers. I'd prefer
> a separate product identity, with a separate set of web pages (part of
> the larger lxml website but explicitly not described as 'core') and a
> separate packaging.
Hmmm, that would really make it a separate product. Do you really think it's
worth it? It still requires lxml.etree to run and shares most of the API, so,
to learn objectify, you'd have to learn lxml.etree. It's just that objectify
would be better hidden from people who only want to use lxml.etree. Isn't a
subpackage enough for that purpose? Maybe call it lxml.objectify to make it
clear that it's more or less at a comparable level as lxml.etree itself.
> Again, my questions and recommendations are not to discourage these
> developments. This kind of innovation certainly should be encouraged. I
> do worry about the proper place and the way these things are done. In
> the rush to innovate I don't want to lose track of the original goals of
> lxml.
>
> I sincerely hope we can work this out together.
So do I. It's definitely the right time to discuss this now, before the
release of 1.1 (and preferably also before the release of 1.1beta, which is
supposed to be feature complete).
Thanks for bringing up this discussion.
Regards,
Stefan
From faassen at infrae.com Fri Aug 4 10:14:28 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 04 Aug 2006 10:14:28 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
In-Reply-To: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com>
<44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D301E4.50800@infrae.com>
Hey Stefan,
Thanks for your constructive reply!
Stefan Behnel wrote:
> Martijn Faassen wrote:
[snip smaller issues]
>> Technical comment: I also see that the ObjectPath parser is
>> implemented in a rather low-level Pyrex formulation. Since you say
>> that this parser is slow anyway, wouldn't it make sense to maintain
>> this as straight Python instead? It would also be nice if we could
>> make this parser and a pure-python implementation available for
>> ElementTree itself.
>
> I agree that it could be worth having it available for ET, too, that
> would extend ET in the same way this now extends lxml.etree. However,
> you would then want to have an objectify module for ET, also, as this
> is where the path semantics actually come from.
Not necessarily so, but yeah, that makes sense. Anyway, I cannot require
an objectify module for ET. :) Separating out the ObjectPath is not that
important then, though technically it would be possible to keep the
implementation separate.
[snip explanation about parser performance]
> So the thing is, as long as ObjectPath is used as part of lxml's
> objectify API, it should be optimised for the internal
> implementation. After all, one of the main goals of ObjectPath is to
> avoid instantiating all elements along the path and instead
> traversing the tree in plain C.
Okay, makes sense.
Main usecase of ObjectPath are what, then? Performance is one, the other
being traversing the tree in an 'objectify' way? When would I pick it
above XPath or elementpath? Is the main answer: when I'm using objectify?
[snip my worries about global switch for objectify]
> Ok, sure. The lxml.elements.classlookup module has (amongst other
> things) a per-parser lookup implementation. I guess you'd want that
> to become the preferred way of using objectify and I think that's a
> good idea.
Yes, that would be preferred.
> Currently, the docs only present that as an alternative (4th
> paragraph):
> http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt
> That part
> could be rewritten to make the global registry the alternative.
I think we should do that, or perhaps even not mention the global
registry at all but briefly mentioning that you could do something to it
to make the whole of lxml work that way for your entire program... We
could also consider just offering an API to register globally at all so
they don't become tempted. :)
>> Now to the biggest item of my concern...
>>
>> Nature of lxml: The addition of a different data-binding model and
>> different path language specific to lxml worries me quite a bit as
>> we're reinventing wheels here, something not the original idea of
>> lxml. The original idea of lxml was to try to stick to an existing
>> API (ElementTree) as much as possible, along with existing XML
>> standards (XPath, for instance) and build things on top of existing
>> underlying technology (libxml2 and libxslt). This idea is quite
>> dear to me and I consider this to be one of the reasons lxml seems
>> reasonably succesful among developers: it does not make people
>> learn too many new things, and tries to minimize the learning
>> needed that's unique to lxml and no other system.
>
> I see your point and I agree that this is desirable. After all, there
> is not that much new in objectify either. Most of the Element API
> stays the same as in ET. The object access pattern looks (and feels)
> like normal Python objects and clearly borrows from Amara.
While there's not much new in objectify, and while I agree that we're
borrowing (hopefully) the best ideas from other implementations, we are
crossing into the territory of inventing a new XML Python tree API here.
It's a somewhat grey area on how much we're inventing and how much
people need to learn, but I think we're going far enough to stop and
think for a bit nonetheless.
>> The objectify data binding model is however a fundamentally new
>> data-binding API: instead of the Amara or gnosis.objectify API
>> we've created our own version. There are good reasons for this, and
>> ElementTree is of course not the end of XML representations for
>> Python.
>
> The main reason why it does not aim to be Amara compatible is that it
> inherits from ElementTree. It does not /need/ all the things for
> which Amara had to invent its own API as all of that is already part
> of the ET API. So the reason why this is a new API is that it allows
> it to integrate with lxml.etree.
Yes, that's part of the 'good reasons' I mentioned. :) There is no
debate that there are good reasons and that this is a valuable
development. My concern is with its presentation to innocent new
developers that start looking at lxml. What's the story we want to tell
them? We have these two APIs, which are similar but not identical, and
you should pick one over the other, when?
>> The question however arises whether these innovations should be
>> maintained as core lxml... I'm worried we're offering developers
>> too many alternatives here: two tree representations (elementtree
>> and objectify), three path languages (.find(), XPath and
>> ObjectPath), which includes two ways completely unique to lxml.
>>
>> Could these new things be shipped in a separate package instead, at
>> least for now? I understand that the capi work, along with eggs,
>> should make this relatively easy. We could even have it share the
>> lxml namespace package, so it could still be called
>> 'lxml.objectify' (and 'lxml.objectpath' as I'd suggest), or,
>> alternatively, we could introduce a new 'lxmlext' namespace to
>> maintain things like this.
>
> I started with "lxml.elementlib", then it became "lxml.elements". The
> reason why I chose to put the new stuff into a subpackage (not only
> submodules) was that I wanted to separate it from the core lxml. :)
Yes, I can see that. I think 'objectify' is a good name, though perhaps
a bit worrying we clash with gnosis.objectify.
> I don't mind giving it a better name and I would not even mind
> separating the packages into different eggs. It's not a problem
> technically, even version dependencies could be handled by
> setuptools. So it's mainly a matter of presentation. For example, the
> classlookup module would then have to stay a part of lxml (or could
> even be merged into lxml.etree), while the objectify module could
> become a separate distribution.
I think it makes sense for classlookup to remain part of the core.
The *facility* to create new databinding APIs for lxml should be core -
I have no beef with that and think it's a very powerful feature. The
actual implementation of a new databinding API on top of lxml I'd prefer
to be outside of the core, however.
>> I'm quite concerned with how we present these to developers. I'd
>> prefer a separate product identity, with a separate set of web
>> pages (part of the larger lxml website but explicitly not described
>> as 'core') and a separate packaging.
>
> Hmmm, that would really make it a separate product. Do you really
> think it's worth it? It still requires lxml.etree to run and shares
> most of the API, so, to learn objectify, you'd have to learn
> lxml.etree.
Understood. I realize that objectify leans heavily on the ET API. Then
again, it also strongly changes the experience. I'm not proposing new
people come into objectify and then never have to learn about
lxml.etree. I'm just trying to make sure that when people run into lxml,
they don't have to spend a lot of mental bandwidth to worry about what
objectify is, when to use it, etc. If it's clear to them it's there that
it's not core, that they don't need to worry about it at all, and that
it's there when they want it, that would help.
So far, most or all of the things in lxml are at least potentially
familiar to a newcomer, if they're familiar with various XML standards
and ElementTree. The new bits are the APIs we invented to glue them all
together. objectify alters that in the sense that it's not an API used
to glue these things together and it's also not an API people can be
familiar with when they come in from the outside. It's a gradual step in
many ways, but I think a significant one.
> It's just that objectify would be better hidden from people who only
> want to use lxml.etree.
I don't think 'hidden' is the right word. I'd like to give objectify
prominence, while also making it very clear in a developer's mind that
this is a separate development, heavily tied into lxml and part of the
lxml projects, but not something you have to buy into when you use lxml
core.
> Isn't a subpackage enough for that purpose? Maybe call it
> lxml.objectify to make it clear that it's more or less at a
> comparable level as lxml.etree itself.
I would be prefer a clearly marked difference. If we call it
'lxml.objectify', but maintain it as an egg outside the core (lxml being
the shared namespace package), we'll have a large step taken already.
We don't need to necessarily split up the svn repository if we can
generate both eggs independently from the same repository.
We should also be careful in organizing our documentation and website to
make clear that objectify is an extension to the core part, and that
people do not have to worry about it when they come to lxml. I think we
can do this so that objectify is not hidden, but also clearly separate
from the core development.
I realize that this is a hassle and it's on the edge of being worth it
or not, but I think it'd be valuable.
On a personal note, I'm going on a short trip and won't be able to
communicate on this further until next week thursday or friday. Note too
big a problem: I said what I wanted to say possibly too voluminously
already. I'd be curious to see what other people's opinions are on these
topics, so perhaps I'll see that when I get back.
I also fully trust you'll make the right decisions if you want to
proceed with a 1.1 beta release while I'm away.
Regards,
Martijn
From elephantum at yandex.ru Fri Aug 4 10:52:30 2006
From: elephantum at yandex.ru (=?KOI8-R?B?9MHUwdLJzs/XIOHOxNLFyg==?=)
Date: Fri, 04 Aug 2006 12:52:30 +0400
Subject: [lxml-dev] lxml goals
Message-ID: <145471154681550@webmail5.yandex.ru>
Hi,
This is all very interesting, but the only thing I can't understand what does it have in common with lxml?
In fact, for a quiet some time I do not understand the goal of lxml project. At first it was "ElementTree on top of libxml2", after it becames more and more bloated with ET-Incompatible API, now, program that uses lxml cannot be easy ported back to ElementTree.
May be it's time to split into ET-implementations and lxml-specific? Or to say "lxml is no more just an ElementTree implementation, but a separate project with it's own ideoms"?
03.08.06, 20:02, Stefan Behnel :
> Hi all,
> I have already mentioned that lxml 1.1 will feature an alternative API,
> lxml.elements.objectify, which is similar to Amara and gnosis.objectify, but
> written in Pyrex. The implementation is now nearing completion, so that
> 1.1beta will hopefully find its way towards cheeseshop early next week.
[...]
> Stefan
From faassen at infrae.com Fri Aug 4 11:45:11 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 04 Aug 2006 11:45:11 +0200
Subject: [lxml-dev] lxml goals
In-Reply-To: <145471154681550@webmail5.yandex.ru>
References: <145471154681550@webmail5.yandex.ru>
Message-ID: <44D31727.8090706@infrae.com>
????????? ?????? wrote:
> This is all very interesting, but the only thing I can't understand
> what does it have in common with lxml?
>
> In fact, for a quiet some time I do not understand the goal of lxml
> project. At first it was "ElementTree on top of libxml2", after it
> becames more and more bloated with ET-Incompatible API, now, program
> that uses lxml cannot be easy ported back to ElementTree.
I don't think it's fair to say that our API is ET-incompatible. lxml's
API is as compatible to ElementTree as we can make it, and we've
expended quite some effort in making it be so.
We've *extended* the API to expose a host of features in libxml2 and
libxslt. For instance, we expose namespace prefixes in lxml.etree where
ET does not. I do not consider these extensions as bloat but as
important functionality.
The API also got extended with a facility to hook in custom element
classes for particular elements. This is an extension to the ET model
which due to its nature needs to be done in the core. I think this is a
nice and powerful feature.
Stefan has now built other facilities on top of this that are unique to
lxml. This is where I asked about goals.
> May be it's time to split into ET-implementations and lxml-specific?
> Or to say "lxml is no more just an ElementTree implementation, but a
> separate project with it's own ideoms"?
lxml has always been *more* than just an ElementTree implementation; if
it were just an ElementTree implementation there'd be no point in doing
our work. It's an ElementTree implementation that exposes a host of XML
technologies implemented in libxml2 and libxslt. It's a Python XML
library with support for XPath, XSLT, Relax NG, and so on.
The objectify extensions, yes, we could present as a separate project
with its own idioms.
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 11:39:11 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 11:39:11 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
In-Reply-To: <44D301E4.50800@infrae.com>
References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com>
<44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
<44D301E4.50800@infrae.com>
Message-ID: <44D315BF.7020204@gkec.informatik.tu-darmstadt.de>
Hi Martijn,
Martijn Faassen wrote:
> Main usecase of ObjectPath are what, then? Performance is one, the other
> being traversing the tree in an 'objectify' way? When would I pick it
> above XPath or elementpath? Is the main answer: when I'm using objectify?
Guess so. That would be another reason for leaving it inside the objectify
module. I think the whole idea of ObjectPath is so much tied into objectify
that it would not make sense to use one without the other. If you only want to
use lxml.etree, you should be pretty well served with XPath. If, however, you
want to use objectify, it's convenient to have a (fast) path language that
matches the API.
So, we should just separate etree and objectify and leave the rest as is.
>> The lxml.elements.classlookup module has (amongst other
>> things) a per-parser lookup implementation. I guess you'd want that
>> to become the preferred way of using objectify and I think that's a
>> good idea.
>
> Yes, that would be preferred.
Ok, I'll fix it in the docs.
> There is no
> debate that there are good reasons and that this is a valuable
> development. My concern is with its presentation to innocent new
> developers that start looking at lxml. What's the story we want to tell
> them? We have these two APIs, which are similar but not identical, and
> you should pick one over the other, when?
That should go into a FAQ entry, I guess. Something like this:
Basically, they are two different approaches to XML: Python-like data-binding
and a generic API for XML handling.
* The ET API is more generic and does not require any knowledge about the XML
structure that is treated. It supports more or less the entire XML infoset.
Besides, it is very well suited for mixed and document-like content (including
HTML).
* The objectify API is very data centred and schema/structure focused. It does
not support document-like XML (or HTML), but it's very convenient for handling
Python(-like) data types stored in XML.
So, objectify has a more convenient API in a smaller application scope, while
ET is broadly applicable to everything that's XML.
> I think 'objectify' is a good name, though perhaps
> a bit worrying we clash with gnosis.objectify.
What about calling it "objectic", then? Sounds similar, but still different
enough to make it clear that it's not the same as gnosis.objectify or Amara.
Google gives 862 hits on "objectic", and even 48 on "objectique". Not much of
a chance to have a name clash with those. :)
Then again, "objectify" has a meaning that pretty much fits its idea. Hmmm, I
guess "objectify" is just fine as a name...
> I think it makes sense for classlookup to remain part of the core.
> The *facility* to create new databinding APIs for lxml should be core -
> I have no beef with that and think it's a very powerful feature. The
> actual implementation of a new databinding API on top of lxml I'd prefer
> to be outside of the core, however.
Understood. But then, classlookup is pretty lonely in lxml.elements. I should
just merge it into lxml.etree. It's not much code and parts of it actually are
already in etree (like the normal NS lookup and the per-parser stuff).
> I realize that objectify leans heavily on the ET API. Then
> again, it also strongly changes the experience. I'm not proposing new
> people come into objectify and then never have to learn about
> lxml.etree. I'm just trying to make sure that when people run into lxml,
> they don't have to spend a lot of mental bandwidth to worry about what
> objectify is, when to use it, etc. If it's clear to them it's there that
> it's not core, that they don't need to worry about it at all, and that
> it's there when they want it, that would help.
>
> So far, most or all of the things in lxml are at least potentially
> familiar to a newcomer, if they're familiar with various XML standards
> and ElementTree. The new bits are the APIs we invented to glue them all
> together. objectify alters that in the sense that it's not an API used
> to glue these things together and it's also not an API people can be
> familiar with when they come in from the outside. It's a gradual step in
> many ways, but I think a significant one.
>
> I'd like to give objectify
> prominence, while also making it very clear in a developer's mind that
> this is a separate development, heavily tied into lxml and part of the
> lxml projects, but not something you have to buy into when you use lxml
> core.
>
> I would prefer a clearly marked difference. If we call it
> 'lxml.objectify', but maintain it as an egg outside the core (lxml being
> the shared namespace package), we'll have a large step taken already.
> We don't need to necessarily split up the svn repository if we can
> generate both eggs independently from the same repository.
Ok, I understand your concerns and I think they are valid. We should really
give users easy guidelines through the package, so that they do not have to
read tons of pages to understand where to /start/.
That said, I believe that it's totally a good thing to provide different APIs
on top of the same infrastructure. Things like parsing, XSLT, RNG, XPath, etc.
work exactly the same way for all of them, so you only have to learn them once
and can then freely choose the API that fits your current use case, without
restarting from scratch and without any incompatibilities or differing
capabilities of the library itself.
So the proposal is:
* merge lxml.elements.classlookup into lxml.etree
* make both APIs stand side-by-side in the lxml package: lxml.etree and
lxml.objectify
* make it clear in the docs (and the FAQ) that they provide different APIs and
how they differ, so that people can easily decide which suites their needs,
without first needing to understand the details
Not required for 1.1beta (but likely in 1.1):
* build separate packages from setup.py: "lxml" and "lxml-objectify" (not too
much of a big deal technically, BTW), where lxml-objectify requires lxml via
setuptools.
> We should also be careful in organizing our documentation and website to
> make clear that objectify is an extension to the core part, and that
> people do not have to worry about it when they come to lxml. I think we
> can do this so that objectify is not hidden, but also clearly separate
> from the core development.
Sure. It already has its own page, which is somewhat similar to api.txt in
spirit. So we should reorganise the doc section in main.txt to tell the users
about both and how we see them in comparison.
> On a personal note, I'm going on a short trip and won't be able to
> communicate on this further until next week thursday or friday.
I'll actually be almost away by then and come back at the end of august. So
I'll try to get 1.1beta out early next week and 1.1 final when I come back
(and find all those nasty little bugs reported on the list... :)
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 12:24:43 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 12:24:43 +0200
Subject: [lxml-dev] lxml goals
In-Reply-To: <145471154681550@webmail5.yandex.ru>
References: <145471154681550@webmail5.yandex.ru>
Message-ID: <44D3206B.6020409@gkec.informatik.tu-darmstadt.de>
Hi ?????????,
????????? ?????? wrote:
> In fact, for a quiet some time I do not understand the goal of lxml
> project. At first it was "ElementTree on top of libxml2",
Well, look closely. As cheeseshop puts it:
"""
lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides
safe and convenient access to these libraries using the ElementTree API.
It extends the ElementTree API significantly to offer support for XPath,
RelaxNG, XML Schema, XSLT, C14N and much more.
"""
So it *safely* *extends* the *ElementTree API* in a *pythonic* way. Those are
the main goals: Be pythonic, safe, compatible to ET, and more comprehensive
(named in no particular order).
As for being pythonic, BTW, I think that objectify is one of the most pythonic
ways of handling XML in Python. But maybe that's just me - and believe me, I'm
biased...
> after it becames
> more and more bloated with ET-Incompatible API, now, program that uses lxml
> cannot be easy ported back to ElementTree.
I acknowledge that you are not a native english speaker, but I'd still be a
bit more careful with words like "bloated" and "incompatible". There are very
few places where lxml is incompatible to ET, and I believe that these spots
are there for very good reasons. Some differ in pure legacy design decisions
that were originally taken by ET (like for processing instructions), others
result from restrictions posed by libxml2 (like the single parent issue).
And I would not say that lxml is bloated in any way. All that is in there is
actually a) useful or b) helpful or c) for compatibility or d) for any
combination of the three. Martijn and I have taken care (and are still taking
care, as this discussion shows) that the API stays consistent in itself and as
close to existing APIs as possible, major points of influence being the ET API
and the Python language idioms.
Sure, in such a large library, you will never require every bit for your
application. But different applications have different requirements, and I
think lxml serves quite a large set of requirements in the XML area by now.
And we are always concerned about keeping the specific subset required for an
application easily accessible.
> May be it's time to split into ET-implementations and lxml-specific?
Well, you can't just split it. Most of the API and its extensions are tightly
integrated and do not work in separation. That's not only a technical problem,
it's rather a problem of API consistency.
There are some parts that could be separated out, like the namespace registry
and class lookup, for example. Now that we have the infrastructure for
external modules in place, it could be moved to a separate module. However,
that would break existing code and change the internal behaviour of lxml,
which currently defaults to support namespace lookup. Too bad. That's one for
compatibility, then. But there are not many things in lxml that come to my
mind when I look for concerns like this...
> say "lxml is no more just an ElementTree implementation, but a separate
> project with it's own ideoms"?
Well, it never *was* "just an ET implementation", just as the cheeseshop quote
suggests. And as for lxml.objectify, it was never meant to become core
technology in lxml. It's a separate API that inherits from ET, lxml, Amara and
Python as much as possible, but otherwise stands on its own. It's not bloating
lxml either.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 12:50:45 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 12:50:45 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> <44D0D003.2080800@infrae.com>
<44D0D14D.5070708@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D32685.8000207@gkec.informatik.tu-darmstadt.de>
Stefan Behnel wrote:
> The different between changing the dict on the parser context and on the XSLT
> context is that the parser context does not use it before it is returned.
> libxslt *might* store stuff in it, depending on the stylesheet.
Ok, I looked through the libxslt source and cannot find a place where this is
actually the case. According to the inline comments in transform.c, libxslt is
supposed to use the dict for XSLT 'key' handling, but it doesn't look like
that's true. (yeah, well, libx*** and documentation...) I could not even find
the word 'dict' in the file keys.c ...
So, given that insight, I'm now somewhat convinced that the patch I sent is
actually harmless. So I'll just merge it in for 1.1beta and see what we get.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 13:12:05 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 13:12:05 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D32685.8000207@gkec.informatik.tu-darmstadt.de>
References: <20060731213352.ymuh45jysoc4gk0s@webmail.ltgc.com> <44CEED04.8080300@gkec.informatik.tu-darmstadt.de> <20060801015204.2k8stoit344kwwww@webmail.ltgc.com> <44CF0E14.5090902@gkec.informatik.tu-darmstadt.de> <20060802005249.50c0a3jmv48g0ko0@webmail.ltgc.com> <44D06823.6030407@gkec.informatik.tu-darmstadt.de> <44D08CE3.3090308@gkec.informatik.tu-darmstadt.de> <44D0D003.2080800@infrae.com> <44D0D14D.5070708@gkec.informatik.tu-darmstadt.de>
<44D32685.8000207@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D32B85.2080304@gkec.informatik.tu-darmstadt.de>
Stefan Behnel wrote:
> Stefan Behnel wrote:
>> The different between changing the dict on the parser context and on the XSLT
>> context is that the parser context does not use it before it is returned.
>> libxslt *might* store stuff in it, depending on the stylesheet.
>
> Ok, I looked through the libxslt source and cannot find a place where this is
> actually the case. According to the inline comments in transform.c, libxslt is
> supposed to use the dict for XSLT 'key' handling, but it doesn't look like
> that's true. (yeah, well, libx*** and documentation...) I could not even find
> the word 'dict' in the file keys.c ...
>
> So, given that insight, I'm now somewhat convinced that the patch I sent is
> actually harmless. So I'll just merge it in for 1.1beta and see what we get.
Right before committing, I noticed that the original patch actually introduces
threading problems, so here is a new patch that fixes it The Right Way.
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xslt-dict-replace.patch
Type: text/x-patch
Size: 2773 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060804/55a8db7d/attachment.bin
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 15:26:11 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 15:26:11 +0200
Subject: [lxml-dev] News from the 2.5 front
Message-ID: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
Hi,
just wanted to send a note that lxml.etree compiles nicely under Python 2.5b3
(AMD64) using the patched Pyrex version here:
http://codespeak.net/svn/lxml/pyrex/
The only problem I currently encounter is a bug in linecache in 2.5's stdlib
that prevents the doctests from running. Once that's solved, we can see if
those tests pass as well.
Stefan
From fdrake at gmail.com Fri Aug 4 15:30:23 2006
From: fdrake at gmail.com (Fred Drake)
Date: Fri, 4 Aug 2006 09:30:23 -0400
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
Message-ID: <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
On 8/4/06, Stefan Behnel wrote:
> just wanted to send a note that lxml.etree compiles nicely under Python 2.5b3
> (AMD64) using the patched Pyrex version here:
Woohoo! Thanks for testing this!
> The only problem I currently encounter is a bug in linecache in 2.5's stdlib
> that prevents the doctests from running. Once that's solved, we can see if
> those tests pass as well.
If there's really a bug in linecache, be sure to report it against
Python on SourceForge so we can get it dealt with.
-Fred
--
Fred L. Drake, Jr.
"Every sin is the result of a collaboration." --Lucius Annaeus Seneca
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 16:18:06 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 16:18:06 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
Message-ID: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
Hi Fred,
Fred Drake wrote:
> On 8/4/06, Stefan Behnel wrote:
>> The only problem I currently encounter is a bug in linecache in 2.5's
>> stdlib that prevents the doctests from running. Once that's solved, we
>> can see if those tests pass as well.
>
> If there's really a bug in linecache, be sure to report it against Python
> on SourceForge so we can get it dealt with.
Oh, well. I did report it and then almost instantly got a TYOF back. The
problem was: lxml used its own version of doctest.py, which was no longer
compatible with 2.5. I always wondered where that came from and what it was
good for. Should have asked long ago, I guess...
Anyway, now it's gone and there's only one minor error in the test runs. I'll
check if I can fix it. It's exception related, so it may still be a bug in the
patched Pyrex version.
Stefan
From fdrake at gmail.com Fri Aug 4 16:27:03 2006
From: fdrake at gmail.com (Fred Drake)
Date: Fri, 4 Aug 2006 10:27:03 -0400
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
Message-ID: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
On 8/4/06, Stefan Behnel wrote:
> Oh, well. I did report it and then almost instantly got a TYOF back. The
TYOF == "That's your own fault" ???
> problem was: lxml used its own version of doctest.py, which was no longer
> compatible with 2.5. I always wondered where that came from and what it was
> good for. Should have asked long ago, I guess...
Hmm. There's a separate version in zope.testing as well. I've no
idea if that's compatible with 2.5; there's so many other things that
fall over with 2.5 it doesn't seem worthwhile to ask.
> Anyway, now it's gone and there's only one minor error in the test runs. I'll
> check if I can fix it. It's exception related, so it may still be a bug in the
> patched Pyrex version.
Ok. Let me know if there's anything I can help with on the 2.5 front.
-Fred
--
Fred L. Drake, Jr.
"Every sin is the result of a collaboration." --Lucius Annaeus Seneca
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 16:52:04 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 16:52:04 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
Message-ID: <44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
Fred Drake wrote:
> On 8/4/06, Stefan Behnel wrote:
>> Oh, well. I did report it and then almost instantly got a TYOF back. The
>
> TYOF == "That's your own fault" ???
Yup. :)
>> problem was: lxml used its own version of doctest.py, which was no longer
>> compatible with 2.5. I always wondered where that came from and what
>> it was
>> good for. Should have asked long ago, I guess...
>
> Hmm. There's a separate version in zope.testing as well. I've no
> idea if that's compatible with 2.5; there's so many other things that
> fall over with 2.5 it doesn't seem worthwhile to ask.
Apparently, they changed some monkeypatching stuff related to the "getlines()"
function in linecache.py. It now has a different signature. :-/
>> Anyway, now it's gone and there's only one minor error in the test
>> runs. I'll
>> check if I can fix it. It's exception related, so it may still be a
>> bug in the
>> patched Pyrex version.
>
> Ok. Let me know if there's anything I can help with on the 2.5 front.
Thanks for offering help, that's always appreciated. :)
I'll give it some more investigation first.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 4 18:01:21 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 04 Aug 2006 18:01:21 +0200
Subject: [lxml-dev] objectify, ObjectPath and Benchmarks
In-Reply-To: <44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
References: <44D21E1A.7020004@gkec.informatik.tu-darmstadt.de> <44D23793.5040402@infrae.com>
<44D269A0.3040700@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D36F51.9050105@gkec.informatik.tu-darmstadt.de>
Stefan Behnel wrote:
> Martijn Faassen wrote:
>> Global switch for objectify: As I mentioned before I'm still quite
>> worried about switching the entire world over to objectify with a single
>> global call. I really think this should be specified by using a
>> different tree constructor. It just too sounds dangerous to me to
>> globally switch the behavior of the whole API.
>>
>> In the 'classic' way of using the namespace registry, custom element
>> classes are typically registered for particular elements in particular
>> namespaces. Objectify however fundamentally alters the behavior of the
>> entire system. I understood from your previous reply that you were
>> working on ways to this settable per-tre; did I understand that
>> correctly? I'd recommend making it the normal way to invoke the
>> objectify behavior, not global.
>
> Ok, sure. The lxml.elements.classlookup module has (amongst other things) a
> per-parser lookup implementation. I guess you'd want that to become the
> preferred way of using objectify and I think that's a good idea.
>
> Currently, the docs only present that as an alternative (4th paragraph):
> http://codespeak.net/svn/lxml/branch/capi/doc/objectify.txt
> That part could be rewritten to make the global registry the alternative.
Now that I started rewriting the doc section, I noticed that a per-parser
setup will not be very satisfactory. It will not affect XML() and also not the
trees built by hand using Element() etc., as both use and inherit the default
parser.
So the only way to get an objectify tree in that case is through the parser
API. Once a parsed node is there, however, new subelements will inherit the
parser lookup scheme.
This makes the per-parser setup not useless, but a bit less beautiful...
Any ideas how this could get a little nicer?
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Aug 5 15:11:08 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sat, 05 Aug 2006 15:11:08 +0200
Subject: [lxml-dev] Request for comments: Removing lxml.etree's default
support for namespace class support
Message-ID: <44D498EC.50507@gkec.informatik.tu-darmstadt.de>
Hi all,
I know, breaking compatibility is a serious topic, so I'm putting this here
for an open discussion. This change would only impact code that uses the
namespace class lookup to supply custom element classes to lxml.etree. Other
code would continue to work.
Currently, lxml.etree does namespace lookup for custom element classes by
default. This has been the case in the 0.9 and 1.0 series.
Starting with lxml 1.1, etree will support not only custom classes, but also
custom lookup schemes for these classes. It includes a generic fallback
mechanism from one lookup scheme to another if the first one fails. This means
that the default support for namespace class lookup is becoming redundant, as
it is also supported by a public class that provides the namespace lookup
scheme. Also, the current scheme does not support a fallback other than the
default element class, so code that wants to use the namespace lookup with a
different fallback is still required to re-register both.
To remove this redundancy, to speed up the default setup if namespace classes
are /not/ used and (above all) to make the lookup API more accessible, I would
like to remove the default for namespace lookup and replace it by the simplest
possible mechanism that always returns the normal element classes. If
namespace lookup support is needed, something like the following code would be
required at setup time:
from lxml import etree
try:
lookup = etree.ElementNamespaceClassLookup()
except AttributeError:
# lxml >= 0.9 and < 1.1 supports this by default
pass
else:
# lxml >= 1.1 requires an explicit setup
etree.setElementClassLookup(lookup)
This code block is backwards compatible with lxml 0.9 and lxml 1.0, so new
code that requires namespace class lookup could continue to support lxml from
version 0.9 on, while older code that uses namespace classes would have to be
updated with the above code block to support lxml 1.1 and later. Doing this
switch *now* makes the above code pretty short, later changes would require
version checking and the like.
One of the main reasons for this change is that I would like to make the
lookup mechanism explict and visible. It is a global property that impacts the
entire library. Users who do not need to install their own custom classes
should not be bothered with it, i.e. should be able to ignore the lookup API,
the Namespace class registry, etc. For those who need a different mechanism, I
believe that the current default does not make it visible enough that (for
example) the functionality of the "Namespace" class registry is disabled if
you select a different class lookup mechanism.
So the new custom class support would work like this:
* if no custom classes are used, no configuration is needed
* any support for custom classes requires setting up a lookup scheme
* changing the default class is done by creating and setting a default
lookup scheme based on the new default classes
* using the namespace lookup requires setting the ns lookup scheme, which
then enables lookups based on the global Namespace registry
* setting a per-parser lookup scheme enables delegation to the specific
lookup registered with a parser, which in turn can deploy any of the
available schemes and defaults to using the normal classes
I'm also considering to replicate the Namespace registry locally in the
ElementNamespaceClassLookup class. This would allow things like a per-parser
namespace registry and the like. I think removing the default would also help
in getting this cleaner.
I'm really interested in hearing opinions on this. I think the above
compatibility code makes the switch trivial to do, but I would like to hear if
there are other impacts of this change that I might not have thought of.
Stefan
From jkrukoff at ltgc.com Sat Aug 5 23:47:56 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Sat, 5 Aug 2006 15:47:56 -0600
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <44D32B85.2080304@gkec.informatik.tu-darmstadt.de>
Message-ID: <001801c6b8d8$d07bf870$051ea8c0@naomi>
> Right before committing, I noticed that the original patch actually
> introduces threading problems, so here is a new patch that fixes it The
> Right Way.
> Stefan
I attempted to apply this patch against the lxml 1.0.2 release version, and
had no luck. Do I need to be pulling 1.1 from svn to get this fix?
---------
John Krukoff
jkrukoff at ltgc.com
From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 07:33:08 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sun, 06 Aug 2006 07:33:08 +0200
Subject: [lxml-dev] Segfault in lxml during element copy
In-Reply-To: <001801c6b8d8$d07bf870$051ea8c0@naomi>
References: <001801c6b8d8$d07bf870$051ea8c0@naomi>
Message-ID: <44D57F14.3000105@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
>> Right before committing, I noticed that the original patch actually
>> introduces threading problems, so here is a new patch that fixes it The
>> Right Way.
>
> I attempted to apply this patch against the lxml 1.0.2 release version, and
> had no luck. Do I need to be pulling 1.1 from svn to get this fix?
Ah, right, sorry. I had done so much work on 1.1 recently that I completely
forgot that you are still using 1.0.
1.0 does not have threading support and I had to rewrite the patch to get it
in. Here's a version against the current 1.0 branch that should apply cleanly
against 1.0.2. I'll also release a 1.0.3 in a few days (preferably at the same
time as 1.1beta to reduce the overhead for our egg maintainers).
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xslt-crash.patch
Type: text/x-patch
Size: 3860 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060806/f546bb2a/attachment.bin
From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 12:09:28 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sun, 06 Aug 2006 12:09:28 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
Message-ID: <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
Stefan Behnel wrote:
>>> there's only one minor error in the test runs. I'll
>>> check if I can fix it. It's exception related, so it may still be a
>>> bug in the patched Pyrex version.
>> Ok. Let me know if there's anything I can help with on the 2.5 front.
>
> Thanks for offering help, that's always appreciated. :)
>
> I'll give it some more investigation first.
Ok, it was not a Pyrex bug. The problem is that lxml uses multiple inheritance
in some exceptions and now that they are new style classes, it's no longer
enough to call the constructor of the superclass directly. However, super()
does not work for old style classes in 2.4, so I'm a bit challenged in getting
this fixed in a backward compatible way.
This works nicely in Python 2.4:
class Error(Exception): pass
class LxmlError(Error):
def __init__(self, *args):
Error.__init__(self, *args)
self.error_log = __copyGlobalErrorLog()
while Python 2.5 requires this:
class LxmlError(Error):
def __init__(self, *args):
super(LxmlError, self).__init__(*args)
self.error_log = __copyGlobalErrorLog()
which does not work for classic classes in 2.3/4. Does anyone have an idea how
to fix this nicely?
Stefan
From fdrake at gmail.com Sun Aug 6 18:22:07 2006
From: fdrake at gmail.com (Fred Drake)
Date: Sun, 6 Aug 2006 12:22:07 -0400
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
Message-ID: <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
On 8/6/06, Stefan Behnel wrote:
> Ok, it was not a Pyrex bug. The problem is that lxml uses multiple inheritance
> in some exceptions and now that they are new style classes, it's no longer
> enough to call the constructor of the superclass directly.
Please explain in detail what problems you had with this approach.
> However, super()
> does not work for old style classes in 2.4, so I'm a bit challenged in getting
> this fixed in a backward compatible way.
>
> This works nicely in Python 2.4:
...
> while Python 2.5 requires this:
...
> which does not work for classic classes in 2.3/4. Does anyone have an idea how
> to fix this nicely?
The Python 2.4 formulation should still work in Python 2.5. Direct
calls to the superclass are not forbidden with new-style classes.
-Fred
--
Fred L. Drake, Jr.
"Every sin is the result of a collaboration." --Lucius Annaeus Seneca
From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 18:36:58 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sun, 06 Aug 2006 18:36:58 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
Message-ID: <44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
Hi Fred,
Fred Drake wrote:
> On 8/6/06, Stefan Behnel wrote:
>> Ok, it was not a Pyrex bug. The problem is that lxml uses multiple
>> inheritance
>> in some exceptions and now that they are new style classes, it's no
>> longer
>> enough to call the constructor of the superclass directly.
>
> Please explain in detail what problems you had with this approach.
As I said, I'm using this:
class Error(Exception): pass
class LxmlError(Error):
def __init__(self, *args):
Error.__init__(self, *args)
self.error_log = __copyGlobalErrorLog()
What I did not say is that afterwards, I use this:
class XPathError(LxmlError):
pass
class LxmlSyntaxError(LxmlError, SyntaxError):
pass
class XPathSyntaxError(LxmlSyntaxError, XPathError):
pass
So there is a 'cross inheritance' here in XPathSyntaxError, but even when I
remove the XPathError inheritance, I get the same result as follows. I now
call this in Pyrex:
raise XPathSyntaxError, "some message"
and what comes out at the end is:
Traceback ...
XPathSyntaxError: None
Which is not quite what you'd expect. I assume what happens is that the MRO
ends up not calling Exception.__init__ or something, which leads to not
setting the message. The following, works, however:
class LxmlError(Error):
def __init__(self, *args):
super(LxmlError, self).__init__(*args)
self.error_log = __copyGlobalErrorLog()
What I now did was to call either the super() stuff or __init__ depending on
Error being a subtype of 'object' or not. I would prefer having a simpler
solution, though.
Stefan
From fdrake at gmail.com Sun Aug 6 19:04:48 2006
From: fdrake at gmail.com (Fred Drake)
Date: Sun, 6 Aug 2006 13:04:48 -0400
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
<44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
Message-ID: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
On 8/6/06, Stefan Behnel wrote:
> class LxmlSyntaxError(LxmlError, SyntaxError):
> pass
Is that the built-in SyntaxError? Leave that out. It's really only
intended to be used with Python-language syntax errors. Handling for
any other syntax errors should use separate exceptions specific to the
processing for that language.
Removing that, I get a reasonable error message for Python 2.4 and 2.5.
-Fred
--
Fred L. Drake, Jr.
"Every sin is the result of a collaboration." --Lucius Annaeus Seneca
From luto at myrealbox.com Sun Aug 6 19:18:33 2006
From: luto at myrealbox.com (Andrew Lutomirski)
Date: Sun, 6 Aug 2006 10:18:33 -0700
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
<44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
Message-ID:
On 8/6/06, Fred Drake wrote:
>
> On 8/6/06, Stefan Behnel
> wrote:
> > class LxmlSyntaxError(LxmlError, SyntaxError):
> > pass
>
> Is that the built-in SyntaxError? Leave that out. It's really only
> intended to be used with Python-language syntax errors. Handling for
> any other syntax errors should use separate exceptions specific to the
> processing for that language.
I think that elementtree and cElementTree do the same thing. I don't like
this behavior at all, though -- I spent quite awhile trying to find a syntax
error in my code a couple days ago when the real error was in the XML input.
--Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20060806/abbec615/attachment.htm
From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 19:20:12 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sun, 06 Aug 2006 19:20:12 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
<44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
Message-ID: <44D624CC.6090203@gkec.informatik.tu-darmstadt.de>
Hi Fred,
Fred Drake wrote:
> On 8/6/06, Stefan Behnel wrote:
>> class LxmlSyntaxError(LxmlError, SyntaxError):
>> pass
>
> Is that the built-in SyntaxError? Leave that out. It's really only
> intended to be used with Python-language syntax errors. Handling for
> any other syntax errors should use separate exceptions specific to the
> processing for that language.
Well, I'm not the one who put it there (and I definitely would not have used
it in the first place). Thing is, lxml is heading for ElementTree
compatibility and ElementTree raises a plain SyntaxError in the place where we
raise LxmlSyntaxError. So removing the superclass would break compatibility to
ET and also break existing code that depends on it...
Stefan
From fdrake at gmail.com Sun Aug 6 19:30:40 2006
From: fdrake at gmail.com (Fred Drake)
Date: Sun, 6 Aug 2006 13:30:40 -0400
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D624CC.6090203@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
<44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
<44D624CC.6090203@gkec.informatik.tu-darmstadt.de>
Message-ID: <9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com>
On 8/6/06, Stefan Behnel wrote:
> Well, I'm not the one who put it there (and I definitely would not have used
> it in the first place). Thing is, lxml is heading for ElementTree
> compatibility and ElementTree raises a plain SyntaxError in the place where we
> raise LxmlSyntaxError. So removing the superclass would break compatibility to
> ET and also break existing code that depends on it...
Ok, I see. The SyntaxError is used directly in the ElementPath module. ;-(
There's not going to be a really clean way to do this, or at least I
can't think of it off-hand. Here's what I came up with; it's probably
similar to what you did:
===========================================
_newstyle_exceptions = isinstance(Exception, type)
class Error(Exception):
pass
class LxmlError(Error):
def __init__(self, *args):
if _newstyle_exceptions:
super(LxmlError, self).__init__(*args)
else:
Error.__init__(self, *args)
self.error_log = []
class XPathError(LxmlError):
pass
class LxmlSyntaxError(LxmlError, SyntaxError):
pass
class XPathSyntaxError(LxmlSyntaxError, XPathError):
pass
raise XPathSyntaxError, "some message"
===========================================
-Fred
--
Fred L. Drake, Jr.
"Every sin is the result of a collaboration." --Lucius Annaeus Seneca
From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Aug 6 19:50:16 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Sun, 06 Aug 2006 19:50:16 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
<44D35F14.9090202@gkec.informatik.tu-darmstadt.de>
<44D5BFD8.1050002@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608060922i17e2f8fbg8a22cea352b3f9a8@mail.gmail.com>
<44D61AAA.20309@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608061004t5a78412eh1dc9f2ed7ff14c8@mail.gmail.com>
<44D624CC.6090203@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608061030r509c69b9qf24533995e33276b@mail.gmail.com>
Message-ID: <44D62BD8.3010505@gkec.informatik.tu-darmstadt.de>
Hi Fred,
Fred Drake wrote:
> There's not going to be a really clean way to do this, or at least I
> can't think of it off-hand. Here's what I came up with; it's probably
> similar to what you did:
>
> ===========================================
> _newstyle_exceptions = isinstance(Exception, type)
>
> class LxmlError(Error):
> def __init__(self, *args):
> if _newstyle_exceptions:
> super(LxmlError, self).__init__(*args)
> else:
> Error.__init__(self, *args)
> self.error_log = []
Yup, that's about what I did, too. It's not that ugly, just a relatively small
work around for a backwards compatibility problem. So I think I'll just live
with it.
Thanks for helping,
Stefan
From benno.luthiger at id.ethz.ch Mon Aug 7 18:48:07 2006
From: benno.luthiger at id.ethz.ch (Luthiger Stoll Benno)
Date: Mon, 7 Aug 2006 18:48:07 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
Message-ID:
Hello
I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I install lxml using easy_install. I saw that this problem was discussed last month on this list.
I scanned the mails addressing this issue, however, I could not find a solution.
How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
Regards,
Benno
From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Aug 7 20:20:20 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Mon, 07 Aug 2006 20:20:20 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To:
References:
Message-ID: <44D78464.8000402@gkec.informatik.tu-darmstadt.de>
Hi Benno,
Luthiger Stoll Benno wrote:
> I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I
> install lxml using easy_install. I saw that this problem was discussed last
> month on this list. I scanned the mails addressing this issue, however, I
> could not find a solution.
We do not provide eggs for Python installations that use 16 bit unicode
(UCS2). The solution is therefore to compile lxml yourself. I assume you're on
Linux, so that's not too much of an effort.
http://codespeak.net/lxml/build.html
> How can I test whether my python installation
> (Python 2.3.5) is compiled with 2 bit unicode?
Ah, 2 bit unicode? No, that's pretty unlikely... ;)
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Aug 8 21:59:26 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Tue, 08 Aug 2006 21:59:26 +0200
Subject: [lxml-dev] lxml 1.0.3 and 1.1beta on cheeseshop
Message-ID: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de>
Hi all,
I finally managed to get 1.1beta out, right after releasing 1.0.3.
1.0.3 is a bug fix release. Since it fixes a crash in XSLT result handling,
updating is recommended.
1.1beta is the last pre-release before the shiny new 1.1 series will take the
lead. Despite the surprisingly short change log, it contains tons of changes
under the hood and some major improvements in flexibility. It is the first
lxml version to compile and run under Python 2.5 beta (3), comes with a C-API
that makes it extensible by other Python C modules, and features an additional
data-binding API on top of etree (objectify).
For further information on the features of lxml 1.1, please refer to the HTML
documentation in the source distribution or read the text files online:
http://codespeak.net/svn/lxml/trunk/doc
Note that lxml 1.1 requires a patched Pyrex version if you want to compile
from non-release or modified sources. It is available here:
http://codespeak.net/svn/lxml/pyrex
This version of Pyrex supports Python 2.5 and public C-API generation, so it
may be of interest to more than only lxml developers.
As always, I'm happy about any egg contributions or bug reports that help in
making lxml 1.1 the greatest Python XML tool ever.
Have fun,
Stefan
Changelogs:
(note that 1.1beta also contains the changes from 1.0.3)
1.1beta (2006-08-08)
Features added
* Unlock the GIL for deep copying documents and for XPath()
* Support for Python 2.5 beta
* New compact keyword argument for parsing read-only documents
* Support for parser options in iterparse()
* The namespace axis is supported in XPath and returns (prefix, URI)
tuples
* The XPath expression "/" now returns an empty list instead of raising an
exception
* XML-Object API on top of lxml (lxml.objectify)
* Customizable Element class lookup:
o Support for externally provided lookup functions
o lxml.elements.classlookup module implements different lookup
mechanisms
* Support for processing instructions (ET-like, not compatible)
* Public C-level API for independent extension modules
Bugs fixed
* XPathSyntaxError now inherits from XPathError
* Threading race conditions in RelaxNG and XMLSchema
* Crash when mixing elements from XSLT results into other trees,
concurrent XSLT is only allowed when the stylesheet was parsed in the
main thread
* The EXSLT regexp:match function now works as defined (except for some
differences in the regular expression syntax)
* Setting element.text to '' returned None on request, not the empty
string
* iterparse() could crash on long XML files
* Creating documents no longer copies the parser for later URL resolving.
For performance reasons, only a reference is kept. Resolver updates on
the parser will now be reflected by documents that were parsed before
the change. Although this should rarely become visible, it is a
behavioral change from 1.0.
1.0.3 (2006-08-08)
Features added
* Element.replace(old, new) method to replace a subelement by another one
Bugs fixed
* Crash when mixing elements from XSLT results into other trees
* Copying/deepcopying did not work for ElementTree objects
* Setting an attribute to a non-string value did not raise an exception
* Element.remove() deleted the tail text from the removed Element
From ogrisel at nuxeo.com Wed Aug 9 17:24:31 2006
From: ogrisel at nuxeo.com (Olivier Grisel)
Date: Wed, 09 Aug 2006 17:24:31 +0200
Subject: [lxml-dev] Google Analytics tagger script based on lxml
Message-ID:
Hi list,
Thanks to the neat HTMLParser feature in lxml I was able to quickly write a
simple script to add Google Analytics tags at the end of static HTML files
(generated from a REST source for instance).
Feel free to use it should you find it any useful:
http://champiland.homelinux.net/evogrid/code/evogrid.og.main/ga_tagger.py
NB: google analytics is a free as in beer web traffic analyser:
http://www.google.com/analytics/
Best,
--
Olivier
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 9 18:59:35 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 09 Aug 2006 18:59:35 +0200
Subject: [lxml-dev] lxml 1.0.3 and 1.1beta on cheeseshop
In-Reply-To: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de>
References: <44D8ED1E.5000301@gkec.informatik.tu-darmstadt.de>
Message-ID: <44DA1476.4070209@gkec.informatik.tu-darmstadt.de>
Ah, well, never release too early...
Here is a patch against 1.1beta that fixes a couple of bugs in lxml.objectify,
especially in the setattr() and addattr() methods of ObjectPath. Without the
patch, you can't currently set attributes to Element values or lists. That's
not too much of an issue, as you can still set them directly (without
ObjectPath). But it's still annoying.
Guess that's what a beta release is there for...
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: objectify-setattr-bugs.patch
Type: text/x-patch
Size: 12242 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060809/4652d733/attachment-0001.bin
From ogrisel at nuxeo.com Wed Aug 9 19:39:09 2006
From: ogrisel at nuxeo.com (Olivier Grisel)
Date: Wed, 09 Aug 2006 19:39:09 +0200
Subject: [lxml-dev] ElementTree and lxml advertised by yahoo
Message-ID:
The lxml part is just a reference at the bottom of the page, but anyway that's
still a good start :)
http://developer.yahoo.com/python/python-xml.html#element
--
Olivier
From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 10 08:31:57 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Thu, 10 Aug 2006 08:31:57 +0200
Subject: [lxml-dev] Developer version of the web pages online
Message-ID: <44DAD2DD.5090307@gkec.informatik.tu-darmstadt.de>
Hi all,
I thought it would be a good idea to advocate the current developer version of
lxml a bit more. So I uploaded the web pages from the trunk to
http://codespeak.net/lxml/dev/
Their differences are obviously generated by a script using lxml.etree. :)
Stefan
From Holger.Joukl at LBBW.de Thu Aug 10 14:00:57 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Thu, 10 Aug 2006 14:00:57 +0200
Subject: [lxml-dev] [1.1beta] lxml.objectify python2.3 compatibilty
In-Reply-To:
Message-ID:
Hi,
lxml.objectify crashes under python2.3:
PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3
Python 2.3.4 (#6, Jul 20 2004, 11:09:38)
[GCC 2.95.2 19991024 (release)] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.objectify
Traceback (most recent call last):
File "", line 1, in ?
ImportError: ld.so.1: python2.3: fatal: relocation error: file
/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol
PyDict_Contains: referenced symbol not found
>>>
seems like PyDict_Contains is not available in python2.3:
$ elfdump /apps/pydev/gcc/3.4.4/bin/python2.4 |grep -i pydict_cont
[593] 0x0004c078 0x00000070 FUNC GLOB D 0 .text
PyDict_Contains
[3487] 0x0004c078 0x00000070 FUNC GLOB D 0 .text
PyDict_Contains
[593] PyDict_Contains
0 hjoukl at dev-a .../pytaf $ elfdump /apps/prod/bin/python2.3 |grep -i
pydict_cont
1 hjoukl at dev-a .../pytaf $
Regards, Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Aug 10 14:28:41 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Thu, 10 Aug 2006 14:28:41 +0200
Subject: [lxml-dev] [1.1beta] lxml.objectify python2.3 compatibilty
In-Reply-To:
References:
Message-ID: <44DB2679.1070807@gkec.informatik.tu-darmstadt.de>
Hi Holger,
Holger Joukl wrote:
> lxml.objectify crashes under python2.3:
>
> PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3
> Python 2.3.4 (#6, Jul 20 2004, 11:09:38)
> [GCC 2.95.2 19991024 (release)] on sunos5
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import lxml.objectify
> Traceback (most recent call last):
> File "", line 1, in ?
> ImportError: ld.so.1: python2.3: fatal: relocation error: file
> /apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol
> PyDict_Contains: referenced symbol not found
Besides the fact that you should not normally import modules that were
compiled for a different Python version - you're right, thanks. That one
slipped through accidentally.
Here's the patch.
Stefan
Index: src/lxml/objectify.pyx
===================================================================
--- src/lxml/objectify.pyx (Revision 31226)
+++ src/lxml/objectify.pyx (Arbeitskopie)
@@ -184,7 +184,7 @@
if c_ns is NULL and tree._getNs(child._c_node) is not NULL:
continue
name = child._c_node.name
- if not python.PyDict_Contains(children, name):
+ if python.PyDict_GetItem(children, name) is NULL:
python.PyDict_SetItem(children, name, child)
return children
Index: src/lxml/python.pxd
===================================================================
--- src/lxml/python.pxd (Revision 31212)
+++ src/lxml/python.pxd (Arbeitskopie)
@@ -52,7 +52,6 @@
cdef void PyDict_Clear(object d)
cdef object PyDict_Copy(object d)
cdef Py_ssize_t PyDict_Size(object d)
- cdef int PyDict_Contains(object d, object key)
cdef object PySequence_List(object o)
cdef object PySequence_Tuple(object o)
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 06:57:01 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 06:57:01 +0200
Subject: [lxml-dev] Request for comments: Removing lxml.etree's default
support for namespace class support
In-Reply-To: <44D498EC.50507@gkec.informatik.tu-darmstadt.de>
References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de>
Message-ID: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
Hi all,
since there were no reactions so far, so I'll just extend my request a little.
Stefan Behnel wrote:
> To remove this redundancy, to speed up the default setup if namespace classes
> are /not/ used and (above all) to make the lookup API more accessible, I would
> like to remove the default for namespace lookup and replace it by the simplest
> possible mechanism that always returns the normal element classes.
[...]
> One of the main reasons for this change is that I would like to make the
> lookup mechanism explict and visible. It is a global property that impacts the
> entire library. Users who do not need to install their own custom classes
> should not be bothered with it, i.e. should be able to ignore the lookup API,
> the Namespace class registry, etc. For those who need a different mechanism, I
> believe that the current default does not make it visible enough that (for
> example) the functionality of the "Namespace" class registry is disabled if
> you select a different class lookup mechanism.
I thought about this some more and found that having a per-parser setup as
default would be pretty convenient and is an extremely small overhead compared
to the default class lookup.
And what's even better, making the parser lookup the default would remove the
need to actually change the global lookup scheme, which avoids problems with
different modules using lxml (as is already the case with objectify).
So, the second proposal for custom class lookup:
* if no custom classes are used, no configuration is needed
* any support for custom classes should be registered at the parser level
then, as before:
> * changing the default class is done by creating and setting a default
> lookup scheme based on the new default classes
> * using the namespace lookup requires setting the ns lookup scheme, which
> then enables lookups based on the global Namespace registry
[leaving out the original per-parser bit]
I think this really helps in getting custom class support in lxml cleaner. It
would then be helpful to also extend the behaviour of the XML() and HTML()
factories to use the default parser *iff* it matches their requirements (i.e.
it *is* an XMLParser or HTMLParser respectively) and only if not, fall back to
the current behaviour of using their own parser. This allows registering a
lookup scheme for the default parser without loosing these functions.
I'll just go and implement this on the trunk for now, so if there are any
comments or diverging interests, please speak up on the list.
Stefan
From Holger.Joukl at LBBW.de Fri Aug 11 09:19:03 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 09:19:03 +0200
Subject: [lxml-dev] [objectify] writing custom DataElement subclasses
In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
Message-ID:
Hi,
inheriting from the NumberElement base class there is a defined
mechanism to set a text-to-pyval parser function using the
_setValueParser method.
Would it make sense to extend this well-defined mechanism to
the general DataElement class?
E.g. writing a custom datetime class looks s.th. like this:
from lxml import objectify
from datetime import datetime
from dateutil.parser import parse
from dateutil import tz
# Unix epoch as datetime object
EPOCH = datetime(1970, 1, 1, 0, 0, 0, 0, tzinfo=tz.tzutc())
# FIXME: Should probably be tzlocal, but this crashes under python2.4:
# FIXME: ValueError: timestamp out of range for platform time_t when
# FIXME: trying to calculate with datetime values
# FIXME: This is due to changes in the time module, python2.3 just ignores
it
#_DEFAULT_TZ=tz.tzlocal() # better??
# Problem is that this rule is true now but has undergone some changes;
# e.g. dst wasn't even invented until 1975 in Germany
_DEFAULT_TZ=tz.tzstr('MET-1MEST-2,M3.5.0/02:00:00,M10.5.0/03:00:00')
class _parsePrecedence:
yearfirst = True
dayfirst = False
def _findtz(name, offset):
"""Determine the timezone information as best as we can.
Offset takes precedence over name. If neither offset nor tz name are
given,
fallback to use system local tz.
"""
if offset:
return tz.tzoffset(name, offset)
if name:
if name == 'UTC':
return tz.tzutc()
else:
found_tz = tz.gettz(name)
if found_tz:
return found_tz
else:
return tz.tzstr(name)
return _DEFAULT_TZ
class DatetimeElement(objectify.ObjectifiedDataElement):
def __get(self):
return _datetimeValueOf(self)
pyval = property(__get)
def _type(text):
return _checkDatetime(text)
_type = staticmethod(_type)
def __add__(self, other):
return _datetimeValueOf(self) + _datetimeValueOf(other)
def __sub__(self, other):
return _datetimeValueOf(self) - _datetimeValueOf(other)
def __radd__(self, other):
return _datetimeValueOf(other) + _datetimeValueOf(self)
def __rsub__(self, other):
return _datetimeValueOf(other) - _datetimeValueOf(self)
def __cmp__(self, other):
return cmp(_datetimeValueOf(self), _datetimeValueOf(other))
def __str__(self):
return str(self.pyval)
def _datetimeValueOf(obj):
if isinstance(obj, DatetimeElement):
return DatetimeElement._type(obj.text)
return obj
def _checkDatetime(timestr):
# parse raises ValueError if not successful
return parse(timestr, tzinfos=_findtz,
yearfirst=_parsePrecedence.yearfirst,
dayfirst=_parsePrecedence.dayfirst)
def register():
datetimeType = objectify.PyType('datetime', _checkDatetime,
DatetimeElement)
datetimeType.xmlSchemaTypes = ("datetime",)
datetimeType.register()
The re-implementation of property pyval might be left out here, also the
_type staticmethod.
Maybe the __str__ method, too if ObjectifiedDataElement changed its __str__
method to
def __str__(self):
return str(self.pyval)
What do you think?
Regards,
Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 09:57:11 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 09:57:11 +0200
Subject: [lxml-dev] [objectify] writing custom DataElement subclasses
In-Reply-To:
References:
Message-ID: <44DC3857.102@gkec.informatik.tu-darmstadt.de>
Hi Holger,
Holger Joukl wrote:
> inheriting from the NumberElement base class there is a defined
> mechanism to set a text-to-pyval parser function using the
> _setValueParser method.
> Would it make sense to extend this well-defined mechanism to
> the general DataElement class?
[implementation of a date type]
> The re-implementation of property pyval might be left out here, also the
> _type staticmethod.
> Maybe the __str__ method, too if ObjectifiedDataElement changed its __str__
> method to
> def __str__(self):
> return str(self.pyval)
Writing str() in that way would not work in all cases. Just look at None,
__str__() must always return a string. So, when None is returned as pyval,
should __str__() return "" or "None"? Depends, right? What about numbers? Does
0 mean "0" or "False"?
We could introduce an intermediate "ParsableObjectifiedDataElement" or
something in that line. I don't know if there's enough use for it, though. It
would only have 3-4 methods or something that don't do much. It's different in
NumberElement, where the entire number protocol is implemented.
BTW, I'm not opposed to integrating a date element class. As it looks, your's
it pretty far advanced by now, and it's even an external Python module. I
won't have the time to merge it before the end of the month, but if you can
get some of the FIXME's out by then (no, *not* only the comments :), we can
see if we get it into 1.1 final.
Stefan
From philipp at weitershausen.de Fri Aug 11 10:15:36 2006
From: philipp at weitershausen.de (Philipp von Weitershausen)
Date: Fri, 11 Aug 2006 10:15:36 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com> <44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
<9cee7ab80608040727s7bba48l7db39c1e055b8627@mail.gmail.com>
Message-ID: <44DC3CA8.8040807@weitershausen.de>
Fred Drake wrote:
>> problem was: lxml used its own version of doctest.py, which was no longer
>> compatible with 2.5. I always wondered where that came from and what it was
>> good for. Should have asked long ago, I guess...
>
> Hmm. There's a separate version in zope.testing as well. I've no
> idea if that's compatible with 2.5; there's so many other things that
> fall over with 2.5 it doesn't seem worthwhile to ask.
Jim, Tim, and others continously improved Python's doctest for Zope. The
latest example is Benji's work on footnotes. AFAIK Zope's doctest was
regularly sync'ed with Python's, though.
At least Python 2.4's doctest is good enough for not having to ship your
own version of it.
Philipp
From Holger.Joukl at LBBW.de Fri Aug 11 10:18:38 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 10:18:38 +0200
Subject: [lxml-dev] [objectify] writing custom DataElement subclasses
In-Reply-To: <44DC3857.102@gkec.informatik.tu-darmstadt.de>
Message-ID:
Stefan Behnel schrieb am
11.08.2006 09:57:11:
> > The re-implementation of property pyval might be left out here, also
the
> > _type staticmethod.
> > Maybe the __str__ method, too if ObjectifiedDataElement changed its
__str__
> > method to
> > def __str__(self):
> > return str(self.pyval)
>
> Writing str() in that way would not work in all cases. Just look at None,
> __str__() must always return a string. So, when None is returned as
pyval,
> should __str__() return "" or "None"? Depends, right? What about numbers?
Does
> 0 mean "0" or "False"?
The NoneElement returns:
def __str__(self):
return "None"
with a pyval:
property pyval:
def __get__(self):
return None
so no problem there.
As for numbers, a pyval of 0 will result in "0" and a pyval of True in
"True".
I don't actually see a problem here :-)
> We could introduce an intermediate "ParsableObjectifiedDataElement" or
> something in that line. I don't know if there's enough use for it,
though. It
> would only have 3-4 methods or something that don't do much. It's
different in
> NumberElement, where the entire number protocol is implemented.
I agree that another DataElement specialization would not be that useful
here.
> BTW, I'm not opposed to integrating a date element class. As it looks,
your's
> it pretty far advanced by now, and it's even an external Python module. I
> won't have the time to merge it before the end of the month, but if you
can
> get some of the FIXME's out by then (no, *not* only the comments :), we
can
> see if we get it into 1.1 final.
Yes, works like a charm. Note that it depends on external dateutil module,
though.
Without that parsing and timezone handling becomes a nightmare.
As for the FIXME I fear that there will be no clean solution other than
forcing the ObjectifiedDatetime user to register a _DEFAULT_TZ containing
the explicit dst rule. Date/time handling is evil.
Btw.: ObjectifiedElement .text and .pyval are read-only (which is a good
thing
imho). Is it possible to have a way to modify the text of the underlying
cnode from within a custom ObjectifiedDataElement class, e.g. in _init()?
I know this i possible when implementing this in pyrex, but for a
pure-python
implementation?
The background is that for the ObjectifiedDatetime class I might optionally
want to change the .text to ISO format.
Regards, Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From Holger.Joukl at LBBW.de Fri Aug 11 10:24:08 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 10:24:08 +0200
Subject: [lxml-dev] [objectify] DataElement function
In-Reply-To: <44DC3857.102@gkec.informatik.tu-darmstadt.de>
Message-ID:
Hi Stefan,
is it intentional/unavoidable that the element type returned
by DataElement is always ObjectifiedElement, before putting
it into an father element:
>>> what = objectify.DataElement(18)
>>> print what
value = '18' [ObjectifiedElement]
* py:pytype = 'int'
>>> what = objectify.DataElement("hallo")
>>> print what
value = 'hallo' [ObjectifiedElement]
* py:pytype = 'str'
>>> what = objectify.DataElement("17", _pytype="str")
>>> print what
value = '17' [ObjectifiedElement]
* py:pytype = 'str'
>>> root = objectify.Element('root')
>>> root.what = what
>>> print root
root = None [ObjectifiedElement]
what = '17' [StringElement]
* py:pytype = 'str'
>>>
Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From faassen at infrae.com Fri Aug 11 10:59:08 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 11 Aug 2006 10:59:08 +0200
Subject: [lxml-dev] News from the 2.5 front
In-Reply-To: <44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
References: <44D34AF3.70708@gkec.informatik.tu-darmstadt.de> <9cee7ab80608040630n5366170wdd1d101edb8fa300@mail.gmail.com>
<44D3571E.8080305@gkec.informatik.tu-darmstadt.de>
Message-ID: <44DC46DC.4090307@infrae.com>
Stefan Behnel wrote:
> Hi Fred,
>
> Fred Drake wrote:
>> On 8/4/06, Stefan Behnel wrote:
>>> The only problem I currently encounter is a bug in linecache in 2.5's
>>> stdlib that prevents the doctests from running. Once that's solved, we
>>> can see if those tests pass as well.
>> If there's really a bug in linecache, be sure to report it against Python
>> on SourceForge so we can get it dealt with.
>
> Oh, well. I did report it and then almost instantly got a TYOF back. The
> problem was: lxml used its own version of doctest.py, which was no longer
> compatible with 2.5. I always wondered where that came from and what it was
> good for. Should have asked long ago, I guess...
I'm not sure I remember anymore; possibly the doctest module that ships
with Python 2.3 was too outdated or didn't support some features that I
wanted. I might've taken it from Zope 3; not sure.
Regards,
Martijn
From faassen at infrae.com Fri Aug 11 11:04:04 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 11 Aug 2006 11:04:04 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To:
References:
Message-ID: <44DC4804.903@infrae.com>
Luthiger Stoll Benno wrote:
> Hello
>
> I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I install lxml using easy_install. I saw that this problem was discussed last month on this list.
> I scanned the mails addressing this issue, however, I could not find a solution.
> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
>
A straightforward compile of Python will be 2 byte unicode, not 4 bytes.
Unfortunately most linux distributions ship with a 4 byte unicode
version of Python, and distutils/setuptools cannot distinguish between 4
bytes and 2 bytes unicode yet. We've passed this problem (which goes
beyond lxml) along to the setuptools developers, and they say "patches
welcome". :)
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 10:59:04 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 10:59:04 +0200
Subject: [lxml-dev] [objectify] writing custom DataElement subclasses
In-Reply-To:
References:
Message-ID: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de>
Holger Joukl wrote:
> Stefan Behnel wrote:
>>> def __str__(self):
>>> return str(self.pyval)
>>
>> Writing str() in that way would not work in all cases. Just look at None,
>> __str__() must always return a string. So, when None is returned as pyval,
>> should __str__() return "" or "None"? Depends, right? What about numbers?
>> Does 0 mean "0" or "False"?
>
> The NoneElement returns:
> def __str__(self):
> return "None"
>
> with a pyval:
> property pyval:
> def __get__(self):
> return None
I know, I've written that code not too long ago. ;)
I was just trying to say that the gain is relatively low, as there are only
few simple methods that can be provided and many use cases still have to
reimplement some of them. So I don't see a noticeable improvement.
>> BTW, I'm not opposed to integrating a date element class. As it looks,
> your's
>> it pretty far advanced by now, and it's even an external Python module.
>
> Yes, works like a charm. Note that it depends on external dateutil module,
> though.
Which is this, I assume:
http://labix.org/python-dateutil
Ok, that's too bad, We can't rely on external modules for the lxml
distribution, at least not for something that's not strictly required for all
users.
> Btw.: ObjectifiedElement .text and .pyval are read-only (which is a good
> thing
> imho). Is it possible to have a way to modify the text of the underlying
> cnode from within a custom ObjectifiedDataElement class, e.g. in _init()?
> I know this i possible when implementing this in pyrex, but for a
> pure-python implementation?
> The background is that for the ObjectifiedDatetime class I might optionally
> want to change the .text to ISO format.
Ah, good question. Not currently, I believe. But you're right, there might be
cases where it makes sense to update the text from a subclass...
Maybe adding a 'private' property '__text' might help here, or rather an
explicit setter function '__updateTextInPlace(self, text)' ?
Stefan
From Holger.Joukl at LBBW.de Fri Aug 11 11:18:54 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 11:18:54 +0200
Subject: [lxml-dev] [objectify] writing custom DataElement subclasses
In-Reply-To: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de>
Message-ID:
Stefan Behnel schrieb am
11.08.2006 10:59:04:
> I was just trying to say that the gain is relatively low, as there are
only
> few simple methods that can be provided and many use cases still have to
> reimplement some of them. So I don't see a noticeable improvement.
Probably the only gain would be for a objectify newbie that he/she needn't
think too much about implementing the .pyval, __str__, ._type stuff.
But I will rather think of a doc patch then to just document this a
little more extensively (after my holidays _:-)
> >> BTW, I'm not opposed to integrating a date element class. As it looks,
> > your's
> >> it pretty far advanced by now, and it's even an external Python
module.
> >
> > Yes, works like a charm. Note that it depends on external dateutil
module,
> > though.
>
> Which is this, I assume:
>
> http://labix.org/python-dateutil
Right.
> Ok, that's too bad, We can't rely on external modules for the lxml
> distribution, at least not for something that's not strictly required for
all
> users.
Maybe we can fallback to the datetime standard mechanisms if dateutil isn't
installed, but then TZ handling and parsing will be far more
restricted/error-prone.
Will think of that.
> > Btw.: ObjectifiedElement .text and .pyval are read-only (which is a
good
> > thing
> > imho). Is it possible to have a way to modify the text of the
underlying
> > cnode from within a custom ObjectifiedDataElement class, e.g. in
_init()?
> > I know this i possible when implementing this in pyrex, but for a
> > pure-python implementation?
> > The background is that for the ObjectifiedDatetime class I might
optionally
> > want to change the .text to ISO format.
>
> Ah, good question. Not currently, I believe. But you're right, there
might be
> cases where it makes sense to update the text from a subclass...
>
> Maybe adding a 'private' property '__text' might help here, or rather an
> explicit setter function '__updateTextInPlace(self, text)' ?
S.th. like this would be nice. And I still think not letting the user
change
the text from the outside is a good thing, at least as long as changing
the .text might result in an object type <-> text value mismatch.
Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From faassen at infrae.com Fri Aug 11 11:30:35 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 11 Aug 2006 11:30:35 +0200
Subject: [lxml-dev] Request for comments: Removing lxml.etree's default
support for namespace class support
In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de>
<44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
Message-ID: <44DC4E3B.6000602@infrae.com>
Stefan Behnel wrote:
[snip]
> So, the second proposal for custom class lookup:
>
> * if no custom classes are used, no configuration is needed
> * any support for custom classes should be registered at the parser level
+1 for per-parser custom class lookup. So far no objections to the first
mail either. :)
Regards,
Martijn
From faassen at infrae.com Fri Aug 11 11:37:04 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 11 Aug 2006 11:37:04 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To: <44DC4804.903@infrae.com>
References:
<44DC4804.903@infrae.com>
Message-ID: <44DC4FC0.6000402@infrae.com>
Hey,
[Benno]
>> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
In order to write our patch to fix distutils/setuputils, we actually
need an answer to Benno's question. Is there a straightforward way to
find this out, in Python code? A brief glance through 'sys' didn't lead
to an answer. A quick google likewise didn't seem to lead to anything so
far.
Perhaps we need to resort to devious unicode string manipulation that
behaves differently depending on the amount of bytes your Python is
compiled with for unicode representation.. Or we could try asking
Fredrik Lundh :).
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 11:44:24 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 11:44:24 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To: <44DC4FC0.6000402@infrae.com>
References: <44DC4804.903@infrae.com>
<44DC4FC0.6000402@infrae.com>
Message-ID: <44DC5178.4000007@gkec.informatik.tu-darmstadt.de>
Hi Martijn,
Martijn Faassen schrieb:
> [Benno]
>>> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
>
> In order to write our patch to fix distutils/setuputils, we actually
> need an answer to Benno's question. Is there a straightforward way to
> find this out, in Python code? A brief glance through 'sys' didn't lead
> to an answer. A quick google likewise didn't seem to lead to anything so
> far.
>>> import sys; print sys.maxunicode
1114111
on my UCS4 system. UCS2 systems cannot return values above 65536.
Stefan
From Holger.Joukl at LBBW.de Fri Aug 11 11:52:09 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 11:52:09 +0200
Subject: [lxml-dev] [objectify] root Element <-> tree problem
In-Reply-To: <44DC46D8.20404@gkec.informatik.tu-darmstadt.de>
Message-ID:
Hi Stefan,
somethings is going wrong here:
>>> root = objectify.Element('root')
>>> root
>>> root.x = 1
>>> root.y = 2
>>> print root
root = None [ObjectifiedElement]
x = 1 [IntElement]
y = 2 [IntElement]
>>> root.getroottree().getroot()
>>> print root.getroottree().getroot()
root = None [ObjectifiedElement]
>>> root.getroottree()
>>> print root.getroottree().getroot().getroottree()
>>>
I'm not doing something wrong, am I?
Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 12:11:10 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 12:11:10 +0200
Subject: [lxml-dev] [objectify] root Element <-> tree problem
In-Reply-To:
References:
Message-ID: <44DC57BE.1060300@gkec.informatik.tu-darmstadt.de>
Hi Holger,
Holger Joukl wrote:
> somethings is going wrong here:
>
>>>> root = objectify.Element('root')
>>>> root
>
>>>> root.x = 1
>>>> root.y = 2
>>>> print root
> root = None [ObjectifiedElement]
> x = 1 [IntElement]
> y = 2 [IntElement]
>>>> root.getroottree().getroot()
>
>>>> print root.getroottree().getroot()
> root = None [ObjectifiedElement]
>>>> root.getroottree()
>
>>>> print root.getroottree().getroot().getroottree()
>
>
> I'm not doing something wrong, am I?
Nope. Was a premature optimisation with side-effects in current SVN. I changed
objectify.Element() to always reuse the same document as the main use case is
to add these things to other documents anyway. Pretty bad idea. Now that you
said it, there are actually a lot of problems with it.
Just reverted.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 12:57:54 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 12:57:54 +0200
Subject: [lxml-dev] [objectify] DataElement function
In-Reply-To:
References:
Message-ID: <44DC62B2.6070709@gkec.informatik.tu-darmstadt.de>
Hi Holger,
Holger Joukl wrote:
> is it intentional/unavoidable that the element type returned
> by DataElement is always ObjectifiedElement
That's not intentional. It's the fast path in _lookupElementClass that strikes
here:
if c_node.parent is NULL or not tree._isElement(c_node.parent):
return ObjectifiedElement
# if element has children => no data class
if cetree.findChildForwards(c_node, 0) is not NULL:
return ObjectifiedElement
Only after that, it checks the attributes of the element that determine the
element type.
There are two ways to change that.
* We could move the above code section behind the attribute tests
* I thought about adding a C level function for creating new elements anyway.
Something like that is already in etree, but it could be extended with an
argument for an explicit lookup function (or element class) and made public.
It's not as easy as it looks, though, as element objects are created in the
elementFactory function, which would have to be adapted as well...
Don't know if the second is really viable. The first is easier anyway...
Stefan
From Holger.Joukl at LBBW.de Fri Aug 11 13:16:21 2006
From: Holger.Joukl at LBBW.de (Holger Joukl)
Date: Fri, 11 Aug 2006 13:16:21 +0200
Subject: [lxml-dev] [objectify] DataElement function
In-Reply-To: <44DC62B2.6070709@gkec.informatik.tu-darmstadt.de>
Message-ID:
lxml-dev-bounces at codespeak.net schrieb am 11.08.2006 12:57:54:
> Hi Holger,
>
> Holger Joukl wrote:
> > is it intentional/unavoidable that the element type returned
> > by DataElement is always ObjectifiedElement
>
> That's not intentional. It's the fast path in _lookupElementClass that
strikes
> here:
>
> if c_node.parent is NULL or not tree._isElement(c_node.parent):
> return ObjectifiedElement
>
> # if element has children => no data class
> if cetree.findChildForwards(c_node, 0) is not NULL:
> return ObjectifiedElement
>
> Only after that, it checks the attributes of the element that determine
the
> element type.
>
> There are two ways to change that.
>
> * We could move the above code section behind the attribute tests
>
> * I thought about adding a C level function for creating new elements
anyway.
> Something like that is already in etree, but it could be extended with an
> argument for an explicit lookup function (or element class) and made
public.
> It's not as easy as it looks, though, as element objects are created in
the
> elementFactory function, which would have to be adapted as well...
>
> Don't know if the second is really viable. The first is easier anyway...
If everything else works as is plus the mentioned thing works better, why
not go for the simpler solution?
It isn't a real problem at the moment as the DataElements I produce get
promptly inserted into a father element and then behave nicely, but...
Btw. Shouldn't the default Element class in _guessElementClass()
become StringElement, to make this
>>> root = objectify.fromstring("""""")
>>> print root
root = None [ObjectifiedElement]
s = None [ObjectifiedElement]
>>>
finally result into StringElements for empty leaf elements?
Holger
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde,
verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht
gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.
The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 11 13:47:21 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 11 Aug 2006 13:47:21 +0200
Subject: [lxml-dev] [objectify] DataElement function
In-Reply-To:
References:
Message-ID: <44DC6E49.2020306@gkec.informatik.tu-darmstadt.de>
Hi Holger,
Holger Joukl wrote:
>> Holger Joukl wrote:
>>> is it intentional/unavoidable that the element type returned
>>> by DataElement is always ObjectifiedElement
>>
>> That's not intentional. It's the fast path in _lookupElementClass that
>> strikes here:
>>
>> if c_node.parent is NULL or not tree._isElement(c_node.parent):
>> return ObjectifiedElement
>>
>> # if element has children => no data class
>> if cetree.findChildForwards(c_node, 0) is not NULL:
>> return ObjectifiedElement
>>
>> Only after that, it checks the attributes of the element that determine
>> the element type.
>>
>> * We could move the above code section behind the attribute tests
>
> If everything else works as is plus the mentioned thing works better, why
> not go for the simpler solution?
Yup, did that. I also fixed a couple of problems related to different data
types as I was at it. We don't currently have a way to check for the real
Python types from PyType registered types, only string parsing is supported.
However, the real types are passed to DataElement and must be treated
similarly. It works for the standard Python types for now and also for custom
types that provide a proper __str__() for conversion to XML data content.
> Btw. Shouldn't the default Element class in _guessElementClass()
> become StringElement, to make this
>
>>>> root = objectify.fromstring("""""")
>>>> print root
> root = None [ObjectifiedElement]
> s = None [ObjectifiedElement]
>
> finally result into StringElements for empty leaf elements?
It only looks wrong if you call the element "s", I guess... :)
But I changed it so that if the element has
* no type annotation and
* no children and
* no text content
then, if it
* has an element as parent it defaults to StringElement
* has no parent it defaults to ObjectifiedElement
I think that makes sense.
Stefan
From faassen at infrae.com Mon Aug 14 12:27:24 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Mon, 14 Aug 2006 12:27:24 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To: <44DC5178.4000007@gkec.informatik.tu-darmstadt.de>
References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com>
<44DC5178.4000007@gkec.informatik.tu-darmstadt.de>
Message-ID: <44E0500C.6080908@infrae.com>
Stefan Behnel wrote:
> Hi Martijn,
>
> Martijn Faassen schrieb:
>> [Benno]
>>>> How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
>> In order to write our patch to fix distutils/setuputils, we actually
>> need an answer to Benno's question. Is there a straightforward way to
>> find this out, in Python code? A brief glance through 'sys' didn't lead
>> to an answer. A quick google likewise didn't seem to lead to anything so
>> far.
>
> >>> import sys; print sys.maxunicode
> 1114111
>
> on my UCS4 system. UCS2 systems cannot return values above 65536.
Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits
systems can still increase as more unicode codepoints get added, but
looking for any value above 65536 should be a reliable way to
distinguish UCS2 from UCS4.
Regards,
Martijn
From fredrik at pythonware.com Mon Aug 14 13:06:43 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Mon, 14 Aug 2006 13:06:43 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de>
<44E0500C.6080908@infrae.com>
Message-ID:
Martijn Faassen wrote:
>> >>> import sys; print sys.maxunicode
>> 1114111
>>
>> on my UCS4 system. UCS2 systems cannot return values above 65536.
>
> Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits
> systems can still increase as more unicode codepoints get added, but
> looking for any value above 65536 should be a reliable way to
> distinguish UCS2 from UCS4.
the 1114111 value isn't the number of assigned code points; it's the largest code
point that's ever (*) going to be used by Unicode.
*) "BMP plus sixteen supplemental planes should be enough for anybody"
From faassen at infrae.com Tue Aug 15 11:48:52 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 15 Aug 2006 11:48:52 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To:
References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com>
Message-ID: <44E19884.9000007@infrae.com>
Fredrik Lundh wrote:
> Martijn Faassen wrote:
>
>>> >>> import sys; print sys.maxunicode
>>> 1114111
>>>
>>> on my UCS4 system. UCS2 systems cannot return values above 65536.
>> Ah, I'd missed that, thanks. I guess the sys.maxunicode on 64 bits
>> systems can still increase as more unicode codepoints get added, but
>> looking for any value above 65536 should be a reliable way to
>> distinguish UCS2 from UCS4.
>
> the 1114111 value isn't the number of assigned code points; it's the largest code
> point that's ever (*) going to be used by Unicode.
> *) "BMP plus sixteen supplemental planes should be enough for anybody"
Thanks for the info!
Don't know what BMP is, and I only have a vague idea of the planes (I'll
read the wikipedia article :), but using 4 bytes to store something that
could be stored in less than 3 seems like a waste. :) Oh well, I imagine
machines can deal better with 4 bytes, especially if they're 64 bits.
Anyway, we'll see whether we can come up with a patch that convinces
distutils to distinguish between the two.
Regards,
Martijn
From fredrik at pythonware.com Tue Aug 15 12:27:50 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 15 Aug 2006 12:27:50 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com>
<44E19884.9000007@infrae.com>
Message-ID:
Martijn Faassen wrote:
>> *) "BMP plus sixteen supplemental planes should be enough for anybody"
>
> Thanks for the info!
>
> Don't know what BMP is, and I only have a vague idea of the planes (I'll
> read the wikipedia article :)
start here:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters
From faassen at infrae.com Tue Aug 15 13:01:36 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Tue, 15 Aug 2006 13:01:36 +0200
Subject: [lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem
In-Reply-To: <44E19884.9000007@infrae.com>
References: <44DC4804.903@infrae.com> <44DC4FC0.6000402@infrae.com><44DC5178.4000007@gkec.informatik.tu-darmstadt.de> <44E0500C.6080908@infrae.com>
<44E19884.9000007@infrae.com>
Message-ID: <44E1A990.7070609@infrae.com>
Martijn Faassen wrote:
[snip]
> Oh well, I imagine
> machines can deal better with 4 bytes, especially if they're 64 bits.
Hah, silly, of course 32 bits is enough for 4 bytes. I knew that! :)
Regards,
Martijn
From jkrukoff at ltgc.com Thu Aug 17 01:31:33 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Wed, 16 Aug 2006 17:31:33 -0600
Subject: [lxml-dev] Request for comments: Removing lxml.etree's
default support for namespace class support
In-Reply-To: <44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
References: <44D498EC.50507@gkec.informatik.tu-darmstadt.de>
<44DC0E1D.5040301@gkec.informatik.tu-darmstadt.de>
Message-ID: <1155771094.11584.30.camel@localhost>
On Fri, 2006-08-11 at 06:57 +0200, Stefan Behnel wrote:
> I thought about this some more and found that having a per-parser setup as
> default would be pretty convenient and is an extremely small overhead compared
> to the default class lookup.
First off, thanks for getting 1.0.3 out so quickly. Really helped with
my problems, and replace has been a very handy convenience function.
I was actually just getting ready to ask you for per-parser custom class
support when I came across you already talking about implementing it.
It's actually an essential feature for me to be able to take advantage
of the custom element classes, as in my application (XML based
middleware) both the middleware layer and the applications all handle
XML and all exist in the same process. Right now, an application can
only change the default element class if it's very careful to make sure
to restore it so it doesn't screw up the middleware, and even that
solution is going to be impossible once the architecture goes
multi-threaded.
So, yeah, I'm pretty excited about getting this feature.
--
John Krukoff
Land Title Guarantee Company
From ashish.kulkarni at kalyptorisk.com Wed Aug 23 09:34:40 2006
From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni)
Date: Wed, 23 Aug 2006 13:04:40 +0530
Subject: [lxml-dev] Building dynamically-linked lxml on windows using mingw32
Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com>
Hello,
I've successfully used ming32 to build lxml (dynamically linked). I was unable to get the static linking to work, because I was unable to get the VC++ 2003 Toolkit compiler and trying static linking with gcc gives lots of errors.
Step 1: Download and install Mingw from http://mingw.org.
Step 2: Start a command window and set the path to include MingW eg.
set path=%path%;C:\mingw\bin
Step 3: Download the win32 libs from ftp://xmlsoft.org/libxml2/win32. You will need
iconv-1.9.1.win32.zip
libxml2-2.6.23.win32.zip
libxslt-1.1.15.win32.zip
zlib-1.2.3.win32.zip
Step 4: Follow the instructions in doc/build.txt for extraction, but use the following setupStaticBuild function instead of the one mentioned:
def setupStaticBuild():
"See doc/build.txt to make this work."
cflags = [
"-I..\\libxml2-2.6.23.win32\\include",
"-I..\\libxslt-1.1.15.win32\\include",
"-I..\\zlib-1.2.3.win32\\include",
"-I..\\iconv-1.9.1.win32\\include"
]
xslt_libs = [
"..\\libxml2-2.6.23.win32\\bin\\libxml2.dll",
"..\\libxslt-1.1.15.win32\\bin\\libxslt.dll",
"..\\libxslt-1.1.15.win32\\bin\\libexslt.dll",
"..\\iconv-1.9.1.win32\\bin\\iconv.dll",
"..\\zlib-1.2.3.win32\\lib\\zlib.lib"
]
result = (cflags, xslt_libs)
return result
Yes, We ARE linking to DLLs directly as the export libraries are incomplete.
5. Copy the 4 DLLs mentioned above to the src/lxml folder. Also, add this line towards the
end of the file, just below the "packages = ['lxml']," line:
package_data={'': ['*.dll']},
6. To build the extension, use the following command:
python setup.py build --c=mingw32 --static bdist_wininst
You should have an installer which uses lxml dynamically linked to the above DLLs. The installer size is around 1344kB, which is almost the same size you get via static linking. (as a comparison, lxml-1.0.2.win32-static-py2.4.exe is around 1266kB).
Hope this helps,
Ashish
From faassen at infrae.com Wed Aug 23 16:44:34 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Wed, 23 Aug 2006 16:44:34 +0200
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various platforms?
Message-ID: <44EC69D2.6080404@infrae.com>
Hey,
Compare:
http://cheeseshop.python.org/pypi/lxml/1.0.2
with
http://cheeseshop.python.org/pypi/lxml/1.0.3
and we see that 1.0.2 has support for lots of different platforms,
including the nice static windows build, but 1.0.3 has not.
In part this is my fault, as it appears I need to do various linux eggs,
but a couple of more egg donations from others would be appreciated!
The same story applies to 1.1 beta.
Regards,
Martijn
From ashish.kulkarni at kalyptorisk.com Thu Aug 24 07:24:03 2006
From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni)
Date: Thu, 24 Aug 2006 10:54:03 +0530
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various
platforms?
Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com>
Hello,
I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is not a static build, but the DLLs are included in the distribution (as per my previous mail).
http://puggy.symonds.net/~ashish/downloads/
Also, I couldn't build the lxml.objectify extension for 1.1beta: apparently there is no pyrex-generated C file in the source distribution. Thus the 1.1 beta builds have that extension disabled.
Hope this helps,
Ashish
From faassen at infrae.com Thu Aug 24 11:37:59 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Thu, 24 Aug 2006 11:37:59 +0200
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for
various platforms?
In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com>
References: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com>
Message-ID: <44ED7377.6050909@infrae.com>
Ashish Kulkarni wrote:
> I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is
> not a static build, but the DLLs are included in the distribution (as
> per my previous mail).
The experience to the end user is the same, I think, so this sounds good
too. :)
> http://puggy.symonds.net/~ashish/downloads/
>
> Also, I couldn't build the lxml.objectify extension for 1.1beta:
> apparently there is no pyrex-generated C file in the source
> distribution. Thus the 1.1 beta builds have that extension disabled.
Thanks!
It's useful to know we don't have a pyrex generated C file in the source
directory for the objectify stuff. I'll leave that to Stephan Behnel to
correct, as he's more familiar with the build procedure than I am.
Previously Steve Howe has been taking care of our windows builds, so I'm
still hoping he'll chip in versions for 1.0.3 (and possibly 1.1beta) for
the cheeseshop. If however he turns out to be busy, we'll be sure to get
back to you again. And for people on Windows who want to continue now,
your downloads are available. Thank you very much!
Regards,
Martijn
From jkrukoff at ltgc.com Thu Aug 24 14:04:56 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Thu, 24 Aug 2006 06:04:56 -0600
Subject: [lxml-dev] Replace/copy related segfault in lxml
Message-ID: <1156421097.17673.20.camel@localhost>
So, I've been making extensive use of lxml 1.0.3, and have come across
another crash bug. This one also appears to be related to subtree
replacement.
This is with libxml2 2.6.26, and I haven't tested with lxml 1.1 beta to
see if the bug is present there. There is a simple workaround, which
appears to be to avoid using the new replace function.
This is the error the attached test program gives me:
*** glibc detected *** double free or corruption (fasttop): 0x080daec8
***
However, minor differences in the location and amount of whitespace in
the input data change the crash, to errors such as this:
*** glibc detected *** corrupted double-linked list: 0x0813b9f8 ***
--
John Krukoff
Land Title Guarantee Company
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-replace.py
Type: text/x-python
Size: 520 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060824/fa120482/attachment.py
From jkrukoff at ltgc.com Thu Aug 24 15:28:55 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Thu, 24 Aug 2006 07:28:55 -0600
Subject: [lxml-dev] No extend method on elements?
Message-ID: <1156426135.17673.43.camel@localhost>
I know ElementTree doesn't support it, but is there any chance of
getting an extend method on Elements?
It's an awfully useful list function, and my first try for replacement
was:
[ element.append( new ) for new in otherelement ]
However, it looks like for large element lists, it's far faster to use
slice assignment:
element[ len( element ) : len( element ) ] = otherelement
which was not the most intuitive way to do things for me. It'd be nice
if -0 : -0 was a real slice...
Is this really the best way, or am I missing something obvious?
--
John Krukoff
Land Title Guarantee Company
From fredrik at pythonware.com Thu Aug 24 15:51:06 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 24 Aug 2006 15:51:06 +0200
Subject: [lxml-dev] No extend method on elements?
References: <1156426135.17673.43.camel@localhost>
Message-ID:
John Krukoff wrote:
>I know ElementTree doesn't support it, but is there any chance of
> getting an extend method on Elements?
ET 1.3 has an extend() method.
> element[ len( element ) : len( element ) ] = otherelement
shorter:
element[len(element):] = otherelement
From faassen at infrae.com Thu Aug 24 16:39:00 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Thu, 24 Aug 2006 16:39:00 +0200
Subject: [lxml-dev] Replace/copy related segfault in lxml
In-Reply-To: <1156421097.17673.20.camel@localhost>
References: <1156421097.17673.20.camel@localhost>
Message-ID: <44EDBA04.9050509@infrae.com>
John Krukoff wrote:
> So, I've been making extensive use of lxml 1.0.3, and have come across
> another crash bug. This one also appears to be related to subtree
> replacement.
>
> This is with libxml2 2.6.26, and I haven't tested with lxml 1.1 beta to
> see if the bug is present there. There is a simple workaround, which
> appears to be to avoid using the new replace function.
>
> This is the error the attached test program gives me:
>
> *** glibc detected *** double free or corruption (fasttop): 0x080daec8
> ***
>
> However, minor differences in the location and amount of whitespace in
> the input data change the crash, to errors such as this:
>
> *** glibc detected *** corrupted double-linked list: 0x0813b9f8 ***
Hm, I'm on an ubuntu 6.06, python 2.4, libxml 2.6.24, lxml-1.0 branch
from svn, and so far I cannot reproduce your problem by running your script.
Trying the 1.0.3 release now, same platform, still cannot reproduce the
crash.
What platform are you on?
I can find a problem I run this code using 'valgrind' to detect memory
errors - I get exuberant warnings now. Looks like you're on to
something.. valgrind doesn't report these warnings when the workaround
is enabled instead.
I'll try to look into this more deeply later.
Regards,
Martijn
From ashish.kulkarni at kalyptorisk.com Fri Aug 25 07:07:45 2006
From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni)
Date: Fri, 25 Aug 2006 10:37:45 +0530
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for
various platforms?
In-Reply-To: <44ED7377.6050909@infrae.com>
Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03E92D@mailserver.kalyptorisk.com>
Actually, now that lxml can be built with mingw32, one can do all the
builds on linux itself. All you have to do is to build a mingw32
cross-compiler.
http://www.mingw.org/MinGWiki/index.php/BuildMingwCross
I've heard that a lot of projects use this approach to build win32
releases. So the official builds can at-least include the Mingw32
builds, until someone comes up with MSVC builds (which are almost always
a bit faster).
Hope this helps,
Ashish
-----Original Message-----
From: Martijn Faassen [mailto:faassen at infrae.com]
Sent: Thursday, August 24, 2006 3:08 PM
To: Ashish Kulkarni
Cc: lxml-dev at codespeak.net; howe at carcass.dhs.org
Subject: Re: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various
platforms?
Ashish Kulkarni wrote:
> I've built the 1.0.3 and 1.1beta installers/eggs using Mingw. It is
> not a static build, but the DLLs are included in the distribution (as
> per my previous mail).
The experience to the end user is the same, I think, so this sounds good
too. :)
> http://puggy.symonds.net/~ashish/downloads/
>
> Also, I couldn't build the lxml.objectify extension for 1.1beta:
> apparently there is no pyrex-generated C file in the source
> distribution. Thus the 1.1 beta builds have that extension disabled.
Thanks!
It's useful to know we don't have a pyrex generated C file in the source
directory for the objectify stuff. I'll leave that to Stephan Behnel to
correct, as he's more familiar with the build procedure than I am.
Previously Steve Howe has been taking care of our windows builds, so I'm
still hoping he'll chip in versions for 1.0.3 (and possibly 1.1beta) for
the cheeseshop. If however he turns out to be busy, we'll be sure to get
back to you again. And for people on Windows who want to continue now,
your downloads are available. Thank you very much!
Regards,
Martijn
From jkrukoff at ltgc.com Fri Aug 25 11:31:52 2006
From: jkrukoff at ltgc.com (John Krukoff)
Date: Fri, 25 Aug 2006 03:31:52 -0600
Subject: [lxml-dev] Replace/copy related segfault in lxml
Message-ID: <004801c6c829$4d100e30$051ea8c0@naomi>
>Hm, I'm on an ubuntu 6.06, python 2.4, libxml 2.6.24, lxml-1.0 branch
>from svn, and so far I cannot reproduce your problem by running your
script.
>
>Trying the 1.0.3 release now, same platform, still cannot reproduce the
>crash.
>
>What platform are you on?
>
>I can find a problem I run this code using 'valgrind' to detect memory
>errors - I get exuberant warnings now. Looks like you're on to
>something.. valgrind doesn't report these warnings when the workaround
>is enabled instead.
>
>I'll try to look into this more deeply later.
>
>Regards,
>
>Martijn
>
I'm on an up to date gentoo stable box, with fairly aggressive optimization
settings.
CFLAGS="-march=pentium4 -O3 -pipe -mfpmath=sse -fomit-frame-pointer"
To be exact.
The problem seems to be related to text node handling. I stripped the test
case down to the bare minimum for my box, but if you're having trouble
reproducing try to add more whitespace to the test data. Let me know if you
can't reproduce the segfault, and I'll try to get it to crash on one of our
redhat boxes.
---------
John Krukoff
jkrukoff at ltgc.com
From faassen at infrae.com Fri Aug 25 12:50:14 2006
From: faassen at infrae.com (Martijn Faassen)
Date: Fri, 25 Aug 2006 12:50:14 +0200
Subject: [lxml-dev] Replace/copy related segfault in lxml
In-Reply-To: <004801c6c829$4d100e30$051ea8c0@naomi>
References: <004801c6c829$4d100e30$051ea8c0@naomi>
Message-ID: <44EED5E6.2050606@infrae.com>
John Krukoff wrote:
[snip]
>
> The problem seems to be related to text node handling. I stripped the test
> case down to the bare minimum for my box, but if you're having trouble
> reproducing try to add more whitespace to the test data. Let me know if you
> can't reproduce the segfault, and I'll try to get it to crash on one of our
> redhat boxes.
Sorry I wasn't more clear in my previous mail, I actually intended to
acknowledge your problem. Since valgrind complains it's clear there is a
memory allocation problem somewhere, it just doesn't show up with some
platforms and/or compilation settings. Thankfully we have valgrind; I
only thought of using it halfway writing the mail back to you. :)
So, to be clear: problem reproduced here, acknowledged, and need to work
on a fix.
Regards,
Martijn
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 25 22:28:26 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 25 Aug 2006 22:28:26 +0200
Subject: [lxml-dev] Replace/copy related segfault in lxml
In-Reply-To: <1156421097.17673.20.camel@localhost>
References: <1156421097.17673.20.camel@localhost>
Message-ID: <44EF5D6A.3080807@gkec.informatik.tu-darmstadt.de>
Hi John,
John Krukoff wrote:
> So, I've been making extensive use of lxml 1.0.3, and have come across
> another crash bug. This one also appears to be related to subtree
> replacement.
Thanks for reporting this. It's a bug in the replace() method. The Python
document reference (and thus the document itself) can be freed before copying
the tail content from it. Here's a fix against the trunk that should also
apply to 1.0.3. Please test it.
Stefan
Index: src/lxml/etree.pyx
===================================================================
--- src/lxml/etree.pyx (Revision 31246)
+++ src/lxml/etree.pyx (Arbeitskopie)
@@ -797,9 +797,9 @@
c_new_node = new_element._c_node
c_new_next = c_new_node.next
tree.xmlReplaceNode(c_old_node, c_new_node)
- moveNodeToDocument(new_element, self._doc)
_moveTail(c_new_next, c_new_node)
_moveTail(c_old_next, c_old_node)
+ moveNodeToDocument(new_element, self._doc)
# PROPERTIES
property tag:
From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Aug 25 23:01:19 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Fri, 25 Aug 2006 23:01:19 +0200
Subject: [lxml-dev] No extend method on elements?
In-Reply-To:
References: <1156426135.17673.43.camel@localhost>
Message-ID: <44EF651F.7020106@gkec.informatik.tu-darmstadt.de>
Fredrik Lundh wrote:
> John Krukoff wrote:
>
>> I know ElementTree doesn't support it, but is there any chance of
>> getting an extend method on Elements?
>
> ET 1.3 has an extend() method.
That's good to know. Then I guess lxml 1.1 should have one, too.
>> element[ len( element ) : len( element ) ] = otherelement
>
> shorter:
>
> element[len(element):] = otherelement
That's the "obvious" way of implementing it. So here's a quick and small patch
against the trunk that adds the function to etree. Something like this will
make it into 1.1.
Stefan
Index: src/lxml/etree.pyx
===================================================================
--- src/lxml/etree.pyx (Revision 31661)
+++ src/lxml/etree.pyx (Arbeitskopie)
@@ -725,6 +725,11 @@
# parent element has moved; change them too..
moveNodeToDocument(element, self._doc)
+ def extend(self, elements):
+ """Extends the current children by the elements in the iterable.
+ """
+ self[python.PY_SSIZE_T_MAX:python.PY_SSIZE_T_MAX] = elements
+
def clear(self):
"""Resets an element. This function removes all subelements,
clears all attributes and sets the text and tail
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 07:57:30 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 30 Aug 2006 07:57:30 +0200
Subject: [lxml-dev] Building dynamically-linked lxml on windows using
mingw32
In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com>
References: <2AB7346A3227A74BB97F9A0D79E3E65A03E87B@mailserver.kalyptorisk.com>
Message-ID: <44F528C9.4080502@gkec.informatik.tu-darmstadt.de>
Hi Ashish,
Ashish Kulkarni wrote:
> I've successfully used ming32 to build lxml (dynamically linked).
Thanks for sharing your experience. It's always helpful to have this kind of
info archived on the list so that others can find it.
> I was
> unable to get the static linking to work, because I was unable to get the
> VC++ 2003 Toolkit compiler and trying static linking with gcc gives lots of
> errors.
That would be the expected behaviour, I guess. Even using newer MS compilers
with the VC-2003 compiled Python interpreter does not work, from what I've
heard. That's been discussed on python-dev for some other extensions a while
ago. Don't remember the result, though...
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 08:03:27 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 30 Aug 2006 08:03:27 +0200
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for
various platforms?
In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com>
References: <2AB7346A3227A74BB97F9A0D79E3E65A03E8D6@mailserver.kalyptorisk.com>
Message-ID: <44F52A2F.9070902@gkec.informatik.tu-darmstadt.de>
Hi Ashish,
Ashish Kulkarni wrote:
> I couldn't build the lxml.objectify extension for 1.1beta: apparently
> there is no pyrex-generated C file in the source distribution.
Right, my fault. It's fixed now (on the trunk), just needed an additional
"objectify.c" entry in the MANIFEST.in file.
You can build the file yourself if you install the patched Pyrex version as
described in build.txt.
> Thus the 1.1 beta builds have that extension disabled.
That's ok, 1.1 final (and 1.0.4) will be out pretty soon, so it's enough if we
have that working by then.
Stefan
From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Aug 30 08:09:04 2006
From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel)
Date: Wed, 30 Aug 2006 08:09:04 +0200
Subject: [lxml-dev] lxml 1.0.3 and lxml 1.1beta builds for various
platforms?
In-Reply-To: <44EC69D2.6080404@infrae.com>
References: <44EC69D2.6080404@infrae.com>
Message-ID: <44F52B80.9080500@gkec.informatik.tu-darmstadt.de>
Hi,
Martijn Faassen wrote:
> we see that 1.0.2 has support for lots of different platforms,
> including the nice static windows build, but 1.0.3 has not.
It's summer holiday time, I guess that's the reason.
Since there was a crash bug in 1.0.3, I'll release a 1.0.4 soon, so it's not
too much of a problem if eggs are missing for 1.0.3. But since I then really,
/really/ hope that that'll finally be the last 1.0 release necessary, I'll be
as happy as Martijn to see egg contributions.
Stefan