From hjh at alterras.de Sun Sep 3 19:07:07 2006 From: hjh at alterras.de (=?ISO-8859-1?Q?Hans-J=FCrgen?= Hay) Date: Sun, 03 Sep 2006 19:07:07 +0200 Subject: [lxml-dev] Pure Text Elements as XSLT results impossible? Message-ID: <1157303227.14688.44.camel@hera.local> XSLT transforms that should result in a TEXT only Element silently result in some kind of 'None' result class with the text seemingly lost. Propably caused by Text Elements beeing second class in ElementTree. This seems to be very wrong or am I. Regards Hans from lxml import etree xsl = """ TestText """ trans = etree.XSLT(etree.XML(xsl)) str(etree.tostring(trans(etree.XML('')))) --->> 'None' insted of 'TestText' ---------------------------------------------------------------------- from lxml import etree xsl = """ TestText """ trans = etree.XSLT(etree.XML(xsl)) etree.tostring(trans(etree.XML(''))) -->> ' TestText' (works) ---------------------------------------------------------------------- from lxml import etree xsl = """ TestText """ trans = etree.XSLT(etree.XML(xsl)) etree.tostring(trans(etree.XML(''))) --->> '' ('TestText' in front missing) From Holger.Joukl at LBBW.de Mon Sep 4 12:07:06 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Mon, 4 Sep 2006 12:07:06 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <44F52B80.9080500@gkec.informatik.tu-darmstadt.de> Message-ID: Hi, depending on how one accesses objectified elements there can be differences in the resulting element type: >>> root = objectify.Element('root') >>> sub = objectify.Element('root') >>> root.sub = sub >>> root.sub.x = 1 >>> del root.sub.x >>> print root root = None [ObjectifiedElement] sub = '' [StringElement] This yields a StringElement root.sub because root.sub has no element contents, does have a parent element but not any children. Whereas >>> root = objectify.Element('root') >>> sub = objectify.Element('root') >>> root.sub = sub >>> root.sub.x = 1 >>> print root root = None [ObjectifiedElement] sub = None [ObjectifiedElement] x = 1 [IntElement] >>> del root.sub.x >>> print root root = None [ObjectifiedElement] sub = None [ObjectifiedElement] >>> yields an ObjectifiedElement root.sub because I already accessed root.sub before deleting its child x, thus making it an ObjectifiedElement in the etree node proxy because at that time it had children. I'm not sure how to address this problem. For my use case it is desirable for - empty content leaf elements to be StringElements, just like it is today: E.g. when parsing from xml s.th. like '' then s should be a StringElement (empty string, leaf node). Also when assigning an empty string in objectify this should end up in a StringElement: >>> root.s = '' >>> print root root = None [ObjectifiedElement] s = '' [StringElement] >>> - a "structural" element (this is what I use ObjectifiedElements for - they are supposed to potentially have children) to remain like it is even if its children get deleted The problem also manifests in this use case: >>> root = objectify.Element('root') >>> root.sub = objectify.Element('whatever') >>> print root root = None [ObjectifiedElement] sub = '' [StringElement] >>> where I would rather have root.sub to be an ObjectifiedElement. And I'm also the one to blame for the current behaviour because I proposed parts of the class lookup order to Stefan :-) Some thoughts: - maybe disallow DataElements to have children, i.e. disabling __setattr__ and alike for DataElements? Then ObjectifiedElements would need to have an accessible (string) pyvalue in contrast to current behaviour - maybe change the time an object is actually registered in the node proxy? - add an additional "structural" element class that is basically just an ObjectifiedElement but has an artificial pytype to make it retain its type and can be produced by a factory similar to objectify.DataElement? - just not care about a StringElement acting as a structural element as it can currently have children too (though it supports the string API parts on top of the ObjectifiedElement basic API)? Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Sep 5 15:54:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 05 Sep 2006 15:54:54 +0200 Subject: [lxml-dev] Pure Text Elements as XSLT results impossible? In-Reply-To: <1157303227.14688.44.camel@hera.local> References: <1157303227.14688.44.camel@hera.local> Message-ID: <44FD81AE.4060800@gkec.informatik.tu-darmstadt.de> Hi Hans-J?rgen, Hans-J?rgen Hay wrote: > XSLT transforms that should result in a TEXT only Element silently > result in some kind of 'None' result class with the text seemingly lost. > Propably caused by Text Elements beeing second class in ElementTree. > > from lxml import etree > xsl = """ > > > > TestText > > """ > trans = etree.XSLT(etree.XML(xsl)) > str(etree.tostring(trans(etree.XML('')))) > > --->> 'None' insted of 'TestText' > ---------------------------------------------------------------------- > from lxml import etree > xsl = """ > > > TestText > > """ > trans = etree.XSLT(etree.XML(xsl)) > etree.tostring(trans(etree.XML(''))) > > -->> ' TestText' (works) > ---------------------------------------------------------------------- > from lxml import etree > xsl = """ > > > TestText > > """ > trans = etree.XSLT(etree.XML(xsl)) > etree.tostring(trans(etree.XML(''))) > > --->> '' ('TestText' in front missing) looks like you forgot to test str(trans(...)) which is the usage suggested by the documentation. Stefan From pawel at praterm.com.pl Tue Sep 5 17:26:15 2006 From: pawel at praterm.com.pl (=?ISO-8859-2?Q?Pawe=B3_Pa=B3ucha?=) Date: Tue, 05 Sep 2006 17:26:15 +0200 Subject: [lxml-dev] File name encoding problems Message-ID: <44FD9717.30802@praterm.com.pl> Hi! I'm not sure if it's really an lxml problem, but it looks like... I'm using lxml version 1.0.3 (from Debian package). My default system encoding is ISO-8859-2. I have a simple program: #!/usr/bin/python # -*- coding: iso-8859-2 -*- import lxml.etree d = lxml.etree.parse('/tmp/?.xml') The problem is with letter '?' (or any other non-ascii letter) in file name. While running program I get something like that: Traceback (most recent call last): File "./p.py", line 6, in ? d = lxml.etree.parse('/tmp/??d?.xml') File "etree.pyx", line 1615, in etree.parse File "parser.pxi", line 687, in etree._parseDocument File "apihelpers.pxi", line 343, in etree._utf8 AssertionError: All strings must be Unicode or ASCII If I try to write path using UTF-8 encoding, the file cannot be found (because the path with UTF-8 encoded name does not exists). This did not happen with python 2.3 - the problem is only with python2.4. There's a workaround - reading file to StringIO object and then parsing XML from that object. It works fine but it's silly. I would be very grateful for any suggestions. Pawe? Pa?ucha From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Sep 5 21:48:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 05 Sep 2006 21:48:32 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <44FDD490.9000303@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > depending on how one accesses objectified elements there can > be differences in the resulting element type: > >>>> root = objectify.Element('root') >>>> sub = objectify.Element('root') >>>> root.sub = sub >>>> root.sub.x = 1 >>>> del root.sub.x >>>> print root > root = None [ObjectifiedElement] > sub = '' [StringElement] > > This yields a StringElement root.sub because root.sub has no element > contents, does have a parent element but not any children. > Whereas > >>>> root = objectify.Element('root') >>>> sub = objectify.Element('root') >>>> root.sub = sub >>>> root.sub.x = 1 >>>> print root > root = None [ObjectifiedElement] > sub = None [ObjectifiedElement] > x = 1 [IntElement] >>>> del root.sub.x >>>> print root > root = None [ObjectifiedElement] > sub = None [ObjectifiedElement] > > yields an ObjectifiedElement root.sub because I already accessed root.sub > before deleting its child x, thus making it an ObjectifiedElement in the > etree node proxy because at that time it had children. It's even worse: >>> print sub root = None [ObjectifiedElement] >>> root.sub = sub >>> print root root = None [ObjectifiedElement] sub = '' [StringElement] >>> root.sub.x = 1 >>> print root root = None [ObjectifiedElement] sub = '' [StringElement] x = 1 [IntElement] This is pretty wrong. The thing that bothers me is that there should not actually be a permanent Python reference to root.sub, which would normally mean that the object should get recreated each time it is accessed. But as the last command shows, that is not the case. > I'm not sure how to address this problem. For my use case it is desirable > for [snip] > - a "structural" element (this is what I use ObjectifiedElements for - they > are supposed to potentially have children) to remain like it is even if > its children get deleted That's only a problem if you access the Python reference of the child itself afterwards, which you normally wouldn't if it's a pure structural element. > The problem also manifests in this use case: >>>> root = objectify.Element('root') >>>> root.sub = objectify.Element('whatever') >>>> print root > root = None [ObjectifiedElement] > sub = '' [StringElement] > where I would rather have root.sub to be an ObjectifiedElement. Sure, but I'd figure that's a rare use case anyway. And if you need it, there are enough ways to get around it, from parsing to ObjectPath. > Some thoughts: > - maybe disallow DataElements to have children, i.e. disabling __setattr__ > and alike for DataElements? Then ObjectifiedElements would need to have > an accessible (string) pyvalue in contrast to current behaviour Not a good idea. In that case, things like this would potentially stop working: >>> root = objectify.Element('root') >>> root.sub = objectify.Element('whatever') >>> root.sub.sub = ... Reason: as it stands now, root.sub would become a StringElement, which would not accept any children. > - maybe change the time an object is actually registered in the node proxy? It's difficult to avoid instantiating element objects when setting and modifying content. The main reason is that if we don't have a proxy, we have to clean up the element ourselves, which means code duplication and/or a tighter code coupling between etree and objectify. > - add an additional "structural" element class that is basically just an > ObjectifiedElement but has an artificial pytype to make it retain its type > and can be produced by a factory similar to objectify.DataElement? Hmm, we could potentially allow "ObjectifiedElement" as pytype, though I'd prefer waiting for a really good reason to do that. > - just not care about a StringElement acting as a structural element as it > can currently have children too (though it supports the string API parts on > top of the ObjectifiedElement basic API)? That leads to the problem I pointed out at the top. What is your actual reasoning for requiring that empty leaf elements should be StringElements? I mean, you could always make them StringElements explicitly by setting >>> root.a.b.c.d = '' and you can always explicitly access their String value with ".text". If we removed that special case, leaf elements that contain strings would always be StringElements and empty leaves and internal elements would always be ObjectifiedElements. That would not change the fact that elements keep their type as long as there is a Python reference to them, but it would work in a few more cases than it does now. Stefan From Holger.Joukl at LBBW.de Wed Sep 6 09:44:34 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Wed, 6 Sep 2006 09:44:34 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <44FDD490.9000303@gkec.informatik.tu-darmstadt.de> Message-ID: Hi Stefan, lxml-dev-bounces at codespeak.net schrieb am 05.09.2006 21:48:32: > > It's even worse: > > >>> print sub > root = None [ObjectifiedElement] > >>> root.sub = sub > >>> print root > root = None [ObjectifiedElement] > sub = '' [StringElement] > >>> root.sub.x = 1 > >>> print root > root = None [ObjectifiedElement] > sub = '' [StringElement] > x = 1 [IntElement] > > This is pretty wrong. The thing that bothers me is that there should not > actually be a permanent Python reference to root.sub, which would normally > mean that the object should get recreated each time it is accessed. But as the > last command shows, that is not the case. > > [...] > > > The problem also manifests in this use case: > >>>> root = objectify.Element('root') > >>>> root.sub = objectify.Element('whatever') > >>>> print root > > root = None [ObjectifiedElement] > > sub = '' [StringElement] > > where I would rather have root.sub to be an ObjectifiedElement. > > Sure, but I'd figure that's a rare use case anyway. And if you need it, there > are enough ways to get around it, from parsing to ObjectPath. This was just another way to describe the behaviour you put out above. I want it to be an ObjectifiedElement because I know I'll put children in it later. > > Some thoughts: > > - maybe disallow DataElements to have children, i.e. disabling __setattr__ > > and alike for DataElements? Then ObjectifiedElements would need to have > > an accessible (string) pyvalue in contrast to current behaviour > > Not a good idea. In that case, things like this would potentially > stop working: > > >>> root = objectify.Element('root') > >>> root.sub = objectify.Element('whatever') > >>> root.sub.sub = ... > > Reason: as it stands now, root.sub would become a StringElement, which would > not accept any children. > > > > - maybe change the time an object is actually registered in the node proxy? > > It's difficult to avoid instantiating element objects when setting and > modifying content. The main reason is that if we don't have a proxy, we have > to clean up the element ourselves, which means code duplication and/or a > tighter code coupling between etree and objectify. > > > > - add an additional "structural" element class that is basically just an > > ObjectifiedElement but has an artificial pytype to make it retain its type > > and can be produced by a factory similar to objectify.DataElement? > > Hmm, we could potentially allow "ObjectifiedElement" as pytype, though I'd > prefer waiting for a really good reason to do that. I know, it's not nice. But right now I can't think of another way to force a leaf to be an ObjectifiedElement. > > - just not care about a StringElement acting as a structural element as it > > can currently have children too (though it supports the string API parts on > > top of the ObjectifiedElement basic API)? > > That leads to the problem I pointed out at the top. What is your actual > reasoning for requiring that empty leaf elements should be StringElements? I > mean, you could always make them StringElements explicitly by setting > > >>> root.a.b.c.d = '' Wouldn't this end up in d being an ObjectifiedElement if the logic (empty leaves are StringElements) changed? > and you can always explicitly access their String value with ".text". > > If we removed that special case, leaf elements that contain strings would > always be StringElements and empty leaves and internal elements would always > be ObjectifiedElements. > > That would not change the fact that elements keep their type as long as there > is a Python reference to them, but it would work in a few more cases than it > does now. > When parsing from XML I need 'some string' to behave like ''. For someone processing the data "s" should always act like a (possibly empty) string. Your solution would only work for me if ObjectifiedElement got a .pyval attribute, too, and its .text was not None but rather '' if no text content is in the node, and probably also needed the String API parts. Much of this stems from the fact the ElementTree elt.text returns None if there is no element text instead of '' (but I guess this won't change :-) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 6 11:27:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 06 Sep 2006 11:27:43 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <44FE948F.7030702@gkec.informatik.tu-darmstadt.de> Holger Joukl wrote: > Stefan Behnel wrote: >> That leads to the problem I pointed out at the top. What is your actual >> reasoning for requiring that empty leaf elements should be >> StringElements? I >> mean, you could always make them StringElements explicitly by setting >> >> >>> root.a.b.c.d = '' > > Wouldn't this end up in d being an ObjectifiedElement if the logic (empty > leaves are StringElements) changed? No. The value is an empty string, not an empty value. So there is text content in there, it's just of length zero. > When parsing from XML I need 'some string' to behave > like > ''. For someone processing the data "s" should always > act like a (possibly empty) string. > Your solution would only work for me if ObjectifiedElement got a .pyval > attribute, > too, and its .text was not None but rather '' if no text content is in the > node, > and probably also needed the String API parts. > Much of this stems from the fact the ElementTree elt.text returns None if > there is no element text instead of '' (but I guess this won't change :-) It returns '' if the value is '' and it returns None if there is no value. That already changed to adapt to ET's own behaviour. The parser sees "" and "" as not having a value. So you will never get an empty string back from a parsed tree. However, if you set it to '', lxml will continue to return an empty string and objectify will determine that it is a StringElement. Maybe you could get by with wrapper functions that add the '' for leafs where required? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 6 11:37:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 06 Sep 2006 11:37:34 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <44FD9717.30802@praterm.com.pl> References: <44FD9717.30802@praterm.com.pl> Message-ID: <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> Hi Pawe?, first thing to note: lxml uses UTF-8 internally, also for filenames, as libxml2 requires a char sequence for their representation. If your system can't handle that, we'll have to figure out a way to make it work. This part of lxml is not much tested, so it would be nice if you could help us in getting this straight. Pawe? Pa?ucha wrote: > I'm using lxml version 1.0.3 (from Debian package). My default system > encoding is ISO-8859-2. I have a simple program: > > #!/usr/bin/python > # -*- coding: iso-8859-2 -*- > > import lxml.etree > d = lxml.etree.parse('/tmp/?.xml') Ok, so you're using 8-bit encoded filenames. > Traceback (most recent call last): > File "./p.py", line 6, in ? > d = lxml.etree.parse('/tmp/??d?.xml') > File "etree.pyx", line 1615, in etree.parse > File "parser.pxi", line 687, in etree._parseDocument > File "apihelpers.pxi", line 343, in etree._utf8 > AssertionError: All strings must be Unicode or ASCII Right, I guess that treating the filename with the _utf8() function is not the right thing to do for 8-bit strings. We should have a separate way of treating filenames. I'll look into it. > If I try to write path using UTF-8 encoding, the file cannot be found > (because the path with UTF-8 encoded name does not exists). I assume you get the same file-not-found error if you pass the filename as unicode string? d = lxml.etree.parse(u'/tmp/?.xml') > This did not happen with python 2.3 - the problem is only with > python2.4. There's a workaround - reading file to StringIO object and > then parsing XML from that object. It works fine but it's silly. You can also pass an opened file object. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 6 13:09:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 06 Sep 2006 13:09:34 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> Message-ID: <44FEAC6E.1080604@gkec.informatik.tu-darmstadt.de> Hi Pawe?, Stefan Behnel wrote: > first thing to note: lxml uses UTF-8 internally, also for filenames, as > libxml2 requires a char sequence for their representation. If your system > can't handle that, we'll have to figure out a way to make it work. > > I guess that treating the filename with the _utf8() function is not the > right thing to do for 8-bit strings. We should have a separate way of treating > filenames. I'll look into it. Here's a patch that might fix your problem. However, it's against the current trunk (i.e. 1.1 beta), as fixing this problem requires a behavioural change that will not make it into 1.0. The web page has information on how to build lxml on Linux, it's pretty easy. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: filename_encoding.patch Type: text/x-patch Size: 11597 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060906/f43db195/attachment.bin From Holger.Joukl at LBBW.de Wed Sep 6 13:25:08 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Wed, 6 Sep 2006 13:25:08 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <44FE948F.7030702@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel schrieb am 06.09.2006 11:27:43: > It returns '' if the value is '' and it returns None if there is no value. > That already changed to adapt to ET's own behaviour. The parser sees "" > and "" as not having a value. So you will never get an empty string > back from a parsed tree. However, if you set it to '', lxml will continue to > return an empty string and objectify will determine that it is a > StringElement. > > Maybe you could get by with wrapper functions that add the '' for leafs where > required? > Hm, I could of course "stringify" all empty leaves after parsing, given that my users aren't accessing the etree/objectify APIs e.g. fromstring() directly. But I'd have to iterate over the whole tree for this. BTW: What do you think about adding .encode(...) to StringElement? Something we've discussed before: Would it make sense to allow an ObjectifiedElement instance to change its element.text internally, like e.g. in its _init() method? Or do you think it is better to stay explicit, loop over the tree and replace elements as needed? My use case is the DatetimeElement class I'm using where I will probably want to change the text to iso format datetime. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From oliver.schoenborn at utoronto.ca Wed Sep 6 17:00:49 2006 From: oliver.schoenborn at utoronto.ca (Oliver Schoenborn) Date: Wed, 06 Sep 2006 11:00:49 -0400 Subject: [lxml-dev] help: lxml causes wxpython app to crash at exit Message-ID: <44FEE2A1.60003@utoronto.ca> Hi, just wondering if anyone else has had same problem and what solution they found: 1. system is RHEL 4, with python 2.3, and latest wxpython from rpm on wxpython website; python and wxpython work fine (demo etc) 2. latest lxml (http://codespeak.net/lxml/) 1.0.3 installed (with latest libxml2 and libxslt built from source) and also built lxml from source (with latest pyrex etc) 3. when I open pyalamode (a wx app with a python shell window), then do "from lxml import etree" then exit, a seg fault occurs; the core dump says it happens during finalization of the python interpreter, possibly when terminating some thread. 4. same thing happens from my own wx app (if I uncomment a line that says "from lxml import etree" there is no segfault after exit) 5. I can use lxml for validation etc from within the wx app, all is fine; the problem is *only* when exiting the wx app 6. this segfault does NOT happen on Windows, 7. it does NOT happen if I do the above from within python shell in xterm 8. the problem does NOT happen when just do "import lxml", lxml.etree is the problem; etree is a binding to some C code in lxml/etree.c; this file is generated by pyrex Upgrading to python 2.4 will not be easy so I want to make sure I exhaust other solutions first. Thanks, Oliver From pawel at praterm.com.pl Wed Sep 6 22:50:23 2006 From: pawel at praterm.com.pl (=?ISO-8859-2?Q?Pawe=B3_Pa=B3ucha?=) Date: Wed, 06 Sep 2006 22:50:23 +0200 Subject: [lxml-dev] Building debian packages Message-ID: <44FF348F.6070605@praterm.com.pl> Hi, I've got a small patch against current SVN to file doc/build.txt: * info about changing md5sum in *.dsc file * command `dpkg -x ...` is incorrect - it should be `dpkg-source -x ...` Pawe? -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: build-deb.patch Url: http://codespeak.net/pipermail/lxml-dev/attachments/20060906/a858d812/attachment.diff From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 6 21:21:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 06 Sep 2006 21:21:24 +0200 Subject: [lxml-dev] help: lxml causes wxpython app to crash at exit In-Reply-To: <44FEE2A1.60003@utoronto.ca> References: <44FEE2A1.60003@utoronto.ca> Message-ID: <44FF1FB3.3010305@gkec.informatik.tu-darmstadt.de> Hi Oliver, Oliver Schoenborn wrote: > Hi, just wondering if anyone else has had same problem and what solution > they found: > > 1. system is RHEL 4, with python 2.3, and latest wxpython from rpm on > wxpython website; python and wxpython work fine (demo etc) > 2. latest lxml (http://codespeak.net/lxml/) 1.0.3 installed (with > latest libxml2 and libxslt built from source) and also built lxml > from source (with latest pyrex etc) > 3. when I open pyalamode (a wx app with a python shell window), then > do "from lxml import etree" then exit, a seg fault occurs; the > core dump says it happens during finalization of the python > interpreter, possibly when terminating some thread. > 4. same thing happens from my own wx app (if I uncomment a line that > says "from lxml import etree" there is no segfault after exit) > 5. I can use lxml for validation etc from within the wx app, all is > fine; the problem is *only* when exiting the wx app > 6. this segfault does NOT happen on Windows, > 7. it does NOT happen if I do the above from within python shell in xterm > 8. the problem does NOT happen when just do "import lxml", lxml.etree > is the problem; etree is a binding to some C code in lxml/etree.c; > this file is generated by pyrex > > Upgrading to python 2.4 will not be easy so I want to make sure I > exhaust other solutions first. You use a lot of C extensions and libraries and even compiled some of them from source, so the problem can be virtually everywhere. Since I used wxPython for a project once, I'd usually suspect the error in there first. The experience at the time made me learn and appreciate Qt. You can try running the Python interpreter under Valgrind control to see where the segfault occurs. There is a command line in lxml/doc/valgrind.txt. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 6 22:28:39 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 06 Sep 2006 22:28:39 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <44FF2F77.1010106@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > Hm, I could of course "stringify" all empty leaves after parsing, given > that my users aren't accessing the etree/objectify APIs e.g. fromstring() > directly. That's a matter of documentation. The best way is to write a small Python wrapper around the objectify module and have them import /that/. from lxml.objectify import * from lxml import objectify def fromstring(xml): return _fixItUp( objectify.fromstring(xml) ) > But I'd have to iterate over the whole tree for this. Sure, it's much easier with the right class in place. And having all elements instantiated during iteration isn't quite the most efficient thing ever. You could reduce the effort with a smart XPath expression, though. One thing that comes to my mind is that we could add support for replacing the default type classes used by ObjectifyElementClassLookup. We could add keyword arguments so that you could say lookup = ObjectifyElementClassLookup(StringElement=MyStringElementClass) That would currently work for String-, None- and ObjectifiedElement only, as the others use the data type registry. Maybe we should rather support something like "default_data_class" and "default_tree_class" (and keep the NoneElement, which is only used in a well defined case anyway). Then again, what about "empty_class", "xsi_nil_class" and "tree_class"? Any preference or comments? > BTW: What do you think about adding .encode(...) to StringElement? Python's string objects have 35 documented methods, most of which we could implement (although some of them, like "index" and "find" already have a different meaning in etree/objectify). If we consider implementing one, we should rather have all of them in place. Don't know if it's worth it. As the documentation says, if you want a real string, use ".text". > Something we've discussed before: Would it make sense to allow an > ObjectifiedElement instance to change its element.text internally, > like e.g. in its _init() method? I think it would be a good idea to add a method "__setText(s)" to ObjectifiedDataElement. That would make it available to subclasses and at the same time make it clear that it is *no* public API. Stefan From Holger.Joukl at LBBW.de Fri Sep 8 10:22:16 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 8 Sep 2006 10:22:16 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <44FF2F77.1010106@gkec.informatik.tu-darmstadt.de> Message-ID: Hi Stefan, lxml-dev-bounces at codespeak.net schrieb am 06.09.2006 22:28:39: > One thing that comes to my mind is that we could add support for replacing the > default type classes used by ObjectifyElementClassLookup. We could add keyword > arguments so that you could say > > lookup = ObjectifyElementClassLookup(StringElement=MyStringElementClass) > > That would currently work for String-, None- and ObjectifiedElement only, as > the others use the data type registry. Maybe we should rather support > something like "default_data_class" and "default_tree_class" (and keep the > NoneElement, which is only used in a well defined case anyway). > > Then again, what about "empty_class", "xsi_nil_class" and "tree_class"? Any > preference or comments? I'm perfectly happy with the current solution except for setattr-ing a 'structural element' and wanting this to remain instead of becoming a StringElement. I don't quite see how a different default data class or different tree class achieve this? So I'm back to suggesting a TreeElement() factory (not the best name, maybe) returning an ObjectifiedElement with a new pytype='ObjectifiedElement' which keeps it from becoming a string. I think that's still nicer than "stringifying" every single empty leaf when parsing from XML. > > BTW: What do you think about adding .encode(...) to StringElement? > > Python's string objects have 35 documented methods, most of which we could > implement (although some of them, like "index" and "find" already have a > different meaning in etree/objectify). If we consider implementing one, we > should rather have all of them in place. Don't know if it's worth it. As the > documentation says, if you want a real string, use ".text". Guess you're right, let's keep things simple. I was worried about e.g. printing to stdout but then again someone should probably convert all data to unicode and then encode to his preferred encoding, as he'll have to deal with numbers and stuff anyway (which don't have an .encode method, either). > > Something we've discussed before: Would it make sense to allow an > > ObjectifiedElement instance to change its element.text internally, > > like e.g. in its _init() method? > > I think it would be a good idea to add a method "__setText(s)" to > ObjectifiedDataElement. That would make it available to subclasses and at the > same time make it clear that it is *no* public API. +1 for that. If somebody wants a user to be able to modify in place he can then also add a public method to his data classes. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Sep 7 17:48:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 07 Sep 2006 17:48:30 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <44FEBDBF.2020202@praterm.com.pl> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEAC6E.1080604@gkec.informatik.tu-darmstadt.de> <44FEBDBF.2020202@praterm.com.pl> Message-ID: <45003F4E.8030403@gkec.informatik.tu-darmstadt.de> Hi Pawe?, Pawe? Pa?ucha wrote: > Your patch is almost ok, I mean it produces: > > File "p.py", line 5, in ? > d = lxml.etree.parse('/tmp/?.xml') > File "etree.pyx", line 1774, in etree.parse > File "parser.pxi", line 878, in etree._parseDocument > File "parser.pxi", line 798, in etree._parseDocFromFile > File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile > File "parser.pxi", line 583, in etree._handleParseResult > File "etree.pyx", line 195, in etree._ExceptionContext._raise_if_stored > File "parser.pxi", line 275, in etree._parser_resolve_from_python > File "apihelpers.pxi", line 492, in etree.funicode > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-8: invalid data > > but if I change line 275 in parse.pxi (etree._parser_resolve_from_python) from: > url = funicode(c_url) > to: > url = c_url > > parsing works ok. Does it make sense? It's not a perfect solution, as it may mean that the user ends up with a UTF-8 encoded URL to resolve. That's rarely a problem, as most URLs are plain ASCII anyway, except for filenames. URLs are passed as UTF-8 encoded strings if they occurred in a document. The funicode() call is meant to handle that. Filenames are passed as byte strings encoded for the local filesystem if they were provided by the user and are UTF-8 encoded if they originate from a document, so that makes custom resolvers more difficult to handle. At least, it's always a byte encoded string, it's just not obvious to the user which encoding was used... As it works with ASCII URLs and also supports the case of locally encoded filenames, I think it's an acceptable tradeoff. I'll put it into the trunk. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Sep 7 18:19:19 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 07 Sep 2006 18:19:19 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <44FEACD8.1090805@praterm.com.pl> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> Message-ID: <45004687.80501@gkec.informatik.tu-darmstadt.de> Hi, Pawe? Pa?ucha wrote: > Stefan Behnel wrote: > >> first thing to note: lxml uses UTF-8 internally, also for filenames, as >> libxml2 requires a char sequence for their representation. If your system >> can't handle that, we'll have to figure out a way to make it work. > > Looking at the code of libxml2, filenames are passed as 'const char *' and not > 'const xmlChar *'. with "char*" == "xmlChar*", but that's not what I meant. >> Right, I guess that treating the filename with the _utf8() function is not the >> right thing to do for 8-bit strings. We should have a separate way of treating >> filenames. I'll look into it. > > I'm not sure, but perhaps the only way is to leave this as user responsibilty, > and pass filenames without any modifications. There's no way to check what > encoding is used for file name. At least, there is sys.getfilesystemencoding(), which tells us what it *should* be. > As for libc open() it's just a NULL terminated > data. But I don't know if Python can handle it... Python just passes it on. >> I assume you get the same file-not-found error if you pass the filename as >> unicode string? >> >> d = lxml.etree.parse(u'/tmp/?.xml') > > Yes, btw - the file-not-found error is just "IOError%" - not very helpful message. Might have been because of the same problem as the exception you got from funicode(). >>> This did not happen with python 2.3 - the problem is only with >>> python2.4. There's a workaround - reading file to StringIO object and >>> then parsing XML from that object. It works fine but it's silly. >> You can also pass an opened file object. > > No ;-) That's strange but code like that: > > f = open('/tmp/?.xml') > d = lxml.etree.parse(f) > > behaves the same way (assertion) - it looks like you check the file name > remembered in file object. Possibly just another side effect of funicode. :) I fixed some more places where this was used, so here is a new patch. Stefan PS: please remember replying to the list also, so that this discussion gets a) archived and b) looked at by others. -------------- next part -------------- A non-text attachment was scrubbed... Name: filename_encoding.patch Type: text/x-patch Size: 15945 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060907/aa9334c8/attachment-0001.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Sep 7 19:47:57 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 07 Sep 2006 19:47:57 +0200 Subject: [lxml-dev] Building debian packages In-Reply-To: <44FF348F.6070605@praterm.com.pl> References: <44FF348F.6070605@praterm.com.pl> Message-ID: <45005B4D.2060208@gkec.informatik.tu-darmstadt.de> Czesc Pawe?, Pawe? Pa?ucha wrote: > Hi, I've got a small patch against current SVN to file doc/build.txt: > * info about changing md5sum in *.dsc file > * command `dpkg -x ...` is incorrect - it should be `dpkg-source -x ...` Dzienki, I'll add it. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Sep 9 08:14:27 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 09 Sep 2006 08:14:27 +0200 Subject: [lxml-dev] 1.0.4 released Message-ID: <45025BC3.7080504@gkec.informatik.tu-darmstadt.de> Hi all, I just released lxml 1.0.4 to cheeseshop. It's a bug-fix release that fixes a crash bug in the replace() function. It also adds an extend() method to Element to make it more list-like. 1.1 final will still have to wait a little longer until the filename issue has settled. If I find the time, it will be out next week. I know, I keep saying things like that, but we had a couple of bug reports last week, so I'm happy that we can solve some more problems before the release. Have fun, Stefan From pawel at praterm.com.pl Sat Sep 9 10:23:38 2006 From: pawel at praterm.com.pl (=?ISO-8859-2?Q?Pawe=B3_Pa=B3ucha?=) Date: Sat, 09 Sep 2006 10:23:38 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <45004687.80501@gkec.informatik.tu-darmstadt.de> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> Message-ID: <45027A0A.4090909@praterm.com.pl> > > I fixed some more places where this was used, so here is a new patch. > > Stefan > --- src/lxml/xslt.pxi (Revision 31765) > +++ src/lxml/xslt.pxi (Arbeitskopie) Should I try to use this patch against Revision 31765 or better a current (32104) SVN Revision? Pawe? Pa?ucha From pawel at praterm.com.pl Sat Sep 9 11:33:54 2006 From: pawel at praterm.com.pl (=?ISO-8859-2?Q?Pawe=B3_Pa=B3ucha?=) Date: Sat, 09 Sep 2006 11:33:54 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <45027A0A.4090909@praterm.com.pl> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> Message-ID: <45028A82.9030906@praterm.com.pl> Pawe? Pa?ucha wrote: >> I fixed some more places where this was used, so here is a new patch. >> >> Stefan > >> --- src/lxml/xslt.pxi (Revision 31765) >> +++ src/lxml/xslt.pxi (Arbeitskopie) > > Should I try to use this patch against Revision 31765 or better a current > (32104) SVN Revision? I tried both versions but they both have the same problem with 'url = funicode(c_url)' resulting in: d = lxml.etree.parse('/tmp/?.xml') File "p.py", line 5, in ? d = lxml.etree.parse('/tmp/?.xml') File "etree.pyx", line 1774, in etree.parse File "parser.pxi", line 878, in etree._parseDocument File "parser.pxi", line 798, in etree._parseDocFromFile File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 579, in etree._handleParseResult File "etree.pyx", line 195, in etree._ExceptionContext._raise_if_stored File "parser.pxi", line 275, in etree._parser_resolve_from_python File "apihelpers.pxi", line 492, in etree.funicode UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-8: invalid data When I try d = lxml.etree.parse('/tmp/?.xml'.encode(utf8)) the result is: File "p.py", line 5, in ? d = lxml.etree.parse(('/tmp/?.xml').encode('utf8')) File "etree.pyx", line 1774, in etree.parse File "parser.pxi", line 878, in etree._parseDocument File "parser.pxi", line 798, in etree._parseDocFromFile File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 582, in etree._handleParseResult File "parser.pxi", line 548, in etree._raiseParseError IOError: Error reading file '/tmp/??.xml': failed to load external entity "/tmp/??.xml" The good news is that: f = open('/tmp/?.xml') d = lxml.etree.parse(f) now works ok (at least at SVN 32104) Pawe? Pa?ucha From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Sep 9 20:22:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 09 Sep 2006 20:22:58 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <45028A82.9030906@praterm.com.pl> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> <45028A82.9030906@praterm.com.pl> Message-ID: <45030682.20301@gkec.informatik.tu-darmstadt.de> Hi, Pawe? Pa?ucha wrote: >>> I fixed some more places where this was used, so here is a new patch. >>> > I tried both versions but they both have the same problem with 'url = > funicode(c_url)' resulting in: > > d = lxml.etree.parse('/tmp/?.xml') > > File "p.py", line 5, in ? > d = lxml.etree.parse('/tmp/?.xml') > File "etree.pyx", line 1774, in etree.parse > File "parser.pxi", line 878, in etree._parseDocument > File "parser.pxi", line 798, in etree._parseDocFromFile > File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile > File "parser.pxi", line 579, in etree._handleParseResult > File "etree.pyx", line 195, in etree._ExceptionContext._raise_if_stored > File "parser.pxi", line 275, in etree._parser_resolve_from_python > File "apihelpers.pxi", line 492, in etree.funicode > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-8: invalid data As I said, the problem is that URLs can originate from libxml2 (as UTF-8) and from the user (as encoded 8-bit strings). In "_parser_resolve_from_python()", however, we cannot distinguish between the two cases and thus cannot easily find out about the encoding used. However, we have to if we want to provide a consistent interface to the user. > When I try > > d = lxml.etree.parse('/tmp/?.xml'.encode(utf8)) > > the result is: > > File "p.py", line 5, in ? > d = lxml.etree.parse(('/tmp/?.xml').encode('utf8')) > File "etree.pyx", line 1774, in etree.parse > File "parser.pxi", line 878, in etree._parseDocument > File "parser.pxi", line 798, in etree._parseDocFromFile > File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile > File "parser.pxi", line 582, in etree._handleParseResult > File "parser.pxi", line 548, in etree._raiseParseError > IOError: Error reading file '/tmp/??.xml': failed to load external entity > "/tmp/??.xml" That's the expected result, as the file you want to refer to does not have that name. > The good news is that: > > f = open('/tmp/?.xml') > d = lxml.etree.parse(f) > > now works ok (at least at SVN 32104) Yes, I fixed that. And I'll have to see what I can do about the problem above. I'm considering to distinguish between the case where the URL in question is the URL of the document that is being parsed and the case where a different URL (like a DTD) is requested. The second case usually means that the URL originated from libxml2 (i.e. is in UTF-8) and the first case means that the user is in charge (i.e. we'd better not recode the string). But I don't know if that's sufficient... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Sep 9 20:51:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 09 Sep 2006 20:51:31 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <45030682.20301@gkec.informatik.tu-darmstadt.de> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> <45028A82.9030906@praterm.com.pl> <45030682.20301@gkec.informatik.tu-darmstadt.de> Message-ID: <45030D33.207@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Pawe? Pa?ucha wrote: >>>> I fixed some more places where this was used, so here is a new patch. >>>> >> I tried both versions but they both have the same problem with 'url = >> funicode(c_url)' resulting in: >> >> d = lxml.etree.parse('/tmp/?.xml') >> >> File "p.py", line 5, in ? >> d = lxml.etree.parse('/tmp/?.xml') >> File "etree.pyx", line 1774, in etree.parse >> File "parser.pxi", line 878, in etree._parseDocument >> File "parser.pxi", line 798, in etree._parseDocFromFile >> File "parser.pxi", line 512, in etree._BaseParser._parseDocFromFile >> File "parser.pxi", line 579, in etree._handleParseResult >> File "etree.pyx", line 195, in etree._ExceptionContext._raise_if_stored >> File "parser.pxi", line 275, in etree._parser_resolve_from_python >> File "apihelpers.pxi", line 492, in etree.funicode >> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-8: invalid data > > As I said, the problem is that URLs can originate from libxml2 (as UTF-8) and > from the user (as encoded 8-bit strings). In "_parser_resolve_from_python()", > however, we cannot distinguish between the two cases and thus cannot easily > find out about the encoding used. However, we have to if we want to provide a > consistent interface to the user. > > I'm considering to distinguish between the case where the URL in question is > the URL of the document that is being parsed and the case where a different > URL (like a DTD) is requested. The second case usually means that the URL > originated from libxml2 (i.e. is in UTF-8) and the first case means that the > user is in charge (i.e. we'd better not recode the string). But I don't know > if that's sufficient... Still not sure, but I committed the following patch for now. Please test if it works for you. Stefan Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (Revision 32103) +++ src/lxml/parser.pxi (Arbeitskopie) @@ -271,12 +271,16 @@ try: if c_url is NULL: url = None + elif c_context.myDoc is NULL or c_context.myDoc.URL is NULL: + # parsing a main document, so URL was passed verbatimly by user + url = c_url else: + # parsing a related document (DTD etc.) => UTF-8 encoded URL url = funicode(c_url) if c_pubid is NULL: pubid = None else: - pubid = funicode(c_pubid) + pubid = funicode(c_pubid) # always UTF-8 doc_ref = context._resolvers.resolve(url, pubid, context) if doc_ref is None: From pawel at praterm.com.pl Sat Sep 9 23:35:22 2006 From: pawel at praterm.com.pl (Pawe? Pa?ucha) Date: Sat, 09 Sep 2006 23:35:22 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <45030D33.207@gkec.informatik.tu-darmstadt.de> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> <45028A82.9030906@praterm.com.pl> <45030682.20301@gkec.informatik.tu-darmstadt.de> <45030D33.207@gkec.informatik.tu-darmstadt.de> Message-ID: <20060909233522.k0dl19gaes8gsg40@www.szarp.com.pl> Quoting Stefan Behnel : > Still not sure, but I committed the following patch for now. Please > test if it > works for you. > try: > if c_url is NULL: > url = None > + elif c_context.myDoc is NULL or c_context.myDoc.URL is NULL: > + # parsing a main document, so URL was passed verbatimly by user > + url = c_url > else: > + # parsing a related document (DTD etc.) => UTF-8 encoded URL > url = funicode(c_url) I will try it, but what about something like: try: url = funicode(c_url) except UnicodeDecodeError: url = c_url ? Pawe? ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Sep 10 09:32:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 10 Sep 2006 09:32:04 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <20060909233522.k0dl19gaes8gsg40@www.szarp.com.pl> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> <45028A82.9030906@praterm.com.pl> <45030682.20301@gkec.informatik.tu-darmstadt.de> <45030D33.207@gkec.informatik.tu-darmstadt.de> <20060909233522.k0dl19gaes8gsg40@www.szarp.com.pl> Message-ID: <4503BF74.40708@gkec.informatik.tu-darmstadt.de> Pawe? Pa?ucha wrote: > Quoting Stefan Behnel : > >> Still not sure, but I committed the following patch for now. Please >> test if it >> works for you. > >> try: >> if c_url is NULL: >> url = None >> + elif c_context.myDoc is NULL or c_context.myDoc.URL is NULL: >> + # parsing a main document, so URL was passed verbatimly >> by user >> + url = c_url >> else: >> + # parsing a related document (DTD etc.) => UTF-8 encoded URL >> url = funicode(c_url) > > I will try it, but what about something like: > > try: > url = funicode(c_url) > except UnicodeDecodeError: > url = c_url what if the URL contains characters in a local encoding that the UTF-8 decoder can accidentally decode? Stefan From pawel at praterm.com.pl Sun Sep 10 12:12:43 2006 From: pawel at praterm.com.pl (=?ISO-8859-2?Q?Pawe=B3_Pa=B3ucha?=) Date: Sun, 10 Sep 2006 12:12:43 +0200 Subject: [lxml-dev] File name encoding problems In-Reply-To: <4503BF74.40708@gkec.informatik.tu-darmstadt.de> References: <44FD9717.30802@praterm.com.pl> <44FE96DE.7030101@gkec.informatik.tu-darmstadt.de> <44FEACD8.1090805@praterm.com.pl> <45004687.80501@gkec.informatik.tu-darmstadt.de> <45027A0A.4090909@praterm.com.pl> <45028A82.9030906@praterm.com.pl> <45030682.20301@gkec.informatik.tu-darmstadt.de> <45030D33.207@gkec.informatik.tu-darmstadt.de> <20060909233522.k0dl19gaes8gsg40@www.szarp.com.pl> <4503BF74.40708@gkec.informatik.tu-darmstadt.de> Message-ID: <4503E51B.9090407@praterm.com.pl> Stefan Behnel wrote: >> I will try it, but what about something like: >> >> try: >> url = funicode(c_url) >> except UnicodeDecodeError: >> url = c_url > > what if the URL contains characters in a local encoding that the UTF-8 decoder > can accidentally decode? Probably you are right ;-) Anyway, SVN version 32115 works just fine, both calling parse() with file name and with already opened file. Thanks a lot! Pawe? From ashish.kulkarni at kalyptorisk.com Tue Sep 12 09:15:14 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Tue, 12 Sep 2006 12:45:14 +0530 Subject: [lxml-dev] Static win32 builds for lxml 1.0.4 Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03ED4A@mailserver.kalyptorisk.com> Hello, I finally got the VC++ Toolkit 2003 compiler, and have made static builds for lxml 1.0.4. They are available at: http://puggy.symonds.net/~ashish/downloads/ Regards, ashish From nico at tekNico.net Tue Sep 12 12:46:08 2006 From: nico at tekNico.net (Nicola Larosa) Date: Tue, 12 Sep 2006 12:46:08 +0200 Subject: [lxml-dev] Unicode munging in element tag and text Message-ID: <45068FF0.8020500@tekNico.net> Hi all, thanks for a great library. :-) I found a rather peculiar behavior in Unicode object handling for element tag and text. It looks like they get converted to a plain string if they only contains ASCII chars, but not always. ElementTree instead always keeps them as Unicode objects. >>> from lxml.etree import Element as lxElem >>> from elementtree.ElementTree import Element as etElem 1) Let's first build an element from a Unicode object with ASCII chars; only ElementTree keeps it as Unicode: >>> lx = lxElem(u'ascii') >>> et = etElem(u'ascii') >>> lx.tag 'ascii' >>> et.tag u'ascii' while when the Unicode object contains non-ASCII chars, both libraries correctly keep it as Unicode: >>> lx = lxElem(u'm?r?th?n?sc??') >>> et = etElem(u'm?r?th?n?sc??') >>> lx.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' >>> et.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' 2) The same happens for the element text; ASCII: >>> lx.text = u'ascii' >>> et.text = u'ascii' >>> lx.text 'ascii' >>> et.text u'ascii' non-ASCII: >>> lx.text = u'm?r?th?n?sc??' >>> et.text = u'm?r?th?n?sc??' >>> lx.text u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' >>> et.text u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' 3) OTOH, when directly setting the element tag, lxml keeps the Unicode object too: >>> lx.tag = u'ascii' >>> et.tag = u'ascii' >>> lx.tag u'ascii' >>> et.tag u'ascii' while both libraries keep working correctly when using non-ASCII chars: >>> lx.tag = u'm?r?th?n?sc??' >>> et.tag = u'm?r?th?n?sc??' >>> lx.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' >>> et.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' This inconsistent behavior does not seem intentional. In my opinion, in the cases 1) and 2) lxml should work as it already does in the case 3), and as ElementTree always does. Thanks again. -- Nicola Larosa - http://www.tekNico.net/ There is more money being spent on breast implants and Viagra today than on Alzheimer's research. This means that by 2040, there should be a large elderly population with perky boobs and huge erections and absolutely no recollection of what to do with them. -- David Icke, April 2006 From fredrik at pythonware.com Tue Sep 12 13:27:23 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 12 Sep 2006 13:27:23 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: <45068FF0.8020500@tekNico.net> References: <45068FF0.8020500@tekNico.net> Message-ID: Nicola Larosa wrote: > This inconsistent behavior does not seem intentional. In my opinion, in the > cases 1) and 2) lxml should work as it already does in the case 3), and as > ElementTree always does. in Python 2.X, Unicode strings are compatible with 8-bit ASCII-only strings, so the lxml.etree behaviour is perfectly acceptable. I see no reason to force an implementation that doesn't use Python objects for its internal storage to be forced to keep track of the original type. (especially not since the Unicode string type will disappear in Python 3.0; all strings will be able to hold Unicode data). From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Sep 12 18:23:39 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 12 Sep 2006 18:23:39 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: <45068FF0.8020500@tekNico.net> References: <45068FF0.8020500@tekNico.net> Message-ID: <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> Hi Nicola, Nicola Larosa wrote: > This inconsistent behavior does not seem intentional. In my opinion, in the > cases 1) and 2) lxml should work as it already does in the case 3), and as > ElementTree always does. At least under Python 2.x, lxml.etree will continue to return unicode or plain strings depending on their content. Internally, everything is stored as UTF-8, so this is for performance reasons as we can avoid unicode conversion for plain ASCII strings (which are very common, just think of numeric data, dates, etc.). This may change in Python 3.x, but then, there may be more to change, so that's not in our scope for now. Stefan From nico at tekNico.net Tue Sep 12 19:00:34 2006 From: nico at tekNico.net (Nicola Larosa) Date: Tue, 12 Sep 2006 19:00:34 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> References: <45068FF0.8020500@tekNico.net> <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> Message-ID: <4506E7B2.20808@tekNico.net> Stefan Behnel wrote: > At least under Python 2.x, lxml.etree will continue to return unicode or plain > strings depending on their content. Internally, everything is stored as UTF-8, > so this is for performance reasons as we can avoid unicode conversion for > plain ASCII strings (which are very common, just think of numeric data, dates, > etc.). Any benchmarks supporting this decision? -- Nicola Larosa - http://www.tekNico.net/ There is more money being spent on breast implants and Viagra today than on Alzheimer's research. This means that by 2040, there should be a large elderly population with perky boobs and huge erections and absolutely no recollection of what to do with them. -- David Icke, April 2006 From fredrik at pythonware.com Tue Sep 12 19:19:52 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 12 Sep 2006 19:19:52 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: <4506E7B2.20808@tekNico.net> References: <45068FF0.8020500@tekNico.net> <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> <4506E7B2.20808@tekNico.net> Message-ID: Nicola Larosa wrote: > Any benchmarks supporting this decision? Are you trying to use the "premature optimization is evil" argument against people who's spent more time than anyone else on optimizing Python's string subsystem? ;-) From nico at tekNico.net Wed Sep 13 00:28:31 2006 From: nico at tekNico.net (Nicola Larosa) Date: Wed, 13 Sep 2006 00:28:31 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: References: <45068FF0.8020500@tekNico.net> <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> <4506E7B2.20808@tekNico.net> Message-ID: <4507348F.7050001@tekNico.net> Fredrik Lundh wrote: > Are you trying to use the "premature optimization is evil" argument > against people who's spent more time than anyone else on optimizing > Python's string subsystem? ;-) I, for one, welcome our new Iceland Sprint overlords. ;-P -- Nicola Larosa - http://www.tekNico.net/ Many software developers have become hostage to the development frameworks that they utilise. In turn, many frameworks have made session state a fundamental building block of web development because it permits sloppy design. -- Alan Dean, April 2006 From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Sep 12 21:59:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 12 Sep 2006 21:59:53 +0200 Subject: [lxml-dev] Static win32 builds for lxml 1.0.4 In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03ED4A@mailserver.kalyptorisk.com> References: <2AB7346A3227A74BB97F9A0D79E3E65A03ED4A@mailserver.kalyptorisk.com> Message-ID: <450711B9.7080208@gkec.informatik.tu-darmstadt.de> Hi Ashish, Ashish Kulkarni wrote: > I finally got the VC++ Toolkit 2003 compiler, and have made static builds > for lxml 1.0.4. Great, I uploaded them. Thanks for contributing! Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 13 09:19:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 13 Sep 2006 09:19:14 +0200 Subject: [lxml-dev] Unicode munging in element tag and text In-Reply-To: <4506E7B2.20808@tekNico.net> References: <45068FF0.8020500@tekNico.net> <4506DF0B.5070604@gkec.informatik.tu-darmstadt.de> <4506E7B2.20808@tekNico.net> Message-ID: <4507B0F2.5080009@gkec.informatik.tu-darmstadt.de> Hi Nicola, Nicola Larosa wrote: > Stefan Behnel wrote: >> At least under Python 2.x, lxml.etree will continue to return unicode or plain >> strings depending on their content. Internally, everything is stored as UTF-8, >> so this is for performance reasons as we can avoid unicode conversion for >> plain ASCII strings (which are very common, just think of numeric data, dates, >> etc.). > > Any benchmarks supporting this decision? Pretty short question for a long answer. Your third point was that .tag returned the original type. This is done through caching the original input, which avoids some 95% of the work required to rebuild it on each access (last time I ran the benchmark, at least). This means, it is 95% faster if a program frequently accesses the same tag name. We could instead recreate a string for the result to make it fit the behaviour of .text, but why if it's not more than overhead? As Fredrik said, plain strings and unicode strings are compatible, no need to convert one into the other for normal string operations. It's actually your fault if you waste memory and processing time by passing a unicode string where a plain string would do. As for the first two points, skipping through a string to see if any non-ASCII characters are in there is trivial and fast (7-bit vs. 8-bit), creating a plain string from it means allocating the same amount of memory, copying the string (which most likely is in the processor cache already by then) using a platform-optimised memcpy (or whatever, note that we already know the length of the string by then) and then create a Python object for it. Converting it to unicode means allocating two or four times the memory, doing a per-character conversion step by step (from multi-byte UTF-8) and then create a Python object for it. I didn't do much benchmarking here, but given the "95%" result above (meaning, the majority of work is the actual string instantiation), I simply assume that avoiding to do the character conversion is worth it if ASCII content is frequent. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Sep 13 10:05:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 13 Sep 2006 10:05:04 +0200 Subject: [lxml-dev] lxml 1.1 released Message-ID: <4507BBB0.4090903@gkec.informatik.tu-darmstadt.de> Hi all, I'm proud to announce the release of lxml 1.1. It is a major new release that builds upon the changes in 1.0.4 and introduces many new features compared to the 1.0 series. The main improvements are: * threading support * XPath axis iteration * iterparse() and iterwalk() * configurable Element class lookup methods * lxml.objectify - a data binding API on top of lxml.etree The complete changelog is below. lxml 1.1 has been tested in an alpha and beta version and received various bug fixes before the final release. It is currently considered stable and ready for production use, whereas the 1.0 series are considered rock-stable and may still be the right thing to choose for more conservative environments. Bug-fix releases and minor improvements to the 1.1 and 1.0 series will continue to become available at need. That way, 1.1 will soon be considered as stable as 1.0 to become its the main line successor. Given the many helpful egg contributions we received for past releases, I hope that 1.1 (and 1.0.4) will be as well supported in that regard to help in further increasing our community. Thanks to everyone who helped in getting 1.1 done, in getting bugs fixed and helping others in using it. Have fun, Stefan Changes since 1.0.4: 1.1 (2006-09-13) ================ Features added -------------- * Comments and processing instructions return '' and '' for repr() * Parsers are now the preferred (and default) place where element class lookup schemes should be registered. Namespace lookup is no longer supported by default. Bugs fixed ---------- * filenames with local 8-bit encoding were not supported * 1.1beta did not compile under Python 2.3 * ignore unknown 'pyval' attribute values in objectify * objectify.ObjectifiedElement.addattr() failed to accept Elements and Lists * objectify.ObjectPath.setattr() failed to accept Elements and Lists 1.1beta (2006-08-08) ==================== Features added -------------- * Support for Python 2.5 beta * Unlock the GIL for deep copying documents and for XPath() * New ``compact`` keyword argument for parsing read-only documents * Support for parser options in iterparse() * The ``namespace`` axis is supported in XPath and returns (prefix, URI) tuples * The XPath expression "/" now returns an empty list instead of raising an exception * XML-Object API on top of lxml (lxml.objectify) * Customizable Element class lookup: * different pre-implemented lookup mechanisms * support for externally provided lookup functions * Support for processing instructions (ET-like, not compatible) * Public C-level API for independent extension modules Bugs fixed ---------- * XPathSyntaxError now inherits from XPathError * Threading race conditions in RelaxNG and XMLSchema * Crash when mixing elements from XSLT results into other trees, concurrent XSLT is only allowed when the stylesheet was parsed in the main thread * The EXSLT ``regexp:match`` function now works as defined (except for some differences in the regular expression syntax) * Setting element.text to '' returned None on request, not the empty string * ``iterparse()`` could crash on long XML files * Creating documents no longer copies the parser for later URL resolving. For performance reasons, only a reference is kept. Resolver updates on the parser will now be reflected by documents that were parsed before the change. Although this should rarely become visible, it is a behavioral change from 1.0. 1.1alpha (2006-06-27) ===================== Features added -------------- * Module level ``iterwalk()`` function as 'iterparse' for trees * Module level ``iterparse()`` function similar to ElementTree (see documentation for differences) * Element.nsmap property returns a mapping of all namespace prefixes known at the Element to their namespace URI * Reentrant threading support in RelaxNG, XMLSchema and XSLT * Threading support in parsers and serializers: * All in-memory operations (tostring, parse(StringIO), etc.) free the GIL * File operations (on file names) free the GIL * Reading from file-like objects frees the GIL and reacquires it for reading * Serialisation to file-like objects is single-threaded (high lock overhead) * Element iteration over XPath axes: * Element.iterdescendants() iterates over the descendants of an element * Element.iterancestors() iterates over the ancestors of an element (from parent to parent) * Element.itersiblings() iterates over either the following or preceding siblings of an element * Element.iterchildren() iterates over the children of an element in either direction * All iterators support the ``tag`` keyword argument to restrict the generated elements * Element.getnext() and Element.getprevious() return the direct siblings of an element From ashish.kulkarni at kalyptorisk.com Thu Sep 14 07:07:24 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Thu, 14 Sep 2006 10:37:24 +0530 Subject: [lxml-dev] Static win32 builds for lxml 1.1 In-Reply-To: <450711B9.7080208@gkec.informatik.tu-darmstadt.de> Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03EDC2@mailserver.kalyptorisk.com> Hello, I've made static win32 builds for lxml 1.1. However, I faced a few problems so I'm just mentioning them here. The builds are available at http://puggy.symonds.net/~ashish/downloads/ side-note: the static build is now 2.1MB, compared to the 1.2MB for 1.0.4. 1] the lxml.etree extension failed to compile etree.c src\lxml\etree.c(49440) : error C2137: empty character constant The code at the location is in function __Pyx_ImportModuleCApi, and looks like this: if (*t->s == '') Apparently, MSVC chokes on the raw character (gcc doesn't have a problem with it). Ideally such a character should not be present -- it should use something like '\x0', which is what I did. If this is correct, then it should be fixed either in the Pyrex code, the patched Pyrex or swig (wherever the problem originated). 2] the lxml.objectify extension failed to link The lxml.objectify extension is using the xmlNanoHTTP/xmlNanoFTP functions from libxml2, which in turn require linking to wsock32. The solution is to add that to the list of libraries. # This is called if the '--static' option is passed def setupStaticBuild(): cflags = [ "-I..\\libxml2-2.6.23.win32\\include", "-I..\\libxslt-1.1.15.win32\\include", "-I..\\zlib-1.2.3.win32\\include", "-I..\\iconv-1.9.1.win32\\include" ] xslt_libs = [ "..\\libxml2-2.6.23.win32\\lib\\libxml2_a.lib", "..\\libxslt-1.1.15.win32\\lib\\libxslt_a.lib", "..\\libxslt-1.1.15.win32\\lib\\libexslt_a.lib", "..\\zlib-1.2.3.win32\\lib\\zlib.lib", "..\\iconv-1.9.1.win32\\lib\\iconv_a.lib", "wsock32.lib" # required for xmlNano* methods ] result = (cflags, xslt_libs) return result The documentation should ideally be updated to reflect this. Regards, ashish From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Sep 14 19:58:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 14 Sep 2006 19:58:14 +0200 Subject: [lxml-dev] Static win32 builds for lxml 1.1 In-Reply-To: <2AB7346A3227A74BB97F9A0D79E3E65A03EDC2@mailserver.kalyptorisk.com> References: <2AB7346A3227A74BB97F9A0D79E3E65A03EDC2@mailserver.kalyptorisk.com> Message-ID: <45099836.1000804@gkec.informatik.tu-darmstadt.de> Hi Ashish, Ashish Kulkarni wrote: > I've made static win32 builds for lxml 1.1. However, I faced a few problems so I'm just mentioning them here. Thanks for binaries and feedback! Both are very helpful. > side-note: the static build is now 2.1MB, compared to the 1.2MB for 1.0.4. That's the disadvantage of a static build: each binary is statically linked against all libraries, so now it's twice the size as we have lxml.etree and lxml.objectify as modules. > 1] the lxml.etree extension failed to compile > > etree.c > src\lxml\etree.c(49440) : error C2137: empty character constant > > The code at the location is in function __Pyx_ImportModuleCApi, and looks like this: > > if (*t->s == '') Right. I found that in my Pyrex patch. It actually says "\0", but that's parsed and converted by Python already, so it ends up in the C source as 0-byte. Changing it to "\\0" will fix it (see patch below). > 2] the lxml.objectify extension failed to link > > The lxml.objectify extension is using the xmlNanoHTTP/xmlNanoFTP functions from libxml2, Not on my machine. I can't see them use any FTP functions whatsoever. Does etree depend on those on your machine, too? > which in turn require linking to wsock32. The solution is to add that to the list of libraries. > > # This is called if the '--static' option is passed > def setupStaticBuild(): > cflags = [ > "-I..\\libxml2-2.6.23.win32\\include", > "-I..\\libxslt-1.1.15.win32\\include", > "-I..\\zlib-1.2.3.win32\\include", > "-I..\\iconv-1.9.1.win32\\include" > ] > xslt_libs = [ > "..\\libxml2-2.6.23.win32\\lib\\libxml2_a.lib", > "..\\libxslt-1.1.15.win32\\lib\\libxslt_a.lib", > "..\\libxslt-1.1.15.win32\\lib\\libexslt_a.lib", > "..\\zlib-1.2.3.win32\\lib\\zlib.lib", > "..\\iconv-1.9.1.win32\\lib\\iconv_a.lib", > "wsock32.lib" # required for xmlNano* methods > ] > result = (cflags, xslt_libs) > return result > > The documentation should ideally be updated to reflect this. Not sure if that's really required. If etree doesn't use those functions, I can't see why objectify should. Stefan Index: Pyrex/Compiler/Nodes.py =================================================================== --- Pyrex/Compiler/Nodes.py (Revision 31073) +++ Pyrex/Compiler/Nodes.py (Arbeitskopie) @@ -4149,7 +4149,7 @@ static int __Pyx_ImportModuleCApi(__Pyx_CApiTabEntry *t) { __Pyx_CApiTabEntry *api_t; while (t->s) { - if (*t->s == '\0') + if (*t->s == '\\0') continue; /* shortcut for erased string entries */ api_t = %(API_TAB)s; while ((api_t->s) && (strcmp(api_t->s, t->s) < 0)) From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Sep 14 19:29:07 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 14 Sep 2006 19:29:07 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <45099163.4050005@gkec.informatik.tu-darmstadt.de> Hi Holger, I didn't wait for this to settle for 1.1.0, but it can become available in 1.1.1 if we see it fit. Holger Joukl wrote: > Stefan Behnel wrote: >> One thing that comes to my mind is that we could add support for >> replacing the default type classes used by ObjectifyElementClassLookup. >> We could add keyword arguments so that you could say >> >> lookup = > ObjectifyElementClassLookup(StringElement=MyStringElementClass) >> That would currently work for String-, None- and ObjectifiedElement only, >> as the others use the data type registry. Maybe we should rather support >> something like "default_data_class" and "default_tree_class" (and keep >> the NoneElement, which is only used in a well defined case anyway). I chose "tree_class" and "empty_data_class" now. I think that's sufficiently telling. > I'm perfectly happy with the current solution except for setattr-ing a > 'structural element' and wanting this to remain instead of becoming a > StringElement. I don't quite see how a different default data class or > different tree class achieve this? Well, the idea is that you can change the default for empty data classes (remember that it's a pretty arbitrary decision to default to StringElement here) and also use subclasses of ObjectifiedElement for the tree structure. However, if you want StringElement in some cases and ObjectifiedElement in other cases, that's difficult to achieve at the Python level, as it would require passing information about the C node to allow taking the decision. > So I'm back to suggesting a TreeElement() factory (not the best name, > maybe) > returning an ObjectifiedElement with a new pytype='ObjectifiedElement' > which keeps it from becoming a string. I think that's still nicer than > "stringifying" every single empty leaf when parsing from XML. What about adding the attribute in objectify.Element()? You can't normally change the data value of an Element itself, so the only real reason why you would call objectify.Element() is to create a structural element (usually a root node). I called the corresponding pytype value "TREE" for now, I think it's unlikely that someone would use that as custom type name. Stefan From lxml at holloway.co.nz Fri Sep 15 00:13:29 2006 From: lxml at holloway.co.nz (Matthew Cruickshank) Date: Fri, 15 Sep 2006 10:13:29 +1200 Subject: [lxml-dev] Handling of ... Message-ID: <4509D409.3060809@holloway.co.nz> Hi, I've been writing this XML chain/pipeline - sort of a multi-pipeline version of Apache Cocoon and lxml has been great -- very fast! I'm having some problems with and I don't know how to read errors caused by that. They terminate the processing but I can't find the error message text (from within the tag) anywhere... it doesn't seem to be in error_log and no exception is thrown. Any ideas -- am I looking in the wrong place? I'm using to assert things about document structure and runtime parameters (things that couldn't be asserted in RelaxNG, I think). .Matthew Cruickshank http://docvert.org << MSWord to HTML or any XML From ashish.kulkarni at kalyptorisk.com Fri Sep 15 07:21:53 2006 From: ashish.kulkarni at kalyptorisk.com (Ashish Kulkarni) Date: Fri, 15 Sep 2006 10:51:53 +0530 Subject: [lxml-dev] Static win32 builds for lxml 1.1 In-Reply-To: <45099836.1000804@gkec.informatik.tu-darmstadt.de> Message-ID: <2AB7346A3227A74BB97F9A0D79E3E65A03EE05@mailserver.kalyptorisk.com> Hi, >> 2] the lxml.objectify extension failed to link >> >> The lxml.objectify extension is using the xmlNanoHTTP/xmlNanoFTP functions from libxml2, > >Not on my machine. I can't see them use any FTP functions whatsoever. Does >etree depend on those on your machine, too? > > Well, etree doesn't depend on it, but objectify does. I looked at the code, and probably it's because objectify includes some namespace URLs while etree doesn't. This may automatically be triggering the inclusion of the xmlNano* methods during a static build -- that's a guess, I don't know enough of lxml or libxml2 to be more certain :-) I've attached the build log of the compilation. Regards, Ashish ==================== BUILD LOG ==================== Building lxml version 1.1 *NOTE*: Trying to build without Pyrex, needs pre-generated 'src/lxml/etree.c' ! running build running build_py creating build creating build\lib.win32-2.4 creating build\lib.win32-2.4\lxml copying src\lxml\elementlib.py -> build\lib.win32-2.4\lxml copying src\lxml\sax.py -> build\lib.win32-2.4\lxml copying src\lxml\_elementpath.py -> build\lib.win32-2.4\lxml copying src\lxml\__init__.py -> build\lib.win32-2.4\lxml running build_ext building 'lxml.etree' extension creating build\temp.win32-2.4 creating build\temp.win32-2.4\Release creating build\temp.win32-2.4\Release\src creating build\temp.win32-2.4\Release\src\lxml C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin\cl.exe /c /nologo /Ox /MD /W3 /GX /DNDEBUG -IC:\DevTools\python24\include -IC:\DevTools\python24\PC /Tcsrc/lxml/etree.c /Fobuild\temp.win32-2.4\Release\src/lxml/etree.obj -w -I..\libxml2-2.6.23.win32\include -I..\libxslt-1.1.15.win32\include -I..\zlib-1.2.3.win32\include -I..\iconv-1.9.1.win32\include cl : Command line warning D4025 : overriding '/W3' with '/w' etree.c C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\DevTools\python24\libs /LIBPATH:C:\DevTools\python24\PCBuild /EXPORT:initetree build\temp.win32-2.4\Release\src/lxml/etree.obj /OUT:build\lib.win32-2.4\lxml\etree.pyd /IMPLIB:build\temp.win32-2.4\Release\src/lxml\etree.lib ..\libxml2-2.6.23.win32\lib\libxml2_a.lib ..\libxslt-1.1.15.win32\lib\libxslt_a.lib ..\libxslt-1.1.15.win32\lib\libexslt_a.lib ..\zlib-1.2.3.win32\lib\zlib.lib ..\iconv-1.9.1.win32\lib\iconv_a.lib Creating library build\temp.win32-2.4\Release\src/lxml\etree.lib and object build\temp.win32-2.4\Release\src/lxml\etree.exp etree.obj : warning LNK4217: locally defined symbol _xmlFree imported in function ___pyx_f_5etree__bugFixURL etree.obj : warning LNK4217: locally defined symbol _xsltDocDefaultLoader imported in function _initetree etree.obj : warning LNK4217: locally defined symbol _xsltLibxsltVersion imported in function _initetree building 'lxml.objectify' extension C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin\cl.exe /c /nologo /Ox /MD /W3 /GX /DNDEBUG -IC:\DevTools\python24\include -IC:\DevTools\python24\PC /Tcsrc/lxml/objectify.c /Fobuild\temp.win32-2.4\Release\src/lxml/objectify.obj -w -I..\libxml2-2.6.23.win32\include -I..\libxslt-1.1.15.win32\include -I..\zlib-1.2.3.win32\include -I..\iconv-1.9.1.win32\include cl : Command line warning D4025 : overriding '/W3' with '/w' objectify.c C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\DevTools\python24\libs /LIBPATH:C:\DevTools\python24\PCBuild /EXPORT:initobjectify build\temp.win32-2.4\Release\src/lxml/objectify.obj/OUT:build\lib.win32-2.4\lxml\objectify.pyd /IMPLIB:build\temp.win32-2.4\Release\src/lxml\objectify.lib ..\libxml2-2.6.23.win32\lib\libxml2_a.lib ..\libxslt-1.1.15.win32\lib\libxslt_a.lib ..\libxslt-1.1.15.win32\lib\libexslt_a.lib ..\zlib-1.2.3.win32\lib\zlib.lib ..\iconv-1.9.1.win32\lib\iconv_a.lib Creating library build\temp.win32-2.4\Release\src/lxml\objectify.lib and object build\temp.win32-2.4\Release\src/lxml\objectify.exp libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__WSAGetLastError at 0 referenced in function _socket_errno libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__WSACleanup at 0 referenced in function _xmlNanoHTTPCleanup libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__WSACleanup at 0 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__closesocket at 4 referenced in function _xmlNanoHTTPFreeCtxt libxml2_a.lib(nanoftp.obj) : error LNK2019: unresolved external symbol __imp__closesocket at 4 referenced in function _xmlNanoFTPNewCtxt libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__select at 20 referenced in function _xmlNanoHTTPSend libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__select at 20 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__send at 16 referenced in function _xmlNanoHTTPSend libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__send at 16 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__recv at 16 referenced in function _xmlNanoHTTPRecv libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__recv at 16 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__getsockopt at 20 referenced in function _xmlNanoHTTPConnectAttempt libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol ___WSAFDIsSet at 8 referenced in function _xmlNanoHTTPConnectAttempt libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__connect at 12 referenced in function _xmlNanoHTTPConnectAttempt libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__connect at 12 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__ioctlsocket at 12 referenced in function _xmlNanoHTTPConnectAttempt libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__socket at 12 referenced in function _xmlNanoHTTPConnectAttempt libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__socket at 12 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__htons at 4 referenced in function _xmlNanoHTTPConnectHost libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__htons at 4 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__gethostbyname at 4 referenced in function _xmlNanoHTTPConnectHost libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__gethostbyname at 4 libxml2_a.lib(nanohttp.obj) : error LNK2019: unresolved external symbol __imp__WSAStartup at 8 referenced in function _xmlNanoHTTPInit libxml2_a.lib(nanoftp.obj) : error LNK2001: unresolved external symbol __imp__WSAStartup at 8 libxml2_a.lib(nanoftp.obj) : error LNK2019: unresolved external symbol __imp__listen at 8 referenced in function _xmlNanoFTPGetConnection libxml2_a.lib(nanoftp.obj) : error LNK2019: unresolved external symbol __imp__bind at 12 referenced in function _xmlNanoFTPGetConnection libxml2_a.lib(nanoftp.obj) : error LNK2019: unresolved external symbol __imp__getsockname at 12 referenced in function _xmlNanoFTPGetConnection build\lib.win32-2.4\lxml\objectify.pyd : fatal error LNK1120: 17 unresolved externals error: command '"C:\Program Files\Microsoft Visual C++ Toolkit 2003\bin\link.exe"' failed with exit status 1120 ==================== END ==================== From Holger.Joukl at LBBW.de Fri Sep 15 10:06:33 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 15 Sep 2006 10:06:33 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <45099163.4050005@gkec.informatik.tu-darmstadt.de> Message-ID: Hello Stefan, first of all congrats for bringing out the 1.1 release! We are currently working hard on basing our toolkit on lxml.objectify and so far it works like a charm. Especially the ObjectPath functionality and the ease of hooking custom element classes into the lookup mechanism is great. I really think anybody considering to use amara.bindery or gnosis.objectify should have a good look at lxml.objectify. Stefan Behnel schrieb am 14.09.2006 19:29:07: > Hi Holger, > > I didn't wait for this to settle for 1.1.0, but it can become available in > 1.1.1 if we see it fit. Not in a hurry :-) > Holger Joukl wrote: > > Stefan Behnel wrote: > >> One thing that comes to my mind is that we could add support for > >> replacing the default type classes used by ObjectifyElementClassLookup. > >> We could add keyword arguments so that you could say > >> > >> lookup = > > ObjectifyElementClassLookup(StringElement=MyStringElementClass) > >> That would currently work for String-, None- and ObjectifiedElement only, > >> as the others use the data type registry. Maybe we should rather support > >> something like "default_data_class" and "default_tree_class" (and keep > >> the NoneElement, which is only used in a well defined case anyway). > > I chose "tree_class" and "empty_data_class" now. I think that's sufficiently > telling. I'm still not quite sure what you mean by that. In my words: That will allow for customizing the behaviour when encountering - an empty leaf node (empty_data_class gets chosen) - an empty tree node = a node that contains no text but has children (tree_class gets chosen) This is to not force an objectify user to follow our arbitrary (though with good reason ;-) decision to use StringElement for empty leaves. Right? > However, if you want StringElement in some cases and ObjectifiedElement in > other cases, that's difficult to achieve at the Python level, as it would > require passing information about the C node to allow taking the decision. > > > > So I'm back to suggesting a TreeElement() factory (not the best name, > > maybe) > > returning an ObjectifiedElement with a new pytype='ObjectifiedElement' > > which keeps it from becoming a string. I think that's still nicer than > > "stringifying" every single empty leaf when parsing from XML. > > What about adding the attribute in objectify.Element()? You can't normally > change the data value of an Element itself, so the only real reason why you > would call objectify.Element() is to create a structural element (usually a > root node). > > I called the corresponding pytype value "TREE" for now, I think it's unlikely > that someone would use that as custom type name. > > Stefan Great, I'll try it out. But I'm still voting for a TreeElement() factory as I'll have to write s.th. like this anyway: def TreeElement(): return objectify.Element('tree', {objectify.PYTYPE_ATTRIBUTE: 'TREE'}) And such a factory complemented the objectify module interface nicely, given there's also the DataElement function, imho. Btw. I've found that I rather often use ElementBase.__len__() to get the childcount/find out if an element has children. What do you think about adding a hasChildren() or countChildren() function to objectify? Regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Fri Sep 15 11:45:03 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 15 Sep 2006 11:45:03 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: <20060915.CcY.88727500@groupware> Message-ID: Hi Stefan, "Stefan Behnel" schrieb am 15.09.2006 11:11:24: > > I really think anybody considering to use amara.bindery or gnosis.objectify > > should have a good look at lxml.objectify. > > Hear, hear! :) > > Can we quote you on our web page? > > Like: "Google uses Python, but for critical stuff, my bank depends on lxml, > because ..." Yes you can, and I'd also be willing to write a short success story or s.th. like that. But let's just delay any quoting or success stories for now, until we have actually realized the success ;-) > > > I chose "tree_class" and "empty_data_class" now. I think that's > > > sufficiently telling. > > > > I'm still not quite sure what you mean by that. In my words: That will > > allow for > > customizing the behaviour when encountering > > - an empty leaf node (empty_data_class gets chosen) > > - an empty tree node = a node that contains no text but has children > > (tree_class gets chosen) > > Right. There are a few more rules: > has no parent -> tree > has xsi:nil attribute -> NullElement > has parsable type -> type class > > > > This is to not force an objectify user to follow our arbitrary (though with > > good reason ;-) decision to use StringElement for empty leaves. > > Right? > > For that, yes, and for easily replacing the inner tree classes. I think that > can be a pretty helpful thing if you want to extend the API. You can now > replace the type classes through PyType, and the inner tree class and the > default leaf class through the lookup mechanism. I think that's all you might > need to extend objectify. Understood, thanks. > > > What about adding the attribute in objectify.Element()? You can't > > > normally > > > change the data value of an Element itself, so the only real reason why > > > you > > > would call objectify.Element() is to create a structural element (usually > > > a root node). > > > > > > I called the corresponding pytype value "TREE" for now, I think it's > > > unlikely that someone would use that as custom type name. > > > > Great, I'll try it out. But I'm still voting for a TreeElement() factory > > as I'll have to write s.th. like this anyway: > > > > def TreeElement(): > > return objectify.Element('tree', {objectify.PYTYPE_ATTRIBUTE: 'TREE'}) > > No, that's redundant. Just use > > objectify.Element('tree') Fine. Just wanted to try it out and got this error: >>> objectify.Element('foo', {'a': '1'}) Traceback (most recent call last): File "", line 1, in ? File "objectify.pyx", line 1481, in objectify.Element NameError: python I don't really understand that as python. is used all over the place. Has this s.th. to with python being cimported and so far having only been used in the cdef'ed stuff but now it shows up in a def'ed module function? I must admit I haven't actually looked very much at pyrex itself. Will now try to add an import python somewhere... > I'll add a countchildren() method, then. > > Stefan Thanks, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Sep 15 17:42:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 15 Sep 2006 17:42:56 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <450ACA00.2050401@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > Stefan Behnel wrote: >> Just use >> >> objectify.Element('tree') > > Fine. > > Just wanted to try it out and got this error: > >>>> objectify.Element('foo', {'a': '1'}) > Traceback (most recent call last): > File "", line 1, in ? > File "objectify.pyx", line 1481, in objectify.Element > NameError: python > > I don't really understand that as python. is used all over the > place. Pyrex gives you that when you use a C function that's not declared. In this case, it does not even exist. The real name is PyDict_Size(). Should be fixed now. DataElement had the same bug, BTW, I had copied the code from there. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Sep 15 17:57:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 15 Sep 2006 17:57:31 +0200 Subject: [lxml-dev] [objectify] inconsistent element types for different element access orders In-Reply-To: References: Message-ID: <450ACD6B.90105@gkec.informatik.tu-darmstadt.de> ## forgot to send this to the list also ## Hi Holger, Holger Joukl (Holger.Joukl at LBBW.de) wrote: > > first of all congrats for bringing out the 1.1 release! We are currently > > working hard on basing our toolkit on lxml.objectify and so far it works > > like a charm. Cool, that's good to know. > > Especially the ObjectPath functionality and the ease of hooking custom > > element classes into the lookup mechanism is great. :) Custom element classes were one of the first thing I implemented when I came to lxml. And I really like the way they now fit into the new lookup framework. > > I really think anybody considering to use amara.bindery or gnosis.objectify > > should have a good look at lxml.objectify. Hear, hear! :) Can we quote you on our web page? Like: "Google uses Python, but for critical stuff, my bank depends on lxml, because ..." >> > > I chose "tree_class" and "empty_data_class" now. I think that's >> > > sufficiently telling. > > > > I'm still not quite sure what you mean by that. In my words: That will > > allow for > > customizing the behaviour when encountering > > - an empty leaf node (empty_data_class gets chosen) > > - an empty tree node = a node that contains no text but has children > > (tree_class gets chosen) Right. There are a few more rules: has no parent -> tree has xsi:nil attribute -> NullElement has parsable type -> type class > > This is to not force an objectify user to follow our arbitrary (though with > > good reason ;-) decision to use StringElement for empty leaves. > > Right? For that, yes, and for easily replacing the inner tree classes. I think that can be a pretty helpful thing if you want to extend the API. You can now replace the type classes through PyType, and the inner tree class and the default leaf class through the lookup mechanism. I think that's all you might need to extend objectify. >> > > What about adding the attribute in objectify.Element()? You can't >> > > normally >> > > change the data value of an Element itself, so the only real reason why >> > > you >> > > would call objectify.Element() is to create a structural element (usually >> > > a root node). >> > > >> > > I called the corresponding pytype value "TREE" for now, I think it's >> > > unlikely that someone would use that as custom type name. > > > > Great, I'll try it out. But I'm still voting for a TreeElement() factory > > as I'll have to write s.th. like this anyway: > > > > def TreeElement(): > > return objectify.Element('tree', {objectify.PYTYPE_ATTRIBUTE: 'TREE'}) No, that's redundant. Just use objectify.Element('tree') As I said, the main reason to call Element() is to create a tree element. If you want a data element, call DataElement(). So, no TreeElement() needed. > > Btw. I've found that I rather often use > > ElementBase.__len__() > > to get the childcount/find out if an element has children. > > What do you think about adding a hasChildren() or countChildren() function > > to objectify? Interesting. I didn't realise there isn't really a way to find out about the children. They are in dir(), but only together with all methods etc. There's .getchildren(), but that builds all children, so it's less efficient than just counting them. A countchildren() method on ObjectifiedElement would match getchildren(), but it also adds another name that cannot be used to look up children. Maybe "countchildren" is a good one in that regard, though, as it's not really a good name for an XML tag. I'm -0 on haschildren(), though, as you can always call countchildren() if you expect the number to be small (note that this is about data binding, so very large documents are unlikely already) or use iterchildren().next() if you expect it to be really large. Child traversal is so fast that it shouldn't make too much of a difference if you count 100 children or only the first one. The method call overhead may even be the dominating factor here... I'll add a countchildren() method, then. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Sep 15 21:45:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 15 Sep 2006 21:45:35 +0200 Subject: [lxml-dev] Handling of ... In-Reply-To: <4509D409.3060809@holloway.co.nz> References: <4509D409.3060809@holloway.co.nz> Message-ID: <450B02DF.10905@gkec.informatik.tu-darmstadt.de> Hi, Matthew Cruickshank wrote: > I've been writing this XML chain/pipeline - sort of a multi-pipeline > version of Apache Cocoon and lxml has been great -- very fast! Thanks for sharing that. Sounds like an interesting project. Any chance it will be open-source? Feel free to post to the list when there is anything to see. > I'm having some problems with and I don't know how to read > errors caused by that. They terminate the processing but I can't find > the error message text (from within the tag) anywhere... > it doesn't seem to be in error_log and no exception is thrown. Right. I checked that. It was actually a deeper bug that prevented XSLT error messages in general from turning up in the local error log of XSLT objects. Should be fixed now (both trunk and 1.1 branch). A side effect is that some of the error messages from XSLT are much clearer now. Another side effect of the fix is that lxml.etree no longer compiles without libxslt (which was possible in 1.0 by commenting out the line "include 'xslt.pxi'" in etree.pyx). Not too much of a drawback, though... > Any ideas -- am I looking in the wrong place? > > I'm using to assert things about document structure and > runtime parameters (things that couldn't be asserted in RelaxNG, I think). Well, 'document structure' is pretty much the thing that RelaxNG was made for, so I don't see why XSLT should be any better. And you might also consider generating (or modifying, parametrising, ...) an RNG on the fly, e.g. through an XSLT. What do you mean with "runtime parameters"? XSLT parameters? Stefan From sidnei at enfoldsystems.com Sun Sep 17 04:13:08 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Sat, 16 Sep 2006 23:13:08 -0300 Subject: [lxml-dev] Stylesheet Processing Instruction Message-ID: <20060917021308.GK8537@cotia> Hi there, Someone asked me if lxml would handle a 'Stylesheet Processing Instruction', which seems to be the way to embed the stylesheet into the XML to be transformed. ie, if you use the said instruction and open the XML in the browser (IE and Firefox?) the browser automatically applies the transform. Since the 'xsltproc' command also seems to do this, from it's man page, I expected lxml to do as well, but didn't actually try. So, can anyone confirm/deny if it's supported? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From faassen at infrae.com Mon Sep 18 13:13:18 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 18 Sep 2006 13:13:18 +0200 Subject: [lxml-dev] egg package upload commands Message-ID: <450E7F4E.2030802@infrae.com> Hi there, I wonder whether I'm doing something wrong, as previously this worked. To verify I'm not doing something stupid, could someone please say this is the proper sequence for uploading the lxml egg? download the tgz from codespeak.net. This contains the .c file that has been created by our special version of Pyrex. Then unpack it, and: $ python2.4 setup.py build [go to lib dir, strip the .so files] $ python2.4 setup bdist_egg $ python2.4 setup upload The last step fails with the following error: error: No dist file created in earlier command Given that I get this with another package too, this means one of two things: I'm doing something wrong, or my version of setuptools has a bug... Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Sep 18 07:40:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 18 Sep 2006 07:40:59 +0200 Subject: [lxml-dev] XSLT profiling Message-ID: <450E316B.9040202@gkec.informatik.tu-darmstadt.de> Hi, while I was looking through the libxslt API to check for stylesheet PI support, I stumbled over the profiling support that I had completely forgotten about. The result is that lxml.etree can now profile stylesheet runs. If you pass 'profile_run=True' to the transform call, the new 'xslt_profile' property of the result tree will contain an ElementTree with profiling data for each template, similar to the following (a 1-template XSLT result):