From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 2 11:35:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 02 Oct 2006 11:35:43 +0200 Subject: [lxml-dev] lxml replace() deletes tail In-Reply-To: <451D35D7.4040306@openplans.org> References: <451D35D7.4040306@openplans.org> Message-ID: <4520DD6F.7090206@gkec.informatik.tu-darmstadt.de> Hi, Chris Abraham wrote: > We have a question about the etree.replace() function. We found that it > doesn't preserve the tail text from the replaced node when inserting a > new node. Perhaps this is the intended behavior, but, to us, it was > unexpected. In the example below, notice how the "tail" text is deleted > when the

is replaced: > >>>> tree = etree.HTML("text > before

textin

tail") >>>> newel = etree.HTML("
new
") >>>> tree[0].replace(tree[0][0], newel[0][0]) >>>> etree.tostring(tree) > 'text before
new
' That *is* the expected behaviour. :) When you replace the element "

textin

tail" with the element "
new
" you get "
new
". Note that the tail is a property of the element, so it would rather be unexpected if the replaced element copied its own tail over to the new element. You can always copy the tail from the original element by hand, in case you need to. Stefan From faassen at infrae.com Tue Oct 3 15:22:21 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 03 Oct 2006 15:22:21 +0200 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <451BAEF8.1020101@infrae.com> References: <451BAEF8.1020101@infrae.com> Message-ID: <4522640D.1040105@infrae.com> Hey, I've just expanded on this and added some more explanation on my weblog, here: http://faassen.n--tree.net/blog/view/weblog/2006/10/03/0 Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 4 09:14:38 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 04 Oct 2006 09:14:38 +0200 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <4522640D.1040105@infrae.com> References: <451BAEF8.1020101@infrae.com> <4522640D.1040105@infrae.com> Message-ID: <45235F5E.20307@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > I've just expanded on this and added some more explanation on my weblog, > here: > > http://faassen.n--tree.net/blog/view/weblog/2006/10/03/0 Thanks, Martijn. Sounds pretty cool, but I'll have to see when I find the time to look into it. Stefan From sidnei at enfoldsystems.com Wed Oct 4 15:02:12 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 4 Oct 2006 10:02:12 -0300 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <45235F5E.20307@gkec.informatik.tu-darmstadt.de> References: <451BAEF8.1020101@infrae.com> <4522640D.1040105@infrae.com> <45235F5E.20307@gkec.informatik.tu-darmstadt.de> Message-ID: <20061004130212.GH4262@cotia> On Wed, Oct 04, 2006 at 09:14:38AM +0200, Stefan Behnel wrote: | Martijn Faassen wrote: | > I've just expanded on this and added some more explanation on my weblog, | > here: | > | > http://faassen.n--tree.net/blog/view/weblog/2006/10/03/0 Hey Martijn, Would you like to add the buildout to the buildbot so it gets tested too? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From faassen at infrae.com Wed Oct 4 17:33:03 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 04 Oct 2006 17:33:03 +0200 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <20061004130212.GH4262@cotia> References: <451BAEF8.1020101@infrae.com> <4522640D.1040105@infrae.com> <45235F5E.20307@gkec.informatik.tu-darmstadt.de> <20061004130212.GH4262@cotia> Message-ID: <4523D42F.6030401@infrae.com> Sidnei da Silva wrote: > On Wed, Oct 04, 2006 at 09:14:38AM +0200, Stefan Behnel wrote: > | Martijn Faassen wrote: > | > I've just expanded on this and added some more explanation on my weblog, > | > here: > | > > | > http://faassen.n--tree.net/blog/view/weblog/2006/10/03/0 > > Hey Martijn, > > Would you like to add the buildout to the buildbot so it gets tested too? I've no idea how to do that and don't really have time to learn buildbot, but if someone can do that then it might be a nice way to test a number of versions of lxml against a number of versions of libxml2/libxslt automatically. It'd essentially just take a bunch of different buildout.cfg files with some download URLs and version numbers changed. Regards, Martijn From sidnei at enfoldsystems.com Wed Oct 4 20:31:57 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 4 Oct 2006 15:31:57 -0300 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <4523D42F.6030401@infrae.com> References: <451BAEF8.1020101@infrae.com> <4522640D.1040105@infrae.com> <45235F5E.20307@gkec.informatik.tu-darmstadt.de> <20061004130212.GH4262@cotia> <4523D42F.6030401@infrae.com> Message-ID: <20061004183157.GG4164@cotia> On Wed, Oct 04, 2006 at 05:33:03PM +0200, Martijn Faassen wrote: | I've no idea how to do that and don't really have time to learn | buildbot, but if someone can do that then it might be a nice way to test | a number of versions of lxml against a number of versions of | libxml2/libxslt automatically. I'm not asking you to do that. I'm asking if you want to see it done. :) | It'd essentially just take a bunch of | different buildout.cfg files with some download URLs and version numbers | changed. Sounds good. I will take a look at that. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From Holger.Joukl at LBBW.de Thu Oct 5 14:03:50 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Thu, 5 Oct 2006 14:03:50 +0200 Subject: [lxml-dev] [objectify] StringElement: Implementing string methods, revisited In-Reply-To: <451AA12F.2040801@gkec.informatik.tu-darmstadt.de> Message-ID: Hi, Stefan Behnel schrieb am 27.09.2006 18:05:03: > > Maintenance is not my main concern. The problem is that we provide an > incomplete interface here, so it's "kinda compatible, but not quite", which I > consider worse than "no string methods there". I fear that the choice of > methods may look too arbitrary to understand. > > But as I said, feel free to convince me. > > Stefan I've experimented with that some more and came to think you're right. It's more of a documentation problem than maintenance and it is a lot more concise to have "wanna use string methods, use .pyval" than having a bunch of supported and some unsupported string methods. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Thu Oct 5 17:23:24 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Thu, 5 Oct 2006 17:23:24 +0200 Subject: [lxml-dev] [objectify] optimization issues In-Reply-To: Message-ID: Hi, I'm currently running into some optimization issues. Be warned this post is rather lenghty... First some background: I'm experimenting with a custom objectified datetime class based on Python's datetime that employs the dateutil.parser module to detect if some element value is in a valid datetime format, i.e. the parse function from dateutil.parser is used to implement the type_check for the PyType type registry. 1) Invoking this parse method is quite expensive, so I want this to happen rarely. As I am using "recursive element dumping" as default I found that for every __str__ call .pyval of the ObjectifiedDataElements in a tree is accessed, which in turn triggers parsing for my custom datetime class. As I don't really see a way to avoid this I propose the introduction of an additional property "_pyval_repr" that can be overridden in subclasses, which makes it possible to simply return element.text, if getting .pyval is expensive. S.th. like: *** ORIG/lxml-1.1/src/lxml/objectify.pyx Wed Sep 27 09:18:30 2006 --- src/lxml/objectify.pyx Wed Oct 4 11:00:09 2006 *************** *** 484,489 **** --- 484,493 ---- def __get__(self): return textOf(self._c_node) + property _pyval_repr: + def __get__(self): + return self.pyval + def __str__(self): return textOf(self._c_node) or '' *************** *** 931,938 **** cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "pyval"): ! value = element.pyval else: value = textOf(element._c_node) if value and not value.strip(): --- 935,942 ---- cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "_pyval_repr"): ! value = element._pyval_repr else: value = textOf(element._c_node) if value and not value.strip(): This can substantially speed up things for complicated type_check routines (in my usecase :) 2) Then, I figured to reduce the calls to ObjectifiedElement.__str__ in general. I am using a custom logging module that implies a function that converts its input arguments to strings, concatenates them and then writes them out through the logger (which substitutes stdout) if the loglevel of the caller meets the set loglevel for the output file/stdout. As the conversion to strings is performed before any loglevel checking, reversing this order leads to a lot less str() calls on the objects. To my astonishment things actually slowed massively down, though. I tried to come up with a minimal example of what seems to happen, using only lxml standard: Runs slow: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root.i print root.f print root.s print root.d """ "n = root.i; n = root.f; n = root.s; n = root.d" 17 238.3343 what 2006-03-03 10 loops -> 0.0102 secs 17 238.3343 what 2006-03-03 100 loops -> 0.101 secs 17 238.3343 what 2006-03-03 1000 loops -> 1.02 secs 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 raw times: 1.03 1.02 1.02 1000 loops, best of 3: 1.02 msec per loop Runs fast: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root """ "n = root.i; n = root.f; n = root.s; n = root.d" root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10 loops -> 0.00109 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 100 loops -> 0.00928 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 1000 loops -> 0.0897 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10000 loops -> 0.905 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] raw times: 0.893 0.911 0.911 10000 loops, best of 3: 89.3 usec per loop Recursively outputting root before accessing its child elements really speeds things up, even though I accessed all elements in the slow example, too. Why is this? I'm clueless. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From faassen at infrae.com Thu Oct 5 18:30:41 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 05 Oct 2006 18:30:41 +0200 Subject: [lxml-dev] a buildout for lxml In-Reply-To: <20061004183157.GG4164@cotia> References: <451BAEF8.1020101@infrae.com> <4522640D.1040105@infrae.com> <45235F5E.20307@gkec.informatik.tu-darmstadt.de> <20061004130212.GH4262@cotia> <4523D42F.6030401@infrae.com> <20061004183157.GG4164@cotia> Message-ID: <45253331.5040706@infrae.com> Sidnei da Silva wrote: > On Wed, Oct 04, 2006 at 05:33:03PM +0200, Martijn Faassen wrote: > | I've no idea how to do that and don't really have time to learn > | buildbot, but if someone can do that then it might be a nice way to test > | a number of versions of lxml against a number of versions of > | libxml2/libxslt automatically. > > I'm not asking you to do that. I'm asking if you want to see it > done. :) Oh, I misread "would you like to add the buildout to the buildbot?" in your mail as me actually doing something. :) Of course I'd like it done! :) > | It'd essentially just take a bunch of > | different buildout.cfg files with some download URLs and version numbers > | changed. > > Sounds good. I will take a look at that. Great! Thanks. Martijn From achimkern at hirschmanngmbh.com Fri Oct 6 10:40:04 2006 From: achimkern at hirschmanngmbh.com (Achim Kern) Date: Fri, 06 Oct 2006 10:40:04 +0200 Subject: [lxml-dev] Building Problems In-Reply-To: <451A9FF6.3010108@gkec.informatik.tu-darmstadt.de> (behnel ml's message of "27 Sep 2006 15:59:50 UT") References: <451A9FF6.3010108@gkec.informatik.tu-darmstadt.de> Message-ID: <87vemxet23.fsf@hirschmanngmbh.com> Hi Stefan, thanks for your rapid answer. As I wasn't in office until today it I wasn't able to answer. Sorry for this. behnel_ml at gkec.informatik.tu-darmstadt.de writes: > Hi Achim, > > Achim Kern wrote: >> during googeling on how to write easier xml datastores with python I >> just found our project. Especialy the objectify modules impressed >> me. So to test things I wanted to install it. Unfortunatly I can not >> use the provided debian package as there is only one for version 1.03 >> not including the objectify extension. So I downloaded the source of >> 1.1.1 from codespeak.net extracted it and that's what it did. >> >> # tar -xzf lxml-1.1.1.tgz >> # ce lxml-1.1.1 >> # make clean test. >> python setup.py build_ext -i >> Building lxml version 1.1 > > 1.1? Not 1.1.1? I tried both but none seamed to work for me. > > >> running build_ext >> python test.py -p -v > > You did build it, right? I assume this is a second try after already having > built it once. > I assume not. :-( > Did you do "make clean" in between? That removes the ".c" files, which means > you need a special Pyrex version to rebuild it. See "doc/build.txt". If you > only unpack the tgz and build from that, you should not need Pyrex as the ".c" > files are included. > > Please retry the above with a clean setup and if that still fails, send a > complete copy of your attempted commands and the resulting output to the list. > I tested it with a clean version which I downloaded and it builds like a dream. I really not clear what happend the first time. Maybe it was because I messed something up with the debian package which I had installed. Sorry for wasting your time. Regards Achim From Holger.Joukl at LBBW.de Fri Oct 6 11:51:16 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 6 Oct 2006 11:51:16 +0200 Subject: [lxml-dev] [objectify] DataElement factory problem In-Reply-To: Message-ID: Hi, I ran into a problem using the objectify DataElement factory function. When implementing an _init method in a derived ObjectifiedDataElement class, it is impossible to access the element.text in _init because this has not yet been set when _init gets called by _elementFactory. Don't see a nice clean way to solve that. Maybe instrument _elementFactory with an optional skip_init argument that allows for a delayed manual call of _init in corner cases? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Fri Oct 6 17:24:24 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Fri, 6 Oct 2006 17:24:24 +0200 Subject: [lxml-dev] [objectify] optimization issues In-Reply-To: Message-ID: Hi, as a followup to my last post some more strange observations. To find out why the call to str(root) aka objectify.dump(root) speeds up things: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.dump(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000898 secs 100 loops -> 0.00887 secs 1000 loops -> 0.0885 secs 10000 loops -> 0.887 secs raw times: 0.893 0.899 0.903 10000 loops, best of 3: 89.3 usec per loop I implemented a visit function that does nothing more than visit every node: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): for child in element.iterchildren(): _visit(child) But: /apps/pydev/gcc/3.4.4/bin/python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.0104 secs 100 loops -> 0.103 secs 1000 loops -> 1.04 secs raw times: 1.04 1.02 1.03 1000 loops, best of 3: 1.02 msec per loop This is actually much slower, again. Now if I change the visit code to: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): element.items() # my only addition for child in element.iterchildren(): _visit(child) Now it's fast, again: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000887 secs 100 loops -> 0.0087 secs 1000 loops -> 0.088 secs 10000 loops -> 0.874 secs raw times: 0.876 0.865 0.87 10000 loops, best of 3: 86.5 usec per loop All of this because of the additional element.items()??? I'm lost. Hope somebody can point out a serious misunderstanding of mine, where my systematic testing error lies or come up with an actual explanation :) As I'm abroad next week I'll follow up on this Tuesday in a week. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From sidnei at enfoldsystems.com Fri Oct 6 17:57:33 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 6 Oct 2006 12:57:33 -0300 Subject: [lxml-dev] 'dist' directory in Pyrex Message-ID: <20061006155733.GD4491@cotia> Is it intentional that there's a 'dist' directory in the lxml copy of Pyrex? I suspect that it shouldn't be there (makes checkout needlesly long). -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Fri Oct 6 18:10:32 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 6 Oct 2006 13:10:32 -0300 Subject: [lxml-dev] setup.py won't work with svn 1.4 Message-ID: <20061006161031.GE4491@cotia> Subversion 1.4 changed the .svn/entries format. setup.py will break. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 9 11:04:00 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 09 Oct 2006 11:04:00 +0200 Subject: [lxml-dev] 'dist' directory in Pyrex In-Reply-To: <20061006155733.GD4491@cotia> References: <20061006155733.GD4491@cotia> Message-ID: <452A1080.1040702@gkec.informatik.tu-darmstadt.de> Sidnei da Silva wrote: > Is it intentional that there's a 'dist' directory in the lxml copy of > Pyrex? I suspect that it shouldn't be there (makes checkout needlesly > long). True, just removed the content. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 9 19:21:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 09 Oct 2006 19:21:04 +0200 Subject: [lxml-dev] [objectify] optimization issues In-Reply-To: References: Message-ID: <452A8500.4000905@gkec.informatik.tu-darmstadt.de> Hi Holger, first of all: please create a new thread for a new topic instead of responding to an existing message. Most mail client honour the "in reply to" hint in the header and sort them into the old thread. Then: what you observe are most likely GC 'issues'. The thing is: if the element already exists as Python object, it is reused, which is much faster then creating a new one. So in the cases where your code runs faster, you can assume that the object survived a larger portion of your code without being re-instantiated. Especially recursive printing instantiates the entire tree, so if the objects are not deleted directly afterwards, this has a performance effect on code that runs afterwards. Stefan From cabraham at openplans.org Mon Oct 9 23:41:19 2006 From: cabraham at openplans.org (Chris Abraham) Date: Mon, 09 Oct 2006 17:41:19 -0400 Subject: [lxml-dev] lxml and html encodings Message-ID: <452AC1FF.3030403@openplans.org> Hello, We are getting some unexpected behavior when processing documents with a Shift_JIS encoding. We are trying to serialize an HTML document using an XSLT transform. Our results don't agree with the FAQ: http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. Please see the comments in the attached demo.py which reads in home.html and demonstrates our problem. Any ideas about this? Thanks. Chris -------------- next part -------------- A non-text attachment was scrubbed... Name: demo.py Type: text/x-python Size: 857 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20061009/40a644a3/attachment.py -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061009/40a644a3/attachment.html From ianb at colorstudy.com Wed Oct 11 00:16:51 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 10 Oct 2006 17:16:51 -0500 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <452AC1FF.3030403@openplans.org> References: <452AC1FF.3030403@openplans.org> Message-ID: <452C1BD3.3050804@colorstudy.com> Chris Abraham wrote: > Hello, > We are getting some unexpected behavior when processing documents with a > Shift_JIS encoding. > We are trying to serialize an HTML document using an XSLT transform. > Our results don't agree with the FAQ: > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. > Please see the comments in the attached demo.py which reads in home.html > and demonstrates our problem. Does etree.HTML() pay any attention to ? I notice it generates that tag (through the XSL I assume), but the parser doesn't necessarily have the same logic. I think for HTML it is better if the encoding is determined before parsing, as there's several types of information that come into play. I think the FAQ entry doesn't really apply here, since it isn't really XML. This library probably has the best rules for determining encoding: http://chardet.feedparser.org/ -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org From ianb at colorstudy.com Wed Oct 11 00:22:29 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 10 Oct 2006 17:22:29 -0500 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <452C1BD3.3050804@colorstudy.com> References: <452AC1FF.3030403@openplans.org> <452C1BD3.3050804@colorstudy.com> Message-ID: <452C1D25.4050403@colorstudy.com> Ian Bicking wrote: > I think for HTML it is better if the encoding is determined before > parsing, as there's several types of information that come into play. I > think the FAQ entry doesn't really apply here, since it isn't really > XML. This library probably has the best rules for determining encoding: > http://chardet.feedparser.org/ Actually, now that I look at this library it's probably more clever than necessary. Generally there should be good encoding information already present in the request, and you don't need heuristics like this to figure it out. Nevertheless, you should probably figure out decoding early, before parsing. To figure out the encoding specified in the tag, you should probably just use a regular expression (since you can't very well parse it to figure out how to decode it before you pass it to the parser). -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org From ltucker at openplans.org Wed Oct 11 16:10:21 2006 From: ltucker at openplans.org (Luke Tucker) Date: Wed, 11 Oct 2006 10:10:21 -0400 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <452C1BD3.3050804@colorstudy.com> References: <452AC1FF.3030403@openplans.org> <452C1BD3.3050804@colorstudy.com> Message-ID: <1160575821.19594.65.camel@ltucker.openplans.org> > Does etree.HTML() pay any attention to content="text/html; charset=Shift_JIS"> ? [...] > I think for HTML it is better if the encoding is determined before > parsing, as there's several types of information that come into play. I > think the FAQ entry doesn't really apply here, since it isn't really > XML. [...] I'm not certain. The FAQ entry says that using HTML unicode strings with charset meta tags also does not work. I thought that meant parsing via etree.HTML(). We can certainly extract the encoding and decode to a unicode string before calling the parser, but it seemed like we ought to get some clarification on the intended behavior as well. - Luke From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 12 18:48:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 12 Oct 2006 18:48:54 +0200 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <452AC1FF.3030403@openplans.org> References: <452AC1FF.3030403@openplans.org> Message-ID: <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> Hi, Chris Abraham wrote: > We are getting some unexpected behavior when processing documents with a > Shift_JIS encoding. > We are trying to serialize an HTML document using an XSLT transform. > Our results don't agree with the FAQ: > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. > Please see the comments in the attached demo.py which reads in home.html > and demonstrates our problem. I looked into it and found that the behaviour of the libxml2 parser depends on the position of the tag. Your HTML is pretty broken in many regards. However, when you move the tag within and before any text (especially before the tag), it is treated correctly. I attached a modified HTML file that parses nicely and serialises into UTF-8. So, the right place to ask this question is on the libxml2 mailing list, not on the lxml mailing list. Stefan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061012/cc9f832d/attachment.html From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 12 19:10:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 12 Oct 2006 19:10:31 +0200 Subject: [lxml-dev] [objectify] DataElement factory problem In-Reply-To: <OF548D5258.B8A6A19B-ONC12571FF.003554D2-C12571FF.00362226@LBBW.de> References: <OF548D5258.B8A6A19B-ONC12571FF.003554D2-C12571FF.00362226@LBBW.de> Message-ID: <452E7707.6020804@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > I ran into a problem using the objectify DataElement factory function. > When implementing an _init method in a derived ObjectifiedDataElement > class, it is impossible to access the element.text in _init because > this has not yet been set when _init gets called by _elementFactory. True, that's a problem. > Don't see a nice clean way to solve that. Maybe instrument > _elementFactory with an optional skip_init argument that allows for a > delayed manual call of _init in corner cases? Not a good idea, as it is rarely used. I already thought about adding a public C-API function for creating elements a while ago, that takes all necessary parameters including the text content. I think that's the cleanest solution. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 13 14:45:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 13 Oct 2006 14:45:35 +0200 Subject: [lxml-dev] [objectify] DataElement factory problem In-Reply-To: <OF548D5258.B8A6A19B-ONC12571FF.003554D2-C12571FF.00362226@LBBW.de> References: <OF548D5258.B8A6A19B-ONC12571FF.003554D2-C12571FF.00362226@LBBW.de> Message-ID: <452F8A6F.6090507@gkec.informatik.tu-darmstadt.de> Hi again, Holger Joukl wrote: > I ran into a problem using the objectify DataElement factory function. > When implementing an _init method in a derived ObjectifiedDataElement > class, it is impossible to access the element.text in _init because > this has not yet been set when _init gets called by _elementFactory. etree's C-API now has a new makeElement() function that creates an _Element straight through with everything it can carry: attributes, text, tail and a prefix mapping, either for an existing _Document or by creating a new document also. Objectify uses it to overcome the above problem. Stefan From cabraham at openplans.org Tue Oct 17 16:38:59 2006 From: cabraham at openplans.org (Chris Abraham) Date: Tue, 17 Oct 2006 10:38:59 -0400 Subject: [lxml-dev] problem with lxml and copy.deepcopy Message-ID: <4534EB03.6030402@openplans.org> Hi, I'm having a problem with performing a copy.deepcopy on a list of elements. I'm finding the etree._Comment elements get turned into None. Please see my test case: >>> a = etree.HTML('<html><body><p>hi</p> <!-- nice comment --> </body></html>') >>> b = a.xpath('//body/child::node()') >>> b [<Element p at 2b69f7d10d20>, ' ', <!-- nice comment -->, ' '] >>> import copy >>> c = copy.deepcopy(b) >>> c [<Element p at 2b69f7d1f7d0>, ' ', None, ' '] BTW, I'm using the CVS HEAD version of libxml2. Any ideas? Thanks, Chris From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Oct 17 19:12:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 17 Oct 2006 19:12:43 +0200 Subject: [lxml-dev] problem with lxml and copy.deepcopy In-Reply-To: <4534EB03.6030402@openplans.org> References: <4534EB03.6030402@openplans.org> Message-ID: <45350F0B.4020000@gkec.informatik.tu-darmstadt.de> Hi, Chris Abraham wrote: > I'm having a problem with performing a copy.deepcopy on a list of > elements. I'm finding the etree._Comment elements get turned into > None. Verified, thanks for reporting this. It's easy to reproduce like this: a = Comment("ONE") b = copy.deepcopy(a) The reason is that we create a new document internally and make the new element the root node. If it's a comment (or PI), however, libxml2 can't look it up right away with the normal call for the document root node, so we have to special case this (rare) use case. Fixed for 1.1 and trunk. Stefan From cabraham at openplans.org Tue Oct 17 20:46:39 2006 From: cabraham at openplans.org (Chris Abraham) Date: Tue, 17 Oct 2006 14:46:39 -0400 Subject: [lxml-dev] problem with lxml and copy.deepcopy In-Reply-To: <45350F0B.4020000@gkec.informatik.tu-darmstadt.de> References: <4534EB03.6030402@openplans.org> <45350F0B.4020000@gkec.informatik.tu-darmstadt.de> Message-ID: <4535250F.6060802@openplans.org> Stefan, Thanks. This works well (and seems to have solved other problems I was having.) Chris Stefan Behnel wrote: > Hi, > > Chris Abraham wrote: > >> I'm having a problem with performing a copy.deepcopy on a list of >> elements. I'm finding the etree._Comment elements get turned into >> None. >> > > Verified, thanks for reporting this. It's easy to reproduce like this: > > a = Comment("ONE") > b = copy.deepcopy(a) > > > The reason is that we create a new document internally and make the new > element the root node. If it's a comment (or PI), however, libxml2 can't look > it up right away with the normal call for the document root node, so we have > to special case this (rare) use case. > > Fixed for 1.1 and trunk. > > Stefan > > !DSPAM:1018,45350f1366461116498154! > > From cabraham at openplans.org Tue Oct 17 21:47:10 2006 From: cabraham at openplans.org (Chris Abraham) Date: Tue, 17 Oct 2006 15:47:10 -0400 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> References: <452AC1FF.3030403@openplans.org> <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> Message-ID: <4535333E.4080807@openplans.org> Stefan, Thanks for this. Who should I contact to get the FAQ updated? http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings It states that lxml "will not parse" Python unicode strings that carry encoding info. But here we see that it does. Also in the API's specific to lxml: http://codespeak.net/lxml/api.html "Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone." ...just a minor detail but thought it was worth following up on. Chris Stefan Behnel wrote: > Hi, > > Chris Abraham wrote: > >> We are getting some unexpected behavior when processing documents with a >> Shift_JIS encoding. >> We are trying to serialize an HTML document using an XSLT transform. >> Our results don't agree with the FAQ: >> http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. >> Please see the comments in the attached demo.py which reads in home.html >> and demonstrates our problem. >> > > I looked into it and found that the behaviour of the libxml2 parser depends on > the position of the <meta> tag. Your HTML is pretty broken in many regards. > However, when you move the <meta> tag within <head> and before any text > (especially before the <title> tag), it is treated correctly. > > I attached a modified HTML file that parses nicely and serialises into UTF-8. > > So, the right place to ask this question is on the libxml2 mailing list, not > on the lxml mailing list. > > Stefan > > > !DSPAM:1018,452fb2f5125711410093335! > > ------------------------------------------------------------------------ > > ?? !DSPAM:1018,452fb2f5125711410093335! > From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 18 08:51:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 18 Oct 2006 08:51:20 +0200 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <4535333E.4080807@openplans.org> References: <452AC1FF.3030403@openplans.org> <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> <4535333E.4080807@openplans.org> Message-ID: <4535CEE8.6090107@gkec.informatik.tu-darmstadt.de> Hi Chris, Chris Abraham wrote: > Thanks for this. Who should I contact to get the FAQ updated? > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings Well, the FAQ isn't really wrong in what it says. In your case, the encoding information is simply not taken into account as it is in a totally wrong position. So it's more like the document did not contain any encoding information at all. Note that the HTML parser is not guaranteed to create correct HTML that is 'equivalent' to the broken HTML. It just tries its best, which may mean that some of the original content may get lost. And in this case, it's meta data that gets lost. Stefan From ltucker at openplans.org Wed Oct 18 17:00:33 2006 From: ltucker at openplans.org (Luke Tucker) Date: Wed, 18 Oct 2006 11:00:33 -0400 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <4535CEE8.6090107@gkec.informatik.tu-darmstadt.de> References: <452AC1FF.3030403@openplans.org> <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> <4535333E.4080807@openplans.org> <4535CEE8.6090107@gkec.informatik.tu-darmstadt.de> Message-ID: <1161183633.19594.103.camel@ltucker.openplans.org> Hey, I could be confused, but I think the issue chris is referring to here might be clouded by the bad HTML in the original message. Here's some behavior that, to me, doesn't appear to match up entirely with the FAQ (as far as where errors are produced) using fixed up HTML. >>> html = open('home2.html').read() >>> unicode = html.decode('Shift_JIS') >>> from lxml import etree >>> rh = etree.HTML(html) >>> uh = etree.HTML(unicode) >>> rh[0][1].text Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 859, in etree._Element.text.__get__ File "apihelpers.pxi", line 291, in etree._collectText File "apihelpers.pxi", line 552, in etree.funicode UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: unexpected code byte >>> uh[0][1].text u'\u30b3\u30df' It looked to me like uh = etree.HTML(unicode) in this case should produce errors (since it is unicode and contains a proper meta charset entry) and that rh should behave normally. Apologies if I'm simply confusing the issue further :) - Luke On Wed, 2006-10-18 at 08:51 +0200, Stefan Behnel wrote: > Hi Chris, > > Chris Abraham wrote: > > Thanks for this. Who should I contact to get the FAQ updated? > > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings > > Well, the FAQ isn't really wrong in what it says. In your case, the encoding > information is simply not taken into account as it is in a totally wrong > position. So it's more like the document did not contain any encoding > information at all. > > Note that the HTML parser is not guaranteed to create correct HTML that is > 'equivalent' to the broken HTML. It just tries its best, which may mean that > some of the original content may get lost. And in this case, it's meta data > that gets lost. > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > !DSPAM:1014,4535cefc145172207481331! > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061018/62ea6596/attachment.html From ianb at colorstudy.com Wed Oct 18 20:31:10 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 18 Oct 2006 13:31:10 -0500 Subject: [lxml-dev] Cheese Shop link Message-ID: <453672EE.90509@colorstudy.com> Can you guys add some text like this to the Cheese Shop description: The in-development version of lxml can be found in the subversion repository at `http://codespeak.net/svn/lxml/trunk <http://codespeak.net/svn/lxml/trunk#egg=lxml-dev>`_ or installed with ``easy_install lxml==dev`` Thanks. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org From ianb at colorstudy.com Wed Oct 18 20:41:13 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 18 Oct 2006 13:41:13 -0500 Subject: [lxml-dev] Cheese Shop link In-Reply-To: <453672EE.90509@colorstudy.com> References: <453672EE.90509@colorstudy.com> Message-ID: <45367549.3040608@colorstudy.com> Ian Bicking wrote: > Can you guys add some text like this to the Cheese Shop description: > > The in-development version of lxml can be found in the subversion > repository at `http://codespeak.net/svn/lxml/trunk > <http://codespeak.net/svn/lxml/trunk#egg=lxml-dev>`_ or installed with > ``easy_install lxml==dev`` Maybe a link to http://codespeak.net/svn/lxml/branch/lxml-1.1#egg=lxml-1.1bugfix would also be useful here. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 19 08:59:17 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 19 Oct 2006 08:59:17 +0200 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <1161183633.19594.103.camel@ltucker.openplans.org> References: <452AC1FF.3030403@openplans.org> <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> <4535333E.4080807@openplans.org> <4535CEE8.6090107@gkec.informatik.tu-darmstadt.de> <1161183633.19594.103.camel@ltucker.openplans.org> Message-ID: <45372245.5060502@gkec.informatik.tu-darmstadt.de> Hi, Luke Tucker wrote: > I could be confused, but I think the issue chris is referring > to here might be clouded by the bad HTML in the original > message. Sure, that's why I was referring him to the libxml2 mailing list. > Here's some behavior that, to me, doesn't appear to > match up entirely with the FAQ (as far as where errors are > produced) using fixed up HTML. > >>>> html = open('home2.html').read() >>>> unicode = html.decode('Shift_JIS') >>>> from lxml import etree >>>> rh = etree.HTML(html) >>>> uh = etree.HTML(unicode) >>>> rh[0][1].text > Traceback (most recent call last): > File "<stdin>", line 1, in ? > File "etree.pyx", line 859, in etree._Element.text.__get__ > File "apihelpers.pxi", line 291, in etree._collectText > File "apihelpers.pxi", line 552, in etree.funicode > UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: > unexpected code byte >>>> uh[0][1].text > u'\u30b3\u30df' > > It looked to me like uh = etree.HTML(unicode) in this case should > produce errors (since it is unicode and contains a proper meta > charset entry) and that rh should behave normally. Apologies if I'm > simply confusing the issue further :) Sorry, but your HTML is very broken, too. It has two <html> tags and two contradictory <meta> tags (saying both "us-ascii" and "shift_jis"), so don't expect libxml2's HTML parser to magically know what you really meant when you wrote it. That's like saying: Ok, I know this function only works for values from 1-5, so I'll put in a 99 and complain if it breaks. If you parse broken HTML and the parser doesn't handle it correctly, the reason is your broken HTML, really. If you think libxml2 should be able to parse this kind of non-HTML, please file a bug on the libxml2 parser. There is nothing lxml can do about it. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 19 09:30:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 19 Oct 2006 09:30:01 +0200 Subject: [lxml-dev] Cheese Shop link In-Reply-To: <45367549.3040608@colorstudy.com> References: <453672EE.90509@colorstudy.com> <45367549.3040608@colorstudy.com> Message-ID: <45372979.1020601@gkec.informatik.tu-darmstadt.de> Hi Ian, Ian Bicking wrote: > Ian Bicking wrote: >> Can you guys add some text like this to the Cheese Shop description: >> >> The in-development version of lxml can be found in the subversion >> repository at `http://codespeak.net/svn/lxml/trunk >> <http://codespeak.net/svn/lxml/trunk#egg=lxml-dev>`_ or installed with >> ``easy_install lxml==dev`` > > Maybe a link to > http://codespeak.net/svn/lxml/branch/lxml-1.1#egg=lxml-1.1bugfix would > also be useful here. I just added both, please check out if it works for you. Stefan From rwiker at gmail.com Thu Oct 19 10:53:20 2006 From: rwiker at gmail.com (Raymond Wiker) Date: Thu, 19 Oct 2006 10:53:20 +0200 Subject: [lxml-dev] doctype declarations in XML output Message-ID: <9cd322050610190153x2289d5f9h30ed870e0583f6c2@mail.gmail.com> Hi. I've just started using Python 2.5 and lxml.etree, and am very happy so far - with this combination, I have an easy-to-use XML solution that includes good XPath and XSLT support. One question: is it possible to add doctype definitions to the output? I have not been able to find anything about this in the documentation, and I cannot see any references to xmlCreateIntSubset or xmlNewDtd in the source code. These should be trivial to add, as far as I can see. From ltucker at openplans.org Thu Oct 19 16:48:35 2006 From: ltucker at openplans.org (Luke Tucker) Date: Thu, 19 Oct 2006 10:48:35 -0400 Subject: [lxml-dev] lxml and html encodings In-Reply-To: <45372245.5060502@gkec.informatik.tu-darmstadt.de> References: <452AC1FF.3030403@openplans.org> <452E71F6.6080803@gkec.informatik.tu-darmstadt.de> <4535333E.4080807@openplans.org> <4535CEE8.6090107@gkec.informatik.tu-darmstadt.de> <1161183633.19594.103.camel@ltucker.openplans.org> <45372245.5060502@gkec.informatik.tu-darmstadt.de> Message-ID: <1161269315.19594.123.camel@ltucker.openplans.org> hah jeez, erg. sorry to waste your time and thanks for your patience. Wasn't intending to suggest it should handle malformed stuff, just a mistake, but I can definitely understand what you're saying all the same. - Luke > Sorry, but your HTML is very broken, too. It has two <html> tags and two > contradictory <meta> tags (saying both "us-ascii" and "shift_jis"), so don't > expect libxml2's HTML parser to magically know what you really meant when you > wrote it. That's like saying: Ok, I know this function only works for values > from 1-5, so I'll put in a 99 and complain if it breaks. > > If you parse broken HTML and the parser doesn't handle it correctly, the > reason is your broken HTML, really. > > If you think libxml2 should be able to parse this kind of non-HTML, please > file a bug on the libxml2 parser. There is nothing lxml can do about it. > > Stefan > > !DSPAM:1014,453722533261362196140! > From ianb at colorstudy.com Thu Oct 19 18:03:09 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 19 Oct 2006 11:03:09 -0500 Subject: [lxml-dev] Cheese Shop link In-Reply-To: <45372979.1020601@gkec.informatik.tu-darmstadt.de> References: <453672EE.90509@colorstudy.com> <45367549.3040608@colorstudy.com> <45372979.1020601@gkec.informatik.tu-darmstadt.de> Message-ID: <4537A1BD.3040401@colorstudy.com> Stefan Behnel wrote: > I just added both, please check out if it works for you. Thanks, they both work. For people wanting to use these in requirements, you have to be a little tricky. For instance: lxml==dev,1.2dev since lxml==dev will install 1.2dev, which doesn't actually satisfy the first requirement ("dev"). -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 20 09:49:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 20 Oct 2006 09:49:08 +0200 Subject: [lxml-dev] doctype declarations in XML output In-Reply-To: <9cd322050610190153x2289d5f9h30ed870e0583f6c2@mail.gmail.com> References: <9cd322050610190153x2289d5f9h30ed870e0583f6c2@mail.gmail.com> Message-ID: <45387F74.9040101@gkec.informatik.tu-darmstadt.de> Hi, Raymond Wiker wrote: > Hi. I've just started using Python 2.5 and lxml.etree, and am very > happy so far - with this combination, I have an easy-to-use XML > solution that includes good XPath and XSLT support. Always happy to hear that. > One question: is it possible to add doctype definitions to the output? > I have not been able to find anything about this in the documentation, > and I cannot see any references to xmlCreateIntSubset or xmlNewDtd in > the source code. These should be trivial to add, as far as I can see. There's not currently any support for that. However, we might consider making _Element.docinfo writable to achieve this. Any patches are welcome. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 20 14:22:26 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 20 Oct 2006 14:22:26 +0200 Subject: [lxml-dev] Failure to Compile on Windows In-Reply-To: <20060926025606.GP4356@cotia> References: <20060925220913.GL4356@cotia> <20060926025606.GP4356@cotia> Message-ID: <4538BF82.2060700@gkec.informatik.tu-darmstadt.de> Hi Sidnei, finally coming back to this. Sidnei da Silva wrote: > On Mon, Sep 25, 2006 at 07:09:13PM -0300, Sidnei da Silva wrote: > | I've got pretty far with setting up the dependencies and all but after > | that the compilation failed with this error. Anyone have a clue? Looks > | like it's missing some .h that 'nano http' depends on. > > FWIW, I've got past that. It was missing Ws2_32.lib. > > I've managed to build lxml with python trunk, but when running the > tests I've got a fatal error due to a negative refcount to a tuple > (!). > > I think I will disable running lxml tests with python trunk, unless > someone wants to digg into this. Thanks for setting this up. The fact that it's a tuple does not necessarily mean it's a Python problem. Could you come up with a stack trace or at least the test name that triggered it? Try running test.py -v. Stefan From sidnei at awkly.org Sat Oct 21 02:12:20 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Fri, 20 Oct 2006 21:12:20 -0300 Subject: [lxml-dev] Failure to Compile on Windows In-Reply-To: <4538BF82.2060700@gkec.informatik.tu-darmstadt.de> References: <20060925220913.GL4356@cotia> <20060926025606.GP4356@cotia> <4538BF82.2060700@gkec.informatik.tu-darmstadt.de> Message-ID: <20061021001220.GA4350@cotia> On Fri, Oct 20, 2006 at 02:22:26PM +0200, Stefan Behnel wrote: | Thanks for setting this up. The fact that it's a tuple does not necessarily | mean it's a Python problem. Could you come up with a stack trace or at least | the test name that triggered it? Try running test.py -v. There you have it: test_empty_parse (lxml.tests.test_errors.ErrorTestCase) ... ok test_XMLDTDID (lxml.tests.test_etree.ETreeOnlyTestCase) ... Fatal Python error: \pybots\slave\2.5.dasilva-x86\build\Objects\tupleobject.c:169 object at 013A91B8 has negative ref count -606348326 Stack Trace: ------------------------------------------------------------------------------- ntdll.dll!7c822583() > python25_d.dll!Py_FatalError(const char * msg=0x0021d274) Line 1552 C python25_d.dll!_Py_NegativeRefcount(const char * fname=0x1e2b7da8, int lineno=0x000000a9, _object * op=0x013a91b8) Line 193 + 0xc C python25_d.dll!tupledealloc(PyTupleObject * op=0x0144e478) Line 169 + 0x75 C python25_d.dll!_Py_Dealloc(_object * op=0x0144e478) Line 1928 + 0x7 C python25_d.dll!tupledealloc(PyTupleObject * op=0x01451ab8) Line 169 + 0x8a C python25_d.dll!_Py_Dealloc(_object * op=0x01451ab8) Line 1928 + 0x7 C etree_d.pyd!__pyx_tp_dealloc_5etree__IDDict(_object * o=0x013a94f8) Line 45102 + 0x73 C python25_d.dll!_Py_Dealloc(_object * op=0x013a94f8) Line 1928 + 0x7 C python25_d.dll!frame_dealloc(_frame * f=0x01470b68) Line 416 + 0x6a C python25_d.dll!_Py_Dealloc(_object * op=0x01470b68) Line 1928 + 0x7 C python25_d.dll!fast_function(_object * func=0x00922690, _object * * * pp_stack=0x0021d4dc, int n=0x00000001, int na=0x00000001, int nk=0x00000000) Line 3654 + 0x6 C python25_d.dll!call_function(_object * * * pp_stack=0x0021d4dc, int oparg=0x00000001) Line 3587 + 0x12 C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x01337c68, int throwflag=0x00000000) Line 2271 C python25_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x00a70808, _object * globals=0x01337c68, _object * locals=0x00000000, _object * * args=0x013af60c, int argcount=0x00000002, _object * * kws=0x01449038, int kwcount=0x00000000, _object * * defs=0x00a7e7bc, int defcount=0x00000001, _object * closure=0x00000000) Line 2833 + 0xb C python25_d.dll!function_call(_object * func=0x00b04508, _object * arg=0x013af5f8, _object * kw=0x0144a380) Line 522 + 0x40 C python25_d.dll!PyObject_Call(_object * func=0x00b04508, _object * arg=0x013af5f8, _object * kw=0x0144a380) Line 1858 + 0xf C python25_d.dll!ext_do_call(_object * func=0x00b04508, _object * * * pp_stack=0x0021da3c, int flags=0x00000003, int na=0x00000001, int nk=0x00000000) Line 3848 C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x01337ac0, int throwflag=0x00000000) Line 2312 C python25_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x00a70868, _object * globals=0x01337ac0, _object * locals=0x00000000, _object * * args=0x013afd8c, int argcount=0x00000002, _object * * kws=0x00000000, int kwcount=0x00000000, _object * * defs=0x00000000, int defcount=0x00000000, _object * closure=0x00000000) Line 2833 + 0xb C python25_d.dll!function_call(_object * func=0x00b04560, _object * arg=0x013afd78, _object * kw=0x00000000) Line 522 + 0x40 C python25_d.dll!PyObject_Call(_object * func=0x00b04560, _object * arg=0x013afd78, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!instancemethod_call(_object * func=0x00b04560, _object * arg=0x013afd78, _object * kw=0x00000000) Line 2495 + 0x11 C python25_d.dll!PyObject_Call(_object * func=0x00b8a2b8, _object * arg=0x014e02a0, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!slot_tp_call(_object * self=0x016fe738, _object * args=0x014e02a0, _object * kwds=0x00000000) Line 4581 + 0x11 C python25_d.dll!PyObject_Call(_object * func=0x016fe738, _object * arg=0x014e02a0, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!do_call(_object * func=0x016fe738, _object * * * pp_stack=0x0021e250, int na=0x00000001, int nk=0x00000000) Line 3779 C python25_d.dll!call_function(_object * * * pp_stack=0x0021e250, int oparg=0x00000001) Line 3589 + 0xa C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x01337918, int throwflag=0x00000000) Line 2271 C python25_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x00a75148, _object * globals=0x01337918, _object * locals=0x00000000, _object * * args=0x013d638c, int argcount=0x00000002, _object * * kws=0x01449028, int kwcount=0x00000000, _object * * defs=0x00000000, int defcount=0x00000000, _object * closure=0x00000000) Line 2833 + 0xb C python25_d.dll!function_call(_object * func=0x00b04b38, _object * arg=0x013d6378, _object * kw=0x0144a230) Line 522 + 0x40 C python25_d.dll!PyObject_Call(_object * func=0x00b04b38, _object * arg=0x013d6378, _object * kw=0x0144a23 python25_d.dll!PyObject_Call(_object * func=0x00b04b90, _object * arg=0x0176a538, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!instancemethod_call(_object * func=0x00b04b90, _object * arg=0x0176a538, _object * kw=0x00000000) Line 2495 + 0x11 C python25_d.dll!PyObject_Call(_object * func=0x00b8a138, _object * arg=0x0142c690, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!slot_tp_call(_object * self=0x01503dc8, _object * args=0x0142c690, _object * kwds=0x00000000) Line 4581 + 0x11 C python25_d.dll!PyObject_Call(_object * func=0x01503dc8, _object * arg=0x0142c690, _object * kw=0x00000000) Line 1858 + 0xf C python25_d.dll!do_call(_object * func=0x01503dc8, _object * * * pp_stack=0x0021efc4, int na=0x00000001, int nk=0x00000000) Line 3779 C python25_d.dll!call_function(_object * * * pp_stack=0x0021efc4, int oparg=0x00000001) Line 3589 + 0xa C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x013369b8, int throwflag=0x00000000) Line 2271 C python25_d.dll!fast_function(_object * func=0x00922690, _object * * * pp_stack=0x0021f498, int n=0x00000002, int na=0x00000002, int nk=0x00000000) Line 3653 C python25_d.dll!call_function(_object * * * pp_stack=0x0021f498, int oparg=0x00000002) Line 3587 + 0x12 C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x00a85120, int throwflag=0x00000000) Line 2271 C python25_d.dll!fast_function(_object * func=0x00922690, _object * * * pp_stack=0x0021f96c, int n=0x00000001, int na=0x00000001, int nk=0x00000000) Line 3653 C python25_d.dll!call_function(_object * * * pp_stack=0x0021f96c, int oparg=0x00000001) Line 3587 + 0x12 C python25_d.dll!PyEval_EvalFrameEx(_frame * f=0x009f6648, int throwflag=0x00000000) Line 2271 C python25_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x00a698c8, _object * globals=0x009f6648, _object * locals=0x009674d0, _object * * args=0x00000000, int argcount=0x00000000, _object * * kws=0x00000000, int kwcount=0x00000000, _object * * defs=0x00000000, int defcount=0x00000000, _object * closure=0x00000000) Line 2833 + 0xb C python25_d.dll!PyEval_EvalCode(PyCodeObject * co=0x00a698c8, _object * globals=0x009674d0, _object * locals=0x009674d0) Line 499 + 0x1f C python25_d.dll!run_mod(_mod * mod=0x00a3cc30, const char * filename=0x00923fe1, _object * globals=0x009674d0, _object * locals=0x009674d0, PyCompilerFlags * flags=0x0021ff2c, _arena * arena=0x00987e20) Line 1264 + 0x11 C python25_d.dll!PyRun_FileExFlags(_iobuf * fp=0x1027c898, const char * filename=0x00923fe1, int start=0x00000101, _object * globals=0x009674d0, _object * locals=0x009674d0, int closeit=0x00000001, PyCompilerFlags * flags=0x0021ff2c) Line 1250 + 0x1d C python25_d.dll!PyRun_SimpleFileExFlags(_iobuf * fp=0x1027c898, const char * filename=0x00923fe1, int closeit=0x00000001, PyCompilerFlags * flags=0x0021ff2c) Line 871 + 0x22 C python25_d.dll!PyRun_AnyFileExFlags(_iobuf * fp=0x1027c898, const char * filename=0x00923fe1, int closeit=0x00000001, PyCompilerFlags * flags=0x0021ff2c) Line 689 + 0x15 C python25_d.dll!Py_Main(int argc=0x00000003, char * * argv=0x00923fa0) Line 499 + 0x30 C python_d.exe!main(int argc=0x00000003, char * * argv=0x00923fa0) Line 23 + 0xe C python_d.exe!mainCRTStartup() Line 398 + 0x11 C kernel32.dll!77e523e5() ------------------------------------------------------------------------------- -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 21 22:14:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 21 Oct 2006 22:14:56 +0200 Subject: [lxml-dev] Failure to Compile on Windows In-Reply-To: <20061021001220.GA4350@cotia> References: <20060925220913.GL4356@cotia> <20060926025606.GP4356@cotia> <4538BF82.2060700@gkec.informatik.tu-darmstadt.de> <20061021001220.GA4350@cotia> Message-ID: <453A7FC0.9080802@gkec.informatik.tu-darmstadt.de> Hi Sidney, Sidnei da Silva wrote: > On Fri, Oct 20, 2006 at 02:22:26PM +0200, Stefan Behnel wrote: > | Thanks for setting this up. The fact that it's a tuple does not necessarily > | mean it's a Python problem. Could you come up with a stack trace or at least > | the test name that triggered it? Try running test.py -v. > > There you have it: > > test_empty_parse (lxml.tests.test_errors.ErrorTestCase) ... ok > test_XMLDTDID (lxml.tests.test_etree.ETreeOnlyTestCase) ... Fatal Python error: > \pybots\slave\2.5.dasilva-x86\build\Objects\tupleobject.c:169 object at 013A91B8 > has negative ref count -606348326 > > > Stack Trace: > ------------------------------------------------------------------------------- > ntdll.dll!7c822583() >> python25_d.dll!Py_FatalError(const char * msg=0x0021d274) Line 1552 C > python25_d.dll!_Py_NegativeRefcount(const char * fname=0x1e2b7da8, int lineno=0x000000a9, _object * op=0x013a91b8) Line 193 + 0xc C > python25_d.dll!tupledealloc(PyTupleObject * op=0x0144e478) Line 169 + 0x75 C > python25_d.dll!_Py_Dealloc(_object * op=0x0144e478) Line 1928 + 0x7 C > python25_d.dll!tupledealloc(PyTupleObject * op=0x01451ab8) Line 169 + 0x8a C > python25_d.dll!_Py_Dealloc(_object * op=0x01451ab8) Line 1928 + 0x7 C > etree_d.pyd!__pyx_tp_dealloc_5etree__IDDict(_object * o=0x013a94f8) Line 45102 + 0x73 C [...] Thanks. That wasn't the greatest code anyway, so thanks for pointing me at it. I couldn't reproduce the bug and didn't find anything suspicious under valgrind, so I just committed a cleaned up version of some code parts that may have lead to the problem and I hope that changes the refcount behaviour also. Could you retry with the current trunk version? Thanks again, Stefan From Holger.Joukl at LBBW.de Mon Oct 23 13:37:28 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Mon, 23 Oct 2006 13:37:28 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions Message-ID: <OF357AE853.D7E90F9C-ONC1257210.003D716B-C1257210.004004A8@LBBW.de> Hi, sorry for the inconvenience, I now put this into a new thread. And I'd have gotten back to that sooner but have been ill. >Then: what you observe are most likely GC 'issues'. The thing is: if the >element already exists as Python object, it is reused, which is much faster >then creating a new one. So in the cases where your code runs faster, you can >assume that the object survived a larger portion of your code without being >re-instantiated. I probably have some misunderstandings how the reuse of elements works. When I "visit" a node, like: >>> from lxml import etree >>> from lxml import objectify >>> parser = etree.XMLParser(remove_blank_text=True) >>> lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) >>> parser.setElementClassLookup(lookup) >>> objectify.setDefaultParser(parser) >>> objectify.enableRecursiveStr() >>> root = objectify.Element('root') >>> root.i = 17 >>> root.i <Element i at 1e94b0> >>> the Python Element object for "i" is being created. Will that Python Element be garbage-collected afterwards, if I do not explicitly delete "i" from the xml tree? I thought this element survived in the element proxy. >Especially recursive printing instantiates the entire tree, so if the objects >are not deleted directly afterwards, this has a performance effect on code >that runs afterwards. I see, but why would "manual access" of the nodes not have the same effect: Runs slow: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root.i print root.f print root.s print root.d """ "n = root.i; n = root.f; n = root.s; n = root.d" 17 238.3343 what 2006-03-03 10 loops -> 0.0102 secs 17 238.3343 what 2006-03-03 100 loops -> 0.101 secs 17 238.3343 what 2006-03-03 1000 loops -> 1.02 secs 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 raw times: 1.03 1.02 1.02 1000 loops, best of 3: 1.02 msec per loop Runs fast: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root """ "n = root.i; n = root.f; n = root.s; n = root.d" root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10 loops -> 0.00109 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 100 loops -> 0.00928 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 1000 loops -> 0.0897 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10000 loops -> 0.905 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] raw times: 0.893 0.911 0.911 10000 loops, best of 3: 89.3 usec per loop Recursively outputting root before accessing its child elements really speeds things up, even though I accessed all elements in the slow example, too. Why is this? I'm clueless. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 23 22:53:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 23 Oct 2006 22:53:51 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <OF357AE853.D7E90F9C-ONC1257210.003D716B-C1257210.004004A8@LBBW.de> References: <OF357AE853.D7E90F9C-ONC1257210.003D716B-C1257210.004004A8@LBBW.de> Message-ID: <453D2BDF.6040804@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: >> Then: what you observe are most likely GC 'issues'. The thing is: if the >> element already exists as Python object, it is reused, which is much > faster >> then creating a new one. So in the cases where your code runs faster, you > can >> assume that the object survived a larger portion of your code without > being >> re-instantiated. > > I probably have some misunderstandings how the reuse of elements works. > When I "visit" a node, like: > >>>> root = objectify.Element('root') >>>> root.i = 17 >>>> root.i > <Element i at 1e94b0> > > the Python Element object for "i" is being created. > Will that Python Element be garbage-collected afterwards, if I do not > explicitly delete "i" > from the xml tree? I thought this element survived in the element proxy. > >> Especially recursive printing instantiates the entire tree, so if the > objects >> are not deleted directly afterwards, this has a performance effect on code >> that runs afterwards. > > I see, but why would "manual access" of the nodes not have the same effect. > Recursively outputting root before accessing its child elements > really speeds things up, even though I accessed all elements in > the slow example, too. > Why is this? I'm clueless. I think I can give an answer here. The difference lies in the two cleanup modes in the Python interpreter: GC and ref counting. Ref-counted objects disappear immediately after loosing the last reference, however, when there are circular references between elements, the GC is required to clean them up. These objects can be garbage collected at any time, but they are usually kept until there is a good opportunity to clean them up, i.e. enough time has passed to merit the GC overhead or memory is filling up so it has to run. Now, the way recursive dumping is currently implemented instantiates an additional object for each element: _Attrib. This generates circular references between the element and its attribute proxy which enforces use of the GC instead of the normal ref-count algorithm. So elements that were recursively printed stay alive until the next run of the GC. Elements that do not have an _Attrib dictionary proxy can be deleted when ref-counting them out. You should be able to reproduce the behaviour observed after recursive printing with elements for which you called ".attrib". You should not rely on either behaviour as this deals with implementation details in both lxml and Python. However, I would not object to patches that make the behaviour more predictable. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Oct 24 00:23:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 24 Oct 2006 00:23:54 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <OF357AE853.D7E90F9C-ONC1257210.003D716B-C1257210.004004A8@LBBW.de> References: <OF357AE853.D7E90F9C-ONC1257210.003D716B-C1257210.004004A8@LBBW.de> Message-ID: <453D40FA.7090802@gkec.informatik.tu-darmstadt.de> Hi again, I rewrote the current recursive string printing implementation to use a real iterator for attribute access, which also lead to much shorter code in _Attrib (after a cleanup). This should remove the difference you see, although moving towards the slower variant. However, if it worked, this means that the elements are immediately garbage collected, which is the right thing to do from a memory perspective. Please test on your machine a) if the two code snippets still differ in performance and b) if the new implementation resulted in any noticeable slow down. If you feel ambitious, take a look at the benchmark directory and try to come up with a new benchmark suite "bench_objectify.py". The benchmark framework makes new benchmarks extremely easy to write and the four test XML trees should be well suited for objectify already. Thanks, Stefan From Holger.Joukl at LBBW.de Tue Oct 24 14:13:37 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Tue, 24 Oct 2006 14:13:37 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <453D2BDF.6040804@gkec.informatik.tu-darmstadt.de> Message-ID: <OF7908AFE5.7DCB672E-ONC1257211.0042C976-C1257211.00435421@LBBW.de> Hello Stefan, Stefan Behnel <behnel_ml at gkec.informatik.tu-darmstadt.de> schrieb am 23.10.2006 22:53:51: > I think I can give an answer here. The difference lies in the two cleanup > modes in the Python interpreter: GC and ref counting. Ref-counted objects > disappear immediately after loosing the last reference, however, when there > are circular references between elements, the GC is required to clean them up. > These objects can be garbage collected at any time, but they are usually kept > until there is a good opportunity to clean them up, i.e. enough time has > passed to merit the GC overhead or memory is filling up so it has to run. > > Now, the way recursive dumping is currently implemented instantiates an > additional object for each element: _Attrib. This generates circular > references between the element and its attribute proxy which enforces use of > the GC instead of the normal ref-count algorithm. So elements that were > recursively printed stay alive until the next run of the GC. Elements that do > not have an _Attrib dictionary proxy can be deleted when ref- > counting them out. > > You should be able to reproduce the behaviour observed after recursive > printing with elements for which you called ".attrib". Thanks for this good explanation! This of course also explains why my experiments with an additional "visit" function (a stripped down _dump, essentially) only worked when there was a call to element.items() included. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Tue Oct 24 15:10:05 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Tue, 24 Oct 2006 15:10:05 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <453D40FA.7090802@gkec.informatik.tu-darmstadt.de> Message-ID: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> Stefan Behnel <behnel_ml at gkec.informatik.tu-darmstadt.de> schrieb am 24.10.2006 00:23:54: > Hi again, > > I rewrote the current recursive string printing implementation to use a real > iterator for attribute access, which also lead to much shorter code in _Attrib > (after a cleanup). This should remove the difference you see, although moving > towards the slower variant. However, if it worked, this means that the > elements are immediately garbage collected, which is the right thing to do > from a memory perspective. > > Please test on your machine a) if the two code snippets still differ in > performance and b) if the new implementation resulted in any > noticeable slow down. I can confirm a) no performance difference between recursive element printing and "manual element access" any more b) no significant slow down using the little timeit snippets for benchmarking. > If you feel ambitious, take a look at the benchmark directory and try to come > up with a new benchmark suite "bench_objectify.py". The benchmark framework > makes new benchmarks extremely easy to write and the four test XML trees > should be well suited for objectify already. Will take a look. Some more need for clarification: If I understand correctly the lxml element proxy only speeds up things if - I hold a python reference to the element object or - a circular reference to the element in question prevents it from being gc-ed To speed up my usecase I could force-create and hold python references to every node before starting to operate on the tree. Would it also be possible to modify objectify in a way that the lifespan of the python _Element, once it has been instantiated, is tied to the existence of the underlying _c_node (xmlNode)? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From sidnei at enfoldsystems.com Tue Oct 24 22:02:16 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 24 Oct 2006 17:02:16 -0300 Subject: [lxml-dev] Resolvers and open files Message-ID: <20061024200216.GH4388@cotia> I have a long-running process that uses a custom resolver to resolve a simple filename to a file relative to a pre-configured directory. It looks like this: class RelativeUrlResolver(etree.Resolver): def __init__(self, prefix): self.prefix = prefix def resolve(self, url, id, context): print "Resolving URL '%s'" % url if not url.startswith('http'): url = self.prefix + urllib.quote_plus(url) ssf = urllib.urlopen(url) if ssf is None: raise ValueError, 'could not resolve url: %r' % url return self.resolve_file(ssf, context) I'm creating the parser like this: parser = etree.XMLParser() parser.resolvers.add(RelativeUrlResolver(BASE)) (Where BASE = 'file:///path/to/some/dir') Now, the issue that's biting me is that it looks like the file is kept open after the processing has finished. The parser is re-created every time ATM, and goes 'out of scope' right after doing the transformation, so I would expect it all to be garbage collected, and the file to be closed. Do I need to do anything special to get this file to be closed? Thanks. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 25 08:49:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 25 Oct 2006 08:49:08 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> References: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> Message-ID: <453F08E4.60606@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > To speed up my usecase I could force-create and hold python references to > every node before starting to operate on the tree. I added a FAQ entry on performance tweaking in objectify. If you find other things to add, I'd be happy to hear about them. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Oct 24 20:39:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 24 Oct 2006 20:39:16 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> References: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> Message-ID: <453E5DD4.1070905@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > Stefan Behnel wrote: >> Please test on your machine a) if the two code snippets still differ in >> performance and b) if the new implementation resulted in any >> noticeable slow down. > > I can confirm > a) no performance difference between recursive element printing and "manual > element access" any more > b) no significant slow down > using the little timeit snippets for benchmarking. Good, thanks. > Some more need for clarification: > If I understand correctly the lxml element proxy only speeds up things if > - I hold a python reference to the element object or > - a circular reference to the element in question prevents it from being > gc-ed Correct. However, as I said: do not rely on the second thing. GC runs are unpredictable (unless you run it by hand). > To speed up my usecase I could force-create and hold python references to > every node before starting to operate on the tree. ... the fastest approach likely being cache[root] = list(root.getiterator()) > Would it also be possible to modify objectify in a way that the lifespan of > the python _Element, once it has been instantiated, is tied to the > existence of the underlying _c_node (xmlNode)? Hmm, I don't know if that's a good thing in general. It eats substantially more memory than the C-tree does already. I mean, feel free to fill a cache like the above when XML comes in and delete it when it goes back out during processing. It should not be that much slower than doing it inside objectify, but it's simple enough to not require a dedicated API and it gives you absolute control over the trade-off between space and speed. Stefan From Holger.Joukl at LBBW.de Wed Oct 25 09:29:36 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Wed, 25 Oct 2006 09:29:36 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <453F08E4.60606@gkec.informatik.tu-darmstadt.de> Message-ID: <OFA28D453B.6D41DE64-ONC1257212.00286D88-C1257212.00295388@LBBW.de> Hi Stefan, Stefan Behnel <behnel_ml at gkec.informatik.tu-darmstadt.de> schrieb am 25.10.2006 08:49:08: > Hi Holger, > > Holger Joukl wrote: > > To speed up my usecase I could force-create and hold python references to > > every node before starting to operate on the tree. > > I added a FAQ entry on performance tweaking in objectify. If you find other > things to add, I'd be happy to hear about them. > > Stefan Very helpful! Thanks a lot. I think the first two hints might also be helpful for someone using classic lxml.etree. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From Holger.Joukl at LBBW.de Wed Oct 25 09:38:05 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Wed, 25 Oct 2006 09:38:05 +0200 Subject: [lxml-dev] [lxml][objectify] optimization of recursive object dumping Message-ID: <OF61C329FC.6FA1E12F-ONC1257212.00292C37-C1257212.002A1A2C@LBBW.de> Hi, I've posted this before but messed with the threads, so here it is again: (Note: patch line numbers might differ, this was based on 1.1 branch of 2 weeks ago, but I could of course update this and send a new patch) First some background: I'm experimenting with a custom objectified datetime class based on Python's datetime that employs the dateutil.parser module to detect if some element value is in a valid datetime format, i.e. the parse function from dateutil.parser is used to implement the type_check for the PyType type registry. 1) Invoking this parse method is quite expensive, so I want this to happen rarely. As I am using "recursive element dumping" as default I found that for every __str__ call .pyval of the ObjectifiedDataElements in a tree is accessed, which in turn triggers parsing for my custom datetime class. As I don't really see a way to avoid this I propose the introduction of an additional property "_pyval_repr" that can be overridden in subclasses, which makes it possible to simply return element.text, if getting .pyval is expensive. S.th. like: *** ORIG/lxml-1.1/src/lxml/objectify.pyx Wed Sep 27 09:18:30 2006 --- src/lxml/objectify.pyx Wed Oct 4 11:00:09 2006 *************** *** 484,489 **** --- 484,493 ---- def __get__(self): return textOf(self._c_node) + property _pyval_repr: + def __get__(self): + return self.pyval + def __str__(self): return textOf(self._c_node) or '' *************** *** 931,938 **** cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "pyval"): ! value = element.pyval else: value = textOf(element._c_node) if value and not value.strip(): --- 935,942 ---- cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "_pyval_repr"): ! value = element._pyval_repr else: value = textOf(element._c_node) if value and not value.strip(): This can substantially speed up things for complicated type_check routines (in my usecase :) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 25 09:49:36 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 25 Oct 2006 09:49:36 +0200 Subject: [lxml-dev] Resolvers and open files In-Reply-To: <20061024200216.GH4388@cotia> References: <20061024200216.GH4388@cotia> Message-ID: <453F1710.8000205@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > I have a long-running process that uses a custom resolver to resolve a > simple filename to a file relative to a pre-configured directory. > Now, the issue that's biting me is that it looks like the file is kept > open after the processing has finished. Right, the resolver context that stores the temporary references was not cleaned up after use. Should be fixed on the trunk now. Please test it with your setup. Stefan From sidnei at enfoldsystems.com Wed Oct 25 15:33:03 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 25 Oct 2006 10:33:03 -0300 Subject: [lxml-dev] Failure to Compile on Windows In-Reply-To: <453A7FC0.9080802@gkec.informatik.tu-darmstadt.de> References: <20060925220913.GL4356@cotia> <20060926025606.GP4356@cotia> <4538BF82.2060700@gkec.informatik.tu-darmstadt.de> <20061021001220.GA4350@cotia> <453A7FC0.9080802@gkec.informatik.tu-darmstadt.de> Message-ID: <20061025133303.GJ4388@cotia> On Sat, Oct 21, 2006 at 10:14:56PM +0200, Stefan Behnel wrote: | Thanks. That wasn't the greatest code anyway, so thanks for pointing me at it. | I couldn't reproduce the bug and didn't find anything suspicious under | valgrind, so I just committed a cleaned up version of some code parts that may | have lead to the problem and I hope that changes the refcount behaviour also. | | Could you retry with the current trunk version? Great! That seems to have fixed it. I do not get the refcount problem anymore. There are a couple failing tests, mainly due to calling os.remove() on an open file (that does not work on Windows). Ex: >>> f = open('/src/test.bat') >>> os.remove('/src/test.bat') Traceback (most recent call last): File "<stdin>", line 1, in ? OSError: [Errno 13] Permission denied: '/src/test.bat' -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Wed Oct 25 16:06:12 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 25 Oct 2006 11:06:12 -0300 Subject: [lxml-dev] Resolvers and open files In-Reply-To: <453F1710.8000205@gkec.informatik.tu-darmstadt.de> References: <20061024200216.GH4388@cotia> <453F1710.8000205@gkec.informatik.tu-darmstadt.de> Message-ID: <20061025140612.GM4388@cotia> On Wed, Oct 25, 2006 at 09:49:36AM +0200, Stefan Behnel wrote: | Right, the resolver context that stores the temporary references was not | cleaned up after use. | | Should be fixed on the trunk now. Please test it with your setup. Ok, seems to work now! Thank you a lot! -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 25 18:20:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 25 Oct 2006 18:20:48 +0200 Subject: [lxml-dev] [lxml][objectify] optimization of recursive object dumping In-Reply-To: <OF61C329FC.6FA1E12F-ONC1257212.00292C37-C1257212.002A1A2C@LBBW.de> References: <OF61C329FC.6FA1E12F-ONC1257212.00292C37-C1257212.002A1A2C@LBBW.de> Message-ID: <453F8EE0.3090106@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl worote: > I'm experimenting with a custom objectified datetime class based on > Python's > datetime that employs the dateutil.parser module to detect if some element > value > is in a valid datetime format, i.e. the parse function from dateutil.parser > is used to implement the type_check for the PyType type registry. > > Invoking this parse method is quite expensive, so I want this to happen > rarely. As I am using "recursive element dumping" as default I found that > for every __str__ call .pyval of the ObjectifiedDataElements in a tree is > accessed, which in turn triggers parsing for my custom datetime class. But that should only happen for normal text content (well, and dates). Numbers should always be parsed first. > As I don't really see a way to avoid this I propose the introduction of > an additional property "_pyval_repr" that can be overridden in subclasses, > which makes it possible to simply return element.text, if getting .pyval > is expensive. Hmmm, I don't really like the idea of adding a new Python method only to optimise the debug output (which is what dump() is essentially meant for). I understand that you use this as default, but I don't think many people will rely on the performance of this function... Have you considered switching from "dump() by default" to "implement __str__() for all data types by hand"? There are not that many standard types... On the other hand, what if we did something like this: cdef object _dump(_Element element, int indent): indentstr = " " * indent if isinstance(element, ObjectifiedDataElement): value = str(element) else: ... Would that help? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Oct 25 19:32:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 25 Oct 2006 19:32:34 +0200 Subject: [lxml-dev] [lxml][objectify] optimization of recursive object dumping In-Reply-To: <453F8EE0.3090106@gkec.informatik.tu-darmstadt.de> References: <OF61C329FC.6FA1E12F-ONC1257212.00292C37-C1257212.002A1A2C@LBBW.de> <453F8EE0.3090106@gkec.informatik.tu-darmstadt.de> Message-ID: <453F9FB2.9030306@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: > On the other hand, what if we did something like this: > > cdef object _dump(_Element element, int indent): > indentstr = " " * indent > if isinstance(element, ObjectifiedDataElement): > value = str(element) > else: > ... I wrote up a patch that could do the trick. Sadly, it requires a behavioural change in NumberElement to return repr(value) for str(). Not that beautiful. I'm just posting it to give you an idea about what looks like a viable approach to me. I'll try to get 1.1.2 out tomorrow, so unless I get convinced by then that this is a sufficiently solid idea, this will have to wait for the next release. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: objectify-repr-dump.patch Type: text/x-patch Size: 2005 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20061025/4967224d/attachment.bin From Holger.Joukl at LBBW.de Thu Oct 26 08:59:28 2006 From: Holger.Joukl at LBBW.de (Holger Joukl) Date: Thu, 26 Oct 2006 08:59:28 +0200 Subject: [lxml-dev] [lxml][objectify] optimization of recursive object dumping In-Reply-To: <453F9FB2.9030306@gkec.informatik.tu-darmstadt.de> Message-ID: <OF4CC0387B.2E6021B1-ONC1257213.00262E8B-C1257213.0026911B@LBBW.de> Hi Stefan, Stefan Behnel <behnel_ml at gkec.informatik.tu-darmstadt.de> schrieb am 25.10.2006 19:32:34: > I wrote up a patch that could do the trick. Sadly, it requires a behavioural > change in NumberElement to return repr(value) for str(). Not that beautiful. > I'm just posting it to give you an idea about what looks like a viable > approach to me. > > I'll try to get 1.1.2 out tomorrow, so unless I get convinced by then that > this is a sufficiently solid idea, this will have to wait for the > next release. I'll probably be not able to look at anything before tomorrow, sorry. But as svn trunk usually works rock solid anyway I'm not that tied to official releases, anyway :) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empf?nger sind oder falls diese E-Mail irrt?mlich an Sie adressiert wurde, verst?ndigen Sie bitte den Absender sofort und l?schen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte ?bermittlung sind nicht gestattet. Die Sicherheit von ?bermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Best?tigung w?nschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 26 09:02:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 26 Oct 2006 09:02:32 +0200 Subject: [lxml-dev] [lxml][objectify] optimization questions In-Reply-To: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> References: <OFCB53E900.4B5C4E75-ONC1257211.00432E84-C1257211.00487F65@LBBW.de> Message-ID: <45405D88.3010104@gkec.informatik.tu-darmstadt.de> Hi Holger, Holger Joukl wrote: > Stefan Behnel wrote: >> If you feel ambitious, take a look at the benchmark directory and try to >> come up with a new benchmark suite "bench_objectify.py". The benchmark >> framework >> makes new benchmarks extremely easy to write and the four test XML trees >> should be well suited for objectify already. > > Will take a look. Well, it wasn't quite that well suited after all. I added some smaller benchmarks myself and adapted the benchmark trees to simplify their use through objectify. Tree 3 still doesn't fit, but trees 1,2 and 4 can be used for benchmarking. Just look at bench_objectify.py to see how that works. So, as you're testing anyway, I'd be happy if you could come up with some additional benchmarks. That way, we could put some more results up on the performance web page. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Oct 26 11:48:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 26 Oct 2006 11:48:11 +0200 Subject: [lxml-dev] lxml now ships with Pyrex included Message-ID: <4540845B.90904@gkec.informatik.tu-darmstadt.de> Hi, as lxml will most likely continue to depend on a non-release version of pyrex for quite a while, I decided to add the patched Pyrex version to the source distribution. It is part of lxml's SVN checkout, I only add the Pyrex directory to the project root in my release script. The setup.py script simply prepends that directory to the Python path so that the modified Pyrex is used automatically. This has a couple of consequences for users: * Distributors that want to apply patches to the lxml sources no longer have to check out Pyrex themselves and can shrink their build scripts to a simple run of setup.py. * Users can now modify and build lxml distributions without globally installing a patched Pyrex. They can also run "make clean" if something went wrong, without breaking the build afterwards, even if they have an unpatched Pyrex installed in their site-packages. I hope this helps in making lxml easier to build. Note that SVN users still have to install our Pyrex version, but I think that's acceptable. Now that this works, we could also remove the .c files to shrink the source distribution. Any objections? Stefan From sidnei at enfoldsystems.com Thu Oct 26 17:27:09 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Thu, 26 Oct 2006 12:27:09 -0300 Subject: [lxml-dev] NameError on docloader.pxi Message-ID: <a7a2b76b0610260827x473f8c82j55af733f70de7462@mail.gmail.com> Hi there, I had spotted a NameError in docloader.pxi, but fixed it locally then forgot to send the patch. So here it is. It did bite me on the back when I tried to compile lxml-trunk on another box :) Traceback: File "xslt.pxi", line 593, in etree._XSLTProcessingInstruction.parseXSL File "parser.pxi", line 889, in etree._parseDocument File "parser.pxi", line 893, in etree._parseDocumentFromURL File "parser.pxi", line 810, in etree._parseDocFromFile File "parser.pxi", line 522, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 591, in etree._handleParseResult File "etree.pyx", line 201, in etree._ExceptionContext._raise_if_stored File "parser.pxi", line 285, in etree._parser_resolve_from_python File "docloader.pxi", line 84, in etree._ResolverRegistry.resolve File "lxmlfilter.pyc", line 50, in resolve File "docloader.pxi", line 47, in etree.Resolver.resolve_file NameError: _ParserInput -- Sidnei da Silva http://www.enfoldsystems.com -------------- next part -------------- A non-text attachment was scrubbed... Name: docloader.diff Type: application/octet-stream Size: 867 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20061026/132fdfad/attachment.obj From sidnei at enfoldsystems.com Thu Oct 26 20:07:38 2006 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Thu, 26 Oct 2006 15:07:38 -0300 Subject: [lxml-dev] Test failures on Windows Message-ID: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> I think I've mentioned the other day the fact some failures are seen when running tests for trunk on Windows. Here's the log: http://tinyurl.com/yhd4hc If you want to follow the status of the builds, you can use this page. You can also force a build from there by clicking on the builder name (same line as 'changes' header): http://tinyurl.com/yykhz7 -- Sidnei da Silva http://www.enfoldsystems.com From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 27 09:08:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 27 Oct 2006 09:08:24 +0200 Subject: [lxml-dev] NameError on docloader.pxi In-Reply-To: <a7a2b76b0610260827x473f8c82j55af733f70de7462@mail.gmail.com> References: <a7a2b76b0610260827x473f8c82j55af733f70de7462@mail.gmail.com> Message-ID: <4541B068.6080407@gkec.informatik.tu-darmstadt.de> Sidnei da Silva wrote: > I had spotted a NameError in docloader.pxi, but fixed it locally then > forgot to send the patch. So here it is. It did bite me on the back > when I tried to compile lxml-trunk on another box :) Obvious bug, obvious patch. Thanks, Stefan From faassen at infrae.com Fri Oct 27 10:51:18 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 27 Oct 2006 10:51:18 +0200 Subject: [lxml-dev] lxml now ships with Pyrex included In-Reply-To: <4540845B.90904@gkec.informatik.tu-darmstadt.de> References: <4540845B.90904@gkec.informatik.tu-darmstadt.de> Message-ID: <4541C886.20406@infrae.com> Stefan Behnel wrote: > Now that this works, we could also remove the .c files to shrink the source > distribution. Any objections? Yes, I object to removing the .c files. the .c files allow a guaranteed sure build of lxml. There is absolutely no doubt which version of Pyrex is used here, as that's under our own control (otherwise there might be interesting import issues in play). Also important, it allows tools like easy_install to download and compile lxml fully automatically (such as in a buildout). I use this feature all the time. I don't know whether that can work if a Pyrex is bundled - the setup.py would likely become more complicated to account for PYTHONPATH manipulation, and that might possibly break easy_install. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 27 12:57:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 27 Oct 2006 12:57:59 +0200 Subject: [lxml-dev] lxml now ships with Pyrex included In-Reply-To: <4541C886.20406@infrae.com> References: <4540845B.90904@gkec.informatik.tu-darmstadt.de> <4541C886.20406@infrae.com> Message-ID: <4541E637.2000504@gkec.informatik.tu-darmstadt.de> Hi, Martijn Faassen wrote: > Stefan Behnel wrote: >> Now that this works, we could also remove the .c files to shrink the >> source distribution. Any objections? > > Yes, I object to removing the .c files. the .c files allow a guaranteed > sure build of lxml. There is absolutely no doubt which version of Pyrex > is used here, as that's under our own control (otherwise there might be > interesting import issues in play). I won't argue for it and it's fine to leave them in. It's only some 230K difference in the tgz (50% more), so if that prevents us from running into install problems ... > Also important, it allows tools like easy_install to download and > compile lxml fully automatically (such as in a buildout). I use this > feature all the time. I don't know whether that can work if a Pyrex is > bundled - the setup.py would likely become more complicated to account > for PYTHONPATH manipulation, and that might possibly break easy_install. No changes in setup.py are required. I only added Pyrex' package directory to the lxml root directory and setup.py imports it nicely from there. I tested that it builds with setuptools, so I don't quite see where this could interfere with buildout or easy_install. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Oct 27 22:55:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 27 Oct 2006 22:55:11 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> Message-ID: <4542722F.9060900@gkec.informatik.tu-darmstadt.de> Hi, Sidnei da Silva wrote: > I think I've mentioned the other day the fact some failures are seen > when running tests for trunk on Windows. Here's the log: > > http://tinyurl.com/yhd4hc Thanks. Most of those are the usual Windows bugs that you can't delete an open file. I fixed some and just silenced the remaining two - shouldn't be too much of a problem if tiny temporary files are not deleted after running the test cases (which is a rare enough event anyway...) There's one problem left, though, and I don't have any idea where it might come from. ====================================================================== FAIL: test_xslt_parameters (lxml.tests.test_xslt.ETreeXSLTTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\pybots\slave\trunk.dasilva-x86\build\lib\unittest.py", line 260, in run testMethod() File "C:\pybots\slave\python-tool\lxml-trunk\src\lxml\tests\test_xslt.py", line 210, in test_xslt_parameters st.apply, tree) File "C:\pybots\slave\trunk.dasilva-x86\build\lib\unittest.py", line 326, in failUnlessRaises raise self.failureException, "%s not raised" % excName AssertionError: XSLTApplyError not raised ---------------------------------------------------------------------- Test case being: ---------------------------------------------------------------------- def test_xslt_parameters(self): tree = self.parse('<a><b>B</b><c>C</c></a>') style = self.parse('''\ <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*" /> <xsl:template match="/"> <foo><xsl:value-of select="$bar" /></foo> </xsl:template> </xsl:stylesheet>''') st = etree.XSLT(style) res = st.apply(tree, bar="'Bar'") self.assertEquals('''\ <?xml version="1.0"?> <foo>Bar</foo> ''', st.tostring(res)) # apply without needed parameter will lead to XSLTApplyError self.assertRaises(etree.XSLTApplyError, st.apply, tree) ---------------------------------------------------------------------- Apparently, the comment is wrong here... I'll have to find some time to look into this, unless someone has an idea? Could anyone test this under Windows and maybe figure out what happens here? Sometimes this means that an unexpected exception is raised instead - or none at all? Stefan From sidnei at awkly.org Fri Oct 27 23:04:54 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Fri, 27 Oct 2006 18:04:54 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4542722F.9060900@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> Message-ID: <20061027210454.GA4460@cotia> On Fri, Oct 27, 2006 at 10:55:11PM +0200, Stefan Behnel wrote: | Thanks. Most of those are the usual Windows bugs that you can't delete an open | file. I fixed some and just silenced the remaining two - shouldn't be too much | of a problem if tiny temporary files are not deleted after running the test | cases (which is a rare enough event anyway...) Yeah, as long as they are tiny :) I will be running the tests constantly on that box. Maybe I can setup some task to clean up $TMP. | There's one problem left, though, and I don't have any idea where it might | come from. | | ====================================================================== | FAIL: test_xslt_parameters (lxml.tests.test_xslt.ETreeXSLTTestCase) | ---------------------------------------------------------------------- | Traceback (most recent call last): | File "C:\pybots\slave\trunk.dasilva-x86\build\lib\unittest.py", line 260, in run | testMethod() | File "C:\pybots\slave\python-tool\lxml-trunk\src\lxml\tests\test_xslt.py", | line 210, in test_xslt_parameters | st.apply, tree) | File "C:\pybots\slave\trunk.dasilva-x86\build\lib\unittest.py", line 326, in | failUnlessRaises | raise self.failureException, "%s not raised" % excName | AssertionError: XSLTApplyError not raised | ---------------------------------------------------------------------- | | Test case being: | | ---------------------------------------------------------------------- | def test_xslt_parameters(self): | tree = self.parse('<a><b>B</b><c>C</c></a>') | style = self.parse('''\ | <xsl:stylesheet version="1.0" | xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> | <xsl:template match="*" /> | <xsl:template match="/"> | <foo><xsl:value-of select="$bar" /></foo> | </xsl:template> | </xsl:stylesheet>''') | | st = etree.XSLT(style) | res = st.apply(tree, bar="'Bar'") | self.assertEquals('''\ | <?xml version="1.0"?> | <foo>Bar</foo> | ''', | st.tostring(res)) | # apply without needed parameter will lead to XSLTApplyError | self.assertRaises(etree.XSLTApplyError, | st.apply, tree) | ---------------------------------------------------------------------- | | Apparently, the comment is wrong here... | | I'll have to find some time to look into this, unless someone has an idea? | Could anyone test this under Windows and maybe figure out what happens here? | Sometimes this means that an unexpected exception is raised instead - or none | at all? I can look at that later today. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at awkly.org Sat Oct 28 02:06:59 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Fri, 27 Oct 2006 21:06:59 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4542722F.9060900@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028000659.GD4460@cotia> | Apparently, the comment is wrong here... | | I'll have to find some time to look into this, unless someone has an idea? | Could anyone test this under Windows and maybe figure out what happens here? | Sometimes this means that an unexpected exception is raised instead - or none | at all? So here's what that line gives: (Pdb) p st.tostring(st.apply(tree)) '<?xml version="1.0"?>\n<foo/>\n' Looks like it just assumed the parameter was empty or something. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at awkly.org Sat Oct 28 03:26:12 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Fri, 27 Oct 2006 22:26:12 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4542722F.9060900@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028012612.GE4460@cotia> On Fri, Oct 27, 2006 at 10:55:11PM +0200, Stefan Behnel wrote: | Thanks. Most of those are the usual Windows bugs that you can't delete an open | file. I fixed some and just silenced the remaining two - shouldn't be too much | of a problem if tiny temporary files are not deleted after running the test | cases (which is a rare enough event anyway...) Two issues here: - In Python2.4 it raises OSError instead of WindowsError. I guess that is one of the changes in Python2.5. - I believe that this might be a real bug that needs fixing. Why it might be a bug: - I looked at the source in lxml and I see that this ends up calling xmlparser.xmlCtxtReadFile, which just delegates down to libxml2. Well, somewhere in there it seems like the file is read but not closed. By trial-and-failure, I've come up with the attached patch, which fixes the failures on Windows. Someone more experienced should review this. careful-not-to-hide-the-dirt-under-the-rug'ly yours, -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 -------------- next part -------------- Index: lxml/xmlparser.pxd =================================================================== --- lxml/xmlparser.pxd (revision 33825) +++ lxml/xmlparser.pxd (working copy) @@ -53,6 +53,7 @@ int recovery int options xmlError lastError + xmlParserInput* input xmlNode* node xmlSAXHandler* sax @@ -127,4 +128,5 @@ char* buffer) cdef xmlParserInput* xmlNewInputFromFile(xmlParserCtxt* ctxt, char* filename) + cdef xmlParserInput* inputPop(xmlParserCtxt* ctxt) cdef void xmlFreeInputStream(xmlParserInput* input) Index: lxml/parser.pxi =================================================================== --- lxml/parser.pxi (revision 33825) +++ lxml/parser.pxi (working copy) @@ -574,6 +574,9 @@ tree.xmlFreeDoc(ctxt.myDoc) ctxt.myDoc = NULL + if ctxt.input is not NULL: + xmlparser.xmlFreeInputStream(xmlparser.inputPop(ctxt)) + if result is not NULL: if ctxt.wellFormed or recover: __GLOBAL_PARSER_CONTEXT.initDocDict(result) From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 28 10:05:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 28 Oct 2006 10:05:14 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028012612.GE4460@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> Message-ID: <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > On Fri, Oct 27, 2006 at 10:55:11PM +0200, Stefan Behnel wrote: > | Thanks. Most of those are the usual Windows bugs that you can't delete an open > | file. I fixed some and just silenced the remaining two - shouldn't be too much > | of a problem if tiny temporary files are not deleted after running the test > | cases (which is a rare enough event anyway...) > > Two issues here: > > - In Python2.4 it raises OSError instead of WindowsError. I guess > that is one of the changes in Python2.5. Good to know. > - I believe that this might be a real bug that needs fixing. > > Why it might be a bug: > > - I looked at the source in lxml and I see that this ends up calling > xmlparser.xmlCtxtReadFile, which just delegates down to > libxml2. Well, somewhere in there it seems like the file is read > but not closed. You got me convinced. I think that's because we are using the context reusing calls (xmlCtxt*). They require calling xmlCtxtReset afterwards to clean up both the input stack and memory resources. This is normally called automatically when using the parser context the next time (which is why there never were any enduring side effects), but waiting for that has the temporal side effect of leaving the input stream open when passing control back to the user code. Now, the problem is, running xmlCtxtReset can currently segfault in some cases, so we can't just call it carelessly. I played with it a bit to figure out in which cases it can be called, but it doesn't look like we can safely call it in every case where it would make sense. Guess I'll file a bug report on it and try to come up with a work-around... Stefan From sidnei at awkly.org Sat Oct 28 15:07:19 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 10:07:19 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028130719.GA4589@cotia> On Sat, Oct 28, 2006 at 10:05:14AM +0200, Stefan Behnel wrote: | You got me convinced. I think that's because we are using the context reusing | calls (xmlCtxt*). They require calling xmlCtxtReset afterwards to clean up | both the input stack and memory resources. This is normally called | automatically when using the parser context the next time (which is why there | never were any enduring side effects), but waiting for that has the temporal | side effect of leaving the input stream open when passing control back to the | user code. | | Now, the problem is, running xmlCtxtReset can currently segfault in some | cases, so we can't just call it carelessly. I played with it a bit to figure | out in which cases it can be called, but it doesn't look like we can safely | call it in every case where it would make sense. Guess I'll file a bug report | on it and try to come up with a work-around... Erm.... did you look at the attached patch? It just frees ctxt->input if its not NULL. I guess you're looking for a generic fix though. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 28 15:16:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 28 Oct 2006 15:16:43 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028130719.GA4589@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> Message-ID: <4543583B.1020702@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > On Sat, Oct 28, 2006 at 10:05:14AM +0200, Stefan Behnel wrote: > | You got me convinced. I think that's because we are using the context reusing > | calls (xmlCtxt*). They require calling xmlCtxtReset afterwards to clean up > | both the input stack and memory resources. This is normally called > | automatically when using the parser context the next time (which is why there > | never were any enduring side effects), but waiting for that has the temporal > | side effect of leaving the input stream open when passing control back to the > | user code. > | > | Now, the problem is, running xmlCtxtReset can currently segfault in some > | cases, so we can't just call it carelessly. I played with it a bit to figure > | out in which cases it can be called, but it doesn't look like we can safely > | call it in every case where it would make sense. Guess I'll file a bug report > | on it and try to come up with a work-around... > > Erm.... did you look at the attached patch? It just frees ctxt->input > if its not NULL. I guess you're looking for a generic fix though. Not only generic. Pending open files is only a symptom here. The real problem is that none of the resources allocated for parsing is freed before you call the parser again (in which case new resources will be allocated right away). So popping the input streams fixes the windows problem, but calling xmlClearParserCtxt() after parsing would be the right thing to do - if it didn't crash. Stefan From sidnei at awkly.org Sat Oct 28 15:25:35 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 10:25:35 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4543583B.1020702@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028132535.GB4589@cotia> On Sat, Oct 28, 2006 at 03:16:43PM +0200, Stefan Behnel wrote: | Not only generic. Pending open files is only a symptom here. The real problem | is that none of the resources allocated for parsing is freed before you call | the parser again (in which case new resources will be allocated right away). | | So popping the input streams fixes the windows problem, but calling | xmlClearParserCtxt() after parsing would be the right thing to do - if it | didn't crash. Thanks for the explanation. That makes a lot of sense. A question though. Is the parser context expensive to allocate? Why not use xmlFreeParserCtxt() and allocate a new one instead of xmlResetParserCtxt()? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 28 15:43:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 28 Oct 2006 15:43:05 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028132535.GB4589@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> Message-ID: <45435E69.6050609@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > A question though. Is the parser context expensive to allocate? Why > not use xmlFreeParserCtxt() and allocate a new one instead of > xmlResetParserCtxt()? It is pretty expensive to allocate. There are some hash-table allocations involved that are rather costly (that's why there is a function for resetting the context). In lxml, we try to reuse context objects wherever possible. Stefan From sidnei at awkly.org Sat Oct 28 16:01:00 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 11:01:00 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <45435E69.6050609@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028140100.GC4589@cotia> On Sat, Oct 28, 2006 at 03:43:05PM +0200, Stefan Behnel wrote: | It is pretty expensive to allocate. There are some hash-table allocations | involved that are rather costly (that's why there is a function for resetting | the context). In lxml, we try to reuse context objects wherever possible. One last question, do you have a small test that can reproduce the segfault or is it just random? I would like to spend some time tracking that one down. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 28 16:12:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 28 Oct 2006 16:12:30 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028140100.GC4589@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> Message-ID: <4543654E.1040006@gkec.informatik.tu-darmstadt.de> Hi, Sidnei da Silva wrote: > On Sat, Oct 28, 2006 at 03:43:05PM +0200, Stefan Behnel wrote: > | It is pretty expensive to allocate. There are some hash-table allocations > | involved that are rather costly (that's why there is a function for resetting > | the context). In lxml, we try to reuse context objects wherever possible. > > One last question, do you have a small test that can reproduce the > segfault or is it just random? I would like to spend some time > tracking that one down. No need to do that. It's in line 12837 in file parser.c. Apparently, the assumption that ctxt->spaceTab has always been initialised when calling xmlCtxtReset() is wrong. Here's my bug report: http://bugzilla.gnome.org/show_bug.cgi?id=366161 My current work around is to do the NULL check myself and to initialise it the way libxml2 normally does before calling the reset. That's what's I'd normally expect libxml2 to do... BTW, I don't know in which cases this field remains uninitialised, so if you want to add some more infos to the bug report, feel free to investigate. I only know that it definitely happens in the case where we call xmlCtxtReadFile(). It is possible that this requires previous runs of the parser, maybe even a failed run preceding the crash, can't tell... Stefan From sidnei at awkly.org Sat Oct 28 16:55:36 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 11:55:36 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4543654E.1040006@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028145536.GD4589@cotia> On Sat, Oct 28, 2006 at 04:12:30PM +0200, Stefan Behnel wrote: | No need to do that. It's in line 12837 in file parser.c. Apparently, the | assumption that ctxt->spaceTab has always been initialised when calling | xmlCtxtReset() is wrong. Here's my bug report: | | http://bugzilla.gnome.org/show_bug.cgi?id=366161 | | My current work around is to do the NULL check myself and to initialise it the | way libxml2 normally does before calling the reset. That's what's I'd normally | expect libxml2 to do... Yes, that check is certainly missing. Note that htmlCtxtReset() does the check! I added that info to the bug report. | BTW, I don't know in which cases this field remains uninitialised, so if you | want to add some more infos to the bug report, feel free to investigate. I | only know that it definitely happens in the case where we call | xmlCtxtReadFile(). It is possible that this requires previous runs of the | parser, maybe even a failed run preceding the crash, can't tell... Maybe in spacePush, if memory allocation fails. That seems a bit odd though... unless you're really short on memory :) -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Oct 28 21:55:55 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 28 Oct 2006 21:55:55 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028145536.GD4589@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> <20061028145536.GD4589@cotia> Message-ID: <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > On Sat, Oct 28, 2006 at 04:12:30PM +0200, Stefan Behnel wrote: > | the assumption that ctxt->spaceTab has always been initialised when calling > | xmlCtxtReset() is wrong. > | I don't know in which cases this field remains uninitialised, so if you > | want to add some more infos to the bug report, feel free to investigate. I > | only know that it definitely happens in the case where we call > | xmlCtxtReadFile(). It is possible that this requires previous runs of the > | parser, maybe even a failed run preceding the crash, can't tell... > > Maybe in spacePush, if memory allocation fails. That seems a bit odd > though... unless you're really short on memory :) I think I was on the wrong track. It's a bug in libxml2, but not the right one. The crash appears somewhere in the long doctest in api.txt, most likely in a place where errors are tested when parsing from a string. That's not what we are looking for. You said that your patch fixes the problem, are you certain about that? Because calling xmlCtxtReset() should do exactly that (and a lot more) and it doesn't seem to solve the problem - according to the buildbot. Stefan From sidnei at awkly.org Sat Oct 28 23:06:06 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 18:06:06 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> References: <20061028012612.GE4460@cotia> <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> <20061028145536.GD4589@cotia> <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> Message-ID: <20061028210606.GF4589@cotia> On Sat, Oct 28, 2006 at 09:55:55PM +0200, Stefan Behnel wrote: | I think I was on the wrong track. It's a bug in libxml2, but not the right | one. The crash appears somewhere in the long doctest in api.txt, most likely | in a place where errors are tested when parsing from a string. That's not what | we are looking for. Ok... but that's no excuse for the check in HTMLparser to not be done on parser. | You said that your patch fixes the problem, are you certain about that? | Because calling xmlCtxtReset() should do exactly that (and a lot more) and it | doesn't seem to solve the problem - according to the buildbot. Yes, it does solve the problem for me on Python 2.4, ie, OSError is not raised after applying my patch. -- -e Sidnei da Silva -e Enfold Systems http://enfoldsystems.com -e Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at awkly.org Sat Oct 28 23:17:33 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 18:17:33 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028210606.GF4589@cotia> References: <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> <20061028145536.GD4589@cotia> <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> <20061028210606.GF4589@cotia> Message-ID: <20061028211733.GG4589@cotia> | | You said that your patch fixes the problem, are you certain about that? | | Because calling xmlCtxtReset() should do exactly that (and a lot more) and it | | doesn't seem to solve the problem - according to the buildbot. I see that you only call xmlClearContext if spaceTab is not NULL. Maybe that's the issue. -- -e Sidnei da Silva -e Enfold Systems http://enfoldsystems.com -e Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Oct 29 00:05:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 29 Oct 2006 00:05:40 +0200 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028211733.GG4589@cotia> References: <45430F3A.7050602@gkec.informatik.tu-darmstadt.de> <20061028130719.GA4589@cotia> <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> <20061028145536.GD4589@cotia> <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> <20061028210606.GF4589@cotia> <20061028211733.GG4589@cotia> Message-ID: <4543D434.1020901@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > | | You said that your patch fixes the problem, are you certain about that? > | | Because calling xmlCtxtReset() should do exactly that (and a lot more) and it > | | doesn't seem to solve the problem - according to the buildbot. > > I see that you only call xmlClearContext if spaceTab is not > NULL. Maybe that's the issue. No, I also tried initialising spaceTab by hand to always call reset(). No difference. So this is definitely not the problem. And it really makes me wonder how your patch can work if xmlClearParserCtxt() does not, because your code snippet is straight in there and I can't see a way it should not get executed if clear() is called. AFAICT, with the call to clear(), all input streams and memory resources should get freed after parsing, so that's all I wanted. And still the problem of a file descriptor staying open remains. I'll wait for the next buildbot run to see, but if that fails, I'll really get clueless... Note, BTW, that the error occurs in a finally block, so maybe it's already shadowing an exception? Stefan From sidnei at awkly.org Sun Oct 29 02:17:41 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Sat, 28 Oct 2006 22:17:41 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <4543D434.1020901@gkec.informatik.tu-darmstadt.de> References: <4543583B.1020702@gkec.informatik.tu-darmstadt.de> <20061028132535.GB4589@cotia> <45435E69.6050609@gkec.informatik.tu-darmstadt.de> <20061028140100.GC4589@cotia> <4543654E.1040006@gkec.informatik.tu-darmstadt.de> <20061028145536.GD4589@cotia> <4543B5CB.90705@gkec.informatik.tu-darmstadt.de> <20061028210606.GF4589@cotia> <20061028211733.GG4589@cotia> <4543D434.1020901@gkec.informatik.tu-darmstadt.de> Message-ID: <20061029011741.GH4589@cotia> On Sun, Oct 29, 2006 at 12:05:40AM +0200, Stefan Behnel wrote: | No, I also tried initialising spaceTab by hand to always call reset(). No | difference. So this is definitely not the problem. And it really makes me | wonder how your patch can work if xmlClearParserCtxt() does not, because your | code snippet is straight in there and I can't see a way it should not get | executed if clear() is called. | | AFAICT, with the call to clear(), all input streams and memory resources | should get freed after parsing, so that's all I wanted. And still the problem | of a file descriptor staying open remains. I'll wait for the next buildbot run | to see, but if that fails, I'll really get clueless... FWIW, I reverted my patch and did a svn up, and your changes seem to work here. No segfault or anything. The buildbot seems to be in some funny state now, maybe due to some checkin on Python 2.5/trunk. -- -e Sidnei da Silva -e Enfold Systems http://enfoldsystems.com -e Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 30 16:28:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 30 Oct 2006 16:28:06 +0100 Subject: [lxml-dev] lxml 1.1.2 released Message-ID: <45461A06.20909@gkec.informatik.tu-darmstadt.de> Hi everyone, lxml 1.1.2 finally made it to the cheeseshop. http://cheeseshop.python.org/pypi/lxml This is mainly a bugfix release for the stable 1.1 series, the changelog is below. As there were a number of important fixes, updating is recommended. Eggs for x86-64 are uploaded already, and I'd love to see more eggs thrown in that direction. Have fun, Stefan 1.1.2 (2006-10-30) Features added * Data elements in objectify support repr(), which is now used by dump() * Source distribution now ships with a patched Pyrex * New C-API function makeElement() to create new elements with text, tail, attributes and namespaces * Reuse original parser flags for XInclude * Simplified support for handling XSLT processing instructions Bugs fixed * Parser resources were not freed before the next parser run * Open files and XML strings returned by Python resolvers were not closed/freed * Crash in the IDDict returned by XMLDTDID * Copying Comments and ProcessingInstructions failed * Memory leak for external URLs in _XSLTProcessingInstruction.parseXSL() * Memory leak when garbage collecting tailed root elements * HTML script/style content was not propagated to .text * Show text xincluded between text nodes correctly in .text and .tail * 'integer * objectify.StringElement' operation was not supported From sidnei at awkly.org Mon Oct 30 18:12:33 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Mon, 30 Oct 2006 14:12:33 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061028000659.GD4460@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028000659.GD4460@cotia> Message-ID: <20061030171233.GA4630@cotia> Hi Stefan, On Fri, Oct 27, 2006 at 09:06:59PM -0300, Sidnei da Silva wrote: | | Apparently, the comment is wrong here... | | | | I'll have to find some time to look into this, unless someone has an idea? | | Could anyone test this under Windows and maybe figure out what happens here? | | Sometimes this means that an unexpected exception is raised instead - or none | | at all? | | So here's what that line gives: | | (Pdb) p st.tostring(st.apply(tree)) | '<?xml version="1.0"?>\n<foo/>\n' | | Looks like it just assumed the parameter was empty or something. Did you had a chance to look at the XSLTApplyError not being raised? Does that test fail on Linux? Maybe it's an issue with the version of libxml2 that I'm using on the buildbot? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Oct 30 18:28:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 30 Oct 2006 18:28:56 +0100 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <20061030171233.GA4630@cotia> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028000659.GD4460@cotia> <20061030171233.GA4630@cotia> Message-ID: <45463658.8030801@gkec.informatik.tu-darmstadt.de> Hi Sidnei, Sidnei da Silva wrote: > Hi Stefan, > > On Fri, Oct 27, 2006 at 09:06:59PM -0300, Sidnei da Silva wrote: > | | Apparently, the comment is wrong here... > | | > | | I'll have to find some time to look into this, unless someone has an idea? > | | Could anyone test this under Windows and maybe figure out what happens here? > | | Sometimes this means that an unexpected exception is raised instead - or none > | | at all? > | > | So here's what that line gives: > | > | (Pdb) p st.tostring(st.apply(tree)) > | '<?xml version="1.0"?>\n<foo/>\n' > | > | Looks like it just assumed the parameter was empty or something. > > Did you had a chance to look at the XSLTApplyError not being raised? Yes, I looked into it and decided it's not critical enough to delay 1.1.2 even more. > Does that test fail on Linux? No, not on my machine. > Maybe it's an issue with the version of > libxml2 that I'm using on the buildbot? I tried with libxml2 2.6.24 and 2.6.26. Both pass the test nicely. I also looked through the code path and couldn't find anything obvious that would behave differently on different systems. I have no idea why that test fails on the buildbot. BTW, the buildbot logs (all of them) seem to be broken currently, don't know where that comes from. Stefan From sidnei at awkly.org Tue Oct 31 12:28:41 2006 From: sidnei at awkly.org (Sidnei da Silva) Date: Tue, 31 Oct 2006 08:28:41 -0300 Subject: [lxml-dev] Test failures on Windows In-Reply-To: <45463658.8030801@gkec.informatik.tu-darmstadt.de> References: <a7a2b76b0610261107r3de69070j2eb9f443a0c39f3a@mail.gmail.com> <4542722F.9060900@gkec.informatik.tu-darmstadt.de> <20061028000659.GD4460@cotia> <20061030171233.GA4630@cotia> <45463658.8030801@gkec.informatik.tu-darmstadt.de> Message-ID: <20061031112841.GA4594@cotia> On Mon, Oct 30, 2006 at 06:28:56PM +0100, Stefan Behnel wrote: | I tried with libxml2 2.6.24 and 2.6.26. Both pass the test nicely. I also | looked through the code path and couldn't find anything obvious that would | behave differently on different systems. I have no idea why that test fails on | the buildbot. I see that you added some extra output with the version numbers to the output. http://tinyurl.com/ymbmrt | BTW, the buildbot logs (all of them) seem to be broken currently, don't know | where that comes from. Yeah, the master got out of sync, somehow. It's working again now. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From dan at danposluns.com Tue Oct 31 22:55:28 2006 From: dan at danposluns.com (Dan Posluns) Date: Tue, 31 Oct 2006 13:55:28 -0800 Subject: [lxml-dev] Can't build on Windows Message-ID: <4547C650.70104@danposluns.com> I'm not normally a Windows user so I'm at my wit's end here trying to get lxml to install. When I try easy_install I get the following: C:\Python24\Scripts>easy_install lxml Searching for lxml Reading http://www.python.org/pypi/lxml/ Reading http://codespeak.net/lxml Reading http://www.python.org/pypi/lxml/1.1.2 Best match: lxml 1.1.2 Downloading http://codespeak.net/lxml/lxml-1.1.2.tgz Processing lxml-1.1.2.tgz Running lxml-1.1.2\setup.py -q bdist_egg --dist-dir c:\docume~1\dposluns\locals~ 1\temp\easy_install-vc3ufv\lxml-1.1.2\egg-dist-tmp-vxzlya Building lxml version 1.1.2 warning: no files found matching 'etree.c' under directory 'src\lxml' warning: no files found matching 'objectify.c' under directory 'src\lxml' warning: no files found matching 'etree.h' under directory 'src\lxml' warning: no files found matching 'etree_defs.h' under directory 'src\lxml' warning: no files found matching 'pubkey.asc' under directory 'doc' warning: no previously-included files found matching 'doc\pyrex.txt' warning: no previously-included files found matching 'src\lxml\etree.pxi' cl : Command line warning D4025 : overriding '/W3' with '/w' cl : Command line warning D4029 : optimization is not available in the standard edition compiler etree.c c:\Documents and Settings\dposluns\Local Settings\Temp\easy_install-vc3ufv\lxml- 1.1.2\src\lxml\etree_defs.h(20) : fatal error C1083: Cannot open include file: ' libxml/xmlversion.h': No such file or directory error: Setup script exited with error: command '"C:\Program Files\Microsoft Visu al Studio .NET 2003\Vc7\bin\cl.exe"' failed with exit status 2 When I try David Sankel's technique to build the libraries statically, I manage to get as far as adding wsock32 (which took me long enough to figure out how to do - again, not a Windows programmer) before the build process borked out on me with the same errors. I was using the Windows distributions of libxml, libxslt, iconv and zlib from zlatkovic.com. Can anyone help me figure this one out? Thanks, Dan. -- Dan Posluns, B. Eng. & Scty. (Software Engineering and Society) dan at danposluns.com - ICQ: 35758902 http://www.danposluns.com "The great thing about being the only species on the planet that makes a distinction between right and wrong is that we get to make the rules up for ourselves as we go." - Douglas Adams, Last Chance to See