From icon at fedoraproject.org Thu Jan 3 16:29:48 2008 From: icon at fedoraproject.org (Konstantin Ryabitsev) Date: Thu, 3 Jan 2008 10:29:48 -0500 Subject: [lxml-dev] Help with an error message Message-ID: Hi, everyone: I'm having trouble with the following case. One of my automatic import scripts takes data from one source and submits it to another as an XML feed. Recently, it started failing because one of the entries contains a null. The testcase is such: from lxml.etree import Element sourcestr = 'Contains a null: \x00' unistr = unicode(sourcestr, 'utf-8') elt = Element('foo').text = unistr Running it will cause the following error: Traceback (most recent call last): File "foo.py", line 6, in elt = Element('foo').text = unistr File "etree.pyx", line 741, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII Can someone suggest the best way to deal with this? Kind regards, -- Konstantin Ryabitsev Montr?al, Qu?bec From stefan_ml at behnel.de Thu Jan 3 17:30:22 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 03 Jan 2008 17:30:22 +0100 Subject: [lxml-dev] Help with an error message In-Reply-To: References: Message-ID: <477D0D9E.3090205@behnel.de> Hi, Konstantin Ryabitsev wrote: > I'm having trouble with the following case. One of my automatic import > scripts takes data from one source and submits it to another as an XML > feed. Recently, it started failing because one of the entries contains > a null. The testcase is such: > > from lxml.etree import Element > sourcestr = 'Contains a null: \x00' > unistr = unicode(sourcestr, 'utf-8') > elt = Element('foo').text = unistr > > Running it will cause the following error: > > Traceback (most recent call last): > File "foo.py", line 6, in > elt = Element('foo').text = unistr > File "etree.pyx", line 741, in etree._Element.text.__set__ > File "apihelpers.pxi", line 344, in etree._setNodeText > File "apihelpers.pxi", line 648, in etree._utf8 > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > Can someone suggest the best way to deal with this? My first question is: why do you need a '\x00' here? If you want to pass binary data in XML, the best way is to use a safe encoding such as uuencode or whatever. That should be part of your XML language spec/schema/... Stefan From azaroth at liverpool.ac.uk Thu Jan 3 17:33:33 2008 From: azaroth at liverpool.ac.uk (Rob Sanderson) Date: Thu, 03 Jan 2008 16:33:33 +0000 Subject: [lxml-dev] Help with an error message In-Reply-To: <477D0D9E.3090205@behnel.de> References: <477D0D9E.3090205@behnel.de> Message-ID: <1199378013.1868.18.camel@helmsdeep> The null character makes the XML non-well-formed anyway. The legal character ranges for XML (as per the spec, section 2.2): Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Definitely no \x00! So ... I would base64 encode any offending data, as suggested by Stefan. Rob On Thu, 2008-01-03 at 17:30 +0100, Stefan Behnel wrote: > Konstantin Ryabitsev wrote: > > I'm having trouble with the following case. One of my automatic import > > scripts takes data from one source and submits it to another as an XML > > feed. Recently, it started failing because one of the entries contains > > a null. > My first question is: why do you need a '\x00' here? If you want to pass > binary data in XML, the best way is to use a safe encoding such as uuencode or > whatever. That should be part of your XML language spec/schema/... From stefan_ml at behnel.de Thu Jan 3 17:57:19 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 03 Jan 2008 17:57:19 +0100 Subject: [lxml-dev] Help with an error message In-Reply-To: <1199378013.1868.18.camel@helmsdeep> References: <477D0D9E.3090205@behnel.de> <1199378013.1868.18.camel@helmsdeep> Message-ID: <477D13EF.4040809@behnel.de> Hi, Rob Sanderson wrote: > The null character makes the XML non-well-formed anyway. > > The legal character ranges for XML (as per the spec, section 2.2): > > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x10000-#x10FFFF] > > Definitely no \x00! that's true. While you could get away on the XML /generator/ side with adding an Entity (and lxml 2.0 will let you do that), this will just let you write out broken XML that the recipient will not be able to parse: >>> from lxml import etree as et >>> el = et.Element("test") >>> el.text = "mind the " >>> el.append(et.Entity("#0")) >>> xml = et.tostring(el) 'mind the �' >>> et.fromstring(xml) Traceback (most recent call last): lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 0, line 1, column 20 Maybe we should fix the Entity() factory here to prevent such misuse... Stefan From xkenneth at gmail.com Fri Jan 4 21:13:29 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Fri, 4 Jan 2008 14:13:29 -0600 Subject: [lxml-dev] XML Schemas (XSD) and Objectification Message-ID: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> All, Is there any way to use an XSD file to generate an object in python using Objectify? Regards, Ken From jlovell at esd189.org Fri Jan 4 21:59:11 2008 From: jlovell at esd189.org (John Lovell) Date: Fri, 4 Jan 2008 12:59:11 -0800 Subject: [lxml-dev] XML Schemas (XSD) and Objectification In-Reply-To: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> References: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B08E159@ZIRIA.esd189.org> Ken: While I do not know if you can do this using Objectify, you should look at generateDS as a backup. http://www.rexx.com/~dkuhlman/generateDS.html Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at esd189.org www.esd189.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Kenneth Miller Sent: Friday, January 04, 2008 12:13 PM To: lxml-dev at codespeak.net Subject: [lxml-dev] XML Schemas (XSD) and Objectification All, Is there any way to use an XSD file to generate an object in python using Objectify? Regards, Ken _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Fri Jan 4 22:45:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 04 Jan 2008 22:45:02 +0100 Subject: [lxml-dev] XML Schemas (XSD) and Objectification In-Reply-To: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> References: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> Message-ID: <477EA8DE.5090305@behnel.de> Hi, Kenneth Miller wrote: > Is there any way to use an XSD file to generate an object in > python using Objectify? Hmm, lxml.objectify actually works "as is", based on the XML document itself (i.e. an 'instance' of the schema), but without any schema interaction. What are you trying to achieve? Type enforcement based on schema types? Or do you mean 'generate an object' in the sense that you want to map an objectify object to a plain Python object? Stefan From stefan_ml at behnel.de Sat Jan 5 10:02:03 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 05 Jan 2008 10:02:03 +0100 Subject: [lxml-dev] XML Schemas (XSD) and Objectification In-Reply-To: References: <15529934-AE65-40B9-9302-A741CF512B0E@gmail.com> <477EA8DE.5090305@behnel.de> Message-ID: <477F478B.5090806@behnel.de> Hi, please reply also to the list. Kenneth Miller wrote: > On Jan 4, 2008, at 3:45 PM, Stefan Behnel wrote: >> Kenneth Miller wrote: >>> Is there any way to use an XSD file to generate an object in >>> python using Objectify? >> >> Hmm, lxml.objectify actually works "as is", based on the XML document >> itself >> (i.e. an 'instance' of the schema), but without any schema >> interaction. What >> are you trying to achieve? Type enforcement based on schema types? >> >> Or do you mean 'generate an object' in the sense that you want to map an >> objectify object to a plain Python object? >> > I'd like to be able to simply create the objects defined by the schema > as python objects. :) repeating an answer doesn't always help in understanding it. But I think what you mean is: you have a schema and you want to generate source code for Python objects *in advance* to represent its document instances. That's not how lxml.objectify works. What lxml.objectify does, is: you give it a document instance (no schema involved) and it will create Python objects for you *at runtime* to represent the document. That's a slight difference, and it's the reason why I asked back. It might or might not fit what you want to achieve with these objects. My bet is, if you validate your instance against the schema before you access the document tree, there shouldn't be a difference in behaviour. Also, if you want to tie specific objects to certain parts of the document, lxml will allow you to do that - just not with an /arbitrary/ Python object, as it requires inheritance of lxml's base objects. Have you looked at the examples on our web page? http://codespeak.net/lxml/dev/objectify.html#the-lxml-objectify-api http://codespeak.net/lxml/dev/objectify.html#python-data-types http://codespeak.net/lxml/dev/element_classes.html Stefan From stefan_ml at behnel.de Sat Jan 5 20:39:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 05 Jan 2008 20:39:02 +0100 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash on freebsd/amd64 In-Reply-To: <47769725.9050802@behnel.de> References: <4772AFBE.8020801@behnel.de> <47769725.9050802@behnel.de> Message-ID: <477FDCD6.9050904@behnel.de> Hi Dmitri, Stefan Behnel wrote: > The way XSLT is implemented in lxml is a bit tricky, as libxslt makes some > things hard to control that lxml uses in libxml2 for performance reasons. In > particular, lxml uses a thread-local hash table for constant strings, which is > much faster than a malloc() for each string that occurs in a document. > However, libxslt doesn't honour this dictionary and creates its own one based > on the stylesheet dictionary. The result is that the stylesheet can leak into > the result document through string references that now point into the hash > table of the stylesheet. > > There isn't a way in libxslt that would allow us to prevent this or to control > the allocation. That's why I decided to restrict the execution of XSL > transformations to threads that inherit the same hash table as the stylesheet, > this should normally prevent any problems. Here is a trivial patch (the one against xslt.pxi) that, instead of raising an exception, copies the stylesheet into the current thread context, and thus works around the current thread restrictions. It seems to work for me, any chance you could give it a try? In case it doesn't work reliably, could you additionally check the second change (in parser.pxi)? It should restrict 'acceptable' hash tables to the local thread, not including the main thread (as it did before). Stefan === src/lxml/xslt.pxi ================================================================== --- src/lxml/xslt.pxi (revision 3205) +++ src/lxml/xslt.pxi (local) @@ -373,7 +373,7 @@ cdef xmlDoc* c_doc if not _checkThreadDict(self._c_style.doc.dict): - raise RuntimeError, "stylesheet is not usable in this thread" + return self.__copy__()(_input, profile_run=profile_run, **_kw) input_doc = _documentOrRaise(_input) root_node = _rootNodeOrRaise(_input) === src/lxml/parser.pxi ================================================================== --- src/lxml/parser.pxi (revision 3205) +++ src/lxml/parser.pxi (local) @@ -132,8 +132,8 @@ """Check that c_dict is either the local thread dictionary or the global parent dictionary. """ - if __GLOBAL_PARSER_CONTEXT._c_dict is c_dict: - return 1 # main thread + #if __GLOBAL_PARSER_CONTEXT._c_dict is c_dict: + # return 1 # main thread if __GLOBAL_PARSER_CONTEXT._getThreadDict(NULL) is c_dict: return 1 # local thread dict return 0 From stefan_ml at behnel.de Mon Jan 7 11:14:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Jan 2008 11:14:32 +0100 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> Message-ID: <4781FB88.900@behnel.de> Hi Christof, H?ke, Christof wrote: > You are the main developer for lxml, right? Yep, but not the only one. :) > I was trying the CSSSelect > facility for a Python CSS library I am developing > (http://code.google.com/p/cssutils/) Cool. I knew about cssutils, felt that its field of application was related to cssselect (and lxml in general) but not with too much of an overlap - and always thought it would be nice to have it working with lxml in some way. > and I think there are some minor > problems with "*" or "*|*" (I need to check again and I'll put them on the > bug tracker then) but a question regarding support for pseudo selectors: > Would it be possible to support stuff like :first-letter (currently not > working is it not?) with Python XPath extension functions which should be > able to do what XPath cannot? Are you maybe even working on it? I guess > things like :first-line are problematic but other should be ok. I'm not the primary person to ask here. cssselect was developed by Ian Bicking, he knows best what works, what doesn't, and how to fix it. :) > If I get the time I would try some things out and report back, this was > just an idea that I had while playing with CSSSelector... Go ahead, this is open source. Any help, testing and ideas are always appreciated. > Lxml is really great stuff BTW, it was actually quite simple using lxml and > a CSSStyleSheet on a given HTML. (Not released yet but an example is in the > SVN). Great. In case there's anything we can do on lxml's side, please ask on the list. > Also the XPath extension facility is really great, I used Pyana until > some time ago but now use lxml for most projects. Competition is best when you win. :) Stefan From ianb at colorstudy.com Mon Jan 7 17:55:42 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 07 Jan 2008 10:55:42 -0600 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <4781FB88.900@behnel.de> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> Message-ID: <4782598E.3000803@colorstudy.com> Stefan Behnel wrote: > Hi Christof, > > H?ke, Christof wrote: >> You are the main developer for lxml, right? > > Yep, but not the only one. :) > > >> I was trying the CSSSelect >> facility for a Python CSS library I am developing >> (http://code.google.com/p/cssutils/) > > Cool. I knew about cssutils, felt that its field of application was related to > cssselect (and lxml in general) but not with too much of an overlap - and > always thought it would be nice to have it working with lxml in some way. Yeah, it would be cool to be able to take a stylesheet and turn it into style tags, to make the HTML relocatable without losing the style. That's just one idea that has occurred to me in the past. >> and I think there are some minor >> problems with "*" or "*|*" (I need to check again and I'll put them on the >> bug tracker then) but a question regarding support for pseudo selectors: >> Would it be possible to support stuff like :first-letter (currently not >> working is it not?) with Python XPath extension functions which should be >> able to do what XPath cannot? Are you maybe even working on it? I guess >> things like :first-line are problematic but other should be ok. ::first-letter is hard because it doesn't match any object in lxml. If it returned a string like "A" it would be very much out of context (e.g., no parent pointer), and it would be hard to do anything useful with it. To make it useful I think it would require some new stringish object that also looked nodeish (e.g., had a .getparent() method). Though maybe an object like that should exist; something similar would be needed for representing ranges. ::first-line, of course, depends on a rendering, so it's right out. I haven't been doing any work on selectors recently. There are a couple places where * doesn't work properly (and fixing it would probably require a Python XPath function), though they should give an exception. If not, then it's a bug of some sort. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From ianb at colorstudy.com Mon Jan 7 18:50:32 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 07 Jan 2008 11:50:32 -0600 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> Message-ID: <47826668.90807@colorstudy.com> H?ke, Christof wrote: >> ::first-letter is hard because it doesn't match any object in lxml. >> If it returned a string like "A" it would be very much out of >> context (e.g., no parent pointer), and it would be hard to do >> anything useful with it. To make it useful I think it would >> require some new stringish object that also looked nodeish (e.g., >> had a .getparent() method). Though maybe an object like that should >> exist; something similar would be needed for representing ranges. > > What came to my mind was the DOM range spec stuff, but it is not > really finished, is it? I was reading about it some years (!) ago I > think in the Javascript Definitive Guide but I think it never went > anywhere really. It doesn't really matter too much, since lxml isn't that much like the DOM. But the same use cases for the DOM range can apply to lxml. > :first-letter should actually be element.text[0] I guess (which would > be a string in lxml currently?), I don't really know the lxml API but > would it be possible to define a subtype for element.text for this > case? But you are right, a more general approach would certainly be > better. element.text is just a unicode string. Maybe we could have a method like element.text_range(0, 1) that returns a subclass of unicode that also happens to know something about its location. E.g.: class ElementText(unicode): def __new__(cls, text, is_tail, range, parent): self = unicode.__new__(cls, text) self.is_tail = is_tail self.range = range self._parent = parent def getparent(self): return self._parent def enclose_in_tag(self, el): """ Enclose this text range in an element, like:: span = Element('span') el.text_range(0, 1).enclose_in_tag(span) """ parent = self.getparent() el.text = unicode(self) if self.is_tail: el.tail = parent.tail[self.range[1]:] parent.tail = parent.tail[:self.range[0]] index = parent.getparent().index(parent) parent.getparent().insert(index+1, el) else: el.tail = parent.text[self.range[1]:] parent.text = parent.text[:self.range[0]] parent.insert(0, el) self._parent = el self.range = (0, len(self)) self.is_tail = False All untested, of course. A real sense of a range would be a bit more difficult, as it involves lots of partial elements. But something like this would be necessary to do that work. Upon further thought, maybe subclassing unicode isn't the right thing -- perhaps it should really just wrap a string. Then perhaps you could just have do, say, el.text_range[:5], where e.text_range is a range object for all of its text, and you could slice range objects to further break them down. But dealing with changes to the element are tricky. It's all a bit tricky ;) -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From stefan_ml at behnel.de Mon Jan 7 19:30:54 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Jan 2008 19:30:54 +0100 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> Message-ID: <47826FDE.9080703@behnel.de> Hi, H?ke, Christof wrote: >> Von: Ian Bicking [mailto:ianb at colorstudy.com] >> ::first-letter is hard >> because it doesn't match any object in lxml. If it returned a string >> like "A" it would be very much out of context (e.g., no parent pointer), >> and it would be hard to do anything useful with it. To make it useful I >> think it would require some new stringish object that also looked nodeish >> (e.g., had a .getparent() method). I considered that a while ago, as it would also be interesting for XPath in general. However, currently, we use fast Python string creation functions to serve the API level. At the time I deducted that changing that to the instantiation of a custom string object would almost certainly slow things down and complicate them, just to serve a rather special use case. Although maybe I might want to take another look at that today... > What came to my mind was the DOM range spec stuff, but it is not really > finished, is it? I was reading about it some years (!) ago I think in the > Javascript Definitive Guide but I think it never went anywhere really. I had a discussion about that starting over at the XML-SIG list last summer. http://permalink.gmane.org/gmane.comp.python.lxml.devel/2763 That was the first time I heard about DOM ranges and when I dug into that a little deeper, I almost ran away screaming. IMVHO, that's an insane and horribly complicated spec. > :first-letter should actually be element.text[0] I guess (which would be a > string in lxml currently?) Yes. > I don't really know the lxml API but would it > be possible to define a subtype for element.text for this case? But you are > right, a more general approach would certainly be better. You could define a special string (and unicode) subtype for the result of an XPath expression, which is determined independent of the Python API level. The freedom is right there. However, it would mean you have to search and (in the worst case) instantiate the parent Element to make sure it won't go away while the string result exists. That's some overhead compared to a simple string creation. As I said, I might reconsider that, but I'm not very confident. Stefan From stefan_ml at behnel.de Mon Jan 7 20:56:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Jan 2008 20:56:11 +0100 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <47826668.90807@colorstudy.com> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> Message-ID: <478283DB.9090403@behnel.de> Hi Ian, Ian Bicking wrote: > element.text is just a unicode string. or a plain string. > Maybe we could have a method > like element.text_range(0, 1) that returns a subclass of unicode that > also happens to know something about its location. I prefer having the XPath string results be something like that. I think that's the only case where you can 'spuriously' end up with a text value and might want to know where it came from. > class ElementText(unicode): Maybe we should still keep up the str/unicode duality here. Although that will be history with Python 3, it isn't now, and it is an integral part of the current lxml API. > def __new__(cls, text, is_tail, range, parent): > self = unicode.__new__(cls, text) > self.is_tail = is_tail Right, 'is_tail' should be in. > self.range = range 'range' would be the substring indices? I would prefer calculating as much as possible on demand. Remember, most people will not use this object in any other way than a plain string. That's why I'm so hesitant about instantiating an Element object along the rode. > def enclose_in_tag(self, el): > """ > Enclose this text range in an element, like:: > > span = Element('span') > el.text_range(0, 1).enclose_in_tag(span) > """ Hmm, I'll have to think about that one. Not sure what the exact semantics should be. > Upon further thought, maybe subclassing unicode isn't the right thing -- > perhaps it should really just wrap a string. No, it must be a 'real' string to avoid having to check for another special case in the API (and likely in other places that we do not control). > It's all a bit tricky ;) I know. :) Stefan From ianb at colorstudy.com Mon Jan 7 22:15:59 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 07 Jan 2008 15:15:59 -0600 Subject: [lxml-dev] cssselect and cssutils In-Reply-To: <478283DB.9090403@behnel.de> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> Message-ID: <4782968F.2000406@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > Ian Bicking wrote: >> element.text is just a unicode string. > > or a plain string. > > >> Maybe we could have a method >> like element.text_range(0, 1) that returns a subclass of unicode that >> also happens to know something about its location. > > I prefer having the XPath string results be something like that. I think > that's the only case where you can 'spuriously' end up with a text value and > might want to know where it came from. > > >> class ElementText(unicode): > > Maybe we should still keep up the str/unicode duality here. Although that will > be history with Python 3, it isn't now, and it is an integral part of the > current lxml API. ElementUnicodeText and ElementStrText? Not very pretty :-P > >> def __new__(cls, text, is_tail, range, parent): >> self = unicode.__new__(cls, text) >> self.is_tail = is_tail > > Right, 'is_tail' should be in. > > >> self.range = range > > 'range' would be the substring indices? I would prefer calculating as much as > possible on demand. Remember, most people will not use this object in any > other way than a plain string. That's why I'm so hesitant about instantiating > an Element object along the rode. Range is the slice that is selected, which is necessary for manipulation later (like enclose_in_tag). I'm not proposing this in any way replace text and tail. These don't feel quite like strings to me. Strings are interchangeable and simple. These are located in a specific place. If you call text.capitalize(), what does that do? Give you a capitalized view on the text? Give you a new string that loses all sense of place? It doesn't feel like a string at all, which is why I'm not sure it should even subclass from unicode/str. Or, for that matter, get used in lots of different contexts. Maybe having XPath return values makes it important to be fast. I'm not sure. It doesn't seem that important that XPath return values be particularly light. And I've found it problematic sometimes that non-node XPath return values are just strings (though that's been more an issue of attributes, where I'd like to know what attribute or element the value belonged to). OTOH, something like XPath's string() really is a string without any place. So it's all just kind of eclectic and awkward. > >> def enclose_in_tag(self, el): >> """ >> Enclose this text range in an element, like:: >> >> span = Element('span') >> el.text_range(0, 1).enclose_in_tag(span) >> """ > > Hmm, I'll have to think about that one. Not sure what the exact semantics > should be. It occurred to me thinking about how you could actually do something useful with ::first-letter, like: def apply_style(doc, selector, style): for item in selector(doc): if isinstance(item, ElementText): el = Element('span') item.enclose_in_tag(el) item = el item.set('style', item.get('style', '') + '; ' + style) This and the only other use case I currently have in my head for ranges (highlighting a selection of the document) would use something like enclose_in_tag. I can't remember what I was doing when I wanted XPath attributes, except I think it was matching something like @*, where the attribute name mattered but I didn't want to query on it. I think I ended up selecting elements and looping through the attributes instead. Maybe this was in some iteration of the HTML cleaning code. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From stefan_ml at behnel.de Tue Jan 8 12:45:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 08 Jan 2008 12:45:33 +0100 Subject: [lxml-dev] special string subclasses for XPath string results In-Reply-To: <478333B6.6030906@behnel.de> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> <4782968F.2000406@colorstudy.com> <478333B6.6030906@behnel.de> Message-ID: <4783625D.8010307@behnel.de> Hi again, Stefan Behnel wrote: > How do you instantiate a > custom unicode subclass from a UTF-8 char*? You can't use the normal C-API > functions, so I guess you'd have to instantiate a normal unicode object, then > determine its length, and then build the custom subclass for the result length > and copy the string over. That's ugly and it would certainly slow things down. Ok, I looked at the Python source and found that this is partially special cased already. All that is left to do is decode a unicode object from the char* and instantiate the subclass with it. The copying will be done internally. So here are some performance numbers for Py2.5.1. At first site, this looks like we have a clear winner: $ python -m timeit -s 'unicode("testtest")' 10000000 loops, best of 3: 0.0464 usec per loop $ python -m timeit -s 'class t(unicode): pass' 't("testtest")' 1000000 loops, best of 3: 1.67 usec per loop Now, a little more instantiating a subclass and copying the string: $ python -m timeit -s 'class t(unicode): pass' -s 's=unicode("test" * 20)' 't(s)' 1000000 loops, best of 3: 0.794 usec per loop $ python -m timeit -s 'class t(unicode): pass' -s 's=unicode("test" * 200)' 't(s)' 1000000 loops, best of 3: 1.09 usec per loop $ python -m timeit -s 'class t(unicode): pass' -s 's=unicode("test" * 2000)' 't(s)' 100000 loops, best of 3: 6.22 usec per loop Same for str: $ python -m timeit -s 'class t(str): pass' -s 's="test" * 200' 't(s)' 1000000 loops, best of 3: 1.27 usec per loop $ python -m timeit -s 'class t(str): pass' -s 's="test" * 2000' 't(s)' 100000 loops, best of 3: 7.03 usec per loop Funny enough, this is actually slower than unicode on my machine. As the following numbers show, however, the task at hand is clearly dominated by decoding: $ python -m timeit -s 'class t(unicode): pass' -s 's="test" * 200' 't(unicode(s, "utf-8"))' 100000 loops, best of 3: 5.23 usec per loop $ python -m timeit -s 'class t(unicode): pass' -s 's="test" * 2000' 't(unicode(s, "utf-8"))' 10000 loops, best of 3: 41.2 usec per loop Decoding by itself: $ python -m timeit -s 's="test" * 2000' 'unicode(s, "utf-8")' 10000 loops, best of 3: 34.9 usec per loop Even going straight through the C-API doesn't help much - 'decode' is a little test module written in Cython for that purpose: $ python -m timeit -s 'from decode import decode' -s 's="test" * 2000' 'decode(s)' 10000 loops, best of 3: 34.4 usec per loop $ python -m timeit -s 'class t(unicode): pass' -s 'from decode import decode' -s 's="test" * 2000' 't(decode(s))' 10000 loops, best of 3: 40.7 usec per loop We also shouldn't forget that we are talking microseconds here, so, performance-wise, there is no reason why we shouldn't use a subclass, especially after having just given the XPath engine a run. I'll give it a try. Maybe this can even still go in for 2.0. Stefan From stefan_ml at behnel.de Tue Jan 8 09:26:30 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 08 Jan 2008 09:26:30 +0100 Subject: [lxml-dev] special string subclasses for XPath string results In-Reply-To: <4782968F.2000406@colorstudy.com> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> <4782968F.2000406@colorstudy.com> Message-ID: <478333B6.6030906@behnel.de> Hi Ian, Ian Bicking wrote: > Stefan Behnel wrote: >> Maybe we should still keep up the str/unicode duality here. Although >> that will >> be history with Python 3, it isn't now, and it is an integral part of the >> current lxml API. > > ElementUnicodeText and ElementStrText? Not very pretty :-P Well, fine, but that's how it is. Users will not have to deal with the classes anyway, they will just be used in the background. isinstance(result, unicode) will work just like before, as will isinstance(result, basestring). The difference is just what happens when you call str() on them, or when you pass them into Python's API, or... Some APIs are ignorant regarding str/unicode, others are not. We shouldn't deliberately break those that are not, that would just slow down *everything*. > Maybe having XPath return values makes it important to be fast. Not necessarily fast, but it shouldn't slow things down unnecessarily for stuff that most people won't use. I imagine that the tricky part is the case where it actually is a (non-ASCII) Unicode value. How do you instantiate a custom unicode subclass from a UTF-8 char*? You can't use the normal C-API functions, so I guess you'd have to instantiate a normal unicode object, then determine its length, and then build the custom subclass for the result length and copy the string over. That's ugly and it would certainly slow things down. > It doesn't seem that important that XPath return values be > particularly light. And I've found it problematic sometimes that > non-node XPath return values are just strings (though that's been more > an issue of attributes, where I'd like to know what attribute or element > the value belonged to). OTOH, something like XPath's string() really is > a string without any place. So it's all just kind of eclectic and awkward. There's only so much we can do anyway. If the libxml2 result is a string, there is no way we can figure out where it came from. So the result of string() will always be a normal string instance. The only way where we could change something would be the case where the expression selects a text node or an attribute (text() and @...). I don't even think we can support ranges here. They would normally result from the substring() function, right? I doubt that would return anything but a plain string value. So to handle the result properly, you could do if isinstance(result, basestring): if hasattr(result, 'getparent'): print result.getparent().tag print result.is_text print result.is_tail print result.is_attribute else: print result BTW, I would also add "is_text" in that case. It would seem weird to check "is_attribute" and "is_tail" before you can determine that it actually is the common case of a normal ".text" value. (Although "is_text" sounds a bit more general than what it means here...) I actually think we are talking about lxml 2.1 stuff here. I'll try to get 2.0 out of the door and then we can see how we could implement these things. Stefan From dfedoruk at gmail.com Wed Jan 9 18:55:23 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Wed, 9 Jan 2008 20:55:23 +0300 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash on freebsd/amd64 In-Reply-To: <477FDCD6.9050904@behnel.de> References: <4772AFBE.8020801@behnel.de> <47769725.9050802@behnel.de> <477FDCD6.9050904@behnel.de> Message-ID: Hello, > Here is a trivial patch (the one against xslt.pxi) > It seems to work for me, any chance you could give it a try? Thank you for the patches, I'll apply them and see what happens next. The thing is that such an exception occurs very seldom and I can not reproduce it. Nevertheless, coming back to the thread subject. As we have managed to find out, it is indeed the deallocation problem. I've played around with the variable taht caused the trouble, tried to make it global, for example - this changed only the positon of the crash, but not the reason. When the memory has to be free'd, the crash happens. Unfortunatelly I managed to reproduce this bug on 3 versions of FreeBSD 6.2 and on the i386 architecture too, which had never happened in 6 months of development. But i386 is capable of running valgrind. So I got this errors: ==77394== Invalid free() / delete / delete[] ==77394== at 0x3C03867F: free (in /usr/local/lib/valgrind/vgpreload_memcheck.so) ==77394== by 0x3CF97668: xmlFreeNodeList (in /usr/X11R6/lib/libxml2.so.5) ==77394== by 0x3CF974F0: xmlFreeProp (in /usr/X11R6/lib/libxml2.so.5) ==77394== by 0x3CF9754F: xmlFreePropList (in /usr/X11R6/lib/libxml2.so.5) ==77394== Address 0x3C9C5E8B is 743 bytes inside a block of size 1024 alloc'd ==77394== at 0x3C038183: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck.so) ==77394== by 0x3D02B4EE: xmlDictAddString (in /usr/X11R6/lib/libxml2.so.5) ==77394== by 0x3D02BBEB: xmlDictLookup (in /usr/X11R6/lib/libxml2.so.5) ==77394== by 0x3CF80425: xmlDetectSAX2 (in /usr/X11R6/lib/libxml2.so.5) More than that, this messages were preceeded by a bunch of errors from the libpython itself: ==77390== Use of uninitialised value of size 4 ==77390== at 0x3C6D2AC9: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C711382: PyEval_EvalFrameEx (in /usr/X11R6/lib/libpython2.5.so) ==77390== ==77390== Invalid read of size 4 ==77390== at 0x3C6D2AAF: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C711382: PyEval_EvalFrameEx (in /usr/X11R6/lib/libpython2.5.so) ==77390== Conditional jump or move depends on uninitialised value(s) ==77390== at 0x3C6D2AB8: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) ==77390== by 0x3C711382: PyEval_EvalFrameEx (in /usr/X11R6/lib/libpython2.5.so) (repeated many times during the apache thread initialisation). So, this is not the lxml problem really... But maybe somebody has any idea? Right now I'm thinking of opportunity to replace mod_python with mod_fastcgi . Thanks for attention so far! Dmitri From dfedoruk at gmail.com Wed Jan 9 19:05:02 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Wed, 9 Jan 2008 21:05:02 +0300 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash on freebsd/amd64 In-Reply-To: References: <4772AFBE.8020801@behnel.de> <47769725.9050802@behnel.de> <477FDCD6.9050904@behnel.de> Message-ID: Greetings once more, Update of the previous message: nevermind the different PID's in valgrind output, I just copied the wrong ones. Error messages from the libpython2.5.so in the process with invalid free\delete are the same as I have posted before. > ==77394== Invalid free() / delete / delete[] [omitted] > More than that, this messages were preceeded by a bunch of errors from > the libpython itself: > ==77390== Use of uninitialised value of size 4 [omitted] Cheers, Dmitri From ianb at colorstudy.com Wed Jan 9 19:19:02 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 09 Jan 2008 12:19:02 -0600 Subject: [lxml-dev] special string subclasses for XPath string results In-Reply-To: <478333B6.6030906@behnel.de> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> <4782968F.2000406@colorstudy.com> <478333B6.6030906@behnel.de> Message-ID: <47851016.6050905@colorstudy.com> Stefan Behnel wrote: >> It doesn't seem that important that XPath return values be >> particularly light. And I've found it problematic sometimes that >> non-node XPath return values are just strings (though that's been more >> an issue of attributes, where I'd like to know what attribute or element >> the value belonged to). OTOH, something like XPath's string() really is >> a string without any place. So it's all just kind of eclectic and awkward. > > There's only so much we can do anyway. If the libxml2 result is a string, > there is no way we can figure out where it came from. So the result of > string() will always be a normal string instance. The only way where we could > change something would be the case where the expression selects a text node or > an attribute (text() and @...). I don't even think we can support ranges here. > They would normally result from the substring() function, right? I doubt that > would return anything but a plain string value. That's kind of why I think performance doesn't matter, because it won't even come into play most of the time. In relation to XPath, the one thing I would like is some representation of attributes. There's a backward compatible issue, but the underlying engine returns attributes as something different than normal text anyway, right? I think attributes are mostly a different use case than text ranges. For something like ::first-letter, I didn't really expect it to be possible to compile that to XPath. Instead it would have to be something like: def first_letter_selector(xpath_expr): def selector(el): result = xpath_expr(el) return result.text_range(0, 1) return selector For representing ranges I'd also like some text range (I don't have any immediate needs, so I'm personally in no rush here). But it's not something that would have to replace the current text/tail attributes. It's just that in some code it can be nice to have something similar to the DOM TextNode, and this kind of provides that (except more nicely I think, as it would be more like a view). Then the range might just be like: class TextRange(object): def __init__(self, el, range, is_text): self.el = el assert range[0] >= 0 self.range = range # (start, end) tuple self.is_text = is_text def __unicode__(self): start, end = self.range if end == 0: return '' if self.is_text: t = self.el.text else: t = self.el.tail if t is None: raise ValueError(...) if not isinstance(t, unicode): t = unicode(t, 'utf8') #? if range[1] > len(t): raise ValueError( "TextRange %r is invalid (element has been changed?)" % self) return t[start:end] def __repr__(self): if self.is_text: meth = 'text_range' else: meth = 'tail_range' return '%r.%s%s' % (self.el, meth, range) def getparent(self): return self.el # and other stuff that might be convenient... No place where I am currently using text/tail do I really want this kind of behavior. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From stefan_ml at behnel.de Wed Jan 9 19:43:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Jan 2008 19:43:26 +0100 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash on freebsd/amd64 In-Reply-To: References: <4772AFBE.8020801@behnel.de> <47769725.9050802@behnel.de> <477FDCD6.9050904@behnel.de> Message-ID: <478515CE.5050101@behnel.de> Hi, Dmitri Fedoruk wrote: >> Here is a trivial patch (the one against xslt.pxi) >> It seems to work for me, any chance you could give it a try? > Thank you for the patches, I'll apply them and see what happens next. Thanks. > The thing is that such an exception occurs very seldom and I can not > reproduce it. You will still notice if it doesn't work. :) > Nevertheless, coming back to the thread subject. As we have managed to > find out, it is indeed the deallocation problem. I've played around > with the variable taht caused the trouble, tried to make it global, > for example - this changed only the positon of the crash, but not the > reason. When the memory has to be free'd, the crash happens. > > Unfortunatelly I managed to reproduce this bug on 3 versions of > FreeBSD 6.2 and on the i386 architecture too, which had never happened > in 6 months of development. It shouldn't be machine dependent. If it's there, it's in the code. Garbage collection and threading might work different on different architectures, but that won't remove the actual problem. > But i386 is capable of running valgrind. So I got this errors: > ==77394== Invalid free() / delete / delete[] > ==77394== at 0x3C03867F: free (in > /usr/local/lib/valgrind/vgpreload_memcheck.so) > ==77394== by 0x3CF97668: xmlFreeNodeList (in /usr/X11R6/lib/libxml2.so.5) > ==77394== by 0x3CF974F0: xmlFreeProp (in /usr/X11R6/lib/libxml2.so.5) > ==77394== by 0x3CF9754F: xmlFreePropList (in /usr/X11R6/lib/libxml2.so.5) > ==77394== Address 0x3C9C5E8B is 743 bytes inside a block of size 1024 alloc'd > ==77394== at 0x3C038183: malloc (in > /usr/local/lib/valgrind/vgpreload_memcheck.so) > ==77394== by 0x3D02B4EE: xmlDictAddString (in /usr/X11R6/lib/libxml2.so.5) > ==77394== by 0x3D02BBEB: xmlDictLookup (in /usr/X11R6/lib/libxml2.so.5) > ==77394== by 0x3CF80425: xmlDetectSAX2 (in /usr/X11R6/lib/libxml2.so.5) Funny place for a malloc. Anyway, this is only a symptom. The problem is that the document or an XML node gets freed either while it's still in use, or by two independent parties (i.e. Python element proxies that refer to it). With a bug that occurs this seldom and a setup as complex as mod_python, it's really hard to narrow down the test case, so I don't how far you could get here. However, I'm currently chasing a (pretty old) bug myself. Maybe it's related already. Could you check with the current SVN trunk if that works better for you? Although the stack trace above gives me doubts... > More than that, this messages were preceeded by a bunch of errors from > the libpython itself: > ==77390== Use of uninitialised value of size 4 > ==77390== at 0x3C6D2AC9: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in > /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C711382: PyEval_EvalFrameEx (in > /usr/X11R6/lib/libpython2.5.so) > ==77390== > ==77390== Invalid read of size 4 > ==77390== at 0x3C6D2AAF: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in > /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C711382: PyEval_EvalFrameEx (in > /usr/X11R6/lib/libpython2.5.so) > ==77390== Conditional jump or move depends on uninitialised value(s) > ==77390== at 0x3C6D2AB8: PyObject_Realloc (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C735BA4: _PyObject_GC_Resize (in > /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C6BE525: PyFrame_New (in /usr/X11R6/lib/libpython2.5.so) > ==77390== by 0x3C711382: PyEval_EvalFrameEx (in > /usr/X11R6/lib/libpython2.5.so) > (repeated many times during the apache thread initialisation). Hmmm, not sure what this means. Might be entirely unrelated. Valgrind uses a suppression file that drops a lot of false positives, maybe those are just false positives of mod_python. > So, this is not the lxml problem really... But maybe somebody has any idea? > Right now I'm thinking of opportunity to replace mod_python with mod_fastcgi . I have neither experience with mod_python nor with mod_fastcgi, sorry. Stefan From stefan_ml at behnel.de Thu Jan 10 00:33:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 10 Jan 2008 00:33:12 +0100 Subject: [lxml-dev] special string subclasses for XPath string results In-Reply-To: <47851016.6050905@colorstudy.com> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> <4782968F.2000406@colorstudy.com> <478333B6.6030906@behnel.de> <47851016.6050905@colorstudy.com> Message-ID: <478559B8.5020703@behnel.de> Hi Ian, I now implemented the basic behaviour on the trunk. It first crashed because of a bug in Cython 0.9.6.10b, hope my patch will find its way into a release soon. Attributes and text nodes are now handled. String results (such as returned by the string() function) will remain plain strings - no way to recover here. Ian Bicking wrote: > For something like ::first-letter, I didn't really expect it to be > possible to compile that to XPath. Instead it would have to be > something like: > > def first_letter_selector(xpath_expr): > def selector(el): > result = xpath_expr(el) > return result.text_range(0, 1) > return selector Something like that, yes. You could do for element in xpath_expr(el): for text in element.itertext() if text: return text[0] return None # or raise or ... or wouldn't we have to return a list here? As in return [ text[0] for element in xpath_expr(el) for text in element.itertext() if text ] > For representing ranges I'd also like some text range (I don't have any > immediate needs, so I'm personally in no rush here). But it's not > something that would have to replace the current text/tail attributes. > It's just that in some code it can be nice to have something similar to > the DOM TextNode, and this kind of provides that (except more nicely I > think, as it would be more like a view). > > Then the range might just be like: > > class TextRange(object): [...] You could file this as a feature request in the bug tracker so that we won't forget about it. It's definitely not clear enough for 2.0. Stefan From stefan_ml at behnel.de Fri Jan 11 16:42:06 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 11 Jan 2008 16:42:06 +0100 Subject: [lxml-dev] lxml 2.0 beta1 released Message-ID: <47878E4E.5080800@behnel.de> Hi all, I finally managed to push lxml 2.0beta1 over to PyPI. This release marks the end of the four month alpha cycle of lxml 2.0. The last stable release series, lxml 1.3, saw the light of day more than six months ago. http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.0beta1 The complete changelog for beta1 and the 2.0 alpha series follows below. Apart from a number of important fixes and enhancements, this beta release also finalises the major API changes that make the difference between 1.x and 2.x. Incompatible changes after this release will require a very good motivation. As usual, compatible enhancements will always be embraced - as will be updates, clarifications and fixes for the documentation! Asking back helps. I expect beta1 to also be the last beta release before lxml 2.0 final (hopefully not in the sense that alpha4/5/6 were), so please test as much as you can to spot any remaining bugs and problems. Note that this release depends on a bug fix in Cython that will hopefully be released as Cython 0.9.6.11 in a couple of days. I attached the necessary patch for those who want work on the sources. Another thing: there was a security advisory on the libxml2 mailing list. To prevent DoS attacks, systems that parse XML from untrusted sources should be updated to libxml2 2.6.31 (or should apply the patch that is referenced in Daniel's post below). http://mail.gnome.org/archives/xml/2008-January/msg00036.html Sidnei, when you build the Windows binaries, could you please wait for libxml2 2.6.31 to become available as binaries as well? Hopefully, that won't take too long... Have fun, Stefan 2.0beta1 (2008-01-11) ===================== Features added -------------- * Parse-time XML schema validation (``schema`` parser keyword). * XPath string results of the ``text()`` function and attribute selection make their Element container accessible through a ``getparent()`` method. As a side-effect, they are now always unicode objects (even ASCII strings). * ``XSLT`` objects are usable in any thread - at the cost of a deep copy if they were not created in that thread. * Invalid entity names and character references will be rejected by the ``Entity()`` factory. * ``entity.text`` returns the textual representation of the entity, e.g. ``&``. Bugs fixed ---------- * XPath on ElementTrees could crash when selecting the virtual root node of the ElementTree. * Compilation ``--without-threading`` was buggy in alpha5/6. Other changes ------------- * Minor performance tweaks for Element instantiation and subelement creation 2.0alpha6 (2007-12-19) ====================== Features added -------------- * New properties ``position`` and ``code`` on ParseError exception (as in ET 1.3) Bugs fixed ---------- * Memory leak in the ``parse()`` function. * Minor bugs in XSLT error message formatting. * Result document memory leak in target parser. Other changes ------------- * Various places in the XPath, XSLT and iteration APIs now require keyword-only arguments. * The argument order in ``element.itersiblings()`` was changed to match the order used in all other iteration methods. The second argument ('preceding') is now a keyword-only argument. * The ``getiterator()`` method on Elements and ElementTrees was reverted to return an iterator as it did in lxml 1.x. The ET API specification allows it to return either a sequence or an iterator, and it traditionally returned a sequence in ET and an iterator in lxml. However, it is now deprecated in favour of the ``iter()`` method, which should be used in new code wherever possible. * The 'pretty printed' serialisation of ElementTree objects now inserts newlines at the root level between processing instructions, comments and the root tag. * A 'pretty printed' serialisation is now terminated with a newline. * Second argument to ``lxml.etree.Extension()`` helper is no longer required, third argument is now a keyword-only argument ``ns``. * ``lxml.html.tostring`` takes an ``encoding`` argument. 2.0alpha5 (2007-11-24) ====================== Features added -------------- * Rich comparison of ``element.attrib`` proxies. * ElementTree compatible TreeBuilder class. * Use default prefixes for some common XML namespaces. * ``lxml.html.clean.Cleaner`` now allows for a ``host_whitelist``, and two overridable methods: ``allow_embedded_url(el, url)`` and the more general ``allow_element(el)``. * Extended slicing of Elements as in ``element[1:-1:2]``, both in etree and in objectify * Resolvers can now provide a ``base_url`` keyword argument when resolving a document as string data. * When using ``lxml.doctestcompare`` you can give the doctest option ``NOPARSE_MARKUP`` (like ``# doctest: +NOPARSE_MARKUP``) to suppress the special checking for one test. Bugs fixed ---------- * Target parser failed to report comments. * In the ``lxml.html`` ``iter_links`` method, links in ```` tags weren't recognized. (Note: plugin-specific link parameters still aren't recognized.) Also, the ```` tag, though not standard, is now included in ``lxml.html.defs.special_inline_tags``. * Using custom resolvers on XSLT stylesheets parsed from a string could request ill-formed URLs. * With ``lxml.doctestcompare`` if you do ```` in your output, it will then be namespace-neutral (before the ellipsis was treated as a real namespace). Other changes ------------- * The module source files were renamed to "lxml.*.pyx", such as "lxml.etree.pyx". This was changed for consistency with the way Pyrex commonly handles package imports. The main effect is that classes now know about their fully qualified class name, including the package name of their module. * Keyword-only arguments in some API functions, especially in the parsers and serialisers. 2.0alpha4 (2007-10-07) ====================== Features added -------------- Bugs fixed ---------- * AttributeError in feed parser on parse errors Other changes ------------- * Tag name validation in lxml.etree (and lxml.html) now distinguishes between HTML tags and XML tags based on the parser that was used to parse or create them. HTML tags no longer reject any non-ASCII characters in tag names but only spaces and the special characters ``<>&/"'``. 2.0alpha3 (2007-09-26) ====================== Features added -------------- * Separate ``feed_error_log`` property for the feed parser interface. The normal parser interface and ``iterparse`` continue to use ``error_log``. * The normal parsers and the feed parser interface are now separated and can be used concurrently on the same parser instance. * ``fromstringlist()`` and ``tostringlist()`` functions as in ElementTree 1.3 * ``iterparse()`` accepts an ``html`` boolean keyword argument for parsing with the HTML parser (note that this interface may be subject to change) * Parsers accept an ``encoding`` keyword argument that overrides the encoding of the parsed documents. * New C-API function ``hasChild()`` to test for children * ``annotate()`` function in objectify can annotate with Python types and XSI types in one step. Accompanied by ``xsiannotate()`` and ``pyannotate()``. Bugs fixed ---------- * XML feed parser setup problem * Type annotation for unicode strings in ``DataElement()`` Other changes ------------- * lxml.etree now emits a warning if you use XPath with libxml2 2.6.27 (which can crash on certain XPath errors) * Type annotation in objectify now preserves the already annotated type by default to prevent loosing type information that is already there. 2.0alpha2 (2007-09-15) ====================== Features added -------------- * ``ET.write()``, ``tostring()`` and ``tounicode()`` now accept a keyword argument ``method`` that can be one of 'xml' (or None), 'html' or 'text' to serialise as XML, HTML or plain text content. * ``iterfind()`` method on Elements returns an iterator equivalent to ``findall()`` * ``itertext()`` method on Elements * Setting a QName object as value of the .text property or as an attribute will resolve its prefix in the respective context * ElementTree-like parser target interface as described in http://effbot.org/elementtree/elementtree-xmlparser.htm * ElementTree-like feed parser interface on XMLParser and HTMLParser (``feed()`` and ``close()`` methods) Bugs fixed ---------- * lxml failed to serialise namespace declarations of elements other than the root node of a tree * Race condition in XSLT where the resolver context leaked between concurrent XSLT calls Other changes ------------- * ``element.getiterator()`` returns a list, use ``element.iter()`` to retrieve an iterator (ElementTree 1.3 compatible behaviour) 2.0alpha1 (2007-09-02) ====================== Features added -------------- * Reimplemented ``objectify.E`` for better performance and improved integration with objectify. Provides extended type support based on registered PyTypes. * XSLT objects now support deep copying * New ``makeSubElement()`` C-API function that allows creating a new subelement straight with text, tail and attributes. * XPath extension functions can now access the current context node (``context.context_node``) and use a context dictionary (``context.eval_context``) from the context provided in their first parameter * HTML tag soup parser based on BeautifulSoup in ``lxml.html.ElementSoup`` * New module ``lxml.doctestcompare`` by Ian Bicking for writing simplified doctests based on XML/HTML output. Use by importing ``lxml.usedoctest`` or ``lxml.html.usedoctest`` from within a doctest. * New module ``lxml.cssselect`` by Ian Bicking for selecting Elements with CSS selectors. * New package ``lxml.html`` written by Ian Bicking for advanced HTML treatment. * Namespace class setup is now local to the ``ElementNamespaceClassLookup`` instance and no longer global. * Schematron validation (incomplete in libxml2) * Additional ``stringify`` argument to ``objectify.PyType()`` takes a conversion function to strings to support setting text values from arbitrary types. * Entity support through an ``Entity`` factory and element classes. XML parsers now have a ``resolve_entities`` keyword argument that can be set to False to keep entities in the document. * ``column`` field on error log entries to accompany the ``line`` field * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on ``XPathError`` * The regular expression functions in XPath now support passing a node-set instead of a string * Extended type annotation in objectify: new ``xsiannotate()`` function * EXSLT RegExp support in standard XPath (not only XSLT) Bugs fixed ---------- * lxml.etree did not check tag/attribute names * The XML parser did not report undefined entities as error * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last * Passing '' as XPath namespace prefix did not raise an error * Thread safety in XPath evaluators Other changes ------------- * objectify.PyType for None is now called "NoneType" * ``el.getiterator()`` renamed to ``el.iter()``, following ElementTree 1.3 - original name is still available as alias * In the public C-API, ``findOrBuildNodeNs()`` was replaced by the more generic ``findOrBuildNodeNsPrefix`` * Major refactoring in XPath/XSLT extension function code * Network access in parsers disabled by default -------------- next part -------------- A non-text attachment was scrubbed... Name: cython-nogc-fix.patch Type: text/x-patch Size: 1422 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080111/45834943/attachment.bin From sidnei at enfoldsystems.com Fri Jan 11 16:47:00 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 11 Jan 2008 13:47:00 -0200 Subject: [lxml-dev] lxml 2.0 beta1 released In-Reply-To: <47878E4E.5080800@behnel.de> References: <47878E4E.5080800@behnel.de> Message-ID: On Jan 11, 2008 1:42 PM, Stefan Behnel wrote: > Sidnei, when you build the Windows binaries, could you please wait for libxml2 > 2.6.31 to become available as binaries as well? Hopefully, that won't take too > long... Note taken! -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From dfedoruk at gmail.com Fri Jan 11 18:38:12 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Fri, 11 Jan 2008 20:38:12 +0300 Subject: [lxml-dev] lxml 2.0 beta1 released In-Reply-To: <47878E4E.5080800@behnel.de> References: <47878E4E.5080800@behnel.de> Message-ID: Hello, > I finally managed to push lxml 2.0beta1 over to PyPI. This release marks the > end of the four month alpha cycle of lxml 2.0. The last stable release series, > lxml 1.3, saw the light of day more than six months ago. That's great! Speaking about your last proposal abot the apache crash, where you've suggested to get the svn snapshot - is the needed code in this build already? I've applied the patch for xslt (which is obviously marked as "`XSLT`` objects are usable in any thread" feature) to the 1.3.5 version of the library. I'll upgrade all my machines to the new build and see what happens :) Cheers, Dmitri From chairos at gmail.com Sat Jan 12 05:14:32 2008 From: chairos at gmail.com (Jon Rosebaugh) Date: Fri, 11 Jan 2008 22:14:32 -0600 Subject: [lxml-dev] Installing lxml 2.0beta1 via easy_install requires Cython; also, question about lxml.html.clean.clean_html Message-ID: I attempted to install lxml 2.0beta1 via easy_install (easy_install lxml==2.0beta1), and it didn't work. After a bunch of experimentation, I discovered that the C files that are supposed to be present in the download were not present. After installing a patched version of Cython 0.9.6.10b (patched according to the directions I found on this list) lxml successfully installed. But I was very surprised at this requirement. Also, I'm not sure, but I think the lxml.html.clean.clean_html() function might not be working properly? I followed the example at http://codespeak.net/lxml/dev/lxmlhtml.html#cleaning-up-html but got different results. I expected this:
a link another link

a paragraph

secret EVIL!
of EVIL! Password: annoying EVIL! spam spam SPAM!
But got this:
a link another link

a paragraph

secret EVIL!
of EVIL! Password: annoying EVIL!spam spam SPAM!
From stefan_ml at behnel.de Sat Jan 12 09:46:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 12 Jan 2008 09:46:36 +0100 Subject: [lxml-dev] Installing lxml 2.0beta1 via easy_install requires Cython; also, question about lxml.html.clean.clean_html In-Reply-To: References: Message-ID: <47887E6C.4030009@behnel.de> Hi, Jon Rosebaugh wrote: > I attempted to install lxml 2.0beta1 via easy_install (easy_install > lxml==2.0beta1), and it didn't work. After a bunch of experimentation, > I discovered that the C files that are supposed to be present in the > download were not present. After installing a patched version of > Cython 0.9.6.10b (patched according to the directions I found on this > list) lxml successfully installed. Hmm, it shouldn't be that hard. The tgz I downloaded has the .c files, so installing without Cython should work just fine. I just removed my local Cython install and did an "easy_install lxml" (which downloaded, built and installed 2.0beta1) and also an "easy_install lxml-2.0beta1.tar.gz". Both worked just fine. Maybe you had an older version of Cython installed? If that's found, it will be used - and obviously fail. > Also, I'm not sure, but I think the lxml.html.clean.clean_html() > function might not be working properly? I followed the example at > http://codespeak.net/lxml/dev/lxmlhtml.html#cleaning-up-html but got > different results. I expected this: > > >
> > a link > another link >

a paragraph

>
secret EVIL!
> of EVIL! > Password: > annoying EVIL! > spam spam SPAM! > >
> > > > But got this: >
> > a link > another link >

a paragraph

>
secret EVIL!
> of EVIL! > > > Password: > annoying EVIL!spam spam SPAM! >
That one should work, too. I just ran lxmlhtml.txt as doctest (which admittedly wasn't included in the test suite before) and it just worked. Same for test_clean.txt. What's the version of libxml2 you are using? Can you try running the test suite and see if that works for you? Stefan From chairos at gmail.com Sat Jan 12 17:00:35 2008 From: chairos at gmail.com (Jon Rosebaugh) Date: Sat, 12 Jan 2008 10:00:35 -0600 Subject: [lxml-dev] Installing lxml 2.0beta1 via easy_install requires Cython; also, question about lxml.html.clean.clean_html In-Reply-To: <47887E6C.4030009@behnel.de> References: <47887E6C.4030009@behnel.de> Message-ID: On Jan 12, 2008 2:46 AM, Stefan Behnel wrote: > Hi, > > Jon Rosebaugh wrote: > > I attempted to install lxml 2.0beta1 via easy_install (easy_install > > lxml==2.0beta1), and it didn't work. After a bunch of experimentation, > > I discovered that the C files that are supposed to be present in the > > download were not present. After installing a patched version of > > Cython 0.9.6.10b (patched according to the directions I found on this > > list) lxml successfully installed. > > Hmm, it shouldn't be that hard. The tgz I downloaded has the .c files, so > installing without Cython should work just fine. I just removed my local > Cython install and did an "easy_install lxml" (which downloaded, built and > installed 2.0beta1) and also an "easy_install lxml-2.0beta1.tar.gz". Both > worked just fine. The tgz linked from the website (http://codespeak.net/lxml/dev/index.html#download -> http://codespeak.net/lxml/dev/lxml-2.0beta1.tgz) gives me a 404, so I used http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0beta1.tar.gz. When I tried just running 'easy_install lxml' without Cython installed, I got compilation errors which I was able to reproduce yesterday, but not today, so I dunno. Best guess I have is some environmental oddity related to macports that went away when I tried it in a new terminal window. (I tend to re-use the same eight over and over again.) > > Maybe you had an older version of Cython installed? If that's found, it will > be used - and obviously fail. Nope, I had never installed it before yesterday. > That one should work, too. I just ran lxmlhtml.txt as doctest (which > admittedly wasn't included in the test suite before) and it just worked. Same > for test_clean.txt. > > What's the version of libxml2 you are using? Can you try running the test > suite and see if that works for you? I used libxml2 2.6.30_0 and libxslt 1.1.22_0, both of which are the latest versions in macports. I tried running the test suite with 'make test' and 'python test.py', and got the same results. test_clean seems to pass, but I got the same strange result as I got yesterday when I try the example in the python interpreter. The test suite fails with 14 errors. jon at euterpe:/tmp/lxml-2.0beta1$ python test.py TESTED VERSION: 2.0.beta1 Python: (2, 5, 1, 'final', 0) lxml.etree: (2, 0, -99, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) ====================================================================== ERROR: test_feed_parser_error_broken (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 3014, in test_feed_parser_error_broken ParseError = self.etree.ParseError AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_feed_parser_error_close_empty (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 3000, in test_feed_parser_error_close_empty ParseError = self.etree.ParseError AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_feed_parser_error_close_incomplete (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 3005, in test_feed_parser_error_close_incomplete ParseError = self.etree.ParseError AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_feed_parser_error_position (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 3028, in test_feed_parser_error_position ParseError = self.etree.ParseError AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_fromstringlist (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 523, in test_fromstringlist fromstringlist = self.etree.fromstringlist AttributeError: 'module' object has no attribute 'fromstringlist' ====================================================================== ERROR: test_fromstringlist_characters (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 531, in test_fromstringlist_characters fromstringlist = self.etree.fromstringlist AttributeError: 'module' object has no attribute 'fromstringlist' ====================================================================== ERROR: test_fromstringlist_single (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 538, in test_fromstringlist_single fromstringlist = self.etree.fromstringlist AttributeError: 'module' object has no attribute 'fromstringlist' ====================================================================== ERROR: test_iter (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 1467, in test_iter list(a.iter())) AttributeError: _ElementInterface instance has no attribute 'iter' ====================================================================== ERROR: test_parse_encoding_8bit_explicit (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 2605, in test_parse_encoding_8bit_explicit self.assertRaises(self.etree.ParseError, AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_parse_encoding_8bit_override (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 2622, in test_parse_encoding_8bit_override self.assertRaises(self.etree.ParseError, AttributeError: 'module' object has no attribute 'ParseError' ====================================================================== ERROR: test_tostring_method_html (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 2376, in test_tostring_method_html tostring(html, method="html")) TypeError: tostring() got an unexpected keyword argument 'method' ====================================================================== ERROR: test_tostring_method_text (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 2393, in test_tostring_method_text tostring(a, method="text")) TypeError: tostring() got an unexpected keyword argument 'method' ====================================================================== ERROR: test_write_method_html (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 737, in test_write_method_html tree.write(f, method="html") TypeError: write() got an unexpected keyword argument 'method' ====================================================================== ERROR: test_write_method_text (lxml.tests.test_elementtree.ElementTreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/private/tmp/lxml-2.0beta1/src/lxml/tests/test_elementtree.py", line 759, in test_write_method_text tree.write(f, method="text") TypeError: write() got an unexpected keyword argument 'method' ---------------------------------------------------------------------- Ran 1092 tests in 12.936s FAILED (errors=14) From stefan_ml at behnel.de Sat Jan 12 17:50:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 12 Jan 2008 17:50:11 +0100 Subject: [lxml-dev] broken output of lxml.html.clean.clean_html In-Reply-To: References: <47887E6C.4030009@behnel.de> Message-ID: <4788EFC3.2020303@behnel.de> Hi, Jon Rosebaugh wrote: > The tgz linked from the website > (http://codespeak.net/lxml/dev/index.html#download -> > http://codespeak.net/lxml/dev/lxml-2.0beta1.tgz) gives me a 404 Ah, thanks. I uploaded it to the /lxml directory and forgot to set a link from /lxml/dev... > When I tried just running 'easy_install lxml' without Cython > installed, I got compilation errors which I was able to reproduce > yesterday, but not today :) Well, good to know that it works now. Regarding the missing files, maybe you ran "make clean" somewhere in between your tests, that deletes the .c files (which are generated and usually expected to be in the way when you call "make clean" as a developer). >> What's the version of libxml2 you are using? Can you try running the test >> suite and see if that works for you? > > I used libxml2 2.6.30_0 and libxslt 1.1.22_0, both of which are the > latest versions in macports. ... and they should work just fine - except for tags, which are broken in libxml2 2.6.29/30 (and fixed in 2.6.31). But that wasn't your problem here. > I tried running the test suite with 'make test' and 'python test.py', > and got the same results. test_clean seems to pass, but I got the same > strange result as I got yesterday when I try the example in the python > interpreter. Ah, I think I know what happens. It's the special doctest support for HTML output. To compare the results in the doctest, we parse the expected output with the HTML parser, which also fixes the output that you see in the console and makes it usable HTML. So that keeps us from seeing that the cleanup actually produces garbage... I'll look into it. > The test suite fails with 14 errors. > ====================================================================== > ERROR: test_feed_parser_error_broken > (lxml.tests.test_elementtree.ElementTreeTestCase) [...] Those are fine, you don't have a suitable ElementTree version installed (lxml 2.0 heads for compatibility with ET 1.3, which is not released yet). I actually thought I had disabled those tests for older ET versions... From stefan_ml at behnel.de Sun Jan 20 11:30:34 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 20 Jan 2008 11:30:34 +0100 Subject: [lxml-dev] type of custom objects in XML-tree disappears In-Reply-To: <20080119175309.31119.84358.malonedeb@gangotri.ubuntu.com> References: <20080119175309.31119.84358.malonedeb@gangotri.ubuntu.com> <20080119175309.31119.84358.malonedeb@gangotri.ubuntu.com> Message-ID: <479322CA.4010200@behnel.de> Hi, I'm responding here as I don't think this is a bug. It should be discussed on the list. mh wrote: > I want to use lxml to build up an XML tree with custom > element objects. > > Background: I want to build up an XML-compatible syntax tree > and provide additional methods and some non XML relevant > attributes in the tree. > > lxml provides etree.XMLParser.setElementClassLookup(...) > and etree.XMLParser.makeelement(...) methods to build up > such a tree. > > It works fine to that point, but after some operations on the > tree, the inserted elements are still there, but some of them > have changed their type from my custom classes to etree.Element > and my additional methods and attributes are lost. > > I tried repeat that kind of error, but it's not easy to make > it repeatable, so I wrote a brute force script to provocate that > kind of error (look at the end of this message). I found this error > on Mac OS X, Windows and Linux and on lxml 1.3.6 and lxml > 2beta1 with Python 2.5. > > Maybe it's not exactly an explicit use case lxml was built for, but > maybe it's worth thinking about that one. > > ----- SNIP ----- > from lxml import etree > import random > import sys > > class MyElement(etree.ElementBase): > TAG="MyElement" > > class Generator: > def __init__(self): > self.__oFactory=etree.XMLParser() > self.__oTree=etree.ElementTree(etree.Element("root"), None) Note that the root element is not taking your element class setup into account here, so it's using the standard class. > def CreateElement(self, oClass): > self.__oFactory.setElementClassLookup(etree.ElementDefaultClassLookup(oClass)) > oNew=self.__oFactory.makeelement(oClass.TAG) > return oNew You do not have to call set_element_class_lookup() each time as it sticks with the parser. I'd rather write __init__ like this: parser = etree.XMLParser() parser.set_element_class_lookup(etree.ElementDefaultClassLookup(the_class)) self.__makeelement = parser.makeelement self.__tree = etree.ElementTree(self.__makeelement("root")) > def Run(self): > try: > for i in range(0,200): > self.Visit(self.__oTree.getroot()) > except Exception, oError: > etree.dump(self.__oTree.getroot()) > > def Visit(self, oElement): > if oElement.tag!="root": > if not isinstance(oElement, MyElement): > raise Exception("Failed") > nRandom=random.randint(0,2) > if nRandom==0: > oNew=self.CreateElement(MyElement) > oElement.append(oNew) > elif nRandom==1: > oNew=self.CreateElement(MyElement) > oElement.insert(0, oNew) > for oSub in oElement: > self.Visit(oSub) > > oGen=Generator() > oGen.Run() This seems to work just fine for me (which actually surprises me). The problem is that your root element does not know about the factory you are using for your other elements, but when Element proxy(!) objects are created for a tree, they are based on the factory that is associated with the document that holds the root element. So when element proxy objects were garbage collected and then are recreated on demand, they will use the standard factory, not yours. The way to fix your code is to do something like the code I posted above. Stefan From bkc at murkworks.com Mon Jan 21 03:03:33 2008 From: bkc at murkworks.com (Brad Clements) Date: Sun, 20 Jan 2008 21:03:33 -0500 Subject: [lxml-dev] Problem installing lxml-2.0beta1 onto production server Message-ID: <4793FD75.2080907@murkworks.com> I am installing lxml-2.0beta1 on python 2.3 on RHEL4. I brought my libxml2 and libxslt up to date: libxml2-2.6.30-1 libxslt-1.1.22-1 then easy-install -U lxml During the install process, I got this: easy_install-2.3 -U lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.0beta1 Downloading http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0beta1.tar.gz Processing lxml-2.0beta1.tar.gz Running lxml-2.0beta1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-SE9_WX/lxml-2.0beta1/egg-dist-tmp-onxV0q Building lxml version 2.0.beta1. NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. warning: no previously-included files found matching 'doc/pyrex.txt' warning: no previously-included files found matching 'src/lxml/etree.pxi' File "build/bdist.linux-i686/egg/lxml/html/diff.py", line 323 start.extend(tag for name, pos, tag in tag_stack) ^ SyntaxError: invalid syntax File "/usr/local/lib/python2.3/site-packages/lxml-2.0beta1-py2.3-linux-i686.egg/lxml/html/diff.py", line 323 start.extend(tag for name, pos, tag in tag_stack) ^ SyntaxError: invalid syntax Adding lxml 2.0beta1 to easy-install.pth file Installed /usr/local/lib/python2.3/site-packages/lxml-2.0beta1-py2.3-linux-i686.egg Processing dependencies for lxml Finished processing dependencies for lxml And when I try to run an application that uses lxml, I get this: @400000004793fafb29d86124 Traceback (most recent call last): @400000004793fafb29e1c764 File "/misc/home/bkc/src/Python/MurkWorks/Rating/RatingServer.py", line 13, in ? @400000004793fafb29e2b994 from MurkWorks.Carriers.Rater import Rater, get_carrier_service_mapper @400000004793fafb29ee36fc File "/misc/home/bkc/src/Python/MurkWorks/Carriers/Rater.py", line 22, in ? @400000004793fafb29eee2dc import UPS, BAX, CCX, Estes, FedexFreight, FedexExpress, FedexGround, USPS, Yellow, Aaction, LandAir, FedexNational, CentralTransport @400000004793fafb29f3d094 File "/misc/home/bkc/src/Python/MurkWorks/Carriers/CentralTransport.py", line 14, in ? @400000004793fafb29f4a76c from lxml import etree @400000004793fafb29f55734 ImportError: /usr/local/lib/python2.3/site-packages/lxml-2.0beta1-py2.3-linux-i686.egg/lxml/etree.so: undefined symbol: PyDict_Contains On my development server I'm running alpha3, so I'm trying to to install that now: http://cheeseshop.python.org/packages/source/l/lxml/lxml-2.0alpha3.tar.gz But that fails:: NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. warning: no previously-included files found matching 'doc/pyrex.txt' warning: no previously-included files found matching 'src/lxml/etree.pxi' src/lxml/etree.c:73843: error: `METH_COEXIST' undeclared here (not in a function) src/lxml/etree.c:73843: error: initializer element is not constant src/lxml/etree.c:73843: error: (near initialization for `__pyx_methods_5etree_QName[0].ml_flags') src/lxml/etree.c:73843: error: initializer element is not constant src/lxml/etree.c:73843: error: (near initialization for `__pyx_methods_5etree_QName[0]') same error with alpha4 Oops, and alpha5 is: Building lxml version 2.0.alpha5. NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. warning: no previously-included files found matching 'doc/pyrex.txt' warning: no previously-included files found matching 'src/lxml/etree.pxi' File "build/bdist.linux-i686/egg/lxml/html/diff.py", line 323 start.extend(tag for name, pos, tag in tag_stack) ^ SyntaxError: invalid syntax File "/usr/local/lib/python2.3/site-packages/lxml-2.0alpha5-py2.3-linux-i686.egg/lxml/html/diff.py", line 323 start.extend(tag for name, pos, tag in tag_stack) ^ SyntaxError: invalid syntax Adding lxml 2.0alpha5 to easy-install.pth file Installed /usr/local/lib/python2.3/site-packages/lxml-2.0alpha5-py2.3-linux-i686.egg Processing dependencies for lxml==2.0alpha5 Finished processing dependencies for lxml==2.0alpha5 It looks like it's going to be a late night for me re-writing to not use lxml on the production server. -- Brad Clements, bkc at murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements From albert.brandl at tttech.com Mon Jan 21 09:13:04 2008 From: albert.brandl at tttech.com (Albert Brandl) Date: Mon, 21 Jan 2008 09:13:04 +0100 Subject: [lxml-dev] Problem installing lxml-2.0beta1 onto production server In-Reply-To: <4793FD75.2080907@murkworks.com> References: <4793FD75.2080907@murkworks.com> Message-ID: <20080121081303.GA1393@tttech.com> On Sun, Jan 20, 2008 at 09:03:33PM -0500, Brad Clements wrote: > File "build/bdist.linux-i686/egg/lxml/html/diff.py", line 323 > start.extend(tag for name, pos, tag in tag_stack) > ^ > SyntaxError: invalid syntax This looks like your server uses an older Python version. "(tag for name, pos, tag in tag_stack)" is a generator expression, which was introduced in Python 2.4. Regards, Albert From dfedoruk at gmail.com Mon Jan 21 10:46:33 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Mon, 21 Jan 2008 12:46:33 +0300 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash : looks like fixed Message-ID: Hello everybody, I've been testing lxml2.0beta1 for a week already and it looks like the problem leading to apache2 crash disappeared. I've tried to reproduce the bug using the known situations which lead to crash, this did not happen. Our QA team made some more thorough tests, the crash did not happen. So I hope the bug is gone forever :) I guess I should thank Stefan for fixing whatever it was. Cheers, Dmitri From stefan_ml at behnel.de Mon Jan 21 15:46:21 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 Jan 2008 15:46:21 +0100 (CET) Subject: [lxml-dev] Problem installing lxml-2.0beta1 onto production server In-Reply-To: <4793FD75.2080907@murkworks.com> References: <4793FD75.2080907@murkworks.com> Message-ID: <26719.194.114.62.34.1200926781.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, > I am installing lxml-2.0beta1 on python 2.3 on RHEL4. I didn't check 2.3 compatibility for a while (Ubuntu doesn't have Python 2.3 anymore), looks like things got a bit out of hand here. I fixed etree for now, it should be in SVN by tonight (CET). You should be fine with that version. Note that building it will require Cython 0.9.6.11 or later. The doctestcompare stuff will not work with Python 2.3, so I had to disable the doctests for that version. Looks like there's still a bit more to do, though. Stefan From stefan_ml at behnel.de Mon Jan 21 20:18:47 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 Jan 2008 20:18:47 +0100 Subject: [lxml-dev] lxml \ libxslt \ libxml2 leads to apache 2 crash : looks like fixed In-Reply-To: References: Message-ID: <4794F017.1060009@behnel.de> Hi, Dmitri Fedoruk wrote: > I've been testing lxml2.0beta1 for a week already and it looks like > the problem leading to apache2 crash disappeared. I've tried to > reproduce the bug using the known situations which lead to crash, this > did not happen. Our QA team made some more thorough tests, the crash > did not happen. That's great news. These things are really nasty. lxml 2.0 has had a bit of internal reengineering, which also lead to a couple of structural fixes. Some of these might find their way back into 1.3 one day, although I'd have to look through the (long list of) changes first. I'm more concerned about getting 2.0 stable for the final release now. What might actually have done the trick here is the conservative "copy-if-foreign" XSLT change, which now copies the stylesheet if it was created in a different thread. This is definitely sub-optimal, but I'll look into that later. As usual: 1) make it work, 2) make it fast (if necessary). > So I hope the bug is gone forever :) I guess I should thank Stefan for > fixing whatever it was. You're welcome. Stefan From dsoulayrol at free.fr Tue Jan 22 13:48:49 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Tue, 22 Jan 2008 13:48:49 +0100 Subject: [lxml-dev] About objectify Message-ID: <1201006129.23443.15.camel@neodebianix.neotip.com> Hello, I wanted to give a try to objectify, and while reading documentation, I've seen that what I'd like to do involves using a schema for my XML file, and the xsi namespace to specify element types. Since I do discover at the same time the objectify particularities and what is exactly a schema in details, I have some troubles to handle the whole thing, and I'm not sure I do understand well everything from documentation. The main problem is that if I do an error, for now I do not know if I should try correct my schema, my XML file or my python code. Does anyone have a simple example with a short XML file containing some elements, its schema, and the python code parsing the file using objectify with the xsi namespace to do type association ? Thanks, -- David. From jholg at gmx.de Tue Jan 22 14:54:42 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 22 Jan 2008 14:54:42 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <1201006129.23443.15.camel@neodebianix.neotip.com> References: <1201006129.23443.15.camel@neodebianix.neotip.com> Message-ID: <20080122143205.234530@gmx.net> Hi David, > Does anyone have a simple example with a short XML file containing some > elements, its schema, and the python code parsing the file using > objectify with the xsi namespace to do type association ? > > I'm not quite sure I understand what you try to achieve, but lxml.objectify does not necessarily need a schema: ??$ cat simpleInstance.xml ??? A string, hopefully. ?You can simply parse this using objecify: ?>>> from lxml import etree, objectify >>> root = objectify.parse("simpleInstance.xml").getroot() >>> print objectify.dump(root) root = None [ObjectifiedElement] ??? s = 'A string, hopefully.' [StringElement] >>> print root.s A string, hopefully. >>> ?What you can do is validate this XML tree against a schema: ?$ cat simpleSchema.xsd ? ? ? ? ??? ????? ??? ? ?>>> from lxml import etree, objectify >>> root = objectify.parse("simpleInstance.xml").getroot() >>> schema = etree.XMLSchema(objectify.parse("simpleSchema.xsd")) >>> schema.validate(root) True >>> ?No xsi:type information anywhere, so far. ?What you currently can not do is use the schema to *add* xsi:type attributes to the XML instance. Or, to put it another way, schema-validation does not add any type-information.? ?No problem, though, if the XML contains xsi:type information: ?>>> from lxml import etree, objectify >>> root = objectify.parse("simpleInstance2.xml").getroot() >>> print objectify.dump(root) root = None [ObjectifiedElement] ??? s = 'A string, hopefully.' [StringElement] ????? * xsi:type = 'xsd:normalizedString' >>> print root.s A string, hopefully. >>> schema = etree.XMLSchema(objectify.parse("simpleSchema.xsd")) >>> schema.validate(root) True >>> ?If xsi:type information is available, it will be used to determine the lxml.objectify type representation of an element: Consider ?>>> root = objectify.fromstring("3") >>> print objectify.dump(root) root = None [ObjectifiedElement] ??? s = 3 [IntElement] >>> ?vs. ?>>> root = objectify.fromstring(""" ... ... 3 ... """) >>> print objectify.dump(root) root = None [ObjectifiedElement] ??? s = '3' [StringElement] ????? * xsi:type = 'xsd:string' >>> ?HTH, Holger? ? -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080122/eb5c9536/attachment.htm From dsoulayrol at free.fr Tue Jan 22 15:45:53 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Tue, 22 Jan 2008 15:45:53 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <20080122143205.234530@gmx.net> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> Message-ID: <1201013153.23443.27.camel@neodebianix.neotip.com> Le mardi 22 janvier 2008 ? 14:54 +0100, jholg at gmx.de a ?crit : > Hi David, > > > Does anyone have a simple example with a short XML file containing > > some > > elements, its schema, and the python code parsing the file using > > objectify with the xsi namespace to do type association ? > I'm not quite sure I understand what you try to achieve, > > but lxml.objectify does not necessarily need a schema: Hello. Thanks for your answer. I should have written I'd like to define a complex type, as stated in the end of the following chapter. http://codespeak.net/lxml/objectify.html#how-data-types-are-matched In my case, writing a type checker would be impossible - or very difficult. I understand I can also use a py:pytype attribute, but I'd like to try the xsi namespace if possible. -- David. From jholg at gmx.de Tue Jan 22 17:31:33 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 22 Jan 2008 17:31:33 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <1201013153.23443.27.camel@neodebianix.neotip.com> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> <1201013153.23443.27.camel@neodebianix.neotip.com> Message-ID: <20080122165549.234510@gmx.net> Hi, ? > I should have written I'd like to define a complex type, as stated in > the end of the following chapter. > > http://codespeak.net/lxml/objectify.html#how-data-types-are-matched > ? ?Hm, it is not really a complex type (in XML Schema terms), but rather a custom simple data type.? ? > In my case, writing a type checker would be impossible - or very > difficult. I understand I can also use a py:pytype attribute, but I'd > like to try the xsi namespace if possible. > ?Have you tried registering your custom type as an xmlSchemaType, as in: ?>>> my_strange_type.xmlSchemaTypes = ("myns:mytypename",) ?Not sure if this will work though, as lxml.objectify currently expects xsi:type information to contain xsd:hints, i.e. the xsi:type infos must come from the XML Schema namespace. At least the DataElement() factory will choke: ?>>> objectify.DataElement(3, _xsi="foo:bar") Traceback (most recent call last): ? File "", line 1, in ? ? File "lxml.objectify.pyx", line 1708, in lxml.objectify.DataElement ValueError: XSD types require the XSD namespace >>> ??Maybe you can achieve what you need by taking a look at the "Using custom element classes in lxml" section?? Good luck, H.? ? -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080122/982a9557/attachment.htm From dsoulayrol at free.fr Tue Jan 22 23:28:58 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Tue, 22 Jan 2008 23:28:58 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <20080122165549.234510@gmx.net> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> <1201013153.23443.27.camel@neodebianix.neotip.com> <20080122165549.234510@gmx.net> Message-ID: <1201040938.3438.24.camel@localhost> Hello, Le mardi 22 janvier 2008 ? 17:31 +0100, jholg at gmx.de a ?crit : > Hi, > > > I should have written I'd like to define a complex type, as stated > > in > > the end of the following chapter. > > > > http://codespeak.net/lxml/objectify.html#how-data-types-are-matched > > > Hm, it is not really a complex type (in XML Schema terms), but rather > a custom simple data type. > > > > > > In my case, writing a type checker would be impossible - or very > > difficult. I understand I can also use a py:pytype attribute, but > > I'd like to try the xsi namespace if possible. > > Have you tried registering your custom type as an xmlSchemaType, as > in: > > >>> my_strange_type.xmlSchemaTypes = ("myns:mytypename",) All this was a bit fuzzy to me. I had the time to dig more, and here is a very simple file I wrote to proceed to some tests: ---%<------%<------%<------%<--- Achille 2.0 2 ---%<------%<------%<------%<--- And below the code I've written to parse it: ---%<------%<------%<------%<--- from lxml import etree from lxml import objectify class Configuration(objectify.ObjectifiedDataElement): pass class MyString(objectify.ObjectifiedDataElement): pass configuration_type = objectify.PyType('configuration', None, Configuration) string_type = objectify.PyType('MyString', None, MyString) configuration_type.xmlSchemaTypes = ('site',) string_type.xmlSchemaTypes = ('title',) configuration_type.register() string_type.register() lParser = etree.XMLParser(remove_blank_text=True) lLookup = objectify.ObjectifyElementClassLookup() lParser.setElementClassLookup(lLookup) lFile = open('test.xml', 'r') lTree = etree.parse(lFile, lParser) print objectify.dump(lTree.getroot()) ---%<------%<------%<------%<--- Here is the result: $ python ./try_objectify.py site = None [ObjectifiedElement] * xsi:type = 'site' title = Achille 2.0 [MyString] * xsi:type = 'title' value = 2L [LongElement] * xsi:type = 'long' So I managed to get some success, but here are some remaining questions. - Why is the root element still an ObjectifiedElement instance ? It seems to me I applied the same rules for both of my defined types. - Is there a way to specify the xsi:type in a schema sheet ? This question may sound stupid, but I'm still learning the XSD spec, and I wonder if objectify could rely entirely on the schema, without the need to add anything in the XML document itself. > Good luck, Thanks for your attention, -- David Soulayrol From himself at markus-hillebrand.de Wed Jan 23 08:03:11 2008 From: himself at markus-hillebrand.de (Markus Hillebrand) Date: Wed, 23 Jan 2008 08:03:11 +0100 Subject: [lxml-dev] type of custom objects in XML-tree disappears References: 20080119175309.31119.84358.malonedeb@gangotri.ubuntu.com Message-ID: <4796E6AF.6030502@markus-hillebrand.de> Thanks for your fast response - with your help I was able to make my implementation work, but on the design side I'm not quite satisfied with the solution. You wrote "You do not have to call set_element_class_lookup() each time as it sticks with the parser." ... - but thats exactly what I want to do: I need a tree with objects of different(!) classes. And I want to put additional data in that objects. Yes, I read this one: "There is one thing to know up front. Element classes must not have a constructor, neither must there be any internal state (except for the data stored in the underlying XML tree). Element instances are created and garbage collected at need, so there is no way to predict when and how often a constructor would be called." on http://codespeak.net/lxml/element_classes.html So for me, it seems that lxml seems not to be designed to manage objects of different classes in an XML tree. With a hack (calling setElementClassLookup() before each creation of an element) I'm able to create such tree's and for some testcases it seems to work fine - despite of these nasty things I reported. But it's not quite satisficing: - it's not free of side-effects, when I change the default setElementClassLookup ... - I'm afraid to run in garbage collection bugs, like the one I reported. Finally my question: is it possible, that lxml supports that feature officially? For example by providing an explicit factory call like etree.createElement(class)? With best regards Markus From jholg at gmx.de Wed Jan 23 08:45:44 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 23 Jan 2008 08:45:44 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <1201040938.3438.24.camel@localhost> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> <1201013153.23443.27.camel@neodebianix.neotip.com> <20080122165549.234510@gmx.net> <1201040938.3438.24.camel@localhost> Message-ID: <20080123083048.131050@gmx.net> Hi, ? > > Here is the result: > > $ python ./try_objectify.py > site = None [ObjectifiedElement] > * xsi:type = 'site' > title = Achille 2.0 [MyString] > * xsi:type = 'title' > value = 2L [LongElement] > * xsi:type = 'long' > > So I managed to get some success, but here are some remaining questions. > > - Why is the root element still an ObjectifiedElement instance ? It > seems to me I applied the same rules for both of my defined types. ? Basically, when lxml parses an XML file/string, the underlying libxml2 is used to build a DOM-like XML-Tree, i.e. a C data structure. On element access, lxml creates a? proxy object to represent the? node in Python.? After? you?ve finished your proceedings with the? node? and delete your Python references to it, it is free to be garbage-collected. ?Now, objectify bases?its element class lookup (i.e. which element class to use for the Python proxy representation) on certain rules: ?1. if element has children => no data class 2. if element is defined as xsi:nil, return NoneElement class 3. check for Python type hint 4. check for XML Schema type hint 5. guess element class ?Therefore, the objectify class lookup will *always* choose ObjectifiedElement if an element has children ("structural element"), as opposed to a "data element". ?You can beat this behaviour by using custom element class lookup (with ObjectifyElementClassLookup as the fallback) based on attributes: ?$ cat lxml_attributeBasedLookup.py from lxml import etree, objectify ? ? class Configuration(objectify.ObjectifiedElement): ??? pass ? class MyString(objectify.ObjectifiedDataElement): ??? pass ? ? # maps attribute values to element classes xsitype_class_mapping = { ??? "site": Configuration, ??? "title": MyString, ??? } ? lookup = etree.AttributeBasedElementClassLookup( ??? "{http://www.w3.org/2001/XMLSchema-instance}type", ??? xsitype_class_mapping, ??? objectify.ObjectifyElementClassLookup()) ? parser = etree.XMLParser() parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) ? root = objectify.fromstring(""" ? ? Achille 2.0 ? 2 """) ? print objectify.dump(root)? ###################### ??$?python2.4 lxml_attributeBasedLookup.py site = None [Configuration] ? * xsi:type = 'site' ??? title = Achille 2.0 [MyString] ????? * xsi:type = 'title' ??? value = 2L [LongElement] ????? * xsi:type = 'long'? > - Is there a way to specify the xsi:type in a schema sheet ? This > question may sound stupid, but I'm still learning the XSD spec, and I > wonder if objectify could rely entirely on the schema, without the need > to add anything in the XML document itself. ?You can define custom types in XML Schema, probably the best is to look at the XML Schema Primer first, or the excellent tutorials of a certain Roger Costello (I think the site is xfront.com) ?Currently, I think lxml.objectify restricts itself to supporting the "xsd" types as in http://www.w3.org/TR/xmlschema-2/ with regard to xsi:type values, e.g. forcing them to come from the schema namespace.? ?You might be able to achieve what you need with what I've shown above, beating the objectify lookup in lookup order. ?For now, there is nothing like a "typifier" that takes an instance and a schema and adds type information from the schema to the instance document. ?Cheers, Holger? -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080123/f77d9f60/attachment.htm From stefan_ml at behnel.de Wed Jan 23 11:24:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Jan 2008 11:24:02 +0100 Subject: [lxml-dev] type of custom objects in XML-tree disappears In-Reply-To: <4796E6AF.6030502@markus-hillebrand.de> References: 20080119175309.31119.84358.malonedeb@gangotri.ubuntu.com <4796E6AF.6030502@markus-hillebrand.de> Message-ID: <479715C2.4050708@behnel.de> Hi, Markus Hillebrand wrote: > You wrote "You do not have to call set_element_class_lookup() > each time as it sticks with the parser." ... - but thats exactly > what I want to do: I need a tree with objects of different(!) > classes. That's perfectly fine, and there are ways to do that. You can write your own lookup scheme based on XML attributes, namespace/tag, some general element information or even full-fledged tree traversal. http://codespeak.net/lxml/dev/element_classes.html What will /not/ work is: merge elements from different trees into a tree that has a different lookup scheme and then have them reappear in the new tree with their original class - *except* if you keep Python references to each object, which will prevent them from being garbage collected and thus from re-evaluating the lookup on access. But you have to take care in this case that tree modifications are reflected in the cache. > And I want to put additional data in that objects. You can do that as long as it is reflected in the underlying XML (e.g. through attributes in a separate namespace). lxml.objectify does this for type annotations, for example. You can /not/ do that if you want to keep the state in the Python objects - again, with the exception of keeping the Python objects alive. > So for me, it seems that lxml seems not to be designed to manage > objects of different classes in an XML tree. It totally is, it just depends on how sophisticated your lookup scheme is. > With a hack (calling > setElementClassLookup() before each creation of an element) You are assuming here that you can keep state in the Element objects, which in this case means: their Python type. > I'm > able to create such tree's and for some testcases it seems to work > fine - despite of these nasty things I reported. But it's not quite > satisficing: > > - it's not free of side-effects, when I change the default > setElementClassLookup ... Which is discouraged anyway, but helpful in some I-know-what-I'm-doing cases where you are sure you're the only one to play with this. > Finally my question: is it possible, that lxml supports that feature > officially? For example by providing an explicit factory call > like etree.createElement(class)? No, lxml will not keep state in its Element proxies. But again, objectify uses something similar: it determines the Python type of an element value (string, int, ...) and stores it as a namespaced attribute. When it has to determine the Element class to use for such an element, it uses that information in the class lookup. When serialising, you can choose to either keep these attributes in or to "deannotate()" the tree first. Stefan From stefan_ml at behnel.de Wed Jan 23 14:12:24 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Jan 2008 14:12:24 +0100 Subject: [lxml-dev] objectify type annotation based on an XML Schema In-Reply-To: <20080123083048.131050@gmx.net> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> <1201013153.23443.27.camel@neodebianix.neotip.com> <20080122165549.234510@gmx.net> <1201040938.3438.24.camel@localhost> <20080123083048.131050@gmx.net> Message-ID: <47973D38.1040208@behnel.de> Hi, jholg at gmx.de wrote: >> - Is there a way to specify the xsi:type in a schema sheet ? This >> question may sound stupid, but I'm still learning the XSD spec, and I >> wonder if objectify could rely entirely on the schema, without the need >> to add anything in the XML document itself. > For now, there is nothing like a "typifier" that takes an instance and a > schema and adds type information from the schema to the instance document. Someone should file a "wishlist" bug report on this, as it has been requested a couple of times. There would be ways to approach this. One is the "generateDS" tool by Dave Kuhlman: http://www.rexx.com/~dkuhlman/generateDS.html we could try to extract the parser (or reimplement it with lxml) and run an annotation instead of the class generation step. We could also look a bit deeper into the internal handling of XML schema in libxml2, there could be something to start from. Stefan From stefan_ml at behnel.de Wed Jan 23 16:50:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Jan 2008 16:50:39 +0100 Subject: [lxml-dev] Restrict exception error log to local errors? Message-ID: <4797624F.5010607@behnel.de> Hi, currently, the "error_log" property on exceptions raised by lxml represents a frozen snapshot of the global error log, thus aggregating loads of errors that have recently occurred, but that are not necessarily related to the problem that lead to the exception. However, we could often restrict that to an error log snapshot that is local to the operation that raised the exception. Question: should this be changed? Is there a reason this should not be changed for 2.0? Stefan From jholg at gmx.de Thu Jan 24 10:44:29 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 24 Jan 2008 10:44:29 +0100 Subject: [lxml-dev] Restrict exception error log to local errors? In-Reply-To: <4797624F.5010607@behnel.de> References: <4797624F.5010607@behnel.de> Message-ID: <20080124094447.269830@gmx.net> Hi, > > > currently, the "error_log" property on exceptions raised by lxml > represents a > frozen snapshot of the global error log, thus aggregating loads of errors > that > have recently occurred, but that are not necessarily related to the > problem > that lead to the exception. However, we could often restrict that to an > error > log snapshot that is local to the operation that raised the exception. > > Question: should this be changed? Is there a reason this should not be > changed > for 2.0? > I for one am indifferent (since I never even noticed the aggregation behaviour so far ;-) If someone wants to have kind of an exception history, then there'll still be the global error log (?). ?Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080124/888e5513/attachment.htm From stefan_ml at behnel.de Thu Jan 24 13:09:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Jan 2008 13:09:59 +0100 Subject: [lxml-dev] Restrict exception error log to local errors? In-Reply-To: <20080124094447.269830@gmx.net> References: <4797624F.5010607@behnel.de> <20080124094447.269830@gmx.net> Message-ID: <47988017.7020504@behnel.de> Hi, jholg at gmx.de wrote: >> currently, the "error_log" property on exceptions raised by lxml >> represents a >> frozen snapshot of the global error log, thus aggregating loads of errors >> that >> have recently occurred, but that are not necessarily related to the >> problem >> that lead to the exception. However, we could often restrict that to an >> error >> log snapshot that is local to the operation that raised the exception. >> >> Question: should this be changed? Is there a reason this should not be >> changed for 2.0? > > I for one am indifferent (since I never even noticed the aggregation > behaviour so far ;-) That's about what I was expecting. :) > If someone wants to have kind of an exception history, then there'll still > be the global > > error log (?). ... which is not currently accessible by itself. >From my POV, you would normally use the "error_log" property of the thing you are working with (XSLT, RelaxNG, XPath, ...) rather than the exception. The only case where it is interesting to access the log through the exception would be the instantiation of such an object. And here, I don't think you'd be interested in anything but the cause of the problem. So, guess I'll just change it in general and make it another beta. :) Stefan From brunobg at gmail.com Thu Jan 24 21:17:12 2008 From: brunobg at gmail.com (Bruno Barberi Gnecco) Date: Thu, 24 Jan 2008 18:17:12 -0200 Subject: [lxml-dev] Possible bug in xpath? Message-ID: <4798F248.6050304@gmail.com> Hi, I think I may have run into a bug. I'm attaching a sample code to reproduce it. Instead of getting back just the '..', I get the entire div content. Since I can't tell if this bug is in lxml or one of the libs it uses, I'm posting to this list instead of the bug tracker. My setup: Python 2.5.1, lxml 1.3.6 and libxml2.so.2.6.28. If OTOH I'm doing something stupid, just tell me :) Thanks a lot, -- Bruno Barberi Gnecco ...the flaw that makes perfection perfect. -------------- next part -------------- A non-text attachment was scrubbed... Name: tmp.py Type: text/x-python Size: 537 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080124/29039c4a/attachment.py From stefan_ml at behnel.de Thu Jan 24 22:02:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Jan 2008 22:02:05 +0100 Subject: [lxml-dev] Possible bug in xpath? In-Reply-To: <4798F248.6050304@gmail.com> References: <4798F248.6050304@gmail.com> Message-ID: <4798FCCD.20107@behnel.de> Hi, Bruno Barberi Gnecco wrote: > I think I may have run into a bug. I'm attaching a sample code > to reproduce it. Instead of getting back just the '..', > I get the entire div content. > > If OTOH I'm doing something stupid, just tell me :) Thanks a lot, tostring(el) and tounicode(el) serialise the Element object you pass, and the .tail text of an Element is part of the Element object you are serialising. Try calling tounicode( et.getroot()[0] ) That should give you the same output that you see in your XPath example. Here's another example that might make it clear why this is so: >>> import lxml.etree as et >>> root = et.Element("test") >>> root.text = "TEXT" >>> et.tostring(root) TEXT >>> et.tail = "TAIL" >>> et.tostring(root) TEXTTAIL >>> et.tail = None >>> et.tostring(root) TEXT I also updated the FAQ entry on this topic. http://codespeak.net/lxml/dev/FAQ.html#what-about-that-trailing-text-on-serialised-elements Stefan From dsoulayrol at free.fr Fri Jan 25 11:00:41 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Fri, 25 Jan 2008 11:00:41 +0100 Subject: [lxml-dev] About objectify In-Reply-To: <20080123083048.131050@gmx.net> References: <1201006129.23443.15.camel@neodebianix.neotip.com> <20080122143205.234530@gmx.net> <1201013153.23443.27.camel@neodebianix.neotip.com> <20080122165549.234510@gmx.net> <1201040938.3438.24.camel@localhost> <20080123083048.131050@gmx.net> Message-ID: <1201255241.5194.11.camel@localhost> Le mercredi 23 janvier 2008 ? 08:45 +0100, jholg at gmx.de a ?crit : > Hi, > > > - Why is the root element still an ObjectifiedElement instance ? It > > seems to me I applied the same rules for both of my defined types. > > > Now, objectify bases its element class lookup (i.e. which element > class to use for the Python proxy representation) on certain rules: > > 1. if element has children => no data class > 2. if element is defined as xsi:nil, return NoneElement class > 3. check for Python type hint > 4. check for XML Schema type hint > 5. guess element class Yes, it is in documentation which I should have read with more attention. Sorry for that. > # maps attribute values to element classes > xsitype_class_mapping = { > "site": Configuration, > "title": MyString, > } > > lookup = etree.AttributeBasedElementClassLookup( > "{http://www.w3.org/2001/XMLSchema-instance}type", > xsitype_class_mapping, > objectify.ObjectifyElementClassLookup()) > > parser = etree.XMLParser() > parser.setElementClassLookup(lookup) > objectify.setDefaultParser(parser) > > root = objectify.fromstring(""" > xsi:type='site'> > > Achille 2.0 > 2 > > """) > > print objectify.dump(root) This makes sense now. > > - Is there a way to specify the xsi:type in a schema sheet ? This > > question may sound stupid, but I'm still learning the XSD spec, and > > I wonder if objectify could rely entirely on the schema, without the > > need to add anything in the XML document itself. > For now, there is nothing like a "typifier" that takes an instance and > a schema and adds type information from the schema to the instance > document. Thanks for all the help. -- David From stefan_ml at behnel.de Fri Jan 25 20:51:45 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 Jan 2008 20:51:45 +0100 Subject: [lxml-dev] special string subclasses for XPath string results In-Reply-To: <478559B8.5020703@behnel.de> References: <57254BE3C7FAD845A0E402A369B0D0DD992E8E@gt1mail1.gt.e-7.local> <4781EDAF.1010604@behnel.de> <57254BE3C7FAD845A0E402A369B0D0DD992F92@gt1mail1.gt.e-7.local> <4781FB88.900@behnel.de> <4782598E.3000803@colorstudy.com> <57254BE3C7FAD845A0E402A369B0D0DD993058@gt1mail1.gt.e-7.local> <47826668.90807@colorstudy.com> <478283DB.9090403@behnel.de> <4782968F.2000406@colorstudy.com> <478333B6.6030906@behnel.de> <47851016.6050905@colorstudy.com> <478559B8.5020703@behnel.de> Message-ID: <479A3DD1.7040006@behnel.de> Hi, one more note on this. Stefan Behnel wrote: > Attributes and text nodes are now handled. String results (such as returned by > the string() function) will remain plain strings - no way to recover here. I generalised this to make all XPath string results 'smart' subclasses of either str or unicode. For the cases where we return strings that do not have a clear origin (e.g. assembled by the XPath string() or concat() functions), the getparent() method will just return None, as for the root Element of a tree. That way, you can avoid doing any hasattr() stuff and just call getparent() directly. Stefan From jeroen at keizerrijk.net Sat Jan 26 12:13:03 2008 From: jeroen at keizerrijk.net (Jeroen van Hilst) Date: Sat, 26 Jan 2008 12:13:03 +0100 Subject: [lxml-dev] elements in gc.garbage - ok ? Message-ID: Hi, While using lxml (which is a great tool!), i am experiencing some memory issues. I have made a small piece of code that makes elements go into in gc.garbage. I am suspecting this to be the problem of my program cunsuming lots of memory. Can someone tell me if it is ok that this happens - or what the reason is ? Thanks in advance. - Jeroen #====================== from lxml import etree import gc gc.set_debug(gc.DEBUG_LEAK) for x in range(1,19): r = etree.Element('div') s = str(x) #if the attr is not touched there are no messages from gc r.attrib[s] = s gc.collect() if gc.garbage: print gc.garbage #====================== I am using: Windows XP SP2 Python 2.5.1.1 lxml 1.3.6 libml 2.6,28 begin 666 lxml_mem.py M(U5S:6YG. at HC4'ET:&]N(#(N-2XQ+C$*(U=I;F1O=W, at 6% @4U @, at HC;'AM M;" Q+C,N- at HC;&EB>&UL(#(N-BXR. H*9G)O;2!L>&UL(&EM<&]R="!E=')E M9