From lists at cheimes.de Mon Feb 2 12:49:02 2009 From: lists at cheimes.de (Christian Heimes) Date: Mon, 02 Feb 2009 12:49:02 +0100 Subject: [lxml-dev] etree.clear_error_log() causes segfaults Message-ID: This is a friendly word of warning! Don't call etree.clear_error_log() from multiple threads. We found this issue during stress tests of our CherryPy based application. Every worker thread was calling etree.clear_error_log() after the page was rendered. Apparently we hit some sort of race condition. Backtrace: C [etree.so+0x21977] C [etree.so+0x22d48] C [libxslt.so.1+0xdc17] xsltTransformError+0xf7 C [libxslt.so.1+0x839d] C [libxslt.so.1+0xa5da] xsltParseStylesheetProcess+0x80a C [libxslt.so.1+0x1efbc] xsltParseStylesheetInclude+0x1ac C [libxslt.so.1+0xa0b4] xsltParseStylesheetProcess+0x2e4 C [libxslt.so.1+0xb4b9] xsltParseStylesheetImportedDoc+0x1e9 C [libxslt.so.1+0xb5b8] xsltParseStylesheetDoc+0x28 C [etree.so+0xe640c] C [python2.5+0x4ff83] C [python2.5+0x11e67] PyObject_Call+0x27 Christian From shigin at rambler-co.ru Mon Feb 2 16:26:47 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Mon, 02 Feb 2009 18:26:47 +0300 Subject: [lxml-dev] String parameters to xslt transformation Message-ID: <1233588407.5942.14.camel@atlas> lxml lacks ways to apply an external parameter containing both single and double quotes. The patch adds `transform` method to XSLT object with `params` and `strparams` argument. `params` works like `**kw` of `__call__` method (i.e. you still need to surround string literals with quotes). `strparams` are treated literally, so you do not need to use any escaping or quotes. -------------- next part -------------- A non-text attachment was scrubbed... Name: xslt-params.diff Type: text/x-patch Size: 6282 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090202/ad3a8772/attachment.bin From Ronny.Pfannschmidt at gmx.de Wed Feb 4 17:50:40 2009 From: Ronny.Pfannschmidt at gmx.de (Ronny Pfannschmidt) Date: Wed, 04 Feb 2009 17:50:40 +0100 Subject: [lxml-dev] altering the indent of the pretty output Message-ID: <1233766240.25247.3.camel@klappe2> Hi, i'm currently porting gazpacho (a wysiwyg gtk ui file editor) to lxml, unfortunately the pretty printer prints with an indent of 2 and in order to match the convention i need an indent of 4 is there any simple way to archive pretty dumping to a file with an indent of 4? Regards Ronny From d.rothe at semantics.de Wed Feb 4 18:59:35 2009 From: d.rothe at semantics.de (Dirk Rothe) Date: Wed, 04 Feb 2009 18:59:35 +0100 Subject: [lxml-dev] altering the indent of the pretty output In-Reply-To: <1233766240.25247.3.camel@klappe2> References: <1233766240.25247.3.camel@klappe2> Message-ID: On Wed, 04 Feb 2009 17:50:40 +0100, Ronny Pfannschmidt wrote: > Hi, > > i'm currently porting gazpacho (a wysiwyg gtk ui file editor) to lxml, > unfortunately the pretty printer prints with an indent of 2 and in order > to match the convention i need an indent of 4 > > is there any simple way to archive pretty dumping to a file with an > indent of 4? You could adapt the following XSL Transformation: prettyXSL = """ """ def prettyPrint(tree): xslt_doc = etree.fromstring(prettyXSL) if isinstance(tree, basestring): doc = etree.fromstring(tree) else: doc = tree transform = etree.XSLT(xslt_doc) return etree.tostring(transform(doc)) --dirk From sergio at sergiomb.no-ip.org Wed Feb 4 20:04:20 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 04 Feb 2009 19:04:20 +0000 Subject: [lxml-dev] how knowing the types return by .xpath Message-ID: <1233774260.21997.9.camel@segulix> Example: from lxml import etree f = open(options.file).read() hparser = etree.HTMLParser(encoding='utf-8', remove_comments=True) etree_document = etree.HTML(f, parser=hparser) elems=etree_document.xpath(strxpath) for frags in elems: print type (frags) (...) if strxpath is equal //h1/following-sibling::text() prints if strxpath is equal //div[@class="news"] prints How I do a "if" to detected the types ? thanks -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090204/2fa63ff6/attachment-0001.bin From stefan_ml at behnel.de Wed Feb 4 20:47:55 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Feb 2009 20:47:55 +0100 Subject: [lxml-dev] etree.clear_error_log() causes segfaults In-Reply-To: References: Message-ID: <4989F0EB.1050800@behnel.de> Hi, Christian Heimes wrote: > This is a friendly word of warning! Don't call etree.clear_error_log() > from multiple threads. > > We found this issue during stress tests of our CherryPy based > application. Every worker thread was calling etree.clear_error_log() > after the page was rendered. Apparently we hit some sort of race condition. Thanks for the heads-up. The global error log is actually not thread-local, so there's not much sense in calling the function above in threaded code - or even using the global log at all. The log that comes with API objects like XPath, XSLT and validators should work as expected, though. Stefan From stefan_ml at behnel.de Wed Feb 4 20:53:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Feb 2009 20:53:42 +0100 Subject: [lxml-dev] altering the indent of the pretty output In-Reply-To: <1233766240.25247.3.camel@klappe2> References: <1233766240.25247.3.camel@klappe2> Message-ID: <4989F246.7010502@behnel.de> Ronny Pfannschmidt wrote: > i'm currently porting gazpacho (a wysiwyg gtk ui file editor) to lxml, > unfortunately the pretty printer prints with an indent of 2 and in order > to match the convention i need an indent of 4 > > is there any simple way to archive pretty dumping to a file with an > indent of 4? Apart from the already proposed XSLT serialisation, libxml2 does have a way to set the indentation level. However, that is done globally at a per-thread level and isn't currently exposed at lxml's API level. I'd accept patches that support it for a single call to ET.write() and tostring(). Stefan From stefan_ml at behnel.de Wed Feb 4 20:56:40 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Feb 2009 20:56:40 +0100 Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <1233774260.21997.9.camel@segulix> References: <1233774260.21997.9.camel@segulix> Message-ID: <4989F2F8.7090608@behnel.de> Hi, Sergio Monteiro Basto wrote: > Example: > > from lxml import etree > > f = open(options.file).read() > > hparser = etree.HTMLParser(encoding='utf-8', remove_comments=True) > etree_document = etree.HTML(f, parser=hparser) > > elems=etree_document.xpath(strxpath) > > for frags in elems: > print type (frags) > (...) > > if strxpath is equal //h1/following-sibling::text() > prints > > > if strxpath is equal //div[@class="news"] > prints > > > How I do a "if" to detected the types ? It's actually rare that the expected type isn't known in advance, but for this kind of use case, you can just test the type as usual, i.e. use isinstance() to check for basestring, float or list. Stefan From d.rothe at semantics.de Wed Feb 4 21:15:58 2009 From: d.rothe at semantics.de (Dirk Rothe) Date: Wed, 04 Feb 2009 21:15:58 +0100 Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <4989F2F8.7090608@behnel.de> References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> Message-ID: On Wed, 04 Feb 2009 20:56:40 +0100, Stefan Behnel wrote: > Hi, > > Sergio Monteiro Basto wrote: >> Example: >> >> from lxml import etree >> >> f = open(options.file).read() >> >> hparser = etree.HTMLParser(encoding='utf-8', remove_comments=True) >> etree_document = etree.HTML(f, parser=hparser) >> >> elems=etree_document.xpath(strxpath) >> >> for frags in elems: >> print type (frags) >> (...) >> >> if strxpath is equal //h1/following-sibling::text() >> prints >> >> >> if strxpath is equal //div[@class="news"] >> prints >> >> >> How I do a "if" to detected the types ? > > It's actually rare that the expected type isn't known in advance, but for > this kind of use case, you can just test the type as usual, i.e. use > isinstance() to check for basestring, float or list. ..or scalar bools. I was quite surprised to see that the xpath function can return other types than lists: >>> tree.xpath('count(/) = 1') True From stefan_ml at behnel.de Wed Feb 4 21:20:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Feb 2009 21:20:53 +0100 Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> Message-ID: <4989F8A5.7050903@behnel.de> Dirk Rothe wrote: > On Wed, 04 Feb 2009 20:56:40 +0100, Stefan Behnel > wrote: >> It's actually rare that the expected type isn't known in advance, but for >> this kind of use case, you can just test the type as usual, i.e. use >> isinstance() to check for basestring, float or list. > > ..or scalar bools. True. See the docs for details. http://codespeak.net/lxml/xpathxslt.html#xpath-return-values Stefan From sergio at sergiomb.no-ip.org Wed Feb 4 21:25:19 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 04 Feb 2009 20:25:19 +0000 Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <4989F2F8.7090608@behnel.de> References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> Message-ID: <1233779119.22959.9.camel@segulix> On Wed, 2009-02-04 at 20:56 +0100, Stefan Behnel wrote: > Hi, > > Sergio Monteiro Basto wrote: > > Example: > > > > from lxml import etree > > > > f = open(options.file).read() > > > > hparser = etree.HTMLParser(encoding='utf-8', remove_comments=True) > > etree_document = etree.HTML(f, parser=hparser) > > > > elems=etree_document.xpath(strxpath) > > > > for frags in elems: > > print type (frags) > > (...) > > > > if strxpath is equal //h1/following-sibling::text() > > prints > > > > > > if strxpath is equal //div[@class="news"] > > prints > > > > > > How I do a "if" to detected the types ? > > It's actually rare that the expected type isn't known in advance, but for > this kind of use case, you can just test the type as usual, i.e. use > isinstance() to check for basestring, float or list. elems (in the example) is a list and each element of the list (elems) could be a lxml.etree._ElementUnicodeResult or a lxml.etree._Element' yes , isinstance(lxml.etree._ElementUnicodeResult, basestring) is true and isinstance(xml.etree._Element, basestring) is false which resolve my initial problem but what instance is xml.etree._Element ? > > Stefan Many thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090204/4a68f5e7/attachment.bin From sergio at sergiomb.no-ip.org Thu Feb 5 03:23:29 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Thu, 05 Feb 2009 02:23:29 +0000 Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <4989F8A5.7050903@behnel.de> References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> <4989F8A5.7050903@behnel.de> Message-ID: <1233800610.3145.64.camel@monteirov> On Wed, 2009-02-04 at 21:20 +0100, Stefan Behnel wrote: > > True. See the docs for details. > > http://codespeak.net/lxml/xpathxslt.html#xpath-return-values I read this before my first post. > > elems=etree_document.xpath(strxpath) > > > > for frags in elems: > > print type (frags) when .xpath return a list, the elements of the list could have different types, I found at least two , etree._Element and etree._ElementUnicodeResult . How could I know if a variable is a etree._Element ? and not other object like for example etree._Entity . Thanks, -- S?rgio M.B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090205/d65096d3/attachment.bin From stefan_ml at behnel.de Thu Feb 5 09:14:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 5 Feb 2009 09:14:36 +0100 (CET) Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <1233779119.22959.9.camel@segulix> References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> <1233779119.22959.9.camel@segulix> Message-ID: <63565.213.61.181.86.1233821676.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Sergio Monteiro Basto wrote: > isinstance(lxml.etree._ElementUnicodeResult, basestring) is true > and > isinstance(xml.etree._Element, basestring) is false > > which resolve my initial problem > > but what instance is xml.etree._Element ? etree.iselement(an_element) is True for elements, but also for PIs, comments and entities (if you configured the parser to leave entities in). isinstance(element.tag, basestring) is True only for Elements (and entities, IIRC). Stefan From stefan_ml at behnel.de Thu Feb 5 09:18:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 5 Feb 2009 09:18:25 +0100 (CET) Subject: [lxml-dev] how knowing the types return by .xpath In-Reply-To: <1233800610.3145.64.camel@monteirov> References: <1233774260.21997.9.camel@segulix> <4989F2F8.7090608@behnel.de> <4989F8A5.7050903@behnel.de> <1233800610.3145.64.camel@monteirov> Message-ID: <34226.213.61.181.86.1233821905.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Sergio Monteiro Basto wrote: >> > elems=etree_document.xpath(strxpath) >> > >> > for frags in elems: >> > print type (frags) > > when .xpath return a list, the elements of the list could have different > types, I found at least two , etree._Element and > etree._ElementUnicodeResult . You either get a single result (if you asked for it, e.g. as a function return value), or you get a list of results. In any case, the set of possible return types is the same. Stefan From stefan_ml at behnel.de Thu Feb 5 21:58:15 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Feb 2009 21:58:15 +0100 Subject: [lxml-dev] String parameters to xslt transformation In-Reply-To: <1233588407.5942.14.camel@atlas> References: <1233588407.5942.14.camel@atlas> Message-ID: <498B52E7.6040906@behnel.de> Hi, thanks for the patch. Alexander Shigin wrote: > lxml lacks ways to apply an external parameter containing both single > and double quotes. > > The patch adds `transform` method to XSLT object with `params` and > `strparams` argument. `params` works like `**kw` of `__call__` method > (i.e. you still need to surround string literals with quotes). > > `strparams` are treated literally, so you do not need to use any > escaping or quotes. I thought about this a bit, and I dislike the idea of adding a new transform method only to support escaped parameters. I prefer having a function or method that does the escaping, so that you could do transform = etree.XSLT(...) result = transform(doc, string_var = transform.strparam("'hi'")) strparam() may return either an escaped string or a wrapper object that the transformation code special cases internally, not sure what's better. What do you think? Stefan From skyfex at gmail.com Thu Feb 5 23:31:06 2009 From: skyfex at gmail.com (Audun Wilhelmsen) Date: Thu, 5 Feb 2009 23:31:06 +0100 Subject: [lxml-dev] Templating Message-ID: <93064EEA-6404-4E94-853F-5F9798AAB95D@gmail.com> Hey I'd like to use lxml for creating a simple HTML templating engine, by implementing some custom tags and attributes. A simple example:

Would replace the text content of the tag with the variable foobar. But I'm not sure how to implement it using lxml. I've been using Genshi succesfully, by simply processing the tag stream before serializing. But I'd like to have the DOM-like capabilities and speed of lxml. What would be the best approach? I've been trying to use XPath to find all the elements and attributes with a gc: prefix, but with little success so far (maybe mixing namespaces with non-xml html doesn't work that well?). Audun Wilhelmsen Student, NTNU, Norway From robl at perfectworld.net Fri Feb 6 10:09:14 2009 From: robl at perfectworld.net (Robert Liebeskind) Date: Fri, 6 Feb 2009 10:09:14 +0100 Subject: [lxml-dev] Unable to solve a crash on Windows with LXML In-Reply-To: <44148.213.61.181.86.1232958645.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <3EA0022E-08E5-48DC-A020-EC3FF74C677B@perfectworld.net> <36880.213.61.181.86.1232634732.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <93FEBB0C-672E-40E8-919A-791352D0AAED@perfectworld.net> <44638.213.61.181.86.1232723604.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1681A8E9-F52C-4B92-A022-7264CF9681E7@perfectworld.net> <48989.213.61.181.86.1232733536.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <44148.213.61.181.86.1232958645.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <43CC4E67-C767-4FEA-A4A4-6A6D2E75FF8B@perfectworld.net> Hi Stefan, I have modified my code so that an etree never crosses threads. Now the etree is converted to text and then back to an etree in the new thread. This has resolved the issue. Does you know if lxml 2.2 have the same issue? Thanks for your help. Rob. On Jan 26, 2009, at 9:30 AM, Stefan Behnel wrote: > Hi, > > I'm CC-ing the list, I hope you don't mind. I think your description > is > abstract enough not to reveal anything about your application. > > Robert Liebeskind wrote: >> The trace you received was from v2.2 of lxml but we continue to >> experience >> the same issue with v2.5. We use XPath extensively. We do not use >> XSLT. > > I guess you meant 2.2beta1 and 2.1.5? > > >> 1. An etree is loaded from an xml file and the data displayed for >> the >> user. >> 2. The etree is modified as the result of user edits using a GUI > > I assume that this happens inside one thread. > > >> 3. The etree is the copied using copy.deepcopy() to etree2 >> 4. etree2 is passed via a queue to a thread in which it is further >> processed. > > Try copying the tree inside the target thread, (preferably) instead of > copying it inside another thread and passing it over. Trees inherit > state > from the thread that built them. Also, using a tree inside a thread > that > did not build it will result in some additional adaptation overhead. > > >> 5. etree2 is modfied as a result of processing in its own thread. >> during this processing >> additional trees/elems are fetched from disk and used to modify/ >> augment etree2. >> 6. etree2 is copied to etree3 >> 7. etree3 is sent for a additional processing in its own thread. >> 8. etree2 is copied to etree4 >> 9. etree 4 is sent for additional processing in its own thread. > > Same thing for 6/7 and 8/9. Copying the tree from inside the target > thread > will make things more stable. Even if multiple copying is not really > memory friendly, it's very fast in lxml, so as long as we are not > talking > about documents with several megabytes, and as long as this thing > really > runs on a multi processor machine, you should be fine even with a > work-around that copies the tree redundantly in both threads. > > >> at this point the initial thread is complete and tears down. >> the two additional spawned threads finish quickly and tear down as >> well. >> These processes will succeed quite often. They fail intermittently >> and result in a Windows Unhandled Exception. > > lxml.etree uses a per-thread dictionary that holds names of tags and > attributes. That's one of the reasons why it's so fast and memory > friendly. In the stack trace you showed me, it seems that a tree is > freed > in a different thread than the one that built it, but (for whatever > reason) some of it content is still linked to a dictionary of the > original > thread. In this case, the tree cleanup cannot detect that the name is > stored in a dictionary and will free it manually. When the originating > thread goes down, either before or after the thread that freed the > tree, > it will destroy the dictionary that stores the name, which results > in a > double free. > > Does that help for now? > > Stefan > > From stefan_ml at behnel.de Fri Feb 6 13:46:31 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 6 Feb 2009 13:46:31 +0100 (CET) Subject: [lxml-dev] Unable to solve a crash on Windows with LXML In-Reply-To: <43CC4E67-C767-4FEA-A4A4-6A6D2E75FF8B@perfectworld.net> References: <3EA0022E-08E5-48DC-A020-EC3FF74C677B@perfectworld.net> <36880.213.61.181.86.1232634732.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <93FEBB0C-672E-40E8-919A-791352D0AAED@perfectworld.net> <44638.213.61.181.86.1232723604.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1681A8E9-F52C-4B92-A022-7264CF9681E7@perfectworld.net> <48989.213.61.181.86.1232733536.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <44148.213.61.181.86.1232958645.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <43CC4E67-C767-4FEA-A4A4-6A6D2E75FF8B@perfectworld.net> Message-ID: <63712.213.61.181.86.1233924391.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Robert Liebeskind wrote: > I have modified my code so that an etree never crosses threads. > Now the etree is converted to text and then back to an etree in > the new thread. This has resolved the issue. That's an extreme measure, but I doubt that it's noticeably slower than copying the tree. Serialising and parsing is *very* fast in lxml, so this becomes a safe and simple option. > Does you know if lxml 2.2 have the same issue? There were no changes between 2.1 and 2.2 that could make me expect anything else. I've resolved a lot of similar issues in the past, so the conditions under which crashes occur have become more and more obscure. It would be nice if we could fix this problem for 2.2 final. Could you come up with a simple setup that mimics what you described as your application flow? That would allow me to play with it myself. There's also a set of threading related tests in test_threading.py. IIRC, we are still missing one that tests a multi-thread XML pipeline as you use. Stefan From shigin at rambler-co.ru Fri Feb 6 14:07:44 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Fri, 06 Feb 2009 16:07:44 +0300 Subject: [lxml-dev] String parameters to xslt transformation In-Reply-To: <498B52E7.6040906@behnel.de> References: <1233588407.5942.14.camel@atlas> <498B52E7.6040906@behnel.de> Message-ID: <1233925664.10059.340.camel@atlas> Hi, Stefan, Stefan Behnel wrote: > strparam() may return either an escaped string or a wrapper object that the > transformation code special cases internally, not sure what's better. > > What do you think? I took another look at libxslt internals. There isn't any way to escape single or double quote in argument. A citation from libxslt/variables.c, xsltProcessUserParamInternal: * enclosed single quotes (double quotes). If the string which you want to * be treated literally contains both single and double quotes (e.g. Meet * at Joe's for "Twelfth Night" at 7 o'clock) then there is no suitable * quoting character. You cannot use ' or " inside the string * because the replacement of character entities with their equivalents is * done at a different stage of processing. The solution is to call * xsltQuoteUserParams or xsltQuoteOneUserParam. So, the only way is to create a special class like XSLTQuotedVariable. I've made a second version of the patch with `strparam()` interface. -------------- next part -------------- A non-text attachment was scrubbed... Name: xslt-params-2.diff Type: text/x-patch Size: 3693 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090206/2ac51b45/attachment.bin From stefan_ml at behnel.de Fri Feb 6 17:14:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 6 Feb 2009 17:14:54 +0100 (CET) Subject: [lxml-dev] String parameters to xslt transformation In-Reply-To: <1233925664.10059.340.camel@atlas> References: <1233588407.5942.14.camel@atlas> <498B52E7.6040906@behnel.de> <1233925664.10059.340.camel@atlas> Message-ID: <49664.213.61.181.86.1233936894.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Alexander Shigin wrote: > Stefan Behnel wrote: >> strparam() may return either an escaped string or a wrapper object that >> the >> transformation code special cases internally, not sure what's better. >> >> What do you think? > > I took another look at libxslt internals. There isn't any way to escape > single or double quote in argument. A citation from libxslt/variables.c, > xsltProcessUserParamInternal: > > * enclosed single quotes (double quotes). If the string which you want to > * be treated literally contains both single and double quotes (e.g. Meet > * at Joe's for "Twelfth Night" at 7 o'clock) then there is no suitable > * quoting character. You cannot use ' or " inside the string > * because the replacement of character entities with their equivalents is > * done at a different stage of processing. The solution is to call > * xsltQuoteUserParams or xsltQuoteOneUserParam. > > So, the only way is to create a special class like XSLTQuotedVariable. > I've made a second version of the patch with `strparam()` interface. Thanks! That looks much better. I'll fix up a couple of things and check if I can't get Cython to support classmethods for this purpose. If all works out, I'll add it for 2.2. Stefan From shigin at rambler-co.ru Fri Feb 6 17:34:35 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Fri, 06 Feb 2009 19:34:35 +0300 Subject: [lxml-dev] String parameters to xslt transformation In-Reply-To: <49664.213.61.181.86.1233936894.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <1233588407.5942.14.camel@atlas> <498B52E7.6040906@behnel.de> <1233925664.10059.340.camel@atlas> <49664.213.61.181.86.1233936894.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <1233938075.10059.350.camel@atlas> ? ???, 06/02/2009 ? 17:14 +0100, Stefan Behnel ?????: > Thanks! That looks much better. I'll fix up a couple of things and check > if I can't get Cython to support classmethods for this purpose. If all > works out, I'll add it for 2.2. Oh, I totally forgot, but I think that ```XSLT.strparam``` is a bit strange for escaping method to be placed. It's only my point of view, but I prefer transform = etree.XSLT(...) result = transform(doc, string_var=etree.strparam("hi")) instead of result = transform(doc, string_var=transform.strparam("hi")) From stefan_ml at behnel.de Fri Feb 6 19:52:51 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Feb 2009 19:52:51 +0100 Subject: [lxml-dev] String parameters to xslt transformation In-Reply-To: <1233938075.10059.350.camel@atlas> References: <1233588407.5942.14.camel@atlas> <498B52E7.6040906@behnel.de> <1233925664.10059.340.camel@atlas> <49664.213.61.181.86.1233936894.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1233938075.10059.350.camel@atlas> Message-ID: <498C8703.9010007@behnel.de> Hi, Alexander Shigin wrote: > ? ???, 06/02/2009 ? 17:14 +0100, Stefan Behnel ?????: >> Thanks! That looks much better. I'll fix up a couple of things and check >> if I can't get Cython to support classmethods for this purpose. If all >> works out, I'll add it for 2.2. > > Oh, I totally forgot, but I think that ```XSLT.strparam``` is a bit > strange for escaping method to be placed. It's only my point of view, > but I prefer > transform = etree.XSLT(...) > result = transform(doc, string_var=etree.strparam("hi")) > instead of > result = transform(doc, string_var=transform.strparam("hi")) I'm not sure. My thoughts were that the only purpose of that function will be the use for XSLT string parameters, so you can either name it "xsltstrparam()" or "xslt_string_parameter()", which sounds lengthy and unwildy, or you can move it into a namespace that makes it clear what it does, i.e. "XSLT.strparam()". That way, you can easily reference it in a local variable if you prefer a function, but it keeps the functionality local to its intrinsic scope. Stefan From stefan_ml at behnel.de Fri Feb 6 21:32:39 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Feb 2009 21:32:39 +0100 Subject: [lxml-dev] Templating In-Reply-To: <93064EEA-6404-4E94-853F-5F9798AAB95D@gmail.com> References: <93064EEA-6404-4E94-853F-5F9798AAB95D@gmail.com> Message-ID: <498C9E67.4080505@behnel.de> Hi, Audun Wilhelmsen wrote: > I'd like to use lxml for creating a simple HTML templating engine, > by implementing some custom tags and attributes. Note that there are already a lot of template engines, including some that work with lxml, such as webstring. > A simple example: >

> Would replace the text content of the tag with the variable foobar. > > But I'm not sure how to implement it using lxml. I've been using > Genshi succesfully, by simply processing the tag stream before > serializing. But I'd like to have the DOM-like capabilities and speed > of lxml. > > What would be the best approach? I've been trying to use XPath to find > all the elements and attributes with a gc: prefix, but with little > success so far (maybe mixing namespaces with non-xml html doesn't work > that well?). The HTML is (obviously) not namespace aware. But have you considered storing your templates as XHTML? That would allow you to use namespaces at will, without preventing you from serialising them to HTML. Also, if your templates are XML compatible (even without an XHTML namespace), you can just parse them with the XML parser instead of the HTML parser. Stefan From terry_n_brown at yahoo.com Fri Feb 6 22:44:03 2009 From: terry_n_brown at yahoo.com (Terry Brown) Date: Fri, 6 Feb 2009 15:44:03 -0600 Subject: [lxml-dev] Templating In-Reply-To: <498C9E67.4080505@behnel.de> References: <93064EEA-6404-4E94-853F-5F9798AAB95D@gmail.com> <498C9E67.4080505@behnel.de> Message-ID: <20090206154403.08dcdcfb@nrri.umn.edu> On Fri, 06 Feb 2009 21:32:39 +0100 Stefan Behnel wrote: > Hi, > > Audun Wilhelmsen wrote: > > I'd like to use lxml for creating a simple HTML templating engine, > > by implementing some custom tags and attributes. > > Note that there are already a lot of template engines, including some > that work with lxml, such as webstring. I've used Genshi with lxml, now I come to think of it. Genshi's python oriented. As webstring maybe also, I don't know. Cheers -Terry > > > A simple example: > >

> > Would replace the text content of the tag with the variable foobar. > > > > But I'm not sure how to implement it using lxml. I've been using > > Genshi succesfully, by simply processing the tag stream before > > serializing. But I'd like to have the DOM-like capabilities and > > speed of lxml. > > > > What would be the best approach? I've been trying to use XPath to > > find all the elements and attributes with a gc: prefix, but with > > little success so far (maybe mixing namespaces with non-xml html > > doesn't work that well?). > > The HTML is (obviously) not namespace aware. But have you considered > storing your templates as XHTML? That would allow you to use > namespaces at will, without preventing you from serialising them to > HTML. Also, if your templates are XML compatible (even without an > XHTML namespace), you can just parse them with the XML parser instead > of the HTML parser. > > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From stefan_ml at behnel.de Fri Feb 6 22:57:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Feb 2009 22:57:48 +0100 Subject: [lxml-dev] Unable to solve a crash on Windows with LXML In-Reply-To: <63712.213.61.181.86.1233924391.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <3EA0022E-08E5-48DC-A020-EC3FF74C677B@perfectworld.net> <36880.213.61.181.86.1232634732.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <93FEBB0C-672E-40E8-919A-791352D0AAED@perfectworld.net> <44638.213.61.181.86.1232723604.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1681A8E9-F52C-4B92-A022-7264CF9681E7@perfectworld.net> <48989.213.61.181.86.1232733536.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <44148.213.61.181.86.1232958645.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <43CC4E67-C767-4FEA-A4A4-6A6D2E75FF8B@perfectworld.net> <63712.213.61.181.86.1233924391.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <498CB25C.9020103@behnel.de> Hi, Stefan Behnel wrote: > There's also a set of threading related tests in test_threading.py. IIRC, > we are still missing one that tests a multi-thread XML pipeline as you > use. Just a quick note that I added one now, and it's 'nicely' crashing on me. I'll see if I can investigate it a little. Stefan From stefan_ml at behnel.de Sat Feb 7 13:34:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Feb 2009 13:34:27 +0100 Subject: [lxml-dev] resolve_entities=False seems to have no effect In-Reply-To: <20090207014117.19989.62524.launchpad@gangotri.canonical.com> References: <20090207014117.19989.62524.launchpad@gangotri.canonical.com> Message-ID: <498D7FD3.1020906@behnel.de> > s = cStringIO.StringIO(""""She's the MAN!"""") > e = etree.parse(s,etree.XMLParser(resolve_entities=False)) Note that there's also etree.fromstring(). > etree.tostring(e) > '"She\'s the MAN!"' > > I would have expected resolve_entities=False to have prevented the > translation of, eg, " to ". The "resolve_entities" option is meant for entities defined in a DTD of which you want to keep the reference instead of the resolved value. The entities you mention are part of the XML spec, not of a DTD. > is there another way to prevent this behavior (or, if nothing else, > reverse it after the fact)? Well, what you get is well-formed XML. May I ask why you need the entity references in the output? Stefan From stefan_ml at behnel.de Sat Feb 7 14:35:04 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Feb 2009 14:35:04 +0100 Subject: [lxml-dev] resolve_entities=False seems to have no effect In-Reply-To: <20090207131254.5763.6575.launchpad@gandwana.canonical.com> References: <498D7FD3.1020906@behnel.de> <20090207131254.5763.6575.launchpad@gandwana.canonical.com> Message-ID: <498D8E08.4090708@behnel.de> Hi, I forwarded your question to the lxml mailing list, which is a much better place to discuss this as there are more people listening who might have an idea. http://comments.gmane.org/gmane.comp.python.lxml.devel/4359 usernamenumber wrote: >> Well, what you get is well-formed XML. May I ask why you need the entity >> references in the output? > > I am calculating checksums based on the combined contents of several > specific tags within a given document. The tool I am writing is designed > to replace a pre-existing tool, which did the same thing and stored > those checksums for comparison. The old tool does not convert entities, > so in order for it to not generate a slew of false-negative checksum > mismatches when we switch over, mine can't either. It's rarely easy to replace a tool if you are required to mimic the original quirks. The right way to do it is to calculate the checksums on the parsed in-memory tree rather than the serialised XML stream. The second best solution is to serialise to canonical XML (C14N) and to work on that. But having checksums depend on a byte stream as serialised by a specific tool is definitely not future proof. To emulate the old behaviour, you could maybe build the checksum from the in-memory tree and just replace all occurrences of ?'? and ?"? by their escaped equivalent before using a text value. If your XML source documents consistently use the entity references everywhere, this should yield the same checksums. Does that help? Stefan From stefan_ml at behnel.de Sat Feb 7 21:58:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Feb 2009 21:58:36 +0100 Subject: [lxml-dev] etree.clear_error_log() causes segfaults In-Reply-To: References: Message-ID: <498DF5FC.3080505@behnel.de> Hi, Christian Heimes wrote: > This is a friendly word of warning! Don't call etree.clear_error_log() > from multiple threads. > > We found this issue during stress tests of our CherryPy based > application. Every worker thread was calling etree.clear_error_log() > after the page was rendered. Apparently we hit some sort of race condition. > > Backtrace: > > C [etree.so+0x21977] > C [etree.so+0x22d48] > C [libxslt.so.1+0xdc17] xsltTransformError+0xf7 > C [libxslt.so.1+0x839d] > C [libxslt.so.1+0xa5da] xsltParseStylesheetProcess+0x80a > C [libxslt.so.1+0x1efbc] xsltParseStylesheetInclude+0x1ac > C [libxslt.so.1+0xa0b4] xsltParseStylesheetProcess+0x2e4 > C [libxslt.so.1+0xb4b9] xsltParseStylesheetImportedDoc+0x1e9 > C [libxslt.so.1+0xb5b8] xsltParseStylesheetDoc+0x28 > C [etree.so+0xe640c] > C [python2.5+0x4ff83] > C [python2.5+0x11e67] PyObject_Call+0x27 I think what happened here is that one thread cleared the log (my guess) while another one was just appending an error (as the stacktrace shows). The (somewhat radical) way to fix this is to make the global error log thread local. I think that's the way it should work anyway, especially since the error log is copied into exceptions. Otherwise, errors would leak into exceptions of other threads. I committed this change to SVN, it will be part of lxml 2.2. This means that it will be safe to call clear_error_log() from a thread in 2.2. Stefan From stefan_ml at behnel.de Sat Feb 7 22:17:41 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 07 Feb 2009 22:17:41 +0100 Subject: [lxml-dev] etree.clear_error_log() causes segfaults In-Reply-To: <498DF73F.4040500@cheimes.de> References: <498DF5FC.3080505@behnel.de> <498DF73F.4040500@cheimes.de> Message-ID: <498DFA75.3040907@behnel.de> Christian Heimes wrote: > I guess it was a relict from the time lxml.etree didn't clear the local > error log when an XSLT document was called. I think you are referring to the error log on XSLT objects, which is actually local to the object and cleared before each transformation. (A copy of) the global error log is attached to exceptions, so you will notice the difference when you do multiple transformations and one of the later ones raises an exception. I guess there might still be space left for improvements... Stefan From stefan_ml at behnel.de Sun Feb 8 11:17:33 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Feb 2009 11:17:33 +0100 Subject: [lxml-dev] altering the indent of the pretty output In-Reply-To: <1233766240.25247.3.camel@klappe2> References: <1233766240.25247.3.camel@klappe2> Message-ID: <498EB13D.5040207@behnel.de> Ronny Pfannschmidt wrote: > i'm currently porting gazpacho (a wysiwyg gtk ui file editor) to lxml, > unfortunately the pretty printer prints with an indent of 2 and in order > to match the convention i need an indent of 4 > > is there any simple way to archive pretty dumping to a file with an > indent of 4? I just remembered that Fredrik has an indentation function for ElementTree: http://effbot.org/zone/element-lib.htm Most of the stuff on that page should work unchanged with lxml.etree. Stefan From stefan_ml at behnel.de Tue Feb 17 22:57:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Feb 2009 22:57:05 +0100 Subject: [lxml-dev] lxml 2.2beta3 released Message-ID: <499B32B1.3050305@behnel.de> Hi all, I'm happy to release the third (and hopefully last) beta of lxml 2.2. This release features some major bug fixes, so upgrading is highly recommended. http://pypi.python.org/pypi/lxml/2.2beta3 http://codespeak.net/lxml/dev/ It was generated with revision 1716 of the Cython 0.11 development branch, which was hardened against various memory leaks using Dag Seljebotn's ref-nanny. This release also marks the end of the lxml 2.1 maintenance cycle. As I stated a while ago, I think that the stability of the current trunk makes backporting fixes to older release series no longer worth the effort. The only reason why this beta version was not released as lxml 2.2 final is that I want to wait for an official Cython release series to go with lxml 2.2. Therefore a note to distributors: please start shipping lxml 2.2 as soon as possible. Note that while this release does not officially support Python 3.0, it should work well in that environment in almost all situations. It was tested with Python 3.0.1. Hope you like it, Stefan 2.2beta3 (2009-02-17) Features added * XSLT.strparam() class method to wrap quoted string parameters that require escaping. Bugs fixed * Memory leak in XPath evaluators. * Crash when parsing indented XML in one thread and merging it with other documents parsed in another thread. * Setting the base attribute in lxml.objectify from a unicode string failed. * Fixes following changes in Python 3.0.1. * Minor fixes for Python 3. Other changes * The global error log (which is copied into the exception log) is now local to a thread, which fixes some race conditions. * More robust error handling on serialisation. From stefan_ml at behnel.de Thu Feb 19 18:48:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Feb 2009 18:48:05 +0100 Subject: [lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities? In-Reply-To: <20090219144142.10497.96201.launchpad@potassium.ubuntu.com> References: <20090219144142.10497.96201.launchpad@potassium.ubuntu.com> Message-ID: <499D9B55.1070705@behnel.de> usernamenumber wrote: > I am porting a perl/SAX tool to python/lxml. Ideally, given the same > input, the new tool should produce the same output as the old tool. In > fact, it introduces a number of problems for me if this is not the case. It's always bad style to make applications depend on a specific XML serialisation done by a specific tool. That's exactly what canonical XML (C14N) was designed for. > One annoying problem I am encountering is that SAX seems to store unicode > entity IDs in hex, whereas lxml uses decimal, regardless of what value is > used in the input: > > >>> import lxml.etree as etree > >>> example_sax_output = "Copyright © 2009 Foocorp, Inc" # Note: xA9 > >>> e = etree.fromstring(example_sax_output) > >>> etree.tostring(e) > Copyright © 2009 Foocorp, Inc # Note: 169 > > Is it possible to avoid this without doing something horribly kludgey > like going through the output with a regex search and manually > converting the values to hex? There isn't a straight way to do that. Decimal character references were chosen for compatibility with ElementTree, which uses "xmlcharrefreplace". However, if you have a bit of memory and do not care too much about raw performance, you can do this: # Python 2.6 unicode_xml = etree.tostring(tree, encoding=unicode) bytes_xml = b''.join(chr(c) if c < 0x80 else b'&#x%X;' % c for c in imap(ord, unicode_xml)) There's also a separate serialiser API in libxml2 that happens to output hex entities. However, that's not used for backward compatibility reasons. Stefan From stefan_ml at behnel.de Thu Feb 19 18:51:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Feb 2009 18:51:53 +0100 Subject: [lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities? In-Reply-To: <499D9B55.1070705@behnel.de> References: <20090219144142.10497.96201.launchpad@potassium.ubuntu.com> <499D9B55.1070705@behnel.de> Message-ID: <499D9C39.7060404@behnel.de> Stefan Behnel wrote: > usernamenumber wrote: >> I am porting a perl/SAX tool to python/lxml. Ideally, given the same >> input, the new tool should produce the same output as the old tool. In >> fact, it introduces a number of problems for me if this is not the case. > > It's always bad style to make applications depend on a specific XML > serialisation done by a specific tool. That's exactly what canonical XML > (C14N) was designed for. And, as a matter of fact, C14N uses hex charrrefs: http://www.w3.org/TR/xml-c14n.html#Example-Chars So maybe you should take a look at that. http://codespeak.net/lxml/api.html#write-c14n-on-elementtree Stefan From stefan_ml at behnel.de Thu Feb 19 20:50:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Feb 2009 20:50:50 +0100 Subject: [lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities? In-Reply-To: <20090219192526.32698.20391.launchpad@gangotri.canonical.com> References: <499D9C39.7060404@behnel.de> <20090219192526.32698.20391.launchpad@gangotri.canonical.com> Message-ID: <499DB81A.3040704@behnel.de> usernamenumber wrote: > Thanks very much for the assistance, Stefan. You are a great help! As it > turns out, write_c14n() actually uses yet another method of rendering > entities, (\xc2\xa9 as opposed to ©) Ah, right. Sure, it serialises to UTF-8, which doesn't require Unicode character escaping. What you see is just what the Python prompt makes of the byte series on output. > so it looks like I may have > to just suck it up and deal with the output of my port being slightly > different from that of the original tool (I don't think it's worth the > extra processing to translate all the entities after the fact). But > being able to write out to C14N (which I hadn't known about before now), > might at least be able to avoid this problem in the future. Wise choice. > I do have one other question coming from this: I can find functions for > writing out c14n content for ElementTree objects, but nothing for > rendering an Element (the result of etree.fromstring(), for example) in > this way. Am I missing something, or if I am working with a string do I > just need to load it into a StringIO and run etree.parse() on it? You can get an ElementTree either by calling parse() or by wrapping an Element in it, i.e. tree = etree.ElementTree(root_node) http://codespeak.net/lxml/tutorial.html#the-elementtree-class http://effbot.org/zone/element.htm#reading-and-writing-xml-files http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree-class Stefan From aloys.baillet at gmail.com Fri Feb 20 05:10:41 2009 From: aloys.baillet at gmail.com (Aloys Baillet) Date: Fri, 20 Feb 2009 15:10:41 +1100 Subject: [lxml-dev] Behaviour change in findtext Message-ID: Hi, I was planning on upgrading to a recent version of lxml but found that our code was failing in numerous places with None objects found in unexpected places. In lxml 2+ the findtext method will ignore the default and return None if the element is found but the text is empty. In elementtree and lxml before 2 the findtext method would never return None, if the element is found but empty it would return the default. Unfortunately I am unable to tell which exact version of lxml introduced that change. And I am not sure if this change is by design, but I would say it is not. A proposed fix (if it is a bug...) is included in the attached unified diff. Cheers, Aloys -- Aloys Baillet Research & Development - Animal Logic -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090220/53f93165/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: findtext_default_not_used_when_text_empty.diff Type: application/octet-stream Size: 366 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090220/53f93165/attachment.obj From stefan_ml at behnel.de Fri Feb 20 08:26:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Feb 2009 08:26:24 +0100 Subject: [lxml-dev] Behaviour change in findtext In-Reply-To: References: Message-ID: <499E5B20.5050209@behnel.de> Hi, Aloys Baillet wrote: > I was planning on upgrading to a recent version of lxml but found that our > code was failing in numerous places with None objects found in unexpected > places. > In lxml 2+ the findtext method will ignore the default and return None if > the element is found but the text is empty. Thanks, this change was introduced in ElementTree 1.3 and lxml 2.0. Note that _elementpath.py is mostly a copy of ElementPath.py in ET, except for some minor adaptations and Py3 fixes. > In elementtree and lxml before 2 the findtext method would never return > None, if the element is found but empty it would return the default. This is not true. ET 1.2 (and thus lxml <= 1.3) returned an empty string instead, which wasn't necessarily the default either. So, for ET 1.2 compatibility, it should return an empty string if the text is empty, and the 'default' value (which is None if not passed!) when the element is not found. I wonder why the default is None, though. If the function is supposed to avoid checks on user side by always returning a string value, the default should be the empty string as well. Plus, lxml.etree knows the difference between an empty string text value ('') and no text content (None). So this would blur things in one place while keeping them transparent in all others. Fredrik, do you have any comments on this? Stefan From paratribulations at free.fr Fri Feb 20 14:54:03 2009 From: paratribulations at free.fr (TP) Date: Fri, 20 Feb 2009 14:54:03 +0100 Subject: [lxml-dev] how to inherit from ElementBase? Message-ID: Hi everybody, For my application, I try to inherit the ElementBase class of lxml. I have read the following page: http://codespeak.net/lxml/dev/api/lxml.etree.ElementBase-class.html "Class ElementBase: The public Element class. All custom Element classes must inherit from this one. To create an Element, use the Element() factory." So I have tried to define a new class deriving from ET.ElementBase, and set the factory (__metaclass__) to ET.Element. You will find my code below. I obtain: $ p test_element_factory.py Traceback (most recent call last): File "test_element_factory.py", line 3, in class NewElement( ET.ElementBase ): File "etree.pyx", line 1844, in etree.Element File "apihelpers.pxi", line 129, in etree._makeElement File "apihelpers.pxi", line 116, in etree._makeElement File "etree.pyx", line 362, in etree._Document._setNodeNamespaces File "apihelpers.pxi", line 651, in etree._utf8 TypeError: Argument must be string or unicode. What is the problem? Thanks in advance, Julien ############################## import lxml.etree as ET class NewElement( ET.ElementBase ): __metaclass__ = ET.Element def _init( self , *args , **kwargs ): super( NewElement, self ).__init__( self , *args , **kwargs ) a = NewElement( "root" ) print a ############################## -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From stefan_ml at behnel.de Fri Feb 20 17:31:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Feb 2009 17:31:36 +0100 Subject: [lxml-dev] how to inherit from ElementBase? In-Reply-To: References: Message-ID: <499EDAE8.2010502@behnel.de> Hi, TP wrote: > For my application, I try to inherit the ElementBase class of lxml. > I have read the following page: > > http://codespeak.net/lxml/dev/api/lxml.etree.ElementBase-class.html Does this page serve you better? http://codespeak.net/lxml/element_classes.html Stefan From stefan_ml at behnel.de Fri Feb 20 17:42:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Feb 2009 17:42:36 +0100 Subject: [lxml-dev] how to inherit from ElementBase? In-Reply-To: <499EDAE8.2010502@behnel.de> References: <499EDAE8.2010502@behnel.de> Message-ID: <499EDD7C.3080100@behnel.de> Stefan Behnel wrote: > Does this page serve you better? > > http://codespeak.net/lxml/element_classes.html ... or actually rather this one if you use lxml 2.2beta: http://codespeak.net/lxml/dev/element_classes.html Stefan From douglas at openplans.org Fri Feb 20 21:10:04 2009 From: douglas at openplans.org (Douglas Mayle) Date: Fri, 20 Feb 2009 15:10:04 -0500 Subject: [lxml-dev] LXML utf-8 problem... Message-ID: Hi all, Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not: douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.html >>> lxml.html.parse(u'

\xa9

'.encode('utf-8')) Traceback (most recent call last): File "", line 1, in File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ __init__.py", line 651, in parse File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ lxml.etree.c:25269) File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ lxml/lxml.etree.c:63768) File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:64012) File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ lxml/lxml.etree.c:63169) File "parser.pxi", line 969, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c: 56751) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ lxml/lxml.etree.c:57595) File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:56936) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128) >>> Why is ascii being used as a codec? It's properly identified in the string. It's a valid character (in this case a copyright symbol). What can I do? From stefan_ml at behnel.de Fri Feb 20 21:37:44 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 20 Feb 2009 21:37:44 +0100 Subject: [lxml-dev] LXML utf-8 problem... In-Reply-To: References: Message-ID: <499F1498.6050400@behnel.de> Hi, Douglas Mayle wrote: > Unfortunately, I'm running into an error that I thought I had licked > before. I've running lxml 2.1.2 on OS X and python 2.5. I have a > 'str' object that contains html with utf-8 bytes and a utf-8 encoding > specified by the directive, which should be properly handled, to my > understanding, but is not: > > douglas$ python > Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) > [GCC 4.0.1 (Apple Inc. build 5465)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import lxml.html > >>> lxml.html.parse(u' >

\xa9

'.encode('utf-8')) > Traceback (most recent call last): > [...] > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 53: ordinal not in range(128) :) The error message is a bit misleading here. parse() takes a file name as argument, which in your case is a UTF-8 encoded byte sequence. When lxml.etree tries to parse, it fails to find the file and thus tries to raise an error. It then fails as it cannot format the error message. Haven't tried, but it should work with 2.2. Stefan From sergio at sergiomb.no-ip.org Sat Feb 21 06:56:26 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Sat, 21 Feb 2009 05:56:26 +0000 Subject: [lxml-dev] LXML utf-8 problem... In-Reply-To: References: Message-ID: <1235195786.3774.112.camel@segulix> On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote: > Hi all, > Unfortunately, I'm running into an error that I thought I had licked > before. I've running lxml 2.1.2 on OS X and python 2.5. I have a > 'str' object that contains html with utf-8 bytes and a utf-8 encoding > specified by the directive, which should be properly handled, to my > understanding, but is not: > > douglas$ python > Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) > [GCC 4.0.1 (Apple Inc. build 5465)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import lxml.html > >>> lxml.html.parse(u' >

\xa9

'.encode('utf-8')) > Traceback (most recent call last): > File "", line 1, in > File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ > tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ > __init__.py", line 651, in parse > File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ > lxml.etree.c:25269) > File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ > lxml/lxml.etree.c:63768) > File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL > (src/lxml/lxml.etree.c:64012) > File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ > lxml/lxml.etree.c:63169) > File "parser.pxi", line 969, in > lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461) > File "parser.pxi", line 538, in > lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c: > 56751) > File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ > lxml/lxml.etree.c:57595) > File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ > lxml/lxml.etree.c:56936) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 53: ordinal not in range(128) > >>> > > Why is ascii being used as a codec? It's properly identified in the > string. It's a valid character (in this case a copyright symbol). > What can I do? if is what I think could be a problem with python it self ! This code : content = urllib.urlopen(url).read(-1) content = content.decode('cp1252') print content With one page with enconding windows-1252, I print to stdout and I see it well but if I put it on a pipe , like : python getcontent.py | grep something, gives the error that you mention. don't ask me why but adding .encode('utf-8') content = content.decode('cp1252').encode('utf-8') fixes this problem . hope that can help , regards. -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090221/26c45f49/attachment.bin From douglas at openplans.org Sat Feb 21 15:11:12 2009 From: douglas at openplans.org (Douglas Mayle) Date: Sat, 21 Feb 2009 09:11:12 -0500 Subject: [lxml-dev] LXML utf-8 problem... In-Reply-To: <1235195786.3774.112.camel@segulix> References: <1235195786.3774.112.camel@segulix> Message-ID: <9114B7F7-2AA8-42F4-8A7C-37D6C0F22B61@openplans.org> Actually, after digging further, I found out that it's a problem with the error reporting mechanisms in lxml. If you have unicode data inside of of a 'str' type object (which is normal for many html and xml documents) then the lxml error reporting incorrectly decodes the string while trying to spit out an error, which causes a new error that masks the original error. As mentioned earlier in this thread, it should be fixed in the newest version of lxml. In any case, I copied code from elsewhere in my program and forgot to switch from parse (which takes a filename or url) to fromstring(which takes text data). parse was spitting out an error because it didn't receive a filename, and that error was mixed with the incorrectly decoded data of the filename which caused a new error... Doug On Feb 21, 2009, at 12:56 AM, Sergio Monteiro Basto wrote: > On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote: >> Hi all, >> Unfortunately, I'm running into an error that I thought I had licked >> before. I've running lxml 2.1.2 on OS X and python 2.5. I have a >> 'str' object that contains html with utf-8 bytes and a utf-8 encoding >> specified by the directive, which should be properly handled, to my >> understanding, but is not: >> >> douglas$ python >> Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) >> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin >> Type "help", "copyright", "credits" or "license" for more >> information. >>>>> import lxml.html >>>>> lxml.html.parse(u'>>

\xa9

'.encode('utf-8')) >> Traceback (most recent call last): >> File "", line 1, in >> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ >> tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ >> __init__.py", line 651, in parse >> File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ >> lxml.etree.c:25269) >> File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ >> lxml/lxml.etree.c:63768) >> File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL >> (src/lxml/lxml.etree.c:64012) >> File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ >> lxml/lxml.etree.c:63169) >> File "parser.pxi", line 969, in >> lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c: >> 60461) >> File "parser.pxi", line 538, in >> lxml.etree._ParserContext._handleParseResultDoc (src/lxml/ >> lxml.etree.c: >> 56751) >> File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ >> lxml/lxml.etree.c:57595) >> File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ >> lxml/lxml.etree.c:56936) >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position >> 53: ordinal not in range(128) >>>>> >> >> Why is ascii being used as a codec? It's properly identified in the >> string. It's a valid character (in this case a copyright symbol). >> What can I do? > > if is what I think could be a problem with python it self ! > This code : > content = urllib.urlopen(url).read(-1) > content = content.decode('cp1252') > print content > > With one page with enconding windows-1252, I print to stdout and I see > it well but if I put it on a pipe , like : > python getcontent.py | grep something, > gives the error that you mention. > > don't ask me why but adding .encode('utf-8') > > content = content.decode('cp1252').encode('utf-8') > > fixes this problem . > > hope that can help , regards. > -- > S?rgio M. B. From stefan_ml at behnel.de Sat Feb 21 19:09:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Feb 2009 19:09:24 +0100 Subject: [lxml-dev] Behaviour change in findtext In-Reply-To: <368a5cd50902210930x6c8d7448o4c88217f7fc32786@mail.gmail.com> References: <499E5B20.5050209@behnel.de> <368a5cd50902210930x6c8d7448o4c88217f7fc32786@mail.gmail.com> Message-ID: <49A04354.2030501@behnel.de> Hi Fredrik, thanks for the clarification. Fredrik Lundh wrote: > Not sure - that you can get None back from findtext when the element > is there looks like an accidental change when the ElementPath engine > was rewritten. I think I'll consider that a bug in findtext. I thought so, too. > As for distinguishing between and That's not what I meant, although that actually is the result when you serialise with or without an empty string value. A parsed empty element will always have its .text set to None in lxml.etree, regardless of the way the parser saw it. I rather meant the difference between users setting el.text = None and el.text = '' in the code. In the second case, lxml.etree creates a text node with an empty string in the underlying libxml2 tree. That way, it can return the expected result on later requests. This is actually compatible with ET, which (obviously) also remembers what the user set as value. You can think of the above as an emulation of the ET behaviour, but also as a way to prevent surprised faces on user side when you see el.text = '' for i in range(10: el.text += 'xyz' fail mysteriously. > the ET specification allows an implementation to use either > None or an empty string for the text and tail attributes in either > case to simplify the tree building. However, an application shouldn't > abuse this - an XML producer should be free to use either form to > indicate an empty element, and application code should use "truth > testing" when necessary, when inspecting the text/tail attributes of a > given element. I fully agree. > And I think findtext should be reverted to the 1.2 > behaviour - just add an to the suitable place in ElementPath, > and leave the rest as is. That's what I did for lxml 2.2. It just makes findtext() simpler to use. Stefan From aloys.baillet at gmail.com Mon Feb 23 07:03:27 2009 From: aloys.baillet at gmail.com (Aloys Baillet) Date: Mon, 23 Feb 2009 17:03:27 +1100 Subject: [lxml-dev] Behaviour change in findtext In-Reply-To: <49A04354.2030501@behnel.de> References: <499E5B20.5050209@behnel.de> <368a5cd50902210930x6c8d7448o4c88217f7fc32786@mail.gmail.com> <49A04354.2030501@behnel.de> Message-ID: Hi Fredrick and Stephan, Thanks a lot for your feedback, I'm happy that you consider that change a bug, and even happier that you already fixed it! Looking forward for the lxml 2.2 release... And thanks for you hard work on ET and lxml! Cheers, Aloys On Sun, Feb 22, 2009 at 5:09 AM, Stefan Behnel wrote: > Hi Fredrik, > > thanks for the clarification. > > Fredrik Lundh wrote: > > Not sure - that you can get None back from findtext when the element > > is there looks like an accidental change when the ElementPath engine > > was rewritten. I think I'll consider that a bug in findtext. > > I thought so, too. > > > > As for distinguishing between and > > That's not what I meant, although that actually is the result when you > serialise with or without an empty string value. A parsed empty element > will always have its .text set to None in lxml.etree, regardless of the way > the parser saw it. I rather meant the difference between users setting > > el.text = None > > and > > el.text = '' > > in the code. In the second case, lxml.etree creates a text node with an > empty string in the underlying libxml2 tree. That way, it can return the > expected result on later requests. This is actually compatible with ET, > which (obviously) also remembers what the user set as value. You can think > of the above as an emulation of the ET behaviour, but also as a way to > prevent surprised faces on user side when you see > > el.text = '' > for i in range(10: > el.text += 'xyz' > > fail mysteriously. > > > > the ET specification allows an implementation to use either > > None or an empty string for the text and tail attributes in either > > case to simplify the tree building. However, an application shouldn't > > abuse this - an XML producer should be free to use either form to > > indicate an empty element, and application code should use "truth > > testing" when necessary, when inspecting the text/tail attributes of a > > given element. > > I fully agree. > > > > And I think findtext should be reverted to the 1.2 > > behaviour - just add an to the suitable place in ElementPath, > > and leave the rest as is. > > That's what I did for lxml 2.2. It just makes findtext() simpler to use. > > Stefan > > -- Aloys Baillet Research & Development - Animal Logic -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090223/a43faba3/attachment.htm From paratribulations at free.fr Mon Feb 23 10:05:57 2009 From: paratribulations at free.fr (TP) Date: Mon, 23 Feb 2009 10:05:57 +0100 Subject: [lxml-dev] how to inherit from ElementBase? References: <499EDAE8.2010502@behnel.de> Message-ID: Stefan Behnel wrote: > Does this page serve you better? > > http://codespeak.net/lxml/element_classes.html Thanks a lot. Julien -- python -c "print ''.join([chr(154 - ord(c)) for c in '*9(9&(18%.\ 9&1+,\'Z4(55l4('])" "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong." (first law of AC Clarke) From jholg at gmx.de Mon Feb 23 16:47:59 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 23 Feb 2009 16:47:59 +0100 Subject: [lxml-dev] Some praise for lxml in ibm developerworks Message-ID: <20090223154759.193830@gmx.net> Just stumbled on this: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ Might've been mentioned before here (?), but I haven't been able to follow too closely these days... Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 From stefan_ml at behnel.de Tue Feb 24 20:15:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 24 Feb 2009 20:15:09 +0100 Subject: [lxml-dev] Some praise for lxml in ibm developerworks In-Reply-To: <20090223154759.193830@gmx.net> References: <20090223154759.193830@gmx.net> Message-ID: <49A4473D.3070506@behnel.de> Hi, jholg at gmx.de wrote: > http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ Thanks for pointing to that, I hadn't seen it yet. Apart from the few minor inaccuracies that you only spot when you've actually written the code behind what's described, I find it a very good article that's nicely written. It's also nice to see that lxml's documentation can't be that bad after all. I remember writing down many of the tweaks and hints that the article mentions, so she must actually have read it! ;) I also find the conclusion nicely heart-warming, starting with this sentence: "Many software products come with the pick-two caveat, meaning that you must choose only two: speed, flexibility, or readability. When used carefully, lxml can provide all three." Stefan From faassen at startifact.com Wed Feb 25 18:08:46 2009 From: faassen at startifact.com (Martijn Faassen) Date: Wed, 25 Feb 2009 18:08:46 +0100 Subject: [lxml-dev] Some praise for lxml in ibm developerworks In-Reply-To: <49A4473D.3070506@behnel.de> References: <20090223154759.193830@gmx.net> <49A4473D.3070506@behnel.de> Message-ID: Hey, Stefan Behnel wrote: [snip] > I also find the conclusion nicely heart-warming, starting with this sentence: > > "Many software products come with the pick-two caveat, meaning that you > must choose only two: speed, flexibility, or readability. When used > carefully, lxml can provide all three." I found that article a few weeks ago, I hadn't realized you hadn't seen it yet! I also thought that conclusion a very nice statement. :) Regards, Martijn From faassen at startifact.com Wed Feb 25 20:06:27 2009 From: faassen at startifact.com (Martijn Faassen) Date: Wed, 25 Feb 2009 20:06:27 +0100 Subject: [lxml-dev] thread-related crash when using xslt Message-ID: Hi there, Attached is a small tarball that demonstrates code that crashes when the code is run in a thread but doesn't crash when it is run stand-alone. I isolated the specific XSLT + XML combination that seems to trigger this crash. I suspect it has to do with passing an XSLT object to a thread. I run this with lxml 2.1.5 in Python 2.4, libxml2 2.6.32 and libxslt 1.1.24 By the way, the FAQ implies that passing an XSLT object into a thread will slow things down (probably as the XSLT would be re-interpreted). Is that still true in the current codebase? I had the impression from previous discussions that this would change. Regards, Martijn -------------- next part -------------- A non-text attachment was scrubbed... Name: thread_crash.tgz Type: application/x-compressed-tar Size: 1058 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090225/bf0c4d93/attachment.bin From ovnicraft at gmail.com Thu Feb 26 01:57:30 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Wed, 25 Feb 2009 19:57:30 -0500 Subject: [lxml-dev] Standalone declaration In-Reply-To: References: <49761F0B.7050305@behnel.de> Message-ID: 2009/1/20 Ovnicraft > > > 2009/1/20 Stefan Behnel > > Hi, >> >> Ovnicraft wrote: >> > i created an xml now i want to make >> > standalone declaration in my structure, how i can do it? >> >> There isn't currently a way to set the flag programmatically, but you can >> just parse in the declaration like this: >> >> doc = etree.fromstring( >> '') >> root = doc.getroot() > > getroot(), This atribute doesnt exist so, how can i add the flag standalone? > >> root[:] = your_content_elements > > > I added that header with etree.tostring(root, encoding='iso-8859-1') > > but can i add a flag with standalone value? > > what instruction is for create flags in a node? > > regards, > > >> >> Stefan >> >> >> > > > -- > [b]question = (to) ? be : !be; .[/b] > -- [b]question = (to) ? be : !be; .[/b] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090225/2710c56f/attachment.htm From stefan_ml at behnel.de Thu Feb 26 08:29:17 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Feb 2009 08:29:17 +0100 Subject: [lxml-dev] Standalone declaration In-Reply-To: References: <49761F0B.7050305@behnel.de> Message-ID: <49A644CD.30403@behnel.de> Ovnicraft wrote: >> 2009/1/20 Stefan Behnel >>> Ovnicraft wrote: >>>> i created an xml now i want to make >>>> standalone declaration in my structure, how i can do it? >>> There isn't currently a way to set the flag programmatically, but you can >>> just parse in the declaration like this: >>> >>> doc = etree.fromstring( >>> '') >>> root = doc.getroot() >>> root[:] = your_content_elements > > getroot(), This atribute doesnt exist Sorry, make that doc = etree.parse(StringIO( '')) root = doc.getroot() or root = etree.fromstring( '') >> I added that header with etree.tostring(root, encoding='iso-8859-1') >> but can i add a flag with standalone value? Not currently. Please file a wishlist bug in lxml's bug tracker. I think a flag on the _ElementTree class's docinfo property would work here. >> what instruction is for create flags in a node? This is not about changing nodes, but about changing documents features. Stefan From stefan_ml at behnel.de Thu Feb 26 09:24:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Feb 2009 09:24:21 +0100 Subject: [lxml-dev] thread-related crash when using xslt In-Reply-To: References: Message-ID: <49A651B5.8050401@behnel.de> Hi Martijn, Martijn Faassen wrote: > Attached is a small tarball that demonstrates code that crashes when the > code is run in a thread but doesn't crash when it is run stand-alone. I > isolated the specific XSLT + XML combination that seems to trigger this > crash. I suspect it has to do with passing an XSLT object to a thread. I've seen enough of these all over the place to consider this possible. ;) I'll look into this as soon as I get to it. I was about to release another beta anyway - the latest changelog has gotten longer than I expected, and I really love being able to say that lxml is now fully Py3 compatible. So I'll see if I can get this to work before putting out a 2.2beta4. The still-future Cython 0.11 has also matured a lot by now, so it's worth another release. > I run this with lxml 2.1.5 in Python 2.4, libxml2 2.6.32 and libxslt > 1.1.24 Just in case, if the crash is related to transformation errors, you might want to try with 2.2beta3, or even with the trunk, if you also install the latest trunk Cython (sorry for that). > By the way, the FAQ implies that passing an XSLT object into a thread > will slow things down (probably as the XSLT would be re-interpreted). Is > that still true in the current codebase? I had the impression from > previous discussions that this would change. Yes, the (ugly) code section that this statement was referring to was killed somewhere in 2.1.x. I removed the paragraph from the FAQ and also clarified a couple of other things while at it. lxml now even has a working test case for passing trees along a thread pipeline, so the safety of threading really has improved a lot lately. It's impressively hard to get these things right. Threads are just plain evil. Their only excuse in lxml is that XML handling is often I/O expensive and can involve major time consuming operations inside libxml2 and libxslt (XSLT is really a great candidate for that). So freeing the GIL when we know we are about to do most of our work outside of the Python interpreter gets you pretty far. Stefan From stefan_ml at behnel.de Thu Feb 26 10:07:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Feb 2009 10:07:01 +0100 Subject: [lxml-dev] thread-related crash when using xslt In-Reply-To: <49A651B5.8050401@behnel.de> References: <49A651B5.8050401@behnel.de> Message-ID: <49A65BB5.6010307@behnel.de> Hi, one more note on this: Stefan Behnel wrote: > Martijn Faassen wrote: >> By the way, the FAQ implies that passing an XSLT object into a thread >> will slow things down (probably as the XSLT would be re-interpreted). Is >> that still true in the current codebase? I had the impression from >> previous discussions that this would change. > > Yes, the (ugly) code section that this statement was referring to was > killed somewhere in 2.1.x. I removed the paragraph from the FAQ and also > clarified a couple of other things while at it. I should mention that there is /still/ some overhead involved when you mix documents from different threads here (as everywhere in lxml), including the stylesheet itself. However, as this also runs with the GIL released, your gain on multi-processor machines will still be higher than the overhead. YMMV, as usual, so profiling is always a good idea. :) Stefan From stefan_ml at behnel.de Fri Feb 27 12:41:43 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Feb 2009 12:41:43 +0100 Subject: [lxml-dev] thread-related crash when using xslt In-Reply-To: References: Message-ID: <49A7D177.4080207@behnel.de> Hi, Martijn Faassen wrote: > Attached is a small tarball that demonstrates code that crashes when the > code is run in a thread but doesn't crash when it is run stand-alone. I > isolated the specific XSLT + XML combination that seems to trigger this > crash. I suspect it has to do with passing an XSLT object to a thread. Ok, this is plain evil. What you do here is this: ... top-row ... Note how the attribute value is changed after being set. In libxslt, this leads to a result tree update that removes the old attribute and replaces it by the new one. In your case, the stylesheet that was parsed outside the thread inherits the name dict from the main thread, while the input document inherits the one from the worker thread that executes this function: def render(id, xml, stylesheet): doc = etree.parse(StringIO(xml)) result_tree = stylesheet(doc) So the first "class" attribute name comes from the stylesheet dict and gets stored in the result document that inherits the thread dict of the input document. When it is overwritten and deleted, it is looked up in the thread dict, is not found there, and thus free()-ed, although it continues to 'live' in the stylesheet dict. This must really be the only place in XSLT where the result document is not only created incrementally but where its existing content gets overwritten. For now, I really do not know how to work around this. There can only be one dict for the result document, but the original attribute can come from the stylesheet or the input document (or even the current thread dict where the XSLT is executed), and the dict lookup happens from deep inside libxslt. I'm very open to ideas. Stefan From stefan_ml at behnel.de Fri Feb 27 13:36:33 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Feb 2009 13:36:33 +0100 Subject: [lxml-dev] thread-related crash when using xslt In-Reply-To: <49A7D177.4080207@behnel.de> References: <49A7D177.4080207@behnel.de> Message-ID: <49A7DE51.4010606@behnel.de> Stefan Behnel wrote: > Martijn Faassen wrote: >> Attached is a small tarball that demonstrates code that crashes when the >> code is run in a thread but doesn't crash when it is run stand-alone. I >> isolated the specific XSLT + XML combination that seems to trigger this >> crash. I suspect it has to do with passing an XSLT object to a thread. > > Ok, this is plain evil. What you do here is this: > > ... > > top-row > ... > > Note how the attribute value is changed after being set. In libxslt, this > leads to a result tree update that removes the old attribute and replaces > it by the new one. Here is a minimal fix for the problem. There may be special cases where this might not work (my guess would be custom XSLT elements), but at least it works safely in this case. Stefan === src/lxml/xslt.pxi ================================================================== --- src/lxml/xslt.pxi (revision 5056) +++ src/lxml/xslt.pxi (local) @@ -486,7 +486,15 @@ _destroyFakeDoc(input_doc._c_doc, c_doc) python.PyErr_NoMemory() - initTransformDict(transform_ctxt) + # using the stylesheet dict is safer than using a possibly + # unrelated dict from the current thread. Almost all + # non-input tag/attr names will come from the stylesheet + # anyway. + if transform_ctxt.dict is not NULL: + xmlparser.xmlDictFree(transform_ctxt.dict) + transform_ctxt.dict = self._c_style.doc.dict + xmlparser.xmlDictReference(transform_ctxt.dict) + xslt.xsltSetCtxtParseOptions( transform_ctxt, input_doc._parser._parse_options) From faassen at startifact.com Fri Feb 27 16:50:08 2009 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 27 Feb 2009 16:50:08 +0100 Subject: [lxml-dev] thread-related crash when using xslt In-Reply-To: <49A7D177.4080207@behnel.de> References: <49A7D177.4080207@behnel.de> Message-ID: <8928d4e90902270750k6f07fb43j71310aaa4595996f@mail.gmail.com> Hey Stefan, On Fri, Feb 27, 2009 at 12:41 PM, Stefan Behnel wrote: > Martijn Faassen wrote: >> Attached is a small tarball that demonstrates code that crashes when the >> code is run in a thread but doesn't crash when it is run stand-alone. I >> isolated the specific XSLT + XML combination that seems to trigger this >> crash. I suspect it has to do with passing an XSLT object to a thread. > > Ok, this is plain evil. What you do here is this: > > ? ? ... > ? ? > ? ? ? ?top-row > ? ? ... I didn't do it, or if I did do it it was years ago and I don't remember! :) [snip] > So the first "class" attribute name comes from the stylesheet dict and gets > stored in the result document that inherits the thread dict of the input > document. When it is overwritten and deleted, it is looked up in the thread > dict, is not found there, and thus free()-ed, although it continues to > 'live' in the stylesheet dict. Ugh! FYI I've worked around the problem in the original application (Silva) by having a thread-local XSLT stylesheet for each thread now. This seems to resolve the actual crash in the application and has a minimal performance impact as far as I can see. Given Silva's history with thread-related issues with XSLT such a general workaround might be the best way forward, though it does mean you'll see less thread related bug reports coming from that direction. :) I see however that you thought up a fix in the reply, which is good news for people coming after me. :) Regards, Martijn From stefan_ml at behnel.de Fri Feb 27 17:10:06 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Feb 2009 17:10:06 +0100 Subject: [lxml-dev] lxml 2.2beta4 - release candidate for 2.2 final Message-ID: <49A8105E.8020608@behnel.de> Hi all, here is another almost-final version of lxml 2.2. Call it a release candidate, if you prefer. This release was necessary as the changelog was getting way too long, and the (crash-)bugs that were fixed in this release were too important to wait. So, updating is recommended. http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.2beta4 I'm very happy to announce that this is the first release that fully supports Python 3. The previous releases suffered from an annoying crash bug in Cython that was uncovered by the exception handling changes in Py3. This means that lxml now supports any officially released Python version from 2.3.x through 3.0.1. Note that the PDF documentation was broken in the last betas due to a recent system update on my side. It's back up for beta4. This release was built with revision "1796:cb0f315bb4f5" of the Cython 0.11 development branch. Have fun, Stefan 2.2beta4 (2009-02-27) Features added * Support strings and instantiable Element classes as child arguments to the constructor of custom Element classes. * GZip compression support for serialisation to files and file-like objects. Bugs fixed * Deep-copying an ElementTree copied neither its sibling PIs and comments nor its internal/external DTD subsets. * Soupparser failed on broken attributes without values. * Crash in XSLT when overwriting an already defined attribute using xsl:attribute. * Crash bug in exception handling code under Python 3. This was due to a problem in Cython, not lxml itself. * lxml.html.FormElement._name() failed for non top-level forms. * TAG special attribute in constructor of custom Element classes was evaluated incorrectly. Other changes * Official support for Python 3.0.1. * Element.findtext() now returns an empty string instead of None for Elements without text content.