From dpavlekovic at gmail.com Mon May 1 22:29:04 2006 From: dpavlekovic at gmail.com (Dean Pavlekovic) Date: Mon May 1 22:29:45 2006 Subject: [lxml-dev] HTMLParser behavior - ill formed HTML Message-ID: Hi all, My name is Dean and I've recently joined the lxml-dev to watch how things are going most notably since the HTML parser has been added to the trunk. We (my colleague and I) have been using lxml in the past for processing of rather large DocBook type XML docs aplit across multiple files and in mixed namespaces without any problems and the experiences with lxml have been nothing but good, so let me first congratulate the developers on a great work lxml is. Lately I've looked into using the lxml trunk revision for analyzing some HTML (not well-formed). I've stumbled upon the feature for which I'm not sure whether it's intended to be. When I feed an ill-formed HTML to a parser via > doc=etree.parse('afile.html', parser=etree.HTMLParser()) #recover=True by default an exception is raised and the function yields no result. When I use libxml2 directly: > doc=libxml2.htmlParseFile('afile.html', None) libxml2 prints out some errors/warnings but I _do_ get a reference to a document, which can normally be used. I've also tried to use the procedure as in parser.pxi i.e. htmlCreateMemoryParserCtxt() and htmlCtxtReadDoc() with the same results. So I tracked this behavior down to calling _handleParseResult(pctxt, result, NULL) at the end of parse* methods in HTMLParser in parser.pxi, note the 'if.ctxt.wellFormed' part: where the document is destroyed if libxml2's context wellFormed flag is not set. I checked this by calling libxml2 htmlCtxtReadDoc() directly on that document and indeed the wellFormed flag turned out to be 0. Now, shouldn't the HTMLParser also return the document reference in this case if recover=True flag is specified since libxml apparently does not have problems with that. I've checked this by modifying the _handleParseResult with 'accept_ill_formed' argument. If that flag is set, ctxt.wellFormed would be ignored. Also modified _handleParseResult calls in HTMLParser's parse... methods to specify accept flag if 'recover' was set in the constructor. This turned to work just well and I was able to navigate the document with xpath, even the errors (using '&' in href attributes) were corrected. So the question would be is the present behavior correct due to something I'm possibly missing? I think it should be dependent on whether RECOVER flags have been specified or not. Best regards, Dean From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 2 07:45:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue May 2 07:45:45 2006 Subject: [lxml-dev] HTMLParser behavior - ill formed HTML In-Reply-To: References: Message-ID: <4456F1E1.9050508@gkec.informatik.tu-darmstadt.de> Hi Dean, Dean Pavlekovic wrote: > experiences with lxml have been nothing but good, so let me first > congratulate the developers on a great work lxml is. Thanks! :) > When I feed an ill-formed HTML to a parser an exception is raised and the > function yields no result. When I use libxml2 directly, libxml2 prints out > some errors/warnings but I _do_ get a reference to a document, which can > normally be used. > > So I tracked this behavior down to calling _handleParseResult(pctxt, > result, NULL) at the end of parse* methods in HTMLParser in parser.pxi, > where the document is destroyed if libxml2's context wellFormed flag is not > set. I checked this by calling libxml2 htmlCtxtReadDoc() directly on that > document and indeed the wellFormed flag turned out to be 0. > > Now, shouldn't the HTMLParser also return the document reference in this > case if recover=True flag is specified since libxml apparently does not > have problems with that. It's absolutely reasonable to do that. My guess is that libxml2 will always try to return either NULL or a correct and usable in-memory structure no matter how broken and incomplete the parsed data was. So if it returns anything but NULL, that should be usable. I changed the trunk to always accept ill-formed results if the recover option is set and no lxml-internal errors were raised. Please try if that helps. I couldn't come up with a sufficiently short example of broken HTML where this problem occurs, so I couldn't test it. The examples that were tested so far can be found in src/lxml/tests/test_htmlparser.py. Thanks for the report, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 2 20:49:49 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue May 2 20:50:28 2006 Subject: [lxml-dev] 'docinfo' property on ElementTree Message-ID: <4457A9CD.2040403@gkec.informatik.tu-darmstadt.de> Hi all, I updated the trunk to provide ElementTree objects with access to the document information provided by the parser: DOCTYPE, XML version and original encoding. Paul Everitt had some use cases related to the HTML parser, but I think it's generally a good idea to make this kind of information available. The new API works as follows: >>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN" >>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >>> doctype_string = '' % (pub_id, sys_url) >>> xml_header = '' >>> xhtml = xml_header + doctype_string + '' >>> et = lxml.etree.parse(StringIO(xhtml)) >>> docinfo = et.docinfo >>> print docinfo.public_id -//W3C//DTD XHTML 1.0 Transitional//EN >>> print docinfo.system_url http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd >>> docinfo.doctype == doctype_string True >>> print docinfo.xml_version 1.0 >>> print docinfo.encoding ascii This is backed by a DocInfo object that you can also instantiate on an ElementTree (or Element) by hand. The docinfo property just does it for you. Any of the attributes above may be None if the information is not available. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 3 09:23:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 3 09:23:48 2006 Subject: [lxml-dev] HTMLParser behavior - ill formed HTML In-Reply-To: References: <4456F1E1.9050508@gkec.informatik.tu-darmstadt.de> Message-ID: <44585A5A.6030401@gkec.informatik.tu-darmstadt.de> Hi Dean, Dean Pavlekovic wrote: > On 5/2/06, Stefan Behnel wrote: > >> I changed the trunk to always accept ill-formed results if the recover >> option is set and no lxml-internal errors were raised. Please try if that >> helps. I couldn't come up with a sufficiently short example of broken >> HTML where this problem occurs, so I couldn't test it. The examples that >> were tested so far can be found in src/lxml/tests/test_htmlparser.py. > > Unfortunately it's not working and I'm sorry but I didn't have the time to > look at it deeper. It appears that the value of options is changed > somewhere in libxml context during the htmlCtxtReadFile. I printed out the > value of parser's self._parse_options before (97, that is > XML_PARSE_RECOVER_FLAG set) and the value of ctxt.options after the > htmlCtxtReadFile call which was 96 meaning the flag was reset by libxml. Interesting. Could you tell me what version of libxml2 you are using (see below)? My guess is that it's older than 2.6.21. Libxml2 copies the options by hand, so if the RECOVER option is unknown, it will not turn up in the context options. That makes me wonder why recovery worked for Paul on 2.6.16 in the first place... Anyway. I changed the trunk to pass the option explicitly to _handleParseResult so that we no longer rely on libxml2. Please test if this works for you now. I'd be glad if you could come up with a short piece of broken HTML code that triggers the "not well formed" case in recovery mode. That would allow us to set up a test case to check that it actually works (and keeps working). To check which versions you are using, I added module attributes "LXML_VERSION", "LIBXML_VERSION" and "LIBXSLT_VERSION" that carry tuples containing the respective versions used by lxml, e.g. (2, 6, 23) for "2.6.23". "LIBXML_COMPILED_VERSION" and "LIBXSLT_COMPILED_VERSION" show against which versions lxml was compiled. BTW, these attributes are mainly meant for debugging purposes. Since they will first (officially) appear in lxml 1.0, code that wants to use them anyway will have to check if they are available (hasattr or try-except) before accessing them. All of them appeared at the same time, so if one is there, the others will be there, too. > Btw. is there a reason that the XML_... enums are assigned an explicit > value in xmlparser.pxd and the same is not done for HTML_... enums in > htmlparser.pxd? I'm not familiar with Pyrex and wheter it parses c/c++ .h > files to match values... Pyrex doesn't care about the values of enums and C uses the .h files directly, so, since they are just copy&pasted into the .pxd files, they sometimes have numbers and sometimes not. Stefan From dpavlekovic at gmail.com Wed May 3 23:17:41 2006 From: dpavlekovic at gmail.com (Dean Pavlekovic) Date: Wed May 3 23:18:21 2006 Subject: [lxml-dev] HTMLParser behavior - ill formed HTML In-Reply-To: <44585A5A.6030401@gkec.informatik.tu-darmstadt.de> References: <4456F1E1.9050508@gkec.informatik.tu-darmstadt.de> <44585A5A.6030401@gkec.informatik.tu-darmstadt.de> Message-ID: Hello Stefan, Now, after running with your new patches, it works well! And about the previous: > Interesting. Could you tell me what version of libxml2 you are using (see > below)? My guess is that it's older than 2.6.21. Libxml2 copies the options by > hand, so if the RECOVER option is unknown, it will not turn up in the context > options. That makes me wonder why recovery worked for Paul on 2.6.16 in the > first place... I am using libxml2 version 2.6.21 (on Ubuntu: libxml2 2.6.21-0ubuntu1 package): lrwxrwxrwx 1 root root 17 2006-05-01 21:31 /usr/lib/libxml2.so -> libxml2.so.2.6.21 I've confirmed this after applying your patches LIBXML_VERSION and LIBXML_COMPILED_VERSION both equate to (2, 6, 21) Just to confirm the behavior from my last email, I changed the HTMLParser by adding some print statements before and after htmlCtxtReadFile call (before your latest patches) --- cdef xmlDoc* _parseDocFromFile(self, char* c_filename) except NULL: ... self._initContext(pctxt) print 'In _parseDocFromFile - before htmlCtxtReadFile: self._parse_options=%s' % self._parse_options result = htmlparser.htmlCtxtReadFile( pctxt, c_filename, NULL, self._parse_options) print 'In _parseDocFromFile - after htmlCtxtReadFile: ctxt.options=%s' % pctxt.options self._error_log.disconnect() return _handleParseResult(pctxt, result, c_filename) --- and if I run this script - lxmltest1.py (files attached): from lxml import etree doc = etree.parse('httest.html', parser=etree.HTMLParser(recover=True)) etree.dump(doc) the result is: dean@boycie:~/work/main/oldstuff/dean/re$ python lxmltest1.py In _parseDocFromFile - before htmlCtxtReadFile: self._parse_options=97 <<<<< here In _parseDocFromFile - after htmlCtxtReadFile: ctxt.options=96 <<<< here Traceback (most recent call last): File "lxmltest1.py", line 2, in ? doc = etree.parse('httest.html', parser=etree.HTMLParser(recover=True)) File "etree.pyx", line 1401, in etree.parse File "parser.pxi", line 489, in etree._parseDocument File "parser.pxi", line 464, in etree._parseDocFromFile File "parser.pxi", line 437, in etree.HTMLParser._parseDocFromFile File "parser.pxi", line 177, in etree._handleParseResult etree.XMLSyntaxError: htmlParseEntityRef: expecting ';' (the error is because of unescaped &-s in href attribute value) Next I've made a C test using 'plain' libxml2 (pls. see lxmltest2.c), and this behavior was confirmed. So I guess this is libxml2 issue. Or it maybe something specific to my local setup if you don't manage to reproduce it... ( gcc -o lxmltest2 -I/usr/include/libxml2 -lxml2 lxmltest2.c) Although it's an odd feature/bug, hope it's useful to know about it :-/ Best regards, Dean PS. The httest.html should be in cp1250 encoding (Windows east european). > Anyway. I changed the trunk to pass the option explicitly to > _handleParseResult so that we no longer rely on libxml2. Please test if this > works for you now. I'd be glad if you could come up with a short piece of > broken HTML code that triggers the "not well formed" case in recovery mode. > That would allow us to set up a test case to check that it actually works (and > keeps working). > > To check which versions you are using, I added module attributes > "LXML_VERSION", "LIBXML_VERSION" and "LIBXSLT_VERSION" that carry tuples > containing the respective versions used by lxml, e.g. (2, 6, 23) for "2.6.23". > "LIBXML_COMPILED_VERSION" and "LIBXSLT_COMPILED_VERSION" show against which > versions lxml was compiled. > > BTW, these attributes are mainly meant for debugging purposes. Since they will > first (officially) appear in lxml 1.0, code that wants to use them anyway will > have to check if they are available (hasattr or try-except) before accessing > them. All of them appeared at the same time, so if one is there, the others > will be there, too. > > > > Btw. is there a reason that the XML_... enums are assigned an explicit > > value in xmlparser.pxd and the same is not done for HTML_... enums in > > htmlparser.pxd? I'm not familiar with Pyrex and wheter it parses c/c++ .h > > files to match values... > > Pyrex doesn't care about the values of enums and C uses the .h files directly, > so, since they are just copy&pasted into the .pxd files, they sometimes have > numbers and sometimes not. > > Stefan > -------------- next part -------------- A non-text attachment was scrubbed... Name: lxmltest1.py Type: text/x-python Size: 124 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060503/a71a1752/lxmltest1-0001.py -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20060503/a71a1752/httest-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: lxmltest2.c Type: text/x-csrc Size: 613 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060503/a71a1752/lxmltest2-0001.c From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 4 07:27:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 4 07:28:10 2006 Subject: [lxml-dev] HTMLParser behavior - ill formed HTML In-Reply-To: References: <4456F1E1.9050508@gkec.informatik.tu-darmstadt.de> <44585A5A.6030401@gkec.informatik.tu-darmstadt.de> Message-ID: <445990C4.9020202@gkec.informatik.tu-darmstadt.de> Hi Dean, Dean Pavlekovic wrote: > Now, after running with your new patches, it works well! Fine, then we have found a work-around that works on older libxml2 versions. >> Libxml2 copies the options by hand, so if the RECOVER option is unknown, >> it will not turn up in the context options. > > I am using libxml2 version 2.6.21 (on Ubuntu: libxml2 2.6.21-0ubuntu1 > package): /usr/lib/libxml2.so -> libxml2.so.2.6.21 I've confirmed this > after applying your patches LIBXML_VERSION and LIBXML_COMPILED_VERSION both > equate to (2, 6, 21) Thanks. I took a second look at it. The version does not matter, the respective code in libxml2 2.6.21 to 2.6.23 (current) looks like this: if (options & HTML_PARSE_RECOVER) { ctxt->recovery = 1; } else ctxt->recovery = 0; if (options & HTML_PARSE_COMPACT) { ctxt->options |= HTML_PARSE_COMPACT; options -= HTML_PARSE_COMPACT; } So, as opposed to most other options, the RECOVER option is not copied at all and not even removed from the original options to show that it was accepted (as is written in the docs). I'll file a bug report on it. The work-around will just stay in lxml as is. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 7 09:43:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun May 7 09:44:08 2006 Subject: [lxml-dev] Remarks on implementing iterparse() Message-ID: <445DA521.7060606@gkec.informatik.tu-darmstadt.de> Hi all, since I won't have the time to implement iterparse() any time soon, here's a proposal on how it should be implemented, in case someone wants to take a shot at it. "iterparse" will be (or will return) an iterable object, let's call it IterParse for clarity. A class is basically the only way of implementing iterators in Pyrex. For the internal SAX part, IterParse will likely work a lot like lxml.sax.ElementTreeContentHandler. We'd need a custom wrapper to the default libxml2 SAX handler to intercept the parse events (this means implementing C helper functions for the SAX events) /after/ they were processed by libxml2. See xmlSAXVersion (SAX2) on how to retrieve the SAX2 default parser structure. IterParse should pass chunks into the parser and buffer the events it receives. When its __next__() method is called, it returns one event or passes new chunks until there is an event to return. This is needed as IterParse has to convert between libxml2 push (SAX) and Python pull (iter). As for the input to the libxml2 parser, there are two possible ways: one is to pass data chunks in through xmlParseChunk and the other is to use xmlCreateIOParserCtxt and implement xmlInputReadCallback (xmlio.h) to have libxml2 request data by itself. Python events (start, end, start-ns, end-ns) are created as follows: * "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call (passed in arguments "prefix"/"URI" and the char* array "namespaces"). They must be stored on a stack to build the respective "end-ns" events. * "start" is somewhat tricky, as it would be a bad idea to allow modifications of the XML structure during that iterator cycle. Maybe it's enough to document that, but there may be ways to crash lxml with certain tree operations. Note also that care has to be taken to prevent Python from garbage collecting the element before the "end" event. The best way to do that is to store a Python reference to that element on a stack. * "end" is simple then: pop the element from the stack and return it. That's all I can come up with so far. So, if anyone is interested in taking a look at it, I'd be glad to hear about it. :) Stefan From faassen at infrae.com Mon May 8 12:39:52 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon May 8 12:39:40 2006 Subject: [lxml-dev] Remarks on implementing iterparse() In-Reply-To: <445DA521.7060606@gkec.informatik.tu-darmstadt.de> References: <445DA521.7060606@gkec.informatik.tu-darmstadt.de> Message-ID: <445F1FF8.7000709@infrae.com> Hi Stefan, Haven't read your whole proposal yet, but I believe that libxml2 also offers a newer 'reader' interface besides the SAX interface that we may want to consider for implementing iterparse. It's based on the C# xmlReader interface and uses an iterator based approach already. It might therefore be a better match for iterparse() implementation than SAX. Unfortunately xmlsoft.org looks unreachable at the moment, but I found a slide on it: http://veillard.com/Talks/2003Guadec/slide5-1.html Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 8 13:04:44 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon May 8 13:04:50 2006 Subject: [lxml-dev] Remarks on implementing iterparse() In-Reply-To: <445F1FF8.7000709@infrae.com> References: <445DA521.7060606@gkec.informatik.tu-darmstadt.de> <445F1FF8.7000709@infrae.com> Message-ID: <445F25CC.7040400@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Haven't read your whole proposal yet, but I believe that libxml2 also > offers a newer 'reader' interface besides the SAX interface that we may > want to consider for implementing iterparse. It's based on the C# > xmlReader interface and uses an iterator based approach already. It > might therefore be a better match for iterparse() implementation than SAX. Yup, I considered that after I had checked that libxml2's SAX parser builds a tree step-by-step exactly the way iterparse wants it. What I did not like about XmlTextReader in this context: * the interface forces us to do everything on our own: build node instances, add attributes, etc. * "Note, however that the node instance returned by the Expand() call is only valid until the next Read() operation." (xmlreader.html) - segfault included! * readers have an "expand" command that expands the entire subtree of the current node to retrieve a node reference. iterparse does neither want this nor need this. So, I'm pretty convinced it's easier to use SAX the way I proposed. iterparse is so SAX-like that implementing it on top of a tree-building SAX parser should be easiest. > Unfortunately xmlsoft.org looks unreachable at the moment, I usually go for file:///usr/share/doc/packages/libxml2-devel/html/ in these cases. :) There's a file "xmlreader.html" in there, which describes the interface to a certain extend. Regards, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 8 16:58:52 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon May 8 16:57:55 2006 Subject: [lxml-dev] I/O benchmarks Message-ID: <445F5CAC.1010105@gkec.informatik.tu-darmstadt.de> Hi everyone, have you ever wondered why you should use lxml instead of cElementTree? Here is why. :) Stefan -------------- next part -------------- Preparing test suites and trees ... Running benchmark on lxe, ET, cET Setup times for trees in seconds: lxe: -- S- U- -A SA UA T1: 0.1180 0.1158 0.1153 0.1178 0.1157 0.1173 T2: 0.1186 0.1202 0.1207 0.1233 0.1236 0.1250 T3: 0.0323 0.0252 0.0250 0.0489 0.0495 0.0492 T4: 0.0005 0.0005 0.0005 0.0010 0.0010 0.0010 ET : -- S- U- -A SA UA T1: 0.2305 0.2887 0.2193 0.2662 0.2872 0.2259 T2: 0.3038 0.3442 0.2823 0.3140 0.2364 0.4051 T3: 0.0534 0.0572 0.0523 0.0583 0.0553 0.0829 T4: 0.0010 0.0008 0.0007 0.0009 0.0008 0.0008 cET: -- S- U- -A SA UA T1: 0.0369 0.0353 0.0371 0.0341 0.0354 0.0345 T2: 0.0370 0.0364 0.0361 0.0371 0.0369 0.0358 T3: 0.0090 0.0091 0.0090 0.0125 0.0174 0.0235 T4: 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 lxe: tostring_utf16 (S- T1 ) 25.0344 24.7066 24.6755 msec/pass, best: 24.6755 ET : tostring_utf16 (S- T1 ) 715.0391 668.6494 668.2270 msec/pass, best: 668.2270 cET: tostring_utf16 (S- T1 ) 631.4080 634.0269 629.3236 msec/pass, best: 629.3236 lxe: tostring_utf16 (U- T1 ) 25.3342 25.4860 25.1003 msec/pass, best: 25.1003 ET : tostring_utf16 (U- T1 ) 666.5964 670.9383 664.8268 msec/pass, best: 664.8268 cET: tostring_utf16 (U- T1 ) 628.6623 628.9270 636.1863 msec/pass, best: 628.6623 lxe: tostring_utf16 (S- T2 ) 38.5328 27.5526 28.3839 msec/pass, best: 27.5526 ET : tostring_utf16 (S- T2 ) 696.8423 697.2251 698.0101 msec/pass, best: 696.8423 cET: tostring_utf16 (S- T2 ) 655.5692 652.2847 653.9454 msec/pass, best: 652.2847 lxe: tostring_utf16 (U- T2 ) 26.2363 26.8365 26.6303 msec/pass, best: 26.2363 ET : tostring_utf16 (U- T2 ) 698.3589 697.7249 698.7069 msec/pass, best: 697.7249 cET: tostring_utf16 (U- T2 ) 652.6794 652.6194 652.4247 msec/pass, best: 652.4247 lxe: tostring_utf16 (S- T3 ) 2.9159 2.9798 3.0483 msec/pass, best: 2.9159 ET : tostring_utf16 (S- T3 ) 90.1106 90.2723 90.6772 msec/pass, best: 90.1106 cET: tostring_utf16 (S- T3 ) 75.9803 75.7933 75.6844 msec/pass, best: 75.6844 lxe: tostring_utf16 (U- T3 ) 2.9326 2.9501 3.8947 msec/pass, best: 2.9326 ET : tostring_utf16 (U- T3 ) 90.8508 90.2919 90.4195 msec/pass, best: 90.2919 cET: tostring_utf16 (U- T3 ) 75.8780 75.6118 75.6281 msec/pass, best: 75.6118 lxe: tostring_utf16 (S- T4 ) 0.1184 0.1246 0.1174 msec/pass, best: 0.1174 ET : tostring_utf16 (S- T4 ) 4.0772 4.1340 4.0832 msec/pass, best: 4.0772 cET: tostring_utf16 (S- T4 ) 8.6555 3.7591 3.8131 msec/pass, best: 3.7591 lxe: tostring_utf16 (U- T4 ) 0.1150 0.1142 0.1470 msec/pass, best: 0.1142 ET : tostring_utf16 (U- T4 ) 4.1377 4.1255 4.0694 msec/pass, best: 4.0694 cET: tostring_utf16 (U- T4 ) 8.5306 3.7888 3.8399 msec/pass, best: 3.7888 lxe: tostring_utf8 (S- T1 ) 20.1342 20.8913 22.0570 msec/pass, best: 20.1342 ET : tostring_utf8 (S- T1 ) 658.5292 659.8311 659.2222 msec/pass, best: 658.5292 cET: tostring_utf8 (S- T1 ) 613.7725 616.7276 615.3487 msec/pass, best: 613.7725 lxe: tostring_utf8 (U- T1 ) 22.3571 21.6062 22.3468 msec/pass, best: 21.6062 ET : tostring_utf8 (U- T1 ) 658.4980 658.8312 659.7990 msec/pass, best: 658.4980 cET: tostring_utf8 (U- T1 ) 619.8824 620.7493 618.3270 msec/pass, best: 618.3270 lxe: tostring_utf8 (S- T2 ) 22.8893 23.1579 22.3014 msec/pass, best: 22.3014 ET : tostring_utf8 (S- T2 ) 695.0778 686.7107 686.9514 msec/pass, best: 686.7107 cET: tostring_utf8 (S- T2 ) 644.8464 645.2512 645.4735 msec/pass, best: 644.8464 lxe: tostring_utf8 (U- T2 ) 21.8574 21.5462 21.9939 msec/pass, best: 21.5462 ET : tostring_utf8 (U- T2 ) 689.5708 689.1687 685.5371 msec/pass, best: 685.5371 cET: tostring_utf8 (U- T2 ) 643.3239 644.8895 641.5675 msec/pass, best: 641.5675 lxe: tostring_utf8 (S- T3 ) 2.3862 2.2847 2.3411 msec/pass, best: 2.2847 ET : tostring_utf8 (S- T3 ) 91.7382 89.4661 90.1840 msec/pass, best: 89.4661 cET: tostring_utf8 (S- T3 ) 76.4675 74.7183 74.7682 msec/pass, best: 74.7183 lxe: tostring_utf8 (U- T3 ) 2.3286 2.4161 2.3287 msec/pass, best: 2.3286 ET : tostring_utf8 (U- T3 ) 89.9778 92.0483 91.0223 msec/pass, best: 89.9778 cET: tostring_utf8 (U- T3 ) 74.8817 74.8538 74.9363 msec/pass, best: 74.8538 lxe: tostring_utf8 (S- T4 ) 0.1028 0.1006 0.1058 msec/pass, best: 0.1006 ET : tostring_utf8 (S- T4 ) 8.9541 4.1052 4.0422 msec/pass, best: 4.0422 cET: tostring_utf8 (S- T4 ) 8.5588 3.7166 3.7450 msec/pass, best: 3.7166 lxe: tostring_utf8 (U- T4 ) 0.1178 0.1000 0.1040 msec/pass, best: 0.1000 ET : tostring_utf8 (U- T4 ) 4.0836 4.0845 4.0257 msec/pass, best: 4.0257 cET: tostring_utf8 (U- T4 ) 8.5069 3.7946 3.7505 msec/pass, best: 3.7505 lxe: tostring_utf8_unicode_XML (S- T1 ) 217.3858 194.6370 196.9607 msec/pass, best: 194.6370 ET : tostring_utf8_unicode_XML (S- T1 ) 1044.2196 983.0496 1006.1930 msec/pass, best: 983.0496 cET: tostring_utf8_unicode_XML (S- T1 ) 684.9565 679.0308 674.1449 msec/pass, best: 674.1449 lxe: tostring_utf8_unicode_XML (U- T1 ) 203.8817 203.9280 200.4173 msec/pass, best: 200.4173 ET : tostring_utf8_unicode_XML (U- T1 ) 981.1272 977.5449 978.9895 msec/pass, best: 977.5449 cET: tostring_utf8_unicode_XML (U- T1 ) 653.3920 655.3704 651.0385 msec/pass, best: 651.0385 lxe: tostring_utf8_unicode_XML (S- T2 ) 208.8770 211.5672 210.4380 msec/pass, best: 208.8770 ET : tostring_utf8_unicode_XML (S- T2 ) 1021.3773 1020.3279 1041.1114 msec/pass, best: 1020.3279 cET: tostring_utf8_unicode_XML (S- T2 ) 688.2031 681.9181 680.2590 msec/pass, best: 680.2590 lxe: tostring_utf8_unicode_XML (U- T2 ) 209.5904 210.3374 209.5674 msec/pass, best: 209.5674 ET : tostring_utf8_unicode_XML (U- T2 ) 1027.9991 1022.0539 1022.0318 msec/pass, best: 1022.0318 cET: tostring_utf8_unicode_XML (U- T2 ) 681.4484 678.8596 680.9469 msec/pass, best: 678.8596 lxe: tostring_utf8_unicode_XML (S- T3 ) 12.5315 12.0217 11.9403 msec/pass, best: 11.9403 ET : tostring_utf8_unicode_XML (S- T3 ) 202.9581 203.0447 204.2288 msec/pass, best: 202.9581 cET: tostring_utf8_unicode_XML (S- T3 ) 82.3561 82.4286 82.1372 msec/pass, best: 82.1372 lxe: tostring_utf8_unicode_XML (U- T3 ) 11.6045 11.8548 11.7764 msec/pass, best: 11.6045 ET : tostring_utf8_unicode_XML (U- T3 ) 201.2770 201.0283 202.3813 msec/pass, best: 201.0283 cET: tostring_utf8_unicode_XML (U- T3 ) 82.3942 82.6146 82.3579 msec/pass, best: 82.3579 lxe: tostring_utf8_unicode_XML (S- T4 ) 5.3620 0.4832 0.5012 msec/pass, best: 0.4832 ET : tostring_utf8_unicode_XML (S- T4 ) 6.4547 6.3800 6.3898 msec/pass, best: 6.3800 cET: tostring_utf8_unicode_XML (S- T4 ) 8.8447 3.9611 4.0390 msec/pass, best: 3.9611 lxe: tostring_utf8_unicode_XML (U- T4 ) 5.2781 0.4689 0.4798 msec/pass, best: 0.4689 ET : tostring_utf8_unicode_XML (U- T4 ) 6.5955 6.4135 6.4196 msec/pass, best: 6.4135 cET: tostring_utf8_unicode_XML (U- T4 ) 8.7307 4.0647 4.0202 msec/pass, best: 4.0202 lxe: write_utf8_parse_stringIO (S- T1 ) 198.0141 188.3298 190.1872 msec/pass, best: 188.3298 ET : write_utf8_parse_stringIO (S- T1 ) 1143.8117 1176.7961 1152.1253 msec/pass, best: 1143.8117 cET: write_utf8_parse_stringIO (S- T1 ) 814.6583 847.3877 810.7611 msec/pass, best: 810.7611 lxe: write_utf8_parse_stringIO (U- T1 ) 194.5397 195.0845 194.7964 msec/pass, best: 194.5397 ET : write_utf8_parse_stringIO (U- T1 ) 1153.3167 1142.9091 1146.2344 msec/pass, best: 1142.9091 cET: write_utf8_parse_stringIO (U- T1 ) 810.8662 808.3027 811.9326 msec/pass, best: 808.3027 lxe: write_utf8_parse_stringIO (S- T2 ) 205.9781 202.8390 202.2034 msec/pass, best: 202.2034 ET : write_utf8_parse_stringIO (S- T2 ) 1195.9166 1195.7337 1193.5966 msec/pass, best: 1193.5966 cET: write_utf8_parse_stringIO (S- T2 ) 846.5763 851.8866 848.6367 msec/pass, best: 846.5763 lxe: write_utf8_parse_stringIO (U- T2 ) 203.5371 202.8143 204.4644 msec/pass, best: 202.8143 ET : write_utf8_parse_stringIO (U- T2 ) 1218.4967 1245.6663 1255.3381 msec/pass, best: 1218.4967 cET: write_utf8_parse_stringIO (U- T2 ) 858.2268 849.1432 858.5724 msec/pass, best: 849.1432 lxe: write_utf8_parse_stringIO (S- T3 ) 17.0564 11.5748 12.5815 msec/pass, best: 11.5748 ET : write_utf8_parse_stringIO (S- T3 ) 235.0894 234.2171 233.4405 msec/pass, best: 233.4405 cET: write_utf8_parse_stringIO (S- T3 ) 113.0575 112.7246 112.9978 msec/pass, best: 112.7246 lxe: write_utf8_parse_stringIO (U- T3 ) 11.1004 11.0245 11.3229 msec/pass, best: 11.0245 ET : write_utf8_parse_stringIO (U- T3 ) 238.6545 243.0529 251.3970 msec/pass, best: 238.6545 cET: write_utf8_parse_stringIO (U- T3 ) 122.8126 115.6941 120.9536 msec/pass, best: 115.6941 lxe: write_utf8_parse_stringIO (S- T4 ) 0.4456 0.4642 0.4494 msec/pass, best: 0.4456 ET : write_utf8_parse_stringIO (S- T4 ) 8.0726 8.1274 7.9427 msec/pass, best: 7.9427 cET: write_utf8_parse_stringIO (S- T4 ) 6.0532 5.0335 5.0780 msec/pass, best: 5.0335 lxe: write_utf8_parse_stringIO (U- T4 ) 0.4778 0.4506 0.4657 msec/pass, best: 0.4506 ET : write_utf8_parse_stringIO (U- T4 ) 7.4578 7.3894 8.0762 msec/pass, best: 7.3894 cET: write_utf8_parse_stringIO (U- T4 ) 6.2156 5.5292 5.0383 msec/pass, best: 5.0383 From fredrik at pythonware.com Mon May 8 20:10:15 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Mon May 8 20:12:23 2006 Subject: [lxml-dev] Re: Remarks on implementing iterparse() References: <445DA521.7060606@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > * "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call > (passed in arguments "prefix"/"URI" and the char* array "namespaces"). They > must be stored on a stack to build the respective "end-ns" events. footnote: ET guarantees that start-ns and end-ns events nest properly. I don't know how libxml2 handles this, but the SAX specification explicitly says that end events may appear out of order: For elements with multiple namespace declarations, the startPrefixMapping() calls won't necessarily nest with the endPrefixMapping() because those endPrefixMapping() calls may be made in any order. assuming that libxml2 isn't doing something really strange here, using a stack should take care of this. From matt-lists at reprocessed.org Tue May 9 15:25:26 2006 From: matt-lists at reprocessed.org (Matt Patterson) Date: Tue May 9 15:26:06 2006 Subject: [lxml-dev] The difference between str(xslt_result) and xslt_result.write() Message-ID: <1950FDB6-58BD-47B1-BEE4-9AE5EE9A6684@reprocessed.org> Hi, Is there a difference between the output from lxml.etree._XSLTResultTree.__str__() and lxml.etree._XSLTResultTree.write()? I'm trying to chase some output whitespace issues around, and I'm wondering if there's a difference in how serialisation is handled. Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' codec can't encode character u'\xa9' in position 1608: ordinal not in range(128)" because I have unicode characters in my documents... lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I can't use a unicode coercion to UTF-8 or something like that... All in all though, I'm really enjoying lxml: I spent a long time working with libxml & lbxslt's standard python interfaces and they much more of a pain than lxml! Thanks, Matt -- Matt Patterson | Design & Code | http://www.reprocessed.org/ From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 9 15:39:19 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue May 9 15:39:57 2006 Subject: [lxml-dev] The difference between str(xslt_result) and xslt_result.write() In-Reply-To: <1950FDB6-58BD-47B1-BEE4-9AE5EE9A6684@reprocessed.org> References: <1950FDB6-58BD-47B1-BEE4-9AE5EE9A6684@reprocessed.org> Message-ID: <44609B87.1060701@gkec.informatik.tu-darmstadt.de> Hi Matt, Matt Patterson schrieb: > Is there a difference between the output from > lxml.etree._XSLTResultTree.__str__() and > lxml.etree._XSLTResultTree.write()? Yes. str() knows about the output method chosen in the stylesheet (xsl:output), write() doesn't. If you call write(), you will end up with the XML tree serialization you requested in the call arguments. If you call str(), you will get the serialized result you requested in the XSL transform. > Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' > codec can't encode character u'\xa9' in position 1608: ordinal not in > range(128)" because I have unicode characters in my documents... Then you have likely forgotten to set an output encoding in your stylesheet. > lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I > can't use a unicode coercion to UTF-8 or something like that... I don't think __unicode__ would make sense here, given the fact that stylesheets determine the output encoding. Python unicode strings usually have a different encoding than the one you specify in your stylesheet. If you're in doubt, 'UTF-8' is commonly a good choice in lxml, as it's the encoding we use internally. > All in all though, I'm really enjoying lxml: I spent a long time working > with libxml & lbxslt's standard python interfaces and they much more of > a pain than lxml! I guess "much more of a pain" is meant in a positive sense here, although it sounds somewhat tainted due to the actual extent to which libxml2's bindings really are a pain... :) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 09:21:13 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 09:21:56 2006 Subject: [lxml-dev] The difference between str(xslt_result) and xslt_result.write() In-Reply-To: <44609B87.1060701@gkec.informatik.tu-darmstadt.de> References: <1950FDB6-58BD-47B1-BEE4-9AE5EE9A6684@reprocessed.org> <44609B87.1060701@gkec.informatik.tu-darmstadt.de> Message-ID: <44619469.7000507@gkec.informatik.tu-darmstadt.de> Hi again, sorry, I was partially mistaken in my last post. You have actually found a bug. Stefan Behnel wrote: > Matt Patterson wrote: >> Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' >> codec can't encode character u'\xa9' in position 1608: ordinal not in >> range(128)" because I have unicode characters in my documents... > > Then you have likely forgotten to set an output encoding in your stylesheet. Actually, you most likely have /not/ forgotten to do so. lxml was mishandling the case where the output encoding is not compatible with UTF-8. A safe work-around is to always use UTF-8 here, although the bug will be fixed in the next release. >> lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I >> can't use a unicode coercion to UTF-8 or something like that... > > I don't think __unicode__ would make sense here, given the fact that > stylesheets determine the output encoding. Since this problem is based on a bug, this gets me closer to the point of accepting that __unicode__ makes sense here. Otherwise, there would be no other way to retrieve a unicode string from a stylesheet result - except for recoding by hand after calling str(), which is rather ugly. The question is how to make this play nicely. We know the requested output encoding from the stylesheet, so when the user calls unicode() on the result, she/he is actually requesting a recoding, which is not always efficient. But then, that's the user's fault. Another thing is that the serialized result may have an XML encoding declaration. To be correct, we have to remove it in this case, as the encoding information is only correctly provided by the unicode string semantics. This may additionally mean that we have to copy the majority of the string (as unicode objects!). So, I believe the best solution is to document that UTF-8 is the best choice as an output encoding in that case and otherwise leave it to the Python codecs. If users want to use other encodings that are not supported by Python, they will get a sensible exception automatically. I changed it on the trunk for now, but if there are any proposals or objections to this, I'd like to hear about them. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 11:39:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 11:40:10 2006 Subject: [lxml-dev] Python unicode string support in lxml Message-ID: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> Hi all, I had a discussion with Fredrik lately and it convinced me (though not Fredrik) that it would be a good idea to improve the support for Python unicode strings in lxml.etree. I think that unicode strings are the most comfortable way of doing XML I/O from/to strings in Python, so I added support for simply calling unicode() on XML nodes and ElementTrees. It behaves just like tostring(), but always returns Python unicode strings: >>> from lxml import etree >>> uxml = u' \uf8d1 + \uf8d2 ' >>> root = etree.XML(uxml) >>> unicode(root) u' \uf8d1 + \uf8d2 ' >>> el = etree.Element("test") >>> unicode(el) u'' >>> subel = etree.SubElement(el, "subtest") >>> unicode(el) u'' >>> unicode( etree.ElementTree(el) ) u'' Note that ElementTree does not support this at all. It will raise a parser exception in the XML() call in the second line and return the same generic strings for unicode() as it does for str(). There is a longer doctest in http://codespeak.net/svn/lxml/trunk/doc/api.txt that explains this in more detail. As usual: any comments appreciated. Stefan From howe at carcass.dhs.org Wed May 10 11:52:57 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 11:53:34 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> Message-ID: <1365266204.20060510065257@carcass.dhs.org> Hello Stefan, Wednesday, May 10, 2006, 6:39:29 AM, you wrote: > I had a discussion with Fredrik lately and it convinced me (though not > Fredrik) that it would be a good idea to improve the support for Python > unicode strings in lxml.etree. I think that unicode strings are the most > comfortable way of doing XML I/O from/to strings in Python, so I added support > for simply calling unicode() on XML nodes and ElementTrees. It behaves just > like tostring(), but always returns Python unicode strings: > >>> from lxml import etree > >>> uxml = u' \uf8d1 + \uf8d2 ' > >>> root = etree.XML(uxml) > >>> unicode(root) > u' \uf8d1 + \uf8d2 ' > >>> el = etree.Element("test") > >>> unicode(el) > u'' > >>> subel = etree.SubElement(el, "subtest") > >>> unicode(el) > u'' > >>> unicode( etree.ElementTree(el) ) > u'' > Note that ElementTree does not support this at all. It will raise a parser > exception in the XML() call in the second line and return the same generic > strings for unicode() as it does for str(). > There is a longer doctest in > http://codespeak.net/svn/lxml/trunk/doc/api.txt > that explains this in more detail. > As usual: any comments appreciated. Shouldn't this be implemented as etree.tounicode() or something like that instead ? This will be more intuitive since there is the tostring() method. And since str(root) will return something like "''", I would also expect unicode(root) to behave like that. Whatever the calling method gets named, its a great feature, thanks. -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 12:08:10 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 12:08:52 2006 Subject: [lxml-dev] lxml 0.9.2 released Message-ID: <4461BB8A.3090708@gkec.informatik.tu-darmstadt.de> Hello everyone, I just uploaded another bug fix release of the 0.9 branch to the Cheese shop. http://cheeseshop.python.org/pypi/lxml http://cheeseshop.python.org/packages/source/l/lxml/lxml-0.9.2.tar.gz This will (hopefully) be the last 0.9.x release before 1.0. It fixes a number of more or less annoying bugs and adds a number of smaller features. All the major, new, fancy features will go into the 1.0 release. Regards, Stefan Features added * Speedup for Element.makeelement(): the new element now reuses the original libxml2 document instead of creating a new empty one * Speedup for reversed() iteration over element children (Py2.4+ only) * ElementTree compatible QName class * RelaxNG and XMLSchema now accept any Element, not only ElementTrees Bugs fixed * str(xslt_result) was broken for XSLT output other than UTF-8 * Memory leak if write_c14n fails to write the file after conversion * Crash in XMLSchema and RelaxNG when passing non-schema documents * Memory leak in RelaxNG() when RelaxNGParseError is raised From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 12:23:33 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 12:24:13 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <1365266204.20060510065257@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> Message-ID: <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Wednesday, May 10, 2006, 6:39:29 AM, you wrote: >> simply calling unicode() on XML nodes and ElementTrees. It behaves just >> like tostring(), but always returns Python unicode strings: >> >> >>> el = etree.Element("test") >> >>> unicode(el) >> u'' > >> As usual: any comments appreciated. > > Shouldn't this be implemented as etree.tounicode() or something like > that instead ? This will be more intuitive since there is the tostring() > method. And since str(root) will return something like "' 8413144>'", I would also expect unicode(root) to behave like that. Actually, I had first implemented it as "etree.tounicode()" and then switched to plain "unicode()" as I thought /that/ would be more intuitive... Note that _XSLTResultTree already supports str() and now also supports unicode() for the same thing (but unicode). I may let myself get convinced that this is different, though. I'm not sure which is better. Maybe "tounicode()" really prevents people from thinking it should behave as str() - as you do. It's trivial to change, but let me wait to see if other people have similar feelings on this. So, don't consider this feature stable for now. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 12:49:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 12:50:39 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> Message-ID: <4461C557.90609@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: > Steve Howe wrote: >> Wednesday, May 10, 2006, 6:39:29 AM, you wrote: >>> simply calling unicode() on XML nodes and ElementTrees. It behaves just >>> like tostring(), but always returns Python unicode strings: >>> >>> >>> el = etree.Element("test") >>> >>> unicode(el) >>> u'' >>> As usual: any comments appreciated. >> Shouldn't this be implemented as etree.tounicode() or something like >> that instead ? This will be more intuitive since there is the tostring() > method. I think there's one good argument that fits independent of the question which is more intuitive: extensibility. The unicode() function is fixed and does not allow us to extend the call parameters to support things like "prettyprint=True" keyword arguments. And that's already a difference to "_XSLTResultTree.__unicode__": that API will never need to be extended as it is configured through xsl:output. Ok, so I'm convinced that our home-grown tounicode() is better. I'll fix it on the trunk. Stefan From elephantum at cyberzoo.ru Wed May 10 12:57:09 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Wed May 10 12:57:53 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> Message-ID: <1147258629.7243.19.camel@zoo.yandex.ru> On Wed, 2006-05-10 at 12:23 +0200, Stefan Behnel wrote: > Actually, I had first implemented it as "etree.tounicode()" and then switched > to plain "unicode()" as I thought /that/ would be more intuitive... That would be intuitive if str(Element) would return string presentation of XML, but not '', otherwise that is inconsistent behavior, that's worse then just lack of shortcut. From howe at carcass.dhs.org Wed May 10 13:13:08 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 13:13:44 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <4461C557.90609@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> <4461C557.90609@gkec.informatik.tu-darmstadt.de> Message-ID: <1819198740.20060510081308@carcass.dhs.org> Hello Stefan, Wednesday, May 10, 2006, 7:49:59 AM, you wrote: > Stefan Behnel wrote: >> Steve Howe wrote: >>> Wednesday, May 10, 2006, 6:39:29 AM, you wrote: >>>> simply calling unicode() on XML nodes and ElementTrees. It behaves just >>>> like tostring(), but always returns Python unicode strings: >>>> >>>> >>> el = etree.Element("test") >>>> >>> unicode(el) >>>> u'' >>>> As usual: any comments appreciated. >>> Shouldn't this be implemented as etree.tounicode() or something like >>> that instead ? This will be more intuitive since there is the tostring() >> method. > I think there's one good argument that fits independent of the question which > is more intuitive: extensibility. The unicode() function is fixed and does not > allow us to extend the call parameters to support things like > "prettyprint=True" keyword arguments. > And that's already a difference to "_XSLTResultTree.__unicode__": that API > will never need to be extended as it is configured through xsl:output. > Ok, so I'm convinced that our home-grown tounicode() is better. I'll fix it on > the trunk. Well thought. I think it would not hurt, however, if unicode() calls .tounicode() with default params, *if* str() behaves the same way. in fact, there is no point in printing at all - is there ? That would bring the best of both worlds, the extensibility/compatibility of .tostr()/.unicode(), and the intuitive str()/unicode() call. Besides, repr(root) will be printing the same as str(root) does today, in case someone really wants that... -- Best regards, Steve mailto:howe@carcass.dhs.org From fredrik at pythonware.com Wed May 10 14:40:07 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 14:41:58 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> Message-ID: Steve Howe wrote: > Whatever the calling method gets named, its a great feature, thanks. so what's your use case? (I hope you're aware that the XML file format is defined in terms of en- coded data, not as sequences of Unicode code points, and that XML encoding involves more than just character sets; there's no such thing as an "XML document in a Unicode string") stefan's argument is basically "we should add it because we can", which is a rather lousy way to design software. From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 15:31:09 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 15:31:51 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> Message-ID: <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > there's no such thing as an "XML document in a Unicode string" Well, there's things like "XML documents in files", "XML documents in HTTP" and "XML documents in SMTP", so why not "XML documents in Unicode strings"? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 15:37:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 15:38:36 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <1819198740.20060510081308@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> <4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> Message-ID: <4461ECB1.8010301@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > I think it would not hurt, however, if unicode() calls > .tounicode() with default params, *if* str() behaves the same way. in > fact, there is no point in printing at all - is > there ? That would bring the best of both worlds, the > extensibility/compatibility of .tostr()/.unicode(), and the intuitive > str()/unicode() call. You know, that was exactly the first thing that came to my mind when I thought about it: who needs those stupid str() results anyway? :) I then, however, ducked away from the holy cow. So, your proposal is to change the current behaviour of str()/unicode() into this: str(element) == etree.tostring(element) unicode(element) == etree.tounicode(element) and to make repr(element) the obvious replacement. The problem is that current code (for lxml or ElementTree) may rely on the fact that str() is a simple thing to call on an element and that it does *not* do anything recursively. The above modification changes the runtime complexity of these calls. That can really make a difference in that case. Imagine debugging (or logging) output where someone adds a str(element) to see what is currently dealt with or to trace the way some element takes through a processing chain. So, since the above change is only a minor improvement compared to calling tounicode/tostring directly (as few as 2 characters if you do the respective import), I'm -0.5 on breaking ElementTree compatibility in these cases. Stefan From faassen at infrae.com Wed May 10 16:39:44 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 16:40:05 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <1819198740.20060510081308@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> <4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> Message-ID: <4461FB30.3060103@infrae.com> Steve Howe wrote: [snip] > Well thought. I think it would not hurt, however, if unicode() calls > .tounicode() with default params, *if* str() behaves the same way. in > fact, there is no point in printing at all - is > there ? That would bring the best of both worlds, the > extensibility/compatibility of .tostr()/.unicode(), and the intuitive > str()/unicode() call. I'm not sure what behavior exactly is being proposed here, but I strongly urge lxml not to print serialized XML instead of an element automatically. It's too implicit and during debugging it might get very annoying to see massive amounts of XML if you just want to see what elements you have in, say, a list, currently. Anyway, that concerns repr() more than str(), but I'm still worried. I'd suggest sticking to whatever behavior ElementTree has in this area. Regards, Martijn From faassen at infrae.com Wed May 10 16:42:10 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 16:42:31 2006 Subject: [lxml-dev] Python unicode string support in lxml In-Reply-To: <4461ECB1.8010301@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461BF25.3070601@gkec.informatik.tu-darmstadt.de> <4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> <4461ECB1.8010301@gkec.informatik.tu-darmstadt.de> Message-ID: <4461FBC2.4010706@infrae.com> Stefan Behnel wrote: [snip] > So, since the above change is only a minor improvement compared to calling > tounicode/tostring directly (as few as 2 characters if you do the respective > import), I'm -0.5 on breaking ElementTree compatibility in these cases. -1 on breaking ElementTree compatibility. tounicode() explicit behavior seems better to me __unicode__. Let's be very careful with implicit behavior in the area of string creation - explicit is better here. With implicit behavior, it's just too easy for a developer to do something wrong with encodings and then get very confused. Regards, Martijn From fredrik at pythonware.com Wed May 10 16:55:07 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 16:56:53 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > Fredrik Lundh wrote: > > there's no such thing as an "XML document in a Unicode string" > > Well, there's things like "XML documents in files", "XML documents in HTTP" > and "XML documents in SMTP", so why not "XML documents in Unicode strings"? files, HTTP, and SMTP all deal with bytes (or if you prefer, octets). a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space. a Python Unicode string doesn't have an encoding. XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text. From fredrik at pythonware.com Wed May 10 16:58:09 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 17:01:29 2006 Subject: [lxml-dev] Re: Re[2]: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461BF25.3070601@gkec.informatik.tu-darmstadt.de><4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> Message-ID: Steve Howe wrote: > there is no point in printing at all - is > there ? depends on how large XML files you work with, of course. I prefer an API that forces me to be a bit more explicit than a plain "print" before dumping 10 megabytes of stuff to the console... From faassen at infrae.com Wed May 10 17:20:33 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 17:20:56 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> Message-ID: <446204C1.6080703@infrae.com> Fredrik Lundh wrote: > Steve Howe wrote: > > >>Whatever the calling method gets named, its a great feature, thanks. > > > so what's your use case? > (I hope you're aware that the XML file format is defined in terms of en- > coded data, not as sequences of Unicode code points, and that XML > encoding involves more than just character sets; there's no such thing > as an "XML document in a Unicode string") For fun let's look at the XML spec and see whether we can get some answers there. The spec says: The mechanism for encoding character code points into bit patterns MAY vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 It also says: In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: ... [...] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. Confusingly in the first part it talks about 'stored in an encoding other than..' and later on it talks about "information provided by an external transport protocol". Still, my interpretation would be that in the case of Python unicode strings, we *do* have a form of 'external character encoding information'. So, in the presence of such external information, this means that the encoding declaration is *not* necessary in the document (and in fact I'd say it shouldn't be there in case of XML in unicode strings). Whether it's useful in practical applications to have the ability to store XML in Python unicode strings is an interesting debate. In the case of in-memory XML processors it might simplify matters if you can just treat any text everywhere as unicode. At least, it'd simplify combining XML text with non-XML text somehow. (You'd prefer to use the ElementTree API for such manipulation though. :) On the other hand, in the lxml implementation it'll be slower than actually dealing with XML as UTF-8, as that's what libxml2 will be able to parse most quickly. So we could argue that encouraging the above usage pattern is going to lead to less than optimal performance. I don't consider that a big problem as fast performance is still available, though. I'm fine with a tounicode() output function (I'd be more worried about the unicode(), but I'm glad that idea got revoked already). I also don't see harm in accepting unicode input into the XML() function. I see that it fails in case an encoding is expressed in the XML itself, so that's good. So, +1 to the current set of changes. Regards, Martijn From faassen at infrae.com Wed May 10 17:45:08 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 17:45:28 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: <44620A84.4040304@infrae.com> Fredrik Lundh wrote: > Stefan Behnel wrote: > > >>Fredrik Lundh wrote: >> >>>there's no such thing as an "XML document in a Unicode string" >> >>Well, there's things like "XML documents in files", "XML documents in HTTP" >>and "XML documents in SMTP", so why not "XML documents in Unicode strings"? > > > files, HTTP, and SMTP all deal with bytes (or if you prefer, octets). > > a Python Unicode string doesn't contain bytes; it contains a sequence of > Unicode code points, which are indexes into an abstract character space. > > a Python Unicode string doesn't have an encoding. The XML specification does not restrict how parsed entities encode code points into bit patterns. In the end the Python unicode string *does* encode code points into a bit patterns in memory (in a way which has nice properties for indexing characters). It's very clear how to get to unicode code points from bit patterns given a unicode string, as they're more or less identical, and that's why normally we don't care about how Python stores unicode internally. The Python unicode string therefore seems to be to be a legitimate source of XML data. See also my reply to your previous mail where I actually quote the XML spec to back this up. :) > XML serialization is all about converting between the XML infoset (which > contains sequences of abstract code points) and the XML file format (which > contains bytes). an XML file is a bunch of bytes, not a bunch of code > points. storing a bunch of bytes as a bunch of code points is simply not > a very good idea, and is a great way to make people who don't understand > Unicode to write XML applications that will break when exposed to non- > ASCII text. XML is more than just a file format. The XML spec is careful to talk about 'entities'. It recognizes that an entity can be exist in numerous encodings, and that the encoding information can be in the entity (the encoding declaration), but that they can also be externally specified. I agree that it is a valid argument against this API that people who do not unstand unicode are going to make even more mistakes when using this. Having reviewed the API, I think the chances that people will get even more confused are relatively minor, though. The API as defined now refuses to guess in all cases: * when XML() is presented with a unicode string, it's clear what to do, unless that string also contains an encoding declaration. In that case, the system refuses to guess and an exception is raised. * .tounicode() needs to be called explicitly in order to get unicode form of XML. and that's it. A naive user would just open an XML file and pass that into the XML() function (or use the file access functions), and that will work (if the encoding declaration in the XML is correct). A naive user would also use tostring() as they don't know all that unicode stuff. I think naive users therefore aren't any worse off than before. Regards, Martijn From fredrik at pythonware.com Wed May 10 17:56:43 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 17:57:56 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <446204C1.6080703@infrae.com> Message-ID: Martijn Faassen wrote: > Whether it's useful in practical applications to have the ability to > store XML in Python unicode strings is an interesting debate. In the > case of in-memory XML processors it might simplify matters if you can > just treat any text everywhere as unicode. the mapping between Unicode text in the infoset and serialized XML involves more things than just the Unicode-to-byte encoding, so that "simplification" is far from obvious. > At least, it'd simplify combining XML text with non-XML text somehow. what exactly is "XML text", and why would you want to combine that with non-XML text? again, what's the use case ? From fredrik at pythonware.com Wed May 10 18:07:47 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 18:10:59 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> Message-ID: Martijn Faassen wrote: > * .tounicode() needs to be called explicitly in order to get unicode > form of XML. and the user must then make sure to treat the output carefully, if he's doing anything with it at all. because it's not really Unicode; it just looks as if it is. > and that's it. far from it. the real problem appears when you want to write the resulting bytes-encoded-in-Unicode string to a file, socket, or some other byte- oriented output device. what do you need to do to make this work, and what happens if you don't ? From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 18:36:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 18:37:37 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: <446216A3.8060401@gkec.informatik.tu-darmstadt.de> Fredrik Lundh wrote: > files, HTTP, and SMTP all deal with bytes (or if you prefer, octets). > > a Python Unicode string doesn't contain bytes; it contains a sequence of > Unicode code points, which are indexes into an abstract character space. Ok, so then that means that unicode strings are completely unparsable. A standards-compliant XML API should raise an error when it is asked to parse a sequence of unicode code points. Let's see... >>> from elementtree.ElementTree import XML >>> XML(u"") What? I didn't put any bytes in there? Where did the element come from? > a Python Unicode string doesn't have an encoding. Well, it does, internally. And it's even well-defined across the whole platform. > XML serialization is all about converting between the XML infoset (which > contains sequences of abstract code points) and the XML file format (which > contains bytes). an XML file is a bunch of bytes, not a bunch of code > points. storing a bunch of bytes as a bunch of code points is simply not > a very good idea, and is a great way to make people who don't understand > Unicode to write XML applications that will break when exposed to non- > ASCII text. You're definitely the first to tell me that using unicode makes people write programs that break for non-ascii text... Stefan From faassen at infrae.com Wed May 10 18:48:21 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 18:48:40 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> Message-ID: <44621955.9040606@infrae.com> Fredrik Lundh wrote: > Martijn Faassen wrote: > > >>* .tounicode() needs to be called explicitly in order to get unicode >>form of XML. > > and the user must then make sure to treat the output carefully, if he's > doing anything with it at all. because it's not really Unicode; it just looks > as if it is. Why is it "not really Unicode"? >>and that's it. > > > far from it. the real problem appears when you want to write the resulting > bytes-encoded-in-Unicode string to a file, socket, or some other byte- > oriented output device. what do you need to do to make this work, and > what happens if you don't ? When you try to write a unicode string to a byte-oriented devise, you'll have to encode, like always when you write a unicode string. Possibly you're pointing out the issue of the encoding header - if you would encode the string to latin-1 and save it, say, there'd be a problem as the XML does not carry along its encoding information in any encoding header. Regards, Martijn From fredrik at pythonware.com Wed May 10 19:01:30 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 19:03:13 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <446216A3.8060401@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > > a Python Unicode string doesn't contain bytes; it contains a sequence of > > Unicode code points, which are indexes into an abstract character space. > > Ok, so then that means that unicode strings are completely unparsable. A > standards-compliant XML API should raise an error when it is asked to parse a > sequence of unicode code points. Let's see... > > >>> from elementtree.ElementTree import XML > >>> XML(u"") > > > What? I didn't put any bytes in there? Where did the element come from? the CPython interpreter uses a default encoding, and attempts to *encode* Unicode strings using this encoding when you pass them to an interface that expects bytes. if that doesn't work, the function won't even get called; instead, you'll get a "can't encode" exception: >>> XML(u"") Traceback (most recent call last): File "", line 1, in ? File "", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) still think that XML supports Unicode ? or are you saying that the subset of Unicode that happens to be ASCII is a good enough subset ? > > a Python Unicode string doesn't have an encoding. > > Well, it does, internally. And it's even well-defined across the whole platform. that's an implementation detail. a Python implementation may use whatever representation it wants on the inside. on the outside, there's no encoding (in the traditional sense); all there is is a sequence of Unicode code points. > > XML serialization is all about converting between the XML infoset (which > > contains sequences of abstract code points) and the XML file format (which > > contains bytes). an XML file is a bunch of bytes, not a bunch of code > > points. storing a bunch of bytes as a bunch of code points is simply not > > a very good idea, and is a great way to make people who don't understand > > Unicode to write XML applications that will break when exposed to non- > > ASCII text. > > You're definitely the first to tell me that using unicode makes people write > programs that break for non-ascii text... using Unicode with interfaces that expect bytes will break, if the Unicode string contains the wrong things. for example, >>> XML(u"") Traceback (most recent call last): File "", line 1, in ? File "", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) and >>> f = open("file", "wb") >>> f.write(u"föö") Traceback (most recent call last): File "", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128) and so on. which means that >>> f = open("file.xml", "wb") >>> f.write(ET.tounicode(tree)) will sometimes work, and sometimes fail, and sometimes generate broken XML files, depending on the data. while >>> f = open("file.xml", "wb") >>> f.write(ET.tostring(tree)) will always do the right thing. From faassen at infrae.com Wed May 10 19:03:17 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed May 10 19:03:37 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <446204C1.6080703@infrae.com> Message-ID: <44621CD5.2040904@infrae.com> Fredrik Lundh wrote: > Martijn Faassen wrote: > > >>Whether it's useful in practical applications to have the ability to >>store XML in Python unicode strings is an interesting debate. In the >>case of in-memory XML processors it might simplify matters if you can >>just treat any text everywhere as unicode. > > > the mapping between Unicode text in the infoset and serialized XML involves > more things than just the Unicode-to-byte encoding, so that "simplification" > is far from obvious. You mean escaped unicode entities? If you want to turn it back into the infoset, you pass it into XML(), right? >>At least, it'd simplify combining XML text with non-XML text somehow. > > > what exactly is "XML text", and why would you want to combine that with > non-XML text? again, what's the use case ? I can come up with a few: * quick and dirty applications that mess about with the XML text on a textual level. I agree there are usually better ways to do the same thing in a clear way (XSLT, ElementTree API). * web applications that use unicode inside (Zope 3, Silva on Zope 2) that want to present XML in a web page. In Zope 3, HTTP response text is initially a unicode string before it's encoded to UTF-8 and sent out to the network. (requests variables are converted to unicode automatically as well) In Zope 3, I'd need the XML encoded as a unicode string in order to put it on a web page. Putting something on a web page typically means combining it with HTML. Normally you'd need an extra escaping run for the <, > and such first, of course, which is in fact an excellent candidate the for 'quick and dirty' application above that's not easily solved another way. * more generally, any application that uses a user interface framework that's unicode-aware. (it might be worthwhile to investigate that Java UI toolkits do. Java uses unicode strings everywhere and presumably also in the UI api, so how do they display XML text?) I think the main use cases are in the area of XML being displayed in the context of a UI environment that's unicode native. This means that support for unicode in XML() is less necessary than 'tounicode()' (though I'm probably missing use cases), but since you already support the former in ElementTree as Stefan pointed out, we're following suit. :) Regards, Martijn From fredrik at pythonware.com Wed May 10 19:33:29 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 19:35:00 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <446204C1.6080703@infrae.com> <44621CD5.2040904@infrae.com> Message-ID: Martijn Faassen wrote: > * quick and dirty applications that mess about with the XML text on a > textual level. I agree there are usually better ways to do the same > thing in a clear way (XSLT, ElementTree API). or messing with the XML encoded text on the textual level. UTF-8 is care- fully designed to allow things like this, of course... > * web applications that use unicode inside (Zope 3, Silva on Zope 2) > that want to present XML in a web page. In Zope 3, HTTP response text is > initially a unicode string before it's encoded to UTF-8 and sent out to > the network. so how do you return images from Zope 3 ? I can buy that a framework might automagically encode Unicode strings as UTF-8 byte strings, but forcing the use of Unicode sounds like a really lousy idea to me. > * more generally, any application that uses a user interface framework > that's unicode-aware. (it might be worthwhile to investigate that Java > UI toolkits do. Java uses unicode strings everywhere and presumably also > in the UI api, so how do they display XML text?) I doubt the set of applications that displays XML files as text is even noticable compared to the set of applications that displays text from the XML infoset... > I think the main use cases are in the area of XML being displayed in the > context of a UI environment that's unicode native. which, frankly, means that the use case is almost nonexistent. > This means that > support for unicode in XML() is less necessary than 'tounicode()' > (though I'm probably missing use cases), but since you already support > the former in ElementTree as Stefan pointed out he's confused: the XML() function does not support Unicode (see my followup mail). From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 19:49:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 19:50:41 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: <44621955.9040606@infrae.com> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> <44621955.9040606@infrae.com> Message-ID: <446227C2.9030805@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Fredrik Lundh wrote: >> the real problem appears when you want to write the resulting >> bytes-encoded-in-Unicode It's not "bytes-encoded-in-unicode". It's Python unicode. That's well defined. Everything inside the Python interpreter knows how to deal with that. > string to a file, socket, or some other byte- >> oriented output device. what do you need to do to make this work, and >> what happens if you don't ? > > When you try to write a unicode string to a byte-oriented devise, you'll > have to encode, like always when you write a unicode string. > > Possibly you're pointing out the issue of the encoding header - if you > would encode the string to latin-1 and save it, say, there'd be a > problem as the XML does not carry along its encoding information in any > encoding header. But then, that's just like taking a letter out of an envelope and saying "Hey! How did that get here?" Stefan From howe at carcass.dhs.org Wed May 10 19:55:50 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 19:56:26 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> Message-ID: <58589037.20060510145550@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 9:40:07 AM, you wrote: > so what's your use case? I think it's obvious: any place where there is XML data represented as unicode and not as plain ASCII. > (I hope you're aware that the XML file format is defined in terms of en- > coded data, not as sequences of Unicode code points, and that XML > encoding involves more than just character sets; there's no such thing > as an "XML document in a Unicode string") Yes, I am aware of the XML spec, thank you. > stefan's argument is basically "we should add it because we can", which > is a rather lousy way to design software. I don't think that was his argument and didn't find your comment very elegant either... -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 20:07:21 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 20:07:54 2006 Subject: [lxml-dev] Re: Re[2]: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461BF25.3070601@gkec.informatik.tu-darmstadt.de><4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> Message-ID: <51237660.20060510150721@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 11:58:09 AM, you wrote: > depends on how large XML files you work with, of course. I prefer > an API that forces me to be a bit more explicit than a plain "print" > before dumping 10 megabytes of stuff to the console... I understand, but I think a programmer would be expecting str(root) to print the string representation of the tree, just like he calls str(int), and see "1" instead of "". That breaks Pythonic behaviour. Large dumping will happen also when printing *any* large text dump to the screen, and just as Python won't "protect" you from such a dumping, I don't see a pointing in doing it. If a programmer does that dumping once, I think he should be smart enough to press Ctrl+C and change his code. Anyway, I don't see that as bad design, just as a taste matter, and I'm happy with either way - and I agree it would probably be bad to break ElementTree compatibility even if we disagree about something. I would have designed it to have both str() and .tostring() support, having the first calling the second. As I said, other Python types use str() to convert from its native types to strings, so I think that should be used also with Elements and ElementTrees instead of the repr() output - when they want that, they should use that function. From the Python documentation: repr( object) Return a string containing a printable representation of an object. (...) str( [object]) Return a string containing a nicely printable representation of an object. (...) -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 20:15:43 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 20:16:24 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> Message-ID: <452176724.20060510151543@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 1:07:47 PM, you wrote: > far from it. the real problem appears when you want to write the resulting > bytes-encoded-in-Unicode string to a file, socket, or some other byte- > oriented output device. what do you need to do to make this work, and > what happens if you don't ? It will happen the same that happens to any unicode object in Python: you should encode it to some str form before doing such a thing, or should have called .tostring() in the first place. Your argument seems to be the same as "do not support unicode at all, it will end up as a bytes sequence in the disk anyway". There are cases for using str, and cases for using unicode. Not all data will be promptly serialized; some will be processed (as unicode) or even printed into an unicode console. -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 20:23:41 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 20:24:21 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <446216A3.8060401@gkec.informatik.tu-darmstadt.de> Message-ID: <12026582.20060510152341@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 2:01:30 PM, you wrote: [...] > and so on. which means that >>>> f = open("file.xml", "wb") >>>> f.write(ET.tounicode(tree)) > will sometimes work, and sometimes fail, and sometimes generate broken XML > files, depending on the data. while >>>> f = open("file.xml", "wb") >>>> f.write(ET.tostring(tree)) > will always do the right thing. ...agreed, *if* the "right thing" is serializing. If I want to process that unicode data, I would have to encode it as unicode, then process it. And for a large string, that would at a lot of resources, not to mention all the trouble involved. As I pointed out: that are places for using .tounicode(), and places for using .tostring(). -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 20:26:27 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 20:27:09 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <446204C1.6080703@infrae.com> <44621CD5.2040904@infrae.com> Message-ID: <77952733.20060510152627@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 2:33:29 PM, you wrote: [...] > I can buy that a framework might automagically encode Unicode strings > as UTF-8 byte strings, but forcing the use of Unicode sounds like a really > lousy idea to me. I still don't understand the "forcing" thing here. Does anyone want to get rid of the .tostring() call ? Or just provide an alternative .tounicode() call for those who want unicode returned ? -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 20:40:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 20:41:10 2006 Subject: [lxml-dev] Re: Re[2]: Python unicode string support in lxml In-Reply-To: <51237660.20060510150721@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461BF25.3070601@gkec.informatik.tu-darmstadt.de><4461C557.90609@gkec.informatik.tu-darmstadt.de> <1819198740.20060510081308@carcass.dhs.org> <51237660.20060510150721@carcass.dhs.org> Message-ID: <44623398.2080601@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > I understand, but I think a programmer would be expecting str(root) to > print the string representation of the tree, just like he calls > str(int), and see "1" instead of "". > [...] > Anyway, I don't see that as > bad design, just as a taste matter, and I'm happy with either way - and > I agree it would probably be bad to break ElementTree compatibility even > if we disagree about something. I would have designed it to have both > str() and .tostring() support, having the first calling the second. > [...] > From the Python documentation: > > repr( object) > Return a string containing a printable representation of an object. > (...) > > str( [object]) > Return a string containing a nicely printable representation of an > object. (...) It's definitely a matter of taste. It's also the question what exactly is meant by "object" here: the Element? The entire tree of the Element? Sadly, that makes a huge difference... Regarding unicode() vs. tounicode(), I think both ideas are intuitive in a way and neither of them has clear advantages. So it's the holy cow that makes the difference. (Also: there should be one - and preferably only one - way of doing it) Regarding tounicode() or not: I can understand that Fredrik has objections to using Unicode to carry XML in general. But I really don't see why you should try to actively prevent users from efficiently getting a straight unicode string out of the API if they want it. Python distinguishes between str and unicode - we can't just go "oh well, it shouldn't have been that way, so we won't support it". We can always add a docstring to "tostring" and "tounicode" saying that the first is preferable for serialization to files. But then, what is "write" for? Stefan From fredrik at pythonware.com Wed May 10 20:46:28 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 20:47:27 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461EB1D.5020802@gkec.informatik.tu-darmstadt.de><446216A3.8060401@gkec.informatik.tu-darmstadt.de> <12026582.20060510152341@carcass.dhs.org> Message-ID: Steve Howe wrote: > ...agreed, *if* the "right thing" is serializing. If I want to process > that unicode data, I would have to encode it as unicode Unicode is not an encoding, it's a text model used by, among others, XML's infoset, and Python's Unicode string type. Encoding Unicode as Unicode doesn't make sense, unless you're confusing encoding with serialization. And *any* conversion between XML infoset (which is the XML information model) and the XML file representation is serialization; Stefan's "tounicode" function serializes to UTF-16 or UCS-4, depending on platform, and stuffs the result into the internal buffer of a Python Unicode string. > then process it. As I just pointed out, if you want to process serialized XML, nothing keeps you from doing that on the byte stream (be it UTF-8 or ASCII or whatever). *Why* you would want to process the serialized form of an XML infoset instead of the actual infoset is still an open question. (I know people who've written XML-to-SGML post-processors for ET, but they don't count ;-) The right way to solve that kind of problems is to use a custom serializer, like the one in Kid). > And for a large string, that would at a lot of resources, not to > mention all the trouble involved. As I pointed out: that are places for > using .tounicode(), and places for using .tostring(). You keep saying this, but Martijn is the only one who's attempted to list some use cases. I'm pretty sure he made them all up on the spot; I'm still waiting for some real-life cases. From fredrik at pythonware.com Wed May 10 20:52:26 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 20:53:31 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><446204C1.6080703@infrae.com><44621CD5.2040904@infrae.com> <77952733.20060510152627@carcass.dhs.org> Message-ID: Steve Howe wrote: > > I can buy that a framework might automagically encode Unicode strings > > as UTF-8 byte strings, but forcing the use of Unicode sounds like a really > > lousy idea to me. > > I still don't understand the "forcing" thing here. Does anyone want to > get rid of the .tostring() call ? Or just provide an alternative > .tounicode() call for those who want unicode returned ? It helps if you read the posts you quote: Martijn's example was a web frame- work used Unicode strings for HTTP responses, and encoded it as UTF-8 on the way out. Such a framework won't be able to handle output from the current ET serializer, but it won't work with images, external templating systems, resources read from disk, preformatted resources, etc, either. From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 20:55:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 20:56:10 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: <44621955.9040606@infrae.com> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> <44621955.9040606@infrae.com> Message-ID: <4462371C.2040700@gkec.informatik.tu-darmstadt.de> Martijn Faassen wrote: > Fredrik Lundh wrote: >> the real problem appears when you want to write the resulting >> bytes-encoded-in-Unicode string to a file, socket, or some other byte- >> oriented output device. what do you need to do to make this work, and >> what happens if you don't ? > > Possibly you're pointing out the issue of the encoding header - if you > would encode the string to latin-1 and save it, say, there'd be a > problem as the XML does not carry along its encoding information in any > encoding header. Hmmm, I just noticed that we don't do that anyway (we rely on libxml2 here): >>> from lxml.etree import XML, tostring, tounicode >>> tostring( XML("") ) '' >>> tostring( XML(""), encoding="UTF-8" ) '' >>> tounicode( XML("") ) u'' That's very consistent as far as lxml is concerned. ElementTree handles this a bit different, though: >>> from elementtree.ElementTree import XML, tostring >>> tostring(XML("")) '' >>> tostring(XML(""), encoding="UTF-8") "\n" This admittedly makes sense when you have the intention of handing that string to someone else. Stefan From fredrik at pythonware.com Wed May 10 21:11:45 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 21:13:15 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> <44621955.9040606@infrae.com> <4462371C.2040700@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > That's very consistent as far as lxml is concerned. ElementTree handles this a > bit different, though: > > >>> from elementtree.ElementTree import XML, tostring > >>> tostring(XML("")) > '' > >>> tostring(XML(""), encoding="UTF-8") > "\n" > > This admittedly makes sense when you have the intention of handing that string > to someone else. that's a wart, though: the current ET serializer outputs the header as soon as you use a non-default encoding (and the default encoding is us-ascii). iirc, ET 1.3 adds an "xml_declaration" option which can be set to None (the default): old "it depends on the encoding" behaviour a true value: always include, with version and encoding a false value (except None): never include feel free to emulate ET 1.3 here. From fredrik at pythonware.com Wed May 10 21:25:20 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 21:26:26 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> Message-ID: Steve Howe wrote: > > so what's your use case? > > I think it's obvious: any place where there is XML data represented as > unicode and not as plain ASCII. Huh? Have you noticed that tostring takes an optional encoding argument? From howe at carcass.dhs.org Wed May 10 21:33:18 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 21:34:00 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461EB1D.5020802@gkec.informatik.tu-darmstadt.de><446216A3.8060401@gkec.informatik.tu-darmstadt.de> <12026582.20060510152341@carcass.dhs.org> Message-ID: <1134976844.20060510163318@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 3:46:28 PM, you wrote: > Unicode is not an encoding, it's a text model used by, among others, XML's > infoset, and Python's Unicode string type. Encoding Unicode as Unicode > doesn't make sense, unless you're confusing encoding with serialization. I mean I would have to str.encode() the string. I know what Unicode is. > As I just pointed out, if you want to process serialized XML, nothing keeps > you from doing that on the byte stream (be it UTF-8 or ASCII or whatever). Doesn't "resource saving" have a consideration here ? > *Why* you would want to process the serialized form of an XML infoset > instead of the actual infoset is still an open question. > (I know people who've written XML-to-SGML post-processors for ET, but > they don't count ;-) The right way to solve that kind of problems is to > use a custom serializer, like the one in Kid). > You keep saying this, but Martijn is the only one who's attempted to > list some use cases. I'm pretty sure he made them all up on the spot; > I'm still waiting for some real-life cases. Ok: 1) say you want to search "f??" on the XML, but it could have any case sense, such as F??. uxml = etree.tounicode(root).lower() if uxml.find('f??') > -1: print 'found' 2) To print unicode directly into the console instead of a string: print etree.tounicode(root) 3) Provide unicode data directly to a native unicode database such as Berkeley DBXML, which uses UTF-8 for all its operations: uxml = etree.tounicode(root) mgr = XmlManager() uc = mgr.createUpdateContext() container = mgr.createContainer("test.dbxml") container.putDocument('mydoc', uxml, uc) In general, on all situations where you will have to encode() the output from etree.tostring(), its much better to have that value given already as unicode. The point of etree.tounicode() is avoiding an unnecessary, resource-wasting .encode() call. And if you don't want unicode, use etree.tostring(). What is so mysterious here ? -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 21:46:13 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 21:46:55 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> Message-ID: <205615288.20060510164613@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 4:25:20 PM, you wrote: > Steve Howe wrote: >> I think it's obvious: any place where there is XML data represented as >> unicode and not as plain ASCII. > Huh? Have you noticed that tostring takes an optional encoding argument? Won't that waste exactly the same resources as this ? xml = etree.tostring(element).encode(encoding) For a large xml, this would more then double the memory requirements to do that processing, when it could be returned directly as an unicode object. -- Best regards, Steve mailto:howe@carcass.dhs.org From tseaver at palladion.com Wed May 10 21:48:27 2006 From: tseaver at palladion.com (Tres Seaver) Date: Wed May 10 21:49:28 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><446204C1.6080703@infrae.com><44621CD5.2040904@infrae.com> <77952733.20060510152627@carcass.dhs.org> Message-ID: <4462438B.6020801@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Fredrik Lundh wrote: > Steve Howe wrote: > > >>>I can buy that a framework might automagically encode Unicode strings >>>as UTF-8 byte strings, but forcing the use of Unicode sounds like a really >>>lousy idea to me. >> >>I still don't understand the "forcing" thing here. Does anyone want to >>get rid of the .tostring() call ? Or just provide an alternative >>.tounicode() call for those who want unicode returned ? > > > It helps if you read the posts you quote: Martijn's example was a web frame- > work used Unicode strings for HTTP responses, and encoded it as UTF-8 on > the way out. Right. Zope3 does this for any "text" (i.e Unicode) reposnses. If the response body is "bytes" (an encoded string of some sort), it doesn't do that processing: it is then the application's job to have set the correct encoding into the 'Content-type' header. > Such a framework won't be able to handle output from the current ET serializer, > but it won't work with images, external templating systems, resources read from > disk, preformatted resources, etc, either. The values so obtained are all "bytes" and not "text" in the Zope3 world. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEYkOL+gerLs4ltQ4RAlhqAKCmQQgtdZi34Uglz0VYASTRraViygCeKZyz H4Fg6t57jFYzObNWrp+wiKQ= =QWBN -----END PGP SIGNATURE----- From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 22:01:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 22:02:31 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <205615288.20060510164613@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> Message-ID: <446246A7.4060100@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Wednesday, May 10, 2006, 4:25:20 PM, you wrote: >> Have you noticed that tostring takes an optional encoding argument? > > Won't that waste exactly the same resources as this ? > > xml = etree.tostring(element).encode(encoding) > > For a large xml, this would more then double the memory requirements to > do that processing, when it could be returned directly as an unicode > object. Careful, this is more or less how tounicode() is currently implemented (although at the libxml2 level). It currently serializes to UTF-8 (which, at least, is pretty fast in libxml2, as all strings are already UTF-8) and then calls the Python API function to convert from UTF-8 to Python unicode in one run (which is also pretty efficient). It's difficult to do otherwise, as libxml2 and Python have independent memory management, so we can't just mange pointers here. Note also that libxml2 uses a dynamically adapted output buffer, so it likely uses more memory during serialization than absolutely necessary. So, while the idea of the API is that it's more efficient (which it still is), the gain may not be as big as expected. But since tostring uses the same mechanism (and thus suffers from the same problem), the gain in overhead is still about 1/3 if the result is required as unicode. Stefan From fredrik at pythonware.com Wed May 10 22:07:10 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed May 10 22:08:21 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461EB1D.5020802@gkec.informatik.tu-darmstadt.de><446216A3.8060401@gkec.informatik.tu-darmstadt.de><12026582.20060510152341@carcass.dhs.org> <1134976844.20060510163318@carcass.dhs.org> Message-ID: Steve Howe wrote: > > As I just pointed out, if you want to process serialized XML, nothing keeps > > you from doing that on the byte stream (be it UTF-8 or ASCII or whatever). > > Doesn't "resource saving" have a consideration here ? what's "resource saving" by using a slower serialization model that needs more memory ? > 1) say you want to search "föö" on the XML, but it could have any case > sense, such as FÖÖ. > > uxml = etree.tounicode(root).lower() > if uxml.find('föö') > -1: > print 'found' why would you do this on the serialized document, rather than on the infoset ? how would you generalize the above to handle arbitrary strings ? what about surrogates ? > 2) To print unicode directly into the console instead of a string: > > print etree.tounicode(root) that's not portable, of course. Python cannot print arbitrary Unicode to stdout on all platforms. it has no trouble printing ASCII to stdout... > > 3) Provide unicode data directly to a native unicode database such as > > Berkeley DBXML, which uses UTF-8 for all its operations: > > uxml = etree.tounicode(root) > mgr = XmlManager() > uc = mgr.createUpdateContext() > container = mgr.createContainer("test.dbxml") > container.putDocument('mydoc', uxml, uc) according to the DBXML documentation, it expects well-formed XML, not necessarily "UTF-8", and definitely not "unicode". have you tried the above with non-ASCII data? with latin-1 data serialized as "iso-8859-1" ? what does sys.getdefaultencoding() return on your machine ? From howe at carcass.dhs.org Wed May 10 22:33:15 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 22:34:04 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <446246A7.4060100@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> Message-ID: <34119542.20060510173315@carcass.dhs.org> Hello Stefan, Wednesday, May 10, 2006, 5:01:43 PM, you wrote: > Careful, this is more or less how tounicode() is currently implemented > (although at the libxml2 level). It currently serializes to UTF-8 (which, at > least, is pretty fast in libxml2, as all strings are already UTF-8) and then > calls the Python API function to convert from UTF-8 to Python unicode in one > run (which is also pretty efficient). It's difficult to do otherwise, as > libxml2 and Python have independent memory management, so we can't just mange > pointers here. > Note also that libxml2 uses a dynamically adapted output buffer, so it likely > uses more memory during serialization than absolutely necessary. > So, while the idea of the API is that it's more efficient (which it still is), > the gain may not be as big as expected. But since tostring uses the same > mechanism (and thus suffers from the same problem), the gain in overhead is > still about 1/3 if the result is required as unicode. I was thinking lxml would return the data encoded as unicode, in the same format Python uses, and thus the gain would be more dramatic. In this case, I think you should judge how more efficient that is then calling .tostring(encoding) and implement if the gain is reasonable. -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 10 23:00:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed May 10 23:00:59 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <34119542.20060510173315@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> Message-ID: <4462545B.1040502@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Wednesday, May 10, 2006, 5:01:43 PM, you wrote: > >> Careful, this is more or less how tounicode() is currently implemented >> (although at the libxml2 level). It currently serializes to UTF-8 (which, at >> least, is pretty fast in libxml2, as all strings are already UTF-8) and then >> calls the Python API function to convert from UTF-8 to Python unicode in one >> run (which is also pretty efficient). It's difficult to do otherwise, as >> libxml2 and Python have independent memory management, so we can't just mange >> pointers here. > >> Note also that libxml2 uses a dynamically adapted output buffer, so it likely >> uses more memory during serialization than absolutely necessary. > >> So, while the idea of the API is that it's more efficient (which it still is), >> the gain may not be as big as expected. But since tostring uses the same >> mechanism (and thus suffers from the same problem), the gain in overhead is >> still about 1/3 if the result is required as unicode. > I was thinking lxml would return the data encoded as unicode, in the > same format Python uses, and thus the gain would be more dramatic. I guess you mean libxml2 here, not lxml. Given the above procedure, I don't think it's a big difference in speed if libxml2 encodes to native Python (from internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8 data. In any case, we'd have to copy the buffer to get it into Python. I assume that the libxml2->UTF8->Python approach is already the most memory friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit unicode (which the Python interpreter *may* use, although it *may* also be 16bit). So generating everything in UTF-8 and then expanding it to unicode actually saves RAM compared to copying from unicode to unicode. > In this case, I think you should judge how more efficient that is then > calling .tostring(encoding) and implement if the gain is reasonable. Sorry, I don't understand what you mean here. This is all done at the C-level: serialization and conversion. If you did the same at the Python level, it cannot be faster or less memory intensive. But you would still have to copy the string before you pass it back through the API. So doing the conversion /as/ the copy operation is the most efficient way. Stefan From howe at carcass.dhs.org Wed May 10 23:01:53 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 23:02:34 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461EB1D.5020802@gkec.informatik.tu-darmstadt.de><446216A3.8060401@gkec.informatik.tu-darmstadt.de><12026582.20060510152341@carcass.dhs.org> <1134976844.20060510163318@carcass.dhs.org> Message-ID: <1104541588.20060510180153@carcass.dhs.org> Hello Fredrik, Wednesday, May 10, 2006, 5:07:10 PM, you wrote: > what's "resource saving" by using a slower serialization model that needs > more memory ? In the first place, I was thinking lxml would be able to return an unicode object directly in the Python internal format, and that's where the resource saving was expect from. If it cannot handle that, there is no point in implementing it, indeed. > why would you do this on the serialized document, rather than on > the infoset ? how would you generalize the above to handle arbitrary > strings ? what about surrogates ? For any reason the user wants. That was just an example. A text editor handling unicode is an example. As I said, I just wanted to avoid an extra .encode() call which would work with two buffers in memory. > that's not portable, of course. Python cannot print arbitrary Unicode > to stdout on all platforms. it has no trouble printing ASCII to stdout... "Not portable" is not an argument. Python supports lots of other non-portable APIs. > according to the DBXML documentation, it expects well-formed XML, not > necessarily "UTF-8", and definitely not "unicode". have you tried the above > with non-ASCII data? with latin-1 data serialized as "iso-8859-1" ? what > does sys.getdefaultencoding() return on your machine ? I can't do those tests right now, sorry, but it should be 'ascii'. DBXML expects NodeStorage containers to be UTF-8 (or plain ASCII), and the XQuery interfaces support only UTF8. Anyway, as I pointed several times, I just want to avoid having a string in memory, then create another UTF-8 object - it's unnecessary if you wanted unicode in the start. I'm sure you understand it's important to have encodings support since .tostring() supports it - but through an inefficient way due to implementation issues. -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 23:08:41 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 23:10:12 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <4462545B.1040502@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> Message-ID: <954805581.20060510180841@carcass.dhs.org> Hello Stefan, Wednesday, May 10, 2006, 6:00:11 PM, you wrote: >>> Careful, this is more or less how tounicode() is currently implemented >>> (although at the libxml2 level). It currently serializes to UTF-8 (which, at >>> least, is pretty fast in libxml2, as all strings are already UTF-8) and then >>> calls the Python API function to convert from UTF-8 to Python unicode in one >>> run (which is also pretty efficient). It's difficult to do otherwise, as >>> libxml2 and Python have independent memory management, so we can't just mange >>> pointers here. >> >>> Note also that libxml2 uses a dynamically adapted output buffer, so it likely >>> uses more memory during serialization than absolutely necessary. >> >>> So, while the idea of the API is that it's more efficient (which it still is), >>> the gain may not be as big as expected. But since tostring uses the same >>> mechanism (and thus suffers from the same problem), the gain in overhead is >>> still about 1/3 if the result is required as unicode. >> I was thinking lxml would return the data encoded as unicode, in the >> same format Python uses, and thus the gain would be more dramatic. > I guess you mean libxml2 here, not lxml. Given the above procedure, I don't > think it's a big difference in speed if libxml2 encodes to native Python (from > internal UTF-8 data) or if Python does that from libxml2 serialized UTF-8 > data. In any case, we'd have to copy the buffer to get it into Python. > I assume that the libxml2->UTF8->Python approach is already the most memory > friendly order in most cases, as UTF-8 tends to be (much) shorter than 32bit > unicode (which the Python interpreter *may* use, although it *may* also be > 16bit). So generating everything in UTF-8 and then expanding it to unicode > actually saves RAM compared to copying from unicode to unicode. >> In this case, I think you should judge how more efficient that is then >> calling .tostring(encoding) and implement if the gain is reasonable. > Sorry, I don't understand what you mean here. This is all done at the C-level: > serialization and conversion. If you did the same at the Python level, it > cannot be faster or less memory intensive. But you would still have to copy > the string before you pass it back through the API. So doing the conversion > /as/ the copy operation is the most efficient way. I meant lxml. I thought it could serialize the input stream from lxml into a Python unicode object without having the whole string in memory, doing it in chunks instead of retrieving a huge buffer, then converting it to unicode - just that. -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Wed May 10 23:12:04 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed May 10 23:17:04 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <954805581.20060510180841@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> Message-ID: <783474576.20060510181204@carcass.dhs.org> Hello Steve, Wednesday, May 10, 2006, 6:08:41 PM, you wrote: > I meant lxml. I thought it could serialize the input stream from lxml > into a Python unicode object without having the whole string in memory, > doing it in chunks instead of retrieving a huge buffer, then converting > it to unicode - just that. Sorry, I meant "it could serialize the input stream from libxml2..." -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 06:26:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 06:27:10 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <4461EB1D.5020802@gkec.informatik.tu-darmstadt.de> <44620A84.4040304@infrae.com> <44621955.9040606@infrae.com> <4462371C.2040700@gkec.informatik.tu-darmstadt.de> Message-ID: <4462BCFA.1080409@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > the current ET serializer outputs the > header as soon as you use a non-default encoding (and the default > encoding is us-ascii). Yup, I understood that from the tests. :) > iirc, ET 1.3 adds an "xml_declaration" option which can be set to > > None (the default): old "it depends on the encoding" behaviour > a true value: always include, with version and encoding > a false value (except None): never include > > feel free to emulate ET 1.3 here. Ok, we do that now. Would you know about any other API changes that we should take into consideration? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 06:31:19 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 06:31:55 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: <1134976844.20060510163318@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><4461EB1D.5020802@gkec.informatik.tu-darmstadt.de><446216A3.8060401@gkec.informatik.tu-darmstadt.de> <12026582.20060510152341@carcass.dhs.org> <1134976844.20060510163318@carcass.dhs.org> Message-ID: <4462BE17.8090400@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > 1) say you want to search "f??" on the XML, but it could have any case > sense, such as F??. > > uxml = etree.tounicode(root).lower() > if uxml.find('f??') > -1: > print 'found' That's maybe not the best example as serialization already involves tree traversal. There is not much point in serializing to search a string in the .text field. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 06:41:46 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 06:42:23 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <783474576.20060510181204@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> Message-ID: <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: >> I meant lxml. I thought it could serialize the input stream from lxml >> into a Python unicode object without having the whole string in memory, >> doing it in chunks instead of retrieving a huge buffer, then converting >> it to unicode - just that. > Sorry, I meant "it could serialize the input stream from libxml2..." Ok, I get it now. Yes, it *could* do that. But that's much more work than the way it is now. That would involve writing a libxml2 I/O writer ourselves, accept UTF-8 data from the traversal process and then pass that to Python's converter step-by-step. I guess that's really for a future version. If anyone finds out that this is really needed, we may decide to implement it that way - under the same API. Note that the current implementation is efficient for the way it works (it's even likely (unverified) a bit faster than the approach above if RAM is there). The above would be a different optimisation, for space. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 07:59:44 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 08:00:23 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> Message-ID: <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> Hi Steve, Stefan Behnel wrote: > Steve Howe wrote: >>> I meant lxml. I thought it could serialize the input stream from lxml >>> into a Python unicode object without having the whole string in memory, >>> doing it in chunks instead of retrieving a huge buffer, then converting >>> it to unicode - just that. >> Sorry, I meant "it could serialize the input stream from libxml2..." > > Ok, I get it now. Yes, it *could* do that. But that's much more work than the > way it is now. That would involve writing a libxml2 I/O writer ourselves, > accept UTF-8 data from the traversal process and then pass that to Python's > converter step-by-step. I just noticed that it's even worse. The PyUnicode_DecodeUTF8Stateful function I had in mind (which is also only available in Python 2.4) doesn't allow us to grow the unicode string, so it would also require copying. So the only way I currently see to get the memory consumption down to about the size of the result string is letting libxml2 do the conversion, write the resulting chunks into a Python memory buffer through a custom libxml2 I/O writer, growing the buffer ourselves as needed (which likely also involves allocating more memory than necessary) and then somehow tricking a unicode object into using it (PyUnicode_FromUnicode seems to allow you to create an empty unicode object). So I really think it's worth waiting for a use case that shows how doubling the memory for unicode string serialization keeps someone from using lxml. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 08:47:10 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 08:47:48 2006 Subject: [lxml-dev] tounicode(), again Message-ID: <4462DDEE.2000702@gkec.informatik.tu-darmstadt.de> Hi all, we had a lengthy discussion yesterday and I guess we found a few use cases where tounicode() makes sense and a few counter-arguments why it might not be a good idea to expose that API at a similarly visible place as tostring(). I'm still convinced that it's a good idea to have that API, but as one of the arguments was that "people who don't understand unicode" (PeWDUUs) would be more likely to write broken code, I added this paragraph to api.txt, in the section that describes the unicode support of lxml. """ Note that the unicode strings returned by ``tounicode()`` never have an XML declaration and therefore do not specify an encoding. This makes it possible to pass them back into the lxml parsers. However, you may have to add a declaration yourself if you want to serialize such a unicode string to a byte stream later. In contrast, the ``tostring()`` function automatically adds a declaration as needed that reflects the encoding of the returned byte string. """ I hope that makes it clear enough for PeWDUUs what the advantage of using tostring() over tounicode() is and that you have to take care what you do with unicode strings. So, I propose leaving the API (and implementation) just as it is now. Regards, Stefan From howe at carcass.dhs.org Thu May 11 08:56:04 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu May 11 08:56:46 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> Message-ID: <86168992.20060511035604@carcass.dhs.org> Hello Stefan, Thursday, May 11, 2006, 2:59:44 AM, you wrote: > I just noticed that it's even worse. The PyUnicode_DecodeUTF8Stateful > function I had in mind (which is also only available in Python 2.4) doesn't > allow us to grow the unicode string, so it would also require copying. > So the only way I currently see to get the memory consumption down to about > the size of the result string is letting libxml2 do the conversion, write the > resulting chunks into a Python memory buffer through a custom libxml2 I/O > writer, growing the buffer ourselves as needed (which likely also involves > allocating more memory than necessary) and then somehow tricking a unicode > object into using it (PyUnicode_FromUnicode seems to allow you to create an > empty unicode object). > So I really think it's worth waiting for a use case that shows how doubling > the memory for unicode string serialization keeps someone from using lxml. Although it is not urgent, there is a common case where scripts run in server limited memory - its typical to see 32 ou 64mb on some VPS servers. If the source has like 11Mb, we'll spend, on my FreeBSD 6.1 system (monitoring with "top"): 3260K - python interpreter 7988K - above + from lxml import etree 36812K - above + loaded tree 47528K - above + str 90084K - above + unicode The commands I ran were: >>> import cElementTree >>> a = cElementTree.parse('a.xml') >>> b = cElementTree.tostring(a.getroot()) >>> c = unicode(b) By the way, the Element -> str converting is *really* slow, took almost a minute. And 11Mb is not such a huge size, and there is nothing more loaded on python. The case xml source is ascii only; I would expect larger sizes for more non-ascii chars. For web servers processing documents, many time in threads, this could be a huge memory waster. At least the str() step could be avoided if what you mentioned could be implemented. Just for fun, let's see how cElementTree behaves on the same system: 3260K - python interpreter 6360K - above + import ElementTree 32804K - above + loaded tree 75920K - above + str 116M - above + unicode On cElementTree, the Element -> str operation is *much* faster, about 10s, but I did not benchmark them. It is interesting to see that it uses much more memory, however. So, if I did everything right, using cElementTree to load this test 11Mb xml file as a unicode string will at a point use 90Mb of memory under lxml or 116Mb under cElementTree. -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 10:05:28 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 10:06:10 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <86168992.20060511035604@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> Message-ID: <4462F048.3070801@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > server limited memory - its typical to see 32 ou 64mb on some VPS > servers. If the source has like 11Mb, we'll spend, on my FreeBSD 6.1 > system (monitoring with "top"): > > 3260K - python interpreter > 7988K - above + from lxml import etree > 36812K - above + loaded tree > 47528K - above + str > 90084K - above + unicode > > The commands I ran were: > > >>> import cElementTree > >>> a = cElementTree.parse('a.xml') > >>> b = cElementTree.tostring(a.getroot()) > >>> c = unicode(b) Good idea, although not necessarily the absolute benchmark setup. I ran this on my machine: 3948K - Python interpreter 5532K - + from lxml import parse, tounicode 137M - + a = parse("big.xml") - [max: 156M] 180M - + c = tounicode(a.getroot()) - [max: 190M] 3948K - Python interpreter 5544K - + from lxml import parse, tostring 137M - + a = parse("big.xml") - [max: 156M] 148M - + b = tostring(a.getroot(), 'UTF-8') - [max: 153M] 190M - + c = unicode(b, 'UTF-8') 180M - + del b Ok, well, that actually looks like both were exactly identical in terms of memory usage. I also tried that with cElementTree: 3948K - Python interpreter 6352K - + from cElementTree import parse, tostring 92M - + a = parse("big.xml") 137M - + b = tostring(a.getroot(), 'UTF-8') - [max: 150M] 180M - + c = unicode(b, 'UTF-8') 170M - + del b The main reason for the big difference is that I'm on a 64bit machine (I assume you're on 32bit?). That doubles the size of pointers, and libxml2 uses tons of them (char*, double-linked trees, hash-tables, ...). > By the way, the Element -> str converting is *really* slow, took almost > a minute. I hope you meant (c)ElementTree, right? I posted some pretty interesting benchmark results on that lately. You can really look how memory usage increases MB by MB... If you meant lxml, you should redo the test and make sure there was no swapping involved. These kind of benchmarks should always read from RAM. > And 11Mb is not such a huge size, and there is nothing more > loaded on python. The case xml source is ascii only; I would expect > larger sizes for more non-ascii chars. For web servers processing > documents, many time in threads, this could be a huge memory waster. At > least the str() step could be avoided if what you mentioned could be > implemented. Hmm, not really. The main memory hog is the unicode itself. If you waste 32bits for an ASCII character, that's 25 empty bits per character! > So, if I did everything right, using cElementTree to load this test 11Mb > xml file as a unicode string will at a point use 90Mb of memory under > lxml or 116Mb under cElementTree. It's a little closer on my side. Still, what do we learn? Unicode strings are bad for large amounts of ASCII data and huge serializations should be done to files. Anything else? Changing the way serialization works will only change the results marginally. The in-memory tree itself is so huge that UTF-8 serialization only takes about an eighths of its size in additional memory (a 4th on your side). That's not really something to worry about, I'd say. Stefan From faassen at infrae.com Thu May 11 11:36:31 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 11:36:49 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de> <1365266204.20060510065257@carcass.dhs.org> <446204C1.6080703@infrae.com> <44621CD5.2040904@infrae.com> Message-ID: <4463059F.80608@infrae.com> Fredrik Lundh wrote: > Martijn Faassen wrote: > > >>* quick and dirty applications that mess about with the XML text on a >>textual level. I agree there are usually better ways to do the same >>thing in a clear way (XSLT, ElementTree API). > > or messing with the XML encoded text on the textual level. UTF-8 is care- > fully designed to allow things like this, of course... >>* web applications that use unicode inside (Zope 3, Silva on Zope 2) >>that want to present XML in a web page. In Zope 3, HTTP response text is >>initially a unicode string before it's encoded to UTF-8 and sent out to >>the network. > > > so how do you return images from Zope 3 ? It's a slightly different case; the image is a binary object by itself, and Zope 3 must take special measures somewhere so it doesn't try to encode. I'm talking about inclusion in a template (for instance a form). > I can buy that a framework might automagically encode Unicode strings > as UTF-8 byte strings, but forcing the use of Unicode sounds like a really > lousy idea to me. I think the idea that Zope translates human-readable text to unicode (which most strings are) and back to UTF-8 again is a really great idea. It makes applications in Zope 3 unicode-aware without the user having to take special action, and remarkably free of unicode errors. It is possible (though I don't know the details) to force output of the whole page to be non-unicode already, and that's useful in special cases if you want XML over HTTP. In that case I wouldn't want to spit out unicode and have it recoded for efficiency reasons. >>* more generally, any application that uses a user interface framework >>that's unicode-aware. (it might be worthwhile to investigate that Java >>UI toolkits do. Java uses unicode strings everywhere and presumably also >>in the UI api, so how do they display XML text?) > > I doubt the set of applications that displays XML files as text is even > noticable compared to the set of applications that displays text from > the XML infoset... >>I think the main use cases are in the area of XML being displayed in the >>context of a UI environment that's unicode native. > > which, frankly, means that the use case is almost nonexistent. It's almost nonexistent, but not quite, and I have run into that very usecase a number of times in the last five years, most recently with SilvaFlexibleXML. I'm sure web UIs for XML databases also have this problem, and I've seen one for eXist (in Java) among other things. The question would be whether this usecase is strong enough to weigh against the drawbacks of tounicode(). >>This means that >>support for unicode in XML() is less necessary than 'tounicode()' >>(though I'm probably missing use cases), but since you already support >>the former in ElementTree as Stefan pointed out > > he's confused: the XML() function does not support Unicode (see my > followup mail). Ah, good point. Perhaps it'd be worthwile in ElementTree to do an assert for plain strings when that function is called then, as it gives the appearance passing unicode strings into XML() works sometimes, and doesn't work other times. It demonstrates the same behavior you show with f.write(etree.tounicode(..)); it sometimes appears to work but sometimes doesn't, depending on the contents of the string. You point this out as a problem in another mail. Regards, Martijn From faassen at infrae.com Thu May 11 11:39:22 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 11:39:37 2006 Subject: [lxml-dev] Re: Python unicode string support in lxml In-Reply-To: References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><446204C1.6080703@infrae.com><44621CD5.2040904@infrae.com> <77952733.20060510152627@carcass.dhs.org> Message-ID: <4463064A.5040609@infrae.com> Fredrik Lundh wrote: [snip] > Such a framework won't be able to handle output from the current ET serializer, > but it won't work with images, external templating systems, resources read from > disk, preformatted resources, etc, either. Most of these are resources by itself. I'm talking about the use case where XML content is mixed with template content, for instance in a form or for display of some XML on a web page. Obviously the Zope 3 publisher is capable of handling images, but I wouldn't want to have to change my whole web application to work with encoded strings just because I want to display an XML snippet on my web pages. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 11:47:38 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 11:48:17 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree Message-ID: <4463083A.90405@gkec.informatik.tu-darmstadt.de> Hi all, I just ran a slight variant (doesn't print, builds list) of Uche's OT benchmark on * ElementTree 1.2.6 * cElementTree 1.0.5 * lxml (trunk/SVN 27065, timings for 0.9.2 are similar) The code I used is attached, it runs against the 3.3M ot.xml file from http://www.ibiblio.org/bosak/xml/eg/religion.2.00.xml.zip Here's the (somewhat bogus) output on my machine (bench, result, time): bench_ET 120 1.59 bench_cET 120 0.31 bench_lxml_findall 120 0.32 bench_lxml_xpath 120 0.33 bench_lxml_xpath_all 120 0.26 I should note that lxml's findall() uses ElementTree's ElementPath implementation. Here are the real numbers from timeit: # python -m timeit -s "from otbench import *" "bench_ET()" 10 loops, best of 3: 1.36 sec per loop # python -m timeit -s "from otbench import *" "bench_cET()" 10 loops, best of 3: 260 msec per loop # python -m timeit -s "from otbench import *" "bench_lxml_findall()" 10 loops, best of 3: 295 msec per loop # python -m timeit -s "from otbench import *" "bench_lxml_xpath()" 10 loops, best of 3: 261 msec per loop # python -m timeit -s "from otbench import *" "bench_lxml_xpath_all()" 10 loops, best of 3: 214 msec per loop I think it's pretty interesting how close the timings of cET.findall() and lxml.xpath() are. Regards, Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: otbench.py Type: text/x-python Size: 1381 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060511/d80b8f1e/otbench.py From faassen at infrae.com Thu May 11 11:48:10 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 11:48:24 2006 Subject: [lxml-dev] Re: tounicode(), again In-Reply-To: <4462DDEE.2000702@gkec.informatik.tu-darmstadt.de> References: <4462DDEE.2000702@gkec.informatik.tu-darmstadt.de> Message-ID: <4463085A.3050808@infrae.com> Stefan Behnel wrote: > Hi all, > > we had a lengthy discussion yesterday and I guess we found a few use cases > where tounicode() makes sense and a few counter-arguments why it might not be > a good idea to expose that API at a similarly visible place as tostring(). > > I'm still convinced that it's a good idea to have that API, but as one of the > arguments was that "people who don't understand unicode" (PeWDUUs) would be > more likely to write broken code, I added this paragraph to api.txt, in the > section that describes the unicode support of lxml. > > """ > Note that the unicode strings returned by ``tounicode()`` never have an XML > declaration and therefore do not specify an encoding. This makes it possible > to pass them back into the lxml parsers. However, you may have to add a > declaration yourself if you want to serialize such a unicode string to a byte > stream later. In contrast, the ``tostring()`` function automatically adds a > declaration as needed that reflects the encoding of the returned byte string. > """ > > I hope that makes it clear enough for PeWDUUs what the advantage of using > tostring() over tounicode() is and that you have to take care what you do with > unicode strings. Maybe we want to alter this to something like this: """ Normally you use tostring() with an encoding argument (typically UTF-8) to create XML, which is typically a stream of bytes. You can then safely save it to a file, pass it over the network, etc. If you're not sure about the way to go, use tostring(). Using tostring() with UTF-8 is also typically faster. In some exceptional use cases it might be useful to obtain XML in a Python unicode string, in which case you can use tounicode(). Only use this if you are confident in your understanding of Python unicode and that your application needs serialized XML in a Python unicode string. """ this way we relate it to use cases, and make clear that tostring() is the way to go for most people. This way people who do not understand what's up with unicode still get a clear hint that they're not supposed to use tounicode(), and that it's even faster not to do so. :) Regards, Martijn From faassen at infrae.com Thu May 11 12:21:29 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 12:21:44 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <4463083A.90405@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> Message-ID: <44631029.5090900@infrae.com> Stefan Behnel wrote: > I just ran a slight variant (doesn't print, builds list) of Uche's OT benchmark on And another difference is that you don't actually measure the overhead of the Python interpret startup, correct? :) > * ElementTree 1.2.6 > * cElementTree 1.0.5 > * lxml (trunk/SVN 27065, timings for 0.9.2 are similar) > > The code I used is attached, it runs against the 3.3M ot.xml file from > http://www.ibiblio.org/bosak/xml/eg/religion.2.00.xml.zip [snip] > I think it's pretty interesting how close the timings of cET.findall() and > lxml.xpath() are. Cool. Impressive for cET.findall(), as it's using a Python implementation of the search algorithm - the same one as used by ElementTree, last I checked. I'm pleasantly surprised lxml_findall() now appears to have reached parity with cET in this test; it used to be that cET was quite a bit faster in my old measurements. Can you identify which tuning effort had this effect, or this due to a slightly different benchmark? Last I did a similar check we were still half the speed: http://faassen.n--tree.net/blog/view/weblog/2005/01/17/0 Heh, the Uche quote on that page has been proven wrong, right? :) This reminds me, I was talking to someone who was interested in getting a function that would just give the first xpath result by the way - equivalent to find(), I think. Often you know you find just one thing, and you want it to return that, instead of having to grap things from the resulting list of nodes. Of course underneath it'd still do the same search, so this would be purely a convenience API, not a performance gain. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 12:59:27 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 13:00:08 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44631029.5090900@infrae.com> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> Message-ID: <4463190F.5070705@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > This reminds me, I was talking to someone who was interested in getting > a function that would just give the first xpath result by the way - > equivalent to find(), I think. Often you know you find just one thing, > and you want it to return that, instead of having to grap things from > the resulting list of nodes. Of course underneath it'd still do the same > search, so this would be purely a convenience API, not a performance gain. Yeah, sadly, libxml2 doesn't have any routines for stopping XPath once a result has been matched. But then, XPath is pretty complex, how do you actually know when you have a result that will be passed through to the caller? But then, that's basically what _elementpath does, too: find everything and return the first result. I wouldn't mind having something like xpath1() to return the first hit only. On the other hand, are there any substantial differences in the ElementTree Path syntax with respect to an XPath subset that should keep us from having findall() etc. call XPath directly? Stefan From elephantum at cyberzoo.ru Thu May 11 13:05:22 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Thu May 11 13:06:06 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <4463190F.5070705@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> Message-ID: <1147345522.6436.4.camel@zoo.yandex.ru> On Thu, 2006-05-11 at 12:59 +0200, Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen wrote: > > This reminds me, I was talking to someone who was interested in getting > > a function that would just give the first xpath result by the way - > > equivalent to find(), I think. Often you know you find just one thing, > > and you want it to return that, instead of having to grap things from > > the resulting list of nodes. Of course underneath it'd still do the same > > search, so this would be purely a convenience API, not a performance gain. > > > Yeah, sadly, libxml2 doesn't have any routines for stopping XPath once a > result has been matched. But then, XPath is pretty complex, how do you > actually know when you have a result that will be passed through to the caller? > > But then, that's basically what _elementpath does, too: find everything and > return the first result. I wouldn't mind having something like xpath1() to > return the first hit only. > > On the other hand, are there any substantial differences in the ElementTree > Path syntax with respect to an XPath subset that should keep us from having > findall() etc. call XPath directly? I've already implemented it some time ago in branch xpath-find. you can take a look. there were no failed tests, as far as I remember. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 13:12:45 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 13:13:25 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44631029.5090900@infrae.com> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> Message-ID: <44631C2D.7050505@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> I just ran a slight variant (doesn't print, builds list) of Uche's OT >> benchmark on > > And another difference is that you don't actually measure the overhead > of the Python interpret startup, correct? :) Uh, well, was that supposed to be in there, too? :) > I'm pleasantly surprised lxml_findall() now > appears to have reached parity with cET in this test; it used to be that > cET was quite a bit faster in my old measurements. Can you identify > which tuning effort had this effect, or this due to a slightly different > benchmark? Last I did a similar check we were still half the speed: I'm not quite sure. I changed a lot of bits everywhere, in the XPath code, the proxy code and the Element creation code. Guess it was a mixture of all of them. There's quite a bit of fast-paths in there now that make a difference when you ask for a lot of elements. BTW, note that there is even an element class lookup for ns/name involved in each element creation. Thus, the tests would yield similar timings with custom per-tag Python classes for elements. That's another thing ElementTree can't give you. > http://faassen.n--tree.net/blog/view/weblog/2005/01/17/0 > > Heh, the Uche quote on that page has been proven wrong, right? :) Totally. What does that guy know anything about, anyway? :] (uh, he's not listening, is he?) :) I'd actually like to see something about lxml on "xml.com". lxml has been in a pretty usable state for quite a while now and is even nearing feature-completeness. Maybe we should just get out 1.0 and send Uche a friendly mail. *wink* Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 13:59:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 13:59:55 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <1147345522.6436.4.camel@zoo.yandex.ru> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> <1147345522.6436.4.camel@zoo.yandex.ru> Message-ID: <44632712.30102@gkec.informatik.tu-darmstadt.de> Hi Andrey, Andrey Tatarinov schrieb: > On Thu, 2006-05-11 at 12:59 +0200, Stefan Behnel wrote: >> are there any substantial differences in the ElementTree >> Path syntax with respect to an XPath subset that should keep us from having >> findall() etc. call XPath directly? > > I've already implemented it some time ago in branch xpath-find. > > you can take a look. Right, the Clark notation. From what I see in your implementation, that's basically the same as the etree.ETXPath wrapper for the XPath class. So I'll check if we can replace it and throw _elementpath.py out. Depends on the performance, though. ETXPath also uses RegExps to split up the path expression. > there were no failed tests, as far as I remember. Yeah, well, that's not the surest thing for something as complex as XPath. We don't test expressions, just the API. Maybe the ET self-tests help out here. Thanks for the hint, Stefan From howe at carcass.dhs.org Thu May 11 14:31:51 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu May 11 14:32:35 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <4462F048.3070801@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> Message-ID: <1041322718.20060511093151@carcass.dhs.org> Hello Stefan, Thursday, May 11, 2006, 5:05:28 AM, you wrote: [...] > The main reason for the big difference is that I'm on a 64bit machine (I > assume you're on 32bit?). That doubles the size of pointers, and libxml2 uses > tons of them (char*, double-linked trees, hash-tables, ...). Yes, 32 bits, FreeBSD 6.1. > I hope you meant (c)ElementTree, right? I posted some pretty interesting > benchmark results on that lately. You can really look how memory usage > increases MB by MB... > If you meant lxml, you should redo the test and make sure there was no > swapping involved. These kind of benchmarks should always read from RAM. No, I meant lxml, and yes, I could have made it read from ram, but I think it did swap. It was not a very controlled test, I admit, just something quick I made on my python prompt. I just ran the test again and the results were similar. There are is plenty of ram available, however. > It's a little closer on my side. Still, what do we learn? Unicode strings are > bad for large amounts of ASCII data and huge serializations should be done to > files. Anything else? > Changing the way serialization works will only change the results marginally. > The in-memory tree itself is so huge that UTF-8 serialization only takes about > an eighths of its size in additional memory (a 4th on your side). That's not > really something to worry about, I'd say. That is not something so important, indeed. It would be nice if it was something easy to implement, but not otherwise. This was the main reason I was interested about .tounicode(). -- Best regards, Steve mailto:howe@carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 14:43:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 14:43:55 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44632712.30102@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> <1147345522.6436.4.camel@zoo.yandex.ru> <44632712.30102@gkec.informatik.tu-darmstadt.de> Message-ID: <44633164.4000205@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: > Andrey Tatarinov wrote: >> On Thu, 2006-05-11 at 12:59 +0200, Stefan Behnel wrote: >>> are there any substantial differences in the ElementTree >>> Path syntax with respect to an XPath subset that should keep us from having >>> findall() etc. call XPath directly? >> I've already implemented it some time ago in branch xpath-find. >> >> you can take a look. > > > Right, the Clark notation. From what I see in your implementation, that's > basically the same as the etree.ETXPath wrapper for the XPath class. > > So I'll check if we can replace it and throw _elementpath.py out. Depends on > the performance, though. ETXPath also uses RegExps to split up the path > expression. > >> there were no failed tests, as far as I remember. > > Yeah, well, that's not the surest thing for something as complex as XPath. We > don't test expressions, just the API. Maybe the ET self-tests help out here. ... and they do: ------------------------------------------ ********************************************************************** File "/home/me/source/Python/lxml/lxml-HEAD/selftest.py", line 224, in selftest.bad_find Failed example: elem.findall("/tag") Expected: Traceback (most recent call last): SyntaxError: cannot use absolute path on element Got: [] ********************************************************************** File "/home/me/source/Python/lxml/lxml-HEAD/selftest.py", line 227, in selftest.bad_find Failed example: elem.findall("../tag") Expected: Traceback (most recent call last): SyntaxError: unsupported path syntax (..) Got: [] ********************************************************************** File "/home/me/source/Python/lxml/lxml-HEAD/selftest.py", line 230, in selftest.bad_find Failed example: elem.findall("section//") Expected: Traceback (most recent call last): SyntaxError: path cannot end with // Got: Traceback (most recent call last): File "/home/me/source/Python/lxml/lxml-HEAD/src/doctest.py", line 1256, in __run compileflags, 1) in test.globs File "", line 1, in ? elem.findall("section//") File "etree.pyx", line 909, in etree._Element.findall File "xpath.pxi", line 215, in etree.ETXPath.__init__ File "xpath.pxi", line 171, in etree.XPath.__init__ File "xpath.pxi", line 65, in etree.XPathEvaluatorBase._raise_parse_error XPathSyntaxError: Error in xpath expression. ********************************************************************** File "/home/me/source/Python/lxml/lxml-HEAD/selftest.py", line 233, in selftest.bad_find Failed example: elem.findall("tag[tag]") Expected: Traceback (most recent call last): SyntaxError: expected path separator ([) Got: [] ********************************************************************** 1 items had failures: 4 of 5 in selftest.bad_find ------------------------------------------ That's a pretty good rate... :) Hmm, everything else passes, but those are gonna be hard to emulate without a real pre-parser (which would cost us performance for compat-only). XPath is just too powerful to strip it down easily... Guess we should just leave it as is for now. Calling _elementpath is sufficiently fast by now, writing our own parser is not worth it. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 14:55:47 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 14:56:26 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <1041322718.20060511093151@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> Message-ID: <44633453.8020101@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Thursday, May 11, 2006, 5:05:28 AM, you wrote: >> I hope you meant (c)ElementTree, right? I posted some pretty interesting >> benchmark results on that lately. You can really look how memory usage >> increases MB by MB... >> If you meant lxml, you should redo the test and make sure there was no >> swapping involved. These kind of benchmarks should always read from RAM. > > No, I meant lxml, and yes, I could have made it read from ram, but I > think it did swap. It was not a very controlled test, I admit, just > something quick I made on my python prompt. I just ran the test again > and the results were similar. There are is plenty of ram available, > however. Hmm, interesting. Could you run the I/O tests from the benchmark suite (trunk version) and post the results? My results here are that lxml is about 20-50 times faster on serialization than cET or ET. I would be surprised if that was so much different on your machine. Try: cd lxml python bench.py -i -a tostring_utf8 tostring_utf16 tostring_utf8_unicode_XML write_utf8_parse_stringIO (the latter all in one line, '-i' adds 'src' to the PYTHONPATH, '-a' runs with lxml, cET and ET if installed) It's gonna take a while and the output is rather lengthy. The benchmarks run this, which is more or less what we talk about here: ---------------------------------- @with_text(text=True, utext=True) def bench_tostring_utf8(self, root): self.etree.tostring(root, 'UTF-8') @with_text(text=True, utext=True) def bench_tostring_utf16(self, root): self.etree.tostring(root, 'UTF-16') @with_text(text=True, utext=True) def bench_tostring_utf8_unicode_XML(self, root): xml = unicode(self.etree.tostring(root, 'UTF-8'), 'UTF-8') self.etree.XML(xml) @with_text(text=True, utext=True) def bench_write_utf8_parse_stringIO(self, root): f = StringIO() self.etree.ElementTree(root).write(f, 'UTF-8') f.seek(0) self.etree.parse(f) ---------------------------------- Thanks, Stefan From faassen at infrae.com Thu May 11 16:25:45 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 16:26:00 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44633164.4000205@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> <1147345522.6436.4.camel@zoo.yandex.ru> <44632712.30102@gkec.informatik.tu-darmstadt.de> <44633164.4000205@gkec.informatik.tu-darmstadt.de> Message-ID: <44634969.6050806@infrae.com> Stefan Behnel wrote: [snip] > Guess we should just leave it as is for now. Calling _elementpath is > sufficiently fast by now, writing our own parser is not worth it. Right, I was going to answer in this thread with the same conclusion I drew before, which was exactly yours. Trying to emulate the .find behavior on top of XPath is not worth doing in my opinion. There are the parsing issues you mention, and the danger of introducing subtle incompatibilities (I'd need to look up the old thread to check what things I found out then). It's probably more worthwhile to invest that energy in speeding up find by writing a native implementation. :) Regards, Martijn From faassen at infrae.com Thu May 11 16:35:17 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu May 11 16:35:30 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44631C2D.7050505@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <44631C2D.7050505@gkec.informatik.tu-darmstadt.de> Message-ID: <44634BA5.4070100@infrae.com> Stefan Behnel wrote: > Martijn Faassen wrote: > >>Stefan Behnel wrote: >> >>>I just ran a slight variant (doesn't print, builds list) of Uche's OT >>>benchmark on >> >>And another difference is that you don't actually measure the overhead >>of the Python interpret startup, correct? :) > > Uh, well, was that supposed to be in there, too? :) It was in Uche's published benchmark if I recall correctly. Fredrik and I slightly disagreed. :) >>I'm pleasantly surprised lxml_findall() now >>appears to have reached parity with cET in this test; it used to be that >>cET was quite a bit faster in my old measurements. Can you identify >>which tuning effort had this effect, or this due to a slightly different >>benchmark? Last I did a similar check we were still half the speed: > > I'm not quite sure. I changed a lot of bits everywhere, in the XPath code, the > proxy code and the Element creation code. Guess it was a mixture of all of > them. There's quite a bit of fast-paths in there now that make a difference > when you ask for a lot of elements. .findall() does ask for lots of elements, so that might be helping then. Pretty good - I didn't expect there was such a gain to be made now, and we've got cElementTree parity now for this performance measurement. Perhaps we should collect all this benchmarking and check them, and then write some article for the lxml website... It'd be a bit of work to make that a solid piece of text, of course. People do pick up on benchmark figures in a rather lazy way sometimes. Last year I as I was developing lxml I honestly said when it was slower than cElementTree in my limited measurements, and I saw that referred to later as "According to Martijn Faassen, libxml might not be that fast with python anyway". For future google searchers: I didn't actually say that! lxml (and libxml2) is plenty fast with Python! So, with a benchmark page on the lxml site, we might get "Stefan Behnel says lxml is faster than anything all the time!" in people's heads instead. Note for future google searchers: Stefan never actually said that! I just made it up! But, lxml is plenty fast with Python! :) > BTW, note that there is even an element class lookup for ns/name involved in > each element creation. Thus, the tests would yield similar timings with custom > per-tag Python classes for elements. That's another thing ElementTree can't > give you. >>http://faassen.n--tree.net/blog/view/weblog/2005/01/17/0 >> >>Heh, the Uche quote on that page has been proven wrong, right? :) > > Totally. What does that guy know anything about, anyway? :] > (uh, he's not listening, is he?) :) :) I like lots of what Uche's done, it's just we had a silly debate about benchmarks. Benchmarks unfortunately tend to invite such discussions. > I'd actually like to see something about lxml on "xml.com". lxml has been in a > pretty usable state for quite a while now and is even nearing > feature-completeness. Maybe we should just get out 1.0 and send Uche a > friendly mail. *wink* Yes. Unfortunately xml.com slightly changed its focus since last year, which appears to be less Python-related articles. Still, it's worth a shot contacting him. Regards, Martijn From howe at carcass.dhs.org Thu May 11 19:24:25 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu May 11 19:25:08 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <44633453.8020101@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> <44633453.8020101@gkec.informatik.tu-darmstadt.de> Message-ID: <1031057011.20060511142425@carcass.dhs.org> Hello Stefan, Thursday, May 11, 2006, 9:55:47 AM, you wrote: > Hmm, interesting. Could you run the I/O tests from the benchmark suite (trunk > version) and post the results? My results here are that lxml is about 20-50 > times faster on serialization than cET or ET. I would be surprised if that was > so much different on your machine. [...] The results attached, supporting that lxml is faster, but I suspect the slowdown happens only on very large xml files - the larger, the worst. How large is the xml stream on this test ? Remember I test on a 11Mb file. This is probably related to the way Python allocates and handle strings - appending is slow and expensive. -- Best regards, Steve mailto:howe@carcass.dhs.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bench.log Type: application/octet-stream Size: 11456 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060511/e2e8d737/bench-0001.obj From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 19:35:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 19:35:59 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44634BA5.4070100@infrae.com> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <44631C2D.7050505@gkec.informatik.tu-darmstadt.de> <44634BA5.4070100@infrae.com> Message-ID: <446375C5.1000203@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Perhaps we should collect all this benchmarking and check them, and then > write some article for the lxml website... It'd be a bit of work to make > that a solid piece of text, of course. Sure, getting numbers is easy. Making sense out of them and making them understandable to others is the trick. You could play with bench.py a bit (it's really easy) and see what other parts of the API would be interesting to test. That way, we'd get a broader idea about what is competitive or faster and what isn't. I wrote many of the benchmarks to see how badly lxml behaves in the spots where I knew it would be bad. So, few of them show where it excels. > So, with a benchmark page on the lxml site, we might get "Stefan Behnel > says lxml is faster than anything all the time!" in people's heads Sure. That's true anyway, right? ;) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 11 22:37:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu May 11 22:38:04 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <1031057011.20060511142425@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> <44633453.8020101@gkec.informatik.tu-darmstadt.de> <1031057011.20060511142425@carcass.dhs.org> Message-ID: <4463A077.4070200@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > The results attached, supporting that lxml is faster Just like on my side. > but I suspect the > slowdown happens only on very large xml files - the larger, the worst. > How large is the xml stream on this test ? Remember I test on a 11Mb > file. This is probably related to the way Python allocates and handle > strings - appending is slow and expensive. Admittedly, the largest was only about 1M. Otherwise, the benchmarks would take too long to run, especially on ET. I changed bench.py to use longer strings now, that should not make a difference in most tests but give us better numbers of tree copying and serialization. You can also now pass the options -l and -L (large or LARGE trees). Anyway, it can't be related to Python. Python just get's a char* and a size and can then happily allocate its final buffer to memcpy it. No appending at all. Maybe it's libxml2 then, but I really wouldn't know why... If you want, you can run the modified bench script again, and if you have enough RAM, you can pass the -L option to see if that makes a difference. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 08:00:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 08:02:16 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44631029.5090900@infrae.com> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> Message-ID: <4464247D.8050707@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Cool. Impressive for cET.findall(), as it's using a Python > implementation of the search algorithm - the same one as used by > ElementTree, last I checked. Actually that's not really surprising. cET has the Python objects readily available and only accesses them. We have to generate them on the fly. And ElementPath is simple enough to be fast. That's like comparing a spoon with a swiss knife. > I'm pleasantly surprised lxml_findall() now > appears to have reached parity with cET in this test; it used to be that > cET was quite a bit faster in my old measurements. Can you identify > which tuning effort had this effect, or this due to a slightly different > benchmark? I figured out what it was. _elementpyth.py calls element.getiterator(). lxml originally collected all children in a list to emulate that. Now it has a real iterator implementation. It could even be faster as _elementpath.py nicely asks for the tag it looks for. Currently, this filters /behind/ the iterator, so all elements are still generated (and we're still close to parity!). Maybe I should check what difference it makes if we filter in plain C... Stefan From fredrik at pythonware.com Wed May 10 22:27:09 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri May 12 08:48:49 2006 Subject: [lxml-dev] Re: Re: Re: Python unicode string support in lxml References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org><58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> Message-ID: Steve Howe wrote: > > Huh? Have you noticed that tostring takes an optional encoding argument? > > Won't that waste exactly the same resources as this ? > > xml = etree.tostring(element).encode(encoding) tostring returns encoded data. did you mean xml = etree.tounicode(element).encode(encoding) ? if so, the answer is no -- the serializer encodes the infoset piece by piece, using different approaches for different parts of the infoset (at least that's what the ET serializer does; not sure about lxml). there's some overhead from cStringIO, though, but that should be far from the 3x/5x worst-case overhead in your example. (and for western users, the worst case is quite often the typical case) From howe at carcass.dhs.org Fri May 12 10:04:15 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri May 12 10:06:00 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <4463A077.4070200@gkec.informatik.tu-darmstadt.de> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> <44633453.8020101@gkec.informatik.tu-darmstadt.de> <1031057011.20060511142425@carcass.dhs.org> <4463A077.4070200@gkec.informatik.tu-darmstadt.de> Message-ID: <1695956808.20060512050415@carcass.dhs.org> Hello Stefan, Thursday, May 11, 2006, 5:37:11 PM, you wrote: > Admittedly, the largest was only about 1M. Otherwise, the benchmarks would > take too long to run, especially on ET. I changed bench.py to use longer > strings now, that should not make a difference in most tests but give us > better numbers of tree copying and serialization. You can also now pass the > options -l and -L (large or LARGE trees). > Anyway, it can't be related to Python. Python just get's a char* and a size > and can then happily allocate its final buffer to memcpy it. No appending at all. > Maybe it's libxml2 then, but I really wouldn't know why... > If you want, you can run the modified bench script again, and if you have > enough RAM, you can pass the -L option to see if that makes a difference. Sure, anything I can help with. I've ran the tests with "-L", and they took quite a while to perform and even crashed after a point. Since I'm in a hurry lately I did not have the time to see why, but the results are attached. See that some tests results were really slow compared to ET and cET. Ex: lxe: tostring_utf16 (SA T3 ) 80626.3903 msec/pass, best of ( 81157.4800 80631.1504 80626.3903 ) cET: tostring_utf16 (SA T3 ) 3305.6618 msec/pass, best of ( 3305.6618 3332.7984 3310.7507 ) ET : tostring_utf16 (SA T3 ) 3413.7482 msec/pass, best of ( 3418.9271 3413.7482 3415.6650 ) lxe: tostring_utf8 (UA T3 ) 37834.8396 msec/pass, best of ( 37834.8396 37970.8700 37908.7146 ) cET: tostring_utf8 (UA T3 ) 2880.5753 msec/pass, best of ( 2880.5753 2886.5763 2885.0215 ) ET : tostring_utf8 (UA T3 ) 2981.1059 msec/pass, best of ( 3000.1362 2981.1059 2988.5129 ) The server is at your disposal if you want to use it. -- Best regards, Steve mailto:howe@carcass.dhs.org From howe at carcass.dhs.org Fri May 12 10:09:10 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri May 12 10:10:27 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <1695956808.20060512050415@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> <44633453.8020101@gkec.informatik.tu-darmstadt.de> <1031057011.20060511142425@carcass.dhs.org> <4463A077.4070200@gkec.informatik.tu-darmstadt.de> <1695956808.20060512050415@carcass.dhs.org> Message-ID: <1553877080.20060512050910@carcass.dhs.org> Hello, Sorry, I forgot the attachment - here it is. -- Best regards, Steve mailto:howe@carcass.dhs.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bench.log Type: application/octet-stream Size: 9395 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060512/cfa845b2/bench-0001.obj From camior at gmail.com Fri May 12 11:34:23 2006 From: camior at gmail.com (David Sankel) Date: Fri May 12 11:35:02 2006 Subject: [lxml-dev] Building lxml on windows: a guide. Message-ID: Note: This is my first post on this list. There was an absence of windows binaries for lxml on the net (all the sites that had older versions were down). Unfortunately, there wasn't enough information to make the process of building on windows easy. I wrote this guide to solve both these problems. I hope that the information here can at worst help out others trying to build on windows and at best get included in the documentation and maintenance. I also have a windows installer created for lxml 0.9.1 that is staticly built with iconv-1.9.1, libxml2 2.6.23, libxslt 1.1.15, and zlib 1.2.3. If there is some ftp site or something I can upload it to, in order to give wide access to it, please let me know what to do. David Sankel Building lxml on windows =================== First you'll need to download the latest version of all the required files. Download them all to the same directory. * libxml: Availible from http://codespeak.net/lxml/ * iconv, libxml2, libxslt, and zlib are all availible from xmlsoft.org. The place to go on the ftp site is ftp://xmlsoft.org/libxml2/win32. Your directory should now have something like the following files in it: iconv-1.9.1.win32.zip libxml2-2.6.23.win32.zip libxslt-1.1.15.win32.zip lxml-0.9.1.tgz zlib-1.2.3.win32.zip Now extract each of those files in the _same_ directory. Now you should have something like this: iconv-1.9.1.win32/ iconv-1.9.1.win32.zip libxml2-2.6.23.win32/ libxml2-2.6.23.win32.zip libxslt-1.1.15.win32/ libxslt-1.1.15.win32.zip lxml-0.9.1/ lxml-0.9.1.tgz zlib-1.2.3.win32/ zlib-1.2.3.win32.zip Go to the lxml-0.9.1 directory and edit the Makefile. There should be a section that looks like this:: ext_modules = [ Extension( "lxml.etree", sources = sources, extra_compile_args = ['-w'] + flags('xslt-config --cflags'), extra_link_args = flags('xslt-config --libs') )], Change it to this (Warning: make sure you are using version numbers that correspond to your downloads) (Note: the _a portion of the libraries means that we are statically linking. If you want to use dlls(why?), link to the dll version of the libraries):: ext_modules = [ Extension( "lxml.etree", sources = sources, extra_compile_args = [ "-w", "-I..\\libxml2-2.6.23.win32\\include", "-I..\\libxslt-1.1.15.win32\\include", "-I..\\zlib-1.2.3.win32\\include", "-I..\\iconv-1.9.1.win32\\include" ], extra_link_args = [ "..\\libxml2-2.6.23.win32\\lib\\libxml2_a.lib", "..\\libxslt-1.1.15.win32\\lib\\libxslt_a.lib", "..\\zlib-1.2.3.win32\\lib\\zlib.lib", "..\\iconv-1.9.1.win32\\lib\\iconv_a.lib" ] )], Now you should be able to use setup.py and everything should work well. "python setup.py bdist_wininst" will create a windows installer in the pkg directory. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20060512/94726f64/attachment.htm From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 11:50:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 11:51:27 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: References: Message-ID: <44645A7B.7030304@gkec.informatik.tu-darmstadt.de> Hi David, David Sankel wrote: > There was an absence of windows > binaries for lxml on the net (all the sites that had older versions were > down). "www.python.org" is rarely down: http://www.python.org/pypi/lxml/0.9.1 That's not the most flashy recent version, but it will do (it's not even two months old). And it has windows binaries (which we do not have for 0.9.2 yet as it was released fairly recently). Easiest way to install is by passing "lxml=0.9.1" to "easy_install". Even if you don't want to install EasyInstall, you can still download the egg and unzip it to a suitable directory by hand. We had so about 90 downloads of the egg so far without any feedback, so I guess there are not too many problems with it. Stefan From faassen at infrae.com Fri May 12 13:27:57 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri May 12 13:28:03 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: References: Message-ID: <4464713D.90008@infrae.com> Hi David, David Sankel wrote: > This is my first post on this list. There was an absence of windows > binaries > for lxml on the net (all the sites that had older versions were down). > Unfortunately, there wasn't enough information to make the process of > building on windows easy. I wrote this guide to solve both these > problems. > I > hope that the information here can at worst help out others trying to build > on windows and at best get included in the documentation and maintenance. Thanks! I'll check this in. > I also have a windows installer created for lxml 0.9.1 that is staticly > built with iconv-1.9.1, libxml2 2.6.23, libxslt 1.1.15, and zlib 1.2.3. If > there is some ftp site or something I can upload it to, in order to give > wide access to it, please let me know what to do. Do I understand that you can just install this, import lxml and everything works, without having to worry about installing libxml2 on windows? That'd be great! If you drop me a version I shall put it online in the Python cheeseshop. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 13:34:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 13:35:07 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <4464713D.90008@infrae.com> References: <4464713D.90008@infrae.com> Message-ID: <446472C7.4020702@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > David Sankel wrote: > >> This is my first post on this list. There was an absence of windows >> binaries >> for lxml on the net (all the sites that had older versions were down). >> Unfortunately, there wasn't enough information to make the process of >> building on windows easy. I wrote this guide to solve both these >> problems. > >> I >> hope that the information here can at worst help out others trying to >> build >> on windows and at best get included in the documentation and maintenance. > > Thanks! I'll check this in. Please, don't. Last thing I heard was that lxml compiles just fine on Windows. I really don't see why we should tell people to install libraries in weird places and mess around in setup.py. Stefan From faassen at infrae.com Fri May 12 14:13:50 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri May 12 14:13:56 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <446472C7.4020702@gkec.informatik.tu-darmstadt.de> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> Message-ID: <44647BFE.8040100@infrae.com> Stefan Behnel wrote: > Hi Martijn, [snip] >>>I >>>hope that the information here can at worst help out others trying to >>>build >>>on windows and at best get included in the documentation and maintenance. >> >>Thanks! I'll check this in. > > Please, don't. Last thing I heard was that lxml compiles just fine on Windows. > I really don't see why we should tell people to install libraries in weird > places and mess around in setup.py. What weird places? There's nothing weird with these instructions that I can see. *Does* the setup.py just work on windows? xslt-config must be installed on the path, and things just work then? In addition, these instructions allow you, as far as I understand, to create a *static* version of lxml, including libxml2 and the like. That's pretty neat. No more separately downloading the libxml2 libraries; you just get them included. Anyway, if lxml compiles fine on Windows with the current instructions, we should consider the following steps: * what was it that tripped up David? We should look into finding out and adding a bit to the documentation so people don't trip up in the future. * I'd like to publish the static version of lxml for Windows, as that makes deploying lxml on Windows that much more easy. * If we do that, it'd be nice if we had instructions on how to build the static version (which David provides). I won't check in anything until we figure those out. I just think we should not ignore David's contribution either. So, David, what was it that stoppped you from building lxml on Windows? Regards, Martijn From faassen at infrae.com Fri May 12 14:27:47 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri May 12 14:27:52 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <446472C7.4020702@gkec.informatik.tu-darmstadt.de> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> Message-ID: <44647F43.4000100@infrae.com> Hey, Note that due to my fault the installation documentation on the website was slightly up to date. I've updated it just now, so David, could you check whether this would've helped you getting unstuck trying to build lxml on windows? If not, could you suggest how to modify it? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 15:04:53 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 15:05:36 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <44647BFE.8040100@infrae.com> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> Message-ID: <446487F5.2010006@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > *Does* the setup.py just work on windows? xslt-config must be installed > on the path, and things just work then? There are instructions on how to install those on the libxml2-for-Windows homepage. Remember that you don't normally need to build lxml, so the PATH doesn't matter. lxml is a C-extension, that's not as easy as a Python script. That's why we try to keep users from having to build it themselves. People who want the one-command install should just go with easy_install, which works perfectly on any half-way well-configured system. All that's needed in addition is installing the libxml2/etc. libraries (as their homepage describes). > In addition, these instructions allow you, as far as I understand, to > create a *static* version of lxml, including libxml2 and the like. I really don't mind a static version, despite the ugliness-factor. As I already stated a while ago, Windows heavily lacks any package management. (That's an absolute killer argument against Windows BTW, but, well, it's Windows...) Compiling a static version is easy, it's just that the way to do is not portable, so we can't put it into setup.py. > * what was it that tripped up David? The main problem seemed to be that he didn't find the binaries - which might have been because of the outdated install docs and because we don't have a Windows egg for 0.9.2 yet. I'm sorry for that. > * I'd like to publish the static version of lxml for Windows, as that > makes deploying lxml on Windows that much more easy. Sure, if someone provides a static egg - one more to put on cheeseshop. Stefan From faassen at infrae.com Fri May 12 15:15:07 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri May 12 15:15:12 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <446487F5.2010006@gkec.informatik.tu-darmstadt.de> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> Message-ID: <44648A5B.201@infrae.com> Hey Stefan, Stefan Behnel wrote: > Martijn Faassen wrote: > >>*Does* the setup.py just work on windows? xslt-config must be installed >>on the path, and things just work then? > > There are instructions on how to install those on the libxml2-for-Windows > homepage. Remember that you don't normally need to build lxml, so the PATH > doesn't matter. lxml is a C-extension, that's not as easy as a Python script. > That's why we try to keep users from having to build it themselves. People who > want the one-command install should just go with easy_install, which works > perfectly on any half-way well-configured system. All that's needed in > addition is installing the libxml2/etc. libraries (as their homepage describes). Right, we should separate the compilation discussion from the installation discussion. If you want to install lxml on Windows, use a compiled version. Let's talk about compilation here. The question remains whether setup.py just works on Windows or whether the people who built the windows versions had to hack it. If the latter, it'd be nice to document these instructions for future reference. If everything does work with the plain setup.py, we should figure out what tripped up David as he apparently needed to hack things. >>In addition, these instructions allow you, as far as I understand, to >>create a *static* version of lxml, including libxml2 and the like. > I really don't mind a static version, despite the ugliness-factor. As I > already stated a while ago, Windows heavily lacks any package management. > (That's an absolute killer argument against Windows BTW, but, well, it's > Windows...) > > Compiling a static version is easy, it's just that the way to do is not > portable, so we can't put it into setup.py. Right, so people typically have to hack their setup.py to do so, I guess. We can still document it somewhere. >>* what was it that tripped up David? > > The main problem seemed to be that he didn't find the binaries - which might > have been because of the outdated install docs and because we don't have a > Windows egg for 0.9.2 yet. I'm sorry for that. Perhaps David can provide a windows installer for 0.9.2 that includes the static libraries? >>* I'd like to publish the static version of lxml for Windows, as that >>makes deploying lxml on Windows that much more easy. > > Sure, if someone provides a static egg - one more to put on cheeseshop. Windows installer should be easy. But eggs? Is it possible to have a static egg and a non-static egg both? How would people choose? Should we switch to the static procedure on windows altogether? What do people think? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 15:26:41 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 15:27:17 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <44648A5B.201@infrae.com> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> <44648A5B.201@infrae.com> Message-ID: <44648D11.5020903@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> if someone provides a static egg - one more to put on cheeseshop. > > Windows installer should be easy. But eggs? Is it possible to have a > static egg and a non-static egg both? How would people choose? Should we > switch to the static procedure on windows altogether? What do people think? I was thinking about that, too. I guess all-in-one installers are mainly for new users. People who want to upgrade will likely prefer independent downloads. So, I think it's better to have a full-fledged installer and a dynamic egg. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 16:27:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 16:28:36 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44634969.6050806@infrae.com> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> <1147345522.6436.4.camel@zoo.yandex.ru> <44632712.30102@gkec.informatik.tu-darmstadt.de> <44633164.4000205@gkec.informatik.tu-darmstadt.de> <44634969.6050806@infrae.com> Message-ID: <44649B6F.6000201@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > It's probably more worthwhile to invest that energy in speeding up find > by writing a native implementation. :) Possibly. Although, well, I guess this is fast enough for now: # python -m timeit -s "from otbench import *" "bench_lxml_findall()" 10 loops, best of 3: 252 msec per loop # python -m timeit -s "from otbench import *" "bench_cET()" 10 loops, best of 3: 258 msec per loop :) Just for the records: cET 1.0.5 vs. lxml trunk, SVN 27133. (hint: I rewrote the ElementDepthFirstIterator returned by getiterator() to support tag-filtering directly.) Have fun, Stefan From camior at gmail.com Fri May 12 17:16:54 2006 From: camior at gmail.com (David Sankel) Date: Fri May 12 17:17:32 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: <44648D11.5020903@gkec.informatik.tu-darmstadt.de> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> <44648A5B.201@infrae.com> <44648D11.5020903@gkec.informatik.tu-darmstadt.de> Message-ID: To clarify your questions. I'm running windows and want to get things installed as quickly as possible (binaries++). * I went to the website and clicked on lxml 9.0.1 link in the news, saw that it was direct source download and canceled. * I scrolled down, scanning for some sort of binary version and clicked the "lxml at the Python cheeseshop" link. (Note: it wasn't immediately obvious that here is where binaries would be) * Saw that there was no windows binaries there. [Made assumption that there were no windows binaries available from the developers.] * Looked at where I got my previous version from my notes. It was from http://carcass.dhs.org/ and that site no longer exists. At this point I figured I'd have to go with doing it by source. * Downloaded lxml 9.0.1 from link in the news (I guess 9.0.2 news wasn't up yet) * python setup.py build:: src\lxml\etree.c(9) : fatal error C1083: Cannot open include file: 'libxml/encod ing.h': No such file or directory error: command '"c:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin\cl.e xe"' failed with exit status 2 * Opened README.txt * Opened INSTALL.txt * Skipped down to "Installation on Windows" section as recommended in the first paragraph. * Read that it would be a hassle to compile :-( * Read that binaries are available on http://carcass.dhs.org/ :-) * Found that that site was down :-( [That was unhelpful, so I figured I'd try to get it to work myself] * Looked to xmlsoft.org to get lxml binaries. * Found that that distribution included includes and everything :-) * That distribution didn't have the xslt-config binary that the normal setup.py required. * Hacked away at the setup.py so it'd work. * Found that the distribution from xmlsoft.org had static binaries :-D (It was a _huge_ pain requiring my developers to download these extra dlls before) (Note: I have heard of python eggs, but haven't gotten around to reading about it yet. I'm not sure if it would have saved me some steps from above.) -- > Note that due to my fault the installation documentation on the website > was slightly up to date. I've updated it just now, so David, could you > check whether this would've helped you getting unstuck trying to build > lxml on windows? Unfortunately the new instructions wouldn't have helped because I used the docs in the source distribution to try to figure out how to install it after not finding a windows binary. > If not, could you suggest how to modify it? For the main page, you could change the news releases to something like this: lxml 0.9.2 (sourcelink, binarieslink) released (changes for 0.9.2) > Windows installer should be easy. But eggs? Is it possible to have a > static egg and a non-static egg both? How would people choose? Should we > switch to the static procedure on windows altogether? What do people think? Having gone from the dll version installer to the static version installer, I can't say enough about how much easier it is to deploy the static version. > Perhaps David can provide a windows installer for 0.9.2 that includes > the static libraries? Sure, no problem. David From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 17:50:09 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 17:50:49 2006 Subject: [lxml-dev] Re: Re: Python unicode string support in lxml In-Reply-To: <1695956808.20060512050415@carcass.dhs.org> References: <4461B4D1.4090103@gkec.informatik.tu-darmstadt.de><1365266204.20060510065257@carcass.dhs.org> <58589037.20060510145550@carcass.dhs.org> <205615288.20060510164613@carcass.dhs.org> <446246A7.4060100@gkec.informatik.tu-darmstadt.de> <34119542.20060510173315@carcass.dhs.org> <4462545B.1040502@gkec.informatik.tu-darmstadt.de> <954805581.20060510180841@carcass.dhs.org> <783474576.20060510181204@carcass.dhs.org> <4462C08A.6070405@gkec.informatik.tu-darmstadt.de> <4462D2D0.2050305@gkec.informatik.tu-darmstadt.de> <86168992.20060511035604@carcass.dhs.org> <4462F048.3070801@gkec.informatik.tu-darmstadt.de> <1041322718.20060511093151@carcass.dhs.org> <44633453.8020101@gkec.informatik.tu-darmstadt.de> <1031057011.20060511142425@carcass.dhs.org> <4463A077.4070200@gkec.informatik.tu-darmstadt.de> <1695956808.20060512050415@carcass.dhs.org> Message-ID: <4464AEB1.2070405@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Thursday, May 11, 2006, 5:37:11 PM, you wrote: >> If you want, you can run the modified bench script again, and if you have >> enough RAM, you can pass the -L option to see if that makes a difference. > > Sure, anything I can help with. I've ran the tests with "-L", and they > took quite a while to perform and even crashed after a point. I think the crash might be related to a bug I fixed lately. Maybe your version didn't have that (you passed the '-i' option to run it against the working directory version, right?) > Since I'm > in a hurry lately I did not have the time to see why, but the results are > attached. See that some tests results were really slow compared to ET > and cET. Ex: > > lxe: tostring_utf16 (SA T3 ) 80626.3903 msec/pass, best of ( 81157.4800 80631.1504 80626.3903 ) > cET: tostring_utf16 (SA T3 ) 3305.6618 msec/pass, best of ( 3305.6618 3332.7984 3310.7507 ) > ET : tostring_utf16 (SA T3 ) 3413.7482 msec/pass, best of ( 3418.9271 3413.7482 3415.6650 ) > > lxe: tostring_utf8 (UA T3 ) 37834.8396 msec/pass, best of ( 37834.8396 37970.8700 37908.7146 ) > cET: tostring_utf8 (UA T3 ) 2880.5753 msec/pass, best of ( 2880.5753 2886.5763 2885.0215 ) > ET : tostring_utf8 (UA T3 ) 2981.1059 msec/pass, best of ( 3000.1362 2981.1059 2988.5129 ) That absolutely looks like your system hit the harddisk. So, it would be interesting to have some hints about memory usage during these two benchmarks. > The server is at your disposal if you want to use it. Thanks for the offer. I may ask Martijn first if they don't have similar facilities at infrae. Might be easier. I'll be away until thursday, but I may still come back to the offer, thanks. Stefan From faassen at infrae.com Fri May 12 18:03:01 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri May 12 18:03:07 2006 Subject: [lxml-dev] Uche's OT-benchmark on lxml and (c)ElementTree In-Reply-To: <44649B6F.6000201@gkec.informatik.tu-darmstadt.de> References: <4463083A.90405@gkec.informatik.tu-darmstadt.de> <44631029.5090900@infrae.com> <4463190F.5070705@gkec.informatik.tu-darmstadt.de> <1147345522.6436.4.camel@zoo.yandex.ru> <44632712.30102@gkec.informatik.tu-darmstadt.de> <44633164.4000205@gkec.informatik.tu-darmstadt.de> <44634969.6050806@infrae.com> <44649B6F.6000201@gkec.informatik.tu-darmstadt.de> Message-ID: <4464B1B5.5020509@infrae.com> Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen wrote: > >>It's probably more worthwhile to invest that energy in speeding up find >>by writing a native implementation. :) > > > Possibly. Although, well, I guess this is fast enough for now: > > # python -m timeit -s "from otbench import *" "bench_lxml_findall()" > 10 loops, best of 3: 252 msec per loop > > # python -m timeit -s "from otbench import *" "bench_cET()" > 10 loops, best of 3: 258 msec per loop > > :) > > Just for the records: cET 1.0.5 vs. lxml trunk, SVN 27133. Way cool! Yes, definitely sounds fast enough for now. :) Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 18:03:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 18:04:12 2006 Subject: [lxml-dev] Building lxml on windows: a guide. In-Reply-To: References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> <44648A5B.201@infrae.com> <44648D11.5020903@gkec.informatik.tu-darmstadt.de> Message-ID: <4464B1D7.6070800@gkec.informatik.tu-darmstadt.de> Hi David, thanks for the long explanation. That's interesting to read. David Sankel wrote: > To clarify your questions. I'm running windows and want to get things > installed as quickly as possible (binaries++). > > * I went to the website and clicked on lxml 9.0.1 link in the news, > saw that it was direct source download and canceled. True, I always wondered if it was a good idea to have those links in the news section... :) I removed them and left them only under "downloads". > * I scrolled down, scanning for some sort of binary version and > clicked the "lxml at the Python cheeseshop" link. (Note: it wasn't > immediately obvious that here is where binaries would be) That should be clear now. > * Saw that there was no windows binaries there. > [Made assumption that there were no windows binaries available from > the developers.] Not for 0.9.2 yet, only for 0.9 and 0.9.1. Note that "windows binaries" are distributed in "egg" form as for most other platforms. (should be clear from the "win32" part in the file name). > * Looked at where I got my previous version from my notes. It was from > http://carcass.dhs.org/ and that site no longer exists. That's ok, that was a temporary solution. The web page is already fixed. > At this point I figured I'd have to go with doing it by source. :) > * Downloaded lxml 9.0.1 from link in the news (I guess 9.0.2 news wasn't > up yet) Exactly. > * python setup.py build:: > > src\lxml\etree.c(9) : fatal error C1083: Cannot open include file: > 'libxml/encod > ing.h': No such file or directory > error: command '"c:\Program Files\Microsoft Visual Studio .NET > 2003\Vc7\bin\cl.e > xe"' failed with exit status 2 You didn't read the installation instructions on the web page. The link is more visible now. > * Opened README.txt > * Opened INSTALL.txt > * Skipped down to "Installation on Windows" section as recommended in > the first paragraph. That section is gone now that we have eggs. > * Read that it would be a hassle to compile :-( That's why we have eggs. :) > * Read that binaries are available on http://carcass.dhs.org/ :-) > * Found that that site was down :-( See above. > [That was unhelpful, so I figured I'd try to get it to work myself] > > * Looked to xmlsoft.org to get lxml binaries. > * Found that that distribution included includes and everything :-) Sure. > * That distribution didn't have the xslt-config binary that the normal > setup.py required. It's not a binary, more of a script. But that's good to know. Maybe Steve can tell us if there was anything he did to make it work without it. > * Hacked away at the setup.py so it'd work. > * Found that the distribution from xmlsoft.org had static binaries :-D As Martijn said, a static installer would be nice to have for windows users. > (It was a _huge_ pain requiring my developers to download these extra > dlls before) You should tell them to use easy_install. Doesn't help with DLLs, but with everything that's Python. lxml works nicely with it. > (Note: I have heard of python eggs, but haven't gotten around to > reading about it yet. I'm not sure if it would have saved me some > steps from above.) Python eggs are the best way to distribute and use various versions and binaries of a Python package. See the EasyInstall link on the lxml install page. > Unfortunately the new instructions wouldn't have helped because I used > the docs in the source distribution to try to figure out how to > install it after not finding a windows binary. The web page is built from the text files in doc/. > For the main page, you could change the news releases to something like > this: > lxml 0.9.2 (sourcelink, binarieslink) released (changes for 0.9.2) I prefer refering to cheeseshop here. "binarieslink" is a bit misleading for a multi-platform thing. > Having gone from the dll version installer to the static version > installer, I can't say enough about how much easier it is to deploy > the static version. It's static. Where's the problem? >> Perhaps David can provide a windows installer for 0.9.2 that includes >> the static libraries? > > Sure, no problem. Cool, please send it to Martijn so that he can copy it to cheeseshop. Martijn, please also rebuild the web pages again. Thanks for the feedback, Stefan From ogrisel at nuxeo.com Fri May 12 18:24:48 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Fri May 12 18:26:15 2006 Subject: [lxml-dev] Re: Building lxml on windows: a guide. In-Reply-To: <4464B1D7.6070800@gkec.informatik.tu-darmstadt.de> References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> <44648A5B.201@infrae.com> <44648D11.5020903@gkec.informatik.tu-darmstadt.de> <4464B1D7.6070800@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel a ?crit : > Hi David, > > thanks for the long explanation. That's interesting to read. > > David Sankel wrote: >> To clarify your questions. I'm running windows and want to get things >> installed as quickly as possible (binaries++). >> >> * I went to the website and clicked on lxml 9.0.1 link in the news, >> saw that it was direct source download and canceled. > > True, I always wondered if it was a good idea to have those links in the news > section... :) > > I removed them and left them only under "downloads". What about replacing the News section directly by the download section with additional links to the CHANGES file? I find the current front page too long and too redundant. The "easy_install lxml" command should be mntioned on top of the download section so that people first think to use it before getting tempted to click on one the source distrib links. -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 12 18:49:36 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri May 12 18:50:13 2006 Subject: [lxml-dev] Re: Building lxml on windows: a guide. In-Reply-To: References: <4464713D.90008@infrae.com> <446472C7.4020702@gkec.informatik.tu-darmstadt.de> <44647BFE.8040100@infrae.com> <446487F5.2010006@gkec.informatik.tu-darmstadt.de> <44648A5B.201@infrae.com> <44648D11.5020903@gkec.informatik.tu-darmstadt.de> <4464B1D7.6070800@gkec.informatik.tu-darmstadt.de> Message-ID: <4464BCA0.8020002@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > Stefan Behnel a ?crit : >> Hi David, >> >> thanks for the long explanation. That's interesting to read. >> >> David Sankel wrote: >>> To clarify your questions. I'm running windows and want to get things >>> installed as quickly as possible (binaries++). >>> >>> * I went to the website and clicked on lxml 9.0.1 link in the news, >>> saw that it was direct source download and canceled. >> >> True, I always wondered if it was a good idea to have those links in >> the news >> section... :) >> >> I removed them and left them only under "downloads". > > What about replacing the News section directly by the download section > with additional links to the CHANGES file? I find the current front page > too long and too redundant. Good idea. done. > The "easy_install lxml" command should be mntioned on top of the > download section so that people first think to use it before getting > tempted to click on one the source distrib links. Already did that in my last commit. :) Stefan From nslater at gmail.com Sun May 14 18:25:05 2006 From: nslater at gmail.com (Noah Slater) Date: Sun May 14 18:25:39 2006 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options Message-ID: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> Hello all, Firstly thanks for such a great library. For a python module I am worryingly excited reading through the documentation. I have had a play around with etree.* and have almost replace 50% of the code in my project thus far to use this - but one concern remains: I cannot figure out from either the docs, snippets of source files I have read or google how I am meant to prevent XSLT from writing to the file system. Using xsltproc from the command line I can use the "--nowrite" and "--nomkdir" switches and would ideally like to be able to replicate this functionality. Thanks so much, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Sun May 14 20:43:00 2006 From: nslater at gmail.com (Noah Slater) Date: Sun May 14 20:43:35 2006 Subject: [lxml-dev] XMl Processing Instructions Message-ID: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> Hello, Why I use the write method of the ElementTree class why does it strip out the XML processing insturctions? I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From ogrisel at nuxeo.com Sun May 14 20:54:32 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Sun May 14 20:55:32 2006 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak Message-ID: Hi list, As I found the current style a bit "too big" I refactored the css to get this: http://champiland.homelinux.net/lxml-newstyle/ If you like it I can check that in the svn. I enclose a patch to list the changes. To Stefan: your work on mail.txt has been done only in branch/lxml-0.9.x and not in trunk. Is it intentional? + it would be nice if your changes could went online. Who is in charge of the publishing process? Regards, -- Olivier -------------- next part -------------- Index: style.css =================================================================== --- style.css (revision 27205) +++ style.css (working copy) @@ -1,33 +1,62 @@ body { + /* CSS Hack for IE that does not respect the "margin: auto" rule at the + * document level */ + text-align: center; + padding: 1em; +} + +div.document { + width: 45em; + font: 13px Arial, Verdana, Helvetica, sans-serif; + margin: 1em auto 1em auto; + background-color: white; + color: #222; + text-align: left; +} + +h1.title { background: url(http://codespeak.net/img/codespeak1b.png) no-repeat; - font: 120% Arial, Verdana, Helvetica, sans-serif; - border: 0; - margin: 0.5em 0em 0.5em 0.5em; - padding: 0 0 0 145px; + padding: 20px 0 0 180px; + height: 60px; + font-size: 200%; } -a { - text-decoration: underline; - background-color: transparent; +h1, h2, h3 { + color: #333; + font-weight: bold; } -p { - /*margin: 0.5em 0em 1em 0em;*/ - text-align: left; - line-height: 1.5em; - margin: 0.5em 0em 0em 0em; +h1 { + font-size: 120%; } -p a { - text-decoration: underline; +h2 { + font-size: 110%; } +h3 { + font-size: 105%; +} -p a:active { - color: Red; +a, a:visited { background-color: transparent; + font-weight: bold; + color: Black; + text-decoration: none; } +a:active { + color: Red; + text-decoration: underline; +} + +p { + /*margin: 0.5em 0em 1em 0em;*/ + text-align: justify; + line-height: 1.5em; + margin: 0.5em 0em 0em 0em; +} + hr { clear: both; height: 1px; @@ -35,10 +64,8 @@ background-color: transparent; } - -ul { +ul { line-height: 1.5em; - /*list-style-image: url("bullet.gif"); */ margin-left: 1em; } @@ -47,28 +74,21 @@ margin-left: 0em; } -ul a, ol a { - text-decoration: underline; -} - blockquote { font-family: Times, "Times New Roman", serif; font-style: italic; - font-size: 120%; } code { - font-size: 120%; color: Black; - /*background-color: #dee7ec;*/ background-color: #cccccc; + font-family: Courier, monospace; } pre { - font-size: 120%; - padding: 1em; + padding: 0.5em; border: 1px solid #8cacbb; color: Black; - background-color: #dee7ec; background-color: #cccccc; + font-family: Courier, monospace; } Index: publish.py =================================================================== --- publish.py (revision 27205) +++ publish.py (working copy) @@ -1,9 +1,12 @@ -import os, sys +import os, shutil, sys def publish(dirname, lxml_path, release): if not os.path.exists(dirname): os.mkdir(dirname) - stylesheet_url = 'http://codespeak.net/lxml/style.css' + stylesheet_url = 'style.css' + + shutil.copy(stylesheet_url, dirname) + for name in ['main.txt', 'intro.txt', 'api.txt', 'compatibility.txt', 'extensions.txt', 'namespace_extensions.txt', 'sax.txt']: path = os.path.join(lxml_path, 'doc', name) @@ -22,10 +25,10 @@ os.path.join(dirname, 'index.html')) def rest2html(source_path, dest_path, stylesheet_url): - - command = ('rest2html --stylesheet=%s %s > %s' % + + command = ('rest2html --stylesheet=%s --link-stylesheet %s > %s' % (stylesheet_url, source_path, dest_path)) os.system(command) - + if __name__ == '__main__': publish(sys.argv[1], sys.argv[2], sys.argv[3]) From faassen at infrae.com Mon May 15 10:41:07 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon May 15 10:40:51 2006 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: References: Message-ID: <44683EA3.4010006@infrae.com> Olivier Grisel wrote: > As I found the current style a bit "too big" I refactored the css to get > this: > > http://champiland.homelinux.net/lxml-newstyle/ > > If you like it I can check that in the svn. I enclose a patch to list > the changes. Cool, I like them! Feel free to check them in, unless Stefan objects. > To Stefan: your work on mail.txt has been done only in branch/lxml-0.9.x > and not in trunk. Is it intentional? + it would be nice if your changes > could went online. Who is in charge of the publishing process? I am, but I haven't been able to keep up with Stefan as well as I should. I think I should arrange for some other people to have access to the lxml web directory as well; running the website generation script isn't hard so I can easily explain what I do. Regards, Martijn From ogrisel at nuxeo.com Mon May 15 20:56:30 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Mon May 15 20:57:45 2006 Subject: [lxml-dev] Re: New CSS style for the lxml website @ codespeak In-Reply-To: <44683EA3.4010006@infrae.com> References: <44683EA3.4010006@infrae.com> Message-ID: Martijn Faassen a ?crit : > Olivier Grisel wrote: >> As I found the current style a bit "too big" I refactored the css to >> get this: >> >> http://champiland.homelinux.net/lxml-newstyle/ >> >> If you like it I can check that in the svn. I enclose a patch to list >> the changes. > > Cool, I like them! Feel free to check them in, unless Stefan objects. Done. >> To Stefan: your work on mail.txt has been done only in >> branch/lxml-0.9.x and not in trunk. Is it intentional? + it would be >> nice if your changes could went online. Who is in charge of the >> publishing process? > > I am, but I haven't been able to keep up with Stefan as well as I > should. I think I should arrange for some other people to have access to > the lxml web directory as well; running the website generation script > isn't hard so I can easily explain what I do. Yes I have used that script after having removed the hardcoded reference to the style.css url to be able to run the script on my personal webserver. -- Olivier From faassen at infrae.com Tue May 16 11:14:01 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 16 May 2006 11:14:01 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: References: <44683EA3.4010006@infrae.com> Message-ID: <446997D9.405@infrae.com> Hey, Stefan, shall I rerun the script so we get the new layout? Does the 0.9 branch have the changes to the text you made as well or is this only the trunk? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 16:05:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 16:05:24 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> Message-ID: <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I cannot figure out from either the docs, snippets of source files I > have read or google how I am meant to prevent XSLT from writing to the > file system. > > Using xsltproc from the command line I can use the "--nowrite" and > "--nomkdir" switches and would ideally like to be able to replicate > this functionality. I guess you're referring to the security framework in libxslt: http://xmlsoft.org/XSLT/html/libxslt-security.html This is not currently wrapped in lxml. I do not know what exactly it is meant to do, though. How can you create files in the current XSLT implementation? If you can find a case where having this left out represents a security risk, we may consider wrapping that part of the API to close it. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 16:29:03 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 16:29:03 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: References: Message-ID: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > As I found the current style a bit "too big" I refactored the css to get > this: > http://champiland.homelinux.net/lxml-newstyle/ > If you like it I can check that in the svn. I enclose a patch to list > the changes. Thanks for working on that. However, I generally dislike web pages that try to tell me what size I should have used for the browser window. For example, it doesn't always work for the code exampled, which can exceed the provisioned width (although this may mean we'd better fix the examples). And I find it very irritating that headlines and links look the same. I'd be happy if you could fix that. The original style wasn't that bad, although it may be enough to underline the links in the new style. > To Stefan: your work on mail.txt has been done only in branch/lxml-0.9.x > and not in trunk. Is it intentional? As the branch represents the latest official version (both of lxml and the web pages), I find it more important to have the web pages up-to-date there. I wanted to merge the changes into the trunk later on, but I had conflicts, so I decided to wait until after my vacation. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 16:34:27 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 16:34:27 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: <446997D9.405@infrae.com> References: <44683EA3.4010006@infrae.com> <446997D9.405@infrae.com> Message-ID: <4469E2F3.2070808@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan, shall I rerun the script so we get the new layout? I'd prefer seeing some changes as I mentioned in my previous post, before going online with the new pages. The right place to update the styles and all that is currently the 0.9.x branch, as it contains the current version of the web site. > Does the 0.9 > branch have the changes to the text you made as well or is this only the > trunk? The web pages in the 0.9.x branch should be consistent with version 0.9.2. The pages in the trunk contain changes that reflect code changes in the trunk, however, some of the design updates from the branch are not yet in the trunk. I'll merges them as soon as I find the time. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 16:47:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 16:47:05 +0200 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> Message-ID: <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > Why I use the write method of the ElementTree class why does it strip > out the XML processing insturctions? > > I would like my documents to start with the processing instruction so > I can specify encodings other than UTF-8. Hmm, I didn't verify this, although I actually thought lxml produced a declaration here. If not, this should be considered a bug, as it is likely inconsistent with ElementTree. I guess this is the same problem as for tostring(), which only started having the expected behaviour fairly recently. I'll see what I can do about that and try to fix it on the SVN trunk as soon as I find the time. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 16:58:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 16:58:34 +0200 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> Message-ID: <4469E89A.10409@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: > Noah Slater wrote: >> Why I use the write method of the ElementTree class why does it strip >> out the XML processing insturctions? >> >> I would like my documents to start with the processing instruction so >> I can specify encodings other than UTF-8. > > I guess this is the same problem as for > tostring(), which only started having the expected behaviour fairly recently. Yes, it /is/ the same problem. You will also notice problems when you serialise trees to XML byte streams containing 0-bytes. Both problems have been fixed on the trunk recently, but after the release of 0.9.2. Please use the Subversion trunk for now, until we have decided if it's worth releasing a 0.9.3 before we have 1.0 ready. http://codespeak.net/svn/lxml/trunk Stefan From ogrisel at nuxeo.com Tue May 16 16:56:32 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Tue, 16 May 2006 16:56:32 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> References: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel a ?crit : > Hi Olivier, > > Olivier Grisel wrote: >> As I found the current style a bit "too big" I refactored the css to get >> this: >> http://champiland.homelinux.net/lxml-newstyle/ >> If you like it I can check that in the svn. I enclose a patch to list >> the changes. > > Thanks for working on that. However, I generally dislike web pages that try to > tell me what size I should have used for the browser window. For example, it > doesn't always work for the code exampled, which can exceed the provisioned > width (although this may mean we'd better fix the examples). The style I wrote takes care of having the main content width remain approximately constant in terms of number of characters. Try to resize the fonts in your browser window with ctrl-scroll wheel to see what I mean. I tend to dislike website with a lot of textual content to spread horizontally having more than 20 words on a line. That makes it really difficult to read. Most blog or newspaper sites with textual content do not spread the text too much horizontally to not hurt the readers eyes. The examples should be resized to 79 chars max (as all source code should) in order to both enhance readability and match the CSS style assumptions. If you still disagree, I can set the column width to some 80% value for instance instead of an "em" based value. > And I find it very irritating that headlines and links look the same. I'd be > happy if you could fix that. The original style wasn't that bad, although it > may be enough to underline the links in the new style. I find underlined links a bit to heavy as well but I agree headlines and links are too similars. I'll give it another try tonight. >> To Stefan: your work on mail.txt has been done only in branch/lxml-0.9.x >> and not in trunk. Is it intentional? > > As the branch represents the latest official version (both of lxml and the web > pages), I find it more important to have the web pages up-to-date there. I > wanted to merge the changes into the trunk later on, but I had conflicts, so I > decided to wait until after my vacation. Ok. -- Olivier From nslater at gmail.com Tue May 16 18:50:34 2006 From: nslater at gmail.com (Noah Slater) Date: Tue, 16 May 2006 17:50:34 +0100 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> Hi Stefan, Thanks for the reply! > I guess you're referring to the security framework in libxslt: > > http://xmlsoft.org/XSLT/html/libxslt-security.html Looks like that could be the thing, though I wouldn't know for sure as am not familiar with the underlying API to libxslt. > I do not know what exactly it is meant to do, though. How can you create files > in the current XSLT implementation? I think it may be an extension to XSLT that libxslt implements. I use this when I am chunking my DocBook documents. See: http://www.sagehill.net/docbookxsl/Chunking.html The DocBook stylesheets generate multiple files and will create them (and the dirs) if necessary. > If you can find a case where having this left out represents a security risk, > we may consider wrapping that part of the API to close it. Yes I have. My application accepts arbitrary XSLT files from users to transform content. While my application does not chunk output in the manner described above I have tested it with a stylesheet that chunks output and the lxml binding do in fact create the files as I would expect with the libxslt bindings. They do not however create directories, which is inconsistent with the standard API. Either way I would like to be able to disable this as it opens up the possibility for users to write arbitrary files to the file system. I hope this better explains things. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Tue May 16 18:51:16 2006 From: nslater at gmail.com (Noah Slater) Date: Tue, 16 May 2006 17:51:16 +0100 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <4469E89A.10409@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> <4469E89A.10409@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605160951p5f0015f1ma26a7292c202516d@mail.gmail.com> Hello, If I understand your email correctly the behaviour I describe is not intentional and will be fixed shortly? Just to clarify - I can parse any file, but when I serialise I loose any processing instructions. This includes the declaration. This also happens with the ResultTree (?) when I transform using XSLT. As an example, I use the DocBook XSLT stylesheets to transform DocBook XML. This can often set various things up with the processing instructions - character encoding being the most important. When I perform these transformations using ElementTree I loose this information. As a work around at the moment I am using lxml.etree to do the transformations using UTF-8 as the encoding. I am then using libxml2 and libxslt to transform the serialized document bytestream a second time with the only operation being converting from UTF-8 to another (variable) character encoding. This feels quite hackish - and to be honest the whole point of me moving to lxml was because I find the libxml2 and libxslt bindings hateful. To summarise, in an ideal world I would like to be able to transform a document using XSLT specifying an encoding at transformation time and have the ResultTree serialise with all processing instructions intact. Additionally I would like to be able to access these programmatically - which I don't think is possible at the moment. I hope all this makes sense. Thanks, Noah From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 19:12:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 19:12:02 +0200 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <9ea1c1180605160942n3ff25079n397211131cc2caa9@mail.gmail.com> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> <4469E89A.10409@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160942n3ff25079n397211131cc2caa9@mail.gmail.com> Message-ID: <446A07E2.4060606@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > If I understand your email correctly the behaviour I describe is not > intentional and will be fixed shortly? It has been fixed in the developer version, but there is not yet a release that has the fix. > Just to clarify - I can parse any file, but when I serialise I loose > any processing instructions. This includes the > declaration. I understood that. > This also happens with the ResultTree (?) when I transform using XSLT. > > As an example, I use the DocBook XSLT stylesheets to transform DocBook > XML. This can often set various things up with the processing > instructions - character encoding being the most important. > > When I perform these transformations using ElementTree I loose this > information. > > As a work around at the moment I am using lxml.etree to do the > transformations using UTF-8 as the encoding. I am then using libxml2 > and libxslt to transform the serialized document bytestream a second > time with the only operation being converting from UTF-8 to another > (variable) character encoding. > > This feels quite hackish Well, it /is/ hackish. :) > - and to be honest the whole point of me > moving to lxml was because I find the libxml2 and libxslt bindings > hateful. That's one of the main reasons that drive us in writing lxml - right after users saying "thank you, it's great and helps us do stuff!" :) > To summarise, in an ideal world I would like to be able to transform a > document using XSLT specifying an encoding at transformation time and > have the ResultTree serialise with all processing instructions intact. > Additionally I would like to be able to access these programmatically > - which I don't think is possible at the moment. That's also a feature of the developer version that will eventually become lxml 1.0. See the 'docinfo' feature described here: http://codespeak.net/svn/lxml/trunk/doc/api.txt See here for a complete list of changes since 0.9.2: http://codespeak.net/svn/lxml/trunk/CHANGES.txt > I hope all this makes sense. It does. Thanks for the report. Please use the current SVN trunk version for now, lxml 1.0 is expected to be released this month. See http://codespeak.net/lxml/installation.html and the bottom of the download section in http://codespeak.net/lxml/ on how to check it out of the subversion repository and compile it. Stefan From ogrisel at nuxeo.com Tue May 16 23:02:44 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Tue, 16 May 2006 23:02:44 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: References: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> Message-ID: Olivier Grisel a ?crit : > Stefan Behnel a ?crit : >> And I find it very irritating that headlines and links look the same. I'd be >> happy if you could fix that. The original style wasn't that bad, although it >> may be enough to underline the links in the new style. > > I find underlined links a bit to heavy as well but I agree headlines and links > are too similar. I'll give it another try tonight. Done: http://champiland.homelinux.net/lxml-newstyle/ Headlines are a bit bigger, in gray and use Helvetica instead of Arial. -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 16 23:13:28 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 16 May 2006 23:13:28 +0200 Subject: [lxml-dev] New CSS style for the lxml website @ codespeak In-Reply-To: References: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> Message-ID: <446A4078.9080007@gkec.informatik.tu-darmstadt.de> Salut Olivier, Olivier Grisel wrote: > Olivier Grisel a ?crit : >> Stefan Behnel a ?crit : >>> And I find it very irritating that headlines and links look the same. I'd be >>> happy if you could fix that. The original style wasn't that bad, although it >>> may be enough to underline the links in the new style. >> I find underlined links a bit to heavy as well but I agree headlines and links >> are too similar. I'll give it another try tonight. > > Done: > > http://champiland.homelinux.net/lxml-newstyle/ > > Headlines are a bit bigger, in gray and use Helvetica instead of Arial. Thanks, that's *much* better. One little thing: headlines get underlined when the mouse touches them. That's the infamous " References: <4469E1AF.5040204@gkec.informatik.tu-darmstadt.de> <446A4078.9080007@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel a ?crit : > Salut Olivier, Hallo Stefan :) >> http://champiland.homelinux.net/lxml-newstyle/ >> >> Headlines are a bit bigger, in gray and use Helvetica instead of Arial. > > > Thanks, that's *much* better. One little thing: headlines get underlined when > the mouse touches them. That's the infamous " > So, if you want to commit it to the branch now, you're welcome. Please tell > Martijn to upload the pages when you're done. Done. All changes occurred in lxml/www/style.css and lxml/www/publish.py thus above the branch/trunk separation. So, Martjin you can upload the new style website by using the script as follows: $ cd lxml/www $ python publish.py target_dir ../branches/lxml-0.9.x 0.9.2 (As far as I understand how it works ...) -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 01:30:52 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 01:30:52 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> Message-ID: <446A60AC.70504@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> I guess you're referring to the security framework in libxslt: >> http://xmlsoft.org/XSLT/html/libxslt-security.html > > Looks like that could be the thing, though I wouldn't know for sure as > am not familiar with the underlying API to libxslt. > >> I do not know what exactly it is meant to do, though. How can you >> create files in the current XSLT implementation? > > I think it may be an extension to XSLT that libxslt implements. I use > this when I am chunking my DocBook documents. See: > > http://www.sagehill.net/docbookxsl/Chunking.html > > The DocBook stylesheets generate multiple files and will create them > (and the dirs) if necessary. I looked through that a bit. It seems to use EXSLT:document() and these things, but I wonder why that works in 0.9.2 (which I assume you tested it with?). Anyway, this is pretty much untested functionality and not currently expected to work in any sensible way. > My application accepts arbitrary XSLT files from users to > transform content. We already had a discussion recently about this-not-being-a-good-idea as you cannot easily prevent the stylesheet from eating up your CPU cycles. XSLT is turing-complete, so you can use it to find prime-factors, search for ET (no pun intended), etc., even if you manage to keep it from filling up your hard disk or reading your password files. > While my application does not chunk output in the > manner described above I have tested it with a stylesheet that chunks > output and the lxml binding do in fact create the files as I would > expect with the libxslt bindings. They do not however create > directories, which is inconsistent with the standard API. Not really inconsistent, as this is an API of xsltproc, not libxslt. It rather should not do that at all... > Either way I > would like to be able to disable this as it opens up the possibility > for users to write arbitrary files to the file system. I gave it a try and implemented a new API for that. Look at the bottom of http://codespeak.net/svn/lxml/branch/xslt-access-control/doc/resolvers.txt to see how to use it. Note that the second part (everything below "BROKEN FROM HERE") does not currently work, likely due to problems with libxslt. If I can't get that working, it will be removed. Stefan From howe at carcass.dhs.org Wed May 17 08:38:48 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 17 May 2006 03:38:48 -0300 Subject: [lxml-dev] lxml crash Message-ID: <472582143.20060517033848@carcass.dhs.org> Hello all, I know that the code is wring, but this crash is happening on my FreeBSD 6.1 and Win XP systems, with lxml 0.9.2: [carcass@/home/howe] python Python 2.4.3 (#2, Apr 17 2006, 14:29:19) [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> a = etree.Element('a') >>> a.append(None) Exception exceptions.TypeError: 'Argument must not be None.' in 'etree._raiseIfNone' ignored zsh: segmentation fault python It should be more friendly by just displaying the message and not crashing, right ? By the way, is there any way to get the lxml version from code ? Some module '__version__' attribute ? Shouldn't it be there ? Something like: lxml.__version__ -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 08:46:50 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 08:46:50 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <446A60AC.70504@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> Message-ID: <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: >>> I guess you're referring to the security framework in libxslt: >>> http://xmlsoft.org/XSLT/html/libxslt-security.html > > I gave it a try and implemented a new API for that. Look at the bottom of > http://codespeak.net/svn/lxml/branch/xslt-access-control/doc/resolvers.txt > to see how to use it. Note that the second part (everything below "BROKEN FROM > HERE") does not currently work, likely due to problems with libxslt. If I > can't get that working, it will be removed. I removed the more fine-grained control mechanisms as they are redundant with (and less general than) the custom resolver support. The branch is merged into the trunk now. See the bottom of http://codespeak.net/svn/lxml/trunk/doc/resolvers.txt for an explanation. There is also a doctest on this. Noah, I'd be glad if you could test it and report back if it works as expected. I do not know if the directory creation business works now (i.e. if directories /are/ created). Martijn, this should also finally answer your question about XSLT security. Regards, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 10:10:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 10:10:32 +0200 Subject: [lxml-dev] new "doc/build.txt" on building lxml from sources Message-ID: <446ADA78.7040104@gkec.informatik.tu-darmstadt.de> Hi all, I added a "doc/build.txt" apart from the simplified "INSTALL.txt" to help interested users build lxml from sources (and to prevent normal distrib users from caring about it). It also contains David's procedure on building lxml statically on Windows. http://codespeak.net/svn/lxml/trunk/doc/build.txt It's both in the trunk and the 0.9 branch, so, Martijn, please update the web site from the branch. Regards, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 11:10:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 11:10:11 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <472582143.20060517033848@carcass.dhs.org> References: <472582143.20060517033848@carcass.dhs.org> Message-ID: <446AE873.3020808@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > I know that the code is wring, but this crash is happening on my FreeBSD > 6.1 and Win XP systems, with lxml 0.9.2: > > [carcass@/home/howe] python > Python 2.4.3 (#2, Apr 17 2006, 14:29:19) > [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 > Type "help", "copyright", "credits" or "license" for more information. >>>> from lxml import etree >>>> a = etree.Element('a') >>>> a.append(None) > Exception exceptions.TypeError: 'Argument must not be None.' in 'etree._raiseIfNone' ignored > zsh: segmentation fault python > > It should be more friendly by just displaying the message and not > crashing, right ? Definitely. Actually, that's already fixed on the trunk due to a different change a while ago. I didn't even know this bug existed, otherwise I would have applied it to the 0.9 branch also... Another thing related to this: In most of its API functions, ElementTree raises an AssertionError on None, while lxml raises a TypeError. I'll change a couple of other places to make it consistent. That breaks ElementTree compatibility a bit more, but I think no one should rely on code raising an AssertionError when wrong argument types are passed... > By the way, is there any way to get the lxml version from code ? Some > module '__version__' attribute ? Shouldn't it be there ? Something like: > > lxml.__version__ Sure, good idea. Actually, lxml 1.0 will have even more. You can ask it for the versions of libxml2 and libxslt that it was compiled with and that it runs with. All versions are represented as int tuples so that you don't have to parse a string to find out if you're running version X.X or later. But I copied the lxml version string also to __version__, I think that's a sufficiently common place to look for it. Thanks, Stefan From faassen at infrae.com Wed May 17 13:20:22 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 17 May 2006 13:20:22 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> Message-ID: <446B06F6.2070604@infrae.com> Stefan Behnel wrote: [snip] > Martijn, this should also finally answer your question about XSLT security. Cool! I was following this thread; seems there was something in my wondering about XSLT security after all. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 13:39:46 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 13:39:46 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <446B06F6.2070604@infrae.com> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <446B06F6.2070604@infrae.com> Message-ID: <446B0B82.1080707@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: > [snip] > >> Martijn, this should also finally answer your question about XSLT >> security. > > Cool! I was following this thread; seems there was something in my > wondering about XSLT security after all. True. Now that we have XSLTAccessControl, I will also enable the remaining libxslt extra features. The current trunk does not currently enable the output elements "output", "write" and "document", and also not the debug element. If the access control works as expected, enabling them should not do any harm, as long as the user takes care to disable file access if necessary. Regards, Stefan From faassen at infrae.com Wed May 17 14:19:52 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 17 May 2006 14:19:52 +0200 Subject: [lxml-dev] new "doc/build.txt" on building lxml from sources In-Reply-To: <446ADA78.7040104@gkec.informatik.tu-darmstadt.de> References: <446ADA78.7040104@gkec.informatik.tu-darmstadt.de> Message-ID: <446B14E8.7010208@infrae.com> Stefan Behnel wrote: > Hi all, > > I added a "doc/build.txt" apart from the simplified "INSTALL.txt" to help > interested users build lxml from sources (and to prevent normal distrib users > from caring about it). It also contains David's procedure on building lxml > statically on Windows. > > http://codespeak.net/svn/lxml/trunk/doc/build.txt > > It's both in the trunk and the 0.9 branch, so, Martijn, please update the web > site from the branch. Done now. Shiny new website is online! :) Note that I made a few minor fixes to the documentation plus some additions to CREDITS.txt. This should be merged into the trunk still. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 15:21:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 15:21:30 +0200 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st Message-ID: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> Hallo everyone, just to keep us from pushing back release dates, Martijn and I have fixed the release date of lxml 1.0 to June 1st. The (long) list of changes for 1.0 is here: http://codespeak.net/svn/lxml/trunk/CHANGES.txt This means for you: If you want a bug-free release ready to install, then come and help us by testing the trunk and reporting any bugs you can find. http://codespeak.net/svn/lxml/trunk Build instructions are here: http://codespeak.net/lxml/build.html It also really helps us if you read the docs and report any difficulties you encounter. The in-development documentation is here: http://codespeak.net/svn/lxml/trunk/doc/ especially http://codespeak.net/svn/lxml/trunk/doc/api.txt http://codespeak.net/svn/lxml/trunk/doc/resolvers.txt and a little bit of http://codespeak.net/svn/lxml/trunk/doc/extensions.txt Any help is appreciated to make lxml 1.0 the best XML tool for Python - ever. :) Have fun, Stefan From howe at carcass.dhs.org Wed May 17 18:29:23 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 17 May 2006 13:29:23 -0300 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <446B0B82.1080707@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <446B06F6.2070604@infrae.com> <446B0B82.1080707@gkec.informatik.tu-darmstadt.de> Message-ID: <128266801.20060517132923@carcass.dhs.org> Hello Stefan, Wednesday, May 17, 2006, 6:10:11 AM, you wrote: > Definitely. Actually, that's already fixed on the trunk due to a different > change a while ago. I didn't even know this bug existed, otherwise I would > have applied it to the 0.9 branch also... Oh, I was not using trunk, thanks. > Another thing related to this: > In most of its API functions, ElementTree raises an AssertionError on None, > while lxml raises a TypeError. I'll change a couple of other places to make it > consistent. That breaks ElementTree compatibility a bit more, but I think no > one should rely on code raising an AssertionError when wrong argument types > are passed... Yes, this is something unpythonic, too - Python raises TypeError just as you implemented: >>> float(None) Traceback (most recent call last): File "", line 1, in ? TypeError: float() argument must be a string or a number Something that could be done to keep compatibility with both models is using a derived exception such as (I know the name is terrible): class LXMLInvalidArgument(TypeError, AssertionError): pass Or we could ask Fedrik if he intends to change it on ElementTree... > Sure, good idea. Actually, lxml 1.0 will have even more. You can ask it for > the versions of libxml2 and libxslt that it was compiled with and that it runs > with. All versions are represented as int tuples so that you don't have to > parse a string to find out if you're running version X.X or later. Nice: this is actually what pybsddb interface does, too. > But I copied the lxml version string also to __version__, I think that's a > sufficiently common place to look for it. Sure, that is a good idea. -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 17 19:53:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 17 May 2006 19:53:02 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <446AE873.3020808@gkec.informatik.tu-darmstadt.de> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> Message-ID: <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Wednesday, May 17, 2006, 6:10:11 AM, you wrote: >> In most of its API functions, ElementTree raises an AssertionError on >> None, while lxml raises a TypeError. I'll change a couple of other places >> to make it consistent. That breaks ElementTree compatibility a bit more, >> but I think no one should rely on code raising an AssertionError when >> wrong argument types are passed... > > Something that could be done to keep compatibility with both models is > using a derived exception such as (I know the name is terrible): > > class LXMLInvalidArgument(TypeError, AssertionError): pass I thought about that, too, since we already do that for SyntaxError. However, currently most of these errors are generated from Pyrex itself through the function signatures. And I don't see why we should change that only to provide a non-intuitive exception for a rare case. Note that we have to check the type anyway in most cases (type casts etc.), so we would double the checks for a not very beautiful result. > Or we could ask Fedrik if he intends to change it on ElementTree... Sure, go, try it. He's working on ET 1.3, so that would be the right time to do it. I'm not very confident he'll like it, though... >> Sure, good idea. Actually, lxml 1.0 will have even more. You can ask it >> for the versions of libxml2 and libxslt that it was compiled with and >> that it runs with. All versions are represented as int tuples so that you >> don't have to parse a string to find out if you're running version X.X or >> later. One thing to add: I also appended the SVN revision number if available. That makes it possible to distinguish between 'official' versions from source distributions and those generated from an SVN checkout. So, when you compile from the trunk, setup.py will look for the ".svn" directory and build a version string like "0.9.2-27455". >> But I copied the lxml version string also to __version__, I think that's >> a sufficiently common place to look for it. It's actually in lxml.etree.__version__ now. Copying it to lxml.__version__ would require us to automatically import etree from lxml, which we should not do without reason. Stefan From howe at carcass.dhs.org Wed May 17 22:16:56 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 17 May 2006 17:16:56 -0300 Subject: [lxml-dev] lxml crash In-Reply-To: <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> Message-ID: <1968986390.20060517171656@carcass.dhs.org> Hello Stefan, Wednesday, May 17, 2006, 2:53:02 PM, you wrote: > I thought about that, too, since we already do that for SyntaxError. However, > currently most of these errors are generated from Pyrex itself through the > function signatures. And I don't see why we should change that only to provide > a non-intuitive exception for a rare case. Note that we have to check the type > anyway in most cases (type casts etc.), so we would double the checks for a > not very beautiful result. Yes, it would not be very nice. Does it happen on other places ? Anyway, that should not be so important if it's documented. The best thing to happen would be a ElementTree change. I wonder what's the reason for raising an AssertionError instead of TypeError... > Sure, go, try it. He's working on ET 1.3, so that would be the right time to > do it. I'm not very confident he'll like it, though... Me neither... :) He already didn't comment about the unpythonic str(element) behavior on ElementTree. >>> Sure, good idea. Actually, lxml 1.0 will have even more. You can ask it >>> for the versions of libxml2 and libxslt that it was compiled with and >>> that it runs with. All versions are represented as int tuples so that you >>> don't have to parse a string to find out if you're running version X.X or >>> later. > One thing to add: I also appended the SVN revision number if available. That > makes it possible to distinguish between 'official' versions from source > distributions and those generated from an SVN checkout. So, when you compile > from the trunk, setup.py will look for the ".svn" directory and build a > version string like "0.9.2-27455". Oh, great. Even better. > It's actually in lxml.etree.__version__ now. Copying it to lxml.__version__ > would require us to automatically import etree from lxml, which we should not > do without reason. I would think it's more intuitive to look for that on the topmost module (lxml only), but ElementTree also exports it (but named as VERSION on a submodule:) >>> from elementtree.ElementTree import * >>> elementtree.ElementTree.VERSION '1.2.6' Will you want to maintain compatibility with the VERSION attribute aswell ? -- Best regards, Steve mailto:howe at carcass.dhs.org From howe at carcass.dhs.org Wed May 17 22:18:29 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 17 May 2006 17:18:29 -0300 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st In-Reply-To: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> Message-ID: <33159625.20060517171829@carcass.dhs.org> Hello Stefan, Wednesday, May 17, 2006, 10:21:30 AM, you wrote: [...] > Any help is appreciated to make lxml 1.0 the best XML tool for Python - ever. Just let me know the day before and I'll provide the usual eggs. If you need anything else, please tell me, and I'll be happy in helping. -- Best regards, Steve mailto:howe at carcass.dhs.org From iny+news at iki.fi Thu May 18 06:24:31 2006 From: iny+news at iki.fi (Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=) Date: Thu, 18 May 2006 07:24:31 +0300 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel writes: > Hallo everyone, > > just to keep us from pushing back release dates, Martijn and I have fixed the > release date of lxml 1.0 to June 1st. > > The (long) list of changes for 1.0 is here: > http://codespeak.net/svn/lxml/trunk/CHANGES.txt Does this include pretty printing? Is it possible not to strip the declaration? Or do I have to continue patching lxml for my use? Pretty printing makes XML readable and I don't want to develop any software that uses XML without it. -- Ilpo Nyyss?nen # biny # /* :-) */ From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 06:38:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 06:38:51 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <9ea1c1180605171334q28386e73k553108337149ad26@mail.gmail.com> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605171334q28386e73k553108337149ad26@mail.gmail.com> Message-ID: <446BFA5B.9020006@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> Noah, I'd be glad if you could test it and report back if it works as >> expected. I do not know if the directory creation business works now >> (i.e. if >> directories /are/ created). > > Hi, sorry I can't test until I have a .deb package to install - so I > am guessing I will have to wait until your changes arrive in Debian > unstable. That won't be before 1.0 then, I assume. It's actually not that hard to build Debian packages once they are part of Debian. Get the source .deb via apt and unpack it. Then you can replace the included lxml sources with the current SVN trunk (you may have to make a tgz from it), make sure the Debian package description has the version number you want and then rebuild it (IIRC, the command for that is "dpkg-rebuild" or something in that line). I don't have a Debian system to test, so I can't tell you what exactly you have to do. But there's an even simpler way. You can check out lxml from SVN and build it in-place in the source directory (see doc/build.txt on how to do that). To use it in your program, you can call Python like this: PYTHONPATH=/path/to/lxml-dir/src python myprogram.py To make sure it's the right version, use PYTHONPATH=/path/to/lxml-dir/src python -c \ 'import lxml.etree ; print lxml.etree.__version__' (the '/' makes it one line when you copy and paste it) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 06:44:36 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 06:44:36 +0200 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st In-Reply-To: References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> Message-ID: <446BFBB4.1050909@gkec.informatik.tu-darmstadt.de> Hi Ilpo, Ilpo Nyyss?nen wrote: > Stefan Behnel writes: >> just to keep us from pushing back release dates, Martijn and I have fixed the >> release date of lxml 1.0 to June 1st. > > Does this include pretty printing? No. > Is it possible not to strip the declaration? If you mean what I think, then that's been fixed. If you mean something else, you may have to explain it. > Or do I have to continue patching lxml for my use? Maybe, depending on why you patch it and how. This is open-source software. If you have a patch that adds a feature you need and have an interest in stopping to patch it yourself, you send a patch to the mailing list to have it included. Then we will discuss it and see if we can include it or what else we can do to add the missing feature. As long as we don't know what the missing feature is, we can't get it included. Stefan From iny+news at iki.fi Thu May 18 07:11:31 2006 From: iny+news at iki.fi (Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=) Date: Thu, 18 May 2006 08:11:31 +0300 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446BFBB4.1050909@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel writes: >> Does this include pretty printing? >> Is it possible not to strip the declaration? [...] >> Or do I have to continue patching lxml for my use? > > Maybe, depending on why you patch it and how. This is open-source > software. If you have a patch that adds a feature you need and have > an interest in stopping to patch it yourself, you send a patch to > the mailing list to have it included. Then we will discuss it and > see if we can include it or what else we can do to add the missing > feature. Both of these were in a patch by someone else earlier, try google("lxml pretty print") for example. Of course that probably won't apply any more as is. -- Ilpo Nyyss?nen # biny # /* :-) */ From philipp at weitershausen.de Thu May 18 09:27:40 2006 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Thu, 18 May 2006 09:27:40 +0200 Subject: [lxml-dev] Lxml 1.0 will be released on Thursday, June 1st In-Reply-To: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> Message-ID: <446C21EC.5010709@weitershausen.de> Stefan Behnel wrote: > Hallo everyone, > > just to keep us from pushing back release dates, Martijn and I have fixed the > release date of lxml 1.0 to June 1st. > > The (long) list of changes for 1.0 is here: > http://codespeak.net/svn/lxml/trunk/CHANGES.txt > > > This means for you: > > If you want a bug-free release ready to install, then come and help us by > testing the trunk and reporting any bugs you can find. How about a 1.0beta release just to broaden the install base for this testing and bug hunting? Philipp From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 09:46:46 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 09:46:46 +0200 Subject: [lxml-dev] pretty printing revisited In-Reply-To: References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446BFBB4.1050909@gkec.informatik.tu-darmstadt.de> Message-ID: <446C2666.9050801@gkec.informatik.tu-darmstadt.de> Ilpo Nyyss?nen wrote: > Stefan Behnel writes: >>> Does this include pretty printing? >>> Is it possible not to strip the declaration? >>> Or do I have to continue patching lxml for my use? >> >> Maybe, depending on why you patch it and how. This is open-source >> software. If you have a patch that adds a feature you need and have >> an interest in stopping to patch it yourself, you send a patch to >> the mailing list to have it included. Then we will discuss it and >> see if we can include it or what else we can do to add the missing >> feature. > > Both of these were in a patch by someone else earlier, try > google("lxml pretty print") for example. Of course that probably won't > apply any more as is. I know about that patch, it was written by Geert Jansen resp. Patrick Wagstrom. And it will definitely not apply to 1.0. I rewrote the patch as simple as possible. The trunk now has support for the "pretty_print" keyword we discussed at that time. I preferred the keyword over a more general "XMLWriter" class approach, after Fredrik told me that ET 1.3 will have an "xml_declaration" keyword in the tostring function, so this is more consistent. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 09:55:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 09:55:04 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C21EC.5010709@weitershausen.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> Message-ID: <446C2858.3000603@gkec.informatik.tu-darmstadt.de> Hi Philipp, Philipp von Weitershausen wrote: > Stefan Behnel wrote: >> If you want a bug-free release ready to install, then come and help us by >> testing the trunk and reporting any bugs you can find. > > How about a 1.0beta release just to broaden the install base for this > testing and bug hunting? I also thought about that. We could release a 0.9.9 right away, so that people can give us feedback more easily, without having to run Pyrex and the like. However, I would then prefer having a single pre-release only before 1.0. Martijn, any objections to that? Steve, Olivier, Georges - could you please be ready to provide eggs for the beta release, so that we don't loose too much time before the release in June? A big thanks in advance to our helping hands, Stefan From howe at carcass.dhs.org Thu May 18 10:00:46 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu, 18 May 2006 05:00:46 -0300 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C2858.3000603@gkec.informatik.tu-darmstadt.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> Message-ID: <624920785.20060518050046@carcass.dhs.org> Hello Stefan, Thursday, May 18, 2006, 4:55:04 AM, you wrote: > I also thought about that. We could release a 0.9.9 right away, so that people > can give us feedback more easily, without having to run Pyrex and the like. > However, I would then prefer having a single pre-release only before 1.0. > Martijn, any objections to that? > Steve, Olivier, Georges - could you please be ready to provide eggs for the > beta release, so that we don't loose too much time before the release in June? Sure, when should we build them ? Is the current trunk the beta release ? -- Best regards, Steve mailto:howe at carcass.dhs.org From philipp at weitershausen.de Thu May 18 10:07:53 2006 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Thu, 18 May 2006 10:07:53 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <624920785.20060518050046@carcass.dhs.org> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> Message-ID: <446C2B59.5040706@weitershausen.de> Steve Howe wrote: > Hello Stefan, > > Thursday, May 18, 2006, 4:55:04 AM, you wrote: > >> I also thought about that. We could release a 0.9.9 right away, so that people >> can give us feedback more easily, without having to run Pyrex and the like. >> However, I would then prefer having a single pre-release only before 1.0. > >> Martijn, any objections to that? > >> Steve, Olivier, Georges - could you please be ready to provide eggs for the >> beta release, so that we don't loose too much time before the release in June? > > Sure, when should we build them ? Is the current trunk the beta release > ? I would advise making a tag. I would also advise NOT to call it 0.9.9 as this suggests some offspring from the 0.9.x line. Just call it 1.0beta, this is a very common naming scheme, even in Python :) Philipp From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 10:06:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 10:06:30 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <624920785.20060518050046@carcass.dhs.org> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> Message-ID: <446C2B06.9040703@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe schrieb: > Hello Stefan, > > Thursday, May 18, 2006, 4:55:04 AM, you wrote: > >> I also thought about that. We could release a 0.9.9 right away, so that people >> can give us feedback more easily, without having to run Pyrex and the like. >> However, I would then prefer having a single pre-release only before 1.0. > >> Martijn, any objections to that? > >> Steve, Olivier, Georges - could you please be ready to provide eggs for the >> beta release, so that we don't loose too much time before the release in June? > > Sure, when should we build them ? Is the current trunk the beta release > ? That was too fast. :) When we release, I will upload a tar.gz to cheeseshop and send a mail to the list. That way, it's a clean release from clean sources. Stefan From howe at carcass.dhs.org Thu May 18 10:11:53 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu, 18 May 2006 05:11:53 -0300 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C2B06.9040703@gkec.informatik.tu-darmstadt.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> <446C2B06.9040703@gkec.informatik.tu-darmstadt.de> Message-ID: <98170436.20060518051153@carcass.dhs.org> Hello Stefan, Thursday, May 18, 2006, 5:06:30 AM, you wrote: > When we release, I will upload a tar.gz to cheeseshop and send a mail to the > list. That way, it's a clean release from clean sources. Ok, I'll be waiting for that. I should be travelling until money, however. If it gets release until then, I'll be doing it as I arrive back. -- Best regards, Steve mailto:howe at carcass.dhs.org From apaku at gmx.de Thu May 18 10:16:46 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Thu, 18 May 2006 10:16:46 +0200 Subject: [lxml-dev] What are the equivalents of nowrite, nomkdir options In-Reply-To: <446BFA5B.9020006@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605171334q28386e73k553108337149ad26@mail.gmail.com> <446BFA5B.9020006@gkec.informatik.tu-darmstadt.de> Message-ID: <20060518081646.GA5244@morpheus.apaku.dnsalias.org> On 18.05.06 06:38:51, Stefan Behnel wrote: > Noah Slater wrote: > >> Noah, I'd be glad if you could test it and report back if it works as > >> expected. I do not know if the directory creation business works now > >> (i.e. if > >> directories /are/ created). > > > > Hi, sorry I can't test until I have a .deb package to install - so I > > am guessing I will have to wait until your changes arrive in Debian > > unstable. > > That won't be before 1.0 then, I assume. > > It's actually not that hard to build Debian packages once they are part of > Debian. Get the source .deb via apt and unpack it. Then you can replace the > included lxml sources with the current SVN trunk (you may have to make a tgz > from it), make sure the Debian package description has the version number you > want and then rebuild it (IIRC, the command for that is "dpkg-rebuild" or > something in that line). I don't have a Debian system to test, so I can't tell > you what exactly you have to do. Let me jump in here. The procedure would roughly be: apt-get source lxml remove the unpacked directory tar.gz the trunk version and replace the orig.tar.gz that lies in the directory do dpkg -x lxml-...dsc and cd into the new directory dch -i and put a comment in there like "use trunk version", this will increase the debian version number so apt/dpkg don't get confused dpkg-buildpackage -rfakeroot -us -uc Eventually dpkg-buildpackage will tell you that some dependecies are missing, you can either install them manually or run apt-get build-dep lxml That'll give you deb's in the parent directory which can be installed using dpkg -i. I did this with trunk and s2-coder branch before 0.9 was released and it worked well. Andreas -- If your life was a horse, you'd have to shoot it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060518/6bcb94ac/attachment.pgp From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 10:40:39 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 10:40:39 +0200 Subject: [lxml-dev] How to build Debian packages from SVN sources In-Reply-To: <20060518081646.GA5244@morpheus.apaku.dnsalias.org> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605171334q28386e73k553108337149ad26@mail.gmail.com> <446BFA5B.9020006@gkec.informatik.tu-darmstadt.de> <20060518081646.GA5244@morpheus.apaku.dnsalias.org> Message-ID: <446C3307.7050105@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 18.05.06 06:38:51, Stefan Behnel wrote: >> It's actually not that hard to build Debian packages once they are part of >> Debian. Get the source .deb via apt and unpack it. Then you can replace the >> included lxml sources with the current SVN trunk (you may have to make a tgz >> from it), make sure the Debian package description has the version number you >> want and then rebuild it (IIRC, the command for that is "dpkg-rebuild" or >> something in that line). I don't have a Debian system to test, so I can't tell >> you what exactly you have to do. > > Let me jump in here. The procedure would roughly be: > > apt-get source lxml > remove the unpacked directory > tar.gz the trunk version and replace the orig.tar.gz that lies in the > directory > do dpkg -x lxml-...dsc and cd into the new directory > dch -i and put a comment in there like "use trunk version", this will > increase the debian version number so apt/dpkg don't get confused > dpkg-buildpackage -rfakeroot -us -uc > > Eventually dpkg-buildpackage will tell you that some dependecies are > missing, you can either install them manually or run apt-get build-dep > lxml > > That'll give you deb's in the parent directory which can be installed > using dpkg -i. > > I did this with trunk and s2-coder branch before 0.9 was released and > it worked well. Thanks for sharing that. I added a section for it in "doc/build.txt". Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 10:57:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 10:57:54 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C2B59.5040706@weitershausen.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> <446C2B59.5040706@weitershausen.de> Message-ID: <446C3712.9010708@gkec.informatik.tu-darmstadt.de> Hi Philipp, Philipp von Weitershausen wrote: > Steve Howe wrote: >> Hello Stefan, >> >> Thursday, May 18, 2006, 4:55:04 AM, you wrote: >> >>> I also thought about that. We could release a 0.9.9 right away, so that people >>> can give us feedback more easily, without having to run Pyrex and the like. >>> However, I would then prefer having a single pre-release only before 1.0. >>> Martijn, any objections to that? >>> Steve, Olivier, Georges - could you please be ready to provide eggs for the >>> beta release, so that we don't loose too much time before the release in June? >> Sure, when should we build them ? Is the current trunk the beta release >> ? > > I would advise making a tag. I would also advise NOT to call it 0.9.9 as > this suggests some offspring from the 0.9.x line. Just call it 1.0beta, > this is a very common naming scheme, even in Python :) Normally, yes. The thing is that lxml currently uses a "numbers-only" versioning scheme and I'd prefer keeping it that way, especially since the version will be accessible as int tuple in 1.0. So, "1.0.beta" will not work that well, as it will become something like (1,0,"beta",0) >>> print (1,0,"beta",0) < (1,0,0,0) False is not quite the expected result. As a work-around, you could make it (1,0,-1,0) and special case the version string parser to represent "beta" as -1. I think that's a good idea. Any objections? Stefan From nslater at gmail.com Thu May 18 11:10:10 2006 From: nslater at gmail.com (Noah Slater) Date: Thu, 18 May 2006 10:10:10 +0100 Subject: [lxml-dev] How to build Debian packages from SVN sources In-Reply-To: <446C3307.7050105@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605140925k605592e4g9873f2f99b56d799@mail.gmail.com> <4469DC24.9080607@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160950k2a7fe3a6u25de679636648e40@mail.gmail.com> <446A60AC.70504@gkec.informatik.tu-darmstadt.de> <446AC6DA.1070707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605171334q28386e73k553108337149ad26@mail.gmail.com> <446BFA5B.9020006@gkec.informatik.tu-darmstadt.de> <20060518081646.GA5244@morpheus.apaku.dnsalias.org> <446C3307.7050105@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605180210y6c39c53fg114c07019abf8414@mail.gmail.com> Yes, thank you. I shall experiment with this sometime over the next few days and let you know how I get on. Thanks, Noah On 5/18/06, Stefan Behnel wrote: > Hi Andreas, > > Andreas Pakulat wrote: > > On 18.05.06 06:38:51, Stefan Behnel wrote: > >> It's actually not that hard to build Debian packages once they are part of > >> Debian. Get the source .deb via apt and unpack it. Then you can replace the > >> included lxml sources with the current SVN trunk (you may have to make a tgz > >> from it), make sure the Debian package description has the version number you > >> want and then rebuild it (IIRC, the command for that is "dpkg-rebuild" or > >> something in that line). I don't have a Debian system to test, so I can't tell > >> you what exactly you have to do. > > > > Let me jump in here. The procedure would roughly be: > > > > apt-get source lxml > > remove the unpacked directory > > tar.gz the trunk version and replace the orig.tar.gz that lies in the > > directory > > do dpkg -x lxml-...dsc and cd into the new directory > > dch -i and put a comment in there like "use trunk version", this will > > increase the debian version number so apt/dpkg don't get confused > > dpkg-buildpackage -rfakeroot -us -uc > > > > Eventually dpkg-buildpackage will tell you that some dependecies are > > missing, you can either install them manually or run apt-get build-dep > > lxml > > > > That'll give you deb's in the parent directory which can be installed > > using dpkg -i. > > > > I did this with trunk and s2-coder branch before 0.9 was released and > > it worked well. > > Thanks for sharing that. I added a section for it in "doc/build.txt". > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From philipp at weitershausen.de Thu May 18 11:57:05 2006 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Thu, 18 May 2006 11:57:05 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C3712.9010708@gkec.informatik.tu-darmstadt.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> <446C2B59.5040706@weitershausen.de> <446C3712.9010708@gkec.informatik.tu-darmstadt.de> Message-ID: <446C44F1.4040705@weitershausen.de> Stefan Behnel wrote: > Hi Philipp, > > Philipp von Weitershausen wrote: >> Steve Howe wrote: >>> Hello Stefan, >>> >>> Thursday, May 18, 2006, 4:55:04 AM, you wrote: >>> >>>> I also thought about that. We could release a 0.9.9 right away, so that people >>>> can give us feedback more easily, without having to run Pyrex and the like. >>>> However, I would then prefer having a single pre-release only before 1.0. >>>> Martijn, any objections to that? >>>> Steve, Olivier, Georges - could you please be ready to provide eggs for the >>>> beta release, so that we don't loose too much time before the release in June? >>> Sure, when should we build them ? Is the current trunk the beta release >>> ? >> I would advise making a tag. I would also advise NOT to call it 0.9.9 as >> this suggests some offspring from the 0.9.x line. Just call it 1.0beta, >> this is a very common naming scheme, even in Python :) > > Normally, yes. The thing is that lxml currently uses a "numbers-only" > versioning scheme and I'd prefer keeping it that way, especially since the > version will be accessible as int tuple in 1.0. > > So, "1.0.beta" will not work that well, as it will become something like > (1,0,"beta",0) > > >>> print (1,0,"beta",0) < (1,0,0,0) > False > > is not quite the expected result. setuptools gets this right, though. > As a work-around, you could make it (1,0,-1,0) and special case the version > string parser to represent "beta" as -1. I think that's a good idea. Any > objections? No idea why you need this number comparison. As said, setuptools gets this right anyways. Philipp From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 12:09:17 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 12:09:17 +0200 Subject: [lxml-dev] 0.9.9 as a beta release? In-Reply-To: <446C44F1.4040705@weitershausen.de> References: <446B235A.6050607@gkec.informatik.tu-darmstadt.de> <446C21EC.5010709@weitershausen.de> <446C2858.3000603@gkec.informatik.tu-darmstadt.de> <624920785.20060518050046@carcass.dhs.org> <446C2B59.5040706@weitershausen.de> <446C3712.9010708@gkec.informatik.tu-darmstadt.de> <446C44F1.4040705@weitershausen.de> Message-ID: <446C47CD.8090005@gkec.informatik.tu-darmstadt.de> Hi Philipp, Philipp von Weitershausen wrote: > Stefan Behnel wrote: >> As a work-around, you could make it (1,0,-1,0) and special case the version >> string parser to represent "beta" as -1. I think that's a good idea. Any >> objections? > > No idea why you need this number comparison. It's meant for version specific code. If we have a bug (or a new feature) somewhere and code needs it to work or can work around it, it should be able to do things like if lxml.etree.LXML_VERSION < (1,0,2): workAroundBugOrComplain() > As said, setuptools gets this right anyways. We should not force software that uses lxml to rely on setuptools. Too many people do not have it installed. It's not required to run lxml, even ordinary users can install lxml from RPMs, .deb or windows installers. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 13:28:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 13:28:40 +0200 Subject: [lxml-dev] lxml 1.0.beta is on cheeseshop Message-ID: <446C5A68.2020608@gkec.informatik.tu-darmstadt.de> Hi all, lxml 1.0.beta is available from cheeseshop. This is expected to be the only beta release before the glorious 1.0 will arrive on June 1st (this year). http://cheeseshop.python.org/pypi/lxml The changelog is long and the time-to-market for 1.0 is short, so, as usual, we are happy to receive your bug reports. :) Have fun, Stefan From ogrisel at nuxeo.com Thu May 18 15:23:43 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Thu, 18 May 2006 15:23:43 +0200 Subject: [lxml-dev] Bugs in 1.0.beta Message-ID: Hi list, On my box (Ubuntu Linux Dapper Drake on i686 / python2.4) all tests pass but running the bench scripts yields some problems: >>> lxml.etree.LIBXML_COMPILED_VERSION (2, 6, 24) >>> lxml.etree.LIBXSLT_COMPILED_VERSION (1, 1, 15) xpath_extensions_old and xslt_extensions_old trigger the following exceptions: lxe: xpath_extensions_old (-- T1 ) Traceback (most recent call last): File "bench.py", line 652, in ? result = run_bench(bench, *benchmark_setup) File "bench.py", line 606, in run_bench method_call(*args) File "bench.py", line 453, in bench_xpath_extensions_old xpath = self.etree.XPath("child(.)", extensions=extensions) File "xpath.pxi", line 166, in etree.XPath.__init__ File "xpath.pxi", line 53, in etree.XPathEvaluatorBase.__init__ File "xpath.pxi", line 17, in etree._XPathContext.__init__ File "extensions.pxi", line 45, in etree._BaseContext.__init__ TypeError: unindexable object lxe: xslt_extensions_old (-- T1 ) Traceback (most recent call last): File "bench.py", line 652, in ? result = run_bench(bench, *benchmark_setup) File "bench.py", line 606, in run_bench method_call(*args) File "bench.py", line 479, in bench_xslt_extensions_old transform = self.etree.XSLT(tree, extensions) File "xslt.pxi", line 265, in etree.XSLT.__init__ File "xslt.pxi", line 189, in etree._XSLTContext.__init__ File "extensions.pxi", line 45, in etree._BaseContext.__init__ ValueError: unpack sequence of wrong size And append_element and replace_children triggers segmentation faults. The first one crashed after child . I don't have time to dig further right now. I'll have more time tonight. Best, -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 15:44:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 15:44:16 +0200 Subject: [lxml-dev] Bugs in 1.0.beta In-Reply-To: References: Message-ID: <446C7A30.90702@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > On my box (Ubuntu Linux Dapper Drake on i686 / python2.4) all tests pass but > running the bench scripts yields some problems: > > >>> lxml.etree.LIBXML_COMPILED_VERSION > (2, 6, 24) > >>> lxml.etree.LIBXSLT_COMPILED_VERSION > (1, 1, 15) > > xpath_extensions_old and xslt_extensions_old trigger the following exceptions: [snip] Ah, right, I forgot to fix those. Luckily, it's the benchmarks that are broken here, not lxml. > And append_element and replace_children triggers segmentation faults. > The first one crashed after child . Thanks, I can reproduce those. I'll see what I can come up with. T'was a good idea to release a beta, after all... Thanks for the reports! Stefan From fredrik at pythonware.com Thu May 18 23:11:55 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 18 May 2006 23:11:55 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <1968986390.20060517171656@carcass.dhs.org> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> Message-ID: Steve Howe wrote: > I wonder what's the reason for raising an AssertionError instead > of TypeError... because that's what "assert" raises: http://pyref.infogami.com/assert > Me neither... :) He already didn't comment about the unpythonic > str(element) behavior on ElementTree. there's nothing "unpythonic" about str(element) in ET. an Element is an Element, not a part of an XML file. all XML-specific behaviour is provided by the ElementTree wrapper, and related helpers. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu May 18 23:59:38 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 18 May 2006 23:59:38 +0200 Subject: [lxml-dev] Bugs in 1.0.beta In-Reply-To: References: Message-ID: <446CEE4A.5040701@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > On my box (Ubuntu Linux Dapper Drake on i686 / python2.4) all tests pass but > running the bench scripts yields some problems: [snip] > append_element and replace_children triggers segmentation faults. > The first one crashed after child . It took me a while to track down this bug, as it was absolutely not where the benchmark code suggested. The test case that triggers it is: >>> a = Element('a') >>> b = copy.deepcopy(a) >>> b.append( Element('c') ) >>> del b The problem lies in the modified Element.__copy__() method I wrote. Instead of creating a new document for the copied tree, it tried to copy the original document structure to keep its settings. I now found that this additionally requires copying the pointer to the parser dictionary also, otherwise libxml2 ends up freeing still-in-use entries from it when deallocating the copied nodes... Things like this tend to remind me of one of the major reasons for having lxml: *one* place to get this stuff right is just so enough... To prevent further bugs like this, I wrote a new helper function _copyDoc that copies the document and sets the dict, and a convenience function _copyDocRoot that makes a specific node the new root node in the copied document (using fakeRootDoc before copying). The bug is now fixed on the trunk. I also updated a couple of places in xslt.pxi where documents are copied, just in case... Stefan From howe at carcass.dhs.org Fri May 19 04:30:47 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Thu, 18 May 2006 23:30:47 -0300 Subject: [lxml-dev] lxml crash In-Reply-To: References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> Message-ID: <533778638.20060518233047@carcass.dhs.org> Hello Fredrik, Thursday, May 18, 2006, 6:11:55 PM, you wrote: >> I wonder what's the reason for raising an AssertionError instead >> of TypeError... > because that's what "assert" raises: > http://pyref.infogami.com/assert I know assert statements raise that, but no other Python function raises AssertionError on wrong parameters, right ? This is debug code. I mentioned an example where TypeError is raised instead of AssertionError. Is there a reason for not following the Python convention ? What other Python functions raise AssertionError instead of TypeError ? See: >>> float(None) Traceback (most recent call last): File "", line 1, in ? TypeError: float() argument must be a string or a number And about str(Element): the Python manual says it should "Return a string containing a nicely printable representation of an object", but it is returning the same as repr(). But since the library is yours (and don't take me bad, I love its API), you do as you wish. -- Best regards, Steve mailto:howe at carcass.dhs.org From fredrik at pythonware.com Fri May 19 08:48:26 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 19 May 2006 08:48:26 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <533778638.20060518233047@carcass.dhs.org> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> <533778638.20060518233047@carcass.dhs.org> Message-ID: Steve Howe wrote: > I know assert statements raise that, but no other Python function raises > AssertionError on wrong parameters, right ? This is debug code. I > mentioned an example where TypeError is raised instead of > AssertionError. Is there a reason for not following the Python > convention ? What other Python functions raise AssertionError instead of > TypeError ? any function that uses assert, which is the most efficient way to add *optional* assertions to code written in *Python*. if you're writing user code that relies on catching assertion errors, you need to fix your code. >>>> float(None) > Traceback (most recent call last): > File "", line 1, in ? > TypeError: float() argument must be a string or a number so? why should a Python implementation of a library have to suffer because some built-in function is written in C? > And about str(Element): the Python manual says it should "Return a > string containing a nicely printable representation of an object", but > it is returning the same as repr(). which is perfectly okay, and perfectly pythonic according to the same documentation (after all, str() maps to repr(), if you don't override things). it's in fact rather unpythonic to use str() for serialization of non- trivial objects. serialization should be explicit, not implicit. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 19 09:01:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 19 May 2006 09:01:58 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> <533778638.20060518233047@carcass.dhs.org> Message-ID: <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > Steve Howe wrote: > >> I know assert statements raise that, but no other Python function raises >> AssertionError on wrong parameters, right ? This is debug code. I >> mentioned an example where TypeError is raised instead of >> AssertionError. Is there a reason for not following the Python >> convention ? What other Python functions raise AssertionError instead of >> TypeError ? > > any function that uses assert, which is the most efficient way to add > *optional* assertions to code written in *Python*. [snip] > why should a Python implementation of a library have to suffer > because some built-in function is written in C? Ok, so, if I understand that right, using AssertionError is an internal optimisation. It is only used to allow switching it off for performance reasons. (Note that you can't do that in C code.) So, for exactly the same reason, lxml will continue to raise TypeError for both None values and other invalid argument types. The test is done internally by Pyrex anyway and therefore the cheapest solution. According to your arguments, code that catches assertions is broken anyway, so this is not even an incompatibility. That's good news. >> And about str(Element): the Python manual says it should "Return a >> string containing a nicely printable representation of an object", but >> it is returning the same as repr(). > > which is perfectly okay, and perfectly pythonic according to the same > documentation (after all, str() maps to repr(), if you don't override > things). I also think that's ok. We already had this discussion and lxml follows ET here. Regards, Stefan From fredrik at pythonware.com Fri May 19 10:06:57 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 19 May 2006 10:06:57 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> <533778638.20060518233047@carcass.dhs.org> <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > Ok, so, if I understand that right, using AssertionError is an internal > optimisation. It is only used to allow switching it off for performance > reasons. (Note that you can't do that in C code.) well, you can, but I don't think you need to (see below). > So, for exactly the same reason, lxml will continue to raise TypeError for > both None values and other invalid argument types. The test is done internally > by Pyrex anyway and therefore the cheapest solution. that's perfectly okay. if not else, we could view lxml as a library that implements the ET interface with assertions disabled -- after all, in general, a piece of code that generates an assertion error if you use it incorrectly is free to do whatever it wants if you disable assertions. > According to your arguments, code that catches assertions is broken anyway, so > this is not even an incompatibility. exactly. (the only problem is doctest-based test suites, but I guess we have to live with that, at least until someone gets around to write an ET validator (or I get around to finish the one I started...)) From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 19 12:40:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 19 May 2006 12:40:12 +0200 Subject: [lxml-dev] lxml/ET/cET performance comparison in doc/performance.txt Message-ID: <446DA08C.70002@gkec.informatik.tu-darmstadt.de> Hi all, I've re-run the benchmark suite (which works nicely now) and written up some documentation about the performance of lxml in comparison to ET/cET. It's in http://codespeak.net/svn/lxml/trunk/doc/performance.txt If anyone is interested in extending the benchmark suite over other parts of the API or in adding to the comparison, I'd be glad to see some input. :) Have fun, Stefan From tseaver at palladion.com Fri May 19 18:11:17 2006 From: tseaver at palladion.com (Tres Seaver) Date: Fri, 19 May 2006 12:11:17 -0400 Subject: [lxml-dev] lxml crash In-Reply-To: <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> <533778638.20060518233047@carcass.dhs.org> <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> Message-ID: <446DEE25.5050700@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Hi Fredrik, > > Fredrik Lundh wrote: > >>Steve Howe wrote: >> >> >>>I know assert statements raise that, but no other Python function raises >>>AssertionError on wrong parameters, right ? This is debug code. I >>>mentioned an example where TypeError is raised instead of >>>AssertionError. Is there a reason for not following the Python >>>convention ? What other Python functions raise AssertionError instead of >>>TypeError ? >> >>any function that uses assert, which is the most efficient way to add >>*optional* assertions to code written in *Python*. > > [snip] > >>why should a Python implementation of a library have to suffer >>because some built-in function is written in C? > > > Ok, so, if I understand that right, using AssertionError is an internal > optimisation. It is only used to allow switching it off for performance > reasons. (Note that you can't do that in C code.) > > So, for exactly the same reason, lxml will continue to raise TypeError for > both None values and other invalid argument types. The test is done internally > by Pyrex anyway and therefore the cheapest solution. > > According to your arguments, code that catches assertions is broken anyway, so > this is not even an incompatibility. > > That's good news. I think the assertion was that catching AssertionError is broken, because the assertion failure is a *programming* error, not a *runtime* error. The TypeErrors raised from Pyrex probably fit the same way. More framework-y code sometimes ends up having to catch such errors, either for backwards compatibility or to allow for things like broken plugins. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEbe4l+gerLs4ltQ4RArEKAJ98L2UYoQ/lXXScXx2p3IwXSu5/aQCeNYLG RnMvKP5qOg/GXoPojvqcOPw= =4rgO -----END PGP SIGNATURE----- From nslater at gmail.com Sat May 20 04:02:55 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 03:02:55 +0100 Subject: [lxml-dev] Encoding Issues Message-ID: <9ea1c1180605191902u5d2f0d9k38953007a8217997@mail.gmail.com> Hello, I notice that I can pass an encoding parameter to an elementtree's write method. Two things: 1) When lxml supports xml processing instructions, will these be updated accordingly? I notice that the DocBook XSL will update meta-equiv HTML elements according to the value I pass in here - which is good, except for the following... 2) Why does this let me make up encodings such as "Noahs-Cool-Encoding" without raising an exception. If I was to take a guess I would imagine this falls back on UTF-8 but these is a bug IMO. In case you are wondering I am in the process of writing an XMl based HTTP publishing framework and lxml will be sitting at the very core of how I handle document conversion/manipulation/transformation. The problem with encoding lies within my content negotiation module which will (in a pythonic manner IMO) try to transform the document with each encoding specified in the Accept-Charset header of the client request. If the transform raises an exception we move on to the next one. My previous way of working would raise an exception if I tried an encoding it didn't recognise. While I am aware of how to look up encoding names using the python standard library - I am not sure if this correlates 100% with lxml and additionally I don't feel this extra step should be necessary. Thanks so much. Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 20 10:45:49 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 20 May 2006 10:45:49 +0200 Subject: [lxml-dev] Encoding Issues In-Reply-To: <9ea1c1180605191902u5d2f0d9k38953007a8217997@mail.gmail.com> References: <9ea1c1180605191902u5d2f0d9k38953007a8217997@mail.gmail.com> Message-ID: <446ED73D.1010909@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I notice that I can pass an encoding parameter to an elementtree's write method. :) http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree-class > Two things: > > 1) When lxml supports xml processing instructions, will these be > updated accordingly? lxml doesn't currently support processing instructions. If you meant the XML declaration, then: yes, starting with 1.0.beta. > 2) Why does this let me make up encodings such as > "Noahs-Cool-Encoding" without raising an exception. If I was to take a > guess I would imagine this falls back on UTF-8 but these is a bug IMO. True, guess we should ask libxml2 to parse the encoding and raise an exception if it is not known. Since it's already parsed a couple of times, that's not too much of a problem. > In case you are wondering I am in the process of writing an XMl based > HTTP publishing framework and lxml will be sitting at the very core of > how I handle document conversion/manipulation/transformation. Interesting. Feel free to post a URL to the list in case it becomes available online. > The problem with encoding lies within my content negotiation module > which will (in a pythonic manner IMO) try to transform the document > with each encoding specified in the Accept-Charset header of the > client request. If the transform raises an exception we move on to the > next one. Sure, sounds sensible. Although most likely a commonly accepted encoding such as UTF-8 should be fine in most cases. As an optimisation, you can check if it's in the acceptance list and only if it's not accepted, fall back to checking one after the other. > While I am aware of how to look up encoding names using the python > standard library - I am not sure if this correlates 100% with lxml and > additionally I don't feel this extra step should be necessary. Well, it is necessary because we can only rely on encodings known in libxml2 (which uses iconv, so that's most of the encodings you will ever come across). And except for the UCS4 bug, libxml2 is pretty good in guessing what encoding was meant, so as long as no one finds a discrepancy in what Python understands and what libxml2 handles, I don't see a reason for changing anything here. I'll make sure we raise an exception for unknown encodings, though. Stefan From nslater at gmail.com Sat May 20 15:45:50 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 14:45:50 +0100 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605200645r2b109068mdcdad621e802e0dd@mail.gmail.com> Hi guys, Sorry, just noticed this now: > > I would like my documents to start with the processing instruction so > > I can specify encodings other than UTF-8. > > Hmm, I didn't verify this, although I actually thought lxml produced a > declaration here. If not, this should be considered a bug, as it is likely > inconsistent with ElementTree. I guess this is the same problem as for > tostring(), which only started having the expected behaviour fairly recently. I disagree on your last point - I think tostring's utility comes from it's standalone nature - i.e. no XML declaration, PIs etc. While I think the write/write_c14n methods on an ElementTree should produce the PIs (XML declaration included) I do not think that simple Element serialisation should include an XML declaration. I am not sure about how other people use it, but in my case I am using etree.tostring to generate and analyse XML fragments in isolation - I would not want an XML declaration messing things up. I intend to post some code to this list in the next few days, which by coincidence will demonstrate my particular use case for tostring. I hope this makes sense. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Sat May 20 15:48:09 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 14:48:09 +0100 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <446A07E2.4060606@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> <4469E89A.10409@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160942n3ff25079n397211131cc2caa9@mail.gmail.com> <446A07E2.4060606@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605200648y432ae017ifb41ca5628a6d6bd@mail.gmail.com> Hi Stefan, > > To summarise, in an ideal world I would like to be able to transform a > > document using XSLT specifying an encoding at transformation time and > > have the ResultTree serialise with all processing instructions intact. > > Additionally I would like to be able to access these programmatically > > - which I don't think is possible at the moment. > > That's also a feature of the developer version that will eventually become > lxml 1.0. See the 'docinfo' feature described here: > http://codespeak.net/svn/lxml/trunk/doc/api.txt Sorry if I am being a pain, but can I just clarify what you meant by this. Were you also indicating that programmatic access to PIs would be available in lxml 1.0? If not, is this on the time line? Thanks so much, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Sat May 20 15:55:51 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 14:55:51 +0100 Subject: [lxml-dev] Encoding Issues In-Reply-To: <446ED73D.1010909@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605191902u5d2f0d9k38953007a8217997@mail.gmail.com> <446ED73D.1010909@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605200655p5d718cfbkc6fc38fe0e1b28aa@mail.gmail.com> Hi Stefan, > http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree-class Yeah, but help() and dir() is so much more fun don't you think? ;) > Interesting. Feel free to post a URL to the list in case it becomes available > online. Without a doubt, my software is being developed under a GNU GPL licence and I intend to distribute it far and wide. :) I just wanted to say again, thanks for this great software! Regards, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 20 16:14:03 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 20 May 2006 16:14:03 +0200 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <9ea1c1180605200645r2b109068mdcdad621e802e0dd@mail.gmail.com> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> <9ea1c1180605200645r2b109068mdcdad621e802e0dd@mail.gmail.com> Message-ID: <446F242B.5010408@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> > I would like my documents to start with the processing instruction so >> > I can specify encodings other than UTF-8. >> >> Hmm, I didn't verify this, although I actually thought lxml produced a >> declaration here. If not, this should be considered a bug, as it is >> likely >> inconsistent with ElementTree. I guess this is the same problem as for >> tostring(), which only started having the expected behaviour fairly >> recently. > > I disagree on your last point - I think tostring's utility comes from > it's standalone nature - i.e. no XML declaration, PIs etc. While I > think the write/write_c14n methods on an ElementTree should produce > the PIs (XML declaration included) I do not think that simple Element > serialisation should include an XML declaration. tostring() and write() now produce XML declarations just as ElementTree does. You can switch them off for tostring() by passing "xml_declaration=False", which is consistent with ET 1.3 (as Fredrik told me). Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 20 16:18:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 20 May 2006 16:18:02 +0200 Subject: [lxml-dev] XMl Processing Instructions In-Reply-To: <9ea1c1180605200648y432ae017ifb41ca5628a6d6bd@mail.gmail.com> References: <9ea1c1180605141143s2401e3bdrda0f5c0d276a3c55@mail.gmail.com> <4469E5E9.3000205@gkec.informatik.tu-darmstadt.de> <4469E89A.10409@gkec.informatik.tu-darmstadt.de> <9ea1c1180605160942n3ff25079n397211131cc2caa9@mail.gmail.com> <446A07E2.4060606@gkec.informatik.tu-darmstadt.de> <9ea1c1180605200648y432ae017ifb41ca5628a6d6bd@mail.gmail.com> Message-ID: <446F251A.1010903@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater schrieb: >> > To summarise, in an ideal world I would like to be able to transform a >> > document using XSLT specifying an encoding at transformation time and >> > have the ResultTree serialise with all processing instructions intact. >> > Additionally I would like to be able to access these programmatically >> > - which I don't think is possible at the moment. >> >> That's also a feature of the developer version that will eventually >> become >> lxml 1.0. See the 'docinfo' feature described here: >> http://codespeak.net/svn/lxml/trunk/doc/api.txt > > Sorry if I am being a pain, but can I just clarify what you meant by > this. Were you also indicating that programmatic access to PIs would > be available in lxml 1.0? No. All functionality of 1.0 is described in the documentation and already available in 1.0.beta. > If not, is this on the time line? No. Feel free to provide a patch that implements ProcessingInstruction in an ET compatible way. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 20 18:11:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 20 May 2006 18:11:21 +0200 Subject: [lxml-dev] Can't change text in comments In-Reply-To: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> References: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> Message-ID: <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I have noticed that setting the 'text' of a Comment does not seem to > change anything when the Comment is serialised. Oh well, /that/ was an old bug you found there. The code in that area dates from SVN revision 8082 - we're at 27509 now. Changing the text of a comment was never implemented, although ET supports it. The reason (I believe) why Martijn did not implement it at the time (unless he just forgot) is that changing comment texts in libxml2 is a bit tricky. There are no libxml2 functions for doing that and it requires checking the document dictionary to see if the original comment string isn't still in use anywhere else. Great place to add some more segfaults. I implemented this and actually found a couple of other bugs that I fixed (including the test cases that relied on these bugs). What bothers me is that lxml is consistent with ElementTree in that it adds whitespace around comment texts. I have no idea why ElementTree does that in the first place. AFAICT, this happily breaks things like SSI. Maybe Fredrik can enlighten us here? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 20 18:45:39 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 20 May 2006 18:45:39 +0200 Subject: [lxml-dev] Can't change text in comments In-Reply-To: <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> Message-ID: <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> Hi all, Stefan Behnel wrote: > What bothers me is that > lxml is consistent with ElementTree in that it adds whitespace around comment > texts. I have no idea why ElementTree does that in the first place. AFAICT, > this happily breaks things like SSI. I took a second look at this and found that lxml can't actually support this. Since the parser does not ignore comments (as ET does) and since we don't serialise on our own, we can't make sure that we always add spaces around the comment text. We can do that through the API calls to Comment(), but that would make things inconsistent compared to parsed trees. So, the only solution I can see is to be incompatible with ET here and not add spaces around comment texts. This means that >>> c = Comment("test") will result in "" in the serialised XML data, as opposed to ElementTree's "". On the other hand, accessing the .text attribute will be identical in both: >>> c.text 'test' >>> c.text = "TEST" >>> c.text 'TEST' So, the only problem is serialisation here. I personally believe that the lxml way of doing it is better, since it does not modify the comment provided by parser or user. Unless someone can convince me of the opposite, this will be the way lxml 1.0 will work then. Stefan From nslater at gmail.com Sat May 20 19:40:11 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 18:40:11 +0100 Subject: [lxml-dev] Can't change text in comments In-Reply-To: <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605201040g4df31abap35fdf5676ab10461@mail.gmail.com> > So, the only problem is serialisation here. I personally believe that the lxml > way of doing it is better, since it does not modify the comment provided by > parser or user. Unless someone can convince me of the opposite, this will be > the way lxml 1.0 will work then. +1 -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Sat May 20 23:32:09 2006 From: nslater at gmail.com (Noah Slater) Date: Sat, 20 May 2006 22:32:09 +0100 Subject: [lxml-dev] ElementTree pretty printing (serialisation) Message-ID: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> Hello again, My constant spamming of this list has finally paid of and I have something to show for all my questions. I have attached the source of a fairly advanced pretty printing serialiser for ElementTree (and ResultTree) objects. The script (after `chmod 755`) can be called from the command line like so: ./prettyprint.py [DOCUMENT] The script can also be imported and used like so: import sys from lxml import etree import prettyprint document = etree.parse(...) serialiser = prettyprint.ElementTreeSerialiser() serialiser.write(document, sys.stdout) I am new to python, and even newer to lxml and the ElementTree API. As a consequence I may be missing some obvious optimisations. This is only my first stab at this problem and I would love any feed back you care to offer. Is pretty printing something that is on the lxml time line - and if not, would my method demonstrated here interest you from an implementation point of view? Thanks so much for your time. Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: prettyprint.py Type: text/x-python Size: 8825 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060520/ec3d2198/attachment-0001.py From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 21 08:52:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 21 May 2006 08:52:16 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> Message-ID: <44700E20.7020707@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > My constant spamming of this list has finally paid of and I have > something to show for all my questions. > > I have attached the source of a fairly advanced pretty printing > serialiser for ElementTree (and ResultTree) objects. Thanks for sharing this. > The script (after `chmod 755`) can be called from the command line like so: > ./prettyprint.py [DOCUMENT] Admittedly, that is shorter than this: python -c 'from lxml.etree import ElementTree as et; \ et("myfile.xml").write("mynewfile.xml", pretty_print=True)' > Is pretty printing something that is on the lxml time line It's in 1.0.beta. http://codespeak.net/svn/lxml/trunk/CHANGES.txt > - and if not, would my method demonstrated here interest you from an > implementation point of view? Ok, I looked through it and the only difference I could see compared to the pretty_print keyword is that you also wrap the data (.text) unless prevented by the list in 'preformated_elements'. You also use a hook for treating the serialised byte stream, although I think it's a bad idea to do this for splitting elements between attributes. So, my impression is that you are duplicating the pretty print code that we already have in lxml. I really think you should decide which way you go: serialising 'by hand' or treating the XML byte stream. If you want to work on the byte stream, you might consider using the pretty printer of libxml2 and then check each line if it is already short enough before treating it. Splitting long lines at whitespace is a pretty simple thing to do. If you want to serialize by walking the tree, then you should do that completely at the element level, preferably with code that also works for the original ElementTree. So, I think there are a lot of possible simplifications for your code. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 21 10:08:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 21 May 2006 10:08:08 +0200 Subject: [lxml-dev] XMLFormatter revisited Message-ID: <44701FE8.6040002@gkec.informatik.tu-darmstadt.de> Hi all, coming back to what I proposed a while ago, we currently have this: class XMLParser: def __init__(self, **options): self.options = options class HTMLParser: ... doc = ET.parse(source, parser=XMLParser(configuration)) At the time, I brought up an equivalent for output formatting: class XMLFormatter: ... That would give us a nice, symmetric API for input and output options. And it allows you to use sublasses to provide different formats: class XMLPrettyPrinter(XMLFormatter): def __init__(self): self._pretty_print = True class XHTMLFormatter(XMLFormatter): def __init__(self): self._xhtml = True xml = ET.tostring(element, formatter=XMLPrettyPrinter()) ET.parse("myfile.xhtml").write("out.xhtml", formatter=XHTMLFormatter()) After Noah's latest approach in that direction, I think we should adopt this API for 1.0. A version 1.0 is supposed to provide a stable and somewhat future-proof API, and adding keyword arguments to various API functions all the time is not quite what I call future-proof. One reason is that you can't test for the availability of keyword arguments, so adding features at that level is difficult to handle for code that wants to support them as an option. So, I propose to replace the current pretty_print keyword (which only appeared in the beta version anyway) with a new XMLPrettyPrinter class and to provide new features preferably at a subclass level of XMLFormatter (e.g. XHTMLFormatter, etc.). An alternative name would be XMLSerializer, maybe that's more general. We could also leave the pretty_print keyword in as a shortcut, but that would obviously make things a bit more complicated internally. Maybe pretty printing will just become a general keyword of the XMLFormatter class, I guess that would make sense. I will have to check how to make these classes nicely usable internally, but that's a minor problem. We already use a lot of xmlBuffer code, so that should give us a common ground for this API. If there are no objections, I'll start getting my hands on this next week, so if anyone has an opinion on this, please speak up soon. Stefan From fredrik at pythonware.com Sun May 21 13:14:38 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sun, 21 May 2006 13:14:38 +0200 Subject: [lxml-dev] Can't change text in comments In-Reply-To: <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > So, the only solution I can see is to be incompatible with ET here and not add > spaces around comment texts. +1. I'll fix this in 1.3 (or whatever the post-1.2 release will be called) From nslater at gmail.com Sun May 21 18:52:43 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 21 May 2006 17:52:43 +0100 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <44700E20.7020707@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> Hi, > > Is pretty printing something that is on the lxml time line > > It's in 1.0.beta. Oh, I feel a little silly now - I had no idea it was already included. I took this opportunity to easy_install the 1.0beta version. > Ok, I looked through it and the only difference I could see compared to the > pretty_print keyword is that you also wrap the data (.text) unless prevented > by the list in 'preformated_elements'. Not true - I have examined the lxml pretty print output and there is quite some difference. In fact, with the exception of a few simplistic documents with no textual element content I cannot see what effect pretty_print has. > You also use a hook for treating the serialised byte stream, although I think > it's a bad idea to do this for splitting elements between attributes. Why do you think this is a bad idea? I was a little hesitant about doing it in the first place because technically it is no longer XML processing, but string processing. As safe as I think my code is, it does feel like there aught to be a better way of doing it. Or do you think it is a bad idea because you don't think element tags should be wrapped on an attributes basis? > So, my impression is that you are duplicating the pretty print code that we > already have in lxml. I really think you should decide which way you go: > serialising 'by hand' or treating the XML byte stream. Like I stated above - I still cannot see what pretty_print is actually doing. I do know that it is unsuitable for my purposes because of it's inability to word wrap and indent elements tags (which is the definition of pretty printing IMHO). In addition, I am curious why you think the combination of ElementTree navigation and byte stream manipulation is a bad one. I would love to wrap element tags on an attribute basis in a programmatic way - but string processing seemed like my only option. > If you want to work on the byte stream, you might consider using the pretty > printer of libxml2 and then check each line if it is already short enough > before treating it. Splitting long lines at whitespace is a pretty simple > thing to do. You lost me here... I tried using help() and google but got lost trying to find a reference to the libxml2 pretty printer you speak of. > If you want to serialize by walking the tree, then you should do that > completely at the element level, preferably with code that also works for the > original ElementTree. Sorry, could you clarify your point here? I am easily confused. :) Thanks fir the feedback Stefan! Regards, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 22 06:43:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 22 May 2006 06:43:21 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> Message-ID: <44714169.1050402@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I have examined the lxml pretty print output and there is > quite some difference. In fact, with the exception of a few simplistic > documents with no textual element content I cannot see what effect > pretty_print has. Interesting. Are you suggesting that this feature is actually not working for you? As far as I understand, you say that you get the normal one-line output when there is textual context in the XML? Because I cannot reproduce that: >>> print tostring(XML("test test test"), pretty_print=True) test test test This is absolutely the expected result. May I ask what version of libxml2 you are using? >> You also use a hook for treating the serialised byte stream, although >> I think it's a bad idea to do this for splitting elements between attributes. > > Why do you think this is a bad idea? A better place to do this would be a filter in the form of a file-like object, as this is much more generic and memory efficient. Something like class FileFilter(object): def __init__(self, out_file): self.out_file = out_file def write(self, data): # treat data self.out_file.write(new_data) This is a totally generic approach that does not rely on lxml, works with any XML stream (as long as you take care about encodings), even when copied directly from a file or things like that. But I still think that it would be better to do such adaptations based on walking the XML tree rather than the byte stream. One reason is that you are duplicating considerations about encodings and parsing that you wouldn't have in the XML infoset. You can always write your own serialiser based on element.getiterator(). Also, feel free to take a look how ElementTree does it. > I was a little hesitant about doing it in the first place because > technically it is no longer XML processing, but string processing. As > safe as I think my code is, it does feel like there aught to be a > better way of doing it. How do you deal with, say, a UTF-16 encoded XML byte stream? > Like I stated above - I still cannot see what pretty_print is actually > doing. I do know that it is unsuitable for my purposes because of it's > inability to word wrap and indent elements tags (which is the > definition of pretty printing IMHO). Well, it does indent element tags on my side, which (IMHO) is the definition of XML pretty printing. How is it supposed to know that adding whitespace to the data between tags does not break anything? >> If you want to work on the byte stream, you might consider using the >> pretty >> printer of libxml2 and then check each line if it is already short enough >> before treating it. Splitting long lines at whitespace is a pretty simple >> thing to do. > > You lost me here... I tried using help() and google but got lost > trying to find a reference to the libxml2 pretty printer you speak of. I meant the pretty_print option. Set it to true and use the above file filter approach by looking for "\n". But make sure the serialized result is in UTF-8 or unicode or another byte format that's readily usable by Python in a portable way. Stefan From faassen at infrae.com Mon May 22 10:34:42 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 22 May 2006 10:34:42 +0200 Subject: [lxml-dev] lxml crash In-Reply-To: <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> References: <472582143.20060517033848@carcass.dhs.org> <446AE873.3020808@gkec.informatik.tu-darmstadt.de> <446B62FE.1060907@gkec.informatik.tu-darmstadt.de> <1968986390.20060517171656@carcass.dhs.org> <533778638.20060518233047@carcass.dhs.org> <446D6D66.6080801@gkec.informatik.tu-darmstadt.de> Message-ID: <447177A2.4050009@infrae.com> Stefan Behnel wrote: [snip] > So, for exactly the same reason, lxml will continue to raise TypeError for > both None values and other invalid argument types. The test is done internally > by Pyrex anyway and therefore the cheapest solution. This is why I chose to use TypeError instead of AssertionError originally in lxml - it was the simplest thing to do and came for free. Performance is an incidental happy side effect. :) Regards, Martijn From faassen at infrae.com Mon May 22 10:37:47 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 22 May 2006 10:37:47 +0200 Subject: [lxml-dev] Can't change text in comments In-Reply-To: <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605200741t6c4b47b3r29f25055159c7f5f@mail.gmail.com> <446F3FA9.2090509@gkec.informatik.tu-darmstadt.de> <446F47B3.5050409@gkec.informatik.tu-darmstadt.de> Message-ID: <4471785B.6080201@infrae.com> Stefan Behnel wrote: > Hi all, > > Stefan Behnel wrote: >> What bothers me is that >> lxml is consistent with ElementTree in that it adds whitespace around comment >> texts. I have no idea why ElementTree does that in the first place. AFAICT, >> this happily breaks things like SSI. > > I took a second look at this and found that lxml can't actually support this. > Since the parser does not ignore comments (as ET does) and since we don't > serialise on our own, we can't make sure that we always add spaces around the > comment text. We can do that through the API calls to Comment(), but that > would make things inconsistent compared to parsed trees. Right, I recall looking into that long ago and coming to the same conclusion. I should've marked that in the compatibility file. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 22 16:07:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 22 May 2006 16:07:43 +0200 Subject: [lxml-dev] XMLFormatter revisited In-Reply-To: <44701FE8.6040002@gkec.informatik.tu-darmstadt.de> References: <44701FE8.6040002@gkec.informatik.tu-darmstadt.de> Message-ID: <4471C5AF.6020900@gkec.informatik.tu-darmstadt.de> Hi all, Stefan Behnel wrote: > class XMLParser: > def __init__(self, **options): > self.options = options > > class XMLFormatter: > ... > > That would give us a nice, symmetric API for input and output options. Oh, well. Just as usual, it's not as easy as it seems at first sight. I found that such an API does not work quite that well, neither for the integration with ET, nor with the implementation on top of libxml2. libxml2 supports the xmlSave... API, which has some nice features for formatting XML. However, it also has a number of bugs and some side effects that make it ugly to integrate into a nicely ET compatible API. One of these nice design decisions was to output a '\n' at the end of the XML output. While this is not too much of a problem when saving XML to a file, it is rather ugly when writing to StringIOs and strings, and it's unluckily non trivial to remove this character from an encoded string. The alternative would be to use an API that can't write XML declarations. Great! So I do not currently see a way to support both tostring() and ET.write() on top of the xmlSave* functions. However, I think it would still be nice to have an API that allows some more fine-tuning of the output, like character-entity conversion or hooks into the serialization process. For this, a separate class XMLSerializer looks like the way to go. I have implemented a simple incarnation of such a beast in the "xmlsave" branch, however, I'm not sufficiently satisfied with it to make it part of lxml 1.0. It's a separate feature anyway, so there is no hurry to integrate it. But if someone wants to take a look at it... Regards, Stefan From nslater at gmail.com Mon May 22 22:09:10 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 22 May 2006 21:09:10 +0100 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <44714169.1050402@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <44714169.1050402@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605221309p53ef3045nf281000ec8589cfb@mail.gmail.com> Hello, > Interesting. Are you suggesting that this feature is actually not working for you? Kind of - I can get it to indent simple documents. I have attached a sample file - if you run this through etree.tostring with pretty printing enabled it doesn't alter the output in any way (with the exception of converting some chars to entities, obviously). > This is absolutely the expected result. May I ask what version of libxml2 you > are using? 2.6.24.dfsg-1 > But I still think that it would be better to do such adaptations based on > walking the XML tree rather than the byte stream. I am as much as I can - but it is not possible to wrap element attributes programmaticaly - thus I have to relly on byte stream post-processing. > duplicating considerations about encodings and parsing that you wouldn't have > in the XML infoset. > How do you deal with, say, a UTF-16 encoded XML byte stream? The code I submitted seems to handle UTF-16 just fine, try throwing a UTF-16 document at it. Why would you think this was an issue? I18n issues are really very important to me - so I really need to understand this one... still trying to get my head around character encodings in general. I use a few regex's that search for '<' '>' '"' and '=' characters. Is this not safe across all encoded byte streams? This characters seem to match just fine using ASCII, LATIN-1, UTF-8 and UTF-16. If that is so - would it make sense to decode the byte stream to a Unicode object to perform string operations before encoding back to the original charset requested? > Well, it does indent element tags on my side, which (IMHO) is the definition > of XML pretty printing. How is it supposed to know that adding whitespace to > the data between tags does not break anything? Okay... a few points on this one. I am still unable to figure out the rules it is using. As I previously mentioned - it does not seem to alter the document I have attached with this email. Secondly, your last point confuses me a little. The very act of indenting tags requires the addition of XML text node to the document tree - so by virtue of the fact you have implemented a pretty printer you are already adding white space to the document. Of am I missing something? All white space is significant in XML - so once you have decided to alter the document, the actual wrapping and indentation styles you choose are by the by. I suppose in this way, the only real difference between my pretty printer and the one built into lxml is the ability to control which element types are altered. Please feel free to knock me back into line... I'm probably missing something embarrassingly obvious. Thanks! Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: document.utf-8.xml Type: text/xml Size: 432 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060522/94b109c2/attachment-0001.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 23 07:30:04 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 23 May 2006 07:30:04 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605221309p53ef3045nf281000ec8589cfb@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <44714169.1050402@gkec.informatik.tu-darmstadt.de> <9ea1c1180605221309p53ef3045nf281000ec8589cfb@mail.gmail.com> Message-ID: <44729DDC.6020608@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> Interesting. Are you suggesting that this feature is actually not >> working for you? > > Kind of - I can get it to indent simple documents. I have attached a > sample file - if you run this through etree.tostring with pretty > printing enabled it doesn't alter the output in any way (with the > exception of converting some chars to entities, obviously). I tried the file and I guess the problem is the whitespace it contains. I guess libxml2 will simply refuse to alter your data if it can't distinguish between relevant and ignorable whitespace. It will only add new whitespace text nodes. Note also that you likely parsed it without DTD. That prevents libxml2 from knowing where whitespace matters and where it doesn't. >> But I still think that it would be better to do such adaptations based on >> walking the XML tree rather than the byte stream. > > I am as much as I can - but it is not possible to wrap element > attributes programmaticaly - thus I have to relly on byte stream > post-processing. If you implement a custom tree-walking serialiser, you have to write one attribute after the other anyway. So just check if the next one fits into the line and otherwise add a newline+indent first. How is that impossible? > The code I submitted seems to handle UTF-16 just fine, try throwing a > UTF-16 document at it. > > Why would you think this was an issue? I18n issues are really very > important to me - so I really need to understand this one... still > trying to get my head around character encodings in general. > > I use a few regex's that search for '<' '>' '"' and '=' characters. > > Is this not safe across all encoded byte streams? This characters seem > to match just fine using ASCII, LATIN-1, UTF-8 and UTF-16. Not every encoding makes the ASCII characters '<' etc. readily visible at a byte level. UTF-8 is perfect here, the ASCII-derived ISO-8859 charsets are also nice, but I guess EBCDIC is pretty much resistant and some Asian encoding will likely be, too. > If that is so - would it make sense to decode the byte stream to a > Unicode object to perform string operations before encoding back to > the original charset requested? Use UTF-8, that's fast and perfectly suited for that purpose. > Secondly, your last point confuses me a little. The very act of > indenting tags requires the addition of XML text node to the document > tree - so by virtue of the fact you have implemented a pretty printer > you are already adding white space to the document. Of am I missing > something? > All white space is significant in XML - so once you have decided to > alter the document, the actual wrapping and indentation styles you > choose are by the by. I guess the rule is: If you don't know what the document is supposed to look like, 'adding whitespace nodes' is (most likely) less harmful than 'changing data between tags'. > I suppose in this way, the only real difference between my pretty > printer and the one built into lxml is the ability to control which > element types are altered. libxml2 knows about the docbook namespace, too. Just try: >>> from lxml.etree import parse, tostring, XMLParser >>> tree = parse("document.utf-8.xml", XMLParser(load_dtd=True)) >>> tree.write("testout.xml", "UTF-8", pretty_print=True) That will nicely pretty print the document you sent me. Stefan From nslater at gmail.com Tue May 23 23:44:58 2006 From: nslater at gmail.com (Noah Slater) Date: Tue, 23 May 2006 22:44:58 +0100 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <44729DDC.6020608@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <44714169.1050402@gkec.informatik.tu-darmstadt.de> <9ea1c1180605221309p53ef3045nf281000ec8589cfb@mail.gmail.com> <44729DDC.6020608@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605231444s435f453fo9207d0fd9e82c2b8@mail.gmail.com> Hi Stefan, Thank you for a great reply - I appreciate the effort your going to. > I tried the file and I guess the problem is the whitespace it contains. I > guess libxml2 will simply refuse to alter your data if it can't distinguish > between relevant and ignorable whitespace. It will only add new whitespace > text nodes. Aha, I see what you mean. This as a limitation with libxml2 IMO - quite different from xml.dom.minidom.toprettyxml and xml.dom.ext.PrettyPrint > Note also that you likely parsed it without DTD. That prevents libxml2 from > knowing where whitespace matters and where it doesn't. Wow, old skool - I am using Relax NG with my documents, aren't DTD deprecated? :) > If you implement a custom tree-walking serialiser, you have to write one > attribute after the other anyway. So just check if the next one fits into the > line and otherwise add a newline+indent first. How is that impossible? I can be simple sometimes - this angle never occurred to me! Thanks so much for pointing it out - your suggestion makes perfect sense. The idea of writing my own serialiser does daunt me a little - I was hoping to piggy back on someone else's efforts (at least that way I would be more certain the code is conformant... heh). Do you have any ideas on where to start on this? Is there some baseclass I could extend? Is this possible with SAX (which I know nothing about) - so many questions. > Not every encoding makes the ASCII characters '<' etc. readily visible at a > byte level. UTF-8 is perfect here, the ASCII-derived ISO-8859 charsets are > also nice, but I guess EBCDIC is pretty much resistant and some Asian encoding > will likely be, too. Please pardon my naivety - you are of course 100% correct on this issue. > I guess the rule is: If you don't know what the document is supposed to look > like, 'adding whitespace nodes' is (most likely) less harmful than 'changing > data between tags'. Hmm... not sure I agree on that, but over all I guess I take your point. Kind of irrelevant in my case however as I am writing a serialiser for my self and hence know exactly what is and isn't significant whitespace. Heh. > libxml2 knows about the docbook namespace, too. Just try: As I mentioned before - does this work with Relax NG? That would be amazing! Once again, thank you for your continued help and for such a great package. Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 08:00:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 08:00:12 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605231444s435f453fo9207d0fd9e82c2b8@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <44714169.1050402@gkec.informatik.tu-darmstadt.de> <9ea1c1180605221309p53ef3045nf281000ec8589cfb@mail.gmail.com> <44729DDC.6020608@gkec.informatik.tu-darmstadt.de> <9ea1c1180605231444s435f453fo9207d0fd9e82c2b8@mail.gmail.com> Message-ID: <4476996C.7010102@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> I tried the file and I guess the problem is the whitespace it contains. I >> guess libxml2 will simply refuse to alter your data if it can't >> distinguish >> between relevant and ignorable whitespace. It will only add new >> whitespace >> text nodes. > > Aha, I see what you mean. This as a limitation with libxml2 IMO - No, absolutely not. If your library modifies your /data/ when you tell it to do non-intrusive indenting, that's just wrong. It doesn't break HTML (which is rendered for people anyway), but it breaks more or less everything else. >> Note also that you likely parsed it without DTD. That prevents libxml2 >> from knowing where whitespace matters and where it doesn't. > > Wow, old skool - I am using Relax NG with my documents, aren't DTD > deprecated? :) Don't think so, but who cares? This is not about validation, only about access to structural information - all that must be known is which tags will never contain (textual) data so that whitespace can be added to these without breaking the 'real' data. >> If you implement a custom tree-walking serialiser, you have to write one >> attribute after the other anyway. So just check if the next one fits >> into the >> line and otherwise add a newline+indent first. How is that impossible? > > I can be simple sometimes - this angle never occurred to me! Thanks so > much for pointing it out - your suggestion makes perfect sense. The > idea of writing my own serialiser does daunt me a little - I was > hoping to piggy back on someone else's efforts (at least that way I > would be more certain the code is conformant... heh). > > Do you have any ideas on where to start on this? Is there some > baseclass I could extend? Is this possible with SAX (which I know > nothing about) - so many questions. SAX is one way of doing it. It mimics a parser and is therefore pretty well suited for serialization, but tends to require loads of code. If you prefer the ElementTree API, look at Element.getiterator(), which is even extremely fast in lxml (there should be code examples on the web). Note also that the original ElementTree library has a Python implemented serializer. You should look at it before writing your own one. >> libxml2 knows about the docbook namespace, too. Just try: > > As I mentioned before - does this work with Relax NG? That would be > amazing! Technically, it could, but there is no reason why libxml2 should support it. DTDs are available for virtually any well-known document type, including the docbook type you are using. Setting the "load_dtd" keyword on the parser should be relatively cheap (note that there is a separate "dtd_validation" keyword for validation). It relies on libxml2's catalog feature, though, and on the relevant DTDs to be installed on the system. You may want to read about DTDs, catalogs and XML Infosets for further information about the topics involved. Stefan From howe at carcass.dhs.org Fri May 26 09:42:29 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri, 26 May 2006 04:42:29 -0300 Subject: [lxml-dev] findall()/xpath() differences ? Message-ID: <428731466.20060526044229@carcass.dhs.org> Hello all, How compatible are the findall() and xpath() methods ? findall() don't seem to handle more complicated XPath expressions. Why there is a difference between what they can handle ? I would expect findall() to be the same as xpath(), but searching from the context node and always returning a list of items, making it compatible with ET's. An example: Python 2.4.3 (#2, Apr 17 2006, 14:29:19) [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> root = etree.XML(''' ... ... ''') >>> print root.xpath('obj[not(@name)]') [] >>> print root.findall('obj[not(@name)]') Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 937, in etree._Element.findall File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 193, in findall return _compile(path).findall(element) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 171, in _compile p = Path(path) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 87, in __init__ raise SyntaxError( SyntaxError: expected path separator ([) -- Best regards, Steve mailto:howe at carcass.dhs.org From elephantum at cyberzoo.ru Fri May 26 09:50:42 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 26 May 2006 11:50:42 +0400 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <428731466.20060526044229@carcass.dhs.org> References: <428731466.20060526044229@carcass.dhs.org> Message-ID: <1148629843.2208.1.camel@zoo.yandex.ru> That's a great chance to start findall vs xpath battle again =) I think lxml should eliminate .xpath method and implement .find* methods through libxml2 xpath support. On Fri, 2006-05-26 at 04:42 -0300, Steve Howe wrote: > Hello all, > > How compatible are the findall() and xpath() methods ? findall() don't seem > to handle more complicated XPath expressions. Why there is a difference > between what they can handle ? > > I would expect findall() to be the same as xpath(), but searching from > the context node and always returning a list of items, making it > compatible with ET's. > > An example: > > Python 2.4.3 (#2, Apr 17 2006, 14:29:19) > [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 > Type "help", "copyright", "credits" or "license" for more information. > >>> from lxml import etree > >>> root = etree.XML(''' > ... > ... ''') > >>> print root.xpath('obj[not(@name)]') > [] > >>> print root.findall('obj[not(@name)]') > Traceback (most recent call last): > File "", line 1, in ? > File "etree.pyx", line 937, in etree._Element.findall > File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 193, in findall > return _compile(path).findall(element) > File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 171, in _compile > p = Path(path) > File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 87, in __init__ > raise SyntaxError( > SyntaxError: expected path separator ([) > From howe at carcass.dhs.org Fri May 26 09:56:07 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri, 26 May 2006 04:56:07 -0300 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <1148629843.2208.1.camel@zoo.yandex.ru> References: <428731466.20060526044229@carcass.dhs.org> <1148629843.2208.1.camel@zoo.yandex.ru> Message-ID: <636742356.20060526045607@carcass.dhs.org> Hello Andrey, Friday, May 26, 2006, 4:50:42 AM, you wrote: > That's a great chance to start findall vs xpath battle again =) > I think lxml should eliminate .xpath method and implement .find* methods > through libxml2 xpath support. I really do not want to start any battles. There must be a good reason for the differences, and I just would like to know what they are and I think they should be documented... Anyway even if for any reasons it decides to get rid of the xpath() method, it should remain as alias for findall() to keep compatibility with older code. -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 10:00:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 10:00:08 +0200 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <428731466.20060526044229@carcass.dhs.org> References: <428731466.20060526044229@carcass.dhs.org> Message-ID: <4476B588.2060805@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > How compatible are the findall() and xpath() methods ? findall() don't seem > to handle more complicated XPath expressions. Why there is a difference > between what they can handle ? > > I would expect findall() to be the same as xpath(), but searching from > the context node and always returning a list of items, making it > compatible with ET's. Well, it /is/ compatible with ET's. That is the main reason why it does not support full XPath expressions. Its expressions follow the documentation from the ElementTree library. What would be the advantage of not being ET compatible here? Is there anything you can do with findall(), find() and findtext() that you couldn't do with xpath() if you wanted to? Note, BTW, that both are similarly fast for similar expressions. If you wanted more speed, you'd go for pre-parsed XPath expressions anyway. IMHO, the only two reasons why these three functions are there are 1) they are ET compatible 2) they are simple We had the discussion pop up a few times if implementing findall() through xpath() would be a good idea. It was generally agreed (and demonstrated in code) that this would too easily break ET compatibility, which was not considered worth it. Stefan From howe at carcass.dhs.org Fri May 26 10:09:30 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri, 26 May 2006 05:09:30 -0300 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476B588.2060805@gkec.informatik.tu-darmstadt.de> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> Message-ID: <1163682329.20060526050930@carcass.dhs.org> Hello Stefan, Friday, May 26, 2006, 5:00:08 AM, you wrote: > Well, it /is/ compatible with ET's. That is the main reason why it does not > support full XPath expressions. Its expressions follow the documentation from > the ElementTree library. Perhaps *too* compatible... :) > What would be the advantage of not being ET compatible here? Is there anything > you can do with findall(), find() and findtext() that you couldn't do with > xpath() if you wanted to? Note, BTW, that both are similarly fast for similar > expressions. If you wanted more speed, you'd go for pre-parsed XPath > expressions anyway. > IMHO, the only two reasons why these three functions are there are > 1) they are ET compatible > 2) they are simple > We had the discussion pop up a few times if implementing findall() through > xpath() would be a good idea. It was generally agreed (and demonstrated in > code) that this would too easily break ET compatibility, which was not > considered worth it. Ok, reason is compatibility. Two points: 1) Shouldn't it be clearly documented ? 2) Since xpath() supports a superset of the expressions findall() does, isn't the compatibility ensured ? Or does findall() support anything xpath() does not ? It makes no sense to cripple etree?s findall() in order to to support only what ET's findall() does. -- Best regards, Steve mailto:howe at carcass.dhs.org From elephantum at cyberzoo.ru Fri May 26 10:14:44 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 26 May 2006 12:14:44 +0400 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476B588.2060805@gkec.informatik.tu-darmstadt.de> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> Message-ID: <1148631284.2208.14.camel@zoo.yandex.ru> On Fri, 2006-05-26 at 10:00 +0200, Stefan Behnel wrote: > We had the discussion pop up a few times if implementing findall() through > xpath() would be a good idea. It was generally agreed (and demonstrated in > code) that this would too easily break ET compatibility, which was not > considered worth it. As far as I remember nobody just cared enough, and broken compatibility was only in cases where Frederik was testing incompleteness of his 'semi-xpath' implementation, for example testing that '//' is invalid expression, or there could not be '[..]' selectors after node name. In cases where the useful functionality was tested - there were no failures. From elephantum at cyberzoo.ru Fri May 26 10:14:56 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 26 May 2006 12:14:56 +0400 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <636742356.20060526045607@carcass.dhs.org> References: <428731466.20060526044229@carcass.dhs.org> <1148629843.2208.1.camel@zoo.yandex.ru> <636742356.20060526045607@carcass.dhs.org> Message-ID: <1148631296.2208.16.camel@zoo.yandex.ru> On Fri, 2006-05-26 at 04:56 -0300, Steve Howe wrote: > Hello Andrey, > > Friday, May 26, 2006, 4:50:42 AM, you wrote: > > > That's a great chance to start findall vs xpath battle again =) > > > I think lxml should eliminate .xpath method and implement .find* methods > > through libxml2 xpath support. > I really do not want to start any battles. There must be a good reason > for the differences, and I just would like to know what they are and > I think they should be documented... > > Anyway even if for any reasons it decides to get rid of the xpath() > method, it should remain as alias for findall() to keep compatibility > with older code. Ok, at the moment .find* methods are served by the same code as in ElementTree, so .find* methods behave exactly like ElementTree's one. .xpath method is lxml's own implementation, somewhere inconsistent with ElementTree's .find. AFAIR: - there is other namespace declaration convention, - full XPath support, - and different behavior on absolute paths (I think that's the place where ElementTree is broken). From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 10:44:52 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 10:44:52 +0200 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <1163682329.20060526050930@carcass.dhs.org> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> Message-ID: <4476C004.8010202@gkec.informatik.tu-darmstadt.de> Steve Howe wrote: > Friday, May 26, 2006, 5:00:08 AM, you wrote: >> IMHO, the only two reasons why these three functions are there are > >> 1) they are ET compatible >> 2) they are simple > >> We had the discussion pop up a few times if implementing findall() through >> xpath() would be a good idea. It was generally agreed (and demonstrated in >> code) that this would too easily break ET compatibility, which was not >> considered worth it. > Ok, reason is compatibility. Two points: > > 1) Shouldn't it be clearly documented ? Well, regarding documentation, lxml has (inofficially) always said: "we let Fredrik write the documentation, and only if we must (or want to) do it different, we document it ourselves." ElementTree's find*() methods are documented, so all we add is "lxml supports full XPath expressions through the xpath() function". > 2) Since xpath() supports a > superset of the expressions findall() does, isn't the compatibility > ensured ? No, it's not a superset at all. findall() uses '{namespace}tag' notation, which is absolutely invalid in XPath. lxml has an ETXPath class that allows you to do this, but calling that for the general XPath case is just overhead, as we would still be trying to extract namespaces from it instead of passing it straight into libxml2's parser. > It makes > no sense to cripple etree?s findall() in order to to support only what > ET's findall() does. It wouldn't make sense if it wasn't for compatibility. Currently, you can exchange code between lxml, ElementTree and cElementTree with relatively little extra consideration. And I mean in all directions. Making more functions incompatible (without convincing reasons) is just calling for trouble. ("he, lxml didn't raise an exception on this expression!!") The reasons for leaving it as is are: 1) it works 2) it is 100% compatible now and trivial to keep compatible 3) it is not trivial to reimplement without breaking compatibility 4) it makes things slower to change it, as it requires parsing the expression twice (once in lxml, once in libxml2) and it's not faster to evaluate it. The reasons to change it are: 1) it supports different expressions than xpath(), which is documented (although perhaps not clearly so) and the reason why there is an xpath() method. Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state. Stefan From faassen at infrae.com Fri May 26 11:42:09 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 26 May 2006 11:42:09 +0200 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476C004.8010202@gkec.informatik.tu-darmstadt.de> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> Message-ID: <4476CD71.3020505@infrae.com> Stefan Behnel wrote: [snip] > Honestly, unless there are good reasons to do it, I'm absolutely +1 for > keeping the current state. +1 for keeping it the way it works too. I followed the same reasoning when I first made it work the way it does. :) We might indeed want to put a small section in our documentation mentioning why we did this. Might even make for a good start to a FAQ. :) Regards, Martijn From tseaver at palladion.com Fri May 26 13:01:10 2006 From: tseaver at palladion.com (Tres Seaver) Date: Fri, 26 May 2006 07:01:10 -0400 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476C004.8010202@gkec.informatik.tu-darmstadt.de> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> Message-ID: <4476DFF6.10703@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Honestly, unless there are good reasons to do it, I'm absolutely +1 for > keeping the current state. Agreed, this is a no broainer. If your application needs to be compatible with ET/cET, use the compatibility API. If it needs full XPath support, then use the native XPath API. I can't even see why we are talking about changing such a simple story. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEdt/2+gerLs4ltQ4RAty4AKCv9lO7gmPWxO/Hmwk1JO/LlUjwTwCePFJf tDQN3pHsqCDrziVmmx5UBTI= =HRhp -----END PGP SIGNATURE----- From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 15:06:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 15:06:12 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements Message-ID: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> Hi, I implemented a getpath() method for the ElementTree class that returns an XPath expression for a node. While working out test cases for it, however, I realized that the semantics of evaluating absolute XPath expressions (/...) on elements were not clear at all in the current implementation. ET does not allow absolute expressions in Element.findall() and raises a SyntaxError instead. I think we should do the same for Element.xpath() to prevent mistakes like this: >>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") >>> d0 = etree.SubElement(b, "d") >>> c = etree.SubElement(a, "c") >>> d1 = etree.SubElement(c, "d") >>> d2 = etree.SubElement(c, "d") >>> c.xpath("//d") The reasoning is that Elements do not have a root and therefore no absolute starting point for XPath. Only ElementTrees provide the required semantics, so it's perfectly valid to do this instead: >>> ElementTree(c).xpath("//d") Imagine the case where you have many ElementTrees wrapping various elements in a tree. Which one should be the starting point? Remember that documents and their absolute root node are not exposed through the API. The use case that brought me there is this: >>> tree = etree.ElementTree(c) >>> print tree.getpath(d2) /c/d[2] >>> tree.xpath(tree.getpath(d2)) == [d2] # fails! Intuitively, this should work. However, the current implementation fails here, as it starts searching at 'a' rather than 'c' and thus finds nothing. To fix this, we have to switch the root node during XPath evaluation. Doing this for ElementTree.xpath() is ok, but doing this for Element.xpath() also is impossible, as it breaks relative expressions like "..". So I decided to simply special case XPath expressions starting with '/' and raise exceptions for them. I know that this is not sufficient, as absolute paths can be hidden in things like "*[/a]" or "a|/a". But it's hopefully enough to make users aware and to prevent common mistakes. I also added a note in the documentation that the result of absolute expressions is undefined for Elements, I think that's the right way of saying it. I post this here as there will likely be code that breaks because of this change. I already found two test cases in the test suite that used this. It's just too easy to get wrong, so lxml is better off by raising exceptions where it can than just ignoring this problem. Stefan From faassen at infrae.com Fri May 26 16:28:01 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 26 May 2006 16:28:01 +0200 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476DFF6.10703@palladion.com> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> <4476DFF6.10703@palladion.com> Message-ID: <44771071.9010106@infrae.com> Tres Seaver wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Stefan Behnel wrote: > >> Honestly, unless there are good reasons to do it, I'm absolutely +1 for >> keeping the current state. > > Agreed, this is a no broainer. If your application needs to be > compatible with ET/cET, use the compatibility API. If it needs full > XPath support, then use the native XPath API. I can't even see why we > are talking about changing such a simple story. I think it really counts as a FAQ by now; I've seen this come up on the list for at least 3 times. Regards, Martijn From fredrik at pythonware.com Fri May 26 16:39:40 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 26 May 2006 16:39:40 +0200 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476DFF6.10703@palladion.com> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> <4476DFF6.10703@palladion.com> Message-ID: Tres Seaver wrote: >> Honestly, unless there are good reasons to do it, I'm absolutely +1 for >> keeping the current state. > > Agreed, this is a no broainer. If your application needs to be > compatible with ET/cET, use the compatibility API. If it needs full > XPath support, then use the native XPath API. and as usual, if someone finds a glaring difference between how findall handles a given xpath pattern and how xpath handles it (clarke notation issues aside), it's probably a bug in ET. From faassen at infrae.com Fri May 26 16:51:01 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 26 May 2006 16:51:01 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> Message-ID: <447715D5.7010802@infrae.com> Stefan Behnel wrote: > I implemented a getpath() method for the ElementTree class that returns an > XPath expression for a node. While working out test cases for it, however, I > realized that the semantics of evaluating absolute XPath expressions (/...) on > elements were not clear at all in the current implementation. > > ET does not allow absolute expressions in Element.findall() and raises a > SyntaxError instead. I think we should do the same for Element.xpath() to > prevent mistakes like this: > > >>> a = etree.Element("a") > >>> b = etree.SubElement(a, "b") > >>> d0 = etree.SubElement(b, "d") > >>> c = etree.SubElement(a, "c") > >>> d1 = etree.SubElement(c, "d") > >>> d2 = etree.SubElement(c, "d") > > >>> c.xpath("//d") > > The reasoning is that Elements do not have a root and therefore no absolute > starting point for XPath. Only ElementTrees provide the required semantics, so > it's perfectly valid to do this instead: > > >>> ElementTree(c).xpath("//d") > > Imagine the case where you have many ElementTrees wrapping various elements in > a tree. Which one should be the starting point? Remember that documents and > their absolute root node are not exposed through the API. Maybe we should expose the absolute root of documents in the API? XPath is defined on the document level. We could define the xpath() function to work in the context of the underlying document when / is used. Conceptually for XPath there *is* an underlying document with a certain structure. We can try to paper that over with hacky XPath parsing and exceptions and pretend there is not, but it's going to lead to more confusion than just exposing this concept in the API. > The use case that brought me there is this: > > >>> tree = etree.ElementTree(c) > >>> print tree.getpath(d2) > /c/d[2] > >>> tree.xpath(tree.getpath(d2)) == [d2] # fails! > > Intuitively, this should work. However, the current implementation fails here, > as it starts searching at 'a' rather than 'c' and thus finds nothing. To fix > this, we have to switch the root node during XPath evaluation. Doing this for > ElementTree.xpath() is ok, but doing this for Element.xpath() also is > impossible, as it breaks relative expressions like "..". But c isn't the root of the tree in all this. I think again it would be much better if we exposed the real underlying tree here, and only return xpath expressions generated from the real root. > So I decided to simply special case XPath expressions starting with '/' and > raise exceptions for them. I know that this is not sufficient, as absolute > paths can be hidden in things like "*[/a]" or "a|/a". But it's hopefully > enough to make users aware and to prevent common mistakes. I also added a note > in the documentation that the result of absolute expressions is undefined for > Elements, I think that's the right way of saying it. I think that instead of going this way, we need to step back for a minute. libxml2 has documents with trees. ElementTree has, potentially, as many trees as there are nodes. xpath works on libxml2 documents. The libxml2 story is going to leak into the ElementTree abstraction inevitably - such as your expressions *[/a], and so on. I think instead of trying to protect the ElementTree abstraction by incomplete checks to prevent 'common mistakes', we need to rethink what we want to expose in the lxml abstractions in the first place. > I post this here as there will likely be code that breaks because of this > change. I already found two test cases in the test suite that used this. It's > just too easy to get wrong, so lxml is better off by raising exceptions where > it can than just ignoring this problem. I don't consider this code to be wrong. That's why we had cases in the test suite that tested for it. Since then you reworked the code to be more like ElementTree in the usage of the ElementTree class, but this stuff is going to shine through nonetheless. Can't we expose a method getdocument() on Elements which will expose the underlying document as an ElementTree instance, and then define XPath's / to work from that always? We can then clearly define xpath() and getpath() in terms of getdocument(). Of course the behavior of getdocument() may be hard to predict for a user. Is this really true, or is getdocument() always going to be the thing created with Element() that wasn't appended or otherwise placed under another one? We have a getparent() method too after all, so we're hardly hiding the existence of the true libxml2 document in our abstraction. Regards, Martijn From elephantum at cyberzoo.ru Fri May 26 16:54:03 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 26 May 2006 18:54:03 +0400 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> <4476DFF6.10703@palladion.com> Message-ID: <1148655243.2208.19.camel@zoo.yandex.ru> On Fri, 2006-05-26 at 16:39 +0200, Fredrik Lundh wrote: > Tres Seaver wrote: > > >> Honestly, unless there are good reasons to do it, I'm absolutely +1 for > >> keeping the current state. > > > > Agreed, this is a no broainer. If your application needs to be > > compatible with ET/cET, use the compatibility API. If it needs full > > XPath support, then use the native XPath API. > > and as usual, if someone finds a glaring difference between how findall > handles a given xpath pattern and how xpath handles it (clarke notation > issues aside), it's probably a bug in ET. this is a bug in ET: In [1]: import elementtree.ElementTree as et In [2]: a = et.Element('a') In [3]: b = et.Element('b') In [4]: a.append(b) In [5]: tree = et.ElementTree(a) In [6]: tree.find('/a') # should be In [7]: tree.find('/b') # should be None Out[7]: From howe at carcass.dhs.org Fri May 26 17:07:03 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri, 26 May 2006 12:07:03 -0300 Subject: [lxml-dev] findall()/xpath() differences ? In-Reply-To: <4476C004.8010202@gkec.informatik.tu-darmstadt.de> References: <428731466.20060526044229@carcass.dhs.org> <4476B588.2060805@gkec.informatik.tu-darmstadt.de> <1163682329.20060526050930@carcass.dhs.org> <4476C004.8010202@gkec.informatik.tu-darmstadt.de> Message-ID: <648202939.20060526120703@carcass.dhs.org> Hello Stefan, Friday, May 26, 2006, 5:44:52 AM, you wrote: [...] > Honestly, unless there are good reasons to do it, I'm absolutely +1 for > keeping the current state. Me too, don't worry. I forgot about the way ET handles namespaces and that is incompatible enough so that a compatibility function should be kept. I'll just use xpath() and there is no problem about it. -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 17:48:19 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 17:48:19 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: <447715D5.7010802@infrae.com> References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> Message-ID: <44772343.8070505@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Imagine the case where you have many ElementTrees wrapping various >> elements in >> a tree. Which one should be the starting point? Remember that >> documents and >> their absolute root node are not exposed through the API. > > Maybe we should expose the absolute root of documents in the API? I don't think this helps. We have ElementTrees that already fulfil exactly the need of representing rooted XML trees. And having ElementTrees that are mostly like other ElementTrees except that they always reference a special element in the document that potentially is not in other ElementTrees of the same document but can be referenced by et.xpath() from any of them ... I don't think that makes things more understandable. You shouldn't forget that you can always append the context node of an ElementTree to another element. Is this supposed to change the result of an xpath() call on the unmodified ElementTree? This would introduce some really hard to debug side-effects. So, in a way, it introduces unpredictable behaviour either way. It's just that the transition gets it closer to the ElementTree API. >> >>> tree = etree.ElementTree(c) >> >>> print tree.getpath(d2) >> /c/d[2] >> >>> tree.xpath(tree.getpath(d2)) == [d2] # fails! >> >> Intuitively, this should work. However, the current implementation >> fails here, >> as it starts searching at 'a' rather than 'c' and thus finds nothing. >> To fix >> this, we have to switch the root node during XPath evaluation. Doing >> this for >> ElementTree.xpath() is ok, but doing this for Element.xpath() also is >> impossible, as it breaks relative expressions like "..". > > But c isn't the root of the tree in all this. Well, it is the root of the ElementTree object. When I call xpath() on that tree, I really expect the root of the tree to be the reference point for absolute expressions. > libxml2 has documents with trees. ElementTree has, potentially, as many > trees as there are nodes. xpath works on libxml2 documents. The libxml2 > story is going to leak into the ElementTree abstraction inevitably - such > as your expressions *[/a], and so on. But that expression only leaks on Elements. It works as expected on ElementTrees. > I think instead of trying to protect the ElementTree abstraction by > incomplete checks to prevent 'common mistakes', we need to rethink what > we want to expose in the lxml abstractions in the first place. All I'm saying is that absolute expressions on Elements do not make sense anyway, so we should clearly mark them as invalid and do our best to prevent their use. If some of them leak, that's mainly for performance reasons. >> I post this here as there will likely be code that breaks because of this >> change. I already found two test cases in the test suite that used >> this. It's >> just too easy to get wrong, so lxml is better off by raising >> exceptions where >> it can than just ignoring this problem. > > I don't consider this code to be wrong. That's why we had cases in the > test suite that tested for it. Since then you reworked the code to be > more like ElementTree in the usage of the ElementTree class, but this > stuff is going to shine through nonetheless. > > Can't we expose a method getdocument() on Elements which will expose the > underlying document as an ElementTree instance, and then define XPath's > / to work from that always? We can then clearly define xpath() and > getpath() in terms of getdocument(). Originally, I implemented getpath() as Element.getpath(). I revoked that because it doesn't make sense in the context of the ElementTree API. It only makes sense when you have an ElementTree that you refer to. So, now the call is ElementTree.getpath(element). I think it's the same for absolute expressions in xpath(). They just don't make sense on Elements. > Of course the behavior of getdocument() may be hard to predict for a > user. Is this really true, or is getdocument() always going to be the > thing created with Element() that wasn't appended or otherwise placed > under another one? We have a getparent() method too after all, so we're > hardly hiding the existence of the true libxml2 document in our > abstraction. We have getparent() because we do not allow Elements to have multiple parents. However, we do allow trees (or documents) to have multiple root contexts (via ElementTree). Everything in lxml works with ElementTrees by now and uses the correct context node when you pass one in. This includes XSLT, RelaxNG, XMLSchema and the XPath class. I don't see why xpath() should be the only exception. Stefan From faassen at infrae.com Fri May 26 19:32:49 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 26 May 2006 19:32:49 +0200 Subject: [lxml-dev] building the trunk and test failures Message-ID: <44773BC1.90700@infrae.com> Hi there, I just tried building the trunk: Python 2.4, Pyrex 0.9.4.1, libxml2 2.6.21 The first thing I noticed is that compilation of the C source code (with 'Make') seems to take extremely long in comparison to the past. What changed to cause that? The next thing is that I get some test failures (included below). The api.txt and the extensions.txt test failures both seem to be triggered by the same problem. What could be going on here? Different behavior between different versions of libxml2? Regards, Martijn python test.py -p -v 459/540 ( 85.0%): Doctest: extensions.txt ====================================================================== ERROR: test_docinfo_public (lxml.tests.test_etree.ETreeOnlyTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.4/unittest.py", line 260, in run testMethod() File "/home/faassen/working/lxml-trunk/src/lxml/tests/test_etree.py", line 456, in test_docinfo_public tree = etree.parse(StringIO(xml)) File "etree.pyx", line 1483, in etree.parse File "parser.pxi", line 665, in etree._parseDocument File "parser.pxi", line 688, in etree._parseMemoryDocument File "parser.pxi", line 602, in etree._parseDoc File "parser.pxi", line 343, in etree._BaseParser._parseDoc File "parser.pxi", line 423, in etree._handleParseResult File "parser.pxi", line 394, in etree._raiseParseError XMLSyntaxError: switching encoding : no input ====================================================================== FAIL: Doctest: api.txt ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.4/unittest.py", line 260, in run testMethod() File "/home/faassen/working/lxml-trunk/src/doctest.py", line 2187, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for api.txt File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 0 ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 82, in api.txt Failed example: et = etree.parse(StringIO(xhtml)) Exception raised: Traceback (most recent call last): File "/home/faassen/working/lxml-trunk/src/doctest.py", line 1256, in __run compileflags, 1) in test.globs File "", line 1, in ? et = etree.parse(StringIO(xhtml)) File "etree.pyx", line 1483, in etree.parse File "parser.pxi", line 665, in etree._parseDocument File "parser.pxi", line 688, in etree._parseMemoryDocument File "parser.pxi", line 602, in etree._parseDoc File "parser.pxi", line 343, in etree._BaseParser._parseDoc File "parser.pxi", line 423, in etree._handleParseResult File "parser.pxi", line 394, in etree._raiseParseError XMLSyntaxError: switching encoding : no input ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 84, in api.txt Failed example: print docinfo.public_id Expected: -//W3C//DTD XHTML 1.0 Transitional//EN Got: -//W3C//DTD HTML 4.0 Transitional//EN ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 86, in api.txt Failed example: print docinfo.system_url Expected: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd Got: http://www.w3.org/TR/REC-html40/loose.dtd ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 88, in api.txt Failed example: docinfo.doctype == doctype_string Expected: True Got: False ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 91, in api.txt Failed example: print docinfo.xml_version Expected: 1.0 Got: None ---------------------------------------------------------------------- File "/home/faassen/working/lxml-trunk/src/lxml/tests/../../../doc/api.txt", line 93, in api.txt Failed example: print docinfo.encoding Expected: ascii Got: None ---------------------------------------------------------------------- Ran 459 tests in 1.695s FAILED (failures=1, errors=1) From iny+news at iki.fi Fri May 26 20:02:44 2006 From: iny+news at iki.fi (Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=) Date: Fri, 26 May 2006 21:02:44 +0300 Subject: [lxml-dev] absolute XPath expressions on Elements References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel writes: > I think it's the same for absolute expressions in xpath(). They just > don't make sense on Elements. Why? Please, look a bit closer to XPath expressions and what you can do with them. You have things like axes. You can search to many other directions too than just to children. To make most use from a XPath it needs to have some context node AND some root. How can you give the context node to the xpath evaluation, if the method is in the document side? >From my point of view the same xpath method needs to be able to evaluate both absolute and relative expressions. Think about implementing something like XSLT, we define blocks that get a context node. Then from those blocks we can access the whole document both with absolute and relative expressions with the same method. It just needs to work and it just needs to know both the root and the context node. -- Ilpo Nyyss?nen # biny # /* :-) */ From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 20:05:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 20:05:24 +0200 Subject: [lxml-dev] building the trunk and test failures In-Reply-To: <44773BC1.90700@infrae.com> References: <44773BC1.90700@infrae.com> Message-ID: <44774364.1010800@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > I just tried building the trunk: > > Python 2.4, Pyrex 0.9.4.1, libxml2 2.6.21 > > The first thing I noticed is that compilation of the C source code (with > 'Make') seems to take extremely long in comparison to the past. What > changed to cause that? It's the error reporting stuff that introduced tons of constants. Hmm, you should have noticed that last time you built it. I remember that you tested the trunk before releasing 0.9... > The next thing is that I get some test failures (included below). The > api.txt and the extensions.txt test failures both seem to be triggered > by the same problem. What could be going on here? Different behavior > between different versions of libxml2? I guess so. That's pretty unfortunate. I'm using libxml2 2.6.24 here. It looks like a problem in switching the encoding. Most likely, the test cases can be fixed by one of the attached patches. Could you please try them independently and tell me if they work? They are only fixing the symptoms, not the cause. I'll try to find out in which libxml2 version the bug was fixed. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: enc-patch1.diff Type: text/x-patch Size: 1779 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060526/80c42cf0/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: enc-patch2.diff Type: text/x-patch Size: 1767 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060526/80c42cf0/attachment-0001.bin From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 20:24:43 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 20:24:43 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> Message-ID: <447747EB.2010101@gkec.informatik.tu-darmstadt.de> Hi Ilpo, Ilpo Nyyss?nen wrote: > Stefan Behnel writes: >> I think it's the same for absolute expressions in xpath(). They just >> don't make sense on Elements. > > Why? Please, look a bit closer to XPath expressions and what you can > do with them. You have things like axes. You can search to many other > directions too than just to children. Sure, that's relative expressions, which are perfectly fine in the context of elements. If you read my post, you will see that this was one of my concerns. > To make most use from a XPath it needs to have some context node AND > some root. How can you give the context node to the xpath evaluation, > if the method is in the document side? What do you mean? You either have a relative expression in which case you have a context node. Or it's an absolute expression in which case it does not have a context node. In the first case, call it either on an Element or ElementTree. In the second case, call it on an ElementTree. > From my point of view the same xpath method needs to be able to > evaluate both absolute and relative expressions. Then tell me: what does it mean to evaluate an absolute XPath expression against an element? What is the point in having a context node in that case? Can you come up with an absolute XPath expression that references a context node? > Think about implementing something like XSLT, we define blocks that > get a context node. Then from those blocks we can access the whole > document both with absolute and relative expressions with the same > method. > > It just needs to work and it just needs to know both the root and the > context node. But then why would you want to call the absolute expression on the context node? What's wrong with evaluating it against some ElementTree that represents the entire document? Sorry, I'm a little confused. Could you go into some more detail with your arguments? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri May 26 20:54:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 26 May 2006 20:54:56 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: <447715D5.7010802@infrae.com> References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> Message-ID: <44774F00.6030606@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Can't we expose a method getdocument() on Elements which will expose the > underlying document as an ElementTree instance I though about this some more. I'm not opposed to this idea. It makes sense in the context of libxml2. It's well defined and matches the getparent() method. I personally prefer a name like "getroottree()", as "document" is not used in the API so far. >, and then define XPath's > / to work from that always? We can then clearly define xpath() and > getpath() in terms of getdocument(). Not getpath(), which only works on ElementTrees anyway. This only regards Element.xpath() then. ElementTree.xpath() will continue to switch root nodes, whereas Element.xpath() will use the element as context for relative expressions and the root tree as context for absolute expressions. > Of course the behavior of getdocument() may be hard to predict for a > user. Is this really true, or is getdocument() always going to be the > thing created with Element() that wasn't appended or otherwise placed > under another one? "element.getroottree()" will always return an ElementTree rooted in the root node of the document that contains the element. How is that for a definition? Stefan From iny+news at iki.fi Sat May 27 06:17:18 2006 From: iny+news at iki.fi (Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=) Date: Sat, 27 May 2006 07:17:18 +0300 Subject: [lxml-dev] absolute XPath expressions on Elements References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> <447747EB.2010101@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel writes: > What do you mean? You either have a relative expression in which > case you have a context node. Or it's an absolute expression in > which case it does not have a context node. > > In the first case, call it either on an Element or ElementTree. In > the second case, call it on an ElementTree. So as I don't know whether it is relative or absolute (it was given to me by someone else via API), I need to evaluate it always in ElementTree? How does the ElementTree know the context node? Also, if I currently only pass Element to a method, where does it get the ElementTree? Or are you saying that I should pass both? > Then tell me: what does it mean to evaluate an absolute XPath > expression against an element? The same as it would be to evaluate it in the document the element belongs to. > What is the point in having a context node in that case? Can you > come up with an absolute XPath expression that references a context > node? It is not about it using it. It is about generic interface. I want to evaluate XPath expressions and I don't want to start looking whether those are relative or absolute. -- Ilpo Nyyss?nen # biny # /* :-) */ From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 27 06:28:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 27 May 2006 06:28:08 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> <447747EB.2010101@gkec.informatik.tu-darmstadt.de> Message-ID: <4477D558.4060901@gkec.informatik.tu-darmstadt.de> Hi Ilpo, Ilpo Nyyss?nen wrote: > Stefan Behnel writes: > >> What do you mean? You either have a relative expression in which >> case you have a context node. Or it's an absolute expression in >> which case it does not have a context node. >> >> In the first case, call it either on an Element or ElementTree. In >> the second case, call it on an ElementTree. > > So as I don't know whether it is relative or absolute (it was given to > me by someone else via API), I need to evaluate it always in > ElementTree? That was the idea, yes. I admit that it may be tricky to figure out the difference if you can't control the source of XPath expressions. >> Then tell me: what does it mean to evaluate an absolute XPath >> expression against an element? > > The same as it would be to evaluate it in the document the element > belongs to. That doesn't make sense in the ElementTree API. Elements do not have a root except for themselves. >> What is the point in having a context node in that case? Can you >> come up with an absolute XPath expression that references a context >> node? > > It is not about it using it. It is about generic interface. I want to > evaluate XPath expressions and I don't want to start looking whether > those are relative or absolute. Ok, I get your point. Actually, it's already changed in the trunk. I implemented the what Martijn proposed. We now have a "getroottree()" method on elements that returns an ElementTree for the root of the document that the element is in. We then define the evaluation of absolute expressions against elements as an evaluation against this elementtree. This is a sensible extension to the API that makes sense in the context of lxml/libxml2. Stefan From iny+news at iki.fi Sat May 27 06:29:01 2006 From: iny+news at iki.fi (Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=) Date: Sat, 27 May 2006 07:29:01 +0300 Subject: [lxml-dev] absolute XPath expressions on Elements References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> <447747EB.2010101@gkec.informatik.tu-darmstadt.de> Message-ID: OK, and I need to thank you for changing the API back. Now you can add the getpath to the Element too? -- Ilpo Nyyss?nen # biny # /* :-) */ From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 27 06:34:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 27 May 2006 06:34:06 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44772343.8070505@gkec.informatik.tu-darmstadt.de> <447747EB.2010101@gkec.informatik.tu-darmstadt.de> Message-ID: <4477D6BE.6020501@gkec.informatik.tu-darmstadt.de> Hi Ilpo, Ilpo Nyyss?nen wrote: > OK, and I need to thank you for changing the API back. Now you can > add the getpath to the Element too? You can read the doc section describing XPath support here: http://codespeak.net/svn/lxml/trunk/doc/api.txt Stefan From paul at zeapartners.org Sat May 27 08:01:10 2006 From: paul at zeapartners.org (Paul Everitt) Date: Sat, 27 May 2006 08:01:10 +0200 Subject: [lxml-dev] INPUT: Iniitial FAQ list Message-ID: Howdy all. This is a thread to collect ideas for an initial FAQ list. Right now we have one on the table: """ How compatible are the findall() and xpath() methods ? findall() don't seem to handle more complicated XPath expressions. Why there is a difference between what they can handle ? """ Other ideas? --Paul From behnel_ml at gkec.informatik.tu-darmstadt.de Sat May 27 12:30:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 27 May 2006 12:30:02 +0200 Subject: [lxml-dev] INPUT: Iniitial FAQ list In-Reply-To: References: Message-ID: <44782A2A.9070308@gkec.informatik.tu-darmstadt.de> Hi Paul, hi all, :) Paul Everitt wrote: > This is a thread to collect ideas for an initial FAQ list. I've written up an initial FAQ file. http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Any suggestions, additions and patches to this file can go into this thread. Stefan From nslater at gmail.com Sun May 28 03:51:30 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 28 May 2006 02:51:30 +0100 Subject: [lxml-dev] Extending lxml.etree._ElementTree Message-ID: <9ea1c1180605271851i789acbbbnb90efed2df133db@mail.gmail.com> Hello again, I would like to extend the _ElementTree class to include some handy methods - write_xml/write_xhtml/write_html are examples. In addition, this should enable me to return extended ElementTrees instead of ResultTrees after an XSLT transform. What would be the best way to go about this. While I can extend the base class I am struggling to understand how I can replace etree.parse() to return these objects. While I am on the subject, what is the reason for ResultTree objects as a pose to ElementTree. Why make the distinction to the user? Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From bashautomation at gmail.com Sun May 28 04:15:36 2006 From: bashautomation at gmail.com (Scott Haeger) Date: Sat, 27 May 2006 22:15:36 -0400 Subject: [lxml-dev] segfault on Windows Message-ID: <3d8ae71c0605271915x6c539feau48084cd15deddd0f@mail.gmail.com> I am trying to port an application to Windows. It works with no problems on Linux. On Windows, lxml parses a large file with no problems on the main thread. However, I am getting a segfault when it parses the same file from a spawned thread. Has anyone seen the same thing. Python 2.4.2 lxml 1.0 beta from binary built on May 18 Let me know if there is anything else I can provide. Thanks, Scott Haeger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20060527/8936216e/attachment.htm From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 06:44:46 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 06:44:46 +0200 Subject: [lxml-dev] segfault from thread on Windows In-Reply-To: <3d8ae71c0605271915x6c539feau48084cd15deddd0f@mail.gmail.com> References: <3d8ae71c0605271915x6c539feau48084cd15deddd0f@mail.gmail.com> Message-ID: <44792ABE.1040107@gkec.informatik.tu-darmstadt.de> Hi Scott, Scott Haeger wrote: > I am trying to port an application to Windows. It works with no > problems on Linux. On Windows, lxml parses a large file with no > problems on the main thread. However, I am getting a segfault when it > parses the same file from a spawned thread. Has anyone seen the same > thing. Threading has never been officially tested. However, we would love to see it work, so your feedback is much appreciated. Note that there are a few places in lxml where functions and classes explicitly state what you must not do within threads (we know that at least). One example is that you must not use the default parser from different threads, which might already be the problem in your case. You must create an independent parser for each thread. This is mainly done for performance reasons. Note that you can .copy() parsers to keep the initial configuration. It's interesting that you do not have the same problem under Linux, though... > Python 2.4.2 > lxml 1.0 beta from binary built on May 18 > > Let me know if there is anything else I can provide. It would be great if you could come up with a short code snippet that shows the problem. Most likely, the problem is not the long file, but the threading itself, so you can try to replace the file parsing by a short XML string. We do not currently have test cases for threading, but it would be very helpful if we had a number of tests that could help us in making sure threading works out-of-the-box. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 07:29:52 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 07:29:52 +0200 Subject: [lxml-dev] Extending lxml.etree._ElementTree In-Reply-To: <9ea1c1180605271851i789acbbbnb90efed2df133db@mail.gmail.com> References: <9ea1c1180605271851i789acbbbnb90efed2df133db@mail.gmail.com> Message-ID: <44793550.8020606@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I would like to extend the _ElementTree class to include some handy > methods - write_xml/write_xhtml/write_html are examples. > > What would be the best way to go about this. While I can extend the > base class I am struggling to understand how I can replace > etree.parse() to return these objects. I would create either global functions or an external class (say, ElementTreeWriter) to implement this kind of functionality. There is no easy way to replace the internal _ElementTree class. Note that there is nothing you could do from a Python subclass that you couldn't do from an external function. > In addition, this should enable me to return extended ElementTrees > instead of ResultTrees after an XSLT transform. May I ask why you would want to do that? > While I am on the subject, what is the reason for ResultTree objects > as a pose to ElementTree. Why make the distinction to the user? There is a related FAQ entry, near the bottom of the file. http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Feel free to ask back if it is unclear or doesn't answer your questions. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 10:16:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 10:16:24 +0200 Subject: [lxml-dev] building the trunk and test failures In-Reply-To: <44773BC1.90700@infrae.com> References: <44773BC1.90700@infrae.com> Message-ID: <44795C58.8040505@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > I just tried building the trunk: > > Python 2.4, Pyrex 0.9.4.1, libxml2 2.6.21 > > The next thing is that I get some test failures (included below). The > api.txt and the extensions.txt test failures both seem to be triggered > by the same problem. What could be going on here? Different behavior > between different versions of libxml2? > > ERROR: test_docinfo_public (lxml.tests.test_etree.ETreeOnlyTestCase) > Traceback (most recent call last): > XMLSyntaxError: switching encoding : no input I investigated this. There was a major cleanup in the parser code between 2.6.22 and 2.6.23. It seems to have fixed a bug in xmlCtxtReadDoc that prevented earlier versions from switching the encoding correctly. The problem seemed to be the setup of the parser buffer. I worked around this problem by replacing the calls to xmlCtxtReadDoc by xmlCtxtReadMemory. libxml2 can't handle strings longer than MAX_INT anyway, so passing the length as an int (which is the main difference) is not a problem here. BTW, etree now raises an exception when longer strings are passed, but I guess it will take a while until people will notice. This requires Python 2.5 and 64bit systems anyway. Most likely, it even got a little faster for long strings, as xmlCtxtReadDoc calls strlen() internally, which is superfluous for Python strings. Stefan From nslater at gmail.com Sun May 28 13:45:58 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 28 May 2006 12:45:58 +0100 Subject: [lxml-dev] Extending lxml.etree._ElementTree In-Reply-To: <44793550.8020606@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605271851i789acbbbnb90efed2df133db@mail.gmail.com> <44793550.8020606@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605280445h15d264f0h4d57b654421f20b3@mail.gmail.com> Hi Stefan, > > In addition, this should enable me to return extended ElementTrees > > instead of ResultTrees after an XSLT transform. > > May I ask why you would want to do that? I wanted to have the same help functions bound to both the ElementTree and ResultTree objects - but I have now opted to have helper classes instead. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Sun May 28 13:47:12 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 28 May 2006 12:47:12 +0100 Subject: [lxml-dev] Segmentation Fault Message-ID: <9ea1c1180605280447o702af0adkbc52a375f99b2a72@mail.gmail.com> Hello guys, I have found a segfault bug with lxml - attached is a file which reproduces it. Hope this helps. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: segfault.py Type: text/x-python Size: 97 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060528/e8ed5d56/attachment.py From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 16:49:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 16:49:35 +0200 Subject: [lxml-dev] Segmentation Fault In-Reply-To: <9ea1c1180605280447o702af0adkbc52a375f99b2a72@mail.gmail.com> References: <9ea1c1180605280447o702af0adkbc52a375f99b2a72@mail.gmail.com> Message-ID: <4479B87F.4040806@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I have found a segfault bug with lxml > >>> print etree._Element('article') This is totally not a bug. _Element etc. are classes private to the module, as the leading underscore commonly indicates. Maybe we should try to hide them from users, but then, they already are non public by the fact that they are not documented at all. You are not supposed to use them. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 18:30:52 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 18:30:52 +0200 Subject: [lxml-dev] segfault from thread on Windows In-Reply-To: <3d8ae71c0605280757w141a0325y60472e1a95c4a747@mail.gmail.com> References: <3d8ae71c0605271915x6c539feau48084cd15deddd0f@mail.gmail.com> <44792ABE.1040107@gkec.informatik.tu-darmstadt.de> <3d8ae71c0605280757w141a0325y60472e1a95c4a747@mail.gmail.com> Message-ID: <4479D03C.2010208@gkec.informatik.tu-darmstadt.de> Hi Scott, Scott Haeger wrote: > Here is a small code snippet that demostrates my problem. As you mentioned, I can't reproduce any problems with this under Linux. Couldn't test it under Windows. > The code > works just fine as it is written. However, python crashes if you > comment out the parseFile( ) call from the main thread. I assume that what happens here is that libxml2 misses its thread-local setup. > This seems to > be contrary to what you described about using the default parser from > different threads. > > def parseFile(): > s = "entryaentryb" > sio = StringIO.StringIO (s) > tree = etree.parse(sio) > tree.write(sys.stdout) > sys.stdout.flush() > > parseFile() > t = thread.start_new_thread(parseFile, ()) I understand what you did now. The current way do it this would be something like: def parseFileFromThread(): etree.initThread() # needs the current trunk to work thread_local_parser = etree.XMLParser() s = "entryaentryb" sio = StringIO.StringIO(s) tree = etree.parse(sio, thread_local_parser) tree.write(sys.stdout) parseFile() t1 = thread.start_new_thread(parseFileFromThread, ()) t2 = thread.start_new_thread(parseFileFromThread, ()) 1.0.beta did not have an initThread() function yet (it had an initThreadLogging function, which did part of what the new initThread does). However, I noticed that it is needed to set up some thread-local libxml2 options. This means that you have to call this method from a newly created thread. It doesn't do any harm to call it from the main thread, though, so you can call the above parseFileFromThread also from the main thread. So, hopefully, your problem is solved in the current trunk. I'd be glad if you could test that with the above modifications to your code. I guess I'll have to write up some documentation on this, but it's good to have some feedback first. Stefan From nslater at gmail.com Sun May 28 19:55:52 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 28 May 2006 18:55:52 +0100 Subject: [lxml-dev] Best Way To Strip A Namespace Message-ID: <9ea1c1180605281055x46a6865blb9ffe324444c0217@mail.gmail.com> Hello, Say I have the following document: baz> After I have parsed it into an element tree, how do I change it in a fool proof manner to loose the root namespace so it becomes: baz> This is important for processing my DocBook sources as the XSLT stylesheets cannot handle the namepace on the root element. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Sun May 28 20:17:34 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 28 May 2006 20:17:34 +0200 Subject: [lxml-dev] Best Way To Strip A Namespace In-Reply-To: <9ea1c1180605281055x46a6865blb9ffe324444c0217@mail.gmail.com> References: <9ea1c1180605281055x46a6865blb9ffe324444c0217@mail.gmail.com> Message-ID: <4479E93E.309@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > Say I have the following document: > > > baz> > > > After I have parsed it into an element tree, how do I change it in a > fool proof manner to loose the root namespace so it becomes: > > > baz> > Have you tried setting the tag name to the local name (without namespace)? Stefan From nslater at gmail.com Sun May 28 20:25:49 2006 From: nslater at gmail.com (Noah Slater) Date: Sun, 28 May 2006 19:25:49 +0100 Subject: [lxml-dev] Best Way To Strip A Namespace In-Reply-To: <4479E93E.309@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605281055x46a6865blb9ffe324444c0217@mail.gmail.com> <4479E93E.309@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605281125i3ff50d67sbc192a34294e672c@mail.gmail.com> Hi Stefan, This does not seem to work, instead it is changing the local name but leaves the namespace alone. I have attached an example for you to look at. Thanks, Noah On 28/05/06, Stefan Behnel wrote: > Hi Noah, > > Noah Slater wrote: > > Say I have the following document: > > > > > > baz> > > > > > > After I have parsed it into an element tree, how do I change it in a > > fool proof manner to loose the root namespace so it becomes: > > > > > > baz> > > > > Have you tried setting the tag name to the local name (without namespace)? > > Stefan > > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: strip_ns.py Type: text/x-python Size: 460 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060528/6ea04ecb/attachment.py From nslater at gmail.com Mon May 29 04:04:28 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 03:04:28 +0100 Subject: [lxml-dev] Bug with attribute mangling when adding child elements Message-ID: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> Hello, I noticed today that certain child attribute values are mangled when you try to insert/append child elements onto parent elements. I have written a test script for this as usual. I'm probably missing something... again. :) Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: xml_id_funnyness.py Type: text/x-python Size: 727 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060529/a5571da7/attachment.py From paul at zeapartners.org Mon May 29 09:00:56 2006 From: paul at zeapartners.org (Paul Everitt) Date: Mon, 29 May 2006 09:00:56 +0200 Subject: [lxml-dev] INPUT: Iniitial FAQ list In-Reply-To: <44782A2A.9070308@gkec.informatik.tu-darmstadt.de> References: <44782A2A.9070308@gkec.informatik.tu-darmstadt.de> Message-ID: <447A9C28.9020707@zeapartners.org> Stefan Behnel wrote: > Hi Paul, hi all, :) > > Paul Everitt wrote: >> This is a thread to collect ideas for an initial FAQ list. > > I've written up an initial FAQ file. > > http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt > > Any suggestions, additions and patches to this file can go into this thread. Looks like you already handled the question under discussion, as well. I'll try to keep an eye out for frequently asked questions as well and update that file. --Paul From nslater at gmail.com Mon May 29 14:09:50 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 13:09:50 +0100 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> Message-ID: <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> Hello, I have just noticed that if you set the attributes after you have added to the tree the weird bug does not happen. Attached is a modified version of the file. Hope this helps someone.... Noah On 29/05/06, Noah Slater wrote: > Hello, > > I noticed today that certain child attribute values are mangled when > you try to insert/append child elements onto parent elements. > > I have written a test script for this as usual. > > I'm probably missing something... again. :) > > Thanks, > Noah > > -- > "Creativity can be a social contribution, but only in so > far as society is free to use the results." - R. Stallman > > > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman -------------- next part -------------- A non-text attachment was scrubbed... Name: xml_id_funnyness.py Type: text/x-python Size: 728 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060529/fbeb0e75/attachment-0001.py From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 15:49:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 15:49:08 +0200 Subject: [lxml-dev] Threading Message-ID: <447AFBD4.6010902@gkec.informatik.tu-darmstadt.de> Hi all, I looked a bit at the threading issues and found that there is not much to gain from using threads anyway. Calling code will suffer from the GIL, so it would be up to lxml to release the GIL on long-running operations to benefit from threading. However, releasing the GIL is non-trivial, as most of these operations can call back into the Python API and the interpreter: parsing calls resolvers, as does XSLT. XPath can call extension functions, even the error messages end up in Python objects and lists, etc. So, there are only very few places that could easily be wrapped by ALLOW_THREADS: * everything that traverses trees (which is so fast that the gain may be eaten by the locking overhead) and if error messages can somehow be made working without requiring the GIL: * serialisation to memory or 'real' files * validation (which current does not support custom resolvers) An alternative would be to always create separate Python threads for things like error handling and resolving, but that would require always releasing the GIL whenever these things *might* get called. So, it would be relatively easy to release the GIL in the ElementDepthFirstIterator to have a thread speedup for findall() etc., but everything else will be real work. And since the major overhead is in parsers and serialisers, I don't think it's worth changing this bit just to say "you can gain from threading". Considering this, I changed the FAQ entry on threading to simply state that threading is not supported, with a short explanation that the gain would be marginal without major changes in lxml. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 16:36:49 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 16:36:49 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> Message-ID: <447B0701.2050000@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > I have examined the lxml pretty print output and there is > quite some difference. In fact, with the exception of a few simplistic > documents with no textual element content I cannot see what effect > pretty_print has. I had to take another look at this while I was rewriting a part of the API for thread clean-ness. The easiest way to get your parsed document formatted is this: >>> parser = etree.XMLParser(ignore_blanks=True) >>> tree = etree.parse(file, parser) >>> tree.write(newfile, pretty_print=True) The "ignore_blanks" option is new in the trunk and removes blank text nodes from the parsed tree. This allows libxml2 to add new white space for indentation without conflicting with left-overs from the original document. No DTD parsing needed. Stefan From nslater at gmail.com Mon May 29 16:42:56 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 15:42:56 +0100 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <447B0701.2050000@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <447B0701.2050000@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605290742q7fe45508he0eba6fbabf5bf96@mail.gmail.com> Hi Stefan, Thanks for that! :) As it happens, I have now moved away from lxml for my pretty printing. Instead, especially given my desire to output well formed HTML/XHTML, I have written an ElementTree wrapper around uTidylib which takes care of pretty much anything I could possibly imagine. Thanks again, Noah On 29/05/06, Stefan Behnel wrote: > Hi Noah, > > Noah Slater wrote: > > I have examined the lxml pretty print output and there is > > quite some difference. In fact, with the exception of a few simplistic > > documents with no textual element content I cannot see what effect > > pretty_print has. > > I had to take another look at this while I was rewriting a part of the API for > thread clean-ness. The easiest way to get your parsed document formatted is this: > > >>> parser = etree.XMLParser(ignore_blanks=True) > >>> tree = etree.parse(file, parser) > >>> tree.write(newfile, pretty_print=True) > > The "ignore_blanks" option is new in the trunk and removes blank text nodes > from the parsed tree. This allows libxml2 to add new white space for > indentation without conflicting with left-overs from the original document. No > DTD parsing needed. > > Stefan > > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 16:50:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 16:50:30 +0200 Subject: [lxml-dev] Best Way To Strip A Namespace In-Reply-To: <9ea1c1180605281125i3ff50d67sbc192a34294e672c@mail.gmail.com> References: <9ea1c1180605281055x46a6865blb9ffe324444c0217@mail.gmail.com> <4479E93E.309@gkec.informatik.tu-darmstadt.de> <9ea1c1180605281125i3ff50d67sbc192a34294e672c@mail.gmail.com> Message-ID: <447B0A36.8010104@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > On 28/05/06, Stefan Behnel wrote: >> Noah Slater wrote: >> > Say I have the following document: >> > >> > >> > baz> >> > >> > >> > After I have parsed it into an element tree, how do I change it in a >> > fool proof manner to loose the root namespace so it becomes: >> > >> > >> > baz> >> > >> >> Have you tried setting the tag name to the local name (without >> namespace)? > > This does not seem to work, instead it is changing the local name but > leaves the namespace alone. I have attached an example for you to look > at. > > #!/usr/bin/env python > > import sys > > from lxml import etree > > root_name = 'article' > namespace_uri = 'http://docbook.org/ns/docbook' > > namespace_map = {None: 'http://docbook.org/ns/docbook'} > > root_element = etree.Element("{%s}%s" % (namespace_uri, root_name), nsmap=namespace_map) > > # Stefan, this is what I interpreted from your email: > root_element.tag = 'essay' > > element_tree = etree.ElementTree(root_element) > > element_tree.write(sys.stdout) > sys.stdout.write('\n') Ok, I tried this and it failed. Nice. I fixed this, so lxml now removes the namespace reference from the element. However, there is more involved than you might think, as it still leaves the namespace in the document. Finding out if the namespace is still in use is rather expensive... Anyway, the problem itself is fixed on the trunk. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 16:52:29 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 16:52:29 +0200 Subject: [lxml-dev] ElementTree pretty printing (serialisation) In-Reply-To: <9ea1c1180605290742q7fe45508he0eba6fbabf5bf96@mail.gmail.com> References: <9ea1c1180605201432l2351f877sebe4ff92fea8c888@mail.gmail.com> <44700E20.7020707@gkec.informatik.tu-darmstadt.de> <9ea1c1180605210952w595b3bccj5579a0380cafc939@mail.gmail.com> <447B0701.2050000@gkec.informatik.tu-darmstadt.de> <9ea1c1180605290742q7fe45508he0eba6fbabf5bf96@mail.gmail.com> Message-ID: <447B0AAD.6050404@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > As it happens, I have now moved away from lxml for my pretty printing. > Instead, especially given my desire to output well formed HTML/XHTML, > I have written an ElementTree wrapper around uTidylib which takes care > of pretty much anything I could possibly imagine. Hmmm, you know about the HTMLParser in lxml, don't you? Stefan From faassen at infrae.com Mon May 29 18:07:07 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 29 May 2006 18:07:07 +0200 Subject: [lxml-dev] building the trunk and test failures In-Reply-To: <44774364.1010800@gkec.informatik.tu-darmstadt.de> References: <44773BC1.90700@infrae.com> <44774364.1010800@gkec.informatik.tu-darmstadt.de> Message-ID: <447B1C2B.6030204@infrae.com> Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen wrote: >> I just tried building the trunk: >> >> Python 2.4, Pyrex 0.9.4.1, libxml2 2.6.21 >> >> The first thing I noticed is that compilation of the C source code (with >> 'Make') seems to take extremely long in comparison to the past. What >> changed to cause that? > > It's the error reporting stuff that introduced tons of constants. > > Hmm, you should have noticed that last time you built it. I remember that you > tested the trunk before releasing 0.9... I probably wasn't paying as much attention while compiling. :) Oddly enough building it just now seemed faster than my memory of last week's attempt. I don't have enough C zen to figure out whether this could be avoided somehow. Regards, Martijn From faassen at infrae.com Mon May 29 18:08:14 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 29 May 2006 18:08:14 +0200 Subject: [lxml-dev] building the trunk and test failures In-Reply-To: <44795C58.8040505@gkec.informatik.tu-darmstadt.de> References: <44773BC1.90700@infrae.com> <44795C58.8040505@gkec.informatik.tu-darmstadt.de> Message-ID: <447B1C6E.1030404@infrae.com> Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen wrote: >> I just tried building the trunk: >> >> Python 2.4, Pyrex 0.9.4.1, libxml2 2.6.21 >> >> The next thing is that I get some test failures (included below). The >> api.txt and the extensions.txt test failures both seem to be triggered >> by the same problem. What could be going on here? Different behavior >> between different versions of libxml2? >> >> ERROR: test_docinfo_public (lxml.tests.test_etree.ETreeOnlyTestCase) >> Traceback (most recent call last): >> XMLSyntaxError: switching encoding : no input > > I investigated this. There was a major cleanup in the parser code between > 2.6.22 and 2.6.23. It seems to have fixed a bug in xmlCtxtReadDoc that > prevented earlier versions from switching the encoding correctly. The problem > seemed to be the setup of the parser buffer. Thanks for investigating this. I can confirm no more test failures on my checkout now. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 18:10:11 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 18:10:11 +0200 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> Message-ID: <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: > On 29/05/06, Noah Slater wrote: >> I noticed today that certain child attribute values are mangled when >> you try to insert/append child elements onto parent elements. >> >> I have written a test script for this as usual. > > I have just noticed that if you set the attributes after you have > added to the tree the weird bug does not happen. Thanks for reporting this. It looks like a bug in libxml2, as it only happens for "xml:id", not for other attributes. I guess the bug is in xmlReconciliateNs, but that's really just guessing. A simpler way to reproduce this is: ------------------------------- root = etree.Element('element') subelement = etree.Element('subelement') subelement.set("{http://www.w3.org/XML/1998/namespace}id", "foo") root.append(subelement) print etree.tostring(root) ------------------------------- What surprises me most is that ElementTree (1.2.6) has a similar bug. It returns this for subelement.attrib after appending: {'{}id': 'foo'} I'll file a bug report on libxml2 anyway. Stefan From faassen at infrae.com Mon May 29 18:14:38 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 29 May 2006 18:14:38 +0200 Subject: [lxml-dev] absolute XPath expressions on Elements In-Reply-To: <44774F00.6030606@gkec.informatik.tu-darmstadt.de> References: <4476FD44.7070708@gkec.informatik.tu-darmstadt.de> <447715D5.7010802@infrae.com> <44774F00.6030606@gkec.informatik.tu-darmstadt.de> Message-ID: <447B1DEE.3050908@infrae.com> Stefan Behnel wrote: > Martijn Faassen wrote: >> Can't we expose a method getdocument() on Elements which will expose the >> underlying document as an ElementTree instance > > I though about this some more. I'm not opposed to this idea. It makes sense in > the context of libxml2. It's well defined and matches the getparent() method. > > I personally prefer a name like "getroottree()", as "document" is not used in > the API so far. Heh, I was just checking this thread again and prepared to argue some more, but I'm glad to see I don't need to. :) Great! >> , and then define XPath's >> / to work from that always? We can then clearly define xpath() and >> getpath() in terms of getdocument(). > > Not getpath(), which only works on ElementTrees anyway. Right, that makes sense. > This only regards Element.xpath() then. ElementTree.xpath() will continue to > switch root nodes, whereas Element.xpath() will use the element as context for > relative expressions and the root tree as context for absolute expressions. Understood. >> Of course the behavior of getdocument() may be hard to predict for a >> user. Is this really true, or is getdocument() always going to be the >> thing created with Element() that wasn't appended or otherwise placed >> under another one? > > "element.getroottree()" will always return an ElementTree rooted in the root > node of the document that contains the element. > > How is that for a definition? Sounds fine. You could also define 'root node' a bit better by saying "it's what you get when you walk up the parent chain". I was more thinking along the lines how complex it is for the programmer to reason about the tree this way. I think it's not too difficult to identify the root node. Of course, ElementTree proper doesn't have such a concept really, but we definitely do and it shows up in quite a few places. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 18:21:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 18:21:59 +0200 Subject: [lxml-dev] building the trunk and test failures In-Reply-To: <447B1C2B.6030204@infrae.com> References: <44773BC1.90700@infrae.com> <44774364.1010800@gkec.informatik.tu-darmstadt.de> <447B1C2B.6030204@infrae.com> Message-ID: <447B1FA7.8000000@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Martijn Faassen wrote: >>> I just tried building the trunk: >>> >>> The first thing I noticed is that compilation of the C source code >>> (with 'Make') seems to take extremely long in comparison to the past. >>> What changed to cause that? >> >> It's the error reporting stuff that introduced tons of constants. > > Oddly enough building it just now seemed faster than my memory of last > week's attempt. Well, that's how things change over the weekend. :) It's faster because I reimplemented the constants stuff. Now you can copy&paste them directly from libxml2's xmlerror.h into a multi-line string and it's parsed at module init time. That's marginally slower on import, but makes etree much smaller and compilation much faster. Stefan From faassen at infrae.com Mon May 29 18:51:35 2006 From: faassen at infrae.com (Martijn Faassen) Date: Mon, 29 May 2006 18:51:35 +0200 Subject: [lxml-dev] credits update: Stefan is the maintainer Message-ID: <447B2697.9050802@infrae.com> Hi there, I've just updated the CREDITS.txt of lxml to reflect what had become the situation for a long time already: Stefan Behnel is currently the main developer of lxml and its maintainer, and my own role as creator and initial maintainer has diminished somewhat. I'm not disappearing and will remain involved in lxml development for the foreseeable future, but Stefan calls the shots, as he took the responsibility - this is as it should be in open source. I recognized what was going on in december and in personal communication then gave Stefan the keys to the lxml car. He's been doing a great job with it since, and as the 1.0 release is pending we thought we should announce this officially. I personally think Stefan's involvement is an absolutely awesome development. The most obvious reason why this is great is of course Stefan's qualities as a developer and project leader. Less obvious but also very exciting to me is that it makes lxml into a true open source project in the community sense. lxml's future is no longer dependent on a single person, namely myself. Stefan has shown great skill in both sides of managing a successful open source project: the code side and the people side. lxml has now become one of the most powerful XML libraries in the Python world, in the area of feature support as well as in the area of performance. We came a long way in the year and almost-a-half since my initial breakthrough in getting the lxml ball rolling. On the more personal front I also met a few people I'm happy to know now, got to speak about lxml at conferences, and of course got a great XML library out of it too. Thanks to all contributors, and of course especially to Stefan, in making this possible! Regards, Martijn From buro at petr.com Mon May 29 18:53:39 2006 From: buro at petr.com (Petr van Blokland) Date: Mon, 29 May 2006 18:53:39 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt Message-ID: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> Hi, new to lxml I may be overlooking some hints in the doc files. I am trying to define a name space (myspace) in Python, but I don't know by which method my name space object is getting called from lxml/xslt. class AbcElement(ElementBase): def ###METHODNAME###(self): return 'Output of abc' ns = Namespace('http://xml.petr.com/xpyth3/ns/myspace') ns['abc'] = AbcElement xslt = ''' < myspace:abc/> ''' Where ###METHODNAME### I don't know. I am using lxml 1.0 beta Suggestions? Kind regards, Petr van Blokland ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From nslater at gmail.com Mon May 29 18:54:39 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 17:54:39 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write Message-ID: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> Hi, Could we get the difference between ResultTree.__str__ and ResultTree.write properly documented somewhere ([1]?) on the website or via docstrings etc as there only seems to be one reference to it that I can find [2] By the way, the ability to change "output" parameters with write is an excellent addition to this package - such a time saver for me! :) Thanks, Noah [1] http://codespeak.net/lxml/api.html [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Mon May 29 19:03:17 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 18:03:17 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> Message-ID: <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater wrote: > Hi, > > Could we get the difference between ResultTree.__str__ and > ResultTree.write properly documented somewhere ([1]?) on the website > or via docstrings etc as there only seems to be one reference to it > that I can find [2] > > By the way, the ability to change "output" parameters with write is an > excellent addition to this package - such a time saver for me! :) > > Thanks, > Noah > > [1] http://codespeak.net/lxml/api.html > [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt > > -- > "Creativity can be a social contribution, but only in so > far as society is free to use the results." - R. Stallman > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Mon May 29 19:03:17 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 18:03:17 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> Message-ID: <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater wrote: > Hi, > > Could we get the difference between ResultTree.__str__ and > ResultTree.write properly documented somewhere ([1]?) on the website > or via docstrings etc as there only seems to be one reference to it > that I can find [2] > > By the way, the ability to change "output" parameters with write is an > excellent addition to this package - such a time saver for me! :) > > Thanks, > Noah > > [1] http://codespeak.net/lxml/api.html > [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt > > -- > "Creativity can be a social contribution, but only in so > far as society is free to use the results." - R. Stallman > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Mon May 29 19:03:17 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 18:03:17 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> Message-ID: <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater wrote: > Hi, > > Could we get the difference between ResultTree.__str__ and > ResultTree.write properly documented somewhere ([1]?) on the website > or via docstrings etc as there only seems to be one reference to it > that I can find [2] > > By the way, the ability to change "output" parameters with write is an > excellent addition to this package - such a time saver for me! :) > > Thanks, > Noah > > [1] http://codespeak.net/lxml/api.html > [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt > > -- > "Creativity can be a social contribution, but only in so > far as society is free to use the results." - R. Stallman > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From nslater at gmail.com Mon May 29 19:03:17 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 18:03:17 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> Message-ID: <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater wrote: > Hi, > > Could we get the difference between ResultTree.__str__ and > ResultTree.write properly documented somewhere ([1]?) on the website > or via docstrings etc as there only seems to be one reference to it > that I can find [2] > > By the way, the ability to change "output" parameters with write is an > excellent addition to this package - such a time saver for me! :) > > Thanks, > Noah > > [1] http://codespeak.net/lxml/api.html > [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt > > -- > "Creativity can be a social contribution, but only in so > far as society is free to use the results." - R. Stallman > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From fredrik at pythonware.com Mon May 29 21:23:22 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Mon, 29 May 2006 21:23:22 +0200 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > What surprises me most is that ElementTree (1.2.6) has a similar bug. It > returns this for subelement.attrib after appending: > > {'{}id': 'foo'} that's somewhat surprising, given that ET doesn't do any mangling by itself; that's entirely up to the serialization code. here's what I get on my machine: >>> e = ET.Element("subelement") >>> e.set("{http://www.w3.org/XML/1998/namespace}id", "foo") >>> e >>> ET.tostring(e) '' >>> e.attrib {'{http://www.w3.org/XML/1998/namespace}id': 'foo'} From nslater at gmail.com Mon May 29 21:45:55 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 20:45:55 +0100 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605291245o2678a080j9fdf20021951c9a9@mail.gmail.com> Varified using python-elementtree v.1.2.6-8 on Debian PPC (Sid): >>> import elementtree.ElementTree as ET >>> >>> element = ET.Element('element') >>> subelement = ET.Element('subelement') >>> subelement.set('{http://www.w3.org/XML/1998/namespace}id', 'foo') >>> element.append(subelement) >>> ET.dump(element) >>> subelement.attrib {'{http://www.w3.org/XML/1998/namespace}id': 'foo'} On 29/05/06, Fredrik Lundh wrote: > Stefan Behnel wrote: > > > What surprises me most is that ElementTree (1.2.6) has a similar bug. It > > returns this for subelement.attrib after appending: > > > > {'{}id': 'foo'} > > that's somewhat surprising, given that ET doesn't do any mangling by > itself; that's entirely up to the serialization code. > > here's what I get on my machine: > > >>> e = ET.Element("subelement") > >>> e.set("{http://www.w3.org/XML/1998/namespace}id", "foo") > >>> e > > >>> ET.tostring(e) > '' > >>> e.attrib > {'{http://www.w3.org/XML/1998/namespace}id': 'foo'} > > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 23:03:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 23:03:56 +0200 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> Message-ID: <447B61BC.4090806@gkec.informatik.tu-darmstadt.de> Noah Slater wrote: > Oh, and a question: > > How do I force the output method (Such as "html" or "txt") from a > ResultTree object? By using a sensible xsl:output element in the stylesheet? > On 29/05/06, Noah Slater wrote: >> Hi, >> >> Could we get the difference between ResultTree.__str__ and >> ResultTree.write properly documented somewhere ([1]?) on the website >> or via docstrings etc as there only seems to be one reference to it >> that I can find [2] >> >> By the way, the ability to change "output" parameters with write is an >> excellent addition to this package - such a time saver for me! :) How's this for a starter? http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 23:13:08 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 23:13:08 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt In-Reply-To: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> References: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> Message-ID: <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > I am trying to define a name space (myspace) in Python, but I don't know > by which method my name space object is getting called from lxml/xslt. > > class AbcElement(ElementBase): > def ###METHODNAME###(self): > return 'Output of abc' > > ns = Namespace('http://xml.petr.com/xpyth3/ns/myspace') > ns['abc'] = AbcElement > > > xslt = ''' xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:abc="http://xml.petr.com/xpyth3/xpath/ns/myspace" > xsl:extension-element-prefixes='myspace' > > > > > < myspace:abc/> > > > ''' > > > Where ###METHODNAME### I don't know. > I am using lxml 1.0 beta "Implementing namespaces" is not quite what you think it is. It has nothing to do with XSLT. XSLT extension elements are not currently implemented in lxml (and require a rather heavy piece of code to be written). I assume you have read the documentation. http://codespeak.net/lxml/namespace_extensions.html It is mainly designed for writing custom APIs on top of XML. A good example for what you can do with it is the MathML implementation MathDOM. http://mathdom.sourceforge.net/ lxml will not "execute" the custom elements in any operation. (although it might be interesting to think about what you could do with this...) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon May 29 23:22:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 29 May 2006 23:22:56 +0200 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> Message-ID: <447B6630.6090802@gkec.informatik.tu-darmstadt.de> Hi Fredrik, sorry for the confusion. Fredrik Lundh wrote: > Stefan Behnel wrote: >> What surprises me most is that ElementTree (1.2.6) has a similar bug. It >> returns this for subelement.attrib after appending: >> >> {'{}id': 'foo'} > > that's somewhat surprising, given that ET doesn't do any mangling by > itself; that's entirely up to the serialization code. > > here's what I get on my machine: > > >>> e = ET.Element("subelement") > >>> e.set("{http://www.w3.org/XML/1998/namespace}id", "foo") > >>> e > > >>> ET.tostring(e) > '' > >>> e.attrib > {'{http://www.w3.org/XML/1998/namespace}id': 'foo'} I get that, too. I had forgotten a tiny 'self.' in the test case, which let it use etree instead of ElementTree. Oh well, imports ... Sorry, Stefan From nslater at gmail.com Mon May 29 23:51:05 2006 From: nslater at gmail.com (Noah Slater) Date: Mon, 29 May 2006 22:51:05 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <447B61BC.4090806@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> <447B61BC.4090806@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605291451u51248e3byd7e44dc3307ffac3@mail.gmail.com> Hello Stefan, > > How do I force the output method (Such as "html" or "txt") from a > > ResultTree object? > > By using a sensible xsl:output element in the stylesheet? Interesting response. If I had total control over the stylesheets I would do this, but even then it would still not offer a solution for me. As I have mentioned before I am writing a HTTP publication framework. All resources are stored and manipulated internally as DocBook. They are transformed towards the end of the request processing into a resource representation. The type of resource representation is determined using HTTP content negotiation. The result of the negotiation could request gzip'ed PDF, Shift-JIS encoded plain text or regular ISO 8859-1 encoded XHTML 1.1. Either way, I have a plethora of XSLT stylesheets which can perform these transformations at my request. The best part about using lxml for my purposes is the ability to use ResultTree.write with an encoding to get the desired result. This saves me from constructing a post-processing XSLT stylesheet in memory (as I did when I was using the libxsl bindings) just to change the character set at run-time. Now you see, here is my actual problem - the use case if you will - for being able to control the output method at runtime via ResultTree.write: If I want to convert my DocBook file to HTML 4.01 I face a bit of a problem. HTML 4.01 should not have any PIs (Such as the declaration), this this rules out any ability to declare the character encoding. Oh no! What do we do? Without declaring the character encoding in the file not only does it fail to validate as HTML (check with the W3C validation service) but it reduces the interoperability of my application because clients and UAs have to guess character encoding - which is a Bad Thing. You may come back with two responses to this: "Well who cares if you don't validate? It's only a small error." It may only be HTML, but I don't want Tim Bray to call me a bozo. [1] "Why don't you specify the correct encoding in the HTTP headers?" I already do. But what happens when the user saves the document to disk? What happens when the UA otherwise copies the file to a local repository? The character encoding is lost - never to be found again. Bozo time! But wait! There is a solution, just specify the the output method as "html" in the stylesheet, that should do it. Let's test with xsltproc from the command line. Yup, we get this in the HTML head generated for us: Okay, cool - let's test in my app. Oh wait, it's gone! Time to look though all the documentation. Starting off with [2] doesn't help me much however, so I continued searching. 15 minutes with google later and I find out it's because of a difference between ResultTree.write and ResultTree.__str__ So it would seem that I have two options I can serve documents up in different character encodings, but only as XHTML. I just lost 90% of the web who can't view XHTML properly. [3] Alternatively I could serve up documents in HTML and other formats, but only in UTF-8 (or another fixed encoding) which kinda sucks. Some clients may not be able to handle UTF-8 , so I just lost some more of my audience. I could always serve up using ASCII or some other variant - but I just lost over half the worlds population as potential users. > http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Once again, thank you for pointing this out - I had read it before as it happens. Unfortunately I don't have the source of lxml checked out - I'm sure I must have it on my system somewhere - but I don't know, and don't care, where. My point was that I, presumably like most other users, turn to the website for documentation which is sadly missing this information. I propose that even if it helps just one other person save those 15 minutes googling we should have this FAQ on the web with some obvious way into it from the home page. Thank you for your time. Regards, Noah [1] http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim [2] http://codespeak.net/lxml/api.html [3] http://hixie.ch/advocacy/xhtml -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 06:49:22 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 06:49:22 +0200 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <9ea1c1180605291451u51248e3byd7e44dc3307ffac3@mail.gmail.com> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> <447B61BC.4090806@gkec.informatik.tu-darmstadt.de> <9ea1c1180605291451u51248e3byd7e44dc3307ffac3@mail.gmail.com> Message-ID: <447BCED2.5070001@gkec.informatik.tu-darmstadt.de> Hi Noah, Noah Slater wrote: >> > How do I force the output method (Such as "html" or "txt") from a >> > ResultTree object? >> >> By using a sensible xsl:output element in the stylesheet? > > Interesting response. > > If I had total control over the stylesheets I would do this, but even > then it would still not offer a solution for me. > > The type of resource representation is determined using HTTP content > negotiation. > > The result of the negotiation could request gzip'ed PDF, Shift-JIS > encoded plain text or regular ISO 8859-1 encoded XHTML 1.1. > > Either way, I have a plethora of XSLT stylesheets which can perform > these transformations at my request. The best part about using lxml > for my purposes is the ability to use ResultTree.write with an > encoding to get the desired result. The result of an XSLT is an ElementTree. Feel free to do with it what you like. It's no problem to specify the output method in a second XSLT step. This is untested: >>> transform = XSLT(bigTreeToDoSomeTransformation) >>> tohtml = XSLT(XML('''\ ... ... ... ... ... ... ... ''')) >>> print tohtml(transform(doc)) The little stylesheets are small enough to be a) generated on the fly b) cached c) quickly compiled by XSLT() after adapting the xsl:output element of a base XSL document (note that XSLT copies the document internally, so this works). Feel free to find out which is fastest. If you're working on a publishing framework, not being able to run a stylesheet cascade on a document would be pretty much of a no-go in my eyes. > This saves me from constructing a post-processing XSLT stylesheet in > memory (as I did when I was using the libxsl bindings) just to change > the character set at run-time. I think that's a pretty quick thing to do with lxml. > Now you see, here is my actual problem - the use case if you will - > for being able to control the output method at runtime via > ResultTree.write: > > If I want to convert my DocBook file to HTML 4.01 I face a bit of a > problem. HTML 4.01 should not have any PIs (Such as the > declaration), this this rules out any ability to declare the character > encoding. Right, libxml2 actually has a special HTML output API, but that's not currently wrapped by lxml. I don't know what it does exactly, though. http://xmlsoft.org/html/libxml-HTMLtree.html > Oh no! What do we do? Without declaring the character encoding in the > file not only does it fail to validate as HTML (check with the W3C > validation service) but it reduces the interoperability of my > application because clients and UAs have to guess character encoding - > which is a Bad Thing. > > But wait! There is a solution, just specify the the output method as > "html" in the stylesheet, that should do it. Let's test with xsltproc > from the command line. > > Yup, we get this in the HTML head generated for us: > > > > Okay, cool - let's test in my app. > > Oh wait, it's gone! 15 minutes with google later and I find out it's > because of a difference between ResultTree.write and ResultTree.__str__ and I guess you meant that you used write() instead of str(). Because "str() does what you want"^TM. :) I think you still have some easy choices: post-process with an output stylesheet, keep multiple slightly different stylesheets in memory that you generate at startup time, modify the ResultTree before serializing (I never tested this, but I wouldn't know why it should not work), ... >> http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt > > Once again, thank you for pointing this out - I had read it before as > it happens. Unfortunately I don't have the source of lxml checked out > - I'm sure I must have it on my system somewhere - but I don't know, > and don't care, where. My point was that I, presumably like most other > users, turn to the website for documentation which is sadly missing > this information. Yeah, Martijn is sometimes a bit slow with updating the web page, but this time it's not really his fault. the FAQ is constantly evolving and since 1.0 is pretty close, we'll update the web page in one step when it comes out. > I propose that even if it helps just one other person save those 15 > minutes googling we should have this FAQ on the web with some obvious > way into it from the home page. The "obvious way" is a link that's in the pages of the trunk. Just wait a few days and it will be online. Since you're working with a recent version anyway, you should not hesitate to refer to the SVN version of the documentation, though. Remember that the official current version is still 0.9.2, which the online documentation describes. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 07:08:35 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 07:08:35 +0200 Subject: [lxml-dev] credits update: Stefan is the maintainer In-Reply-To: <447B2697.9050802@infrae.com> References: <447B2697.9050802@infrae.com> Message-ID: <447BD353.5010302@gkec.informatik.tu-darmstadt.de> Hi Martijn, thanks a lot, it's an honour to become lead developer in a project as great as lxml. Thank you for writing it in the first place. I wouldn't have started it, but when I joined it, it was so far advanced that it was a pleasure to start playing with it. I think that really qualifies for true OpenSource. I'm happy to hear that you keep with the project, as your input and feedback has always been very valuable to me and my work. And to additionally have someone who's advocating it is really worth something. :) Thank you for everything you put into the project and the community around it. Stefan From paul at zeapartners.org Tue May 30 08:23:29 2006 From: paul at zeapartners.org (Paul Everitt) Date: Tue, 30 May 2006 08:23:29 +0200 Subject: [lxml-dev] credits update: Stefan is the maintainer In-Reply-To: <447B2697.9050802@infrae.com> References: <447B2697.9050802@infrae.com> Message-ID: <447BE4E1.6030005@zeapartners.org> I'd like to second Martijn's praise as well. Stefan's work over the last 6-9 months has been one of the most seriously impressive things I've seen in open source in a while: smart work, seriously done, helpful replies to questions, etc. A nice combination of diligence, innovation, and fun. I *really* enjoy using lxml and thinking about how design is influenced by the new features. (I'm going in the same direction Stefan did with the XPathModel stuff he put on Berlios.) It's really fun to put the new stuff to use. I'd also like to second Stefan's praise of Martijn, who seems to excel at launching major projects with deep impact that attract committed contributors who become leaders. I can only apologize for not being a better contributor myself. I can't possibly do Pyrex or libxml2 stuff, but I can collect FAQs, help write meaningful tests, update the website, answer questions, etc. Perhaps all of us should show our appreciation to Martijn and Stefan by trying, as a group, to match one of their individual efforts. --Paul Martijn Faassen wrote: > Hi there, > > I've just updated the CREDITS.txt of lxml to reflect what had become the > situation for a long time already: Stefan Behnel is currently the main > developer of lxml and its maintainer, and my own role as creator and > initial maintainer has diminished somewhat. I'm not disappearing and > will remain involved in lxml development for the foreseeable future, but > Stefan calls the shots, as he took the responsibility - this is as it > should be in open source. > > I recognized what was going on in december and in personal communication > then gave Stefan the keys to the lxml car. He's been doing a great job > with it since, and as the 1.0 release is pending we thought we should > announce this officially. > > I personally think Stefan's involvement is an absolutely awesome > development. The most obvious reason why this is great is of course > Stefan's qualities as a developer and project leader. Less obvious but > also very exciting to me is that it makes lxml into a true open source > project in the community sense. lxml's future is no longer dependent on > a single person, namely myself. Stefan has shown great skill in both > sides of managing a successful open source project: the code side and > the people side. > > lxml has now become one of the most powerful XML libraries in the Python > world, in the area of feature support as well as in the area of > performance. We came a long way in the year and almost-a-half since my > initial breakthrough in getting the lxml ball rolling. On the more > personal front I also met a few people I'm happy to know now, got to > speak about lxml at conferences, and of course got a great XML library > out of it too. Thanks to all contributors, and of course especially to > Stefan, in making this possible! > > Regards, > > Martijn From johnny at johnnydebris.net Tue May 30 13:23:11 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Tue, 30 May 2006 13:23:11 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) Message-ID: <447C2B1F.2000505@johnnydebris.net> Hi! The past couple of months I kept running into some bug. For some reason the namespace information of certain nodes got lost after moving the nodes from one document to another. I kept trying to isolate the problem, but without success, until this week I found that the reason the information gets lost seems to have to do with the document getting garbage collected even though some nodes are still referred to. With this in mind I managed to write the following demonstration snippet: #-------------------------------------------------------------------------- from lxml import etree s1 = '' tree1 = etree.fromstring(s1) btag = tree1.xpath('//b')[0] del tree1 s2 = '' tree2 = etree.fromstring(s2) tree2.append(btag) # this produces crap... print etree.tostring(tree2) #-------------------------------------------------------------------------- The problem seems to be the 'del' statement: this seems to free the document instance or something, even though there's still a reference to one of its nodes. Serializing the node later on will result in garbage. When the 'del' line is commented out, the XML is produced properly. I hope this is enough information for you guys to track the bug: I'm no C programmer myself, so I don't dare diving into the Pyrex, and also I may have made some errors in my explanation... If you need more information, let me know. Cheers, Guido From nslater at gmail.com Tue May 30 13:58:41 2006 From: nslater at gmail.com (Noah Slater) Date: Tue, 30 May 2006 12:58:41 +0100 Subject: [lxml-dev] Difference between ResultTree.__str__ and ResultTree.write In-Reply-To: <447BCED2.5070001@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605290954h25595ea7g668facb29a7002ed@mail.gmail.com> <9ea1c1180605291003u42d5517bv736352be86358fe@mail.gmail.com> <447B61BC.4090806@gkec.informatik.tu-darmstadt.de> <9ea1c1180605291451u51248e3byd7e44dc3307ffac3@mail.gmail.com> <447BCED2.5070001@gkec.informatik.tu-darmstadt.de> Message-ID: <9ea1c1180605300458g7c77b77dk7cdb7f7eda0dbc2a@mail.gmail.com> Thanks for your help, as always Stefan. I read an understood all that you say - however I am still left wondering if this is something we will ever be able to do via ResultTree.write? Is this a hard problem? Friendly, Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman From faassen at infrae.com Tue May 30 15:06:25 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 30 May 2006 15:06:25 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? Message-ID: <447C4351.7050804@infrae.com> Hi there, Since upgrading to a new version of valgrind quite a while ago, valgrind gets spammy, even with the Python suppressions in place. I suspect this is because valgrind is getting more picky. I'm getting stuff like this: ==10199== Conditional jump or move depends on uninitialised value(s) ==10199== at 0x1B8F4FD1: (within /lib/ld-2.3.5.so) ==10199== by 0x1B8EA4AA: (within /lib/ld-2.3.5.so) ==10199== by 0x4111F18F: (within /lib/tls/libc-2.3.5.so) ==10199== by 0x1B8EF026: (within /lib/ld-2.3.5.so) ==10199== by 0x4111FB85: _dl_open (in /lib/tls/libc-2.3.5.so) ==10199== by 0x4118FD32: (within /lib/tls/libdl-2.3.5.so) ==10199== by 0x1B8EF026: (within /lib/ld-2.3.5.so) ==10199== by 0x41190486: (within /lib/tls/libdl-2.3.5.so) ==10199== by 0x4118FDB0: dlopen (in /lib/tls/libdl-2.3.5.so) ==10199== by 0x80DCE40: _PyImport_GetDynLoadFunc (in /usr/bin/python2.4) ==10199== by 0x80D2622: _PyImport_LoadDynamicModule (in /usr/bin/python2.4) ==10199== by 0x80D223F: (within /usr/bin/python2.4) and generally ld related output, presumably something to do with dynamic linking. This makes it harder to pick out potentially real memory issues. I'm wondering whether others get this same behavior with valgrind and have it fixed, or whether everybody is stuck with it. I looked for newer suppressions in the Python repository at some point, but couldn't find it then. My valgrind version is 3.0.1 Regards, Martijn From faassen at infrae.com Tue May 30 15:10:15 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 30 May 2006 15:10:15 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C4351.7050804@infrae.com> References: <447C4351.7050804@infrae.com> Message-ID: <447C4437.4020806@infrae.com> Hey, Also note that running valgrind over the lxml testsuite now reports quite a few problems -- invalid reads of content previously freed by libxml2, for instance. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 15:50:44 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 15:50:44 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C4437.4020806@infrae.com> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> Message-ID: <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Hey, > > Also note that running valgrind over the lxml testsuite now reports > quite a few problems -- invalid reads of content previously freed by > libxml2, for instance. I hope you're refering to the test_attribute_xmlns_move test. That's the xml:id bug that Noah found. I'm working on that, but it's rather tricky. See here: http://bugzilla.gnome.org/show_bug.cgi?id=343302 Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 16:47:10 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 16:47:10 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) In-Reply-To: <447C2B1F.2000505@johnnydebris.net> References: <447C2B1F.2000505@johnnydebris.net> Message-ID: <447C5AEE.4030608@gkec.informatik.tu-darmstadt.de> Hi, Johnny deBris wrote: > The past couple of months I kept running into some bug. For some reason > the namespace information of certain nodes got lost after moving the > nodes from one document to another. I kept trying to isolate the > problem, but without success, until this week I found that the reason > the information gets lost seems to have to do with the document getting > garbage collected even though some nodes are still referred to. > > With this in mind I managed to write the following demonstration snippet: > > #-------------------------------------------------------------------------- > > from lxml import etree > > s1 = '' > tree1 = etree.fromstring(s1) > btag = tree1.xpath('//b')[0] > del tree1 > > s2 = '' > tree2 = etree.fromstring(s2) > tree2.append(btag) > > # this produces crap... > print etree.tostring(tree2) > > #-------------------------------------------------------------------------- > > The problem seems to be the 'del' statement: this seems to free the > document instance or something, even though there's still a reference to > one of its nodes. Serializing the node later on will result in garbage. > When the 'del' line is commented out, the XML is produced properly. > > I hope this is enough information for you guys to track the bug: I'm no > C programmer myself, so I don't dare diving into the Pyrex, and also I > may have made some errors in my explanation... If you need more > information, let me know. Thanks for reporting this and thanks for taking the time to provide a test. This seems to be related to a bug that was recently discovered, but this test case tells me that it's actually different then I thought. I'll have another look at it. Thanks, Stefan From buro at petr.com Tue May 30 17:22:58 2006 From: buro at petr.com (Petr van Blokland) Date: Tue, 30 May 2006 17:22:58 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt In-Reply-To: <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> References: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> Message-ID: <3A7A4B1D-B4CA-4473-8A5F-114B18B8871C@petr.com> Hi Stefan, Ah, that explains. I was searching for a feature that was not there. I'll have to think about another strategy then, This hase be working with a full-python XSL parser we wrote ourselves, but lxml/libxml2 is about 10 times faster, so it is very tempting to switch. I could solve the problem by only using xpath in combination with xsl:value-of. It might work (although not no readable as tags). Trying though I found that als < and > answered by an external xpath function are escaped. Any idea how to avoid that? Regards, Petr van Blokland ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From faassen at infrae.com Tue May 30 17:34:39 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 30 May 2006 17:34:39 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> Message-ID: <447C660F.6010505@infrae.com> Stefan Behnel wrote: > Hi Martijn, > > Martijn Faassen wrote: >> Hey, >> >> Also note that running valgrind over the lxml testsuite now reports >> quite a few problems -- invalid reads of content previously freed by >> libxml2, for instance. > > I hope you're refering to the test_attribute_xmlns_move test. That's the > xml:id bug that Noah found. I'm working on that, but it's rather tricky. See here: > > http://bugzilla.gnome.org/show_bug.cgi?id=343302 I indeed get problems with test_attribute_xmlns_move, but also with test_module_HTML_unicode. I also appear to get a whole bunch of problems near the end of the test run (possbily when it's running doctests? not sure..). Note that test_attribute_xmlns_move currently also fails when I run the tests. Are you checking with valgrind, by the way? What do you do with the supression? Or is valgrind not an option for you and is this something I should do regularly for you? Regards, Martijn From johnny at johnnydebris.net Tue May 30 17:36:17 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Tue, 30 May 2006 17:36:17 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) In-Reply-To: <447C5AEE.4030608@gkec.informatik.tu-darmstadt.de> References: <447C2B1F.2000505@johnnydebris.net> <447C5AEE.4030608@gkec.informatik.tu-darmstadt.de> Message-ID: <447C6671.3060204@johnnydebris.net> Stefan Behnel wrote: >I'll have another look at it. > > > Very cool, thanks! Cheers, Guido From faassen at infrae.com Tue May 30 17:54:22 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 30 May 2006 17:54:22 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C660F.6010505@infrae.com> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> <447C660F.6010505@infrae.com> Message-ID: <447C6AAE.2010402@infrae.com> Martijn Faassen wrote: > Stefan Behnel wrote: >> Hi Martijn, >> >> Martijn Faassen wrote: >>> Hey, >>> >>> Also note that running valgrind over the lxml testsuite now reports >>> quite a few problems -- invalid reads of content previously freed by >>> libxml2, for instance. >> I hope you're refering to the test_attribute_xmlns_move test. That's the >> xml:id bug that Noah found. I'm working on that, but it's rather tricky. See here: >> >> http://bugzilla.gnome.org/show_bug.cgi?id=343302 > > I indeed get problems with test_attribute_xmlns_move, but also with > test_module_HTML_unicode. I see you did a fix; the last one is now gone. > I also appear to get a whole bunch of problems > near the end of the test run (possbily when it's running doctests? not > sure..). Note that test_attribute_xmlns_move currently also fails when I > run the tests. I think I have been looking wrong and mistook the error summary for errors at the end. There are some reports at the end but they look to have something to do with the Python interpreter again, not lxml in particular: ==31495== Invalid read of size 4 ==31495== at 0x41129379: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x410E9121: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x410E9211: tdestroy (in /lib/tls/libc-2.3.5.so) ==31495== by 0x4112971B: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x41129D51: __libc_freeres (in /lib/tls/libc-2.3.5.so) ==31495== by 0x1B8FC68A: _vgw_freeres (vg_preloaded.c:62) ==31495== by 0x41045785: exit (in /lib/tls/libc-2.3.5.so) ==31495== by 0x80D8468: (within /usr/bin/python2.4) ==31495== by 0x80D8645: PyErr_PrintEx (in /usr/bin/python2.4) ==31495== by 0x80D9143: PyRun_SimpleFileExFlags (in /usr/bin/python2.4) ==31495== by 0x8055A05: Py_Main (in /usr/bin/python2.4) ==31495== by 0x4102DEBF: __libc_start_main (in /lib/tls/libc-2.3.5.so) ==31495== Address 0xC is not stack'd, malloc'd or (recently) free'd ==31495== ==31495== Process terminating with default action of signal 11 (SIGSEGV) ==31495== Access not within mapped region at address 0xC ==31495== at 0x41129379: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x410E9121: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x410E9211: tdestroy (in /lib/tls/libc-2.3.5.so) ==31495== by 0x4112971B: (within /lib/tls/libc-2.3.5.so) ==31495== by 0x41129D51: __libc_freeres (in /lib/tls/libc-2.3.5.so) ==31495== by 0x1B8FC68A: _vgw_freeres (vg_preloaded.c:62) ==31495== by 0x41045785: exit (in /lib/tls/libc-2.3.5.so) ==31495== by 0x80D8468: (within /usr/bin/python2.4) ==31495== by 0x80D8645: PyErr_PrintEx (in /usr/bin/python2.4) ==31495== by 0x80D9143: PyRun_SimpleFileExFlags (in /usr/bin/python2.4) ==31495== by 0x8055A05: Py_Main (in /usr/bin/python2.4) ==31495== by 0x4102DEBF: __libc_start_main (in /lib/tls/libc-2.3.5.so) Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 18:04:38 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 18:04:38 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C660F.6010505@infrae.com> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> <447C660F.6010505@infrae.com> Message-ID: <447C6D16.5000206@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Martijn Faassen wrote: >>> Also note that running valgrind over the lxml testsuite now reports >>> quite a few problems -- invalid reads of content previously freed by >>> libxml2, for instance. >> >> I hope you're refering to the test_attribute_xmlns_move test. That's the >> xml:id bug that Noah found. I'm working on that, but it's rather >> tricky. See here: >> >> http://bugzilla.gnome.org/show_bug.cgi?id=343302 > > I indeed get problems with test_attribute_xmlns_move I know, we had two bug reports for the problem that is now covered by that test. > but also with test_module_HTML_unicode. Don't you ever do updates? That was fixed at least half an hour ago! :) > I also appear to get a whole bunch of problems > near the end of the test run (possbily when it's running doctests? not > sure..). Note that test_attribute_xmlns_move currently also fails when I > run the tests. Don't know about those, If you want to check, feel free. > Are you checking with valgrind, by the way? What do you do with the > supression? Or is valgrind not an option for you and is this something I > should do regularly for you? I do from time to time, not regularly. The fix above slipped through since the last valgrind run. It was actually for fixing a few memory leaks, but I added one call too much. I'm stuffed with work currently, so if you want to run some tests and check them, go ahead. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 21:35:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 21:35:32 +0200 Subject: [lxml-dev] Bug with attribute mangling when adding child elements In-Reply-To: <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> References: <9ea1c1180605281904v64b36f60me4b736aa80d8c4f8@mail.gmail.com> <9ea1c1180605290509t7ab6f0f9we73ee9392f00770@mail.gmail.com> <447B1CE3.5090807@gkec.informatik.tu-darmstadt.de> Message-ID: <447C9E84.3060400@gkec.informatik.tu-darmstadt.de> Hi, Stefan Behnel wrote: > Noah Slater wrote: >> On 29/05/06, Noah Slater wrote: >>> I noticed today that certain child attribute values are mangled when >>> you try to insert/append child elements onto parent elements. >>> >>> I have written a test script for this as usual. >> I have just noticed that if you set the attributes after you have >> added to the tree the weird bug does not happen. > > Thanks for reporting this. It looks like a bug in libxml2, as it only happens > for "xml:id", not for other attributes. I guess the bug is in > xmlReconciliateNs, but that's really just guessing. Well, it was /related/ to xmlReconciliateNs, but not a bug in libxml2 and only by chance it showed fro xml:id. lxml was running xmlReconciliateNs /after/ cleaning up the document references in the Python elements. Which means it freed the document if all references were gone, but before trying to access its namespaces. Great. That bug was in there for ages, since SVN revision 12568! Fixed in the trunk. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 21:38:03 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 21:38:03 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) In-Reply-To: <447C2B1F.2000505@johnnydebris.net> References: <447C2B1F.2000505@johnnydebris.net> Message-ID: <447C9F1B.3070801@gkec.informatik.tu-darmstadt.de> Hi, Johnny deBris wrote: > The past couple of months I kept running into some bug. For some reason > the namespace information of certain nodes got lost after moving the > nodes from one document to another. I kept trying to isolate the > problem, but without success, until this week I found that the reason > the information gets lost seems to have to do with the document getting > garbage collected even though some nodes are still referred to. [...] > I hope this is enough information for you guys to track the bug: I'm no > C programmer myself, so I don't dare diving into the Pyrex, and also I > may have made some errors in my explanation... If you need more > information, let me know. Thanks a lot. Bug is fixed in the trunk. Stefan From apaku at gmx.de Tue May 30 21:46:24 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Tue, 30 May 2006 21:46:24 +0200 Subject: [lxml-dev] element of an xpath evaluation Message-ID: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> Hi, very recently I needed a program that evaluates a xpath and displays the "result" graphically in a tree of the xml file that was used. I found XPath Explorer a java application, however it does quite some more stuff than I need and there certain things that just don't work as I want. I figured I could very easily provide something similar using lxml and PyQt4. However if I want to highlight the tree node that the xpath matches I have a "problem" when the xpath matches attributes or text nodes. So the question is: Is there a way using lxml to find out to which element a certain non-element result of an xpath evaluation belongs? Andreas -- You now have Asian Flu. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060530/ee47e900/attachment-0001.pgp From behnel_ml at gkec.informatik.tu-darmstadt.de Tue May 30 22:10:06 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 30 May 2006 22:10:06 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> Message-ID: <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > very recently I needed a program that evaluates a xpath and displays the > "result" graphically in a tree of the xml file that was used. I found > XPath Explorer a java application, however it does quite some more stuff > than I need and there certain things that just don't work as I want. > > I figured I could very easily provide something similar using lxml and > PyQt4. That sounds pretty interesting. Please post a link when you have something usable. > However if I want to highlight the tree node that the xpath > matches I have a "problem" when the xpath matches attributes or text > nodes. So the question is: Is there a way using lxml to find out to > which element a certain non-element result of an xpath evaluation > belongs? Not straight away. Both are returned as strings, so you loose the information where it came from. You can try to run a second XPath expression to find the result text or attribute value in the tree, but that's bound to fail if text data is not unique (which is pretty likely for attributes). A 'stupid idea' would be to fiddle with the XPath expression and add a function call after each traversed node that stores the element in a list. You could then trace the evaluation path. Something like "a/b//c[true()]/d/text()" -> "a[store(.)]/b[store(.)]//c[(true()) and store(.)]/d[store(.)]/text()" But that would require you to 'parse' the expression, I don't know if you can get that done with regexps... Stefan From johnny at johnnydebris.net Tue May 30 22:42:01 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Tue, 30 May 2006 22:42:01 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) In-Reply-To: <447C9F1B.3070801@gkec.informatik.tu-darmstadt.de> References: <447C2B1F.2000505@johnnydebris.net> <447C9F1B.3070801@gkec.informatik.tu-darmstadt.de> Message-ID: <447CAE19.8070102@johnnydebris.net> Stefan Behnel wrote: > > Thanks a lot. Bug is fixed in the trunk. > Thank *you*! This helps me a lot! Cheers, Guido From apaku at gmx.de Tue May 30 23:45:53 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Tue, 30 May 2006 23:45:53 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> Message-ID: <20060530214553.GC31918@morpheus.apaku.dnsalias.org> On 30.05.06 22:10:06, Stefan Behnel wrote: > Andreas Pakulat wrote: > > very recently I needed a program that evaluates a xpath and displays the > > "result" graphically in a tree of the xml file that was used. I found > > XPath Explorer a java application, however it does quite some more stuff > > than I need and there certain things that just don't work as I want. > > > > I figured I could very easily provide something similar using lxml and > > PyQt4. > > That sounds pretty interesting. Please post a link when you have something usable. I will. > > However if I want to highlight the tree node that the xpath > > matches I have a "problem" when the xpath matches attributes or text > > nodes. So the question is: Is there a way using lxml to find out to > > which element a certain non-element result of an xpath evaluation > > belongs? > > Not straight away. Both are returned as strings, so you loose the information > where it came from. Yeah, tell me ;-) > A 'stupid idea' would be to fiddle with the XPath expression and add a > function call after each traversed node that stores the element in a list. You > could then trace the evaluation path. > > Something like > "a/b//c[true()]/d/text()" > -> > "a[store(.)]/b[store(.)]//c[(true()) and store(.)]/d[store(.)]/text()" > > But that would require you to 'parse' the expression, I don't know if you can > get that done with regexps... uh oh. No. I guess I'll go with PyXML and it's dom-Model then. That'll give me a proper AttrNode for the attributes. Thanks for the clarification. Andreas -- You possess a mind not merely twisted, but actually sprained. From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 10:17:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 10:17:58 +0200 Subject: [lxml-dev] Faster iteration and deallocation Message-ID: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> Hi all, I know, we already had our last beta and 1.0 is pretty close, but I decided to do some more major clean ups anyway and found a lot of code duplication between functions that iterate over the tree. I replaced that by a set of BEGIN/END C macros, which has three effects: * Tree iteration now uses the same code throughout lxml * It's much shorter and thus (hopefully) easier to understand * It's about 20-30% faster than the previous recursive implementations I took another somewhat experimental step and removed the _NodeBase superclass from _Attrib to make it a plain Python object that sits on an _Element. This allowed for another major clean up and a huge simplification of the proxy code, since there is only one proxy type left. I know that these changes impact pretty much every part of lxml and I hope they did not introduce any of those beautiful little bugs which I already squeezed so many of recently. Since they simplify and merge a lot of code, I hope they also make the code easier to maintain. According to the test cases, everything works nicely as before and I will give valgrind a run to see if there's anything suspicious left. It would be a great help if I could get some feedback on the stability of the modified trunk version in other applications to see if the test cases speak the truth. To do this without installing it, checkout the trunk and compile it (make clean inplace). You can then add the 'src' directory to the PYTHONPATH and try running your programs with it. Any feedback is appreciated. Stefan From johnny at johnnydebris.net Wed May 31 10:44:33 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Wed, 31 May 2006 10:44:33 +0200 Subject: [lxml-dev] Bug in memory management (I guess ;) In-Reply-To: <447C9F1B.3070801@gkec.informatik.tu-darmstadt.de> References: <447C2B1F.2000505@johnnydebris.net> <447C9F1B.3070801@gkec.informatik.tu-darmstadt.de> Message-ID: <447D5771.1070906@johnnydebris.net> Stefan Behnel wrote: >Thanks a lot. Bug is fixed in the trunk. > > > I can confirm this: everything seems to work perfectly. Thanks again! Cheers, Guido From howe at carcass.dhs.org Wed May 31 10:58:28 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 31 May 2006 05:58:28 -0300 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> Message-ID: <1553372610.20060531055828@carcass.dhs.org> Hello Stefan, Wednesday, May 31, 2006, 5:17:58 AM, you wrote: > Hi all, > I know, we already had our last beta and 1.0 is pretty close, but I decided to > do some more major clean ups anyway and found a lot of code duplication > between functions that iterate over the tree. > I replaced that by a set of BEGIN/END C macros, which has three effects: > * Tree iteration now uses the same code throughout lxml > * It's much shorter and thus (hopefully) easier to understand > * It's about 20-30% faster than the previous recursive implementations > I took another somewhat experimental step and removed the _NodeBase superclass > from _Attrib to make it a plain Python object that sits on an _Element. This > allowed for another major clean up and a huge simplification of the proxy > code, since there is only one proxy type left. > I know that these changes impact pretty much every part of lxml and I hope > they did not introduce any of those beautiful little bugs which I already > squeezed so many of recently. Since they simplify and merge a lot of code, I > hope they also make the code easier to maintain. > According to the test cases, everything works nicely as before and I will give > valgrind a run to see if there's anything suspicious left. > It would be a great help if I could get some feedback on the stability of the > modified trunk version in other applications to see if the test cases speak > the truth. To do this without installing it, checkout the trunk and compile it > (make clean inplace). You can then add the 'src' directory to the PYTHONPATH > and try running your programs with it. > Any feedback is appreciated. What about a beta 2 ? Will help test those new changes made recently, whose were not few. Do you have a planned date for the 1.0 release ? -- Best regards, Steve mailto:howe at carcass.dhs.org From faassen at infrae.com Wed May 31 11:18:19 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 31 May 2006 11:18:19 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447C6D16.5000206@gkec.informatik.tu-darmstadt.de> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> <447C660F.6010505@infrae.com> <447C6D16.5000206@gkec.informatik.tu-darmstadt.de> Message-ID: <447D5F5B.305@infrae.com> Stefan Behnel wrote: [snip] >> Are you checking with valgrind, by the way? What do you do with the >> supression? Or is valgrind not an option for you and is this something I >> should do regularly for you? > > I do from time to time, not regularly. The fix above slipped through since the > last valgrind run. It was actually for fixing a few memory leaks, but I added > one call too much. How are the supressions working for you? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 15:04:28 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 15:04:28 +0200 Subject: [lxml-dev] filtering namespaces in findall Message-ID: <447D945C.1070209@gkec.informatik.tu-darmstadt.de> Hi, since I was rewriting the tree iterator anyway, I stumbled over the fact that ElementTree's getiterator() can filter for any tag ("*"), but not for any tag within a namespace ("{namespace}*"). Since this is just a cheap and straight forward little extension to the node filter (and a pretty nice feature to have), I added that to lxml and hope for inclusion in a future ElementTree version also (or maybe a hint why Fredrik assumes this to be inappropriate). Note that it's not currently supported in findall, though, as we use the same implementation as ElementTree there. Stefan From ogrisel at nuxeo.com Wed May 31 15:23:03 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Wed, 31 May 2006 15:23:03 +0200 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <1553372610.20060531055828@carcass.dhs.org> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> Message-ID: Steve Howe a ?crit : >> It would be a great help if I could get some feedback on the stability of the >> modified trunk version in other applications to see if the test cases speak >> the truth. To do this without installing it, checkout the trunk and compile it >> (make clean inplace). You can then add the 'src' directory to the PYTHONPATH >> and try running your programs with it. > >> Any feedback is appreciated. I just did a `cd lxml/trunk && svn up && make test && make bench` and all went well and I'm now successfully using it for some simple parsing / xpath related tests (on a 32bit linux box). > What about a beta 2 ? Will help test those new changes made recently, > whose were not few. Do you have a planned date for the 1.0 release ? +1. Georges told me a couple of days ago that a "make test" was broken on Mac OSX for 1.0.beta but he did not had time to report it on the list so it might be a good idea to have a success report on MacOSX before releasing 1.0 final. -- Olivier From gracinet at nuxeo.com Wed May 31 15:32:04 2006 From: gracinet at nuxeo.com (Georges Racinet) Date: Wed, 31 May 2006 15:32:04 +0200 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> Message-ID: <1ed0a857dfd4e7b88cb3228e3399d299@nuxeo.com> Le 31 mai 2006, ? 15:23, Olivier Grisel a ?crit : > Steve Howe a ?crit : > >>> It would be a great help if I could get some feedback on the >>> stability of the >>> modified trunk version in other applications to see if the test >>> cases speak >>> the truth. To do this without installing it, checkout the trunk and >>> compile it >>> (make clean inplace). You can then add the 'src' directory to the >>> PYTHONPATH >>> and try running your programs with it. >> >>> Any feedback is appreciated. > > I just did a `cd lxml/trunk && svn up && make test && make bench` and > all went > well and I'm now successfully using it for some simple parsing / xpath > related > tests (on a 32bit linux box). > >> What about a beta 2 ? Will help test those new changes made recently, >> whose were not few. Do you have a planned date for the 1.0 release ? > > +1. Georges told me a couple of days ago that a "make test" was broken > on Mac > OSX for 1.0.beta but he did not had time to report it on the list so > it might be > a good idea to have a success report on MacOSX before releasing 1.0 > final. That's right. In fact, I hadn't enough time to check if it my setup was faulty or not. I'll try again more toroughly in a few days and keep the list posted (if noone did it before of course) > > -- > Olivier > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 15:39:07 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 15:39:07 +0200 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <1553372610.20060531055828@carcass.dhs.org> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> Message-ID: <447D9C7B.8010708@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Wednesday, May 31, 2006, 5:17:58 AM, you wrote: >> I know, we already had our last beta and 1.0 is pretty close, but I decided to >> do some more major clean ups anyway [ tree iteration and proxy cleanup ] >> I know that these changes impact pretty much every part of lxml and I hope >> they did not introduce any of those beautiful little bugs which I already >> squeezed so many of recently. Since they simplify and merge a lot of code, I >> hope they also make the code easier to maintain. > >> It would be a great help if I could get some feedback on the stability of the >> modified trunk version in other applications to see if the test cases speak >> the truth. To do this without installing it, checkout the trunk and compile it >> (make clean inplace). You can then add the 'src' directory to the PYTHONPATH >> and try running your programs with it. > > What about a beta 2 ? Will help test those new changes made recently, > whose were not few. Most of them were bug fixes that are now checked by test cases and confirmed by those who found them. Apart from the two things I mentioned above, there are very few new features. It's the first time since 0.8 that we have a longer fixed-bugs list than new-features list in the ChangeLog. But that's maybe just because we had a beta in-between... :) So, I don't know, I kinda feel tempted to just throw out a 1.0 rather than a new beta. Since both of the bigger changes above are mainly simplifications and code cleanup, I don't really think it's worth bothering people with a second beta. I also checked a few of my own lxml applications to see if they work as expected and I was pretty shocked how fast they were now. I just clicked on 'load file' in "Slow" and the complete GUI was just there, instantaneously. That's pretty baffling if you don't expect it. I mean, Slow uses all sorts of weird XML/XPath/XSLT stuff to work on its internal XML model. It's not supposed to be fast! :) MathDOM's also working nicely and its little test suite is also just breezing by. I really don't see that many things that could go wrong with 1.0. > Do you have a planned date for the 1.0 release ? Yep, tomorrow. :) Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 16:00:56 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 16:00:56 +0200 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <1ed0a857dfd4e7b88cb3228e3399d299@nuxeo.com> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> <1ed0a857dfd4e7b88cb3228e3399d299@nuxeo.com> Message-ID: <447DA198.7030009@gkec.informatik.tu-darmstadt.de> Hi George, Georges Racinet wrote: > Le 31 mai 2006, ? 15:23, Olivier Grisel a ?crit : >> Steve Howe a ?crit : >>> What about a beta 2 ? Will help test those new changes made recently, >>> whose were not few. >> >> +1. Georges told me a couple of days ago that a "make test" was broken >> on Mac >> OSX for 1.0.beta but he did not had time to report it on the list so >> it might be >> a good idea to have a success report on MacOSX before releasing 1.0 >> final. > > That's right. In fact, I hadn't enough time to check if it my setup was > faulty or not. I'll try again > more toroughly in a few days and keep the list posted (if noone did it > before of course) Ok, I guess that's a reason to postpone. If we have a 1.0, it should compile as nicely as possible on the broadest possible number of machines. It's a pretty crappy thing to bother users with a bug-fix version that only fixes a compilation bug on one platform... Admittedly, I did a couple of changes in setup.py, e.g. for linking against libexslt and to provide a fill-in space for a static linker configuration, so maybe that broke something in your local setup. Anyway, I'd be happy if you could give us some feedback soon, so that we can get 1.0 out of the door. I'll send you a clean tar.gz so that you can try from a real pre-release version. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 18:19:24 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 18:19:24 +0200 Subject: [lxml-dev] filtering namespaces in findall In-Reply-To: <447D945C.1070209@gkec.informatik.tu-darmstadt.de> References: <447D945C.1070209@gkec.informatik.tu-darmstadt.de> Message-ID: <447DC20C.90608@gkec.informatik.tu-darmstadt.de> Hi all, Stefan Behnel wrote: > ElementTree's getiterator() can filter for any tag ("*"), but not for any tag > within a namespace ("{namespace}*"). > > Note that it's not currently supported in findall, though, as we use the same > implementation as ElementTree there. Actually, it *does* work in findall. ET's _elementpath.py nicely passes the pattern on to getiterator(), so that lxml now supports it in getiterator() and the find*() methods. Stefan From howe at carcass.dhs.org Wed May 31 18:55:57 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 31 May 2006 13:55:57 -0300 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <447D9C7B.8010708@gkec.informatik.tu-darmstadt.de> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> <447D9C7B.8010708@gkec.informatik.tu-darmstadt.de> Message-ID: <682780274.20060531135557@carcass.dhs.org> Hello Stefan, Wednesday, May 31, 2006, 10:39:07 AM, you wrote: > Yep, tomorrow. :) Ok, please send a message when you freeze the trunk so I can compile the Windows and FreeBSD eggs. -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Wed May 31 19:13:09 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 31 May 2006 19:13:09 +0200 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <682780274.20060531135557@carcass.dhs.org> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> <447D9C7B.8010708@gkec.informatik.tu-darmstadt.de> <682780274.20060531135557@carcass.dhs.org> Message-ID: <447DCEA5.5010906@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Ok, please send a message when you freeze the trunk so I can compile > the Windows and FreeBSD eggs. You should better compile them from the source tgz. If you compile from a SVN checkout, setup.py will pick up the SVN revision. I'll upload the signed tgz to cheeseshop and announce it on the list, as usual. I can also send a message to you, George and Olivier first. Note that I'm not sure we'll really make it for tomorrow. Stefan From howe at carcass.dhs.org Wed May 31 19:35:35 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Wed, 31 May 2006 14:35:35 -0300 Subject: [lxml-dev] Faster iteration and deallocation In-Reply-To: <447DCEA5.5010906@gkec.informatik.tu-darmstadt.de> References: <447D5136.6040300@gkec.informatik.tu-darmstadt.de> <1553372610.20060531055828@carcass.dhs.org> <447D9C7B.8010708@gkec.informatik.tu-darmstadt.de> <682780274.20060531135557@carcass.dhs.org> <447DCEA5.5010906@gkec.informatik.tu-darmstadt.de> Message-ID: <151346173.20060531143535@carcass.dhs.org> Hello Stefan, Wednesday, May 31, 2006, 2:13:09 PM, you wrote: > Steve Howe wrote: >> Ok, please send a message when you freeze the trunk so I can compile >> the Windows and FreeBSD eggs. > You should better compile them from the source tgz. If you compile from a SVN > checkout, setup.py will pick up the SVN revision. > I'll upload the signed tgz to cheeseshop and announce it on the list, as > usual. I can also send a message to you, George and Olivier first. > Note that I'm not sure we'll really make it for tomorrow. Ok, whenever it happens :) -- Best regards, Steve mailto:howe at carcass.dhs.org