From stefan_ml at behnel.de Thu May 1 12:15:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 May 2008 12:15:33 +0200 Subject: [lxml-dev] lxml 2.0.5 released Message-ID: <48199845.4@behnel.de> Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31. From klizhentas at gmail.com Thu May 1 20:14:19 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Thu, 1 May 2008 22:14:19 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> Message-ID: <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> Hi All, Got a question: I've extended the ElementBase object using the approach described in the tutorial, but SubElement does not work as desired: class NodeBase(etree.ElementBase): def append(self,child): print "aaa" return etree.ElementBase.append(self,child) etree.SubElement(root,"child") #no "aaa" printed OK, but when taking your code to the module: def SubElement(parent, tag, attrib={}, **extra): attrib = attrib.copy() attrib.update(extra) element = parent.makeelement(tag, attrib) parent.append(element) return element SubElement(root,"child") # "aaa" is here! and overriding def makeelement(self, tag, attrib): return Node(tag, attrib) in the NodeBase just does not help, Any advice will be appreciated, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080501/200b8175/attachment.htm From stefan_ml at behnel.de Thu May 1 20:28:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 May 2008 20:28:12 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> Message-ID: <481A0BBC.7040000@behnel.de> Hi, Alex Klizhentas wrote: > I've extended the ElementBase object using the approach described in the > tutorial, but SubElement does not work as desired: > > class NodeBase(etree.ElementBase): > def append(self,child): > print "aaa" > return etree.ElementBase.append(self,child) > > etree.SubElement(root,"child") #no "aaa" printed That's because SubElement() does not call .append(). > OK, but when taking your code to the module: > > def SubElement(parent, tag, attrib={}, **extra): > attrib = attrib.copy() > attrib.update(extra) > element = parent.makeelement(tag, attrib) > parent.append(element) > return element > > SubElement(root,"child") # "aaa" is here! As expected, as you call .append() explicitly here. > and overriding > def makeelement(self, tag, attrib): > return Node(tag, attrib) > > in the NodeBase just does not help, SubElement() does not call .makeelement() either. It's implemented in plain C. Could you explain a bit why you want to do this and how your .append() differs from the normal append code? Stefan From klizhentas at gmail.com Thu May 1 21:11:38 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Thu, 1 May 2008 23:11:38 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481A0BBC.7040000@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> Message-ID: <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> Thanks for the comments, The idea behind this is to allow the XML tree to notify observers when it's contents are changed: the node is added, removed or moved. That's why I'm going to override the ElementBase members so that they will notify observers on the certain actions performed. Everything works fine, except this usefult SubElement function that did not work as expected, now you've clarified the things, Thanks Alex 2008/5/1 Stefan Behnel : > Hi, > > Alex Klizhentas wrote: > > I've extended the ElementBase object using the approach described in the > > tutorial, but SubElement does not work as desired: > > > > class NodeBase(etree.ElementBase): > > def append(self,child): > > print "aaa" > > return etree.ElementBase.append(self,child) > > > > etree.SubElement(root,"child") #no "aaa" printed > > That's because SubElement() does not call .append(). > > > > OK, but when taking your code to the module: > > > > def SubElement(parent, tag, attrib={}, **extra): > > attrib = attrib.copy() > > attrib.update(extra) > > element = parent.makeelement(tag, attrib) > > parent.append(element) > > return element > > > > SubElement(root,"child") # "aaa" is here! > > As expected, as you call .append() explicitly here. > > > > and overriding > > def makeelement(self, tag, attrib): > > return Node(tag, attrib) > > > > in the NodeBase just does not help, > > SubElement() does not call .makeelement() either. It's implemented in > plain C. > Could you explain a bit why you want to do this and how your .append() > differs > from the normal append code? > > Stefan > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080501/0c565b12/attachment.htm From stefan_ml at behnel.de Fri May 2 08:49:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 08:49:39 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> Message-ID: <481AB983.9020604@behnel.de> Alex Klizhentas wrote: >> Alex Klizhentas wrote: >>> I've extended the ElementBase object using the approach described in the >>> tutorial, but SubElement does not work as desired: >>> >>> class NodeBase(etree.ElementBase): >>> def append(self,child): >>> print "aaa" >>> return etree.ElementBase.append(self,child) >>> >>> etree.SubElement(root,"child") #no "aaa" printed >> That's because SubElement() does not call .append(). >> >> >>> OK, but when taking your code to the module: >>> >>> def SubElement(parent, tag, attrib={}, **extra): >>> attrib = attrib.copy() >>> attrib.update(extra) >>> element = parent.makeelement(tag, attrib) >>> parent.append(element) >>> return element > > The idea behind this is to allow the XML tree to notify observers when it's > contents are changed: the node is added, removed or moved. > > That's why I'm going to override the ElementBase members so that they will > notify observers on the certain actions performed. > > Everything works fine, except this usefult SubElement function that did not > work as expected, now you've clarified the things, Ah, sure. Then it's best to use a pure Python implementation of SubElement instead, as the one above. Stefan From stefan_ml at behnel.de Fri May 2 16:30:24 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 16:30:24 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481A0BBC.7040000@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> Message-ID: <481B2580.9070803@behnel.de> Hi, another bit of reasoning here. Stefan Behnel wrote: > Alex Klizhentas wrote: >> I've extended the ElementBase object using the approach described in the >> tutorial, but SubElement does not work as desired: >> >> class NodeBase(etree.ElementBase): >> def append(self,child): >> print "aaa" >> return etree.ElementBase.append(self,child) >> >> etree.SubElement(root,"child") #no "aaa" printed > > That's because SubElement() does not call .append(). [...] > SubElement() does not call .makeelement() either. It's implemented in plain C. One important reason is that this allows lxml.etree to append the new libxml2 node at the C level *before* the decision is taken which Python class should be used to represent it. This might have an impact on the class lookup if it considers the parental relation when taking its decision (lxml.objectify does that, for example). But that's the only difference I can see between etree.SubElement() and your Python implementation. And you could even work around it by doing something like this: def SubElement(parent, tag, attrib={}, **extra): attrib = attrib.copy() attrib.update(extra) element = parent.makeelement(tag, attrib) parent.append(element) del element return parent[-1] However, you might want to avoid that if you know you won't need it, e.g. when using the "namespace" or "default" lookup scheme. Stefan From stefan_ml at behnel.de Fri May 2 19:16:34 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 19:16:34 +0200 Subject: [lxml-dev] threading fixed :) Message-ID: <481B4C72.1040604@behnel.de> Hi, there has been a long-standing issue in the threading support in lxml, combined with the per-thread string hash table we use for libxml2. Here is a simple example of a sure crasher: ------------------------------- import threading import lxml.etree as et xml = "" main_root = et.XML("") def run_thread(): thread_root = et.XML(xml) main_root.append(thread_root[0]) del thread_root # deletes the document thread = threading.Thread(target=run_thread) thread.start() thread.join() print et.tostring(main_root) ------------------------------- This crashes, because the thread parses the XML fragment into its own dictionary and stores the tag name "threadtag" there. Then it appends the "threadtag" element to a tree in the main program, which uses a different dict. When it deletes the "thread_root", the document will be deleted as well, and the (ref-counted) thread dictionary that contains the string "threadtag" will be freed when the thread terminates. The main program then crashes when it accesses the no longer available tag name in the corrupted document. The solution I came up with today is actually quite simple. We have to traverse the subtree anyway to update the document references and to fix the namespace declarations. So it's only one step more to also fix the name pointers by looking them up in the target dictionary and re-assigning the names. This is only required when we really have two different dicts, which is easy to decide. So there isn't even a performance impact if you only use a single thread or if you do not move subtrees between threads. And the added overhead when you need this is really small. I will release a new beta of 2.1 soon that will have this change, and it would be very helpful if people who currently use threaded code that exchanges (i.e. deep copies) tree fragments between threads could check if this works for them (i.e. if code that crashes under 2.0 if you remove the deep copying works under 2.1). If it proves to fix the problem, I will backport it to 2.0 also. Read: the more feedback I get, the faster this will be fixed in 2.0. :) Stefan From klizhentas at gmail.com Fri May 2 19:21:19 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Fri, 2 May 2008 21:21:19 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481B2580.9070803@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <481B2580.9070803@behnel.de> Message-ID: <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> Thanks Stefan, All the nodes in that tree should have the same type, that's why the default class lookup scheme for parser works fine. BTW, I have one more question, to set the xml:id i use the following construct: def xml_id(v): # helper function to create name space attributes return {'{http://www.w3.org/XML/1998/namespace}id': v} and the following construct: N.child1("text",xml_id("some_id")) following the examples from the site. to get the id I use: class NodeBase(etree.ElementBase): ... def get_node_id(self,id): searched = self.find(".//*[@{ http://www.w3.org/XML/1998/namespace}id='%s']"%(id,)) if searched is None: raise NodeNotFoundError(id) return searched I have two questions: 1. what way is faster to get the element by Id? should I use find or xpath to achieve the better performance? 2. is there a way to set xml:id using xml - prefix? Thanks, Alex 2008/5/2 Stefan Behnel : > Hi, > > another bit of reasoning here. > > Stefan Behnel wrote: > > Alex Klizhentas wrote: > >> I've extended the ElementBase object using the approach described in > the > >> tutorial, but SubElement does not work as desired: > >> > >> class NodeBase(etree.ElementBase): > >> def append(self,child): > >> print "aaa" > >> return etree.ElementBase.append(self,child) > >> > >> etree.SubElement(root,"child") #no "aaa" printed > > > > That's because SubElement() does not call .append(). > [...] > > SubElement() does not call .makeelement() either. It's implemented in > plain C. > > One important reason is that this allows lxml.etree to append the new > libxml2 > node at the C level *before* the decision is taken which Python class > should > be used to represent it. This might have an impact on the class lookup if > it > considers the parental relation when taking its decision (lxml.objectify > does > that, for example). > > But that's the only difference I can see between etree.SubElement() and > your > Python implementation. And you could even work around it by doing > something > like this: > > def SubElement(parent, tag, attrib={}, **extra): > attrib = attrib.copy() > attrib.update(extra) > element = parent.makeelement(tag, attrib) > parent.append(element) > del element > return parent[-1] > > However, you might want to avoid that if you know you won't need it, e.g. > when > using the "namespace" or "default" lookup scheme. > > Stefan > > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080502/e5f1c6a6/attachment-0001.htm From stefan_ml at behnel.de Fri May 2 19:42:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 19:42:09 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <481B2580.9070803@behnel.de> <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> Message-ID: <481B5271.8000101@behnel.de> Hi, Alex Klizhentas wrote: > I have one more question, to set the xml:id i use the following construct: > > def xml_id(v): > # helper function to create name space attributes > return {'{http://www.w3.org/XML/1998/namespace}id': v} > > and the following construct: > > N.child1("text",xml_id("some_id")) > > following the examples from the site. > > to get the id I use: > > class NodeBase(etree.ElementBase): > ... > def get_node_id(self,id): > searched = self.find(".//*[@{ > http://www.w3.org/XML/1998/namespace}id='%s']"%(id,)) > if searched is None: > raise NodeNotFoundError(id) > return searched > > I have two questions: > > 1. what way is faster to get the element by Id? should I use find or xpath > to achieve the better performance? timeit will tell you that. But it really depends on the data. element.find() stops short after the first hit, so that's probably faster on average if the document is large. OTOH, XPath() is implemented in C and could easily beat the Python code behind find(".. at attr...") for smaller documents... Try this: find_id = etree.ETXPath( ".//*[@{http://www.w3.org/XML/1998/namespace}id=$id]") ... def get_node_id(self,id): el = find_id(self, id=id) > 2. is there a way to set xml:id using xml - prefix? No, but if you know you run single-threaded, you can reuse the attrib dict and just change the value. That's faster than recreating it each time. Stefan From stefan_ml at behnel.de Fri May 2 20:48:28 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 20:48:28 +0200 Subject: [lxml-dev] lxml 2.1beta2 released Message-ID: <481B61FC.7050300@behnel.de> Hi all, I'm happy to announce the release of lxml 2.1 beta2. It features a couple of enhancements and fixes over the first beta. The main improvement is the much more robust threading support, which makes it a lot easier to move subtrees back and forth between threads. It is described in more detail here: http://permalink.gmane.org/gmane.comp.python.lxml.devel/3571 Please report back on the list (preferably in reply to the above thread) if you notice a difference to lxml 2.0 with your code. Have fun, Stefan 2.1beta2 (2008-05-02) Features added * All parse functions in lxml.html take a parser keyword argument. * lxml.html has a new parser class XHTMLParser and a module attribute xhtml_parser that provide XML parsers that are pre-configured for the lxml.html package. Bugs fixed * Moving a subtree from a document created in one thread into a document of another thread could crash when the rest of the source document is deleted while the subtree is still in use. * Passing an nsmap when creating an Element will no longer strip redundantly defined namespace URIs. This prevented the definition of more than one prefix for a namespace on the same Element. Other changes * If the default namespace is redundantly defined with a prefix on the same Element, the prefix will now be preferred for subelements and attributes. This allows users to work around a problem in libxml2 where attributes from the default namespace could serialise without a prefix even when they appear on an Element with a different namespace (i.e. they would end up in the wrong namespace). From mharper3 at uiuc.edu Sun May 4 03:17:49 2008 From: mharper3 at uiuc.edu (mharper3 at uiuc.edu) Date: Sat, 3 May 2008 20:17:49 -0500 (CDT) Subject: [lxml-dev] (no subject) Message-ID: <20080503201749.BHR33134@expms5.cites.uiuc.edu> Hi lxml-dev: I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code: import lxml.etree wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/ context = lxml.etree.iterparse(wiki_xml_filename, events=("end")) for action, elem in context: pass The crash usually occurs about halfway through the file (around 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) without error. I only get this error for this very large xml file (in this case about 13gb uncompressed). I had no trouble parsing the same file with the python standard library sax parser, but it is much slower and I don't like its api. I'm using libxml2-2.6.32 (also used earlier versions), python 2.5.2, python-lxml 2.0.5 (also tried earlier versions), Kubuntu 8.04 with 2.6.24 kernel (also tested on opensuse 10.3 with earlier kernel). Some of the exceptions are MemoryErrors. The machine running the code has 4gb of ram. The kernel does not appear to significantly hit the swap during the run. Here are the errors: ** glibc detected *** python: free(): invalid pointer: 0x08220a15 *** Aborted Also: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook import re, tempfile, traceback File "/usr/lib/python2.5/traceback.py", line 241, in def print_last(limit=None, file=None): MemoryError Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None ... and also (slightly different) Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook import re, tempfile, traceback File "/usr/lib/python2.5/tempfile.py", line 33, in from random import Random as _Random MemoryError Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Sometimes I just get 'Segmentation fault' from the shell, and sometimes it just hangs indefinitely. and finally (cStringIO): Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 36, in apport_excepthook from cStringIO import StringIO ImportError: /usr/lib/python2.5/lib-dynload/cStringIO.so: failed to map segment from shared object: Permission denied Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Any direction on tracking down the source is greatly appreciated! -- Marc From stefan_ml at behnel.de Sun May 4 07:34:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 07:34:50 +0200 Subject: [lxml-dev] (no subject) In-Reply-To: <20080503201749.BHR33134@expms5.cites.uiuc.edu> References: <20080503201749.BHR33134@expms5.cites.uiuc.edu> Message-ID: <481D4AFA.8020401@behnel.de> Hi, mharper3 at uiuc.edu wrote: > I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code: > > > import lxml.etree > > wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/ > context = lxml.etree.iterparse(wiki_xml_filename, events=("end")) > for action, elem in context: > pass > > > The crash usually occurs about halfway through the file (around > 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) > without error. I only get this error for this very large xml file (in this > case about 13gb uncompressed). I had no trouble parsing the same file with > the python standard library sax parser, but it is much slower and I don't > like its api. > > Some of the exceptions are MemoryErrors. The machine running the code has > 4gb of ram. The kernel does not appear to significantly hit the swap during > the run. iterparse() builds a tree in memory, so parsing a 13gb file on a 4gb RAM machine will fail - *unless* you clean up the parts of the tree that you no longer need. Something like for action, elem in context: if elem.tag == "page": # handle page elem.clear() elif elem.tag in tag_names_of_ancestors_of_page_elements: elem.clear() might work for you. BTW, you can also parse the gzip compressed file directly, might even be faster. Stefan From stefan_ml at behnel.de Sun May 4 11:02:07 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 11:02:07 +0200 Subject: [lxml-dev] [Fwd: Re: (no subject)] Message-ID: <481D7B8F.9060603@behnel.de> [Forwarding to the list ...] From: Stefan -- Thanks so much for the quick response. I did consider that the tree was being built in memory, but the documentation seems to suggest that is not the case. Specifically the language in the tutorial (http://codespeak.net/lxml/tutorial.html) in both the sections 'incremental parsing' and 'event-driven parsing' seem to suggest using iterparse to access without retaining the tree in memory. I see now that the documentation says otherwise for iterparse, as you pointed out. If you don't mind, why does the iterator retain the tree in memory? I would suspect otherwise from the 'natural' behavior of iterators/generators in general, though that may be an invalid assumption. (i.e. I would parse the entire tree into memory if I thought that I had enough memory to do so; otherwise I would _incrementally_ parse it.) More specifically, I don't want to ignore any parts of the xml file in this specific instance, so a ParserTarget is not the correct solution. Your suggestion to use clear() works for me; maybe it should be made explicit in the tutorial that memory is not cleared unless clear() is called. The only mention in the tutorial is iterparse "also allows to clear() or modify the content of an Element to save memory". My mistake was to assume that the 'used' elements would be freed without an explicit call to do so as the iterator progressed. Again, thank you for your quick reply! -- Marc From stefan_ml at behnel.de Sun May 4 11:02:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 11:02:49 +0200 Subject: [lxml-dev] [Fwd: Re: (no subject)] Message-ID: <481D7BB9.7000201@behnel.de> [Forwarding to the list...] From: Also, adding elem.clear() into the loop still eventually leads to a memory error, just much later. This should be clearing every element, so I'm not quite sure if I understand what clear() actually does. Should I segment the file into smaller pieces so that the tree is unloaded as each piece finishes? I apologize if my questions are trivial. I appreciate your responses greatly. -- Marc From stefan_ml at behnel.de Sun May 4 12:18:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 12:18:42 +0200 Subject: [lxml-dev] saving memory with iterparse() In-Reply-To: <481D7B8F.9060603@behnel.de> References: <481D7B8F.9060603@behnel.de> Message-ID: <481D8D82.8060503@behnel.de> Hi, Stefan Behnel wrote: > From: > Thanks so much for the quick response. I did consider that the tree was being > built in memory, but the documentation seems to suggest that is not the case. > Specifically the language in the tutorial > (http://codespeak.net/lxml/tutorial.html) in both the sections 'incremental > parsing' and 'event-driven parsing' seem to suggest using iterparse to access > without retaining the tree in memory. It actually says: """ two event-driven parser interfaces, one that generates parser events while building the tree (``iterparse``), and one that does not build the tree at all, and instead calls feedback methods on a target object in a SAX-like fashion. """ but I added a new example now that shows how to save memory. http://codespeak.net/lxml/tutorial.html#event-driven-parsing > If you don't mind, why does the > iterator retain the tree in memory? I would suspect otherwise from the > 'natural' behavior of iterators/generators in general, though that may be an > invalid assumption. [...] > My mistake was to assume that the > 'used' elements would be freed without an explicit call to do so as the > iterator progressed. The question is: how should iterparse() know when you no longer need a subtree? The end event for a parent always comes after the end events of all its children and you might still access the whole subtree when you handle the parent. > (i.e. I would parse the entire tree into memory if I > thought that I had enough memory to do so; otherwise I would _incrementally_ > parse it.) The docs actually use two terms: "incremental parsing" and "event-driven parsing". Incremental parsing is used for feeding data into the parser one chunk at a time, while event-driven parsing means you also get back one parser event at a time. If you have an idea how to present this better, I take patches: http://codespeak.net/svn/lxml/trunk/doc/tutorial.txt Stefan From stefan_ml at behnel.de Sun May 4 12:53:34 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 12:53:34 +0200 Subject: [lxml-dev] parsing a large file with iterparse() In-Reply-To: <481D7BB9.7000201@behnel.de> References: <481D7BB9.7000201@behnel.de> Message-ID: <481D95AE.1030609@behnel.de> Hi, Stefan Behnel wrote: > From: > > Also, adding > > elem.clear() > > into the loop still eventually leads to a memory error, just much later. This > should be clearing every element, so I'm not quite sure if I understand what > clear() actually does. According to the docs: """ clear() Resets an element. This function removes all subelements, clears all attributes and sets the text and tail properties to None. """ So it does not remove the element itself. I don't know what your XML looks like, but if it's something like ... * a zillion and you handle the end event of the element and clear() it, you still end up with a tree that has a zillion empty children. I see two choices in this case. There is cElementTree, which has the same API and allows you to clear the root element. http://effbot.org/zone/element-iterparse.htm#incremental-parsing This does not work in lxml as you cannot delete elements that are still required by the tree traversal of the parser (i.e. parents and following siblings). But you can try this in lxml: for action, elem in context: if elem.tag == "page": # handle page elem.clear() # remove all previous siblings parent = elem.getparent() previous_sibling = elem.getprevious() while previous_sibling is not None: parent.remove(previous_sibling) previous_sibling = elem.getprevious() BTW, if you only look for "page" tags and do the sibling cleanup as above, you can just pass tag="page" to iterparse(). Stefan From stefan_ml at behnel.de Mon May 5 17:46:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 5 May 2008 17:46:56 +0200 (CEST) Subject: [lxml-dev] lxml - addition of argument to control namespace declaration serializtion In-Reply-To: References: Message-ID: <38095.194.114.62.38.1210002416.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, these things are best discussed on the list. Grimes, David wrote: > In 1.3.5-1.3.6 timeframe, there was a patch introduced to > _writeNodeToBuffer() from serialization.pxi which forced namespace > declarations from parent nodes to be serialized onto the sub-tree root > node. In general, and with respect to XML standards, this make a great > deal of sense (so you don't have prefixed elements/attributes without the > corresponding namespace declaration). > > But, the application I've been building essentially takes an XML document > and makes template-string blocks of text out of various sub-trees, to be > later combined back into a full document using __getitem__ substitution in > the form of "%(token)s" string formatting. > > The nsdecl patch of 1.3.5/6 causes interesting behaviour when the sub-tree > being rendered is done in, for example, a loop - one "formatting > operation" per iteration. Also interesting is when many such sub-trees > are combined to form a document which (in my case) we know will have the > declarations on the ultimate root node. What you mean is that we actually make a copy of a non-root node and then copy over the namespace declarations of the ancestors. You say "interesting behaviour". Does that refer to the performance overhead or is there a 'real' problem you see? Looking at the code now, I see some potential optimisations, so if it's just the performance, here's a (trunk) patch that should give a bit of relief. > So ... I've got a patch I'm using in my local build which adds a keyword > argument "nsdecl=True" to tostring(), tounicode() and tofilelike() - these > are all the places which make use of the _writeNodeToBuffer() machinery. > I can spin the patch against any 2.0.x or 2.1.x source tree. > > The argument defaults to True, to maintain backward compatibility, but can > be provided as False to get <= 1.3.4 behaviour. > > Would you consider accepting this patch? At first glance: no. I do not think there is general interest for a serialisation that is not ns well-formed. You seem to have a rather special use case here. I'm not even sure you have to do what you describe based on serialised XML fragments. You might be able to do something like that with subtrees. But that's very close to guessing. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: copy-node-namespaces.patch Type: application/octet-stream Size: 786 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080505/711237cc/attachment.obj From kris at cs.ucsb.edu Tue May 6 03:32:56 2008 From: kris at cs.ucsb.edu (kris) Date: Mon, 05 May 2008 18:32:56 -0700 Subject: [lxml-dev] generative building of xml? Message-ID: <1210037576.13243.63.camel@loup.ece.ucsb.edu> I am generating, processing and eventually serializing several XML streams. I was wondering if this was possible to do with lxml? Here's the setup. I've got several databases generating XML content (which can be quite large), I really want to be able to process the database record progressively generating XML and sending out on its own stream. An aggregator/filter (elsewhere) will read the streams and parse them processing similar members and generate a new stream based on the combined streams. DB1 DB2 DB3 Core database XML XML XML XML genaration WS WS WS delivery over a stream using generator | | | +------+-----+ AGG Parse and match incoming streams (iterparse) XML WS send resulting merge as XML using generator. So the questions: 1.. Does anybody have a recipe to build a recursive generator using Element? 2. Given the above generator, is there any such thing as a generator version etree.tostring? -- Kristian Kvilekval kris at cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756 From friedel at translate.org.za Tue May 6 11:09:47 2008 From: friedel at translate.org.za (F Wolff) Date: Tue, 06 May 2008 11:09:47 +0200 Subject: [lxml-dev] Error reporting not clear Message-ID: <1210064987.7179.30.camel@localhost> Hallo Stefan and other lxml people. I had a bug report which I traced to an invalid XML file. The error message given by the parser was however not optimally useful. The file is available here (zipped): ?http://bugs.locamotion.org/attachment.cgi?id=132 and a description of the problem here: http://bugs.locamotion.org/show_bug.cgi?id=384 It might or might not be interesting to improve this error reporting, so I thought I'll mention it. Keep well Friedel From stefan_ml at behnel.de Tue May 6 18:14:51 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 6 May 2008 18:14:51 +0200 (CEST) Subject: [lxml-dev] Error reporting not clear In-Reply-To: <1210064987.7179.30.camel@localhost> References: <1210064987.7179.30.camel@localhost> Message-ID: <56049.194.114.62.38.1210090491.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, F Wolff wrote: > I had a bug report which I traced to an invalid XML file. The error > message given by the parser was however not optimally useful. The file > is available here (zipped): > http://bugs.locamotion.org/attachment.cgi?id=132 > > and a description of the problem here: > http://bugs.locamotion.org/show_bug.cgi?id=384 > > It might or might not be interesting to improve this error reporting, so > I thought I'll mention it. The error comes from libxml2 as is. You can check the error log to see if that is the only error that the parser reports, or if there are other errors that might be more important. http://codespeak.net/lxml/api.html#error-handling-on-exceptions If you feel that lxml selects the wrong message from the error log, please provide a list of errors as example. The lxml version is also important in this context, as there were improvements in the not so far history. Stefan From usernamenumber at gmail.com Wed May 7 15:13:04 2008 From: usernamenumber at gmail.com (Brad Smith) Date: Wed, 7 May 2008 09:13:04 -0400 Subject: [lxml-dev] Querying valid children of an element? Message-ID: Hello, I just discovered lxml and am pretty excited about it. There is one thing I'm having trouble figuring out how to do, though, if it's even possible: I am writing a tool that translates xml tags mixed with a wiki-like shorthand into full xml. It would be helpful to be able to sanity-check the mix of explicit tags and implicit tags I'm deriving from the shorthand by querying our DTD along the lines: "Is element foo legal within element bar" Same for CDATA. Is this possible using lxml? If not, is it possible using anything else? The best I've been able to come up with so far is to assemble a tree of dummy nodes in the proposed order and then validate it, but this seems wasteful. Thanks in advance for any help offered, --Brad -- ~ Second Shift: An original, serialized audio adventure ~ http://www.secondshiftpodcast.com From stefan_ml at behnel.de Thu May 8 09:22:04 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 May 2008 09:22:04 +0200 Subject: [lxml-dev] generative building of xml? In-Reply-To: <1210037576.13243.63.camel@loup.ece.ucsb.edu> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> Message-ID: <4822AA1C.30604@behnel.de> Hi, kris wrote: > I am generating, processing and eventually serializing > several XML streams. I was wondering if this was possible > to do with lxml? Probably, although lxml is not designed for pipelined XML processing (any better than SAX, that is). It also depends on how your XML looks like. If it's from a database, it's probably something simple like ... ... ... That shouldn't cause too many problems, you can use the (SAX-like) target parser to copy it into a simple Python container class, use that inside your program, merge all of those objects into a single stream at some point and then generate a new XML stream from that. > Here's the setup. I've got several databases > generating XML content (which can be quite large), I really want > to be able to process the database record progressively > generating XML and sending out on its own stream. > > An aggregator/filter (elsewhere) will read the streams > and parse them processing similar members and generate > a new stream based on the combined streams. > > DB1 DB2 DB3 Core database > XML XML XML XML genaration > WS WS WS delivery over a stream using generator A generator? Interesting. Why not just a file-like object? If the interface is a generator (yielding strings, I assume), then you will have to use the feed parser interface to copy the data into the parser, otherwise, you can just use one thread per DB connection and have it read and parse the data for you. > 2. Given the above generator, is there any such > thing as a generator version etree.tostring? Nothing keeps you from yielding "", followed by the serialised stream entries (call tostring() on each separately), followed by a "". Stefan From stefan_ml at behnel.de Thu May 8 09:33:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 May 2008 09:33:17 +0200 Subject: [lxml-dev] Querying valid children of an element? In-Reply-To: References: Message-ID: <4822ACBD.6020302@behnel.de> Hi, Brad Smith wrote: > I just discovered lxml and am pretty excited about it. :) > I am writing a tool that translates xml tags mixed with a wiki-like > shorthand into full xml. It would be helpful to be able to > sanity-check the mix of explicit tags and implicit tags I'm deriving > from the shorthand by querying our DTD along the lines: "Is element > foo legal within element bar" Same for CDATA. > > Is this possible using lxml? If not, is it possible using anything > else? You could define your grammar in a way that is easily usable for you in your program and then generate a DTD from that. > The best I've been able to come up with so far is to assemble a > tree of dummy nodes in the proposed order and then validate it, but > this seems wasteful. Why? Don't you expect your users to get it right most of the time anyway? Why don't you just assemble the complete result tree and validate that? Is your program working on the tree itself or some other data representation? This might be of interest: http://codespeak.net/lxml/element_classes.html Stefan From kris at cs.ucsb.edu Thu May 8 20:03:46 2008 From: kris at cs.ucsb.edu (kris) Date: Thu, 08 May 2008 11:03:46 -0700 Subject: [lxml-dev] generative building of xml? In-Reply-To: <4822AA1C.30604@behnel.de> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> <4822AA1C.30604@behnel.de> Message-ID: <1210269826.12910.28.camel@loup.ece.ucsb.edu> On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote: > Hi, > Probably, although lxml is not designed for pipelined XML processing (any > better than SAX, that is). > > It also depends on how your XML looks like. If it's from a database, it's > probably something simple like > > > > ... > ... > > ... > > > That shouldn't cause too many problems, you can use the (SAX-like) target > parser to copy it into a simple Python container class, use that inside your > program, merge all of those objects into a single stream at some point and > then generate a new XML stream from that. > > > > Here's the setup. I've got several databases > > generating XML content (which can be quite large), I really want > > to be able to process the database record progressively > > generating XML and sending out on its own stream. > > > > An aggregator/filter (elsewhere) will read the streams > > and parse them processing similar members and generate > > a new stream based on the combined streams. > > > > DB1 DB2 DB3 Core database > > XML XML XML XML genaration > > WS WS WS delivery over a stream using generator > > A generator? Interesting. Why not just a file-like object? I was thinking of a generator because I am feeding this to a stream that works with/on generators .. The databases are returning a top-k queries as xml files. Each DB keeps generating its best hits as a stream the aggregator sorts them and send them to the client. I would like to propagate the query all the way to the component databases using generators to minimize the work each on does. > If the interface is a generator (yielding strings, I assume), then you will > have to use the feed parser interface to copy the data into the parser, > otherwise, you can just use one thread per DB connection and have it read and > parse the data for you. > > > > 2. Given the above generator, is there any such > > thing as a generator version etree.tostring? > > Nothing keeps you from yielding "", followed by the serialised stream > entries (call tostring() on each separately), followed by a "". Unfortunately it is a tree structure.. I would like to visit the tree in something like; yield "" yield ' ' yield ' ' ?yield ' ' ... yield ' > Stefan -- Kristian Kvilekval kris at cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756 From jeff at ocjtech.us Thu May 8 21:30:16 2008 From: jeff at ocjtech.us (Jeffrey Ollie) Date: Thu, 8 May 2008 14:30:16 -0500 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 Message-ID: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> Has anyone built lxml 2.0.5 on RHEL 4 or CentOS 4? When I submit it to the Fedora/EPEL buildsystem I get the following error: libxml/schematron.h: No such file or directory I don't have direct access to a RHEL/CentOS 4 box so I can't do much more debugging until I do get one set up. libxml2 is at version 2.6.16 in RHEL/CentOS 4. The full build log is here: http://buildsys.fedoraproject.org/logs/fedora-4-epel/38964-python-lxml-2.0.5-1.el4/ppc/build.log Jeff From usernamenumber at gmail.com Thu May 8 22:27:02 2008 From: usernamenumber at gmail.com (Brad Smith) Date: Thu, 8 May 2008 16:27:02 -0400 Subject: [lxml-dev] Querying valid children of an element? In-Reply-To: <20080508114453.13762360@mbook.local> References: <4822ACBD.6020302@behnel.de> <20080508114453.13762360@mbook.local> Message-ID: To clarify about what I'm doing. The goal is to have a shorthand language (not entirely tag-based) that is easier for subject matter experts to learn than docbook, which can then be converted into full docbook once they've written a first draft. So, to illustrate one aspect of it, instead of writing foomaster example... $ foomaster [OPTIONS] They can write * foomaster example... ** $ foomaster [OPTIONS] As you can see, the translation process consists of not just converting asterisks into the appropriate combination of itemizedlists and listitems, but also protecting cdata within paras where necessary. In the first one, the interpreter sees that isn't allowed inside , which is its cue to try inserting a . is allowed within , so it does not insert a para. Making that determination is what I'm trying to find the best approach for. Currently I use a function like this: def validateAppend(parent,child): parent.append(child) if not dtd.validate(parent): dbg("Appending %s to %s failed DTD validation" % (child.tag,parent.tag)) del(parent[-1]) return False return True This works but, like I said, is not terribly efficient, so I just wanted to see if there was another method for making the determination. --Brad On Thu, May 8, 2008 at 11:44 AM, Mike Meyer wrote: > On Thu, 08 May 2008 09:33:17 +0200 Stefan Behnel wrote: > >> > I am writing a tool that translates xml tags mixed with a wiki-like >> > shorthand into full xml. It would be helpful to be able to >> > sanity-check the mix of explicit tags and implicit tags I'm deriving >> > from the shorthand by querying our DTD along the lines: "Is element >> > foo legal within element bar" Same for CDATA. >> > >> > Is this possible using lxml? If not, is it possible using anything >> > else? >> >> You could define your grammar in a way that is easily usable for you in your >> program and then generate a DTD from that. > > Are you really using DTDs, and not using that as a catchall for the > various Schema languages? > > If so, then you might consider switching to a modern schema > language. RelaxNG lets you write regular expressions for CDATA, which > ought to work with wiki-like "tags", and I wouldn't be surprised to > find that Schematron is turing complete. > > -- ~ Second Shift: An original, serialized audio adventure ~ http://www.secondshiftpodcast.com From stefan_ml at behnel.de Fri May 9 10:35:22 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 10:35:22 +0200 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 In-Reply-To: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> References: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> Message-ID: <48240CCA.9040306@behnel.de> Hi, Jeffrey Ollie wrote: > Has anyone built lxml 2.0.5 on RHEL 4 or CentOS 4? When I submit it > to the Fedora/EPEL buildsystem I get the following error: > > libxml/schematron.h: No such file or directory > > I don't have direct access to a RHEL/CentOS 4 box so I can't do much > more debugging until I do get one set up. libxml2 is at version > 2.6.16 in RHEL/CentOS 4. That's too old anyway. lxml > 1.3.x requires libxml2 2.6.21 (although I think 2.0.x still states it works with 2.6.20, which the above error proves wrong...) Two choices: stay with lxml 1.3 or build your own libxml2 as well. Stefan From stefan_ml at behnel.de Fri May 9 10:47:14 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 10:47:14 +0200 Subject: [lxml-dev] generative building of xml? In-Reply-To: <1210269826.12910.28.camel@loup.ece.ucsb.edu> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> <4822AA1C.30604@behnel.de> <1210269826.12910.28.camel@loup.ece.ucsb.edu> Message-ID: <48240F92.4070704@behnel.de> Hi, kris wrote: > On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote: >> If the interface is a generator (yielding strings, I assume), then you will >> have to use the feed parser interface to copy the data into the parser, >> otherwise, you can just use one thread per DB connection and have it read and >> parse the data for you. >> >>> 2. Given the above generator, is there any such >>> thing as a generator version etree.tostring? >> Nothing keeps you from yielding "", followed by the serialised stream >> entries (call tostring() on each separately), followed by a "". > > Unfortunately it is a tree structure.. I would like to visit the tree > in something like; > > yield "" > yield ' ' > yield ' ... > yield ' yield ' ' > ?yield ' ' > ... > yield ' I think that's a bad idea, as you loose semantics that you will need to recover in each generator step. My approach would be: let the databases write file-like streams (a socket or whatever), attach an iterparse() thread to each of them, copy the data of each entry to a container object (or maybe just use iterparse() with lxml.objectify), merge the container objects into a single stream in a thread safe way and serialise the resulting stream of entries to an XML stream, maybe even manually, as I suggested. Stefan From jeff at ocjtech.us Fri May 9 13:54:36 2008 From: jeff at ocjtech.us (Jeffrey Ollie) Date: Fri, 9 May 2008 06:54:36 -0500 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 In-Reply-To: <48240CCA.9040306@behnel.de> References: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> <48240CCA.9040306@behnel.de> Message-ID: <935ead450805090454m47dc97a7kf8211f2b35ac3854@mail.gmail.com> On Fri, May 9, 2008 at 3:35 AM, Stefan Behnel wrote: > > That's too old anyway. lxml > 1.3.x requires libxml2 2.6.21 (although I think > 2.0.x still states it works with 2.6.20, which the above error proves wrong...) > > Two choices: stay with lxml 1.3 or build your own libxml2 as well. Doh, my sleep deprived-brain thought that I had built an earlier version of lxml 2.0.x for RHEL 4, but I guess the last build was 1.3.6. Thanks for the wake up call! Jeff From bba at inbox.com Fri May 9 16:10:06 2008 From: bba at inbox.com (Ben) Date: Fri, 9 May 2008 06:10:06 -0800 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) Message-ID: Hello I'm writing some code to check whether our daily backups worked. Backup Exec stores its results in XML files. Sometimes bad characters - or maybe it is binary data - ends up in these XML files and then lxml chokes: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 5, in Xml = etree.parse(XmlFileName) File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062) File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088) File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:53337) File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:52584) File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:50115) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:47023) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:47861) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47285) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95 The offending line looks like this (not sure if the bad characters will make it through the email): Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\????\\VIC-ve\TT\miscellaneous and its subdirectories. Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2): ################################## Xml = etree.parse(XmlFileName) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") ############################## The code works fine unless there are invalid characters in, and I am happy for any suggestion, because the bit I'm interested in is always near the end of the xml file, and there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope) Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get this: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 9, in print Xml.findtext(".//end_time") File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext (src/lxml/lxml.etree.c:15354) File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:14116) AssertionError: ElementTree not initialized, missing root The code I tried for the 'recover' parser option: XmlFileName = r'c:/BEX03194.xml' parser = etree.XMLParser(recover=True) Xml = etree.parse(StringIO(XmlFileName), parser) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") I guess I'm just specifying the option wrong, but can't see how I should be doing it. Any suggestion, including how to circumvent/work around the problem is most welcome. ReplyReply AllForwardTrash ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium From stefan_ml at behnel.de Fri May 9 16:42:16 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 16:42:16 +0200 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) In-Reply-To: References: Message-ID: <482462C8.2020108@behnel.de> Hi, Ben wrote: > Xml = etree.parse(XmlFileName) > ############################## > XmlFileName = r'c:/BEX03194.xml' > parser = etree.XMLParser(recover=True) > Xml = etree.parse(StringIO(XmlFileName), parser) Not sure if this is just a "find-a-short-example" error, but you parse the filename, not the file here. This should read Xml = etree.parse(XmlFileName, parser) > Also, I've tried the 'recover' parser option, but I'm doing something wrong, > because I get this: > > C:\>python sb-lxml.py > Traceback (most recent call last): > File "sb-lxml.py", line 9, in > print Xml.findtext(".//end_time") > File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext > (src/lxml/lxml.etree.c:15354) > File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot > (src/lxml/lxml.etree.c:14116) > AssertionError: ElementTree not initialized, missing root I guess that happens when the parser "recover"s from not finding any XML at all. Maybe we should still raise an exception in this case instead of returning an empty ElementTree. This is really an extreme case of broken data... Stefan From bba at inbox.com Fri May 9 17:15:31 2008 From: bba at inbox.com (Ben) Date: Fri, 9 May 2008 07:15:31 -0800 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) In-Reply-To: <482462C8.2020108@behnel.de> References: Message-ID: > Stefan wrote: > > Not sure if this is just a "find-a-short-example" error, but you parse > the filename, not the file here. This should read > > Xml = etree.parse(XmlFileName, parser) (LOL) This is indeed a "find-a-short-example" error - which is what you use when you are a sysadmin. Now it works and gets me past the invalid characters too. Thanks for lxml From aryeh at bigfoot.com Fri May 9 18:26:22 2008 From: aryeh at bigfoot.com (Arye) Date: Fri, 9 May 2008 18:26:22 +0200 Subject: [lxml-dev] validation with multiple XSD files Message-ID: Hello all, I would like to so some schema validation and started with the instructions in : http://codespeak.net/lxml/dev/validation.html#xmlschema This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name): ... ... some types defined in "base.xsd" are used here I am new to lxml so sorry in advance if the question does not make sense. Regards, Arye. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080509/eabd6f62/attachment.htm From jlovell at esd189.org Fri May 9 18:38:08 2008 From: jlovell at esd189.org (John Lovell) Date: Fri, 9 May 2008 09:38:08 -0700 Subject: [lxml-dev] validation with multiple XSD files In-Reply-To: References: Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A22D4@ZIRIA.esd189.org> Arye: I had a similar problem and this is how I handled it. http://messagesleuth.svn.sourceforge.net/viewvc/messagesleuth/trunk/xsd/ xsd2one.py?view=markup I didn't ask the group so others may have a better or more full featured approach. John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.esd189.org Together We Can ... ________________________________ From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Arye Sent: Friday, May 09, 2008 9:26 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] validation with multiple XSD files Hello all, I would like to so some schema validation and started with the instructions in : http://codespeak.net/lxml/dev/validation.html#xmlschema This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name): ... ... some types defined in "base.xsd" are used here I am new to lxml so sorry in advance if the question does not make sense. Regards, Arye. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080509/818099fa/attachment-0001.htm From kumar.mcmillan at gmail.com Sat May 10 23:46:00 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sat, 10 May 2008 16:46:00 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? Message-ID: I know this has been discussed over and over but I'm writing to see if anyone has made a breakthrough yet. The problem of course is that Leopard's builtin libxml2 and libxslt are too old for lxml 2.0. You have to build libxml2 either from source or use a port. There is currently a problem with the libxml2 port, but the workaround is going fine for me: http://trac.macports.org/ticket/15230 (I know because postgres built just fine and I have some tests exerising psycopg2 as well) So after updating my libxml2 to 2.6.31 and libxslt to 1.1.23 and updating my $PATH so that the new xml2-config and xslt-config can be found, I can build lxml *without errors* but I see these warnings: $ sudo easy_install lxml-2.0.5.tgz Processing lxml-2.0.5.tgz Running lxml-2.0.5/setup.py -q bdist_egg --dist-dir /tmp/easy_install-3azY8e/lxml-2.0.5/egg-dist-tmp-t80esG Building lxml version 2.0.5. NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. Using build configuration of libxslt 1.1.23 ld: warning in /opt/local/lib/libxslt.dylib, file is not of required architecture ld: warning in /opt/local/lib/libexslt.dylib, file is not of required architecture ld: warning in /opt/local/lib/libxml2.dylib, file is not of required architecture [... and more like this ...] ... Finished processing dependencies for lxml==2.0.5 What doesn't make sense is these files seem fine to me: $ file -L /opt/local/lib/libxslt.dylib /opt/local/lib/libxslt.dylib: Mach-O dynamically linked shared library i386 $ file -L /opt/local/lib/libexslt.dylib /opt/local/lib/libexslt.dylib: Mach-O dynamically linked shared library i386 I was having similar trouble like this on Tiger and I had test cases in my own test suite that would consistently segfault. On Leopard, those same test cases were *not* segfaulting but now I have some different test cases that are consistently segfaulting. The segfault looks like this in the crash log: Exception Type: EXC_BAD_ACCESS (SIGBUS) Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000008 Crashed Thread: 0 Thread 0 Crashed: 0 libxml2.2.dylib 0x90d39419 xmlDictLookup + 360 1 libxml2.2.dylib 0x025626e4 xmlXPathCompExprAdd + 212 2 libxml2.2.dylib 0x025709c6 xmlXPathCompPathExpr + 3910 [etc....] Setting my dyld path (like suggested in the docs, export DYLD_LIBRARY_PATH=/opt/local/lib:/usr/lib) *does* make my test cases run without segfault so I'm assuming what's happening is lxml is using the older dylibs at runtime. This is a really lame way to fix the problem! Specifically, my svn binaries do not like this dylib setting, producing errors like: $ svn ls dyld: lazy symbol binding failed: Symbol not found: _iconv_open Referenced from: /usr/lib/libaprutil-1.0.dylib Expected in: /opt/local/lib/libiconv.2.dylib [etc] (This is slightly odd since I included /usr/lib but whatever.) *sigh* Next, I tried doing a static build of lxml by setting STATIC_LIBRARY_DIRS = ['/opt/local/lib'] in setup.py and running: python setup.py bdist_egg --static --with-xml2-config=/opt/local/bin/xml2-config --with-xslt-config=/opt/local/bin/xslt-config I had to fiddle with gcc to get this to build but otherwise it built fine and installed ok but I did not see any difference. Still consistent segfaults that are fixed by setting the dyld path. Now I'm out of ideas. Does anyone have another suggestion? Until then I have a stupid bash file that I have to source anytime I want to work on lxml. -Kumar From stefan_ml at behnel.de Sun May 11 09:01:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 11 May 2008 09:01:01 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: Message-ID: <482699AD.4030905@behnel.de> Hi Kumar, you ask why this is so hard? Simple answer: because no-one has contributed a way so far to make it easier. We had lots of reports about stuff not working and almost as many work-arounds, but no-one came up with a patch that would allow building lxml reliably at least on a subset of Mac-OS systems. And I just cannot believe that there is no-one amongst the Mac-OS-X users who knows how to use distutils to build a binary extension. Or at least someone who knows how to build C code statically against a C library. >From my POV, Mac-OS seems to lack three things that make this problem non-trivial. It doesn't have a standard package management system. Neither does it have something like the Linux Standard Base, which dictates where newly installed things belong. And it doesn't seem to support "rpath", which would allow a binary to say "I know where my dependencies come from". Or at least distutils don't support that on Mac. So everything I could try here on Linux to make it work better is bound to fail. Kumar McMillan wrote: > I know this has been discussed over and over but I'm writing to see if > anyone has made a breakthrough yet. The problem of course is that > Leopard's builtin libxml2 and libxslt are too old for lxml 2.0. You > have to build libxml2 either from source or use a port. [lots of important details skipped to keep this at a higer level for now] > Next, I tried doing a static build of lxml by setting > STATIC_LIBRARY_DIRS = ['/opt/local/lib'] in setup.py and running: > > python setup.py bdist_egg --static > --with-xml2-config=/opt/local/bin/xml2-config > --with-xslt-config=/opt/local/bin/xslt-config > > I had to fiddle with gcc to get this to build but otherwise it built > fine and installed ok but I did not see any difference. Still > consistent segfaults that are fixed by setting the dyld path. This is because the --static switch was made specifically for static building on Windows, which has even less support for package management or even half-decent software installation. It just doesn't support Mac-OS as no-one ever told me how to support it. If you want this to run, let's make a deal. Here is a patch (against the trunk, but should work with 2.0.x) that lets --static require setting the STATIC_*_DIRS variables only on Windows, which should result in reading the directories from xml2-config/xslt-config if the hard-coded setup is not provided. Given your above example, this should be the right thing to do. Now, please look at the function "libraries()" in setupinfo.py and fix it up for Mac-OS-X (and for whatever sys.platform calls it) to find the correct static libraries in these directories. If you get it to run reliably on your system, just with your above command line, I'll make sure it gets into 2.0.6. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: setupinfo.py-static-non-windows.patch Type: text/x-patch Size: 1871 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080511/dd5990b2/attachment.bin From mwm-keyword-lxml.9112b8 at mired.org Sun May 11 20:48:04 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sun, 11 May 2008 14:48:04 -0400 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482699AD.4030905@behnel.de> References: <482699AD.4030905@behnel.de> Message-ID: <20080511144804.69429df6@bhuda.mired.org> On Sun, 11 May 2008 09:01:01 +0200 Stefan Behnel wrote: > you ask why this is so hard? Simple answer: because no-one has contributed a > way so far to make it easier. Gee, I had no trouble at all doing this last week (the release of Oracle library bits for Intel OS-X means it's now desirable). I installed macports, did a self-update, then installed py25-lxml. It installed python2.5.2 and the versions of libxml2 and libxslt that were in macports as part of the process. Installing cx_Oracle after that was more work. > We had lots of reports about stuff not working and almost as many > work-arounds, but no-one came up with a patch that would allow building lxml > reliably at least on a subset of Mac-OS systems. And I just cannot believe > that there is no-one amongst the Mac-OS-X users who knows how to use distutils > to build a binary extension. Or at least someone who knows how to build C code > statically against a C library. I'm sorry, but my experience is that binary distributions make the problems *worse*, not better - at least if you require multiple different components to be installed. You have to make sure the components all agree about the builds of any libraries they have in common, and unless you have a coordinated build, that just doesn't happen. After all, I could build a binary distribution of lxml from macports, but to use it, you'd have to have the macports versions of python, libxml2 and libxslt. If you've got that, it's probably easier to install the macports version than it is to download and install whatever I might build. I could use ports to build a binary package with all those things in it - is there anyone who really wants that? I started working with lxml last year, when the latest version was 1.3.3. Since updating the software after deployment would be a traumatic operation (a single "instance" of the application uses about 40 cores spread across 10 systems and two SANs, and we typically run three instances), I wanted the latest stable versions of everything. I looked at five different systems, and on only two was getting that combination sane: OSX using macports, and FreeBSD, mostly because they had the latest versions in the ports system when I went to look. The three GNU/Linux systems either had old versions of Python, of the xml libraries, and lxml was either old or missing. So I wound up doing initial development on OSX and FreeBSD while we dealt with the GNU/Linux platform we were speced to run on. Grabbing binaries only half worked. Replacing the installed system tools on GNU/linux is a recipe for disaster, and our sysadmins correctly refused to do so. Which means the LSB is no help at all. The sysadmins found a binary build of python 2.5, and installed that. We then grabbed the lxml rpm from PyPI, and installed that - only it wouldn't run, because it had been built against a version of Python that was compiled with a python shared library, and the version we had hadn't been. I eventually wound up building everything - Python, libxml2, libxslt, lxml and cx_Oracle - by hand to run our installation on, and providing a carefully tailored environment to run things in. Which is what I did on OSX and FreeBSD, except their ports systems makes building from sources trivial, and FreeBSD doesn't need the tailored environment. Updating those two systems is trivial. Updating the GNU/Linux systems is several days worth of work just to get to the point where I have something to give to operations. > From my POV, Mac-OS seems to lack three things that make this problem > non-trivial. It doesn't have a standard package management system. Neither > does it have something like the Linux Standard Base, which dictates where > newly installed things belong. And it doesn't seem to support "rpath", which > would allow a binary to say "I know where my dependencies come from". Or at > least distutils don't support that on Mac. So everything I could try here on > Linux to make it work better is bound to fail. Providing a binary distribution for *any* system that includes libraries that are "to old" in the base system is going to be a major pain. Everyone runs into this on OSX, because they didn't update those libraries in 10.5 for some unknown reason (I've filed bug ID #5926693 at adc about this). Those of us building for corporate environments - where we always run on out of date platforms, because we can't get corporate approval to use a new one before it becomes out of date - run into it all the time. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From kumar.mcmillan at gmail.com Mon May 12 00:49:18 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 17:49:18 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482699AD.4030905@behnel.de> References: <482699AD.4030905@behnel.de> Message-ID: Stefan, thanks for all the info On Sun, May 11, 2008 at 2:01 AM, Stefan Behnel wrote: > From my POV, Mac-OS seems to lack three things that make this problem > non-trivial. It doesn't have a standard package management system. Neither > does it have something like the Linux Standard Base, which dictates where > newly installed things belong. And it doesn't seem to support "rpath", which > would allow a binary to say "I know where my dependencies come from". Or at > least distutils don't support that on Mac. So everything I could try here on > Linux to make it work better is bound to fail. I don't have experience building native OS X applications but I've done a little more research into the problem and I think it is specifically this: "/usr/lib/libxml2.2.dylib uses two-level namespace, meaning that the Foundation framework will always use this one instead of yours" -- from http://0xced.blogspot.com/2006/07/dealing-with-outdated-open-source-libs.html What is two-level namespacing? Good question. I haven't quite figured that out yet but as the blog post suggests, you can "flatten" it at runtime by setting DYLD_FORCE_FLAT_NAMESPACE=1 And, by golly, this actually works -- that is, setting it in my shell and running my test cases that would otherwise segfault run smoothly. Also, this doesn't screw up my lib paths like setting DYLD_LIBRARY_PATH does (the conflict with subversion went away!). >From more googling it does appear however that setting this var might confuse some apps that do rely on two-level namespacing. So far *my* problems have gone away (well, besides this being a kludge) but I guess I'll have to keep an eye on it. The dyld manual didn't help me understand this any better: http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/dyld.1.html > > If you want this to run, let's make a deal. Here is a patch (against the > trunk, but should work with 2.0.x) that lets --static require setting the > STATIC_*_DIRS variables only on Windows, which should result in reading the > directories from xml2-config/xslt-config if the hard-coded setup is not > provided. Given your above example, this should be the right thing to do. Now, > please look at the function "libraries()" in setupinfo.py and fix it up for > Mac-OS-X (and for whatever sys.platform calls it) to find the correct static > libraries in these directories. If you get it to run reliably on your system, > just with your above command line, I'll make sure it gets into 2.0.6. I'm willing to do whatever I can to contribute a better Mac OS X build process for lxml. However, I'm not experienced with using ext_modules in python and am having a hard time following your suggestions. You say your patch removed the enforcement of STATIC_*_DIRS but that was never a problem. in fact, that seems to confuse gcc when building with --static since it produces orphaned -I args (no directory attached) Next, you suggest to adjust the sys.platform checks. sys.platform always equals "darwin" on OS X but where would I want to make adjustments? I don't understand what this is doing in libraries() : if sys.platform in ('win32',): libs = ['%s_a' % lib for lib in libs] if I add "darwin" to the list, I get the error: ld: library not found for -lxslt_a whereas -lxslt is the correct arg (just like on linux). In my /opt/local/lib dir I have libxslt.dylib, libxslt.la, libexslt.dylib, libexslt.la, etc. I tried changing the above list comprehension to generate .la names but that didn't work either (still said library not found). I'm still not clear on how to statically link the libxml libraries and that's the first step to solving the problem. If anyone has done this, please let me know and I'll have another go at it. Maybe I need to use libtool to produce static versions then link to those. More googling suggests it *is* possible ;) http://lists.apple.com/archives/Unix-porting/2006/Aug/msg00012.html gcc man pages are not helping me. -Kumar From kumar.mcmillan at gmail.com Mon May 12 01:00:26 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 18:00:26 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511144804.69429df6@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: On Sun, May 11, 2008 at 1:48 PM, Mike Meyer wrote: > On Sun, 11 May 2008 09:01:01 +0200 > Stefan Behnel wrote: > >> you ask why this is so hard? Simple answer: because no-one has contributed a >> way so far to make it easier. > > Gee, I had no trouble at all doing this last week (the release of > Oracle library bits for Intel OS-X means it's now desirable). I > installed macports, did a self-update, then installed py25-lxml. It > installed python2.5.2 and the versions of libxml2 and libxslt that > were in macports as part of the process. The build of lxml doesn't fail and you probably won't see any errors unless you are using xpath. In fact, running selftest.py after building passes for me (I'm not sure if that runs all tests or not) but I do get a consistent segfault in my program. Looking at the macport of py25-lxml I don't see any flags that would indicate they have accomplished statically linking the new libxml libs. I don't like to use ports of python modules because /opt/local/bin/python doesn't mix well with a Framework python installation from my experience. From mwm-keyword-lxml.9112b8 at mired.org Mon May 12 01:26:48 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sun, 11 May 2008 19:26:48 -0400 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: <20080511192648.28fcf02d@bhuda.mired.org> On Sun, 11 May 2008 18:00:26 -0500 "Kumar McMillan" wrote: > On Sun, May 11, 2008 at 1:48 PM, Mike Meyer wrote: > > On Sun, 11 May 2008 09:01:01 +0200 > > Stefan Behnel wrote: > > > >> you ask why this is so hard? Simple answer: because no-one has contributed a > >> way so far to make it easier. > > > > Gee, I had no trouble at all doing this last week (the release of > > Oracle library bits for Intel OS-X means it's now desirable). I > > installed macports, did a self-update, then installed py25-lxml. It > > installed python2.5.2 and the versions of libxml2 and libxslt that > > were in macports as part of the process. > > The build of lxml doesn't fail and you probably won't see any errors > unless you are using xpath. In fact, running selftest.py after > building passes for me (I'm not sure if that runs all tests or not) > but I do get a consistent segfault in my program. Well, we make fairly heavy use of xpath (we use it to extract millions of records/minute in our ETL system, plus provide default attributes in the xml config file), so if it's a problem, I'm sure I'll see it. The few tests I've run so far worked fine. Care to provide an example that breaks? > Looking at the macport of py25-lxml I don't see any flags that would > indicate they have accomplished statically linking the new libxml > libs. I don't like to use ports of python modules because > /opt/local/bin/python doesn't mix well with a Framework python > installation from my experience. That's always a problem when you start building your version of languages in the base system - you probably can't use the platform-specific modules that are in the base systems language installation. I can't get to any of the rpm-related python modules on RHEL with my custom python installed. Fortunately, I don't need access to either the rpm libraries or the mac Python frameworks in my applications. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From kumar.mcmillan at gmail.com Mon May 12 02:23:09 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 19:23:09 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511192648.28fcf02d@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> <20080511192648.28fcf02d@bhuda.mired.org> Message-ID: On Sun, May 11, 2008 at 6:26 PM, Mike Meyer wrote: > > Well, we make fairly heavy use of xpath (we use it to extract millions > of records/minute in our ETL system, plus provide default attributes > in the xml config file), so if it's a problem, I'm sure I'll see > it. The few tests I've run so far worked fine. huh, yeah it does seem like you'd see a crash. Maybe the py25-lxml port gains some advantages from getting built within the macports environment somehow. > Care to provide an > example that breaks? unfornately, I don't think I have one, not something that is decoupled from the app I'm working on anyway. The app I'm working on makes heavy use of lxml.html to spider through the web, uses xpath() here and there, and the test cases use xpaths for assertions. However, I see the segfault in strange places. For example, if I run all tests at once (I'm using nose) then I usually don't see a segfault. But if I run test cases by themselves I will generally see a segfault. And if I do, it is a consistent segfault. Looking at the crash log I can see that it's on an xpath lookup (I posted this earlier). However, to make matters worse, the test cases I can trigger segfaults in generally do not seem to touch any of the xpath code :/ Nonetheless, all the workarounds I've mentioned stop the segfaults. From stefan_ml at behnel.de Mon May 12 10:41:13 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 10:41:13 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511144804.69429df6@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: <482802A9.9070809@behnel.de> Hi, Mike Meyer wrote: > On Sun, 11 May 2008 09:01:01 +0200 > Stefan Behnel wrote: > >> you ask why this is so hard? Simple answer: because no-one has contributed a >> way so far to make it easier. > > Gee, I had no trouble at all doing this last week (the release of > Oracle library bits for Intel OS-X means it's now desirable). I > installed macports, did a self-update, then installed py25-lxml. It > installed python2.5.2 and the versions of libxml2 and libxslt that > were in macports as part of the process. Installing cx_Oracle after > that was more work. > >> We had lots of reports about stuff not working and almost as many >> work-arounds, but no-one came up with a patch that would allow building lxml >> reliably at least on a subset of Mac-OS systems. And I just cannot believe >> that there is no-one amongst the Mac-OS-X users who knows how to use distutils >> to build a binary extension. Or at least someone who knows how to build C code >> statically against a C library. > > I'm sorry, but my experience is that binary distributions make the > problems *worse*, not better I wasn't talking about distributing binaries. I meant: someone has to provide a way to configure the compiler so that it builds lxml statically on Mac-OS. Stefan From stefan_ml at behnel.de Mon May 12 11:04:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 11:04:44 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> Message-ID: <4828082C.3020202@behnel.de> Hi, Kumar McMillan wrote: > I don't have experience building native OS X applications See? That seems to be a general problem amongst Mac-OS users. If no-one using that platform knows how to build a C program, how am I supposed to know it? > What is two-level namespacing? *shrug*, I prefer an automatic static build on Mac-OS anyway. > You say your patch removed the enforcement of STATIC_*_DIRS but that was > never a problem. It was, as it requires manual interaction by users that should only be required in stupid "who-needs-a-system-compiler-anyway" environments like Windows. > in fact, that seems to confuse gcc when building > with --static since it produces orphaned -I args (no directory > attached) It just disables the requirement for setting the variables. It doesn't configure anything so far. The config has to come from xml2-config and xslt-config. > Next, you suggest to adjust the sys.platform checks. sys.platform > always equals "darwin" on OS X Ok, then the function will likely look something like this: def libraries(): if sys.platform in ('win32', 'darwin'): libs = ['libxslt', 'libexslt', 'libxml2', 'iconv'] else: libs = ['xslt', 'exslt', 'xml2', 'z', 'm'] if OPTION_STATIC: if sys.platform in ('win32',): libs = ['%s_a' % lib for lib in libs] elif sys.platform in ('darwin',): libs = ['%s.a' % lib for lib in libs] if sys.platform in ('win32',): libs.extend(['zlib', 'WS2_32']) return libs Minus some changes for libiconv and libz. > but where would I want to make > adjustments? I don't understand what this is doing in libraries() : > > if sys.platform in ('win32',): > libs = ['%s_a' % lib for lib in libs] > > if I add "darwin" to the list, I get the error: > ld: library not found for -lxslt_a The static libraries are called xxx_a in Windows. If someone can figure out what they are called on Mac-OS, I can fill it in myself. > whereas -lxslt is the correct arg (just like on linux). In my > /opt/local/lib dir I have libxslt.dylib, libxslt.la, libexslt.dylib, > libexslt.la, etc. I tried changing the above list comprehension to > generate .la names but that didn't work either (still said library not > found). Hmmm, on Linux, the static libraries are called "libxml2.a" etc. Can you find anything like that on your system? Stefan From kumar.mcmillan at gmail.com Mon May 12 17:15:24 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Mon, 12 May 2008 10:15:24 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <4828082C.3020202@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> Message-ID: On Mon, May 12, 2008 at 4:04 AM, Stefan Behnel wrote: > Kumar McMillan wrote: > > I don't have experience building native OS X applications > > See? That seems to be a general problem amongst Mac-OS users. Most people "use" computers they don't build them. Your work is greatly appreciated! :) > > What is two-level namespacing? > > *shrug*, I prefer an automatic static build on Mac-OS anyway. me too, I think that would be the right solution. > > in fact, that seems to confuse gcc when building > > with --static since it produces orphaned -I args (no directory > > attached) > > It just disables the requirement for setting the variables. It doesn't > configure anything so far. The config has to come from xml2-config and > xslt-config. something is going wrong then with --static because I get "Python.h not found" errors and the gcc command looked something like this: gcc ... -I -I/path/to/python/headers notice the orphaned -I call where, afaict, STATIC_INCLUDE_DIRS was previously getting inserted. Just a theory. > Hmmm, on Linux, the static libraries are called "libxml2.a" etc. Can you find > anything like that on your system? OK, I dug up some more dirt. The problem with the macport of libxml2 is that it doesn't build static libraries. From the port file itself, I now see: --disable-static, doh! But, yeah, I think if I build my own with --disable-shared and then point to that dir as an include this might work. And I assume I will probably get a libxml2.a file out of that build. But is that a feasible end user solution? That is, I'm not convinced this will make great strides in solving the lxml runtime problem where it uses the wrong version of libxml2 / libxslt. Kumar From stefan_ml at behnel.de Mon May 12 18:26:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 18:26:44 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080512113934.6c774076@mbook.local> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> Message-ID: <48286FC4.6030309@behnel.de> Hi, Mike Meyer wrote: > Apple's official position is that static linking of > applications is unsupported. They don't provide static versions of any > of the system libraries. > > Likewise, macports doesn't provide static libraries for the libraries > it installs, and the docs don't hint at anyway to get it to do so. Great! Now that would have been too easy anyway, wouldn't it? :-/ Thanks for the infos. Now, anyone for a plan B? Stefan From kumar.mcmillan at gmail.com Mon May 12 18:37:35 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Mon, 12 May 2008 11:37:35 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48286FC4.6030309@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> Message-ID: On Mon, May 12, 2008 at 11:26 AM, Stefan Behnel wrote: > > Likewise, macports doesn't provide static libraries for the libraries > > it installs, and the docs don't hint at anyway to get it to do so. > > Great! Now that would have been too easy anyway, wouldn't it? :-/ > > Thanks for the infos. Now, anyone for a plan B? It looks to me like the typical way to do this in an OS X app is to compile your static libs then bundle them with your application (and as Mike pointed out, Apple does not recommend this). Obviously there is a ram penalty for that (the custom lib). I don't see lxml distributing static libs just for OS X :) The best thing I can think of is to get --static working for libxml2.a files and then I can submit to you the steps I took to build my static libs from source (assuming I can get that all to work). Would that be useful? If it proves too cumbersome I might just continue to use the DYLD_FORCE_FLAT_NAMESPACE var at runtime even though that's bound to bite me someday. Unlike Mike I am fortunate enough not to be using lxml in *production* on OS X ... yet also misfortunate enough to be the only one who sees segfaults :( K From stefan_ml at behnel.de Mon May 12 20:39:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 20:39:57 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> Message-ID: <48288EFD.7060207@behnel.de> Kumar McMillan wrote: > The best thing I can think of is to get --static working for libxml2.a > files and then I can submit to you the steps I took to build my static > libs from source (assuming I can get that all to work). I think a buildout will help here, as previously proposed a couple of times. http://pypi.python.org/pypi/zc.buildout http://pypi.python.org/pypi/zc.recipe.cmmi We can configure it to build only the static versions of libxml2 and libxslt, and then build against those. ------------------------ [libxml2] recipe = zc.recipe.cmmi url = http://ftp.gnome.org/pub/GNOME/sources/libxml2/2.6/libxml2-2.6.32.tar.gz extra_options = --without-python --enable-shared --enable-static [libxslt] recipe = zc.recipe.cmmi url = http://ftp.gnome.org/pub/GNOME/sources/libxslt/1.1/libxslt-1.1.22.tar.bz2 extra_options = --with-libxml-prefix=${buildout:directory}/parts/libxml2/ --without-python --disable-shared --enable-static [lxml] recipe = zc.recipe.egg:custom egg = lxml include-dirs = ${buildout:directory}/parts/libxml2/include/libxml2 ${buildout:directory}/parts/libxslt/include library-dirs = ${buildout:directory}/parts/libxml2/lib ${buildout:directory}/parts/libxslt/lib ------------------------ lxml's setup.py would then need to be changed to automatically compile statically on the Mac-OS platform. Although maybe we should only do that if buildout is running (sys.modules?). Stefan From xkenneth at gmail.com Mon May 12 23:19:32 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Mon, 12 May 2008 16:19:32 -0500 Subject: [lxml-dev] Looking for general insight. Message-ID: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> All, Any opinions on my problem would be greatly appreciated. I've got a large pre-defined XML schema, tons of data types etc. I want to be able to create python objects from the schemas and traffic these objects in and out of some sort of a database. Could I perhaps create these objects using lxml and extend lxml to use zope persistence? Regards, Kenneth Miller From jlovell at esd189.org Mon May 12 23:54:24 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 12 May 2008 14:54:24 -0700 Subject: [lxml-dev] Looking for general insight. In-Reply-To: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> References: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A22D8@ZIRIA.esd189.org> Kenneth: What you ask is not easy. However, I can point you at a few things that might be helpful. First a clarification. When you say, "create python objects from the schemas and traffic these objects in and out of some sort of a database" do you mean python classes or lxml trees filled with random data (or something else)? For python classes from XML schemas check out: http://www.rexx.com/~dkuhlman/generateDS.html For lxml trees filled with random data check out: http://messagesleuth.svn.sourceforge.net/viewvc/messagesleuth/trunk/xsd2 data.py?revision=6&view=markup Note: For this you will need to pay attention to the other MessageSleuth libraries it uses. Note: Realize that this supports a subset of XML Schema operators. Note: While I am proud of most of this code (and it consistently meets my needs) I believe randstr.py can generate invalid strings under certain conditions. Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.esd189.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Kenneth Miller Sent: Monday, May 12, 2008 2:20 PM To: lxml-dev at codespeak.net Subject: [lxml-dev] Looking for general insight. All, Any opinions on my problem would be greatly appreciated. I've got a large pre-defined XML schema, tons of data types etc. I want to be able to create python objects from the schemas and traffic these objects in and out of some sort of a database. Could I perhaps create these objects using lxml and extend lxml to use zope persistence? Regards, Kenneth Miller _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From mike at it-loops.com Tue May 13 07:58:05 2008 From: mike at it-loops.com (maru) Date: Tue, 13 May 2008 07:58:05 +0200 Subject: [lxml-dev] =?utf-8?q?_Re=3A__install_lxml_2=2E0=2E5_on_Mac_OS_X_L?= =?utf-8?q?eopard_-_why_is_itso_hard=3F?= In-Reply-To: <48288EFD.7060207@behnel.de> References: <48288EFD.7060207@behnel.de> Message-ID: On Mon, 12 May 2008 20:39:57 +0200, Stefan Behnel wrote: > lxml's setup.py would then need to be changed to automatically compile > statically on the Mac-OS platform. Although maybe we should only do that > if buildout is running (sys.modules?). Please leave the dynamic build as default option since it makes building universal libraries so much easier. A static build if buildout is used would be better in my opinion. Kind regards, Michael From kumar.mcmillan at gmail.com Tue May 13 17:41:58 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Tue, 13 May 2008 10:41:58 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48288EFD.7060207@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> <48288EFD.7060207@behnel.de> Message-ID: On Mon, May 12, 2008 at 1:39 PM, Stefan Behnel wrote: > Kumar McMillan wrote: > > The best thing I can think of is to get --static working for libxml2.a > > files and then I can submit to you the steps I took to build my static > > libs from source (assuming I can get that all to work). > > I think a buildout will help here, as previously proposed a couple of times. ah, yes, excellent idea. I've started fiddling with it and have libxml2/libxslt building static libs no problem; this might just work. It looks like zc.recipe.egg isn't going to cut it though, as I can't find a way to pass in custom setup.py flags like --static (which I think is still needed to find libxml2.a, etc). I found collective.recipe.distutils which might work. I found some issues with it already but patching as I go. More as it happens - Kumar > > http://pypi.python.org/pypi/zc.buildout > http://pypi.python.org/pypi/zc.recipe.cmmi > > We can configure it to build only the static versions of libxml2 and libxslt, > and then build against those. > > ------------------------ > [libxml2] > recipe = zc.recipe.cmmi > url = http://ftp.gnome.org/pub/GNOME/sources/libxml2/2.6/libxml2-2.6.32.tar.gz > extra_options = --without-python --enable-shared --enable-static > > [libxslt] > recipe = zc.recipe.cmmi > url = http://ftp.gnome.org/pub/GNOME/sources/libxslt/1.1/libxslt-1.1.22.tar.bz2 > extra_options = --with-libxml-prefix=${buildout:directory}/parts/libxml2/ > --without-python --disable-shared --enable-static > > [lxml] > recipe = zc.recipe.egg:custom > egg = lxml > include-dirs = ${buildout:directory}/parts/libxml2/include/libxml2 > ${buildout:directory}/parts/libxslt/include > library-dirs = ${buildout:directory}/parts/libxml2/lib > ${buildout:directory}/parts/libxslt/lib > ------------------------ > > lxml's setup.py would then need to be changed to automatically compile > statically on the Mac-OS platform. Although maybe we should only do that if > buildout is running (sys.modules?). > > Stefan > From stefan_ml at behnel.de Tue May 13 07:25:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 May 2008 07:25:40 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> Message-ID: <48292654.7030802@behnel.de> Hi, Kumar McMillan wrote: > something is going wrong then with --static because I get "Python.h > not found" errors and the gcc command looked something like this: > > gcc ... -I -I/path/to/python/headers That's a bug. Here is a patch. Stefan === setupinfo.py ================================================================== --- setupinfo.py (revision 4206) +++ setupinfo.py (local) @@ -15,8 +15,11 @@ PACKAGE_PATH = "src/lxml/" def env_var(name): - value = os.getenv(name, '') - return value.split(os.pathsep) + value = os.getenv(name) + if value: + return value.split(os.pathsep) + else: + return [] def ext_modules(static_include_dirs, static_library_dirs, static_cflags): if CYTHON_INSTALLED: From kumar.mcmillan at gmail.com Wed May 14 06:54:46 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Tue, 13 May 2008 23:54:46 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48292654.7030802@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> Message-ID: On Tue, May 13, 2008 at 12:25 AM, Stefan Behnel wrote: > > gcc ... -I -I/path/to/python/headers > > That's a bug. Here is a patch. closer! thanks for that fix. That got all the -I includes in order. Next up, I'm pretty sure I need to pass -static to libtool so that it honors the -lxml2.a (without -static, it says xml2.a -- lib not found). My idea for this was: export LDFLAGS='-static' and I got: gcc -arch i386 -arch ppc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -g -bundle -undefined dynamic_lookup -static build/temp.macosx-10.3-i386-2.5/src/lxml/lxml.etree.o -L/Users/kumar/src/lxml-2.0/parts/libxml2/lib -L/Users/kumar/src/lxml-2.0/parts/libxslt/lib -lxslt.a -lexslt.a -lxml2.a -lz.a -lm.a -o build/lib.macosx-10.3-i386-2.5/lxml/etree.so ld_classic: incompatible flag -bundle used (must specify "-dynamic" to be used) so ... how do I stop it from adding -bundle? Ideas for another approach? From stefan_ml at behnel.de Wed May 14 08:01:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 14 May 2008 08:01:56 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> Message-ID: <482A8054.4060603@behnel.de> Hi, Kumar McMillan wrote: > On Tue, May 13, 2008 at 12:25 AM, Stefan Behnel wrote: >> > gcc ... -I -I/path/to/python/headers >> >> That's a bug. Here is a patch. > > closer! thanks for that fix. That got all the -I includes in order. > > Next up, I'm pretty sure I need to pass -static to libtool so that it > honors the -lxml2.a (without -static, it says xml2.a -- lib not > found). It's not "-lxml2.a" but a plain "/path/to/libxml2.a" as parameter to link it in just like the normal lxml.etree.o object file that was just compiled. Stefan From x at jwp.name Tue May 13 19:35:30 2008 From: x at jwp.name (James William Pye) Date: Tue, 13 May 2008 17:35:30 +0000 (UTC) Subject: [lxml-dev] Help with an error message References: <477D0D9E.3090205@behnel.de> Message-ID: Stefan Behnel behnel.de> writes: > Konstantin Ryabitsev wrote: > > Traceback (most recent call last): > > File "foo.py", line 6, in > > elt = Element('foo').text = unistr > > File "etree.pyx", line 741, in etree._Element.text.__set__ > > File "apihelpers.pxi", line 344, in etree._setNodeText > > File "apihelpers.pxi", line 648, in etree._utf8 > > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > > > Can someone suggest the best way to deal with this? > > My first question is: why do you need a '\x00' here? If you want to pass > binary data in XML, the best way is to use a safe encoding such as uuencode or > whatever. That should be part of your XML language spec/schema/... I just ran into this myself. In my case, having the NULL was not desired, rather I wanted to see a raw '\x00' to appear in the string(ie, the literal backslash sequence, *not* the NULL character). It would be nice if lxml would be more explicit about the problem: raise ValueError("NULL characters are not allowed in XML strings") That is: How I am supposed to derive that a NULL character was causing that AssertionError from the given string? (It wasn't until I found this message that I understood what I was doing wrong) From stefan_ml at behnel.de Wed May 14 18:16:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 14 May 2008 18:16:26 +0200 (CEST) Subject: [lxml-dev] Help with an error message In-Reply-To: References: <477D0D9E.3090205@behnel.de> Message-ID: <27846.194.114.62.39.1210781786.squirrel@groupware.dvs.informatik.tu-darmstadt.de> James William Pye wrote: > It would be nice if lxml would be more explicit about the problem: > > raise ValueError("NULL characters are not allowed in XML strings") > > That is: How I am supposed to derive that a NULL character was causing > that > AssertionError from the given string? (It wasn't until I found this > message that I understood what I was doing wrong) Ok, what about: "All strings must be XML compatible: Unicode or ASCII, no NULL bytes" ? Stefan From rogerpatterson at gmail.com Thu May 15 06:21:52 2008 From: rogerpatterson at gmail.com (roger patterson) Date: Wed, 14 May 2008 21:21:52 -0700 Subject: [lxml-dev] html entities and lxml.html.ElementSoup In-Reply-To: <482B7C87.10800@aya.yale.edu> References: <482B7C87.10800@aya.yale.edu> Message-ID: <1200dfce0805142121q16c7fa30t148830146c932f02@mail.gmail.com> Hi Viksit, What you typed was correct, except you have to note that lxml.html.soupparser.convert_tree(soup) returns a *list* of root elements, so you can't just do a lxml.etree.tostring() on the list. Depending on your HTML, choosing the first element will probably work. I have moved to the trunk now, so am working well with the new lxml.html.soupparser. But if you're stuck on that branch, then that work-around worked for me. Hope it works for you! cheers -Roger 2008/5/14 Viksit Gaur : > Hi there, > >>Roger Patterson wrote: >>> I'm getting an interesting situation. When using the very cool >>> ElementSoup add-on to lxml.html with certain source-html files that >>> already encode entities (eg. £), using the ElementSoup.parse() >>> messes up the entities. > > I'm running into the same problem. > >>It looks like it's not the parse(), but rather the serialisation. What >> >happens >>is that the entity references end up in the /text/ content, which is >> >clearly >>wrong as it leads to re-escaping of the references on the way out. > >>> What I'm currently doing to solve this is first parsing it with >>> BeautifulSoup(html, convertEntities="html"), then calling >>> ElementSoup.convert_tree(soup). This work-around works fine, but I >>> thought I'd bring it to your attention. > > Did you mean something of the sort, > > soup = BeautifulSoup(doc, convertEntities="html") > root = lxml.html.soupparser.convert_tree(soup) > > Because I get an error of the form: > > File "lxml.etree.pyx", line 2491, in lxml.etree.tostring > (src/lxml/lxml.etree.c:21792) > TypeError: Type 'list' cannot be serialized. > > > >>ElementSoup should do that for you. I fixed it on the trunk. > >>Stefan > > Unfortunately, I can't switch to lxml trunk. Would it be possible for you to > point me to the code change in lxml so I can patch it myself? > > Thanks and Cheers, > Viksit > From kumar.mcmillan at gmail.com Thu May 15 06:40:58 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Wed, 14 May 2008 23:40:58 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482A8054.4060603@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> Message-ID: Hello again On Wed, May 14, 2008 at 1:01 AM, Stefan Behnel wrote: >> Next up, I'm pretty sure I need to pass -static to libtool so that it >> honors the -lxml2.a (without -static, it says xml2.a -- lib not >> found). > > It's not "-lxml2.a" but a plain "/path/to/libxml2.a" as parameter to link it > in just like the normal lxml.etree.o object file that was just compiled. when I tried the plain paths it says library cannot be found. But I've discovered that building with -static is a dead end. It seems that Apple all but disallows static linking completely: http://developer.apple.com/qa/qa2001/qa1118.html HOWEVER after blood, sweat, and some tears (kidding) this is *all* I needed, it seems: export CFLAGS="-flat_namespace" ...no static builds libxml2 libs, no buildout recipe. I just set that and ran: python setup.py bdist_egg --with-xml2-config=/opt/local/bin/xml2-config --with-xslt-config=/opt/local/bin/xslt-config which uses the libxml2 and etc. installed by ports. In fact, as long as /opt/local/bin is on my path that should work without having to set paths (i.e. from easy_install). All my tests that were segfaulting are now passing. This appears to be the exact same behavior I got by setting DYLD_FORCE_FLAT_NAMESPACE at runtime but without the side affect of applying itself to anything else running in my shell ;) so, I'm thinking this is just two lines of code added to cflags() ... if sys.platform in ('darwin',): result.append('-flat_namespace') Do you want a patch that also includes the adjustments to --static when not windows? I don't think they are necessary anymore. Actually, using --static on darwin should probably raise an error "Apple says no" ;) -Kumar From stefan_ml at behnel.de Thu May 15 13:03:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 15 May 2008 13:03:17 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> Message-ID: <482C1875.4070508@behnel.de> Hi Kumar, Kumar McMillan wrote: > so, I'm thinking this is just two lines of code added to cflags() ... > > if sys.platform in ('darwin',): > result.append('-flat_namespace') That's cool, thanks. I added it to the trunk and to the 2.0 branch. Let's see if Mac users get along with 2.0.6 then... Thanks for the effort! Stefan From kumar.mcmillan at gmail.com Fri May 16 04:47:11 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Thu, 15 May 2008 21:47:11 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482C1875.4070508@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> <482C1875.4070508@behnel.de> Message-ID: On Thu, May 15, 2008 at 6:03 AM, Stefan Behnel wrote: >> if sys.platform in ('darwin',): >> result.append('-flat_namespace') > > That's cool, thanks. I added it to the trunk and to the 2.0 branch. excellent > Let's see > if Mac users get along with 2.0.6 then... > > Thanks for the effort! sure, no problem. I researched this a bit more. It seems that people generally consider -flat_namespace a bad "hack," something to keep in mind. However, this seems to be because a few libraries take advantage of -twolevel_namespace (the default gcc behavior as of OS X 10.3 or something) so your binaries may cause other linked libs to behave wrong. The only specific example I could find of one that uses two level namespaces was OpenGL, but maybe there are others. Anyway, for lxml's purposes *I think* it is OK to use -flat_namespace since there aren't many other libs involved. Let's roll with it. This is what etree links to : $ otool -l path/to/lxml/etree.so [snip] Load command 7 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libxslt.1.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 3.23.0 compatibility version 3.0.0 Load command 8 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libexslt.0.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 9.13.0 compatibility version 9.0.0 Load command 9 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libxml2.2.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 9.32.0 compatibility version 9.0.0 Load command 10 cmd LC_LOAD_DYLIB cmdsize 52 name /opt/local/lib/libz.1.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 1.2.3 compatibility version 1.0.0 Load command 11 cmd LC_LOAD_DYLIB cmdsize 52 name /usr/lib/libSystem.B.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 88.3.6 compatibility version 1.0.0 ... so unless libSystem.B.dylib somehow would be tripped up by -flat_namespace I think all should be good. BTW, when I add those two lines all tests pass for me (they passed before but, hey, still a good sign) : Index: setupinfo.py =================================================================== --- setupinfo.py (revision 54771) +++ setupinfo.py (working copy) @@ -136,6 +136,8 @@ for possible_cflag in possible_cflags: if not possible_cflag.startswith('-I'): result.append(possible_cflag) + if sys.platform in ('darwin',): + result.append('-flat_namespace') return result def define_macros(): -Kumar From vik.list.nutch at gmail.com Fri May 16 04:58:41 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Thu, 15 May 2008 19:58:41 -0700 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? Message-ID: <482CF861.2010306@gmail.com> Hi all, I was wondering - what would be the most efficient method to access all the elements in the DOM tree, in some order, using lxml.etree? The methods I currently see in the docs return a class like ElementDepthfirstIterator or iterwalk, which have 2 issues - 1) The first has a flat representation of the tree, so I lose child/parent structure 2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running? Or perhaps there is some method I've missed completely? Quick note on what I'm trying to do - graphically represent the DOM structure of a page using a library like networkX.. Cheers, Viksit From stefan_ml at behnel.de Fri May 16 11:14:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 11:14:59 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482CF861.2010306@gmail.com> References: <482CF861.2010306@gmail.com> Message-ID: <482D5093.7060303@behnel.de> Hi, Viksit Gaur wrote: > 2) Things like iterwalk do return "start" and "end" actions - but > instead of first doing an iterwalk and then parsing the results, is > there a better way to construct the tree when iterwalk itself is running? I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining? Stefan From vik.list.nutch at gmail.com Fri May 16 11:28:39 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Fri, 16 May 2008 02:28:39 -0700 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D5093.7060303@behnel.de> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> Message-ID: <482D53C7.1060701@gmail.com> Hi, Stefan Behnel wrote: > Hi, > > Viksit Gaur wrote: >> 2) Things like iterwalk do return "start" and "end" actions - but >> instead of first doing an iterwalk and then parsing the results, is >> there a better way to construct the tree when iterwalk itself is running? > > I don't understand what you mean here. Are you modifying the tree during the > iteration? Or do you think of some kind of pipelining? Hmm. The problem I face was a method to assign a unique ID to each element on the page. Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I'm not sure how to do this, without extending the etree.so file inside which iterwalk is implemented.. Cheers, Viksit > > Stefan > From stefan_ml at behnel.de Fri May 16 11:56:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 11:56:56 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D53C7.1060701@gmail.com> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> Message-ID: <482D5A68.2010107@behnel.de> Viksit Gaur wrote: > The problem I face was a method to assign a unique ID to each > element on the page. > > Lets say I construct an iterwalk object. But, during this phase, I would > like to not only build the tree, but also add some of my own information > to each node (such as a unique ID to each element). I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. Stefan From Dennis.Benzinger at gmx.net Fri May 16 12:28:42 2008 From: Dennis.Benzinger at gmx.net (Dennis Benzinger) Date: Fri, 16 May 2008 12:28:42 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D5A68.2010107@behnel.de> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> <482D5A68.2010107@behnel.de> Message-ID: <482D61DA.8040609@gmx.net> Am 16.05.2008 11:56, Stefan Behnel schrieb: > > Viksit Gaur wrote: >> The problem I face was a method to assign a unique ID to each >> element on the page. >> >> Lets say I construct an iterwalk object. But, during this phase, I would >> like to not only build the tree, but also add some of my own information >> to each node (such as a unique ID to each element). > > I still don't understand what you mean with "build the tree". You can't > construct a tree and run iterwalk at the same time. iterparse() will do that > in case you are parsing. > [...] I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data. HTH, Dennis Benzinger From stefan_ml at behnel.de Fri May 16 12:46:38 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 12:46:38 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D61DA.8040609@gmx.net> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> <482D5A68.2010107@behnel.de> <482D61DA.8040609@gmx.net> Message-ID: <482D660E.4010303@behnel.de> Hi, Dennis Benzinger wrote: > Am 16.05.2008 11:56, Stefan Behnel schrieb: >> Viksit Gaur wrote: >>> The problem I face was a method to assign a unique ID to each >>> element on the page. >>> >>> Lets say I construct an iterwalk object. But, during this phase, I would >>> like to not only build the tree, but also add some of my own information >>> to each node (such as a unique ID to each element). >> I still don't understand what you mean with "build the tree". You can't >> construct a tree and run iterwalk at the same time. iterparse() will do that >> in case you are parsing. >> [...] > > I think he is talking about his own tree. The tree he is building to > visualize the structure of the XML data. Ok, but if it's that, then I don't understand why iterating over the tree and adding an id attribute to each node won't do the job. Stefan From cz at gocept.com Fri May 16 14:21:27 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Fri, 16 May 2008 14:21:27 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? Message-ID: Hi, with lxml 2.0.4 I get text removed when I replace a node. The text after the replaced node vanishes.... ----------------------- import lxml.objectify import lxml.etree xml = lxml.objectify.fromstring( 'before bazafter baz') print lxml.etree.tostring(xml, pretty_print=True) print 50*'-' baz = xml['bar']['baz'] xml['bar'].replace(baz, lxml.objectify.E.holler()) print lxml.etree.tostring(xml, pretty_print=True) ----------------- Prints out: before bazafter baz -------------------------------------------------- before baz Thanks, -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From stefan_ml at behnel.de Fri May 16 14:41:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 14:41:44 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? In-Reply-To: References: Message-ID: <482D8108.1080804@behnel.de> Hi, Christian Zagrodnick wrote: > with lxml 2.0.4 I get text removed when I replace a node. The text > after the replaced node vanishes... You mean the .tail property of the node that you replace. http://codespeak.net/lxml/tutorial.html#elements-contain-text When you replace the node, it takes its tail with it. Stefan From cz at gocept.com Fri May 16 15:20:43 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Fri, 16 May 2008 15:20:43 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? References: <482D8108.1080804@behnel.de> Message-ID: On 2008-05-16 14:41:44 +0200, Stefan Behnel said: > Hi, > > Christian Zagrodnick wrote: >> with lxml 2.0.4 I get text removed when I replace a node. The text >> after the replaced node vanishes... > > You mean the .tail property of the node that you replace. > > http://codespeak.net/lxml/tutorial.html#elements-contain-text > > When you replace the node, it takes its tail with it. Hrr. I'm too DOMified. Sorry :) -- -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From stefan_ml at behnel.de Sun May 18 21:24:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 18 May 2008 21:24:42 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 Message-ID: <4830827A.8050304@behnel.de> Hi, since we had a lengthy discussion on whether or not non-prefixed byte strings should automatically mutate into unicode strings when compiled for Py3, here are some initial lessons from my first attempt to port lxml. My first approach was (obviously) to import unicode_literals from __future__. This failed miserably, and even showed a couple of further bugs in Cython. :) I then chose the route to explicitly prepend unicode strings with 'u', as I wanted to keep my source compilable with older Cython versions that do not support the 'b' prefix. Currently, I have changed about 700 lines this way in a quick walk-through, and now I'm searching the places where this was the wrong thing to do. :) Most important evidence found: it's definitely non-trivial in a lot of places to decide what has to be unicode and what doesn't. It's non-trivial for me, and definitely not easier for Cython. One important place where I ended up with a lot of trivial changes are docstrings. Here, I would give an almost 100% chance that the user meant a unicode string if it's not prefixed. The remaining cases, e.g. where some external tool may require binary data for some kind of configuration or analysis are rare enough to just ignore them. For exactly this reason (I think), the doctest module in Py3 ignores docstrings that are not unicode. This might be a place where an automatic conversion might make sense (although, if it's the only place, that would be some funny string semantics...) Another important place are exception messages. Here, I'd give a real 100% for string literals, as their only purpose is to be human readable. A field where I really had to take care is when working with byte sequences. For example, lxml has a couple of places where strings are converted into UTF-8 and then passed into re.findall() or re.sub(). When substituting, the replacement string obviously has to be a byte string, too. I also found a bug in the Py3 re module when working with byte strings in one specific case. There are actually quite a number of places where strings are built as byte strings by combining and formatting literals, and then converted to a char*. Another place where automatic conversion must not happen. So, while still on the way, my first real-world impression meets my original opinion. There are definitely a lot of unprefixed strings in my own code that are meant to be unicode strings. Simply switching their type in Py3 will fix a lot of them, but at the same time break many others. The things that it fixes are the trivial parts: docstrings and exceptions. Almost everything else really were byte strings, and some were non-trivial things that need real work. If I can choose, I opt for going through this once and then having code that correctly distinguishes between byte strings and unicode strings in *both* Py2 and Py3, instead of additionally having to deal with changing string semantics for identical code in different environments. We might think about a way to simplify the transition from unprefixed docstrings and exception messages to unicode strings. As it currently stands, everything else is definitely out of scope for any automatism. Stefan From stefan_ml at behnel.de Sun May 18 23:19:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 18 May 2008 23:19:00 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 In-Reply-To: <4830827A.8050304@behnel.de> References: <4830827A.8050304@behnel.de> Message-ID: <48309D44.9080603@behnel.de> Sorry, wrong list. This was supposed to go to the Cython list... [but yes, there will be lxml for Python 3, and pretty soon] Stefan Behnel wrote: > since we had a lengthy discussion on whether or not non-prefixed byte strings > should automatically mutate into unicode strings when compiled for Py3, here > are some initial lessons from my first attempt to port lxml. [...] From matan at xipus.lxmldev.ninio.org Tue May 20 15:18:16 2008 From: matan at xipus.lxmldev.ninio.org (Matan Ninio) Date: Tue, 20 May 2008 13:18:16 +0000 (UTC) Subject: [lxml-dev] problem\bug in xpath compare() with text in tail Message-ID: This may be a just my (limited) understanding of Xpath and XML, but i'm getting a strange problem when I try to use xpath to search for specific strings in a file. specifically, when I use "\\*[compare(text(),"needle")]" to look for elements with "needle" in their text, it only works when the strings appears in the "text" part, but not when its in the "tail" part. So: e=etree.HTML("inbody
text
tail") e.xpath("//text()") ['inbody', 'text', 'tail'] e.xpath("//*[contains(text(),'text')]//text()") ['text'] ---- works fine, but e.xpath("//*[contains(text(),'tail')]//text()") [] ---- does not. is it just that I need to use a different function/attribute for the tail (instead of text())? Is this a bug? Is there a workaround? using lxml 2.0.5, via mac "port" thanks! From sam.kuper at uclmail.net Thu May 22 01:52:08 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Thu, 22 May 2008 00:52:08 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree Message-ID: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> Dear lovely lxmlves, Yesterday I tried to parse a large file, the Open Directory Project's links document, available here . The process went like this: 1) Unzipped the file using 7-zip. No errors reported. 2) Renamed the file by adding a .xml extension, mainly so Windows (see my spec below) would recognise it as an XML file. 3) Had a look at the file in Oxygen's large document viewer. It took a few minutes to load, but everything looked shipshape. 4) Opened a command prompt, navigated to the directory containing the file, and started Python. 5) Entered: from lxml import etree 6) Entered: doc = open ('content.rdf.u8.xml', 'r') 7) Entered: docParsed = etree.parse(doc) Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up to around 96% (fair enough, it's a big document) and the Windows UI became sluggish. It didn't crash, and the RAM usage stabilised around that amount, with Windows Task Manager showing well under 10% CPU load from Python. Still, I figured it might take a while to parse, so I left it overnight. In the morning, I found the following error message immediately underneath the command I'd entered in step 7: Traceback (most recent call last): File "", line 1, in File "lxml.etree.pyx", line 2520, in lxml.etree.parse File "parser.pxi", line 1331, in lxml.etree._parseDocument File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike File "parser.pxi", line 850, in lxml.etree._BaseParser._parseDocFromFilelike File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc File "parser.pxi", line 536, in lxml.etree._handleParseResult File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Memory allocation failed : building node I hope that's meaningful to someone, and that perhaps I might be able to get some suggestions about how to parse the file on my PC. Also, I was thinking of trying to parse the file on a virtual server that only has 64M of RAM. I don't mind if the VPS takes a day or two, as long as the code to make it work is fairly straightforward. So any suggestions about that option would be helpful too. Many thanks, Sam --- Macbook 2.13GHz with 2GB RAM Windows Vista Home Premium via Leopard BootCamp ActivePython 2.5.1 lxml installed via lxml-2.0.3-py2.5-win32.egg (this was the most up-to-date egg that was available last time I checked, which was about a week or two ago) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080522/45a75b4d/attachment.htm From sam.kuper at uclmail.net Thu May 22 02:17:05 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Thu, 22 May 2008 01:17:05 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> Message-ID: <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> Hmm, 64M might be unfeasibly low. Let's say 128M. Anyway, if I did go with this option, it would probably be on one of the cheaper of thesemachines (or something similar somewhere else), which seem like potentially an inexpensive resource for doing offline data-munging. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080522/b16101f9/attachment.htm From sam.kuper at uclmail.net Thu May 22 02:20:59 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Thu, 22 May 2008 01:20:59 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> Message-ID: <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> Gosh, this is turning into a really fragmented post; apologies. I meant to add to the first post that once parsed, my intention was to run a fairly simple XSL transform on the document, to extract a copy of each of the URLs it contains. Probably something like this:

ODP URLs

Thanks for your patience; I'm still relatively new at this stuff, Sam -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080522/195906c0/attachment.htm From stefan_ml at behnel.de Thu May 22 09:39:14 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 22 May 2008 09:39:14 +0200 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> Message-ID: <48352322.2030901@behnel.de> Hi, Sam Kuper wrote: > Dear lovely lxmlves, > Yesterday I tried to parse a large file, the Open Directory Project's links > document, available here . The > process went like this: > > 1) Unzipped the file using 7-zip. No errors reported. > 2) Renamed the file by adding a .xml extension, mainly so Windows (see my > spec below) would recognise it as an XML file. > 3) Had a look at the file in Oxygen's large document viewer. It took a few > minutes to load, but everything looked shipshape. > 4) Opened a command prompt, navigated to the directory containing the file, > and started Python. > 5) Entered: from lxml import etree > 6) Entered: doc = open ('content.rdf.u8.xml', 'r') > 7) Entered: docParsed = etree.parse(doc) lxml can parse from a gzipped XML file, no need to do step 1) and 6), just do docParsed = etree.parse('content.rdf.u8.xml.gz') or even docParsed = etree.parse('http://rdf.dmoz.org/rdf/content.rdf.u8.gz') BTW, if you do 6) it should read doc = open ('content.rdf.u8.xml', 'rb') mind the 'rb' at the end. > Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up > to around 96% (fair enough, it's a big document) and the Windows UI became > sluggish. It didn't crash, and the RAM usage stabilised around that amount, > with Windows Task Manager showing well under 10% CPU load from Python. That means your machine was heavily swapping. The in-memory tree of libxml2 is much larger than the serialised document itself, so if it doesn't fit into RAM, parsing the tree into memory will not make you happy, especially not with 64/128MB... > Traceback (most recent call last): > File "", line 1, in > File "lxml.etree.pyx", line 2520, in lxml.etree.parse > File "parser.pxi", line 1331, in lxml.etree._parseDocument > File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument > File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike > File "parser.pxi", line 850, in > lxml.etree._BaseParser._parseDocFromFilelike > File "parser.pxi", line 452, in > lxml.etree._ParserContext._handleParseResultDoc > File "parser.pxi", line 536, in lxml.etree._handleParseResult > File "parser.pxi", line 478, in lxml.etree._raiseParseError > lxml.etree.XMLSyntaxError: Memory allocation failed : building node Your operating system stopped allowing it to allocate more memory and it didn't even crash, it just gave you an exception. Isn't that cool? :) (although I wouldn't generally rely on that ...) Stefan From stefan_ml at behnel.de Thu May 22 10:19:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 22 May 2008 10:19:37 +0200 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> Message-ID: <48352C99.7040901@behnel.de> Hi, Sam Kuper wrote: > Gosh, this is turning into a really fragmented post; apologies. I meant to > add to the first post that once parsed, my intention was to run a fairly > simple XSL transform on the document, to extract a copy of each of the URLs > it contains. Probably something like this: > > > > > >

ODP URLs

> >

>
> > >
>
That is a problem that can be solved with extremely little memory. Take a look at the (SAX-like) target parser interface, which will not build a tree and instead just receive callbacks while parsing: http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface Write a parser target class that keeps track of being inside or outside the "Topic" tag (start/end), and whenever you find a "link" tag while inside a "Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib dictionary and and write it into a hand-generated HTML stream like the one you used above. Stefan From vik.list.nutch at gmail.com Fri May 23 11:26:00 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Fri, 23 May 2008 02:26:00 -0700 Subject: [lxml-dev] DOM tree intersection/comparison? Message-ID: <48368DA8.5040602@gmail.com> Hi all, I was wondering if there's a currently implemented way to find out the common elements between 2 DOM trees? If not (I couldn't find any obvious classes or functions) - what would you recommend as the best method to do so? Use iterparse/iterwalk on two pages, and then do a side by side comparison looks like a naive method.. Btw, when I say comparison, the basic aim is to figure out dom element sequences or subtrees that are common. For instance, a page with
and
would have div->a as a common element, amongst other things.. Cheers, Viksit From stefan_ml at behnel.de Fri May 23 12:24:13 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 23 May 2008 12:24:13 +0200 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <48368DA8.5040602@gmail.com> References: <48368DA8.5040602@gmail.com> Message-ID: <48369B4D.9040204@behnel.de> Hi, Viksit Gaur wrote: > I was wondering if there's a currently implemented way to find out the > common elements between 2 DOM trees? [...] > >
> Have a look at lxml.html.diff, might come close to what you want. Stefan From vik.list.nutch at gmail.com Fri May 23 13:00:44 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Fri, 23 May 2008 04:00:44 -0700 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <48369B4D.9040204@behnel.de> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> Message-ID: <4836A3DC.5060807@gmail.com> Hi Stefan, Stefan Behnel wrote: > > Have a look at lxml.html.diff, might come close to what you want. Thanks for the prompt pointer - I don't think this meets my requirements however. I was looking for something which would basically give me an intersection of 2 trees that was a subtree.. I had a couple of further questions actually.. I see there's a DFS iterator for elements, but is there a way to do a breadth first iteration through the tree? I thought maybe I could do a comparison of elements at the same level (eg. html -> hr, a, div1, div2 etc) and (div1 ->a, hr, b, br) - sort of cluster these elements based on which level they are at in the tree. Looking for the source, it appears that this would be handled by the C kernel that lxml uses - which means any modifications to the base code must be made in C? Or maybe there's an easier method to do this? Cheers Viksit > > Stefan > > From mwm-keyword-lxml.9112b8 at mired.org Fri May 23 16:46:39 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Fri, 23 May 2008 10:46:39 -0400 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <4836A3DC.5060807@gmail.com> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> <4836A3DC.5060807@gmail.com> Message-ID: <20080523104639.689ab655@bhuda.mired.org> On Fri, 23 May 2008 04:00:44 -0700 Viksit Gaur wrote: > Thanks for the prompt pointer - I don't think this meets my requirements > however. I was looking for something which would basically give me an > intersection of 2 trees that was a subtree.. I've written some code to diff two xml trees. The real issue is that "the differences between two trees" isn't really well defined. I.e. - does order of children matter? Not for attribute nodes, and maybe not for other nodes, depending on the application. What about whitespace? Same answer - some of it yes, some of it depends on the application. Look at a modern diff's different options for whitespace handling, then fold in XML's newline handling to see how nasty that can get. FWIW, I'm not sure you get a "subtree" - more like forest. Or maybe it depends on exactly what you mean by "differences". I.e. - if an attribute changed value and that was the only difference, I wanted that attribute pulled out. I could see where you might define things so that the difference was the largest common subtree, or some such. > Or maybe there's an easier method to do this? I dealt with my issues by deciding on a canonical character string representation that gave me lots of lines, then feeding that representation to a string differ. The standard canonical forms don't quite work, because they (correctly) assume that order of attributes don't matter, but they will when you diff them with a string. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From stefan_ml at behnel.de Sat May 24 11:55:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 11:55:39 +0200 Subject: [lxml-dev] Python 3 changes in lxml 2.1 Message-ID: <4837E61B.7000707@behnel.de> Hi, as it currently seems, lxml 2.1 will support Python 2.6 and Python 3 out of the box. While fixing up lxml 2.1beta to make this work, I found a couple of things that I needed to change. Here's an (incomplete) list, so that people can start shouting at me for breaking their code. ;) One major thing that changed is that the API will now always return unicode strings for non byte stream data (.text, .tag, namespaces, ...), whereas it continues to return a byte string for plain ASCII data in Py2. Two things have become a bit quirky now. We currently return a subclass of ElementTree from XSLT, and you can call str(tree) on it to get the result. Returning a byte string here raises an exception in Py3, so that str(result) now behaves as unicode(result) did before, i.e. it returns a Python unicode string. To get the expected result as a byte string, people will have to use the new buffer protocol instead (memoryview&friends). This also means that bytes(xslt_result) will work as expected. Sadly, this means that there isn't a way to get the result in a portable way. I'm thinking about adding a .tobytes() method, but I'm not sure this is really helpful. The second quirk is serialisation to a unicode string. Instead of tostring(root, encoding=unicode) you now have to write tostring(root, encoding=str) so this requires source adaptation. Then again, this is (hopefully) a rare usage anyway and most Python code will require Py3 changes anyway. Haven't checked, but the 2to3 tool should normally take care of this. The ugliest problem I found so far is with doctests. There just isn't a way to write a Py2/Py3 portable doctest that accepts exactly a byte string or unicode strings as output, as both look different in Py2 and Py3. Also, exception names are now fully qualified, so that tracebacks look different. Tons of failing tests for nothing... Stefan From vik.list.nutch at gmail.com Sat May 24 12:36:36 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Sat, 24 May 2008 03:36:36 -0700 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <20080523104639.689ab655@bhuda.mired.org> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> <4836A3DC.5060807@gmail.com> <20080523104639.689ab655@bhuda.mired.org> Message-ID: <4837EFB4.9090609@gmail.com> Hey Mike, Mike Meyer wrote: > > I've written some code to diff two xml trees. The real issue is that > "the differences between two trees" isn't really well defined. I.e. - > does order of children matter? Not for attribute nodes, and maybe not > for other nodes, depending on the application. What about whitespace? > Same answer - some of it yes, some of it depends on the > application. Look at a modern diff's different options for whitespace > handling, then fold in XML's newline handling to see how nasty that > can get. Thanks for pointing out some interesting questions - I had thought of a couple, but I was counting on others not being too relevant to what I was doing.. Whitespace diffs are actually really bad - and I guess unicode is not going to sit pretty with the mix if I ever have to move to multi-lingual support. > > FWIW, I'm not sure you get a "subtree" - more like forest. Or maybe it > depends on exactly what you mean by "differences". I.e. - if an > attribute changed value and that was the only difference, I wanted > that attribute pulled out. I could see where you might define things > so that the difference was the largest common subtree, or some such. The latter was what I was aiming for. Mostly, I'm not trying to compute an intersection between 2 trees, as much as constructing a compressed representation of them. Cheers, Viksit From vik.list.nutch at gmail.com Sat May 24 12:43:14 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Sat, 24 May 2008 03:43:14 -0700 Subject: [lxml-dev] Efficient way to search element attribs? Message-ID: <4837F142.40901@gmail.com> Hi all, When building a tree, I add a particular attribute to each element (eg: ID), and was wondering if there are any methods I could use to search the tree for tags that have this attribute? I looked at find(), findall() and the xpath functions - but they seem to help only when you know the tag name itself. Of course, one way might be to do a manual depth first or breadth first search from the root downwards, recursively.. But that would become expensive pretty soon. Thanks, Viksit From stefan_ml at behnel.de Sat May 24 13:42:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 13:42:32 +0200 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <4836A3DC.5060807@gmail.com> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> <4836A3DC.5060807@gmail.com> Message-ID: <4837FF28.8090001@behnel.de> Hi, Viksit Gaur wrote: > I see there's a DFS > iterator for elements, but is there a way to do a breadth first > iteration through the tree? there's no API for it, but BFS shouldn't be that hard to do in Python. Actually, if deque() supported appending during iteration, it would be totally trivial. But there should be recipes on the web. Stefan From stefan_ml at behnel.de Sat May 24 13:48:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 13:48:11 +0200 Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: References: Message-ID: <4838007B.3050900@behnel.de> Hi, Matan Ninio wrote: > This may be a just my (limited) understanding of Xpath and XML, but i'm getting > a strange problem when I try to use xpath to search for specific strings in a > file. specifically, when I use "\\*[compare(text(),"needle")]" to look for > elements with "needle" in their text, it only works when the strings appears in > the "text" part, but not when its in the "tail" part. So: > > e=etree.HTML("inbody
text
tail") > > e.xpath("//text()") > ['inbody', 'text', 'tail'] > > e.xpath("//*[contains(text(),'text')]//text()") > ['text'] > > ---- works fine, but > > e.xpath("//*[contains(text(),'tail')]//text()") > [] > > ---- does not. > > is it just that I need to use a different function/attribute for the tail > (instead of text())? The tail text is not inside the element, so it's non-trivial to search for it in XPath. You can either iterate over all nodes and check .tail yourself, or do this (untested) to reduce the overhead on the Python side: for el in e.xpath("//*[contains(following-sibling::text(),'tail')]"): if 'tail' in el.tail: ... Do some testing to find out which is faster for your data. Stefan From stefan_ml at behnel.de Sat May 24 13:52:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 13:52:05 +0200 Subject: [lxml-dev] Efficient way to search element attribs? In-Reply-To: <4837F142.40901@gmail.com> References: <4837F142.40901@gmail.com> Message-ID: <48380165.1020403@behnel.de> Hi, Viksit Gaur wrote: > When building a tree, I add a particular attribute to each element (eg: > ID), and was wondering if there are any methods I could use to search > the tree for tags that have this attribute? > > I looked at find(), findall() and the xpath functions - but they seem to > help only when you know the tag name itself. You should read a bit on XPath, or at least on the ElementPath syntax. http://effbot.org/zone/element-xpath.htm The '*' wildcard allows you to search for any element tag. All iterators in lxml.etree support it, BTW. Stefan From sam.kuper at uclmail.net Sat May 24 15:25:52 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Sat, 24 May 2008 14:25:52 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <48352C99.7040901@behnel.de> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> Message-ID: <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> Dear Stefan, I've tried the method you've suggested below, but it isn't quite working for me. It may be that I've misunderstood your suggestion. I'll explain what I've tried. Here is my python program, extract_links_dmoz.py: from lxml import etree infile = open("content.example.xml", "r") infile.seek(0) outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML(infile.read(), parser) This uses the short, example RDF file at http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed content.example.xml), and works fine. When I view the output_test001.txt file, it contains one URL per line, which is exactly what I want for now. However, if I change the program to read content.rdf.u8.xml (i.e. the full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz) instead of content.example.xml , then when I run the program I get the following error: Traceback (most recent call last): File "extract_links_dmoz.py", line 26, in result = etree.XML(infile.read(), parser) MemoryError Any help you (or others) can offer would be greatly appreciated. Many thanks, Sam 2008/5/22 Stefan Behnel : > Hi, > > Sam Kuper wrote: > > Gosh, this is turning into a really fragmented post; apologies. I meant > to > > add to the first post that once parsed, my intention was to run a fairly > > simple XSL transform on the document, to extract a copy of each of the > URLs > > it contains. Probably something like this: > > > > > > > > > > > >

ODP URLs

> > > >

> >
> > > > > >
> >
> > That is a problem that can be solved with extremely little memory. Take a > look > at the (SAX-like) target parser interface, which will not build a tree and > instead just receive callbacks while parsing: > > http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface > > Write a parser target class that keeps track of being inside or outside the > "Topic" tag (start/end), and whenever you find a "link" tag while inside a > "Topic" tag, look for a "{whatever-namespace}resource" attribute in the > attrib > dictionary and and write it into a hand-generated HTML stream like the one > you > used above. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/2372fd88/attachment-0001.htm From sam.kuper at uclmail.net Sat May 24 16:19:32 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Sat, 24 May 2008 15:19:32 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805240718x128bafbfic80ff5c4addd755c@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> <4126b3450805240718x128bafbfic80ff5c4addd755c@mail.gmail.com> Message-ID: <4126b3450805240719r57ae9dc4r14f5e55a7cda888f@mail.gmail.com> To add to the message below, I've just tried running a much simpler program that doesn't call lxml to see if the memory error is a Python/environment one rather than being due to lxml. It turns out to be: >>> infile = open("content.rdf.u8.xml", "r") >>> print infile.read() Traceback (most recent call last): File "", line 1, in MemoryError Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The normal workaround for processing large text files piece by piece seems to be either to set a byte limit on how much is read at once, or to read the file line by line. However, neither of those will work in this case because they won't produce well-formed XML that the target parser interface can handle (correct me if I'm wrong). I'm sure there must be a fairly easy solution to this, but it's eluding me. All assistance greatly appreciated! Sam 2008/5/24 Sam Kuper : > Dear Stefan, > > I've tried the method you've suggested below, but it isn't quite working > for me. It may be that I've misunderstood your suggestion. I'll explain what > I've tried. Here is my python program, extract_links_dmoz.py: > > from lxml import etree > infile = open("content.example.xml", "r") > infile.seek(0) > outfile = open("output_test001.txt", "w") > class EchoTarget(): > def start(self, tag, attrib): > if tag.endswith("xternalPage"): > line = attrib["about"] > if line != "": > outfile.write(line+"\n") > print line > def close(self): > return "closed!" > parser = etree.XMLParser(target = EchoTarget()) > result = etree.XML(infile.read(), parser) > > This uses the short, example RDF file at > http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed > content.example.xml), and works fine. When I view the output_test001.txt > file, it contains one URL per line, which is exactly what I want for now. > > However, if I change the program to read content.rdf.u8.xml (i.e. the > full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz) > instead of content.example.xml , then when I run the program I get the > following error: > > Traceback (most recent call last): > File "extract_links_dmoz.py", line 26, in > result = etree.XML(infile.read(), parser) > MemoryError > > Any help you (or others) can offer would be greatly appreciated. > > Many thanks, > > Sam > > 2008/5/22 Stefan Behnel : > >> Hi, >> >> Sam Kuper wrote: >> > Gosh, this is turning into a really fragmented post; apologies. I meant >> to >> > add to the first post that once parsed, my intention was to run a fairly >> > simple XSL transform on the document, to extract a copy of each of the >> URLs >> > it contains. Probably something like this: >> > >> > >> > >> > >> > >> >

ODP URLs

>> > >> >

>> >
>> > >> > >> >
>> >
>> >> That is a problem that can be solved with extremely little memory. Take a >> look >> at the (SAX-like) target parser interface, which will not build a tree and >> instead just receive callbacks while parsing: >> >> http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface >> >> Write a parser target class that keeps track of being inside or outside >> the >> "Topic" tag (start/end), and whenever you find a "link" tag while inside a >> "Topic" tag, look for a "{whatever-namespace}resource" attribute in the >> attrib >> dictionary and and write it into a hand-generated HTML stream like the one >> you >> used above. >> >> Stefan >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/53e27859/attachment.htm From sam.kuper at uclmail.net Sat May 24 20:40:27 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Sat, 24 May 2008 19:40:27 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805240719r57ae9dc4r14f5e55a7cda888f@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> <4126b3450805240718x128bafbfic80ff5c4addd755c@mail.gmail.com> <4126b3450805240719r57ae9dc4r14f5e55a7cda888f@mail.gmail.com> Message-ID: <4126b3450805241140he1f380fla35ff0868e069c6b@mail.gmail.com> I'm still trying to find a good way to process big xml files (i.e. xml files around as large as the available RAM, or larger). My last post to the mailing list asked if there might be a way to do this by processing fragments of the document at a time. I now realise that lxml's feed parser is intended for this sort of task (correct me if I'm wrong). So I'm trying to learn to use it, but I find its behaviour a little odd. Not sure if this is a bug or not, so I'm posting to the list for advice. When I run: from lxml import etree class EchoTarget: def start(self, tag, attrib): print "start", tag, attrib def end(self, tag): print "end", tag def data(self, data): print "data", repr(data) def close(self): print "close" return "closed!" parser = etree.XMLParser(target = EchoTarget()) parser.feed("foo") I get the result: start something {} data u'foo' end something But when I run: from lxml import etree class EchoTarget: def start(self, tag, attrib): print "start", tag, attrib def end(self, tag): print "end", tag def data(self, data): print "data", repr(data) def close(self): print "close" return "closed!" parser = etree.XMLParser(target = EchoTarget()) parser.feed("foo") nothing gets sent to stdout. Isn't that weird? I think so. I would have expected it to give the same result as the program above. I'd be very thankful if anyone can shed some light on the matter for me. Many thanks, Sam 2008/5/24 Sam Kuper : > To add to the message below, I've just tried running a much simpler program > that doesn't call lxml to see if the memory error is a Python/environment > one rather than being due to lxml. It turns out to be: > > >>> infile = open("content.rdf.u8.xml", "r") > >>> print infile.read() > Traceback (most recent call last): > File "", line 1, in > MemoryError > > Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The > normal workaround for processing large text files piece by piece seems to be > either to set a byte limit on how much is read at once, or to read the file > line by line. However, neither of those will work in this case because they > won't produce well-formed XML that the target parser interface can handle > (correct me if I'm wrong). > > I'm sure there must be a fairly easy solution to this, but it's eluding me. > All assistance greatly appreciated! > > Sam > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/bd741aaf/attachment.htm From stefan_ml at behnel.de Sat May 24 20:55:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 20:55:01 +0200 Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> Message-ID: <48386485.4000408@behnel.de> Hi, please keep the list involved. matan ninio wrote: > Is there some good place to look for information about XPath? Search for "xpath tutorial" ? Stefan From stefan_ml at behnel.de Sat May 24 21:00:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 May 2008 21:00:11 +0200 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> Message-ID: <483865BB.9010001@behnel.de> Hi, did you read my other post? Sam Kuper wrote: > result = etree.XML(infile.read(), parser) make that result = etree.parse("thefile.xml", parser) and consider reading the parser docs on the web page. Stefan From sam.kuper at uclmail.net Sat May 24 21:09:45 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Sat, 24 May 2008 20:09:45 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <483865BB.9010001@behnel.de> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> <483865BB.9010001@behnel.de> Message-ID: <4126b3450805241209wb5fb82bqe7c1650962d773ed@mail.gmail.com> Dear Stefan, I did read your other post, but using the file name directly when calling the parser didn't work for me. Here is what I tried: from lxml import etree outfile = open("output_test001.txt", "w") class EchoTarget(): def start(self, tag, attrib): if tag.endswith("xternalPage"): line = attrib["about"] if line != "": outfile.write(line+"\n") print line def close(self): return "closed!" parser = etree.XMLParser(target = EchoTarget()) result = etree.XML("content.example.xml", parser) This gives the following error: Traceback (most recent call last): File "extract_links_dmoz005.py", line 15, in result = etree.XML("content.example.xml", parser) File "lxml.etree.pyx", line 2358, in lxml.etree.XML File "parser.pxi", line 1354, in lxml.etree._parseMemoryDocument File "parser.pxi", line 1243, in lxml.etree._parseDoc File "parser.pxi", line 795, in lxml.etree._BaseParser._parseDoc File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResultDoc File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 I have been reading the docs, but I'm new to processing XML in Python, so I don't find them all that easy to understand. I think I'm improving, though :) Thanks for your patience. Best, Sam 2008/5/24 Stefan Behnel : > Hi, > > did you read my other post? > > Sam Kuper wrote: > > result = etree.XML(infile.read(), parser) > > make that > > result = etree.parse("thefile.xml", parser) > > and consider reading the parser docs on the web page. > > Stefan > -- http://five.sentenc.es | http://tinyurl.com/3x9se4 -- Mr Sam Pablo Kuper BSc MRI Research Assistant Darwin Correspondence Project Cambridge University Library West Road Cambridge CB3 9DR spk30 at cam.ac.uk Office: +44 (0)1223 333008 Mobile: +44 (0) 7971858176 www.darwinproject.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/f549f5af/attachment-0001.htm From john at nmt.edu Sat May 24 21:46:00 2008 From: john at nmt.edu (John W. Shipman) Date: Sat, 24 May 2008 13:46:00 -0600 (MDT) Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: <48386485.4000408@behnel.de> References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> <48386485.4000408@behnel.de> Message-ID: matan ninio wrote: +-- | Is there some good place to look for information about XPath? +-- If I might recommend my modest XSLT reference: http://www.nmt.edu/tcc/help/pubs/xslt/ It has a section on XPath. John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber From rwiker at gmail.com Sat May 24 22:09:53 2008 From: rwiker at gmail.com (Raymond Wiker) Date: Sat, 24 May 2008 22:09:53 +0200 Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> <48386485.4000408@behnel.de> Message-ID: <596C30C4-1AE5-48F6-B174-06867F3441D9@gmail.com> On May 24, 2008, at 21:46 , John W. Shipman wrote: > matan ninio wrote: > > +-- > | Is there some good place to look for information about XPath? > +-- > > If I might recommend my modest XSLT reference: > > http://www.nmt.edu/tcc/help/pubs/xslt/ > > It has a section on XPath. There's also some good stuff on http://www.zvon.org. From matan at xipus.lxmldev.ninio.org Sat May 24 22:41:24 2008 From: matan at xipus.lxmldev.ninio.org (Matan Ninio) Date: Sat, 24 May 2008 20:41:24 +0000 (UTC) Subject: [lxml-dev] problem\bug in xpath compare() with text in tail References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> <48386485.4000408@behnel.de> <596C30C4-1AE5-48F6-B174-06867F3441D9@gmail.com> Message-ID: Raymond Wiker gmail.com> writes: > > On May 24, 2008, at 21:46 , John W. Shipman wrote: > > > matan ninio wrote: > > > > +-- > > | Is there some good place to look for information about XPath? > > +-- > > > > If I might recommend my modest XSLT reference: > > > > http://www.nmt.edu/tcc/help/pubs/xslt/ > > > > It has a section on XPath. > > There's also some good stuff on http://www.zvon.org. > Thanks for the links. I have already read several of them, including the very nice one in zvon.org mentioned above. But I'm yet to find the bit of information i'm missing. Why dose the behavior of "text()" change to exclude tail elements when moving from "//text()" to "//*[contains(text(),'ABC')]"? What does the "text()" function *actually* do? I can see that if an element where to have more then one text value, the meaning of "contains(text()," may be unclear. But they why is the //text() version actually pulling out the tail elements? This thread is somewhat off-topic. I am new to this list, so i really don't know if it's considered acceptable to discuss such topics here. If not, I apologize and will take this elsewhere. Thanks again, Matan From stefan_ml at behnel.de Sun May 25 08:08:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 25 May 2008 08:08:40 +0200 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805241209wb5fb82bqe7c1650962d773ed@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> <483865BB.9010001@behnel.de> <4126b3450805241209wb5fb82bqe7c1650962d773ed@mail.gmail.com> Message-ID: <48390268.5040508@behnel.de> Hi, RMP! :) Sam Kuper wrote: > result = etree.XML("content.example.xml", parser) > > 2008/5/24 Stefan Behnel: >> result = etree.parse("thefile.xml", parser) See the difference? Please read http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files and http://codespeak.net/lxml/parsing.html Stefan From rwiker at gmail.com Sun May 25 12:14:36 2008 From: rwiker at gmail.com (Raymond Wiker) Date: Sun, 25 May 2008 12:14:36 +0200 Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> <48386485.4000408@behnel.de> <596C30C4-1AE5-48F6-B174-06867F3441D9@gmail.com> Message-ID: On May 24, 2008, at 22:41 , Matan Ninio wrote: > Raymond Wiker gmail.com> writes: > >> >> On May 24, 2008, at 21:46 , John W. Shipman wrote: >> >>> matan ninio wrote: >>> >>> +-- >>> | Is there some good place to look for information about XPath? >>> +-- >>> >>> If I might recommend my modest XSLT reference: >>> >>> http://www.nmt.edu/tcc/help/pubs/xslt/ >>> >>> It has a section on XPath. >> >> There's also some good stuff on http://www.zvon.org. >> > > > > Thanks for the links. I have already read several of them, > including the very > nice one in zvon.org mentioned above. But I'm yet to find the bit of > information i'm missing. Why dose the behavior of "text()" change > to exclude > tail elements when moving from "//text()" to "// > *[contains(text(),'ABC')]"? > What does the "text()" function *actually* do? I can see that if an > element > where to have more then one text value, the meaning of > "contains(text()," may be > unclear. But they why is the //text() version actually pulling out > the tail > elements? The text() function is a predicate that returns true for XML text nodes - it is not a function that returns the concatenation of text nodes under a specific element. Thus, //text() returns all text nodes in a tree. If you want to return all text nodes that contain the string "ABC", the correct test might be something like "//text() [contains(., 'ABC')]" From stefan_ml at behnel.de Sun May 25 12:42:54 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 25 May 2008 12:42:54 +0200 Subject: [lxml-dev] problem\bug in xpath compare() with text in tail In-Reply-To: References: <4838007B.3050900@behnel.de> <3DCA2674-7BC4-4EAC-8AFE-0B2CF77DA5EA@xipus.lxmldev.ninio.org> <48386485.4000408@behnel.de> <596C30C4-1AE5-48F6-B174-06867F3441D9@gmail.com> Message-ID: <483942AE.8010604@behnel.de> Hi, while XPath might be considered somewhat off-topic for ElementTree, I find your question about text() and .tail very on-topic for lxml. ElementTree does not expose the concept of a "text node" to Python space, so having them appear in XPath is somewhat ugly. Also, note that the parser may decide to split long text content or content that contains entities into multiple text nodes, so "text()" is not even guaranteed to return a text node that contains the complete ".text" value of a node. That makes it a somewhat fragile concept in XPath. If you want to test for .text and .tail reliably, it is easiest to do it in Python space. Look at the "siblings" example I gave in my first reply. Note also that most XPath string functions can work on node content, so for example: //*[contains(., 'ABC')] succeeds for any node where 'ABC' exists in the concatenated string value of the node and its children (but not in the .tail text of the node itself): >>> e=et.HTML("inbody
text
tail") >>> e.xpath("//*[contains(., 'text')]") [, , ] >>> e.xpath("//*[contains(., 'tail')]") [, ] Matan Ninio wrote: > Why dose the behavior of "text()" change to exclude > tail elements when moving from "//text()" to "//*[contains(text(),'ABC')]"? > What does the "text()" function *actually* do? "//text()" will get you /any/ text node in the tree, regardless of its position. "text()" is a node test that succeeds for all text nodes. "//*[contains(text(),'ABC')]" will get you the element that has a text node as direct child that contains the string "ABC". However, apparently, this only works for the first text node: >>> e = et.HTML("inbody
text
tail") >>> e.xpath("//*[contains(text(), 'tail')]") [] >>> e.xpath("//*[contains(text(), 'inbody')]") [] Not sure if this is in line with the XPath spec - might be a problem in libxml2. Although: > I can see that if an element > where to have more then one text value, the meaning of "contains(text()," may be > unclear. I would accept that as an explanation. :) Stefan From sam.kuper at uclmail.net Sun May 25 16:07:25 2008 From: sam.kuper at uclmail.net (Sam Kuper) Date: Sun, 25 May 2008 15:07:25 +0100 Subject: [lxml-dev] Trouble parsing large XML document with ElementTree In-Reply-To: <4126b3450805250706n303bb453lbc0f2c604ec3ccfc@mail.gmail.com> References: <4126b3450805211652q6f05adbfn9a71cfad80849c67@mail.gmail.com> <4126b3450805211717k68d3d281k5d01a0d8861543a2@mail.gmail.com> <4126b3450805211720k4c78f60dr889599a7e1db235f@mail.gmail.com> <48352C99.7040901@behnel.de> <4126b3450805240625n34d867bdx6543ce2f29d2f67a@mail.gmail.com> <483865BB.9010001@behnel.de> <4126b3450805241209wb5fb82bqe7c1650962d773ed@mail.gmail.com> <48390268.5040508@behnel.de> <4126b3450805250706n303bb453lbc0f2c604ec3ccfc@mail.gmail.com> Message-ID: <4126b3450805250707o66f1b86fgb94b610aec3f21c6@mail.gmail.com> Stefan, My apologies and thanks - my eyes obviously weren't working so well yesterday! The program's running smoothly now (aside from some "'ascii' codec can't encode character" exceptions that I'm "try/except"ing until I've understood how to handle them with lxml; I realise the docs cover this but I (still) don't quite understand them all yet). Thanks again. Gratefully, Sam 2008/5/25 Stefan Behnel : > > result = etree.XML("content.example.xml", parser) > > result = etree.parse("thefile.xml", parser) > > See the difference? > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080525/cd53b2dc/attachment.htm From stefan_ml at behnel.de Mon May 26 22:06:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 26 May 2008 22:06:05 +0200 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <4837FF28.8090001@behnel.de> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> <4836A3DC.5060807@gmail.com> <4837FF28.8090001@behnel.de> Message-ID: <483B182D.1070407@behnel.de> Stefan Behnel wrote: > Viksit Gaur wrote: >> I see there's a DFS >> iterator for elements, but is there a way to do a breadth first >> iteration through the tree? > > there's no API for it, but BFS shouldn't be that hard to do in Python. > Actually, if deque() supported appending during iteration, it would be totally > trivial. But there should be recipes on the web. I added a simple BFS recipe to the iteration section of api.txt: >>> root = etree.XML('
') >>> print(etree.tostring(root, pretty_print=True, encoding=unicode)) >>> queue = deque([root]) >>> while queue: ... el = queue.popleft() # pop next element ... queue.extend(el) # append its children ... print(el.tag) root a d b c e Stefan From vik.list.nutch at gmail.com Tue May 27 09:34:09 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Tue, 27 May 2008 00:34:09 -0700 Subject: [lxml-dev] DOM tree intersection/comparison? In-Reply-To: <483B182D.1070407@behnel.de> References: <48368DA8.5040602@gmail.com> <48369B4D.9040204@behnel.de> <4836A3DC.5060807@gmail.com> <4837FF28.8090001@behnel.de> <483B182D.1070407@behnel.de> Message-ID: <483BB971.4080402@gmail.com> Hey, Stefan Behnel wrote: > Stefan Behnel wrote: > > I added a simple BFS recipe to the iteration section of api.txt: Oh thats a neat way of doing it. I didn't know about deque - was using lists and pops! Thanks for the pointer.. Cheers, Viksit > > >>> root = etree.XML('') > >>> print(etree.tostring(root, pretty_print=True, encoding=unicode)) > > > > > > > > > > > >>> queue = deque([root]) > >>> while queue: > ... el = queue.popleft() # pop next element > ... queue.extend(el) # append its children > ... print(el.tag) > root > a > d > b > c > e > > > Stefan > > > From deepali.hd at sonata-software.com Fri May 30 10:05:26 2008 From: deepali.hd at sonata-software.com (Deepali H. Dhabade) Date: Fri, 30 May 2008 13:35:26 +0530 Subject: [lxml-dev] Regarding the ElementTree API Message-ID: <15AA0DBB99BE714D91F7E2E9E76800B401C0391E@NANDIMSG.SONATA.LOCAL> ElementTree API does not support namespace prefixes.Is there any latest ElementTree API which gives full support for namespace prefixes. I need this because I am working on openxml+python support -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080530/089b8601/attachment-0001.htm From stefan_ml at behnel.de Fri May 30 10:17:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 30 May 2008 10:17:56 +0200 Subject: [lxml-dev] Regarding the ElementTree API In-Reply-To: <15AA0DBB99BE714D91F7E2E9E76800B401C0391E@NANDIMSG.SONATA.LOCAL> References: <15AA0DBB99BE714D91F7E2E9E76800B401C0391E@NANDIMSG.SONATA.LOCAL> Message-ID: <483FB834.4050300@behnel.de> Hi, Deepali H. Dhabade wrote: > ElementTree API does not support namespace prefixes. That's a bit simplistic. > Is there any latest > ElementTree API which gives full support for namespace prefixes. The ElementTree API handles prefixes correctly, although it doesn't give users much control over what happens with them. lxml.etree is a bit more open here. http://codespeak.net/lxml/tutorial.html#namespaces What else do you need? Stefan From faassen at startifact.com Fri May 30 16:51:11 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 30 May 2008 16:51:11 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 In-Reply-To: <48309D44.9080603@behnel.de> References: <4830827A.8050304@behnel.de> <48309D44.9080603@behnel.de> Message-ID: Stefan Behnel wrote: > Sorry, wrong list. This was supposed to go to the Cython list... > > [but yes, there will be lxml for Python 3, and pretty soon] It's interesting to hear about here anyway. I'm glad to hear that you'll try being more clear with how unicode works for the Python 2.x version of lxml too. I think this is similar to the way the migration path enables this for plain Python code in Python 2.6. I do hope that lxml for Python 2.x can be maintained and extended for the forseeable future though; I'm sitting on a vast mountain of codebases that aren't going to Python 3.x in a hurry. Regards, Martijn From faassen at startifact.com Fri May 30 16:55:24 2008 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 30 May 2008 16:55:24 +0200 Subject: [lxml-dev] Python 3 changes in lxml 2.1 In-Reply-To: <4837E61B.7000707@behnel.de> References: <4837E61B.7000707@behnel.de> Message-ID: Hi there, Stefan Behnel wrote: [snip] > The ugliest problem I found so far is with doctests. There just isn't a way to > write a Py2/Py3 portable doctest that accepts exactly a byte string or unicode > strings as output, as both look different in Py2 and Py3. Also, exception > names are now fully qualified, so that tracebacks look different. Tons of > failing tests for nothing... I wonder whether doctest can be extended/adapted so it can normalize strings. I think by the way it might be valuable to bring up this issue on the Py3K mailing list. The doctest module is after all in the standard library, and perhaps people can think up a way to break less. Regards, Martijn From stefan_ml at behnel.de Fri May 30 17:24:31 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 30 May 2008 17:24:31 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 In-Reply-To: References: <4830827A.8050304@behnel.de> <48309D44.9080603@behnel.de> Message-ID: <48401C2F.3020500@behnel.de> Hi, Martijn Faassen wrote: > I'm glad to hear that you'll try being more clear with how unicode works > for the Python 2.x version of lxml too. I think this is similar to the > way the migration path enables this for plain Python code in Python 2.6. There will be little changes when running in Py2. It will still accept byte strings at the API level and return them for plain ASCII values. This only changes under Py3, where you will always get a unicode string back for .tag, .text, etc. I'm not even planning to block passing byte strings as tag name, although that will become really rare for Python code running in Py3. > I do hope that lxml for Python 2.x can be maintained and extended for > the forseeable future though; I'm sitting on a vast mountain of > codebases that aren't going to Python 3.x in a hurry. There will only be a single code base. We ported the code that Cython generates in a way that makes it compile from Py2.3 to Py3.0 without changes, and I'm planning to continue the support for 2.3 as long as possible. We are plannung a new release of Cython shortly after the release of 3.0/2.6 beta1. Stefan From stefan_ml at behnel.de Sat May 31 15:18:55 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 31 May 2008 15:18:55 +0200 Subject: [lxml-dev] Python 3 changes in lxml 2.1 In-Reply-To: References: <4837E61B.7000707@behnel.de> Message-ID: <4841503F.3030108@behnel.de> Martijn Faassen wrote: > I wonder whether doctest can be extended/adapted so it can normalize > strings. You can use lib2to3 to convert doctests before running them. That works well in most cases, but isn't trivial to set up either and you may still have to change your tests to enable an automated conversion. A straight 2to3 option in the doctest module would be nice anyway. What I did for lxml was changing most doctests to the more explicit Py3 syntax (or actually a bit of a mix of both worlds) and use a couple of regular expressions to fix them up before passing them to doctest. Not ideal, but it works well enough for now. Stefan From stefan_ml at behnel.de Sat May 31 18:30:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 31 May 2008 18:30:58 +0200 Subject: [lxml-dev] lxml 2.0.6 released Message-ID: <48417D42.9090805@behnel.de> Hi, lxml 2.0.6 is on PyPI. This is a bug fix only release for the stable 2.0 series. As a long-standing threading problem was solved, updating is generally recommended, although it should not affect currently working code. It should, however, make it possible to run lxml threaded under mod_python and friends. Feedback is welcome. This release should also make the life easier for MacOS-X users. Have fun, Stefan 2.0.6 (2008-05-31) Features added Bugs fixed * Incorrect evaluation of el.find("tag[child]"). * Windows build was broken. * Moving a subtree from a document created in one thread into a document of another thread could crash when the rest of the source document is deleted while the subtree is still in use. * Rare crash when serialising to a file object with certain encodings. Other changes * lxml should now build without problems on MacOS-X. From tseaver at palladion.com Sat May 31 19:36:47 2008 From: tseaver at palladion.com (Tres Seaver) Date: Sat, 31 May 2008 13:36:47 -0400 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 In-Reply-To: <48401C2F.3020500@behnel.de> References: <4830827A.8050304@behnel.de> <48309D44.9080603@behnel.de> <48401C2F.3020500@behnel.de> Message-ID: <48418CAF.5010800@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Hi, > > Martijn Faassen wrote: >> I'm glad to hear that you'll try being more clear with how unicode works >> for the Python 2.x version of lxml too. I think this is similar to the >> way the migration path enables this for plain Python code in Python 2.6. > > There will be little changes when running in Py2. It will still accept byte > strings at the API level and return them for plain ASCII values. This only > changes under Py3, where you will always get a unicode string back for .tag, > .text, etc. > > I'm not even planning to block passing byte strings as tag name, although that > will become really rare for Python code running in Py3. > > >> I do hope that lxml for Python 2.x can be maintained and extended for >> the forseeable future though; I'm sitting on a vast mountain of >> codebases that aren't going to Python 3.x in a hurry. > > There will only be a single code base. We ported the code that Cython > generates in a way that makes it compile from Py2.3 to Py3.0 without changes, > and I'm planning to continue the support for 2.3 as long as possible. Thank you! /me heaves a huge sigh of relief. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIQYyv+gerLs4ltQ4RAoaYAJ9tfnofSKDkniA2KV7mPa4AUg7UhACfQoIy 5zcGRMv37Fu4ZWIEmC8E5v0= =hp1x -----END PGP SIGNATURE-----