From stefan_ml at behnel.de Thu May 1 12:15:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 May 2008 12:15:33 +0200 Subject: [lxml-dev] lxml 2.0.5 released Message-ID: <48199845.4@behnel.de> Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31. From klizhentas at gmail.com Thu May 1 20:14:19 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Thu, 1 May 2008 22:14:19 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> Message-ID: <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> Hi All, Got a question: I've extended the ElementBase object using the approach described in the tutorial, but SubElement does not work as desired: class NodeBase(etree.ElementBase): def append(self,child): print "aaa" return etree.ElementBase.append(self,child) etree.SubElement(root,"child") #no "aaa" printed OK, but when taking your code to the module: def SubElement(parent, tag, attrib={}, **extra): attrib = attrib.copy() attrib.update(extra) element = parent.makeelement(tag, attrib) parent.append(element) return element SubElement(root,"child") # "aaa" is here! and overriding def makeelement(self, tag, attrib): return Node(tag, attrib) in the NodeBase just does not help, Any advice will be appreciated, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080501/200b8175/attachment.htm From stefan_ml at behnel.de Thu May 1 20:28:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 May 2008 20:28:12 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> Message-ID: <481A0BBC.7040000@behnel.de> Hi, Alex Klizhentas wrote: > I've extended the ElementBase object using the approach described in the > tutorial, but SubElement does not work as desired: > > class NodeBase(etree.ElementBase): > def append(self,child): > print "aaa" > return etree.ElementBase.append(self,child) > > etree.SubElement(root,"child") #no "aaa" printed That's because SubElement() does not call .append(). > OK, but when taking your code to the module: > > def SubElement(parent, tag, attrib={}, **extra): > attrib = attrib.copy() > attrib.update(extra) > element = parent.makeelement(tag, attrib) > parent.append(element) > return element > > SubElement(root,"child") # "aaa" is here! As expected, as you call .append() explicitly here. > and overriding > def makeelement(self, tag, attrib): > return Node(tag, attrib) > > in the NodeBase just does not help, SubElement() does not call .makeelement() either. It's implemented in plain C. Could you explain a bit why you want to do this and how your .append() differs from the normal append code? Stefan From klizhentas at gmail.com Thu May 1 21:11:38 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Thu, 1 May 2008 23:11:38 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481A0BBC.7040000@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> Message-ID: <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> Thanks for the comments, The idea behind this is to allow the XML tree to notify observers when it's contents are changed: the node is added, removed or moved. That's why I'm going to override the ElementBase members so that they will notify observers on the certain actions performed. Everything works fine, except this usefult SubElement function that did not work as expected, now you've clarified the things, Thanks Alex 2008/5/1 Stefan Behnel : > Hi, > > Alex Klizhentas wrote: > > I've extended the ElementBase object using the approach described in the > > tutorial, but SubElement does not work as desired: > > > > class NodeBase(etree.ElementBase): > > def append(self,child): > > print "aaa" > > return etree.ElementBase.append(self,child) > > > > etree.SubElement(root,"child") #no "aaa" printed > > That's because SubElement() does not call .append(). > > > > OK, but when taking your code to the module: > > > > def SubElement(parent, tag, attrib={}, **extra): > > attrib = attrib.copy() > > attrib.update(extra) > > element = parent.makeelement(tag, attrib) > > parent.append(element) > > return element > > > > SubElement(root,"child") # "aaa" is here! > > As expected, as you call .append() explicitly here. > > > > and overriding > > def makeelement(self, tag, attrib): > > return Node(tag, attrib) > > > > in the NodeBase just does not help, > > SubElement() does not call .makeelement() either. It's implemented in > plain C. > Could you explain a bit why you want to do this and how your .append() > differs > from the normal append code? > > Stefan > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080501/0c565b12/attachment.htm From stefan_ml at behnel.de Fri May 2 08:49:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 08:49:39 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <6310a8f80805011211p1a17e25dt647f065c31a9200a@mail.gmail.com> Message-ID: <481AB983.9020604@behnel.de> Alex Klizhentas wrote: >> Alex Klizhentas wrote: >>> I've extended the ElementBase object using the approach described in the >>> tutorial, but SubElement does not work as desired: >>> >>> class NodeBase(etree.ElementBase): >>> def append(self,child): >>> print "aaa" >>> return etree.ElementBase.append(self,child) >>> >>> etree.SubElement(root,"child") #no "aaa" printed >> That's because SubElement() does not call .append(). >> >> >>> OK, but when taking your code to the module: >>> >>> def SubElement(parent, tag, attrib={}, **extra): >>> attrib = attrib.copy() >>> attrib.update(extra) >>> element = parent.makeelement(tag, attrib) >>> parent.append(element) >>> return element > > The idea behind this is to allow the XML tree to notify observers when it's > contents are changed: the node is added, removed or moved. > > That's why I'm going to override the ElementBase members so that they will > notify observers on the certain actions performed. > > Everything works fine, except this usefult SubElement function that did not > work as expected, now you've clarified the things, Ah, sure. Then it's best to use a pure Python implementation of SubElement instead, as the one above. Stefan From stefan_ml at behnel.de Fri May 2 16:30:24 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 16:30:24 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481A0BBC.7040000@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> Message-ID: <481B2580.9070803@behnel.de> Hi, another bit of reasoning here. Stefan Behnel wrote: > Alex Klizhentas wrote: >> I've extended the ElementBase object using the approach described in the >> tutorial, but SubElement does not work as desired: >> >> class NodeBase(etree.ElementBase): >> def append(self,child): >> print "aaa" >> return etree.ElementBase.append(self,child) >> >> etree.SubElement(root,"child") #no "aaa" printed > > That's because SubElement() does not call .append(). [...] > SubElement() does not call .makeelement() either. It's implemented in plain C. One important reason is that this allows lxml.etree to append the new libxml2 node at the C level *before* the decision is taken which Python class should be used to represent it. This might have an impact on the class lookup if it considers the parental relation when taking its decision (lxml.objectify does that, for example). But that's the only difference I can see between etree.SubElement() and your Python implementation. And you could even work around it by doing something like this: def SubElement(parent, tag, attrib={}, **extra): attrib = attrib.copy() attrib.update(extra) element = parent.makeelement(tag, attrib) parent.append(element) del element return parent[-1] However, you might want to avoid that if you know you won't need it, e.g. when using the "namespace" or "default" lookup scheme. Stefan From stefan_ml at behnel.de Fri May 2 19:16:34 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 19:16:34 +0200 Subject: [lxml-dev] threading fixed :) Message-ID: <481B4C72.1040604@behnel.de> Hi, there has been a long-standing issue in the threading support in lxml, combined with the per-thread string hash table we use for libxml2. Here is a simple example of a sure crasher: ------------------------------- import threading import lxml.etree as et xml = "" main_root = et.XML("") def run_thread(): thread_root = et.XML(xml) main_root.append(thread_root[0]) del thread_root # deletes the document thread = threading.Thread(target=run_thread) thread.start() thread.join() print et.tostring(main_root) ------------------------------- This crashes, because the thread parses the XML fragment into its own dictionary and stores the tag name "threadtag" there. Then it appends the "threadtag" element to a tree in the main program, which uses a different dict. When it deletes the "thread_root", the document will be deleted as well, and the (ref-counted) thread dictionary that contains the string "threadtag" will be freed when the thread terminates. The main program then crashes when it accesses the no longer available tag name in the corrupted document. The solution I came up with today is actually quite simple. We have to traverse the subtree anyway to update the document references and to fix the namespace declarations. So it's only one step more to also fix the name pointers by looking them up in the target dictionary and re-assigning the names. This is only required when we really have two different dicts, which is easy to decide. So there isn't even a performance impact if you only use a single thread or if you do not move subtrees between threads. And the added overhead when you need this is really small. I will release a new beta of 2.1 soon that will have this change, and it would be very helpful if people who currently use threaded code that exchanges (i.e. deep copies) tree fragments between threads could check if this works for them (i.e. if code that crashes under 2.0 if you remove the deep copying works under 2.1). If it proves to fix the problem, I will backport it to 2.0 also. Read: the more feedback I get, the faster this will be fixed in 2.0. :) Stefan From klizhentas at gmail.com Fri May 2 19:21:19 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Fri, 2 May 2008 21:21:19 +0400 Subject: [lxml-dev] Custom Elements question In-Reply-To: <481B2580.9070803@behnel.de> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <481B2580.9070803@behnel.de> Message-ID: <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> Thanks Stefan, All the nodes in that tree should have the same type, that's why the default class lookup scheme for parser works fine. BTW, I have one more question, to set the xml:id i use the following construct: def xml_id(v): # helper function to create name space attributes return {'{http://www.w3.org/XML/1998/namespace}id': v} and the following construct: N.child1("text",xml_id("some_id")) following the examples from the site. to get the id I use: class NodeBase(etree.ElementBase): ... def get_node_id(self,id): searched = self.find(".//*[@{ http://www.w3.org/XML/1998/namespace}id='%s']"%(id,)) if searched is None: raise NodeNotFoundError(id) return searched I have two questions: 1. what way is faster to get the element by Id? should I use find or xpath to achieve the better performance? 2. is there a way to set xml:id using xml - prefix? Thanks, Alex 2008/5/2 Stefan Behnel : > Hi, > > another bit of reasoning here. > > Stefan Behnel wrote: > > Alex Klizhentas wrote: > >> I've extended the ElementBase object using the approach described in > the > >> tutorial, but SubElement does not work as desired: > >> > >> class NodeBase(etree.ElementBase): > >> def append(self,child): > >> print "aaa" > >> return etree.ElementBase.append(self,child) > >> > >> etree.SubElement(root,"child") #no "aaa" printed > > > > That's because SubElement() does not call .append(). > [...] > > SubElement() does not call .makeelement() either. It's implemented in > plain C. > > One important reason is that this allows lxml.etree to append the new > libxml2 > node at the C level *before* the decision is taken which Python class > should > be used to represent it. This might have an impact on the class lookup if > it > considers the parental relation when taking its decision (lxml.objectify > does > that, for example). > > But that's the only difference I can see between etree.SubElement() and > your > Python implementation. And you could even work around it by doing > something > like this: > > def SubElement(parent, tag, attrib={}, **extra): > attrib = attrib.copy() > attrib.update(extra) > element = parent.makeelement(tag, attrib) > parent.append(element) > del element > return parent[-1] > > However, you might want to avoid that if you know you won't need it, e.g. > when > using the "namespace" or "default" lookup scheme. > > Stefan > > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080502/e5f1c6a6/attachment-0001.htm From stefan_ml at behnel.de Fri May 2 19:42:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 19:42:09 +0200 Subject: [lxml-dev] Custom Elements question In-Reply-To: <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> References: <6310a8f80805010338l584fab31nf99ea15c7461ceb6@mail.gmail.com> <6310a8f80805011114x2a17221cjb00704dc9dd5af1@mail.gmail.com> <481A0BBC.7040000@behnel.de> <481B2580.9070803@behnel.de> <6310a8f80805021021p12780fa7y9852028222caff06@mail.gmail.com> Message-ID: <481B5271.8000101@behnel.de> Hi, Alex Klizhentas wrote: > I have one more question, to set the xml:id i use the following construct: > > def xml_id(v): > # helper function to create name space attributes > return {'{http://www.w3.org/XML/1998/namespace}id': v} > > and the following construct: > > N.child1("text",xml_id("some_id")) > > following the examples from the site. > > to get the id I use: > > class NodeBase(etree.ElementBase): > ... > def get_node_id(self,id): > searched = self.find(".//*[@{ > http://www.w3.org/XML/1998/namespace}id='%s']"%(id,)) > if searched is None: > raise NodeNotFoundError(id) > return searched > > I have two questions: > > 1. what way is faster to get the element by Id? should I use find or xpath > to achieve the better performance? timeit will tell you that. But it really depends on the data. element.find() stops short after the first hit, so that's probably faster on average if the document is large. OTOH, XPath() is implemented in C and could easily beat the Python code behind find(".. at attr...") for smaller documents... Try this: find_id = etree.ETXPath( ".//*[@{http://www.w3.org/XML/1998/namespace}id=$id]") ... def get_node_id(self,id): el = find_id(self, id=id) > 2. is there a way to set xml:id using xml - prefix? No, but if you know you run single-threaded, you can reuse the attrib dict and just change the value. That's faster than recreating it each time. Stefan From stefan_ml at behnel.de Fri May 2 20:48:28 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 May 2008 20:48:28 +0200 Subject: [lxml-dev] lxml 2.1beta2 released Message-ID: <481B61FC.7050300@behnel.de> Hi all, I'm happy to announce the release of lxml 2.1 beta2. It features a couple of enhancements and fixes over the first beta. The main improvement is the much more robust threading support, which makes it a lot easier to move subtrees back and forth between threads. It is described in more detail here: http://permalink.gmane.org/gmane.comp.python.lxml.devel/3571 Please report back on the list (preferably in reply to the above thread) if you notice a difference to lxml 2.0 with your code. Have fun, Stefan 2.1beta2 (2008-05-02) Features added * All parse functions in lxml.html take a parser keyword argument. * lxml.html has a new parser class XHTMLParser and a module attribute xhtml_parser that provide XML parsers that are pre-configured for the lxml.html package. Bugs fixed * Moving a subtree from a document created in one thread into a document of another thread could crash when the rest of the source document is deleted while the subtree is still in use. * Passing an nsmap when creating an Element will no longer strip redundantly defined namespace URIs. This prevented the definition of more than one prefix for a namespace on the same Element. Other changes * If the default namespace is redundantly defined with a prefix on the same Element, the prefix will now be preferred for subelements and attributes. This allows users to work around a problem in libxml2 where attributes from the default namespace could serialise without a prefix even when they appear on an Element with a different namespace (i.e. they would end up in the wrong namespace). From mharper3 at uiuc.edu Sun May 4 03:17:49 2008 From: mharper3 at uiuc.edu (mharper3 at uiuc.edu) Date: Sat, 3 May 2008 20:17:49 -0500 (CDT) Subject: [lxml-dev] (no subject) Message-ID: <20080503201749.BHR33134@expms5.cites.uiuc.edu> Hi lxml-dev: I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code: import lxml.etree wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/ context = lxml.etree.iterparse(wiki_xml_filename, events=("end")) for action, elem in context: pass The crash usually occurs about halfway through the file (around 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) without error. I only get this error for this very large xml file (in this case about 13gb uncompressed). I had no trouble parsing the same file with the python standard library sax parser, but it is much slower and I don't like its api. I'm using libxml2-2.6.32 (also used earlier versions), python 2.5.2, python-lxml 2.0.5 (also tried earlier versions), Kubuntu 8.04 with 2.6.24 kernel (also tested on opensuse 10.3 with earlier kernel). Some of the exceptions are MemoryErrors. The machine running the code has 4gb of ram. The kernel does not appear to significantly hit the swap during the run. Here are the errors: ** glibc detected *** python: free(): invalid pointer: 0x08220a15 *** Aborted Also: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook import re, tempfile, traceback File "/usr/lib/python2.5/traceback.py", line 241, in def print_last(limit=None, file=None): MemoryError Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None ... and also (slightly different) Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook import re, tempfile, traceback File "/usr/lib/python2.5/tempfile.py", line 33, in from random import Random as _Random MemoryError Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Sometimes I just get 'Segmentation fault' from the shell, and sometimes it just hangs indefinitely. and finally (cStringIO): Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 36, in apport_excepthook from cStringIO import StringIO ImportError: /usr/lib/python2.5/lib-dynload/cStringIO.so: failed to map segment from shared object: Permission denied Original exception was: Traceback (most recent call last): File "minimal.py", line 6, in for action, elem in context: File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064) File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432) lxml.etree.XMLSyntaxError: None Any direction on tracking down the source is greatly appreciated! -- Marc From stefan_ml at behnel.de Sun May 4 07:34:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 07:34:50 +0200 Subject: [lxml-dev] (no subject) In-Reply-To: <20080503201749.BHR33134@expms5.cites.uiuc.edu> References: <20080503201749.BHR33134@expms5.cites.uiuc.edu> Message-ID: <481D4AFA.8020401@behnel.de> Hi, mharper3 at uiuc.edu wrote: > I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code: > > > import lxml.etree > > wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/ > context = lxml.etree.iterparse(wiki_xml_filename, events=("end")) > for action, elem in context: > pass > > > The crash usually occurs about halfway through the file (around > 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) > without error. I only get this error for this very large xml file (in this > case about 13gb uncompressed). I had no trouble parsing the same file with > the python standard library sax parser, but it is much slower and I don't > like its api. > > Some of the exceptions are MemoryErrors. The machine running the code has > 4gb of ram. The kernel does not appear to significantly hit the swap during > the run. iterparse() builds a tree in memory, so parsing a 13gb file on a 4gb RAM machine will fail - *unless* you clean up the parts of the tree that you no longer need. Something like for action, elem in context: if elem.tag == "page": # handle page elem.clear() elif elem.tag in tag_names_of_ancestors_of_page_elements: elem.clear() might work for you. BTW, you can also parse the gzip compressed file directly, might even be faster. Stefan From stefan_ml at behnel.de Sun May 4 11:02:07 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 11:02:07 +0200 Subject: [lxml-dev] [Fwd: Re: (no subject)] Message-ID: <481D7B8F.9060603@behnel.de> [Forwarding to the list ...] From: Stefan -- Thanks so much for the quick response. I did consider that the tree was being built in memory, but the documentation seems to suggest that is not the case. Specifically the language in the tutorial (http://codespeak.net/lxml/tutorial.html) in both the sections 'incremental parsing' and 'event-driven parsing' seem to suggest using iterparse to access without retaining the tree in memory. I see now that the documentation says otherwise for iterparse, as you pointed out. If you don't mind, why does the iterator retain the tree in memory? I would suspect otherwise from the 'natural' behavior of iterators/generators in general, though that may be an invalid assumption. (i.e. I would parse the entire tree into memory if I thought that I had enough memory to do so; otherwise I would _incrementally_ parse it.) More specifically, I don't want to ignore any parts of the xml file in this specific instance, so a ParserTarget is not the correct solution. Your suggestion to use clear() works for me; maybe it should be made explicit in the tutorial that memory is not cleared unless clear() is called. The only mention in the tutorial is iterparse "also allows to clear() or modify the content of an Element to save memory". My mistake was to assume that the 'used' elements would be freed without an explicit call to do so as the iterator progressed. Again, thank you for your quick reply! -- Marc From stefan_ml at behnel.de Sun May 4 11:02:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 11:02:49 +0200 Subject: [lxml-dev] [Fwd: Re: (no subject)] Message-ID: <481D7BB9.7000201@behnel.de> [Forwarding to the list...] From: Also, adding elem.clear() into the loop still eventually leads to a memory error, just much later. This should be clearing every element, so I'm not quite sure if I understand what clear() actually does. Should I segment the file into smaller pieces so that the tree is unloaded as each piece finishes? I apologize if my questions are trivial. I appreciate your responses greatly. -- Marc From stefan_ml at behnel.de Sun May 4 12:18:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 12:18:42 +0200 Subject: [lxml-dev] saving memory with iterparse() In-Reply-To: <481D7B8F.9060603@behnel.de> References: <481D7B8F.9060603@behnel.de> Message-ID: <481D8D82.8060503@behnel.de> Hi, Stefan Behnel wrote: > From: > Thanks so much for the quick response. I did consider that the tree was being > built in memory, but the documentation seems to suggest that is not the case. > Specifically the language in the tutorial > (http://codespeak.net/lxml/tutorial.html) in both the sections 'incremental > parsing' and 'event-driven parsing' seem to suggest using iterparse to access > without retaining the tree in memory. It actually says: """ two event-driven parser interfaces, one that generates parser events while building the tree (``iterparse``), and one that does not build the tree at all, and instead calls feedback methods on a target object in a SAX-like fashion. """ but I added a new example now that shows how to save memory. http://codespeak.net/lxml/tutorial.html#event-driven-parsing > If you don't mind, why does the > iterator retain the tree in memory? I would suspect otherwise from the > 'natural' behavior of iterators/generators in general, though that may be an > invalid assumption. [...] > My mistake was to assume that the > 'used' elements would be freed without an explicit call to do so as the > iterator progressed. The question is: how should iterparse() know when you no longer need a subtree? The end event for a parent always comes after the end events of all its children and you might still access the whole subtree when you handle the parent. > (i.e. I would parse the entire tree into memory if I > thought that I had enough memory to do so; otherwise I would _incrementally_ > parse it.) The docs actually use two terms: "incremental parsing" and "event-driven parsing". Incremental parsing is used for feeding data into the parser one chunk at a time, while event-driven parsing means you also get back one parser event at a time. If you have an idea how to present this better, I take patches: http://codespeak.net/svn/lxml/trunk/doc/tutorial.txt Stefan From stefan_ml at behnel.de Sun May 4 12:53:34 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 04 May 2008 12:53:34 +0200 Subject: [lxml-dev] parsing a large file with iterparse() In-Reply-To: <481D7BB9.7000201@behnel.de> References: <481D7BB9.7000201@behnel.de> Message-ID: <481D95AE.1030609@behnel.de> Hi, Stefan Behnel wrote: > From: > > Also, adding > > elem.clear() > > into the loop still eventually leads to a memory error, just much later. This > should be clearing every element, so I'm not quite sure if I understand what > clear() actually does. According to the docs: """ clear() Resets an element. This function removes all subelements, clears all attributes and sets the text and tail properties to None. """ So it does not remove the element itself. I don't know what your XML looks like, but if it's something like ... * a zillion and you handle the end event of the element and clear() it, you still end up with a tree that has a zillion empty children. I see two choices in this case. There is cElementTree, which has the same API and allows you to clear the root element. http://effbot.org/zone/element-iterparse.htm#incremental-parsing This does not work in lxml as you cannot delete elements that are still required by the tree traversal of the parser (i.e. parents and following siblings). But you can try this in lxml: for action, elem in context: if elem.tag == "page": # handle page elem.clear() # remove all previous siblings parent = elem.getparent() previous_sibling = elem.getprevious() while previous_sibling is not None: parent.remove(previous_sibling) previous_sibling = elem.getprevious() BTW, if you only look for "page" tags and do the sibling cleanup as above, you can just pass tag="page" to iterparse(). Stefan From stefan_ml at behnel.de Mon May 5 17:46:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 5 May 2008 17:46:56 +0200 (CEST) Subject: [lxml-dev] lxml - addition of argument to control namespace declaration serializtion In-Reply-To: References: Message-ID: <38095.194.114.62.38.1210002416.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, these things are best discussed on the list. Grimes, David wrote: > In 1.3.5-1.3.6 timeframe, there was a patch introduced to > _writeNodeToBuffer() from serialization.pxi which forced namespace > declarations from parent nodes to be serialized onto the sub-tree root > node. In general, and with respect to XML standards, this make a great > deal of sense (so you don't have prefixed elements/attributes without the > corresponding namespace declaration). > > But, the application I've been building essentially takes an XML document > and makes template-string blocks of text out of various sub-trees, to be > later combined back into a full document using __getitem__ substitution in > the form of "%(token)s" string formatting. > > The nsdecl patch of 1.3.5/6 causes interesting behaviour when the sub-tree > being rendered is done in, for example, a loop - one "formatting > operation" per iteration. Also interesting is when many such sub-trees > are combined to form a document which (in my case) we know will have the > declarations on the ultimate root node. What you mean is that we actually make a copy of a non-root node and then copy over the namespace declarations of the ancestors. You say "interesting behaviour". Does that refer to the performance overhead or is there a 'real' problem you see? Looking at the code now, I see some potential optimisations, so if it's just the performance, here's a (trunk) patch that should give a bit of relief. > So ... I've got a patch I'm using in my local build which adds a keyword > argument "nsdecl=True" to tostring(), tounicode() and tofilelike() - these > are all the places which make use of the _writeNodeToBuffer() machinery. > I can spin the patch against any 2.0.x or 2.1.x source tree. > > The argument defaults to True, to maintain backward compatibility, but can > be provided as False to get <= 1.3.4 behaviour. > > Would you consider accepting this patch? At first glance: no. I do not think there is general interest for a serialisation that is not ns well-formed. You seem to have a rather special use case here. I'm not even sure you have to do what you describe based on serialised XML fragments. You might be able to do something like that with subtrees. But that's very close to guessing. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: copy-node-namespaces.patch Type: application/octet-stream Size: 786 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080505/711237cc/attachment.obj From kris at cs.ucsb.edu Tue May 6 03:32:56 2008 From: kris at cs.ucsb.edu (kris) Date: Mon, 05 May 2008 18:32:56 -0700 Subject: [lxml-dev] generative building of xml? Message-ID: <1210037576.13243.63.camel@loup.ece.ucsb.edu> I am generating, processing and eventually serializing several XML streams. I was wondering if this was possible to do with lxml? Here's the setup. I've got several databases generating XML content (which can be quite large), I really want to be able to process the database record progressively generating XML and sending out on its own stream. An aggregator/filter (elsewhere) will read the streams and parse them processing similar members and generate a new stream based on the combined streams. DB1 DB2 DB3 Core database XML XML XML XML genaration WS WS WS delivery over a stream using generator | | | +------+-----+ AGG Parse and match incoming streams (iterparse) XML WS send resulting merge as XML using generator. So the questions: 1.. Does anybody have a recipe to build a recursive generator using Element? 2. Given the above generator, is there any such thing as a generator version etree.tostring? -- Kristian Kvilekval kris at cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756 From friedel at translate.org.za Tue May 6 11:09:47 2008 From: friedel at translate.org.za (F Wolff) Date: Tue, 06 May 2008 11:09:47 +0200 Subject: [lxml-dev] Error reporting not clear Message-ID: <1210064987.7179.30.camel@localhost> Hallo Stefan and other lxml people. I had a bug report which I traced to an invalid XML file. The error message given by the parser was however not optimally useful. The file is available here (zipped): ?http://bugs.locamotion.org/attachment.cgi?id=132 and a description of the problem here: http://bugs.locamotion.org/show_bug.cgi?id=384 It might or might not be interesting to improve this error reporting, so I thought I'll mention it. Keep well Friedel From stefan_ml at behnel.de Tue May 6 18:14:51 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 6 May 2008 18:14:51 +0200 (CEST) Subject: [lxml-dev] Error reporting not clear In-Reply-To: <1210064987.7179.30.camel@localhost> References: <1210064987.7179.30.camel@localhost> Message-ID: <56049.194.114.62.38.1210090491.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, F Wolff wrote: > I had a bug report which I traced to an invalid XML file. The error > message given by the parser was however not optimally useful. The file > is available here (zipped): > http://bugs.locamotion.org/attachment.cgi?id=132 > > and a description of the problem here: > http://bugs.locamotion.org/show_bug.cgi?id=384 > > It might or might not be interesting to improve this error reporting, so > I thought I'll mention it. The error comes from libxml2 as is. You can check the error log to see if that is the only error that the parser reports, or if there are other errors that might be more important. http://codespeak.net/lxml/api.html#error-handling-on-exceptions If you feel that lxml selects the wrong message from the error log, please provide a list of errors as example. The lxml version is also important in this context, as there were improvements in the not so far history. Stefan From usernamenumber at gmail.com Wed May 7 15:13:04 2008 From: usernamenumber at gmail.com (Brad Smith) Date: Wed, 7 May 2008 09:13:04 -0400 Subject: [lxml-dev] Querying valid children of an element? Message-ID: Hello, I just discovered lxml and am pretty excited about it. There is one thing I'm having trouble figuring out how to do, though, if it's even possible: I am writing a tool that translates xml tags mixed with a wiki-like shorthand into full xml. It would be helpful to be able to sanity-check the mix of explicit tags and implicit tags I'm deriving from the shorthand by querying our DTD along the lines: "Is element foo legal within element bar" Same for CDATA. Is this possible using lxml? If not, is it possible using anything else? The best I've been able to come up with so far is to assemble a tree of dummy nodes in the proposed order and then validate it, but this seems wasteful. Thanks in advance for any help offered, --Brad -- ~ Second Shift: An original, serialized audio adventure ~ http://www.secondshiftpodcast.com From stefan_ml at behnel.de Thu May 8 09:22:04 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 May 2008 09:22:04 +0200 Subject: [lxml-dev] generative building of xml? In-Reply-To: <1210037576.13243.63.camel@loup.ece.ucsb.edu> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> Message-ID: <4822AA1C.30604@behnel.de> Hi, kris wrote: > I am generating, processing and eventually serializing > several XML streams. I was wondering if this was possible > to do with lxml? Probably, although lxml is not designed for pipelined XML processing (any better than SAX, that is). It also depends on how your XML looks like. If it's from a database, it's probably something simple like ... ... ... That shouldn't cause too many problems, you can use the (SAX-like) target parser to copy it into a simple Python container class, use that inside your program, merge all of those objects into a single stream at some point and then generate a new XML stream from that. > Here's the setup. I've got several databases > generating XML content (which can be quite large), I really want > to be able to process the database record progressively > generating XML and sending out on its own stream. > > An aggregator/filter (elsewhere) will read the streams > and parse them processing similar members and generate > a new stream based on the combined streams. > > DB1 DB2 DB3 Core database > XML XML XML XML genaration > WS WS WS delivery over a stream using generator A generator? Interesting. Why not just a file-like object? If the interface is a generator (yielding strings, I assume), then you will have to use the feed parser interface to copy the data into the parser, otherwise, you can just use one thread per DB connection and have it read and parse the data for you. > 2. Given the above generator, is there any such > thing as a generator version etree.tostring? Nothing keeps you from yielding "", followed by the serialised stream entries (call tostring() on each separately), followed by a "". Stefan From stefan_ml at behnel.de Thu May 8 09:33:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 May 2008 09:33:17 +0200 Subject: [lxml-dev] Querying valid children of an element? In-Reply-To: References: Message-ID: <4822ACBD.6020302@behnel.de> Hi, Brad Smith wrote: > I just discovered lxml and am pretty excited about it. :) > I am writing a tool that translates xml tags mixed with a wiki-like > shorthand into full xml. It would be helpful to be able to > sanity-check the mix of explicit tags and implicit tags I'm deriving > from the shorthand by querying our DTD along the lines: "Is element > foo legal within element bar" Same for CDATA. > > Is this possible using lxml? If not, is it possible using anything > else? You could define your grammar in a way that is easily usable for you in your program and then generate a DTD from that. > The best I've been able to come up with so far is to assemble a > tree of dummy nodes in the proposed order and then validate it, but > this seems wasteful. Why? Don't you expect your users to get it right most of the time anyway? Why don't you just assemble the complete result tree and validate that? Is your program working on the tree itself or some other data representation? This might be of interest: http://codespeak.net/lxml/element_classes.html Stefan From kris at cs.ucsb.edu Thu May 8 20:03:46 2008 From: kris at cs.ucsb.edu (kris) Date: Thu, 08 May 2008 11:03:46 -0700 Subject: [lxml-dev] generative building of xml? In-Reply-To: <4822AA1C.30604@behnel.de> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> <4822AA1C.30604@behnel.de> Message-ID: <1210269826.12910.28.camel@loup.ece.ucsb.edu> On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote: > Hi, > Probably, although lxml is not designed for pipelined XML processing (any > better than SAX, that is). > > It also depends on how your XML looks like. If it's from a database, it's > probably something simple like > > > > ... > ... > > ... > > > That shouldn't cause too many problems, you can use the (SAX-like) target > parser to copy it into a simple Python container class, use that inside your > program, merge all of those objects into a single stream at some point and > then generate a new XML stream from that. > > > > Here's the setup. I've got several databases > > generating XML content (which can be quite large), I really want > > to be able to process the database record progressively > > generating XML and sending out on its own stream. > > > > An aggregator/filter (elsewhere) will read the streams > > and parse them processing similar members and generate > > a new stream based on the combined streams. > > > > DB1 DB2 DB3 Core database > > XML XML XML XML genaration > > WS WS WS delivery over a stream using generator > > A generator? Interesting. Why not just a file-like object? I was thinking of a generator because I am feeding this to a stream that works with/on generators .. The databases are returning a top-k queries as xml files. Each DB keeps generating its best hits as a stream the aggregator sorts them and send them to the client. I would like to propagate the query all the way to the component databases using generators to minimize the work each on does. > If the interface is a generator (yielding strings, I assume), then you will > have to use the feed parser interface to copy the data into the parser, > otherwise, you can just use one thread per DB connection and have it read and > parse the data for you. > > > > 2. Given the above generator, is there any such > > thing as a generator version etree.tostring? > > Nothing keeps you from yielding "", followed by the serialised stream > entries (call tostring() on each separately), followed by a "". Unfortunately it is a tree structure.. I would like to visit the tree in something like; yield "" yield ' ' yield ' ' ?yield ' ' ... yield ' > Stefan -- Kristian Kvilekval kris at cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756 From jeff at ocjtech.us Thu May 8 21:30:16 2008 From: jeff at ocjtech.us (Jeffrey Ollie) Date: Thu, 8 May 2008 14:30:16 -0500 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 Message-ID: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> Has anyone built lxml 2.0.5 on RHEL 4 or CentOS 4? When I submit it to the Fedora/EPEL buildsystem I get the following error: libxml/schematron.h: No such file or directory I don't have direct access to a RHEL/CentOS 4 box so I can't do much more debugging until I do get one set up. libxml2 is at version 2.6.16 in RHEL/CentOS 4. The full build log is here: http://buildsys.fedoraproject.org/logs/fedora-4-epel/38964-python-lxml-2.0.5-1.el4/ppc/build.log Jeff From usernamenumber at gmail.com Thu May 8 22:27:02 2008 From: usernamenumber at gmail.com (Brad Smith) Date: Thu, 8 May 2008 16:27:02 -0400 Subject: [lxml-dev] Querying valid children of an element? In-Reply-To: <20080508114453.13762360@mbook.local> References: <4822ACBD.6020302@behnel.de> <20080508114453.13762360@mbook.local> Message-ID: To clarify about what I'm doing. The goal is to have a shorthand language (not entirely tag-based) that is easier for subject matter experts to learn than docbook, which can then be converted into full docbook once they've written a first draft. So, to illustrate one aspect of it, instead of writing foomaster example... $ foomaster [OPTIONS] They can write * foomaster example... ** $ foomaster [OPTIONS] As you can see, the translation process consists of not just converting asterisks into the appropriate combination of itemizedlists and listitems, but also protecting cdata within paras where necessary. In the first one, the interpreter sees that isn't allowed inside , which is its cue to try inserting a . is allowed within , so it does not insert a para. Making that determination is what I'm trying to find the best approach for. Currently I use a function like this: def validateAppend(parent,child): parent.append(child) if not dtd.validate(parent): dbg("Appending %s to %s failed DTD validation" % (child.tag,parent.tag)) del(parent[-1]) return False return True This works but, like I said, is not terribly efficient, so I just wanted to see if there was another method for making the determination. --Brad On Thu, May 8, 2008 at 11:44 AM, Mike Meyer wrote: > On Thu, 08 May 2008 09:33:17 +0200 Stefan Behnel wrote: > >> > I am writing a tool that translates xml tags mixed with a wiki-like >> > shorthand into full xml. It would be helpful to be able to >> > sanity-check the mix of explicit tags and implicit tags I'm deriving >> > from the shorthand by querying our DTD along the lines: "Is element >> > foo legal within element bar" Same for CDATA. >> > >> > Is this possible using lxml? If not, is it possible using anything >> > else? >> >> You could define your grammar in a way that is easily usable for you in your >> program and then generate a DTD from that. > > Are you really using DTDs, and not using that as a catchall for the > various Schema languages? > > If so, then you might consider switching to a modern schema > language. RelaxNG lets you write regular expressions for CDATA, which > ought to work with wiki-like "tags", and I wouldn't be surprised to > find that Schematron is turing complete. > > -- ~ Second Shift: An original, serialized audio adventure ~ http://www.secondshiftpodcast.com From stefan_ml at behnel.de Fri May 9 10:35:22 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 10:35:22 +0200 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 In-Reply-To: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> References: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> Message-ID: <48240CCA.9040306@behnel.de> Hi, Jeffrey Ollie wrote: > Has anyone built lxml 2.0.5 on RHEL 4 or CentOS 4? When I submit it > to the Fedora/EPEL buildsystem I get the following error: > > libxml/schematron.h: No such file or directory > > I don't have direct access to a RHEL/CentOS 4 box so I can't do much > more debugging until I do get one set up. libxml2 is at version > 2.6.16 in RHEL/CentOS 4. That's too old anyway. lxml > 1.3.x requires libxml2 2.6.21 (although I think 2.0.x still states it works with 2.6.20, which the above error proves wrong...) Two choices: stay with lxml 1.3 or build your own libxml2 as well. Stefan From stefan_ml at behnel.de Fri May 9 10:47:14 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 10:47:14 +0200 Subject: [lxml-dev] generative building of xml? In-Reply-To: <1210269826.12910.28.camel@loup.ece.ucsb.edu> References: <1210037576.13243.63.camel@loup.ece.ucsb.edu> <4822AA1C.30604@behnel.de> <1210269826.12910.28.camel@loup.ece.ucsb.edu> Message-ID: <48240F92.4070704@behnel.de> Hi, kris wrote: > On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote: >> If the interface is a generator (yielding strings, I assume), then you will >> have to use the feed parser interface to copy the data into the parser, >> otherwise, you can just use one thread per DB connection and have it read and >> parse the data for you. >> >>> 2. Given the above generator, is there any such >>> thing as a generator version etree.tostring? >> Nothing keeps you from yielding "", followed by the serialised stream >> entries (call tostring() on each separately), followed by a "". > > Unfortunately it is a tree structure.. I would like to visit the tree > in something like; > > yield "" > yield ' ' > yield ' ... > yield ' yield ' ' > ?yield ' ' > ... > yield ' I think that's a bad idea, as you loose semantics that you will need to recover in each generator step. My approach would be: let the databases write file-like streams (a socket or whatever), attach an iterparse() thread to each of them, copy the data of each entry to a container object (or maybe just use iterparse() with lxml.objectify), merge the container objects into a single stream in a thread safe way and serialise the resulting stream of entries to an XML stream, maybe even manually, as I suggested. Stefan From jeff at ocjtech.us Fri May 9 13:54:36 2008 From: jeff at ocjtech.us (Jeffrey Ollie) Date: Fri, 9 May 2008 06:54:36 -0500 Subject: [lxml-dev] Building lxml 2.0.5 on RHEL/CentOS 4 In-Reply-To: <48240CCA.9040306@behnel.de> References: <935ead450805081230ga6b654fo7e07f1c7a03dbf60@mail.gmail.com> <48240CCA.9040306@behnel.de> Message-ID: <935ead450805090454m47dc97a7kf8211f2b35ac3854@mail.gmail.com> On Fri, May 9, 2008 at 3:35 AM, Stefan Behnel wrote: > > That's too old anyway. lxml > 1.3.x requires libxml2 2.6.21 (although I think > 2.0.x still states it works with 2.6.20, which the above error proves wrong...) > > Two choices: stay with lxml 1.3 or build your own libxml2 as well. Doh, my sleep deprived-brain thought that I had built an earlier version of lxml 2.0.x for RHEL 4, but I guess the last build was 1.3.6. Thanks for the wake up call! Jeff From bba at inbox.com Fri May 9 16:10:06 2008 From: bba at inbox.com (Ben) Date: Fri, 9 May 2008 06:10:06 -0800 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) Message-ID: Hello I'm writing some code to check whether our daily backups worked. Backup Exec stores its results in XML files. Sometimes bad characters - or maybe it is binary data - ends up in these XML files and then lxml chokes: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 5, in Xml = etree.parse(XmlFileName) File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062) File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088) File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:53337) File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:52584) File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:50115) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:47023) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:47861) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47285) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95 The offending line looks like this (not sure if the bad characters will make it through the email): Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\????\\VIC-ve\TT\miscellaneous and its subdirectories. Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2): ################################## Xml = etree.parse(XmlFileName) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") ############################## The code works fine unless there are invalid characters in, and I am happy for any suggestion, because the bit I'm interested in is always near the end of the xml file, and there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope) Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get this: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 9, in print Xml.findtext(".//end_time") File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext (src/lxml/lxml.etree.c:15354) File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:14116) AssertionError: ElementTree not initialized, missing root The code I tried for the 'recover' parser option: XmlFileName = r'c:/BEX03194.xml' parser = etree.XMLParser(recover=True) Xml = etree.parse(StringIO(XmlFileName), parser) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") I guess I'm just specifying the option wrong, but can't see how I should be doing it. Any suggestion, including how to circumvent/work around the problem is most welcome. ReplyReply AllForwardTrash ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium From stefan_ml at behnel.de Fri May 9 16:42:16 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 09 May 2008 16:42:16 +0200 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) In-Reply-To: References: Message-ID: <482462C8.2020108@behnel.de> Hi, Ben wrote: > Xml = etree.parse(XmlFileName) > ############################## > XmlFileName = r'c:/BEX03194.xml' > parser = etree.XMLParser(recover=True) > Xml = etree.parse(StringIO(XmlFileName), parser) Not sure if this is just a "find-a-short-example" error, but you parse the filename, not the file here. This should read Xml = etree.parse(XmlFileName, parser) > Also, I've tried the 'recover' parser option, but I'm doing something wrong, > because I get this: > > C:\>python sb-lxml.py > Traceback (most recent call last): > File "sb-lxml.py", line 9, in > print Xml.findtext(".//end_time") > File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext > (src/lxml/lxml.etree.c:15354) > File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot > (src/lxml/lxml.etree.c:14116) > AssertionError: ElementTree not initialized, missing root I guess that happens when the parser "recover"s from not finding any XML at all. Maybe we should still raise an exception in this case instead of returning an empty ElementTree. This is really an extreme case of broken data... Stefan From bba at inbox.com Fri May 9 17:15:31 2008 From: bba at inbox.com (Ben) Date: Fri, 9 May 2008 07:15:31 -0800 Subject: [lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option) In-Reply-To: <482462C8.2020108@behnel.de> References: Message-ID: > Stefan wrote: > > Not sure if this is just a "find-a-short-example" error, but you parse > the filename, not the file here. This should read > > Xml = etree.parse(XmlFileName, parser) (LOL) This is indeed a "find-a-short-example" error - which is what you use when you are a sysadmin. Now it works and gets me past the invalid characters too. Thanks for lxml From aryeh at bigfoot.com Fri May 9 18:26:22 2008 From: aryeh at bigfoot.com (Arye) Date: Fri, 9 May 2008 18:26:22 +0200 Subject: [lxml-dev] validation with multiple XSD files Message-ID: Hello all, I would like to so some schema validation and started with the instructions in : http://codespeak.net/lxml/dev/validation.html#xmlschema This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name): ... ... some types defined in "base.xsd" are used here I am new to lxml so sorry in advance if the question does not make sense. Regards, Arye. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080509/eabd6f62/attachment.htm From jlovell at esd189.org Fri May 9 18:38:08 2008 From: jlovell at esd189.org (John Lovell) Date: Fri, 9 May 2008 09:38:08 -0700 Subject: [lxml-dev] validation with multiple XSD files In-Reply-To: References: Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A22D4@ZIRIA.esd189.org> Arye: I had a similar problem and this is how I handled it. http://messagesleuth.svn.sourceforge.net/viewvc/messagesleuth/trunk/xsd/ xsd2one.py?view=markup I didn't ask the group so others may have a better or more full featured approach. John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.esd189.org Together We Can ... ________________________________ From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Arye Sent: Friday, May 09, 2008 9:26 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] validation with multiple XSD files Hello all, I would like to so some schema validation and started with the instructions in : http://codespeak.net/lxml/dev/validation.html#xmlschema This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name): ... ... some types defined in "base.xsd" are used here I am new to lxml so sorry in advance if the question does not make sense. Regards, Arye. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080509/818099fa/attachment-0001.htm From kumar.mcmillan at gmail.com Sat May 10 23:46:00 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sat, 10 May 2008 16:46:00 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? Message-ID: I know this has been discussed over and over but I'm writing to see if anyone has made a breakthrough yet. The problem of course is that Leopard's builtin libxml2 and libxslt are too old for lxml 2.0. You have to build libxml2 either from source or use a port. There is currently a problem with the libxml2 port, but the workaround is going fine for me: http://trac.macports.org/ticket/15230 (I know because postgres built just fine and I have some tests exerising psycopg2 as well) So after updating my libxml2 to 2.6.31 and libxslt to 1.1.23 and updating my $PATH so that the new xml2-config and xslt-config can be found, I can build lxml *without errors* but I see these warnings: $ sudo easy_install lxml-2.0.5.tgz Processing lxml-2.0.5.tgz Running lxml-2.0.5/setup.py -q bdist_egg --dist-dir /tmp/easy_install-3azY8e/lxml-2.0.5/egg-dist-tmp-t80esG Building lxml version 2.0.5. NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c' needs to be available. Using build configuration of libxslt 1.1.23 ld: warning in /opt/local/lib/libxslt.dylib, file is not of required architecture ld: warning in /opt/local/lib/libexslt.dylib, file is not of required architecture ld: warning in /opt/local/lib/libxml2.dylib, file is not of required architecture [... and more like this ...] ... Finished processing dependencies for lxml==2.0.5 What doesn't make sense is these files seem fine to me: $ file -L /opt/local/lib/libxslt.dylib /opt/local/lib/libxslt.dylib: Mach-O dynamically linked shared library i386 $ file -L /opt/local/lib/libexslt.dylib /opt/local/lib/libexslt.dylib: Mach-O dynamically linked shared library i386 I was having similar trouble like this on Tiger and I had test cases in my own test suite that would consistently segfault. On Leopard, those same test cases were *not* segfaulting but now I have some different test cases that are consistently segfaulting. The segfault looks like this in the crash log: Exception Type: EXC_BAD_ACCESS (SIGBUS) Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000008 Crashed Thread: 0 Thread 0 Crashed: 0 libxml2.2.dylib 0x90d39419 xmlDictLookup + 360 1 libxml2.2.dylib 0x025626e4 xmlXPathCompExprAdd + 212 2 libxml2.2.dylib 0x025709c6 xmlXPathCompPathExpr + 3910 [etc....] Setting my dyld path (like suggested in the docs, export DYLD_LIBRARY_PATH=/opt/local/lib:/usr/lib) *does* make my test cases run without segfault so I'm assuming what's happening is lxml is using the older dylibs at runtime. This is a really lame way to fix the problem! Specifically, my svn binaries do not like this dylib setting, producing errors like: $ svn ls dyld: lazy symbol binding failed: Symbol not found: _iconv_open Referenced from: /usr/lib/libaprutil-1.0.dylib Expected in: /opt/local/lib/libiconv.2.dylib [etc] (This is slightly odd since I included /usr/lib but whatever.) *sigh* Next, I tried doing a static build of lxml by setting STATIC_LIBRARY_DIRS = ['/opt/local/lib'] in setup.py and running: python setup.py bdist_egg --static --with-xml2-config=/opt/local/bin/xml2-config --with-xslt-config=/opt/local/bin/xslt-config I had to fiddle with gcc to get this to build but otherwise it built fine and installed ok but I did not see any difference. Still consistent segfaults that are fixed by setting the dyld path. Now I'm out of ideas. Does anyone have another suggestion? Until then I have a stupid bash file that I have to source anytime I want to work on lxml. -Kumar From stefan_ml at behnel.de Sun May 11 09:01:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 11 May 2008 09:01:01 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: Message-ID: <482699AD.4030905@behnel.de> Hi Kumar, you ask why this is so hard? Simple answer: because no-one has contributed a way so far to make it easier. We had lots of reports about stuff not working and almost as many work-arounds, but no-one came up with a patch that would allow building lxml reliably at least on a subset of Mac-OS systems. And I just cannot believe that there is no-one amongst the Mac-OS-X users who knows how to use distutils to build a binary extension. Or at least someone who knows how to build C code statically against a C library. >From my POV, Mac-OS seems to lack three things that make this problem non-trivial. It doesn't have a standard package management system. Neither does it have something like the Linux Standard Base, which dictates where newly installed things belong. And it doesn't seem to support "rpath", which would allow a binary to say "I know where my dependencies come from". Or at least distutils don't support that on Mac. So everything I could try here on Linux to make it work better is bound to fail. Kumar McMillan wrote: > I know this has been discussed over and over but I'm writing to see if > anyone has made a breakthrough yet. The problem of course is that > Leopard's builtin libxml2 and libxslt are too old for lxml 2.0. You > have to build libxml2 either from source or use a port. [lots of important details skipped to keep this at a higer level for now] > Next, I tried doing a static build of lxml by setting > STATIC_LIBRARY_DIRS = ['/opt/local/lib'] in setup.py and running: > > python setup.py bdist_egg --static > --with-xml2-config=/opt/local/bin/xml2-config > --with-xslt-config=/opt/local/bin/xslt-config > > I had to fiddle with gcc to get this to build but otherwise it built > fine and installed ok but I did not see any difference. Still > consistent segfaults that are fixed by setting the dyld path. This is because the --static switch was made specifically for static building on Windows, which has even less support for package management or even half-decent software installation. It just doesn't support Mac-OS as no-one ever told me how to support it. If you want this to run, let's make a deal. Here is a patch (against the trunk, but should work with 2.0.x) that lets --static require setting the STATIC_*_DIRS variables only on Windows, which should result in reading the directories from xml2-config/xslt-config if the hard-coded setup is not provided. Given your above example, this should be the right thing to do. Now, please look at the function "libraries()" in setupinfo.py and fix it up for Mac-OS-X (and for whatever sys.platform calls it) to find the correct static libraries in these directories. If you get it to run reliably on your system, just with your above command line, I'll make sure it gets into 2.0.6. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: setupinfo.py-static-non-windows.patch Type: text/x-patch Size: 1871 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080511/dd5990b2/attachment.bin From mwm-keyword-lxml.9112b8 at mired.org Sun May 11 20:48:04 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sun, 11 May 2008 14:48:04 -0400 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482699AD.4030905@behnel.de> References: <482699AD.4030905@behnel.de> Message-ID: <20080511144804.69429df6@bhuda.mired.org> On Sun, 11 May 2008 09:01:01 +0200 Stefan Behnel wrote: > you ask why this is so hard? Simple answer: because no-one has contributed a > way so far to make it easier. Gee, I had no trouble at all doing this last week (the release of Oracle library bits for Intel OS-X means it's now desirable). I installed macports, did a self-update, then installed py25-lxml. It installed python2.5.2 and the versions of libxml2 and libxslt that were in macports as part of the process. Installing cx_Oracle after that was more work. > We had lots of reports about stuff not working and almost as many > work-arounds, but no-one came up with a patch that would allow building lxml > reliably at least on a subset of Mac-OS systems. And I just cannot believe > that there is no-one amongst the Mac-OS-X users who knows how to use distutils > to build a binary extension. Or at least someone who knows how to build C code > statically against a C library. I'm sorry, but my experience is that binary distributions make the problems *worse*, not better - at least if you require multiple different components to be installed. You have to make sure the components all agree about the builds of any libraries they have in common, and unless you have a coordinated build, that just doesn't happen. After all, I could build a binary distribution of lxml from macports, but to use it, you'd have to have the macports versions of python, libxml2 and libxslt. If you've got that, it's probably easier to install the macports version than it is to download and install whatever I might build. I could use ports to build a binary package with all those things in it - is there anyone who really wants that? I started working with lxml last year, when the latest version was 1.3.3. Since updating the software after deployment would be a traumatic operation (a single "instance" of the application uses about 40 cores spread across 10 systems and two SANs, and we typically run three instances), I wanted the latest stable versions of everything. I looked at five different systems, and on only two was getting that combination sane: OSX using macports, and FreeBSD, mostly because they had the latest versions in the ports system when I went to look. The three GNU/Linux systems either had old versions of Python, of the xml libraries, and lxml was either old or missing. So I wound up doing initial development on OSX and FreeBSD while we dealt with the GNU/Linux platform we were speced to run on. Grabbing binaries only half worked. Replacing the installed system tools on GNU/linux is a recipe for disaster, and our sysadmins correctly refused to do so. Which means the LSB is no help at all. The sysadmins found a binary build of python 2.5, and installed that. We then grabbed the lxml rpm from PyPI, and installed that - only it wouldn't run, because it had been built against a version of Python that was compiled with a python shared library, and the version we had hadn't been. I eventually wound up building everything - Python, libxml2, libxslt, lxml and cx_Oracle - by hand to run our installation on, and providing a carefully tailored environment to run things in. Which is what I did on OSX and FreeBSD, except their ports systems makes building from sources trivial, and FreeBSD doesn't need the tailored environment. Updating those two systems is trivial. Updating the GNU/Linux systems is several days worth of work just to get to the point where I have something to give to operations. > From my POV, Mac-OS seems to lack three things that make this problem > non-trivial. It doesn't have a standard package management system. Neither > does it have something like the Linux Standard Base, which dictates where > newly installed things belong. And it doesn't seem to support "rpath", which > would allow a binary to say "I know where my dependencies come from". Or at > least distutils don't support that on Mac. So everything I could try here on > Linux to make it work better is bound to fail. Providing a binary distribution for *any* system that includes libraries that are "to old" in the base system is going to be a major pain. Everyone runs into this on OSX, because they didn't update those libraries in 10.5 for some unknown reason (I've filed bug ID #5926693 at adc about this). Those of us building for corporate environments - where we always run on out of date platforms, because we can't get corporate approval to use a new one before it becomes out of date - run into it all the time. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From kumar.mcmillan at gmail.com Mon May 12 00:49:18 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 17:49:18 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482699AD.4030905@behnel.de> References: <482699AD.4030905@behnel.de> Message-ID: Stefan, thanks for all the info On Sun, May 11, 2008 at 2:01 AM, Stefan Behnel wrote: > From my POV, Mac-OS seems to lack three things that make this problem > non-trivial. It doesn't have a standard package management system. Neither > does it have something like the Linux Standard Base, which dictates where > newly installed things belong. And it doesn't seem to support "rpath", which > would allow a binary to say "I know where my dependencies come from". Or at > least distutils don't support that on Mac. So everything I could try here on > Linux to make it work better is bound to fail. I don't have experience building native OS X applications but I've done a little more research into the problem and I think it is specifically this: "/usr/lib/libxml2.2.dylib uses two-level namespace, meaning that the Foundation framework will always use this one instead of yours" -- from http://0xced.blogspot.com/2006/07/dealing-with-outdated-open-source-libs.html What is two-level namespacing? Good question. I haven't quite figured that out yet but as the blog post suggests, you can "flatten" it at runtime by setting DYLD_FORCE_FLAT_NAMESPACE=1 And, by golly, this actually works -- that is, setting it in my shell and running my test cases that would otherwise segfault run smoothly. Also, this doesn't screw up my lib paths like setting DYLD_LIBRARY_PATH does (the conflict with subversion went away!). >From more googling it does appear however that setting this var might confuse some apps that do rely on two-level namespacing. So far *my* problems have gone away (well, besides this being a kludge) but I guess I'll have to keep an eye on it. The dyld manual didn't help me understand this any better: http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/dyld.1.html > > If you want this to run, let's make a deal. Here is a patch (against the > trunk, but should work with 2.0.x) that lets --static require setting the > STATIC_*_DIRS variables only on Windows, which should result in reading the > directories from xml2-config/xslt-config if the hard-coded setup is not > provided. Given your above example, this should be the right thing to do. Now, > please look at the function "libraries()" in setupinfo.py and fix it up for > Mac-OS-X (and for whatever sys.platform calls it) to find the correct static > libraries in these directories. If you get it to run reliably on your system, > just with your above command line, I'll make sure it gets into 2.0.6. I'm willing to do whatever I can to contribute a better Mac OS X build process for lxml. However, I'm not experienced with using ext_modules in python and am having a hard time following your suggestions. You say your patch removed the enforcement of STATIC_*_DIRS but that was never a problem. in fact, that seems to confuse gcc when building with --static since it produces orphaned -I args (no directory attached) Next, you suggest to adjust the sys.platform checks. sys.platform always equals "darwin" on OS X but where would I want to make adjustments? I don't understand what this is doing in libraries() : if sys.platform in ('win32',): libs = ['%s_a' % lib for lib in libs] if I add "darwin" to the list, I get the error: ld: library not found for -lxslt_a whereas -lxslt is the correct arg (just like on linux). In my /opt/local/lib dir I have libxslt.dylib, libxslt.la, libexslt.dylib, libexslt.la, etc. I tried changing the above list comprehension to generate .la names but that didn't work either (still said library not found). I'm still not clear on how to statically link the libxml libraries and that's the first step to solving the problem. If anyone has done this, please let me know and I'll have another go at it. Maybe I need to use libtool to produce static versions then link to those. More googling suggests it *is* possible ;) http://lists.apple.com/archives/Unix-porting/2006/Aug/msg00012.html gcc man pages are not helping me. -Kumar From kumar.mcmillan at gmail.com Mon May 12 01:00:26 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 18:00:26 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511144804.69429df6@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: On Sun, May 11, 2008 at 1:48 PM, Mike Meyer wrote: > On Sun, 11 May 2008 09:01:01 +0200 > Stefan Behnel wrote: > >> you ask why this is so hard? Simple answer: because no-one has contributed a >> way so far to make it easier. > > Gee, I had no trouble at all doing this last week (the release of > Oracle library bits for Intel OS-X means it's now desirable). I > installed macports, did a self-update, then installed py25-lxml. It > installed python2.5.2 and the versions of libxml2 and libxslt that > were in macports as part of the process. The build of lxml doesn't fail and you probably won't see any errors unless you are using xpath. In fact, running selftest.py after building passes for me (I'm not sure if that runs all tests or not) but I do get a consistent segfault in my program. Looking at the macport of py25-lxml I don't see any flags that would indicate they have accomplished statically linking the new libxml libs. I don't like to use ports of python modules because /opt/local/bin/python doesn't mix well with a Framework python installation from my experience. From mwm-keyword-lxml.9112b8 at mired.org Mon May 12 01:26:48 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sun, 11 May 2008 19:26:48 -0400 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: <20080511192648.28fcf02d@bhuda.mired.org> On Sun, 11 May 2008 18:00:26 -0500 "Kumar McMillan" wrote: > On Sun, May 11, 2008 at 1:48 PM, Mike Meyer wrote: > > On Sun, 11 May 2008 09:01:01 +0200 > > Stefan Behnel wrote: > > > >> you ask why this is so hard? Simple answer: because no-one has contributed a > >> way so far to make it easier. > > > > Gee, I had no trouble at all doing this last week (the release of > > Oracle library bits for Intel OS-X means it's now desirable). I > > installed macports, did a self-update, then installed py25-lxml. It > > installed python2.5.2 and the versions of libxml2 and libxslt that > > were in macports as part of the process. > > The build of lxml doesn't fail and you probably won't see any errors > unless you are using xpath. In fact, running selftest.py after > building passes for me (I'm not sure if that runs all tests or not) > but I do get a consistent segfault in my program. Well, we make fairly heavy use of xpath (we use it to extract millions of records/minute in our ETL system, plus provide default attributes in the xml config file), so if it's a problem, I'm sure I'll see it. The few tests I've run so far worked fine. Care to provide an example that breaks? > Looking at the macport of py25-lxml I don't see any flags that would > indicate they have accomplished statically linking the new libxml > libs. I don't like to use ports of python modules because > /opt/local/bin/python doesn't mix well with a Framework python > installation from my experience. That's always a problem when you start building your version of languages in the base system - you probably can't use the platform-specific modules that are in the base systems language installation. I can't get to any of the rpm-related python modules on RHEL with my custom python installed. Fortunately, I don't need access to either the rpm libraries or the mac Python frameworks in my applications. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From kumar.mcmillan at gmail.com Mon May 12 02:23:09 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Sun, 11 May 2008 19:23:09 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511192648.28fcf02d@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> <20080511192648.28fcf02d@bhuda.mired.org> Message-ID: On Sun, May 11, 2008 at 6:26 PM, Mike Meyer wrote: > > Well, we make fairly heavy use of xpath (we use it to extract millions > of records/minute in our ETL system, plus provide default attributes > in the xml config file), so if it's a problem, I'm sure I'll see > it. The few tests I've run so far worked fine. huh, yeah it does seem like you'd see a crash. Maybe the py25-lxml port gains some advantages from getting built within the macports environment somehow. > Care to provide an > example that breaks? unfornately, I don't think I have one, not something that is decoupled from the app I'm working on anyway. The app I'm working on makes heavy use of lxml.html to spider through the web, uses xpath() here and there, and the test cases use xpaths for assertions. However, I see the segfault in strange places. For example, if I run all tests at once (I'm using nose) then I usually don't see a segfault. But if I run test cases by themselves I will generally see a segfault. And if I do, it is a consistent segfault. Looking at the crash log I can see that it's on an xpath lookup (I posted this earlier). However, to make matters worse, the test cases I can trigger segfaults in generally do not seem to touch any of the xpath code :/ Nonetheless, all the workarounds I've mentioned stop the segfaults. From stefan_ml at behnel.de Mon May 12 10:41:13 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 10:41:13 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080511144804.69429df6@bhuda.mired.org> References: <482699AD.4030905@behnel.de> <20080511144804.69429df6@bhuda.mired.org> Message-ID: <482802A9.9070809@behnel.de> Hi, Mike Meyer wrote: > On Sun, 11 May 2008 09:01:01 +0200 > Stefan Behnel wrote: > >> you ask why this is so hard? Simple answer: because no-one has contributed a >> way so far to make it easier. > > Gee, I had no trouble at all doing this last week (the release of > Oracle library bits for Intel OS-X means it's now desirable). I > installed macports, did a self-update, then installed py25-lxml. It > installed python2.5.2 and the versions of libxml2 and libxslt that > were in macports as part of the process. Installing cx_Oracle after > that was more work. > >> We had lots of reports about stuff not working and almost as many >> work-arounds, but no-one came up with a patch that would allow building lxml >> reliably at least on a subset of Mac-OS systems. And I just cannot believe >> that there is no-one amongst the Mac-OS-X users who knows how to use distutils >> to build a binary extension. Or at least someone who knows how to build C code >> statically against a C library. > > I'm sorry, but my experience is that binary distributions make the > problems *worse*, not better I wasn't talking about distributing binaries. I meant: someone has to provide a way to configure the compiler so that it builds lxml statically on Mac-OS. Stefan From stefan_ml at behnel.de Mon May 12 11:04:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 11:04:44 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> Message-ID: <4828082C.3020202@behnel.de> Hi, Kumar McMillan wrote: > I don't have experience building native OS X applications See? That seems to be a general problem amongst Mac-OS users. If no-one using that platform knows how to build a C program, how am I supposed to know it? > What is two-level namespacing? *shrug*, I prefer an automatic static build on Mac-OS anyway. > You say your patch removed the enforcement of STATIC_*_DIRS but that was > never a problem. It was, as it requires manual interaction by users that should only be required in stupid "who-needs-a-system-compiler-anyway" environments like Windows. > in fact, that seems to confuse gcc when building > with --static since it produces orphaned -I args (no directory > attached) It just disables the requirement for setting the variables. It doesn't configure anything so far. The config has to come from xml2-config and xslt-config. > Next, you suggest to adjust the sys.platform checks. sys.platform > always equals "darwin" on OS X Ok, then the function will likely look something like this: def libraries(): if sys.platform in ('win32', 'darwin'): libs = ['libxslt', 'libexslt', 'libxml2', 'iconv'] else: libs = ['xslt', 'exslt', 'xml2', 'z', 'm'] if OPTION_STATIC: if sys.platform in ('win32',): libs = ['%s_a' % lib for lib in libs] elif sys.platform in ('darwin',): libs = ['%s.a' % lib for lib in libs] if sys.platform in ('win32',): libs.extend(['zlib', 'WS2_32']) return libs Minus some changes for libiconv and libz. > but where would I want to make > adjustments? I don't understand what this is doing in libraries() : > > if sys.platform in ('win32',): > libs = ['%s_a' % lib for lib in libs] > > if I add "darwin" to the list, I get the error: > ld: library not found for -lxslt_a The static libraries are called xxx_a in Windows. If someone can figure out what they are called on Mac-OS, I can fill it in myself. > whereas -lxslt is the correct arg (just like on linux). In my > /opt/local/lib dir I have libxslt.dylib, libxslt.la, libexslt.dylib, > libexslt.la, etc. I tried changing the above list comprehension to > generate .la names but that didn't work either (still said library not > found). Hmmm, on Linux, the static libraries are called "libxml2.a" etc. Can you find anything like that on your system? Stefan From kumar.mcmillan at gmail.com Mon May 12 17:15:24 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Mon, 12 May 2008 10:15:24 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <4828082C.3020202@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> Message-ID: On Mon, May 12, 2008 at 4:04 AM, Stefan Behnel wrote: > Kumar McMillan wrote: > > I don't have experience building native OS X applications > > See? That seems to be a general problem amongst Mac-OS users. Most people "use" computers they don't build them. Your work is greatly appreciated! :) > > What is two-level namespacing? > > *shrug*, I prefer an automatic static build on Mac-OS anyway. me too, I think that would be the right solution. > > in fact, that seems to confuse gcc when building > > with --static since it produces orphaned -I args (no directory > > attached) > > It just disables the requirement for setting the variables. It doesn't > configure anything so far. The config has to come from xml2-config and > xslt-config. something is going wrong then with --static because I get "Python.h not found" errors and the gcc command looked something like this: gcc ... -I -I/path/to/python/headers notice the orphaned -I call where, afaict, STATIC_INCLUDE_DIRS was previously getting inserted. Just a theory. > Hmmm, on Linux, the static libraries are called "libxml2.a" etc. Can you find > anything like that on your system? OK, I dug up some more dirt. The problem with the macport of libxml2 is that it doesn't build static libraries. From the port file itself, I now see: --disable-static, doh! But, yeah, I think if I build my own with --disable-shared and then point to that dir as an include this might work. And I assume I will probably get a libxml2.a file out of that build. But is that a feasible end user solution? That is, I'm not convinced this will make great strides in solving the lxml runtime problem where it uses the wrong version of libxml2 / libxslt. Kumar From stefan_ml at behnel.de Mon May 12 18:26:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 18:26:44 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <20080512113934.6c774076@mbook.local> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> Message-ID: <48286FC4.6030309@behnel.de> Hi, Mike Meyer wrote: > Apple's official position is that static linking of > applications is unsupported. They don't provide static versions of any > of the system libraries. > > Likewise, macports doesn't provide static libraries for the libraries > it installs, and the docs don't hint at anyway to get it to do so. Great! Now that would have been too easy anyway, wouldn't it? :-/ Thanks for the infos. Now, anyone for a plan B? Stefan From kumar.mcmillan at gmail.com Mon May 12 18:37:35 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Mon, 12 May 2008 11:37:35 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48286FC4.6030309@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> Message-ID: On Mon, May 12, 2008 at 11:26 AM, Stefan Behnel wrote: > > Likewise, macports doesn't provide static libraries for the libraries > > it installs, and the docs don't hint at anyway to get it to do so. > > Great! Now that would have been too easy anyway, wouldn't it? :-/ > > Thanks for the infos. Now, anyone for a plan B? It looks to me like the typical way to do this in an OS X app is to compile your static libs then bundle them with your application (and as Mike pointed out, Apple does not recommend this). Obviously there is a ram penalty for that (the custom lib). I don't see lxml distributing static libs just for OS X :) The best thing I can think of is to get --static working for libxml2.a files and then I can submit to you the steps I took to build my static libs from source (assuming I can get that all to work). Would that be useful? If it proves too cumbersome I might just continue to use the DYLD_FORCE_FLAT_NAMESPACE var at runtime even though that's bound to bite me someday. Unlike Mike I am fortunate enough not to be using lxml in *production* on OS X ... yet also misfortunate enough to be the only one who sees segfaults :( K From stefan_ml at behnel.de Mon May 12 20:39:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 May 2008 20:39:57 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> Message-ID: <48288EFD.7060207@behnel.de> Kumar McMillan wrote: > The best thing I can think of is to get --static working for libxml2.a > files and then I can submit to you the steps I took to build my static > libs from source (assuming I can get that all to work). I think a buildout will help here, as previously proposed a couple of times. http://pypi.python.org/pypi/zc.buildout http://pypi.python.org/pypi/zc.recipe.cmmi We can configure it to build only the static versions of libxml2 and libxslt, and then build against those. ------------------------ [libxml2] recipe = zc.recipe.cmmi url = http://ftp.gnome.org/pub/GNOME/sources/libxml2/2.6/libxml2-2.6.32.tar.gz extra_options = --without-python --enable-shared --enable-static [libxslt] recipe = zc.recipe.cmmi url = http://ftp.gnome.org/pub/GNOME/sources/libxslt/1.1/libxslt-1.1.22.tar.bz2 extra_options = --with-libxml-prefix=${buildout:directory}/parts/libxml2/ --without-python --disable-shared --enable-static [lxml] recipe = zc.recipe.egg:custom egg = lxml include-dirs = ${buildout:directory}/parts/libxml2/include/libxml2 ${buildout:directory}/parts/libxslt/include library-dirs = ${buildout:directory}/parts/libxml2/lib ${buildout:directory}/parts/libxslt/lib ------------------------ lxml's setup.py would then need to be changed to automatically compile statically on the Mac-OS platform. Although maybe we should only do that if buildout is running (sys.modules?). Stefan From xkenneth at gmail.com Mon May 12 23:19:32 2008 From: xkenneth at gmail.com (Kenneth Miller) Date: Mon, 12 May 2008 16:19:32 -0500 Subject: [lxml-dev] Looking for general insight. Message-ID: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> All, Any opinions on my problem would be greatly appreciated. I've got a large pre-defined XML schema, tons of data types etc. I want to be able to create python objects from the schemas and traffic these objects in and out of some sort of a database. Could I perhaps create these objects using lxml and extend lxml to use zope persistence? Regards, Kenneth Miller From jlovell at esd189.org Mon May 12 23:54:24 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 12 May 2008 14:54:24 -0700 Subject: [lxml-dev] Looking for general insight. In-Reply-To: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> References: <1950A20E-6FF3-414F-8DE0-01471669EBD4@gmail.com> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A22D8@ZIRIA.esd189.org> Kenneth: What you ask is not easy. However, I can point you at a few things that might be helpful. First a clarification. When you say, "create python objects from the schemas and traffic these objects in and out of some sort of a database" do you mean python classes or lxml trees filled with random data (or something else)? For python classes from XML schemas check out: http://www.rexx.com/~dkuhlman/generateDS.html For lxml trees filled with random data check out: http://messagesleuth.svn.sourceforge.net/viewvc/messagesleuth/trunk/xsd2 data.py?revision=6&view=markup Note: For this you will need to pay attention to the other MessageSleuth libraries it uses. Note: Realize that this supports a subset of XML Schema operators. Note: While I am proud of most of this code (and it consistently meets my needs) I believe randstr.py can generate invalid strings under certain conditions. Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.esd189.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Kenneth Miller Sent: Monday, May 12, 2008 2:20 PM To: lxml-dev at codespeak.net Subject: [lxml-dev] Looking for general insight. All, Any opinions on my problem would be greatly appreciated. I've got a large pre-defined XML schema, tons of data types etc. I want to be able to create python objects from the schemas and traffic these objects in and out of some sort of a database. Could I perhaps create these objects using lxml and extend lxml to use zope persistence? Regards, Kenneth Miller _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From mike at it-loops.com Tue May 13 07:58:05 2008 From: mike at it-loops.com (maru) Date: Tue, 13 May 2008 07:58:05 +0200 Subject: [lxml-dev] =?utf-8?q?_Re=3A__install_lxml_2=2E0=2E5_on_Mac_OS_X_L?= =?utf-8?q?eopard_-_why_is_itso_hard=3F?= In-Reply-To: <48288EFD.7060207@behnel.de> References: <48288EFD.7060207@behnel.de> Message-ID: On Mon, 12 May 2008 20:39:57 +0200, Stefan Behnel wrote: > lxml's setup.py would then need to be changed to automatically compile > statically on the Mac-OS platform. Although maybe we should only do that > if buildout is running (sys.modules?). Please leave the dynamic build as default option since it makes building universal libraries so much easier. A static build if buildout is used would be better in my opinion. Kind regards, Michael From kumar.mcmillan at gmail.com Tue May 13 17:41:58 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Tue, 13 May 2008 10:41:58 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48288EFD.7060207@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <20080512113934.6c774076@mbook.local> <48286FC4.6030309@behnel.de> <48288EFD.7060207@behnel.de> Message-ID: On Mon, May 12, 2008 at 1:39 PM, Stefan Behnel wrote: > Kumar McMillan wrote: > > The best thing I can think of is to get --static working for libxml2.a > > files and then I can submit to you the steps I took to build my static > > libs from source (assuming I can get that all to work). > > I think a buildout will help here, as previously proposed a couple of times. ah, yes, excellent idea. I've started fiddling with it and have libxml2/libxslt building static libs no problem; this might just work. It looks like zc.recipe.egg isn't going to cut it though, as I can't find a way to pass in custom setup.py flags like --static (which I think is still needed to find libxml2.a, etc). I found collective.recipe.distutils which might work. I found some issues with it already but patching as I go. More as it happens - Kumar > > http://pypi.python.org/pypi/zc.buildout > http://pypi.python.org/pypi/zc.recipe.cmmi > > We can configure it to build only the static versions of libxml2 and libxslt, > and then build against those. > > ------------------------ > [libxml2] > recipe = zc.recipe.cmmi > url = http://ftp.gnome.org/pub/GNOME/sources/libxml2/2.6/libxml2-2.6.32.tar.gz > extra_options = --without-python --enable-shared --enable-static > > [libxslt] > recipe = zc.recipe.cmmi > url = http://ftp.gnome.org/pub/GNOME/sources/libxslt/1.1/libxslt-1.1.22.tar.bz2 > extra_options = --with-libxml-prefix=${buildout:directory}/parts/libxml2/ > --without-python --disable-shared --enable-static > > [lxml] > recipe = zc.recipe.egg:custom > egg = lxml > include-dirs = ${buildout:directory}/parts/libxml2/include/libxml2 > ${buildout:directory}/parts/libxslt/include > library-dirs = ${buildout:directory}/parts/libxml2/lib > ${buildout:directory}/parts/libxslt/lib > ------------------------ > > lxml's setup.py would then need to be changed to automatically compile > statically on the Mac-OS platform. Although maybe we should only do that if > buildout is running (sys.modules?). > > Stefan > From stefan_ml at behnel.de Tue May 13 07:25:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 May 2008 07:25:40 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> Message-ID: <48292654.7030802@behnel.de> Hi, Kumar McMillan wrote: > something is going wrong then with --static because I get "Python.h > not found" errors and the gcc command looked something like this: > > gcc ... -I -I/path/to/python/headers That's a bug. Here is a patch. Stefan === setupinfo.py ================================================================== --- setupinfo.py (revision 4206) +++ setupinfo.py (local) @@ -15,8 +15,11 @@ PACKAGE_PATH = "src/lxml/" def env_var(name): - value = os.getenv(name, '') - return value.split(os.pathsep) + value = os.getenv(name) + if value: + return value.split(os.pathsep) + else: + return [] def ext_modules(static_include_dirs, static_library_dirs, static_cflags): if CYTHON_INSTALLED: From kumar.mcmillan at gmail.com Wed May 14 06:54:46 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Tue, 13 May 2008 23:54:46 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <48292654.7030802@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> Message-ID: On Tue, May 13, 2008 at 12:25 AM, Stefan Behnel wrote: > > gcc ... -I -I/path/to/python/headers > > That's a bug. Here is a patch. closer! thanks for that fix. That got all the -I includes in order. Next up, I'm pretty sure I need to pass -static to libtool so that it honors the -lxml2.a (without -static, it says xml2.a -- lib not found). My idea for this was: export LDFLAGS='-static' and I got: gcc -arch i386 -arch ppc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -g -bundle -undefined dynamic_lookup -static build/temp.macosx-10.3-i386-2.5/src/lxml/lxml.etree.o -L/Users/kumar/src/lxml-2.0/parts/libxml2/lib -L/Users/kumar/src/lxml-2.0/parts/libxslt/lib -lxslt.a -lexslt.a -lxml2.a -lz.a -lm.a -o build/lib.macosx-10.3-i386-2.5/lxml/etree.so ld_classic: incompatible flag -bundle used (must specify "-dynamic" to be used) so ... how do I stop it from adding -bundle? Ideas for another approach? From stefan_ml at behnel.de Wed May 14 08:01:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 14 May 2008 08:01:56 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> Message-ID: <482A8054.4060603@behnel.de> Hi, Kumar McMillan wrote: > On Tue, May 13, 2008 at 12:25 AM, Stefan Behnel wrote: >> > gcc ... -I -I/path/to/python/headers >> >> That's a bug. Here is a patch. > > closer! thanks for that fix. That got all the -I includes in order. > > Next up, I'm pretty sure I need to pass -static to libtool so that it > honors the -lxml2.a (without -static, it says xml2.a -- lib not > found). It's not "-lxml2.a" but a plain "/path/to/libxml2.a" as parameter to link it in just like the normal lxml.etree.o object file that was just compiled. Stefan From x at jwp.name Tue May 13 19:35:30 2008 From: x at jwp.name (James William Pye) Date: Tue, 13 May 2008 17:35:30 +0000 (UTC) Subject: [lxml-dev] Help with an error message References: <477D0D9E.3090205@behnel.de> Message-ID: Stefan Behnel behnel.de> writes: > Konstantin Ryabitsev wrote: > > Traceback (most recent call last): > > File "foo.py", line 6, in > > elt = Element('foo').text = unistr > > File "etree.pyx", line 741, in etree._Element.text.__set__ > > File "apihelpers.pxi", line 344, in etree._setNodeText > > File "apihelpers.pxi", line 648, in etree._utf8 > > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > > > Can someone suggest the best way to deal with this? > > My first question is: why do you need a '\x00' here? If you want to pass > binary data in XML, the best way is to use a safe encoding such as uuencode or > whatever. That should be part of your XML language spec/schema/... I just ran into this myself. In my case, having the NULL was not desired, rather I wanted to see a raw '\x00' to appear in the string(ie, the literal backslash sequence, *not* the NULL character). It would be nice if lxml would be more explicit about the problem: raise ValueError("NULL characters are not allowed in XML strings") That is: How I am supposed to derive that a NULL character was causing that AssertionError from the given string? (It wasn't until I found this message that I understood what I was doing wrong) From stefan_ml at behnel.de Wed May 14 18:16:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 14 May 2008 18:16:26 +0200 (CEST) Subject: [lxml-dev] Help with an error message In-Reply-To: References: <477D0D9E.3090205@behnel.de> Message-ID: <27846.194.114.62.39.1210781786.squirrel@groupware.dvs.informatik.tu-darmstadt.de> James William Pye wrote: > It would be nice if lxml would be more explicit about the problem: > > raise ValueError("NULL characters are not allowed in XML strings") > > That is: How I am supposed to derive that a NULL character was causing > that > AssertionError from the given string? (It wasn't until I found this > message that I understood what I was doing wrong) Ok, what about: "All strings must be XML compatible: Unicode or ASCII, no NULL bytes" ? Stefan From rogerpatterson at gmail.com Thu May 15 06:21:52 2008 From: rogerpatterson at gmail.com (roger patterson) Date: Wed, 14 May 2008 21:21:52 -0700 Subject: [lxml-dev] html entities and lxml.html.ElementSoup In-Reply-To: <482B7C87.10800@aya.yale.edu> References: <482B7C87.10800@aya.yale.edu> Message-ID: <1200dfce0805142121q16c7fa30t148830146c932f02@mail.gmail.com> Hi Viksit, What you typed was correct, except you have to note that lxml.html.soupparser.convert_tree(soup) returns a *list* of root elements, so you can't just do a lxml.etree.tostring() on the list. Depending on your HTML, choosing the first element will probably work. I have moved to the trunk now, so am working well with the new lxml.html.soupparser. But if you're stuck on that branch, then that work-around worked for me. Hope it works for you! cheers -Roger 2008/5/14 Viksit Gaur : > Hi there, > >>Roger Patterson wrote: >>> I'm getting an interesting situation. When using the very cool >>> ElementSoup add-on to lxml.html with certain source-html files that >>> already encode entities (eg. £), using the ElementSoup.parse() >>> messes up the entities. > > I'm running into the same problem. > >>It looks like it's not the parse(), but rather the serialisation. What >> >happens >>is that the entity references end up in the /text/ content, which is >> >clearly >>wrong as it leads to re-escaping of the references on the way out. > >>> What I'm currently doing to solve this is first parsing it with >>> BeautifulSoup(html, convertEntities="html"), then calling >>> ElementSoup.convert_tree(soup). This work-around works fine, but I >>> thought I'd bring it to your attention. > > Did you mean something of the sort, > > soup = BeautifulSoup(doc, convertEntities="html") > root = lxml.html.soupparser.convert_tree(soup) > > Because I get an error of the form: > > File "lxml.etree.pyx", line 2491, in lxml.etree.tostring > (src/lxml/lxml.etree.c:21792) > TypeError: Type 'list' cannot be serialized. > > > >>ElementSoup should do that for you. I fixed it on the trunk. > >>Stefan > > Unfortunately, I can't switch to lxml trunk. Would it be possible for you to > point me to the code change in lxml so I can patch it myself? > > Thanks and Cheers, > Viksit > From kumar.mcmillan at gmail.com Thu May 15 06:40:58 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Wed, 14 May 2008 23:40:58 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482A8054.4060603@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> Message-ID: Hello again On Wed, May 14, 2008 at 1:01 AM, Stefan Behnel wrote: >> Next up, I'm pretty sure I need to pass -static to libtool so that it >> honors the -lxml2.a (without -static, it says xml2.a -- lib not >> found). > > It's not "-lxml2.a" but a plain "/path/to/libxml2.a" as parameter to link it > in just like the normal lxml.etree.o object file that was just compiled. when I tried the plain paths it says library cannot be found. But I've discovered that building with -static is a dead end. It seems that Apple all but disallows static linking completely: http://developer.apple.com/qa/qa2001/qa1118.html HOWEVER after blood, sweat, and some tears (kidding) this is *all* I needed, it seems: export CFLAGS="-flat_namespace" ...no static builds libxml2 libs, no buildout recipe. I just set that and ran: python setup.py bdist_egg --with-xml2-config=/opt/local/bin/xml2-config --with-xslt-config=/opt/local/bin/xslt-config which uses the libxml2 and etc. installed by ports. In fact, as long as /opt/local/bin is on my path that should work without having to set paths (i.e. from easy_install). All my tests that were segfaulting are now passing. This appears to be the exact same behavior I got by setting DYLD_FORCE_FLAT_NAMESPACE at runtime but without the side affect of applying itself to anything else running in my shell ;) so, I'm thinking this is just two lines of code added to cflags() ... if sys.platform in ('darwin',): result.append('-flat_namespace') Do you want a patch that also includes the adjustments to --static when not windows? I don't think they are necessary anymore. Actually, using --static on darwin should probably raise an error "Apple says no" ;) -Kumar From stefan_ml at behnel.de Thu May 15 13:03:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 15 May 2008 13:03:17 +0200 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> Message-ID: <482C1875.4070508@behnel.de> Hi Kumar, Kumar McMillan wrote: > so, I'm thinking this is just two lines of code added to cflags() ... > > if sys.platform in ('darwin',): > result.append('-flat_namespace') That's cool, thanks. I added it to the trunk and to the 2.0 branch. Let's see if Mac users get along with 2.0.6 then... Thanks for the effort! Stefan From kumar.mcmillan at gmail.com Fri May 16 04:47:11 2008 From: kumar.mcmillan at gmail.com (Kumar McMillan) Date: Thu, 15 May 2008 21:47:11 -0500 Subject: [lxml-dev] install lxml 2.0.5 on Mac OS X Leopard - why is it so hard? In-Reply-To: <482C1875.4070508@behnel.de> References: <482699AD.4030905@behnel.de> <4828082C.3020202@behnel.de> <48292654.7030802@behnel.de> <482A8054.4060603@behnel.de> <482C1875.4070508@behnel.de> Message-ID: On Thu, May 15, 2008 at 6:03 AM, Stefan Behnel wrote: >> if sys.platform in ('darwin',): >> result.append('-flat_namespace') > > That's cool, thanks. I added it to the trunk and to the 2.0 branch. excellent > Let's see > if Mac users get along with 2.0.6 then... > > Thanks for the effort! sure, no problem. I researched this a bit more. It seems that people generally consider -flat_namespace a bad "hack," something to keep in mind. However, this seems to be because a few libraries take advantage of -twolevel_namespace (the default gcc behavior as of OS X 10.3 or something) so your binaries may cause other linked libs to behave wrong. The only specific example I could find of one that uses two level namespaces was OpenGL, but maybe there are others. Anyway, for lxml's purposes *I think* it is OK to use -flat_namespace since there aren't many other libs involved. Let's roll with it. This is what etree links to : $ otool -l path/to/lxml/etree.so [snip] Load command 7 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libxslt.1.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 3.23.0 compatibility version 3.0.0 Load command 8 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libexslt.0.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 9.13.0 compatibility version 9.0.0 Load command 9 cmd LC_LOAD_DYLIB cmdsize 56 name /opt/local/lib/libxml2.2.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 9.32.0 compatibility version 9.0.0 Load command 10 cmd LC_LOAD_DYLIB cmdsize 52 name /opt/local/lib/libz.1.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 1.2.3 compatibility version 1.0.0 Load command 11 cmd LC_LOAD_DYLIB cmdsize 52 name /usr/lib/libSystem.B.dylib (offset 24) time stamp 2 Wed Dec 31 18:00:02 1969 current version 88.3.6 compatibility version 1.0.0 ... so unless libSystem.B.dylib somehow would be tripped up by -flat_namespace I think all should be good. BTW, when I add those two lines all tests pass for me (they passed before but, hey, still a good sign) : Index: setupinfo.py =================================================================== --- setupinfo.py (revision 54771) +++ setupinfo.py (working copy) @@ -136,6 +136,8 @@ for possible_cflag in possible_cflags: if not possible_cflag.startswith('-I'): result.append(possible_cflag) + if sys.platform in ('darwin',): + result.append('-flat_namespace') return result def define_macros(): -Kumar From vik.list.nutch at gmail.com Fri May 16 04:58:41 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Thu, 15 May 2008 19:58:41 -0700 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? Message-ID: <482CF861.2010306@gmail.com> Hi all, I was wondering - what would be the most efficient method to access all the elements in the DOM tree, in some order, using lxml.etree? The methods I currently see in the docs return a class like ElementDepthfirstIterator or iterwalk, which have 2 issues - 1) The first has a flat representation of the tree, so I lose child/parent structure 2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running? Or perhaps there is some method I've missed completely? Quick note on what I'm trying to do - graphically represent the DOM structure of a page using a library like networkX.. Cheers, Viksit From stefan_ml at behnel.de Fri May 16 11:14:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 11:14:59 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482CF861.2010306@gmail.com> References: <482CF861.2010306@gmail.com> Message-ID: <482D5093.7060303@behnel.de> Hi, Viksit Gaur wrote: > 2) Things like iterwalk do return "start" and "end" actions - but > instead of first doing an iterwalk and then parsing the results, is > there a better way to construct the tree when iterwalk itself is running? I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining? Stefan From vik.list.nutch at gmail.com Fri May 16 11:28:39 2008 From: vik.list.nutch at gmail.com (Viksit Gaur) Date: Fri, 16 May 2008 02:28:39 -0700 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D5093.7060303@behnel.de> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> Message-ID: <482D53C7.1060701@gmail.com> Hi, Stefan Behnel wrote: > Hi, > > Viksit Gaur wrote: >> 2) Things like iterwalk do return "start" and "end" actions - but >> instead of first doing an iterwalk and then parsing the results, is >> there a better way to construct the tree when iterwalk itself is running? > > I don't understand what you mean here. Are you modifying the tree during the > iteration? Or do you think of some kind of pipelining? Hmm. The problem I face was a method to assign a unique ID to each element on the page. Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I'm not sure how to do this, without extending the etree.so file inside which iterwalk is implemented.. Cheers, Viksit > > Stefan > From stefan_ml at behnel.de Fri May 16 11:56:56 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 11:56:56 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D53C7.1060701@gmail.com> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> Message-ID: <482D5A68.2010107@behnel.de> Viksit Gaur wrote: > The problem I face was a method to assign a unique ID to each > element on the page. > > Lets say I construct an iterwalk object. But, during this phase, I would > like to not only build the tree, but also add some of my own information > to each node (such as a unique ID to each element). I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. Stefan From Dennis.Benzinger at gmx.net Fri May 16 12:28:42 2008 From: Dennis.Benzinger at gmx.net (Dennis Benzinger) Date: Fri, 16 May 2008 12:28:42 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D5A68.2010107@behnel.de> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> <482D5A68.2010107@behnel.de> Message-ID: <482D61DA.8040609@gmx.net> Am 16.05.2008 11:56, Stefan Behnel schrieb: > > Viksit Gaur wrote: >> The problem I face was a method to assign a unique ID to each >> element on the page. >> >> Lets say I construct an iterwalk object. But, during this phase, I would >> like to not only build the tree, but also add some of my own information >> to each node (such as a unique ID to each element). > > I still don't understand what you mean with "build the tree". You can't > construct a tree and run iterwalk at the same time. iterparse() will do that > in case you are parsing. > [...] I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data. HTH, Dennis Benzinger From stefan_ml at behnel.de Fri May 16 12:46:38 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 12:46:38 +0200 Subject: [lxml-dev] Efficient methods to build a tree out of HTML structure? In-Reply-To: <482D61DA.8040609@gmx.net> References: <482CF861.2010306@gmail.com> <482D5093.7060303@behnel.de> <482D53C7.1060701@gmail.com> <482D5A68.2010107@behnel.de> <482D61DA.8040609@gmx.net> Message-ID: <482D660E.4010303@behnel.de> Hi, Dennis Benzinger wrote: > Am 16.05.2008 11:56, Stefan Behnel schrieb: >> Viksit Gaur wrote: >>> The problem I face was a method to assign a unique ID to each >>> element on the page. >>> >>> Lets say I construct an iterwalk object. But, during this phase, I would >>> like to not only build the tree, but also add some of my own information >>> to each node (such as a unique ID to each element). >> I still don't understand what you mean with "build the tree". You can't >> construct a tree and run iterwalk at the same time. iterparse() will do that >> in case you are parsing. >> [...] > > I think he is talking about his own tree. The tree he is building to > visualize the structure of the XML data. Ok, but if it's that, then I don't understand why iterating over the tree and adding an id attribute to each node won't do the job. Stefan From cz at gocept.com Fri May 16 14:21:27 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Fri, 16 May 2008 14:21:27 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? Message-ID: Hi, with lxml 2.0.4 I get text removed when I replace a node. The text after the replaced node vanishes.... ----------------------- import lxml.objectify import lxml.etree xml = lxml.objectify.fromstring( 'before bazafter baz') print lxml.etree.tostring(xml, pretty_print=True) print 50*'-' baz = xml['bar']['baz'] xml['bar'].replace(baz, lxml.objectify.E.holler()) print lxml.etree.tostring(xml, pretty_print=True) ----------------- Prints out: before bazafter baz -------------------------------------------------- before baz Thanks, -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From stefan_ml at behnel.de Fri May 16 14:41:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 16 May 2008 14:41:44 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? In-Reply-To: References: Message-ID: <482D8108.1080804@behnel.de> Hi, Christian Zagrodnick wrote: > with lxml 2.0.4 I get text removed when I replace a node. The text > after the replaced node vanishes... You mean the .tail property of the node that you replace. http://codespeak.net/lxml/tutorial.html#elements-contain-text When you replace the node, it takes its tail with it. Stefan From cz at gocept.com Fri May 16 15:20:43 2008 From: cz at gocept.com (Christian Zagrodnick) Date: Fri, 16 May 2008 15:20:43 +0200 Subject: [lxml-dev] bug: objectify removes text on replace()? References: <482D8108.1080804@behnel.de> Message-ID: On 2008-05-16 14:41:44 +0200, Stefan Behnel said: > Hi, > > Christian Zagrodnick wrote: >> with lxml 2.0.4 I get text removed when I replace a node. The text >> after the replaced node vanishes... > > You mean the .tail property of the node that you replace. > > http://codespeak.net/lxml/tutorial.html#elements-contain-text > > When you replace the node, it takes its tail with it. Hrr. I'm too DOMified. Sorry :) -- -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From stefan_ml at behnel.de Sun May 18 21:24:42 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 18 May 2008 21:24:42 +0200 Subject: [lxml-dev] first lessons learned while porting lxml to Py3 Message-ID: <4830827A.8050304@behnel.de> Hi, since we had a lengthy discussion on whether or not non-prefixed byte strings should automatically mutate into unicode strings when compiled for Py3, here are some initial lessons from my first attempt to port lxml. My first approach was (obviously) to import unicode_literals from __future__. This failed miserably, and even showed a couple of further bugs in Cython. :) I then chose the route to explicitly prepend unicode strings with 'u', as I wanted to keep my source compilable with older Cython versions that do not support the 'b' prefix. Currently, I have changed about 700 lines this way in a quick walk-through, and now I'm searching the places where this was the wrong thing to do. :) Most important evidence found: it's definitely non-trivial in a l