From chris0wj at gmail.com Sat Aug 1 13:48:39 2009 From: chris0wj at gmail.com (Chris Wj) Date: Sat, 1 Aug 2009 07:48:39 -0400 Subject: [lxml-dev] Splitting an xml file. Message-ID: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> I'm looking for the best way to split an xml file with many children into multiple files with the same parent tags but individual children. Example: Turn this... file1.xml a whole bunch of stuff... child 1 stuff... child 2 stuff... Into 2 files... file1a.xml a whole bunch of stuff... child 1 stuff... file1b.xml a whole bunch of stuff... child 2 stuff... Should I use lxml.etree to find the line numbers and then just use file operations? You guys think that is most efficient? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090801/794ff602/attachment.htm From shigin at rambler-co.ru Tue Aug 4 21:22:33 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Tue, 04 Aug 2009 23:22:33 +0400 Subject: [lxml-dev] Splitting an xml file. In-Reply-To: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> References: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> Message-ID: <1249413753.11605.13.camel@dervish> ? ???, 01/08/2009 ? 07:48 -0400, Chris Wj ?????: > I'm looking for the best way to split an xml file with many children > into multiple files with the same parent tags but individual children. ... > Should I use lxml.etree to find the line numbers and then just use > file operations? You guys think that is most efficient? I think that the simplest way to split the file is to remove all 'child' element from parsed document and serialize document with different childs. In [3]: parsed = etree.parse('q.xml') In [4]: root = parsed.getroot() In [5]: childs = parsed.findall('child') In [6]: for child in childs: root.remove(child) In [7]: for num, child in enumerate(childs): ... root.append(child) ... f = codecs.open('file1%s.xml' % num, 'w', encoding='utf-8') ... f.write(etree.tounicode(parsed)) ... f.close() ... root.remove(child) This solution works incorrect with tail text or nodes after 'child' nodes. I don't know if it's critical for you, but the next XML will be split in wrong way. .... 1234 .... .... If you need to split big XML files, it's much better to use SAX interface. But SAX reader/writer is a way harder to implement. From chris0wj at gmail.com Tue Aug 4 22:16:49 2009 From: chris0wj at gmail.com (Chris Wj) Date: Tue, 4 Aug 2009 16:16:49 -0400 Subject: [lxml-dev] Splitting an xml file. In-Reply-To: <1249413753.11605.13.camel@dervish> References: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> <1249413753.11605.13.camel@dervish> Message-ID: <3a0f5ffd0908041316o3a2621ffxabf2beab5cdbd0d8@mail.gmail.com> What about xslt, can I use that to accomplish the task? On Tue, Aug 4, 2009 at 3:22 PM, Alexander Shigin wrote: > ? ???, 01/08/2009 ? 07:48 -0400, Chris Wj ?????: > > I'm looking for the best way to split an xml file with many children > > into multiple files with the same parent tags but individual children. > ... > > Should I use lxml.etree to find the line numbers and then just use > > file operations? You guys think that is most efficient? > > I think that the simplest way to split the file is to remove all 'child' > element from parsed document and serialize document with different > childs. > > In [3]: parsed = etree.parse('q.xml') > In [4]: root = parsed.getroot() > In [5]: childs = parsed.findall('child') > In [6]: for child in childs: root.remove(child) > In [7]: for num, child in enumerate(childs): > ... root.append(child) > ... f = codecs.open('file1%s.xml' % num, 'w', encoding='utf-8') > ... f.write(etree.tounicode(parsed)) > ... f.close() > ... root.remove(child) > > This solution works incorrect with tail text or nodes after 'child' > nodes. I don't know if it's critical for you, but the next XML will be > split in wrong way. > > .... > 1234 > .... > .... > > > If you need to split big XML files, it's much better to use SAX > interface. But SAX reader/writer is a way harder to implement. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090804/59a89c72/attachment-0001.htm From kris at cs.ucsb.edu Wed Aug 5 00:44:10 2009 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Tue, 04 Aug 2009 15:44:10 -0700 Subject: [lxml-dev] Key error on del attribute? Message-ID: <1249425850.27934.78.camel@loup.ece.ucsb.edu> We need to delete an attribute on an Element node, however we are receiving a strange exception. > a=etree.Element('a', z='1', x='2') > a.attrib['x'] '2' > del a.attrib['x'] > del a.attrib['x'] ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (3059, 0)) We could add a call to has_key, however we expect a simple KeyError exception to be raised. From jlovell at nwesd.org Wed Aug 5 01:23:13 2009 From: jlovell at nwesd.org (John Lovell) Date: Tue, 4 Aug 2009 16:23:13 -0700 Subject: [lxml-dev] Key error on del attribute? In-Reply-To: <1249425850.27934.78.camel@loup.ece.ucsb.edu> References: <1249425850.27934.78.camel@loup.ece.ucsb.edu> Message-ID: On Ubuntu 9.04 I get a KeyError thrown. Can you provide a list of versions like the below? python: 2.6.2 lxml.etree: (2, 1, 5, 0) libxml used: (2, 6, 32) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) This should help... http://codespeak.net/lxml/2.0/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of kristian kvilekval Sent: Tuesday, August 04, 2009 3:44 PM To: lxml-dev at codespeak.net Subject: [lxml-dev] Key error on del attribute? We need to delete an attribute on an Element node, however we are receiving a strange exception. > a=etree.Element('a', z='1', x='2') > a.attrib['x'] '2' > del a.attrib['x'] > del a.attrib['x'] ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (3059, 0)) We could add a call to has_key, however we expect a simple KeyError exception to be raised. _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From kris at cs.ucsb.edu Wed Aug 5 01:34:08 2009 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Tue, 04 Aug 2009 16:34:08 -0700 Subject: [lxml-dev] Key error on del attribute? In-Reply-To: References: <1249425850.27934.78.camel@loup.ece.ucsb.edu> Message-ID: <1249428848.27934.94.camel@loup.ece.ucsb.edu> On Tue, 2009-08-04 at 16:16 -0700, John Lovell wrote: > On Ubuntu 9.04 I get a KeyError thrown. Can you provide a list of versions like the below? > > python: 2.6.2 > lxml.etree: (2, 1, 5, 0) > libxml used: (2, 6, 32) > libxml compiled: (2, 6, 32) > libxslt used: (1, 1, 24) > libxslt compiled: (1, 1, 24) Bizarre .. your right it works in python.. it's the error parsing in ipython that runs into trouble: Not sure if the bug is in ipython or lxml but no matter. lxml.etree: (2, 1, 5, 0) libxml used: (2, 6, 32) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) -------------------------------------------------------------------- Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) Type "copyright", "credits" or "license" for more information. IPython 0.8.4 -- An enhanced Interactive Python. In [3]: a=etree.Element('a', z='1', x='2') In [4]: del a.attrib['x'] In [5]: del a.attrib['x'] ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (3059, 0)) ------------------------------------------------- $ python Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> a=etree.Element('a', z='1', x='2') Traceback (most recent call last): File "", line 1, in NameError: name 'etree' is not defined >>> from lxml import etree >>> a=etree.Element('a', z='1', x='2') >>> del a.attrib['x'] >>> del a.attrib['x'] Traceback (most recent call last): File "", line 1, in File "lxml.etree.pyx", line 1857, in lxml.etree._Attrib.__delitem__ (src/lxml/lxml.etree.c:18787) File "apihelpers.pxi", line 435, in lxml.etree._delAttribute (src/lxml/lxml.etree.c:31747) KeyError: 'x' > This should help... > http://codespeak.net/lxml/2.0/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do > Thanks, From piet at cs.uu.nl Wed Aug 5 04:33:26 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Wed, 5 Aug 2009 04:33:26 +0200 Subject: [lxml-dev] Key error on del attribute? In-Reply-To: <1249428848.27934.94.camel@loup.ece.ucsb.edu> References: <1249425850.27934.78.camel@loup.ece.ucsb.edu> <1249428848.27934.94.camel@loup.ece.ucsb.edu> Message-ID: <19064.61302.617505.125663@Cochabamba.local> With iPython 0.9.1 on Python 2.6.2 it just works: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/ipython-0.9.1-py2.6.egg/IPython/Magic.py:38: DeprecationWarning: the sets module is deprecated from sets import Set Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object'. ?object also works, ?? prints more. In [1]: from lxml import etree In [2]: a=etree.Element('a', z='1', x='2') In [3]: del a.attrib['x'] In [4]: del a.attrib['x'] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /Users/piet/Mail/ in () /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.3-fat.egg/lxml/etree.so in lxml.etree._Attrib.__delitem__ (src/lxml/lxml.etree.c:42562)() /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.3-fat.egg/lxml/etree.so in lxml.etree._delAttribute (src/lxml/lxml.etree.c:13933)() KeyError: 'x' -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From bl8cki at gmail.com Wed Aug 5 14:21:05 2009 From: bl8cki at gmail.com (bl8cki) Date: Wed, 5 Aug 2009 15:21:05 +0300 Subject: [lxml-dev] iterating xpath? Message-ID: I was searching the api and found things like iterfind, but it seems that this work with ElementPath I would like to do something like iterxpath. Is there any way to achieve this? Thanks a lot! From lei at ipac.caltech.edu Wed Aug 5 18:48:57 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Wed, 05 Aug 2009 09:48:57 -0700 Subject: [lxml-dev] lxml2.2 doctype missing Message-ID: <4A79B7F9.4060109@ipac.caltech.edu> I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2. Is this bug still not fixed in lxml 2.2 ? -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From shigin at rambler-co.ru Wed Aug 5 20:26:41 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Wed, 05 Aug 2009 22:26:41 +0400 Subject: [lxml-dev] Splitting an xml file. In-Reply-To: <3a0f5ffd0908041316o3a2621ffxabf2beab5cdbd0d8@mail.gmail.com> References: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> <1249413753.11605.13.camel@dervish> <3a0f5ffd0908041316o3a2621ffxabf2beab5cdbd0d8@mail.gmail.com> Message-ID: <1249496801.11605.70.camel@dervish> ? ???, 04/08/2009 ? 16:16 -0400, Chris Wj ?????: > What about xslt, can I use that to accomplish the task? I've never used the ability of xslt to produce many output files. I've just briefly reviewed XSLT specification and can't find how to use it. You can use xslt param and produce different output by changing 'keep' param. For example, you can use xsltproc and example file q.xslt. $ xsltproc --param keep 2 q.xslt q.xml === q.xslt === ====== This solution has another issue: I don't know how to find out position numbers. The next XML has 'child' elements in 2 and 4 position. 1234 .... .... .... From herve.cauwelier at free.fr Thu Aug 6 12:32:45 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Thu, 06 Aug 2009 12:32:45 +0200 Subject: [lxml-dev] xpath going crazy Message-ID: <4A7AB14D.2090602@free.fr> Hi, Consider the following XML document: http://pastebin.ca/1520331 This is an ODF presentation produced by OpenOffice.org, assumed to be a valid XML document. Now I type this: >>> from lxml import etree >>> t = etree.parse('content.xml') >>> ns = {'draw': "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"} >>> t.xpath('//draw:frame', namespaces=ns) [, ] There are indeed two frames in the document. >>> t.xpath('//draw:frame[0]', namespaces=ns) [] The position counting starts at 1 in XPath so this is expected. >>> t.xpath('//draw:frame[1]', namespaces=ns) [, ] I get the two elements at once. >>> t.xpath('//draw:frame[2]', namespaces=ns) [] I can't get the second element. The same thing happens when asking the root instead of the tree. I know my XPath knowledge is limited by I don't think I'm doing any wrong assumption. >>> print "lxml.etree: ", etree.LXML_VERSION lxml.etree: (2, 2, 2, 0) >>> print "libxml used: ", etree.LIBXML_VERSION libxml used: (2, 7, 3) >>> print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION libxml compiled: (2, 7, 3) >>> print "libxslt used: ", etree.LIBXSLT_VERSION libxslt used: (1, 1, 24) >>> print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION libxslt compiled: (1, 1, 24) Thanks for your lights, Herv? From jq at qdevelop.de Thu Aug 6 14:05:42 2009 From: jq at qdevelop.de (Jens Quade) Date: Thu, 6 Aug 2009 14:05:42 +0200 Subject: [lxml-dev] xpath going crazy In-Reply-To: <4A7AB14D.2090602@free.fr> References: <4A7AB14D.2090602@free.fr> Message-ID: <8BDC1B54-7065-411E-B08A-77FB39631EC4@qdevelop.de> On 06.08.2009, at 12:32, Herv? Cauwelier wrote: > Hi, > > Consider the following XML document: http://pastebin.ca/1520331 > > This is an ODF presentation produced by OpenOffice.org, assumed to > be a > valid XML document. > The position counting starts at 1 in XPath so this is expected. > >>>> t.xpath('//draw:frame[1]', namespaces=ns) > [ 7f604469eec0>, {urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}frame at > 7f604469ee68>] > > I get the two elements at once. > >>>> t.xpath('//draw:frame[2]', namespaces=ns) > [] > > I can't get the second element. You ask for all draw:frame-Elements that are the first in their specific context. //draw:frame[1] only omits all draw:frame that have a draw:frame in their preceding-siblings. If you look at the XPath-results in e.g. Oxygen, it is easy to see. (//draw:frame)[1] should do what you want. (only the first of all //draw:frame in the document) From herve.cauwelier at free.fr Thu Aug 6 15:56:05 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Thu, 06 Aug 2009 15:56:05 +0200 Subject: [lxml-dev] xpath going crazy In-Reply-To: <8BDC1B54-7065-411E-B08A-77FB39631EC4@qdevelop.de> References: <4A7AB14D.2090602@free.fr> <8BDC1B54-7065-411E-B08A-77FB39631EC4@qdevelop.de> Message-ID: <4A7AE0F5.8090406@free.fr> Jens Quade a ?crit : > You ask for all draw:frame-Elements that are the first in their specific > context. > //draw:frame[1] only omits all draw:frame that have a draw:frame in > their preceding-siblings. > > If you look at the XPath-results in e.g. Oxygen, it is easy to see. > > (//draw:frame)[1] > > should do what you want. (only the first of all //draw:frame in the > document) Thanks for the quick reply. I fixed my expressions. Herv? From stefan_ml at behnel.de Thu Aug 6 20:36:43 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 20:36:43 +0200 Subject: [lxml-dev] Jython and XPointer support In-Reply-To: <1248352207.20640.107.camel@tttdal> References: <1248352207.20640.107.camel@tttdal> Message-ID: <4A7B22BB.4020601@behnel.de> Hi, Daniel Albeseder wrote: > I wonder if there is, or will be any Jython support of lxml? We are > currently evaluating several possibilities to support several > XML-features in Python and Jython together. This includes XPath, XSLT, > XPointer. The last one is supported by libxml2 but I have not found any > support inside lxml. Have I missed something? Will there be XPointer > support in lxml in the future? That XPointer part of libxml2's API is not currently wrapped. If anyone writes the code, I'll be happy to include it. > About Jython: There is of course the possibility to use JNA to access > the libxml2 library natively, but I do not comprehend how to access lxml > from within Jython, since it is not a "normal" shared library, but a > shared object created for CPython in Cython. Does anyone have any hint > about that? As lxml is written in Cython, the generated C code is heavily tied into CPython. So you will not be able to access lxml 'natively' from Jython. If you can afford to run a CPython interpreter next to Jython, tools like JPype may work for you. But I don't know of any library that provides portable support for XSLT for both CPython and Jython. Stefan From stefan_ml at behnel.de Thu Aug 6 20:41:30 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 20:41:30 +0200 Subject: [lxml-dev] lxml In-Reply-To: <5d8599fb0907231519k2b00fe81v7dd1bc177801bf9f@mail.gmail.com> References: <5d8599fb0907231519k2b00fe81v7dd1bc177801bf9f@mail.gmail.com> Message-ID: <4A7B23DA.9020601@behnel.de> Yassin Ezbakhe wrote: > Hi, I'm using Python 2.6.2 (Windows) and lxml 2.2.2. > > When I run the following piece of code, it gets stuck. > > import lxml.etree as et > s = "" > xml = et.fromstring(s) > print xml[0] # prints element a > print xml[2] # prints element e > print xml[3] # it should raise an out of range exception, but it gets stuck > > What is the problem? In the original ElementTree implementation, I get an > IndexError. Works for me: Python 3.1 (r31:73572, Jun 28 2009, 21:07:35) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.etree as et >>> print(et.__version__) 2.2.2 >>> s = "" >>> xml = et.fromstring(s) >>> print(xml[0]) # prints element a >>> print(xml[2]) # prints element e >>> print(xml[3]) # it should raise an out of range exception, but it gets stuck Traceback (most recent call last): File "", line 1, in File "lxml.etree.pyx", line 961, in lxml.etree._Element.__getitem__ (src/lxml/lxml.etree.c:33945) IndexError: list index out of range Stefan From stefan_ml at behnel.de Thu Aug 6 20:46:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 20:46:01 +0200 Subject: [lxml-dev] Getting 'user-visible' text from HTML In-Reply-To: <20090724110516.500d74e0@nrri.umn.edu> References: <20090724110516.500d74e0@nrri.umn.edu> Message-ID: <4A7B24E9.4030502@behnel.de> Terry Brown wrote: > On Fri, 24 Jul 2009 14:30:03 +0000 (UTC) > Adam Nelson wrote: > >> Is there a shortcut method (or even a pasted script) that allows lxml >> to get all the 'user-visible' text? > > doc.xpath("//text()") should return a list of every piece of text in > the html. ... whereas doc.xpath("string()") will return the text content as a plain string. You can also serialise the document to plain text, as shown here: http://codespeak.net/lxml/tutorial.html#serialisation Note the unicode string serialisation at the end of that section. Stefan From stefan_ml at behnel.de Thu Aug 6 20:52:06 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 20:52:06 +0200 Subject: [lxml-dev] id function of xpath and parseid In-Reply-To: <1248684566.20640.137.camel@tttdal> References: <1248684566.20640.137.camel@tttdal> Message-ID: <4A7B2656.7080505@behnel.de> Daniel Albeseder wrote: > I wonder why there is no parseid function inside the objectify module? > Is there a reason, why this is only in etree? No, it's just too rarely used to become duplicated. You can pass a parser to parseid(), which you can set up to use objectify API. http://codespeak.net/lxml/objectify.html#advanced-element-class-lookup Stefan From stefan_ml at behnel.de Thu Aug 6 20:53:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 20:53:09 +0200 Subject: [lxml-dev] id function of xpath and parseid In-Reply-To: <1248687797.20640.143.camel@tttdal> References: <1248684566.20640.137.camel@tttdal> <1248687797.20640.143.camel@tttdal> Message-ID: <4A7B2695.4010108@behnel.de> Daniel Albeseder wrote: > On Mon, 2009-07-27 at 10:49 +0200, Daniel Albeseder wrote: > >> Additionally I wonder, how the `id` function of XPath does work with >> lxml. I created a schema-aware parser, which reads an XML-file, where >> some attributes are declared as xs:ID inside the schema. However the >> `xpath` method always returns an empty list of nodes, even if the IDs >> given are inside the XML. > > I just found in the archive, that it only seems to work with either an > xml:id attribute or a given DTD. However since XML Schema is the modern > way to restrict XML-files, and XML schema also has the xsd:ID attribute > type, I see not reason, why this does not work. > > The XPath specification does talk about DTDs, thats correct, but since > this specification is older than the XML Schema document and XML schema > is designed to be a superset of DTD (and even to replace it), it sounds > strange, that some features only work for DTDs. Well, as you said: XMLSchema didn't exist when XPath 1.0 was defined. I assume the definition of id() is different for XPath 2.0, but that's not supported by libxml2. Stefan From stefan_ml at behnel.de Thu Aug 6 21:28:04 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 21:28:04 +0200 Subject: [lxml-dev] Strange segmentation fault if class inherited from objectify.ObjectifiedElement In-Reply-To: <1249039073.20640.371.camel@tttdal> References: <1249039073.20640.371.camel@tttdal> Message-ID: <4A7B2EC4.4020005@behnel.de> Daniel Albeseder wrote: > I just tried this and got an segmentation fault :-( > > Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) > [GCC 4.3.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> from lxml import objectify >>>> from lxml import etree >>>> >>>> print etree.LXML_VERSION, etree.LIBXML_VERSION > (2, 2, 2, 0) (2, 6, 32) >>>> class test (objectify.ObjectifiedElement) : > ... pass > ... >>>> good = objectify.Element ("abc") >>>> print type (good), repr (good) > >>>> bad = test ("abc") > Segmentation fault http://codespeak.net/lxml/element_classes.html Stefan From stefan_ml at behnel.de Thu Aug 6 21:30:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 21:30:12 +0200 Subject: [lxml-dev] Whitespace foiling pretty_print - any fix? In-Reply-To: <204abd770907301047t1c805ab8k30ce8e7b6821e9cf@mail.gmail.com> References: <204abd770907301047t1c805ab8k30ce8e7b6821e9cf@mail.gmail.com> Message-ID: <4A7B2F44.10506@behnel.de> B Wooster wrote: > f an XML element has whitespace along with sub elements, pretty_print > does not work well. > > This is probably a XML whitespace issue, but is there any way to work > around this issue? > > Here's example code and the output it prints, and what I would like to get: > > from lxml import etree as ET > root = ET.XML(""" > test > """) > ET.SubElement(root, "c") > ET.SubElement(root, "d") > print ET.tostring(root, pretty_print=True) > # This prints: > """ > test > > """ > # Would like: > """ > test > > > > """ http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Stefan From stefan_ml at behnel.de Thu Aug 6 21:42:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 21:42:50 +0200 Subject: [lxml-dev] automatic attribute unicode decode? In-Reply-To: <4A731041.4020908@free.fr> References: <4A731041.4020908@free.fr> Message-ID: <4A7B323A.4010308@behnel.de> Herv? Cauwelier wrote: > I'm quite puzzled by the following excerpt: > > >>> from lxml import etree > >>> r = etree.fromstring('' > >>> r.attrib > {'titi': 'ascii', 'toto': u'fran\xe7ais', 'tata': '1'} > > In a bare document with no encoding declaration, lxml has decoded itself > a string that did not match the ascii table (what heuristic did it > use?). No heuristic. It follows the XML specification in that the absence of an XML declaration defines the encoding as UTF-8. I assume your console was set to UTF-8 when you typed the above? > Now I have three attributes of two different types. I wonder why > the integer was not decoded. ;-) > > I actually found this in a real-world document with encoding and > namespaces (An ODF xml part). > > Is this a bug to report and how to circumvent it? Definitely not a bug. What would be the behaviour you expected instead? Stefan From stefan_ml at behnel.de Thu Aug 6 21:54:14 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 21:54:14 +0200 Subject: [lxml-dev] Splitting an xml file. In-Reply-To: <3a0f5ffd0908041316o3a2621ffxabf2beab5cdbd0d8@mail.gmail.com> References: <3a0f5ffd0908010448y19bcdbacv1d219b623f0fb36f@mail.gmail.com> <1249413753.11605.13.camel@dervish> <3a0f5ffd0908041316o3a2621ffxabf2beab5cdbd0d8@mail.gmail.com> Message-ID: <4A7B34E6.2070609@behnel.de> Chris Wj wrote: > On Tue, Aug 4, 2009 at 3:22 PM, Alexander Shigin wrote: > >> ? ???, 01/08/2009 ? 07:48 -0400, Chris Wj ?????: >>> I'm looking for the best way to split an xml file with many children >>> into multiple files with the same parent tags but individual children. >> ... >>> Should I use lxml.etree to find the line numbers and then just use >>> file operations? You guys think that is most efficient? >> >> I think that the simplest way to split the file is to remove all 'child' >> element from parsed document and serialize document with different >> childs. >> >> In [3]: parsed = etree.parse('q.xml') >> In [4]: root = parsed.getroot() >> In [5]: childs = parsed.findall('child') >> In [6]: for child in childs: root.remove(child) >> In [7]: for num, child in enumerate(childs): >> ... root.append(child) >> ... f = codecs.open('file1%s.xml' % num, 'w', encoding='utf-8') >> ... f.write(etree.tounicode(parsed)) >> ... f.close() >> ... root.remove(child) Note that serialising to a unicode string and then using the codecs module to encode to UTF-8 is very inefficient. Instead, pass encoding='UTF-8' to tostring() and use the 'wb' mode when opening the file. > What about xslt, can I use that to accomplish the task? http://www.exslt.org/exsl/elements/document/index.html However, the above should be simple enough in Python, so doing the same in XSLT sounds like overkill to me. Stefan From stefan_ml at behnel.de Thu Aug 6 21:56:11 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 06 Aug 2009 21:56:11 +0200 Subject: [lxml-dev] iterating xpath? In-Reply-To: References: Message-ID: <4A7B355B.9010802@behnel.de> bl8cki wrote: > I was searching the api and found things like iterfind, but it seems > that this work with ElementPath > I would like to do something like iterxpath. Is there any way to achieve this? No, that's not supported by libxml2. Stefan From mike_mp at zzzcomputing.com Fri Aug 7 20:06:18 2009 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Fri, 7 Aug 2009 14:06:18 -0400 Subject: [lxml-dev] setuptools issues with python2.6 maint Message-ID: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> My apologies for dumping a bad build on the list here, but googling returned absolutely nothing for this one. I haven't had this issue before so it may be related to my usage of the latest python 2.6 mainentance branch which I got from http://svn.python.org/projects/python/branches/release26-maint . It builds fine if I run a straight distutils build without setuptools being installed. Otherwise I get the below - tested against 2.1.5, 2.2, and 2.2.2, any ideas are appreciated. root at stageassets lxml-2.2.2]# /usr/local/bin/python setup.py install Building lxml version 2.2.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/lib running install running build running build_py running build_ext running install_lib creating /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/sax.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/pyclasslookup.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/ElementInclude.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/etree.so -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/_elementpath.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/__init__.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/usedoctest.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/doctestcompare.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/cssselect.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/builder.py -> /usr/local/lib/python2.6/site-packages/lxml copying build/lib.linux-i686-2.6/lxml/objectify.so -> /usr/local/lib/python2.6/site-packages/lxml creating /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/diff.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/defs.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/_setmixin.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/_diffcommand.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/__init__.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/usedoctest.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/ElementSoup.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/_html5builder.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/builder.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/formfill.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/html5parser.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/soupparser.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/clean.py -> /usr/local/lib/python2.6/site-packages/lxml/html copying build/lib.linux-i686-2.6/lxml/html/_dictmixin.py -> /usr/local/lib/python2.6/site-packages/lxml/html byte-compiling /usr/local/lib/python2.6/site-packages/lxml/sax.py to sax.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/pyclasslookup.py to pyclasslookup.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/ElementInclude.py to ElementInclude.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/_elementpath.py to _elementpath.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/__init__.py to __init__.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/usedoctest.py to usedoctest.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/doctestcompare.py to doctestcompare.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/cssselect.py to cssselect.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/builder.py to builder.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/diff.py to diff.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/defs.py to defs.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/_setmixin.py to _setmixin.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/_diffcommand.py to _diffcommand.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/__init__.py to __init__.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/usedoctest.py to usedoctest.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/ElementSoup.py to ElementSoup.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/_html5builder.py to _html5builder.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/builder.py to builder.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/formfill.py to formfill.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/html5parser.py to html5parser.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/soupparser.py to soupparser.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/clean.py to clean.pyc byte-compiling /usr/local/lib/python2.6/site-packages/lxml/html/_dictmixin.py to _dictmixin.pyc running install_egg_info Writing /usr/local/lib/python2.6/site-packages/lxml-2.2.2-py2.6.egg-info [root at stageassets lxml-2.2.2]# /var/web/hosts/fanfeedr.com/snapscores.com/bin/python setup.py install Building lxml version 2.2.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/lib running install running bdist_egg running egg_info writing src/lxml.egg-info/PKG-INFO writing top-level names to src/lxml.egg-info/top_level.txt writing dependency_links to src/lxml.egg-info/dependency_links.txt reading manifest file 'src/lxml.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'src/lxml.egg-info/SOURCES.txt' installing library code to build/bdist.linux-i686/egg running install_lib running build_py running build_ext Traceback (most recent call last): File "setup.py", line 116, in **extra_options File "/usr/local/lib/python2.6/distutils/core.py", line 152, in setup dist.run_commands() File "/usr/local/lib/python2.6/distutils/dist.py", line 975, in run_commands self.run_command(cmd) File "/usr/local/lib/python2.6/distutils/dist.py", line 995, in run_command cmd_obj.run() File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/install.py", line 76, in run File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/install.py", line 96, in do_egg_install File "/usr/local/lib/python2.6/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/usr/local/lib/python2.6/distutils/dist.py", line 995, in run_command cmd_obj.run() File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/bdist_egg.py", line 174, in run File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/bdist_egg.py", line 161, in call_command File "/usr/local/lib/python2.6/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/usr/local/lib/python2.6/distutils/dist.py", line 995, in run_command cmd_obj.run() File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/install_lib.py", line 20, in run File "/usr/local/lib/python2.6/distutils/command/install_lib.py", line 113, in build self.run_command('build_ext') File "/usr/local/lib/python2.6/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/usr/local/lib/python2.6/distutils/dist.py", line 995, in run_command cmd_obj.run() File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/build_ext.py", line 46, in run File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line 449, in build_extensions self.build_extension(ext) File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/build_ext.py", line 175, in build_extension File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line 460, in build_extension ext_path = self.get_ext_fullpath(ext.name) File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line 633, in get_ext_fullpath filename = self.get_ext_filename(modpath[-1]) File "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/build_ext.py", line 85, in get_ext_filename KeyError: 'etree' From mike_mp at zzzcomputing.com Fri Aug 7 20:13:05 2009 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Fri, 7 Aug 2009 14:13:05 -0400 Subject: [lxml-dev] setuptools issues with python2.6 maint In-Reply-To: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> References: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> Message-ID: I would add that I successfully built/installed psycopg2 and PIL with the same Python install, so its not that my distutils/easy_install is "broken" across the board for compiler builds....the issue is specific to lxml. From ndudfield at gmail.com Sat Aug 8 06:25:17 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Sat, 08 Aug 2009 14:25:17 +1000 Subject: [lxml-dev] Catalog for entities such as   for XHTML parser. In-Reply-To: References: <4A63669C.4050404@behnel.de> <4A642134.7070206@behnel.de> <4A644E4B.4070309@behnel.de> <4A647835.9040303@behnel.de> Message-ID: <4A7CFE2D.1070302@gmail.com> List, Please excuse me if this question has been answered but I couldn't find anything on the list archives that spelled it out for dummies. My usage situation is this: * I'm using windows * I'm parsing xhtml with the xhtml parser * I'm calling lxml from within a python extensible editor. My problem: * Parsing failures due to `unknown` entities, even quite common ones such as   eg. XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 11 How can I set up an external file with common entity definitions that I can parse as an argument to the parser constructor? I read something about a `catalog` but the only docs I could find on it assumed *nix. If someone could help out with a code snippet example I would be very much appreciative. Cheers. From ndudfield at gmail.com Sat Aug 8 07:28:08 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Sat, 08 Aug 2009 15:28:08 +1000 Subject: [lxml-dev] Catalog for entities such as   for XHTML parser. In-Reply-To: <4A7CFE2D.1070302@gmail.com> References: <4A63669C.4050404@behnel.de> <4A642134.7070206@behnel.de> <4A644E4B.4070309@behnel.de> <4A647835.9040303@behnel.de> <4A7CFE2D.1070302@gmail.com> Message-ID: <4A7D0CE8.3060202@gmail.com> Nicholas Dudfield wrote: > List, > > Please excuse me if this question has been answered but I couldn't > find anything on the list archives that spelled it out for dummies. > > My usage situation is this: > > * I'm using windows > * I'm parsing xhtml with the xhtml parser > * I'm calling lxml from within a python extensible editor. > > My problem: > * Parsing failures due to `unknown` entities, even quite common > ones such as   > > eg. XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 11 > > How can I set up an external file with common entity definitions that > I can parse as an argument to the parser constructor? > I read something about a `catalog` but the only docs I could find on > it assumed *nix. > > If someone could help out with a code snippet example I would be very > much appreciative. > > Cheers. > Passing `resolve_entities=False` to the parser constructor ought to work for my case. There seems to be a bug related to this in the feed interface. If you feed the whole document in one go it will honor the constructor, however if you pass it `chunks` ( as you typically would ) it fails. I have attached some test cases. For better or worse they are written to all `pass` proving `errors` using assertRaises. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lxmltest.py Url: http://codespeak.net/pipermail/lxml-dev/attachments/20090808/491a0b80/attachment.diff From stefan_ml at behnel.de Sat Aug 8 15:53:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 15:53:42 +0200 Subject: [lxml-dev] Bug with whitespace in namespaces In-Reply-To: References: Message-ID: <4A7D8366.8060003@behnel.de> Christian Zagrodnick wrote: > it is possible to create invalid XML with lxml: > >>>> import lxml.etree >>>> import lxml.objectify >>>> xml = lxml.objectify.XML('') >>>> xml.set('{a b}c', 'foo') # This should fail! >>>> lxml.etree.tostring(xml) > '' >>>> lxml.objectify.fromstring(lxml.etree.tostring(xml)) > Traceback (most recent call last): > ... > File "parser.pxi", line 625, in lxml.etree._handleParseResult > (src/lxml/lxml.etree.c:64741) > File "parser.pxi", line 565, in lxml.etree._raiseParseError > (src/lxml/lxml.etree.c:64084) > lxml.etree.XMLSyntaxError: xmlns:ns0: 'a b' is not a valid URI, line 1, > column 13 Well, URI checking is actually a new feature in libxml2 2.7 (IIRC), that's why it wasn't used before. Newer libxml2 versions are strict about RFC 3986 syntax, so I agree that it would make sense to also check namespace URIs on the way in. This should go into lxml 2.3. Stefan From stefan_ml at behnel.de Sat Aug 8 17:55:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 17:55:48 +0200 Subject: [lxml-dev] Bug with whitespace in namespaces In-Reply-To: <4A7D8366.8060003@behnel.de> References: <4A7D8366.8060003@behnel.de> Message-ID: <4A7DA004.6080300@behnel.de> Stefan Behnel wrote: > Christian Zagrodnick wrote: >> it is possible to create invalid XML with lxml: >> >>>>> import lxml.etree >>>>> import lxml.objectify >>>>> xml = lxml.objectify.XML('') >>>>> xml.set('{a b}c', 'foo') # This should fail! >>>>> lxml.etree.tostring(xml) >> '' >>>>> lxml.objectify.fromstring(lxml.etree.tostring(xml)) >> Traceback (most recent call last): >> ... >> File "parser.pxi", line 625, in lxml.etree._handleParseResult >> (src/lxml/lxml.etree.c:64741) >> File "parser.pxi", line 565, in lxml.etree._raiseParseError >> (src/lxml/lxml.etree.c:64084) >> lxml.etree.XMLSyntaxError: xmlns:ns0: 'a b' is not a valid URI, line 1, >> column 13 > > Well, URI checking is actually a new feature in libxml2 2.7 (IIRC), that's > why it wasn't used before. Newer libxml2 versions are strict about RFC 3986 > syntax, so I agree that it would make sense to also check namespace URIs on > the way in. > > This should go into lxml 2.3. Fixed on the trunk. Stefan From belred at gmail.com Sat Aug 8 18:21:40 2009 From: belred at gmail.com (Bryan) Date: Sat, 8 Aug 2009 09:21:40 -0700 Subject: [lxml-dev] xml vulnerability Message-ID: <38f48f590908080921t78510de7q2587c160b2b1415@mail.gmail.com> we use this library at our work, and i was asked to find out if xml parsers in lxml are affected for the following xml vulnerability? http://voices.washingtonpost.com/securityfix/2009/08/researchers_xml_security_flaw.html thanks, bryan From stefan_ml at behnel.de Sat Aug 8 18:45:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 18:45:56 +0200 Subject: [lxml-dev] xml vulnerability In-Reply-To: <38f48f590908080921t78510de7q2587c160b2b1415@mail.gmail.com> References: <38f48f590908080921t78510de7q2587c160b2b1415@mail.gmail.com> Message-ID: <4A7DABC4.7070002@behnel.de> Bryan wrote: > we use this library at our work, and i was asked to find out if xml > parsers in lxml are affected for the following xml vulnerability? > > http://voices.washingtonpost.com/securityfix/2009/08/researchers_xml_security_flaw.html This article contains mostly underinformed journalist rubbish, but when you follow the link at the end of the article: https://www.cert.fi/en/reports/2009/vulnerability2009085.html you get to the CERT advisory that hints on what the problem is (possible crashes or DoS attacks related to character decoding and parsing) and states which parsers were found to be vulnerable. lxml is based on libxml2, which is not on that list (whereas pyexpat is, so the stdlib ElementTree is vulnerable, for example). As usual, this doesn't mean libxml2/lxml is bug-free, uncrackable software, just that it's not on the list for this problem. If you need more information regarding this issue, please ask on the libxml2 mailing list. Stefan From stefan_ml at behnel.de Sat Aug 8 19:06:19 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 19:06:19 +0200 Subject: [lxml-dev] xml vulnerability In-Reply-To: <4A7DABC4.7070002@behnel.de> References: <38f48f590908080921t78510de7q2587c160b2b1415@mail.gmail.com> <4A7DABC4.7070002@behnel.de> Message-ID: <4A7DB08B.7040607@behnel.de> Stefan Behnel wrote: > If you need more information regarding this issue, please ask on the > libxml2 mailing list. On a related note: if you care about parsing XML from untrusted sources, it's best to use libxml2 2.7.x, as it's less vulnerable to XML bombs due to size limitations inside the parser (which are enabled by default). Stefan From stefan_ml at behnel.de Sat Aug 8 19:31:32 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 19:31:32 +0200 Subject: [lxml-dev] lxml2.2 doctype missing In-Reply-To: <4A79B7F9.4060109@ipac.caltech.edu> References: <4A79B7F9.4060109@ipac.caltech.edu> Message-ID: <4A7DB674.50001@behnel.de> Mary Lei wrote: > I noticed that the xhtml converted from > the parse tree has doctype missing. > I am using lxml 2.2. > > Is this bug still not fixed in lxml 2.2 ? In order to convince others that this is a bug, you might want to provide some more information. Could you present a short code snippet that shows what you do and the (unexpected) result you get? Stefan From stefan_ml at behnel.de Sat Aug 8 19:38:02 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 19:38:02 +0200 Subject: [lxml-dev] setuptools issues with python2.6 maint In-Reply-To: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> References: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> Message-ID: <4A7DB7FA.2090400@behnel.de> Michael Bayer wrote: > My apologies for dumping a bad build on the list here, but googling > returned absolutely nothing for this one. I haven't had this issue before > so it may be related to my usage of the latest python 2.6 mainentance > branch which I got from > http://svn.python.org/projects/python/branches/release26-maint . > > It builds fine if I run a straight distutils build without setuptools > being installed. Otherwise I get the below - tested against 2.1.5, 2.2, > and 2.2.2, any ideas are appreciated. > > root at stageassets lxml-2.2.2]# /usr/local/bin/python setup.py install [...] > running build_ext > Traceback (most recent call last): [...] > File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line > 449, in build_extensions > self.build_extension(ext) > File > "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/build_ext.py", > line 175, in build_extension > File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line > 460, in build_extension > ext_path = self.get_ext_fullpath(ext.name) > File "/usr/local/lib/python2.6/distutils/command/build_ext.py", line > 633, in get_ext_fullpath > filename = self.get_ext_filename(modpath[-1]) > File > "/var/web/hosts/fanfeedr.com/snapscores.com/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/build_ext.py", > line 85, in get_ext_filename > KeyError: 'etree' No idea, never seen this before. I use the same setuptools version under Py2.6.2, and it works perfectly well. Have you tried the bdist_egg target instead of a mere "install"? Also, the way setuptools patches into distutils makes it quite possible that newer Python releases introduce incompatibilities, so maybe there's an issue over there. Stefan From stefan_ml at behnel.de Sat Aug 8 20:17:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 08 Aug 2009 20:17:42 +0200 Subject: [lxml-dev] Catalog for entities such as   for XHTML parser. In-Reply-To: <4A7D0CE8.3060202@gmail.com> References: <4A63669C.4050404@behnel.de> <4A642134.7070206@behnel.de> <4A644E4B.4070309@behnel.de> <4A647835.9040303@behnel.de> <4A7CFE2D.1070302@gmail.com> <4A7D0CE8.3060202@gmail.com> Message-ID: <4A7DC146.8080907@behnel.de> Hi, Nicholas Dudfield wrote: > Nicholas Dudfield wrote: >> My usage situation is this: >> >> * I'm using windows >> * I'm parsing xhtml with the xhtml parser >> * I'm calling lxml from within a python extensible editor. >> >> My problem: >> * Parsing failures due to `unknown` entities, even quite common >> ones such as   >> >> eg. XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 11 >> >> How can I set up an external file with common entity definitions that >> I can parse as an argument to the parser constructor? You can let the parser load the DTD by setting load_dtd=True. lxml will not load DTDs by default and if there is no DTD, the parser will fail on unknown entity references. Also, lxml will not access the network by default, so unless you use a catalog, you must also pass no_network=False. Note that this may slow down parsing considerably, as each document requires loading the DTD from the network first. >> I read something about a `catalog` but the only docs I could find on >> it assumed *nix. You need to set the XML_CATALOG_FILES environment variable to a space separated list of catalog files. http://xmlsoft.org/catalog.html I have no idea how to install or manage XML catalogs under Windows, though. > There seems to be a bug related to this in the feed interface. If you > feed the whole document in one go it will honor the constructor, however > if you pass it `chunks` ( as you typically would ) it fails. > > I have attached some test cases. For better or worse they are written to > all `pass` proving `errors` using assertRaises. This sounds like a bug to me. Could you file a bug report? https://bugs.launchpad.net/lxml Thanks! Stefan From ndudfield at gmail.com Sun Aug 9 04:36:12 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Sun, 09 Aug 2009 12:36:12 +1000 Subject: [lxml-dev] Catalog for entities such as   for XHTML parser. In-Reply-To: <4A7DC146.8080907@behnel.de> References: <4A63669C.4050404@behnel.de> <4A642134.7070206@behnel.de> <4A644E4B.4070309@behnel.de> <4A647835.9040303@behnel.de> <4A7CFE2D.1070302@gmail.com> <4A7D0CE8.3060202@gmail.com> <4A7DC146.8080907@behnel.de> Message-ID: <4A7E361C.30101@gmail.com> > There seems to be a bug related to this in the feed interface. If you > feed the whole document in one go it will honor the constructor, however > if you pass it `chunks` ( as you typically would ) it fails. > > I have attached some test cases. For better or worse they are written to > all `pass` proving `errors` using assertRaises. > >>> This sounds like a bug to me. Could you file a bug report? >>> https://bugs.launchpad.net/lxml >>> Thanks! >>> Stefan Of course :) I just signed up to LP and filed the report with test cases ( modified from the one I sent earlier to the list: buggy behaviour `fails` ) There was also a (possible bug) I noticed in relation to using XPath searches for text() when a parser was initiated with `strip_cdata=False`. I'll have a look into that now and see if I can write a test case that consistently exposes the bug. I as well noticed a fault in the css parser (used for lxml.cssselect.css_to_xpath) which can put the interpreter in an infinite loop but IIRC the bug was already mentioned on the list. From lei at ipac.caltech.edu Mon Aug 10 20:30:27 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Mon, 10 Aug 2009 11:30:27 -0700 Subject: [lxml-dev] lxml2.2 doctype missing In-Reply-To: <4A7DB674.50001@behnel.de> References: <4A79B7F9.4060109@ipac.caltech.edu> <4A7DB674.50001@behnel.de> Message-ID: <4A806743.4040503@ipac.caltech.edu> here is an example: #!/bin/sh # next line restarts python \ "exec" "python" "-O" "$0" "$@" import urllib import urllib2 import urlparse import os import sys, getopt, difflib import re import string version = sys.version_info if version < (2,6): print "Need python version 2.6 or better, %s.%s too old!" % version else: print "python version: ", version from lxml.html import parse,submit_form,fromstring,tostring import lxml.html from lxml import etree from StringIO import StringIO url = "http://nsted.ipac.caltech.edu" try: rc = urllib2.urlopen(url) contents = rc.read() rc.close() except urllib2.HTTPError,e: print "Error: Page not found",e sys.exit(1) except urllib2.URLError,e: print "Error: Connection refused ",e sys.exit(1) print "contents-------------\n"+contents[0:300] root = fromstring(contents) fd = open ("tempfile", "w") fd.write(contents) fd.close() root = parse("tempfile").getroot() htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True) print "htmlstr--------------\n"+htmlstr[0:300] htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True,\ include_meta_content_type=False,method='xml') print "htmlstr1-------------\n"+htmlstr[0:300] try: print root.docinfo.doctype except AttributeError,e: print e tree = etree.parse(StringIO("""""")) print "doctype",tree.docinfo.doctype Output: python version: (2, 6, 2, 'final', 0) contents------------- has original doctype Welcome to NStED</tit htmlstr-------------- from lxml tostring, no doctype <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Welcome to NStED Welcome to NStED fd.write(contents) > fd.close() > > root = parse("tempfile").getroot() > htmlstr = lxml.html.tostring(root,\ > encoding="iso-8859-1",pretty_print=True) > ## htmlstr-------------- from lxml tostring, no doctype > htmlstr = lxml.html.tostring(root,\ > encoding="iso-8859-1",pretty_print=True,\ > include_meta_content_type=False,method='xml') > ## htmlstr1------------- from lxml tostring, no doctype, convert as xml > > tree = etree.parse(StringIO("""""")) > print "doctype",tree.docinfo.doctype > ## doctype <---- this one is ok > > So it was in contents from urlopen but missing > in lxml fromstring and then tostring. > Am I missing something ? Yes. When you tell lxml to serialise an element, you get the element and nothing but that. If you want doctype declarations, DTDs, processing instructions and the like (i.e. stuff that doesn't belong to the element itself), you must wrap the element in an ElementTree and serialise that. Stefan From mike_mp at zzzcomputing.com Mon Aug 10 21:48:48 2009 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Mon, 10 Aug 2009 15:48:48 -0400 Subject: [lxml-dev] setuptools issues with python2.6 maint In-Reply-To: <4A7DB7FA.2090400@behnel.de> References: <543dc5d5c9a7062c5ed361e669c17d20.squirrel@www.geekisp.com> <4A7DB7FA.2090400@behnel.de> Message-ID: <159c0df5ffd364411bf21a6d9edd3574.squirrel@www.geekisp.com> Stefan Behnel wrote: > > > No idea, never seen this before. I use the same setuptools version under > Py2.6.2, and it works perfectly well. > > Have you tried the bdist_egg target instead of a mere "install"? > > Also, the way setuptools patches into distutils makes it quite possible > that newer Python releases introduce incompatibilities, so maybe there's > an > issue over there. > I was running build_ext in most cases. Didn't try bdist_egg. Anyway my dependency on that version of python is over for now, so if it is in fact an issue with py2.6+, we'll all find out soon enough. thanks for the help. From manu3d at gmail.com Tue Aug 11 13:03:08 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 11 Aug 2009 12:03:08 +0100 Subject: [lxml-dev] default namespace and xpath evaluation Message-ID: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> Hi everybody, I can't seem to find a more compact way to do this: nsDict = {"default":anElement.nsmap[None]} aChildElement = anElement.xpath("/default:elem", namespaces=nsDict )[0] Specifically, I would have thought that if the element is in the default namespace the simple string: aChildElement = anElement.xpath("/elem")[0] would be sufficient, but the element is not found - probably appropriately. However, passing the element's nsmap such as in: aChildElement = anElement.xpath("/elem" ,namespaces=anElement.nsmap)[0] results in the following error message: Traceback (most recent call last): File "", line 1, in File "lxml.etree.pyx", line 1314, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:38871) File "xpath.pxi", line 245, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:106924) File "xpath.pxi", line 117, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:105514) File "xpath.pxi", line 55, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:104808) File "extensions.pxi", line 77, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:96771) TypeError: empty namespace prefix is not supported in XPath Am I missing a nicer way of doing this? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090811/7810d69e/attachment.htm From chris at simplistix.co.uk Tue Aug 11 18:10:06 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 11 Aug 2009 17:10:06 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X Message-ID: <4A8197DE.70800@simplistix.co.uk> Hi All, I'm getting the following error when trying to install: Building lxml version 2.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.11 Building against libxml2/libxslt in the following directory: /usr/lib src/lxml/lxml.etree.c:169:31:src/lxml/lxml.etree.c:169:31: error: libxml/schematron.h: No such file or directory error: libxml/schematron.h: No such file or directory src/lxml/lxml.etree.c:135067: error: dereferencing pointer to incomplete type src/lxml/lxml.etree.c:135068: error: dereferencing pointer to incomplete type src/lxml/lxml.etree.c: At top level: src/lxml/lxml.etree.c:135174: error: invalid application of 'sizeof' to incomplete type 'struct __pyx_obj_4lxml_5etree__ParserSchemaValidationContext' lipo: can't figure out the architecture type of: /var/tmp//ccxExHVU.out error: Setup script exited with error: command 'gcc' failed with exit status 1 What am I doing wrong? cheers, Chris From stefan_ml at behnel.de Tue Aug 11 20:27:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Aug 2009 20:27:12 +0200 Subject: [lxml-dev] default namespace and xpath evaluation In-Reply-To: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> References: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> Message-ID: <4A81B800.1080707@behnel.de> Hi, Emanuele D'Arrigo wrote: > I can't seem to find a more compact way to do this: > > nsDict = {"default":anElement.nsmap[None]} > aChildElement = anElement.xpath("/default:elem", namespaces=nsDict > )[0] I wonder about the use case. Why would you want to look for an element of which you do not know the namespace URI in advance? > Specifically, I would have thought that if the element is in the > default namespace the simple string: > > aChildElement = anElement.xpath("/elem")[0] > > would be sufficient, but the element is not found - probably appropriately. Yes. http://codespeak.net/lxml/FAQ.html#how-can-i-specify-a-default-namespace-for-xpath-expressions > However, passing the element's nsmap such as in: > > aChildElement = anElement.xpath("/elem" > ,namespaces=anElement.nsmap)[0] Note that the nsmap of an element does not necessarily contain a definition of the namespace that you are looking for inside a subtree. > results in the following error message: > > Traceback (most recent call last): > File "", line 1, in > File "lxml.etree.pyx", line 1314, in lxml.etree._Element.xpath > (src/lxml/lxml.etree.c:38871) > File "xpath.pxi", line 245, in lxml.etree.XPathElementEvaluator.__init__ > (src/lxml/lxml.etree.c:106924) > File "xpath.pxi", line 117, in lxml.etree._XPathEvaluatorBase.__init__ > (src/lxml/lxml.etree.c:105514) > File "xpath.pxi", line 55, in lxml.etree._XPathContext.__init__ > (src/lxml/lxml.etree.c:104808) > File "extensions.pxi", line 77, in lxml.etree._BaseContext.__init__ > (src/lxml/lxml.etree.c:96771) > TypeError: empty namespace prefix is not supported in XPath In addition to the above FAQ, there is also: http://codespeak.net/lxml/FAQ.html#how-can-i-find-out-which-namespace-prefixes-are-used-in-a-document which gives a bit more background. Stefan From stefan_ml at behnel.de Wed Aug 12 09:09:31 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Aug 2009 09:09:31 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A8197DE.70800@simplistix.co.uk> References: <4A8197DE.70800@simplistix.co.uk> Message-ID: <4A826AAB.8070601@behnel.de> Chris Withers wrote: > I'm getting the following error when trying to install: > > Building lxml version 2.2. > NOTE: Trying to build without Cython, pre-generated > 'src/lxml/lxml.etree.c' needs to be available. > Using build configuration of libxslt 1.1.11 > Building against libxml2/libxslt in the following directory: /usr/lib > src/lxml/lxml.etree.c:169:31:src/lxml/lxml.etree.c:169:31: error: > libxml/schematron.h: No such file or directory > error: libxml/schematron.h: No such file or directory The libxml2 installation is (or at least the header files are) too old. http://codespeak.net/lxml/build.html#building-lxml-on-macos-x Stefan From manu3d at gmail.com Wed Aug 12 09:59:25 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Wed, 12 Aug 2009 08:59:25 +0100 Subject: [lxml-dev] default namespace and xpath evaluation In-Reply-To: <4A81B800.1080707@behnel.de> References: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> <4A81B800.1080707@behnel.de> Message-ID: <915dc91d0908120059r48315094yade0f3f55270b99f@mail.gmail.com> 2009/8/11 Stefan Behnel > Emanuele D'Arrigo wrote: > > I can't seem to find a more compact way to do this: > > > > nsDict = {"default":anElement.nsmap[None]} > > aChildElement = anElement.xpath("/default:elem", > namespaces=nsDict > > )[0] > > I wonder about the use case. Why would you want to look for an element of > which you do not know the namespace URI in advance? I do know its namespace: it's the default namespace. In the xml document it's something like rather than . For this reason I find it peculiar that I have to first create an arbitrary namespace and then use xpath("/arbitrary:elem"). Intuitively I would have expected xpath("/elem") to be enough. Thank you for pointing me to the FAQ though (and sorry I didn't check it myself first): http://codespeak.net/lxml/FAQ.html#how-can-i-specify-a-default-namespace-for-xpath-expressions in turn it pointed me to a common misuse of default namespaces, illustrated here: http://www.edankert.com/defaultnamespaces.html That was my problem. I should probably not use a default namespaces anyway. xml documents with the potential for tags from multiple namespaces are more readable anyway if all tags have a namespace. Thank you again. Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090812/66dca716/attachment.htm From lists at cheimes.de Wed Aug 12 12:51:22 2009 From: lists at cheimes.de (Christian Heimes) Date: Wed, 12 Aug 2009 12:51:22 +0200 Subject: [lxml-dev] libxml2 crash on 64bit Ubuntu and solution Message-ID: Dear lxml users, My blog post may safe some of you several hours of debugging. If you compile libxml2 yourself on a 64bit Ubuntu system you are going to run into the same problem. http://lipyrary.blogspot.com/2009/08/libxml2-crash-on-64bit-ubuntu.html HTH Christian From libxml at bestley.co.uk Tue Aug 11 22:30:53 2009 From: libxml at bestley.co.uk (Mark Bestley) Date: Tue, 11 Aug 2009 21:30:53 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X References: <4A8197DE.70800@simplistix.co.uk> Message-ID: Chris Withers writes: > Hi All, > > I'm getting the following error when trying to install: > > Building lxml version 2.2. > NOTE: Trying to build without Cython, pre-generated > 'src/lxml/lxml.etree.c' needs to be available. > Using build configuration of libxslt 1.1.11 > Building against libxml2/libxslt in the following directory: /usr/lib > src/lxml/lxml.etree.c:169:31:src/lxml/lxml.etree.c:169:31: error: > libxml/schematron.h: No such file or directory > error: libxml/schematron.h: No such file or directory > > > > src/lxml/lxml.etree.c:135067: error: dereferencing pointer to incomplete > type > src/lxml/lxml.etree.c:135068: error: dereferencing pointer to incomplete > type > src/lxml/lxml.etree.c: At top level: > src/lxml/lxml.etree.c:135174: error: invalid application of 'sizeof' to > incomplete type 'struct > __pyx_obj_4lxml_5etree__ParserSchemaValidationContext' > lipo: can't figure out the architecture type of: /var/tmp//ccxExHVU.out > error: Setup script exited with error: command 'gcc' failed with exit > status 1 > > What am I doing wrong? > You are using the libxslt supplied by Apple. You need a newer one. I got libxml and libxslt from macports . Fink would probably provide another war or look in the last month or two on this list I think someone produced a binary build of lxml for OSX -- Mark From stefan_ml at behnel.de Wed Aug 12 14:35:45 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Aug 2009 14:35:45 +0200 Subject: [lxml-dev] default namespace and xpath evaluation In-Reply-To: <915dc91d0908120059r48315094yade0f3f55270b99f@mail.gmail.com> References: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> <4A81B800.1080707@behnel.de> <915dc91d0908120059r48315094yade0f3f55270b99f@mail.gmail.com> Message-ID: <4A82B721.2070409@behnel.de> Emanuele D'Arrigo wrote: > 2009/8/11 Stefan Behnel > >> Emanuele D'Arrigo wrote: >>> I can't seem to find a more compact way to do this: >>> >>> nsDict = {"default":anElement.nsmap[None]} >>> aChildElement = anElement.xpath("/default:elem", >> namespaces=nsDict >>> )[0] >> I wonder about the use case. Why would you want to look for an element of >> which you do not know the namespace URI in advance? > > > I do know its namespace: it's the default namespace. In the xml document > it's something like rather than . This sounds like you are confusing namespaces (i.e. URIs) and prefixes (which are a document internal work-around for readability reasons). Prefixes can be defined and redefined all over the place in a document. Even the default namespace can be redefined at any element. Defining XPath based on the prefixes used in a particular document would render it completely unusable. Stefan From manu3d at gmail.com Wed Aug 12 14:42:48 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Wed, 12 Aug 2009 13:42:48 +0100 Subject: [lxml-dev] default namespace and xpath evaluation In-Reply-To: <4A82B721.2070409@behnel.de> References: <915dc91d0908110403ub68b31dx6257d618bf3c046c@mail.gmail.com> <4A81B800.1080707@behnel.de> <915dc91d0908120059r48315094yade0f3f55270b99f@mail.gmail.com> <4A82B721.2070409@behnel.de> Message-ID: <915dc91d0908120542l5e53296kc49bb605b9053f0d@mail.gmail.com> 2009/8/12 Stefan Behnel > Prefixes can be defined and redefined all over the place in a document. > Even the default namespace can be redefined at any element. Defining XPath > based on the prefixes used in a particular document would render it > completely unusable. A-ha! Thank you, I didn't know that! I thought namespaces were defined once and for all in the root of the document! It all makes sense now. Thank you again! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090812/382024e4/attachment.htm From chris at simplistix.co.uk Wed Aug 12 15:33:30 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Wed, 12 Aug 2009 14:33:30 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: References: <4A8197DE.70800@simplistix.co.uk> Message-ID: <4A82C4AA.4090601@simplistix.co.uk> Mark Bestley wrote: > I got libxml and libxslt from macports . > Fink would probably provide another war or look in the last month or two > on this list I think someone produced a binary build of lxml for OSX If whoever did that could do a bdist_egg for python 2.6 and give it to Stefan to put on PyPI, that would be perfect... Any chance of that happening? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Wed Aug 12 16:42:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Aug 2009 16:42:42 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A82C4AA.4090601@simplistix.co.uk> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> Message-ID: <4A82D4E2.9030606@behnel.de> Chris Withers wrote: > Mark Bestley wrote: >> I got libxml and libxslt from macports . >> Fink would probably provide another war or look in the last month or two >> on this list I think someone produced a binary build of lxml for OSX > > If whoever did that could do a bdist_egg for python 2.6 and give it to > Stefan to put on PyPI, that would be perfect... So, you're not using MacOS-X 10.5, I assume? Stefan From chris at simplistix.co.uk Wed Aug 12 17:32:17 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Wed, 12 Aug 2009 16:32:17 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A82D4E2.9030606@behnel.de> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> Message-ID: <4A82E081.6090707@simplistix.co.uk> Stefan Behnel wrote: > Chris Withers wrote: >> Mark Bestley wrote: >>> I got libxml and libxslt from macports . >>> Fink would probably provide another war or look in the last month or two >>> on this list I think someone produced a binary build of lxml for OSX >> If whoever did that could do a bdist_egg for python 2.6 and give it to >> Stefan to put on PyPI, that would be perfect... > > So, you're not using MacOS-X 10.5, I assume? 10.4.11, but what does that have to do with not wanting to fight this fight myself? If someone else can get the compile working, if they can provide a binary egg, surely that will alleviate the need for other Mac users to go through the pain each time? cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Wed Aug 12 18:01:11 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 12 Aug 2009 18:01:11 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A82E081.6090707@simplistix.co.uk> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> Message-ID: <4A82E747.5030708@behnel.de> Chris Withers wrote: > Stefan Behnel wrote: >> Chris Withers wrote: >>> If whoever did that could do a bdist_egg for python 2.6 and give it >>> to Stefan to put on PyPI, that would be perfect... >> >> So, you're not using MacOS-X 10.5, I assume? > > 10.4.11, but what does that have to do with not wanting to fight this > fight myself? The difference is that we have binaries for 10.5: http://pypi.python.org/pypi/lxml/2.2.2 Stefan From nick.lang at propylon.com Wed Aug 12 18:16:03 2009 From: nick.lang at propylon.com (Nick Lang) Date: Wed, 12 Aug 2009 11:16:03 -0500 Subject: [lxml-dev] Inserting Element before Tails Message-ID: <4A82EAC3.7050006@propylon.com> Hello, I am new to lxml. I see from the website that it says it supports Mixed-Content. Though I have question with this mixed content support. Say for example I have the following xml: this is BOLD If I were to get a list of the children of it would _only_ contain the element correct? The .tail of would be: "this is", right? So the question I have is this: If I wanted to add an element before and after I would do so like this (or similarly): .insert(0, etree.Element("new") The result of this insert would leave me with: This is BOLD Is it possible to insert before the tail of ? IE, I would want something like this. before: this is BOLD after: this is BOLD I hope this is clear. Thanks Nick From herve.cauwelier at free.fr Wed Aug 12 18:51:35 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Wed, 12 Aug 2009 18:51:35 +0200 Subject: [lxml-dev] Inserting Element before Tails In-Reply-To: <4A82EAC3.7050006@propylon.com> References: <4A82EAC3.7050006@propylon.com> Message-ID: <4A82F317.9030804@free.fr> Nick Lang a ?crit : > Hello, > > I am new to lxml. I see from the website that it says it supports > Mixed-Content. Though I have question with this mixed content support. > > Say for example I have the following xml: > > this is BOLD > > > If I were to get a list of the children of it would _only_ contain > the element correct? You'll get elements only, so yes only here. > The .tail of would be: "this is", right? No, this is the text (attribute ".text"). The tail is None because there is content between the closing and the closing . I think you need to read the tutorial closely and try the examples to get familiar with the terminology. http://codespeak.net/lxml/tutorial.html > So the question I have is this: > > If I wanted to add an element before and after I would do so > like this (or similarly): > .insert(0, etree.Element("new") > > The result of this insert would leave me with: This is />BOLD > > Is it possible to insert before the tail of ? IE, I would want > something like this. > before: > this is BOLD > > after: > this is BOLD Keep your current code and just move the text of the element to the tail of the element. Herv? From l at lrowe.co.uk Wed Aug 12 19:07:39 2009 From: l at lrowe.co.uk (Laurence Rowe) Date: Wed, 12 Aug 2009 18:07:39 +0100 Subject: [lxml-dev] Behaviour of the push parser in recover mode on encountering errors Message-ID: Over at http://bugzilla.gnome.org/show_bug.cgi?id=569131 I reported what I thought was a bug in HTMLParser but on closer inspection appears to be an incorrect assumption on my part (and that of lxml) when dealing with errors returned by the push parser interface. With the libxml2 bindings, I am able to parse invalid html using the push parser: >>> import libxml2 >>> options = libxml2.HTML_PARSE_RECOVER | libxml2.HTML_PARSE_NONET >>> p = libxml2.htmlCreatePushParser(None, "", 0, "test") >>> p.ctxtUseOptions(options) 0 >>> bad1 = '''

\n''' >>> p.htmlParseChunk(bad1, len(bad1), 0) test:1: HTML parser error : Unexpected end tag : p

^ 76 >>> good = '''
foo
\n''' >>> p.htmlParseChunk(good, len(good), 0) 76 >>> p.htmlParseChunk("", 0, 1) 76 >>> print p.doc().serialize()


    
foo
But with lxml, the parser is reset on encountering an error: >>> from lxml.etree import HTMLParser, dump >>> p = HTMLParser(recover=True) >>> bad1 = '''

\n''' >>> p.feed(bad1) Traceback (most recent call last): File "", line 1, in ? File "parser.pxi", line 1093, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:61114) File "parser.pxi", line 534, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:56605) File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:57504) File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:56902) XMLSyntaxError: Unexpected end tag : p, line 1, column 19 >>> good = '''
foo
\n''' >>> p.feed(good) >>> elem = p.close() And previous state is lost: >>> dump(elem)
foo
In fact, I'm unable to retrieve any state from the parser unless it is reset: >>> p.feed(bad1) Traceback (most recent call last): File "", line 1, in ? File "parser.pxi", line 1093, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:61114) File "parser.pxi", line 534, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:56605) File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:57504) File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:56902) XMLSyntaxError: Unexpected end tag : p, line 1, column 19 >>> p.close() Traceback (most recent call last): File "", line 1, in ? File "parser.pxi", line 1113, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:61239) XMLSyntaxError: no element found So in my view, the behaviour here is not helpful. When a parser is created with recover=True it should not raise errors, so allowing incremental parsing of invalid html. Laurence From chris at simplistix.co.uk Thu Aug 13 09:19:13 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Thu, 13 Aug 2009 08:19:13 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A82E747.5030708@behnel.de> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> Message-ID: <4A83BE71.7050907@simplistix.co.uk> Stefan Behnel wrote: > > Chris Withers wrote: >> Stefan Behnel wrote: >>> Chris Withers wrote: >>>> If whoever did that could do a bdist_egg for python 2.6 and give it >>>> to Stefan to put on PyPI, that would be perfect... >>> So, you're not using MacOS-X 10.5, I assume? >> 10.4.11, but what does that have to do with not wanting to fight this >> fight myself? > > The difference is that we have binaries for 10.5: > > http://pypi.python.org/pypi/lxml/2.2.2 Oh, I see... So no, still on 10.4... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Thu Aug 13 09:58:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Aug 2009 09:58:50 +0200 Subject: [lxml-dev] Behaviour of the push parser in recover mode on encountering errors In-Reply-To: References: Message-ID: <4A83C7BA.4020004@behnel.de> Laurence Rowe wrote: > When a parser is > created with recover=True it should not raise errors, so allowing > incremental parsing of invalid html. I agree, this is a bug. There is a bit of code in parser.pxi that handles the recovery flag in the error case, but before doing that, it already stops short when encountering an error. Fixed on the trunk. Stefan From commissarster at gmail.com Thu Aug 13 09:51:58 2009 From: commissarster at gmail.com (commissar wu) Date: Thu, 13 Aug 2009 15:51:58 +0800 Subject: [lxml-dev] lxml bug Message-ID: Hi:everyone,lxml is very good, I like it .But I recently encountered a little trouble.I use lxml to parse the contents of the url( http://www.dtzww.cn/files/article/fulltext/23/23208.html),the lxml is been blocking,and don't rasie exception. The CPU utilization rate is 100%. My environment is lxml-2.2.2. ubutnu-8.04-amd64-server python-2.5.2 My code is fellow: import lxml.html as htmltool import urlib url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html" f = urllib.urlopen(url) data = f.read() doc = htmltool.document_fromstring(data) ## <--- Block this Looking forward to your reply. commissar your friend -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090813/7719c32f/attachment.htm From stefan_ml at behnel.de Thu Aug 13 10:02:31 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Aug 2009 10:02:31 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A83BE71.7050907@simplistix.co.uk> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> Message-ID: <4A83C897.6020004@behnel.de> Chris Withers wrote: > Stefan Behnel wrote: >> Chris Withers wrote: >>> Stefan Behnel wrote: >>>> Chris Withers wrote: >>>>> If whoever did that could do a bdist_egg for python 2.6 and give it >>>>> to Stefan to put on PyPI, that would be perfect... >>>> So, you're not using MacOS-X 10.5, I assume? >>> 10.4.11, but what does that have to do with not wanting to fight this >>> fight myself? >> >> The difference is that we have binaries for 10.5: >> >> http://pypi.python.org/pypi/lxml/2.2.2 > > Oh, I see... > > So no, still on 10.4... Did you try the 10.5 egg? It doesn't have that many dependencies, so it might even work... If they don't: Stefan (E.), any chance to get binaries that run on 10.4? Stefan From stefan_ml at behnel.de Thu Aug 13 10:08:22 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Aug 2009 10:08:22 +0200 Subject: [lxml-dev] lxml bug In-Reply-To: References: Message-ID: <4A83C9F6.3010604@behnel.de> commissar wu wrote: > Hi:everyone,lxml is very good, I like it .But I recently encountered a > little trouble.I use lxml to parse the contents of the url( > http://www.dtzww.cn/files/article/fulltext/23/23208.html),the lxml is been > blocking,and don't rasie exception. The CPU utilization rate is 100%. > > My environment is lxml-2.2.2. ubutnu-8.04-amd64-server python-2.5.2 > > My code is fellow: > > import lxml.html as htmltool > import urlib > > url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html" > f = urllib.urlopen(url) > data = f.read() > > doc = htmltool.document_fromstring(data) ## <--- Block this I can reproduce this, although I didn't look into it any deeper yet. This works for me, though: import lxml.html as htmltool url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html" doc = htmltool.parse(url) Stefan From commissarster at gmail.com Thu Aug 13 10:22:19 2009 From: commissarster at gmail.com (commissar wu) Date: Thu, 13 Aug 2009 16:22:19 +0800 Subject: [lxml-dev] lxml bug In-Reply-To: <4A83C9F6.3010604@behnel.de> References: <4A83C9F6.3010604@behnel.de> Message-ID: 2009/8/13 Stefan Behnel > > commissar wu wrote: > > Hi:everyone,lxml is very good, I like it .But I recently encountered a > > little trouble.I use lxml to parse the contents of the url( > > http://www.dtzww.cn/files/article/fulltext/23/23208.html),the lxml is > been > > blocking,and don't rasie exception. The CPU utilization rate is 100%. > > > > My environment is lxml-2.2.2. ubutnu-8.04-amd64-server python-2.5.2 > > > > My code is fellow: > > > > import lxml.html as htmltool > > import urlib > > > > url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html" > > f = urllib.urlopen(url) > > data = f.read() > > > > doc = htmltool.document_fromstring(data) ## <--- Block this > > I can reproduce this, although I didn't look into it any deeper yet. > > This works for me, though: > > import lxml.html as htmltool > url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html" > doc = htmltool.parse(url) > > Stefan > Oh,Stefan,thank you. you are right,htmltool.parse is ok. But,why the document_fromstring can not work ?and the lxml.html.parse and lxml.html.document_fromstring are not the same used in a way? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090813/f43ff905/attachment.htm From stefan_ml at behnel.de Thu Aug 13 10:47:14 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 13 Aug 2009 10:47:14 +0200 Subject: [lxml-dev] lxml bug In-Reply-To: References: <4A83C9F6.3010604@behnel.de> Message-ID: <4A83D312.2040404@behnel.de> commissar wu wrote: > But,why the document_fromstring can not work ? Sorry, I forgot to mention it: please file a bug report. https://bugs.launchpad.net/lxml Thanks! Stefan From gael at gawel.org Thu Aug 13 14:16:54 2009 From: gael at gawel.org (Gael Pasgrimaud) Date: Thu, 13 Aug 2009 14:16:54 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A82E747.5030708@behnel.de> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> Message-ID: <7911b3bb0908130516x21c02613xeaecf2a79b064f36@mail.gmail.com> On Wed, Aug 12, 2009 at 6:01 PM, Stefan Behnel wrote: > > > Chris Withers wrote: >> Stefan Behnel wrote: >>> Chris Withers wrote: >>>> If whoever did that could do a bdist_egg for python 2.6 and give it >>>> to Stefan to put on PyPI, that would be perfect... >>> >>> So, you're not using MacOS-X 10.5, I assume? >> >> 10.4.11, but what does that have to do with not wanting to fight this >> fight myself? > > The difference is that we have binaries for 10.5: > > http://pypi.python.org/pypi/lxml/2.2.2 > I still dont understand why my OSX 10.5 always want to compile lxml. gawel:~/tmp% virtualenv test New python executable in test/bin/python Installing setuptools............done. gawel:~/tmp% cd test && source bin/activate (test)gawel:~/tmp/test% easy_install lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.2.2 Downloading http://codespeak.net/lxml/lxml-2.2.2.tgz ^Cinterrupted (test)gawel:~/tmp/test% easy_install http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-macosx-10.5-i386.egg Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-macosx-10.5-i386.egg Processing lxml-2.2.2-py2.6-macosx-10.5-i386.egg creating /Users/gawel/tmp/test/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.5-i386.egg Extracting lxml-2.2.2-py2.6-macosx-10.5-i386.egg to /Users/gawel/tmp/test/lib/python2.6/site-packages Adding lxml 2.2.2 to easy-install.pth file Installed /Users/gawel/tmp/test/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.5-i386.egg Processing dependencies for lxml==2.2.2 Searching for lxml==2.2.2 Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.2.2 Downloading http://codespeak.net/lxml/lxml-2.2.2.tgz Processing lxml-2.2.2.tgz Running lxml-2.2.2/setup.py -q bdist_egg --dist-dir /var/folders/6W/6W87x5wkH1ObxROpTXJbTk+++TI/-Tmp-/easy_install-mOILMS/lxml-2.2.2/egg-dist-tmp-uH7pE4 Building lxml version 2.2.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /usr/local/lib In file included from src/lxml/lxml.etree.c:139: ... > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From nospamus at gmail.com Fri Aug 14 16:01:17 2009 From: nospamus at gmail.com (Bryan Hughes) Date: Fri, 14 Aug 2009 10:01:17 -0400 Subject: [lxml-dev] Validating against an empty element Message-ID: <4badce440908140701vdad41d0yc7d4113238882d09@mail.gmail.com> I have the following empty SubElement in my XML file: This value stores one of the 50 US states. When I attempt to validate it against my XSD, I receive the following error: Element 'State': [facet 'pattern'] The value '' is not accepted by the pattern Here is the code in my XSD. Can someone shed some light on why this might be failing? The "State" type is listed as a non-required field (minOccurs=0), so I'm stumped why this error is popping up. FWIW, I've ever tried adding "\s" to my regex pattern, but it still raises the same exception. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090814/c03c8c09/attachment.htm From jlovell at nwesd.org Fri Aug 14 18:49:41 2009 From: jlovell at nwesd.org (John Lovell) Date: Fri, 14 Aug 2009 09:49:41 -0700 Subject: [lxml-dev] Validating against an empty element In-Reply-To: <4badce440908140701vdad41d0yc7d4113238882d09@mail.gmail.com> References: <4badce440908140701vdad41d0yc7d4113238882d09@mail.gmail.com> Message-ID: Bryan: is an occurrence of element 'State.' Its value is ''. The way you have written your rules if 'State' exists then it must have on of the following values. If you want to allow for an empty string then you must make that an approved value. Hope this helps, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... ________________________________ From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Bryan Hughes Sent: Friday, August 14, 2009 7:01 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] Validating against an empty element I have the following empty SubElement in my XML file: This value stores one of the 50 US states. When I attempt to validate it against my XSD, I receive the following error: Element 'State': [facet 'pattern'] The value '' is not accepted by the pattern Here is the code in my XSD. Can someone shed some light on why this might be failing? The "State" type is listed as a non-required field (minOccurs=0), so I'm stumped why this error is popping up. FWIW, I've ever tried adding "\s" to my regex pattern, but it still raises the same exception. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090814/d7b278de/attachment.htm From stefan_ml at behnel.de Fri Aug 14 21:46:37 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 14 Aug 2009 21:46:37 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A85A5C9.5030901@urheberrecht.org> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> <4A83C897.6020004@behnel.de> <4A85A5C9.5030901@urheberrecht.org> Message-ID: <4A85BF1D.3090602@behnel.de> Hi, Pascal Oberndoerfer wrote: > Sorry for sending this large file to you directly, but I thought > it might be of use. And I don't have a clue with regard to pypi... > > Built on i386, running 10.4.11, with MacPython.org distro, using: > "python setup.py bdist_egg --static-deps". > > Untested, as I did this after "python setup.py install --static-deps". Thanks a lot! I put it here: http://codespeak.net/lxml/lxml-2.2.2-py2.5-macosx-10.3-i386.egg so that others can test it before I move it over to PyPI. Stefan From stefan_ml at behnel.de Fri Aug 14 22:31:52 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 14 Aug 2009 22:31:52 +0200 Subject: [lxml-dev] current trunk includes static build for libiconv Message-ID: <4A85C9B8.1070809@behnel.de> Hi, since there were problems also with libiconv on MacOS recently, I added libiconv to the list of libraries that "--static-deps" builds. I'd be happy to get some feedback on this to see if it actually works for those who reported problems. http://codespeak.net/lxml/build.html#subversion http://codespeak.net/lxml/build.html#building-lxml-on-macos-x Note that the trunk builds as lxml version "2.3dev" now. Copying over the buildlibxml.py script to another lxml version should also work. http://codespeak.net/svn/lxml/trunk/buildlibxml.py Stefan From chris at simplistix.co.uk Sat Aug 15 11:34:56 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Sat, 15 Aug 2009 10:34:56 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A83C897.6020004@behnel.de> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> <4A83C897.6020004@behnel.de> Message-ID: <4A868140.9030302@simplistix.co.uk> Stefan Behnel wrote: > Did you try the 10.5 egg? It doesn't have that many dependencies, so it > might even work... Because I use automated tools (buildout in this case) I can't manually substitute in a "wrong" egg like this... > If they don't: Stefan (E.), any chance to get binaries that run on 10.4? That would be great :-) Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Sat Aug 15 14:09:40 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Sat, 15 Aug 2009 13:09:40 +0100 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A85BF1D.3090602@behnel.de> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> <4A83C897.6020004@behnel.de> <4A85A5C9.5030901@urheberrecht.org> <4A85BF1D.3090602@behnel.de> Message-ID: <4A86A584.5050100@simplistix.co.uk> Stefan Behnel wrote: >> Built on i386, running 10.4.11, with MacPython.org distro, using: >> "python setup.py bdist_egg --static-deps". >> >> Untested, as I did this after "python setup.py install --static-deps". > > Thanks a lot! > > I put it here: > > http://codespeak.net/lxml/lxml-2.2.2-py2.5-macosx-10.3-i386.egg If this is for MacOS 10.4, why does it say 10.3? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From jamie at artefact.org.nz Thu Aug 20 08:15:17 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Thu, 20 Aug 2009 18:15:17 +1200 Subject: [lxml-dev] getname() method on 'smart' attribute string values Message-ID: <1250748917.7169.14.camel@atman> I have a situation where I want to find any attributes in a document that contain a certain value, and then change that value and record the fact that the new value is the result of such a change. In order to track these changes, I am populating a dictionary keyed by the parent element, with each value being another dictionary keyed by attribute name with a value of the list of parts of the attribute that have been changed. So: get_attrs = etree.XPath('//@*[contains(concat(" ", ., " "), concat(" #", $old_id, " "))]') for attribute in get_attrs(element, old_id=some_string): element = attribute.getparent() And then I discover that in fact I have to use element.attrib.items() and search through for which attribute matched. It would be much easier if the attribute 'smart' string result from the XPath evaluation had a method which would specify *which* attribute of the parent it was the value of. Or am I missing a better, existing way of doing this? Note that while there is in fact a finite set of attribute names I need to check, it's a potentially expanding set, and I'd rather not have to touch the code when that expansion happens. Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090820/c35706ea/attachment-0001.pgp From stefan_ml at behnel.de Thu Aug 20 10:37:32 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 20 Aug 2009 10:37:32 +0200 Subject: [lxml-dev] getname() method on 'smart' attribute string values In-Reply-To: <1250748917.7169.14.camel@atman> References: <1250748917.7169.14.camel@atman> Message-ID: <4A8D0B4C.60206@behnel.de> Hi, Jamie Norrish wrote: > I have a situation where I want to find any attributes in a document > that contain a certain value, and then change that value and record the > fact that the new value is the result of such a change. In order to > track these changes, I am populating a dictionary keyed by the parent > element, with each value being another dictionary keyed by attribute > name with a value of the list of parts of the attribute that have been > changed. So: > > get_attrs = etree.XPath('//@*[contains(concat(" ", ., " "), concat(" #", > $old_id, " "))]') > > for attribute in get_attrs(element, old_id=some_string): > element = attribute.getparent() > > And then I discover that in fact I have to use element.attrib.items() > and search through for which attribute matched. It would be much easier > if the attribute 'smart' string result from the XPath evaluation had a > method which would specify *which* attribute of the parent it was the > value of. I like that. It would be an attribute, though, maybe "attrname", to make it clear that the name is a) fixed at the time it's found and b) only available for attributes, as elements have their own way of providing the tag name (which is not fixed). Stefan From lei at ipac.caltech.edu Thu Aug 20 19:01:57 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Thu, 20 Aug 2009 10:01:57 -0700 Subject: [lxml-dev] how to do file upload via submit_form Message-ID: <4A8D8185.7040604@ipac.caltech.edu> I have to support form post. Since I am using lxml for form requests, how do I set up the file upload part ? Thanks. -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From stefan_ml at behnel.de Fri Aug 21 11:32:39 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 21 Aug 2009 11:32:39 +0200 Subject: [lxml-dev] getname() method on 'smart' attribute string values In-Reply-To: <1250846887.27171.99.camel@atman> References: <1250748917.7169.14.camel@atman> <4A8D0B4C.60206@behnel.de> <1250846887.27171.99.camel@atman> Message-ID: <4A8E69B7.4040205@behnel.de> Jamie Norrish wrote: > On Thu, 2009-08-20 at 10:37 +0200, Stefan Behnel wrote: > >> I like that. It would be an attribute, though, maybe "attrname", to make it >> clear that the name is a) fixed at the time it's found and b) only >> available for attributes, as elements have their own way of providing the >> tag name (which is not fixed). > > Sounds good to me! You can give it a try on the trunk, if you like. https://codespeak.net/viewvc/?view=rev&revision=67010 http://codespeak.net/lxml/build.html Stefan From jamie at artefact.org.nz Fri Aug 21 11:28:07 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Fri, 21 Aug 2009 21:28:07 +1200 Subject: [lxml-dev] getname() method on 'smart' attribute string values In-Reply-To: <4A8D0B4C.60206@behnel.de> References: <1250748917.7169.14.camel@atman> <4A8D0B4C.60206@behnel.de> Message-ID: <1250846887.27171.99.camel@atman> On Thu, 2009-08-20 at 10:37 +0200, Stefan Behnel wrote: > I like that. It would be an attribute, though, maybe "attrname", to make it > clear that the name is a) fixed at the time it's found and b) only > available for attributes, as elements have their own way of providing the > tag name (which is not fixed). Sounds good to me! Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090821/4047476e/attachment.pgp From p.oberndoerfer at urheberrecht.org Fri Aug 21 19:47:59 2009 From: p.oberndoerfer at urheberrecht.org (=?ISO-8859-1?Q?=22Dr=2E_Pascal_Obernd=F6rfer=22?=) Date: Fri, 21 Aug 2009 19:47:59 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A86A584.5050100@simplistix.co.uk> References: <4A8197DE.70800@simplistix.co.uk> <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> <4A83C897.6020004@behnel.de> <4A85A5C9.5030901@urheberrecht.org> <4A85BF1D.3090602@behnel.de> <4A86A584.5050100@simplistix.co.uk> Message-ID: <4A8EDDCF.8070509@urheberrecht.org> Chris Withers schrieb: > Stefan Behnel wrote: >>> Built on i386, running 10.4.11, with MacPython.org distro, using: >>> "python setup.py bdist_egg --static-deps". >>> >>> Untested, as I did this after "python setup.py install --static-deps". >> >> Thanks a lot! >> >> I put it here: >> >> http://codespeak.net/lxml/lxml-2.2.2-py2.5-macosx-10.3-i386.egg > > If this is for MacOS 10.4, why does it say 10.3? To be honest: I don't know. All I can say is that I have already seen this on some occasions (i.e. building on 10.4 and the resulting egg being named 10.3). Sorry if this is not of any help... Pascal > Chris > From mike at it-loops.com Sat Aug 22 01:51:49 2009 From: mike at it-loops.com (Michael Guntsche) Date: Sat, 22 Aug 2009 01:51:49 +0200 Subject: [lxml-dev] problems trying to install lxml 2.2 on Mac OS X In-Reply-To: <4A8EDDCF.8070509@urheberrecht.org> References: <4A82C4AA.4090601@simplistix.co.uk> <4A82D4E2.9030606@behnel.de> <4A82E081.6090707@simplistix.co.uk> <4A82E747.5030708@behnel.de> <4A83BE71.7050907@simplistix.co.uk> <4A83C897.6020004@behnel.de> <4A85A5C9.5030901@urheberrecht.org> <4A85BF1D.3090602@behnel.de> <4A86A584.5050100@simplistix.co.uk> <4A8EDDCF.8070509@urheberrecht.org> Message-ID: <20090821235148.GA16153@gibson.comsick.at> On Fri, Aug 21, 2009 at 07:47:59PM +0200, "Dr. Pascal Obernd?rfer" wrote: > To be honest: I don't know. All I can say is that I have already seen > this on some occasions (i.e. building on 10.4 and the resulting egg > being named 10.3). Sorry if this is not of any help... This more or less just means that it should also work on 10.3.9. Has something to do with the DEPLOYMENT_TARGET python was build AFAIR. Kind regards, Michael Guntsche From jamie at artefact.org.nz Sun Aug 23 11:00:51 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Sun, 23 Aug 2009 21:00:51 +1200 Subject: [lxml-dev] getname() method on 'smart' attribute string values In-Reply-To: <4A8E69B7.4040205@behnel.de> References: <1250748917.7169.14.camel@atman> <4A8D0B4C.60206@behnel.de> <1250846887.27171.99.camel@atman> <4A8E69B7.4040205@behnel.de> Message-ID: <1251018051.6506.49.camel@atman> On Fri, 2009-08-21 at 11:32 +0200, Stefan Behnel wrote: > You can give it a try on the trunk, if you like. Perfect, thank you! Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090823/49d7e0bf/attachment.pgp From p.oberndoerfer at urheberrecht.org Sun Aug 23 13:52:35 2009 From: p.oberndoerfer at urheberrecht.org (Pascal Oberndoerfer) Date: Sun, 23 Aug 2009 13:52:35 +0200 Subject: [lxml-dev] current trunk includes static build for libiconv In-Reply-To: <4A85C9B8.1070809@behnel.de> References: <4A85C9B8.1070809@behnel.de> Message-ID: <4A912D83.9080407@urheberrecht.org> Stefan Behnel schrieb: > Hi, > > since there were problems also with libiconv on MacOS recently, I added > libiconv to the list of libraries that "--static-deps" builds. > > I'd be happy to get some feedback on this to see if it actually works for > those who reported problems. > > http://codespeak.net/lxml/build.html#subversion > http://codespeak.net/lxml/build.html#building-lxml-on-macos-x > > Note that the trunk builds as lxml version "2.3dev" now. Copying over the > buildlibxml.py script to another lxml version should also work. > > http://codespeak.net/svn/lxml/trunk/buildlibxml.py > > Stefan I copied the new 'buildlibxml.py' into a clean lxml-2.2.2 directory and started a build with '-- static-deps'. Everything seems to work fine (libiconv, libxml2, and libxslt build nicely) except form some minor errors like: - 'make[3]: [install-data-local] Error 71 (ignored)' - 'make[2]: [xsltproc.html] Error 4 (ignored)' Unfortunately -- after installing -- I get this ImportError on doing 'import lxml.etree': > ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.5/ > lib/python2.5/site-packages/lxml-2.2.2-py2.5-macosx-10.3-ppc.egg/lxml/ > etree.so, 2): Symbol not found: _libiconv_close > > Referenced from: /Library/Frameworks/Python.framework/Versions/2.5/ > lib/python2.5/site-packages/lxml-2.2.2-py2.5-macosx-10.3-ppc.egg/lxml/ > etree.so > > Expected in: dynamic lookup Speculating if during one of the last steps (lxml?) libiconv isn't linked correctly? Where could I look for this? As the problem is AFAICT only related to the PPC platform (and if running MacOS X 10.4.x?), would it make sense to build libiconv statically only 'if platform.processor() == 'powerpc' and major_version == 8:'? This could possibly help avoid any side effects on Intel or 10.5 systems. Just a thought... Thanks a lot! Pascal From lei at ipac.caltech.edu Mon Aug 24 22:45:27 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Mon, 24 Aug 2009 13:45:27 -0700 Subject: [lxml-dev] a solution to file upload Message-ID: <4A92FBE7.1040001@ipac.caltech.edu> 8/20/09 I posted the following: I have to support form post. Since I am using lxml for form requests, how do I set up the file upload part ? I now have a solution for python 2.6: 1. I adopted Fabien SEISEN's urllib2_file.py. When my code detects that the form request is a file upload request, instead of using submit_form, my code switches to calling urllib2.urlopen http handler in urllib2_file.py and thus puts out the same multipart form/data format for file upload as done via the web browser. I did have to include an error handler for urllib2_file.py: # Special case for StringIO try: <--- catch if fd.__module__ in ("StringIO", "cStringIO"): <-- has no __module__ name = k fd.seek(0, 2) # EOF file_size = fd.tell() fd.seek(0) # START else: file_size = os.fstat(fd.fileno())[stat.ST_SIZE] except AttributeError: <-- catch exception here file_size = os.fstat(fd.fileno())[stat.ST_SIZE] -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From ted at milo.com Mon Aug 24 23:23:15 2009 From: ted at milo.com (Ted Dziuba) Date: Mon, 24 Aug 2009 14:23:15 -0700 Subject: [lxml-dev] a solution to file upload In-Reply-To: <4A92FBE7.1040001@ipac.caltech.edu> References: <4A92FBE7.1040001@ipac.caltech.edu> Message-ID: <6451ccbf0908241423u6504c8b5qeeee8a8c3a561eb8@mail.gmail.com> Have you looked at mechanize? It does almost all form automation you could ever want. It parses content with BeautifulSoup, though, so parse trees may look different than in lxml. Ted On Mon, Aug 24, 2009 at 1:45 PM, Mary Lei wrote: > 8/20/09 I posted the following: > I have to support form post. > Since I am using lxml for form > requests, how do I set up the file upload > part ? > > I now have a solution for python 2.6: > 1. I adopted Fabien SEISEN's > urllib2_file.py. When my code > detects that the form request is a file > upload request, instead of using > submit_form, my code switches to calling > urllib2.urlopen http handler in urllib2_file.py > and thus puts out the same multipart form/data > format for file upload as done via the web browser. > I did have to include an error handler for > urllib2_file.py: > > # Special case for StringIO > try: <--- catch > if fd.__module__ in ("StringIO", "cStringIO"): <-- has no > __module__ > name = k > fd.seek(0, 2) # EOF > file_size = fd.tell() > fd.seek(0) # START > else: > file_size = os.fstat(fd.fileno())[stat.ST_SIZE] > except AttributeError: <-- catch exception here > file_size = os.fstat(fd.fileno())[stat.ST_SIZE] > > -- > Mary Lei > > Software Testing > IPAC-NExScl > > Rm: KS-233 > MS: 220-6 > Phone: 395-1998 > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Ted Dziuba Co-Founder and Engineer Milo.com, Inc. 165 University Avenue Palo Alto, CA, 94301 http://milo.com Cell: (609)-665-2639 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090824/c6470646/attachment-0001.htm From lei at ipac.caltech.edu Tue Aug 25 00:08:17 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Mon, 24 Aug 2009 15:08:17 -0700 Subject: [lxml-dev] a solution to file upload In-Reply-To: <6451ccbf0908241423u6504c8b5qeeee8a8c3a561eb8@mail.gmail.com> References: <4A92FBE7.1040001@ipac.caltech.edu> <6451ccbf0908241423u6504c8b5qeeee8a8c3a561eb8@mail.gmail.com> Message-ID: <4A930F51.9040903@ipac.caltech.edu> Thanks for the info. There is ClientForm also. But at this point in my project I am not going to include too many packages as I won't have a job in October. The project manager is nervous about being dependent on too many external packages and I won't have adequate time to research them out. I am finding limitations with lxml but so far things are working. Ted Dziuba wrote: > Have you looked at mechanize? It does almost all form automation you > could ever want. It parses content with BeautifulSoup, though, so parse > trees may look different than in lxml. > > Ted > > On Mon, Aug 24, 2009 at 1:45 PM, Mary Lei > wrote: > > 8/20/09 I posted the following: > I have to support form post. > Since I am using lxml for form > requests, how do I set up the file upload > part ? > > I now have a solution for python 2.6: > 1. I adopted Fabien SEISEN's > urllib2_file.py. When my code > detects that the form request is a file > upload request, instead of using > submit_form, my code switches to calling > urllib2.urlopen http handler in urllib2_file.py > and thus puts out the same multipart form/data > format for file upload as done via the web browser. > I did have to include an error handler for > urllib2_file.py: > > # Special case for StringIO > try: <--- catch > if fd.__module__ in ("StringIO", "cStringIO"): <-- has no > __module__ > name = k > fd.seek(0, 2) # EOF > file_size = fd.tell() > fd.seek(0) # START > else: > file_size = os.fstat(fd.fileno())[stat.ST_SIZE] > except AttributeError: <-- catch exception here > file_size = os.fstat(fd.fileno())[stat.ST_SIZE] > > -- > Mary Lei > > Software Testing > IPAC-NExScl > > Rm: KS-233 > MS: 220-6 > Phone: 395-1998 > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > -- > Ted Dziuba > Co-Founder and Engineer > > Milo.com, Inc. > 165 University Avenue > Palo Alto, CA, 94301 > http://milo.com > > Cell: (609)-665-2639 > -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From ndudfield at gmail.com Tue Aug 25 01:46:32 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Tue, 25 Aug 2009 09:46:32 +1000 Subject: [lxml-dev] getname() method on 'smart' attribute string values In-Reply-To: References: Message-ID: <4A932658.9000408@gmail.com> > You can give it a try on the trunk, if you like. > > https://codespeak.net/viewvc/?view=rev&revision=67010 > > http://codespeak.net/lxml/build.html > > Stefan > I also have need for this functionality and also a bugfix from a revision ahead of the stable version 2.2.2 available for windows. I heard libxml2/lxml is a PITA to build on windows so being inexperienced I'll not bother attempting it before ruling out alternatives. Is there a build bot with dist zips of the latest revisions available anywhere for windows ? From philipp.reichmuth+gmane at gmail.com Fri Aug 28 16:22:35 2009 From: philipp.reichmuth+gmane at gmail.com (Philipp Reichmuth) Date: Fri, 28 Aug 2009 16:22:35 +0200 Subject: [lxml-dev] Python 3.1 binary on Windows? Message-ID: <1n6jo1parl8po.iyhfdc2pakhw.dlg@40tude.net> Hi, maybe it's just me being stupid for overlooking something, but are there Windows binaries built for Python 3.1 out there? Philipp