From cJ-lxml at zougloub.eu Mon Nov 1 02:46:23 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Sun, 31 Oct 2010 21:46:23 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <1286350813.29843.587028.camel@atlas> References: <20100903161533.20e66719@Bidule.intranet.cs> <1286350813.29843.587028.camel@atlas> Message-ID: <20101031214623.34690eba@zougloub.eu> On Wed, 06 Oct 2010 11:40:13 +0400 Alexander Shigin wrote: > It looks like the usage of XSLT.strparam solves your problem. Please > look at attached patch. Thanks a lot Alexander, it works. Regards, -- cJ From cJ-lxml at zougloub.eu Mon Nov 1 13:38:59 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Mon, 1 Nov 2010 08:38:59 -0400 Subject: [lxml-dev] XSLT - xsltMaxDepth setting Message-ID: <20101101083859.39339551@zougloub.eu> Hi, libxslt uses a xsltMaxDepth variable (global...) to limit recursion, and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages). xsltproc --maxdepth 10000 .... At the moment, lxml does not touch this value. Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ? Regards, -- cJ From stefan_ml at behnel.de Mon Nov 1 17:14:30 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 01 Nov 2010 17:14:30 +0100 Subject: [lxml-dev] XSLT - xsltMaxDepth setting In-Reply-To: <20101101083859.39339551@zougloub.eu> References: <20101101083859.39339551@zougloub.eu> Message-ID: <4CCEE766.6000905@behnel.de> J?r?me Carretero, 01.11.2010 13:38: > libxslt uses a xsltMaxDepth variable (global...) to limit recursion, > and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages). > xsltproc --maxdepth 10000 .... > > At the moment, lxml does not touch this value. > > Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ? Such a local property doesn't work well for a global setting, and global settings are always evil. Isn't there a per-call setting for this? Stefan From Althial at gmx.net Mon Nov 1 18:48:52 2010 From: Althial at gmx.net (Althial at gmx.net) Date: Mon, 01 Nov 2010 18:48:52 +0100 Subject: [lxml-dev] External Ids with DTD class In-Reply-To: References: Message-ID: <20101101174852.43870@gmx.net> Hi, I want to use lxml to validate fragments of xhtml but setting up the parser is driving me nuts. from lxml import etree dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XHTML V4.2//EN") That's taken from the webpage's tutorial and this dtd = etree.DTD(external_id = "-//W3C//DTD XHTML 1.0 Strict//EN") is what I'd like to do. Result: DTDParseError: failed to load external entity "-//W3C//DTD XHTML 1.0 Strict//EN" Now I realize this looks like some setup problem on my side. I guess my system is simply lacking the catalog entries so the DTD can't be found. But the documentation (of lxml) says nothing more on this issue. I'm working with Ubuntu 10.04 and all my /usr/share/sgml/docbook/dtd directory contains only xml which itself holds versions 4 to 4.5 - but nothing with XHTML. All I want is some basic validation to XHTML 1.0 STRICT. Is that really so hard to set up? :-( Amnu -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From jholg at gmx.de Tue Nov 2 09:40:23 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 02 Nov 2010 09:40:23 +0100 Subject: [lxml-dev] (no subject) Message-ID: <20101102084023.275460@gmx.net> Hi, just stumbled upon this: http://stackoverflow.com/questions/3103661 In short: Should we consider this a bug: >>> root = etree.fromstring(""" ... 206 ... ... ... ... ... ... """) >>> root['{}duration'] Traceback (most recent call last): File "", line 1, in ? File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345) File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347) AttributeError: no such child: {http://api.example.com}duration >>> ? Looks like there is no way to get at a no-namespace child element apart from working around this e.g. using xpath. Holger -- GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl From svetlyak.40wt at gmail.com Tue Nov 2 10:07:33 2010 From: svetlyak.40wt at gmail.com (Alexander Artemenko) Date: Tue, 2 Nov 2010 12:07:33 +0300 Subject: [lxml-dev] (no subject) In-Reply-To: <20101102084023.275460@gmx.net> References: <20101102084023.275460@gmx.net> Message-ID: Hi! On Tue, Nov 2, 2010 at 11:40 AM, wrote: > Hi, > > just stumbled upon this: > > http://stackoverflow.com/questions/3103661 > > In short: Should we consider this a bug: > >>>> root = etree.fromstring(""" > ... 206 > ... > ... ? ? ... > ... ? > ... > ... """) >>>> root['{}duration'] > Traceback (most recent call last): > ?File "", line 1, in ? > ?File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345) > ?File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347) > AttributeError: no such child: {http://api.example.com}duration >>>> This is not a bug, because you MUST specify namespaces for the duration, because this element is in the scope of the 'ns2' namespaces. See http://www.w3.org/TR/xml-names/#scoping for details. -- Alexander Artemenko (a.k.a. Svetlyak 40wt) Blog: http://aartemenko.com Photos: http://svetlyak.ru Jabber: svetlyak.40wt at gmail.com From stefan_ml at behnel.de Tue Nov 2 10:30:38 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Nov 2010 10:30:38 +0100 Subject: [lxml-dev] (no subject) In-Reply-To: References: <20101102084023.275460@gmx.net> Message-ID: <4CCFDA3E.5020600@behnel.de> Alexander Artemenko, 02.11.2010 10:07: > On Tue, Nov 2, 2010 at 11:40 AM, jholg wrote: >> just stumbled upon this: >> >> http://stackoverflow.com/questions/3103661 >> >> In short: Should we consider this a bug: >> >> >>> root = etree.fromstring(""" >> ...206 >> ... >> ...... >> ... >> ... >> ... """) >> >>> root['{}duration'] >> Traceback (most recent call last): >> File "", line 1, in ? >> File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345) >> File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347) >> AttributeError: no such child: {http://api.example.com}duration > > This is not a bug, because you MUST specify namespaces for the > duration, because this element is in the scope of the 'ns2' > namespaces. See http://www.w3.org/TR/xml-names/#scoping for details. The spec says in 6.2: """ If there is a default namespace declaration in scope, the expanded name corresponding to an unprefixed element name has the URI of the default namespace as its namespace name. If there is no default namespace declaration in scope, the namespace name has no value. """ So, in the above case, "the namespace name has no value", which is just fine. Although rare, this *is* a problem. Personally, I think I would have expected "root['{}duration']" to work, but I haven't looked into it any deeper yet. It might be worth special casing this in lxml.objectify. Stefan From cJ-lxml at zougloub.eu Thu Nov 4 13:42:01 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Thu, 4 Nov 2010 08:42:01 -0400 Subject: [lxml-dev] XSLT - xsltMaxDepth setting In-Reply-To: <4CCEE766.6000905@behnel.de> References: <20101101083859.39339551@zougloub.eu> <4CCEE766.6000905@behnel.de> Message-ID: <20101104084201.309144d6@zougloub.eu> On Mon, 01 Nov 2010 17:14:30 +0100 Stefan Behnel wrote: > J?r?me Carretero, 01.11.2010 13:38: > > libxslt uses a xsltMaxDepth variable (global...) to limit recursion, > > and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages). > > xsltproc --maxdepth 10000 .... > > > > At the moment, lxml does not touch this value. > > > > Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ? > > Such a local property doesn't work well for a global setting, and global > settings are always evil. Isn't there a per-call setting for this? Attached is a libxslt patch that makes the max template depth an attribute of the transform context and not a global variable. Comments ? Regards, -- cJ PS: the patch was applied onto the master branch of Diego's git://gitorious.org/libxslt/libxslt.git mirror, I was too lazy to dig the real libxslt repository. -------------- next part -------------- A non-text attachment was scrubbed... Name: libxslt-unglobalize-maxdepth.patch Type: text/x-patch Size: 4293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20101104/88174fb9/attachment-0001.bin From stefan_ml at behnel.de Thu Nov 4 14:05:25 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 04 Nov 2010 14:05:25 +0100 Subject: [lxml-dev] XSLT - xsltMaxDepth setting In-Reply-To: <20101104084201.309144d6@zougloub.eu> References: <20101101083859.39339551@zougloub.eu> <4CCEE766.6000905@behnel.de> <20101104084201.309144d6@zougloub.eu> Message-ID: <4CD2AF95.7060607@behnel.de> J?r?me Carretero, 04.11.2010 13:42: > On Mon, 01 Nov 2010 17:14:30 +0100 > Stefan Behnel wrote: > >> J?r?me Carretero, 01.11.2010 13:38: >>> libxslt uses a xsltMaxDepth variable (global...) to limit recursion, >>> and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages). >>> xsltproc --maxdepth 10000 .... >>> >>> At the moment, lxml does not touch this value. >>> >>> Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ? >> >> Such a local property doesn't work well for a global setting, and global >> settings are always evil. Isn't there a per-call setting for this? > > Attached is a libxslt patch that makes the max template depth an attribute of the transform context and not a global variable. > > Comments ? Wrong list. ;) But to get it accepted, I think you will have to keep the old interface in addition to the new one. Stefan From Marc.Graff at VerizonWireless.com Thu Nov 4 19:41:24 2010 From: Marc.Graff at VerizonWireless.com (Graff, Marc) Date: Thu, 4 Nov 2010 14:41:24 -0400 Subject: [lxml-dev] Need feedback on Memory Errors Message-ID: <20101104185239.429A1282B9D@codespeak.net> I just finished an app that parses a large xml file "FeedA" and appends another smaller file fragmentB to the tree from FeedA for an xpath specified parent node. All seems fine when processing a file less than 500MB but anything large results in one of two errors. All libs were built from src in my home dir and LD_LIBRARY_PATH reflects the home dir lib. Not sure if that will distort the following lib details lxml.etree: (2, 2, 8, 0) libxml used: (2, 7, 7) libxml compiled: (2, 6, 23) libxslt used: (1, 1, 26) libxslt compiled: (1, 1, 15) There should be ample memory. This is running on a Solaris M5000 with 96GB of memory and unlimit is unlimited. The FeedA test file contains valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file tests. Just over 500MB and the app returns a MemoryError on the serializer.pxi. Attached is the full error trace from the captured exception. The parse obj has (huge_tree=True). I didn't expect this to make a difference since the error in on the serialization of the output but tried anyway. File "serializer.pxi", line 133, in lxml.etree._tostring (src/lxml/lxml.etree.c:79345) MemoryError Anything over 1.5GB and it core dumps (first error in the stack trapped in libxml2). Attached is the stack and mappings from mdb. Including incase related: libc.so.1`strlen+0x50(39e730, 3ceef0, 1e3718, 755, 1e3724, 756) etree.so`__pyx_f_4lxml_5etree_13_BaseErrorLog__receive+0xcc(397490, 3ceef0, feff5d24, 74e, 2, 74e) etree.so`__pyx_f_4lxml_5etree__forwardError+0x6c(fef081e0, 3ceef0, d9d64, fed303a0, ff1303bc, 1) libxml2.so.2.7.7`__xmlRaiseError+0x2c4(fef07980, fee9c4c8, 397490, 3ced70, ffffffff, 1) libxml2.so.2.7.7`xmlErrMemory+0xa4(3ced70, fedc1e70, d9d64, ff0566a8, ff1303bc, ff13a558) libxml2.so.2.7.7`xmlSAX2TextNode+0x2e0(0, 3f5616, 7, 1, 6, feddb774) I am new to python so any help/suggestions would be greatly appreciated. Thanks Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment.htm -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lxml_error_message.txt Url: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: corelog.txt Url: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment-0001.txt From stefan_ml at behnel.de Thu Nov 4 23:01:17 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 04 Nov 2010 23:01:17 +0100 Subject: [lxml-dev] Need feedback on Memory Errors In-Reply-To: <20101104185239.429A1282B9D@codespeak.net> References: <20101104185239.429A1282B9D@codespeak.net> Message-ID: <4CD32D2D.6010606@behnel.de> Hi, Graff, Marc, 04.11.2010 19:41: > I just finished an app that parses a large xml file "FeedA" and appends > another smaller file fragmentB to the tree from FeedA for an xpath > specified parent node. All seems fine when processing a file less than > 500MB but anything large results in one of two errors. You may not be aware of it, but this is huge. If that's just the size of the serialised XML, this means that the in-memory tree representation is several times that size, easily 10x or more. Depending on the text-to-tag ratio in the content, it may well reach the size of your available memory. Check the size of the Python process while it's building the tree, prstat is your friend. > All libs were built from src in my home dir and LD_LIBRARY_PATH reflects > the home dir lib. Not sure if that will distort the following lib > details > > lxml.etree: (2, 2, 8, 0) > > libxml used: (2, 7, 7) > > libxml compiled: (2, 6, 23) > > libxslt used: (1, 1, 26) > > libxslt compiled: (1, 1, 15) Try to build against the libraries that you use at runtime. lxml has several bug work-arounds and compile time adaptations for the various library versions. A major discrepancy between the version used at compile time and runtime, such as in your case, may have unexpected side effects. You can pass the path to the configuration scripts (xml2-config and xslt-config in the bin directories of the install destinations) using the XML2_CONFIG and XSLT_CONFIG environment variables. > There should be ample memory. This is running on a Solaris M5000 with > 96GB of memory and unlimit is unlimited. The FeedA test file contains > valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file > tests. > > Just over 500MB and the app returns a MemoryError on the serializer.pxi. The serialiser needs to reallocate additional memory step by step while it's doing its work. Normally, the OS handles this by enlarging the allocated area and without copying. However, if the available memory runs low, memory fragmentation may trigger the allocation of a completely new memory area of very large size to copy the previously allocated memory into, which may easily fail since memory is low already. So even if there is some memory left in the system, it may not be enough to satisfy the memory allocation scheme at hand. Remember that your output alone is 500-3000 MB in one single piece of memory, and libxml2 can't know in advance that it will need that much. So, please monitor the memory consumption of the process. If you are really running out of memory, one thing you can try is to switch to cElementTree (xml.etree.cElementTree). It has a somewhat lighter memory footprint which may just be enough to make a difference here (although likely not for 3GB of XML). It also has less features than lxml.etree (a bit fewer less in Py2.7/ET1.3), but currently, your only real problem seems to be the memory requirement. Stefan From Marc.Graff at VerizonWireless.com Fri Nov 5 16:14:29 2010 From: Marc.Graff at VerizonWireless.com (Graff, Marc) Date: Fri, 5 Nov 2010 11:14:29 -0400 Subject: [lxml-dev] Need feedback on Memory Errors In-Reply-To: References: <20101104185239.429A1282B9D@codespeak.net> Message-ID: <20101105151505.24CA9282BF5@codespeak.net> The runtime vs. compile time lib difference went unrealized (missed the 500 lb. gorilla) in until my ride home last night even though it was right in front of me. The long ride home is often when things that allude me often come together. I was concerned I opened my self up to justifiably harsh scrutiny. Thanks for kindly confirming. Also thanks for the helpful insights. I generally run top to see what my code is using but will include prstat to my monitoring. I will recompile with the correct env vars and retest. I was considering the possible memory footprint of the current implementation but wanted to finish version 1. I will try altering with cElementTree and compare to the current code. I am also going to investigate an event driven parse_and_append approach since lmxl provide such a mechanism and I believe that could reduce memory usage drastically. Thanks for the very useful feedback and have a good weekend. Marc -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel Sent: Thursday, November 04, 2010 6:01 PM To: Graff, Marc Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] Need feedback on Memory Errors Hi, Graff, Marc, 04.11.2010 19:41: > I just finished an app that parses a large xml file "FeedA" and appends > another smaller file fragmentB to the tree from FeedA for an xpath > specified parent node. All seems fine when processing a file less than > 500MB but anything large results in one of two errors. You may not be aware of it, but this is huge. If that's just the size of the serialised XML, this means that the in-memory tree representation is several times that size, easily 10x or more. Depending on the text-to-tag ratio in the content, it may well reach the size of your available memory. Check the size of the Python process while it's building the tree, prstat is your friend. > All libs were built from src in my home dir and LD_LIBRARY_PATH reflects > the home dir lib. Not sure if that will distort the following lib > details > > lxml.etree: (2, 2, 8, 0) > > libxml used: (2, 7, 7) > > libxml compiled: (2, 6, 23) > > libxslt used: (1, 1, 26) > > libxslt compiled: (1, 1, 15) Try to build against the libraries that you use at runtime. lxml has several bug work-arounds and compile time adaptations for the various library versions. A major discrepancy between the version used at compile time and runtime, such as in your case, may have unexpected side effects. You can pass the path to the configuration scripts (xml2-config and xslt-config in the bin directories of the install destinations) using the XML2_CONFIG and XSLT_CONFIG environment variables. > There should be ample memory. This is running on a Solaris M5000 with > 96GB of memory and unlimit is unlimited. The FeedA test file contains > valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file > tests. > > Just over 500MB and the app returns a MemoryError on the serializer.pxi. The serialiser needs to reallocate additional memory step by step while it's doing its work. Normally, the OS handles this by enlarging the allocated area and without copying. However, if the available memory runs low, memory fragmentation may trigger the allocation of a completely new memory area of very large size to copy the previously allocated memory into, which may easily fail since memory is low already. So even if there is some memory left in the system, it may not be enough to satisfy the memory allocation scheme at hand. Remember that your output alone is 500-3000 MB in one single piece of memory, and libxml2 can't know in advance that it will need that much. So, please monitor the memory consumption of the process. If you are really running out of memory, one thing you can try is to switch to cElementTree (xml.etree.cElementTree). It has a somewhat lighter memory footprint which may just be enough to make a difference here (although likely not for 3GB of XML). It also has less features than lxml.etree (a bit fewer less in Py2.7/ET1.3), but currently, your only real problem seems to be the memory requirement. Stefan _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From ra.ravi.rav at gmail.com Sat Nov 6 18:44:12 2010 From: ra.ravi.rav at gmail.com (Ravi) Date: Sat, 6 Nov 2010 23:14:12 +0530 Subject: [lxml-dev] etree.tostring() cannot handle Unicode Message-ID: With reference to the bug report https://bugs.launchpad.net/lxml/+bug/671885I found that etree.tostring() cannot handle Unicode. It is giving me the UnicodeEncodeError. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101106/9d985bbb/attachment.htm From stefan_ml at behnel.de Sat Nov 6 19:37:08 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Nov 2010 19:37:08 +0100 Subject: [lxml-dev] etree.tostring() cannot handle Unicode In-Reply-To: References: Message-ID: <4CD5A054.2010701@behnel.de> Ravi, 06.11.2010 18:44: > With reference to the bug report > https://bugs.launchpad.net/lxml/+bug/671885I found that > etree.tostring() cannot handle Unicode. It is giving me the > UnicodeEncodeError. etree.tostring() handles unicode (whatever you mean by that) nicely, so the issue is most likely with your own code. Please provide an example that shows what your concrete problem is. Stefan From Paul.Wray at det.nsw.edu.au Wed Nov 10 01:03:44 2010 From: Paul.Wray at det.nsw.edu.au (Wray, Paul) Date: Wed, 10 Nov 2010 11:03:44 +1100 Subject: [lxml-dev] Correcting 'simple' broken HTML Message-ID: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win> Background: I need a paragraph-insertion algorithm for use with Internet Explorer's webbrowser component, as the use of execCommand('InsertParagraph') or pasteHTML gives unpredictable results. I have developed a simple algorithm (~50 lines of Python) that gives acceptable results for simplest cases, but I have no confidence that it covers all cases. I thought that I could use lxml to correct the html to when a paragraph is inserted in the middle of text (pipe character represents caret position): Original:

xxx|yyy

Insert para:

xxx

yyy

Fixed:

xxx

yyy

This simplest case works OK, but I was surprised to find that this fails, when breaking a line within an inline element: Original:

wwwxxx|yyyzzz

Insert para:

wwwxxx

yyyzzz

Expected output:

wwwxxx

yyyzzz

Actual output from lxml.etree and lxml.html:

wwwxxx

yyyzzz

So it seems that both lxml.etree and lxml.html are tolerant of a paragraph as the child of an inline element. When I use recover=False for lxml.etree parser, there is no exception raised. My questions: * Am I expecting too much, or missing something? I think that the above is a simple case of broken HTML. * Can anyone point me to a tried and true line-breaking algorithm for lxml? Code and output follows. ------------------------------------------------------------- Test Code from lxml import etree, html from StringIO import StringIO print 'lxml version', etree.LXML_VERSION print 'libxml version', etree.LIBXML_VERSION badhtml = '

wwwxxx

yyyzzz

' print 'With lxml.etree:' parser = etree.HTMLParser(recover=False) tree = etree.parse(StringIO(badhtml), parser) result = etree.tostring(tree.getroot(), pretty_print=True, method='html') print result print 'With lxml.html:' parsed = html.fragment_fromstring(badhtml) print html.tostring(parsed) --------------------------------------------------------- Output lxml version (2, 2, 2, 0) libxml version (2, 7, 2) With lxml.etree:

wwwxxx

yyy

zzz

With lxml.html:

wwwxxx

yyy

zzz

Paul ********************************************************************** This message is intended for the addressee named and may contain privileged information or confidential information or both. If you are not the intended recipient please delete it and notify the sender. ********************************************************************** From stefan_ml at behnel.de Wed Nov 10 06:45:53 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 10 Nov 2010 06:45:53 +0100 Subject: [lxml-dev] Correcting 'simple' broken HTML In-Reply-To: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win> References: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win> Message-ID: <4CDA3191.8010708@behnel.de> Wray, Paul, 10.11.2010 01:03: > I thought that I could use lxml to correct the html to when a paragraph > is inserted in the middle of text (pipe character represents caret > position): > > Original:

xxx|yyy

> Insert para:

xxx

yyy

> Fixed:

xxx

yyy

> > This simplest case works OK, but I was surprised to find that this > fails, when breaking a line within an inline element: > > Original:

wwwxxx|yyyzzz

> Insert para:

wwwxxx

yyyzzz

> Expected output:

wwwxxx

yyyzzz

> Actual output from lxml.etree and lxml.html: >

wwwxxx

yyyzzz

This is not the result of a valid in-memory tree, so it is impossible that the serialiser produces this. Here is what I get: >>> import lxml.html as h >>> h.tostring(h.fromstring("

wwwxxx

yyyzzz

")) '

wwwxxx

yyyzzz

' This is with lxml 2.3 trunk, but the version shouldn't matter. Stefan From stefan_ml at behnel.de Wed Nov 10 06:50:52 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 10 Nov 2010 06:50:52 +0100 Subject: [lxml-dev] Correcting 'simple' broken HTML In-Reply-To: <4CDA3191.8010708@behnel.de> References: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win> <4CDA3191.8010708@behnel.de> Message-ID: <4CDA32BC.2010005@behnel.de> Stefan Behnel, 10.11.2010 06:45: > Wray, Paul, 10.11.2010 01:03: >> I thought that I could use lxml to correct the html to when a paragraph >> is inserted in the middle of text (pipe character represents caret >> position): >> >> Original:

xxx|yyy

>> Insert para:

xxx

yyy

>> Fixed:

xxx

yyy

>> >> This simplest case works OK, but I was surprised to find that this >> fails, when breaking a line within an inline element: >> >> Original:

wwwxxx|yyyzzz

>> Insert para:

wwwxxx

yyyzzz

>> Expected output:

wwwxxx

yyyzzz

>> Actual output from lxml.etree and lxml.html: >>

wwwxxx

yyyzzz

> > This is not the result of a valid in-memory tree, so it is impossible that > the serialiser produces this. Here is what I get: > > >>> import lxml.html as h > >>> h.tostring(h.fromstring("

wwwxxx

yyyzzz

")) > '

wwwxxx

yyyzzz

' Rereading your post, you actually misspelled the output above and quoted it correctly further down:

wwwxxx

yyy

zzz

This obviously *is* a possible serialisation, although invalid HTML - it's 'p' inside of 'b'. The problem is most likely this: libxml version (2, 7, 2) Use a newer libxml2 version, 2.7.7 and later are good choices. Stefan From john at nmt.edu Sat Nov 13 17:37:44 2010 From: john at nmt.edu (John W. Shipman) Date: Sat, 13 Nov 2010 09:37:44 -0700 (MST) Subject: [lxml-dev] Asking for advice - python lxml (fwd) Message-ID: Sorry, I don't do sysadmin. Forwarding to the mailing list. John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 146, Socorro, NM 87801, (575) 835-5950, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber ---------- Forwarded message ---------- Date: Sat, 13 Nov 2010 13:37:47 +0000 From: Peter Lom To: "tcc-doc at nmt.edu" Subject: Asking for advice - python lxml Hi John and others in lxml world, I wonder if you can advise about the problem in installing lxml on Solaris? This is my desperate move to save the first attempt in an international company located in Ireland to use python and also lxml for a Canadian client. I developed an application around lxml on the dev box (out sysadmin was able to build python 2.6.2 with lxml on Solaris 10 with some obstacles from sources but all is OK) . BTW it did performs about 15x faster than xml processing using Ruby The client production system does not allow access to repositories and as the consequence our sysadmin cannot prepare a full build for them! The man claimed 10 hours lost in trying to do so and failed. What can be done? Many thanks Peter Lom, Melbourne This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation, retention or transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify postmaster at openet.com. Although Openet has taken reasonable precautions to ensure no viruses are present in this email, we cannot accept responsibility for any loss or damage arising from the use of this email or attachments. From denis-bz-py at t-online.de Tue Nov 16 13:03:08 2010 From: denis-bz-py at t-online.de (denis) Date: Tue, 16 Nov 2010 12:03:08 +0000 (UTC) Subject: [lxml-dev] read .xlsx spreadsheets with lxml ? Message-ID: Folks, has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? A simple API along the lines of csv would be nice: doc = openxmllib.openXmlDocument( path= "...xlsx" ) for row in doc: for col in row: # num / string (Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.) Thanks, cheers -- denis From jholg at gmx.de Tue Nov 16 14:42:57 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 16 Nov 2010 14:42:57 +0100 Subject: [lxml-dev] Asking for advice - python lxml (fwd) In-Reply-To: References: Message-ID: <20101116134257.195330@gmx.net> Hi, > I developed an application around lxml on the dev box (out sysadmin was > able to build python 2.6.2 with lxml on Solaris 10 with some obstacles from > sources but all is OK) . > BTW it did performs about 15x faster than xml processing using Ruby > > The client production system does not allow access to repositories and as > the consequence our sysadmin cannot prepare a full build for them! > The man claimed 10 hours lost in trying to do so and failed. > > What can be done? What do you mean by "does not allow access to repositories"? Can't the build be done on the dev box and then be packaged and shipped to the production client? Anyway, you'd face the same problem with whatever software you chose to install to the production client, right? Not 100% sure I understand the problem but I guess you'll have to * (pre-) build python & lxml (+ libxml2/libxslt) & your application on the dev box * package python, lxml & your application on the dev box, e.g. as eggs or maybe as sunpkgs (or as tarballs) * ship your packages to the client production system * install your packages on the client production system Or, if you have some kind of build recipe e.g. involving pip/easy_install/zc.buildout or whatever that should be run on the client production system for deployment, you'd have to adapt the recipe to using "local repositories". E.g. not installing from PyPi but from local eggs. Holger -- GRATIS! Movie-FLAT mit ?ber 300 Videos. Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome From jholg at gmx.de Tue Nov 16 15:15:02 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 16 Nov 2010 15:15:02 +0100 Subject: [lxml-dev] read .xlsx spreadsheets with lxml ? In-Reply-To: References: Message-ID: <20101116141502.195330@gmx.net> Hi, > Folks, > has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? > A simple API along the lines of csv would be nice: > > doc = openxmllib.openXmlDocument( path= "...xlsx" ) > for row in doc: > for col in row: # num / string > > (Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.) Well, the lxml APIs are simple enough for handling the XML *inside* the .xlsx zip archive. Don't know how complicated the structure of the file itself can get. Here's the lxml.objectify notion: $ unzip Foo.xlsx $ python -i -c 'from lxml import etree, objectify' >>> root = objectify.parse("./tmp/Foo/xl/worksheets/sheet1.xml").getroot() >>> print root.tag {http://schemas.openxmlformats.org/spreadsheetml/2006/main}worksheet >>> for row in root.sheetData.row: ... for c in row.c: ... print "%s: %s" % (c.get('r'), c.v) ... A1: 0 B1: 1 C1: 2 A2: 1 B2: 2 C2: 3 A3: 4 B3: 5 C3: 6 >>> Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From terry_n_brown at yahoo.com Tue Nov 16 15:06:09 2010 From: terry_n_brown at yahoo.com (Terry Brown) Date: Tue, 16 Nov 2010 08:06:09 -0600 Subject: [lxml-dev] read .xlsx spreadsheets with lxml ? In-Reply-To: References: Message-ID: <20101116080609.5ac3367c@nrri.umn.edu> On Tue, 16 Nov 2010 12:03:08 +0000 (UTC) denis wrote: > Folks, > has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ? > A simple API along the lines of csv would be nice: No. First I'd try http://pypi.python.org/pypi/xlrd - I haven't used it for .xlsx, but it works well for .xls and I think it also supports .xlsx. Small code example below. Cheers -Terry import xlrd from collections import defaultdict filename = "MasterDatabase.xls" book = xlrd.open_workbook(filename) cnt = defaultdict(lambda: 0) for sheet in book.sheets(): print("{0.name:>20s} {0.nrows}".format(sheet)) sheet0 = book.sheet_by_index(0) for row in range(sheet0.nrows): cnt[sheet0.cell(row,0).value] += 1 From denis-bz-py at t-online.de Tue Nov 16 17:59:43 2010 From: denis-bz-py at t-online.de (denis) Date: Tue, 16 Nov 2010 16:59:43 +0000 (UTC) Subject: [lxml-dev] read .xlsx spreadsheets with lxml ? References: Message-ID: Thanks Holger, Thanks Terry, I was really looking for someone who's *used* lxml (or ...) on big Microsoft xlsx spreadsheets. I gather from http://en.wikipedia.org/wiki/Office_Open_XML that the format is messy -- Part 1 (Fundamentals and Markup Language Reference) This part has 5560 pages ?! Bytheway xlrd 0.7.1 -> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 429, in open_workbook biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1545, in getbof bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/xlrd/__init__.py", line 1539, in bof_error raise XLRDError('Unsupported format, or corrupt file: ' + msg) xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'PK\x03\x04\x14\x00\x06\x00' cheers -- denis From stefan_ml at behnel.de Wed Nov 17 18:28:08 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Nov 2010 18:28:08 +0100 Subject: [lxml-dev] explicit child tag notation for empty namespace URI in lxml.objectify In-Reply-To: <4CCFDA3E.5020600@behnel.de> References: <20101102084023.275460@gmx.net> <4CCFDA3E.5020600@behnel.de> Message-ID: <4CE410A8.8070203@behnel.de> Stefan Behnel, 02.11.2010 10:30: > Alexander Artemenko, 02.11.2010 10:07: >> On Tue, Nov 2, 2010 at 11:40 AM, jholg wrote: >>> just stumbled upon this: >>> >>> http://stackoverflow.com/questions/3103661 >>> >>> In short: Should we consider this a bug: >>> >>> >>> root = etree.fromstring(""" >>> ...206 >>> ... >>> ...... >>> ... >>> ... >>> ... """) >>> >>> root['{}duration'] >>> Traceback (most recent call last): >>> File "", line 1, in ? >>> File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345) >>> File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347) >>> AttributeError: no such child: {http://api.example.com}duration >> >> This is not a bug, because you MUST specify namespaces for the >> duration, because this element is in the scope of the 'ns2' >> namespaces. See http://www.w3.org/TR/xml-names/#scoping for details. > > The spec says in 6.2: > > """ > If there is a default namespace declaration in scope, the expanded name > corresponding to an unprefixed element name has the URI of the default > namespace as its namespace name. If there is no default namespace > declaration in scope, the namespace name has no value. > """ > > So, in the above case, "the namespace name has no value", which is just > fine. Although rare, this *is* a problem. Personally, I think I would have > expected "root['{}duration']" to work, but I haven't looked into it any > deeper yet. It might be worth special casing this in lxml.objectify. I've committed a fix for 2.3 that lets lxml.objectify accept "{}tag" as explicitly meaning "tag" with an empty namespace URI. Stefan From kj at rdprojekt.pl Thu Nov 18 11:15:54 2010 From: kj at rdprojekt.pl (Krzysztof Jakubczyk) Date: Thu, 18 Nov 2010 11:15:54 +0100 Subject: [lxml-dev] Schema validation - no file position Message-ID: <4CE4FCDA.5080000@rdprojekt.pl> Hi, I'm trying to validate a document using XmlSchema. It works but the exception received (etree.XMLSyntaxError) has no information about file position- exc.position is (0,0). Is this correct behaviour? From kj at rdprojekt.pl Thu Nov 18 11:15:11 2010 From: kj at rdprojekt.pl (Krzysztof Jakubczyk) Date: Thu, 18 Nov 2010 11:15:11 +0100 Subject: [lxml-dev] Schema validation Message-ID: <4CE4FCAF.8040506@rdprojekt.pl> Hi, I'm trying to validate a document using XmlSchema. It works but the exception received (etree.XMLSyntaxError) has no information about location - exc.position is (0,0). Is this correct behaviour? From m.parrucci at unibo.it Fri Nov 19 18:33:55 2010 From: m.parrucci at unibo.it (Matteo Parrucci) Date: Fri, 19 Nov 2010 17:33:55 +0000 Subject: [lxml-dev] HTMLParser and \r converted in html entity Message-ID: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it> Hi, I'm new here; I subscribed because I encountered a strange behavior in lxml: Is it normal that "\r" followed by "\n" in html code get converted in " " entity using HTMLParser? the strange behavior is reproduced in the example that follows. import lxml.etree g='>\r\n>\r\n\r\ntitle\r\n' lxml.etree.tostring(lxml.etree.fromstring(g, parser=lxml.etree.HTMLParser())) OUTPUT: ' xmlns="http://www.w3.org/1999/xhtml"> \ntitle \n' Thanks in advance Matteo Parrucci -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101119/cd576919/attachment.htm From stefan_ml at behnel.de Sat Nov 20 16:58:19 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 20 Nov 2010 16:58:19 +0100 Subject: [lxml-dev] HTMLParser and \r converted in html entity In-Reply-To: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it> References: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it> Message-ID: <4CE7F01B.2080409@behnel.de> Matteo Parrucci, 19.11.2010 18:33: > I'm new here; I subscribed because I encountered a strange behavior in lxml: > Is it normal that "\r" followed by "\n" in html code get converted in " " entity using HTMLParser? > the strange behavior is reproduced in the example that follows. > > import lxml.etree > g='>\r\n>\r\n\r\ntitle\r\n' > lxml.etree.tostring(lxml.etree.fromstring(g, parser=lxml.etree.HTMLParser())) This is an XHTML document, you shouldn't parse it using the HTML parser. Use the XML parser instead. > OUTPUT: > ' xmlns="http://www.w3.org/1999/xhtml"> \ntitle \n' Default encoding for serialisation is ASCII, which escapes all non-ASCII characters (although I wonder why it should escape line endings...). If you want a different encoding, use the "encoding" parameter. Stefan From piet at vanoostrum.org Sat Nov 20 21:09:12 2010 From: piet at vanoostrum.org (Piet van Oostrum) Date: Sat, 20 Nov 2010 16:09:12 -0400 Subject: [lxml-dev] HTMLParser and \r converted in html entity In-Reply-To: <4CE7F01B.2080409@behnel.de> References: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it> <4CE7F01B.2080409@behnel.de> Message-ID: <19688.10984.533379.514521@cochabamba.vanoostrum.org> Stefan Behnel wrote: > Default encoding for serialisation is ASCII, which escapes all > non-ASCII characters (although I wonder why it should escape line > endings...). If you want a different encoding, use the "encoding" > parameter. \r isn't supposed to be a line ending *in a string*, I suppose. In a file it is (at least on Windows), but it disappears as soon as it is read as text. -- Piet van Oostrum Cochabamba. URL: http://pietvanoostrum.com/ Nu Fair Trade woonartikelen op http://www.zylja.com From crucialfelix at gmail.com Tue Nov 23 11:14:19 2010 From: crucialfelix at gmail.com (felix) Date: Tue, 23 Nov 2010 11:14:19 +0100 Subject: [lxml-dev] Compile failure In-Reply-To: <4CCBDC02.3080305@behnel.de> References: <4CCBDC02.3080305@behnel.de> Message-ID: On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel wrote: > felix, 26.10.2010 15:28: > > According to this: >> http://codespeak.net/lxml/build.html >> >> we should avoid installing Cython >> >> but using easy_install to build fails saying the cython generated file is >> missing >> > > I doubt that it's failing because of that. However, you didn't provide the > output of the build, so I can't guess what happened that actually made the > build fail. sorry, that output had scrolled off by the time I realized I should submit a report. I have another server so fortunately I can fail there and show you. crucial at crucial-systems:~/working/lxml$ python setup.py build /home/crucial/working/lxml/versioninfo.py:53: UserWarning: unrecognized .svn/entries format; skipping /home/crucial/working/lxml/ warn("unrecognized .svn/entries format; skipping "+base) Building lxml version 2.3.beta1. *NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available.* Using build configuration of libxslt 1.1.26 Building against libxml2/libxslt in the following directory: /usr/lib running build running build_py running build_ext building 'lxml.etree' extension gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/libxml2 -I/usr/include/python2.6 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.6/src/lxml/lxml.etree.o -w gcc: src/lxml/lxml.etree.c: No such file or directory gcc: no input files error: command 'gcc' failed with exit status 1 > The latest build instructions for the SVN trunk are in the SVN trunk as > "doc/build.txt", or *(not always completely up-to-date)* here: > exactly > > *but then I succeeded with the old sudo easy_install lxml* >> >> >> because now I have Cython >> > > Again, I doubt that this is the reason. > sudo easy_install lxml failed before after installing Cython it says it uses Cython (not Trying to build without Cython) and it worked. nothing else having changed I thought it was a reasonable guess that it worked because it used Cython because Cython is installed. * * > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101123/7777e0de/attachment.htm From jholg at gmx.de Tue Nov 23 11:35:07 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 23 Nov 2010 11:35:07 +0100 Subject: [lxml-dev] Compile failure In-Reply-To: References: <4CCBDC02.3080305@behnel.de> Message-ID: <20101123103507.100470@gmx.net> Hi, > On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel > wrote: > > > felix, 26.10.2010 15:28: > > > > According to this: > >> http://codespeak.net/lxml/build.html > >> > >> we should avoid installing Cython > >> > >> but using easy_install to build fails saying the cython generated file > is > >> missing Note that it says """ Since we distribute the Cython-generated .c files with lxml *releases*, however, you do not need Cython to build lxml from the normal *release* sources. """ So the Cython-generated .c files are not in an SVN checkout but should be in the release packages. > > Again, I doubt that this is the reason. > > > > sudo easy_install lxml > failed before > > after installing Cython it says it uses Cython (not Trying to build > without > Cython) and it worked. > > nothing else having changed I thought it was a reasonable guess that it > worked because it used Cython because Cython is installed. I just checked the 2.3beta1 (source) package on pypi and it does contain the .c files: -rw-r--r-- 1000/1000 7102827 Sep 6 09:31 2010 lxml-2.3beta1/src/lxml/lxml.etree.c -rw-r--r-- 1000/1000 1318011 Sep 6 09:33 2010 lxml-2.3beta1/src/lxml/lxml.objectify.c Holger -- GRATIS! Movie-FLAT mit ?ber 300 Videos. Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome From stefan_ml at behnel.de Tue Nov 23 15:38:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Nov 2010 15:38:35 +0100 Subject: [lxml-dev] SyntaxErrors with Python 3 In-Reply-To: <201009102244.24301.Arfrever.FTA@gmail.com> References: <201007200303.15139.Arfrever.FTA@gmail.com> <4C455380.3020905@behnel.de> <201007251715.35395.Arfrever.FTA@gmail.com> <201009102244.24301.Arfrever.FTA@gmail.com> Message-ID: <4CEBD1EB.8040908@behnel.de> Arfrever Frehtes Taifersar Arahesis, 10.09.2010 22:44: > 2010-07-25 17:14:53 Arfrever Frehtes Taifersar Arahesis napisa?(a): >> 2010-07-20 09:42:56 Stefan Behnel napisa?(a): >>> Arfrever Frehtes Taifersar Arahesis, 20.07.2010 03:02: >>>> LXML r76211 generally supports Python 3, but there are still some SyntaxErrors. >>> > [snip] >>> >>> Thanks. Only 2 or 3 of those are relevant to Py3, but I'll see if I can fix >>> them. A patch could easily speed this up, BTW. >> >> I'm attaching the partial patch. > > Could this patch be committed? Done. Stefan From stefan_ml at behnel.de Thu Nov 25 10:48:55 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Nov 2010 10:48:55 +0100 Subject: [lxml-dev] Schema validation - no file position In-Reply-To: <4CE4FCDA.5080000@rdprojekt.pl> References: <4CE4FCDA.5080000@rdprojekt.pl> Message-ID: <4CEE3107.9000301@behnel.de> Hi, Krzysztof Jakubczyk, 18.11.2010 11:15: > I'm trying to validate a document using XmlSchema. It works but the > exception received (etree.XMLSyntaxError) has no information about file > position- exc.position is (0,0). Is this correct behaviour? I wonder why you get an "XMLSyntaxError" in the first place. This means that there's an error while parsing your document. Could you show us the code that you use for parsing and validation? Stefan From kj at rdprojekt.pl Thu Nov 25 11:00:52 2010 From: kj at rdprojekt.pl (Krzysztof Jakubczyk) Date: Thu, 25 Nov 2010 11:00:52 +0100 Subject: [lxml-dev] Schema validation - no file position In-Reply-To: <4CEE3107.9000301@behnel.de> References: <4CE4FCDA.5080000@rdprojekt.pl> <4CEE3107.9000301@behnel.de> Message-ID: <4CEE33D4.9030400@rdprojekt.pl> On 2010-11-25 10:48, Stefan Behnel wrote: > Hi, > > Krzysztof Jakubczyk, 18.11.2010 11:15: >> I'm trying to validate a document using XmlSchema. It works but the >> exception received (etree.XMLSyntaxError) has no information about file >> position- exc.position is (0,0). Is this correct behaviour? > > I wonder why you get an "XMLSyntaxError" in the first place. This > means that there's an error while parsing your document. > > Could you show us the code that you use for parsing and validation? > > Stefan Hmm... I get the error because the document I validate is invalid - it doesn't match the Xml Schema. This behavior is correct. My problem is that the error doesn't contain information about position of the error - it's hard to find source of the error in a big file. my code is the following: def validate(schemaContent, dataStream): schema = etree.XMLSchema(etree.fromstring(schemaContent)) for event, elem in etree.iterparse(dataStream, schema=schema): elem.clear() while elem.getprevious() is not None: if not elem.getparent() is None: del elem.getparent()[0] regards, kj From stefan_ml at behnel.de Thu Nov 25 11:14:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Nov 2010 11:14:33 +0100 Subject: [lxml-dev] Schema validation - no file position In-Reply-To: <4CEE33D4.9030400@rdprojekt.pl> References: <4CE4FCDA.5080000@rdprojekt.pl> <4CEE3107.9000301@behnel.de> <4CEE33D4.9030400@rdprojekt.pl> Message-ID: <4CEE3709.5010203@behnel.de> Krzysztof Jakubczyk, 25.11.2010 11:00: > On 2010-11-25 10:48, Stefan Behnel wrote: >> Krzysztof Jakubczyk, 18.11.2010 11:15: >>> I'm trying to validate a document using XmlSchema. It works but the >>> exception received (etree.XMLSyntaxError) has no information about file >>> position- exc.position is (0,0). Is this correct behaviour? >> >> I wonder why you get an "XMLSyntaxError" in the first place. This >> means that there's an error while parsing your document. >> >> Could you show us the code that you use for parsing and validation? > > I get the error because the document I validate is invalid - it doesn't > match the Xml Schema. This behavior is correct. > My problem is that the error doesn't contain information about position > of the error - it's hard to find source of the error in a big file. > > my code is the following: > > def validate(schemaContent, dataStream): > schema = etree.XMLSchema(etree.fromstring(schemaContent)) > for event, elem in etree.iterparse(dataStream, schema=schema): Now, this reveals two important hints that you didn't provide in your original post: you are validating at parse time, and you are using iterparse(). For me, that totally changes the place in the code to look at. I'll see if I can come up with something. Stefan From chris at simplistix.co.uk Mon Nov 29 21:21:37 2010 From: chris at simplistix.co.uk (Chris Withers) Date: Mon, 29 Nov 2010 20:21:37 +0000 Subject: [lxml-dev] read .xlsx spreadsheets with lxml ? In-Reply-To: References: Message-ID: <4CF40B51.8040700@simplistix.co.uk> On 16/11/2010 16:59, denis wrote: > Thanks Holger, Thanks Terry, > > I was really looking for someone who's *used* lxml (or ...) > on big Microsoft xlsx spreadsheets. John Machin over on the python-excel group has done just this. He has some alpha code that I know he'd like to see merged into the xlrd code base but he's looking for some serious testers. Follow the birdy on www.python-excel.org for group joining... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk