From cJ-lxml at zougloub.eu Mon Nov 1 02:46:23 2010
From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero)
Date: Sun, 31 Oct 2010 21:46:23 -0400
Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook
In-Reply-To: <1286350813.29843.587028.camel@atlas>
References: <20100903161533.20e66719@Bidule.intranet.cs>
<1286350813.29843.587028.camel@atlas>
Message-ID: <20101031214623.34690eba@zougloub.eu>
On Wed, 06 Oct 2010 11:40:13 +0400
Alexander Shigin wrote:
> It looks like the usage of XSLT.strparam solves your problem. Please
> look at attached patch.
Thanks a lot Alexander, it works.
Regards,
--
cJ
From cJ-lxml at zougloub.eu Mon Nov 1 13:38:59 2010
From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero)
Date: Mon, 1 Nov 2010 08:38:59 -0400
Subject: [lxml-dev] XSLT - xsltMaxDepth setting
Message-ID: <20101101083859.39339551@zougloub.eu>
Hi,
libxslt uses a xsltMaxDepth variable (global...) to limit recursion,
and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages).
xsltproc --maxdepth 10000 ....
At the moment, lxml does not touch this value.
Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ?
Regards,
--
cJ
From stefan_ml at behnel.de Mon Nov 1 17:14:30 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 01 Nov 2010 17:14:30 +0100
Subject: [lxml-dev] XSLT - xsltMaxDepth setting
In-Reply-To: <20101101083859.39339551@zougloub.eu>
References: <20101101083859.39339551@zougloub.eu>
Message-ID: <4CCEE766.6000905@behnel.de>
J?r?me Carretero, 01.11.2010 13:38:
> libxslt uses a xsltMaxDepth variable (global...) to limit recursion,
> and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages).
> xsltproc --maxdepth 10000 ....
>
> At the moment, lxml does not touch this value.
>
> Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ?
Such a local property doesn't work well for a global setting, and global
settings are always evil. Isn't there a per-call setting for this?
Stefan
From Althial at gmx.net Mon Nov 1 18:48:52 2010
From: Althial at gmx.net (Althial at gmx.net)
Date: Mon, 01 Nov 2010 18:48:52 +0100
Subject: [lxml-dev] External Ids with DTD class
In-Reply-To:
References:
Message-ID: <20101101174852.43870@gmx.net>
Hi,
I want to use lxml to validate fragments of xhtml but setting up the parser is driving me nuts.
from lxml import etree
dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XHTML V4.2//EN")
That's taken from the webpage's tutorial and this
dtd = etree.DTD(external_id = "-//W3C//DTD XHTML 1.0 Strict//EN")
is what I'd like to do. Result:
DTDParseError: failed to load external entity "-//W3C//DTD XHTML 1.0 Strict//EN"
Now I realize this looks like some setup problem on my side. I guess my system is simply lacking the catalog entries so the DTD can't be found. But the documentation (of lxml) says nothing more on this issue.
I'm working with Ubuntu 10.04 and all my /usr/share/sgml/docbook/dtd directory contains only xml which itself holds versions 4 to 4.5 - but nothing with XHTML.
All I want is some basic validation to XHTML 1.0 STRICT. Is that really so hard to set up? :-(
Amnu
--
Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!
Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
From jholg at gmx.de Tue Nov 2 09:40:23 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 02 Nov 2010 09:40:23 +0100
Subject: [lxml-dev] (no subject)
Message-ID: <20101102084023.275460@gmx.net>
Hi,
just stumbled upon this:
http://stackoverflow.com/questions/3103661
In short: Should we consider this a bug:
>>> root = etree.fromstring("""
... 206
...
... ...
...
...
... """)
>>> root['{}duration']
Traceback (most recent call last):
File "", line 1, in ?
File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345)
File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347)
AttributeError: no such child: {http://api.example.com}duration
>>>
?
Looks like there is no way to get at a no-namespace child element apart from working around this e.g. using xpath.
Holger
--
GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit
gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl
From svetlyak.40wt at gmail.com Tue Nov 2 10:07:33 2010
From: svetlyak.40wt at gmail.com (Alexander Artemenko)
Date: Tue, 2 Nov 2010 12:07:33 +0300
Subject: [lxml-dev] (no subject)
In-Reply-To: <20101102084023.275460@gmx.net>
References: <20101102084023.275460@gmx.net>
Message-ID:
Hi!
On Tue, Nov 2, 2010 at 11:40 AM, wrote:
> Hi,
>
> just stumbled upon this:
>
> http://stackoverflow.com/questions/3103661
>
> In short: Should we consider this a bug:
>
>>>> root = etree.fromstring("""
> ... 206
> ...
> ... ? ? ...
> ... ?
> ...
> ... """)
>>>> root['{}duration']
> Traceback (most recent call last):
> ?File "", line 1, in ?
> ?File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345)
> ?File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347)
> AttributeError: no such child: {http://api.example.com}duration
>>>>
This is not a bug, because you MUST specify namespaces for the
duration, because this element is in the scope of the 'ns2'
namespaces. See http://www.w3.org/TR/xml-names/#scoping for details.
--
Alexander Artemenko (a.k.a. Svetlyak 40wt)
Blog: http://aartemenko.com
Photos: http://svetlyak.ru
Jabber: svetlyak.40wt at gmail.com
From stefan_ml at behnel.de Tue Nov 2 10:30:38 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 02 Nov 2010 10:30:38 +0100
Subject: [lxml-dev] (no subject)
In-Reply-To:
References: <20101102084023.275460@gmx.net>
Message-ID: <4CCFDA3E.5020600@behnel.de>
Alexander Artemenko, 02.11.2010 10:07:
> On Tue, Nov 2, 2010 at 11:40 AM, jholg wrote:
>> just stumbled upon this:
>>
>> http://stackoverflow.com/questions/3103661
>>
>> In short: Should we consider this a bug:
>>
>> >>> root = etree.fromstring("""
>> ...206
>> ...
>> ......
>> ...
>> ...
>> ... """)
>> >>> root['{}duration']
>> Traceback (most recent call last):
>> File "", line 1, in ?
>> File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345)
>> File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347)
>> AttributeError: no such child: {http://api.example.com}duration
>
> This is not a bug, because you MUST specify namespaces for the
> duration, because this element is in the scope of the 'ns2'
> namespaces. See http://www.w3.org/TR/xml-names/#scoping for details.
The spec says in 6.2:
"""
If there is a default namespace declaration in scope, the expanded name
corresponding to an unprefixed element name has the URI of the default
namespace as its namespace name. If there is no default namespace
declaration in scope, the namespace name has no value.
"""
So, in the above case, "the namespace name has no value", which is just
fine. Although rare, this *is* a problem. Personally, I think I would have
expected "root['{}duration']" to work, but I haven't looked into it any
deeper yet. It might be worth special casing this in lxml.objectify.
Stefan
From cJ-lxml at zougloub.eu Thu Nov 4 13:42:01 2010
From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero)
Date: Thu, 4 Nov 2010 08:42:01 -0400
Subject: [lxml-dev] XSLT - xsltMaxDepth setting
In-Reply-To: <4CCEE766.6000905@behnel.de>
References: <20101101083859.39339551@zougloub.eu> <4CCEE766.6000905@behnel.de>
Message-ID: <20101104084201.309144d6@zougloub.eu>
On Mon, 01 Nov 2010 17:14:30 +0100
Stefan Behnel wrote:
> J?r?me Carretero, 01.11.2010 13:38:
> > libxslt uses a xsltMaxDepth variable (global...) to limit recursion,
> > and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages).
> > xsltproc --maxdepth 10000 ....
> >
> > At the moment, lxml does not touch this value.
> >
> > Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ?
>
> Such a local property doesn't work well for a global setting, and global
> settings are always evil. Isn't there a per-call setting for this?
Attached is a libxslt patch that makes the max template depth an attribute of the transform context and not a global variable.
Comments ?
Regards,
--
cJ
PS: the patch was applied onto the master branch of Diego's git://gitorious.org/libxslt/libxslt.git mirror, I was too lazy to dig the real libxslt repository.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: libxslt-unglobalize-maxdepth.patch
Type: text/x-patch
Size: 4293 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20101104/88174fb9/attachment-0001.bin
From stefan_ml at behnel.de Thu Nov 4 14:05:25 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 04 Nov 2010 14:05:25 +0100
Subject: [lxml-dev] XSLT - xsltMaxDepth setting
In-Reply-To: <20101104084201.309144d6@zougloub.eu>
References: <20101101083859.39339551@zougloub.eu> <4CCEE766.6000905@behnel.de>
<20101104084201.309144d6@zougloub.eu>
Message-ID: <4CD2AF95.7060607@behnel.de>
J?r?me Carretero, 04.11.2010 13:42:
> On Mon, 01 Nov 2010 17:14:30 +0100
> Stefan Behnel wrote:
>
>> J?r?me Carretero, 01.11.2010 13:38:
>>> libxslt uses a xsltMaxDepth variable (global...) to limit recursion,
>>> and I used to increase it when processing some big files (for instance, DocBook containing tables spanning over dozens of pages).
>>> xsltproc --maxdepth 10000 ....
>>>
>>> At the moment, lxml does not touch this value.
>>>
>>> Maybe providing a lxml.etree.XSLT.maxdepth property would not be too complicated ?
>>
>> Such a local property doesn't work well for a global setting, and global
>> settings are always evil. Isn't there a per-call setting for this?
>
> Attached is a libxslt patch that makes the max template depth an attribute of the transform context and not a global variable.
>
> Comments ?
Wrong list. ;)
But to get it accepted, I think you will have to keep the old interface in
addition to the new one.
Stefan
From Marc.Graff at VerizonWireless.com Thu Nov 4 19:41:24 2010
From: Marc.Graff at VerizonWireless.com (Graff, Marc)
Date: Thu, 4 Nov 2010 14:41:24 -0400
Subject: [lxml-dev] Need feedback on Memory Errors
Message-ID: <20101104185239.429A1282B9D@codespeak.net>
I just finished an app that parses a large xml file "FeedA" and appends
another smaller file fragmentB to the tree from FeedA for an xpath
specified parent node. All seems fine when processing a file less than
500MB but anything large results in one of two errors.
All libs were built from src in my home dir and LD_LIBRARY_PATH reflects
the home dir lib. Not sure if that will distort the following lib
details
lxml.etree: (2, 2, 8, 0)
libxml used: (2, 7, 7)
libxml compiled: (2, 6, 23)
libxslt used: (1, 1, 26)
libxslt compiled: (1, 1, 15)
There should be ample memory. This is running on a Solaris M5000 with
96GB of memory and unlimit is unlimited. The FeedA test file contains
valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file
tests.
Just over 500MB and the app returns a MemoryError on the serializer.pxi.
Attached is the full error trace from the captured exception. The parse
obj has (huge_tree=True). I didn't expect this to make a difference
since the error in on the serialization of the output but tried anyway.
File "serializer.pxi", line 133, in lxml.etree._tostring
(src/lxml/lxml.etree.c:79345)
MemoryError
Anything over 1.5GB and it core dumps (first error in the stack trapped
in libxml2). Attached is the stack and mappings from mdb. Including
incase related:
libc.so.1`strlen+0x50(39e730, 3ceef0, 1e3718, 755, 1e3724, 756)
etree.so`__pyx_f_4lxml_5etree_13_BaseErrorLog__receive+0xcc(397490,
3ceef0,
feff5d24, 74e, 2, 74e)
etree.so`__pyx_f_4lxml_5etree__forwardError+0x6c(fef081e0, 3ceef0,
d9d64,
fed303a0, ff1303bc, 1)
libxml2.so.2.7.7`__xmlRaiseError+0x2c4(fef07980, fee9c4c8, 397490,
3ced70,
ffffffff, 1)
libxml2.so.2.7.7`xmlErrMemory+0xa4(3ced70, fedc1e70, d9d64, ff0566a8,
ff1303bc, ff13a558)
libxml2.so.2.7.7`xmlSAX2TextNode+0x2e0(0, 3f5616, 7, 1, 6, feddb774)
I am new to python so any help/suggestions would be greatly appreciated.
Thanks
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment.htm
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lxml_error_message.txt
Url: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment.txt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corelog.txt
Url: http://codespeak.net/pipermail/lxml-dev/attachments/20101104/fa85bee7/attachment-0001.txt
From stefan_ml at behnel.de Thu Nov 4 23:01:17 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 04 Nov 2010 23:01:17 +0100
Subject: [lxml-dev] Need feedback on Memory Errors
In-Reply-To: <20101104185239.429A1282B9D@codespeak.net>
References: <20101104185239.429A1282B9D@codespeak.net>
Message-ID: <4CD32D2D.6010606@behnel.de>
Hi,
Graff, Marc, 04.11.2010 19:41:
> I just finished an app that parses a large xml file "FeedA" and appends
> another smaller file fragmentB to the tree from FeedA for an xpath
> specified parent node. All seems fine when processing a file less than
> 500MB but anything large results in one of two errors.
You may not be aware of it, but this is huge. If that's just the size of
the serialised XML, this means that the in-memory tree representation is
several times that size, easily 10x or more. Depending on the text-to-tag
ratio in the content, it may well reach the size of your available memory.
Check the size of the Python process while it's building the tree, prstat
is your friend.
> All libs were built from src in my home dir and LD_LIBRARY_PATH reflects
> the home dir lib. Not sure if that will distort the following lib
> details
>
> lxml.etree: (2, 2, 8, 0)
>
> libxml used: (2, 7, 7)
>
> libxml compiled: (2, 6, 23)
>
> libxslt used: (1, 1, 26)
>
> libxslt compiled: (1, 1, 15)
Try to build against the libraries that you use at runtime. lxml has
several bug work-arounds and compile time adaptations for the various
library versions. A major discrepancy between the version used at compile
time and runtime, such as in your case, may have unexpected side effects.
You can pass the path to the configuration scripts (xml2-config and
xslt-config in the bin directories of the install destinations) using the
XML2_CONFIG and XSLT_CONFIG environment variables.
> There should be ample memory. This is running on a Solaris M5000 with
> 96GB of memory and unlimit is unlimited. The FeedA test file contains
> valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file
> tests.
>
> Just over 500MB and the app returns a MemoryError on the serializer.pxi.
The serialiser needs to reallocate additional memory step by step while
it's doing its work. Normally, the OS handles this by enlarging the
allocated area and without copying. However, if the available memory runs
low, memory fragmentation may trigger the allocation of a completely new
memory area of very large size to copy the previously allocated memory
into, which may easily fail since memory is low already. So even if there
is some memory left in the system, it may not be enough to satisfy the
memory allocation scheme at hand. Remember that your output alone is
500-3000 MB in one single piece of memory, and libxml2 can't know in
advance that it will need that much.
So, please monitor the memory consumption of the process. If you are really
running out of memory, one thing you can try is to switch to cElementTree
(xml.etree.cElementTree). It has a somewhat lighter memory footprint which
may just be enough to make a difference here (although likely not for 3GB
of XML). It also has less features than lxml.etree (a bit fewer less in
Py2.7/ET1.3), but currently, your only real problem seems to be the memory
requirement.
Stefan
From Marc.Graff at VerizonWireless.com Fri Nov 5 16:14:29 2010
From: Marc.Graff at VerizonWireless.com (Graff, Marc)
Date: Fri, 5 Nov 2010 11:14:29 -0400
Subject: [lxml-dev] Need feedback on Memory Errors
In-Reply-To:
References: <20101104185239.429A1282B9D@codespeak.net>
Message-ID: <20101105151505.24CA9282BF5@codespeak.net>
The runtime vs. compile time lib difference went unrealized (missed the
500 lb. gorilla) in until my ride home last night even though it was
right in front of me. The long ride home is often when things that
allude me often come together. I was concerned I opened my self up to
justifiably harsh scrutiny. Thanks for kindly confirming. Also thanks
for the helpful insights. I generally run top to see what my code is
using but will include prstat to my monitoring. I will recompile with
the correct env vars and retest.
I was considering the possible memory footprint of the current
implementation but wanted to finish version 1. I will try altering with
cElementTree and compare to the current code. I am also going to
investigate an event driven parse_and_append approach since lmxl provide
such a mechanism and I believe that could reduce memory usage
drastically.
Thanks for the very useful feedback and have a good weekend.
Marc
-----Original Message-----
From: lxml-dev-bounces at codespeak.net
[mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel
Sent: Thursday, November 04, 2010 6:01 PM
To: Graff, Marc
Cc: lxml-dev at codespeak.net
Subject: Re: [lxml-dev] Need feedback on Memory Errors
Hi,
Graff, Marc, 04.11.2010 19:41:
> I just finished an app that parses a large xml file "FeedA" and
appends
> another smaller file fragmentB to the tree from FeedA for an xpath
> specified parent node. All seems fine when processing a file less
than
> 500MB but anything large results in one of two errors.
You may not be aware of it, but this is huge. If that's just the size of
the serialised XML, this means that the in-memory tree representation is
several times that size, easily 10x or more. Depending on the
text-to-tag
ratio in the content, it may well reach the size of your available
memory.
Check the size of the Python process while it's building the tree,
prstat
is your friend.
> All libs were built from src in my home dir and LD_LIBRARY_PATH
reflects
> the home dir lib. Not sure if that will distort the following lib
> details
>
> lxml.etree: (2, 2, 8, 0)
>
> libxml used: (2, 7, 7)
>
> libxml compiled: (2, 6, 23)
>
> libxslt used: (1, 1, 26)
>
> libxslt compiled: (1, 1, 15)
Try to build against the libraries that you use at runtime. lxml has
several bug work-arounds and compile time adaptations for the various
library versions. A major discrepancy between the version used at
compile
time and runtime, such as in your case, may have unexpected side
effects.
You can pass the path to the configuration scripts (xml2-config and
xslt-config in the bin directories of the install destinations) using
the
XML2_CONFIG and XSLT_CONFIG environment variables.
> There should be ample memory. This is running on a Solaris M5000 with
> 96GB of memory and unlimit is unlimited. The FeedA test file contains
> valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB
file
> tests.
>
> Just over 500MB and the app returns a MemoryError on the
serializer.pxi.
The serialiser needs to reallocate additional memory step by step while
it's doing its work. Normally, the OS handles this by enlarging the
allocated area and without copying. However, if the available memory
runs
low, memory fragmentation may trigger the allocation of a completely new
memory area of very large size to copy the previously allocated memory
into, which may easily fail since memory is low already. So even if
there
is some memory left in the system, it may not be enough to satisfy the
memory allocation scheme at hand. Remember that your output alone is
500-3000 MB in one single piece of memory, and libxml2 can't know in
advance that it will need that much.
So, please monitor the memory consumption of the process. If you are
really
running out of memory, one thing you can try is to switch to
cElementTree
(xml.etree.cElementTree). It has a somewhat lighter memory footprint
which
may just be enough to make a difference here (although likely not for
3GB
of XML). It also has less features than lxml.etree (a bit fewer less in
Py2.7/ET1.3), but currently, your only real problem seems to be the
memory
requirement.
Stefan
_______________________________________________
lxml-dev mailing list
lxml-dev at codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
From ra.ravi.rav at gmail.com Sat Nov 6 18:44:12 2010
From: ra.ravi.rav at gmail.com (Ravi)
Date: Sat, 6 Nov 2010 23:14:12 +0530
Subject: [lxml-dev] etree.tostring() cannot handle Unicode
Message-ID:
With reference to the bug report
https://bugs.launchpad.net/lxml/+bug/671885I found that
etree.tostring() cannot handle Unicode. It is giving me the
UnicodeEncodeError.
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101106/9d985bbb/attachment.htm
From stefan_ml at behnel.de Sat Nov 6 19:37:08 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Nov 2010 19:37:08 +0100
Subject: [lxml-dev] etree.tostring() cannot handle Unicode
In-Reply-To:
References:
Message-ID: <4CD5A054.2010701@behnel.de>
Ravi, 06.11.2010 18:44:
> With reference to the bug report
> https://bugs.launchpad.net/lxml/+bug/671885I found that
> etree.tostring() cannot handle Unicode. It is giving me the
> UnicodeEncodeError.
etree.tostring() handles unicode (whatever you mean by that) nicely, so the
issue is most likely with your own code. Please provide an example that
shows what your concrete problem is.
Stefan
From Paul.Wray at det.nsw.edu.au Wed Nov 10 01:03:44 2010
From: Paul.Wray at det.nsw.edu.au (Wray, Paul)
Date: Wed, 10 Nov 2010 11:03:44 +1100
Subject: [lxml-dev] Correcting 'simple' broken HTML
Message-ID: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win>
Background: I need a paragraph-insertion algorithm for use with Internet
Explorer's webbrowser component, as the use of
execCommand('InsertParagraph') or pasteHTML gives unpredictable results.
I have developed a simple algorithm (~50 lines of Python) that gives
acceptable results for simplest cases, but I have no confidence that it
covers all cases.
I thought that I could use lxml to correct the html to when a paragraph
is inserted in the middle of text (pipe character represents caret
position):
Original: xxx|yyy
Insert para: xxx
yyy
Fixed: xxx
yyy
This simplest case works OK, but I was surprised to find that this
fails, when breaking a line within an inline element:
Original: wwwxxx|yyyzzz
Insert para: wwwxxx
yyyzzz
Expected output: wwwxxx
yyyzzz
Actual output from lxml.etree and lxml.html:
wwwxxx
yyyzzz
So it seems that both lxml.etree and lxml.html are tolerant of a
paragraph as the child of an inline element. When I use recover=False
for lxml.etree parser, there is no exception raised.
My questions:
* Am I expecting too much, or missing something? I think that the above
is a simple case of broken HTML.
* Can anyone point me to a tried and true line-breaking algorithm for
lxml?
Code and output follows.
-------------------------------------------------------------
Test Code
from lxml import etree, html
from StringIO import StringIO
print 'lxml version', etree.LXML_VERSION
print 'libxml version', etree.LIBXML_VERSION
badhtml = 'wwwxxx
yyyzzz
'
print 'With lxml.etree:'
parser = etree.HTMLParser(recover=False)
tree = etree.parse(StringIO(badhtml), parser)
result = etree.tostring(tree.getroot(), pretty_print=True,
method='html')
print result
print 'With lxml.html:'
parsed = html.fragment_fromstring(badhtml)
print html.tostring(parsed)
---------------------------------------------------------
Output
lxml version (2, 2, 2, 0)
libxml version (2, 7, 2)
With lxml.etree:
wwwxxx
yyy
zzz
With lxml.html:
wwwxxx
yyy
zzz
Paul
**********************************************************************
This message is intended for the addressee named and may contain
privileged information or confidential information or both. If you
are not the intended recipient please delete it and notify the sender.
**********************************************************************
From stefan_ml at behnel.de Wed Nov 10 06:45:53 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 10 Nov 2010 06:45:53 +0100
Subject: [lxml-dev] Correcting 'simple' broken HTML
In-Reply-To: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win>
References: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win>
Message-ID: <4CDA3191.8010708@behnel.de>
Wray, Paul, 10.11.2010 01:03:
> I thought that I could use lxml to correct the html to when a paragraph
> is inserted in the middle of text (pipe character represents caret
> position):
>
> Original:xxx|yyy
> Insert para:xxx
yyy
> Fixed:xxx
yyy
>
> This simplest case works OK, but I was surprised to find that this
> fails, when breaking a line within an inline element:
>
> Original:wwwxxx|yyyzzz
> Insert para:wwwxxx
yyyzzz
> Expected output:wwwxxx
yyyzzz
> Actual output from lxml.etree and lxml.html:
> wwwxxx
yyyzzz
This is not the result of a valid in-memory tree, so it is impossible that
the serialiser produces this. Here is what I get:
>>> import lxml.html as h
>>> h.tostring(h.fromstring("wwwxxx
yyyzzz
"))
''
This is with lxml 2.3 trunk, but the version shouldn't matter.
Stefan
From stefan_ml at behnel.de Wed Nov 10 06:50:52 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 10 Nov 2010 06:50:52 +0100
Subject: [lxml-dev] Correcting 'simple' broken HTML
In-Reply-To: <4CDA3191.8010708@behnel.de>
References: <03426FC7F879734CA8026B6814131E05059AF9DA@otfexchange1.western_sydney.det.win>
<4CDA3191.8010708@behnel.de>
Message-ID: <4CDA32BC.2010005@behnel.de>
Stefan Behnel, 10.11.2010 06:45:
> Wray, Paul, 10.11.2010 01:03:
>> I thought that I could use lxml to correct the html to when a paragraph
>> is inserted in the middle of text (pipe character represents caret
>> position):
>>
>> Original:xxx|yyy
>> Insert para:xxx
yyy
>> Fixed:xxx
yyy
>>
>> This simplest case works OK, but I was surprised to find that this
>> fails, when breaking a line within an inline element:
>>
>> Original:wwwxxx|yyyzzz
>> Insert para:wwwxxx
yyyzzz
>> Expected output:wwwxxx
yyyzzz
>> Actual output from lxml.etree and lxml.html:
>> wwwxxx
yyyzzz
>
> This is not the result of a valid in-memory tree, so it is impossible that
> the serialiser produces this. Here is what I get:
>
> >>> import lxml.html as h
> >>> h.tostring(h.fromstring("wwwxxx
yyyzzz
"))
> ''
Rereading your post, you actually misspelled the output above and quoted it
correctly further down:
wwwxxx
yyy
zzz
This obviously *is* a possible serialisation, although invalid HTML - it's
'p' inside of 'b'.
The problem is most likely this:
libxml version (2, 7, 2)
Use a newer libxml2 version, 2.7.7 and later are good choices.
Stefan
From john at nmt.edu Sat Nov 13 17:37:44 2010
From: john at nmt.edu (John W. Shipman)
Date: Sat, 13 Nov 2010 09:37:44 -0700 (MST)
Subject: [lxml-dev] Asking for advice - python lxml (fwd)
Message-ID:
Sorry, I don't do sysadmin. Forwarding to the mailing list.
John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center,
Speare 146, Socorro, NM 87801, (575) 835-5950, http://www.nmt.edu/~john
``Let's go outside and commiserate with nature.'' --Dave Farber
---------- Forwarded message ----------
Date: Sat, 13 Nov 2010 13:37:47 +0000
From: Peter Lom
To: "tcc-doc at nmt.edu"
Subject: Asking for advice - python lxml
Hi John and others in lxml world,
I wonder if you can advise about the problem in installing lxml on Solaris?
This is my desperate move to save the first attempt in an international company located in Ireland to use python and also lxml for a Canadian client.
I developed an application around lxml on the dev box (out sysadmin was able to build python 2.6.2 with lxml on Solaris 10 with some obstacles from sources but all is OK) .
BTW it did performs about 15x faster than xml processing using Ruby
The client production system does not allow access to repositories and as the consequence our sysadmin cannot prepare a full build for them!
The man claimed 10 hours lost in trying to do so and failed.
What can be done?
Many thanks
Peter Lom, Melbourne
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation, retention or transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify postmaster at openet.com. Although Openet has taken reasonable precautions to ensure no viruses are present in this email, we cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
From denis-bz-py at t-online.de Tue Nov 16 13:03:08 2010
From: denis-bz-py at t-online.de (denis)
Date: Tue, 16 Nov 2010 12:03:08 +0000 (UTC)
Subject: [lxml-dev] read .xlsx spreadsheets with lxml ?
Message-ID:
Folks,
has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ?
A simple API along the lines of csv would be nice:
doc = openxmllib.openXmlDocument( path= "...xlsx" )
for row in doc:
for col in row: # num / string
(Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.)
Thanks, cheers
-- denis
From jholg at gmx.de Tue Nov 16 14:42:57 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 16 Nov 2010 14:42:57 +0100
Subject: [lxml-dev] Asking for advice - python lxml (fwd)
In-Reply-To:
References:
Message-ID: <20101116134257.195330@gmx.net>
Hi,
> I developed an application around lxml on the dev box (out sysadmin was
> able to build python 2.6.2 with lxml on Solaris 10 with some obstacles from
> sources but all is OK) .
> BTW it did performs about 15x faster than xml processing using Ruby
>
> The client production system does not allow access to repositories and as
> the consequence our sysadmin cannot prepare a full build for them!
> The man claimed 10 hours lost in trying to do so and failed.
>
> What can be done?
What do you mean by "does not allow access to repositories"?
Can't the build be done on the dev box and then be packaged and shipped to the production client?
Anyway, you'd face the same problem with whatever software you chose to install to the production client, right?
Not 100% sure I understand the problem but I guess you'll have to
* (pre-) build python & lxml (+ libxml2/libxslt) & your application on the dev box
* package python, lxml & your application on the dev box, e.g. as eggs or maybe as sunpkgs (or as tarballs)
* ship your packages to the client production system
* install your packages on the client production system
Or, if you have some kind of build recipe e.g. involving pip/easy_install/zc.buildout or whatever that should be run on the client production system for deployment, you'd have to adapt the recipe to using "local repositories".
E.g. not installing from PyPi but from local eggs.
Holger
--
GRATIS! Movie-FLAT mit ?ber 300 Videos.
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome
From jholg at gmx.de Tue Nov 16 15:15:02 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 16 Nov 2010 15:15:02 +0100
Subject: [lxml-dev] read .xlsx spreadsheets with lxml ?
In-Reply-To:
References:
Message-ID: <20101116141502.195330@gmx.net>
Hi,
> Folks,
> has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ?
> A simple API along the lines of csv would be nice:
>
> doc = openxmllib.openXmlDocument( path= "...xlsx" )
> for row in doc:
> for col in row: # num / string
>
> (Background: Mac Openoffice chokes on an xlsx with > 65536 rows, grr.)
Well, the lxml APIs are simple enough for handling the XML *inside* the .xlsx zip archive. Don't know how complicated the structure of the file itself can get.
Here's the lxml.objectify notion:
$ unzip Foo.xlsx
$ python -i -c 'from lxml import etree, objectify'
>>> root = objectify.parse("./tmp/Foo/xl/worksheets/sheet1.xml").getroot()
>>> print root.tag
{http://schemas.openxmlformats.org/spreadsheetml/2006/main}worksheet
>>> for row in root.sheetData.row:
... for c in row.c:
... print "%s: %s" % (c.get('r'), c.v)
...
A1: 0
B1: 1
C1: 2
A2: 1
B2: 2
C2: 3
A3: 4
B3: 5
C3: 6
>>>
Holger
--
Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!
Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
From terry_n_brown at yahoo.com Tue Nov 16 15:06:09 2010
From: terry_n_brown at yahoo.com (Terry Brown)
Date: Tue, 16 Nov 2010 08:06:09 -0600
Subject: [lxml-dev] read .xlsx spreadsheets with lxml ?
In-Reply-To:
References:
Message-ID: <20101116080609.5ac3367c@nrri.umn.edu>
On Tue, 16 Nov 2010 12:03:08 +0000 (UTC)
denis wrote:
> Folks,
> has anyone read spreadsheets, .xlsx aka excel-2007, with lxml ?
> A simple API along the lines of csv would be nice:
No. First I'd try http://pypi.python.org/pypi/xlrd - I haven't used it for .xlsx, but it works well for .xls and I think it also supports .xlsx.
Small code example below.
Cheers -Terry
import xlrd
from collections import defaultdict
filename = "MasterDatabase.xls"
book = xlrd.open_workbook(filename)
cnt = defaultdict(lambda: 0)
for sheet in book.sheets():
print("{0.name:>20s} {0.nrows}".format(sheet))
sheet0 = book.sheet_by_index(0)
for row in range(sheet0.nrows):
cnt[sheet0.cell(row,0).value] += 1
From denis-bz-py at t-online.de Tue Nov 16 17:59:43 2010
From: denis-bz-py at t-online.de (denis)
Date: Tue, 16 Nov 2010 16:59:43 +0000 (UTC)
Subject: [lxml-dev] read .xlsx spreadsheets with lxml ?
References:
Message-ID:
Thanks Holger, Thanks Terry,
I was really looking for someone who's *used* lxml (or ...)
on big Microsoft xlsx spreadsheets.
I gather from http://en.wikipedia.org/wiki/Office_Open_XML
that the format is messy --
Part 1 (Fundamentals and Markup Language Reference)
This part has 5560 pages
?!
Bytheway xlrd 0.7.1 ->
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
/site-packages/xlrd/__init__.py", line 429, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
/site-packages/xlrd/__init__.py", line 1545, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
/site-packages/xlrd/__init__.py", line 1539, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file:
Expected BOF record; found 'PK\x03\x04\x14\x00\x06\x00'
cheers
-- denis
From stefan_ml at behnel.de Wed Nov 17 18:28:08 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 17 Nov 2010 18:28:08 +0100
Subject: [lxml-dev] explicit child tag notation for empty namespace URI
in lxml.objectify
In-Reply-To: <4CCFDA3E.5020600@behnel.de>
References: <20101102084023.275460@gmx.net>
<4CCFDA3E.5020600@behnel.de>
Message-ID: <4CE410A8.8070203@behnel.de>
Stefan Behnel, 02.11.2010 10:30:
> Alexander Artemenko, 02.11.2010 10:07:
>> On Tue, Nov 2, 2010 at 11:40 AM, jholg wrote:
>>> just stumbled upon this:
>>>
>>> http://stackoverflow.com/questions/3103661
>>>
>>> In short: Should we consider this a bug:
>>>
>>> >>> root = etree.fromstring("""
>>> ...206
>>> ...
>>> ......
>>> ...
>>> ...
>>> ... """)
>>> >>> root['{}duration']
>>> Traceback (most recent call last):
>>> File "", line 1, in ?
>>> File "lxml.objectify.pyx", line 284, in lxml.objectify.ObjectifiedElement.__getitem__ (src/lxml/lxml.objectify.c:3345)
>>> File "lxml.objectify.pyx", line 484, in lxml.objectify._lookupChildOrRaise (src/lxml/lxml.objectify.c:5347)
>>> AttributeError: no such child: {http://api.example.com}duration
>>
>> This is not a bug, because you MUST specify namespaces for the
>> duration, because this element is in the scope of the 'ns2'
>> namespaces. See http://www.w3.org/TR/xml-names/#scoping for details.
>
> The spec says in 6.2:
>
> """
> If there is a default namespace declaration in scope, the expanded name
> corresponding to an unprefixed element name has the URI of the default
> namespace as its namespace name. If there is no default namespace
> declaration in scope, the namespace name has no value.
> """
>
> So, in the above case, "the namespace name has no value", which is just
> fine. Although rare, this *is* a problem. Personally, I think I would have
> expected "root['{}duration']" to work, but I haven't looked into it any
> deeper yet. It might be worth special casing this in lxml.objectify.
I've committed a fix for 2.3 that lets lxml.objectify accept "{}tag" as
explicitly meaning "tag" with an empty namespace URI.
Stefan
From kj at rdprojekt.pl Thu Nov 18 11:15:54 2010
From: kj at rdprojekt.pl (Krzysztof Jakubczyk)
Date: Thu, 18 Nov 2010 11:15:54 +0100
Subject: [lxml-dev] Schema validation - no file position
Message-ID: <4CE4FCDA.5080000@rdprojekt.pl>
Hi,
I'm trying to validate a document using XmlSchema. It works but the
exception received (etree.XMLSyntaxError) has no information about file
position- exc.position is (0,0). Is this correct behaviour?
From kj at rdprojekt.pl Thu Nov 18 11:15:11 2010
From: kj at rdprojekt.pl (Krzysztof Jakubczyk)
Date: Thu, 18 Nov 2010 11:15:11 +0100
Subject: [lxml-dev] Schema validation
Message-ID: <4CE4FCAF.8040506@rdprojekt.pl>
Hi,
I'm trying to validate a document using XmlSchema. It works but the
exception received (etree.XMLSyntaxError) has no information about
location - exc.position is (0,0). Is this correct behaviour?
From m.parrucci at unibo.it Fri Nov 19 18:33:55 2010
From: m.parrucci at unibo.it (Matteo Parrucci)
Date: Fri, 19 Nov 2010 17:33:55 +0000
Subject: [lxml-dev] HTMLParser and \r converted in
html entity
Message-ID: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it>
Hi,
I'm new here; I subscribed because I encountered a strange behavior in lxml:
Is it normal that "\r" followed by "\n" in html code get converted in "
" entity using HTMLParser?
the strange behavior is reproduced in the example that follows.
import lxml.etree
g='>\r\n>\r\n\r\ntitle\r\n'
lxml.etree.tostring(lxml.etree.fromstring(g, parser=lxml.etree.HTMLParser()))
OUTPUT:
' xmlns="http://www.w3.org/1999/xhtml">
\ntitle
\n'
Thanks in advance
Matteo Parrucci
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101119/cd576919/attachment.htm
From stefan_ml at behnel.de Sat Nov 20 16:58:19 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 20 Nov 2010 16:58:19 +0100
Subject: [lxml-dev] HTMLParser and \r converted in
html entity
In-Reply-To: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it>
References: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it>
Message-ID: <4CE7F01B.2080409@behnel.de>
Matteo Parrucci, 19.11.2010 18:33:
> I'm new here; I subscribed because I encountered a strange behavior in lxml:
> Is it normal that "\r" followed by "\n" in html code get converted in "
" entity using HTMLParser?
> the strange behavior is reproduced in the example that follows.
>
> import lxml.etree
> g='>\r\n>\r\n\r\ntitle\r\n'
> lxml.etree.tostring(lxml.etree.fromstring(g, parser=lxml.etree.HTMLParser()))
This is an XHTML document, you shouldn't parse it using the HTML parser.
Use the XML parser instead.
> OUTPUT:
> ' xmlns="http://www.w3.org/1999/xhtml">
\ntitle
\n'
Default encoding for serialisation is ASCII, which escapes all non-ASCII
characters (although I wonder why it should escape line endings...). If you
want a different encoding, use the "encoding" parameter.
Stefan
From piet at vanoostrum.org Sat Nov 20 21:09:12 2010
From: piet at vanoostrum.org (Piet van Oostrum)
Date: Sat, 20 Nov 2010 16:09:12 -0400
Subject: [lxml-dev] HTMLParser and \r converted in
html entity
In-Reply-To: <4CE7F01B.2080409@behnel.de>
References: <412A64CC3920A44E8ABEA2995B09D246D198@E10-MBX2-CS.personale.dir.unibo.it>
<4CE7F01B.2080409@behnel.de>
Message-ID: <19688.10984.533379.514521@cochabamba.vanoostrum.org>
Stefan Behnel wrote:
> Default encoding for serialisation is ASCII, which escapes all
> non-ASCII characters (although I wonder why it should escape line
> endings...). If you want a different encoding, use the "encoding"
> parameter.
\r isn't supposed to be a line ending *in a string*, I suppose. In a file it is (at least on Windows), but it disappears as soon as it is read as text.
--
Piet van Oostrum
Cochabamba. URL: http://pietvanoostrum.com/
Nu Fair Trade woonartikelen op http://www.zylja.com
From crucialfelix at gmail.com Tue Nov 23 11:14:19 2010
From: crucialfelix at gmail.com (felix)
Date: Tue, 23 Nov 2010 11:14:19 +0100
Subject: [lxml-dev] Compile failure
In-Reply-To: <4CCBDC02.3080305@behnel.de>
References:
<4CCBDC02.3080305@behnel.de>
Message-ID:
On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel wrote:
> felix, 26.10.2010 15:28:
>
> According to this:
>> http://codespeak.net/lxml/build.html
>>
>> we should avoid installing Cython
>>
>> but using easy_install to build fails saying the cython generated file is
>> missing
>>
>
> I doubt that it's failing because of that. However, you didn't provide the
> output of the build, so I can't guess what happened that actually made the
> build fail.
sorry, that output had scrolled off by the time I realized I should submit a
report. I have another server so fortunately I can fail there and show you.
crucial at crucial-systems:~/working/lxml$ python setup.py build
/home/crucial/working/lxml/versioninfo.py:53: UserWarning: unrecognized
.svn/entries format; skipping /home/crucial/working/lxml/
warn("unrecognized .svn/entries format; skipping "+base)
Building lxml version 2.3.beta1.
*NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c'
needs to be available.*
Using build configuration of libxslt 1.1.26
Building against libxml2/libxslt in the following directory: /usr/lib
running build
running build_py
running build_ext
building 'lxml.etree' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall
-Wstrict-prototypes -fPIC -I/usr/local/include/libxml2
-I/usr/include/python2.6 -c src/lxml/lxml.etree.c -o
build/temp.linux-x86_64-2.6/src/lxml/lxml.etree.o -w
gcc: src/lxml/lxml.etree.c: No such file or directory
gcc: no input files
error: command 'gcc' failed with exit status 1
> The latest build instructions for the SVN trunk are in the SVN trunk as
> "doc/build.txt", or *(not always completely up-to-date)* here:
>
exactly
>
> *but then I succeeded with the old sudo easy_install lxml*
>>
>>
>> because now I have Cython
>>
>
> Again, I doubt that this is the reason.
>
sudo easy_install lxml
failed before
after installing Cython it says it uses Cython (not Trying to build without
Cython) and it worked.
nothing else having changed I thought it was a reasonable guess that it
worked because it used Cython because Cython is installed.
*
*
>
> Stefan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101123/7777e0de/attachment.htm
From jholg at gmx.de Tue Nov 23 11:35:07 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 23 Nov 2010 11:35:07 +0100
Subject: [lxml-dev] Compile failure
In-Reply-To:
References:
<4CCBDC02.3080305@behnel.de>
Message-ID: <20101123103507.100470@gmx.net>
Hi,
> On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel
> wrote:
>
> > felix, 26.10.2010 15:28:
> >
> > According to this:
> >> http://codespeak.net/lxml/build.html
> >>
> >> we should avoid installing Cython
> >>
> >> but using easy_install to build fails saying the cython generated file
> is
> >> missing
Note that it says
"""
Since we distribute the Cython-generated .c files with lxml *releases*, however, you do not need Cython to build lxml from the normal *release* sources.
"""
So the Cython-generated .c files are not in an SVN checkout but should be in the release packages.
> > Again, I doubt that this is the reason.
> >
>
> sudo easy_install lxml
> failed before
>
> after installing Cython it says it uses Cython (not Trying to build
> without
> Cython) and it worked.
>
> nothing else having changed I thought it was a reasonable guess that it
> worked because it used Cython because Cython is installed.
I just checked the 2.3beta1 (source) package on pypi and it does contain the .c files:
-rw-r--r-- 1000/1000 7102827 Sep 6 09:31 2010 lxml-2.3beta1/src/lxml/lxml.etree.c
-rw-r--r-- 1000/1000 1318011 Sep 6 09:33 2010 lxml-2.3beta1/src/lxml/lxml.objectify.c
Holger
--
GRATIS! Movie-FLAT mit ?ber 300 Videos.
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome
From stefan_ml at behnel.de Tue Nov 23 15:38:35 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 23 Nov 2010 15:38:35 +0100
Subject: [lxml-dev] SyntaxErrors with Python 3
In-Reply-To: <201009102244.24301.Arfrever.FTA@gmail.com>
References: <201007200303.15139.Arfrever.FTA@gmail.com> <4C455380.3020905@behnel.de> <201007251715.35395.Arfrever.FTA@gmail.com>
<201009102244.24301.Arfrever.FTA@gmail.com>
Message-ID: <4CEBD1EB.8040908@behnel.de>
Arfrever Frehtes Taifersar Arahesis, 10.09.2010 22:44:
> 2010-07-25 17:14:53 Arfrever Frehtes Taifersar Arahesis napisa?(a):
>> 2010-07-20 09:42:56 Stefan Behnel napisa?(a):
>>> Arfrever Frehtes Taifersar Arahesis, 20.07.2010 03:02:
>>>> LXML r76211 generally supports Python 3, but there are still some SyntaxErrors.
>>> > [snip]
>>>
>>> Thanks. Only 2 or 3 of those are relevant to Py3, but I'll see if I can fix
>>> them. A patch could easily speed this up, BTW.
>>
>> I'm attaching the partial patch.
>
> Could this patch be committed?
Done.
Stefan
From stefan_ml at behnel.de Thu Nov 25 10:48:55 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 25 Nov 2010 10:48:55 +0100
Subject: [lxml-dev] Schema validation - no file position
In-Reply-To: <4CE4FCDA.5080000@rdprojekt.pl>
References: <4CE4FCDA.5080000@rdprojekt.pl>
Message-ID: <4CEE3107.9000301@behnel.de>
Hi,
Krzysztof Jakubczyk, 18.11.2010 11:15:
> I'm trying to validate a document using XmlSchema. It works but the
> exception received (etree.XMLSyntaxError) has no information about file
> position- exc.position is (0,0). Is this correct behaviour?
I wonder why you get an "XMLSyntaxError" in the first place. This means
that there's an error while parsing your document.
Could you show us the code that you use for parsing and validation?
Stefan
From kj at rdprojekt.pl Thu Nov 25 11:00:52 2010
From: kj at rdprojekt.pl (Krzysztof Jakubczyk)
Date: Thu, 25 Nov 2010 11:00:52 +0100
Subject: [lxml-dev] Schema validation - no file position
In-Reply-To: <4CEE3107.9000301@behnel.de>
References: <4CE4FCDA.5080000@rdprojekt.pl> <4CEE3107.9000301@behnel.de>
Message-ID: <4CEE33D4.9030400@rdprojekt.pl>
On 2010-11-25 10:48, Stefan Behnel wrote:
> Hi,
>
> Krzysztof Jakubczyk, 18.11.2010 11:15:
>> I'm trying to validate a document using XmlSchema. It works but the
>> exception received (etree.XMLSyntaxError) has no information about file
>> position- exc.position is (0,0). Is this correct behaviour?
>
> I wonder why you get an "XMLSyntaxError" in the first place. This
> means that there's an error while parsing your document.
>
> Could you show us the code that you use for parsing and validation?
>
> Stefan
Hmm...
I get the error because the document I validate is invalid - it doesn't
match the Xml Schema. This behavior is correct.
My problem is that the error doesn't contain information about position
of the error - it's hard to find source of the error in a big file.
my code is the following:
def validate(schemaContent, dataStream):
schema = etree.XMLSchema(etree.fromstring(schemaContent))
for event, elem in etree.iterparse(dataStream, schema=schema):
elem.clear()
while elem.getprevious() is not None:
if not elem.getparent() is None:
del elem.getparent()[0]
regards,
kj
From stefan_ml at behnel.de Thu Nov 25 11:14:33 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 25 Nov 2010 11:14:33 +0100
Subject: [lxml-dev] Schema validation - no file position
In-Reply-To: <4CEE33D4.9030400@rdprojekt.pl>
References: <4CE4FCDA.5080000@rdprojekt.pl> <4CEE3107.9000301@behnel.de>
<4CEE33D4.9030400@rdprojekt.pl>
Message-ID: <4CEE3709.5010203@behnel.de>
Krzysztof Jakubczyk, 25.11.2010 11:00:
> On 2010-11-25 10:48, Stefan Behnel wrote:
>> Krzysztof Jakubczyk, 18.11.2010 11:15:
>>> I'm trying to validate a document using XmlSchema. It works but the
>>> exception received (etree.XMLSyntaxError) has no information about file
>>> position- exc.position is (0,0). Is this correct behaviour?
>>
>> I wonder why you get an "XMLSyntaxError" in the first place. This
>> means that there's an error while parsing your document.
>>
>> Could you show us the code that you use for parsing and validation?
>
> I get the error because the document I validate is invalid - it doesn't
> match the Xml Schema. This behavior is correct.
> My problem is that the error doesn't contain information about position
> of the error - it's hard to find source of the error in a big file.
>
> my code is the following:
>
> def validate(schemaContent, dataStream):
> schema = etree.XMLSchema(etree.fromstring(schemaContent))
> for event, elem in etree.iterparse(dataStream, schema=schema):
Now, this reveals two important hints that you didn't provide in your
original post: you are validating at parse time, and you are using
iterparse(). For me, that totally changes the place in the code to look at.
I'll see if I can come up with something.
Stefan
From chris at simplistix.co.uk Mon Nov 29 21:21:37 2010
From: chris at simplistix.co.uk (Chris Withers)
Date: Mon, 29 Nov 2010 20:21:37 +0000
Subject: [lxml-dev] read .xlsx spreadsheets with lxml ?
In-Reply-To:
References:
Message-ID: <4CF40B51.8040700@simplistix.co.uk>
On 16/11/2010 16:59, denis wrote:
> Thanks Holger, Thanks Terry,
>
> I was really looking for someone who's *used* lxml (or ...)
> on big Microsoft xlsx spreadsheets.
John Machin over on the python-excel group has done just this.
He has some alpha code that I know he'd like to see merged into the xlrd
code base but he's looking for some serious testers.
Follow the birdy on www.python-excel.org for group joining...
Chris
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk