From jholg at gmx.de Tue Oct 2 17:58:52 2007
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 02 Oct 2007 17:58:52 +0200
Subject: [lxml-dev] trunk schematron tests core dump (was: annotate,
pyannotate, xsiannotate)
In-Reply-To: <20070921142905.315500@gmx.net>
References: <46DAE416.1080807@behnel.de> <20070904100001.277890@gmx.net>
<20070906161412.136940@gmx.net> <46E82DC5.4010101@behnel.de>
<20070913081704.138210@gmx.net> <46EEC836.9030603@behnel.de>
<20070918072107.19040@gmx.net> <46EF8F6B.8040403@behnel.de>
<20070918085758.19050@gmx.net> <20070919112409.271040@gmx.net>
<46F11E1D.70309@behnel.de> <20070919134327.17290@gmx.net>
<46F37F62.40307@behnel.de> <20070921092340.311080@gmx.net>
<20070921142905.315500@gmx.net>
Message-ID: <20071002155852.130440@gmx.net>
Hi,
> > > Schematron uses XPath a lot, so I wouldn't be surprised if this was
> > > related to
> > > the XPath bug in libxml2 2.6.27. Is there any chance you could switch
> to
> [...]
> Unfortunately, using the latest & greatest libxml2/libxslt (2.6.33/1.1.22)
> doesn't solve the problem for me.
I'm trying to get some sensible information but have real problems with debugging, as I'm seeing line number information that is just plain wrong, though compiling with debugging on and everything, the likes of:
(gdb) info source
Current source file is src/lxml/etree.c
Compilation directory is /home/lb54320/pydev/LXML/lxml/
Located in /home/lb54320/pydev/LXML/lxml/src/lxml/etree.c
Contains 90795 lines.
Source language is c.
Compiled with stabs debugging format.
(gdb) b etree.c:70850
No line 70850 in file "src/lxml/etree.c".
(gdb)
No idea what I'm doing wrong here, at the moment.
So the info on the crash does not get much better than that backtrace at the moment:
Program received signal SIGSEGV, Segmentation fault.
0xff0b3218 in strlen () from /usr/lib/libc.so.1
(gdb) bt
#0 0xff0b3218 in strlen () from /usr/lib/libc.so.1
#1 0xff106530 in _doprnt () from /usr/lib/libc.so.1
#2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1
#3 0xfe23df04 in __xmlRaiseError () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
#4 0xfe3e717c in xmlSchematronPErr () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
#5 0xfe3e9878 in xmlSchematronParse () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
#6 0xfe68dfdc in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x1b30f0,
__pyx_args=0x1db670, __pyx_kwds=0x0) at src/lxml/etree.c:5663
What I can see, though, is that using the same schematron schema with xmllint does not crash:
0 $ cat invalid_empty.xst
0 $ python2.4 -i -c 'from lxml import etree; print etree.LIBXML_VERSION; schema = etree.Schematron(etree.parse("invalid_empty.xst"))'
(2, 6, 30)
Segmentation Fault (core dumped)
whereas
$ /apps/pydev/bin/xmllint --schematron invalid_empty.xst foo.xml --version
/apps/pydev/bin/xmllint: using libxml version 20630
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
invalid_empty.xst:1: element schema: Schemas parser error : The schematron document 'invalid_empty.xst' has no pattern
Schematron schema invalid_empty.xst failed to compile
Holger
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
From jg307 at cam.ac.uk Tue Oct 2 22:33:06 2007
From: jg307 at cam.ac.uk (James Graham)
Date: Tue, 02 Oct 2007 21:33:06 +0100
Subject: [lxml-dev] Tag name validation and HTML
In-Reply-To: <46FBBEBD.7030308@behnel.de>
References: <46FBA0D3.6010700@cam.ac.uk> <46FBBEBD.7030308@behnel.de>
Message-ID: <4702AB02.6080300@cam.ac.uk>
Stefan Behnel wrote:
> James Graham wrote:
>> The development branch of lxml 2 appears to restrict the characters that may
>> appear in a tag name. Whilst this may be appropriate for XML, it does not match
>> the behavior of all common HTML UAs and, as such, does not match the current
>> draft of the HTML 5 spec [1].
>
> This is actually not as simple as it might seem. The Element factory cannot
> distinguish between XML and HTML tags, so it cannot switch off validation for
> a particular tag. So the conservative solution would be to actually follow the
> HTML5 spec, as it is a superset of the XML spec, an extremely broad one even.
> But then there's not much left that you could honestly call validation. Also,
> I would still want to restrict ":" in tag names, as this has been a source of
> problems way too often. So that would just leave spaces and any of ":/>" as
> invalid characters in tag names.
The : thing is difficult because HTML UAs are expected to deal with : in
the tag name and there is content in the wild that depends on this being
accepted; MS Office produces "HTML" containing tags like , for
example. Since I, and I guess others too, want to use lxml to process
random content that may have colons in the tag names, hard failure for
this case is a problem. To make matters worse it is possible that the
HTML spec will change in the future to introduce some sort of
namespacing feature which may or may not use colons.
Given all of this I would prefer it if it were possible to have an
HTML-specific mode with much more liberal rules than the XML mode. This
could then be adapted to support any namespacing features HTML grows in
the future. For example, if one could do something like
import lxml.html
lxml.html.Element("o:p")
where lxml.html.Element would be just like lxml.etree.Element but
without XML-specific validity checks. I guess there might be serious
practical difficulties with that exact solution, but I think the general
idea of being able to flag an element as following HTML rules or XML
rules would be more user-friendly than having a set of rules that
neither matches the XML nor the HTML model correctly.
--
"Mixed up signals
Bullet train
People snuffed out in the brutal rain"
--Conner Oberst
From l.oluyede at gmail.com Wed Oct 3 15:31:16 2007
From: l.oluyede at gmail.com (Lawrence Oluyede)
Date: Wed, 3 Oct 2007 15:31:16 +0200
Subject: [lxml-dev] Namespace serialization patch
Message-ID: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com>
I had the same problem Anders Bruun Olsen had in this thread:
http://comments.gmane.org/gmane.comp.python.lxml.devel/2924
What I'd like to know if I have to wait for 2.0 completion (using the
alpha is not an option AFAIK) to use it or you plan to release an
interim 1.3.x version with that patch applied.
Thanks
--
Lawrence, oluyede.org - neropercaso.it
"It is difficult to get a man to understand
something when his salary depends on not
understanding it" - Upton Sinclair
From anders at bruun-olsen.net Wed Oct 3 20:28:48 2007
From: anders at bruun-olsen.net (Anders Bruun Olsen)
Date: Wed, 03 Oct 2007 20:28:48 +0200
Subject: [lxml-dev] Namespace serialization patch
In-Reply-To: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com>
References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com>
Message-ID: <4703DF60.4040108@bruun-olsen.net>
Lawrence Oluyede wrote:
> I had the same problem Anders Bruun Olsen had in this thread:
> http://comments.gmane.org/gmane.comp.python.lxml.devel/2924
> What I'd like to know if I have to wait for 2.0 completion (using the
> alpha is not an option AFAIK) to use it or you plan to release an
> interim 1.3.x version with that patch applied.
Building LXML from SVN is really rather straightforward and of course
includes the fixes for that particular problem as well as others.
See the download page for instructions on building from SVN.
--
Anders
From l.oluyede at gmail.com Wed Oct 3 21:47:19 2007
From: l.oluyede at gmail.com (Lawrence Oluyede)
Date: Wed, 3 Oct 2007 21:47:19 +0200
Subject: [lxml-dev] Namespace serialization patch
In-Reply-To: <4703DF60.4040108@bruun-olsen.net>
References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com>
<4703DF60.4040108@bruun-olsen.net>
Message-ID: <9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com>
> Building LXML from SVN is really rather straightforward and of course
> includes the fixes for that particular problem as well as others.
> See the download page for instructions on building from SVN.
I, personally, don't have a problem with that but AFAIK at work using
the SVN version is a lesser option than using the 2.0alpha.
--
Lawrence, oluyede.org - neropercaso.it
"It is difficult to get a man to understand
something when his salary depends on not
understanding it" - Upton Sinclair
From mwm-keyword-lxml.9112b8 at mired.org Wed Oct 3 22:50:16 2007
From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer)
Date: Wed, 3 Oct 2007 16:50:16 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
Message-ID: <20071003165016.104d2caf@bhuda.mired.org>
I'm getting crashes - by which I mean the python process is
segfaulting and, with some tweaking of GNU/Linux, leaving me a core
file - while using lxml to parse data.
Versions:
OS: RHEL 5
Python: 2.5.1 (custom built).
lxml: 1.3.3
libxml: 2.6.26 (both compiled and built)
libxslt: 1.1.17
[Yes, I know those are a bit out of date, but we had to give our
client host requirements months ago, and those were current at the
time, and changing them is a non-trivial process, and I've already
started on it, but I'd rather not do that if I can avoid it....]
Rebuilding python with OPTS=-g (I set that for the lxml build as
well), I can get a "where" output that points at lxml:
#0 0x00002aaaaf906c3a in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#1 0x00002aaaaf906be7 in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#2 0x00002aaaaf8ebdfe in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#3 0x00002aaaaf966a5c in findOrBuildNodeNs ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
The first problem is that this isn't repeatable. I've got test data
that will make it happen, but I have to feed that data through the
system a few thousand times in. This is part of a database ETL system,
parsing data from the XML to load into the database. If I feed it the
exact same data over and over again, it'll work 9999 times out of ten
thousand - but then fail that ten thousands time with a segfault.
While this might not seem like a big deal, we're planning on
processing hundreds of thousands of documents a day, so we're talking
about having an instance of the process die tens of times a day. So I
sorta need to fix it.
The document is straightforward: it starts with a meta element with a
set of attributes, and then has a lot of data elements, all the same
type, all with the same attributes (give or take an optional one), and
I just use document.xpath to find the elements, and then read off
their attribute values to save to a database load file.
Hints on how to proceed - setting things up so I can use gdb on the
lxml sources, for instance - would be greatly appreciated. If this
looks like a bug that's been fixed if I update one or more libraries,
that would be great information (i.e. - I can use it to get all the
libraries updated). Anything else that you think I oughta know would
be nice as well.
The sample document is almost half a megabyte (and might be
proprietary). If you'd like to look at it, drop me a line.
thanks,
http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
From lists.steve at arachnedesign.net Thu Oct 4 00:50:50 2007
From: lists.steve at arachnedesign.net (Steve Lianoglou)
Date: Wed, 3 Oct 2007 18:50:50 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <20071003165016.104d2caf@bhuda.mired.org>
References: <20071003165016.104d2caf@bhuda.mired.org>
Message-ID: <7BFF4FC4-72EC-419C-A1A6-0C333F435B09@arachnedesign.net>
> I'm getting crashes - by which I mean the python process is
> segfaulting and, with some tweaking of GNU/Linux, leaving me a core
> file - while using lxml to parse data.
>
> Versions:
>
> OS: RHEL 5
> Python: 2.5.1 (custom built).
> lxml: 1.3.3
> libxml: 2.6.26 (both compiled and built)
> libxslt: 1.1.17
As an aside (addendum?, whatever ..) I recently got nailed w/
segfaults and bus errors that seemed to not be 100% reproducible on
OS X.
I built lxml against:
libxml 2.6.30
libxslt 1.1.22
python2.5.1(and python2.4.4)
lxml 1.3.4
(all using MacPorts)
My code was basically generating large(-ish -- though really not much
bigger than 4 megs or so) documents like so (inspired from
ElementTree examples):
import lxml.etree as ET
root = ET.Element('graph', **root_attribs)
ET.SubElement(root, 'node', id='something', label=name)
ET.SubElement(node, 'att', name='pvalue', type='real', value=pval)
...
The nesting level wouldn't ever really go more than 3 or 4 children
deep.
Anyway, I know there was talk about lxml crashing w/ the default OS X
xml libs, but here's the case when I'm using the newer ones.
I don't know if this is the same issue as Mike's having, but since
this just happened to me and I haven't been able to smoke it out, I'm
bringing it up here (in the meantime I've switched to elementtree and
the same code works fine (if not slower)).
I will try to create a minimal test case after one of my deadlines
pass to help smoke this out better(also, I don't know if the minimal
test case will help, is it possible that it's a function of the size
of the xml doc that I'm trying to build?)
Thanks,
-steve
From etiffany at alum.mit.edu Thu Oct 4 01:43:01 2007
From: etiffany at alum.mit.edu (Eric Tiffany)
Date: Wed, 03 Oct 2007 19:43:01 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <7BFF4FC4-72EC-419C-A1A6-0C333F435B09@arachnedesign.net>
Message-ID:
On OS X, you might actually be using the system libs rather than the newer
libs (in /opt/local/lib, if you are using MacOSPorts, for example). I had
lots of segfault problems until I realized that even though lxml was
claiming it was running with the newer libs, the info was only based on what
it was built with. At least, that's what it seemed like.
Anyway, all my (segfault) problems went away when I exported
DYLD_LIBRARY_PATH=/opt/local/lib
Into the environment where python was running.
Actually, python was running zope/plone, but I think this problem could be
similar to yours.
ET
On 10/3/07 6:50 PM, "Steve Lianoglou" wrote:
>> I'm getting crashes - by which I mean the python process is
>> segfaulting and, with some tweaking of GNU/Linux, leaving me a core
>> file - while using lxml to parse data.
>>
>> Versions:
>>
>> OS: RHEL 5
>> Python: 2.5.1 (custom built).
>> lxml: 1.3.3
>> libxml: 2.6.26 (both compiled and built)
>> libxslt: 1.1.17
>
> As an aside (addendum?, whatever ..) I recently got nailed w/
> segfaults and bus errors that seemed to not be 100% reproducible on
> OS X.
>
> I built lxml against:
>
> libxml 2.6.30
> libxslt 1.1.22
> python2.5.1(and python2.4.4)
> lxml 1.3.4
> (all using MacPorts)
>
> My code was basically generating large(-ish -- though really not much
> bigger than 4 megs or so) documents like so (inspired from
> ElementTree examples):
>
> import lxml.etree as ET
> root = ET.Element('graph', **root_attribs)
> ET.SubElement(root, 'node', id='something', label=name)
> ET.SubElement(node, 'att', name='pvalue', type='real', value=pval)
> ...
>
> The nesting level wouldn't ever really go more than 3 or 4 children
> deep.
>
> Anyway, I know there was talk about lxml crashing w/ the default OS X
> xml libs, but here's the case when I'm using the newer ones.
>
> I don't know if this is the same issue as Mike's having, but since
> this just happened to me and I haven't been able to smoke it out, I'm
> bringing it up here (in the meantime I've switched to elementtree and
> the same code works fine (if not slower)).
>
> I will try to create a minimal test case after one of my deadlines
> pass to help smoke this out better(also, I don't know if the minimal
> test case will help, is it possible that it's a function of the size
> of the xml doc that I'm trying to build?)
>
> Thanks,
> -steve
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
--
________________________________________________
Eric Tiffany | +1 413-458-3743
etiffany at alum.mit.edu | +1 413-627-1778 mobile
From lists.steve at arachnedesign.net Thu Oct 4 01:50:11 2007
From: lists.steve at arachnedesign.net (Steve Lianoglou)
Date: Wed, 3 Oct 2007 19:50:11 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To:
References:
Message-ID: <9AD07B43-4E11-43A2-AE9E-6E7060F8A5F2@arachnedesign.net>
> On OS X, you might actually be using the system libs rather than
> the newer
> libs (in /opt/local/lib, if you are using MacOSPorts, for
> example). I had
> lots of segfault problems until I realized that even though lxml was
> claiming it was running with the newer libs, the info was only
> based on what
> it was built with. At least, that's what it seemed like.
>
> Anyway, all my (segfault) problems went away when I exported
>
> DYLD_LIBRARY_PATH=/opt/local/lib
>
> Into the environment where python was running.
Hmm .. interesting.
I was playing with DYLD_LIBRARY_PATH, but I thought that had to be
set during compile time (of lxml).
Even though ... through my hunting on the intarweb, I came across a
suggestion to use `otool` to see what libs were being used. So I
tried like so:
$ otool -L /opt/local/Library/Frameworks/Python.framework/Versions/
Current/lib/python2.4/site-packages/lxml/etree.so
/opt/local/Library/Frameworks/Python.framework/Versions/Current/lib/
python2.4/site-packages/lxml/etree.so:
/opt/local/lib/libxslt.1.dylib (compatibility version 3.0.0,
current version 3.22.0)
/opt/local/lib/libexslt.0.dylib (compatibility version
9.0.0, current version 9.13.0)
/opt/local/lib/libxml2.2.dylib (compatibility version 9.0.0,
current version 9.30.0)
/opt/local/lib/libz.1.dylib (compatibility version 1.0.0,
current version 1.2.3)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0,
current version 88.3.9)
The fact that the xml libs in /opt/local were the ones being
referenced made me think that those are the ones it would use ... is
that right? Looking at that closer, I do see ``/usr/lib/
libSystem.B.dylib``which is OS X default, but honestly don't know
what it's responsible for ...
-steve
From etiffany at alum.mit.edu Thu Oct 4 04:06:11 2007
From: etiffany at alum.mit.edu (Eric Tiffany)
Date: Wed, 03 Oct 2007 22:06:11 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <9AD07B43-4E11-43A2-AE9E-6E7060F8A5F2@arachnedesign.net>
Message-ID:
Check the man page for dyld, which notes
DYLD_LIBRARY_PATH
This is a colon separated list of directories that contain
libraries. The dynamic linker searches these directories before
it searches the default locations for libraries. It allows you
to test new versions of existing libraries.
For each library that a program uses, the dynamic linker looks
for it in each directory in DYLD_LIBRARY_PATH in turn. If it
still can't find the library, it then searches DYLD_FALL-
BACK_FRAMEWORK_PATH and DYLD_FALLBACK_LIBRARY_PATH in turn.
Use the -L option to otool(1). to discover the frameworks and
shared libraries that the executable is linked against.
I think otool is telling you what libs the .so would *like* to use, but the
environment will tell dyld where to look at runtime. At least, that's the
way I interpret it. Anyway, my segfaults and bus errors stopped.
ET
On 10/3/07 7:50 PM, "Steve Lianoglou" wrote:
>> On OS X, you might actually be using the system libs rather than
>> the newer
>> libs (in /opt/local/lib, if you are using MacOSPorts, for
>> example). I had
>> lots of segfault problems until I realized that even though lxml was
>> claiming it was running with the newer libs, the info was only
>> based on what
>> it was built with. At least, that's what it seemed like.
>>
>> Anyway, all my (segfault) problems went away when I exported
>>
>> DYLD_LIBRARY_PATH=/opt/local/lib
>>
>> Into the environment where python was running.
>
> Hmm .. interesting.
>
> I was playing with DYLD_LIBRARY_PATH, but I thought that had to be
> set during compile time (of lxml).
>
> Even though ... through my hunting on the intarweb, I came across a
> suggestion to use `otool` to see what libs were being used. So I
> tried like so:
>
> $ otool -L /opt/local/Library/Frameworks/Python.framework/Versions/
> Current/lib/python2.4/site-packages/lxml/etree.so
> /opt/local/Library/Frameworks/Python.framework/Versions/Current/lib/
> python2.4/site-packages/lxml/etree.so:
> /opt/local/lib/libxslt.1.dylib (compatibility version 3.0.0,
> current version 3.22.0)
> /opt/local/lib/libexslt.0.dylib (compatibility version
> 9.0.0, current version 9.13.0)
> /opt/local/lib/libxml2.2.dylib (compatibility version 9.0.0,
> current version 9.30.0)
> /opt/local/lib/libz.1.dylib (compatibility version 1.0.0,
> current version 1.2.3)
> /usr/lib/libSystem.B.dylib (compatibility version 1.0.0,
> current version 88.3.9)
>
> The fact that the xml libs in /opt/local were the ones being
> referenced made me think that those are the ones it would use ... is
> that right? Looking at that closer, I do see ``/usr/lib/
> libSystem.B.dylib``which is OS X default, but honestly don't know
> what it's responsible for ...
>
> -steve
--
____________________________________________________
Eric Tiffany | eric at projectliberty.org
Interop Tech Lead | +1 413-458-3743
Liberty Alliance | +1 413-627-1778 mobile
From rocarras at gmail.com Thu Oct 4 15:51:43 2007
From: rocarras at gmail.com (Roberto Carrasco)
Date: Thu, 4 Oct 2007 09:51:43 -0400
Subject: [lxml-dev] Problem with lxml library running on Windows
Message-ID: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com>
Hi:
We have an issue with libxml library running on Windows.
We are trying to read an xml document from a string over and over but the
program crashes in the while loop. We suspect the problem is that we cannot
run the function etree.parse too much times when we are reading a xml
document from a string. The code crashes when the program read a xml
document repeatedly.
The issue is on Windows becuase on an Linux environment there is no problem
excecuting it.
We are trying to execute the piece of code shown below in this environment:
- Windows XP Service Pack 2
- Python 2.5
- lxml 1.3.4 and 2.0 alpha 3
The question is: what we are doing wrong? or is this a problem with the
library running on Windows?
# -*- coding: UTF-8 -*-
from lxml import etree
from StringIO import StringIO
if __name__ == "__main__":
document=""" 1-32006-03-13
08:44:52SANTIAGOPUENTE13/03/2006RobertinMANZANO2006-03-10
15:52:29"""
j=0
while 1:
print j
j+=1
#tree = etree.parse(StringIO(docRauco0))
tree = etree.fromstring(document)
images_url = tree.xpath('//link[@rel="media"][@href]')
image_url_name=images_url[0].attrib['href']
--
Regards,
Roberto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20071004/3c12f465/attachment.htm
From jholg at gmx.de Fri Oct 5 14:00:41 2007
From: jholg at gmx.de (jholg at gmx.de)
Date: Fri, 05 Oct 2007 14:00:41 +0200
Subject: [lxml-dev] Re: Tag name validation and HTML
Message-ID: <20071005120041.180700@gmx.net>
Hi,
> The : thing is difficult because HTML UAs are expected to deal with : in
> the tag name and there is content in the wild that depends on this being
> accepted; MS Office produces "HTML" containing tags like , for
> example. Since I, and I guess others too, want to use lxml to process
> random content that may have colons in the tag names, hard failure for
> this case is a problem. To make matters worse it is possible that the
> HTML spec will change in the future to introduce some sort of
> namespacing feature which may or may not use colons.
You'd get errors when parsing such stuff with the XML parser:
>>> etree.fromstring("""foo""")
Traceback (most recent call last):
File "", line 1, in ?
File "etree.pyx", line 2137, in etree.fromstring
File "parser.pxi", line 1301, in etree._parseMemoryDocument
File "parser.pxi", line 1207, in etree._parseDoc
File "parser.pxi", line 782, in etree._BaseParser._parseDoc
File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc
File "parser.pxi", line 523, in etree._handleParseResult
File "parser.pxi", line 471, in etree._raiseParseError
etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5
but not with the HTML parser:
>>> etree.HTML
>>> etree.HTML("""foo""")
>>>
So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements.
For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data.
> Given all of this I would prefer it if it were possible to have an
> HTML-specific mode with much more liberal rules than the XML mode. This
> could then be adapted to support any namespacing features HTML grows in
> the future. For example, if one could do something like
>
> import lxml.html
> lxml.html.Element("o:p")
>
> where lxml.html.Element would be just like lxml.etree.Element but
> without XML-specific validity checks. I guess there might be serious
> practical difficulties with that exact solution, but I think the general
> idea of being able to flag an element as following HTML rules or XML
> rules would be more user-friendly than having a set of rules that
> neither matches the XML nor the HTML model correctly.
Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element().
Holger
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
From stefan_ml at behnel.de Sat Oct 6 19:22:28 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Oct 2007 19:22:28 +0200
Subject: [lxml-dev] trunk schematron tests core dump
In-Reply-To: <20071002155852.130440@gmx.net>
References: <46DAE416.1080807@behnel.de>
<20070904100001.277890@gmx.net> <20070906161412.136940@gmx.net>
<46E82DC5.4010101@behnel.de> <20070913081704.138210@gmx.net>
<46EEC836.9030603@behnel.de> <20070918072107.19040@gmx.net>
<46EF8F6B.8040403@behnel.de> <20070918085758.19050@gmx.net>
<20070919112409.271040@gmx.net> <46F11E1D.70309@behnel.de>
<20070919134327.17290@gmx.net> <46F37F62.40307@behnel.de>
<20070921092340.311080@gmx.net> <20070921142905.315500@gmx.net>
<20071002155852.130440@gmx.net>
Message-ID: <4707C454.8020302@behnel.de>
jholg at gmx.de wrote:
>>>> Schematron uses XPath a lot, so I wouldn't be surprised if this was
>>>> related to
>>>> the XPath bug in libxml2 2.6.27. Is there any chance you could switch
>> to
>> [...]
>> Unfortunately, using the latest & greatest libxml2/libxslt (2.6.33/1.1.22)
>> doesn't solve the problem for me.
>
> I'm trying to get some sensible information but have real problems with debugging, as I'm seeing line number information that is just plain wrong, though compiling with debugging on and everything, the likes of:
>
> (gdb) info source
> Current source file is src/lxml/etree.c
> Compilation directory is /home/lb54320/pydev/LXML/lxml/
> Located in /home/lb54320/pydev/LXML/lxml/src/lxml/etree.c
> Contains 90795 lines.
> Source language is c.
> Compiled with stabs debugging format.
> (gdb) b etree.c:70850
> No line 70850 in file "src/lxml/etree.c".
> (gdb)
Never seen that before. I assume you did a clean build before that? Maybe gdb
doesn't get along with the source line references in the comments of the
generated C file?
> So the info on the crash does not get much better than that backtrace at the moment:
>
> Program received signal SIGSEGV, Segmentation fault.
> 0xff0b3218 in strlen () from /usr/lib/libc.so.1
> (gdb) bt
> #0 0xff0b3218 in strlen () from /usr/lib/libc.so.1
> #1 0xff106530 in _doprnt () from /usr/lib/libc.so.1
> #2 0xff108730 in vsnprintf () from /usr/lib/libc.so.1
> #3 0xfe23df04 in __xmlRaiseError () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
> #4 0xfe3e717c in xmlSchematronPErr () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
> #5 0xfe3e9878 in xmlSchematronParse () from /apps/pydev/debug/dmalloc/lib//libxml2.so.2
> #6 0xfe68dfdc in __pyx_f_5etree_10Schematron___init__ (__pyx_v_self=0x1b30f0,
> __pyx_args=0x1db670, __pyx_kwds=0x0) at src/lxml/etree.c:5663
>
>
> What I can see, though, is that using the same schematron schema with xmllint does not crash:
> 0 $ cat invalid_empty.xst
>
>
> 0 $ python2.4 -i -c 'from lxml import etree; print etree.LIBXML_VERSION; schema = etree.Schematron(etree.parse("invalid_empty.xst"))'
> (2, 6, 30)
> Segmentation Fault (core dumped)
>
> whereas
>
> $ /apps/pydev/bin/xmllint --schematron invalid_empty.xst foo.xml --version
> /apps/pydev/bin/xmllint: using libxml version 20630
> compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
> invalid_empty.xst:1: element schema: Schemas parser error : The schematron document 'invalid_empty.xst' has no pattern
> Schematron schema invalid_empty.xst failed to compile
>
>
xmllint has a different error reporting setup, that might make the difference.
Anyway, error reporting in Schematron is pretty basic and remember working
around that at the time. I'll have to take a deeper look into it when I find
the time.
Stefan
From stefan_ml at behnel.de Sat Oct 6 19:28:47 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Oct 2007 19:28:47 +0200
Subject: [lxml-dev] Tag name validation and HTML
In-Reply-To: <4702AB02.6080300@cam.ac.uk>
References: <46FBA0D3.6010700@cam.ac.uk> <46FBBEBD.7030308@behnel.de>
<4702AB02.6080300@cam.ac.uk>
Message-ID: <4707C5CF.9080102@behnel.de>
Hi,
James Graham wrote:
> The : thing is difficult because HTML UAs are expected to deal with : in
> the tag name and there is content in the wild that depends on this being
> accepted; MS Office produces "HTML" containing tags like , for
> example. Since I, and I guess others too, want to use lxml to process
> random content that may have colons in the tag names, hard failure for
> this case is a problem. To make matters worse it is possible that the
> HTML spec will change in the future to introduce some sort of
> namespacing feature which may or may not use colons.
Ok, so I understand that HTML tags must be treated different from XML tags.
> Given all of this I would prefer it if it were possible to have an
> HTML-specific mode with much more liberal rules than the XML mode. This
> could then be adapted to support any namespacing features HTML grows in
> the future. For example, if one could do something like
>
> import lxml.html
> lxml.html.Element("o:p")
>
> where lxml.html.Element would be just like lxml.etree.Element but
> without XML-specific validity checks.
This absolutely makes sense to me. I'll have to look into the details of an
implementation though, since tag name validation is currently done in
lxml.etree.Element, which is simply reused by the Python-implemented
lxml.html. So we'd have to provide some kind of Python-level API for this.
Stefan
From stefan_ml at behnel.de Sat Oct 6 19:33:07 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Oct 2007 19:33:07 +0200
Subject: [lxml-dev] prefix mappings
In-Reply-To: <46FEAF82.60501@antwerpen.be>
References: <46FEAF82.60501@antwerpen.be>
Message-ID: <4707C6D3.1080001@behnel.de>
FnH wrote:
> I would like to generate the following serialization:
>
>
>
>
[...]
> In order to solve this I think it would be a good idea to allow (or take
> into account) prefix mappings on non root nodes as well. The output I'd
> like could then be achieved by the following code snippet:
>
> a = Element("{foo}a", nsmap={None:"foo"})
> a.append(Element("{bar}b", nsmap={None:"bar"}))
>>> from lxml.etree import Element, tostring
>>> a = Element("{foo}a", nsmap={None:"foo"})
>>> a.append(Element("{bar}b", nsmap={None:"bar"}))
>>> print tostring(a, pretty_print=True)
This is on lxml 2.0 alpha, lxml 1.3 should work alike.
Stefan
From stefan_ml at behnel.de Sat Oct 6 19:39:02 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Oct 2007 19:39:02 +0200
Subject: [lxml-dev] Namespace serialization patch
In-Reply-To: <9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com>
References: <9eebf5740710030631u6b8af7f0y8ce10d6f91252b8d@mail.gmail.com> <4703DF60.4040108@bruun-olsen.net>
<9eebf5740710031247q63751d59v9eb8c2c3e3c1cf22@mail.gmail.com>
Message-ID: <4707C836.8070307@behnel.de>
Lawrence Oluyede wrote:
>> Building LXML from SVN is really rather straightforward and of course
>> includes the fixes for that particular problem as well as others.
>> See the download page for instructions on building from SVN.
>
> I, personally, don't have a problem with that but AFAIK at work using
> the SVN version is a lesser option than using the 2.0alpha.
There will be a new release soon, but I can't currently tell when exactly.
However, the patch in the current 1.3 branch (which reflects the stable 1.3
series) will definitely go in there, so you should be just fine with using an
unofficial branch build for now, or a patched 1.3 build (which would currently
be exactly the same anyway).
Stefan
From stefan_ml at behnel.de Sat Oct 6 22:21:28 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 06 Oct 2007 22:21:28 +0200
Subject: [lxml-dev] Problem with lxml library running on Windows
In-Reply-To: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com>
References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com>
Message-ID: <4707EE48.7000704@behnel.de>
Hi,
as expected, I cannot reproduce your problem on Linux.
Roberto Carrasco wrote:
> We have an issue with libxml library running on Windows.
> We are trying to read an xml document from a string over and over but the
> program crashes in the while loop. We suspect the problem is that we cannot
> run the function etree.parse too much times when we are reading a xml
> document from a string.
lxml.etree actually optimises parsing from a StringIO object into parsing via
fromstring() - or rather its internal implementation. So I can't see how this
would make a difference.
> We are trying to execute the piece of code shown below in this environment:
>
> - Windows XP Service Pack 2
> - Python 2.5
> - lxml 1.3.4 and 2.0 alpha 3
You are using the pre-built binaries from PyPI, right? I'm not currently sure
which version of libxml2 they use, but should be 2.6.28 or later.
> The code crashes when the program read a xml document repeatedly.
> The issue is on Windows becuase on an Linux environment there is no problem
> excecuting it.
>
> The question is: what we are doing wrong? or is this a problem with the
> library running on Windows?
>
> # -*- coding: UTF-8 -*-
> from lxml import etree
> from StringIO import StringIO
>
> if __name__ == "__main__":
>
> document="""
> 1-3
> 2006-03-13
> 08:44:52
> SANTIAGO
> PUENTE
> type="string">13/03/2006
> type="string">Robertin
> MANZANO
>
> 2006-03-10
> 15:52:29
>
>
>
> """
>
> j=0
> while 1:
> print j
> j+=1
>
> #tree = etree.parse(StringIO(docRauco0))
> tree = etree.fromstring(document)
> images_url = tree.xpath('//link[@rel="media"][@href]')
> image_url_name=images_url[0].attrib['href']
Just to mention it, you could simplify this to
images_url_names = tree.xpath('//link[@rel="media"]/@href')
Regarding your problem - instead of this line:
image_url_name=images_url[0].attrib['href']
could you try this instead, to see if it still crashes:
image_url_name=images_url[0].get('href')
Apart from that, I would need some debugging information to understand what's
happening here. While there are differences between the behaviour of libxml2
under Linux and Windows, I don't currently see any that could cause the above
code to fail.
Stefan
From stefan_ml at behnel.de Sun Oct 7 07:14:33 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 07 Oct 2007 07:14:33 +0200
Subject: [lxml-dev] lxml 2.0alpha4 released
Message-ID: <47086B39.4030802@behnel.de>
Hi all,
I just released a 4th alpha version of lxml 2.0 to PyPI. It hopefully sets an
end to the tag name validation problems by distinguishing between HTML tags
and XML tags based on the associated parser, i.e. either the one that parsed
it or the one that created the element through its "makeelement" method. Note
that the Element factory of lxml.etree uses the XMLParser by default, while
the factory in lxml.html uses the HTMLParser, and thus allows HTML tag names.
Everyone who bumped into and/or reported problems with this, please verify
that this provides a viable solution to you.
Have fun,
Stefan
2.0alpha4 (2007-10-07)
Features added
Bugs fixed
* AttributeError in feed parser on parse errors
Other changes
* Tag name validation in lxml.etree (and lxml.html) now distinguishes
between HTML tags and XML tags based on the parser that was used to
parse or create them. HTML tags no longer reject any non-ASCII
characters in tag names but only spaces and the special characters
<>&/'"
From Michael.Pechal at silabs.com Sun Oct 7 21:30:34 2007
From: Michael.Pechal at silabs.com (Michael Pechal)
Date: Sun, 7 Oct 2007 14:30:34 -0500
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
Message-ID: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
Hello,
I am new to XML and I have found lxml.objectify to be very useful. I am
using XML to store register settings. I use the register mnemonic as
the tag. I use custom attributes to store additional information, such
as address, description, apply, etc. I am also using the _pytype and
_xsi attributes. I am using the binary install of lxml 1.3.4 on WinXP
running Python 2.5.1.
My main problem is that assigning a new value to an
objectify.DataElement destroys the existing attribute list. My current
workaround is to retrieve the attributes with the items() call, assign
the new value, and then reapply attributes with set() method for each
pair in the items dict. I dug through the API documentation and I did
not see a way around this issue. Am I missing something here?
I thought about subclassing DataElement and then I scanned the SVN
development change list. I saw some discussion about preserving _pytype
or _xsi attributes, but does this include ALL attributes? If so, I will
proceed with a build from the latest SVN copy. How stable are dev
versions? Are there automated acceptance tests (unittest) that gate the
check in? I may just use my workaround until 1.3.5 arrives.
Another issue I noticed is that if I specify _xsi='int', the _pytype
attribute will be 'long' instead of 'integer', so I am forced to use
_pytype='integer' for all integer data elements. Also, if you run
objectify.annonate(), the integer becomes a long type again. Annotate
should look to the _xsi or even pyval type. Has this been fixed? This
is not really an issue for me, since I always keep the list annotated.
The objectify API documentation was helpful. As a new user, I had a few
problems with save and retrieve from file. I would suggest updating the
objectify API document to provide a full example of saving to and
loading from a file. I have provided a test case from my unittest code
below that may be useful for the documentation:
#
------------------------------------------------------------------------
-
def testFileSaveAndLoad(self):
""" Save to XML file, then reload and compare data. """
# note the self.objRoot is created in the setUp() method
tofile = etree.tostring(self.objRoot, pretty_print=True)
xmlFH = open('test.xml', 'w')
xmlFH.write(tofile)
xmlFH.close()
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)
tree = etree.parse('test.xml', parser)
root = tree.getroot() # crucial step, as parse() doesn't
return the root
fromfile = etree.tostring(root, pretty_print=True)
self.assertEqual(tofile, fromfile)
Also on the documentation front, there is a failure with help() on the
objectify module. :
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import objectify
>>> from lxml import etree
>>> help(objectify)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python25\lib\site.py", line 346, in __call__
return pydoc.help(*args, **kwds)
File "C:\Python25\lib\pydoc.py", line 1645, in __call__
self.help(request)
File "C:\Python25\lib\pydoc.py", line 1689, in help
else: doc(request, 'Help on %s:')
File "C:\Python25\lib\pydoc.py", line 1481, in doc
pager(title % desc + '\n\n' + text.document(object, name))
File "C:\Python25\lib\pydoc.py", line 324, in document
if inspect.ismodule(object): return self.docmodule(*args)
File "C:\Python25\lib\pydoc.py", line 1070, in docmodule
inspect.getclasstree(classlist, 1), name)]
File "C:\Python25\lib\inspect.py", line 656, in getclasstree
for parent in c.__bases__:
TypeError: 'functools.partial' object is not iterable
>>> objectify
Note that help(etree) works fine.
Thanks,
Michael
This email and any attachments thereto may contain private, confidential,
and privileged material for the sole use of the intended recipient. Any
review, copying, or distribution of this email (or any attachments thereto)
by others is strictly prohibited. If you are not the intended recipient,
please contact the sender immediately and permanently delete the original
and any copies of this email and any attachments thereto.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20071007/1e257c1b/attachment-0001.htm
From stefan_ml at behnel.de Mon Oct 8 09:47:07 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 08 Oct 2007 09:47:07 +0200
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
In-Reply-To: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
Message-ID: <4709E07B.5000201@behnel.de>
Hi,
thanks for sharing your impressions.
Michael Pechal wrote:
> I am new to XML and I have found lxml.objectify to be very useful. I am
> using XML to store register settings. I use the register mnemonic as
> the tag. I use custom attributes to store additional information, such
> as address, description, apply, etc.
This sounds a bit like attribute misuse. If you are not tied to a specific XML
language, consider storing this information in the XML structure rather than
XML attributes, just as you would in a Python object.
> I am also using the _pytype and
> _xsi attributes. I am using the binary install of lxml 1.3.4 on WinXP
> running Python 2.5.1.
>
> My main problem is that assigning a new value to an
> objectify.DataElement destroys the existing attribute list.
That's intentional. It's like when you assign a Python value to an object
attribute. The old value will be lost in that case, including all of its own
attributes. Note that I'm not talking about XML at all here, this is plain
Python object behaviour, which is what objectify mimics.
> I thought about subclassing DataElement
DataElement is not a class, it's a factory function. So you can write a
wrapper but you cannot subclass it.
> and then I scanned the SVN
> development change list. I saw some discussion about preserving _pytype
> or _xsi attributes, but does this include ALL attributes?
No, these two are special (or rather their namespaced XML attributes).
> If so, I will
> proceed with a build from the latest SVN copy. How stable are dev
> versions?
There are currently two actively maintained branches: the 1.3 branch for the
stable 1.3 series (basically, everything that gets committed here will be in a
future 1.3.x release), and the current trunk for the future 2.0 series, which
is currently in alpha status. This means: some functionallity and some APIs
are not stable yet and there may still be incompatible changes to come if
their value for lxml 2.0 is considered high enough to break current code.
The 2.0 web pages are also online:
http://codespeak.net/lxml/dev/
> Are there automated acceptance tests (unittest) that gate the
> check in?
Sure, check out the test suite that comes with the source distribution. It's
pretty extensive by now.
http://codespeak.net/lxml/build.html#running-the-tests-and-reporting-errors
There is also a benchmarking suite that might be of interest to you.
http://codespeak.net/lxml/performance.html
> I may just use my workaround until 1.3.5 arrives.
No 1.3.x release will ever change the above behaviour, and it won't change for
2.0 either.
> Another issue I noticed is that if I specify _xsi='int', the _pytype
> attribute will be 'long' instead of 'integer', so I am forced to use
> _pytype='integer' for all integer data elements.
You're mixing names here, so I'm not quite sure what exactly you are doing.
Make sure you are distinguishing between Python type names and XSI type names
in your code. In general, XSI types are more accurate, so you might want to
prefer them.
The Python type "long" maps to the XSI type "integer" and various other XSI
types. Only the small XSI integer types like "int" or "short" map to a Python int.
> Also, if you run
> objectify.annonate(), the integer becomes a long type again. Annotate
> should look to the _xsi or even pyval type. Has this been fixed? This
> is not really an issue for me, since I always keep the list annotated.
This has been changed in 2.0, which uses annotations quite a bit more
naturally. The current behaviour in 1.3 will not change, unless it's
considered a bug that should be fixed.
> The objectify API documentation was helpful. As a new user, I had a few
> problems with save and retrieve from file.
That's because this is done through lxml.etree rather than objectify directly.
You will almost certainly need both to work with objectify anyway, so it's
worth skipping through the lxml.etree tutorial.
That said, the objectify documentation is starting to get rather lengthy.
Maybe we should start focussing it a bit, also towards users that do not know
lxml.etree or ElementTree before looking at objectify.
> I would suggest updating the
> objectify API document to provide a full example of saving to and
> loading from a file. I have provided a test case from my unittest code
> below that may be useful for the documentation:
Thanks. This is a question of duplicating documentation versus making it
easily accessible. We also use our documentation as doctests, where accessing
files is not as straight forward as it should look.
> Also on the documentation front, there is a failure with help() on the
> objectify module. :
>
> >>> help(objectify)
>
> Traceback (most recent call last):
> TypeError: 'functools.partial' object is not iterable
>
> Note that help(etree) works fine.
Ah, I wasn't aware of that. However, it seems to be more of a problem in
help() itself rather than objectify. I'll have to investigate this one day or
another...
Stefan
From jholg at gmx.de Mon Oct 8 09:59:01 2007
From: jholg at gmx.de (jholg at gmx.de)
Date: Mon, 08 Oct 2007 09:59:01 +0200
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
In-Reply-To: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
Message-ID: <20071008075901.13870@gmx.net>
Hi Michael,
> My main problem is that assigning a new value to an
> objectify.DataElement destroys the existing attribute list. My current
> workaround is to retrieve the attributes with the items() call, assign
> the new value, and then reapply attributes with set() method for each
> pair in the items dict. I dug through the API documentation and I did
> not see a way around this issue. Am I missing something here?
objectify maps element data to simple python builtins, so it treats them as immutable, i.e. you cannot modify element.text. This is intentional and not going to change (I hope :)
With 2.0alpha, you could however do this to keep your attributes:
>>> msg.a = 12345
>>> msg.a.set("foo", "bar")
>>> print objectify.dump(msg)
msg = None [ObjectifiedElement]
a = 12345 [IntElement]
* py:pytype = 'int'
* foo = 'bar'
>>> msg.a = objectify.DataElement("changeMe", attrib=dict(msg.a.attrib))
>>> print objectify.dump(msg)
msg = None [ObjectifiedElement]
a = 'changeMe' [StringElement]
* py:pytype = 'str'
* foo = 'bar'
>>>
Note that the foo attribute remains intact, whilst the py:pytype gets corrected to s.th. that fits the element value.
> I thought about subclassing DataElement and then I scanned the SVN
> development change list. I saw some discussion about preserving _pytype
> or _xsi attributes, but does this include ALL attributes? If so, I will
> proceed with a build from the latest SVN copy. How stable are dev
> versions? Are there automated acceptance tests (unittest) that gate the
> check in? I may just use my workaround until 1.3.5 arrives.
Generally I'd say the dev versions are still very stable with regard to robustness, but of course feature-wise they can be in flux.
> Another issue I noticed is that if I specify _xsi='int', the _pytype
> attribute will be 'long' instead of 'integer', so I am forced to use
> _pytype='integer' for all integer data elements. Also, if you run
> objectify.annonate(), the integer becomes a long type again. Annotate
> should look to the _xsi or even pyval type. Has this been fixed? This
> is not really an issue for me, since I always keep the list annotated.
Please try 2.0alpha, the behaviour with regard to py:pytype/xsi:type has been tweaked a little, and some parts now behave more "natural". There's also now a triplet of annotation functions (pyannotate, xsiannotate, annotate) that give you fine-grained control of annotation. The most prominent change is that you get auto-pytypification now:
>>> msg.a = 999
>>> print objectify.dump(msg)
msg = None [ObjectifiedElement]
a = 999 [IntElement]
* py:pytype = 'int'
>>>
Here, the actual Python type of the RVAL of the assignment gets taken into account now.
Regarding your example, explicitly setting _xsi="int" gives
>>> msg.a = objectify.DataElement(8, _xsi="int")
>>> print objectify.dump(msg)
msg = None [ObjectifiedElement]
a = 8 [IntElement]
* py:pytype = 'int'
* xsi:type = 'xsd:int'
>>>
I do think it's just the same with 1.3, so I think you might have mixed s.th. up here.
However, specifying _xsi="integer" will result in:
>>> msg.a = objectify.DataElement(8, _xsi="integer")
>>> print objectify.dump(msg)
msg = None [ObjectifiedElement]
a = 8L [LongElement]
* py:pytype = 'long'
* xsi:type = 'xsd:integer'
>>>
This is due to the XML Schema type system, where an XML Schema integer is
not restricted to 32 bits, like a Python int (still is), see
http://www.w3.org/TR/xmlschema-2/
Holger
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
From faassen at startifact.com Mon Oct 8 14:47:35 2007
From: faassen at startifact.com (Martijn Faassen)
Date: Mon, 08 Oct 2007 14:47:35 +0200
Subject: [lxml-dev] did windows binary versions ever get removed from the
cheeseshop?
Message-ID:
Hi there,
To start off, I'm not 100% sure on this, so I'm just checking.
I thought at some stage I had a working windows installation of my
software that was using lxml 1.3 binaries (not 1.3.1 or something, just
1.3). When I tried again today it didn't work anymore, and instead had
to start using lxml 1.3.4 (for instance).
Is it possible that someone for some reason removed the versions for 1.3
from the cheeseshop? If so, my general recommendation would be never to
do this, even if the packages are broken somehow. A release is a
release, and people might be depending on it. For this reason, never
remove release files, and also never overwrite release files.
Anyway, I'm not at all sure this actually happened with lxml, but I'm
just writing this to make sure it won't happen. :)
Regards,
Martijn
From stefan_ml at behnel.de Mon Oct 8 19:51:38 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 08 Oct 2007 19:51:38 +0200
Subject: [lxml-dev] did windows binary versions ever get removed from
the cheeseshop?
In-Reply-To:
References:
Message-ID: <470A6E2A.8050602@behnel.de>
Martijn Faassen wrote:
> To start off, I'm not 100% sure on this, so I'm just checking.
>
> I thought at some stage I had a working windows installation of my
> software that was using lxml 1.3 binaries (not 1.3.1 or something, just
> 1.3). When I tried again today it didn't work anymore, and instead had
> to start using lxml 1.3.4 (for instance).
>
> Is it possible that someone for some reason removed the versions for 1.3
> from the cheeseshop? If so, my general recommendation would be never to
> do this, even if the packages are broken somehow. A release is a
> release, and people might be depending on it. For this reason, never
> remove release files, and also never overwrite release files.
>
> Anyway, I'm not at all sure this actually happened with lxml, but I'm
> just writing this to make sure it won't happen. :)
Thanks for the warning. However, I didn't remove anything myself and I
wouldn't know why Sidnei should have. I'm not sure but I have a feeling that
we never had any Windows binaries for 1.3...
Anyway, I agree that releases should stay where they were uploaded. There are
always reasons why you would want to go back to or compare/test with older
versions. Note that the "Index of Packages" even lists them all. I actually do
that by hand after each release - distutils/PyPI doesn't seem to have a way to
say: "don't hide other releases after an upload".
Stefan
From Michael.Pechal at silabs.com Mon Oct 8 21:37:31 2007
From: Michael.Pechal at silabs.com (Michael Pechal)
Date: Mon, 8 Oct 2007 14:37:31 -0500
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
In-Reply-To: <4709E07B.5000201@behnel.de>
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
<4709E07B.5000201@behnel.de>
Message-ID: <6DD7584058DDB24193C9049940B45FFB018885E0@EXCAUS001.silabs.com>
Stefan,
Thank you for your prompt reply. I found the Epydoc URL
(http://codespeak.net/lxml/dev/api/lxml.objectify-module.html) which
provided more information. I did not see a direct link to here from the
lxml page or perhaps I missed it?
I have provided a few responses below.
Regards,
Michael
-----Original Message-----
From: lxml-dev-bounces at codespeak.net
[mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel
Sent: Monday, October 08, 2007 2:47 AM
To: Michael Pechal
Cc: lxml-dev at codespeak.net
Subject: Re: [lxml-dev] Issues with objectify.ObjectifiedElement:
assignment, attribute handling, and documentation
Hi,
thanks for sharing your impressions.
Michael Pechal wrote:
> I am new to XML and I have found lxml.objectify to be very useful. I
am
> using XML to store register settings. I use the register mnemonic as
> the tag. I use custom attributes to store additional information,
such
> as address, description, apply, etc.
This sounds a bit like attribute misuse. If you are not tied to a
specific XML
language, consider storing this information in the XML structure rather
than
XML attributes, just as you would in a Python object.
[MP] I suppose I was being a little lazy here. I am retrofitting an
existing cPickle implementation with custom data classes. Only a few
parts of the data model will require the attributes, so it won't be too
painful to create sub-elements for what I am now trying to store as
attributes.
> I thought about subclassing DataElement
DataElement is not a class, it's a factory function. So you can write a
wrapper but you cannot subclass it.
[MP] Thanks for the clarification. I should have stated
"objectify.IntElement".
> Are there automated acceptance tests (unittest) that gate the
> check in?
Sure, check out the test suite that comes with the source distribution.
It's
pretty extensive by now.
http://codespeak.net/lxml/build.html#running-the-tests-and-reporting-err
ors
There is also a benchmarking suite that might be of interest to you.
http://codespeak.net/lxml/performance.html
[MP] Very nice!
> Another issue I noticed is that if I specify _xsi='int', the _pytype
> attribute will be 'long' instead of 'integer', so I am forced to use
> _pytype='integer' for all integer data elements.
You're mixing names here, so I'm not quite sure what exactly you are
doing.
Make sure you are distinguishing between Python type names and XSI type
names
in your code. In general, XSI types are more accurate, so you might want
to
prefer them.
The Python type "long" maps to the XSI type "integer" and various other
XSI
types. Only the small XSI integer types like "int" or "short" map to a
Python int.
[MP] Ah, I see. I misunderstood the various formats for the _xsi
attribute. I should have used _xsi='int' or 'short'. Thanks for the
clarification.
>>> e = objectify.DataElement(1, _xsi='integer')
>>> type(e.pyval)
>>> e = objectify.DataElement(1, _xsi='int')
>>> type(e.pyval)
>>> e = objectify.DataElement(1, _xsi='short')
>>> type(e.pyval)
> Also, if you run
> objectify.annonate(), the integer becomes a long type again. Annotate
> should look to the _xsi or even pyval type. Has this been fixed?
This
> is not really an issue for me, since I always keep the list annotated.
This has been changed in 2.0, which uses annotations quite a bit more
naturally. The current behaviour in 1.3 will not change, unless it's
considered a bug that should be fixed.
[MP] With the correct _xsi attribute, annotate() correctly restores the
pytype attribute to 'int'.
>>> e = objectify.DataElement(1, _xsi='int')
>>> e.items()
[('{http://codespeak.net/lxml/objectify/pytype}pytype', 'int'),
('{http://www.w3.org/2001/XMLSchema-instance}type', 'short')]
>>> objectify.deannotate(e)
>>> e.items()
[]
>>> objectify.annotate(e)
>>> e.items()
[('{http://codespeak.net/lxml/objectify/pytype}pytype', 'int')]
> I would suggest updating the
> objectify API document to provide a full example of saving to and
> loading from a file. I have provided a test case from my unittest
code
> below that may be useful for the documentation:
Thanks. This is a question of duplicating documentation versus making it
easily accessible. We also use our documentation as doctests, where
accessing
files is not as straight forward as it should look.
[MP] I would just clarify that the etree.parse() call returns
etree._ElementTree type, while the getroot() call returns
objectify.ObjectifyElement type for direct use. This took about ten
minutes to undercover, so it was not a huge problem. I just like to
jump in and run with examples to see how far I can get before I have to
roll-up my sleeves and review the details of a new module. :) I will
spend some time reviewing the etree documentation as well.
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> lookup = objectify.ObjectifyElementClassLookup()
>>> parser.setElementClassLookup(lookup)
>>> tree = etree.parse('codec.xml', parser)
>>> type(tree)
>>> tree.reg_config
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'etree._ElementTree' object has no attribute
'reg_config'
>>> root = tree.getroot()
>>> type(root)
>>> root.reg_config
This email and any attachments thereto may contain private, confidential,
and privileged material for the sole use of the intended recipient. Any
review, copying, or distribution of this email (or any attachments thereto)
by others is strictly prohibited. If you are not the intended recipient,
please contact the sender immediately and permanently delete the original
and any copies of this email and any attachments thereto.
From stefan_ml at behnel.de Mon Oct 8 21:58:09 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 08 Oct 2007 21:58:09 +0200
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <20071003165016.104d2caf@bhuda.mired.org>
References: <20071003165016.104d2caf@bhuda.mired.org>
Message-ID: <470A8BD1.6060305@behnel.de>
Hi,
sorry for the late reply, I was on vacation last week and am just catching up
with my e-mail.
Mike Meyer wrote:
> I'm getting crashes - by which I mean the python process is
> segfaulting and, with some tweaking of GNU/Linux, leaving me a core
> file - while using lxml to parse data.
>
> Versions:
>
> OS: RHEL 5
> Python: 2.5.1 (custom built).
> lxml: 1.3.3
> libxml: 2.6.26 (both compiled and built)
> libxslt: 1.1.17
>
> Yes, I know those are a bit out of date
They should work, though.
> Rebuilding python with OPTS=-g (I set that for the lxml build as
> well), I can get a "where" output that points at lxml:
>
>
> #0 0x00002aaaaf906c3a in rename ()
> from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #1 0x00002aaaaf906be7 in rename ()
> from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #2 0x00002aaaaf8ebdfe in rename ()
> from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #3 0x00002aaaaf966a5c in findOrBuildNodeNs ()
> from /usr/local/lib/python2.5/site-packages/lxml/etree.so
>
> The first problem is that this isn't repeatable. I've got test data
> that will make it happen, but I have to feed that data through the
> system a few thousand times in. This is part of a database ETL system,
> parsing data from the XML to load into the database. If I feed it the
> exact same data over and over again, it'll work 9999 times out of ten
> thousand - but then fail that ten thousands time with a segfault.
Are those the real numbers? The 10000, I mean? That would explain a *lot*.
lxml.etree currently has a hard limit for namespace prefix generation (the
"nsXX" bit), which happens to be (an arbitrary) 10000 *per document*.
Admittedly, the resulting behaviour is far from robust and you seem to have
triggered a case where this number matters. I attached a patch (against the
trunk) that switches the counter to a Python long instead, which is only bound
by available memory.
> The document is straightforward: it starts with a meta element with a
> set of attributes, and then has a lot of data elements, all the same
> type, all with the same attributes (give or take an optional one), and
> I just use document.xpath to find the elements, and then read off
> their attribute values to save to a database load file.
>
> Hints on how to proceed - setting things up so I can use gdb on the
> lxml sources, for instance - would be greatly appreciated.
A way to work around this is to *not* reuse documents. You mention a "meta
element", so I guess you use a single document and keep adding namespaced
elements to it. That lets the counter overflow, as the namespaces must declare
and adapt their prefixes when being added to an existing document. You can
print the "prefix" attribute of elements to see how the numbers go up. I don't
know your code, so I can't be more specific. Please ask back if you need any
further hints what you can do to avoid this in general.
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: python-namespace-prefix-counter.patch
Type: text/x-diff
Size: 2622 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20071008/babff101/attachment-0001.bin
From Michael.Pechal at silabs.com Mon Oct 8 22:55:33 2007
From: Michael.Pechal at silabs.com (Michael Pechal)
Date: Mon, 8 Oct 2007 15:55:33 -0500
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
In-Reply-To: <20071008075901.13870@gmx.net>
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com>
<20071008075901.13870@gmx.net>
Message-ID: <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com>
Holger,
> Note that the foo attribute remains intact, whilst the py:pytype gets
> corrected to s.th. that fits the element value.
[MP] Thanks for your suggestion. I will follow Stefan's advice and
create separate elements versus using the attribute list. Thus, I would
have the following structure in my codec.xml file:
codec (TREE)
reg_config (TREE)
filter_type (TREE)
value (int)
desc (str)
apply (bool)
default (int)
This should be faster and cleaner, as I am not abusing the attribute
list. I also don't have to worry about attribute type conversions,
since all attributes are strings. I can access the pyval property, so
no conversion is required in my data handler accessor methods.
Phase one involves translating everything into XML (currently custom
data classes with cPickling). Phase two entails developing an XSD file
to validate the XML. I imagine the scheme validation will be cleaner
with the separate elements versus the bloated attribute list.
I have a better understanding of the spirit of lxml, but I have much to
learn regarding XML and XSLT in general. I will perform more background
reading before troubling this list again with basic questions. :)
> Regarding your example, explicitly setting _xsi="int" gives
> >>> msg.a = objectify.DataElement(8, _xsi="int")
> >>> print objectify.dump(msg)
> msg = None [ObjectifiedElement]
> a = 8 [IntElement]
> * py:pytype = 'int'
> * xsi:type = 'xsd:int'
> >>>
[MP] Thanks for the clarification. I need to use _xsi="int". I will
review the XML schema link that you provided.
Regards,
Michael
This email and any attachments thereto may contain private, confidential,
and privileged material for the sole use of the intended recipient. Any
review, copying, or distribution of this email (or any attachments thereto)
by others is strictly prohibited. If you are not the intended recipient,
please contact the sender immediately and permanently delete the original
and any copies of this email and any attachments thereto.
From stefan_ml at behnel.de Mon Oct 8 23:13:22 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 08 Oct 2007 23:13:22 +0200
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
In-Reply-To: <6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com>
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <20071008075901.13870@gmx.net>
<6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com>
Message-ID: <470A9D72.4060407@behnel.de>
Michael Pechal wrote:
> Phase one involves translating everything into XML (currently custom
> data classes with cPickling).
Doesn't tostring() or ElementTree(root).write() do what you want? I don't see
why you would go through pickling here...
http://effbot.org/elementtree/elementtree-elementtree.htm
> Phase two entails developing an XSD file to validate the XML.
Unless you are very firm with XML Schema and/or have good tool support, I
generally suggest writing a RelaxNG schema instead (preferably in the "compact
syntax" aka RNC), which is easy to write, read and understand and is well
supported by lxml/libxml2. It also supports the XSD datatypes and can be
translated into an XML Schema via tools like trang.
Stefan
From agustin.villena at gmail.com Mon Oct 8 23:42:13 2007
From: agustin.villena at gmail.com (=?ISO-8859-1?Q?Agust=EDn_Villena?=)
Date: Mon, 08 Oct 2007 17:42:13 -0400
Subject: [lxml-dev] Problem with lxml library running on Windows
In-Reply-To: <4707EE48.7000704@behnel.de>
References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com>
<4707EE48.7000704@behnel.de>
Message-ID:
Hi!
I tested a simplified code (attached to this post) in 2 versions of
Windows, with different results:
Python version:
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
lxml version:
LIBXML_COMPILED_VERSION: (2, 6, 28)
LIBXML_VERSION : (2, 6, 28)
LIBXSLT_COMPILED_VERSION: (1, 1, 19)
LIBXSLT_VERSION: (1, 1, 19),
LXML_VERSION: (1, 3, 4, 0)}
For the same version of python and lxml
- Doesn't crashes in
Microsoft Windows Vista Ultimate
Version: 6.0.6000 build 6000
- Crashes after 137 iterations
Microsoft Windows XP Profesional
Version: 5.1.2600 Service Pack 2 Build 2600
The generated error signature is:
AppName: python.exe
AppVer: 0.0.0.0
ModName: etree.pyd
ModVer: 0.0.0.0
Offset: 00010c90
Attached to this post is the error report generated for Microsoft after
the crash
Cheers
Agustin
Stefan Behnel escribi?:
> Hi,
>
> as expected, I cannot reproduce your problem on Linux.
>
>
> Roberto Carrasco wrote:
>> We have an issue with libxml library running on Windows.
>> We are trying to read an xml document from a string over and over but the
>> program crashes in the while loop. We suspect the problem is that we cannot
>> run the function etree.parse too much times when we are reading a xml
>> document from a string.
>
> lxml.etree actually optimises parsing from a StringIO object into parsing via
> fromstring() - or rather its internal implementation. So I can't see how this
> would make a difference.
>
>
>> We are trying to execute the piece of code shown below in this environment:
>>
>> - Windows XP Service Pack 2
>> - Python 2.5
>> - lxml 1.3.4 and 2.0 alpha 3
>
> You are using the pre-built binaries from PyPI, right? I'm not currently sure
> which version of libxml2 they use, but should be 2.6.28 or later.
>
>
>> The code crashes when the program read a xml document repeatedly.
>> The issue is on Windows becuase on an Linux environment there is no problem
>> excecuting it.
>>
>> The question is: what we are doing wrong? or is this a problem with the
>> library running on Windows?
>>
>> # -*- coding: UTF-8 -*-
>> from lxml import etree
>> from StringIO import StringIO
>>
>> if __name__ == "__main__":
>>
>> document="""
>> 1-3
>> 2006-03-13
>> 08:44:52
>> SANTIAGO
>> PUENTE
>> > type="string">13/03/2006
>> > type="string">Robertin
>> MANZANO
>>
>> 2006-03-10
>> 15:52:29
>>
>>
>>
>> """
>>
>> j=0
>> while 1:
>> print j
>> j+=1
>>
>> #tree = etree.parse(StringIO(docRauco0))
>> tree = etree.fromstring(document)
>> images_url = tree.xpath('//link[@rel="media"][@href]')
>> image_url_name=images_url[0].attrib['href']
>
> Just to mention it, you could simplify this to
>
> images_url_names = tree.xpath('//link[@rel="media"]/@href')
>
>
> Regarding your problem - instead of this line:
>
> image_url_name=images_url[0].attrib['href']
>
> could you try this instead, to see if it still crashes:
>
> image_url_name=images_url[0].get('href')
>
>
> Apart from that, I would need some debugging information to understand what's
> happening here. While there are differences between the behaviour of libxml2
> under Linux and Windows, I don't currently see any that could cause the above
> code to fail.
>
> Stefan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 2551_appcompat.txt
Url: http://codespeak.net/pipermail/lxml-dev/attachments/20071008/225338ee/attachment.txt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lxml_crash_windows.py
Url: http://codespeak.net/pipermail/lxml-dev/attachments/20071008/225338ee/attachment.diff
From Michael.Pechal at silabs.com Tue Oct 9 00:38:15 2007
From: Michael.Pechal at silabs.com (Michael Pechal)
Date: Mon, 8 Oct 2007 17:38:15 -0500
Subject: [lxml-dev] Issues with objectify.ObjectifiedElement: assignment,
attribute handling, and documentation
References: <6DD7584058DDB24193C9049940B45FFB0188858F@EXCAUS001.silabs.com> <20071008075901.13870@gmx.net>
<6DD7584058DDB24193C9049940B45FFB018885F7@EXCAUS001.silabs.com>
<470A9D72.4060407@behnel.de>
Message-ID: <6DD7584058DDB24193C9049940B45FFB0188860E@EXCAUS001.silabs.com>
Stefan,
> Doesn't tostring() or ElementTree(root).write() do what you want? I
don't
> see why you would go through pickling here...
> http://effbot.org/elementtree/elementtree-elementtree.htm
They work very well! What I was trying to say is that I currently use
custom python classes that are persisted via cPickle. Phase one
involves replacing the data model with lxml.objectify and all of its
superior power. So, goodbye cPickled data classes and hello
lxml.objectify! In the past, I have leveraged cPickle, ConfigParser, or
custom parser. I have wanted to leverage XML for some time but the
learning curve is steep. Now lxml.objectify has come to my rescue.
My tool is based on MVC design. I have converted the data model to an
objectified tree and I have a unittest wrapper to exercise the data
model. Before I update the controller methods for tree access, I wanted
to finalize the XML structure. I just need to refactor the data model
and "do it right" with more elements versus hacking the attribute list.
Then, phase one will be complete.
> > Phase two entails developing an XSD file to validate the XML.
> Unless you are very firm with XML Schema and/or have good tool
support, I
> generally suggest writing a RelaxNG schema instead (preferably in the
> "compact syntax" aka RNC), which is easy to write, read and understand
and
> is well supported by lxml/libxml2. It also supports the XSD datatypes
and
> can be translated into an XML Schema via tools like trang.
Thanks for the advice. I will explore RelaxNG schema first. We are out
of licenses for Altova XMLSpy 2007 and it is pricy! I found Editix
(http://www.editix.com/) for $85. It is cross-platform (*nix, OS X and
Windows). The documentation lists Schema Generator (DTD, W3C XML
Schema, XML Relax NG) from XML documents. When I am serious about
schema work, I will try out the shareware version.
Regards,
Michael
This email and any attachments thereto may contain private, confidential,
and privileged material for the sole use of the intended recipient. Any
review, copying, or distribution of this email (or any attachments thereto)
by others is strictly prohibited. If you are not the intended recipient,
please contact the sender immediately and permanently delete the original
and any copies of this email and any attachments thereto.
From kf9150 at gmail.com Tue Oct 9 01:07:00 2007
From: kf9150 at gmail.com (Kelie)
Date: Mon, 8 Oct 2007 23:07:00 +0000 (UTC)
Subject: [lxml-dev] is there a binary windows installer for lxml 2.0alpha4
release?
Message-ID:
as subject. thanks.
From stefan_ml at behnel.de Tue Oct 9 12:39:19 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 09 Oct 2007 12:39:19 +0200
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <20071008172121.4a6acd8b@bhuda.mired.org>
References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de>
<20071008172121.4a6acd8b@bhuda.mired.org>
Message-ID: <470B5A57.4030102@behnel.de>
Hi,
ok, that wasn't the problem then (it's still good to have it fixed, though).
Mike Meyer wrote:
> A master process reads in a a couple of config files, and parses and
> checks them against a schema, and then possibly plugs in some default
> attribute values. It then forks two processes:
>
> 1) Uses http to get xml documents from a remote server. These are the
> ones I described; they have a meta element and then a data element
> containing "row "elements, with the actual values in the attributes
> to "row" elements. This process uses iterparse to pull one value
> from the meta element, and then saves the entire thing to disk.
That's the process that fails, right?
Can you find out if it's the iterparse() or something else that fails here?
Using valgrind is usually a great way to find out what's going wrong. It will
make the run a lot slower, but it should print some helpful infos when it
crashes. Run it like this:
valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
python yourscript.py
preferably only on the process that crashes.
> I.e. - the only documents that gets reused a lot is the schema, which
> are built, passed to RelaxNG, and then used to validate each of those
> thousands of documents.
That's ok.
> architecture makes things a little convoluted, but the basic path is
> something like:
>
> data = urlopen(....)
> try:
> parsed = fromstring(data.read())
parse(data) should do, BTW.
> if not schema.validate(parsed):
> handle_broken_document(parsed=parsed)
> for node in parsed.xpath('Types/Type'):
> d = dict(node.attrib):
> save_for_db(d)
> for node in parsed.xpath('AltTypes/AltType'):
> d = dict(node.attrib):
> save_for_db(d)
> for node in parsed.xpath('MoreTypes/MoreType'):
> d = dict(node.attrib):
> save_for_db(d)
That's pretty straight forward code, I don't see any risk here. But I'm
wondering which of the two processes actually fails now - you're presenting
this one, but from your previous posts I though it was the other one that crashed.
> I tried turning of the parsing - which pretty much makes everything
> else do nothing but pass around the raw data - and got no failures. I
> also tried turning off just the validation, so that the work is still
> getting done - and got failures.
Hmmmm, are those failures related to validation errors?
Just in case it's the second process that fails (the XPath one), it could be
worth testing if using the XPath() class instead of the xpath() method works
better. That might give us a hint on where the problem comes from. It should
also be faster, BTW.
Stefan
From stefan_ml at behnel.de Tue Oct 9 13:00:05 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 09 Oct 2007 13:00:05 +0200
Subject: [lxml-dev] is there a binary windows installer for lxml
2.0alpha4 release?
In-Reply-To:
References:
Message-ID: <470B5F35.7090601@behnel.de>
Kelie wrote:
> as subject. thanks.
1) According to PyPI: no.
2) According to me: not yet, wait for Sidnei to upload it to PyPI (see 1).
Stefan
From stefan_ml at behnel.de Tue Oct 9 13:02:11 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 09 Oct 2007 13:02:11 +0200
Subject: [lxml-dev] Problem with lxml library running on Windows
In-Reply-To:
References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de>
Message-ID: <470B5FB3.80708@behnel.de>
Agust?n Villena wrote:
> For the same version of python and lxml
>
> - Doesn't crashes in
> Microsoft Windows Vista Ultimate
> Version: 6.0.6000 build 6000
>
> - Crashes after 137 iterations
> Microsoft Windows XP Profesional
> Version: 5.1.2600 Service Pack 2 Build 2600
Hmm, but then I can't see how this is supposed to be a problem with lxml. I
mean, if the only difference is the code that Microsoft puts below the runtime
environment, I would just go and ask Microsoft what they did wrong (or what
they fixed in Vista to make it work).
Stefan
From agustin.villena+gmane at gmail.com Tue Oct 9 13:28:09 2007
From: agustin.villena+gmane at gmail.com (Agustin Villena)
Date: Tue, 09 Oct 2007 07:28:09 -0400
Subject: [lxml-dev] Problem with lxml library running on Windows
In-Reply-To: <470B5FB3.80708@behnel.de>
References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de>
<470B5FB3.80708@behnel.de>
Message-ID:
I agree.
Nonetheless, WinXP SP2 is a very relevant platform to be ignored :(
And lxml is the fastest XPath alternative for python
Do you know Is avalidable the lxml's egg windows build system in the
source tree?, this can be a very good base to me to debug and reproduce
the problem
Did it uses pre-compiled libxml libraries? Or custom compiled libraries?
Thanks
Agustin
Stefan Behnel escribi?:
> Agust?n Villena wrote:
>> For the same version of python and lxml
>>
>> - Doesn't crashes in
>> Microsoft Windows Vista Ultimate
>> Version: 6.0.6000 build 6000
>>
>> - Crashes after 137 iterations
>> Microsoft Windows XP Profesional
>> Version: 5.1.2600 Service Pack 2 Build 2600
>
> Hmm, but then I can't see how this is supposed to be a problem with lxml. I
> mean, if the only difference is the code that Microsoft puts below the runtime
> environment, I would just go and ask Microsoft what they did wrong (or what
> they fixed in Vista to make it work).
>
> Stefan
From stefan_ml at behnel.de Tue Oct 9 15:02:29 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 09 Oct 2007 15:02:29 +0200
Subject: [lxml-dev] Problem with lxml library running on Windows
In-Reply-To:
References: <1b8955fe0710040651n44263544ie6350260f8d29c7a@mail.gmail.com> <4707EE48.7000704@behnel.de> <470B5FB3.80708@behnel.de>
Message-ID: <470B7BE5.8000404@behnel.de>
Agustin Villena wrote:
> Nonetheless, WinXP SP2 is a very relevant platform to be ignored :(
Sadly, yes.
> Do you know Is avalidable the lxml's egg windows build system in the
> source tree?, this can be a very good base to me to debug and reproduce
> the problem
> Did it uses pre-compiled libxml libraries? Or custom compiled libraries?
AFAIK (Sidnei will know better), it should compile with MSVC 2003 like this:
http://codespeak.net/lxml/build.html#static-linking-on-windows
You might also have success with MinGW (setup.py --compiler=mingw32).
If you need help, please ask back on the list.
Stefan
From mwm-keyword-lxml.9112b8 at mired.org Tue Oct 9 21:01:56 2007
From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer)
Date: Tue, 9 Oct 2007 15:01:56 -0400
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <470B5A57.4030102@behnel.de>
References: <20071003165016.104d2caf@bhuda.mired.org>
<470A8BD1.6060305@behnel.de>
<20071008172121.4a6acd8b@bhuda.mired.org>
<470B5A57.4030102@behnel.de>
Message-ID: <20071009150156.2a90e40a@bhuda.mired.org>
On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel wrote:
> ok, that wasn't the problem then (it's still good to have it fixed, though).
> Mike Meyer wrote:
> > A master process reads in a a couple of config files, and parses and
> > checks them against a schema, and then possibly plugs in some default
> > attribute values. It then forks two processes:
> >
> > 1) Uses http to get xml documents from a remote server. These are the
> > ones I described; they have a meta element and then a data element
> > containing "row "elements, with the actual values in the attributes
> > to "row" elements. This process uses iterparse to pull one value
> > from the meta element, and then saves the entire thing to disk.
> That's the process that fails, right?
No, it's the second process, that uses xpath expressions to find
elements to pull the attribute values from, that fails.
> Can you find out if it's the iterparse() or something else that fails here?
Well, I did try isolating parts of the parsing process. The problem
appears to be in the attribute extraction code.
Basically, I have a routine that I pass an xpath expression to, and a
list of attributes I want values for from those elements. I was being
clever (probably to clever), and letting lxml provide a dictionary,
using dict to make a copy of it (i.e. - the "d = dict(node.attrib)"
line), and then playing game with sets to remove extra keys and add
empty strings for missing attributes. If I just create an empty
dictionary and plug empty strings into it for all the keys, the
problem goes away.
So I rewrote that code with something a bit more straightforward:
d = dict()
for key in keys:
d[key] = node.get(key, '?)
and again, I haven't been able to recreate the problem.
The rest of this is probably irrelevant at this point. I've got code
that appears to be working, and things to try if it doesn't work. If
you'd like to continue chasing this, let me know if there's anything I
can do to help.
> Using valgrind is usually a great way to find out what's going wrong. It will
> make the run a lot slower, but it should print some helpful infos when it
> crashes. Run it like this:
>
> valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
> python yourscript.py
>
> preferably only on the process that crashes.
I've got this. I get errors from the Python parser and oracle
libraries (uninitialized values). Then errors from lxml that look like
the gdb "where" output: it just points through etree.so, but adds that
it's doing an invalid read of size 8 or 4 (didn't have the size
before, but this should cause the segfaults). These all seem to be
followed by an error that says
Address 0x4D31450 is 8 bytes inside a block of size 120 free'd
And then traces back through vg_replace_malloc.c, then xmlFreeNodeList
in libxml2 a couple of times, and then back to etree.so.
> > architecture makes things a little convoluted, but the basic path is
> > something like:
> >
> > data = urlopen(....)
> > try:
> > parsed = fromstring(data.read())
>
> parse(data) should do, BTW.
Yeah, I know. But the urlopen happens in a different process (and
host, for that matter) than the parsing code. That got lost in the
simplification.
Note that I changed this - I'm actually using the "findall" method,
not the "xpath" method, to find the elements of interest. All values
passed to findall are paths as indicated, though.
> > if not schema.validate(parsed):
> > handle_broken_document(parsed=parsed)
> > for node in parsed.findall('Types/Type'):
> > d = dict(node.attrib):
> > save_for_db(d)
> > for node in parsed.findall('AltTypes/AltType'):
> > d = dict(node.attrib):
> > save_for_db(d)
> > for node in parsed.findall('MoreTypes/MoreType'):
> > d = dict(node.attrib):
> > save_for_db(d)
>
> That's pretty straight forward code, I don't see any risk here. But I'm
> wondering which of the two processes actually fails now - you're presenting
> this one, but from your previous posts I though it was the other one that crashed.
> > I tried turning of the parsing - which pretty much makes everything
> > else do nothing but pass around the raw data - and got no failures. I
> > also tried turning off just the validation, so that the work is still
> > getting done - and got failures.
> Hmmmm, are those failures related to validation errors?
Nope. I have files without validation errors that cause failures,
whereas I haven't caught the one test file that does validate causing
problems.
> Just in case it's the second process that fails (the XPath one), it could be
> worth testing if using the XPath() class instead of the xpath() method works
> better. That might give us a hint on where the problem comes from. It should
> also be faster, BTW.
I should have thought of that myself. Faster is good, so I went ahead
and made this change. Haven't tried it in the dict(node.attrib)
version, though.
--
Mike Meyer http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
From stefan_ml at behnel.de Wed Oct 10 09:03:04 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 10 Oct 2007 09:03:04 +0200
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <20071009150156.2a90e40a@bhuda.mired.org>
References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> <470B5A57.4030102@behnel.de>
<20071009150156.2a90e40a@bhuda.mired.org>
Message-ID: <470C7928.7000307@behnel.de>
Mike Meyer wrote:
> On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel wrote:
>> Can you find out if it's the iterparse() or something else that fails here?
>
> Well, I did try isolating parts of the parsing process. The problem
> appears to be in the attribute extraction code.
>
> Basically, I have a routine that I pass an xpath expression to, and a
> list of attributes I want values for from those elements. I was being
> clever (probably to clever), and letting lxml provide a dictionary,
> using dict to make a copy of it (i.e. - the "d = dict(node.attrib)"
> line),
That should work though. You should also be able to safely do
d = dict(node.items())
or something in that line, which should even be faster as it avoids the
intermediate attrib proxy and iterator creation steps. If you wan to be more
selective, a generator expression will do.
> and then playing game with sets to remove extra keys and add
> empty strings for missing attributes. If I just create an empty
> dictionary and plug empty strings into it for all the keys, the
> problem goes away.
>
> So I rewrote that code with something a bit more straightforward:
>
> d = dict()
> for key in keys:
> d[key] = node.get(key, '?)
>
> and again, I haven't been able to recreate the problem.
Hmmmm, this sounds like a deallocation problem then. Calling .attrib creates a
dict-like Proxy that adds a cyclic reference to the underlying Element, so
this changes the garbage collection behaviour. Things have been going astray a
couple of times already here, as this is really hard to get right for the tons
and tons of possible use cases (involving threading race conditions and what
not). Though I was pretty sure that 1.3.2+ didn't suffer from anything like
that anymore and the attrib stuff should actually have been fixed in 1.2
already AFAIR.
>> Using valgrind is usually a great way to find out what's going wrong. It will
>> make the run a lot slower, but it should print some helpful infos when it
>> crashes. Run it like this:
>>
>> valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
>> python yourscript.py
>>
>> preferably only on the process that crashes.
>
> I've got this. I get errors from the Python parser and oracle
> libraries (uninitialized values). Then errors from lxml that look like
> the gdb "where" output: it just points through etree.so, but adds that
> it's doing an invalid read of size 8 or 4 (didn't have the size
> before, but this should cause the segfaults). These all seem to be
> followed by an error that says
> Address 0x4D31450 is 8 bytes inside a block of size 120 free'd
> And then traces back through vg_replace_malloc.c, then xmlFreeNodeList
> in libxml2 a couple of times, and then back to etree.so.
Then it is a deallocation problem. Apparently, the XML nodes it accesses were
already freed before - that's what's great about valgrind: it tells you what
last happened to the memory that it now fails to access, so you can figure out
why it was freed in the first place.
Could you send me the output?
Stefan
From tillea at rki.de Thu Oct 11 09:56:24 2007
From: tillea at rki.de (Andreas Tille)
Date: Thu, 11 Oct 2007 09:56:24 +0200 (CEST)
Subject: [lxml-dev] Beginner question
Message-ID:
Hi,
I'm sorry to start with this beginner question. Yesterday I stumbled over
lxml and I think it is a really great tool which exactly is what I ever wanted
but I'm afraid I need some kick start. I try to parse some XML files that
are used as transport medium between different databases. We use a self defined
XSD schema. The xml file lokes like this:
...
With the code that I adopted from the tutorial
for event, elem in etree.iterparse(infile, events=("start")):
if event == "start":
print "start:", etree.tostring(elem, pretty_print=True)
print "--->", elem.tag
I got something like:
...
start:
---> {http://www3.rki.de/ns/agi/ibs/2007/T06/report}source
start:
---> {http://www3.rki.de/ns/rki/base/ct/2007/T03}software
start
---> {http://www3.rki.de/ns/agi/ibs/2007/T06/report}source
...
the elements as a whole with children on the one hand but I have no
idea how to finally access the values like 'idSource="NRZ Berlin" '
nor do I have an idea how to get rid of the default name space that
is prepended before the tags. I would rather like to access the tag
called "source" (without the default name space) or "ct:software"
with the shortcut of the name space.
I also found the very interesting objectify method at
http://codespeak.net/lxml/objectify.html
but I finally have no idea how to use that in the parser because
the page just describes creating objects (or did I missed something?)
Sorry for my ignorance in case things should be obvious from reading
the docs.
Kind regards
Andreas.
--
http://fam-tille.de
From stefan_ml at behnel.de Thu Oct 11 13:49:59 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 11 Oct 2007 13:49:59 +0200
Subject: [lxml-dev] Beginner question
In-Reply-To:
References:
Message-ID: <470E0DE7.2080600@behnel.de>
Andreas Tille wrote:
> I'm sorry to start with this beginner question.
Everyone's a beginner right from the start. :)
> With the code that I adopted from the tutorial
>
> for event, elem in etree.iterparse(infile, events=("start")):
> if event == "start":
> print "start:", etree.tostring(elem, pretty_print=True)
> print "--->", elem.tag
The "start" event only guarantees that the Element itself is complete, but its
children may or may not be parsed yet. Use the "end" event if you need to
access the children.
BTW, testing for event == "start" if you already restricted the events to
("start",) is redundant.
> idea how to finally access the values like 'idSource="NRZ Berlin" '
That would be an attribute. Read the tutorial on this.
http://codespeak.net/lxml/tutorial.html#elements-carry-attributes
> nor do I have an idea how to get rid of the default name space that
> is prepended before the tags. I would rather like to access the tag
> called "source" (without the default name space)
But there *is* a namespace, so how would you distinguish it from a plain
"source" tag without namespace?
If it's just for brevity, you can always use string constants.
> or "ct:software" with the shortcut of the name space.
Who guarantees that the namespace prefix ("ct") is used in all data files?
Your code would stop working if it wasn't...
> I also found the very interesting objectify method at
> http://codespeak.net/lxml/objectify.html
> but I finally have no idea how to use that in the parser because
> the page just describes creating objects (or did I missed something?)
http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify
iterparse() also returns a (special) parser, so the setup of the lookup scheme
should work alike. I never tried it, but this should work:
parser = etree.iterparse(source_file, remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)
for event, element in parser:
...
Stefan
From tillea at rki.de Thu Oct 11 14:35:18 2007
From: tillea at rki.de (Andreas Tille)
Date: Thu, 11 Oct 2007 14:35:18 +0200 (CEST)
Subject: [lxml-dev] Beginner question
In-Reply-To: <470E0DE7.2080600@behnel.de>
References:
<470E0DE7.2080600@behnel.de>
Message-ID:
On Thu, 11 Oct 2007, Stefan Behnel wrote:
>> for event, elem in etree.iterparse(infile, events=("start")):
>> if event == "start":
>> print "start:", etree.tostring(elem, pretty_print=True)
>> print "--->", elem.tag
>
> The "start" event only guarantees that the Element itself is complete, but its
> children may or may not be parsed yet. Use the "end" event if you need to
> access the children.
Does this mean the usage of
etree.iterparse(infile, events=("end"))
would be what I really want?
> BTW, testing for event == "start" if you already restricted the events to
> ("start",) is redundant.
Right. The condition was a remaining from some other tests ...
>> idea how to finally access the values like 'idSource="NRZ Berlin" '
>
> That would be an attribute. Read the tutorial on this.
>
> http://codespeak.net/lxml/tutorial.html#elements-carry-attributes
Ahhh, elem.get(attribute) did the trick. Thanks.
> But there *is* a namespace, so how would you distinguish it from a plain
> "source" tag without namespace?
>
> If it's just for brevity, you can always use string constants.
I decided for
if elem.tag.endswith('}source'):
source = elem.get("idSource")
because for practical reasons I can be sure that I'm in the default
name space.
>> or "ct:software" with the shortcut of the name space.
>
> Who guarantees that the namespace prefix ("ct") is used in all data files?
> Your code would stop working if it wasn't...
It would not validate before if the ct would be missing in the place where
it is used here. But I can see your arguing and can cope with it. I just
thought I would have missed something in the API that would enable me to
use shortcuts.
> http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify
>
> iterparse() also returns a (special) parser, so the setup of the lookup scheme
> should work alike. I never tried it, but this should work:
>
> parser = etree.iterparse(source_file, remove_blank_text=True)
>
> lookup = objectify.ObjectifyElementClassLookup()
> parser.setElementClassLookup(lookup)
>
> for event, element in parser:
> ...
Well, when using the code:
for event, element in parser:
print "element: ", etree.tostring(element, pretty_print=True)
gives for instance:
element:
element:
element:
I here also wonder how to obtain the attribute idSource from the source tag
for instance.
Many thanks for the hint in the beginning which brought me quite a step foreward
Andreas.
--
http://fam-tille.de
From stefan_ml at behnel.de Thu Oct 11 16:04:35 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 11 Oct 2007 16:04:35 +0200
Subject: [lxml-dev] Beginner question
In-Reply-To:
References: <470E0DE7.2080600@behnel.de>
Message-ID: <470E2D73.60904@behnel.de>
Andreas Tille wrote:
> On Thu, 11 Oct 2007, Stefan Behnel wrote:
>
>>> for event, elem in etree.iterparse(infile, events=("start")):
>>> if event == "start":
>>> print "start:", etree.tostring(elem, pretty_print=True)
>>> print "--->", elem.tag
>> The "start" event only guarantees that the Element itself is complete, but its
>> children may or may not be parsed yet. Use the "end" event if you need to
>> access the children.
>
> Does this mean the usage of
> etree.iterparse(infile, events=("end"))
> would be what I really want?
Depends on what you want, but likely yes. Note that ("end",) is the default
anyway.
>>> or "ct:software" with the shortcut of the name space.
>> Who guarantees that the namespace prefix ("ct") is used in all data files?
>> Your code would stop working if it wasn't...
>
> It would not validate before if the ct would be missing in the place where
> it is used here.
Why not? You could use "humptydumpty:software" as long as you associated
"humptydumpty" with the right namespace. And your XML document could define
1000 prefixes for the same namespace and then use a different prefix for each
tag. And it would validate just fine, as the namespace would be correct.
>> http://codespeak.net/lxml/objectify.html#setting-up-lxml-objectify
>>
>> iterparse() also returns a (special) parser, so the setup of the lookup scheme
>> should work alike. I never tried it, but this should work:
>>
>> parser = etree.iterparse(source_file, remove_blank_text=True)
>>
>> lookup = objectify.ObjectifyElementClassLookup()
>> parser.setElementClassLookup(lookup)
>>
>> for event, element in parser:
>> ...
>
> Well, when using the code:
>
> for event, element in parser:
> print "element: ", etree.tostring(element, pretty_print=True)
>
> gives for instance:
>
> element:
> element:
>
>
> element:
>
> I here also wonder how to obtain the attribute idSource from the source tag
> for instance.
Same attribute access as before, just the child access API is different, as
described in the objectify docs.
Stefan
From felwert at uni-bremen.de Thu Oct 11 16:26:45 2007
From: felwert at uni-bremen.de (Frederik Elwert)
Date: Thu, 11 Oct 2007 16:26:45 +0200
Subject: [lxml-dev] [Spam: 5.001 ] Re: Beginner question
In-Reply-To:
References:
<470E0DE7.2080600@behnel.de>
Message-ID: <1192112805.8247.12.camel@FredDesk>
Am Donnerstag, den 11.10.2007, 14:35 +0200 schrieb Andreas Tille:
> On Thu, 11 Oct 2007, Stefan Behnel wrote:
> > But there *is* a namespace, so how would you distinguish it from a
plain
> > "source" tag without namespace?
> >
> > If it's just for brevity, you can always use string constants.
>
> I decided for
>
> if elem.tag.endswith('}source'):
> source = elem.get("idSource")
>
> because for practical reasons I can be sure that I'm in the default
> name space.
If you want it a bit less "dirty" and more XMLish, you could use
local-name() from XPath:
lname = etree.XPath('local-name()')
if lname(elem) == 'source':
source = elem.get('idSource')
Cheers,
Frederik
From stefan_ml at behnel.de Thu Oct 11 17:18:57 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 11 Oct 2007 17:18:57 +0200
Subject: [lxml-dev] Dealing with segfaults in lxml?
In-Reply-To: <20071010104245.361b359c@mbook.mired.org>
References: <20071003165016.104d2caf@bhuda.mired.org> <470A8BD1.6060305@behnel.de> <20071008172121.4a6acd8b@bhuda.mired.org> <470B5A57.4030102@behnel.de> <20071009150156.2a90e40a@bhuda.mired.org> <470C7928.7000307@behnel.de>
<20071010104245.361b359c@mbook.mired.org>
Message-ID: <470E3EE1.3050707@behnel.de>
Mike Meyer wrote:
> On Wed, 10 Oct 2007 09:03:04 +0200 Stefan Behnel wrote:
>> d = dict(node.items())
>>
>> or something in that line, which should even be faster as it avoids the
>> intermediate attrib proxy and iterator creation steps. If you wan to be more
>> selective, a generator expression will do.
>
> I tried the node.items() variation, and that was still causing
> segfaults.
Then it's still different than I thought. If all you change is this line:
d = dict(node.attrib)
and you get segfaults with this:
d = dict(node.items())
but not with this:
d = dict()
for key in keys:
d[key] = node.get(key, '?)
I really can't extract anything meaningful from that. The complete valgrind
trace would be helpful.
Stefan
From bkc at murkworks.com Mon Oct 15 20:18:46 2007
From: bkc at murkworks.com (Brad Clements)
Date: Mon, 15 Oct 2007 14:18:46 -0400
Subject: [lxml-dev] custom resolver, why does system url start with XSLT:?
Message-ID: <4713AF06.3060408@murkworks.com>
I have a project (XSL based TAL) that has used libxml2 and libxslt for a
couple of years. I have a custom resolver that has worked "ok" with this.
Now I have converted the project to use lxml. I am creating a parser and
adding my resolver.
when my resolver gets called, the URIs are weirdly mangled like this:
resolve url 'XSLT:///xml/navigation.xml' id None ctext
resolve url '/xml/carrier_payables_navigation.xml' id None ctext
resolve url 'XSLT:///services/+payment_accounts' id None ctext
(the 2nd one is not mangled, looks ok to me)
What's the story with XSLT:// being stuck on the front of the system urls?
I don't see that happen when I use libxml2 directly.
I tried looking through the lxml source to find this, but I couldn't
find it in docloader, parser, or xslt.
where is the XSLT scheme coming from, is it lxml or libxslt?
Why is it being inserted?
The last example url, comes from using document() in a stylesheet, (the
converted form of this:)
I expect my resolver to get a system url of '/services/+payment_accounts'
The 2nd example url above (the non-mangled one), also comes from a
document call, like this:
Maybe the difference is due to one document() using a constant string,
the other using a variable..?
I am using 2.0 alpha4 and 2.0 alpha 3 (two different systems, same
problem). I can't see how from lxml to tell you which version of libxml2
and libxslt I am using.
(the .xsl that is the converted from of the above TAL statements has
these statements)
Thanks for any suggestions!
--
Brad Clements, bkc at murkworks.com (315)268-1000
http://www.murkworks.com
AOL-IM: BKClements
From stefan_ml at behnel.de Tue Oct 16 08:50:05 2007
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 16 Oct 2007 08:50:05 +0200
Subject: [lxml-dev] custom resolver,
why does system url start with XSLT:?
In-Reply-To: <4713AF06.3060408@murkworks.com>
References: <4713AF06.3060408@murkworks.com>
Message-ID: <47145F1D.2050302@behnel.de>
Brad Clements wrote:
> I have a project (XSL based TAL) that has used libxml2 and libxslt for a
> couple of years. I have a custom resolver that has worked "ok" with this.
>
> Now I have converted the project to use lxml. I am creating a parser and
> adding my resolver.
>
> when my resolver gets called, the URIs are weirdly mangled like this:
>
> resolve url 'XSLT:///xml/navigation.xml' id None ctext
>
>
> resolve url '/xml/carrier_payables_navigation.xml' id None ctext
>
>
> resolve url 'XSLT:///services/+payment_accounts' id None ctext
>
>
> (the 2nd one is not mangled, looks ok to me)
>
> What's the story with XSLT:// being stuck on the front of the system urls?
>
> The last example url, comes from using document() in a stylesheet, (the
> converted form of this:)
>
>