From aymeric.augustin at polytechnique.org Sat Jan 2 20:57:02 2010
From: aymeric.augustin at polytechnique.org (Aymeric Augustin)
Date: Sat, 2 Jan 2010 20:57:02 +0100
Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files?
Message-ID: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org>
Hello,
lxml.etree.parse is able to load gzipped XML files directly, but
lxml.etree.iterparse is not. See below for an interactive session
demonstrating the problem on debian stable. Is it the expected
behavior, or is it a bug?
The documentation does address this point, it says only:
> lxml can parse from a local file, an HTTP URL or an FTP URL. It
> also auto-detects and reads gzip-compressed XML files (.gz).
Context: I'm handling hundreds of GB-sized files. It would be nice to
store them gzipped and have lxml decompress them on the fly, without
any specific Python code.
Thanks!
% python
Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip, sys
>>> from lxml import etree
>>> print etree.__version__
2.1.1
Let's create a gzipped XML file:
>>> gzip.open('test.xml.gz', 'wb').write('')
etree.parse is able to load it:
>>> tree = etree.parse('test.xml.gz')
>>> tree.write(sys.stdout); print
etree.iterparse crashes:
>>> ctx = etree.iterparse('test.xml.gz')
>>> list(ctx)
Traceback (most recent call last):
File "", line 1, in
File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:73245)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/
lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
etree.iterparse accepts the ungzipped file:
>>> ctx = etree.iterparse(gzip.open('test.xml.gz', 'rb'))
>>> list(ctx)
[(u'end', ), (u'end', )]
--
Aymeric Augustin.
From stefan_ml at behnel.de Sun Jan 3 08:25:51 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 03 Jan 2010 08:25:51 +0100
Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files?
In-Reply-To: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org>
References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org>
Message-ID: <4B40467F.7020305@behnel.de>
Aymeric Augustin, 02.01.2010 20:57:
> lxml.etree.parse is able to load gzipped XML files directly, but
> lxml.etree.iterparse is not.
> [...]
> The documentation does address this point, it says only:
>> lxml can parse from a local file, an HTTP URL or an FTP URL. It
>> also auto-detects and reads gzip-compressed XML files (.gz).
Right, there should be a note in the iterparse docs also. The input support
in iterparse() is a lot simpler than that. It doesn't support URLs either.
Due to the inner workings of iterparse, all of this isn't trivial to add,
as lxml would have to detect and apply the correct reading mechanism itself
(e.g. by building up a decompression step for libxml2 manually). Even
detecting the compression would require opening the file and reading from
it first. Now imagine named pipes and system streams, which you cannot just
reopen afterwards...
It might be possible to detect GzipFile objects and bypass them, but that
would already be a difference to the normal parse() behaviour.
> Context: I'm handling hundreds of GB-sized files. It would be nice to
> store them gzipped and have lxml decompress them on the fly, without
> any specific Python code.
The way to do that is currently by passing through the gzip module. You can
also try using a pipe to an externally started gzip process. I frequently
use this on 64-bit multicore Sun machines where the system provided gzip is
increadibly fast, much faster than Python's gzip module.
Stefan
From aymeric.augustin at polytechnique.org Sun Jan 3 11:00:41 2010
From: aymeric.augustin at polytechnique.org (Aymeric Augustin)
Date: Sun, 3 Jan 2010 11:00:41 +0100
Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files?
In-Reply-To: <4B40467F.7020305@behnel.de>
References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org>
<4B40467F.7020305@behnel.de>
Message-ID:
On 3 janv. 10, at 08:25, Stefan Behnel wrote:
> Right, there should be a note in the iterparse docs also. The input
> support in iterparse() is a lot simpler than that. It doesn't
> support URLs either.
>
> Due to the inner workings of iterparse, all of this isn't trivial
> to add, as lxml would have to detect and apply the correct reading
> mechanism itself (e.g. by building up a decompression step for
> libxml2 manually). Even detecting the compression would require
> opening the file and reading from it first. Now imagine named pipes
> and system streams, which you cannot just reopen afterwards...
OK, thanks for the explanation.
>> Context: I'm handling hundreds of GB-sized files. It would be nice
>> to store them gzipped and have lxml decompress them on the fly,
>> without any specific Python code.
>
> The way to do that is currently by passing through the gzip module.
> You can also try using a pipe to an externally started gzip
> process. I frequently use this on 64-bit multicore Sun machines
> where the system provided gzip is increadibly fast, much faster
> than Python's gzip module.
I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the
pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using
the gzip module. The performance gain comes from gunzipping and
parsing in parallel; the overall resource consumption (user + system)
is nearly identical.
So for now I'll stick with the gzip module.
--
Aymeric Augustin.
From stefan_ml at behnel.de Sun Jan 3 15:14:35 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 03 Jan 2010 15:14:35 +0100
Subject: [lxml-dev] Should lxml.etree.iterparse support gzipped files?
In-Reply-To:
References: <9AF1F17C-3852-4DD1-9214-FB407B5CD06B@polytechnique.org>
<4B40467F.7020305@behnel.de>
Message-ID: <4B40A64B.60302@behnel.de>
Aymeric Augustin, 03.01.2010 11:00:
> I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the
> pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using the
> gzip module. The performance gain comes from gunzipping and parsing in
> parallel; the overall resource consumption (user + system) is nearly
> identical.
>
> So for now I'll stick with the gzip module.
Sounds reasonable. You can also try to adjust the gzip buffer size and see
if that reduces the overhead.
Stefan
From lists at zopyx.com Wed Jan 6 14:33:57 2010
From: lists at zopyx.com (Andreas Jung)
Date: Wed, 06 Jan 2010 14:33:57 +0100
Subject: [lxml-dev] 'text', 'tail' handling with nested markup
Message-ID: <4B449145.8060602@zopyx.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi there,
given a structure like
foo
blather
bar
blather
fox
How can I parse the content of
into a flat list like
['foo', , 'bar', , 'fox']
?
Andreas
- --
ZOPYX Limited \ zopyx group
Charlottenstr. 37/1 \ The full-service network for your
D-72070 T?bingen \ Python, Zope and Plone projects
www.zopyx.com, info at zopyx.com \ www.zopyxgroup.com
- ------------------------------------------------------------------------
E-Publishing, Python, Zope & Plone development, Consulting
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAktEkUUACgkQCJIWIbr9KYwbXwCfWorSz4vRAcGHTop0AYcNpvmq
rSMAoJruQ6iWdOivdteLBnkCZHI4mM6m
=M0/g
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lists.vcf
Type: text/x-vcard
Size: 316 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100106/3f446a88/attachment.vcf
From stefan_ml at behnel.de Thu Jan 7 14:15:41 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 07 Jan 2010 14:15:41 +0100
Subject: [lxml-dev] 'text', 'tail' handling with nested markup
In-Reply-To: <4B449145.8060602@zopyx.com>
References: <4B449145.8060602@zopyx.com>
Message-ID: <4B45DE7D.9050903@behnel.de>
Andreas Jung, 06.01.2010 14:33:
> given a structure like
>
>
> foo
> blather
> bar
> blather
> fox
>
>
> How can I parse the content of
into a flat list like
>
> ['foo', , 'bar', , 'fox']
for p in root.iter(tag='p'):
flat_list = []
if p.text:
flat_list.append(p.text)
for el in p:
flat_list.append(el)
if el.tail:
flat_list.append(el.tail)
print(flat_list)
Stefan
From d.rothe at semantics.de Thu Jan 7 22:38:46 2010
From: d.rothe at semantics.de (Dirk Rothe)
Date: Thu, 07 Jan 2010 22:38:46 +0100
Subject: [lxml-dev] 'text', 'tail' handling with nested markup
In-Reply-To: <4B45DE7D.9050903@behnel.de>
References: <4B449145.8060602@zopyx.com> <4B45DE7D.9050903@behnel.de>
Message-ID:
On Thu, 07 Jan 2010 14:15:41 +0100, Stefan Behnel
wrote:
>
> Andreas Jung, 06.01.2010 14:33:
>> given a structure like
>>
>>
>> foo
>> blather
>> bar
>> blather
>> fox
>>
>>
>> How can I parse the content of
into a flat list like
>>
>> ['foo', , 'bar', , 'fox']
>
> for p in root.iter(tag='p'):
> flat_list = []
> if p.text:
> flat_list.append(p.text)
> for el in p:
> flat_list.append(el)
> if el.tail:
> flat_list.append(el.tail)
> print(flat_list)
Another variant with xpath:
flat_list = root.xpath('//p/node()')
or if root is the
element
flat_list = root.xpath('node()')
--dirk
From ygingras at ygingras.net Sun Jan 10 21:16:31 2010
From: ygingras at ygingras.net (Yannick Gingras)
Date: Sun, 10 Jan 2010 15:16:31 -0500
Subject: [lxml-dev] Looking for performance tips for soupparser
In-Reply-To: <4B3D20CB.4000305@behnel.de>
References: <200912311111.20018.ygingras@ygingras.net>
<4B3D20CB.4000305@behnel.de>
Message-ID: <201001101516.31343.ygingras@ygingras.net>
On December 31, 2009, Stefan Behnel wrote:
> > Would any of you have some tips to share on speeding things up with
> > soupparser? How hard would it be to make elements conform to the
> > pickling protocol?
>
> I'd use the normal HTML parser instead, and only fall back to using the
> soupparser when things go really wrong (whatever that means in your case).
>
> Another thing you can do (assuming that caching is helpful in your case),
> is to parse the documents using soupparser and serialise them into the
> cache. Then parse them from the cache using the normal HTML parser
> (preferably with "recover=False") when you need them. A serialise-parse
> cycle is several times faster than a new parser run of BeautifulSoup, so if
> you need the documents multiple times, this will speed things up.
I implemented both ideas and it resulted in a least a 10 fold speedup.
Thanks a lot!
--
Yannick Gingras
http://ygingras.net
http://confoo.ca -- track coordinator
http://montrealpython.org -- lead organizer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100110/c9d590f6/attachment.pgp
From b.tarde at gmail.com Mon Jan 11 22:04:34 2010
From: b.tarde at gmail.com (Peter Baker)
Date: Mon, 11 Jan 2010 16:04:34 -0500
Subject: [lxml-dev] Output of xsl:message when terminate is not yes
Message-ID: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com>
I've just recently been getting acquainted with lxml. Thanks to the
developers: it's great!
There's just one thing (so far) that I haven't been able to do. In an
XSLT transformation, I can't figure out what's going on with
xsl:message when there is no terminate="yes" attribute. Command-line
processors like xsltproc print the message to stderr. With libxslt you
can capture it with a callback function. But I can't figure out how to
display such messages with lxml.
For just a little background, my app often needs to do very long
transforms: up to a half hour, though a minute is more typical. It
seems important to be able to spit out warnings and messages to mark
the progress of the transformation. So reading an error log after the
transform is over is not what I want.
Sorry if the answer is an obvious one. It's hard to search a forum's
archive for "message," and the other keywords I've been able to think
of haven't given me the answser in either the documentation or the
forum.
Thanks in advance,
Peter Baker
From stefan_ml at behnel.de Tue Jan 12 09:05:44 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 12 Jan 2010 09:05:44 +0100
Subject: [lxml-dev] Output of xsl:message when terminate is not yes
In-Reply-To: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com>
References: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com>
Message-ID: <4B4C2D58.1030604@behnel.de>
Hi,
Peter Baker, 11.01.2010 22:04:
> I've just recently been getting acquainted with lxml. Thanks to the
> developers: it's great!
Thanks :)
> There's just one thing (so far) that I haven't been able to do. In an
> XSLT transformation, I can't figure out what's going on with
> xsl:message when there is no terminate="yes" attribute. Command-line
> processors like xsltproc print the message to stderr. With libxslt you
> can capture it with a callback function. But I can't figure out how to
> display such messages with lxml.
>
> For just a little background, my app often needs to do very long
> transforms: up to a half hour, though a minute is more typical. It
> seems important to be able to spit out warnings and messages to mark
> the progress of the transformation. So reading an error log after the
> transform is over is not what I want.
I never tried that, but you should be able to read the error_log also
during the transformation (you obviously need a reference to the running
XSLT object to do that).
You shouldn't read the log from a separate thread, though. I'm not sure if
that works, but if it works, I should consider it a bug (I'll have to check
that).
A different way is to use a dedicated extension element to export the
message, instead of the generic xsl:message tag. That would allow you to do
whatever you want in Python code, e.g. use the logging package, and also to
provide more information than just a plain message string.
http://codespeak.net/lxml/extensions.html#xslt-extension-elements
And a third approach would be to divert the global thread error log (i.e.
the output of everything libxml2/libxslt does in the current thread) to
Python's logging package.
http://codespeak.net/lxml/api/lxml.etree.PyErrorLog-class.html
http://codespeak.net/lxml/api/lxml.etree-module.html#use_global_python_log
However, this doesn't appear to be used very much, so it hasn't been
exercised as much as other parts of the API. It was even broken for a long
time before 2.2.3 until a user finally noticed (this feature isn't easy to
test as it obviously has the side-effect of diverting the error output).
Just looking through the code now revealed a couple of spots where the API
could be improved - I guess that should be done for 2.3.
HTH - maybe not "one perfect way to do it", but at least a couple of paths
to explore. :)
> Sorry if the answer is an obvious one. It's hard to search a forum's
> archive for "message," and the other keywords I've been able to think
> of haven't given me the answser in either the documentation or the
> forum.
I'd look for "xsl:message" given that the prefix is so commonly used. But I
don't remember having seen any related threads on this list so far.
Stefan
From ashish.vyas at motorola.com Tue Jan 12 10:33:56 2010
From: ashish.vyas at motorola.com (VYAS ASHISH M-NTB837)
Date: Tue, 12 Jan 2010 17:33:56 +0800
Subject: [lxml-dev] FW: lxml 2.2.4 on python3.1, Windows XP gives importerror
Message-ID: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com>
Dear All
I have Python 3.1 installed on Windows XP and Works nice.
I downloaded lxml 2.2.4 (lxml-2.2.4.win32-py3.1.exe) from pypi.
When I try:
from lxml import etree
I get:
ImportError: DLL load failed: This application has failed to start
because the application configuration is incorrect. Reinstalling the
application may fix this problem.
For information: 'import lxml' works fine.
After reinstalling python3.1 also the error message is the same. Any
help is appreciated!
Regards,
Ashish Vyas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100112/50e974ba/attachment.htm
From dakota at brokenpipe.ru Tue Jan 12 14:05:43 2010
From: dakota at brokenpipe.ru (Marat Dakota)
Date: Tue, 12 Jan 2010 16:05:43 +0300
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To: <4B3A7F41.6000304@behnel.de>
References:
<4AF7C843.6050509@behnel.de>
<4AF923A2.8010006@behnel.de>
<4AF9852D.3020408@behnel.de>
<4AF992C9.7090400@behnel.de>
<4B3A7F41.6000304@behnel.de>
Message-ID:
>
> Thanks a lot, it's looks reasonable at first glance and I'll take a closer
> look as soon as I get to it. If it works well, it should make it into 2.3.
>
Is there a roadmap date for 2.3 release?
> Could you add a couple of tests to src/lxml/tests/test_xslt.py? That would
> help in making sure that it keeps working as expected even if I find that I
> need to rework the patch.
>
I've added tests, I've also renamed variables to fit your code better and
added possibility to evaluate extension element's content directly to
_AppendOnlyElementProxy as well as to _Element. It looks like I'm satisfied
with the code now. I wonder what will you say about it.
> Also, it's best to send patches as a readable attachment rather than
> inline. Mail programs tend to reformat text and it's easy to loose empty
> trailing lines etc.
>
The patch is attached. Can't wait to see it in trunk :)
> Thanks for pulling this out!
And thank you for making very nice and useful thing!
--
Marat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100112/2f83eef6/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lxml-2.2.4.patch
Type: application/octet-stream
Size: 7125 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100112/2f83eef6/attachment-0001.obj
From b.tarde at gmail.com Tue Jan 12 14:51:15 2010
From: b.tarde at gmail.com (Peter Baker)
Date: Tue, 12 Jan 2010 08:51:15 -0500
Subject: [lxml-dev] Output of xsl:message when terminate is not yes
In-Reply-To: <4B4C2D58.1030604@behnel.de>
References: <64e117cb1001111304p244d92c7ie87fca1cfee7a24@mail.gmail.com>
<4B4C2D58.1030604@behnel.de>
Message-ID: <64e117cb1001120551o45fd6b79scca71cff088b09b2@mail.gmail.com>
On Tue, Jan 12, 2010 at 3:05 AM, Stefan Behnel wrote:
>
> I never tried that, but you should be able to read the error_log also during
> the transformation (you obviously need a reference to the running XSLT
> object to do that).
>
> You shouldn't read the log from a separate thread, though. I'm not sure if
> that works, but if it works, I should consider it a bug (I'll have to check
> that).
>
Stefan,
Thanks for the quick reply. I'm not partial to extension elements, and
your third method sounds iffy, so I'd like to try the first. But
looking at the API it is not clear to me how I get at the error_log
*while the transform is running.* (I'm still learning to think
Pythonically: my current task is converting from a Bash script that
calls executes xsltproc to Python that uses the Lxml API.) Here's a
snippet from my current (probably un-Pythonic) code:
from lxml import etree
xgfdoc = etree.parse(self.xgffile)
styledoc = etree.parse(self.xslfile)
transform = etree.XSLT(styledoc)
newparams = dict()
for k in self.params.keys():
newparams[k] = "'" + self.params[k] + "'"
try:
result = transform(xgfdoc, **newparams)
except etree.XSLTApplyError as msg:
raise XGFCompilationError(msg)
Peter
From sridharr at activestate.com Tue Jan 12 19:02:38 2010
From: sridharr at activestate.com (Sridhar Ratnakumar)
Date: Tue, 12 Jan 2010 10:02:38 -0800
Subject: [lxml-dev] Instructions to build on Windows 64-bit?
In-Reply-To: <4B3A55DD.1030004@activestate.com>
References: <4B3A55DD.1030004@activestate.com>
Message-ID: <4B4CB93E.5070903@activestate.com>
On 12/29/2009 11:17 AM, Sridhar Ratnakumar wrote:
> I noticed that the lxml PyPI page provides 64-bit Windows installers
> [http://pypi.python.org/pypi/lxml/2.2.4 ; lxml-*amd64.exe]. I assume
> they are statically linked with the libxml/xslt libraries.
>
> In the interest of providing 64-bit binaries in PyPM
> [pypm.activestate.com], may I know how these binaries are built? I tried
> buildlibxml.py which fails at several steps; and the compiled libraries
> provided atftp.zlatkovic.com are 32-bit only.
Any response?
I ask because this would enable us to provide builds for lxml via PyPM:
http://pypm.activestate.com/list-l.html#lxml
Currently the 64-bits are missing. If the person who made the amd64
installers could respond to this, that would be great. It will make
http://codespeak.net/lxml/installation.html#installation-in-activepython
just work on the 64-bit systems.
Specifically I am asking for the 64-bit version of
http://codespeak.net/lxml/build.html#static-linking-on-windows
-srid
From sidnei.da.silva at gmail.com Tue Jan 12 20:01:24 2010
From: sidnei.da.silva at gmail.com (Sidnei da Silva)
Date: Tue, 12 Jan 2010 17:01:24 -0200
Subject: [lxml-dev] Instructions to build on Windows 64-bit?
In-Reply-To: <4B4CB93E.5070903@activestate.com>
References: <4B3A55DD.1030004@activestate.com>
<4B4CB93E.5070903@activestate.com>
Message-ID:
On Tue, Jan 12, 2010 at 4:02 PM, Sridhar Ratnakumar
wrote:
> Any response?
>
> I ask because this would enable us to provide builds for lxml via PyPM:
>
> ? http://pypm.activestate.com/list-l.html#lxml
>
> Currently the 64-bits are missing. If the person who made the amd64
> installers could respond to this, that would be great. It will make
> http://codespeak.net/lxml/installation.html#installation-in-activepython
> just work on the 64-bit systems.
>
> Specifically I am asking for the 64-bit version of
> http://codespeak.net/lxml/build.html#static-linking-on-windows
I've used the binaries from:
http://pecl2.php.net/downloads/php-windows-builds/php-libs/
I also had to modify the 'libraries' function in setupinfo.py since
those binaries are slightly different. I don't remember the details,
but something about a '_a' postfix IIRC.
-- Sidnei
From mike.maccana at gmail.com Wed Jan 13 21:40:26 2010
From: mike.maccana at gmail.com (Mike MacCana)
Date: Wed, 13 Jan 2010 20:40:26 +0000
Subject: [lxml-dev] Python module to make / modify / create docx files
Message-ID: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com>
Hi LXML folks,
Just a short note to say I've made a Python module to create, modify, query
and extract text from Microsoft Word 2007 docx files - using everyone's
favorite Python XML module (that's LXML of course).
If you're interested, check out
http://github.com/mikemaccana/python-docxfor a full feature list.
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100113/2b79c91f/attachment.htm
From martin at ozoneonline.com Wed Jan 13 21:27:16 2010
From: martin at ozoneonline.com (Martin Fisher)
Date: Wed, 13 Jan 2010 12:27:16 -0800
Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard...
Message-ID: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com>
Hi Guys,
I think I've searched diligently but can't find a good solution:
I'm seeing the following error when loading docx...
ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk
Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
Expected in: flat namespace
in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions.
Any clues where to look will be appreciated.
Thanks
Martin
Ozone
Online
Martin Fisher
Title: CTO
phone: 415-692-4182
email: martin at ozoneonline.com
fax: 415-771-5530
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100113/e16b185f/attachment.htm
From stefan_ml at behnel.de Thu Jan 14 09:52:33 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 14 Jan 2010 09:52:33 +0100
Subject: [lxml-dev] Python module to make / modify / create docx files
In-Reply-To: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com>
References: <73d18a591001131240w142a3a3amf8f3b79e5d050738@mail.gmail.com>
Message-ID: <4B4EDB51.4090001@behnel.de>
MacCana, 13.01.2010 21:40:
> Just a short note to say I've made a Python module to create, modify, query
> and extract text from Microsoft Word 2007 docx files - using everyone's
> favorite Python XML module (that's LXML of course).
>
> If you're interested, check out
> http://github.com/mikemaccana/python-docxfor a full feature list.
Thanks for sharing that, I'll add a link to the lxml 'who uses it' FAQ.
Stefan
From mike.maccana at gmail.com Sat Jan 16 23:44:48 2010
From: mike.maccana at gmail.com (Mike MacCana)
Date: Sat, 16 Jan 2010 22:44:48 +0000
Subject: [lxml-dev] Namespaces on attribute values
Message-ID: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com>
Hi Stefan and others,
I'm currently creating a particularly tricky element from a string, using:
etree.fromstring('''2010-01-01T21:07:00Z''')
which works fine. However I'd like to make the element manually. This bit is
causing me trouble:
xsi:type="dcterms:W3CDTF"
is 'dcterms' a namespace on a attribute value? If so, how can I set it in
LXML? I haven't seen that before, and I can't find much documentation
online. I'm quite comfortable with namespaces on tags and elements. Here's
what I currently have:
bar = etree.Element('{http://purl.org/dc/terms/}'+'created')
bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type',
'dcterms:W3CDTF')
bar.text = '2010-01-01T21:07:00Z'
Alas the app that parses my XML doesn't like it - though the fromstring() is
fine. Any way I can set the namespace on attribute value?
Thanks very much for any help,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100116/b4fe86b2/attachment.html
From stefan_ml at behnel.de Sun Jan 17 08:48:23 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 17 Jan 2010 08:48:23 +0100
Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard...
In-Reply-To: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com>
References: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com>
Message-ID: <4B52C0C7.3090302@behnel.de>
Martin Fisher, 13.01.2010 21:27:
> I think I've searched diligently but can't find a good solution:
>
> I'm seeing the following error when loading docx...
>
> ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk
> Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
> Expected in: flat namespace
> in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
That's a symbol from libxml2 that it can't find. Something's wrong with
your installation.
> I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions.
It shouldn't use macports to provide the libraries. See here for
installation instructions:
http://codespeak.net/lxml/installation.html#installation
Stefan
From stefan_ml at behnel.de Sun Jan 17 08:53:02 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 17 Jan 2010 08:53:02 +0100
Subject: [lxml-dev] FW: lxml 2.2.4 on python3.1,
Windows XP gives importerror
In-Reply-To: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com>
References: <7C57DB58C81FB64CA979EC6C4DB73E7A03E70C92@ZMY16EXM67.ds.mot.com>
Message-ID: <4B52C1DE.8010400@behnel.de>
VYAS ASHISH M-NTB837, 12.01.2010 10:33:
> I have Python 3.1 installed on Windows XP and Works nice.
> I downloaded lxml 2.2.4 (lxml-2.2.4.win32-py3.1.exe) from pypi.
>
> When I try:
> from lxml import etree
> I get:
> ImportError: DLL load failed: This application has failed to start
> because the application configuration is incorrect. Reinstalling the
> application may fix this problem.
I can't extract much information from that error message, except that the
lxml.etree module failed to load.
Do others experience the same problem with the 3.1 installer?
> For information: 'import lxml' works fine.
That's because it doesn't load any DLLs.
> After reinstalling python3.1 also the error message is the same. Any
> help is appreciated!
Did you try reinstalling lxml?
Stefan
From stefan_ml at behnel.de Sun Jan 17 09:00:01 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 17 Jan 2010 09:00:01 +0100
Subject: [lxml-dev] Namespaces on attribute values
In-Reply-To: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com>
References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com>
Message-ID: <4B52C381.1050304@behnel.de>
Mike MacCana, 16.01.2010 23:44:
> Hi Stefan and others,
>
> I'm currently creating a particularly tricky element from a string, using:
>
> etree.fromstring(''' xsi:type="dcterms:W3CDTF">2010-01-01T21:07:00Z''')
>
> which works fine. However I'd like to make the element manually. This bit is
> causing me trouble:
>
> xsi:type="dcterms:W3CDTF"
>
> is 'dcterms' a namespace on a attribute value? If so, how can I set it in
> LXML? I haven't seen that before, and I can't find much documentation
> online. I'm quite comfortable with namespaces on tags and elements. Here's
> what I currently have:
>
> bar = etree.Element('{http://purl.org/dc/terms/}'+'created')
> bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type',
> 'dcterms:W3CDTF')
> bar.text = '2010-01-01T21:07:00Z'
>
> Alas the app that parses my XML doesn't like it - though the fromstring() is
> fine. Any way I can set the namespace on attribute value?
It's not a namespace (i.e. a URI), it's just the associated prefix. You
have to extract it from the Element you set the namespace on ('prefix'
attribute) or use an 'nsmap' to specify a prefix for the namespace.
http://codespeak.net/lxml/tutorial.html#namespaces
If you don't provide a prefix yourself, lxml.etree will use a default name
like 'ns0', which doesn't correspond with the 'dcterms' you use in your value.
Stefan
From mike.maccana at gmail.com Sun Jan 17 12:19:16 2010
From: mike.maccana at gmail.com (Mike MacCana)
Date: Sun, 17 Jan 2010 11:19:16 +0000
Subject: [lxml-dev] Namespaces on attribute values
In-Reply-To: <4B52C381.1050304@behnel.de>
References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com>
<4B52C381.1050304@behnel.de>
Message-ID: <73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com>
On Sun, Jan 17, 2010 at 8:00 AM, Stefan Behnel wrote:
>
> Mike MacCana, 16.01.2010 23:44:
> > Hi Stefan and others,
> >
> > I'm currently creating a particularly tricky element from a string,
> using:
> >
> > etree.fromstring(''' > xsi:type="dcterms:W3CDTF">2010-01-01T21:07:00Z''')
> >
> > which works fine. However I'd like to make the element manually. This bit
> is
> > causing me trouble:
> >
> > xsi:type="dcterms:W3CDTF"
> >
> > is 'dcterms' a namespace on a attribute value? If so, how can I set it in
> > LXML? I haven't seen that before, and I can't find much documentation
> > online. I'm quite comfortable with namespaces on tags and elements.
> Here's
> > what I currently have:
> >
> > bar = etree.Element('{http://purl.org/dc/terms/}'+'created')
> > bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type',
> > 'dcterms:W3CDTF')
> > bar.text = '2010-01-01T21:07:00Z'
> >
> > Alas the app that parses my XML doesn't like it - though the fromstring()
> is
> > fine. Any way I can set the namespace on attribute value?
>
> It's not a namespace (i.e. a URI), it's just the associated prefix. You
> have to extract it from the Element you set the namespace on ('prefix'
> attribute) or use an 'nsmap' to specify a prefix for the namespace.
>
> http://codespeak.net/lxml/tutorial.html#namespaces
>
> If you don't provide a prefix yourself, lxml.etree will use a default name
> like 'ns0', which doesn't correspond with the 'dcterms' you use in your
> value.
>
> Stefan
>
>
Thanks - I should clarify that my aim is to set the right namespace on the
attribute's value - ns0 is fine if it points to the right URI. My question
is, how can I set the right namespace on what appears to be an attributes
*value*?
I've read the namespace tutorial, and understand:
- Namespace of tag itself is "http://purl.org/dc/terms/". Cool.
- Namespace of the attribute is "http://www.w3.org/2001/XMLSchema-instance".
Cool.
But how can I set a namespace (preferably directly or via a prefix) on an
attribute value (not the attribute)? Trying:
bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', '{
http://purl.org/dc/terms/}'+'W3CDTF')
Doesn't seem to give me any luck either...
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100117/13831277/attachment.htm
From gilles.lenfant at gmail.com Sun Jan 17 12:44:44 2010
From: gilles.lenfant at gmail.com (Gilles Lenfant)
Date: Sun, 17 Jan 2010 12:44:44 +0100
Subject: [lxml-dev] Problems loading lxml in MacOS 2.6 Snow Leopard...
In-Reply-To: <4B52C0C7.3090302@behnel.de>
References: <4EC939AE-7FF1-4FF4-877A-F67A9DFCE193@ozoneonline.com>
<4B52C0C7.3090302@behnel.de>
Message-ID: <7c3325691001170344u4b5313d2u99d419d09662c54a@mail.gmail.com>
BTW (perhaps off topic, but...)
Somebody knows how to do the equivalent of this in a buildout?
"STATIC_DEPS=true easy_install lxml"
Do I need to write a recipe for this or is there an OTB method I didn't find?
Thanks by advance.
--
Gilles Lenfant
2010/1/17 Stefan Behnel :
>
> Martin Fisher, 13.01.2010 21:27:
>> I think I've searched diligently but can't find a good solution:
>>
>> I'm seeing the following error when loading docx...
>>
>> ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk
>> ? Referenced from: /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
>> ? Expected in: flat namespace
>> ?in /Library/Python/2.6/site-packages/lxml-2.2.4-py2.6-macosx-10.6-universal.egg/lxml/etree.so
>
> That's a symbol from libxml2 that it can't find. Something's wrong with
> your installation.
>
>
>> I used easy_install to install lxml and macports to load libxml2,libxslt after cleaning/removing all versions.
>
> It shouldn't use macports to provide the libraries. See here for
> installation instructions:
>
> http://codespeak.net/lxml/installation.html#installation
>
> Stefan
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
From stefan_ml at behnel.de Sun Jan 17 15:00:56 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 17 Jan 2010 15:00:56 +0100
Subject: [lxml-dev] Namespaces on attribute values
In-Reply-To: <73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com>
References: <73d18a591001161444p71e6d715ld840ab49b05d58b6@mail.gmail.com>
<4B52C381.1050304@behnel.de>
<73d18a591001170319j100fbf1ei7f2d449cbee924a3@mail.gmail.com>
Message-ID: <4B531818.1030406@behnel.de>
Mike MacCana, 17.01.2010 12:19:
> On Sun, Jan 17, 2010 at 8:00 AM, Stefan Behnel wrote:
>> Mike MacCana, 16.01.2010 23:44:
>>> xsi:type="dcterms:W3CDTF"
>>>
>>> is 'dcterms' a namespace on a attribute value? If so, how can I set it in
>>> LXML? I haven't seen that before, and I can't find much documentation
>>> online. I'm quite comfortable with namespaces on tags and elements.
>>> Here's what I currently have:
>>>
>>> bar = etree.Element('{http://purl.org/dc/terms/}'+'created')
>>> bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type',
>>> 'dcterms:W3CDTF')
>>> bar.text = '2010-01-01T21:07:00Z'
>>>
>>> Alas the app that parses my XML doesn't like it - though the fromstring()
>> is
>>> fine. Any way I can set the namespace on attribute value?
>>
>> It's not a namespace (i.e. a URI), it's just the associated prefix. You
>> have to extract it from the Element you set the namespace on ('prefix'
>> attribute)
I might have been unclear here. Elements have a 'prefix' attribute that
gives you the prefix they use for their namespace.
>> http://codespeak.net/lxml/tutorial.html#namespaces
>>
>> If you don't provide a prefix yourself, lxml.etree will use a default name
>> like 'ns0', which doesn't correspond with the 'dcterms' you use in your
>> value.
>
> Thanks - I should clarify that my aim is to set the right namespace on the
> attribute's value - ns0 is fine if it points to the right URI. My question
> is, how can I set the right namespace on what appears to be an attributes
> *value*?
>
> I've read the namespace tutorial, and understand:
> - Namespace of tag itself is "http://purl.org/dc/terms/". Cool.
> - Namespace of the attribute is "http://www.w3.org/2001/XMLSchema-instance".
> Cool.
> But how can I set a namespace (preferably directly or via a prefix) on an
> attribute value (not the attribute)? Trying:
>
> bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type', '{
> http://purl.org/dc/terms/}'+'W3CDTF')
There is a feature I had almost forgotten about, I'm not even sure it's
documented anywhere (doc patches welcome). You can do this:
bar.set('{http://www.w3.org/2001/XMLSchema-instance}'+'type',
etree.QName('http://purl.org/dc/terms/', 'W3CDTF'))
'QName' means 'qualified name' and works wherever lxml.etree accepts a tag
name, with attribute values as a special feature for exactly your use case.
Note that lxml.etree can't take care for keeping the prefix used in the
(now plain text) attribute value up to date during tree changes, so if
(e.g.) you append the element above to a tree that already defines the
namespace under a different prefix, the attribute value will not get
updated and may loose its meaning if the namespace declarations get reassigned.
Stefan
From dakota at brokenpipe.ru Sun Jan 17 21:12:01 2010
From: dakota at brokenpipe.ru (Marat Dakota)
Date: Sun, 17 Jan 2010 23:12:01 +0300
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To:
References:
<4AF923A2.8010006@behnel.de>
<4AF9852D.3020408@behnel.de>
<4AF992C9.7090400@behnel.de>
<4B3A7F41.6000304@behnel.de>
Message-ID:
Hi,
I wonder if you've noticed my last letter with patch and questions...
--
Marat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100117/5c5e76d7/attachment.htm
From stefan_ml at behnel.de Mon Jan 18 08:21:16 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 18 Jan 2010 08:21:16 +0100
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To:
References:
<4AF923A2.8010006@behnel.de>
<4AF9852D.3020408@behnel.de>
<4AF992C9.7090400@behnel.de>
<4B3A7F41.6000304@behnel.de>
Message-ID: <4B540BEC.5090900@behnel.de>
Marat Dakota, 17.01.2010 21:12:
> I wonder if you've noticed my last letter with patch and questions...
Sorry! Yes, I noticed it, but didn't have the time to reply at the time. I
haven't looked at it yet, but I definitely will. As I said, the last one
looked good already, so I'll see that I get it applied as soon as I get to it.
Thanks!
Stefan
From jholg at gmx.de Mon Jan 18 11:55:28 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Mon, 18 Jan 2010 11:55:28 +0100
Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron
support to lxml
Message-ID: <20100118105528.89660@gmx.net>
Hi,
having played around with schematron a bit more I propose some changes to the current trunk additions for iso-schematron support:
The "validation criteria", i.e. if the validation result is True or False, is currently exposed on a module level:
# svrl result accessors
svrl_validation_errors = _etree.XPath(
'//svrl:failed-assert', namespaces={'svrl': SVRL_NS})
So you can customize the criteria globally.
With schematron however you can use "assert" as well as "report" tests, and also categorize the tests using "flag" and "role" attributes that will show up in the resulting svrl xml document (I do not currently understand the intended difference between the two, but that's another story).
So I think it may well be possible that what's interpreted as a validation error depends very much on how one designs the schematron schema, and it may be helpful to be able to customize validation outcome per validator instance.
This speaks for pulling the result accessor into the Schematron class, probably as a class attribute that can be overridden on an instance level.
The same might make sense for the iso-schematron implementation xsl transformation steps.
Opinions?
Holger
--
Preisknaller: GMX DSL Flatrate f?r nur 16,99 Euro/mtl.!
http://portal.gmx.net/de/go/dsl02
From stefan_ml at behnel.de Mon Jan 18 12:01:15 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 18 Jan 2010 12:01:15 +0100
Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron
support to lxml
In-Reply-To: <20100118105528.89660@gmx.net>
References: <20100118105528.89660@gmx.net>
Message-ID: <4B543F7B.8000803@behnel.de>
jholg at gmx.de, 18.01.2010 11:55:
> having played around with schematron a bit more I propose some changes to the current trunk additions for iso-schematron support:
>
> The "validation criteria", i.e. if the validation result is True or False, is currently exposed on a module level:
>
> # svrl result accessors
> svrl_validation_errors = _etree.XPath(
> '//svrl:failed-assert', namespaces={'svrl': SVRL_NS})
>
> So you can customize the criteria globally.
>
> With schematron however you can use "assert" as well as "report" tests, and also categorize the tests using "flag" and "role" attributes that will show up in the resulting svrl xml document (I do not currently understand the intended difference between the two, but that's another story).
> So I think it may well be possible that what's interpreted as a validation error depends very much on how one designs the schematron schema, and it may be helpful to be able to customize validation outcome per validator instance.
>
> This speaks for pulling the result accessor into the Schematron class, probably as a class attribute that can be overridden on an instance level.
>
> The same might make sense for the iso-schematron implementation xsl transformation steps.
Sounds like a much better interface. Any interesting global options would
be better overridden by subtyping the validator class, so class attributes
make sense to me.
Stefan
From jholg at gmx.de Mon Jan 18 12:48:42 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Mon, 18 Jan 2010 12:48:42 +0100
Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron
support to lxml
In-Reply-To: <4B543F7B.8000803@behnel.de>
References: <20100118105528.89660@gmx.net> <4B543F7B.8000803@behnel.de>
Message-ID: <20100118114842.89660@gmx.net>
Hi Stefan,
> > This speaks for pulling the result accessor into the Schematron class,
> probably as a class attribute that can be overridden on an instance level.
> >
> > The same might make sense for the iso-schematron implementation xsl
> transformation steps.
>
> Sounds like a much better interface. Any interesting global options would
> be better overridden by subtyping the validator class, so class attributes
> make sense to me.
I'll change that, then. Do you prefer me making the changes on the iso-schematron branch or directly in trunk?
Holger
--
GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
From stefan_ml at behnel.de Mon Jan 18 12:52:54 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 18 Jan 2010 12:52:54 +0100
Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron
support to lxml
In-Reply-To: <20100118114842.89660@gmx.net>
References: <20100118105528.89660@gmx.net> <4B543F7B.8000803@behnel.de>
<20100118114842.89660@gmx.net>
Message-ID: <4B544B96.50803@behnel.de>
jholg at gmx.de, 18.01.2010 12:48:
>>> This speaks for pulling the result accessor into the Schematron class,
>> probably as a class attribute that can be overridden on an instance level.
>>> The same might make sense for the iso-schematron implementation xsl
>> transformation steps.
>>
>> Sounds like a much better interface. Any interesting global options would
>> be better overridden by subtyping the validator class, so class attributes
>> make sense to me.
>
> I'll change that, then. Do you prefer me making the changes on the iso-schematron branch or directly in trunk?
>From my POV, the branch is basically dead after the merge, so please change
the trunk only.
Stefan
From animator333 at yahoo.com Fri Jan 22 10:02:12 2010
From: animator333 at yahoo.com (Prashant Saxena)
Date: Fri, 22 Jan 2010 14:32:12 +0530 (IST)
Subject: [lxml-dev] lxml newbie objectify & subclassing
Message-ID: <11593.3219.qm@web94914.mail.in2.yahoo.com>
Hi,
This is my first post. I have used python and xml earlier but on a very small scale. This time we are developing a fairly large
application where primary data storage is based on xml. I have been reading the docs on the site and lxml is looking quite
promising, specially the "objectify" module. Just to start with here is the first question:
from lxml import etree
from lxml import objectify
class Attribute(objectify.ObjectifiedDataElement):
""""""
def __init__(self):
objectify.ObjectifiedDataElement.__init__(self)
self.set("datatype", "")
self.set("range", "0.,1.")
def asXml(self):
return etree.tostring(self, method="xml", pretty_print=True)
#-------------------------------------------------------------------------------
class FloatAttribute(Attribute):
""""""
def __init__(self, tag="float", value=0.):
Attribute.__init__(self)
self.tag = tag
self.set("datatype", "float")
etree.SubElement(self, "float").text = str(value)
f = FloatAttribute()
print f.asXml()
The above code crashes. What's wrong here?
Prashant
Python 2.6.2
wxPython 2.8.10.1
lxml 2.2.4
XP 32
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/
From animator333 at yahoo.com Fri Jan 22 10:14:38 2010
From: animator333 at yahoo.com (Prashant Saxena)
Date: Fri, 22 Jan 2010 14:44:38 +0530 (IST)
Subject: [lxml-dev] lxml newbie xpath as connections
Message-ID: <169717.40674.qm@web94904.mail.in2.yahoo.com>
Hi,
This is regarding xpath and making some virtual connections in a xml file.
txt="""
0.10.20.30.40.50.60.10.20.30.40.50.6
/node/diffuse/color[1]/blue
"""
o = objectify.fromstring(txt)
Considering the above example, once is parsed, I could get the value of connection>output using
>> print o.connection.output
but it's a text.
Is it possible to define connection>output in such a way, that once parsed connection>output should refer to element
it is pointing by path as text.
If not at the time of parsing then using xpath later, I tried:
>> r = o.xpath(str(o.connection.output))
>> r.pyval
I am getting an empty list. How ever if I try:
>> r = o.xpath("/node/emission/color/red")
>> print r
r is list containing two values for each "red" of "color". How do I precisely get:
/node/emission/color[0]/red
Thanks
Prashant
Python 2.6.2
wxPython 2.8.10.1
lxml 2.2.4
XP 32
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/
From jholg at gmx.de Fri Jan 22 10:25:36 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Fri, 22 Jan 2010 10:25:36 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <11593.3219.qm@web94914.mail.in2.yahoo.com>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
Message-ID: <20100122092536.155750@gmx.net>
> from lxml import etree
> from lxml import objectify
>
> class Attribute(objectify.ObjectifiedDataElement):
> """"""
> def __init__(self):
> objectify.ObjectifiedDataElement.__init__(self)
> self.set("datatype", "")
> self.set("range", "0.,1.")
>
> def asXml(self):
> return etree.tostring(self, method="xml", pretty_print=True)
>
> #-------------------------------------------------------------------------------
>
>
> class FloatAttribute(Attribute):
> """"""
> def __init__(self, tag="float", value=0.):
> Attribute.__init__(self)
> self.tag = tag
> self.set("datatype", "float")
> etree.SubElement(self, "float").text = str(value)
>
>
> f = FloatAttribute()
> print f.asXml()
>
> The above code crashes. What's wrong here?
Please take a look at
http://codespeak.net/lxml/element_classes.html
especially:
http://codespeak.net/lxml/element_classes.html#element-initialization
Note that Elements get instantiated through the Element()/DataElement factory functions.
Holger
--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
From jholg at gmx.de Fri Jan 22 10:29:39 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Fri, 22 Jan 2010 10:29:39 +0100
Subject: [lxml-dev] lxml newbie xpath as connections
In-Reply-To: <169717.40674.qm@web94904.mail.in2.yahoo.com>
References: <169717.40674.qm@web94904.mail.in2.yahoo.com>
Message-ID: <20100122092939.155740@gmx.net>
> I am getting an empty list. How ever if I try:
> >> r = o.xpath("/node/emission/color/red")
> >> print r
>
> r is list containing two values for each "red" of "color". How do I
> precisely get:
>
> /node/emission/color[0]/red
>
Please read up on XPath. Indexing starts with 1 in XPath so use
/node/emission/color[1]/red
Holger
--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser
From stefan_ml at behnel.de Fri Jan 22 12:31:37 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 22 Jan 2010 12:31:37 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <20100122092536.155750@gmx.net>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
<20100122092536.155750@gmx.net>
Message-ID: <4B598C99.5030202@behnel.de>
jholg at gmx.de, 22.01.2010 10:25:
>
>> from lxml import etree
>> from lxml import objectify
>>
>> class Attribute(objectify.ObjectifiedDataElement):
>> """"""
>> def __init__(self):
>> objectify.ObjectifiedDataElement.__init__(self)
>> self.set("datatype", "")
>> self.set("range", "0.,1.")
>>
>> def asXml(self):
>> return etree.tostring(self, method="xml", pretty_print=True)
>>
>> #-------------------------------------------------------------------------------
>>
>>
>> class FloatAttribute(Attribute):
>> """"""
>> def __init__(self, tag="float", value=0.):
>> Attribute.__init__(self)
>> self.tag = tag
>> self.set("datatype", "float")
>> etree.SubElement(self, "float").text = str(value)
>>
>>
>> f = FloatAttribute()
>> print f.asXml()
>>
>> The above code crashes. What's wrong here?
>
> Please take a look at
> http://codespeak.net/lxml/element_classes.html
> especially:
> http://codespeak.net/lxml/element_classes.html#element-initialization
>
> Note that Elements get instantiated through the Element()/DataElement factory functions.
This is true in general, however, the above works in lxml.etree and should
also work in lxml.objectify (and definitely shouldn't just crash).
Stefan
From stefan_ml at behnel.de Fri Jan 22 12:36:19 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 22 Jan 2010 12:36:19 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <11593.3219.qm@web94914.mail.in2.yahoo.com>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
Message-ID: <4B598DB3.20800@behnel.de>
Hi,
Prashant Saxena, 22.01.2010 10:02:
> This is my first post. I have used python and xml earlier but on a very small scale. This time we are developing a fairly large
> application where primary data storage is based on xml. I have been reading the docs on the site and lxml is looking quite
> promising, specially the "objectify" module. Just to start with here is the first question:
>
> from lxml import etree
> from lxml import objectify
>
> class Attribute(objectify.ObjectifiedDataElement):
> """"""
> def __init__(self):
> objectify.ObjectifiedDataElement.__init__(self)
> self.set("datatype", "")
> self.set("range", "0.,1.")
>
> def asXml(self):
> return etree.tostring(self, method="xml", pretty_print=True)
>
> #-------------------------------------------------------------------------------
>
> class FloatAttribute(Attribute):
> """"""
> def __init__(self, tag="float", value=0.):
> Attribute.__init__(self)
> self.tag = tag
> self.set("datatype", "float")
> etree.SubElement(self, "float").text = str(value)
>
>
> f = FloatAttribute()
> print f.asXml()
>
> The above code crashes. What's wrong here?
>
> Prashant
> Python 2.6.2
> wxPython 2.8.10.1
> lxml 2.2.4
> XP 32
Thanks for the report, I can reproduce this. There seems to be an
unexpected interaction between the __init__ method of ElementBase and the
API of ObjectifiedElement.
This needs to be made more robust (at least). Could you open a ticket in
the bug tracker?
A work-around (and actually the expected usage) is to instantiate the
elements through the factories, as Holger suggested.
Thanks,
Stefan
From stefan_ml at behnel.de Fri Jan 22 14:14:36 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 22 Jan 2010 14:14:36 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <101858.8586.qm@web94906.mail.in2.yahoo.com>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
<4B598DB3.20800@behnel.de>
<101858.8586.qm@web94906.mail.in2.yahoo.com>
Message-ID: <4B59A4BC.7080901@behnel.de>
Prashant Saxena, 22.01.2010 13:53:
> The bug is reported in bug tracker.
> BTW, the code below is not working because of a bug or this feature is not been implemented?
It's a bug because of a feature that is implemented in lxml.etree but not
in lxml.objectify, so it's a bit of both. ;)
> As of now, I would prefer to use as it is more pythonic & easy to implement, compare to methods explained
> here:
> http://codespeak.net/lxml/element_classes.html#element-initialization
Not sure what you mean exactly. The ElementBase class is only meant for
subtyping, not for direct usage. Could you provide some background about
your use case?
Stefan
From jholg at gmx.de Fri Jan 22 15:06:41 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Fri, 22 Jan 2010 15:06:41 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <4B598C99.5030202@behnel.de>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
<20100122092536.155750@gmx.net> <4B598C99.5030202@behnel.de>
Message-ID: <20100122140641.155740@gmx.net>
Hi,
> >> The above code crashes. What's wrong here?
> >
> > Please take a look at
> > http://codespeak.net/lxml/element_classes.html
> > especially:
> > http://codespeak.net/lxml/element_classes.html#element-initialization
> >
> > Note that Elements get instantiated through the Element()/DataElement
> factory functions.
> This is true in general, however, the above works in lxml.etree and should
> also work in lxml.objectify (and definitely shouldn't just crash).
Looks like I'm missing something. What about
"
There is one thing to know up front. Element classes *must not* have an __init__ or __new__ method.
"
? Has this requirement been relaxed?
Holger
--
Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190.
Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF.
From jholg at gmx.de Fri Jan 22 15:07:58 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Fri, 22 Jan 2010 15:07:58 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <180543.6188.qm@web94913.mail.in2.yahoo.com>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
<20100122092536.155750@gmx.net>
<180543.6188.qm@web94913.mail.in2.yahoo.com>
Message-ID: <20100122140758.155730@gmx.net>
(cc-ing the list)
-------- Original-Nachricht --------
> Datum: Fri, 22 Jan 2010 16:11:38 +0530 (IST)
> Von: Prashant Saxena
> An: jholg at gmx.de
> Betreff: Re: [lxml-dev] lxml newbie objectify & subclassing
> Thanks for a quick reply.
>
>
> Why I need sub classing is because the application needs various types of
> custom data types(elements), such as vector, matrix,
> color etc. A collections of these data types is a node. These attributes
> are created from custom xml files at run time, stored in a node, value of an
> element is changed/edited using front end gui and then node is saved to
> disk(xml format). You can again load the node from disk create all the
> attribute at run time.
>
> Considering above scenario, the simplest attribute that represents a float
> element is created from class.
>
> Instead of using , If I use
> , there are no errors and code is working fine.
>
> The only draw back is that I have to convert string to pydata types myself
> to hook with the gui, which is not difficult at all.
>
> If you do have some suggestions then please let me know.
>
> Thanks
>
> Prashant
>
>
>
>
> ----- Original Message ----
> From: "jholg at gmx.de"
> To: Prashant Saxena ; lxml-dev at codespeak.net
> Sent: Fri, 22 January, 2010 2:55:36 PM
> Subject: Re: [lxml-dev] lxml newbie objectify & subclassing
>
>
>
> > from lxml import etree
> > from lxml import objectify
> >
> > class Attribute(objectify.ObjectifiedDataElement):
> > """"""
> > def __init__(self):
> > objectify.ObjectifiedDataElement.__init__(self)
> > self.set("datatype", "")
> > self.set("range", "0.,1.")
> >
> > def asXml(self):
> > return etree.tostring(self, method="xml", pretty_print=True)
> >
> >
> #-------------------------------------------------------------------------------
> >
> >
> > class FloatAttribute(Attribute):
> > """"""
> > def __init__(self, tag="float", value=0.):
> > Attribute.__init__(self)
> > self.tag = tag
> > self.set("datatype", "float")
> > etree.SubElement(self, "float").text = str(value)
> >
> >
> > f = FloatAttribute()
> > print f.asXml()
> >
> > The above code crashes. What's wrong here?
>
> Please take a look at
> http://codespeak.net/lxml/element_classes.html
> especially:
> http://codespeak.net/lxml/element_classes.html#element-initialization
>
> Note that Elements get instantiated through the Element()/DataElement
> factory functions.
>
> Holger
>
> --
> Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5
> -
> sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
>
>
>
> The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.
> http://in.yahoo.com/
--
Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190.
Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF.
From stefan_ml at behnel.de Fri Jan 22 15:41:59 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 22 Jan 2010 15:41:59 +0100
Subject: [lxml-dev] lxml newbie objectify & subclassing
In-Reply-To: <20100122140641.155740@gmx.net>
References: <11593.3219.qm@web94914.mail.in2.yahoo.com>
<20100122092536.155750@gmx.net> <4B598C99.5030202@behnel.de>
<20100122140641.155740@gmx.net>
Message-ID: <4B59B937.7030703@behnel.de>
jholg at gmx.de, 22.01.2010 15:06:
>>>> The above code crashes. What's wrong here?
>>> Please take a look at
>>> http://codespeak.net/lxml/element_classes.html
>>> especially:
>>> http://codespeak.net/lxml/element_classes.html#element-initialization
>>>
>>> Note that Elements get instantiated through the Element()/DataElement
>> factory functions.
>
>
>> This is true in general, however, the above works in lxml.etree and should
>> also work in lxml.objectify (and definitely shouldn't just crash).
>
> Looks like I'm missing something. What about
> "
> There is one thing to know up front. Element classes *must not* have an __init__ or __new__ method.
> "
> ? Has this requirement been relaxed?
Sort-of. It works if you call the __init__ method of the superclass first
thing in your subtype, and it will actually do The Right Thing, i.e. create
a new element.
It will still not get called when lxml.etree instantiates an element proxy
behind the scenes, so the warning is still true in the sense that __init__
may not have been called on an Element proxy when lxml.etree returns it. As
long as you only change the XML element in __init__ (as in the code that
Prashant presented) and do not keep any local state in the class, you're
fine, though.
However, it crashes in this specific case because the way the __init__
method in ElementBase is implemented accesses the object before it is
initialised, so this *is* a bug.
I'm not sure how to fix this yet, but I'm considering to add more safety
checks to the API in general. I'll see.
Stefan
From stefan_ml at behnel.de Fri Jan 22 23:00:19 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 22 Jan 2010 23:00:19 +0100
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To:
References:
<4AF7C843.6050509@behnel.de>
<4AF923A2.8010006@behnel.de>
<4AF9852D.3020408@behnel.de>
<4AF992C9.7090400@behnel.de>
<4B3A7F41.6000304@behnel.de>
Message-ID: <4B5A1FF3.7040703@behnel.de>
Marat Dakota, 12.01.2010 14:05:
>> Thanks a lot, it's looks reasonable at first glance and I'll take a closer
>> look as soon as I get to it. If it works well, it should make it into 2.3.
>
> Is there a roadmap date for 2.3 release?
Not yet, no.
>> Could you add a couple of tests to src/lxml/tests/test_xslt.py? That would
>> help in making sure that it keeps working as expected even if I find that I
>> need to rework the patch.
>>
>
> I've added tests, I've also renamed variables to fit your code better and
> added possibility to evaluate extension element's content directly to
> _AppendOnlyElementProxy as well as to _Element. It looks like I'm satisfied
> with the code now. I wonder what will you say about it.
Hmm, and did you *run* the tests? The test code actually contains obvious
errors (such as non well-formed XML), so I wonder how you tested it at all.
After fixing the tests, they even crash on my machine. So, sorry, but this
patch isn't in an acceptable state.
Could you please open up a ticket on launchpad for this? That would make it
easier to track the progress of this patch.
Stefan
From animator333 at yahoo.com Sat Jan 23 16:20:28 2010
From: animator333 at yahoo.com (Prashant Saxena)
Date: Sat, 23 Jan 2010 20:50:28 +0530 (IST)
Subject: [lxml-dev] pretty_print with tail
Message-ID: <80093.83351.qm@web94909.mail.in2.yahoo.com>
Hi,
A test example prints:
0.231kFloat0.326kFloat0.921kFloatkColor
I am using:
etree.tostring(root, method="xml", pretty_print=True)
May be because every element/children has a tail text.
Is it possible to format the output in this way:
0.231kFloat
0.326kFloat
0.921kFloat
kColor
Thanks
Prashant
Python 2.6.2
lxml 2.2.4
wxPython 2.8.10.1
XP 32
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/
From stefan_ml at behnel.de Sat Jan 23 17:53:50 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 23 Jan 2010 17:53:50 +0100
Subject: [lxml-dev] pretty_print with tail
In-Reply-To: <80093.83351.qm@web94909.mail.in2.yahoo.com>
References: <80093.83351.qm@web94909.mail.in2.yahoo.com>
Message-ID: <4B5B299E.2080403@behnel.de>
Prashant Saxena, 23.01.2010 16:20:
> A test example prints:
> 0.231kFloat0.326kFloat0.921kFloatkColor
>
> I am using:
> etree.tostring(root, method="xml", pretty_print=True)
>
> May be because every element/children has a tail text.
>
> Is it possible to format the output in this way:
>
>
>
> 0.231kFloat
> 0.326kFloat
> 0.921kFloat
> kColor
>
http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Stefan
From animator333 at yahoo.com Sat Jan 23 18:56:20 2010
From: animator333 at yahoo.com (Prashant Saxena)
Date: Sat, 23 Jan 2010 23:26:20 +0530 (IST)
Subject: [lxml-dev] pretty_print with tail
In-Reply-To: <4B5B299E.2080403@behnel.de>
References: <80093.83351.qm@web94909.mail.in2.yahoo.com>
<4B5B299E.2080403@behnel.de>
Message-ID: <246745.27582.qm@web94903.mail.in2.yahoo.com>
Prashant Saxena, 23.01.2010 16:20:
> A test example prints:
> 0.231kFloat0.326kFloat0.921kFloatkColor
>
> I am using:
> etree.tostring(root, method="xml", pretty_print=True)
>
> May be because every element/children has a tail text.
>
> Is it possible to format the output in this way:
>
>
>
> 0.231kFloat
> 0.326kFloat
> 0.921kFloat
> kColor
>
http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Parsing a xml file from disk which has written as above is not a problem and output is as is.
I am interested in writing it to disk as above.
from lxml import etree
root = etree.Element("emmision")
color = etree.SubElement(root, "color")
color.tail = "kColor"
red = etree.SubElement(color, "red")
green = etree.SubElement(color, "green")
blue = etree.SubElement(color, "blue")
red.text = "0.231"
green.text = "0.326"
blue.text = "0.291"
red.tail = "kFloat"
green.tail = "kFloat"
blue.tail = "kFloat"
print etree.tostring(root, method="xml", pretty_print=True)
This code prints every thing in a single line & it's hard to read. Do I have to write a custom function to parse the string and
print as needed?
Prashant
Python 2.6.2
lxml 2.2.4
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/
From animator333 at yahoo.com Sat Jan 23 19:01:43 2010
From: animator333 at yahoo.com (Prashant Saxena)
Date: Sat, 23 Jan 2010 23:31:43 +0530 (IST)
Subject: [lxml-dev] conditional pretty_print
Message-ID: <333445.56710.qm@web94912.mail.in2.yahoo.com>
Hi,
In short:
While printing,
1. Ignore *all* attributes(keys()) of every element.
2. Ignore *certain* attributes(keys()) of every element.
3. Ignore *certain* attributes(keys()) of element with *tag* .
Prashant
Python 2.6.2
lxml 2.2.4
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. http://in.yahoo.com/
From stefan_ml at behnel.de Sun Jan 24 12:37:06 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 24 Jan 2010 12:37:06 +0100
Subject: [lxml-dev] pretty_print with tail
In-Reply-To: <246745.27582.qm@web94903.mail.in2.yahoo.com>
References: <80093.83351.qm@web94909.mail.in2.yahoo.com> <4B5B299E.2080403@behnel.de>
<246745.27582.qm@web94903.mail.in2.yahoo.com>
Message-ID: <4B5C30E2.8000800@behnel.de>
Prashant Saxena, 23.01.2010 18:56:
>
> Prashant Saxena, 23.01.2010 16:20:
>> A test example prints:
>> 0.231kFloat0.326kFloat0.921kFloatkColor
>>
>> I am using:
>> etree.tostring(root, method="xml", pretty_print=True)
>>
>> May be because every element/children has a tail text.
>>
>> Is it possible to format the output in this way:
>>
>>
>>
>> 0.231kFloat
>> 0.326kFloat
>> 0.921kFloat
>> kColor
>>
>
> http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
> [...]
> This code prints every thing in a single line & it's hard to read. Do I have to write a custom function to parse the string and
> print as needed?
See the last paragraph of the section I linked above.
Stefan
From stefan_ml at behnel.de Sun Jan 24 19:36:42 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 24 Jan 2010 19:36:42 +0100
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To: <4B540BEC.5090900@behnel.de>
References: <4AF923A2.8010006@behnel.de> <4AF9852D.3020408@behnel.de> <4AF992C9.7090400@behnel.de> <4B3A7F41.6000304@behnel.de>
<4B540BEC.5090900@behnel.de>
Message-ID: <4B5C933A.4010803@behnel.de>
Hi,
Stefan Behnel, 18.01.2010 08:21:
> Marat Dakota, 17.01.2010 21:12:
>> I wonder if you've noticed my last letter with patch and questions...
>
> Sorry! Yes, I noticed it, but didn't have the time to reply at the time. I
> haven't looked at it yet, but I definitely will. As I said, the last one
> looked good already, so I'll see that I get it applied as soon as I get to it.
I have committed an extended version of the patch to the trunk. Please
review the new API to see if it works for you.
https://codespeak.net/viewvc/?view=rev&revision=70799
Stefan
From dakota at brokenpipe.ru Mon Jan 25 09:04:36 2010
From: dakota at brokenpipe.ru (Marat Dakota)
Date: Mon, 25 Jan 2010 11:04:36 +0300
Subject: [lxml-dev] Evaluate XSLT directives inside custom XSLT
extension elements
In-Reply-To: <4B5C933A.4010803@behnel.de>
References:
<4AF9852D.3020408@behnel.de>
<4AF992C9.7090400@behnel.de>
<4B3A7F41.6000304@behnel.de>
<4B540BEC.5090900@behnel.de> <4B5C933A.4010803@behnel.de>
Message-ID:
>
> I have committed an extended version of the patch to the trunk. Please
> review the new API to see if it works for you.
>
> https://codespeak.net/viewvc/?view=rev&revision=70799
Thanks so much! I'll celebrate it :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100125/c49dfda6/attachment.htm
From stefan_ml at behnel.de Mon Jan 25 12:49:29 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 25 Jan 2010 12:49:29 +0100
Subject: [lxml-dev] conditional pretty_print
In-Reply-To: <333445.56710.qm@web94912.mail.in2.yahoo.com>
References: <333445.56710.qm@web94912.mail.in2.yahoo.com>
Message-ID: <4B5D8549.4020300@behnel.de>
Prashant Saxena, 23.01.2010 19:01:
> In short:
Too short, I guess.
> While printing,
>
> 1. Ignore *all* attributes(keys()) of every element.
> 2. Ignore *certain* attributes(keys()) of every element.
> 3. Ignore *certain* attributes(keys()) of element with *tag* .
If the above is intended to describe a custom serialisation scheme, I
assume you want to use this scheme to serialise an XML tree, right?
Two ways to do that:
1) strip all unwanted information from the tree before serialising
2) roll your own serialiser (IIRC, the ElementTree docs mention this somewhere)
Which way is better for you is mostly dependent on whether you want an
opt-in or opt-out solution, I guess.
Stefan
From optilude+lists at gmail.com Thu Jan 28 14:52:21 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Thu, 28 Jan 2010 21:52:21 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
Message-ID:
Hi,
I'm trying to use lxml to conditionally insert an tag
into an HTML document.
The document is parsed with the HTML parser and manipulated in various
ways. At one point, I search for a node ('placeholder') and want to
replace it with something that renders to:
The code I used looks like this:
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
root.nsmap.update(nsmap) # root is the element
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
esiNode.set('src', url)
placeholder.addnext(esiNode) # placeholder is later removed
There are two problems with this:
- The xmlns:esi ends up on the tag instead of the HTML
root. Varnish doesn't like this apparently.
- The tag is not self-closing when rendered with the
html.tostring (using etree.tostring is not really an option as other
things are going on which want html rendering). Varnish likes this even
less.
Thus:
What can I do to push the namespace declaration up to the top node
('root') and make the tag self-closing?
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Thu Jan 28 17:04:19 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 28 Jan 2010 17:04:19 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References:
Message-ID: <4B61B583.50003@behnel.de>
Hi Martin,
Martin Aspeli, 28.01.2010 14:52:
> I'm trying to use lxml to conditionally insert an tag
> into an HTML document.
First problem: HTML is not namespace aware - namespaces in HTML are
underdefined at best (and they certainly were not well defined back in
2001, when the ESI spec appeared).
> The document is parsed with the HTML parser and manipulated in various
> ways. At one point, I search for a node ('placeholder') and want to
> replace it with something that renders to:
>
>
>
> The code I used looks like this:
>
> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
> root.nsmap.update(nsmap) # root is the element
Updating the nsmap property has no effect. I've updated the docstring
appropriately.
> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
> esiNode.set('src', url)
> placeholder.addnext(esiNode) # placeholder is later removed
>
> There are two problems with this:
>
> - The xmlns:esi ends up on the tag instead of the HTML
> root. Varnish doesn't like this apparently.
As I said, namespaces in HTML...
To move the namespace declaration to the top-level element, you can create
a new 'html' root element that has it and move the nodes over, e.g.
new_root = etree.Element('html', nsmap=nsmap)
new_root[:] = root[:] # or copy.deepcopy(root)[:]
I think it would be nice to allow an 'nsmap' parameter in the
cleanup_namespaces() function. Its namespace declarations would then get
added to the element it runs on before starting the cleanup process. That
would be a 2.3 feature, though.
I don't think adding support for changing 'el.nsmap' would be a good idea,
as changing namespace prefixes is actually a rather non-trivial process.
This should be requested explicitly at a well selected step in the code
(usually just before serialisation, when prefixes become interesting).
> - The tag is not self-closing when rendered with the
> html.tostring (using etree.tostring is not really an option as other
> things are going on which want html rendering). Varnish likes this even
> less.
Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
know that it's supposed to be self-closing. I think you only have three
options here:
1) fix Varnish
2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
3) close the tag through byte string substitution *after* serialisation
If you choose to go with 2), you may consider converting the stream back to
plain HTML *after* processing the esi tags, using an additional
parse-serialise cycle (or an external tool like xmllint or tidy).
Stefan
From optilude+lists at gmail.com Fri Jan 29 01:00:37 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 08:00:37 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B61B583.50003@behnel.de>
References: <4B61B583.50003@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Hi Martin,
>
> Martin Aspeli, 28.01.2010 14:52:
>> I'm trying to use lxml to conditionally insert an tag
>> into an HTML document.
>
> First problem: HTML is not namespace aware - namespaces in HTML are
> underdefined at best (and they certainly were not well defined back in
> 2001, when the ESI spec appeared).
>
>
>> The document is parsed with the HTML parser and manipulated in various
>> ways. At one point, I search for a node ('placeholder') and want to
>> replace it with something that renders to:
>>
>>
>>
>> The code I used looks like this:
>>
>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
>> root.nsmap.update(nsmap) # root is the element
>
> Updating the nsmap property has no effect. I've updated the docstring
> appropriately.
Ok, thanks.
>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>> esiNode.set('src', url)
>> placeholder.addnext(esiNode) # placeholder is later removed
>>
>> There are two problems with this:
>>
>> - The xmlns:esi ends up on the tag instead of the HTML
>> root. Varnish doesn't like this apparently.
>
> As I said, namespaces in HTML...
>
> To move the namespace declaration to the top-level element, you can create
> a new 'html' root element that has it and move the nodes over, e.g.
>
> new_root = etree.Element('html', nsmap=nsmap)
> new_root[:] = root[:] # or copy.deepcopy(root)[:]
How (in)efficient is this?
> I think it would be nice to allow an 'nsmap' parameter in the
> cleanup_namespaces() function. Its namespace declarations would then get
> added to the element it runs on before starting the cleanup process. That
> would be a 2.3 feature, though.
Ok.
> I don't think adding support for changing 'el.nsmap' would be a good idea,
> as changing namespace prefixes is actually a rather non-trivial process.
> This should be requested explicitly at a well selected step in the code
> (usually just before serialisation, when prefixes become interesting).
Agree. I'd be happy to pass something to the serialiser about namespaces.
>> - The tag is not self-closing when rendered with the
>> html.tostring (using etree.tostring is not really an option as other
>> things are going on which want html rendering). Varnish likes this even
>> less.
>
> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
> know that it's supposed to be self-closing. I think you only have three
> options here:
Is there no way to make it aware of it? Seems this should be
configurable (or monkey-patch-able) somewhere...
> 1) fix Varnish
> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
It probably would. What's that look like?
> 3) close the tag through byte string substitution *after* serialisation
Yipes. If I do that, I'll just do the entire tag through such a
substitution to be honest and not use lxml at all.
> If you choose to go with 2), you may consider converting the stream back to
> plain HTML *after* processing the esi tags, using an additional
> parse-serialise cycle (or an external tool like xmllint or tidy).
That sounds pretty bad for performance. :(
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From l at lrowe.co.uk Fri Jan 29 02:47:14 2010
From: l at lrowe.co.uk (Laurence Rowe)
Date: Fri, 29 Jan 2010 01:47:14 +0000
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de>
Message-ID:
2010/1/29 Martin Aspeli :
> Stefan Behnel wrote:
>> Hi Martin,
>>
>> Martin Aspeli, 28.01.2010 14:52:
>>> I'm trying to use lxml to conditionally insert an ?tag
>>> into an HTML document.
>>
>> First problem: HTML is not namespace aware - namespaces in HTML are
>> underdefined at best (and they certainly were not well defined back in
>> 2001, when the ESI spec appeared).
>>
>>
>>> The document is parsed with the HTML parser and manipulated in various
>>> ways. At one point, I search for a node ('placeholder') and want to
>>> replace it with something that renders to:
>>>
>>> ? ?
>>>
>>> The code I used looks like this:
>>>
>>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
>>> root.nsmap.update(nsmap) # root is the ?element
>>
>> Updating the nsmap property has no effect. I've updated the docstring
>> appropriately.
>
> Ok, thanks.
>
>>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>>> esiNode.set('src', url)
>>> placeholder.addnext(esiNode) # placeholder is later removed
>>>
>>> There are two problems with this:
>>>
>>> ? ?- The xmlns:esi ends up on the ?tag instead of the HTML
>>> root. Varnish doesn't like this apparently.
>>
>> As I said, namespaces in HTML...
>>
>> To move the namespace declaration to the top-level element, you can create
>> a new 'html' root element that has it and move the nodes over, e.g.
>>
>> ? ? ?new_root = etree.Element('html', nsmap=nsmap)
>> ? ? ?new_root[:] = root[:] # or copy.deepcopy(root)[:]
>
> How (in)efficient is this?
>
>> I think it would be nice to allow an 'nsmap' parameter in the
>> cleanup_namespaces() function. Its namespace declarations would then get
>> added to the element it runs on before starting the cleanup process. That
>> would be a 2.3 feature, though.
>
> Ok.
>
>> I don't think adding support for changing 'el.nsmap' would be a good idea,
>> as changing namespace prefixes is actually a rather non-trivial process.
>> This should be requested explicitly at a well selected step in the code
>> (usually just before serialisation, when prefixes become interesting).
>
> Agree. I'd be happy to pass something to the serialiser about namespaces.
>
>>> ? ?- The ?tag is not self-closing when rendered with the
>>> html.tostring (using etree.tostring is not really an option as other
>>> things are going on which want html rendering). Varnish likes this even
>>> less.
>>
>> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
>> know that it's supposed to be self-closing. I think you only have three
>> options here:
>
> Is there no way to make it aware of it? Seems this should be
> configurable (or monkey-patch-able) somewhere...
>
>> 1) fix Varnish
>> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
>
> It probably would. What's that look like?
>
>> 3) close the tag through byte string substitution *after* serialisation
>
> Yipes. If I do that, I'll just do the entire tag through such a
> substitution to be honest and not use lxml at all.
>
>> If you choose to go with 2), you may consider converting the stream back to
>> plain HTML *after* processing the esi tags, using an additional
>> parse-serialise cycle (or an external tool like xmllint or tidy).
>
> That sounds pretty bad for performance. :(
>
> Martin
FWIW, the only way I've found to get good xhtml output from html
parsing is with an xsl like the following...
This triggers the xml output mode to produce valid xhtml. If
et.docinfo.public_id and et.docinfo.system_url could be set somehow
then I'm sure it would work without the transform. (The relevant code
is at the top of libxml2/xmlsave.c - basically so long as you have one
of the xhtml public ids or system urls you'll get the right output).
Laurence
From optilude+lists at gmail.com Fri Jan 29 03:26:48 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 10:26:48 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B61B583.50003@behnel.de>
References: <4B61B583.50003@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
> know that it's supposed to be self-closing. I think you only have three
> options here:
>
> 1) fix Varnish
> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
> 3) close the tag through byte string substitution *after* serialisation
>
> If you choose to go with 2), you may consider converting the stream back to
> plain HTML *after* processing the esi tags, using an additional
> parse-serialise cycle (or an external tool like xmllint or tidy).
I've just tried this with serialization using lxml.etree.tostring
instead of lxml.html.tostring. Unfortunately, I'm still getting an
open-close tag pair instead of a self-closed tag. Any idea what I may be
doing wrong?
esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
esiNode.set('src', tileHref)
esiNode.text = None
tilePlaceholderNode.addnext(esiNode)
toRemove.append(tilePlaceholderNode)
Output:
...
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From optilude+lists at gmail.com Fri Jan 29 04:38:40 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 11:38:40 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B61B583.50003@behnel.de>
References: <4B61B583.50003@behnel.de>
Message-ID:
Stefan Behnel wrote:
> To move the namespace declaration to the top-level element, you can create
> a new 'html' root element that has it and move the nodes over, e.g.
>
> new_root = etree.Element('html', nsmap=nsmap)
> new_root[:] = root[:] # or copy.deepcopy(root)[:]
I tried this (self-closing tag issue notwithstanding), like so:
root = tree.getroot()
nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
newRoot = etree.Element('html',
nsmap=newRoot.attrib.update(root.attrib.items())
newRoot[:] = copy.deepcopy(root)[:]
tree._setroot(newRoot)
Unfortunately, I've now lost the doctype. :( The head of the page looks
like:
Intriguingly, the tag now self-closes. :-) However,
Firefox is showing an empty page.
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 08:36:10 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 08:36:10 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de>
Message-ID: <4B628FEA.2030905@behnel.de>
Martin Aspeli, 29.01.2010 04:38:
> Stefan Behnel wrote:
>
>> To move the namespace declaration to the top-level element, you can create
>> a new 'html' root element that has it and move the nodes over, e.g.
>>
>> new_root = etree.Element('html', nsmap=nsmap)
>> new_root[:] = root[:] # or copy.deepcopy(root)[:]
>
> I tried this (self-closing tag issue notwithstanding), like so:
>
> root = tree.getroot()
> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
> newRoot = etree.Element('html',
> nsmap=newRoot.attrib.update(root.attrib.items())
> newRoot[:] = copy.deepcopy(root)[:]
> tree._setroot(newRoot)
>
> Unfortunately, I've now lost the doctype. :( The head of the page looks
> like:
>
> xmlns="http://www.w3.org/1999/xhtml" lang="en">
>
You can also create the element using the parser:
newRoot = etree.XML('''
''')
Sadly, doctype setting isn't currently as easy as it could be...
> Intriguingly, the tag now self-closes. :-) However,
> Firefox is showing an empty page.
May or may not be due to the missing doctype.
Stefan
From stefan_ml at behnel.de Fri Jan 29 08:47:27 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 08:47:27 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de>
Message-ID: <4B62928F.5050400@behnel.de>
Martin Aspeli, 29.01.2010 03:26:
> Stefan Behnel wrote:
>
>> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
>> know that it's supposed to be self-closing. I think you only have three
>> options here:
>>
>> 1) fix Varnish
>> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
>> 3) close the tag through byte string substitution *after* serialisation
>>
>> If you choose to go with 2), you may consider converting the stream back to
>> plain HTML *after* processing the esi tags, using an additional
>> parse-serialise cycle (or an external tool like xmllint or tidy).
>
> I've just tried this with serialization using lxml.etree.tostring
> instead of lxml.html.tostring. Unfortunately, I'm still getting an
> open-close tag pair instead of a self-closed tag. Any idea what I may be
> doing wrong?
>
> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
> esiNode.set('src', tileHref)
> esiNode.text = None
Setting the .text to None is redundant as this is a new element. Otherwise,
doing that should be enough to erase all text content.
> tilePlaceholderNode.addnext(esiNode)
> toRemove.append(tilePlaceholderNode)
I guess I would have used parent.replace(old,new) here.
> Output:
>
>
Works for me. Could you send me a complete code snippet where it doesn't
work for you?
Stefan
From optilude+lists at gmail.com Fri Jan 29 09:15:50 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 16:15:50 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62928F.5050400@behnel.de>
References:
<4B61B583.50003@behnel.de>
<4B62928F.5050400@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Martin Aspeli, 29.01.2010 03:26:
>> Stefan Behnel wrote:
>>
>>> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
>>> know that it's supposed to be self-closing. I think you only have three
>>> options here:
>>>
>>> 1) fix Varnish
>>> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
>>> 3) close the tag through byte string substitution *after* serialisation
>>>
>>> If you choose to go with 2), you may consider converting the stream back to
>>> plain HTML *after* processing the esi tags, using an additional
>>> parse-serialise cycle (or an external tool like xmllint or tidy).
>> I've just tried this with serialization using lxml.etree.tostring
>> instead of lxml.html.tostring. Unfortunately, I'm still getting an
>> open-close tag pair instead of a self-closed tag. Any idea what I may be
>> doing wrong?
>>
>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>> esiNode.set('src', tileHref)
>> esiNode.text = None
>
> Setting the .text to None is redundant as this is a new element. Otherwise,
> doing that should be enough to erase all text content.
It was an act of desperation. :)
>> tilePlaceholderNode.addnext(esiNode)
>> toRemove.append(tilePlaceholderNode)
>
> I guess I would have used parent.replace(old,new) here.
I didn't do this for two reasons:
1. In some cases (though not here) I'm replacing one placeholder with
multiple nodes.
2. This code appears within a loop that's manipulating the tree for
each of multiple elements matched with an XPath expression. I thought
deleting a node mid-iteration would cause problems.
>> Output:
>>
>>
>
> Works for me. Could you send me a complete code snippet where it doesn't
> work for you?
How much work are you willing to put in? :-)
I can give you a Plone buildout that will set up everything and talk you
through the steps to reproduce. It's not very hard, it just requires a
few steps. I won't bother explaining it if you don't have half an hour
to chase it down, though. :)
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From d.rothe at semantics.de Fri Jan 29 09:57:33 2010
From: d.rothe at semantics.de (Dirk Rothe)
Date: Fri, 29 Jan 2010 09:57:33 +0100
Subject: [lxml-dev] exslt functions in xpath expressions
Message-ID:
During XPath Evaluations in XSL-Transformations it's possible to use Stuff
from http://www.exslt.org/ (so [5] does indeed match the element).
During XPath Evaluations its only possible to use standard XPath/XSLT
Functions. Is there a chance to enable the functions from exslt for lxml
XPath evaluations as well?
=========================================================
In [1]: from lxml import etree
In [2]: from StringIO import StringIO
In [3]: tree = etree.parse(StringIO(''))
In [4]: print tree.xpath('/a[@b=concat("1","2")]')
[]
In [5]: print tree.xpath('/a[@b=str:split("12 34")]')
[]
=========================================================
--dirk
From stefan_ml at behnel.de Fri Jan 29 09:59:14 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 09:59:14 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de>
Message-ID: <4B62A362.40502@behnel.de>
Martin Aspeli, 29.01.2010 01:00:
> Stefan Behnel wrote:
>> Martin Aspeli, 28.01.2010 14:52:
>>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>>> esiNode.set('src', url)
>>> placeholder.addnext(esiNode) # placeholder is later removed
>>>
>>> There are two problems with this:
>>>
>>> - The xmlns:esi ends up on the tag instead of the HTML
>>> root. Varnish doesn't like this apparently.
>> As I said, namespaces in HTML...
>>
>> To move the namespace declaration to the top-level element, you can create
>> a new 'html' root element that has it and move the nodes over, e.g.
>>
>> new_root = etree.Element('html', nsmap=nsmap)
>> new_root[:] = root[:] # or copy.deepcopy(root)[:]
>
> How (in)efficient is this?
It's about linear in the number of elements in your tree, plus the number
of direct children for the move operation. Maybe not the most efficient
thing to do, but usually pretty fast. Certainly a lot faster than you could
ever get your own hand-rolled serialiser in Python, for instance. You can
compare the absolute numbers on this page:
http://codespeak.net/lxml/performance.html#parsing-and-serialising
http://codespeak.net/lxml/performance.html#merging-different-sources
>> I don't think adding support for changing 'el.nsmap' would be a good idea,
>> as changing namespace prefixes is actually a rather non-trivial process.
>> This should be requested explicitly at a well selected step in the code
>> (usually just before serialisation, when prefixes become interesting).
>
> Agree. I'd be happy to pass something to the serialiser about namespaces.
I know, that's how ET's serialiser works. Can't work for lxml, though. The
serialiser in libxml2 can only write out what is there.
It could work for a doctype, though. Support for passing that verbatimly
into the serialiser would be a nice feature.
>>> - The tag is not self-closing when rendered with the
>>> html.tostring (using etree.tostring is not really an option as other
>>> things are going on which want html rendering). Varnish likes this even
>>> less.
>> Well, the HTML serialiser doesn't know the "esi:include" tag, so it can't
>> know that it's supposed to be self-closing. I think you only have three
>> options here:
>
> Is there no way to make it aware of it? Seems this should be
> configurable (or monkey-patch-able) somewhere...
Monkey-patching isn't all that easy in libxml2, though...
Not that it can't work for C code, it's just not that portable - nor
particularly safe... ;-)
>> 1) fix Varnish
>> 2) serialise to XHTML instead of HTML (assuming that Varnish supports that)
>
> It probably would. What's that look like?
Depends on your input. If it's HTML, there's an html_to_xhtml() function in
lxml.html that can do the conversion for you. And the serialiser can always
be chosen using the 'method' argument (that's basically the difference
between lxml.etree.tostring() and lxml.html.tostring()).
>> 3) close the tag through byte string substitution *after* serialisation
>
> Yipes. If I do that, I'll just do the entire tag through such a
> substitution to be honest and not use lxml at all.
It's rather safe, though. The exact string to replace would be
">", which won't appear that easily in your content. Doing
the parsing and replacing manually on the input is a lot more fragile.
>> If you choose to go with 2), you may consider converting the stream back to
>> plain HTML *after* processing the esi tags, using an additional
>> parse-serialise cycle (or an external tool like xmllint or tidy).
>
> That sounds pretty bad for performance. :(
Don't underestimate the speed of a tool that was made for the job.
Stefan
From optilude+lists at gmail.com Fri Jan 29 10:10:54 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 17:10:54 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62928F.5050400@behnel.de>
References:
<4B61B583.50003@behnel.de>
<4B62928F.5050400@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Works for me. Could you send me a complete code snippet where it doesn't
> work for you?
Okay, here's a pseudo-doctest that illustrates the problem:
First, we create a simple document. We use the HTML parser here, because
we don't necessarily trust the input being 100% valid XHTML, even though
the doctype says so.
>>> from lxml import etree, html
>>> doc = """\
...
...
...
...
...
...
... """
>>> inputTree = html.fromstring(doc)
We are going to replace the tag with an tag. We
find it via an XPath:
>>> placeholderXPath = etree.XPath("//img[contains(concat(' ',
normalize-space(@class), ' '), ' mceTile ')]")
>>> matched = list(placeholderXPath(inputTree))
>>> matchedNode = matched[0]
We then create the ESI node. At this point, it's nice and self-closing.
Note that we use the etree.tostring() method, since we want XHTML output.
>>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
>>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>>> esiNode.set('src', matchedNode.get('alt'))
>>> print etree.tostring(esiNode)
Now we connect it to the parent:
>>> matchedNode.getparent().replace(matchedNode, esiNode)
At this point it's all over:
>>> print etree.tostring(esiNode)
And sure enough:
>>> print etree.tostring(inputTree)
It's also interesting to note that this suddenly has the xmlns
declaration twice.
Any ideas would be highly welcome. I've tried to play with different
ways to construct the ESI tag, and different placements for the
placeholder (e.g. outside the
tag), but it's all the same. It also
doesn't seem to make any difference whether I parse with
etree.fromstring() or html.fromstring() (in the real code I'm actually
feeding an HTMLParser).
As soon as I insert it into the parent tree, the tag stops self closing.
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 10:18:52 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 10:18:52 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de>
Message-ID: <4B62A7FC.80607@behnel.de>
Martin Aspeli, 29.01.2010 09:15:
> Stefan Behnel wrote:
>> Setting the .text to None is redundant as this is a new element.
>
> It was an act of desperation. :)
I know. ;)
>>> tilePlaceholderNode.addnext(esiNode)
>>> toRemove.append(tilePlaceholderNode)
>> I guess I would have used parent.replace(old,new) here.
>
> I didn't do this for two reasons:
>
> 1. In some cases (though not here) I'm replacing one placeholder with
> multiple nodes.
Another nice feature: support a sequence as replacement. :)
Although that requirement is basically satisfied with slice replacements,
so I guess that won't make it in for now.
> 2. This code appears within a loop that's manipulating the tree for
> each of multiple elements matched with an XPath expression. I thought
> deleting a node mid-iteration would cause problems.
XPath returns a list of nodes, so you are no longer iterating over the tree
structure in this case. Ripping stuff out should be absolutely safe here.
>>> Output:
>>>
>>>
>> Works for me. Could you send me a complete code snippet where it doesn't
>> work for you?
>
> How much work are you willing to put in? :-)
>
> I can give you a Plone buildout that will set up everything and talk you
> through the steps to reproduce.
LOL! :)
"You know, I have this huge pile of code here, but it's really easy to set
up and then all you have to do is a tiny bit of debugging. It's easy! It
really is! I can't believe you don't want to feel the fun to try it!"
Honestly, could you try to come up with a little example that injects
namespaced XML content into a small HTML page, and that shows that the XML
serialiser behaves unexpected? Shouldn't be hard to write...
Stefan
From optilude+lists at gmail.com Fri Jan 29 10:22:44 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 17:22:44 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62A7FC.80607@behnel.de>
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de>
<4B62A7FC.80607@behnel.de>
Message-ID:
Stefan Behnel wrote:
>> 1. In some cases (though not here) I'm replacing one placeholder with
>> multiple nodes.
>
> Another nice feature: support a sequence as replacement. :)
>
> Although that requirement is basically satisfied with slice replacements,
> so I guess that won't make it in for now.
Could you elaborate an example?
>> 2. This code appears within a loop that's manipulating the tree for
>> each of multiple elements matched with an XPath expression. I thought
>> deleting a node mid-iteration would cause problems.
>
> XPath returns a list of nodes, so you are no longer iterating over the tree
> structure in this case. Ripping stuff out should be absolutely safe here.
Cool! Less code.
>>>> Output:
>>>>
>>>>
>>> Works for me. Could you send me a complete code snippet where it doesn't
>>> work for you?
>> How much work are you willing to put in? :-)
>>
>> I can give you a Plone buildout that will set up everything and talk you
>> through the steps to reproduce.
>
> LOL! :)
>
> "You know, I have this huge pile of code here, but it's really easy to set
> up and then all you have to do is a tiny bit of debugging. It's easy! It
> really is! I can't believe you don't want to feel the fun to try it!"
That's why I asked. :-p
> Honestly, could you try to come up with a little example that injects
> namespaced XML content into a small HTML page, and that shows that the XML
> serialiser behaves unexpected? Shouldn't be hard to write...
See my other mail. I got a minimal example that's bombing out for me.
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 11:08:56 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 11:08:56 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de>
Message-ID: <4B62B3B8.60206@behnel.de>
Martin Aspeli, 29.01.2010 10:10:
> here's a pseudo-doctest that illustrates the problem:
>
> First, we create a simple document. We use the HTML parser here, because
> we don't necessarily trust the input being 100% valid XHTML, even though
> the doctype says so.
I think that's the main problem. If you parse XHTML using the HTML parser,
you loose information due to the fact that namespaces are not well-defined
for HTML. I'd *always* try with the XML parser first.
However, according to your last comment, it seems you have tried the XML
parser already...
> >>> from lxml import etree, html
> >>> doc = """\
> ... "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> ... xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
> ...
> ...
alt="./target.html" />
> ...
> ...
> ... """
> >>> inputTree = html.fromstring(doc)
>
> We are going to replace the tag with an tag. We
> find it via an XPath:
>
> >>> placeholderXPath = etree.XPath("//img[contains(concat(' ',
> normalize-space(@class), ' '), ' mceTile ')]")
Perfect use case for lxml.cssselect. :)
> >>> matched = list(placeholderXPath(inputTree))
As I said, XPath returns a *list*.
Personally, I'd love to have it return an iterable, but libxml2 doesn't
easily give you that. IIRC, there's some limited support for this (it works
for certain patterns), but that would need some serious wrapping effort
with non-trivial memory management.
> >>> matchedNode = matched[0]
>
> We then create the ESI node. At this point, it's nice and self-closing.
> Note that we use the etree.tostring() method, since we want XHTML output.
>
> >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
> >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
> >>> esiNode.set('src', matchedNode.get('alt'))
>
> >>> print etree.tostring(esiNode)
> src="./target.html"/>
Ok so far.
> Now we connect it to the parent:
>
> >>> matchedNode.getparent().replace(matchedNode, esiNode)
>
> At this point it's all over:
>
> >>> print etree.tostring(esiNode)
> src="./target.html">
Ah, this is because it is now part of an HTML document, so the HTML
semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?)
that provided 'better' support for HTML serialisation by taking into
account the document context. Looks like this strikes here.
I just looked it up in the sources, recent 2.7.x versions of libxml2 have
added a way to override this behaviour again, but lxml doesn't do this yet.
IIRC, it wasn't trivial at the time - I think it required going through a
different serialisation function or something.
> And sure enough:
>
> >>> print etree.tostring(inputTree)
> xmlns:esi="http://www.edge-delivery.org/esi/1.0"
> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
> xml:lang="en">
>
src="./target.html">
>
>
> It's also interesting to note that this suddenly has the xmlns
> declaration twice.
... namespaces in HTML ...
> Any ideas would be highly welcome. I've tried to play with different
> ways to construct the ESI tag, and different placements for the
> placeholder (e.g. outside the
tag), but it's all the same. It also
> doesn't seem to make any difference whether I parse with
> etree.fromstring() or html.fromstring() (in the real code I'm actually
> feeding an HTMLParser).
It *should* make a difference, but from your example I can see that it
doesn't. No idea why. I'll have a closer look later.
Stefan
From stefan_ml at behnel.de Fri Jan 29 11:11:47 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 11:11:47 +0100
Subject: [lxml-dev] exslt functions in xpath expressions
In-Reply-To:
References:
Message-ID: <4B62B463.4040701@behnel.de>
Dirk Rothe, 29.01.2010 09:57:
> During XPath Evaluations in XSL-Transformations it's possible to use Stuff
> from http://www.exslt.org/ (so [5] does indeed match the element).
> During XPath Evaluations its only possible to use standard XPath/XSLT
> Functions. Is there a chance to enable the functions from exslt for lxml
> XPath evaluations as well?
>
> =========================================================
> In [1]: from lxml import etree
>
> In [2]: from StringIO import StringIO
>
> In [3]: tree = etree.parse(StringIO(''))
>
> In [4]: print tree.xpath('/a[@b=concat("1","2")]')
> []
>
> In [5]: print tree.xpath('/a[@b=str:split("12 34")]')
> []
> =========================================================
They should be enabled. But you have to specify the namespace of the
function you use.
http://codespeak.net/lxml/xpathxslt.html#xpath
Stefan
From d.rothe at semantics.de Fri Jan 29 11:29:30 2010
From: d.rothe at semantics.de (Dirk Rothe)
Date: Fri, 29 Jan 2010 11:29:30 +0100
Subject: [lxml-dev] exslt functions in xpath expressions
In-Reply-To: <4B62B463.4040701@behnel.de>
References: <4B62B463.4040701@behnel.de>
Message-ID:
On Fri, 29 Jan 2010 11:11:47 +0100, Stefan Behnel
wrote:
>
> Dirk Rothe, 29.01.2010 09:57:
>> During XPath Evaluations in XSL-Transformations it's possible to use
>> Stuff
>> from http://www.exslt.org/ (so [5] does indeed match the element).
>> During XPath Evaluations its only possible to use standard XPath/XSLT
>> Functions. Is there a chance to enable the functions from exslt for lxml
>> XPath evaluations as well?
>>
>> =========================================================
>> In [1]: from lxml import etree
>>
>> In [2]: from StringIO import StringIO
>>
>> In [3]: tree = etree.parse(StringIO(''))
>>
>> In [4]: print tree.xpath('/a[@b=concat("1","2")]')
>> []
>>
>> In [5]: print tree.xpath('/a[@b=str:split("12 34")]')
>> []
>> =========================================================
>
> They should be enabled. But you have to specify the namespace of the
> function you use.
>
> http://codespeak.net/lxml/xpathxslt.html#xpath
Ah, sorry. I should have checked this.
--dirkse
From optilude+lists at gmail.com Fri Jan 29 11:51:29 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 18:51:29 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62B3B8.60206@behnel.de>
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de>
<4B62B3B8.60206@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Martin Aspeli, 29.01.2010 10:10:
>> here's a pseudo-doctest that illustrates the problem:
>>
>> First, we create a simple document. We use the HTML parser here, because
>> we don't necessarily trust the input being 100% valid XHTML, even though
>> the doctype says so.
>
> I think that's the main problem. If you parse XHTML using the HTML parser,
> you loose information due to the fact that namespaces are not well-defined
> for HTML. I'd *always* try with the XML parser first.
This code is being used in a post-processing step for output from Plone.
Performance is important, so trial-and-error like this is probably
undesirable. And even then, this would need to work for documents parsed
with the HTML parser. The output being transformed could include
not-quite-well-formed XHTML from content-managed pages. That's the
attraction of xlml in the first place - it can deal with somewhat-crap
output. ;)
> However, according to your last comment, it seems you have tried the XML
> parser already...
I just re-confirmed it. If the whole thing is parsed with
etree.fromstring (and lxml.html is not used anywhere) it still doesn't
close.
>> >>> from lxml import etree, html
>> >>> doc = """\
>> ...> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> ...> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
>> ...
>> ...
> alt="./target.html" />
>> ...
>> ...
>> ... """
>> >>> inputTree = html.fromstring(doc)
>>
>> We are going to replace the tag with an tag. We
>> find it via an XPath:
>>
>> >>> placeholderXPath = etree.XPath("//img[contains(concat(' ',
>> normalize-space(@class), ' '), ' mceTile ')]")
>
> Perfect use case for lxml.cssselect. :)
Well, I got the XPath from css2xpath.appspot.com which uses the same
algorithm I think.
>> >>> matched = list(placeholderXPath(inputTree))
>
> As I said, XPath returns a *list*.
Cool, thanks.
> Personally, I'd love to have it return an iterable, but libxml2 doesn't
> easily give you that. IIRC, there's some limited support for this (it works
> for certain patterns), but that would need some serious wrapping effort
> with non-trivial memory management.
Changing it in a future release may be risky if it returns a list now.
>> >>> matchedNode = matched[0]
>>
>> We then create the ESI node. At this point, it's nice and self-closing.
>> Note that we use the etree.tostring() method, since we want XHTML output.
>>
>> >>> nsmap = {'esi': 'http://www.edge-delivery.org/esi/1.0'}
>> >>> esiNode = etree.Element("{%s}include" % nsmap['esi'], nsmap=nsmap)
>> >>> esiNode.set('src', matchedNode.get('alt'))
>>
>> >>> print etree.tostring(esiNode)
>> > src="./target.html"/>
>
> Ok so far.
>
>
>> Now we connect it to the parent:
>>
>> >>> matchedNode.getparent().replace(matchedNode, esiNode)
>>
>> At this point it's all over:
>>
>> >>> print etree.tostring(esiNode)
>> > src="./target.html">
>
> Ah, this is because it is now part of an HTML document, so the HTML
> semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?)
> that provided 'better' support for HTML serialisation by taking into
> account the document context. Looks like this strikes here.
>
> I just looked it up in the sources, recent 2.7.x versions of libxml2 have
> added a way to override this behaviour again, but lxml doesn't do this yet.
> IIRC, it wasn't trivial at the time - I think it required going through a
> different serialisation function or something.
Makes sense, sorta, but I would've thought this was a matter for
serialisation, not parsing? Even when parsing as HTML, I'm using
etree.tostring() to serialise.
>> And sure enough:
>>
>> >>> print etree.tostring(inputTree)
>> > xmlns:esi="http://www.edge-delivery.org/esi/1.0"
>> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
>> xml:lang="en">
>>
> src="./target.html">
>>
>>
>> It's also interesting to note that this suddenly has the xmlns
>> declaration twice.
>
> ... namespaces in HTML ...
Yeah, yeah. XHTML. ;-)
>> Any ideas would be highly welcome. I've tried to play with different
>> ways to construct the ESI tag, and different placements for the
>> placeholder (e.g. outside the
tag), but it's all the same. It also
>> doesn't seem to make any difference whether I parse with
>> etree.fromstring() or html.fromstring() (in the real code I'm actually
>> feeding an HTMLParser).
>
> It *should* make a difference, but from your example I can see that it
> doesn't. No idea why. I'll have a closer look later.
I appreciate it!
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 13:16:59 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 13:16:59 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de>
Message-ID: <4B62D1BB.1050408@behnel.de>
Martin Aspeli, 29.01.2010 11:51:
> Stefan Behnel wrote:
>> Personally, I'd love to have it return an iterable, but libxml2 doesn't
>> easily give you that. IIRC, there's some limited support for this (it works
>> for certain patterns), but that would need some serious wrapping effort
>> with non-trivial memory management.
>
> Changing it in a future release may be risky if it returns a list now.
Obviously. It would rather become a new method on the XPath class, like
xpath.iterfind(el).
>>> ...>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>>> ...>> xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
>>> ...
>>> ...
>> alt="./target.html" />
>>> ...
>>> ...
>> [...]
>>> >>> matchedNode.getparent().replace(matchedNode, esiNode)
>>> >>> print etree.tostring(esiNode)
>>> >> src="./target.html">
>>
>> Ah, this is because it is now part of an HTML document, so the HTML
>> semantics interfere. I remember a not-so-old change in libxml2 (2.7.x?)
>> that provided 'better' support for HTML serialisation by taking into
>> account the document context. Looks like this strikes here.
>>
>> I just looked it up in the sources, recent 2.7.x versions of libxml2 have
>> added a way to override this behaviour again, but lxml doesn't do this yet.
>> IIRC, it wasn't trivial at the time - I think it required going through a
>> different serialisation function or something.
>
> Makes sense, sorta, but I would've thought this was a matter for
> serialisation, not parsing? Even when parsing as HTML, I'm using
> etree.tostring() to serialise.
I read through the libxml2 sources a bit more. It's not confusing HTML at
all, it's even smarter than I thought. It looks at the *doctype* of the
document that is being serialised and then applies special XHTML formatting
rules. :o)
http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137
Stefan
From optilude+lists at gmail.com Fri Jan 29 14:27:28 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 21:27:28 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62D1BB.1050408@behnel.de>
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de>
<4B62D1BB.1050408@behnel.de>
Message-ID:
Stefan Behnel wrote:
> I read through the libxml2 sources a bit more. It's not confusing HTML at
> all, it's even smarter than I thought. It looks at the *doctype* of the
> document that is being serialised and then applies special XHTML formatting
> rules. :o)
But... XHTML says empty tags can self-close as far as I know. And even
then, this is in a different namespace.
> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137
My C fu is weak. Any hints in there I'm missing?
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 15:32:42 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 15:32:42 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de>
Message-ID: <4B62F18A.2040105@behnel.de>
Martin Aspeli, 29.01.2010 14:27:
> Stefan Behnel wrote:
>
>> I read through the libxml2 sources a bit more. It's not confusing HTML at
>> all, it's even smarter than I thought. It looks at the *doctype* of the
>> document that is being serialised and then applies special XHTML formatting
>> rules. :o)
>
> But... XHTML says empty tags can self-close as far as I know.
Sure. I just pointed you to the code that formats the output.
Given that the DOCTYPE plays the card here, you may also consider keeping
the DOCTYPE out of the tree and prepending it after the serialisation.
>> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137
>
> My C fu is weak. Any hints in there I'm missing?
This, for example:
http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451
The rule that bites you here is in line 1452. If the element uses a
namespace prefix, it will not become self-closing. I have no idea about the
reasoning behind such a rule, but if you are interested, I'd go straight to
the libxml2 mailing list and ask.
There's also line 1414
http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414
which emits a default namespace declaration for the XHTML namespace
regardless of the existing declarations. Certainly space left for enhancements.
IIRC, the XHTML formatting is rather new, may have been added in the 2.7
line. You'll have a good chance of being heard if you propose some sensible
improvements to it.
Stefan
From mateusz-lists at ant.gliwice.pl Fri Jan 29 15:49:16 2010
From: mateusz-lists at ant.gliwice.pl (Mateusz Korniak)
Date: Fri, 29 Jan 2010 15:49:16 +0100
Subject: [lxml-dev] Extending //r/text()
Message-ID: <201001291549.16846.mateusz-lists@ant.gliwice.pl>
Hi !
I am using lxml and find it really great !
Is it possible to extend like
http://codespeak.net/lxml/extensions.html
functions which are similar to
//r/text() ?
Although I have defined "testf"
ns = lxml.etree.FunctionNamespace(None)
ns['testf'] = testf
from running:
res = test_root.xpath("//r[testf()]")
res = test_root.xpath("//r/text()")
res = test_root.xpath("//r/testf()")
I get:
lxml.etree.XPathEvalError: Invalid expression
when executing "//r/testf()"
Thanks in advance and regards !
--
Mateusz Korniak
From optilude+lists at gmail.com Fri Jan 29 16:09:32 2010
From: optilude+lists at gmail.com (Martin Aspeli)
Date: Fri, 29 Jan 2010 23:09:32 +0800
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62F18A.2040105@behnel.de>
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de>
<4B62F18A.2040105@behnel.de>
Message-ID:
Stefan Behnel wrote:
> Given that the DOCTYPE plays the card here, you may also consider keeping
> the DOCTYPE out of the tree and prepending it after the serialisation.
That's going to be pretty tricky, but I guess we can try.
I wonder what the side-effect of this may, though. Presumably, the
DOCTYPE detection is there for a reason.
>>> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n2137
>> My C fu is weak. Any hints in there I'm missing?
>
> This, for example:
>
> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1451
>
> The rule that bites you here is in line 1452. If the element uses a
> namespace prefix, it will not become self-closing. I have no idea about the
> reasoning behind such a rule, but if you are interested, I'd go straight to
> the libxml2 mailing list and ask.
>
> There's also line 1414
>
> http://git.gnome.org/browse/libxml2/tree/xmlsave.c#n1414
>
> which emits a default namespace declaration for the XHTML namespace
> regardless of the existing declarations. Certainly space left for enhancements.
>
> IIRC, the XHTML formatting is rather new, may have been added in the 2.7
> line. You'll have a good chance of being heard if you propose some sensible
> improvements to it.
Good to know. I'm not sure I know how to formulate the needed changes
except by re-stating the problem I'm having here, though. It'd probably
help if I understood the purpose of the special formatting better. I
naively thought that XHTML = XML and wouldn't need any magic. :)
Martin
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book
From stefan_ml at behnel.de Fri Jan 29 16:27:09 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 16:27:09 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To: <4B62A362.40502@behnel.de>
References:
<4B61B583.50003@behnel.de>
<4B62A362.40502@behnel.de>
Message-ID: <4B62FE4D.7090504@behnel.de>
[replying to myself]
Stefan Behnel, 29.01.2010 09:59:
> The serialiser in libxml2 can only write out what is there.
>
> It could work for a doctype, though. Support for passing that verbatimly
> into the serialiser would be a nice feature.
There we go:
https://codespeak.net/viewvc/?view=rev&revision=70976
>>> xml = '\n'
>>> tree = etree.parse(StringIO(xml))
>>> print(etree.tostring(tree))
>>> print(etree.tostring(tree,
... doctype=''))
Stefan
From stefan_ml at behnel.de Fri Jan 29 16:45:17 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 16:45:17 +0100
Subject: [lxml-dev] Building an ESI tag with lxml
In-Reply-To:
References: <4B61B583.50003@behnel.de> <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> <4B62F18A.2040105@behnel.de>
Message-ID: <4B63028D.2030603@behnel.de>
Martin Aspeli, 29.01.2010 16:09:
> Stefan Behnel wrote:
>> Given that the DOCTYPE plays the card here, you may also consider keeping
>> the DOCTYPE out of the tree and prepending it after the serialisation.
>
> That's going to be pretty tricky, but I guess we can try.
>
> I wonder what the side-effect of this may, though. Presumably, the
> DOCTYPE detection is there for a reason.
:) I think it's because I complained about one of the early 2.7.x versions
breaking lxml's serialisation completely, so Daniel eventually added some
"do what I mean" work-around to call the right functions in absence of a
specific configuration (which lxml can't pass as the API it uses doesn't
allow it ...)
Don't expect everything in libxml2 to be well designed from the ground up.
It was grown over years and has become a crucial part of the GNU/GNOME/...
infrastructure. It naturally carries quite a bit of backwards compatibility
with it, in both API and functionality. It certainly has its edges.
Discussing new stuff to move it into the right directions is almost always
worth it.
>> IIRC, the XHTML formatting is rather new, may have been added in the 2.7
>> line. You'll have a good chance of being heard if you propose some sensible
>> improvements to it.
>
> Good to know. I'm not sure I know how to formulate the needed changes
> except by re-stating the problem I'm having here, though. It'd probably
> help if I understood the purpose of the special formatting better. I
> naively thought that XHTML = XML and wouldn't need any magic. :)
I wouldn't call that naive. Just go and ask.
Stefan
From stefan_ml at behnel.de Fri Jan 29 16:52:01 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 29 Jan 2010 16:52:01 +0100
Subject: [lxml-dev] Extending //r/text()
In-Reply-To: <201001291549.16846.mateusz-lists@ant.gliwice.pl>
References: <201001291549.16846.mateusz-lists@ant.gliwice.pl>
Message-ID: <4B630421.7090604@behnel.de>
Mateusz Korniak, 29.01.2010 15:49:
> I am using lxml and find it really great !
Happy to hear that. :)
> Is it possible to extend like
> http://codespeak.net/lxml/extensions.html
> functions which are similar to
> //r/text() ?
>
> Although I have defined "testf"
>
> ns = lxml.etree.FunctionNamespace(None)
> ns['testf'] = testf
>
> from running:
>
> res = test_root.xpath("//r[testf()]")
> res = test_root.xpath("//r/text()")
> res = test_root.xpath("//r/testf()")
>
> I get:
> lxml.etree.XPathEvalError: Invalid expression
>
> when executing "//r/testf()"
So that's only for the last expression, right? It's different in that it
doesn't have anything to match on. "text()" is a special function in XPath
that matches any text node. This special property can't be replaced with an
extension function.
However, you didn't write anything about your use case. There may be other
ways to do what you want, such as "myfunc( //r/node() )".
Stefan
From d.rothe at semantics.de Sat Jan 30 16:14:44 2010
From: d.rothe at semantics.de (Dirk Rothe)
Date: Sat, 30 Jan 2010 16:14:44 +0100
Subject: [lxml-dev] exslt functions in xpath expressions
In-Reply-To:
References: <4B62B463.4040701@behnel.de>
Message-ID:
On Fri, 29 Jan 2010 11:29:30 +0100, Dirk Rothe
wrote:
> On Fri, 29 Jan 2010 11:11:47 +0100, Stefan Behnel
> wrote:
>
>>
>> Dirk Rothe, 29.01.2010 09:57:
>>> During XPath Evaluations in XSL-Transformations it's possible to use
>>> Stuff
>>> from http://www.exslt.org/ (so [5] does indeed match the element).
>>> During XPath Evaluations its only possible to use standard XPath/XSLT
>>> Functions. Is there a chance to enable the functions from exslt for
>>> lxml
>>> XPath evaluations as well?
>>>
>>> =========================================================
>>> In [1]: from lxml import etree
>>>
>>> In [2]: from StringIO import StringIO
>>>
>>> In [3]: tree = etree.parse(StringIO(''))
>>>
>>> In [4]: print tree.xpath('/a[@b=concat("1","2")]')
>>> []
>>>
>>> In [5]: print tree.xpath('/a[@b=str:split("12 34")]')
>>> []
>>> =========================================================
>>
>> They should be enabled. But you have to specify the namespace of the
>> function you use.
>>
>> http://codespeak.net/lxml/xpathxslt.html#xpath
>
> Ah, sorry. I should have checked this.
Hmm, but it's not working:
=========================================================
In [9]: print tree.xpath("/a[@b=str:split('12 34')]", namespaces={'str':
"http://exslt.org/strings"})
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (23303, 0))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (24793, 0))
---------------------------------------------------------------------------
XPathEvalError Traceback (most recent call last)
D:\vls-trunk\server\bin\ in ()
d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd
in lxml.etree._ElementTree.xpath (sr
c/lxml/lxml.etree.c:41699)()
d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd
in lxml.etree.XPathDocumentEvaluator
.__call__ (src/lxml/lxml.etree.c:103472)()
d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd
in lxml.etree._XPathEvaluatorBase._h
andle_result (src/lxml/lxml.etree.c:102330)()
d:\vls-trunk\environment\python25\lib\site-packages\lxml-2.2.2-py2.5-win32.egg\lxml\etree.pyd
in lxml.etree._XPathEvaluatorBase._r
aise_eval_error (src/lxml/lxml.etree.c:102153)()
XPathEvalError: Unregistered function
From stefan_ml at behnel.de Sat Jan 30 17:10:25 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 30 Jan 2010 17:10:25 +0100
Subject: [lxml-dev] exslt functions in xpath expressions
In-Reply-To:
References: <4B62B463.4040701@behnel.de>
Message-ID: <4B6459F1.9090600@behnel.de>
Dirk Rothe, 30.01.2010 16:14:
> In [9]: print tree.xpath("/a[@b=str:split('12 34')]", namespaces={'str':
> "http://exslt.org/strings"})
>[...]
> XPathEvalError: Unregistered function
You're right, they are currently only available to XSLT. It seems that at
least the date, math, sets and string functions can be enabled in plain
XPath, but only from libxslt 1.1.25 onwards. That version was released on
2009-09-17, so it's fairly recent.
http://xmlsoft.org/XSLT/EXSLT/html/libexslt-exslt.html
Could you file a feature request for this in the bug tracker? I should be
able to add support in lxml 2.3.
Stefan
From richardbp+lxml at gmail.com Sun Jan 31 05:06:56 2010
From: richardbp+lxml at gmail.com (Richard Baron Penman)
Date: Sun, 31 Jan 2010 15:06:56 +1100
Subject: [lxml-dev] ElementTree 1.3a xpath position broken?
Message-ID:
hello,
I am after xpath support for an application running on Google App Engine,
which unfortunately rules out lxml.
According to this document (http://effbot.org/zone/element-xpath.htm) the
development version of ElementTree 1.3a has additional support for xpath,
which covers my use cases.
>From my tests I found attributes and child nodes work:
>>> from elementtree import ElementTree
>>> tree = ElementTree.fromstring('')
>>> print list(tree.findall('.//*[@class="test"]'))
[]
>>> print list(tree.findall('.//b[c]'))
[]
However tag positions appear to be broken:
>>> print list(tree.findall('.//b[1]')) # should return b element
[]
Have I missed something? Suggestions?
regards,
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100131/677569a5/attachment.htm
From stefan_ml at behnel.de Sun Jan 31 07:44:09 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sun, 31 Jan 2010 07:44:09 +0100
Subject: [lxml-dev] ElementTree 1.3a xpath position broken?
In-Reply-To:
References:
Message-ID: <4B6526B9.9060007@behnel.de>
Richard Baron Penman, 31.01.2010 05:06:
> I am after xpath support for an application running on Google App Engine,
> which unfortunately rules out lxml.
Yeah, I know. That's one of the reasons I never found a use for the GAE on
my side. That also makes your e-mail somewhat misplaced on this list. ;)
> According to this document (http://effbot.org/zone/element-xpath.htm) the
> development version of ElementTree 1.3a has additional support for xpath,
Careful. It has *extended* the supported *subset* of XPath, compared to
what ET 1.2 has. The ElementPath implementation was also completely
rewritten and it's what lxml uses since 2.0.
> which covers my use cases.
Apparently not. You can look at the sources to see what it supports. It's
really quite short and simple.
http://svn.effbot.org/public/elementtree-1.3/elementtree/ElementPath.py
You may also want to look for "xpath" on this page:
http://effbot.org/zone/element-index.htm
That should get you here:
http://sourceforge.net/projects/pdis-xpath/
I never tried it, but it's been recently updated, so it looks like it's
still maintained.
>>From my tests I found attributes and child nodes work:
>
>>>> from elementtree import ElementTree
>>>> tree = ElementTree.fromstring(' class="test">')
>>>> print list(tree.findall('.//*[@class="test"]'))
> []
>>>> print list(tree.findall('.//b[c]'))
> []
>
>
> However tag positions appear to be broken:
>>>> print list(tree.findall('.//b[1]')) # should return b element
> []
That shouldn't be hard to add. You just have to make sure it only counts
elements within the same parent, so you may have to add the selector in
more than one place. I guess that's why Fredrik didn't add it while he was
at it.
Stefan
From richardbp+lxml at gmail.com Sun Jan 31 14:56:01 2010
From: richardbp+lxml at gmail.com (Richard Baron Penman)
Date: Mon, 1 Feb 2010 00:56:01 +1100
Subject: [lxml-dev] ElementTree 1.3a xpath position broken?
In-Reply-To: <4B6526B9.9060007@behnel.de>
References:
<4B6526B9.9060007@behnel.de>
Message-ID:
hi Stefan,
thanks very much for your reply.
> > I am after xpath support for an application running on Google App Engine,
> > which unfortunately rules out lxml.
>
> Yeah, I know. That's one of the reasons I never found a use for the GAE on
> my side. That also makes your e-mail somewhat misplaced on this list. ;)
>
Hopefully the lxml feature request goes somewhere:
http://code.google.com/p/googleappengine/issues/detail?id=18
Can you recommend an alternative for discussing ElementTree?
I tried emailing Fredrik earlier but didn't get a response and the
ElementTree repository hasn't been committed to since 2007.
http://sourceforge.net/projects/pdis-xpath/
>
> I never tried it, but it's been recently updated, so it looks like it's
> still maintained.
>
That project does look promising, however it doesn't yet support // or ..
> > However tag positions appear to be broken:
> >>>> print list(tree.findall('.//b[1]')) # should return b element
> > []
>
> That shouldn't be hard to add. You just have to make sure it only counts
> elements within the same parent, so you may have to add the selector in
> more than one place. I guess that's why Fredrik didn't add it while he was
> at it.
>
I found it was half implemented and finished it off. There is some elegant
code in ElementPath.py but it needs refactoring...
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100201/b9645364/attachment.htm