From dkuhlman at rexx.com Thu Apr 1 22:41:33 2010 From: dkuhlman at rexx.com (Dave Kuhlman) Date: Thu, 1 Apr 2010 13:41:33 -0700 Subject: [lxml-dev] Temporary data attached to custom subclasses In-Reply-To: <4BB3091A.5090708@behnel.de> References: <20100329214830.GA21855@cutter.rexx.com> <4BB3091A.5090708@behnel.de> Message-ID: <20100401204133.GA76933@cutter.rexx.com> > Date: Wed, 31 Mar 2010 10:34:34 +0200 > From: Stefan Behnel > To: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] Tempory data attached to custom subclasses > > Dave Kuhlman, 29.03.2010 23:48: > > I've been using the custom subclasses capability of lxml. It's > > slick. > > > > I do, however, miss the ability to attach temporary data to the > > ElementBase subclasses. (see the warnings under "Element > > initialization" at http://codespeak.net/lxml/element_classes.html) > > > > I can, as suggested by the docs, add attributes or children to the > > underlying etree.Element, but that means that I'd have to strip > > that temporary data off when I want to serialize the tree. > > As long as your tree doesn't change, the easiest solution is to keep a > reference to all Elements ("list(root.iter())") and then just store the > data in the proxy instances. They are guaranteed not to change as long as > there is a live reference to them. > > If your tree changes, you can still try to add new Elements to your > keep-alive list to get the same behaviour, but you may need to take a > little more care when you remove elements, so that you only remove them > from the keep-alive list when you are sure they'll get discarded. > Stefan - Thanks for this suggestion. The keep-alive list/set seems like a good solution for my needs. Another point about this -- The documentation you point at has the following in section titled "Element initialization": "There is one thing to know up front. Element classes must not have an __init___ or __new__ method. There should not be any internal state either, except for the data stored in the underlying XML tree." The above suggests that there is no solution such as the one you suggest. And so, someone like me, with a little less brain-power, is unlikely to think of that solution. You might want to add your two paragraphs (above) or something like the following: "If you really must store temporary data on an element that you do not want serialized, then you should put any nodes which must be persistent on a keep-alive list (or other container), since they are guaranteed not to change as long as there is a live reference to them." Something like that might save you from having to answer this question yet again at some time in the future. And, a last point: for some purposes, instead of: keep_alive = list(root.iter()) the following might be better: keep_alive = set(root.iterdescendants()) keep_alive.add(root) because: 1. iterdescendents() plus adding root puts all nodes into keep_alive. 2. A set should give faster look-up, check for membership, etc. Thanks again for your help with this. And, thanks even more for Lxml. It's a super tool. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman From stefan_ml at behnel.de Thu Apr 1 23:14:08 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 01 Apr 2010 23:14:08 +0200 Subject: [lxml-dev] Temporary data attached to custom subclasses In-Reply-To: <20100401204133.GA76933@cutter.rexx.com> References: <20100329214830.GA21855@cutter.rexx.com> <4BB3091A.5090708@behnel.de> <20100401204133.GA76933@cutter.rexx.com> Message-ID: <4BB50CA0.4050401@behnel.de> Dave Kuhlman, 01.04.2010 22:41: > You might want to add your two paragraphs (above) or something like > the following: > > "If you really must store temporary data on an element that you > do not want serialized, then you should put any nodes which > must be persistent on a keep-alive list (or other container), > since they are guaranteed not to change as long as there is a > live reference to them." > > Something like that might save you from having to answer this > question yet again at some time in the future. Thanks, I'll add something like that to the docs. > And, a last point: for some purposes, instead of: > > keep_alive = list(root.iter()) > > the following might be better: > > keep_alive = set(root.iterdescendants()) > keep_alive.add(root) > > because: > > 1. iterdescendents() plus adding root puts all nodes into > keep_alive. Then that shouldn't be any different from keep_alive = set(root.iter()) The only reason why there *is* an iterdescendants() is that iter() yields all nodes in the subtree, including the root itself. Stefan From dkuhlman at rexx.com Thu Apr 1 23:26:34 2010 From: dkuhlman at rexx.com (Dave Kuhlman) Date: Thu, 1 Apr 2010 14:26:34 -0700 Subject: [lxml-dev] Temporary data attached to custom subclasses In-Reply-To: <4BB50CA0.4050401@behnel.de> References: <20100329214830.GA21855@cutter.rexx.com> <4BB3091A.5090708@behnel.de> <20100401204133.GA76933@cutter.rexx.com> <4BB50CA0.4050401@behnel.de> Message-ID: <20100401212633.GA77955@cutter.rexx.com> On Thu, Apr 01, 2010 at 11:14:08PM +0200, Stefan Behnel wrote: > Dave Kuhlman, 01.04.2010 22:41: > > > > keep_alive = set(root.iterdescendants()) > > keep_alive.add(root) > > > > because: > > > > 1. iterdescendents() plus adding root puts all nodes into > > keep_alive. > > Then that shouldn't be any different from > > keep_alive = set(root.iter()) > > The only reason why there *is* an iterdescendants() is that iter() yields > all nodes in the subtree, including the root itself. > Stefan - You are right. My mistake. I thought I had done a test with iter(), but I must have confused myself somehow. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman From stefan_ml at behnel.de Mon Apr 5 10:41:40 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 05 Apr 2010 10:41:40 +0200 Subject: [lxml-dev] lxml iterparse generator not returning anything In-Reply-To: References: Message-ID: <4BB9A244.8020708@behnel.de> Joe Sarre, 30.03.2010 16:53: > I'm finding that when using iterparse, the generator always throws StopIteration immediately, without returning any data. The only comment I can give on this one is that I've never seen this before. I'd first try to make sure it's really not a problem in your setup. Stefan From wichert at wiggy.net Mon Apr 5 10:52:18 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Mon, 05 Apr 2010 10:52:18 +0200 Subject: [lxml-dev] adding a namespace In-Reply-To: <4BA911E9.4030800@behnel.de> References: <4BA235A5.7070507@wiggy.net> <4BA911E9.4030800@behnel.de> Message-ID: <4BB9A4C2.80708@wiggy.net> On 3/23/10 20:09 , Stefan Behnel wrote: > Simon showed you a way, but apart from that, it's a missing feature. > Changing namespace mappings is nothing that the ElementTree API needs to > care about, and lxml clearly lacks a good way to do it. > > Could you file a ticket on the bug tracker? This should be doable for 2.3. Most certainly: https://bugs.launchpad.net/lxml/+bug/555602 Wichert. From wichert at wiggy.net Mon Apr 5 10:58:41 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Mon, 05 Apr 2010 10:58:41 +0200 Subject: [lxml-dev] downloads-a-plenty on launchpad page? Message-ID: <4BB9A641.2040407@wiggy.net> I just noticed https://launchpad.net/lxml/ really likes you to download the lxml 2.2 release. So much in fact it has that download listed 129 times. I suspect that isn't intentional? :) Wichert. From wichert at wiggy.net Mon Apr 5 11:00:16 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Mon, 05 Apr 2010 11:00:16 +0200 Subject: [lxml-dev] downloads-a-plenty on launchpad page? In-Reply-To: <4BB9A641.2040407@wiggy.net> References: <4BB9A641.2040407@wiggy.net> Message-ID: <4BB9A6A0.9040101@wiggy.net> On 4/5/10 10:58 , Wichert Akkerman wrote: > I just noticed https://launchpad.net/lxml/ really likes you to download > the lxml 2.2 release. So much in fact it has that download listed 129 > times. I suspect that isn't intentional? :) At least it is consistent: looking at https://launchpad.net/lxml/+download this appears to happen for all lxml releases. Wichert. From sidnei.da.silva at gmail.com Tue Apr 6 20:28:36 2010 From: sidnei.da.silva at gmail.com (Sidnei da Silva) Date: Tue, 6 Apr 2010 15:28:36 -0300 Subject: [lxml-dev] downloads-a-plenty on launchpad page? In-Reply-To: <4BB9A6A0.9040101@wiggy.net> References: <4BB9A641.2040407@wiggy.net> <4BB9A6A0.9040101@wiggy.net> Message-ID: On Mon, Apr 5, 2010 at 6:00 AM, Wichert Akkerman wrote: > On 4/5/10 10:58 , Wichert Akkerman wrote: >> I just noticed https://launchpad.net/lxml/ really likes you to download >> the lxml 2.2 release. So much in fact it has that download listed 129 >> times. I suspect that isn't intentional? :) > > At least it is consistent: looking at > https://launchpad.net/lxml/+download this appears to happen for all lxml > releases. Seems like it only happens for lxml. I brought it up with the Launchpad team, they are looking into it. -- Sidnei From sgg at ci.uchicago.edu Tue Apr 6 21:41:28 2010 From: sgg at ci.uchicago.edu (Stephen Graham) Date: Tue, 6 Apr 2010 15:41:28 -0400 Subject: [lxml-dev] installing lxml on MacOS Message-ID: <1F4D9F37-B9AA-440B-91CE-E185F2E1180D@ci.uchicago.edu> I am trying to go down the learning curve on lxml. I tried to follow the install instructions to install lxml on my MacOS As root, I did: STATIC_DEPS=true easy_install lxml The install chugged away, but eventually failed: ... Undefined symbols for architecture i386: "_gzdirect", referenced from: ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) ld: symbol(s) not found for architecture i386 collect2: ld returned 1 exit status Undefined symbols for architecture ppc: "_gzdirect", referenced from: ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) ld: symbol(s) not found for architecture ppc collect2: ld returned 1 exit status lipo: can't open input file: /var/tmp//ccsmIZu8.out (No such file or directory) make[2]: *** [xmllint] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 Traceback (most recent call last): File "/usr/bin/easy_install", line 8, in load_entry_point('setuptools==0.6c7', 'console_scripts', 'easy_install')() File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 1670, in main with_ei_usage(lambda: File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 1659, in with_ei_usage return f() File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 1674, in distclass=DistributionWithoutHelpCommands, **kw File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ python2.5/distutils/core.py", line 151, in setup dist.run_commands() File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ python2.5/distutils/dist.py", line 974, in run_commands self.run_command(cmd) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 211, in run self.easy_install(spec, not self.no_deps) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 446, in easy_install return self.install_item(spec, dist.location, tmpdir, deps) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 471, in install_item dists = self.install_eggs(spec, download, tmpdir) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 655, in install_eggs return self.build_and_install(setup_script, setup_base) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 930, in build_and_install self.run_setup(setup_script, setup_base, args) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/command/easy_install.py", line 919, in run_setup run_setup(setup_script, args) File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/sandbox.py", line 27, in run_setup lambda: execfile( File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/sandbox.py", line 63, in run return func() File "/System/Library/Frameworks/Python.framework/Versions/2.5/ Extras/lib/python/setuptools/sandbox.py", line 29, in {'__file__':setup_script, '__name__':'__main__'} File "setup.py", line 119, in File "/tmp/easy_install-XVh5wp/lxml-2.2.6/setupinfo.py", line 50, in ext_modules File "/tmp/easy_install-XVh5wp/lxml-2.2.6/buildlibxml.py", line 208, in build_libxml2xslt File "/tmp/easy_install-XVh5wp/lxml-2.2.6/buildlibxml.py", line 158, in call_subprocess Exception: Command "make" returned code 2 Any thing obvious jump out at anyone that I might have missed? thanks in advance sgg Steve Graham sgg at ci.uchicago.edu From dlindquist at arkayne.com Wed Apr 7 18:56:20 2010 From: dlindquist at arkayne.com (David Lindquist) Date: Wed, 7 Apr 2010 09:56:20 -0700 Subject: [lxml-dev] parse timeout Message-ID: Hello, I have to parse a series of URLs, some of which might hang for an unacceptable length of time. I cannot figure out how to add a timeout: import socket from lxml.html import parse socket.setdefaulttimeout(10) doc = parse('http://example.com/hang_for_a_long_time') # this might hang indefinitely Is there some other way to add a timeout, short of recreating the parse function using urllib2? Thanks, David Lindquist From optilude+lists at gmail.com Thu Apr 8 05:03:18 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Thu, 08 Apr 2010 11:03:18 +0800 Subject: [lxml-dev] lxml crashes parsing anything on SuSE Linux Enterprise Server 11 Message-ID: Hi, I'm trying to get lxml to work on SLES11 (x86_64). Much pain and suffering has been had so far to come to the point where lxml is most likely the culprit of all kinds of random crashes. I'm using the system-installed Python 2.6. I'd prefer a non-system Python, but it's nearly impossible to compile one from source that also has ssl and hashlib support, so I've given up. I've got two test scenarios that both produce the same result: - System Python 2.6 using an rpm-installed lxml - Same Python in a virtualenv --no-site-packages with lxml installed via easy_install The system packages we have installed are: $ zypper search libxml2 libxslt libexslt lxml Loading repository data... Reading installed packages... S | Name | Summary | Type --+------------------+----------------------------------------------+----------- | libxml2 | A Library to Manipulate XML Files | srcpackage i | libxml2 | A Library to Manipulate XML Files | package i | libxml2-32bit | A Library to Manipulate XML Files | package i | libxml2-devel | Include Files and Libraries mandatory for -> | package | libxml2-doc | A Library to Manipulate XML Files | package i | libxml2-python | Python Bindings for libxml2 | package | libxml2-python | Python Bindings for libxml2 | srcpackage | libxslt | XSL Transformation Library | srcpackage i | libxslt | XSL Transformation Library | package i | libxslt-32bit | XSL Transformation Library | package i | libxslt-devel | Include Files and Libraries mandatory for -> | package | libxslt-python | Python Bindings for libxslt | package | libxslt-python | Python Bindings for libxslt | srcpackage | perl-XML-LibXSLT | XML::LibXSLT Perl Module | package | perl-XML-LibXSLT | XML::LibXSLT Perl Module | srcpackage i | python-lxml | A Pythonic Binding for the libxml2 and lib-> | package | python-lxml | A Pythonic Binding for the libxml2 and lib-> | srcpackage | python-lxml-doc | Documentation for python-lxml Package | package i | slessp0-libxml2 | Security update for libxml2 | patch The ones with' i' are installed. When trying to parse anything, the result is a segfualt. gdb says: $ gdb ./bin/python GNU gdb (GDB; SUSE Linux Enterprise 11) 6.8.50.20081120-cvs Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux". For bug reporting instructions, please see: ... (no debugging symbols found) (gdb) run Starting program: /home/osc/tmp/lxml-env/bin/python (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) [Thread debugging using libthread_db enabled] (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) Python 2.6 (r26:66714, Feb 21 2009, 02:16:04) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 Type "help", "copyright", "credits" or "license" for more information. (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) >>> import lxml.etree >>> lxml.etree.parse('') Program received signal SIGSEGV, Segmentation fault. 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 #1 0x00007ffff5b04757 in __xmlParserInputBufferCreateFilename () from /usr/lib64/libxml2.so.2 #2 0x00007ffff5b4bd66 in xmlParseCatalogFile () from /usr/lib64/libxml2.so.2 #3 0x00007ffff5b4bff3 in ?? () from /usr/lib64/libxml2.so.2 #4 0x00007ffff5b4c51d in ?? () from /usr/lib64/libxml2.so.2 #5 0x00007ffff5b4d0bf in xmlACatalogResolve () from /usr/lib64/libxml2.so.2 #6 0x00007ffff5b04b43 in ?? () from /usr/lib64/libxml2.so.2 #7 0x00007ffff5b055cf in ?? () from /usr/lib64/libxml2.so.2 #8 0x00007ffff62e557c in __pyx_f_4lxml_5etree__local_resolver ( __pyx_v_c_url=0x806670 "", __pyx_v_c_pubid=0x0, __pyx_v_c_context=0x80c350) at src/lxml/lxml.etree.c:63618 #9 0x00007ffff5b04bef in xmlLoadExternalEntity () from /usr/lib64/libxml2.so.2 #10 0x00007ffff5af30df in xmlCtxtReadFile () from /usr/lib64/libxml2.so.2 #11 0x00007ffff62e85d2 in __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile (__pyx_v_self=0x7ffff7e695a8, __pyx_v_c_filename=0x7ffff7f29744 "") at src/lxml/lxml.etree.c:68146 #12 0x00007ffff627d6cb in __pyx_f_4lxml_5etree__parseDocFromFile ( __pyx_v_filename8=0x7ffff7f29720, __pyx_v_parser=0x7ffff7e695a8) at src/lxml/lxml.etree.c:71175 #13 0x00007ffff627d9d9 in __pyx_f_4lxml_5etree__parseDocumentFromURL ( __pyx_v_url=0x7fffffffdc70, __pyx_v_parser=0x700001c4f) at src/lxml/lxml.etree.c:72080 ---Type to continue, or q to quit--- #14 0x00007ffff6311b78 in __pyx_f_4lxml_5etree__parseDocument ( __pyx_v_source=0x7ffff7f29720, __pyx_v_parser=0x7ffff7dac470, __pyx_v_base_url=0x7ffff7dac470) at src/lxml/lxml.etree.c:71797 #15 0x00007ffff6313656 in __pyx_pf_4lxml_5etree_parse ( __pyx_self=, __pyx_args=, __pyx_kwds=) at src/lxml/lxml.etree.c:49958 #16 0x00007ffff7b15f28 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0 #17 0x00007ffff7b1b6c0 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0 #18 0x00007ffff7b13822 in PyEval_EvalCode () from /usr/lib64/libpython2.6.so.1.0 #19 0x00007ffff7b34b13 in ?? () from /usr/lib64/libpython2.6.so.1.0 #20 0x00007ffff7b3711c in PyRun_InteractiveOneFlags () from /usr/lib64/libpython2.6.so.1.0 #21 0x00007ffff7b372d6 in PyRun_InteractiveLoopFlags () from /usr/lib64/libpython2.6.so.1.0 #22 0x00007ffff7b357b6 in PyRun_AnyFileExFlags () from /usr/lib64/libpython2.6.so.1.0 #23 0x00007ffff7b40dba in Py_Main () from /usr/lib64/libpython2.6.so.1.0 #24 0x00007ffff6e94586 in __libc_start_main () from /lib64/libc.so.6 #25 0x00000000004006e9 in _start () (gdb) I've tried to compile a custom libxml2, libxslt and zlib and build against those, but not had any luck. I may've been doing that wrong, though. Any help greatly appreciated! Martin From stefan_ml at behnel.de Thu Apr 8 08:42:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 Apr 2010 08:42:42 +0200 Subject: [lxml-dev] lxml crashes parsing anything on SuSE Linux Enterprise Server 11 In-Reply-To: References: Message-ID: <4BBD7AE2.90604@behnel.de> Martin Aspeli, 08.04.2010 05:03: > >>> import lxml.etree > >>> lxml.etree.parse('') Note that you should pass a filename here. > Program received signal SIGSEGV, Segmentation fault. > 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 > (gdb) bt > #0 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 > #1 0x00007ffff5b04757 in __xmlParserInputBufferCreateFilename () > from /usr/lib64/libxml2.so.2 > #2 0x00007ffff5b4bd66 in xmlParseCatalogFile () from > /usr/lib64/libxml2.so.2 > #3 0x00007ffff5b4bff3 in ?? () from /usr/lib64/libxml2.so.2 > #4 0x00007ffff5b4c51d in ?? () from /usr/lib64/libxml2.so.2 > #5 0x00007ffff5b4d0bf in xmlACatalogResolve () from /usr/lib64/libxml2.so.2 > #6 0x00007ffff5b04b43 in ?? () from /usr/lib64/libxml2.so.2 > #7 0x00007ffff5b055cf in ?? () from /usr/lib64/libxml2.so.2 > #8 0x00007ffff62e557c in __pyx_f_4lxml_5etree__local_resolver ( > __pyx_v_c_url=0x806670 "", __pyx_v_c_pubid=0x0, > __pyx_v_c_context=0x80c350) at src/lxml/lxml.etree.c:63618 > #9 0x00007ffff5b04bef in xmlLoadExternalEntity () from > /usr/lib64/libxml2.so.2 > #10 0x00007ffff5af30df in xmlCtxtReadFile () from /usr/lib64/libxml2.so.2 > #11 0x00007ffff62e85d2 in > __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile > (__pyx_v_self=0x7ffff7e695a8, __pyx_v_c_filename=0x7ffff7f29744 > "") This happens during XML catalog resolution, maybe there's a broken catalog configuration on the system that libxml2 stumbles over. http://xmlsoft.org/catalog.html > I've tried to compile a custom libxml2, libxslt and zlib and build > against those, but not had any luck. I may've been doing that wrong, > though. You are using the system version of libxml2/libxslt, right? Which versions are installed? You have provided incorrect input in your example above (which, I assume, was just an example), but it certainly shouldn't make it crash, and it doesn't crash for me. I don't know how old the system libraries are, so you could try to build with --static-deps to have it download the newest versions and build statically against them. If that doesn't work, please provide the error output. Stefan From stefan_ml at behnel.de Thu Apr 8 08:53:21 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 Apr 2010 08:53:21 +0200 Subject: [lxml-dev] parse timeout In-Reply-To: References: Message-ID: <4BBD7D61.50905@behnel.de> David Lindquist, 07.04.2010 18:56: > I have to parse a series of URLs, some of which might hang for an > unacceptable length of time. I cannot figure out how to add a timeout: > > import socket > socket.setdefaulttimeout(10) This is a pure Python-level setting and has no effect on the C-level socket used internally by libxml2. In the libxml2 source, there's a static variable 'timeout' in nanohttp.c that (I assume) controls the timeout. It's set to one minute and don't see a way to change it from outside. You can decrease that value in the sources to see if that works for you. > from lxml.html import parse > doc = parse('http://example.com/hang_for_a_long_time') # this might hang > indefinitely > > Is there some other way to add a timeout, short of recreating the parse > function using urllib2? Given that it's rather trivial to do that, I'd just take that route. Stefan From stefan_ml at behnel.de Thu Apr 8 09:03:39 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 08 Apr 2010 09:03:39 +0200 Subject: [lxml-dev] installing lxml on MacOS In-Reply-To: <1F4D9F37-B9AA-440B-91CE-E185F2E1180D@ci.uchicago.edu> References: <1F4D9F37-B9AA-440B-91CE-E185F2E1180D@ci.uchicago.edu> Message-ID: <4BBD7FCB.2020807@behnel.de> Stephen Graham, 06.04.2010 21:41: > I am trying to go down the learning curve on lxml. > > I tried to follow the install instructions to install lxml on my MacOS > > As root, I did: > STATIC_DEPS=true easy_install lxml > > The install chugged away, but eventually failed: > > ... > > Undefined symbols for architecture i386: > "_gzdirect", referenced from: > ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) > ld: symbol(s) not found for architecture i386 > collect2: ld returned 1 exit status > Undefined symbols for architecture ppc: > "_gzdirect", referenced from: > ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) > ld: symbol(s) not found for architecture ppc > collect2: ld returned 1 exit status > lipo: can't open input file: /var/tmp//ccsmIZu8.out (No such file or > directory) I'm so happy no-one forces me to use MacOS-X. My guess is that you are missing some compiler flag. The options are set in buildlibxml.py. Maybe others can give better advice here, but I'd suggest you play with them a little to see if they can be made to work for you. It's also possible that some library (like zlib?) conflicts with your Python installation. I assume you use some package management system? Stefan From optilude+lists at gmail.com Thu Apr 8 09:21:16 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Thu, 08 Apr 2010 15:21:16 +0800 Subject: [lxml-dev] lxml crashes parsing anything on SuSE Linux Enterprise Server 11 In-Reply-To: <4BBD7AE2.90604@behnel.de> References: <4BBD7AE2.90604@behnel.de> Message-ID: Stefan Behnel wrote: > Martin Aspeli, 08.04.2010 05:03: >> >>> import lxml.etree >> >>> lxml.etree.parse('') > > Note that you should pass a filename here. Right. It's actually doing a filename in the real code. I just had it in my mind that .parse() could take either, so when I did the interactive example, I thought I'd make it easier to reproduce. Passing a filename (from the current directory) gives the same error, though. >> Program received signal SIGSEGV, Segmentation fault. >> 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 >> (gdb) bt >> #0 0x00007ffff6ef2ab9 in strncmp () from /lib64/libc.so.6 >> #1 0x00007ffff5b04757 in __xmlParserInputBufferCreateFilename () >> from /usr/lib64/libxml2.so.2 >> #2 0x00007ffff5b4bd66 in xmlParseCatalogFile () from >> /usr/lib64/libxml2.so.2 >> #3 0x00007ffff5b4bff3 in ?? () from /usr/lib64/libxml2.so.2 >> #4 0x00007ffff5b4c51d in ?? () from /usr/lib64/libxml2.so.2 >> #5 0x00007ffff5b4d0bf in xmlACatalogResolve () from /usr/lib64/libxml2.so.2 >> #6 0x00007ffff5b04b43 in ?? () from /usr/lib64/libxml2.so.2 >> #7 0x00007ffff5b055cf in ?? () from /usr/lib64/libxml2.so.2 >> #8 0x00007ffff62e557c in __pyx_f_4lxml_5etree__local_resolver ( >> __pyx_v_c_url=0x806670 "", __pyx_v_c_pubid=0x0, >> __pyx_v_c_context=0x80c350) at src/lxml/lxml.etree.c:63618 >> #9 0x00007ffff5b04bef in xmlLoadExternalEntity () from >> /usr/lib64/libxml2.so.2 >> #10 0x00007ffff5af30df in xmlCtxtReadFile () from /usr/lib64/libxml2.so.2 >> #11 0x00007ffff62e85d2 in >> __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile >> (__pyx_v_self=0x7ffff7e695a8, __pyx_v_c_filename=0x7ffff7f29744 >> "") > > This happens during XML catalog resolution, maybe there's a broken catalog > configuration on the system that libxml2 stumbles over. > > http://xmlsoft.org/catalog.html Wouldn't surprise me. :) /etc/xml/catalog contains: suse-catalog.xml exists and looks ok to my eye. There are a few other things in /etc/xml as well. > > I've tried to compile a custom libxml2, libxslt and zlib and build > > against those, but not had any luck. I may've been doing that wrong, > > though. > > You are using the system version of libxml2/libxslt, right? Which versions > are installed? libxml2: 2.7.1-10.9.1 libxslt: 1.1.24-19.15 both installed via the SuSE rpms, yes. > You have provided incorrect input in your example above (which, I assume, > was just an example), but it certainly shouldn't make it crash, and it > doesn't crash for me. > > I don't know how old the system libraries are, so you could try to build > with --static-deps to have it download the newest versions and build > statically against them. If that doesn't work, please provide the error output. What are the exact steps to do this correctly? I tried a few different ways from reading the build docs and kept getting compilation errors. I don't have the logs right here, but... - What should I download? The .tar.gz from PyPI? - What is the exact Python command to run? - Do I need to set any environment variables or pass any other options? At least if you tell me the steps I can try to reproduce reliably. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From l.oluyede at gmail.com Fri Apr 9 15:46:47 2010 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Fri, 9 Apr 2010 15:46:47 +0200 Subject: [lxml-dev] iterparse segfaults misteriously Message-ID: Yesterday my colleague and I noticed a series of apparently random deaths of apache processes and we found out it was lxml's fault. We slimmed down the use case to the following lines of code: xml = "Test" open("/tmp/test.xml", "w").write(xml) from lxml import etree list(etree.iterparse(open("/tmp/test.xml"), events=("start", "end"))) This crashes badly on one of our staging machines but not at all on our development computers. We thought it was due to a slight difference in libxml2/lxml versions but we tried all the configurations and it keeps on crashing with the following exception: Traceback (most recent call last): File "", line 1, in File "iterparse.pxi", line 471, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:91652) TypeError: 'str' object doesn't support slice deletion (we had "TypeError: 'NoneType' object doesn't support slice deletion" on the actual code but the only difference is that it's using StringIO instead of a file) We debugged postmortem and this is what GDB is saying: #0 __pyx_pf_4lxml_5etree_9iterparse___next__ (__pyx_v_self=0x2b78ff1af1d0) at src/lxml/lxml.etree.c:86607 #1 0x00002b78fa892dec in listextend (self=0x2b78ffa15518, b=) at Objects/listobject.c:823 #2 0x00002b78fa89336a in list_init (self=0x2b78ffa15518, args=, kw=) at Objects/listobject.c:2391 #3 0x00002b78fa8b2338 in type_call (type=, args=0x2b78ffa233d0, kwds=0x0) at Objects/typeobject.c:436 #4 0x00002b78fa869df3 in PyObject_Call (func=0x0, arg=0x307194faf0, kw=0x0) at Objects/abstract.c:1861 #5 0x00002b78fa8e701a in PyEval_EvalFrameEx (f=0x110a2f10, throwflag=) at Python/ceval.c:3823 #6 0x00002b78fa8ead0b in PyEval_EvalCodeEx (co=0x2b78fac1e0a8, globals=, locals=, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2875 #7 0x00002b78fa8ead52 in PyEval_EvalCode (co=0x0, globals=0x307194faf0, locals=0x0) at Python/ceval.c:514 #8 0x00002b78fa90d991 in PyRun_FileExFlags (fp=0x11031010, filename=0x7fffb027d7ee "test_lxml.py", start=, globals=0x11054490, locals=0x11054490, closeit=1, flags=0x7fffb027cad0) at Python/pythonrun.c:1273 #9 0x00002b78fa90dc18 in PyRun_SimpleFileExFlags (fp=, filename=0x7fffb027d7ee "test_lxml.py", closeit=1, flags=0x7fffb027cad0) at Python/pythonrun.c:879 #10 0x00002b78fa9167f9 in Py_Main (argc=1, argv=) at Modules/main.c:532 #11 0x000000307161d8b4 in __libc_start_main () from /lib64/libc.so.6 #12 0x0000000000400629 in _start () Any ideas? We tried with lxml 2.2.0, lxml 2.2.4 and lxml 2.2.6 against libxml2 2.7.2 and libxml2 2.7.7 and libxslt 1.1.4 and libxslt 1.1.26 Python version is 2.5.4 on a Centos 5 64bit -- Lawrence Oluyede [eng] http://oluyede.org - http://twitter.com/lawrenceoluyede [ita] http://www.neropercaso.it [flickr] http://www.flickr.com/photos/rhymes From bboissin at gmail.com Mon Apr 12 19:10:26 2010 From: bboissin at gmail.com (Benoit Boissinot) Date: Mon, 12 Apr 2010 19:10:26 +0200 Subject: [lxml-dev] Bug with HTMLParser() Message-ID: The HTMLParser() seems to strip any whitespace after , for example: http://paste.pocoo.org/show/201052/ In [25]: etree.tostring(etree.fromstring('Article 1er bis (nouveau)', etree.HTMLParser())) Out[25]: 'Article 1erbis (nouveau)' In [26]: etree.tostring(etree.fromstring('Article 1er bis (nouveau)')) Out[26]: 'Article 1er bis (nouveau)' Is it a bug with lxml? or with libxml2? regards, Benoit From stefan_ml at behnel.de Tue Apr 13 06:55:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 Apr 2010 06:55:36 +0200 Subject: [lxml-dev] Bug with HTMLParser() In-Reply-To: References: Message-ID: <4BC3F948.2060408@behnel.de> Benoit Boissinot, 12.04.2010 19:10: > The HTMLParser() seems to strip any whitespace after, for example: > http://paste.pocoo.org/show/201052/ > > In [25]: etree.tostring(etree.fromstring('Article > 1er bis (nouveau)', > etree.HTMLParser())) > Out[25]: 'Article 1erbis > (nouveau)' > > In [26]: etree.tostring(etree.fromstring('Article > 1er bis (nouveau)')) > Out[26]: 'Article 1er bis > (nouveau)' > > Is it a bug with lxml? or with libxml2? Thanks for the report, looks like you already found the libxml2 mailing list. It's common to report errors there by reproducing them using xmllint, so in this case xmllint --html inputfile.html would show you if the problem also shows with plain libxml2. Stefan From stefan_ml at behnel.de Tue Apr 13 08:06:52 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 Apr 2010 08:06:52 +0200 Subject: [lxml-dev] iterparse segfaults misteriously In-Reply-To: References: Message-ID: <4BC409FC.4060709@behnel.de> Lawrence Oluyede, 09.04.2010 15:46: > Yesterday my colleague and I noticed a series of apparently random > deaths of apache processes and we found out it was lxml's fault. > We slimmed down the use case to the following lines of code: > > xml = "Test" > open("/tmp/test.xml", "w").write(xml) > from lxml import etree > list(etree.iterparse(open("/tmp/test.xml"), events=("start", "end"))) > > This crashes badly on one of our staging machines but not at all on > our development computers. We thought it was due to a slight > difference in libxml2/lxml versions but we tried all the configurations Ok, sounds like this will be tricky to reproduce then... > it keeps on crashing with the following exception: > > Traceback (most recent call last): > File "", line 1, in > File "iterparse.pxi", line 471, in lxml.etree.iterparse.__next__ > (src/lxml/lxml.etree.c:91652) > TypeError: 'str' object doesn't support slice deletion > > (we had "TypeError: 'NoneType' object doesn't support slice deletion" > on the actual code but the only difference is that it's using StringIO > instead of a file) Looks weird. What version of lxml do the above line numbers refer to? > We debugged postmortem and this is what GDB is saying: > > #0 __pyx_pf_4lxml_5etree_9iterparse___next__ > (__pyx_v_self=0x2b78ff1af1d0) at src/lxml/lxml.etree.c:86607 > #1 0x00002b78fa892dec in listextend (self=0x2b78ffa15518, b= optimized out>) at Objects/listobject.c:823 >[...] > Any ideas? Never seen this before, but I'll look into it as soon as I know what to look at. Stefan From bboissin at gmail.com Tue Apr 13 09:48:44 2010 From: bboissin at gmail.com (Benoit Boissinot) Date: Tue, 13 Apr 2010 09:48:44 +0200 Subject: [lxml-dev] Bug with HTMLParser() In-Reply-To: <4BC3F948.2060408@behnel.de> References: <4BC3F948.2060408@behnel.de> Message-ID: <20100413074844.GJ9550@pirzuine> On Tue, Apr 13, 2010 at 06:55:36AM +0200, Stefan Behnel wrote: > Benoit Boissinot, 12.04.2010 19:10: > > > >Is it a bug with lxml? or with libxml2? > > Thanks for the report, looks like you already found the libxml2 > mailing list. It's common to report errors there by reproducing them > using xmllint, so in this case > > xmllint --html inputfile.html > > would show you if the problem also shows with plain libxml2. Thanks for the hint! (maybe it could be on the website, for those trying to determine if something is a bug with lxml or libxml2). cheers, Benoit -- :wq From l.oluyede at gmail.com Tue Apr 13 11:19:48 2010 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Tue, 13 Apr 2010 11:19:48 +0200 Subject: [lxml-dev] iterparse segfaults misteriously In-Reply-To: <4BC409FC.4060709@behnel.de> References: <4BC409FC.4060709@behnel.de> Message-ID: On Tue, Apr 13, 2010 at 8:06 AM, Stefan Behnel wrote: > Ok, sounds like this will be tricky to reproduce then... Yes, I even have problems trying to reproduce that. >> it keeps on crashing with the following exception: >> >> Traceback (most recent call last): >> ? File "", line 1, in >> ? File "iterparse.pxi", line 471, in lxml.etree.iterparse.__next__ >> (src/lxml/lxml.etree.c:91652) >> TypeError: 'str' object doesn't support slice deletion >> >> (we had "TypeError: 'NoneType' object doesn't support slice deletion" >> on the actual code but the only difference is that it's using StringIO >> instead of a file) > > Looks weird. What version of lxml do the above line numbers refer to? All of them, but I cannot reproduce it again. It just segfaults. >> We debugged postmortem and this is what GDB is saying: >> >> #0 ?__pyx_pf_4lxml_5etree_9iterparse___next__ >> (__pyx_v_self=0x2b78ff1af1d0) at src/lxml/lxml.etree.c:86607 >> #1 ?0x00002b78fa892dec in listextend (self=0x2b78ffa15518, b=> optimized out>) at Objects/listobject.c:823 >> [...] >> Any ideas? > > Never seen this before, but I'll look into it as soon as I know what to look > at. > That's though, we don't really know more than that. The odd stuff is that it seems to depend on some weird user configuration because on the same machine and different users works just fine. I checked with ldd if etree.so was linked in a weird way but it looks fine. I checked the LD_LIBRARY_PATH and all the users have the same paths. -- Lawrence Oluyede [eng] http://oluyede.org - http://twitter.com/lawrenceoluyede [ita] http://www.neropercaso.it [flickr] http://www.flickr.com/photos/rhymes From stefan_ml at behnel.de Tue Apr 13 13:44:26 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 Apr 2010 13:44:26 +0200 Subject: [lxml-dev] iterparse segfaults misteriously In-Reply-To: References: <4BC409FC.4060709@behnel.de> Message-ID: <4BC4591A.9030902@behnel.de> Lawrence Oluyede, 13.04.2010 11:19: >>> it keeps on crashing with the following exception: >>> >>> Traceback (most recent call last): >>> File "", line 1, in >>> File "iterparse.pxi", line 471, in lxml.etree.iterparse.__next__ >>> (src/lxml/lxml.etree.c:91652) >>> TypeError: 'str' object doesn't support slice deletion >>> >>> (we had "TypeError: 'NoneType' object doesn't support slice deletion" >>> on the actual code but the only difference is that it's using StringIO >>> instead of a file) >> >> Looks weird. What version of lxml do the above line numbers refer to? > > All of them, but I cannot reproduce it again. It just segfaults. Sorry, I really can't do anything if I can neither reproduce the crash nor look at the code. So I need at least the exact C source file that the above stack trace came from, which means that I need to know the release version that shipped those sources. >>> We debugged postmortem and this is what GDB is saying: >>> >>> #0 __pyx_pf_4lxml_5etree_9iterparse___next__ >>> (__pyx_v_self=0x2b78ff1af1d0) at src/lxml/lxml.etree.c:86607 >>> #1 0x00002b78fa892dec in listextend (self=0x2b78ffa15518, b=>> optimized out>) at Objects/listobject.c:823 >>> [...] >>> Any ideas? >> >> Never seen this before, but I'll look into it as soon as I know what to look >> at. > > That's though, we don't really know more than that. The odd stuff is > that it seems to depend on some weird > user configuration because on the same machine and different users > works just fine. I checked with ldd if etree.so was > linked in a weird way but it looks fine. I checked the LD_LIBRARY_PATH > and all the users have the same paths. Can you at least tell me what source version the GDB crash trace was taken from? Stefan From l.oluyede at gmail.com Tue Apr 13 17:55:12 2010 From: l.oluyede at gmail.com (Lawrence Oluyede) Date: Tue, 13 Apr 2010 17:55:12 +0200 Subject: [lxml-dev] iterparse segfaults misteriously In-Reply-To: <4BC4591A.9030902@behnel.de> References: <4BC409FC.4060709@behnel.de> <4BC4591A.9030902@behnel.de> Message-ID: On Tue, Apr 13, 2010 at 1:44 PM, Stefan Behnel wrote: > Sorry, I really can't do anything if I can neither reproduce the crash nor > look at the code. So I need at least the exact C source file that the above > stack trace came from, which means that I need to know the release version > that shipped those sources. lxml.etree: (2, 2, 6, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 7) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 26) The etree.c is the one shipped with lxml 2.2.6 (we tried to generate it installing Cython ourselves but nothing changed) -- Lawrence Oluyede [eng] http://oluyede.org - http://twitter.com/lawrenceoluyede [ita] http://www.neropercaso.it [flickr] http://www.flickr.com/photos/rhymes From stefan_ml at behnel.de Tue Apr 13 18:57:21 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 13 Apr 2010 18:57:21 +0200 Subject: [lxml-dev] iterparse segfaults misteriously In-Reply-To: References: <4BC409FC.4060709@behnel.de> <4BC4591A.9030902@behnel.de> Message-ID: <4BC4A271.603@behnel.de> Lawrence Oluyede, 13.04.2010 17:55: > On Tue, Apr 13, 2010 at 1:44 PM, Stefan Behnel wrote: >> Sorry, I really can't do anything if I can neither reproduce the crash nor >> look at the code. So I need at least the exact C source file that the above >> stack trace came from, which means that I need to know the release version >> that shipped those sources. > > lxml.etree: (2, 2, 6, 0) > libxml used: (2, 7, 2) > libxml compiled: (2, 7, 7) > libxslt used: (1, 1, 24) > libxslt compiled: (1, 1, 26) Going back from a compiled libxml2 version of 2.7.7 to a runtime version of 2.7.2 is not the ideal environment for tracking down a runtime crash problem (maybe for getting one but not for tracking it down). Same for the libxslt version. It may not apply here, but in general, lxml.etree contains certain work-arounds for library bugs that are enabled at compile time. Compiling against a newer version than what is used at runtime may disable them. > The etree.c is the one shipped with lxml 2.2.6 Ok, in that file, the line in the first stack trace contains a C comment. The line in the second stack trace contains an if-test of a local variable - certainly nothing that can segfault in C. While I wouldn't swear that the line before that can't segfault (given the weird Python stack traces you showed), I'd first make sure you really use exactly the unmodified release sources for the build. > (we tried to generate it installing Cython ourselves but nothing changed) I assume you used Cython 0.11.3 for that? Generating the sources yourself is not a good idea when dealing with crashes - the released sources are the ones that were tested and that everyone else uses. Stefan From sergio at sergiomb.no-ip.org Wed Apr 14 00:59:50 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 13 Apr 2010 23:59:50 +0100 Subject: [lxml-dev] bug in HTMLParser with broken html Message-ID: <1271199590.3667.8.camel@segulix> Hi, I found other problem with lxml.html In [1]: from lxml import html In [2]: html.tostring(html.fromstring("""
1125.
1124.
""")) Out[2]: '
1125.
1124.
' the ?! firebug result : < /tbody>
1125.
1124.
xmllint --html file also done this wrongly ! Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100413/e68c9fe9/attachment.bin From stefan_ml at behnel.de Wed Apr 14 08:32:38 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 14 Apr 2010 08:32:38 +0200 Subject: [lxml-dev] bug in HTMLParser with broken html In-Reply-To: <1271199590.3667.8.camel@segulix> References: <1271199590.3667.8.camel@segulix> Message-ID: <4BC56186.2090901@behnel.de> Sergio Monteiro Basto, 14.04.2010 00:59: > I found other problem with lxml.html > > In [1]: from lxml import html > > In [2]: html.tostring(html.fromstring("""
class=menu href="1125.pdf"> 1125.
href="1124.pdf"> 1124.
""")) > > Out[2]: '
> 1125.
1124. >
' > > the
and ?! Interesting that you call this a "problem with lxml.html". It's more of a problem in the HTML input, don't you think? > firebug result : > > > < > /tbody>
1125. >
1124. >
> > xmllint --html file also done this wrongly ! "wrong" is clearly the wrong word here. "Different than firebug" is more correct, and I bet it's not the only one that behaves "different than firebug" for a certain piece of non-HTML input. If you feel that this is a bug in libxml2, please file a bug report there or report it on their mailing list. Stefan From chef at ghum.de Wed Apr 14 15:04:57 2010 From: chef at ghum.de (Massa, Harald Armin) Date: Wed, 14 Apr 2010 15:04:57 +0200 Subject: [lxml-dev] getting € instead of € from tostring Message-ID: import lxml.html a=lxml.html.fragment_fromstring("

we need 100 € yeah

") print lxml.html.tostring(a)

we need 100 € yeah

What can I do to get "named entitites", as € ä ö etc? (Would even be willing to enumerate and pass the 15 or so I really need) Those are much better human-readable in the unlikely case of debugging. best wishes, Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Stra?e 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100414/f097f42e/attachment.htm From sergio at sergiomb.no-ip.org Wed Apr 14 19:19:39 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 14 Apr 2010 18:19:39 +0100 Subject: [lxml-dev] bug in HTMLParser with broken html In-Reply-To: <4BC56186.2090901@behnel.de> References: <1271199590.3667.8.camel@segulix> <4BC56186.2090901@behnel.de> Message-ID: <1271265579.9695.16.camel@segulix> On Wed, 2010-04-14 at 08:32 +0200, Stefan Behnel wrote: > Sergio Monteiro Basto, 14.04.2010 00:59: > > I found other problem with lxml.html > > > > In [1]: from lxml import html > > > > In [2]: html.tostring(html.fromstring("""
> class=menu href="1125.pdf"> 1125.
> href="1124.pdf"> 1124.
""")) > > > > Out[2]: '
> > 1125.
1124. > >
' > > > > the
> and ?! > > Interesting that you call this a "problem with lxml.html". It's more of a > problem in the HTML input, don't you think? Well I trust in lxml to fix the input. see http://www.voicenews.ca/ > > > > firebug result : > > > > > > < > > /tbody>
1125. > >
1124. > >
> > > > xmllint --html file also done this wrongly ! > > "wrong" is clearly the wrong word here. "Different than firebug" is more > correct, and I bet it's not the only one that behaves "different than > firebug" for a certain piece of non-HTML input. "wrong" is the word " " is a wrong fix to this HTML input. The correct is clearly , is how html is render by any browser . > If you feel that this is a > bug in libxml2, please file a bug report there or report it on their > mailing list. I follow yours tip in xmllint --html file So this HTML input is not fixable, by lxml.html ? Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100414/c162cf0f/attachment.bin From stefan_ml at behnel.de Thu Apr 15 07:42:21 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 15 Apr 2010 07:42:21 +0200 Subject: [lxml-dev] bug in HTMLParser with broken html In-Reply-To: <1271265579.9695.16.camel@segulix> References: <1271199590.3667.8.camel@segulix> <4BC56186.2090901@behnel.de> <1271265579.9695.16.camel@segulix> Message-ID: <4BC6A73D.2070000@behnel.de> Sergio Monteiro Basto, 14.04.2010 19:19: > On Wed, 2010-04-14 at 08:32 +0200, Stefan Behnel wrote: >> If you feel that this is a >> bug in libxml2, please file a bug report there or report it on their >> mailing list. > > I follow yours tip in xmllint --html file > > So this HTML input is not fixable, by lxml.html ? Well, if xmllint produces the same result, then there is nothing lxml can do about it. Please contact the libxml2 project. Stefan From sergio at sergiomb.no-ip.org Tue Apr 20 03:40:46 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 20 Apr 2010 02:40:46 +0100 Subject: [lxml-dev] getting € instead of € from tostring In-Reply-To: References: Message-ID: <1271727646.6178.8.camel@segulix> On Wed, 2010-04-14 at 15:04 +0200, Massa, Harald Armin wrote: > > import lxml.html > > a=lxml.html.fragment_fromstring("

we need 100 € yeah

") > > print lxml.html.tostring(a) >

we need 100 € yeah

print lxml.html.tostring(a, method="html", encoding="utf-8") this works for me (...) > What can I do to get "named entitites", as € ä ö etc? > (Would even be willing to enumerate and pass the 15 or so I really > need) For curiosity, I had use function htmldecode, which doesn't work in all types of strings , but works well with unicode, so basically I use always like this: htmldecode( unicode(text,'utf-8') ).encode('utf-8') so I did some improvement to this script which I not publish yet because is not tested (...) cat htmldecode.py #-*- coding: utf-8 -*- import re from htmlentitydefs import name2codepoint # This pattern matches a character entity reference (a decimal numeric # references, a hexadecimal numeric reference, or a named reference). charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?') def htmldecode(text): """Decode HTML entities in the given text.""" if type(text) is unicode: uchr = unichr else: uchr = lambda value: value > 255 and unichr(value) or chr(value) def entitydecode(match, uchr=uchr): entity = match.group(1) if entity.startswith('#x'): return uchr(int(entity[2:], 16)) elif entity.startswith('#'): return uchr(int(entity[1:])) elif entity in name2codepoint: return uchr(name2codepoint[entity]) else: return match.group(0) return charrefpat.sub(entitydecode, text) Regards, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100420/c87c4ba8/attachment-0001.bin From chef at ghum.de Thu Apr 22 10:46:30 2010 From: chef at ghum.de (Massa, Harald Armin) Date: Thu, 22 Apr 2010 10:46:30 +0200 Subject: [lxml-dev] getting € instead of € from tostring In-Reply-To: <1271727646.6178.8.camel@segulix> References: <1271727646.6178.8.camel@segulix> Message-ID: Sergio, > import lxml.html > > > > a=lxml.html.fragment_fromstring("

we need 100 € yeah

") > > > > print lxml.html.tostring(a) > >

we need 100 € yeah

> > print lxml.html.tostring(a, method="html", encoding="utf-8") > > this works for me (...) > > yeah, works for me too - just puts out the € as ? , which is fine as long as I do not have to communicate with legacy systems crashing on that code :) I rather would not want to move over the text with another replace from ? to -> € as that will eat up all speed I have won from using lxml :) > > What can I do to get "named entitites", as € ä ö etc? > > (Would even be willing to enumerate and pass the 15 or so I really > > need) > > Thanks, Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Stra?e 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100422/80c1ebcd/attachment.htm From stefan_ml at behnel.de Thu Apr 22 11:27:20 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 22 Apr 2010 11:27:20 +0200 Subject: [lxml-dev] getting € instead of € from tostring In-Reply-To: References: <1271727646.6178.8.camel@segulix> Message-ID: <4BD01678.4000501@behnel.de> Massa, Harald Armin, 22.04.2010 10:46: >> import lxml.html >>> >>> a=lxml.html.fragment_fromstring("

we need 100€ yeah

") >>> >>> print lxml.html.tostring(a) >>>

we need 100€ yeah

>> >> print lxml.html.tostring(a, method="html", encoding="utf-8") >> >> this works for me (...) >> >> yeah, works for me too - just puts out the€ as ? , which is fine as > long as I do not have to communicate with legacy systems crashing on that > code :) Well, do you? Maybe they don't expect UTF-8 input then? How do you "communicate" with a legacy system using HTML? Stefan From chef at ghum.de Thu Apr 22 11:41:43 2010 From: chef at ghum.de (Massa, Harald Armin) Date: Thu, 22 Apr 2010 11:41:43 +0200 Subject: [lxml-dev] getting € instead of € from tostring In-Reply-To: <4BD01678.4000501@behnel.de> References: <1271727646.6178.8.camel@segulix> <4BD01678.4000501@behnel.de> Message-ID: Stefan, > import lxml.html >>> >>>> >>>> a=lxml.html.fragment_fromstring("

we need 100€ yeah

") >>>> >>>> print lxml.html.tostring(a) >>>>

we need 100€ yeah

>>>> >>> >>> print lxml.html.tostring(a, method="html", encoding="utf-8") >>> >>> this works for me (...) >>> >>> yeah, works for me too - just puts out the€ as ? , which is fine as >>> >> long as I do not have to communicate with legacy systems crashing on that >> code :) >> > > Well, do you? Maybe they don't expect UTF-8 input then? > > How do you "communicate" with a legacy system using HTML? > > You are totally right, they do not expect UTF-8 input, and non of the encodings that would allow an "?" either. That's why I am limited to something way below UTF-8, sorry, should have stated that more clearly. "communicating" as in "storing that string in a visible but untouched field of the legacy system" That's why I was asking for the "encode ? as €" instead of € that code is mainly machine-mangled; but SOMETIMES somebody will look at it, and "200.000 €" is readable for a non-tech, "200.000 € " will raise questions. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Stra?e 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100422/ce5d2ee6/attachment.htm From c-dilla at gmx.ch Tue Apr 27 14:01:53 2010 From: c-dilla at gmx.ch (Michael Konietzny) Date: Tue, 27 Apr 2010 14:01:53 +0200 Subject: [lxml-dev] examine XSD xml schema Message-ID: <4BD6D231.4030802@gmx.ch> Hello, I would like to examine a XSD schema in python. Currently I'm using lxml which is doing it's job very very well when it only has to validate a document against the schema. But, I want to know whats inside of the schema and access the elements in the lxml behavior. The schema: | | The lxml code to load the schema is (simplyfied): |xsd_file_handle= open( self._xsd_file, 'rb') xsd_text= xsd_file_handle.read() schema_document= etree.fromstring(xsd_text, base_url=xmlpath) xmlschema= etree.XMLSchema(schema_document) | I'm then able to use schema_document (which is etree._Element) to go through the schema as an XML document. But since etree.fromstring (at least it seams like that) expects a XML document the |xsd:include| elements are not processed. The problem is currently solved by parsing the first schema document, then load the include elements and then insert them one by one into the main document by hand: |BASE_URL= "/xml/" schema_document= etree.fromstring(xsd_text, base_url=BASE_URL) tree= schema_document.getroottree() schemas= [] for schemaChildin schema_document.iterchildren(): if schemaChild.tag.endswith("include"): try: h= open(os.path.join(BASE_URL, schemaChild.get("schemaLocation")), "r") s= etree.fromstring(h.read(), base_url=BASE_URL) schemas.append(s) except Exception as ex: print "failed to load schema: %s" % ex finally: h.close() # remove the element self._schema_document.remove(schemaChild) for sin schemas: # inside for sChildin s: schema_document.append(sChild) | What I'm asking for is an idea how to solve the problem by using a more common way. I've already searched for other schema parsers in python but for now there was nothing that would fit in that case. Greetings, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100427/d318f009/attachment.htm From usernamenumber at gmail.com Wed Apr 28 20:22:43 2010 From: usernamenumber at gmail.com (Brad Smith) Date: Wed, 28 Apr 2010 14:22:43 -0400 Subject: [lxml-dev] Adding internal DTD entities? Message-ID: Hello all, First, thanks to those responsible for lxml, it has made processing xml in python an almost-fun experience! ;) I have only found one thing I need to be able to do, but of which I can't seem to find any examples. I have entities defined in a file that is shared by all of the files in a document (which are xi:include'ed into the main doc). Thus, I want each document to begin with something like: %common_entities; ]> The problem is that some of the files in the doc are auto-generated by a python/lxml pre-processing script, and I can't figure out how to have lxml insert that entity definition into the doctype declaration. Any suggestions here would be greatly appreciated. Thanks! --Brad From drobbins at funtoo.org Thu Apr 29 19:16:55 2010 From: drobbins at funtoo.org (Daniel Robbins) Date: Thu, 29 Apr 2010 11:16:55 -0600 Subject: [lxml-dev] using xpath to grab and *modify* a sub-element? Message-ID: Hi All, I am new to lxml, but so far it seems that xpath functions/methods return *copies* of a SubElement, and thus any modifications I make to the returned element does not affect the master element. This is quite fine for retrieving data, but what if I want to use xpath to find SubElements, that I then want to modify in some way, so that my changes are reflected in the master Element? Any short code snippet or link to an example would be appreciated. I've been searching but haven't found any yet. Thanks and Regards, -Daniel Robbins From terry_n_brown at yahoo.com Thu Apr 29 20:15:13 2010 From: terry_n_brown at yahoo.com (Terry Brown) Date: Thu, 29 Apr 2010 13:15:13 -0500 Subject: [lxml-dev] using xpath to grab and *modify* a sub-element? In-Reply-To: References: Message-ID: <20100429131513.6751fb1e@nrri.umn.edu> On Thu, 29 Apr 2010 11:16:55 -0600 Daniel Robbins wrote: > I am new to lxml, but so far it seems that xpath functions/methods > return *copies* of a SubElement, and thus any modifications I make to > the returned element does not affect the master element. I would expect them to return references to the Elements they find. Not sure about attribute values (i.e. xpaths ending in @foo), that would require some kind of wrapper, but if your xpath returns Elements I'm sure they're references, not copies. So you can modify in place. Cheers -Terry From friedel at translate.org.za Thu Apr 29 22:18:10 2010 From: friedel at translate.org.za (F Wolff) Date: Thu, 29 Apr 2010 22:18:10 +0200 Subject: [lxml-dev] installing lxml on MacOS In-Reply-To: <1F4D9F37-B9AA-440B-91CE-E185F2E1180D@ci.uchicago.edu> References: <1F4D9F37-B9AA-440B-91CE-E185F2E1180D@ci.uchicago.edu> Message-ID: <1272572290.9033.16335.camel@localhost.localdomain> Op Di, 2010-04-06 om 15:41 -0400 skryf Stephen Graham: > I am trying to go down the learning curve on lxml. > > I tried to follow the install instructions to install lxml on my MacOS > > As root, I did: > STATIC_DEPS=true easy_install lxml > > The install chugged away, but eventually failed: > > ... > > Undefined symbols for architecture i386: > "_gzdirect", referenced from: > ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) > ld: symbol(s) not found for architecture i386 > collect2: ld returned 1 exit status > Undefined symbols for architecture ppc: > "_gzdirect", referenced from: > ___xmlParserInputBufferCreateFilename in libxml2.a(xmlIO.o) > ld: symbol(s) not found for architecture ppc > collect2: ld returned 1 exit status > lipo: can't open input file: /var/tmp//ccsmIZu8.out (No such file or > directory) > make[2]: *** [xmllint] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all] Error 2 > Traceback (most recent call last): > File "/usr/bin/easy_install", line 8, in > load_entry_point('setuptools==0.6c7', 'console_scripts', > 'easy_install')() > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 1670, in > main > with_ei_usage(lambda: > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 1659, in > with_ei_usage > return f() > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 1674, in > > distclass=DistributionWithoutHelpCommands, **kw > File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ > python2.5/distutils/core.py", line 151, in setup > dist.run_commands() > File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ > python2.5/distutils/dist.py", line 974, in run_commands > self.run_command(cmd) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ > python2.5/distutils/dist.py", line 994, in run_command > cmd_obj.run() > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 211, in run > self.easy_install(spec, not self.no_deps) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 446, in > easy_install > return self.install_item(spec, dist.location, tmpdir, deps) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 471, in > install_item > dists = self.install_eggs(spec, download, tmpdir) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 655, in > install_eggs > return self.build_and_install(setup_script, setup_base) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 930, in > build_and_install > self.run_setup(setup_script, setup_base, args) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/command/easy_install.py", line 919, in > run_setup > run_setup(setup_script, args) > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/sandbox.py", line 27, in run_setup > lambda: execfile( > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/sandbox.py", line 63, in run > return func() > File "/System/Library/Frameworks/Python.framework/Versions/2.5/ > Extras/lib/python/setuptools/sandbox.py", line 29, in > {'__file__':setup_script, '__name__':'__main__'} > File "setup.py", line 119, in > File "/tmp/easy_install-XVh5wp/lxml-2.2.6/setupinfo.py", line 50, > in ext_modules > File "/tmp/easy_install-XVh5wp/lxml-2.2.6/buildlibxml.py", line > 208, in build_libxml2xslt > File "/tmp/easy_install-XVh5wp/lxml-2.2.6/buildlibxml.py", line > 158, in call_subprocess > Exception: Command "make" returned code 2 > > > Any thing obvious jump out at anyone that I might have missed? > > thanks in advance > sgg > Steve Graham Hi Steve I just saw this same issue on someone else's computer (OSX 10.5). Did you ever manage solve this? Thank you for any help. Friedel -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/how-should-we-do-high-contrast-application-icons