From andrepleblanc at gmail.com Wed Aug 4 01:23:22 2010 From: andrepleblanc at gmail.com (Andre LeBlanc) Date: Tue, 3 Aug 2010 19:23:22 -0400 Subject: [lxml-dev] xmlns declarations when using tostring Message-ID: Hi, I'm using lxml to parse an ODT document, at one point I need to use tostring() on an element, but when I do, the string it generates contains about a dozen xmlns declarations that were at the top of the original document. is there ANY way to get a string representation of what a tag actually looks like in the file? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100803/3584fefe/attachment.htm From donn.ingle at gmail.com Wed Aug 4 06:28:36 2010 From: donn.ingle at gmail.com (donn) Date: Wed, 04 Aug 2010 06:28:36 +0200 Subject: [lxml-dev] Walking linked-trees Message-ID: <4C58EC74.5060804@gmail.com> I am using SVG trees which can 'touch' one-another by a node. The node has an 'href' attribute which can point to another node *within* the current tree, or to a node in an *external* tree. Given that I have a Forest (a dict) of pre-parsed Element Trees, how can I perform a walk of Forest['start'] such that it follows the links into the other trees (and back)? I have a solution now that relies on recursion, but I am worried about the recursion depth problem -- svg trees can get quite deep. context = etree.iterwalk( Forest['start'], events=('start',) ) for a,e in context: if e.tag == "use": href = extract(e) other_tree = Forest[href] ?? somehow alter context so it steps into other_tree I imagine the trees touching at certain branches and want the walk to continue into the other tree and then back again into the first -- as if the link across were simply part of the first tree for a moment. Been racking my flea-brain for ways to *not* use recursion, like coroutines or some kind of mega-stack which split and joins. Any ideas? \d From jholg at gmx.de Wed Aug 4 09:37:18 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Aug 2010 09:37:18 +0200 Subject: [lxml-dev] xmlns declarations when using tostring In-Reply-To: References: Message-ID: <20100804073718.182280@gmx.net> Hi, > I'm using lxml to parse an ODT document, at one point I need to use > tostring() on an element, but when I do, the string it generates contains > about a dozen xmlns declarations that were at the top of the original > document. is there ANY way to get a string representation of what a tag > actually looks like in the file? Depends on the namespaces being needed in the element context. If they're not you might achieve what you want by copying the element for isolation and using etree.cleanup_namespaces(): >>> from lxml import etree, objectify >>> import copy >>> root = objectify.Element('root') >>> root.i = 12345 >>> print etree.tostring(root, pretty_print=True) 12345 >>> i = copy.deepcopy(root.i) >>> print etree.tostring(i, pretty_print=True) 12345 >>> etree.cleanup_namespaces(i) # does not remove xmlns:py as this is still needed >>> print etree.tostring(i, pretty_print=True) 12345 >>> del i.attrib['{http://codespeak.net/lxml/objectify/pytype}pytype'] >>> etree.cleanup_namespaces(i) # will now remove xmls:py declaration >>> print etree.tostring(i, pretty_print=True) 12345 >>> Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From jholg at gmx.de Wed Aug 4 13:55:09 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Aug 2010 13:55:09 +0200 Subject: [lxml-dev] Walking linked-trees In-Reply-To: <4C58EC74.5060804@gmail.com> References: <4C58EC74.5060804@gmail.com> Message-ID: <20100804115509.182270@gmx.net> Hi, > I am using SVG trees which can 'touch' one-another by a node. > The node has an 'href' attribute which can point to another node > *within* the current tree, or to a node in an *external* tree. > > Given that I have a Forest (a dict) of pre-parsed Element Trees, how can > I perform a walk of Forest['start'] such that it follows the links into > the other trees (and back)? > I have a solution now that relies on recursion, but I am worried about > the recursion depth problem -- svg trees can get quite deep. But tree depth is not really the matter here wrt the recursion, rather the number of trees in the "forest" that reference each other? > context = etree.iterwalk( Forest['start'], events=('start',) ) > > for a,e in context: > if e.tag == "use": > href = extract(e) > other_tree = Forest[href] > ?? somehow alter context so it steps into other_tree > > I imagine the trees touching at certain branches and want the walk to > continue into the other tree and then back again into the first -- as if > the link across were simply part of the first tree for a moment. > > Been racking my flea-brain for ways to *not* use recursion, like > coroutines or some kind of mega-stack which split and joins. > > Any ideas? You could use a context stack, s.th. along the lines of: context = etree.iterwalk(root) contextstack = [] while context: try: a, e = context.next() print "event: %s element: %s (from %s)" % (a, e.tag, e.getroottree().getroot().tag) if e.tag == "use": href = e.get('href') referenced = other.xpath(href) for ref in referenced: contextstack.insert(0, context) context = etree.iterwalk(ref) except StopIteration: try: context = contextstack.pop() except IndexError: break This outputs for me: (using lxml.objectify out of personal convenience preference) >>> E = objectify.E >>> root = E.root(E.sub(E.use(href='1'), E.subsub(E.x(1), E.y(2), E.z(3)))) >>> other = E.other(E.foo(E.bar(E.moo(E.z(4))))) >>> forest = {'0': root, '1': other} >>> >>> print etree.tostring(root, pretty_print=True) 1 2 3 >>> print etree.tostring(other, pretty_print=True) 4 >>> >>> context = etree.iterwalk(root) >>> contextstack = [] >>> while context: ... try: ... a, e = context.next() ... print "event: %s element: %s (from %s)" % (a, e.tag, e.getroottree().getroot().tag) ... if e.tag == "use": ... href = e.get('href') ... referenced = forest[href] ... for ref in referenced: ... contextstack.insert(0, context) ... context = etree.iterwalk(ref) ... ... except StopIteration: ... try: ... context = contextstack.pop() ... except IndexError: ... break ... ... event: end element: use (from root) event: end element: z (from other) event: end element: moo (from other) event: end element: bar (from other) event: end element: foo (from other) event: end element: other (from other) event: end element: x (from root) event: end element: y (from root) event: end element: z (from root) event: end element: subsub (from root) event: end element: sub (from root) event: end element: root (from root) >>> Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From donn.ingle at gmail.com Wed Aug 4 15:26:07 2010 From: donn.ingle at gmail.com (donn) Date: Wed, 04 Aug 2010 15:26:07 +0200 Subject: [lxml-dev] Walking linked-trees In-Reply-To: <20100804115509.182270@gmx.net> References: <4C58EC74.5060804@gmail.com> <20100804115509.182270@gmx.net> Message-ID: <4C596A6F.3080105@gmail.com> Holger, thanks for the work and effort. It will take me a little while to parse it through my head and test it out, but it's great to have something to work with! Best, \d From stefan_ml at behnel.de Mon Aug 9 08:40:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Aug 2010 08:40:35 +0200 Subject: [lxml-dev] decoding unicode strings In-Reply-To: <4C51427C.7060206@arskom.com.tr> References: <4C51427C.7060206@arskom.com.tr> Message-ID: <4C5FA2E3.8020906@behnel.de> Burak Arslan, 29.07.2010 10:57: > the lxml.etree.XMLID function does not accept unicode strings when the > xml declaration tag is present at the beginning of the xml document. > however, not all soap clients send the xml declaration, so sometimes i > must rely on information in http headers to decode the string. > > my solution was this: > > try: > root, xmlids = etree.XMLID(xml_string.decode(http_charset)) > except ValueError,e: > logger.debug('%s -- falling back to str decoding.' % (e)) > root, xmlids = etree.XMLID(xml_string) > > is this the proper way to check whether an xml document candidate has an > xml declaration at the beginning? The correct way to do it is to pass a parser that uses an explicitly defined encoding. However, this parameter is currently missing from the XMLID() functions. You can use parseid() instead, which accepts this argument. I also fixed this in SVN so that the upcoming 2.3 release will support the 'parser' parameter in XMLID() as well. Stefan From sakshichawla12354 at gmail.com Fri Aug 13 13:01:26 2010 From: sakshichawla12354 at gmail.com (sakshi chawla) Date: Fri, 13 Aug 2010 16:31:26 +0530 Subject: [lxml-dev] installation of lxml module Message-ID: I am using python2.6 on ubuntu9.04. I have downloaded lxml-2.2.6.tgz module. How can I install or include this module in python so that I can use it? what should I do now? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100813/914aff84/attachment.htm From jkrukoff at ltgc.com Fri Aug 13 17:14:38 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 13 Aug 2010 09:14:38 -0600 Subject: [lxml-dev] installation of lxml module In-Reply-To: References: Message-ID: <1281712478.3497.3.camel@localhost> On Fri, 2010-08-13 at 16:31 +0530, sakshi chawla wrote: > I am using python2.6 on ubuntu9.04. I have downloaded lxml-2.2.6.tgz > module. How can I install or include this module in python so that I > can use it? what should I do now? > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev Did you try the standard source install instructions yet? http://codespeak.net/lxml/build.html -- John Krukoff Land Title Guarantee Company jkrukoff at ltgc.com From bruceq at hammondranch.com Tue Aug 17 20:35:02 2010 From: bruceq at hammondranch.com (Bruce Q Hammond) Date: Tue, 17 Aug 2010 11:35:02 -0700 Subject: [lxml-dev] easy_install lxml Message-ID: Hi folks, I am trying to build lxml Python 3.1.2 on a fresh out of the box Snow Leopard Mac OS X machine. sudo STATIC_DEPS=true easy_install-3.1 lxml which eventually dies with the output. Building lxml version 2.2.7. Latest version of libxml2 is 2.7.7 Traceback (most recent call last): File "/usr/local/bin/easy_install-3.1", line 9, in load_entry_point('distribute==0.6.14', 'console_scripts', 'easy_install-3.1')() File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 1855, in main with_ei_usage(lambda: File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 1836, in with_ei_usage return f() File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 1859, in distclass=DistributionWithoutHelpCommands, **kw File "/usr/local/lib/python3.1/distutils/core.py", line 149, in setup dist.run_commands() File "/usr/local/lib/python3.1/distutils/dist.py", line 919, in run_commands self.run_command(cmd) File "/usr/local/lib/python3.1/distutils/dist.py", line 938, in run_command cmd_obj.run() File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 342, in run self.easy_install(spec, not self.no_deps) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 582, in easy_install return self.install_item(spec, dist.location, tmpdir, deps) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 612, in install_item dists = self.install_eggs(spec, download, tmpdir) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 802, in install_eggs return self.build_and_install(setup_script, setup_base) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 1079, in build_and_install self.run_setup(setup_script, setup_base, args) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/command/easy_install.py", line 1068, in run_setup run_setup(setup_script, args) File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/sandbox.py", line 30, in run_setup lambda: exec(compile(open( File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/sandbox.py", line 71, in run return func() File "/usr/local/lib/python3.1/site-packages/distribute-0.6.14-py3.1.egg/setuptools/sandbox.py", line 33, in {'__file__':setup_script, '__name__':'__main__'}) File "setup.py", line 119, in return 'install' in sys.argv[1:] or _easy_install_marker() File "/tmp/easy_install-sP6BxP/lxml-2.2.7/setupinfo.py", line 50, in ext_modules File "/tmp/easy_install-sP6BxP/lxml-2.2.7/buildlibxml.py", line 170, in build_libxml2xslt File "/tmp/easy_install-sP6BxP/lxml-2.2.7/buildlibxml.py", line 36, in download_libxml2 File "/tmp/easy_install-sP6BxP/lxml-2.2.7/buildlibxml.py", line 76, in download_library NameError: global name 'urljoin' is not defined Any ideas for what is going on / how to resolve it? Thanks, Bruce Q Hammond -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100817/708c3e81/attachment.htm From stefan_ml at behnel.de Wed Aug 18 14:36:06 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Aug 2010 14:36:06 +0200 Subject: [lxml-dev] easy_install lxml In-Reply-To: References: Message-ID: <4C6BD3B6.1040009@behnel.de> Bruce Q Hammond, 17.08.2010 20:35: > I am trying to build lxml Python 3.1.2 on a fresh out of the box Snow Leopard Mac OS X machine. > > sudo STATIC_DEPS=true easy_install-3.1 lxml > > which eventually dies with the output. > > > Building lxml version 2.2.7. > Latest version of libxml2 is 2.7.7 > Traceback (most recent call last): > [...] > File "/tmp/easy_install-sP6BxP/lxml-2.2.7/buildlibxml.py", line 76, in download_library > NameError: global name 'urljoin' is not defined > > > > Any ideas for what is going on / how to resolve it? Here's a fix. Looks like this hasn't been run on Py3.1 for a while. (The trunk and the 2.3-pre releases are ok, BTW). Stefan Index: buildlibxml.py =================================================================== --- buildlibxml.py (Revision 71671) +++ buildlibxml.py (Arbeitskopie) @@ -5,7 +5,7 @@ from urlparse import urlsplit, urljoin from urllib import urlretrieve except ImportError: - from urllib.parse import urlsplit + from urllib.parse import urlsplit, urljoin from urllib.request import urlretrieve ## Routines to download and build libxml2/xslt: From hywelm.jones at talk21.com Sat Aug 21 13:33:04 2010 From: hywelm.jones at talk21.com (Hywel Jones) Date: Sat, 21 Aug 2010 11:33:04 +0000 (GMT) Subject: [lxml-dev] Fw: Install lxml problem Message-ID: <242414.99735.qm@web87108.mail.ird.yahoo.com> I'm using Windows XP but failing to install lxml. I'd appreciate any advice. I'm a Python novice so it will have to be simple. Thanks in advance. Here's what I got. I can see it says "make sure the development packages of libxml2 and libxslt are installed" and they're not but the installation instructions specifically say that the easy_install should take care of that! Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. C:\Python26>easy_install lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.2.7 Downloading http://codespeak.net/lxml/lxml-2.2.7.tgz Processing lxml-2.2.7.tgz Running lxml-2.2.7\setup.py -q bdist_egg --dist-dir c:\docume~1\hywel\locals~1\t emp\easy_install-amxrxs\lxml-2.2.7\egg-dist-tmp-rs_jj5 Building lxml version 2.2.7. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' need s to be available. ERROR: 'xslt-config' is not recognized as an internal or external command, operable program or batch file. ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt warning: no files found matching 'lxml.etree.c' under directory 'src\lxml' warning: no files found matching 'lxml.objectify.c' under directory 'src\lxml' warning: no files found matching 'lxml.etree.h' under directory 'src\lxml' warning: no files found matching 'lxml.etree_api.h' under directory 'src\lxml' warning: no files found matching 'etree_defs.h' under directory 'src\lxml' warning: no files found matching 'pubkey.asc' under directory 'doc' warning: no files found matching 'tagpython*.png' under directory 'doc' error: Setup script exited with error: Unable to find vcvarsall.bat From liweijian.hust at gmail.com Tue Aug 24 11:50:29 2010 From: liweijian.hust at gmail.com (hookits) Date: Tue, 24 Aug 2010 09:50:29 +0000 (UTC) Subject: [lxml-dev] Detect JSON format by lxml? Message-ID: Hi, everybody. I want to write a python script to do some JavaScript-Hijacking detection. (http://www.net-security.org/article.php?id=995) I know I can detect it by the response header Content-Type. However, some web application will response some JSON data as a header of Plantext or Html, not JSON. There are 2 types of JSON response that may face the JavaScript-Hijacking attack: #############case 1############## _callback({SOME_JSON_DATA}); #############case 2############## var userAddress={SOME_JSON_DATA}; try{_callback(userAddress);}catch(e){} I seems hard to filter these 2 cases from so many http responses. I am wondering if there is some of lxml's function can do such detection. Thanks in advance. From lists at zopyx.com Tue Aug 24 17:37:44 2010 From: lists at zopyx.com (Andreas Jung) Date: Tue, 24 Aug 2010 15:37:44 +0000 (UTC) Subject: [lxml-dev] Please upload your sdists on PyPI Message-ID: Hi there, could you please make _all_ releases of lxml available on PyPI. All buildout that haven't used a pinned version broke over the last days due the unavailability of codespeak since 2.3.0a2 has no releases. Thanks, Andreas From sridharr at activestate.com Wed Aug 25 01:03:04 2010 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Tue, 24 Aug 2010 16:03:04 -0700 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: References: Message-ID: <4C744FA8.8050505@activestate.com> On 8/24/2010 8:37 AM, Andreas Jung wrote: > Hi there, > > could you please make _all_ releases of lxml available on PyPI. > > All buildout that haven't used a pinned version broke over the last days due the > unavailability of codespeak since 2.3.0a2 has no releases. Hmm, shouldn't buildout use 2.2.7, that is in fact hosted in PyPI? 2.3alpha2 *was* once released to PyPI, but it was deleted later on: http://article.gmane.org/gmane.comp.python.lxml.devel/5496/match=2.3alpha At least for me "pip install lxml" downloads 2.2.7, not 2.3*. -srid From lists at zopyx.com Wed Aug 25 06:53:50 2010 From: lists at zopyx.com (Andreas Jung) Date: Wed, 25 Aug 2010 06:53:50 +0200 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: <4C744FA8.8050505@activestate.com> References: <4C744FA8.8050505@activestate.com> Message-ID: 2010/8/25 Sridhar Ratnakumar > On 8/24/2010 8:37 AM, Andreas Jung wrote: > >> Hi there, >> >> could you please make _all_ releases of lxml available on PyPI. >> >> All buildout that haven't used a pinned version broke over the last days >> due the >> unavailability of codespeak since 2.3.0a2 has no releases. >> > > Hmm, shouldn't buildout use 2.2.7, that is in fact hosted in PyPI? > > 2.3alpha2 *was* once released to PyPI, but it was deleted later on: > http://article.gmane.org/gmane.comp.python.lxml.devel/5496/match=2.3alpha > > Released is released! > At least for me "pip install lxml" downloads 2.2.7, not 2.3*. > We don't care about pip - incomplete or badly maintained releases broke the buildouts of many people...please ensure that the lxml are clean. Thanks, Andreas > > -srid > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100825/5c894a08/attachment.htm From stefan_ml at behnel.de Wed Aug 25 08:02:40 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Aug 2010 08:02:40 +0200 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: References: <4C744FA8.8050505@activestate.com> Message-ID: <4C74B200.6070209@behnel.de> Andreas Jung, 25.08.2010 06:53: > 2010/8/25 Sridhar Ratnakumar >> On 8/24/2010 8:37 AM, Andreas Jung wrote: >>> could you please make _all_ releases of lxml available on PyPI. >>> >>> All buildout that haven't used a pinned version broke over the last days >>> due the >>> unavailability of codespeak since 2.3.0a2 has no releases. At least setuptools handles this as expected, as does pip, as reported. Sounds like buildout could also be fixed. >> Hmm, shouldn't buildout use 2.2.7, that is in fact hosted in PyPI? >> >> 2.3alpha2 *was* once released to PyPI, but it was deleted later on: >> http://article.gmane.org/gmane.comp.python.lxml.devel/5496/match=2.3alpha > > Released is released! Well, yes, but I didn't put the sdists up on PyPI exactly because I didn't want users to install it by default. Otherwise, setuptools would grab the latest distribution regardless of the advertised state of stability. In the current state of affairs, users will continue to get the latest stable release when they do not specify a version. I think that's what most people want. Developer releases are not meant for production. Stefan From lists at zopyx.com Wed Aug 25 08:08:58 2010 From: lists at zopyx.com (Andreas Jung) Date: Wed, 25 Aug 2010 08:08:58 +0200 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: <4C74B200.6070209@behnel.de> References: <4C744FA8.8050505@activestate.com> <4C74B200.6070209@behnel.de> Message-ID: <4C74B37A.7050402@zopyx.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Andreas Jung, 25.08.2010 06:53: >> 2010/8/25 Sridhar Ratnakumar >>> On 8/24/2010 8:37 AM, Andreas Jung wrote: >>>> could you please make _all_ releases of lxml available on PyPI. >>>> >>>> All buildout that haven't used a pinned version broke over the last >>>> days >>>> due the >>>> unavailability of codespeak since 2.3.0a2 has no releases. > > At least setuptools handles this as expected, as does pip, as reported. > > Sounds like buildout could also be fixed. > > >>> Hmm, shouldn't buildout use 2.2.7, that is in fact hosted in PyPI? >>> >>> 2.3alpha2 *was* once released to PyPI, but it was deleted later on: >>> http://article.gmane.org/gmane.comp.python.lxml.devel/5496/match=2.3alpha >>> >> >> Released is released! > > Well, yes, but I didn't put the sdists up on PyPI exactly because I > didn't want users to install it by default. Otherwise, setuptools would > grab the latest distribution regardless of the advertised state of > stability. > > In the current state of affairs, users will continue to get the latest > stable release when they do not specify a version. I think that's what > most people want. Developer releases are not meant for production. Once again: the 2.3.0a2 release had no release files and the default behavior of setuptools or whatever is involved is to follow the homepage URL at http://codespeak.net/lxml. I don't think that buildout is doing something special here since it used setuptools transparently afaik. Andreas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQGUBAEBAgAGBQJMdLN5AAoJEADcfz7u4AZjz/cLwMjPNStiSLLFSkYwQAfSgSss FHpMpZeRMQAYrCbuDwp0GLIycNNTPW0m1IMpHU26OJ9EY5yA3cn+TqbLUALY2XxD 4xggoslKuMAN6EhGcH6twdoe/nWNu2o7kX8jWtbrz4LMOYjIThRk34JOZ9cMK9u4 u3ndXI9LVzDSgSCDazEO1HDvaURbuFrd/nDawwYbC3vYKGiktwv+ujTn+pU1jQ5J uaVe8xyT9DCbA0dwC8pVhzZ/0j99m062Cqvw+sVxzp5HSqzZoXb8vEv8pRWVGpaU zrSkTaRrcy4drR+QBW60eV1cWjFaJPGrr44/qx8fgdB0zYx8tBFj6dnWiRyLOZ2N 3qgErbDc5Q7jkDUlnyf9Q9mnG91vvzcWLgAmg7I6lG2OoTTzReaudQHSzdbXKaD0 mQPWa+bT7TO4rMUW2acc2jS0UE4eix66da1KHe36RtBYV24bWn0VCi3lt0pwXldq 5rPvb3tW3EtpVdn3KSFmQZ185qqu/6w= =h7eI -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: lists.vcf Type: text/x-vcard Size: 330 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100825/d6245a5c/attachment.vcf From stefan_ml at behnel.de Wed Aug 25 08:20:29 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Aug 2010 08:20:29 +0200 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: <4C74B37A.7050402@zopyx.com> References: <4C744FA8.8050505@activestate.com> <4C74B200.6070209@behnel.de> <4C74B37A.7050402@zopyx.com> Message-ID: <4C74B62D.9090906@behnel.de> Andreas Jung, 25.08.2010 08:08: > Stefan Behnel wrote: >> At least setuptools handles this as expected, as does pip, as reported. >> >> In the current state of affairs, users will continue to get the latest >> stable release when they do not specify a version. I think that's what >> most people want. Developer releases are not meant for production. > > Once again: the 2.3.0a2 release had no release files and the default > behavior of setuptools or whatever is involved is to follow the homepage > URL at http://codespeak.net/lxml. I don't think that buildout > is doing something special here since it used setuptools transparently > afaik. Well, I didn't try it, but quite obviously, it does. Plain setuptools works for me and others. Note that the 2.3 pages do now have a download URL, that's what seems to do the trick for setuptools. Personally, I think the right fix would be a client-side download option that depends on the "Development Status" category. Setuptools, buildout & friends should prefer the most stable version by default, instead of just taking any "latest" version they find. Stefan From jlong at iarc.uaf.edu Thu Aug 26 22:06:33 2010 From: jlong at iarc.uaf.edu (James Long) Date: Thu, 26 Aug 2010 20:06:33 +0000 (UTC) Subject: [lxml-dev] Invalid Schematron References: Message-ID: masetto gmail.com> writes: > > Hi all,i'm trying to validate an OVAL (http://oval.mitre.org) xml document > against it's schematron rules > (http://oval.mitre.org/language/version5.7/ovaldefinition/schematron/oval- > definitions-schematron.sch) > Following the manual (http://codespeak.net/lxml/validation.html) i wrote the > following piece of code: > from lxml import etree > rule = open("oval-definitions-schematron.sch") > defs = open("oval.xml") > sct_doc = etree.parse(rule) > try: >? schematron = etree.Schematron(sct_doc) > except etree.SchematronParseError, e: >?? print e.args >?? print e.error_log >?? print e.message >snip< > but i got the following error(s):Document is not a valid Schematron schema > oval-definitions-schematron.sch:34:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: > Expecting a pattern element instead of phase (repeated N times) > oval-definitions-schematron.sch:1547:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: > Failed to compile context expression oval-def:objects/*/*[ > datatype='binary']|oval-def:states/*/*[ > datatype='binary']|oval-def:states/*/* (repeated N times) > Can you help me? > Thanks---Masetto I'm having the same problem. In my case I get the error Failed to compile context expression //*[gmd:identificationInfo/gmd:MD_DataIdentification] in the schematron rule: Thanks, Jim From stefan_ml at behnel.de Fri Aug 27 06:59:51 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Aug 2010 06:59:51 +0200 Subject: [lxml-dev] Invalid Schematron In-Reply-To: References: Message-ID: <4C774647.3030304@behnel.de> masetto, 16.07.2010 14:07: > i'm trying to validate an OVAL (http://oval.mitre.org) xml document against > it's schematron rules > Following the manual (http://codespeak.net/lxml/validation.html) i wrote the > following piece of code: > > from lxml import etree > > rule = open("oval-definitions-schematron.sch") > defs = open("oval.xml") > > sct_doc = etree.parse(rule) > try: > schematron = etree.Schematron(sct_doc) > except etree.SchematronParseError, e: > print e.args > print e.error_log > print e.message > > doc = etree.parse(defs) > print schematron.validate(doc) > > but i got the following error(s): > > Document is not a valid Schematron schema > oval-definitions-schematron.sch:34:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: > Expecting a pattern element instead of phase (repeated N times) > ... > oval-definitions-schematron.sch:1547:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: Failed > to compile context expression > oval-def:objects/*/*[@datatype='binary']|oval-def:states/*/*[@datatype='binary']|oval-def:states/*/* > (repeated N times) Try to validate the document with libxml2's xmllint tool on the command line. If that fails, too, try lxml 2.3, which has true ISO Schematron support in the "lxml.isoschematron" package. Stefan From dpritsos at extremepro.gr Mon Aug 30 16:34:06 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Mon, 30 Aug 2010 17:34:06 +0300 Subject: [lxml-dev] Question about etree vs html Message-ID: <4C7BC15E.3060900@extremepro.gr> Hello, I am Dimitrios Pritsos and I am working on a WebCrawler. In order to analyse the pages that I am getting while crawling I am using lxml. However I cannot tell the difference of lxml.html and lxml.etree when coming to the XHTML parsing. In particular I am confused of what to use from the variety of options lxml is providing. Moreover, the documentation is a bit misleadings. Let me be more specific. Firstly I ve seen that lxml.html has been developed on Python and in fact is a shortcut for extracting several common information from an HTML page instead of building your own paths and xpaths, similarly to XML() and HTML() shortcuts. In addition all of these sortcuts are using the HTML() (ie the HTMLParser()). Unfortunately this took me few days to realize it and I found the answer here: http://zdar.trinet.as/doc/python-lxml-2.0.11/doc/html/api/lxml-module.html. Because no documentation is clarifying this. Not even the one of John W. Shipman, which is the best for newbies like me. However, in the documentation (found in http://codespeak.net/lxml/lxmldoc-2.2.7.pdf) there is a statement that says that "Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results". Considering that, using lxml.etree is the best choice for the www right because of the great variety of web pages are in XHTML and not HTML markup. On the other hand lxml.html has all the good staff. So, what exactly is going on here which library should I use, or how I could combine them for not loosing any information from the pages? After several test, for several days, I found that different "parsing" function gives different results and different tostring() call (from html or etree) again gives different results even for the same ElementTree. So, why is that? No Documentation found for this eather. In general the lxml it seems to me really great, however, because of the limited documentation some times you cannot tell what is what and all just seems a different path to do the same thing, but this is not the case as I can tell from my several tests. So, in practice it is totally different. For example try this: >>>xhtmlsrc = '

Testing
' >>> lxml.etree.tostring(lxml.html.soupparser.fromstring(xhtmlsrc)) 'DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

Testing
' >>> lxml.etree.tostring(lxml.html.fromstring(xhtmlsrc)) '

Testing
' >>> lxml.etree.tostring(lxml.html.parse(StringIO(xhtmlsrc))) '\n

Testing
' /Why the above give different result when based on the documentation the suppose to give the same result/? >>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc)) >>> xhtmltree.test_content() Traceback (most recent call last): File "", line 1, in AttributeError: 'lxml.etree._ElementTree' object has no attribute 'test_content' >>> >>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc) >>> xhtmltree.text_content() 'Testing' >>> /Again why there is this deferent result when it documentation it is not reported?/ So could you please advise me what should I do? And one more question: When I am using XMLParser() which DTD is used for building the ElementTree? I the case of HTMLParser() I can tell it is HTML 4.0 because this is what I get when I am doing this: >>> xhtmlsrc2 = '

Testing
' >>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc2)) >>> lxml.html.tostring(xhtmltree) '\n

Testing
' >>> PLEASE I NEED SOME HELP HERE! Best Regards, Dimitrios -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100830/f71cba0d/attachment-0001.htm From chris at simplistix.co.uk Mon Aug 30 19:14:50 2010 From: chris at simplistix.co.uk (Chris Withers) Date: Mon, 30 Aug 2010 18:14:50 +0100 Subject: [lxml-dev] Please upload your sdists on PyPI In-Reply-To: <4C74B62D.9090906@behnel.de> References: <4C744FA8.8050505@activestate.com> <4C74B200.6070209@behnel.de> <4C74B37A.7050402@zopyx.com> <4C74B62D.9090906@behnel.de> Message-ID: <4C7BE70A.8080903@simplistix.co.uk> Stefan Behnel wrote: > Personally, I think the right fix would be a client-side download option > that depends on the "Development Status" category. Setuptools, buildout & > friends should prefer the most stable version by default, instead of just > taking any "latest" version they find. This is true for new releases of buildout. It's also configurable in your buildout.cfg. Andreas is just likely stuck in the stone age ;-) Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From cswiggett at knowledgemosaic.com Tue Aug 31 00:13:40 2010 From: cswiggett at knowledgemosaic.com (Clif Swiggett) Date: Mon, 30 Aug 2010 15:13:40 -0700 Subject: [lxml-dev] DOCTYPE converted to

tag Message-ID: I'm using lxml 2.2.2. The input html looks like (note the DOCTYPE directive is mangled): I parse with the lxml.html.HTMLParser as follows: import lxml.html as html tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=True)) print html.tostring(tree, pretty_print=True) and it produces the following output:

foo">

I'm not sure why it is converting the DOCTYPE tag to a

tag. Any ideas? Thanks for your help. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100830/27824551/attachment.htm From jholg at gmx.de Tue Aug 31 09:21:27 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 31 Aug 2010 09:21:27 +0200 Subject: [lxml-dev] Question about etree vs html In-Reply-To: <4C7BC15E.3060900@extremepro.gr> References: <4C7BC15E.3060900@extremepro.gr> Message-ID: <20100831072127.288820@gmx.net> Hi, > >>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc)) > >>> xhtmltree.test_content() > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'lxml.etree._ElementTree' object has no attribute > 'test_content' > >>> > >>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc) > >>> xhtmltree.text_content() > 'Testing' > >>> > > /Again why there is this deferent result when it documentation it is not > reported?/ You are missing the distinction between Elements and ElementTrees: http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files >>> xhtmlsrc = '


Testing
' >>> >>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc)) >>> type(xhtmltree) >>> >>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc) >>> type(xhtmltree) >>> ElementTree and Element have a different API. I can't really comment on your other questions as I've never used lxml.html. Holger -- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 f?r nur 19,99 €/mtl.!* http://portal.gmx.net/de/go/dsl From jholg at gmx.de Tue Aug 31 09:40:15 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 31 Aug 2010 09:40:15 +0200 Subject: [lxml-dev] DOCTYPE converted to

tag In-Reply-To: References: Message-ID: <20100831074015.288760@gmx.net> Hi, > > > > > > > > > > > > > > > I parse with the lxml.html.HTMLParser as follows: > > > > import lxml.html as html > > tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=True)) > > print html.tostring(tree, pretty_print=True) > > > > and it produces the following output: > > > > > >

foo"> > > > > > >

> > > > Trying this without recover=True: >>> tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=False)) Traceback (most recent call last): File "", line 1, in ? File "/apps/pydev/pytaf/solaris-2.8/2010-Q2/lib/python2.4/site-packages/lxml/html/__init__.py", line 601, in fromstring return document_fromstring(html, parser=parser, base_url=base_url, **kw) File "/apps/pydev/pytaf/solaris-2.8/2010-Q2/lib/python2.4/site-packages/lxml/html/__init__.py", line 511, in document_fromstring value = etree.fromstring(html, parser, **kw) File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71106) File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67875) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) lxml.etree.XMLSyntaxError: Unfinished PubidLiteral, line 1, column 24 Removing the character "~" from the doctype declaration: >>> inhtml = """ ... ... ... ... ... ... """ >>> ... tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=False)) >>> print html.tostring(tree, pretty_print=True) >>> Maybe this character isn't allowed or carries special meaning? Should probably read the spec... Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Tue Aug 31 08:25:41 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 31 Aug 2010 08:25:41 +0200 Subject: [lxml-dev] Fw: Install lxml problem In-Reply-To: <242414.99735.qm@web87108.mail.ird.yahoo.com> References: <242414.99735.qm@web87108.mail.ird.yahoo.com> Message-ID: <4C7CA065.4070907@behnel.de> Hywel Jones, 21.08.2010 13:33: > I'm using Windows XP but failing to install lxml. I'd appreciate any advice. > I'm a Python novice so it will have to be simple. Thanks in advance. Here's what > I got. I can see it says "make sure the development packages of libxml2 > and libxslt are installed" and they're not but the installation > instructions specifically say that the easy_install should take care of that! Well, they would if you had a C compiler installed, which apparently you don't. Please use a pre-built binary release, there should be one for 2.2.6 (or an older version). Stefan From nospamus at gmail.com Tue Aug 31 14:37:54 2010 From: nospamus at gmail.com (Bryan Hughes) Date: Tue, 31 Aug 2010 08:37:54 -0400 Subject: [lxml-dev] Code hangs when calling etree.XMLSchema Message-ID: When I pass in a valid XSD file into this line of code, the application hangs: logging.debug("start") xmlschema_ = etree.XMLSchema(xsd_file) logging.debug("finish") I've wrapped that piece of code with debug logging to confirm that it hangs. In other words, my logs will print "start", but do not print "finish". Unfortunately, no exceptions are thrown -- it simply does nothing once that line of code executes. Any thoughts on what may be happening? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100831/4a16f13a/attachment.htm From dpritsos at extremepro.gr Tue Aug 31 15:43:55 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Tue, 31 Aug 2010 16:43:55 +0300 Subject: [lxml-dev] Question about etree vs html In-Reply-To: <20100831072127.288820@gmx.net> References: <4C7BC15E.3060900@extremepro.gr> <20100831072127.288820@gmx.net> Message-ID: <4C7D071B.9040802@extremepro.gr> On 31/08/10 10:21, jholg at gmx.de wrote: > Hi, > > >> >>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc)) >> >>> xhtmltree.test_content() >> Traceback (most recent call last): >> File "", line 1, in >> AttributeError: 'lxml.etree._ElementTree' object has no attribute >> 'test_content' >> >>> >> >>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc) >> >>> xhtmltree.text_content() >> 'Testing' >> >>> >> >> /Again why there is this deferent result when it documentation it is not >> reported?/ >> > You are missing the distinction between Elements and ElementTrees: > > http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files > > Thank you for this tutorial! >>>> xhtmlsrc = '

Testing
' >>>> >>>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc)) >>>> type(xhtmltree) >>>> > > >>>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc) >>>> type(xhtmltree) >>>> > > >>>> > ElementTree and Element have a different API. > > I can't really comment on your other questions as I've never used lxml.html. > As I have seen in the lxml.html internals, while lxml.etree is based on libxml2 ans, the lxml.html is a shortcut (written in Python) for the common function most one should have build on its own in case they would have been used the the HTMLParser() to the etree.parse() or etree.fromstring(), as an "external" parser (i.e. not the default XMLParser). All of my above question are digested in one: Which parser should I use for getting an ElementTree of the XHTML files I am downloading for farther analysis? The XMLParser(with load_dtd=True for using DTD for parsing, recover=True, no_network=False) or the HTMLParser. Which will give me a proper ElementTree of XHTML files which are not exactly HTML 4.0 (but really close)? The reason I am looking into detail is because as I said in the Documentation there is a statement says that and HTML parser might return an ElementTree which is not the proper one in case it has to deal with an XHTML, and for that case is better to use an XML Paser. > Holger > Thank you Very much Holger for your instant Response! Dimitrios From cswiggett at knowledgemosaic.com Tue Aug 31 18:36:04 2010 From: cswiggett at knowledgemosaic.com (Clif Swiggett) Date: Tue, 31 Aug 2010 09:36:04 -0700 Subject: [lxml-dev] DOCTYPE converted to

tag In-Reply-To: <20100831074015.288760@gmx.net> References: <20100831074015.288760@gmx.net> Message-ID: Thanks Holger - Yes, it seems related to having tilda (~) in the DOCTYPE, and appears to be a bug. I submitted a report to https://bugs.launchpad.net/lxml. All the best, Clif > -----Original Message----- > From: jholg at gmx.de [mailto:jholg at gmx.de] > Sent: Tuesday, August 31, 2010 12:40 AM > To: Clif Swiggett; lxml-dev at codespeak.net > Subject: Re: [lxml-dev] DOCTYPE converted to

tag > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I parse with the lxml.html.HTMLParser as follows: > > > > > > > > import lxml.html as html > > > > tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=True)) > > > > print html.tostring(tree, pretty_print=True) > > > > > > > > and it produces the following output: > > > > > > > > > > > >

foo"> > > > > > > > > > > > >

> > > > > > > > > > Trying this without recover=True: > > >>> tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=False)) > Traceback (most recent call last): > File "", line 1, in ? > File "/apps/pydev/pytaf/solaris-2.8/2010-Q2/lib/python2.4/site- > packages/lxml/html/__init__.py", line 601, in fromstring > return document_fromstring(html, parser=parser, base_url=base_url, > **kw) > File "/apps/pydev/pytaf/solaris-2.8/2010-Q2/lib/python2.4/site- > packages/lxml/html/__init__.py", line 511, in document_fromstring > value = etree.fromstring(html, parser, **kw) > File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring > (src/lxml/lxml.etree.c:48634) > File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument > (src/lxml/lxml.etree.c:72245) > File "parser.pxi", line 1424, in lxml.etree._parseDoc > (src/lxml/lxml.etree.c:71106) > File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc > (src/lxml/lxml.etree.c:67875) > File "parser.pxi", line 539, in > lxml.etree._ParserContext._handleParseResultDoc > (src/lxml/lxml.etree.c:64257) > File "parser.pxi", line 625, in lxml.etree._handleParseResult > (src/lxml/lxml.etree.c:65178) > File "parser.pxi", line 565, in lxml.etree._raiseParseError > (src/lxml/lxml.etree.c:64521) > lxml.etree.XMLSyntaxError: Unfinished PubidLiteral, line 1, column 24 > > Removing the character "~" from the doctype declaration: > > >>> inhtml = """ > ... > ... > ... > ... > ... > ... """ > >>> > ... tree = html.fromstring(inhtml, parser=html.HTMLParser(recover=False)) > >>> print html.tostring(tree, pretty_print=True) > > > > >>> > > Maybe this character isn't allowed or carries special meaning? Should > probably read the spec... > > Holger > -- > GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! > Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01