From dkuhlman at rexx.com Wed Sep 1 01:08:56 2010 From: dkuhlman at rexx.com (Dave Kuhlman) Date: Tue, 31 Aug 2010 16:08:56 -0700 Subject: [lxml-dev] Access to ElementTree for XML schema Message-ID: <20100831230856.GA75749@cutter.rexx.com> I'm looking for a way to get access to an etree._ElementTree that represents an XML schema document in which the xsd:include and xsd:import elements have been recursively expanded. When I create an instance of etree.XMLSchema, libxml2 expands the underlying C tree for the schema. Am I right about that? If so, is there a way for me to get an etree._ElementTree that wraps that underlying C tree? Or, perhaps to have a way to create an etree._ElementTree from the XMLSchema object? If that document is not available, I suppose that I am asking for a new feature that enables us to retrieve the processed and expanded etree Document from an etree.XMLSchema object. Or, is there already some other way to get an XML schema document tree in which the include and import elements have been (recursively) expanded? The reason I'm asking for this -- I process XML schema documents, and I think we should encourage other Python hackers to do so, too. This (new) feature would enable lxml to support that. I'm trying to implement this capability myself in Python using lxml, but my implementation still has bugs and I'm sure that libxml2 does it better than I can. Thanks for help. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman From ab at rdprojekt.pl Wed Sep 1 09:48:46 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Wed, 01 Sep 2010 09:48:46 +0200 Subject: [lxml-dev] No information about error location in file when validating against XMLSchema In-Reply-To: <4C348E49.3090006@rdprojekt.pl> References: <4C348E49.3090006@rdprojekt.pl> Message-ID: <4C7E055E.1050407@rdprojekt.pl> Hello, I'm running "for x in etree.iterparse(open('NotValidFile.xml'), schema=schema)", where 'NotValidFile.xml' is a file that won't validate against provided schema. Iterating over raises etree.XMLSyntaxError, which is expected and ok. But the value of 'position' attribute of that error is (0,0) and value of 'offset' attribute is None. Is there a way to get at least the line number in which the error occurred? Is current behaviour a bug in etree.iterparse or is it expected behaviour, that errors raised by etree.iterparse doesn't tell anything about their position in XML file? Regards, Adam Biela?ski. From Sean.Mccully at turner.com Wed Sep 1 22:40:48 2010 From: Sean.Mccully at turner.com (Mccully, Sean) Date: Wed, 1 Sep 2010 16:40:48 -0400 Subject: [lxml-dev] python-lxml-dpg Message-ID: I just recently installed the python-lxml-dbg modules and am trying to import etree_d. But get a strange error undefined symbol: _Py_RefTotal. Didnt see any information elsewhere. >>> from lxml import etree_d Traceback (most recent call last): File "", line 1, in ImportError: /usr/lib/python2.6/dist-packages/lxml/etree_d.so: undefined symbol: _Py_RefTotal >>> Python 2.6. Any help? Thanks!! Sean McCully -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100901/43e99c58/attachment.htm From sergio at sergiomb.no-ip.org Thu Sep 2 02:28:43 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Thu, 02 Sep 2010 01:28:43 +0100 Subject: [lxml-dev] Question about etree vs html In-Reply-To: <4C7BC15E.3060900@extremepro.gr> References: <4C7BC15E.3060900@extremepro.gr> Message-ID: <1283387323.15534.13.camel@segulix> On Mon, 2010-08-30 at 17:34 +0300, Dimitrios Pritsos wrote: > I am Dimitrios Pritsos and I am working on a WebCrawler. In order to > analyse the pages that I am getting while crawling I am using lxml. > However I cannot tell the difference of lxml.html and lxml.etree when > coming to the XHTML parsing. In particular I am confused of what to > use from the variety of options lxml is providing. Hi, I think lxml.html and lxml.etree do the same, but html have some methods specific to html like: .head and html just have tostring which is etree.HTMLparser() while etree have more parsers. I'm developing a kind a WebCrawler too, but problems of parsing bad html, falls in libxml2, not here. lxml is just a wrapper of libxml2 and libxslt ( which are coded in C or C++ ) for python . Cheers, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100902/caa484bf/attachment.bin From stefan_ml at behnel.de Thu Sep 2 09:11:13 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Sep 2010 09:11:13 +0200 Subject: [lxml-dev] python-lxml-dpg In-Reply-To: References: Message-ID: <4C7F4E11.8050305@behnel.de> Mccully, Sean, 01.09.2010 22:40: > I just recently installed the python-lxml-dbg modules and am trying to import etree_d. But get a strange error undefined symbol: _Py_RefTotal. Didnt see any information elsewhere. > > >>> from lxml import etree_d > Traceback (most recent call last): > File "", line 1, in > ImportError: /usr/lib/python2.6/dist-packages/lxml/etree_d.so: undefined symbol: _Py_RefTotal > > Python 2.6. Well, yes, but which one? You need to run the debug version of Python 2.6, not the normal runtime. Note that this is a Linux distribution related issue, not an lxml one. Stefan From stefan_ml at behnel.de Thu Sep 2 14:14:23 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Sep 2010 14:14:23 +0200 Subject: [lxml-dev] lxml 2.2.8 released Message-ID: <4C7F951F.3000403@behnel.de> Hi, lxml 2.2.8 is up on PyPI. It is a minor bugfix release for the stable 2.2 series. Have fun, Stefan 2.2.8 (2010-09-02) Bugs fixed * Crash in newer libxml2 versions when moving elements between documents that had attributes on replaced XInclude nodes. * Import fix for urljoin in Python 3.1+. From stefan_ml at behnel.de Fri Sep 3 20:25:46 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 03 Sep 2010 20:25:46 +0200 Subject: [lxml-dev] Access to ElementTree for XML schema In-Reply-To: <20100831230856.GA75749@cutter.rexx.com> References: <20100831230856.GA75749@cutter.rexx.com> Message-ID: <4C813DAA.3050902@behnel.de> Dave Kuhlman, 01.09.2010 01:08: > I'm looking for a way to get access to an etree._ElementTree that > represents an XML schema document in which the xsd:include and > xsd:import elements have been recursively expanded. > > When I create an instance of etree.XMLSchema, libxml2 expands the > underlying C tree for the schema. Am I right about that? The best ways to find out are to a) read the libxml2 source code or b) add a little debug code that dumps the schema document to a file *after* parsing. Just go ahead, the XML Schema code in lxml is pretty short. > If so, is > there a way for me to get an etree._ElementTree that wraps that > underlying C tree? Or, perhaps to have a way to create an > etree._ElementTree from the XMLSchema object? *If* the tree is available as a normal XML tree, it is trivial to copy it and wrap it in an ElementTree, sure. > Or, is there already some other way to get an XML schema document tree > in which the include and import elements have been (recursively) > expanded? No. However, is it really that hard to implement the algorithm for that in Python space? Admittedly, XML Schema is a severely complex format, but the import rules are definitely not the most complex part of the spec. > The reason I'm asking for this -- I process XML schema documents, and > I think we should encourage other Python hackers to do so, too. This > (new) feature would enable lxml to support that. I'm trying to > implement this capability myself in Python using lxml, but my > implementation still has bugs and I'm sure that libxml2 does it better > than I can. I'm pretty sure it handles imports and includes as specified. It does have a few remaining quirks for certain less common XML Schema features, but all in all, it works pretty well and spec compliant. Stefan From cJ-lxml at zougloub.eu Fri Sep 3 22:15:33 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Fri, 3 Sep 2010 16:15:33 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook Message-ID: <20100903161533.20e66719@Bidule.intranet.cs> Hi all, Using a python-based build tool, I decided to use lxml directly instead of calling xsltproc in order to transform docbook sources. The docbook sources contain olinks (links between documents). Everything works fine when using xsltproc. I tried the following : def xsltproc(xml_filename, xslt_filename, **kw): parser = etree.XMLParser() xml_doc = etree.parse(xml_filename, parser) print xml_doc.getroot().base xml_doc.xinclude() xslt_doc = etree.parse(xslt_filename) ac = etree.XSLTAccessControl(read_network=True, write_file=True, read_file=True, create_dir=True) transform = etree.XSLT(xslt_doc, access_control=ac) result_tree = transform(xml_doc, **kw) res = etree.tostring(result_tree, pretty_print=True, encoding="utf-8", xml_declaration=True) print "Transform\n", transform.error_log print "Parser\n", parser.error_log if not res: raise Exception("pouet") return res kw = { "olink.base.uri" : "doc.html", "collect.xref.targets" : "yes", "targets.filename" : "doc.html.db", "target.database.document" : "olinkdb-html.xml", } res = xsltproc("doc.xml", ".../xsl/xhtml/docbook.xsl", **kw) with open("pouet", "w") as f: f.write(res) This code fails silently (Exception("pouet") is raised). If I comment the line with remove collect.xref.targets, a document is correctly output. But the olinks are screwed. Somehow the xslt transformer is not able to read the olink information of other documents, by reading the pointers in olinkdb-html.xml. It seems that the silent failure is a bug, and I may be missing some stuff in order to convert the documents properly. More complete test cases available upon request. Thank you all, -- cJ From stefan_ml at behnel.de Fri Sep 3 22:21:18 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 03 Sep 2010 22:21:18 +0200 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <20100903161533.20e66719@Bidule.intranet.cs> References: <20100903161533.20e66719@Bidule.intranet.cs> Message-ID: <4C8158BE.4010203@behnel.de> J?r?me Carretero, 03.09.2010 22:15: > This code fails silently (Exception("pouet") is raised). > If I comment the line with remove collect.xref.targets, a document is correctly output. >[...] > It seems that the silent failure is a bug, and I may be missing some stuff in order to convert the documents properly. Did you check the error_log of the XSLT object to see if it reports any errors? Stefan From cJ-lxml at zougloub.eu Fri Sep 3 22:31:06 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Fri, 3 Sep 2010 16:31:06 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <4C8158BE.4010203@behnel.de> References: <20100903161533.20e66719@Bidule.intranet.cs> <4C8158BE.4010203@behnel.de> Message-ID: <20100903163106.7f1fd31f@Bidule.intranet.cs> On Fri, 03 Sep 2010 22:21:18 +0200 Stefan Behnel wrote: > J?r?me Carretero, 03.09.2010 22:15: > > This code fails silently (Exception("pouet") is raised). > > If I comment the line with remove collect.xref.targets, a document is correctly output. > >[...] > > It seems that the silent failure is a bug, and I may be missing some stuff in order to convert the documents properly. > > Did you check the error_log of the XSLT object to see if it reports any errors? > > Stefan In my listing I have : print "Transform\n", transform.error_log print "Parser\n", parser.error_log And the printout is empty (except from the strings of course). Regards, -- cJ From stefan_ml at behnel.de Fri Sep 3 22:33:55 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 03 Sep 2010 22:33:55 +0200 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <20100903163106.7f1fd31f@Bidule.intranet.cs> References: <20100903161533.20e66719@Bidule.intranet.cs> <4C8158BE.4010203@behnel.de> <20100903163106.7f1fd31f@Bidule.intranet.cs> Message-ID: <4C815BB3.1010603@behnel.de> J?r?me Carretero, 03.09.2010 22:31: > On Fri, 03 Sep 2010 22:21:18 +0200 > Stefan Behnel wrote: > >> J?r?me Carretero, 03.09.2010 22:15: >>> This code fails silently (Exception("pouet") is raised). >>> If I comment the line with remove collect.xref.targets, a document is correctly output. >>> [...] >>> It seems that the silent failure is a bug, and I may be missing some stuff in order to convert the documents properly. >> >> Did you check the error_log of the XSLT object to see if it reports any errors? > > In my listing I have : > print "Transform\n", transform.error_log > print "Parser\n", parser.error_log > > And the printout is empty (except from the strings of course). Ok, next questions then: which version of lxml, libxml2 and libxslt are you using, and what command line do you use to do the same thing in libxslt? Stefan From cJ-lxml at zougloub.eu Fri Sep 3 22:46:33 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Fri, 3 Sep 2010 16:46:33 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <4C815BB3.1010603@behnel.de> References: <20100903161533.20e66719@Bidule.intranet.cs> <4C8158BE.4010203@behnel.de> <20100903163106.7f1fd31f@Bidule.intranet.cs> <4C815BB3.1010603@behnel.de> Message-ID: <20100903164633.33a63c8e@Bidule.intranet.cs> On Fri, 03 Sep 2010 22:33:55 +0200 Stefan Behnel wrote: > Ok, next questions then: which version of lxml, libxml2 and libxslt are you > using, and what command line do you use to do the same thing in libxslt? > > Stefan Python 2.6.5 (release26-maint, Aug 3 2010, 17:34:54) [GCC 4.5.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> print "lxml.etree: ", etree.LXML_VERSION lxml.etree: (2, 2, 6, 0) >>> print "libxml used: ", etree.LIBXML_VERSION libxml used: (2, 7, 7) >>> print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION libxml compiled: (2, 7, 7) >>> print "libxslt used: ", etree.LIBXSLT_VERSION libxslt used: (1, 1, 26) >>> print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION libxslt compiled: (1, 1, 26) /usr/bin/xsltproc --maxdepth 10000 --nonet --xinclude --stringparam olink.base.uri "doc.html" --stringparam collect.xref.targets "yes" --stringparam targets.filename "doc.html.db" --stringparam target.database.document "olinkdb-html.xml" /path/to/the/xsl/xhtml/docbook.xsl.xsl doc.xml > doc.html Regards, -- cJ From stefan_ml at behnel.de Sat Sep 4 21:54:47 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 04 Sep 2010 21:54:47 +0200 Subject: [lxml-dev] Code hangs when calling etree.XMLSchema In-Reply-To: References: Message-ID: <4C82A407.2070700@behnel.de> Bryan Hughes, 31.08.2010 14:37: > When I pass in a valid XSD file into this line of code, the application > hangs: > > logging.debug("start") > xmlschema_ = etree.XMLSchema(xsd_file) > logging.debug("finish") > > I've wrapped that piece of code with debug logging to confirm that it > hangs. In other words, my logs will print "start", but do not print > "finish". Unfortunately, no exceptions are thrown -- it simply does nothing > once that line of code executes. Could you run it under gdb and send in a stack trace from the point where it hangs? Also, what lxml/library versions are you using? And what system is this on? Stefan From terry_n_brown at yahoo.com Sat Sep 4 22:09:44 2010 From: terry_n_brown at yahoo.com (Terry Brown) Date: Sat, 4 Sep 2010 13:09:44 -0700 (PDT) Subject: [lxml-dev] install problems in Windows Message-ID: <981581.24112.qm@web34401.mail.mud.yahoo.com> Hi, thanks for the wonderful product, use it all the time. Given the inability of Python's etree to find a parent from an element (am I missing something?), I recently used in lxml in simple program, not thinking people would try and run it on Windows. But they did. c:\python26\python\scripts\easy_install lxml==2.2.4 seemed to work fine for python 2.6. We got nothing working for python 2.7 and 3.1. Without installing a C compiler in Windows... is there an easy way to get lxml in windows in 2.7/3.1 currently? Is it just a matter of waiting for the right binary installer to be made? Current plan is to rewrite the program to not use lxml, it's trivial enough that that's no big deal, but a shame if it's unnecessary work. Thanks, Terry From stefan_ml at behnel.de Mon Sep 6 10:05:09 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 06 Sep 2010 10:05:09 +0200 Subject: [lxml-dev] install problems in Windows In-Reply-To: <981581.24112.qm@web34401.mail.mud.yahoo.com> References: <981581.24112.qm@web34401.mail.mud.yahoo.com> Message-ID: <4C84A0B5.3090202@behnel.de> Terry Brown, 04.09.2010 22:09: > Given the inability of Python's etree to find a parent from an element (am I missing something?) There's a recipe that simply caches the parents for each element in a dict. It gets a bit tedious when you start modifying the document, though. > I recently used in lxml in simple program, not thinking people would try and run it on Windows. But they did. Tell me about it... > c:\python26\python\scripts\easy_install lxml==2.2.4 > > seemed to work fine for python 2.6. We got nothing working for python 2.7 and 3.1. It looks like there is a 2.2.4 binary distro available for 3.1, but I don't know (and have my doubts) if it works. 2.2.8 should work in general, but there is no Windows binary currently. It's best to use the 2.3 series with Python 3.1 and later. There is currently no Windows binary for 2.7 that I know of. > Without installing a C compiler in Windows... is there an easy way to get lxml in windows in 2.7/3.1 currently? Is it just a matter of waiting for the right binary installer to be made? CC-ing Sidney here, who commonly builds them. Maybe he can provide a 2.2.8 binary distro with the latest libxml2/libxslt in a somewhat timely fashion, and maybe even binaries for the latest 2.3 alpha on Py3.1. I tried myself, but so far, I could neither manage to install MSVC under Wine, nor does it seem to be trivial to build libxml2/libxslt with MinGW under Wine. Tricky business. I'll keep trying to get that working, but don't expect a breakthrough anytime soon. Stefan From stefan_ml at behnel.de Mon Sep 6 11:31:40 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 06 Sep 2010 11:31:40 +0200 Subject: [lxml-dev] lxml 2.3 beta 1 released Message-ID: <4C84B4FC.9060909@behnel.de> Hi, I'm happy to announce the first beta release of lxml 2.3. This is a bug-fix release for the upcoming 2.3 series that mostly follows the fixes in lxml 2.2.8. Please consider testing this version if you currently use 2.2.x, which will continue to receive only important bug fixes. Note that the source distribution is available on PyPI, so your default lxml installation process may choose to use it. If you do not want that, choose an explicit version, e.g. "easy_install lxml==2.2.8". Sidnei, Pascal, please provide binary builds for this release, preferably using libxml2 2.7.7 (or 2.7.3 alternatively) and libxslt 1.1.26. Based on the current state of affairs, I'm expecting to release lxml 2.3 final in mid October, so there's still enough time to find bugs. :) This release was built using Cython 0.13 final. Have fun, Stefan 2.3beta1 (2010-09-06) Bugs fixed * Crash in newer libxml2 versions when moving elements between documents that had attributes on replaced XInclude nodes. * XMLID() function was missing the optional parser and base_url parameters. * Searching for wildcard tags in iterparse() was broken in Py3. * lxml.html.open_in_browser() didn't work in Python 3 due to the use of os.tempnam. It now takes an optional 'encoding' parameter. From sidnei at awkly.org Mon Sep 6 16:21:18 2010 From: sidnei at awkly.org (Sidnei da Silva) Date: Mon, 6 Sep 2010 10:21:18 -0400 Subject: [lxml-dev] install problems in Windows In-Reply-To: <4C84A0B5.3090202@behnel.de> References: <981581.24112.qm@web34401.mail.mud.yahoo.com> <4C84A0B5.3090202@behnel.de> Message-ID: On Mon, Sep 6, 2010 at 4:05 AM, Stefan Behnel wrote: > CC-ing Sidney here, who commonly builds them. Maybe he can provide a 2.2.8 > binary distro with the latest libxml2/libxslt in a somewhat timely fashion, > and maybe even binaries for the latest 2.3 alpha on Py3.1. Should lxml < 2.3 be built for Python >= 3.0? I've tried that in the past, but since you mention it likely doesn't work, I might just drop it. > I tried myself, but so far, I could neither manage to install MSVC under > Wine, nor does it seem to be trivial to build libxml2/libxslt with MinGW > under Wine. Tricky business. I'll keep trying to get that working, but don't > expect a breakthrough anytime soon. I've shared the scripts I use to set up the environment, which BTW don't require MSVC. If you use them you can build both 32 and 64 bit versions with only the Windows SDK installed. I'll do my best to provide binaries this week, just clarify which versions of lxml should be built for which versions of Python, as I asked above. -- Sidnei From stefan_ml at behnel.de Mon Sep 6 16:54:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 06 Sep 2010 16:54:35 +0200 Subject: [lxml-dev] install problems in Windows In-Reply-To: References: <981581.24112.qm@web34401.mail.mud.yahoo.com> <4C84A0B5.3090202@behnel.de> Message-ID: <4C8500AB.90501@behnel.de> Sidnei da Silva, 06.09.2010 16:21: > On Mon, Sep 6, 2010 at 4:05 AM, Stefan Behnel wrote: >> CC-ing Sidney here, who commonly builds them. Maybe he can provide a 2.2.8 >> binary distro with the latest libxml2/libxslt in a somewhat timely fashion, >> and maybe even binaries for the latest 2.3 alpha on Py3.1. > > Should lxml< 2.3 be built for Python>= 3.0? I've tried that in the > past, but since you mention it likely doesn't work, I might just drop > it. lxml 2.2.8 should generally work with Py3.[01], although it may have some quirks. So, if you can manage to build it for 3.1 and if it tests "mostly" fine, I think we should upload it. If not, well, then not. I'm almost sure it won't work with Py3.2, though. I'm fine with keeping that for lxml 2.3+ entirely. >> I tried myself, but so far, I could neither manage to install MSVC under >> Wine, nor does it seem to be trivial to build libxml2/libxslt with MinGW >> under Wine. Tricky business. I'll keep trying to get that working, but don't >> expect a breakthrough anytime soon. > > I've shared the scripts I use to set up the environment, which BTW > don't require MSVC. If you use them you can build both 32 and 64 bit > versions with only the Windows SDK installed. I know, but that still seems to require Windows. If I can get it working with Wine/MinGW, that would simplify things a lot. I'm already getting closer. > I'll do my best to provide binaries this week Cool, thanks! Stefan From cJ-lxml at zougloub.eu Mon Sep 6 19:53:15 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Mon, 6 Sep 2010 13:53:15 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <20100903164633.33a63c8e@Bidule.intranet.cs> References: <20100903161533.20e66719@Bidule.intranet.cs> <4C8158BE.4010203@behnel.de> <20100903163106.7f1fd31f@Bidule.intranet.cs> <4C815BB3.1010603@behnel.de> <20100903164633.33a63c8e@Bidule.intranet.cs> Message-ID: <20100906135315.784092c8@zougloub.eu> On 09/03/2010 04:46 PM, J?r?me Carretero wrote: > ... Hi Stefan, A self-contained test case is available at: git clone git://git.zougloub.eu/docbook_testcase A makefile calls xsltproc and a make.py file tentatively uses lxml to produce html documents from two simple docbook files. HTH, -- cJ From l at lrowe.co.uk Tue Sep 7 16:59:32 2010 From: l at lrowe.co.uk (Laurence Rowe) Date: Tue, 7 Sep 2010 15:59:32 +0100 Subject: [lxml-dev] xsl:output method and encoding ignored Message-ID: XSLT enables you to set output method (xml or html) and encoding using the xsl:output element. However, there seems to be no way to use this information from lxml, the default settings from the write method or the tostring function seem to take precedence. For example with the following XSL: >>> from lxml import etree >>> import sys >>> identity = etree.XSLT(etree.parse('identity.xsl')) >>> identity(etree.HTML('
')).write(sys.stdout)
Whereas with xsltproc: $ echo "
" | xsltproc --html identity.xsl -
Does anyone have any suggestions on how to deal with this? Laurence From stefan_ml at behnel.de Tue Sep 7 17:42:54 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 07 Sep 2010 17:42:54 +0200 Subject: [lxml-dev] xsl:output method and encoding ignored In-Reply-To: References: Message-ID: <4C865D7E.4000002@behnel.de> Laurence Rowe, 07.09.2010 16:59: > XSLT enables you to set output method (xml or html) and encoding using > the xsl:output element. However, there seems to be no way to use this > information from lxml, the default settings from the write method or > the tostring function seem to take precedence. For example with the > following XSL: > > > > media-type="text/html" encoding="utf-8"/> > > > > > >>>> from lxml import etree >>>> import sys >>>> identity = etree.XSLT(etree.parse('identity.xsl')) >>>> identity(etree.HTML('
')).write(sys.stdout) >
> > Whereas with xsltproc: > > $ echo "
" | xsltproc --html identity.xsl - >
> > Does anyone have any suggestions on how to deal with this? http://codespeak.net/lxml/xpathxslt#xslt-result-objects Stefan From cJ-lxml at zougloub.eu Tue Sep 7 15:31:37 2010 From: cJ-lxml at zougloub.eu (=?UTF-8?B?SsOpcsO0bWU=?= Carretero) Date: Tue, 7 Sep 2010 09:31:37 -0400 Subject: [lxml-dev] XSLT: Issues encountered when transforming docbook In-Reply-To: <20100906135315.784092c8@zougloub.eu> References: <20100903161533.20e66719@Bidule.intranet.cs> <4C8158BE.4010203@behnel.de> <20100903163106.7f1fd31f@Bidule.intranet.cs> <4C815BB3.1010603@behnel.de> <20100903164633.33a63c8e@Bidule.intranet.cs> <20100906135315.784092c8@zougloub.eu> Message-ID: <20100907093137.491c037d@Bidule.intranet.cs> On Mon, 6 Sep 2010 13:53:15 -0400 J?r?me Carretero wrote: > On 09/03/2010 04:46 PM, J?r?me Carretero wrote: > > ... Out of curiosity I tried trunk r76925 and have the same behavior. -- cJ PS: this was just a pretext to send another mail, to maybe have an answer. From p.oberndoerfer at urheberrecht.org Wed Sep 8 21:58:28 2010 From: p.oberndoerfer at urheberrecht.org (=?iso-8859-1?Q?=22Pascal_Obernd=F6rfer=22?=) Date: Wed, 8 Sep 2010 21:58:28 +0200 Subject: [lxml-dev] lxml 2.3 beta 1 released In-Reply-To: <4C84B4FC.9060909@behnel.de> References: <4C84B4FC.9060909@behnel.de> Message-ID: <0c52e60d6de96d2448c3ada960ebf9be.squirrel@mail.urheberrecht.org> > Hi, > > I'm happy to announce the first beta release of lxml 2.3. This is a > bug-fix > release for the upcoming 2.3 series that mostly follows the fixes in lxml > 2.2.8. Please consider testing this version if you currently use 2.2.x, > which will continue to receive only important bug fixes. > [...] > > Sidnei, Pascal, please provide binary builds for this release, preferably > using libxml2 2.7.7 (or 2.7.3 alternatively) and libxslt 1.1.26. Done, used libxml2 2.7.7 and libxslt 1.1.26. > Based on the current state of affairs, I'm expecting to release lxml 2.3 > final in mid October, so there's still enough time to find bugs. :) > > This release was built using Cython 0.13 final. > > Have fun, > > Stefan From arfrever.fta at gmail.com Fri Sep 10 22:44:23 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Fri, 10 Sep 2010 22:44:23 +0200 Subject: [lxml-dev] SyntaxErrors with Python 3 In-Reply-To: <201007251715.35395.Arfrever.FTA@gmail.com> References: <201007200303.15139.Arfrever.FTA@gmail.com> <4C455380.3020905@behnel.de> <201007251715.35395.Arfrever.FTA@gmail.com> Message-ID: <201009102244.24301.Arfrever.FTA@gmail.com> 2010-07-25 17:14:53 Arfrever Frehtes Taifersar Arahesis napisa?(a): > 2010-07-20 09:42:56 Stefan Behnel napisa?(a): > > Arfrever Frehtes Taifersar Arahesis, 20.07.2010 03:02: > > > LXML r76211 generally supports Python 3, but there are still some SyntaxErrors. > > > [snip] > > > > Thanks. Only 2 or 3 of those are relevant to Py3, but I'll see if I can fix > > them. A patch could easily speed this up, BTW. > > I'm attaching the partial patch. Could this patch be committed? -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100910/b361e976/attachment.pgp From sidnei at awkly.org Sun Sep 12 17:25:22 2010 From: sidnei at awkly.org (Sidnei da Silva) Date: Sun, 12 Sep 2010 11:25:22 -0400 Subject: [lxml-dev] install problems in Windows In-Reply-To: <4C8500AB.90501@behnel.de> References: <981581.24112.qm@web34401.mail.mud.yahoo.com> <4C84A0B5.3090202@behnel.de> <4C8500AB.90501@behnel.de> Message-ID: On Mon, Sep 6, 2010 at 10:54 AM, Stefan Behnel wrote: > lxml 2.2.8 should generally work with Py3.[01], although it may have some > quirks. So, if you can manage to build it for 3.1 and if it tests "mostly" > fine, I think we should upload it. If not, well, then not. It does. Uploaded. > I'm almost sure it won't work with Py3.2, though. I'm fine with keeping that > for lxml 2.3+ entirely. It didn't. The error I got was: http://paste.ubuntu.com/492635/ -- Sidnei From arfrever.fta at gmail.com Sun Sep 12 22:18:35 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Sun, 12 Sep 2010 22:18:35 +0200 Subject: [lxml-dev] lxml 2.3 beta 1 released In-Reply-To: <4C84B4FC.9060909@behnel.de> References: <4C84B4FC.9060909@behnel.de> Message-ID: <201009122219.44284.Arfrever.FTA@gmail.com> Tests of lxml 2.3 beta 1 fail with Python 3. Output of test.py: Traceback (most recent call last): File "test.py", line 602, in exitcode = main(sys.argv) File "test.py", line 565, in main test_cases = get_test_cases(test_files, cfg, tracer=tracer) File "test.py", line 267, in get_test_cases module = import_module(file, cfg, tracer=tracer) File "test.py", line 210, in import_module mod = __import__(modname) File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/src/lxml/html/tests/test_elementsoup.py", line 11, in class SoupParserTestCase(HelperTestCase): File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/src/lxml/html/tests/test_elementsoup.py", line 12, in SoupParserTestCase from lxml.html import soupparser File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/src/lxml/html/soupparser.py", line 108, in from htmlentitydefs import name2codepoint ImportError: No module named htmlentitydefs Output of selftest.py: * Running selftest.py ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 568, in selftest.encoding Failed example: serialize(elem, encoding="utf-8") Expected: '\xc3\xa5\xc3\xb6\xc3\xb6<>' Got: '???<>' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 579, in selftest.encoding Failed example: serialize(elem, encoding="utf-8") Expected: '' Got: '' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 311, in selftest.parsefile Failed example: tree.write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in tree.write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 319, in selftest.parsefile Failed example: tree.write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in tree.write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 350, in selftest.parseliteral Failed example: ElementTree.ElementTree(element).write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in ElementTree.ElementTree(element).write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 353, in selftest.parseliteral Failed example: ElementTree.ElementTree(element).write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in ElementTree.ElementTree(element).write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 361, in selftest.parseliteral Failed example: print(ElementTree.tostring(element)) Expected: text Got: b'text' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 521, in selftest.writestring Failed example: ElementTree.tostring(elem) Expected: 'text' Got: b'text' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest.py", line 524, in selftest.writestring Failed example: ElementTree.tostring(elem) Expected: 'text' Got: b'text' ********************************************************************** 4 items had failures: 2 of 29 in selftest.encoding 2 of 6 in selftest.parsefile 3 of 10 in selftest.parseliteral 2 of 4 in selftest.writestring ***Test Failed*** 9 failures. 175 tests ok. Output of selftest2.py: ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest2.py", line 187, in selftest2.encoding Failed example: serialize(elem, "utf-8") Expected: '\xc3\xa5\xc3\xb6\xc3\xb6<>' Got: '???<>' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest2.py", line 198, in selftest2.encoding Failed example: serialize(elem, "utf-8") Expected: '' Got: '' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest2.py", line 121, in selftest2.parsefile Failed example: tree.write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in tree.write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_beta1/work/lxml-2.3beta1/selftest2.py", line 128, in selftest2.parsefile Failed example: tree.write(sys.stdout) Exception raised: Traceback (most recent call last): File "/usr/lib64/python3.1/doctest.py", line 1246, in __run compileflags, 1), test.globs) File "", line 1, in tree.write(sys.stdout) File "lxml.etree.pyx", line 1850, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:44526) File "serializer.pxi", line 478, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:90245) File "lxml.etree.pyx", line 281, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8390) File "serializer.pxi", line 378, in lxml.etree._FilelikeWriter.write (src/lxml/lxml.etree.c:89117) TypeError: string argument expected, got 'bytes' ********************************************************************** 2 items had failures: 2 of 29 in selftest2.encoding 2 of 4 in selftest2.parsefile ***Test Failed*** 4 failures. 102 tests ok. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100912/e78296ef/attachment-0001.pgp From p.oberndoerfer at urheberrecht.org Mon Sep 13 11:22:17 2010 From: p.oberndoerfer at urheberrecht.org (Pascal) Date: Mon, 13 Sep 2010 09:22:17 +0000 (UTC) Subject: [lxml-dev] lxml 2.3 beta 1 released References: <4C84B4FC.9060909@behnel.de> <201009122219.44284.Arfrever.FTA@gmail.com> Message-ID: Arfrever Frehtes Taifersar Arahesis gmail.com> writes: > > Tests of lxml 2.3 beta 1 fail with Python 3. > I can confirm similar results for selftest.py and selftest2.py on Mac. Never got test.py to work on Mac though. Cheers, Pascal From l at lrowe.co.uk Mon Sep 13 16:59:40 2010 From: l at lrowe.co.uk (Laurence Rowe) Date: Mon, 13 Sep 2010 15:59:40 +0100 Subject: [lxml-dev] Finding the media-type and method of an XSLT Message-ID: Thanks for your earlier help. Now that I'm suing str(result) to output the result tree, I'd like to set an appropriate Content-Type header. The element allows one to specify the output method (xml or html) and media-type of the result tree. From libxslt one can interrogate these values on the transform with: xsltStylesheetPtr transform; transform->mediaType; transform->encoding; transform->method; In lxml I can find the encoding from the resultTree with result.docinfo.encoding, but how do I find the method and mediaType? Laurence From flyaflyaa at gmail.com Tue Sep 14 14:39:45 2010 From: flyaflyaa at gmail.com (flya flya) Date: Tue, 14 Sep 2010 20:39:45 +0800 Subject: [lxml-dev] lxml can't work with html5lib Message-ID: I installed lxml 2.3beta1 and html5lib 0.90 on python2.6.6, when try code from here http://codespeak.net/lxml/html5parser.html >>> from lxml.html import tostring, html5parser >>> tostring(html5parser.fromstring("
foo")) get the follow error message: Traceback (most recent call last): File "", line 1, in tostring(html5parser.fromstring("
foo")) File "C:\Python26\lib\site-packages\lxml\html\html5parser.py", line 137, in fromstring guess_charset=guess_charset) File "C:\Python26\lib\site-packages\lxml\html\html5parser.py", line 54, in document_fromstring return parser.parse(html, useChardet=guess_charset).getroot() File "build\bdist.win32\egg\html5lib\html5parser.py", line 211, in parse parseMeta=parseMeta, useChardet=useChardet) File "build\bdist.win32\egg\html5lib\html5parser.py", line 111, in _parse self.mainLoop() File "build\bdist.win32\egg\html5lib\html5parser.py", line 179, in mainLoop self.phase.processStartTag(token) File "build\bdist.win32\egg\html5lib\html5parser.py", line 578, in processStartTag self.parser.phase.processStartTag(token) File "build\bdist.win32\egg\html5lib\html5parser.py", line 616, in processStartTag self.insertHtmlElement() File "build\bdist.win32\egg\html5lib\html5parser.py", line 595, in insertHtmlElement self.tree.insertRoot(impliedTagToken("html", "StartTag")) File "C:\Python26\lib\site-packages\lxml\html\_html5builder.py", line 91, in insertRoot root_element = self.elementClass(name) File "build\bdist.win32\egg\html5lib\treebuilders\etree.py", line 31, in __init__ namespace)) File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1563, in Element v = html_parser.makeelement(*args, **kw) File "parser.pxi", line 861, in lxml.etree._BaseParser.makeelement (src/lxml/lxml.etree.c:74517) File "apihelpers.pxi", line 120, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12246) File "apihelpers.pxi", line 1464, in lxml.etree._getNsTag (src/lxml/lxml.etree.c:23600) File "apihelpers.pxi", line 1358, in lxml.etree._utf8 (src/lxml/lxml.etree.c:22519) TypeError: Argument must be string or unicode. From dkuhlman at rexx.com Wed Sep 15 00:38:48 2010 From: dkuhlman at rexx.com (Dave Kuhlman) Date: Tue, 14 Sep 2010 15:38:48 -0700 Subject: [lxml-dev] Access to ElementTree for XML schema In-Reply-To: <4C813DAA.3050902@behnel.de> References: <20100831230856.GA75749@cutter.rexx.com> <4C813DAA.3050902@behnel.de> Message-ID: <20100914223848.GA9565@cutter.rexx.com> On Fri, Sep 03, 2010 at 08:25:46PM +0200, Stefan Behnel wrote: > > Dave Kuhlman, 01.09.2010 01:08: > >I'm looking for a way to get access to an etree._ElementTree that > >represents an XML schema document in which the xsd:include and > >xsd:import elements have been recursively expanded. > > > >When I create an instance of etree.XMLSchema, libxml2 expands the > >underlying C tree for the schema. Am I right about that? > > The best ways to find out are to a) read the libxml2 source code or b) add > a little debug code that dumps the schema document to a file *after* > parsing. > > Just go ahead, the XML Schema code in lxml is pretty short. > I've looked. I'll look at it a bit more. Seems like I'll need to learn more about Cython. > > >If so, is > >there a way for me to get an etree._ElementTree that wraps that > >underlying C tree? Or, perhaps to have a way to create an > >etree._ElementTree from the XMLSchema object? > > *If* the tree is available as a normal XML tree, it is trivial to copy it > and wrap it in an ElementTree, sure. > > > >Or, is there already some other way to get an XML schema document tree > >in which the include and import elements have been (recursively) > >expanded? > > No. However, is it really that hard to implement the algorithm for that in > Python space? Admittedly, XML Schema is a severely complex format, but the > import rules are definitely not the most complex part of the spec. > Well, I *believe* that I've done implemented it, now. (That work was part of the delay in this response.) But, I worry that there is some detail that I've gotten wrong. Anyway, the new implementation of this is in the file process_includes.py which is part of the generateDS.py distribution. If anyone needs it, you can find it here: http://www.rexx.com/~dkuhlman/generateDS.html Thanks for the help with this. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman From eugene.vandenbulke at gmail.com Wed Sep 15 08:08:44 2010 From: eugene.vandenbulke at gmail.com (Eugene Van den Bulke) Date: Wed, 15 Sep 2010 16:08:44 +1000 Subject: [lxml-dev] lxml.html.submit_form and unicode values Message-ID: Hi, Slowly but surely learning more about lxml.html by using it to do some scrapping. I encountered a unicode problem trying to submit the following form.
Which can be found under the Questions link of http://www.assemblee-nationale.fr/13/tribun/fiches_id/267457.asp#P3 ===== UnicodeEncodeError Traceback (most recent call last) /Users/eugene/Documents/Dev/parlorama/code/ in () /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in submit_form(form, extra_values, open_http) 819 if open_http is None: 820 open_http = open_http_urllib --> 821 return open_http(form.method, form.action, values) 822 823 def open_http_urllib(method, url, values): /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc in open_http_urllib(method, url, values) 836 data = None 837 else: --> 838 data = urlencode(values) 839 return urlopen(url, data) 840 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.pyc in urlencode(query, doseq) 1267 for k, v in query: 1268 k = quote_plus(str(k)) -> 1269 v = quote_plus(str(v)) 1270 l.append(k + '=' + v) 1271 else: UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 6: ordinal not in range(128) ===== I tried to address the problem by encoding the values in the form fields as suggested here : http://mail.python.org/pipermail/tutor/2007-May/054340.html but in a python shell doing >>> form.fields['id_auteur'] u'Aboud \xc9lie' >>> form.fields['id_auteur'] = form.fields['id_auteur'].encode('utf-8') [...] ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes Would welcome advice or guidance ... if I want to make urlopen "happy" I am "displeasing" ElementTree :( Thanks for your help, -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you? From vojta.rylko at seznam.cz Wed Sep 15 08:52:10 2010 From: vojta.rylko at seznam.cz (=?UTF-8?B?Vm9qdMSbY2ggUnlsa28=?=) Date: Wed, 15 Sep 2010 08:52:10 +0200 Subject: [lxml-dev] lxml.html.submit_form and unicode values In-Reply-To: References: Message-ID: <4C906D1A.7030406@seznam.cz> Hi, for UnicodeEncodeError should be written tutorial on codespeak.net/lxml - because it's most problematic and confusing (even with NS) problem in lxml... And it have not need to be encoded from utf-8 as in example below, but ISO-8859-1 ( at assemple-nationale.fr/...) Vojta Dne 15.9.2010 8:08, Eugene Van den Bulke napsal(a): > Hi, > > Slowly but surely learning more about lxml.html by using it to do some > scrapping. > > I encountered a unicode problem trying to submit the following form. > >
action="http://recherche2.assemblee-nationale.fr/resultats_tribun.jsp" > id="Lien1"> > > > > >
> > Which can be found under the Questions link of > http://www.assemblee-nationale.fr/13/tribun/fiches_id/267457.asp#P3 > > ===== > UnicodeEncodeError Traceback (most recent call last) > > /Users/eugene/Documents/Dev/parlorama/code/ in() > > /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc > in submit_form(form, extra_values, open_http) > 819 if open_http is None: > 820 open_http = open_http_urllib > --> 821 return open_http(form.method, form.action, values) > 822 > 823 def open_http_urllib(method, url, values): > > /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml/html/__init__.pyc > in open_http_urllib(method, url, values) > 836 data = None > 837 else: > --> 838 data = urlencode(values) > 839 return urlopen(url, data) > 840 > > /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.pyc > in urlencode(query, doseq) > 1267 for k, v in query: > 1268 k = quote_plus(str(k)) > -> 1269 v = quote_plus(str(v)) > 1270 l.append(k + '=' + v) > 1271 else: > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in > position 6: ordinal not in range(128) > ===== > > I tried to address the problem by encoding the values in the form > fields as suggested here : > http://mail.python.org/pipermail/tutor/2007-May/054340.html > > but in a python shell doing > >>>> form.fields['id_auteur'] > u'Aboud \xc9lie' >>>> form.fields['id_auteur'] = form.fields['id_auteur'].encode('utf-8') > [...] > ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes > > Would welcome advice or guidance ... if I want to make urlopen "happy" > I am "displeasing" ElementTree :( > > Thanks for your help, > From eugene.vandenbulke at gmail.com Wed Sep 15 13:51:03 2010 From: eugene.vandenbulke at gmail.com (Eugene Van den Bulke) Date: Wed, 15 Sep 2010 21:51:03 +1000 Subject: [lxml-dev] lxml.html.submit_form and unicode values In-Reply-To: <4C906D1A.7030406@seznam.cz> References: <4C906D1A.7030406@seznam.cz> Message-ID: Thanks for pointing out my charset mistake (not at the heart of my problem though). If you have any expertise with lxml.html + form + unicode, do you think an writing an alternate opener and passing it to submit_form using the open_http keyword be the best way to go about it? -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you? From eugene.vandenbulke at gmail.com Wed Sep 15 15:00:17 2010 From: eugene.vandenbulke at gmail.com (Eugene Van den Bulke) Date: Wed, 15 Sep 2010 23:00:17 +1000 Subject: [lxml-dev] lxml.html.submit_form and unicode values In-Reply-To: References: <4C906D1A.7030406@seznam.cz> Message-ID: > If you have any expertise with lxml.html + form + unicode, do you > think an writing an alternate opener and passing it to submit_form > using the open_http keyword be the best way to go about it? It seems to do the job: def open_http(method, url, values): from lxml.html import open_http_urllib values = [(k, v.encode('ISO-8859-1')) for k,v in values] return open_http_urllib(method, url, values) May not be the most elegant solution but could be useful to someone else. -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you? From nikita.vetoshkin at gmail.com Fri Sep 17 08:48:15 2010 From: nikita.vetoshkin at gmail.com (Vetoshkin Nikita) Date: Fri, 17 Sep 2010 12:48:15 +0600 Subject: [lxml-dev] Problem with utf-8 and etree Message-ID: Not sure if it's a bug, so here I am. Test case: from lxml import etree e = etree.Element("Test") l = open("1.txt").read() e.text = l # <-- ValueError: All strings must be XML compatible... etree.fromstring("""\r\n""" + l + "") # <-- works 1.txt file is attached. My setup: lxml.etree: (2, 3, -99, 0) libxml used: (2, 7, 7) libxml compiled: (2, 7, 7) libxslt used: (1, 1, 26) libxslt compiled: (1, 1, 26) -------------- next part -------------- ??????? ???????? ???????,10244,FAILURE,1000315,222,FAILURE,1,FAILURE,got_session_id,,None,nauss_1284700193_968_100,2010-09-17 11:09:54.341000 ??????? ???????? ???????,10244,FAILURE,1000315,222,FAILURE,2,FAILURE,got_session_id,,None,nauss_1284700226_566_104,2010-09-17 11:10:26.825000 ??????? ???????? ???????,10244,FAILURE,1000315,222,FAILURE,3,FAILURE,got_session_id,,None,nauss_1284700259_68_108,2010-09-17 11:10:59.419000 ??????? ???????? ???????,10244,FAILURE,1000315,222,FAILURE,4,FAILURE,got_session_id,,None,nauss_1284700291_677_112,2010-09-17 11:11:32.013000 ??????? ???????? ???????,10244,FAILURE,1000315,222,FAILURE,5,FAILURE,got_session_id,,None,nauss_1284700324_168_116,2010-09-17 11:12:04.497000 From dpritsos at extremepro.gr Mon Sep 20 13:46:46 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Mon, 20 Sep 2010 14:46:46 +0300 Subject: [lxml-dev] Concurrent Programming and lxml but recipe for Newbies like me - any solution? In-Reply-To: References: Message-ID: <4C9749A6.2000701@extremepro.gr> Hello to all, I am working for a while with lxml in a Concurrent Programming Environment. I have tested to pass an HtmlElement tree and failed in both case of MultiProcessing and MultiThreading setup. I use "threading" and "multiprocessing" native modules. I found in FAQ that at least in MultiThreading case it should work properly when passing a tree or element between Threads. However it is not working. More specifically it seems that is passing the HtmlElement Part of python but not the part of C because (or something like that anyway): writing this : isinstance(xhtml_tree, lxml.html.HtmlElement) => returns True in both Threads 1.the one that Got the Tree from a Queue.Queue() and 2. the one that puts the Puts the Tree to the Queue() but few lines later writing: xhtml.xpath("//text()") => does nothing, while writing : xpath(xhtml) where xpath = lxml.etree.XPath("//text()") => raises an error that the input is not an HtmlElement Even if Python says that it is ( isinstance() ). So, I guess there is a problem with python and C parts of the etree objects. Is it? however returning the tree from an Evenlet's greenThread it is working fine ( I guess that is because greenThread is nothing more than a Coroutine) Do you have any Idea why lxml is not working in the concurrent setup (even in Threading) while it should? However I would prefer to have it working on MultiProcessing env Best Regards, Dimitrios From dpritsos at extremepro.gr Mon Sep 20 16:18:59 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Mon, 20 Sep 2010 17:18:59 +0300 Subject: [lxml-dev] lxml-dev Digest, Vol 72, Issue 4 In-Reply-To: References: Message-ID: <4C976D53.70202@extremepro.gr> On 20/09/10 14:46, lxml-dev-request at codespeak.net wrote: > Send lxml-dev mailing list submissions to > lxml-dev at codespeak.net > > To subscribe or unsubscribe via the World Wide Web, visit > http://codespeak.net/mailman/listinfo/lxml-dev > or, via email, send a message with subject or body 'help' to > lxml-dev-request at codespeak.net > > You can reach the person managing the list at > lxml-dev-owner at codespeak.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of lxml-dev digest..." > > > Today's Topics: > > 1. Re: lxml 2.3 beta 1 released (Pascal) > 2. Finding the media-type and method of an XSLT (Laurence Rowe) > 3. lxml can't work with html5lib (flya flya) > 4. Re: Access to ElementTree for XML schema (Dave Kuhlman) > 5. lxml.html.submit_form and unicode values (Eugene Van den Bulke) > 6. Re: lxml.html.submit_form and unicode values (Vojt?ch Rylko) > 7. Re: lxml.html.submit_form and unicode values > (Eugene Van den Bulke) > 8. Re: lxml.html.submit_form and unicode values > (Eugene Van den Bulke) > 9. Problem with utf-8 and etree (Vetoshkin Nikita) > > 10. Concurrent Programming and lxml but recipe for Newbies like > me - any solution? (Dimitrios Pritsos) > Ok I have a correction to make in 10. I posted earlier: In fact the Threading Setup is working Fine STILL the MultiProcessing Set up is NOT. One problem that occured is that Queue.Queue() with in a Manager() object has some kind of problem to trasfere Trees, While Queue.Queue() works fine with Threads. Any Idea how the MultiProcessing setup could work ? From saul at ag-projects.com Tue Sep 21 17:57:14 2010 From: saul at ag-projects.com (=?ISO-8859-1?Q?Sa=FAl_Ibarra_Corretg=E9?=) Date: Tue, 21 Sep 2010 17:57:14 +0200 Subject: [lxml-dev] Hashing objectify.StringElement Message-ID: <4C98D5DA.2080103@ag-projects.com> Hi all, It's my first question on the list and I tried to find the answer but couldn't so if my question has indeed been answered before just point me there please :-) I'm updating some API which used to use etree.Element objects to objectify because I need to pickle this items and from what I've read so far, this seems like the way to go. At some point in the code elements are used as dictionary keys and this used to work before, but now I get a TypeError telling me that 'lxml.objectify.StringElement' is unhashable. However, 'lxml.objectify.ObjectifiedElement' class is hashable. Also, I found that the __hash__ function seems to be implemented, but for some reason calling hash() or trying to use StringElement as a dictionary key will give this error. Is this a bug? A feature? Can I get around this somehow? Thanks in advance and kind regards, PS: I'm using version 2.2.6-1 on Debian squeeze. -- Sa?l Ibarra Corretg? AG Projects From saul at ag-projects.com Tue Sep 21 18:28:20 2010 From: saul at ag-projects.com (=?ISO-8859-1?Q?Sa=FAl_Ibarra_Corretg=E9?=) Date: Tue, 21 Sep 2010 18:28:20 +0200 Subject: [lxml-dev] Hashing objectify.StringElement In-Reply-To: <4C98D5DA.2080103@ag-projects.com> References: <4C98D5DA.2080103@ag-projects.com> Message-ID: <4C98DD24.3070501@ag-projects.com> On 09/21/2010 05:57 PM, Sa?l Ibarra Corretg? wrote: > Hi all, > > It's my first question on the list and I tried to find the answer but > couldn't so if my question has indeed been answered before just point me > there please :-) > > I'm updating some API which used to use etree.Element objects to > objectify because I need to pickle this items and from what I've read so > far, this seems like the way to go. > > At some point in the code elements are used as dictionary keys and this > used to work before, but now I get a TypeError telling me that > 'lxml.objectify.StringElement' is unhashable. However, > 'lxml.objectify.ObjectifiedElement' class is hashable. > > Also, I found that the __hash__ function seems to be implemented, but > for some reason calling hash() or trying to use StringElement as a > dictionary key will give this error. > > Is this a bug? A feature? Can I get around this somehow? > > Thanks in advance and kind regards, > > PS: I'm using version 2.2.6-1 on Debian squeeze. > Fun, I just found the answer right after sending the email :-S So it's already fixed in trunk :-) http://comments.gmane.org/gmane.comp.python.lxml.devel/5339 -- Sa?l Ibarra Corretg? AG Projects From rzurad at gmail.com Tue Sep 21 21:31:35 2010 From: rzurad at gmail.com (Richard Zurad) Date: Tue, 21 Sep 2010 12:31:35 -0700 Subject: [lxml-dev] Attribute Whitelist in lxml.html.clean Message-ID: Greetings, I'm currently working on a project that requires an attribute whitelist for the clean_html method in lxml.html.clean (more specifically, the __call__ method of the Cleaner object). I did some digging and found a post to this mailing list from about two years ago that asked the same thing: http://article.gmane.org/gmane.comp.python.lxml.devel/3875/ While I agree that it would be nice to link an attribute whitelist to a list of attributes that are only valid for a given tag, a solution like that seems a bit overkill for what my project requires as we are only cleaning subtrees of the HTML DOM where we know exactly what we're expecting and can narrow things down to a very small whitelist of tags and attributes. Also, to only restrict attributes to standards-compliant attributes per tag puts a bit of a damper on some web development trends where developers define custom attributes for the sake of their own requirements (adding an in-line editor, etc) I've implemented a simplistic fix that adds the ability to whitelist attributes. The addition is a simple five lines of code to the __call__ method of the Cleaner object in lxml.html.clean. It simply checks if the keyword argument whitelist_args (a list of valid args for any tag) was passed in to __init__. if it was, then on line 257 of clean.py, instead of using defs.safe_attrs, we use the supplied whitelist (although if both safe_attrs_only and whitelist_args is passed to the constructor, throw a ValueError, as it doesn't make sense to supply both a whitelist and to specify to use the feedparser attribute whitelist) I'd like to submit a patch to lxml with this code, however there is a design decision that must be made. In the __call__ method where we do javascript sanitization if the contructor was called with javascript=True (or the default behavior), there is code that simply skips javascript sanitization on attributes if the safe_attrs_only is True since the feedparser whitelist does not include event attributes. This leads to a question of what to do if the object is instantiated with, for example, an attribute whitelist that includes 'onchange'? Would it be safe to assume that if an event attribute is passed in as part of the whitelist, we want to allow javascript on that attribute and that attribute only? Or should we raise an error because it doesn't make sense to pass in an event attribute in the whitelist along with javascript=True? Or should we just silently delete the attribute, ignoring the whitelist? For the project I'm working on, this situation won't arise. However, since I'd like to submit a patch to lxml for potential inclusion of this feature in a future release, I'd like to make sure that the concept behind it is in line with the rest of the lxml mantra. Thanks, -Rich -- http://www.greyboxware.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100921/598c3254/attachment.htm From agroszer at gmail.com Thu Sep 23 12:42:15 2010 From: agroszer at gmail.com (Adam GROSZER) Date: Thu, 23 Sep 2010 12:42:15 +0200 Subject: [lxml-dev] lxml-2.2.8-py2.5-win32.egg is missing from pypi Message-ID: <1591557964.20100923124215@gmail.com> Hello, lxml-2.2.8-py2.5-win32.egg is missing from pypi. lxml 2.2.8 is a dependency for the upcoming ZTK 1.0 release. Would someone please do the egg? -- Best regards, Adam GROSZER mailto:agroszer at gmail.com -- Quote of the day: The truest test of independent judgment is being able to dislike someone who admires us, and to admire someone who dislikes us. - Sydney J. Harris From agroszer at gmail.com Thu Sep 23 12:50:58 2010 From: agroszer at gmail.com (Adam GROSZER) Date: Thu, 23 Sep 2010 12:50:58 +0200 Subject: [lxml-dev] automatic windows eggs for lxml Message-ID: <1749906635.20100923125058@gmail.com> Hello, With the help of the Zope Foundation I build a bot that does windows binary eggs of some zope packages. Using that infrastructure it should be rather easy to do all sorts of windows binary eggs for lxml too. Are you interested in that? -- Best regards, Adam GROSZER mailto:agroszer at gmail.com -- Quote of the day: If you realize that you aren't as wise today as you thought you were yesterday, you're wiser today. - Michigan Presbyterian Church From sidnei.da.silva at canonical.com Thu Sep 23 15:40:45 2010 From: sidnei.da.silva at canonical.com (Sidnei da Silva) Date: Thu, 23 Sep 2010 10:40:45 -0300 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: <1749906635.20100923125058@gmail.com> References: <1749906635.20100923125058@gmail.com> Message-ID: On Thu, Sep 23, 2010 at 7:50 AM, Adam GROSZER wrote: > Hello, > > With the help of the Zope Foundation I build a bot that does windows > binary eggs of some zope packages. > Using that infrastructure it should be rather easy to do all sorts of > windows binary eggs for lxml too. > > Are you interested in that? +1? -- Sidnei From agroszer at gmail.com Thu Sep 23 15:45:30 2010 From: agroszer at gmail.com (Adam GROSZER) Date: Thu, 23 Sep 2010 15:45:30 +0200 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: References: <1749906635.20100923125058@gmail.com> Message-ID: <1137040418.20100923154530@gmail.com> Hello Sidnei, Is the page http://codespeak.net/lxml/build.html / "Static linking on Windows" up to date? Are there any other instructions how to build the eggs? Thursday, September 23, 2010, 3:40:45 PM, you wrote: SdS> On Thu, Sep 23, 2010 at 7:50 AM, Adam GROSZER wrote: >> Hello, >> >> With the help of the Zope Foundation I build a bot that does windows >> binary eggs of some zope packages. >> Using that infrastructure it should be rather easy to do all sorts of >> windows binary eggs for lxml too. >> >> Are you interested in that? SdS> +1? SdS> -- Sidnei -- Best regards, Adam GROSZER mailto:agroszer at gmail.com -- Quote of the day: I don't have any solution, but I certainly admire the problem. From sidnei.da.silva at canonical.com Thu Sep 23 16:20:42 2010 From: sidnei.da.silva at canonical.com (Sidnei da Silva) Date: Thu, 23 Sep 2010 11:20:42 -0300 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: <1137040418.20100923154530@gmail.com> References: <1749906635.20100923125058@gmail.com> <1137040418.20100923154530@gmail.com> Message-ID: On Thu, Sep 23, 2010 at 10:45 AM, Adam GROSZER wrote: > Hello Sidnei, > > Is the page http://codespeak.net/lxml/build.html / "Static linking on > Windows" up to date? It should work with those instructions, yes. > Are there any other instructions how to build the eggs? I have a set of batch scripts to setup the environment, so that I can build with the Platform SDK only, without needing Visual Studio installed. There's also some batch scripts to build libxml/libxslt/zlib. http://people.canonical.com/~sidnei/lxml/ -- Sidnei From agroszer at gmail.com Thu Sep 23 16:27:22 2010 From: agroszer at gmail.com (Adam GROSZER) Date: Thu, 23 Sep 2010 16:27:22 +0200 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: References: <1749906635.20100923125058@gmail.com> <1137040418.20100923154530@gmail.com> Message-ID: <1079563684.20100923162722@gmail.com> Hello Sidnei, I'll try that on winbot once I find some time. Thursday, September 23, 2010, 4:20:42 PM, you wrote: SdS> On Thu, Sep 23, 2010 at 10:45 AM, Adam GROSZER wrote: >> Hello Sidnei, >> >> Is the page http://codespeak.net/lxml/build.html / "Static linking on >> Windows" up to date? SdS> It should work with those instructions, yes. >> Are there any other instructions how to build the eggs? SdS> I have a set of batch scripts to setup the environment, so that I can SdS> build with the Platform SDK only, without needing Visual Studio SdS> installed. There's also some batch scripts to build SdS> libxml/libxslt/zlib. SdS> http://people.canonical.com/~sidnei/lxml/ SdS> -- Sidnei -- Best regards, Adam GROSZER mailto:agroszer at gmail.com -- Quote of the day: Make input easy to proofread From maxkisselew at googlemail.com Fri Sep 24 01:51:28 2010 From: maxkisselew at googlemail.com (Max Kisselew) Date: Fri, 24 Sep 2010 01:51:28 +0200 Subject: [lxml-dev] Performance gets bad when parsing xml with namespaces In-Reply-To: References: Message-ID: Hello, recently I discovered a problem with lxml/LibXML2. I guess it's likely that the problem comes from Libxml2. I hope despite that you can help me. I'm working on a university project where I use Python and lxml for xml parsing and processing. First there were no namespace definitions in the xml files we used but recently the format has slightly changed and some namespace definitions were added. Here the xml format as it was in the beginning: IMS, Uni Stuttgart European Medicines Agency EMEA/H/C/471 [...] Wegen European Medicines [...] Wegen [...] And here the xml with the recently added namespace definitions: IMS, Uni Stuttgart European Medicines Agency EMEA/H/C/471 [...] Wegen European Medicines [...] Wegen [...] I wanted to extract all the content from the elements. In the xml file without the namespace definitions that takes just a moment (less that 30 seconds). But when I tried to perform the same on the new file with namespaces, it took much longer, more that 30 minutes (!). The xml file was about 7 MB. Since the same problem occurs when one tries to parse the xml file with the LibXML2 binding for Perl, I guess the problem comes from LibXML2 itself. It is also strange that the performance problem seems to grow with the amount of the tags to be parsed. So the first 10 000 tags only need about a second. But when we parse the first 20 000 tags, it takes 21 seconds! Do you have any idea about the cause of this problem and how it could be solved? Thanks ?Max From sridharr at activestate.com Tue Sep 28 01:20:57 2010 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Mon, 27 Sep 2010 16:20:57 -0700 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: References: <1749906635.20100923125058@gmail.com> <1137040418.20100923154530@gmail.com> Message-ID: On 2010-09-23, at 7:20 AM, Sidnei da Silva wrote: > I have a set of batch scripts to setup the environment, so that I can > build with the Platform SDK only, without needing Visual Studio > installed. There's also some batch scripts to build > libxml/libxslt/zlib. > > http://people.canonical.com/~sidnei/lxml/ Hi Sidnei, lxml's setupinfo.py invokes a program called "xslt-config" and fails, around line "lib_versions = get_library_versions()". Your libxslt64.bat does not seem to provide this program. Does your private build produces a xslt-config executable, or did you somehow manage to workaround setupinfo.py's use of xslt-config for version detection? Assume static linking for this question. Thanks. -srid From sidnei.da.silva at canonical.com Tue Sep 28 02:44:08 2010 From: sidnei.da.silva at canonical.com (Sidnei da Silva) Date: Mon, 27 Sep 2010 21:44:08 -0300 Subject: [lxml-dev] automatic windows eggs for lxml In-Reply-To: References: <1749906635.20100923125058@gmail.com> <1137040418.20100923154530@gmail.com> Message-ID: On Mon, Sep 27, 2010 at 8:20 PM, Sridhar Ratnakumar wrote: > Hi Sidnei, > > lxml's setupinfo.py invokes a program called "xslt-config" and fails, around line "lib_versions = get_library_versions()". Your libxslt64.bat does not seem to provide this program. Does your private build produces a xslt-config executable, or did you somehow manage to workaround setupinfo.py's use of xslt-config for version detection? > > Assume static linking for this question. By using lxml{64}.bat, the right environment variables are set, so that even though xslt-config is not around the build still proceeds successfully. -- Sidnei From magawake at gmail.com Tue Sep 28 13:43:48 2010 From: magawake at gmail.com (Mag Gam) Date: Tue, 28 Sep 2010 07:43:48 -0400 Subject: [lxml-dev] PCDATA question In-Reply-To: References: Message-ID: Also, I do have a DTD file. However, I am not sure how to validate it against my xml file On Tue, Sep 28, 2010 at 7:40 AM, Mag Gam wrote: > I am trying to parse a XML file I keep getting this error: > lxml.etree.XMLSyntaxError: PCDATA invalid Char value 27, line 29053, column 2111 > > By looking at this character, it does look wierd. There is an escape > sequence of some sort in my file. Is there a way to include this or > convert this bad character into something useful? > From magawake at gmail.com Tue Sep 28 13:40:36 2010 From: magawake at gmail.com (Mag Gam) Date: Tue, 28 Sep 2010 07:40:36 -0400 Subject: [lxml-dev] PCDATA question Message-ID: I am trying to parse a XML file I keep getting this error: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 27, line 29053, column 2111 By looking at this character, it does look wierd. There is an escape sequence of some sort in my file. Is there a way to include this or convert this bad character into something useful? From Tim.Arnold at sas.com Tue Sep 28 17:22:33 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Tue, 28 Sep 2010 15:22:33 +0000 Subject: [lxml-dev] exploring a relaxng schema Message-ID: <3AA0EA4F99BA8C4F89E32C90DF945E0E180CAADE@MERCMBX03D.na.SAS.com> hi, I'd like to generate a list of legal child elements given a parent element, using the DocBook 5 relaxng schema. Is it possible to do this with lxml? something like this is what I'd like to eventually get/write: somefunc('varname','inlineequation')# returns False somefunc('varname','inlinemediaobject')# returns True thanks, --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100928/ab2d444a/attachment.htm From cfbearden at gmail.com Tue Sep 28 22:43:57 2010 From: cfbearden at gmail.com (Chuck Bearden) Date: Tue, 28 Sep 2010 15:43:57 -0500 Subject: [lxml-dev] PCDATA question In-Reply-To: References: Message-ID: On Tue, Sep 28, 2010 at 6:43 AM, Mag Gam wrote: > Also, I do have a DTD file. However, I am not sure how to validate it > against my xml file The documentation on the lxml website is excellent: There are instructions for validating XML against DTDs, and RNG & XML schemas. The website seems to be down now; otherwise I would send you a link to the relevant part. Hope this helps, Chuck From cfbearden at gmail.com Tue Sep 28 22:47:06 2010 From: cfbearden at gmail.com (Chuck Bearden) Date: Tue, 28 Sep 2010 15:47:06 -0500 Subject: [lxml-dev] exploring a relaxng schema In-Reply-To: <3AA0EA4F99BA8C4F89E32C90DF945E0E180CAADE@MERCMBX03D.na.SAS.com> References: <3AA0EA4F99BA8C4F89E32C90DF945E0E180CAADE@MERCMBX03D.na.SAS.com> Message-ID: On Tue, Sep 28, 2010 at 10:22 AM, Tim Arnold wrote: > hi, > I'd like to generate a list of legal child elements given a parent element, > using the DocBook 5 relaxng schema. Is it possible to do this with lxml? > > something like this is what I'd like to eventually get/write: > somefunc('varname','inlineequation')# returns False > somefunc('varname','inlinemediaobject')# returns True > > thanks, > --Tim Yes, it is, though if the schema is in multiple inclusions you should probably simplify the schema first as per the specification. When I worked with RNG, I used lxml for this kind of thing quite a bit, though mostly by means of XPath rather than the Elementtree API. Best wishes, Chuck From cfbearden at gmail.com Tue Sep 28 22:42:17 2010 From: cfbearden at gmail.com (Chuck Bearden) Date: Tue, 28 Sep 2010 15:42:17 -0500 Subject: [lxml-dev] PCDATA question In-Reply-To: References: Message-ID: On Tue, Sep 28, 2010 at 6:40 AM, Mag Gam wrote: > I am trying to parse a XML file I keep getting this error: > lxml.etree.XMLSyntaxError: PCDATA invalid Char value 27, line 29053, column 2111 > > By looking at this character, it does look wierd. There is an escape > sequence of some sort in my file. Is there a way to include this or > convert this bad character into something useful? Evidently you have an ESCAPE character (hex 1b, perhaps as "" or "") in your not-quite-XML document. XML processors (including the libraries that lxml uses) are not required to accept this character, and you should probably assume that they won't. See for a list of characters XML processors are required to accept. I don't know of a way to escape this character short of encoding it as base64 and ensuring that downstream processors know to unescape it when the de-XMLize the data. Hope this helps, Chuck From Tim.Arnold at sas.com Wed Sep 29 16:24:44 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Wed, 29 Sep 2010 14:24:44 +0000 Subject: [lxml-dev] exploring a relaxng schema In-Reply-To: References: <3AA0EA4F99BA8C4F89E32C90DF945E0E180CAADE@MERCMBX03D.na.SAS.com> Message-ID: <3AA0EA4F99BA8C4F89E32C90DF945E0E180CD4E0@MERCMBX03D.na.SAS.com> > -----Original Message----- > From: Chuck Bearden [mailto:cfbearden at gmail.com] > Sent: Tuesday, September 28, 2010 4:47 PM > To: Tim Arnold > Cc: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] exploring a relaxng schema > > On Tue, Sep 28, 2010 at 10:22 AM, Tim Arnold wrote: > > hi, > > I'd like to generate a list of legal child elements given a parent > > element, using the DocBook 5 relaxng schema. Is it possible to do this > with lxml? > > > > something like this is what I'd like to eventually get/write: > > somefunc('varname','inlineequation')# returns False > > somefunc('varname','inlinemediaobject')# returns True > > > > thanks, > > --Tim > > Yes, it is, though if the schema is in multiple inclusions you should > probably simplify the schema first as per the specification. When I worked > with RNG, I used lxml for this kind of thing quite a bit, though mostly by > means of XPath rather than the Elementtree API. > > Best wishes, > Chuck Hi Chuck, Thanks for the suggestion. The schema isn't in multiple inclusions, so at least that helps. If you have any code laying around that shows an example of how you used Xpath for finding whether a particular child is a legal subelement of its parent, I sure would appreciate it. thanks! --Tim Arnold From ovnicraft at gmail.com Wed Sep 29 19:10:26 2010 From: ovnicraft at gmail.com (Ovnicraft) Date: Wed, 29 Sep 2010 12:10:26 -0500 Subject: [lxml-dev] Error with etree.XML(mystr) what i cant identify Message-ID: Hi folks, trying with etree.XML(mystr) i get this error http://openerp.pastebin.com/rPF9fKYZ my str is http://openerp.pastebin.com/CbU4y04Y, i cant identify the problem. Waiting for your help. Regards, -- Cristian Salamea @ovnicraft -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100929/c05edc78/attachment-0001.htm From terry_n_brown at yahoo.com Wed Sep 29 19:18:34 2010 From: terry_n_brown at yahoo.com (Terry Brown) Date: Wed, 29 Sep 2010 12:18:34 -0500 Subject: [lxml-dev] Error with etree.XML(mystr) what i cant identify In-Reply-To: References: Message-ID: <20100929121834.49adcab5@nrri.umn.edu> On Wed, 29 Sep 2010 12:10:26 -0500 Ovnicraft wrote: > Hi folks, trying with etree.XML(mystr) i get this error > http://openerp.pastebin.com/rPF9fKYZ my str is > http://openerp.pastebin.com/CbU4y04Y, i cant identify the problem. > Waiting for your help. attrs=\'{\'readonly\':[(\'state\',\'=\',\'valid\')]}\' should maybe be attrs="\'{\'readonly\':[(\'state\',\'=\',\'valid\')]}\'" From terry_n_brown at yahoo.com Wed Sep 29 19:17:24 2010 From: terry_n_brown at yahoo.com (Terry Brown) Date: Wed, 29 Sep 2010 12:17:24 -0500 Subject: [lxml-dev] Error with etree.XML(mystr) what i cant identify In-Reply-To: References: Message-ID: <20100929121724.62a847c5@nrri.umn.edu> On Wed, 29 Sep 2010 12:10:26 -0500 Ovnicraft wrote: > Hi folks, trying with etree.XML(mystr) i get this error > http://openerp.pastebin.com/rPF9fKYZ my str is > http://openerp.pastebin.com/CbU4y04Y, i cant identify the problem. > Waiting for your help. > > Regards, tidy -xml says line 1 column 1743 - Warning: unexpected or duplicate quote mark line 1 column 2108 - Warning: unexpected or duplicate quote mark From ovnicraft at gmail.com Wed Sep 29 20:00:03 2010 From: ovnicraft at gmail.com (Ovnicraft) Date: Wed, 29 Sep 2010 13:00:03 -0500 Subject: [lxml-dev] Error with etree.XML(mystr) what i cant identify In-Reply-To: <20100929121834.49adcab5@nrri.umn.edu> References: <20100929121834.49adcab5@nrri.umn.edu> Message-ID: On Wed, Sep 29, 2010 at 12:18 PM, Terry Brown wrote: > On Wed, 29 Sep 2010 12:10:26 -0500 > Ovnicraft wrote: > > > Hi folks, trying with etree.XML(mystr) i get this error > > http://openerp.pastebin.com/rPF9fKYZ my str is > > http://openerp.pastebin.com/CbU4y04Y, i cant identify the problem. > > Waiting for your help. > > > attrs=\'{\'readonly\':[(\'state\',\'=\',\'valid\')]}\' > > should maybe be > > attrs="\'{\'readonly\':[(\'state\',\'=\',\'valid\')]}\'" > Thanks it was fixed. regards, > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Cristian Salamea @ovnicraft -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100929/39fada87/attachment.htm From magawake at gmail.com Thu Sep 30 12:25:36 2010 From: magawake at gmail.com (Mag Gam) Date: Thu, 30 Sep 2010 06:25:36 -0400 Subject: [lxml-dev] PCDATA question In-Reply-To: References: Message-ID: Can you try sending me the relevant part? On Tue, Sep 28, 2010 at 4:43 PM, Chuck Bearden wrote: > On Tue, Sep 28, 2010 at 6:43 AM, Mag Gam wrote: >> Also, I do have a DTD file. However, I am not sure how to validate it >> against my xml file > > The documentation on the lxml website is excellent: > > ? > > There are instructions for validating XML against DTDs, and RNG & XML > schemas. The website seems to be down now; otherwise I would send you > a link to the relevant part. > > Hope this helps, > Chuck > From stefan_ml at behnel.de Thu Sep 30 17:34:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 30 Sep 2010 17:34:35 +0200 Subject: [lxml-dev] Problem with utf-8 and etree In-Reply-To: References: Message-ID: <4CA4AE0B.5090308@behnel.de> Vetoshkin Nikita, 17.09.2010 08:48: > Not sure if it's a bug, so here I am. > Test case: > > from lxml import etree > e = etree.Element("Test") > l = open("1.txt").read() > e.text = l #<-- ValueError: All strings must be XML compatible... I assume you are using Python 2? In that case, you get back a byte string when reading the file. Since lxml.etree can't know which encoding the byte sequence in that string uses, it rejects the string. You must decode it to Unicode first. Stefan From stefan_ml at behnel.de Thu Sep 30 17:47:05 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 30 Sep 2010 17:47:05 +0200 Subject: [lxml-dev] Performance gets bad when parsing xml with namespaces In-Reply-To: References: Message-ID: <4CA4B0F9.40407@behnel.de> Max Kisselew, 24.09.2010 01:51: > recently I discovered a problem with lxml/LibXML2. > I guess it's likely that the problem comes from Libxml2. What version are you using? If it's a 2.6.x version, try 2.7 instead. > I'm working on a university project where I use Python and lxml for > xml parsing and processing. First there were no namespace definitions > in the xml files > we used but recently the format has slightly changed and some namespace > definitions were added. Here the xml format as it was in the beginning: > > > > > IMS, Uni Stuttgart > > > European Medicines Agency > EMEA/H/C/471 [...] > Wegen > > European > Medicines > [...] > Wegen > > > [...] > > > > > > And here the xml with the recently added namespace definitions: > > > > > IMS, Uni Stuttgart > > > European Medicines Agency > EMEA/H/C/471 [...] > Wegen > > European > Medicines > [...] > Wegen > > > [...] > > > > > > I wanted to extract all the content from the elements. In the xml > file without the namespace definitions that takes just a moment (less > that 30 seconds). > But when I tried to perform the same on the new file with namespaces, it > took much longer, more that 30 minutes (!). The xml file was about 7 MB. 7 MB is pretty small, so I'm surprised about that difference (although 30 seconds sounds pretty long already). Did you try to declare all three namespaces on the root element using different prefixes, instead of redeclaring them without a prefix all over the place? What code do you use for parsing and searching? There are many ways to do the above in lxml.etree, and some of them are much faster than others. Have a look here: http://codespeak.net/lxml/performance.html and especially here: http://codespeak.net/lxml/performance.html#a-longer-example Stefan From cfbearden at gmail.com Thu Sep 30 17:47:37 2010 From: cfbearden at gmail.com (Chuck Bearden) Date: Thu, 30 Sep 2010 10:47:37 -0500 Subject: [lxml-dev] PCDATA question In-Reply-To: References: Message-ID: On Thu, Sep 30, 2010 at 5:25 AM, Mag Gam wrote: > Can you try sending me the relevant part? > On Tue, Sep 28, 2010 at 4:43 PM, Chuck Bearden wrote: >> On Tue, Sep 28, 2010 at 6:43 AM, Mag Gam wrote: >>> Also, I do have a DTD file. However, I am not sure how to validate it >>> against my xml file >> >> The documentation on the lxml website is excellent: >> >> ? >> >> There are instructions for validating XML against DTDs, and RNG & XML >> schemas. The website seems to be down now; otherwise I would send you >> a link to the relevant part. >> >> Hope this helps, >> Chuck >> > From nikita.vetoshkin at gmail.com Thu Sep 30 20:32:45 2010 From: nikita.vetoshkin at gmail.com (Vetoshkin Nikita) Date: Fri, 1 Oct 2010 00:32:45 +0600 Subject: [lxml-dev] Fwd: Problem with utf-8 and etree In-Reply-To: References: <4CA4AE0B.5090308@behnel.de> Message-ID: Thanks for your reply, Stefan. You're right, I'm using Python2.6. Doesn't lxml asumes UTF-8 encoding or is there any way to tell about encoding? I'm solving the problem: Firebird DBMS via SQLAlchemy returns Unicode string, then I convert them to bytestrings in UTF-8 for csv module, after that I want to send that string as a text element and convert it back to Unicode, and that serialize to UTF-8 and actually send. That no good. I digged a bit and found that lxml is checking if characters fit UTF-8, but I couldn't find the wrong one in my string. 2010/9/30 Stefan Behnel : > Vetoshkin Nikita, 17.09.2010 08:48: >> >> Not sure if it's a bug, so here I am. >> Test case: >> >> from lxml import etree >> e = etree.Element("Test") >> l = open("1.txt").read() >> e.text = l #<-- ValueError: All strings must be XML compatible... > > I assume you are using Python 2? In that case, you get back a byte string > when reading the file. Since lxml.etree can't know which encoding the byte > sequence in that string uses, it rejects the string. You must decode it to > Unicode first. > > Stefan > From scott.smith at primal.com Thu Sep 30 21:26:01 2010 From: scott.smith at primal.com (Scott Smith) Date: Thu, 30 Sep 2010 19:26:01 +0000 Subject: [lxml-dev] DLL issue with win32 static builds of lxml Message-ID: <303CBE02A7EFC949B6A71CBF4FC86B752BFC95BD5E@MBX2.EXCHPROD.USA.NET> Hi I'm looking at upgrading our python installations from 2.6(.2) to 2.7 and upgrading the packages we use to the latest versions. We use python and the mod_wsgi module in Apache 2.2 for our application server and easy_install to install egg packages. I'm currently testing on a Windows 7 machine. The current egg version of lxml that we're using (before upgrading) is: lxml-2.2.4-py2.6-win32.egg In trying to upgrade things, I've found that all the later builds of lxml cause a DLL problem through apache. For example, all of these builds (from pipy) cause the problem: lxml-2.2.8-py2.6-win32.egg lxml-2.2.8-py2.7-win32.egg lxml-2.2beta1-py2.6-win32.egg lxml-2.3beta1-py2.6-win32.egg lxml-2.3beta1-py2.7-win32.egg (Note that there are no static win32 builds available for 2.2.5-2.2.7, or I can't find them) The error thrown in the Apache logs is: mod_wsgi (pid=4812): Target WSGI script 'C:/wsgi/handler.py' cannot be loaded as Python module. mod_wsgi (pid=4812): Exception occurred processing WSGI script 'C:/wsgi/handler.py'. Traceback (most recent call last): File "C:/wsgi/handler.py", line 2, in from lxml import etree ImportError: DLL load failed: The specified module could not be found. The same file works in Python 2.6 if lxml 2.2.4 is installed, but fails for any other higher version of lxml (either in python 2.6 or 2.7). I've been able to trace the problem down to the builds not being able to find the msvcr90.dll file properly. I've tried modifying path variables and other changes, but can't figure out what the problem is. A couple of other things to note: - Using just "import lxml" doesn't cause the error, just "from lxml import etree" - The problem only occurs through Apache. If I write a script that references etree and call it through the command line, the problem doesn't occur. Does anyone know what I need to do? What changed after the 2.2.4 lxml builds? Thanks for any help you can give. Scott Smith -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100930/b65516b8/attachment-0001.htm From stefan_ml at behnel.de Thu Sep 30 22:43:17 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 30 Sep 2010 22:43:17 +0200 Subject: [lxml-dev] Fwd: Problem with utf-8 and etree In-Reply-To: References: <4CA4AE0B.5090308@behnel.de> Message-ID: <4CA4F665.9050606@behnel.de> Vetoshkin Nikita, 30.09.2010 20:32: > 2010/9/30 Stefan Behnel: >> Vetoshkin Nikita, 17.09.2010 08:48: >>> >>> Not sure if it's a bug, so here I am. >>> Test case: >>> >>> from lxml import etree >>> e = etree.Element("Test") >>> l = open("1.txt").read() >>> e.text = l #<-- ValueError: All strings must be XML compatible... >> >> I assume you are using Python 2? In that case, you get back a byte string >> when reading the file. Since lxml.etree can't know which encoding the byte >> sequence in that string uses, it rejects the string. You must decode it to >> Unicode first. > > You're right, I'm using Python2.6. Doesn't lxml asumes UTF-8 encoding No, it doesn't. That's a feature that keeps you from making mistakes. Hence the error. > or is there any way to tell about encoding? I'm solving the problem: > Firebird DBMS via SQLAlchemy returns Unicode string, then I convert > them to bytestrings in UTF-8 for csv module, after that I want to send > that string as a text element and convert it back to Unicode, and that > serialize to UTF-8 and actually send. As I said, decode it before passing it to lxml. It will do the right thing on serialisation. Stefan From sidnei.da.silva at canonical.com Thu Sep 30 22:47:59 2010 From: sidnei.da.silva at canonical.com (Sidnei da Silva) Date: Thu, 30 Sep 2010 17:47:59 -0300 Subject: [lxml-dev] DLL issue with win32 static builds of lxml In-Reply-To: <303CBE02A7EFC949B6A71CBF4FC86B752BFC95BD5E@MBX2.EXCHPROD.USA.NET> References: <303CBE02A7EFC949B6A71CBF4FC86B752BFC95BD5E@MBX2.EXCHPROD.USA.NET> Message-ID: On Thu, Sep 30, 2010 at 4:26 PM, Scott Smith wrote: > Does anyone know what I need to do? What changed after the 2.2.4 lxml > builds? There's actually quite some things that changed. The laptop where I built the 2.2.4 version was stolen so I had to do everything from scratch. The version of libxml2 was changed as well, and things like that. But the fact it works from the command line is suspicious. I've had issues with mod_python not finding modules, but that was before mod_wsgi existed. I'd rule out any issues in your environment before anything else. -- Sidnei