From stefan_ml at behnel.de Fri Aug 1 08:47:07 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 01 Aug 2008 08:47:07 +0200 Subject: [lxml-dev] Segfault on XSLT/XPath undefined variable error. In-Reply-To: <1217522521.4123.183.camel@jmk> References: <1217375154.4123.138.camel@jmk> <48900199.5020806@behnel.de> <1217451261.4123.163.camel@jmk> <489150A6.3060307@behnel.de> <1217522521.4123.183.camel@jmk> Message-ID: <4892B16B.5010903@behnel.de> Hi, John Krukoff wrote: > Okay, can only get it to crash when first signing a document using > libxmlsec, so I suppose I'll simply assume that the two libraries use > the error log in incompatible ways. could you check if this patch makes it work better for you? It basically restricts XSLT error logging to the lifetime of an XSL transformation. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: temporary-xslt-logging.patch Type: text/x-patch Size: 2078 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080801/cb3738e5/attachment.bin From azaroth at liverpool.ac.uk Fri Aug 1 17:16:21 2008 From: azaroth at liverpool.ac.uk (Dr R. Sanderson) Date: Fri, 1 Aug 2008 16:16:21 +0100 (BST) Subject: [lxml-dev] Python 3.0 Support In-Reply-To: <47878E4E.5080800@behnel.de> References: <47878E4E.5080800@behnel.de> Message-ID: Back in May, Stefan wrote: > [but yes, there will be lxml for Python 3, and pretty soon] Any news on the Py3k front? (I'm in the process of scoping out just how hard it's going to be to update our code to 3000, starting with the dependencies) Many Thanks! Rob From stefan_ml at behnel.de Sun Aug 3 08:41:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 03 Aug 2008 08:41:50 +0200 Subject: [lxml-dev] Python 3.0 Support In-Reply-To: References: <47878E4E.5080800@behnel.de> Message-ID: <4895532E.1050501@behnel.de> Hi, Dr R. Sanderson wrote: > Back in May, Stefan wrote: >> [but yes, there will be lxml for Python 3, and pretty soon] > > Any news on the Py3k front? It's there in general, so you can compile lxml under Py3 and run your code against it for pure testing purposes. However, due to changes in Py3.0 beta2, you can get crashes in the exception handling code that Cython generates. There seem to be slight changes in the way exceptions interact with the frame cleanup in Py3 now. And Cython does not use frames at all but emulates them, apparently not well enough for the latest Py3 beta... I'm working on fixing this, but I don't know when this will be done. It may take a couple of weeks, and will require a new source release of 2.1.x. Stefan From azaroth at liverpool.ac.uk Sun Aug 3 17:18:19 2008 From: azaroth at liverpool.ac.uk (Dr R. Sanderson) Date: Sun, 3 Aug 2008 16:18:19 +0100 (BST) Subject: [lxml-dev] Python 3.0 Support In-Reply-To: <4895532E.1050501@behnel.de> References: <47878E4E.5080800@behnel.de> <4895532E.1050501@behnel.de> Message-ID: >>> [but yes, there will be lxml for Python 3, and pretty soon] >> Any news on the Py3k front? > It's there in general, so you can compile lxml under Py3 and run your code > against it for pure testing purposes. Fantastic :) And the thinko that was causing my problem is that fromstring() is all lowercase not fromString(). Duh. Haven't run into any of the crashes yet. > However, due to changes in Py3.0 beta2, you can get crashes in the exception > [...] > I'm working on fixing this, but I don't know when this will be done. It may > take a couple of weeks, and will require a new source release of 2.1.x. No problem! Many thanks for the prompt reply, Rob From jjl at pobox.com Mon Aug 4 14:10:16 2008 From: jjl at pobox.com (John J Lee) Date: Mon, 4 Aug 2008 13:10:16 +0100 (BST) Subject: [lxml-dev] Passing UTF-8 bytestrings to lxml Message-ID: Hi Apologies in advance if this is the wrong list -- I'm suggesting a change to lxml, so I guess this is the right place... I'm working on some existing code that makes use of both unicode objects and UTF-8 encoded bytestring objects (both of which sometimes contain non-ASCII characters). I'm making changes to the code to ensure that it supports the unicode character set. Unfortunately, it's not practical to change all of the code to use unicode objects (partly because there's a lot of code, and partly because fixing that would probably entail fixing PyGTK to return unicode objects instead of UTF-8 encoded bytestrings). So, the plan is to live with both unicode and UTF-8 encoded bytestrings, and to ensure Python's default encoding is always set to UTF-8. I'm sure the wisdom that approach could be debated (!), but I hope that somebody will be kind enough to answer the following question anyway :-) Looking at the code, it seems that changing function _utf8 in apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that seems to work. 1. Will what I'm doing subtly break lxml in some way if I make use of this patched lxml in my own code? 2. Should lxml be changed in this way? If it's considered important to avoid accidentally passing non-ASCII bytestrings to lxml, would it be acceptable to add a global switch to enable accepting UTF-8 encoded bytestrings? Thanks for any help (This patch is against lxml 1.3.6, but this function in SVN trunk is very similar) --- apihelpers.pxi.orig 2008-08-04 12:52:34.000000000 +0100 +++ apihelpers.pxi 2008-08-04 12:41:57.000000000 +0100 @@ -640,13 +640,12 @@ cdef object _utf8(object s): if python.PyString_Check(s): - assert not isutf8py(s), \ - "All strings must be XML compatible, either Unicode or ASCII" + assert isutf8py(s) != -1, \ + "All strings must either unicode objects or UTF-8" elif python.PyUnicode_Check(s): - # FIXME: we should test these strings, too ... s = python.PyUnicode_AsUTF8String(s) assert isutf8py(s) != -1, \ - "All strings must be XML compatible, either Unicode or ASCII" + "All strings must be either unicode objects or UTF-8" else: raise TypeError, "Argument must be string or unicode." return s John From stefan_ml at behnel.de Mon Aug 4 16:07:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 4 Aug 2008 16:07:33 +0200 (CEST) Subject: [lxml-dev] Passing UTF-8 bytestrings to lxml In-Reply-To: References: Message-ID: <57778.213.61.181.86.1217858853.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, John J Lee wrote: > Apologies in advance if this is the wrong list -- I'm suggesting a change > to lxml, so I guess this is the right place... We only have one mailing list, so this is definitely the right place. > Looking at the code, it seems that changing function _utf8 in > apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would > be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that > seems to work. The internal encoding used by libxml2 is UTF-8, so I don't expect any problems when you pass in UTF-8 directly - as long as you can make sure that it's really a valid UTF-8 byte sequence. > 2. Should lxml be changed in this way? If it's considered important to > avoid accidentally passing non-ASCII bytestrings to lxml I consider that important, yes. The support for ASCII byte strings is a pure convenience as ASCII names are extremely common in XML *and* they are compatible with unicode strings in Python 2.x. Allowing anything other than ASCII here would open the door for all sorts of hard to track down encoding problems, as you would no longer get an exception when you accidentally pass ISO encoded non-ASCII strings, for example. Note that when lxml runs under Python 3, it will not allow you to pass byte strings into the API at all (except for parsing, obviously). > would it be > acceptable to add a global switch to enable accepting UTF-8 encoded > bytestrings? Global switches are always a bad thing. And I don't like the idea of accepting UTF-8 encoded strings at the API level and returning them as unicode strings (and: no, I would not allow returning UTF-8 encoded strings from the API). So I guess the answer is a pretty straight no. Stefan PS: Regarding your actual problem: it's best to decode data directly when your code gets its hands on it, and to decode as late as possible, i.e. on the way out. I remember that when I worked with Qt3, I used two helper functions to wrap (arguments of) Qt functions that accepted or returned strings, so that I could work with clean Python unicode strings in the rest of my code. That's the best advice I can give you. Besides, if PyGTK worked like lxml, you wouldn't have this problem in the first place. From jjl at pobox.com Mon Aug 4 20:02:12 2008 From: jjl at pobox.com (John J Lee) Date: Mon, 04 Aug 2008 19:02:12 +0100 Subject: [lxml-dev] Passing UTF-8 bytestrings to lxml In-Reply-To: <57778.213.61.181.86.1217858853.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <57778.213.61.181.86.1217858853.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <1217872932.3251.1266930759@webmail.messagingengine.com> On Mon, 4 Aug 2008 16:07:33 +0200 (CEST), "Stefan Behnel" said: [...] > > Looking at the code, it seems that changing function _utf8 in > > apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would > > be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that > > seems to work. > > The internal encoding used by libxml2 is UTF-8, so I don't expect any > problems when you pass in UTF-8 directly - as long as you can make sure > that it's really a valid UTF-8 byte sequence. Thanks for this, it's very helpful. I have a follow-up question, though. On discovering the fact that unicode strings containing non-ASCII characters don't hash to the same value as their UTF-8 equivalent bytestring (despite the fact that, for example, they compare equal, when the default encoding is set to UTF-8), I'm having second thoughts about my mixed-str-and-unicode scheme: >>> import sys >>> reload(sys) >>> sys.setdefaultencoding("utf-8") >>> hash(u"\xa3") -610773982 >>> hash(u"\xa3".encode("utf-8")) 1195450215 >>> d = {} >>> d[u"\xa3"] = 1 >>> d[u"\xa3".encode("utf-8")] = 2 >>> len(d) 2 FWIW, that fact is documented here: http://www.python.org/dev/peps/pep-0100/ "Comparison & Hash Value" So, my question: were I also to change the function funicode (also in apihelpers.pxi) to return UTF-8 bytestrings, would lxml always return UTF-8 bytestring objects from all of its API calls? Again, this seems to work with a quick test, but I wonder whether there are cases where funicode() is not called. The patch I'm thinking of would be something like this: --- apihelpers.pxi.orig 2008-08-04 12:52:34.000000000 +0100 +++ apihelpers.pxi 2008-08-04 18:40:57.000000000 +0100 @@ -623,30 +623,18 @@ return is_non_ascii cdef object funicode(char* s): - cdef Py_ssize_t slen - cdef char* spos - cdef char c - spos = s - c = spos[0] - while c != c'\0': - if c & 0x80: - break - spos = spos + 1 - c = spos[0] - slen = spos - s - if c != c'\0': - return python.PyUnicode_DecodeUTF8(s, slen+cstd.strlen(spos), NULL) - return python.PyString_FromStringAndSize(s, slen) + if s is NULL: + return python.PyString_FromString("") + return python.PyString_FromString(s) cdef object _utf8(object s): if python.PyString_Check(s): - assert not isutf8py(s), \ - "All strings must be XML compatible, either Unicode or ASCII" + assert isutf8py(s) != -1, \ + "All strings must either unicode objects or UTF-8" elif python.PyUnicode_Check(s): - # FIXME: we should test these strings, too ... s = python.PyUnicode_AsUTF8String(s) assert isutf8py(s) != -1, \ - "All strings must be XML compatible, either Unicode or ASCII" + "All strings must be either unicode objects or UTF-8" else: raise TypeError, "Argument must be string or unicode." return s (I'm not requesting this patch be applied to lxml, just hoping to get some help re whether this will do what I hope it will.) [...] > Global switches are always a bad thing. And I don't like the idea of > accepting UTF-8 encoded strings at the API level and returning them as > unicode strings (and: no, I would not allow returning UTF-8 encoded > strings from the API). > > So I guess the answer is a pretty straight no. Fair enough :-) > PS: Regarding your actual problem: it's best to decode data directly when > your code gets its hands on it, and to decode as late as possible, i.e. [...] Sure, that's the usual principle. In my case, I'm consciously looking for a practical hack as a way of working with existing code. Also, though, there are dissenters who argue in favour of encoding to UTF-8 as early as possible (and recoding as late as possible). That view seems self- consistent to me. Note that e.g. "x in y" tests still work fine if you pick UTF-8. The only problems then are with things like len(), .strip(), .upper(), etc, but those can be solved by using a different len() function, and using functions instead of methods for strip, upper, etc. It's a minority view, of course. Thanks again, John From stefan_ml at behnel.de Mon Aug 4 20:10:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 04 Aug 2008 20:10:59 +0200 Subject: [lxml-dev] Passing UTF-8 bytestrings to lxml In-Reply-To: <1217872932.3251.1266930759@webmail.messagingengine.com> References: <57778.213.61.181.86.1217858853.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1217872932.3251.1266930759@webmail.messagingengine.com> Message-ID: <48974633.90608@behnel.de> Hi, John J Lee wrote: > So, my question: were I also to change the function funicode (also in > apihelpers.pxi) to return UTF-8 bytestrings, would lxml always return > UTF-8 bytestring objects from all of its API calls? funicode() is a very central function that is called whenever a UTF-8 byte sequence is to be converted to a Python string. I won't give you a guarantee that everything will work if you change it, but at least I don't see a major problem at first sight. Stefan From niels at bjerre.net Tue Aug 5 16:50:14 2008 From: niels at bjerre.net (Niels Bjerre) Date: Tue, 5 Aug 2008 14:50:14 +0000 (UTC) Subject: [lxml-dev] Transform parameter variables References: <6c6bd9260807271620s1c0239bel667b396afca73731@mail.gmail.com> <488D5E5F.8000408@behnel.de> <6c6bd9260807280218m6f6204dib28f53a55f36511d@mail.gmail.com> <34735.213.61.181.86.1217243787.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel behnel.de> writes: > > Hi, > > it's good practice to > > a) reply to the list, > b) avoid top-posting and > c) read what people post. > > Niels Bjerre wrote: > > Thank You for your response > > > > But I have problem with the_dict: > > These 2 statements don't give the same result. > > > > 1. newdoc = transform(places.myplaces, area="'3751'") > > 2. newdoc = transform(places.myplaces, {'area':'\"\'3751\'\"'}) > > > > 1. is passed to xslt: > > 2. is ignored > > > > I have tried with > > {'area':'\'3751\''} and others > > The last line will work, but as I wrote before: > > > 2008/7/28 Stefan Behnel behnel.de> > >> You can pass more than one kayword parameter, as everywhere in Python. > >> If you want to pass them from a dictionary, do the usual > >> > >> result = transform(xml_tree, **the_dict) > >> > >> trick. > > Note the two stars before "the_dict". This is standard Python syntax for > expanding a mapping into keyword arguments. > > Stefan > I'm Sorry - still no luck passing a dictionary as extentions parameter The stylesheet has a parameter: The parameter is picked up in the transformation if I use: transform(doc, area="'3751'") but not when I use the_dict transform(doc, {'area':'\"\'3751\'\"'}) or any other variant of a dictionary or a dict_variable I can think of! Any suggestions is most welcome Niels From jholg at gmx.de Tue Aug 5 17:24:55 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 05 Aug 2008 17:24:55 +0200 Subject: [lxml-dev] Transform parameter variables In-Reply-To: References: <6c6bd9260807271620s1c0239bel667b396afca73731@mail.gmail.com> <488D5E5F.8000408@behnel.de> <6c6bd9260807280218m6f6204dib28f53a55f36511d@mail.gmail.com> <34735.213.61.181.86.1217243787.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20080805152538.203110@gmx.net> > >Note the two stars before "the_dict". This is standard Python syntax for > > expanding a mapping into keyword arguments. > > > > Stefan > > > > I'm Sorry - still no luck passing a dictionary as extentions parameter > The stylesheet has a parameter: > > > The parameter is picked up in the transformation if I use: > transform(doc, area="'3751'") > but not when I use the_dict > transform(doc, {'area':'\"\'3751\'\"'}) or any other variant of a > dictionary or > a dict_variable I can think of! > > > ? ??Try ?transform(doc, **{'area':"3751"}) ?Note the two stars, read up on python syntax on function calling and keyword parameters.? -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080805/ab86e267/attachment.htm From sidnei at enfoldsystems.com Wed Aug 6 01:52:32 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 5 Aug 2008 20:52:32 -0300 Subject: [lxml-dev] Odd looking exception Message-ID: Hi there, I've just got an exception while upgrading to lxml 2.1.1. Still trying to find out if it's some backwards-incompatible change. But what triggered my attention is that the exception seems to have tried to display the line number from etree.c but failed: File "enfold\lxml\transform.pyo", line 429, in transform File "xslt.pxi", line 399, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:%u) XSLTParseError: Cannot parse stylesheet This is on Windows FWIW. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Wed Aug 6 04:17:15 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 5 Aug 2008 23:17:15 -0300 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import Message-ID: Hi there, I've got a reproducible failing test when using custom resolvers and relative xsl:import. Basically, if I use resolve_string() or let the default resolver do it's work, everything works fine. If I use resolve_file though, the *next* uri to be resolved will have a relative (to the previous uri resolved) filename, and then there's not enough information available to compute the full URI. This happens with lxml 2.1.1. I am pretty sure it didn't happen with lxml 1.3.x series (which is what I was using before). I'm attaching the problematic test and related files. For reference, this is the problematic code: ... def resolve(self, uri, id, ctx): print uri # return None # works # return self.resolve_string(open(uri, 'r').read(), ctx) # works return self.resolve_file(open(uri, 'r'), ctx) # fails ... -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 -------------- next part -------------- A non-text attachment was scrubbed... Name: lxml-test.tar.gz Type: application/x-gzip Size: 1163 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080805/31c16fd5/attachment.bin From sidnei at enfoldsystems.com Wed Aug 6 04:24:41 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 5 Aug 2008 23:24:41 -0300 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import Message-ID: Hi there, I've got a reproducible failing test when using custom resolvers and relative xsl:import. Basically, if I use resolve_string() or let the default resolver do it's work, everything works fine. If I use resolve_file though, the *next* uri to be resolved will have a relative (to the previous uri resolved) filename, and then there's not enough information available to compute the full URI. This happens with lxml 2.1.1. I am pretty sure it didn't happen with lxml 1.3.x series (which is what I was using before). I'm attaching the problematic test and related files. For reference, this is the problematic code: ... def resolve(self, uri, id, ctx): print uri # return None # works # return self.resolve_string(open(uri, 'r').read(), ctx) # works return self.resolve_file(open(uri, 'r'), ctx) # fails ... -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Wed Aug 6 08:43:43 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 06 Aug 2008 08:43:43 +0200 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: References: Message-ID: <4899481F.4020008@behnel.de> Hi, I'll look into this when I find the time, but as a quick comment on your code: Sidnei da Silva wrote: > ... > def resolve(self, uri, id, ctx): > print uri > # return None # works > # return self.resolve_string(open(uri, 'r').read(), ctx) # works > return self.resolve_file(open(uri, 'r'), ctx) # fails > ... This should read (mind the 'rb'): return self.resolve_file(open(uri, 'rb'), ctx) There's also return self.resolve_filename(uri, ctx) which tends to be a lot more efficient (at least before 2.1) and works for file names and HTTP/FTP URLs. Stefan From niels at bjerre.net Wed Aug 6 08:46:18 2008 From: niels at bjerre.net (Niels Bjerre) Date: Wed, 6 Aug 2008 06:46:18 +0000 (UTC) Subject: [lxml-dev] Transform parameter variables References: <6c6bd9260807271620s1c0239bel667b396afca73731@mail.gmail.com> <488D5E5F.8000408@behnel.de> <6c6bd9260807280218m6f6204dib28f53a55f36511d@mail.gmail.com> <34735.213.61.181.86.1217243787.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080805152538.203110@gmx.net> Message-ID: gmx.de> writes: > > > > > Note the two stars before "the_dict". This is standard Python syntax for> expanding a mapping into keyword arguments.> > Stefan> I'm Sorry - still no luck passing a dictionary as extentions parameterThe stylesheet has a parameter: The parameter is picked up in the transformation if I use:transform(doc, area="'3751'")but not when I use the_dict transform(doc, {'area':'\"\'3751\'\"'}) or any other variant of a dictionary ora dict_variable I can think of! > > > > > Try > > transform(doc, **{'area':"3751"}) > > Note the two stars, read up on python syntax on function calling and > keyword parameters. > > -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer > > > _______________________________________________ > lxml-dev mailing list > lxml-dev codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > Resolved! Thank You From cesar.ortiz at gmail.com Wed Aug 6 16:51:24 2008 From: cesar.ortiz at gmail.com (Cesar Ortiz) Date: Wed, 6 Aug 2008 16:51:24 +0200 Subject: [lxml-dev] FAIL: test_parser_target_entity Message-ID: <90255a70808060751x490b8780l3201de0521ca6850@mail.gmail.com> Hi all, I am new to lxml, but I have been using libxml2 for a while. In my system I?ve got libxml2 2.6.26 and libxslt 1.1.17, and I tried to install lxml. I could not install with easy_install (because of the installation I have got I think) so I tried to install from source, and it worked. After that I run the tests and one failed: [phe1246 at pandora lxml]$ make test python setup.py build_ext -i Building lxml version 2.2.alpha1-56897. Building with Cython 0.9.8. Using build configuration of libxslt 1.1.17 Building against libxml2/libxslt in the following directory: /home/phe1246/install//lib running build_ext building 'lxml.etree' extension gcc -pthread -shared build/temp.linux-i686-2.4/src/lxml/lxml.etree.o -L/home/phe1246/install//lib -lxslt -lexslt -lxml2 -lz -lm -o src/lxml/etree.so building 'lxml.objectify' extension gcc -pthread -shared build/temp.linux-i686-2.4/src/lxml/lxml.objectify.o -L/home/phe1246/install//lib -lxslt -lexslt -lxml2 -lz -lm -o src/lxml/objectify.so python test.py -p -v TESTED VERSION: 2.2.alpha1-56897 Python: (2, 4, 3, 'final', 0) lxml.etree: (2, 2, -199, 56897) libxml used: (2, 6, 26) libxml compiled: (2, 6, 26) libxslt used: (1, 1, 17) libxslt compiled: (1, 1, 17) 970/970 (100.0%): Doctest: xpathxslt.txt ====================================================================== FAIL: test_parser_target_entity (lxml.tests.test_elementtree.ETreeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.4/unittest.py", line 260, in run testMethod() File "/home/phe1246/software/lxml/src/lxml/tests/test_elementtree.py", line 3417, in test_parser_target_entity events) File "/usr/lib/python2.4/unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: ['start-root', 'start-sub', 'end-sub', 'start-sub', 'data-this is an entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] != ['start-root', 'start-sub', 'end-sub', 'start-sub', u'data-this is an entityan entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] ---------------------------------------------------------------------- Ran 970 tests in 33.731s FAILED (failures=1) make: *** [test_inplace] Error 1 Should I worry about it? I didn?t see anything related in the mailing-list repository. Thanks in advance. NB: By the way, is it possible to get a .tar.gz version from any place? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080806/77690be2/attachment.htm From stefan_ml at behnel.de Wed Aug 6 20:31:43 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 06 Aug 2008 20:31:43 +0200 Subject: [lxml-dev] FAIL: test_parser_target_entity In-Reply-To: <90255a70808060751x490b8780l3201de0521ca6850@mail.gmail.com> References: <90255a70808060751x490b8780l3201de0521ca6850@mail.gmail.com> Message-ID: <4899EE0F.8020005@behnel.de> Hi, Cesar Ortiz wrote: > In my system I?ve got libxml2 2.6.26 and libxslt 1.1.17, and I tried to > install lxml. > > [phe1246 at pandora lxml]$ make test > python setup.py build_ext -i > Building lxml version 2.2.alpha1-56897. > Building with Cython 0.9.8. No need to install Cython, BTW. > ====================================================================== > FAIL: test_parser_target_entity (lxml.tests.test_elementtree.ETreeTestCase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/usr/lib/python2.4/unittest.py", line 260, in run > testMethod() > File "/home/phe1246/software/lxml/src/lxml/tests/test_elementtree.py", > line 3417, in test_parser_target_entity > events) > File "/usr/lib/python2.4/unittest.py", line 333, in failUnlessEqual > raise self.failureException, \ > AssertionError: ['start-root', 'start-sub', 'end-sub', 'start-sub', > 'data-this is an entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] != > ['start-root', 'start-sub', 'end-sub', 'start-sub', u'data-this is an > entityan entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] > > ---------------------------------------------------------------------- That's a known bug in libxml2 <= 2.6.26. I generally recommend using 2.6.28 or later, although this will only hit you if you use the target parser together with entities (as the test shows). > NB: By the way, is it possible to get a .tar.gz version from any place? ... of lxml? Now I have to wonder how you actually installed it. :) You'll find it on the web site and on PyPI. Stefan From jkrukoff at ltgc.com Wed Aug 6 20:55:27 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 06 Aug 2008 12:55:27 -0600 Subject: [lxml-dev] Segfault on XSLT/XPath undefined variable error. In-Reply-To: <4892B16B.5010903@behnel.de> References: <1217375154.4123.138.camel@jmk> <48900199.5020806@behnel.de> <1217451261.4123.163.camel@jmk> <489150A6.3060307@behnel.de> <1217522521.4123.183.camel@jmk> <4892B16B.5010903@behnel.de> Message-ID: <1218048927.25651.11.camel@jmk> On Fri, 2008-08-01 at 08:47 +0200, Stefan Behnel wrote: > Hi, > > John Krukoff wrote: > > Okay, can only get it to crash when first signing a document using > > libxmlsec, so I suppose I'll simply assume that the two libraries use > > the error log in incompatible ways. > > could you check if this patch makes it work better for you? It basically > restricts XSLT error logging to the lifetime of an XSL transformation. > > Stefan > I still need to compile lxml with -ggdb, where do I stick that in the setup.py/makefile? Interestingly, this is after I've switched to calling an external C# program to do my xml signing, and am no longer using libxmlsec. But, anyway, still crashed with the patch for me: Core was generated by `/usr/bin/python -tt ./Adapter.py'. Program terminated with signal 11, Segmentation fault. #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so (gdb) bt #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so #1 0xb774bb12 in __pyx_f_4lxml_5etree__receiveXSLTError () from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so #2 0xb76baef9 in xsltPrintErrorContext () from /usr/lib/libxslt.so.1 #3 0xb76bb091 in xsltTransformError () from /usr/lib/libxslt.so.1 #4 0xb76dd434 in xsltValueOf () from /usr/lib/libxslt.so.1 #5 0xb76da5ba in ?? () from /usr/lib/libxslt.so.1 #6 0x0878e718 in ?? () #7 0x08432c60 in ?? () #8 0x0878ec78 in ?? () #9 0x0878f6f8 in ?? () #10 0x00000000 in ?? () Fortunately, I've been able to simplify my crash conditions somewhat, so the valgrind log is significantly shorter. Looks like I'll need to find some time to work on that test case after all. -- John Krukoff Land Title Guarantee Company -------------- next part -------------- ==3894== Memcheck, a memory error detector. ==3894== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==3894== Using LibVEX rev 1732, a library for dynamic binary translation. ==3894== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==3894== Using valgrind-3.2.3, a dynamic binary instrumentation framework. ==3894== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==3894== For more details, rerun with: -v ==3894== ==3894== My PID = 3894, parent PID = 10649. Prog and args are: ==3894== python ==3894== -E ==3894== -tt ==3894== ./Adapter.py ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153D9: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==3894== Address 0x44C1A9C is 36 bytes inside a block of size 39 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4007715: (within /lib/ld-2.6.1.so) ==3894== by 0x4008156: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153F0: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==3894== Address 0x44C8C78 is 40 bytes inside a block of size 42 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4007715: (within /lib/ld-2.6.1.so) ==3894== by 0x4008156: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153F0: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== Address 0x44D18E8 is 24 bytes inside a block of size 25 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x40087B8: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4015407: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x481E1C6: (within /usr/lib/libcrypto.so.0.9.8) ==3894== by 0x481F14F: DSO_load (in /usr/lib/libcrypto.so.0.9.8) ==3894== Address 0x44F410C is 12 bytes inside a block of size 13 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x40087B8: (within /lib/ld-2.6.1.so) ==3894== by 0x40118C3: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== by 0x481E1C6: (within /usr/lib/libcrypto.so.0.9.8) ==3894== by 0x481F14F: DSO_load (in /usr/lib/libcrypto.so.0.9.8) ==3894== by 0x4893829: COMP_zlib (in /usr/lib/libcrypto.so.0.9.8) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153D9: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== Address 0x4A5EF14 is 20 bytes inside a block of size 22 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x40087B8: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153F0: (within /lib/ld-2.6.1.so) ==3894== by 0x40078CB: (within /lib/ld-2.6.1.so) ==3894== by 0x4007E71: (within /lib/ld-2.6.1.so) ==3894== by 0x4008471: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== Address 0x441D328 is 8 bytes inside a block of size 9 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4007715: (within /lib/ld-2.6.1.so) ==3894== by 0x4007855: (within /lib/ld-2.6.1.so) ==3894== by 0x4007E71: (within /lib/ld-2.6.1.so) ==3894== by 0x4008471: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4015407: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== Address 0x4434C04 is 28 bytes inside a block of size 29 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x40087B8: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4B33616: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3F1CF: __pyx_f_4lxml_5etree__appendChild (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B163C7: __pyx_pf_4lxml_5etree_8_Element_append (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x40F7FD9: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8285: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8285: (within /usr/lib/libpython2.5.so.1.0) ==3894== Address 0x4A89380 is 80 bytes inside a block of size 88 free'd ==3894== at 0x402237A: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4CBA69A: xmlFreeDoc (in /usr/lib/libxml2.so.2.6.31) ==3894== by 0x4B1094E: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BB831E: __pyx_tp_dealloc_4lxml_5etree__Document (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B337F9: __pyx_f_4lxml_5etree__updateProxyDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B33544: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3F1CF: __pyx_f_4lxml_5etree__appendChild (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B163C7: __pyx_pf_4lxml_5etree_8_Element_append (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x40F7FD9: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4B33616: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3F1CF: __pyx_f_4lxml_5etree__appendChild (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3ED88: __pyx_f_4lxml_5etree__replaceSlice (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B15607: __pyx_pf_4lxml_5etree_8_Element___setitem__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BB8C08: __pyx_mp_ass_subscript_4lxml_5etree__Element (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4060DDB: PyObject_SetItem (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F920E: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F2244: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8285: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== Address 0x4A935F0 is 80 bytes inside a block of size 88 free'd ==3894== at 0x402237A: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4CBA69A: xmlFreeDoc (in /usr/lib/libxml2.so.2.6.31) ==3894== by 0x4B1094E: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BB831E: __pyx_tp_dealloc_4lxml_5etree__Document (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B337F9: __pyx_f_4lxml_5etree__updateProxyDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B33544: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3F1CF: __pyx_f_4lxml_5etree__appendChild (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B3ED88: __pyx_f_4lxml_5etree__replaceSlice (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B15607: __pyx_pf_4lxml_5etree_8_Element___setitem__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BB8C08: __pyx_mp_ass_subscript_4lxml_5etree__Element (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4060DDB: PyObject_SetItem (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F920E: (within /usr/lib/libpython2.5.so.1.0) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4B33616: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B17447: __pyx_pf_4lxml_5etree_8_Element_replace (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x409A67D: PyCFunction_Call (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8189: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8285: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8285: (within /usr/lib/libpython2.5.so.1.0) ==3894== Address 0x4F175C0 is 80 bytes inside a block of size 88 free'd ==3894== at 0x402237A: free (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x4CBA69A: xmlFreeDoc (in /usr/lib/libxml2.so.2.6.31) ==3894== by 0x4B1094E: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BB831E: __pyx_tp_dealloc_4lxml_5etree__Document (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B337F9: __pyx_f_4lxml_5etree__updateProxyDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B333E7: __pyx_f_4lxml_5etree_moveNodeToDocument (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B17447: __pyx_pf_4lxml_5etree_8_Element_replace (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x409A67D: PyCFunction_Call (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8189: (within /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F4E76: PyEval_EvalFrameEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F666C: PyEval_EvalCodeEx (in /usr/lib/libpython2.5.so.1.0) ==3894== by 0x40F8510: (within /usr/lib/libpython2.5.so.1.0) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x40153C3: (within /lib/ld-2.6.1.so) ==3894== by 0x4006337: (within /lib/ld-2.6.1.so) ==3894== by 0x4008217: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A787C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== Address 0x53D46E0 is 48 bytes inside a block of size 51 alloc'd ==3894== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==3894== by 0x40087B8: (within /lib/ld-2.6.1.so) ==3894== by 0x400BF55: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x400C14F: (within /lib/ld-2.6.1.so) ==3894== by 0x401191E: (within /lib/ld-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x40112CD: (within /lib/ld-2.6.1.so) ==3894== by 0x41A787C: (within /lib/libdl-2.6.1.so) ==3894== by 0x400D891: (within /lib/ld-2.6.1.so) ==3894== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==3894== by 0x41A78F7: dlopen (in /lib/libdl-2.6.1.so) ==3894== ==3894== Invalid read of size 4 ==3894== at 0x4B4C740: __pyx_f_4lxml_5etree__forwardError (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B4CB11: __pyx_f_4lxml_5etree__receiveXSLTError (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4C2BEF8: xsltPrintErrorContext (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C2C090: xsltTransformError (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4E433: xsltValueOf (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4B5B9: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4BF5A: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4AFFA: xsltProcessOneNode (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C504C9: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C50A46: xsltApplyStylesheetUser (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4BA8D33: __pyx_f_4lxml_5etree_4XSLT__run_transform (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BA6DDF: __pyx_pf_4lxml_5etree_4XSLT___call__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== Address 0x706D6F43 is not stack'd, malloc'd or (recently) free'd ==3894== ==3894== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==3894== Access not within mapped region at address 0x706D6F43 ==3894== at 0x4B4C740: __pyx_f_4lxml_5etree__forwardError (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4B4CB11: __pyx_f_4lxml_5etree__receiveXSLTError (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4C2BEF8: xsltPrintErrorContext (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C2C090: xsltTransformError (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4E433: xsltValueOf (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4B5B9: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4BF5A: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C4AFFA: xsltProcessOneNode (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C504C9: (within /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4C50A46: xsltApplyStylesheetUser (in /usr/lib/libxslt.so.1.1.24) ==3894== by 0x4BA8D33: __pyx_f_4lxml_5etree_4XSLT__run_transform (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== by 0x4BA6DDF: __pyx_pf_4lxml_5etree_4XSLT___call__ (in /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so) ==3894== ==3894== ERROR SUMMARY: 84 errors from 12 contexts (suppressed: 2539 from 7) ==3894== malloc/free: in use at exit: 24,884,229 bytes in 21,299 blocks. ==3894== malloc/free: 142,314 allocs, 121,015 frees, 3,738,145,080 bytes allocated. ==3894== For counts of detected errors, rerun with: -v ==3894== searching for pointers to 21,299 not-freed blocks. ==3894== checked 24,614,252 bytes. ==3894== ==3894== LEAK SUMMARY: ==3894== definitely lost: 240,366 bytes in 2,990 blocks. ==3894== possibly lost: 372,325 bytes in 628 blocks. ==3894== still reachable: 24,271,538 bytes in 17,681 blocks. ==3894== suppressed: 0 bytes in 0 blocks. ==3894== Rerun with --leak-check=full to see details of leaked memory. From jkrukoff at ltgc.com Wed Aug 6 21:23:45 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 06 Aug 2008 13:23:45 -0600 Subject: [lxml-dev] Segfault on XSLT/XPath undefined variable error. In-Reply-To: <1218048927.25651.11.camel@jmk> References: <1217375154.4123.138.camel@jmk> <48900199.5020806@behnel.de> <1217451261.4123.163.camel@jmk> <489150A6.3060307@behnel.de> <1217522521.4123.183.camel@jmk> <4892B16B.5010903@behnel.de> <1218048927.25651.11.camel@jmk> Message-ID: <1218050625.25651.13.camel@jmk> On Wed, 2008-08-06 at 12:55 -0600, John Krukoff wrote: > Interestingly, this is after I've switched to calling an external C# > program to do my xml signing, and am no longer using libxmlsec. Whoops, I apologize for the misinformation. It does, in fact, only crash when using libxmlsec, as expected. I had a spurious import in my new code that was pulling in the xmlsig python bindings, even though I was no longer using them. -- John Krukoff Land Title Guarantee Company From cesar.ortiz at gmail.com Thu Aug 7 00:59:55 2008 From: cesar.ortiz at gmail.com (Cesar Ortiz) Date: Thu, 7 Aug 2008 00:59:55 +0200 Subject: [lxml-dev] FAIL: test_parser_target_entity In-Reply-To: <4899EE0F.8020005@behnel.de> References: <90255a70808060751x490b8780l3201de0521ca6850@mail.gmail.com> <4899EE0F.8020005@behnel.de> Message-ID: <90255a70808061559ne608357l87fc8fff617a4024@mail.gmail.com> I got it From subversion ;). I did not see the tar.gz from the site. I will have a look again.... Thank you for the quick answer. -- Cesar On Wed, Aug 6, 2008 at 8:31 PM, Stefan Behnel wrote: > Hi, > > Cesar Ortiz wrote: > > In my system I?ve got libxml2 2.6.26 and libxslt 1.1.17, and I tried to > > install lxml. > > > > [phe1246 at pandora lxml]$ make test > > python setup.py build_ext -i > > Building lxml version 2.2.alpha1-56897. > > Building with Cython 0.9.8. > > No need to install Cython, BTW. > > > > ====================================================================== > > FAIL: test_parser_target_entity > (lxml.tests.test_elementtree.ETreeTestCase) > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "/usr/lib/python2.4/unittest.py", line 260, in run > > testMethod() > > File "/home/phe1246/software/lxml/src/lxml/tests/test_elementtree.py", > > line 3417, in test_parser_target_entity > > events) > > File "/usr/lib/python2.4/unittest.py", line 333, in failUnlessEqual > > raise self.failureException, \ > > AssertionError: ['start-root', 'start-sub', 'end-sub', 'start-sub', > > 'data-this is an entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] > != > > ['start-root', 'start-sub', 'end-sub', 'start-sub', u'data-this is an > > entityan entity', 'end-sub', 'start-sub', 'end-sub', 'end-root'] > > > > ---------------------------------------------------------------------- > > That's a known bug in libxml2 <= 2.6.26. I generally recommend using > 2.6.28 or > later, although this will only hit you if you use the target parser > together > with entities (as the test shows). > > > > NB: By the way, is it possible to get a .tar.gz version from any place? > > ... of lxml? Now I have to wonder how you actually installed it. :) > > You'll find it on the web site and on PyPI. > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080807/4740647c/attachment.htm From sergio at sergiomb.no-ip.org Thu Aug 7 01:27:13 2008 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Thu, 07 Aug 2008 00:27:13 +0100 Subject: [lxml-dev] FAIL: test_parser_target_entity In-Reply-To: <90255a70808061559ne608357l87fc8fff617a4024@mail.gmail.com> References: <90255a70808060751x490b8780l3201de0521ca6850@mail.gmail.com> <4899EE0F.8020005@behnel.de> <90255a70808061559ne608357l87fc8fff617a4024@mail.gmail.com> Message-ID: <1218065233.2846.21.camel@monteirov> Fedora 8 and Fedora 9 have the libxml2-2.6.32 and the libxslt-1.1.24 and works nice . IMHO you should update libxml2 and libxslt. BTW: I am another happy user of lxml , a big thanks to lxml team ! On Thu, 2008-08-07 at 00:59 +0200, Cesar Ortiz wrote: > I got it From subversion ;). > I did not see the tar.gz from the site. I will have a look again.... > > Thank you for the quick answer. > > -- Cesar > > On Wed, Aug 6, 2008 at 8:31 PM, Stefan Behnel > wrote: > Hi, > > Cesar Ortiz wrote: > > In my system I?ve got libxml2 2.6.26 and libxslt 1.1.17, and > I tried to > > install lxml. > > > > > [phe1246 at pandora lxml]$ make test > > python setup.py build_ext -i > > Building lxml version 2.2.alpha1-56897. > > Building with Cython 0.9.8. > > > No need to install Cython, BTW. > > > > > ====================================================================== > > FAIL: test_parser_target_entity > (lxml.tests.test_elementtree.ETreeTestCase) > > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "/usr/lib/python2.4/unittest.py", line 260, in run > > testMethod() > > File > "/home/phe1246/software/lxml/src/lxml/tests/test_elementtree.py", > > line 3417, in test_parser_target_entity > > events) > > File "/usr/lib/python2.4/unittest.py", line 333, in > failUnlessEqual > > raise self.failureException, \ > > AssertionError: ['start-root', 'start-sub', 'end-sub', > 'start-sub', > > 'data-this is an entity', 'end-sub', 'start-sub', 'end-sub', > 'end-root'] != > > ['start-root', 'start-sub', 'end-sub', 'start-sub', > u'data-this is an > > entityan entity', 'end-sub', 'start-sub', 'end-sub', > 'end-root'] > > > > > ---------------------------------------------------------------------- > > > That's a known bug in libxml2 <= 2.6.26. I generally recommend > using 2.6.28 or > later, although this will only hit you if you use the target > parser together > with entities (as the test shows). > > > > NB: By the way, is it possible to get a .tar.gz version from > any place? > > > ... of lxml? Now I have to wonder how you actually installed > it. :) > > You'll find it on the web site and on PyPI. > > Stefan > > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M.B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080807/2dac27df/attachment.bin From lxml-dev at mlists.thewrittenword.com Thu Aug 7 10:26:44 2008 From: lxml-dev at mlists.thewrittenword.com (Gary V. Vaughan) Date: Thu, 7 Aug 2008 08:26:44 +0000 Subject: [lxml-dev] combining target parser class with DTD validation Message-ID: <20080807082640.GA79728@thor.il.thewrittenword.com> Hi, I have an xml.sax based parser that works like this: .-----. ,-------------------. file.xml -> | gpp | -> preprocessed.xml -> |saxexts.make_parser| --. `-----' `-------------------' | ,-------------------. | custom class heirarchy <- |sax DocumentHandler| <-' `-------------------' I'm in the process of converting all of this from xml.sax to lxml.etree. The DocumentHandler function is very complex, but well debugged, so I'd really like to convert it to work with an lxml.XMLParser target keyword, which is straight forward enough (changing startElement to start, characters to data etc) to avoid churning the handler function and the custom class heirarchy it builds. So for so good... Also, performance is important, so I'm passing the the output of gpp (general pre-processor) to the parser with a feed function as gpp is running on file.xml. The main driver behind moving the project to lxml is to perform DTD validation on 'preprocessed.xml', but setting a target keyword in etree.XMLParser turns off DTD validation at parse time :( Also, with dtd_validation=True, even xml documents with no DOCTYPE declaration throw an exception. Since I have a zillion files that I'd like to migrate gradually, while interoperating with other users that haven't installed the lxml based parser yet, I only want to do validation when there is a DOCTYPE declaration. Older files don't have it, so I'd like to ignore the missing DTD reference on those until they are upgraded. My question is: what is the cleanest/fastest way to combine (i) passing input to the parser with a feed function (ii) reusing most of the sax DocumentHandler with a target class (iii) performing DTD validation on the fly during parsing (iv) but skipping validation if there is no DOCTYPE declaration I've ended up implementing an lxml based validating parser to replace the above like this: # input file fh = open (gpp_input, 'r') # For backwards compatibility, skip dtd validation when there # is no DOCTYPE declaration: match = re.compile ('^ References: <20080807082640.GA79728@thor.il.thewrittenword.com> Message-ID: <56104.213.61.181.86.1218110809.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, Gary V. Vaughan wrote: > I have an xml.sax based parser that works like this: > > .-----. ,-------------------. > file.xml -> | gpp | -> preprocessed.xml -> |saxexts.make_parser| --. > `-----' `-------------------' | > ,-------------------. | > custom class heirarchy <- |sax DocumentHandler| <-' > `-------------------' Excellent picture. > The main driver behind moving the project to lxml is to perform DTD > validation on 'preprocessed.xml', but setting a target keyword in > etree.XMLParser turns off DTD validation at parse time :( Hmm, I never tried that. If it doesn't work, I would assume that the validation is done after passing the callbacks through the SAX interface of libxml2, which lxml connects to directly, i.e. -> DTD validation -> normal tree builder input -> parser -> SAX < -> lxml's target parser (I didn't verify this, though, so I might be mistaken. Maybe it's easy to enable validation, maybe it's not...) > Also, with dtd_validation=True, even xml documents with no DOCTYPE > declaration throw an exception. Since I have a zillion files that > I'd like to migrate gradually, while interoperating with other users > that haven't installed the lxml based parser yet, I only want to do > validation when there is a DOCTYPE declaration. Older files don't > have it, so I'd like to ignore the missing DTD reference on those > until they are upgraded. Hmmm, I wasn't aware of that, but it seems you have to have a DOCTYPE in your XML to use a validating parser. I understand that that can be undesirable. > My question is: what is the cleanest/fastest way to combine > (i) passing input to the parser with a feed function > (ii) reusing most of the sax DocumentHandler with a target class > (iii) performing DTD validation on the fly during parsing > (iv) but skipping validation if there is no DOCTYPE declaration > > I've ended up implementing an lxml based validating parser to replace > the above like this: > > # input file > fh = open (gpp_input, 'r') > > # For backwards compatibility, skip dtd validation when there > # is no DOCTYPE declaration: > match = re.compile ('^ fh.seek(0) > > # prepare XML parser to read data > parser = etree.XMLParser (dtd_validation=(match != None)) > > gpp_r, my_w = os.pipe () > my_r, gpp_w = os.pipe () > gpp = os.fork () > if gpp == 0: > ... > # set up pipes to gpp stdin and stdout > ... > > while fds: > ... > # collect gpp stdout with select > ... > > parser.feed (gpp_output) > > # get the etree > tree = parser.close () > > # walk the etree and fire synthetic sax events > xmlh = old_sax_DocumentHandler () > context = etree.iterwalk (tree, events=("start", "end")) > for action, element in context: > if action == 'start': > xmlh.startElement (element.tag, element.attrib) > if element.text and hasattr (xmlh, 'characters'): > xmlh.characters (element.text) > elif action == 'end': > xmlh.endElement (element.tag) > if element.tail and hasattr (xmlh, 'characters'): > xmlh.characters (element.tail) You are aware of lxml.sax? Although your code is fairly short and special, so I guess it's fair enough to just use this. It might even be faster than lxml.sax after all... > It works well enough, but it feels kludgy to manually peek into each > xml file and look for a DOCTYPE at the start; and since I have to walk > the tree once while building it and again when calling the handler > function, I'm sure it is slower than it could be. I know, threads are often frowned upon in Python, but given that lxml.etree frees the GIL for all sorts of C-level and I/O operations, they can really help you here to speed up your overall processing of your "zillions" of files. A couple of ideas for improvements: 1) if there is a way to let gpp read its input file directly through a command line option - do that. It keeps your Python interpreter from wasting GIL time on copying data from the outside world back into it. 2) connect lxml's parser directly to gpp's output pipe and run it in a separate thread. 3) try to do post-parsing validation instead of on-the-fly validation by validating with tree.docinfo.externalDTD (not sure if that's faster, YMMV). Again, do this in a thread. Your application looks heavily I/O bound, so even if you do things in a tree instead of on-the-way-in, doing things in parallel will give you a big speed-up. Hope this helps. Stefan From ivanov.maxim at gmail.com Fri Aug 8 06:54:57 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Fri, 8 Aug 2008 08:54:57 +0400 Subject: [lxml-dev] Some HTML target processing issues Message-ID: Hi all! I've attached small piece of code. lxml target parsing has some problems from my point of view. 1) I use lxml.html.HTMLParser which should handle unknown HTML tags since it uses lxml.html.HtmlElementClassLookup which contain this code in its lookup function: "if node_type == 'element': return self._element_classes.get(name.lower(), HtmlElement)". If I understand it right, then even unknown tags should be handled properly. But I still get error at the end of the code: lxml.etree.XMLSyntaxError: Tag noindex invalid, line 266, column 17 I don't understand why, hope someone give me a good advice :) 2) Even if the whole process fails, etree.fromstring continue to call target methods (start,end,comment etc...) even after invelid tag is appeared. It's ok, it's some sort of fault tolerance. But why it do not call target.close() at the end? Instead of that it raises exception. If document processing continues even after error, then call target.close() too! Maybe i'ts better to pass all accured errors to close function, so target could decide what to do. 3) lxml should stop processing when target raises exception. Nowdays it's just ignored and all continue. From stefan_ml at behnel.de Fri Aug 8 07:18:25 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 08 Aug 2008 07:18:25 +0200 Subject: [lxml-dev] Some HTML target processing issues In-Reply-To: References: Message-ID: <489BD721.2050703@behnel.de> Hi, Max Ivanov wrote: > I've attached small piece of code. No. Anyway, HTML target parsing (or rather: target parsing with the "recover" option) is rarely used, so you might have run into a bug due to lack of testing. > lxml target parsing has some problems from my point of view. > > 1) I use lxml.html.HTMLParser which should handle unknown HTML tags > since it uses lxml.html.HtmlElementClassLookup which contain this code > in its lookup function: > "if node_type == 'element': return > self._element_classes.get(name.lower(), HtmlElement)". If I understand > it right, then even unknown tags should be handled properly. But I > still get error at the end of the code: lxml.etree.XMLSyntaxError: Tag > noindex invalid, line 266, column 17 You mix two different things here. The error you get comes from the parser, the lookup is called by the machinery that wraps an already parsed XML node as an Element (i.e. much later). > 2) Even if the whole process fails, etree.fromstring continue to call > target methods (start,end,comment etc...) even after invelid tag > is appeared. It's ok, it's some sort of fault tolerance. But > why it do not call target.close() at the end? Instead of that it > raises exception. Sounds like a bug to me. When you parse with recovery enabled, it should finish gracefully also for the parser target. > Maybe i'ts better to pass all accured > errors to close function, so target could decide what to do. That's not part of the API. Besides, you can find the errors (and warnings) in the error log. > 3) lxml should stop processing when target raises exception. Nowdays > it's just ignored and all continue. Might be another problem related to "recover" parsing, or a general problem. I'll look into it when I find the time. Can you come up with a patch with a couple of simple test cases for src/lxml/tests/test_htmlparser.py that show the three problems you describe? That usually makes them easier (read: faster) to fix. There are some target parser test cases in test_etree.py and test_elementtree.py that you can look at for inspiration. Stefan From stefan_ml at behnel.de Fri Aug 8 07:57:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 08 Aug 2008 07:57:05 +0200 Subject: [lxml-dev] [Bug 255800] parser.feed(t) + parser.feed('\n') != parser.feed(t+'\n') In-Reply-To: <20080808073407.7ef2cbb7@g10.unix> References: <20080807165147.2420.76578.malonedeb@gandwana.canonical.com> <20080807171305.30382.88958.malone@gangotri.canonical.com> <20080808073407.7ef2cbb7@g10.unix> Message-ID: <489BE031.9090301@behnel.de> (forwarding this to the list) Vladimir Vcelak wrote: > Dne Thu, 07 Aug 2008 19:13:04 +0200 (CEST) Stefan Behnel napsal: >> Vladimir Vcelak wrote: >>> >>> t = 'x' >>> >>> parser = etree.XMLParser(target = EchoTarget()) >>> >>> parser.feed(t) ; parser.feed('\n') >>> start a {} >>> data u'x' >>> end a >>> >>> >>> parser = etree.XMLParser(target = EchoTarget()) >>> >>> parser.feed(t + '\n') >>> (print none!!) >> >> The parser is free to start/stop/continue parsing at any time. As long >> as the callback sequence in both cases is the same after >> close(), I don't see the bug here. > > Thank you for your answer, but I can't depend on calling close(). My idea is use XML for dialogue between Server and Client: > > SERVER: > from lxml import etree > > class MyParser: > def start(): > level += 1 > def end(): > level -= 1 > if level == 0: > end_command = True > > socket = create_TCP_server(port) > while True: > s = socket.accept() > end_command = False > parser = etree.XMLParser(target = MyTarget()) > while not end_command: > parser.feed(s.readline()) Have you actually tried calling close() at this point? It might even work, and it will raise an exception if the command wasn't well-formed. BTW, I didn't check, but creating the parser once outside the loop and making sure you call close() after each iteration might actually be enough. > answer = execute(command) > s.write(answer) > > > > CLIENT: > socket = connect_to(server, port) > socket.write(" ..........\n") > answer = socket.read() > socket.write(" ..........\n") > answer = socket.read() > .... > > or CLIENT for debug: > telnet server port > .... > ... > > answer > .... > ... > > answer From stefan_ml at behnel.de Fri Aug 8 08:25:03 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 08 Aug 2008 08:25:03 +0200 Subject: [lxml-dev] Segfault on XSLT/XPath undefined variable error. In-Reply-To: <1218048927.25651.11.camel@jmk> References: <1217375154.4123.138.camel@jmk> <48900199.5020806@behnel.de> <1217451261.4123.163.camel@jmk> <489150A6.3060307@behnel.de> <1217522521.4123.183.camel@jmk> <4892B16B.5010903@behnel.de> <1218048927.25651.11.camel@jmk> Message-ID: <489BE6BF.8090606@behnel.de> Hi, John Krukoff wrote: > On Fri, 2008-08-01 at 08:47 +0200, Stefan Behnel wrote: >> Hi, >> >> John Krukoff wrote: >>> Okay, can only get it to crash when first signing a document using >>> libxmlsec, so I suppose I'll simply assume that the two libraries use >>> the error log in incompatible ways. >> could you check if this patch makes it work better for you? It basically >> restricts XSLT error logging to the lifetime of an XSL transformation. >> >> Stefan > > I still need to compile lxml with -ggdb, where do I stick that in the > setup.py/makefile? Pass the "CFLAGS" env variable when calling setup.py, as in CFLAGS="-O -ggdb" make clean inplace > still crashed with the patch for me: > > Core was generated by `/usr/bin/python -tt ./Adapter.py'. > Program terminated with signal 11, Segmentation fault. > #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > (gdb) bt > #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > #1 0xb774bb12 in __pyx_f_4lxml_5etree__receiveXSLTError () > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > #2 0xb76baef9 in xsltPrintErrorContext () from /usr/lib/libxslt.so.1 > #3 0xb76bb091 in xsltTransformError () from /usr/lib/libxslt.so.1 > #4 0xb76dd434 in xsltValueOf () from /usr/lib/libxslt.so.1 > #5 0xb76da5ba in ?? () from /usr/lib/libxslt.so.1 > #6 0x0878e718 in ?? () > #7 0x08432c60 in ?? () > #8 0x0878ec78 in ?? () > #9 0x0878f6f8 in ?? () > #10 0x00000000 in ?? () > > Fortunately, I've been able to simplify my crash conditions somewhat, so > the valgrind log is significantly shorter. You can generally strip the "... within /lib/ld-2.6..." entries from it. > Looks like I'll need to find some time to work on that test case after > all. Please do, and also re-run your valgrind test with -ggdb. There seem to be some interesting problems in there (the "moveNodeToDocument()" sections), even if only crashes later. Seeing the line numbers here would really be helpful. Stefan From stefan_ml at behnel.de Fri Aug 8 14:01:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 8 Aug 2008 14:01:37 +0200 (CEST) Subject: [lxml-dev] Some HTML target processing issues In-Reply-To: References: <489BD721.2050703@behnel.de> Message-ID: <53981.213.61.181.86.1218196897.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, please keep the list involved. Max Ivanov wrote: > Then how could I add tolerance to unknown tag into HTMLParser? You can't change the parser. It already parses with the "recover" option, so it tries to keep going as long as possible. The problem here is that when you use a target parser, it currently raises an exception at the end if errors occurred during the parsing. It *might* be better to disable that based on the recover option, but I'll have to look into that. >> Can you come up with a patch with a couple of simple test cases for >> src/lxml/tests/test_htmlparser.py that show the three problems you >> describe? >> That usually makes them easier (read: faster) to fix. There are some >> target >> parser test cases in test_etree.py and test_elementtree.py that you can >> look at for inspiration. > > Thx, I'll try to write tests, but I've never done it before. It looks > quite clear, but I've no idea how to run tests itself. It's pretty easy. Each test has a method in the test case class that will be called by the test runner. Reading a few of the existing test methods should get you going. There is a script "test.py" in the root directory that you can call to run the tests ("make test" does that, for example). It will walk through the directory hierarchy and collect all test classes it finds into a unit test suite (based on the unittest module), and then run them. Try "python test.py -vv" to get some verbose output. Stefan From jkrukoff at ltgc.com Fri Aug 8 17:55:10 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 08 Aug 2008 09:55:10 -0600 Subject: [lxml-dev] Converting from XML to HTML parsed trees. Message-ID: <1218210910.25651.48.camel@jmk> I have some XML data already parsed into an lxml ElementTree. Is there any easy way to reparse that using lxml.html, or is the only way to do it to serialize the XML to a string and reparse using one of the lxml.html parsers? -- John Krukoff Land Title Guarantee Company From stefan_ml at behnel.de Fri Aug 8 18:30:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 08 Aug 2008 18:30:49 +0200 Subject: [lxml-dev] Converting from XML to HTML parsed trees. In-Reply-To: <1218210910.25651.48.camel@jmk> References: <1218210910.25651.48.camel@jmk> Message-ID: <489C74B9.60500@behnel.de> Hi, John Krukoff wrote: > I have some XML data already parsed into an lxml ElementTree. Is there > any easy way to reparse that using lxml.html, or is the only way to do > it to serialize the XML to a string and reparse using one of the > lxml.html parsers? In 2.1, lxml.html has two functions html_to_xhtml() and xhtml_to_html() that might do what you want, but they will not change the tree API into the one of lxml.html. There are two ways to do that: 1) lxml's parser and serialiser are so fast that it might actually be fast enough to serialise into HTML (method="html") and parse using lxml.html. 2) Create a new Element using lxml.html and append all children of your original root Element to it. They will then inherit the lxml.html API from their root. In this case, you have to make sure that you let go of all references to these Elements, though, as the Element proxy objects will keep their API as long as they stay alive. Just try both to see what works best for you. Stefan From Peter.Santoro at po.state.ct.us Fri Aug 8 18:39:40 2008 From: Peter.Santoro at po.state.ct.us (Santoro, Peter) Date: Fri, 8 Aug 2008 12:39:40 -0400 Subject: [lxml-dev] schema validation support Message-ID: <9FB57EBF8B1BE64E85408F356B3DBC510197ACB6@DRS-EX100.drs-h-main.drs.state.ct.us> I have used xerces/java, in the past, to do xml work without schema validation. As I now prefer to use python, I'm considering using lxml to validate xml instance documents against associated schemas. So far, I'm impressed with my initial testing. Thank you for making this software available! I'm curious if anyone has compared/benchmarked lxml/libxml2 schema validation support with what's available in Apache's xerces project? What, if any, limitations to the xlml/libxml2 schema validation support have others hit that I should be aware of? What, if any, limitations to the xerces schema validation support have others hit that I should be aware of? I'm asking these questions, because I recently noticed on the underlying libxml2 library's web site states that the xml schema support is incomplete (http://xmlsoft.org/index.html) and that it is being finished up (http://xmlsoft.org/news.html). To be fair, it appears that there are limitations with xerces schema validation support (http://xerces.apache.org/xerces-c/schema.html), too. Thank you, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080808/8ecf7466/attachment.htm From stefan_ml at behnel.de Sat Aug 9 14:19:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 09 Aug 2008 14:19:00 +0200 Subject: [lxml-dev] Odd looking exception In-Reply-To: References: Message-ID: <489D8B34.3040602@behnel.de> Hi, Sidnei da Silva wrote: > I've just got an exception while upgrading to lxml 2.1.1. Still trying > to find out if it's some backwards-incompatible change. But what > triggered my attention is that the exception seems to have tried to > display the line number from etree.c but failed: > > File "enfold\lxml\transform.pyo", line 429, in transform > File "xslt.pxi", line 399, in lxml.etree.XSLT.__init__ > (src/lxml/lxml.etree.c:%u) > XSLTParseError: Cannot parse stylesheet > > This is on Windows FWIW. Might be worth it. Does anyone know if there is a platform dependent difference regarding PyString_FromFormat("%u", some_int) ? At least the Python docs say that the processing of the format string will stop short when it encounters an unknown format, so that might be what's happening here. I noticed that Cython actually uses a normal (signed) int here, so "%d" would have been the correct format anyway. I changed that, so we'll see if that makes it work for you in the next release (I don't think it's worth bothering before that). Stefan From stefan_ml at behnel.de Sun Aug 10 08:14:13 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 10 Aug 2008 08:14:13 +0200 Subject: [lxml-dev] schema validation support In-Reply-To: <9FB57EBF8B1BE64E85408F356B3DBC510197ACB6@DRS-EX100.drs-h-main.drs.state.ct.us> References: <9FB57EBF8B1BE64E85408F356B3DBC510197ACB6@DRS-EX100.drs-h-main.drs.state.ct.us> Message-ID: <489E8735.9020807@behnel.de> Hi, Santoro, Peter wrote: > I have used xerces/java, in the past, to do xml work without schema > validation. As I now prefer to use python, I'm considering using lxml > to validate xml instance documents against associated schemas. So far, > I'm impressed with my initial testing. Thank you for making this > software available! :) > I'm curious if anyone has compared/benchmarked lxml/libxml2 schema > validation support with what's available in Apache's xerces project? Here's a very old benchmark, no idea how they compare today. http://xmlbench.sourceforge.net/results/benchmark200402/index.html If you can come up with some meaningful numbers yourself, please post them either here or on the libxml2 list. I would expect others to be interested, too. > What, if any, limitations to the xlml/libxml2 schema validation support > have others hit that I should be aware of? While, theoretically, I know that there are still some incompletely supported schema constructions, I know of none that you should be particularly aware of. Maybe others can comment here who actually /did/ encounter anything in practice, but a better place to ask (and search) is the libxml2 list. > I'm asking these questions, because I recently noticed on the underlying > libxml2 library's web site states that the xml schema support is > incomplete (http://xmlsoft.org/index.html) and that it is being finished > up (http://xmlsoft.org/news.html). To be fair, it appears that there > are limitations with xerces schema validation support > (http://xerces.apache.org/xerces-c/schema.html), too. Given the complexity of the XML Schema standard, I would be surprised if there was really an implementation that works 100% for all possible schemas. Note that you can't actually prove that by testing. All you can prove is that there are problems, not that there are none. That's why compliance test suites can never be called complete. Stefan From jkrukoff at ltgc.com Mon Aug 11 20:29:51 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Mon, 11 Aug 2008 12:29:51 -0600 Subject: [lxml-dev] Segfault on XSLT/XPath undefined variable error. In-Reply-To: <489BE6BF.8090606@behnel.de> References: <1217375154.4123.138.camel@jmk> <48900199.5020806@behnel.de> <1217451261.4123.163.camel@jmk> <489150A6.3060307@behnel.de> <1217522521.4123.183.camel@jmk> <4892B16B.5010903@behnel.de> <1218048927.25651.11.camel@jmk> <489BE6BF.8090606@behnel.de> Message-ID: <1218479391.25651.85.camel@jmk> On Fri, 2008-08-08 at 08:25 +0200, Stefan Behnel wrote: > Hi, > > John Krukoff wrote: > > On Fri, 2008-08-01 at 08:47 +0200, Stefan Behnel wrote: > >> Hi, > >> > >> John Krukoff wrote: > >>> Okay, can only get it to crash when first signing a document using > >>> libxmlsec, so I suppose I'll simply assume that the two libraries use > >>> the error log in incompatible ways. > >> could you check if this patch makes it work better for you? It basically > >> restricts XSLT error logging to the lifetime of an XSL transformation. > >> > >> Stefan > > > > I still need to compile lxml with -ggdb, where do I stick that in the > > setup.py/makefile? > > Pass the "CFLAGS" env variable when calling setup.py, as in > > CFLAGS="-O -ggdb" make clean inplace > > > > still crashed with the patch for me: > > > > Core was generated by `/usr/bin/python -tt ./Adapter.py'. > > Program terminated with signal 11, Segmentation fault. > > #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () > > > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > > (gdb) bt > > #0 0xb774b740 in __pyx_f_4lxml_5etree__forwardError () > > > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > > #1 0xb774bb12 in __pyx_f_4lxml_5etree__receiveXSLTError () > > > > from /usr/lib/python2.5/site-packages/lxml-2.1.1-py2.5-linux-i686.egg/lxml/etree.so > > #2 0xb76baef9 in xsltPrintErrorContext () from /usr/lib/libxslt.so.1 > > #3 0xb76bb091 in xsltTransformError () from /usr/lib/libxslt.so.1 > > #4 0xb76dd434 in xsltValueOf () from /usr/lib/libxslt.so.1 > > #5 0xb76da5ba in ?? () from /usr/lib/libxslt.so.1 > > #6 0x0878e718 in ?? () > > #7 0x08432c60 in ?? () > > #8 0x0878ec78 in ?? () > > #9 0x0878f6f8 in ?? () > > #10 0x00000000 in ?? () > > > > Fortunately, I've been able to simplify my crash conditions somewhat, so > > the valgrind log is significantly shorter. > > You can generally strip the "... within /lib/ld-2.6..." entries from it. > > > > Looks like I'll need to find some time to work on that test case after > > all. > > Please do, and also re-run your valgrind test with -ggdb. There seem to be > some interesting problems in there (the "moveNodeToDocument()" sections), even > if only crashes later. Seeing the line numbers here would really be helpful. > > Stefan Okay, I can get a segfault with this minimal test case, where variable.xslt is the previously sent XSLT sheet which references an undefined parameter in an xpath expression: Python 2.5.2 (r252:60911, Jul 31 2008, 15:38:58) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> import xmlsig >>> etree.XSLT( etree.parse( 'variable.xslt' ) )( etree.XML( '' ) ) xmlXPathCompiledEval: evaluation failed Segmentation fault However, changing the import order causes the crash to go away, I assume due to order of initialization in the logging code: ~/Projects/Gizmo/www/Samples 17$ python Python 2.5.2 (r252:60911, Jul 31 2008, 15:38:58) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import xmlsig >>> from lxml import etree >>> etree.XSLT( etree.parse( 'variable.xslt' ) )( etree.XML( '' ) ) Traceback (most recent call last): File "", line 1, in File "xslt.pxi", line 529, in lxml.etree.XSLT.__call__ (src/lxml/lxml.etree.c:91033) lxml.etree.XSLTApplyError: XPath evaluation returned no result. Here is the backtrace from the crash with debugging info: Core was generated by `python'. Program terminated with signal 11, Segmentation fault. #0 0xb78ce53b in __pyx_f_4lxml_5etree__forwardError (__pyx_v_c_log_handler=, __pyx_v_error=0xbfba01c8) at src/lxml/lxml.etree.c:40855 40855 ((struct __pyx_vtabstruct_4lxml_5etree__BaseErrorLog *)__pyx_v_log_handler->__pyx_vtab)->_receive(__pyx_v_log_handler, __pyx_v_error); (gdb) bt #0 0xb78ce53b in __pyx_f_4lxml_5etree__forwardError (__pyx_v_c_log_handler=, __pyx_v_error=0xbfba01c8) at src/lxml/lxml.etree.c:40855 #1 0xb78ce79c in __pyx_f_4lxml_5etree__receiveXSLTError (__pyx_v_c_log_handler=0xb75fcf04, __pyx_v_msg=0xb7880450 "%s: file %s line %d element %s\n") at src/lxml/lxml.etree.c:41371 #2 0xb7859ef9 in xsltPrintErrorContext () from /usr/lib/libxslt.so.1 #3 0xb785a091 in xsltTransformError () from /usr/lib/libxslt.so.1 #4 0xb787c434 in xsltValueOf () from /usr/lib/libxslt.so.1 #5 0xb78795ba in ?? () from /usr/lib/libxslt.so.1 #6 0x0812bdb8 in ?? () #7 0x08127100 in ?? () #8 0x08124b60 in ?? () #9 0x08108eb0 in ?? () #10 0x00000000 in ?? () Attached is the valgrind log from the crash. Hope this helps. All of these are with the XSLT error logging patch you sent along previously. -- John Krukoff Land Title Guarantee Company -------------- next part -------------- ==16392== Memcheck, a memory error detector. ==16392== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==16392== Using LibVEX rev 1732, a library for dynamic binary translation. ==16392== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==16392== Using valgrind-3.2.3, a dynamic binary instrumentation framework. ==16392== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==16392== For more details, rerun with: -v ==16392== ==16392== My PID = 16392, parent PID = 15505. Prog and args are: ==16392== python ==16392== ==16392== Invalid read of size 4 ==16392== at 0x40153F0: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x40118C3: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==16392== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==16392== Address 0x44B5EE0 is 40 bytes inside a block of size 43 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x4007715: (within /lib/ld-2.6.1.so) ==16392== by 0x4008156: (within /lib/ld-2.6.1.so) ==16392== by 0x40118C3: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==16392== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x40153D9: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== Address 0x44B6244 is 20 bytes inside a block of size 22 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x40087B8: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x4015407: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x40118C3: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==16392== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==16392== Address 0x44CC434 is 76 bytes inside a block of size 79 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x4007715: (within /lib/ld-2.6.1.so) ==16392== by 0x4008156: (within /lib/ld-2.6.1.so) ==16392== by 0x40118C3: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== by 0x4126245: _PyImport_GetDynLoadFunc (in /usr/lib/libpython2.5.so.1.0) ==16392== by 0x4113698: _PyImport_LoadDynamicModule (in /usr/lib/libpython2.5.so.1.0) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x4015407: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== Address 0x44CD144 is 12 bytes inside a block of size 15 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x40087B8: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x40153F0: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== Address 0x44CD4C8 is 24 bytes inside a block of size 25 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x40087B8: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x40153C3: (within /lib/ld-2.6.1.so) ==16392== by 0x4006337: (within /lib/ld-2.6.1.so) ==16392== by 0x4008217: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== Address 0x4D80A78 is 32 bytes inside a block of size 33 alloc'd ==16392== at 0x4022760: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so) ==16392== by 0x40087B8: (within /lib/ld-2.6.1.so) ==16392== by 0x400BF55: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x400C14F: (within /lib/ld-2.6.1.so) ==16392== by 0x401191E: (within /lib/ld-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x40112CD: (within /lib/ld-2.6.1.so) ==16392== by 0x41A6C4C: (within /lib/libdl-2.6.1.so) ==16392== by 0x400D891: (within /lib/ld-2.6.1.so) ==16392== by 0x41A70EB: (within /lib/libdl-2.6.1.so) ==16392== by 0x41A6B80: dlopen (in /lib/libdl-2.6.1.so) ==16392== ==16392== Invalid read of size 4 ==16392== at 0x483853B: __pyx_f_4lxml_5etree__forwardError (lxml.etree.c:40855) ==16392== by 0x483879B: __pyx_f_4lxml_5etree__receiveXSLTError (lxml.etree.c:41371) ==16392== by 0x48DEEF8: xsltPrintErrorContext (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48DF090: xsltTransformError (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4901433: xsltValueOf (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FE5B9: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FEF5A: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FDFFA: xsltProcessOneNode (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x49034C9: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4903A46: xsltApplyStylesheetUser (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4839B90: __pyx_f_4lxml_5etree_4XSLT__run_transform (lxml.etree.c:91598) ==16392== by 0x4853E42: __pyx_pf_4lxml_5etree_4XSLT___call__ (lxml.etree.c:90587) ==16392== Address 0x706D6F43 is not stack'd, malloc'd or (recently) free'd ==16392== ==16392== Process terminating with default action of signal 11 (SIGSEGV) ==16392== Access not within mapped region at address 0x706D6F43 ==16392== at 0x483853B: __pyx_f_4lxml_5etree__forwardError (lxml.etree.c:40855) ==16392== by 0x483879B: __pyx_f_4lxml_5etree__receiveXSLTError (lxml.etree.c:41371) ==16392== by 0x48DEEF8: xsltPrintErrorContext (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48DF090: xsltTransformError (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4901433: xsltValueOf (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FE5B9: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FEF5A: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x48FDFFA: xsltProcessOneNode (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x49034C9: (within /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4903A46: xsltApplyStylesheetUser (in /usr/lib/libxslt.so.1.1.24) ==16392== by 0x4839B90: __pyx_f_4lxml_5etree_4XSLT__run_transform (lxml.etree.c:91598) ==16392== by 0x4853E42: __pyx_pf_4lxml_5etree_4XSLT___call__ (lxml.etree.c:90587) ==16392== ==16392== ERROR SUMMARY: 17 errors from 7 contexts (suppressed: 623 from 7) ==16392== malloc/free: in use at exit: 1,970,208 bytes in 3,841 blocks. ==16392== malloc/free: 11,248 allocs, 7,407 frees, 4,586,049 bytes allocated. ==16392== For counts of detected errors, rerun with: -v ==16392== searching for pointers to 3,841 not-freed blocks. ==16392== checked 2,393,756 bytes. ==16392== ==16392== LEAK SUMMARY: ==16392== definitely lost: 0 bytes in 0 blocks. ==16392== possibly lost: 27,020 bytes in 82 blocks. ==16392== still reachable: 1,943,188 bytes in 3,759 blocks. ==16392== suppressed: 0 bytes in 0 blocks. ==16392== Rerun with --leak-check=full to see details of leaked memory. From bkc at murkworks.com Wed Aug 13 07:29:36 2008 From: bkc at murkworks.com (Brad Clements) Date: Wed, 13 Aug 2008 01:29:36 -0400 Subject: [lxml-dev] problems with document(''), possibly thread related Message-ID: <48A27140.2060005@murkworks.com> I have a stylesheet that uses document('') to reference itself. The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10 However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it does not work. I then wrote a simple test case (thinking.. aha, I'll report this error), but of course the test case functions correctly. I've spent 4 hours working on this tonight, I'm pooped, and going nuts. basically given an xml document whose root element is "" and a stylesheet that has: From within the threaded wsgi app, the output I get from this is "root", but from the test case and from xsltproc, I get "xsl:stylesheet" My code is more or less like this: ss_parser = etree.XMLParser(load_dtd=True) ss_parser.resolvers.add(Resolver()) stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser, base_url='http://mystylesheet.xsl') stylesheet = etree.XSLT(stylesheet_doc) doc_parser = etree.XMLParser(load_dtd=True) doc_parser.resolvers.add(Resolver()) xml_doc = etree.fromstring(xml_src, doc_parser, base_url='http://myfile.xml') however base_url is some real value when called from wsgi, it's threaded, and my Resolver.resolve method does get called in the wsgi app, but not from the test app. Before I give up, can someone suggest ways in which using lxml from within a threaded app might somehow "break" resolving document(''), but non-threaded it works ok? I don't think I'm using the same parser object for the stylesheet and xml document, the real wsgi code is a tad complicated. However the stylesheet and xml document should be parsed and used within the same thread (which just happens to not be the main thread) I believe this works ok on lxml 1.1.2, but I've already updated my code to use 'base_url' and so forth and I'm too worn out to change all that code just to test a theory. So .. any ideas on what could cause this? thanks for any suggestions.. -- Brad Clements, bkc at murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements From bkc at murkworks.com Thu Aug 14 04:20:59 2008 From: bkc at murkworks.com (Brad Clements) Date: Wed, 13 Aug 2008 22:20:59 -0400 Subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG' In-Reply-To: <48A27140.2060005@murkworks.com> References: <48A27140.2060005@murkworks.com> Message-ID: <48A3968B.3000207@murkworks.com> Brad Clements wrote: > I have a stylesheet that uses document('') to reference itself. > > The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10 > > However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it > does not work. > Now that I've had some sleep and another hour of google time, I have been able to recreate the problem in a test program. The big clue came from this old thread from 2006: http://article.gmane.org/gmane.comp.python.lxml.devel/1083/match=document Basically that post makes me think that the document('') problem is related to base_url passed to fromstring() In that, when document('') is processed, the base_url is used to look up the stylesheet's canonical "URL", and then that URL is used to retrieve the xml document tree that represents the stylesheet. The problem here is that base_url could be wrong.. It could be the same value as some other document. In fact, I can recreate the problem by setting base_url to the same value for both the xml source and the stylesheet source. My understanding of the reason for base_url was just so that resolvers would have a basis for resolving relative lookups. That is certainly how I use base_url ... as the only mechanism to set the URL that is passed to my custom resolver. It seems to me that after spending more than 5 hours trying to troubleshoot this "problem" with document(''), I'm going to say that this is a design flaw in lxml. I'm thinking that using base_url as a way to get back the original stylesheet XML was convenient for the lxml developers, but has left a big undocumented pitfall for lxml users. The only documentation I could find on the website about base_url is on http://codespeak.net/lxml/parsing.html#parsers where no mention is made about the requirement to NOT use the same base_url for different documents. Of course, I could be wrong here and I don't want to get anyone upset by making invalid claims. My test case program is shown below, when base_url is the same value for both the stylesheet and the xml document, then document('') fails in the stylesheet. If base_url is different, it works. --------------- test.py ----------- # demonstrate problem with self-reference stylesheet in lxml # problem occurs when base_uri is the same for both the stylesheet and # the xml document. from lxml import etree class Resolver(etree.Resolver): def __init__(self): super(etree.Resolver, self).__init__() def resolve(self, URL, ID, ctxt): print "RESOLVE URL %r" % (URL, ) return None stylesheet_src = """ Hi!
xf model id:
expected value is: location-selector-model
""" xml_src = """ """ def test(): ss_parser = etree.XMLParser(load_dtd=True) ss_parser.resolvers.add(Resolver()) stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser, base_url='http://myfile.xml') stylesheet = etree.XSLT(stylesheet_doc) doc_parser = etree.XMLParser(load_dtd=True) doc_parser.resolvers.add(Resolver()) xml_doc = etree.fromstring(xml_src, doc_parser, base_url='http://myfile.xml') print "%s" % stylesheet(xml_doc) if __name__ == "__main__": test() -- Brad Clements, bkc at murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements From stefan_ml at behnel.de Thu Aug 14 20:11:45 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 14 Aug 2008 20:11:45 +0200 Subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG' In-Reply-To: <48A3968B.3000207@murkworks.com> References: <48A27140.2060005@murkworks.com> <48A3968B.3000207@murkworks.com> Message-ID: <48A47561.4020607@behnel.de> Hi, Brad Clements wrote: > when document('') is processed, the base_url is used to look up > the stylesheet's canonical "URL", and then that URL is used to retrieve > the xml document tree that represents the stylesheet. Yes, it's common to look up a document by its URL. That's an optimisation used by libxslt, too, so if you assign the same URL to different documents, you will run into problems, whether lxml does this or not. > The problem here is that base_url could be wrong.. It could be the same > value as some other document. In fact, I can recreate the problem by > setting base_url to the same value for both the xml source and the > stylesheet source. You are deliberately lying to lxml, and still expect it to be so kind to do the right thing regardless? > My understanding of the reason for base_url was just so that resolvers > would have a basis for resolving relative lookups. That is certainly how > I use base_url ... as the only mechanism to set the URL that is passed > to my custom resolver. Yes, that's one way of using it. Others may use it differently. > this is a design flaw in lxml. I'm thinking that using base_url as a way > to get back the original stylesheet XML was convenient for the lxml > developers, but has left a big undocumented pitfall for lxml users. And it's easy to work around by providing unique URLs for each document. If you think the documentation should be improved, please submit a patch. > The only documentation I could find on the website about base_url is on > http://codespeak.net/lxml/parsing.html#parsers where no mention is made > about the requirement to NOT use the same base_url for different documents. It sounds to me like the misunderstanding here is largely based on what the "base URL" of a document is. It's the URL that defines the origin of the document. Assuming that you will get the same document when you re-read its URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have to re-parse a document each time it encounters a document() reference. That would really hurt performance. > My test case program is shown below, > when base_url is the same value for both the stylesheet and the xml > document, then document('') fails in the stylesheet. > If base_url is different, it works. I agree that separate documentation paragraphs in the parser documentation, the resolver documentation, and the XSLT documentation would help here. Maybe you can write up something? Stefan From bkc at murkworks.com Thu Aug 14 21:07:15 2008 From: bkc at murkworks.com (Brad Clements) Date: Thu, 14 Aug 2008 15:07:15 -0400 Subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG' In-Reply-To: <48A47561.4020607@behnel.de> References: <48A27140.2060005@murkworks.com> <48A3968B.3000207@murkworks.com> <48A47561.4020607@behnel.de> Message-ID: <48A48263.3000703@murkworks.com> Stefan Behnel wrote: > > You are deliberately lying to lxml, and still expect it to be so kind to do > the right thing regardless? > Well, I didn't realize I was lying.. :-( > It sounds to me like the misunderstanding here is largely based on what the > "base URL" of a document is. It's the URL that defines the origin of the > document. Assuming that you will get the same document when you re-read its > URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have > to re-parse a document each time it encounters a document() reference. That > would really hurt performance. > I agree with what you say. However it's a "surprise" to find that document('') is affected this way. document('') is "expected" to always mean "the current stylesheet" no matter what URL you named the stylesheet with. Could this be improved by having etree.XSLT attach the stylesheet doc to the returned stylesheet object, or is this too hard and tangled up inside libxslt? Is there any documentation on the internal URL caching mechanism? Is the "cache" shared between parsers? Between threads? If I use from_string(base_url="xyz") somewhere, then from a different parser have a stylesheet that does document('xyz'), will my resolver get called, or the document that was generated from_string be used instead? How long are documents and their URLs "cached"? My WSGI code is generating stylesheets "on the fly" based on web requests, so I need to know more about the implementation details of the URL/document caching mechanism. Thanks -- Brad Clements, bkc at murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements From stefan_ml at behnel.de Fri Aug 15 08:52:03 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 15 Aug 2008 08:52:03 +0200 Subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG' In-Reply-To: <48A48263.3000703@murkworks.com> References: <48A27140.2060005@murkworks.com> <48A3968B.3000207@murkworks.com> <48A47561.4020607@behnel.de> <48A48263.3000703@murkworks.com> Message-ID: <48A52793.1010601@behnel.de> Hi, Brad Clements wrote: > document('') is "expected" to always mean "the current stylesheet" no > matter what URL you named the stylesheet with. Could this be improved > by having etree.XSLT attach the stylesheet doc to the returned > stylesheet object, or is this too hard and tangled up inside libxslt? The thing is that when a stylesheet says document(''), libxslt will resolve that URL relative to the stylesheet URL (i.e. replace it with that URL) and then ask lxml about that URL. So the only way to see that the stylesheet was meant is to compare the requested URL to the one of the stylesheet. That is identical to the case that you say document("the stylesheet url"). lxml handles this directly without calling a user provided resolver. > Is there any documentation on the internal URL caching mechanism? Is the > "cache" shared between parsers? Between threads? It's local to a single XSLT call. As long as all documents that participate in your XSL transformation (including the stylesheet itself) have unique URLs, you will be safe. > If I use from_string(base_url="xyz") somewhere, then from a different > parser have a stylesheet that does document('xyz'), will my resolver get > called, or the document that was generated from_string be used instead? The only document URLs that will not be requested through your resolver are the one of the stylesheet and the one of the document that is being transformed. Everything else will be requested before it is added to the cache. > My WSGI code is generating stylesheets "on the fly" based on web > requests, so I need to know more about the implementation details of the > URL/document caching mechanism. Giving each of them a unique base URL should work in any case. Stefan From p-santoro at sbcglobal.net Fri Aug 15 12:26:26 2008 From: p-santoro at sbcglobal.net (Peter) Date: Fri, 15 Aug 2008 06:26:26 -0400 Subject: [lxml-dev] instance from schema Message-ID: <48A559D2.4040904@sbcglobal.net> I'm looking for a freely available tool that can generate xml instance documents from an xml schema. I'm aware of Sun's xmlgen, but I cannot find a valid web link to it. Does anyone have a current link to Sun's xmlgen or know of a similar tool? When writing such a tool, wouldn't you need to load the schema as an xml document, walk the tree, and output appropriate instance xml (probably not trivial, given xml schema's complexity)? Thank you, Peter From jholg at gmx.de Fri Aug 15 12:56:51 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 15 Aug 2008 12:56:51 +0200 Subject: [lxml-dev] instance from schema In-Reply-To: <48A559D2.4040904@sbcglobal.net> References: <48A559D2.4040904@sbcglobal.net> Message-ID: <20080815110319.26410@gmx.net> Hi, ? > I'm looking for a freely available tool that can generate xml instance > documents from an xml schema. I'm aware of Sun's xmlgen, but I cannot > find a valid web link to it. Does anyone have a current link to Sun's > xmlgen or know of a similar tool? > ? ?If you mean random valid sample documents: ?Seems like eclipse has tools for such tasks:? http://www.eclipse.org/webtools/community/tutorials/XMLWizards/XMLWizards.html ?You can definitely do the like with oXygen xml editor, but this is not free software; although you can get a temporary evaluation licence. ?generateDS lets you create data structures from a given schema which you can then populate and serialize: ?http://www.rexx.com/~dkuhlman/generateDS.html ?I don't know if this can be easily used to create valid random sample data. ?? > When writing such a tool, wouldn't you need to load the schema as an xml > document, walk the tree, and output appropriate instance xml (probably > not trivial, given xml schema's complexity)? ?I'd say so. An alternative might be to implement such functionality using an elaborate XSLT. ?Regarding lxml, you could of course create an instance document through the lxml API(s) and then validate. ?Holger? > -- Psssst! Schon das coole Video vom GMX MultiMessenger gesehen? Der Eine f?r Alle: http://www.gmx.net/de/go/messenger03 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080815/88d877b8/attachment.htm From stefan_ml at behnel.de Mon Aug 18 09:04:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Aug 2008 09:04:58 +0200 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: References: Message-ID: <48A91F1A.4070405@behnel.de> Hi, Sidnei da Silva wrote: > I've got a reproducible failing test when using custom resolvers and > relative xsl:import. Basically, if I use resolve_string() or let the > default resolver do it's work, everything works fine. If I use > resolve_file though, the *next* uri to be resolved will have a > relative (to the previous uri resolved) filename, and then there's not > enough information available to compute the full URI. The problem here is that lxml currently only takes the .name attribute of a file object to determine the file name. The right thing to do would be to use os.path.abspath(f.name) instead to retrieve the absolute path name of the file. Using resolve_filename(filename) instead of resolve_file(open(filename)) provides a more efficient work around, though. Stefan From sidnei at enfoldsystems.com Mon Aug 18 21:21:30 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 18 Aug 2008 16:21:30 -0300 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <48A91F1A.4070405@behnel.de> References: <48A91F1A.4070405@behnel.de> Message-ID: On Mon, Aug 18, 2008 at 4:04 AM, Stefan Behnel wrote: > The problem here is that lxml currently only takes the .name attribute of a > file object to determine the file name. The right thing to do would be to use > os.path.abspath(f.name) instead to retrieve the absolute path name of the file. > > Using resolve_filename(filename) instead of resolve_file(open(filename)) > provides a more efficient work around, though. Okay, I think that solves one of the problems, when the filename is local. I suspect it won't solve the other problem I had, which was when the file comes from urlopen(). Since the files are relatively small though I can keep using resolve_string() for now. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Mon Aug 18 22:16:53 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Aug 2008 22:16:53 +0200 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: References: <48A91F1A.4070405@behnel.de> Message-ID: <48A9D8B5.7050500@behnel.de> Hi, Sidnei da Silva wrote: > On Mon, Aug 18, 2008 at 4:04 AM, Stefan Behnel wrote: >> The problem here is that lxml currently only takes the .name attribute of a >> file object to determine the file name. The right thing to do would be to use >> os.path.abspath(f.name) instead to retrieve the absolute path name of the file. >> >> Using resolve_filename(filename) instead of resolve_file(open(filename)) >> provides a more efficient work around, though. > > Okay, I think that solves one of the problems, when the filename is > local. I suspect it won't solve the other problem I had, which was > when the file comes from urlopen(). Since the files are relatively > small though I can keep using resolve_string() for now. I'm working on fixing this. However, do you really need something from urlopen that lxml can't handle by itself? Any "HTTP GET" or "FTP get" request should work when passed as an encoded URL. HTTP POSTs and request options won't work, but they are pretty rarely used when requesting pages. HTTPS also won't work, which I would consider more important. Stefan From sidnei at enfoldsystems.com Mon Aug 18 22:45:29 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 18 Aug 2008 17:45:29 -0300 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <48A9D8B5.7050500@behnel.de> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> Message-ID: On Mon, Aug 18, 2008 at 5:16 PM, Stefan Behnel wrote: > I'm working on fixing this. However, do you really need something from urlopen > that lxml can't handle by itself? Any "HTTP GET" or "FTP get" request should > work when passed as an encoded URL. HTTP POSTs and request options won't work, > but they are pretty rarely used when requesting pages. HTTPS also won't work, > which I would consider more important. Yes, HTTPS and I'm also using code that handles caching. So, I overly simplified here. I actually use something that resembles urlopen but that might return an open file handle (or something that resembles a file, like a StringIO) if the requested url is cached locally. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From jlovell at esd189.org Mon Aug 18 22:50:38 2008 From: jlovell at esd189.org (John Lovell) Date: Mon, 18 Aug 2008 13:50:38 -0700 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <48A9D8B5.7050500@behnel.de> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> Stefan: If you find a currently maintained pure Python way to add SSL, please let us know. John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel Sent: Monday, August 18, 2008 1:17 PM To: Sidnei da Silva Cc: ML-Lxml-dev Subject: Re: [lxml-dev] Failure with custom resolvers and relative xsl:import Hi, Sidnei da Silva wrote: > On Mon, Aug 18, 2008 at 4:04 AM, Stefan Behnel wrote: >> The problem here is that lxml currently only takes the .name >> attribute of a file object to determine the file name. The right >> thing to do would be to use >> os.path.abspath(f.name) instead to retrieve the absolute path name of the file. >> >> Using resolve_filename(filename) instead of >> resolve_file(open(filename)) provides a more efficient work around, though. > > Okay, I think that solves one of the problems, when the filename is > local. I suspect it won't solve the other problem I had, which was > when the file comes from urlopen(). Since the files are relatively > small though I can keep using resolve_string() for now. I'm working on fixing this. However, do you really need something from urlopen that lxml can't handle by itself? Any "HTTP GET" or "FTP get" request should work when passed as an encoded URL. HTTP POSTs and request options won't work, but they are pretty rarely used when requesting pages. HTTPS also won't work, which I would consider more important. Stefan _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Tue Aug 19 12:26:38 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 Aug 2008 12:26:38 +0200 (CEST) Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> Message-ID: <64235.213.61.181.86.1219141598.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Sidnei da Silva wrote: > Yes, HTTPS and I'm also using code that handles caching. So, I overly > simplified here. I actually use something that resembles urlopen but > that might return an open file handle (or something that resembles a > file, like a StringIO) if the requested url is cached locally. If you return a custom object, you can give it a "filename" attribute or a "geturl()" method (as provided by urlopen() 'files'). lxml will recognise at. A StringIO object can't pass on its filename or source URL, so you have to provide a meaningful base_url from your resolver in this case. Stefan From stefan_ml at behnel.de Tue Aug 19 12:38:25 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 Aug 2008 12:38:25 +0200 (CEST) Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> Message-ID: <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> John Lovell wrote: > If you find a currently maintained pure Python way to add SSL, please > let us know. Hmmm, isn't it enough to use a custom resolver to intercept only "https://..." requests (return None for everything else) and to redirect them to your preferred Python HTTPS client library? I'm not sure implementing this directly in lxml makes sense, as it wouldn't allow using certificates and these things. But it shouldn't be too hard to do on your own. Stefan From jlovell at esd189.org Tue Aug 19 17:07:06 2008 From: jlovell at esd189.org (John Lovell) Date: Tue, 19 Aug 2008 08:07:06 -0700 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> Stefan: So what is your "preferred Python HTTPS client library?" I recently tried very hard and ultimately failed at handling https cross platform with Python. Thanks for all you do, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Tuesday, August 19, 2008 3:38 AM To: John Lovell Cc: ML-Lxml-dev Subject: Re: [lxml-dev] Failure with custom resolvers and relative xsl:import John Lovell wrote: > If you find a currently maintained pure Python way to add SSL, please > let us know. Hmmm, isn't it enough to use a custom resolver to intercept only "https://..." requests (return None for everything else) and to redirect them to your preferred Python HTTPS client library? I'm not sure implementing this directly in lxml makes sense, as it wouldn't allow using certificates and these things. But it shouldn't be too hard to do on your own. Stefan From stefan_ml at behnel.de Tue Aug 19 17:39:19 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 Aug 2008 17:39:19 +0200 (CEST) Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> Message-ID: <60653.213.61.181.86.1219160359.squirrel@groupware.dvs.informatik.tu-darmstadt.de> John Lovell wrote: > So what is your "preferred Python HTTPS client library?" I recently > tried very hard and ultimately failed at handling https cross platform > with Python. I never needed to try (I'd start with urllib2 anyway...), but this sounds like a question for comp.lang.python rather than this list. Stefan From etiffany at alum.mit.edu Tue Aug 19 17:49:03 2008 From: etiffany at alum.mit.edu (Eric Tiffany) Date: Tue, 19 Aug 2008 11:49:03 -0400 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> Message-ID: I have used M2Crypto for clientTLS connections from MacOS, Ubuntu, and Vista, all using the same codebase (including LXML) running under Plone. Works brillantly. ET On 8/19/08 11:07 AM, "John Lovell" wrote: > Stefan: > > So what is your "preferred Python HTTPS client library?" I recently > tried very hard and ultimately failed at handling https cross platform > with Python. > > Thanks for all you do, > > John W. Lovell > Web Applications Engineer > Northwest Educational Service District > 1601 R Avenue > Anacortes, WA 98221 > (360) 299-4086 > jlovell at nwesd.org > > www.nwesd.org > Together We Can ... > > > -----Original Message----- > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > Sent: Tuesday, August 19, 2008 3:38 AM > To: John Lovell > Cc: ML-Lxml-dev > Subject: Re: [lxml-dev] Failure with custom resolvers and relative > xsl:import > > John Lovell wrote: >> If you find a currently maintained pure Python way to add SSL, please >> let us know. > > Hmmm, isn't it enough to use a custom resolver to intercept only > "https://..." requests (return None for everything else) and to redirect > them to your preferred Python HTTPS client library? > > I'm not sure implementing this directly in lxml makes sense, as it > wouldn't allow using certificates and these things. But it shouldn't be > too hard to do on your own. > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- ________________________________________________ Eric Tiffany | +1 413-458-3743 etiffany at alum.mit.edu | +1 413-627-1778 mobile From stefan_ml at behnel.de Tue Aug 19 18:02:06 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 Aug 2008 18:02:06 +0200 (CEST) Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <20080819114425.13d8d006@mbook.local> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> <60653.213.61.181.86.1219160359.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080819114425.13d8d006@mbook.local> Message-ID: <37198.213.61.181.86.1219161726.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Mike Meyer wrote: > If your build includes SSL support, then https is in the standard > library. If not, then not. Ah, sure, that's a portability issue then. > Unfortunately, the same is *not* true for server side SSL. ... for which there is tlslite, at least I know that one. From a quick look at their web site, it's supposed to work with httplib also. I never used it either, but a couple of blog entries seem to be happy with it. Stefan From jlovell at esd189.org Tue Aug 19 18:16:32 2008 From: jlovell at esd189.org (John Lovell) Date: Tue, 19 Aug 2008 09:16:32 -0700 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: <37198.213.61.181.86.1219161726.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <48A91F1A.4070405@behnel.de> <48A9D8B5.7050500@behnel.de> <3A49C88789256B4AB33AC603DB6AF49B011A2404@ZIRIA.esd189.org> <39571.213.61.181.86.1219142305.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <3A49C88789256B4AB33AC603DB6AF49B011A2406@ZIRIA.esd189.org> <60653.213.61.181.86.1219160359.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080819114425.13d8d006@mbook.local> <37198.213.61.181.86.1219161726.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <3A49C88789256B4AB33AC603DB6AF49B011A2408@ZIRIA.esd189.org> I wound up using tlslite the only issue was Windows for which there wasn't an installer for Python 2.5 and it seemed I needed a version of DevStudio (that I don't have) to build one. I tried manually copying the files and still no joy. Anyway, back to lxml stuff. John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Tuesday, August 19, 2008 9:02 AM To: Mike Meyer Cc: John Lovell; ML-Lxml-dev Subject: Re: [lxml-dev] Failure with custom resolvers and relative xsl:import Mike Meyer wrote: > If your build includes SSL support, then https is in the standard > library. If not, then not. Ah, sure, that's a portability issue then. > Unfortunately, the same is *not* true for server side SSL. ... for which there is tlslite, at least I know that one. From a quick look at their web site, it's supposed to work with httplib also. I never used it either, but a couple of blog entries seem to be happy with it. Stefan From stefan_ml at behnel.de Tue Aug 19 21:45:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 Aug 2008 21:45:36 +0200 Subject: [lxml-dev] Failure with custom resolvers and relative xsl:import In-Reply-To: References: <48A91F1A.4070405@behnel.de> Message-ID: <48AB22E0.3050104@behnel.de> Hi, Sidnei da Silva wrote: > On Mon, Aug 18, 2008 at 4:04 AM, Stefan Behnel wrote: >> The problem here is that lxml currently only takes the .name attribute of a >> file object to determine the file name. The right thing to do would be to use >> os.path.abspath(f.name) instead to retrieve the absolute path name of the file. >> >> Using resolve_filename(filename) instead of resolve_file(open(filename)) >> provides a more efficient work around, though. > > Okay, I think that solves one of the problems, when the filename is > local. I suspect it won't solve the other problem I had, which was > when the file comes from urlopen(). Since the files are relatively > small though I can keep using resolve_string() for now. here's a patch that tries making filenames absolute when users pass a file-like object. This only works for file objects (and maybe compatible objects), but "file://..." URLs should be provided as absolute URLs anyway. Is there anything that's missing? You mentioned problems with urlopen(), but didn't elaborate on them any further. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: absolute-filenames.patch Type: text/x-patch Size: 3484 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080819/cf24f3c5/attachment-0001.bin From jholg at gmx.de Wed Aug 20 17:09:09 2008 From: jholg at gmx.de (Holger Joukl) Date: Wed, 20 Aug 2008 17:09:09 +0200 Subject: [lxml-dev] objectify.ObjectPath("wrongroot.a.b")(root, None) should not raise an exception Message-ID: <20080820151743.203090@gmx.net> Hi, ?using an absolute ObjectPath with a wrong root element objectify.ObjectPath("wrongroot.a.b")(root, None) currently raises ValueError even if a default is given: ?>>> root = objectify.fromstring("
23") >>> objectify.ObjectPath("root.a.b")(root, None) 23 >>> objectify.ObjectPath("wrongroot.a.b")(root, None) Traceback (most recent call last): ? File "", line 1, in ? ? File "objectpath.pxi", line 53, in lxml.objectify.ObjectPath.__call__ ? File "objectpath.pxi", line 197, in lxml.objectify._findObjectPath ValueError: root element does not match: need wrongroot, got root >>> ?Whereas ObjectPath returns a given default if any other element of the path is not in the tree: ?>>> objectify.ObjectPath("root.wronga.b")(root, None) >>> ?I'll change this in trunk to gracefully return the default, and add some tests (unless someone stops me... :) ?Holger? ? -- Psssst! Schon das coole Video vom GMX MultiMessenger gesehen? Der Eine f?r Alle: http://www.gmx.net/de/go/messenger03 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080820/a16ed6d3/attachment.htm From stefan_ml at behnel.de Wed Aug 20 09:07:55 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 Aug 2008 09:07:55 +0200 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <1214931259.10505.48.camel@localhost.localdomain> References: <1214856697.868.38.camel@localhost.localdomain> <4869BEB6.9000609@behnel.de> <1214931259.10505.48.camel@localhost.localdomain> Message-ID: <48ABC2CB.1020404@behnel.de> Hi, coming back to this after a while... Eric Jahn wrote: > On Tue, 2008-07-01 at 07:20 +0200, Stefan Behnel wrote: >> If you want it to replace the namespace by a resolved prefix, use >> >> type = etree.QName(NS2 + "...") > > No, I don't want the prefix resolved the the url, so I guess my only > option is to do something like the following and just pass the type > value a string with the namespace prefix explicity stated: > > child1 = etree.SubElement(root,NS2 + "secondelement", nsmap=NSMAP, type > = "NS2:someattribute") I think you misunderstood my example (and apparently didn't try it on your side). Isn't this what you wanted: >>> import lxml.etree as et >>> root = et.XML('') >>> root[0].set("type", et.QName("{http://my/ns}tname")) >>> et.tostring(root) '' This has been working for quite a while now. Stefan From stefan_ml at behnel.de Wed Aug 20 20:50:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 Aug 2008 20:50:26 +0200 Subject: [lxml-dev] Fwd: [xml] Security fix for libxml2 Message-ID: <48AC6772.3010409@behnel.de> FYI -------- Original-Message -------- Subject: [xml] Security fix for libxml2 Date: Wed, 20 Aug 2008 19:00:51 +0200 From: Daniel Veillard To: xml at gnome.org Bad news, when checking against recursive entities expansion problem back when it was made official (c.f. the billion laught attack circa 2004) I had checked for the normal recursion, but when happening in an attribute value the resource consumption is way faster and the recursion detection in place is not sufficient to catch the problem. Basically when this happen within an attribute just checking for a recursion depth is not sufficient, and the only good method I could find was to count the number of entities replacement taking place while parsing a given document, and drop parsing after half a million substitution. I think it's a fair default process and what the patches below implements for various libxml2 versions, but i can understand that in some case that may be problematic. So i intend in the next release (2.7.0 hopefully available soon) to add a parser flag removing the hardcoded limits (there is also a maximum document depth in place). Distributions have been made aware of the problem for a couple of weeks and updates should be available soon from normal update channels I'm updating SVN with the fix too, Daniel -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: libxml2-2.6.32-billion_laught.patch Url: http://codespeak.net/pipermail/lxml-dev/attachments/20080820/5bd0f960/attachment-0003.diff -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: libxml2-2.6.26-billion_laught.patch Url: http://codespeak.net/pipermail/lxml-dev/attachments/20080820/5bd0f960/attachment-0004.diff -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: libxml2-2.6.16-billion_laught.patch Url: http://codespeak.net/pipermail/lxml-dev/attachments/20080820/5bd0f960/attachment-0005.diff From stefan_ml at behnel.de Wed Aug 20 20:55:10 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 Aug 2008 20:55:10 +0200 Subject: [lxml-dev] objectify.ObjectPath("wrongroot.a.b")(root, None) should not raise an exception In-Reply-To: <20080820151743.203090@gmx.net> References: <20080820151743.203090@gmx.net> Message-ID: <48AC688E.3020608@behnel.de> Hi, Holger Joukl wrote: > using an absolute ObjectPath with a wrong root element currently raises > ValueError even if a default is given: > [...] > Whereas ObjectPath returns a given default if any other > element of the path is not in the tree: > I'll change this in trunk to gracefully return the default, and add some > tests Please do. Stefan From jholg at gmx.de Thu Aug 21 08:24:06 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 21 Aug 2008 08:24:06 +0200 Subject: [lxml-dev] objectify.ObjectPath("wrongroot.a.b")(root, None) should not raise an exception In-Reply-To: <48AC688E.3020608@behnel.de> References: <20080820151743.203090@gmx.net> <48AC688E.3020608@behnel.de> Message-ID: <20080821062406.203120@gmx.net> > >I'll change this in trunk to gracefully return the default, and add some > > tests > > Please do. > ?Committed revision 57527, ?Holger? -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080821/be7ef0c6/attachment.htm From nguyenkhanhduy at gmail.com Thu Aug 21 17:43:34 2008 From: nguyenkhanhduy at gmail.com (=?UTF-8?Q?Nguy=E1=BB=85n_Kh=C3=A1nh_Duy?=) Date: Thu, 21 Aug 2008 22:43:34 +0700 Subject: [lxml-dev] Undefined preference when building lxml xsltLibxsltVersion Message-ID: I am trying to build lxml for python 2.3 on windows, using mingw32. I've successfully downloaded and installed libxslt and libxml2, put the include header files and the lib files where they need to be, but when I build lxml the following error appears: Am I missing any .lib files? libxslt, libxml2, libexslt, iconv and zlib are all in the python 2.3 libs folder. D:\LXML\lxml-2.1.1>python setup.py build -cmingw32 Building lxml version 2.1.1. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. ERROR: 'xslt-config' is not recognized as an internal or external command, operable program or batch file. ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt running build running build_py running build_ext building 'lxml.etree' extension writing build\temp.win32-2.3\Release\src\lxml\etree.def C:\MinGW\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32-2.3\Release\src\lxm l\lxml.etree.o build\temp.win32-2.3\Release\src\lxml\etree.def -LD:\Python23\lib s -LD:\Python23\PCBuild -llibxslt -llibexslt -llibxml2 -liconv -lzlib -lWS2_32 - lpython23 -o build\lib.win32-2.3\lxml\etree.pyd build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x67e20): undefined reference to `xsltProcessOneNode' build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x83946): undefined reference to `xsltLibxsltVersion' build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x839e2): undefined reference to `xsltDocDefaultLoader' collect2: ld returned 1 exit status error: command 'gcc' failed with exit status 1 -- Nguyen Khanh Duy nguyenkhanhduy at gmail.com From stefan_ml at behnel.de Thu Aug 21 19:11:06 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Aug 2008 19:11:06 +0200 Subject: [lxml-dev] Undefined preference when building lxml xsltLibxsltVersion In-Reply-To: References: Message-ID: <48ADA1AA.1040402@behnel.de> Hi, Nguy?n Kh?nh Duy wrote: > ERROR: 'xslt-config' is not recognized as an internal or external > command, operable program or batch file. > > ** make sure the development packages of libxml2 and libxslt are installed ** You need to have the "xslt-config" script in your PATH. It comes with libxslt, but in your case, lxml's setup.py script can't find it. Stefan From nguyenkhanhduy at gmail.com Thu Aug 21 19:28:23 2008 From: nguyenkhanhduy at gmail.com (=?UTF-8?Q?Nguy=E1=BB=85n_Kh=C3=A1nh_Duy?=) Date: Fri, 22 Aug 2008 00:28:23 +0700 Subject: [lxml-dev] Undefined preference when building lxml xsltLibxsltVersion In-Reply-To: <48ADA1AA.1040402@behnel.de> References: <48ADA1AA.1040402@behnel.de> Message-ID: This is another thing I was wondering about. I built libxslt using MinGW32 and MSYS on Windows. After that, I checked the bin directory and see a file named xslt-config there, it's a shell script. I have set the PATH to include the directory and using MSYS I can execute it normally from anywhere. But when I try to build lxml with "python setup.py build -cmingw32" that message still shows. ==== > You need to have the "xslt-config" script in your PATH. It comes with libxslt, > but in your case, lxml's setup.py script can't find it. -- Nguyen Khanh Duy nguyenkhanhduy at gmail.com From nguyenkhanhduy at gmail.com Thu Aug 21 19:52:25 2008 From: nguyenkhanhduy at gmail.com (=?UTF-8?Q?Nguy=E1=BB=85n_Kh=C3=A1nh_Duy?=) Date: Fri, 22 Aug 2008 00:52:25 +0700 Subject: [lxml-dev] Undefined preference when building lxml xsltLibxsltVersion In-Reply-To: References: <48ADA1AA.1040402@behnel.de> Message-ID: I don't think there's a problem with xslt-config. By hacking through setupinfo.py, you could find ways around using xslt-config and let the setup.py work just well. I just need to know which file did I miss to have those references? build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x67e20): undefined reference to `xsltProcessOneNode' build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x83946): undefined reference to `xsltLibxsltVersion' build\temp.win32-2.3\Release\src\lxml\lxml.etree.o:lxml.etree.c:(.text+0x839e2): undefined reference to `xsltDocDefaultLoader' -- Nguyen Khanh Duy nguyenkhanhduy at gmail.com From nguyenkhanhduy at gmail.com Fri Aug 22 11:17:36 2008 From: nguyenkhanhduy at gmail.com (=?UTF-8?Q?Nguy=E1=BB=85n_Kh=C3=A1nh_Duy?=) Date: Fri, 22 Aug 2008 16:17:36 +0700 Subject: [lxml-dev] lxml built for python2.3 on windows Message-ID: Does anyone have the binaries for lxml on windows for python 2.3 ? I could not build it and cannot find the prebuilt anywhere. -- Nguyen Khanh Duy nguyenkhanhduy at gmail.com From dsoulayrol at free.fr Fri Aug 22 17:50:10 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Fri, 22 Aug 2008 17:50:10 +0200 Subject: [lxml-dev] ElementTree docinfo attribute Message-ID: <1219420210.20981.29.camel@neodebianix.neotip.com> Hello, I can read at http://article.gmane.org/gmane.comp.python.lxml.devel/1106/match=docinfo "a DocInfo object that you can also instantiate on an ElementTree (or Element) by hand." But I also can read from _ElementTree help : "docinfo: Information about the document provided by parser and DTD. This value is only defined for ElementTree objects based on the root node of a parsed document (e.g. those returned by the parse functions)." In any case, the docinfo property of ElementTree seems read only. So please tell me, is it possible to fill the docinfo property of an ElementTree by hand, or is it necessary to use the etree parse method ? Thanks. -- David. From ivanov.maxim at gmail.com Sat Aug 23 08:58:55 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Sat, 23 Aug 2008 10:58:55 +0400 Subject: [lxml-dev] Some HTML target processing issues In-Reply-To: <53981.213.61.181.86.1218196897.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <489BD721.2050703@behnel.de> <53981.213.61.181.86.1218196897.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Here is one test for problem with not calling targets' close() method when XMLSyntaxError is raised during SAX-like parsing even with recover=True. This is addition to test_htmlparser.py: def test_module_target_on_raise_stop(self): class Target(object): def __init__(self, res): self.res = res def start(self, tag, attrib): pass def end(self, tag): pass def close(self): self.res.append(True) result = [] parser = self.etree.HTMLParser(target=Target(result), recover=True) parse = self.etree.parse f = BytesIO(self.broken_html_str) self.assertRaises(self.etree.XMLSyntaxError, parse, f, parser) self.assertEqual(result[-1],True) From ivanov.maxim at gmail.com Sat Aug 23 09:05:14 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Sat, 23 Aug 2008 11:05:14 +0400 Subject: [lxml-dev] .text_content() should leave spaces. Tests included Message-ID: Hi! I've run into another strange behaviour. lxml.html.HTMLParser produces html elements with similair API as Etree elements, but with some additions. One of them is .text_content() method. Some quote from docs: "Returns the text content of the element, including the text content of its children, with no markup." So according to description it transforms "element1element2" to "element1element2". Notice the lack of space between contents of two elements. From my point of view, that's make this method quite useless, it would be better if it produce "element1 element2" from same string. Here is a test fro test_htmlparser.py: def test_html_text_content(self): from lxml.html import HTMLParser element = self.etree.HTML(self.html_str, parser=HTMLParser()) self.assertEquals(element.text_content(),"test page title") From stefan_ml at behnel.de Sat Aug 23 09:25:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Aug 2008 09:25:36 +0200 Subject: [lxml-dev] Some HTML target processing issues In-Reply-To: References: <489BD721.2050703@behnel.de> <53981.213.61.181.86.1218196897.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <48AFBB70.3090801@behnel.de> Hi, Max Ivanov wrote: > Here is one test for problem with not calling targets' close() method > when XMLSyntaxError is raised during SAX-like parsing even with > recover=True. This is addition to test_htmlparser.py: > > > def test_module_target_on_raise_stop(self): > class Target(object): > def __init__(self, res): > self.res = res > def start(self, tag, attrib): > pass > def end(self, tag): > pass > def close(self): > self.res.append(True) > > result = [] > parser = self.etree.HTMLParser(target=Target(result), recover=True) > parse = self.etree.parse > f = BytesIO(self.broken_html_str) > self.assertRaises(self.etree.XMLSyntaxError, > parse, f, parser) > self.assertEqual(result[-1],True) Thanks, could you file a bug report in the launchpad bug tracker so that this doesn't get lost? Stefan From stefan_ml at behnel.de Sat Aug 23 09:32:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Aug 2008 09:32:50 +0200 Subject: [lxml-dev] .text_content() should leave spaces. Tests included In-Reply-To: References: Message-ID: <48AFBD22.3030803@behnel.de> Hi, Max Ivanov wrote: > I've run into another strange behaviour. I wouldn't call that "strange behaviour". What you want is a new feature. > lxml.html.HTMLParser > produces html elements with similair API as Etree elements, but with > some additions. One of them is .text_content() method. Some quote from > docs: "Returns the text content of the element, including the text > content of its children, with no markup." > > So according to description it transforms > "element1element2" to "element1element2". > Notice the lack of space between contents of two elements. Exactly as in the HTML source, I would say. Given your specific example, I don't think a browser would display it any different. > From my > point of view, that's make this method quite useless, it would be > better if it produce "element1 element2" from same string. Here is a > test fro test_htmlparser.py: > > def test_html_text_content(self): > from lxml.html import HTMLParser > element = self.etree.HTML(self.html_str, parser=HTMLParser()) > self.assertEquals(element.text_content(),"test page title") That would be wrong, as it alters the content while collecting it. I agree that a few additional features could help targeting new use cases. For example, the method could be smart about
tags and replace them with "\n". But that would be optional behaviour enabled by a keyword argument. Feel free to provide a patch. Stefan From ivanov.maxim at gmail.com Sat Aug 23 09:57:19 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Sat, 23 Aug 2008 11:57:19 +0400 Subject: [lxml-dev] .text_content() should leave spaces. Tests included In-Reply-To: <48AFBD22.3030803@behnel.de> References: <48AFBD22.3030803@behnel.de> Message-ID: >> So according to description it transforms >> "element1element2" to "element1element2". >> Notice the lack of space between contents of two elements. > > Exactly as in the HTML source, I would say. Given your specific example, I > don't think a browser would display it any different. > Maybe examples are not suitable here. but .text_content() on "test

page title

" displaying "testpage title" instead of "test page title" is definitely wrong. Imagine what would happen with with multiple td's and tr's - it'll transform it to one big word without spaces. Do you think that it is correct?. Easiest way will be but spaces between content of any two tags and keep all other symbols between tags. >Feel free to provide a patch. text_method is an alias for XPath("string()"). But I didn't find any description of just plain string() function, everything I found is an "string text()" which according to wikipedia returns text content of elements only one level lower. So I don't understand how all that works =) From mwm-keyword-lxml.9112b8 at mired.org Sat Aug 23 10:34:36 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sat, 23 Aug 2008 04:34:36 -0400 Subject: [lxml-dev] .text_content() should leave spaces. Tests included In-Reply-To: References: <48AFBD22.3030803@behnel.de> Message-ID: <20080823043436.780120c7@bhuda.mired.org> On Sat, 23 Aug 2008 11:57:19 +0400 "Max Ivanov" wrote: > >> So according to description it transforms > >> "element1element2" to "element1element2". > >> Notice the lack of space between contents of two elements. > > > > Exactly as in the HTML source, I would say. Given your specific example, I > > don't think a browser would display it any different. > > > Maybe examples are not suitable here. but .text_content() on > "test

page > title

" displaying "testpage title" instead of "test > page title" is definitely wrong. Imagine what would happen with >
with multiple td's and tr's - it'll transform it to one big > word without spaces. Do you think that it is correct?. Easiest way > will be but spaces between content of any two tags and keep all other > symbols between tags. Easiest way to what? Fix this broken behavior? But it'll break the correct behavior where inline tags are used to change the rendering of elements in a word (like bleen). If you want it to look like what a browser might render, you want to put spaces between block elements but not inline elements. Of course, whether a particular tag is inline or not can be changed by whatever style sheets are in use. And title - well, it's contents aren't rendered in the contents of the page at all. So maybe they should just vanish? http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From ivanov.maxim at gmail.com Sat Aug 23 12:27:45 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Sat, 23 Aug 2008 14:27:45 +0400 Subject: [lxml-dev] HTMLParser encoding Message-ID: If there is no meta tag with defined document encoding, how HTMLParser converts text data into Unicode? Does it contain some encoding detection machinery? From stefan_ml at behnel.de Sat Aug 23 13:55:53 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Aug 2008 13:55:53 +0200 Subject: [lxml-dev] ElementTree docinfo attribute In-Reply-To: <1219420210.20981.29.camel@neodebianix.neotip.com> References: <1219420210.20981.29.camel@neodebianix.neotip.com> Message-ID: <48AFFAC9.7020909@behnel.de> Hi, David Soulayrol wrote: > I can read at > http://article.gmane.org/gmane.comp.python.lxml.devel/1106/match=docinfo > > "a DocInfo object that you can also instantiate on an ElementTree (or > Element) by hand." > > But I also can read from _ElementTree help : > > "docinfo: Information about the document provided by parser and DTD. > This value is only defined for ElementTree objects based on the root > node of a parsed document (e.g. those returned by the parse > functions)." I just added ", not for trees that were built manually." at the end, that should make it clearer. > In any case, the docinfo property of ElementTree seems read only. So > please tell me, is it possible to fill the docinfo property of an > ElementTree by hand, or is it necessary to use the etree parse method ? What information would you like to add, and for what purpose? Stefan From stefan_ml at behnel.de Sat Aug 23 14:02:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Aug 2008 14:02:05 +0200 Subject: [lxml-dev] HTMLParser encoding In-Reply-To: References: Message-ID: <48AFFC3D.3090600@behnel.de> Hi, Max Ivanov wrote: > If there is no meta tag with defined document encoding, how HTMLParser > converts text data into Unicode? Does it contain some encoding > detection machinery? Yes, but that's implemented in libxml2 and I don't know much about the details. There are some ways to help it, though, in case it gets it wrong. If you can provide the proper encoding (e.g. as provided through HTTP, MIME or some other source), you can pass it to the parser when you create it. Or, you can decode the data to a unicode string and pass that to the parser. Stefan From ivanov.maxim at gmail.com Sat Aug 23 14:52:33 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Sat, 23 Aug 2008 16:52:33 +0400 Subject: [lxml-dev] HTMLParser encoding In-Reply-To: <48AFFC3D.3090600@behnel.de> References: <48AFFC3D.3090600@behnel.de> Message-ID: >> If there is no meta tag with defined document encoding, how HTMLParser >> converts text data into Unicode? Does it contain some encoding >> detection machinery? > > Yes, but that's implemented in libxml2 and I don't know much about the > details. There are some ways to help it, though, in case it gets it wrong. If > you can provide the proper encoding (e.g. as provided through HTTP, MIME or > some other source), you can pass it to the parser when you create it. Or, you > can decode the data to a unicode string and pass that to the parser. I plan to user chardet module (http://chardet.feedparser.org/) to detect charset if no meta tag is present. chardet needs untouched text for proper detection, I couldn't pass to it unicode text from element.text ot .text_content() also I couldnt pass plain text full of tags since it make chardet return wrong results. Is there any way to restore original text from element.text or text_content()? From eric at ejahn.net Sun Aug 24 05:51:04 2008 From: eric at ejahn.net (Eric Jahn) Date: Sat, 23 Aug 2008 23:51:04 -0400 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <48ABC2CB.1020404@behnel.de> References: <1214856697.868.38.camel@localhost.localdomain> <4869BEB6.9000609@behnel.de> <1214931259.10505.48.camel@localhost.localdomain> <48ABC2CB.1020404@behnel.de> Message-ID: <1219549865.29355.0.camel@localhost.localdomain> On Wed, 2008-08-20 at 09:07 +0200, Stefan Behnel wrote: > >>> import lxml.etree as et > >>> root = et.XML('') > >>> root[0].set("type", et.QName("{http://my/ns}tname")) > >>> et.tostring(root) > '' Stefan, this worked, thank you! -Eric From richardbp+lxml at gmail.com Sun Aug 24 13:19:18 2008 From: richardbp+lxml at gmail.com (Richard Baron Penman) Date: Sun, 24 Aug 2008 21:19:18 +1000 Subject: [lxml-dev] Text obscured by subelement Message-ID: hello, I have a document with a format like this: text1text2text3text4text5 I want to extract 'text1text3text5' from but the text attribute returns just 'text1'. Here is an example: from lxml import html doc = html.fromstring('text1text2text3text4text5') print doc.text # 'text1' print doc.tail # '' print doc.text_content() # 'text1text2text3text4text5' for child in doc: child.drop_tree() print doc.text # 'text1text3text5' >From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text? regards, Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080824/254de6f4/attachment.htm From jjl at pobox.com Mon Aug 25 00:03:13 2008 From: jjl at pobox.com (John J Lee) Date: Sun, 24 Aug 2008 23:03:13 +0100 (BST) Subject: [lxml-dev] Text obscured by subelement In-Reply-To: References: Message-ID: On Sun, 24 Aug 2008, Richard Baron Penman wrote: > > I have a document with a format like this: > text1text2text3text4text5 > > I want to extract 'text1text3text5' from but the text attribute > returns just 'text1'. Here is an example: > > from lxml import html > doc = html.fromstring('text1text2text3text4text5') [...] >> From the example you can see I can get what I want by first dropping the > subelements. > Is there a better way to access this text? [...] I only have 1.3.6 installed, so don't have the HTML support, but you want to use the .tail of the b elements I think. With the XML API: from lxml.etree import fromstring doc = fromstring('text1text2text3text4text5') b1, b2 = doc.getchildren() print doc.text + b1.tail + b2.tail John From piet at cs.uu.nl Mon Aug 25 00:16:56 2008 From: piet at cs.uu.nl (Piet van Oostrum) Date: Mon, 25 Aug 2008 00:16:56 +0200 Subject: [lxml-dev] Text obscured by subelement In-Reply-To: References: Message-ID: <18609.56792.103575.459000@cochabamba.local> >>>>> John J Lee (JJL) wrote: >JJL> On Sun, 24 Aug 2008, Richard Baron Penman wrote: >>> >>> I have a document with a format like this: >>> text1text2text3text4text5 >>> >>> I want to extract 'text1text3text5' from but the text attribute >>> returns just 'text1'. Here is an example: >>> >>> from lxml import html >>> doc = html.fromstring('text1text2text3text4text5') >JJL> [...] >>>> From the example you can see I can get what I want by first dropping the >>> subelements. >>> Is there a better way to access this text? >JJL> [...] >JJL> I only have 1.3.6 installed, so don't have the HTML support, but you want >JJL> to use the .tail of the b elements I think. With the XML API: >JJL> from lxml.etree import fromstring >JJL> doc = fromstring('text1text2text3text4text5') >JJL> b1, b2 = doc.getchildren() >JJL> print doc.text + b1.tail + b2.tail print doc.text+''.join(c.tail for c in doc.getchildren()) -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From ivanov.maxim at gmail.com Mon Aug 25 00:21:29 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Mon, 25 Aug 2008 02:21:29 +0400 Subject: [lxml-dev] Encoding again Message-ID: Hi again! I'm unable to solve encoding problems myself, so I ask again here, hope someone have solution. Is there any way to force lxml to make element.text and element.tail to be exactly the same as in original text, without any encoding manipulation? Or to restore them to original state, i.e. maybe somewhere inside lxml there is a var which contain original encoding, so I could do elelemt.text.encode('...').? From stefan_ml at behnel.de Mon Aug 25 04:36:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Aug 2008 04:36:29 +0200 Subject: [lxml-dev] Text obscured by subelement In-Reply-To: <18609.56792.103575.459000@cochabamba.local> References: <18609.56792.103575.459000@cochabamba.local> Message-ID: <48B21AAD.8070601@behnel.de> Piet van Oostrum wrote: >>>>>> John J Lee (JJL) wrote: >> JJL> doc = fromstring('text1text2text3text4text5') >> JJL> b1, b2 = doc.getchildren() >> JJL> print doc.text + b1.tail + b2.tail > > print doc.text+''.join(c.tail for c in doc.getchildren()) print doc.text+''.join(c.tail for c in doc) Stefan From stefan_ml at behnel.de Mon Aug 25 04:42:08 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Aug 2008 04:42:08 +0200 Subject: [lxml-dev] Encoding again In-Reply-To: References: Message-ID: <48B21C00.8070700@behnel.de> Hi, Max Ivanov wrote: > Is there any way to force lxml to make element.text and element.tail > to be exactly the same as in original text, without any encoding > manipulation? Or to restore them to original state, i.e. maybe > somewhere inside lxml there is a var which contain original encoding, > so I could do elelemt.text.encode('...').? I'm not sure I understand what you want, but in case you want lxml.etree to return the encoded byte string instead of the unicode string: no, there is no switch to do that. I have no idea why you would want to do that, though. The original encoding is stored in the docinfo property of the ElementTree of the document. Stefan From dsoulayrol at free.fr Mon Aug 25 09:48:24 2008 From: dsoulayrol at free.fr (David Soulayrol) Date: Mon, 25 Aug 2008 09:48:24 +0200 Subject: [lxml-dev] ElementTree docinfo attribute In-Reply-To: <48AFFAC9.7020909@behnel.de> References: <1219420210.20981.29.camel@neodebianix.neotip.com> <48AFFAC9.7020909@behnel.de> Message-ID: <1219650504.6540.3.camel@neodebianix.neotip.com> Le samedi 23 ao?t 2008 ? 13:55 +0200, Stefan Behnel a ?crit : > Hi, > > David Soulayrol wrote: > > I can read at > > http://article.gmane.org/gmane.comp.python.lxml.devel/1106/match=docinfo > > > > "a DocInfo object that you can also instantiate on an ElementTree (or > > Element) by hand." > > > > But I also can read from _ElementTree help : > > > > "docinfo: Information about the document provided by parser and DTD. > > This value is only defined for ElementTree objects based on the root > > node of a parsed document (e.g. those returned by the parse > > functions)." > > I just added ", not for trees that were built manually." at the end, that > should make it clearer. > > > > In any case, the docinfo property of ElementTree seems read only. So > > please tell me, is it possible to fill the docinfo property of an > > ElementTree by hand, or is it necessary to use the etree parse method ? > > What information would you like to add, and for what purpose? I have a method which takes a NS map, a root tag and DTD info and should return a correct document with all these info set. I was wondering if the only way to do this was to write everything inside a string and then call the parser with StringIO. -- David. From stefan_ml at behnel.de Mon Aug 25 10:31:25 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Aug 2008 10:31:25 +0200 (CEST) Subject: [lxml-dev] ElementTree docinfo attribute In-Reply-To: <1219650504.6540.3.camel@neodebianix.neotip.com> References: <1219420210.20981.29.camel@neodebianix.neotip.com> <48AFFAC9.7020909@behnel.de> <1219650504.6540.3.camel@neodebianix.neotip.com> Message-ID: <39945.213.61.181.86.1219653085.squirrel@groupware.dvs.informatik.tu-darmstadt.de> David Soulayrol wrote: > I have a method which takes a NS map, a root tag and DTD info and should > return a correct document with all these info set. I was wondering if > the only way to do this was to write everything inside a string and then > call the parser with StringIO. Ok, so the problem you have is how to set the DTD reference after creating the root Element. There isn't currently an API to do that, so you have to call the parser to create the right setting in the first place. I wouldn't mind making the public_id and system_url properties in ElementTree.docinfo (DocInfo class) writable. Patches appreciated. Stefan From stefan_ml at behnel.de Mon Aug 25 17:22:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Aug 2008 17:22:29 +0200 (CEST) Subject: [lxml-dev] Encoding again In-Reply-To: References: <48B21C00.8070700@behnel.de> Message-ID: <54973.213.61.181.86.1219677749.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Max Ivanov wrote: >> Max Ivanov wrote: >>> Is there any way to force lxml to make element.text and element.tail >>> to be exactly the same as in original text, without any encoding >>> manipulation? Or to restore them to original state, i.e. maybe >>> somewhere inside lxml there is a var which contain original encoding, >>> so I could do elelemt.text.encode('...').? >> >> I'm not sure I understand what you want, but in case you want lxml.etree >> to >> return the encoded byte string instead of the unicode string: no, there >> is no >> switch to do that. I have no idea why you would want to do that, though. >> >> The original encoding is stored in the docinfo property of the >> ElementTree of >> the document. > > Ok, I'll explain it in python since my English isn't ok for this task > =) I've attached simple test case. None of assertions there pass. I can't test it right now, but this might work for you. I just provided the parser with the right encoding information. Note that your "HTML document" does not specify an encoding, so I assume that the parser just expects it to be latin-1 or some other plain byte encoding, and reads the bytes as they come in. To be clear: it's the document that's broken here, not the parser. Note that you can also pass unicode strings into the parser, so if you manage to decode your HTML data into correct unicode, the parser will do the right thing. Stefan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: charset.py Url: http://codespeak.net/pipermail/lxml-dev/attachments/20080825/e40f35b4/attachment.diff From ivanov.maxim at gmail.com Mon Aug 25 22:15:33 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Tue, 26 Aug 2008 00:15:33 +0400 Subject: [lxml-dev] Encoding again In-Reply-To: <54973.213.61.181.86.1219677749.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <48B21C00.8070700@behnel.de> <54973.213.61.181.86.1219677749.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: > I can't test it right now, but this might work for you. I just provided > the parser with the right encoding information. Note that your "HTML > document" does not specify an encoding, so I assume that the parser just > expects it to be latin-1 or some other plain byte encoding, and reads the > bytes as they come in. To be clear: it's the document that's broken here, > not the parser. Yes indeed. I understand that document is broken, but that's the case - I've to process even broken html pages. Even more, lxml does a lots of heavy lifting to make processing of broken html much easier. I'm talking about another step in that way. There are a lots of pages in russian segment of internet with no charset specified. All of them contain lots of symbols with codes > 128. Do you agree that if you pass some data, it is reasonable to assume that it would return exactly the same data? Nowdays we have: origdata = 'some string with codes >128 (national chars)' xml = ''+origdata+'' .... parsing it with lxml.... rettext = doc.text_content() isinstance(rettext, unicode) #TRUE! but original text was not unicode. #ok, converting original text to unicode to compare unidata = origdata.decode('original encoding') origdata == doc.text_content() #FALSE! lxml makes garbage from our text. xml is all about tags and attribs, why lxml affects content of elements? It should leave it as is, if it doesn't know what to do with them ( == there is no charset information, so it is unable to detect it) Ok in some cases we could do rettext.encode('iso-8859-1') which converts unicode string to single-byte string leaving bytes the same ( ==unicode string is being read as byte array). But imagine what would happen if original data contains " " symbol? In rettext there will be one correct unicode symbol, and when we'll try to convert it to single byte string with iso-8859-1 hack it will be converted to wrong symbol! > Note that you can also pass unicode strings into the parser, so if you > manage to decode your HTML data into correct unicode, the parser will do > the right thing. That's what I'm trying to do. But first I need to throw out all tags, leave only tag content, because charset detector would confuse if there is will be lots of ascii symbols and few national symbols. doc.text_content() is an ideal way to do that, but now it is unusable for this task From stefan_ml at behnel.de Tue Aug 26 08:58:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 26 Aug 2008 08:58:50 +0200 Subject: [lxml-dev] Encoding again In-Reply-To: References: <48B21C00.8070700@behnel.de> <54973.213.61.181.86.1219677749.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <48B3A9AA.8080004@behnel.de> Hi, Max Ivanov wrote: >> I can't test it right now, but this might work for you. I just provided >> the parser with the right encoding information. Note that your "HTML >> document" does not specify an encoding, so I assume that the parser just >> expects it to be latin-1 or some other plain byte encoding, and reads the >> bytes as they come in. To be clear: it's the document that's broken here, >> not the parser. > > Yes indeed. I understand that document is broken, but that's the case > - I've to process even broken html pages. Even more, lxml does a lots > of heavy lifting to make processing of broken html much easier. I'm > talking about another step in that way. There are a lots of pages in > russian segment of internet with no charset specified. All of them > contain lots of symbols with codes > 128. Do you agree that if you > pass some data, it is reasonable to assume that it would return > exactly the same data? What you pass is a byte stream of unknown encoding. What you get back is a tree with well defined characters. Isn't that great enough? > Nowdays we have: > > origdata = 'some string with codes >128 (national chars)' > xml = ''+origdata+'' > .... parsing it with lxml.... > rettext = doc.text_content() > isinstance(rettext, unicode) #TRUE! but original text was not unicode. The "text" you are talking about was a sequence of bytes. Now it is a sequence of characters. It may not be the sequence you expect, because the document does not provide any hints about what the characters it describes with its byte sequences are (how do /you/ know it's really bulgarian characters?), so they may be Latin-1, they may be UTF-8, they may be Cyrillic, they may be EBCDIC. I showed you two ways to make it the right sequence of characters in my last post, in case you have enough information to figure out the encoding with your own code. > #ok, converting original text to unicode to compare > unidata = origdata.decode('original encoding') > origdata == doc.text_content() #FALSE! lxml makes garbage from our text. No, it doesn't. It makes well-defined characters from ambiguous bytes. Please try to understand the difference between an encoded byte sequence and a Unicode character sequence before you blame tools that deploy Unicode correctly. > Ok in some cases we could do rettext.encode('iso-8859-1') which > converts unicode string to single-byte string leaving bytes the same ( > ==unicode string is being read as byte array). That's a pretty ugly hack, I hope you know that. > But imagine what would happen if original data contains " " > symbol? In rettext there will be one correct unicode symbol, and when > we'll try to convert it to single byte string with iso-8859-1 hack it > will be converted to wrong symbol! Ah, so you already know that it's an ugly hack. Fine. :) >> Note that you can also pass unicode strings into the parser, so if you >> manage to decode your HTML data into correct unicode, the parser will do >> the right thing. > That's what I'm trying to do. But first I need to throw out all tags, > leave only tag content, because charset detector would confuse if > there is will be lots of ascii symbols and few national symbols. Then it's not a good-enough encoding detector. You really shouldn't blame the encoding detector in libxml2 for not being able to detect an ambiguous encoding, if the tool you prefer fails in the same way. If you want to remove all tags from the input byte sequence just to detect its encoding, you can use a regular expression like b"<[^>]*>". Should be good enough for that purpose. Stefan From ivanov.maxim at gmail.com Tue Aug 26 11:09:15 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Tue, 26 Aug 2008 13:09:15 +0400 Subject: [lxml-dev] Encoding again In-Reply-To: <48B3A9AA.8080004@behnel.de> References: <48B21C00.8070700@behnel.de> <54973.213.61.181.86.1219677749.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <48B3A9AA.8080004@behnel.de> Message-ID: > What you pass is a byte stream of unknown encoding. What you get back is a > tree with well defined characters. Isn't that great enough? In some cases (original text in ASCII) there are well defined characters, in other cases it is garbage. Why you couldn't just leave content inside tags as is in case original encoding is unknown and parser unable to detect it from data (no tag for example)? I'm asking just about new keyword argument which disables any processing over unknown byte streams inside tags. that would make lxml more usefull in wider situations. >> origdata = 'some string with codes >128 (national chars)' >> xml = ''+origdata+'' >> .... parsing it with lxml.... >> rettext = doc.text_content() >> isinstance(rettext, unicode) #TRUE! but original text was not unicode. > > The "text" you are talking about was a sequence of bytes. Now it is a sequence > of characters. It may not be the sequence you expect, because the document > does not provide any hints about what the characters it describes with its > byte sequences are (how do /you/ know it's really bulgarian characters?), so > they may be Latin-1, they may be UTF-8, they may be Cyrillic, they may be EBCDIC. That's what I'm talking about! If nobody knows what content is actually is then leave it as is, as original byte stream. Why lxml now suggests that input stream is unicode? nobody tell it about that. If lxml don't know about encoding then it should just process tags and attribs and build tree, lxml don't need to know correct encoding to do that, any unknown data should be leaved untouched! that's simpliest rule I could ever imagine - if you don't know what is it, and you don't need it for your task then avoid any processing of that data, it's up to user how to handle it later. > I showed you two ways to make it the right sequence of characters in my last > post, in case you have enough information to figure out the encoding with your > own code. All of them need to find out encoding of text before parsing it with lxml. I'll tell about that later in this message > >> #ok, converting original text to unicode to compare >> unidata = origdata.decode('original encoding') >> origdata == doc.text_content() #FALSE! lxml makes garbage from our text. > > No, it doesn't. It makes well-defined characters from ambiguous bytes. Please > try to understand the difference between an encoded byte sequence and a > Unicode character sequence before you blame tools that deploy Unicode correctly. Why do you call them well-defined characters? What is so "well" about them? > Then it's not a good-enough encoding detector. You really shouldn't blame the > encoding detector in libxml2 for not being able to detect an ambiguous > encoding, if the tool you prefer fails in the same way. I've find out from various experiments that liblxml2 encoding detector is based on tags in source (meta, " etc). chardet which I use is implemented in python so I couldn't feed it with large ammount of data, because it takes a lot of time then. If I feed it with just 1-2kb of raw source I could not guarantee that there would be enough national characters for proper detection. that's why I'm cleaning out source first, then take small chunk of data and then feed parser with it. > If you want to remove all tags from the input byte sequence just to detect its > encoding, you can use a regular expression like b"<[^>]*>". Should be good > enough for that purpose.