From ianb at colorstudy.com Mon Dec 1 20:32:06 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 01 Dec 2008 13:32:06 -0600 Subject: [lxml-dev] pyquery In-Reply-To: References: Message-ID: <49343BB6.1040103@colorstudy.com> Olivier Lauzanne wrote: > Hello, > > First thanks for lxml it's great. > But I miss an interface on top of it. Something like jquery > or hpricot . > Is there any work in progress to go toward something like that in python ? > > Missing a jquery like API in python, I started reproducing the jquery > API in python by using lxml and released it a few days ago : pyquery > Some of this overlaps with what lxml.html already does, and some would already be appropriate there. jQuery is a bit unusual in a Python context, because it only deals with sets of elements. But it's not unreasonable. Some things in jQuery are a result of Javascript, where the equivalent in Python would use a different syntax. For instance: >>> p.attr("id") 'hello' >>> p.attr("id", "plop") [] Would more typically be: >>> p.attrib['id'] 'hello' >>> p.attrib['id'] = 'plop' Javascript just doesn't have anything like __getitem__/__setitem__, and doesn't really have getters and setters (at least on many browsers) so it also has to use functions to get and set values. Also note you don't allow things like p.attr('id', None), which should be valid (probably meaning an attribute deletion). Of course if you have CSS patches to CSSSelect (e.g., for :first -- though I thought that worked?) it would be good to have them in lxml directly. Or if there are patches to make it easier to subclass CSSSelector, that'd be fine too -- there's a number of useful extensions to selectors in jQuery (e.g., input:checkbox), but it'd be nice to keep CSSSelect itself more strictly CSS 3. The $() constructor is also overloaded to do a lot more than selection, but that's kind of out of style for Python -- alternate class methods would be preferable. You also seem to be using lxml.etree in places where lxml.html would definitely be better. E.g., for setting .html: children = lxml.html.fragments_fromstring(html) if children and isinstance(children[0], basestring): parent.text = children.pop(0) else: parent.text = None parent[:] = children Also to get the HTML contents, (parent.text or '')+''.join(tostring(el) for el in parent). I'm sure there's several other things. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From ianb at colorstudy.com Mon Dec 1 21:32:57 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 01 Dec 2008 14:32:57 -0600 Subject: [lxml-dev] Question on clean_html In-Reply-To: <1c1ea3180811292203m76aac375w8d654a6d107805df@mail.gmail.com> References: <1c1ea3180811292203m76aac375w8d654a6d107805df@mail.gmail.com> Message-ID: <493449F9.5060701@colorstudy.com> Brian Neal wrote: > Hi, > > I would like to use lxml to remove all tags except 'a' tags. Is this possible? > > I don't seem to understand the arguments to the Cleaner class. What > does allow_tags do? > > I tried this: > >>>> c = Cleaner(allow_tags=('a',), remove_unknown_tags=False) >>>> print c.clean_html('Hi') > Hi > > Do I instead have to list all the tags I don't want, except for 'a', > in a remove_tags keyword argument? > > Any hints? Thank you. There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From stefan_ml at behnel.de Tue Dec 2 09:23:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 2 Dec 2008 09:23:05 +0100 (CET) Subject: [lxml-dev] Question on clean_html In-Reply-To: <493449F9.5060701@colorstudy.com> References: <1c1ea3180811292203m76aac375w8d654a6d107805df@mail.gmail.com> <493449F9.5060701@colorstudy.com> Message-ID: <45243.213.61.181.86.1228206185.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Ian Bicking wrote: > for el in list(doc.iter()): > if el.tag not in ['a']: > el.drop_tag() > > I'm not 100% sure what happens if you modify the tree in place like > this, though I think list() will make it work. It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan From olauzanne at gmail.com Tue Dec 2 14:23:30 2008 From: olauzanne at gmail.com (Olivier Lauzanne) Date: Tue, 2 Dec 2008 14:23:30 +0100 Subject: [lxml-dev] pyquery In-Reply-To: <49343BB6.1040103@colorstudy.com> References: <49343BB6.1040103@colorstudy.com> Message-ID: On Mon, Dec 1, 2008 at 8:32 PM, Ian Bicking wrote: > Olivier Lauzanne wrote: > >> Hello, >> >> First thanks for lxml it's great. >> But I miss an interface on top of it. Something like jquery < >> http://jquery.com> or hpricot > >. >> Is there any work in progress to go toward something like that in python ? >> >> Missing a jquery like API in python, I started reproducing the jquery API >> in python by using lxml and released it a few days ago : pyquery < >> http://pypi.python.org/pypi/pyquery> >> > > Some of this overlaps with what lxml.html already does, and some would > already be appropriate there. jQuery is a bit unusual in a Python context, > because it only deals with sets of elements. But it's not unreasonable. > In lxml.html, it seems there is very specific code for each html tag. I think the css query approach is more powerfull and simple. And it can provide a similar enough api. Instead of doing p.inputs you just do p('input'). Dealing with sets of elements is something that I came to love about jquery. And I don't think it's actually unpythonic in any way. It's just a different approach. It's just like getting an element of a string gives you a string back and not a character. > > Some things in jQuery are a result of Javascript, where the equivalent in > Python would use a different syntax. For instance: > > >>> p.attr("id") > 'hello' > >>> p.attr("id", "plop") > [] > > Would more typically be: > > >>> p.attrib['id'] > 'hello' > >>> p.attrib['id'] = 'plop' > > Javascript just doesn't have anything like __getitem__/__setitem__, and > doesn't really have getters and setters (at least on many browsers) so it > also has to use functions to get and set values. Also note you don't allow > things like p.attr('id', None), which should be valid (probably meaning an > attribute deletion). > attr('id', None) doesn't work, but it doesn't work in jquery either, there actually is a method called removeAttr for that purpose. You're right, jquery isn't always perfectly pythonic, it doesn't use setters, and method names use the hungarian notation which isn't pythonic and which I don't like. But it is object oriented (very much so) and allow "streamed" method application, calling method over method over method on the same object, which you can't do if you use a python setter. Also jquery misses a method to access the full html string of a tag (you can only access innerHtml) which sucks. On the other hand it is has the advantage of being simple, well known, used and documented API. So it felt like it would already be good to replicate it. Also reproducing the jquery API has the advantage of making it trivial to move a functionality in a web application from server to client, or client to server. And then if people started using it and if there was a consensus that it should be changed it could always be done then. But I'm open enough if you have a vision of a better API, but it would have to be a significantly better API to compensate for the fact of not using a well known API. > > Of course if you have CSS patches to CSSSelect (e.g., for :first -- though > I thought that worked?) it would be good to have them in lxml directly. Or > if there are patches to make it easier to subclass CSSSelector, that'd be > fine too -- there's a number of useful extensions to selectors in jQuery > (e.g., input:checkbox), but it'd be nice to keep CSSSelect itself more > strictly CSS 3. The $() constructor is also overloaded to do a lot more > than selection, but that's kind of out of style for Python -- alternate > class methods would be preferable. > I don't have patches yet, but I have seen where they can be done. I was planning on monkey-patching, I perfectly agree that CSSSelect should remain standard compliant. I'll check if I can do something cleaner than monkey-patching. > > You also seem to be using lxml.etree in places where lxml.html would > definitely be better. E.g., for setting .html: > > children = lxml.html.fragments_fromstring(html) > if children and isinstance(children[0], basestring): > parent.text = children.pop(0) > else: > parent.text = None > parent[:] = children > > Also to get the HTML contents, (parent.text or '')+''.join(tostring(el) for > el in parent). I'm sure there's several other things. > Thanks for the info, I'll look into it. pyquery was the occasion for me to learn lxml so I may have overlooked some more things. Also jquery hacks are a common practice when working on complex applications, you can't understand the logic of the application (or just don't want to modify it) so you just hack the modification in another layer on top of the application, this layer can be javasscript but I think it's kind of the same idea that is used in deliverance. I would like to have a wsgi application where I could do some quick hacks like that on server side, maybe in deliverance or in its own wsgi middleware. What do you think ? Thanks for your answer, Olivier Lauzanne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081202/00851283/attachment.htm From ianb at colorstudy.com Wed Dec 3 19:40:42 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 03 Dec 2008 12:40:42 -0600 Subject: [lxml-dev] pyquery In-Reply-To: References: <49343BB6.1040103@colorstudy.com> Message-ID: <4936D2AA.7090106@colorstudy.com> Olivier Lauzanne wrote: > > > On Mon, Dec 1, 2008 at 8:32 PM, Ian Bicking > wrote: > > Olivier Lauzanne wrote: > > Hello, > > First thanks for lxml it's great. > But I miss an interface on top of it. Something like jquery > or hpricot > . > > Is there any work in progress to go toward something like that > in python ? > > Missing a jquery like API in python, I started reproducing the > jquery API in python by using lxml and released it a few days > ago : pyquery > > > Some of this overlaps with what lxml.html already does, and some > would already be appropriate there. jQuery is a bit unusual in a > Python context, because it only deals with sets of elements. But > it's not unreasonable. > > > In lxml.html, it seems there is very specific code for each html tag. I > think the css query approach is more powerfull and simple. And it can > provide a similar enough api. Instead of doing p.inputs you just do > p('input'). In most cases there's something distinct about those attributes. For instance p.inputs gives you special form fields. If course p.cssselect('input,select,textarea') also works (and if you don't mind a honking long XPath query you could do that too). > Dealing with sets of elements is something that I came to love about > jquery. And I don't think it's actually unpythonic in any way. It's just > a different approach. It's just like getting an element of a string > gives you a string back and not a character. Well... it is unpythonic in that sets and items are treated differently in Python (except the oddball case of strings, as you mention). It's more a question of whether it is justifiably unpythonic... and I'm not disputing that it can be. > Some things in jQuery are a result of Javascript, where the > equivalent in Python would use a different syntax. For instance: > > >>> p.attr("id") > 'hello' > >>> p.attr("id", "plop") > [] > > Would more typically be: > > >>> p.attrib['id'] > 'hello' > >>> p.attrib['id'] = 'plop' > > Javascript just doesn't have anything like __getitem__/__setitem__, > and doesn't really have getters and setters (at least on many > browsers) so it also has to use functions to get and set values. > Also note you don't allow things like p.attr('id', None), which > should be valid (probably meaning an attribute deletion). > > > attr('id', None) doesn't work, but it doesn't work in jquery either, > there actually is a method called removeAttr for that purpose. Well, it would be easy to make it work, just don't use None as your sentinel. > You're right, jquery isn't always perfectly pythonic, it doesn't use > setters, and method names use the hungarian notation which isn't > pythonic and which I don't like. But it is object oriented (very much > so) and allow "streamed" method application, calling method over method > over method on the same object, which you can't do if you use a python > setter. Also jquery misses a method to access the full html string of a > tag (you can only access innerHtml) which sucks. There's a very small (4-line?) outerHtml plugin for jquery, BTW. > On the other hand it is has the advantage of being simple, well known, > used and documented API. So it felt like it would already be good to > replicate it. Also reproducing the jquery API has the advantage of > making it trivial to move a functionality in a web application from > server to client, or client to server. And then if people started using > it and if there was a consensus that it should be changed it could > always be done then. But I'm open enough if you have a vision of a > better API, but it would have to be a significantly better API to > compensate for the fact of not using a well known API. I think there are arguably places where setters and getters are just simpler and look nicer. I guess I see the jQuery technique for these specifically as a way of turning a deficiency in Javascript (lack of getters and setters) into an advantage (chaining)... but I'm not sure it's enough of an advantage to make it worth it. For instance, el.html and el.html = '...' seems nicer to me than el.html() and el.html('...'), and all you lose is the ability to do something like el.html('...').attr('foo', 'bar'), and that doesn't seem like such a big thing. Also there's two APIs: jQuery and lxml. There's some advantage to reusing the lxml APIs as well, I think, so that for instance el.attrib and el.get().attrib are the same. (I'm not sure you actually implemented .get()?) It might be good, or it might be sloppy, to actually support both APIs to the degree they don't overlap (e.g., .attr vs. .attrib). > Of course if you have CSS patches to CSSSelect (e.g., for :first -- > though I thought that worked?) it would be good to have them in lxml > directly. Or if there are patches to make it easier to subclass > CSSSelector, that'd be fine too -- there's a number of useful > extensions to selectors in jQuery (e.g., input:checkbox), but it'd > be nice to keep CSSSelect itself more strictly CSS 3. The $() > constructor is also overloaded to do a lot more than selection, but > that's kind of out of style for Python -- alternate class methods > would be preferable. > > > I don't have patches yet, but I have seen where they can be done. I was > planning on monkey-patching, I perfectly agree that CSSSelect should > remain standard compliant. I'll check if I can do something cleaner than > monkey-patching. Probably some of the functions would have to turn into methods of a class, and then you'd subclass that to add custom selectors and XPath translations of those selectors. > You also seem to be using lxml.etree in places where lxml.html would > definitely be better. E.g., for setting .html: > > children = lxml.html.fragments_fromstring(html) > if children and isinstance(children[0], basestring): > parent.text = children.pop(0) > else: > parent.text = None > parent[:] = children > > Also to get the HTML contents, (parent.text or > '')+''.join(tostring(el) for el in parent). I'm sure there's > several other things. > > > Thanks for the info, I'll look into it. pyquery was the occasion for me > to learn lxml so I may have overlooked some more things. > > Also jquery hacks are a common practice when working on complex > applications, you can't understand the logic of the application (or just > don't want to modify it) so you just hack the modification in another > layer on top of the application, this layer can be javasscript but I > think it's kind of the same idea that is used in deliverance. I would > like to have a wsgi application where I could do some quick hacks like > that on server side, maybe in deliverance or in its own wsgi middleware. > What do you think ? Yeah, that could be possible -- people have asked for the ability to do arbitrary code-based transitions in Deliverance -- for the reasons you describe, like not wanting to touch the underlying application -- and this would probably be a very comfortable technique for people, especially if they are more front-end oriented. Like people have asked for the ability to do something that I guess would be expressed like doc('ul#menu li').prepend('>), when they want some kind of text separators in a list. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From nepi at gmx.ch Thu Dec 4 12:46:34 2008 From: nepi at gmx.ch (Daniel Jirku) Date: Thu, 04 Dec 2008 12:46:34 +0100 Subject: [lxml-dev] html encoding Message-ID: <20081204114634.241670@gmx.net> hi... My problem is i suppose well known, but i couldnt find any soultion through my searches... I have a regular html link with ? and an &. When i print the variable in pyhton, it looks fine... (like: http://www.somelink.com/site.html?param1=test¶m2=hello), BUT when i add it to my root xml element with: adId1 = etree.SubElement(tagAd, "originalAdUrl") adId1.text = adUrl and then later write the xml to a file with this: toStringValue = etree.tostring(xmlTagRoot, encoding="utf-8", method="xml", xml_declaration=True, pretty_print=True) ... the tag has as its value the link with an & instead of & !! How can i use the correct signs for persistant storage in a xml file...? thank you very much.. -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger From d.rothe at semantics.de Thu Dec 4 12:57:26 2008 From: d.rothe at semantics.de (Dirk Rothe) Date: Thu, 04 Dec 2008 12:57:26 +0100 Subject: [lxml-dev] html encoding In-Reply-To: <20081204114634.241670@gmx.net> References: <20081204114634.241670@gmx.net> Message-ID: On Thu, 04 Dec 2008 12:46:34 +0100, Daniel Jirku wrote: > hi... > > My problem is i suppose well known, but i couldnt find any soultion > through my searches... > > I have a regular html link with ? and an &. When i print the variable in > pyhton, it looks fine... (like: > http://www.somelink.com/site.html?param1=test¶m2=hello), BUT when i > add it to my root xml element with: > adId1 = etree.SubElement(tagAd, "originalAdUrl") > adId1.text = adUrl > > and then later write the xml to a file with this: > toStringValue = etree.tostring(xmlTagRoot, encoding="utf-8", > method="xml", xml_declaration=True, pretty_print=True) > ... > > the tag has as its value the link with an & instead of & !! > How can i use the correct signs for persistant storage in a xml file...? The XML Processor has correctly escaped your "&" character. If you deserialise (aka load) the file with a XML Parser of your choice, it will restore your "&" character. see http://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_entity_references --dirk From filip.salomonsson at gmail.com Thu Dec 4 15:00:16 2008 From: filip.salomonsson at gmail.com (Filip Salomonsson) Date: Thu, 4 Dec 2008 15:00:16 +0100 Subject: [lxml-dev] Tracking down bugs Message-ID: <2f334ccd0812040600v56900797ja6861cfdca7cd0f@mail.gmail.com> lxml occasionally goes crashing on me, with an error message and backtrace like the one below. Usually, the same program runs just fine if I try again. What can I do to track down the cause? *** glibc detected *** python: free(): invalid pointer: 0x0a0388cb *** ======= Backtrace: ========= /lib/libc.so.6[0x1d3b16] /lib/libc.so.6(cfree+0x90)[0x1d7070] /usr/lib/libxml2.so.2(xmlNodeSetName+0xa5)[0x68fc0c5] /home/stp02/salo/lib/python/lxml-2.1.2-py2.4-linux-i686.egg/lxml/etree.so[0x3b988f] /usr/lib/libpython2.4.so.1.0[0x5437d83] /usr/lib/libpython2.4.so.1.0(PyObject_GenericSetAttr+0x104)[0x5455b84] /usr/lib/libpython2.4.so.1.0(PyObject_SetAttr+0xbb)[0x54561db] /usr/lib/libpython2.4.so.1.0(PyEval_EvalFrame+0x2d51)[0x548cd91] /usr/lib/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x896)[0x548fc76] /usr/lib/libpython2.4.so.1.0(PyEval_EvalCode+0x63)[0x548fd03] /usr/lib/libpython2.4.so.1.0[0x54acad8] /usr/lib/libpython2.4.so.1.0(PyRun_SimpleFileExFlags+0x198)[0x54ae1e8] /usr/lib/libpython2.4.so.1.0(PyRun_AnyFileExFlags+0x7a)[0x54ae8ca] /usr/lib/libpython2.4.so.1.0(Py_Main+0xb85)[0x54b52d5] python(main+0x32)[0x80485b2] /lib/libc.so.6(__libc_start_main+0xdc)[0x180dec] python[0x80484c1] -- filip salomonsson From stefan_ml at behnel.de Thu Dec 4 15:37:41 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 4 Dec 2008 15:37:41 +0100 (CET) Subject: [lxml-dev] Tracking down bugs In-Reply-To: <2f334ccd0812040600v56900797ja6861cfdca7cd0f@mail.gmail.com> References: <2f334ccd0812040600v56900797ja6861cfdca7cd0f@mail.gmail.com> Message-ID: <58011.213.61.181.86.1228401461.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, thanks for the report. Filip Salomonsson wrote: > lxml occasionally goes crashing on me, with an error message and > backtrace like the one below. Usually, the same program runs just fine > if I try again. What can I do to track down the cause? First of all, use the latest released (!) version of lxml and build it without having Cython installed and with debug symbols enabled (add "-g3" to your CFLAGS) so that the stack trace shows line numbers that can be looked up in the released sources. Then, install valgrind and try to make your program crash under valgrind control. Valgrind usually gives pretty accurate information about the origin of a bad memory location, so that's a lot more informative than a bare stack trace. There is a command line in the Makefile (make valtest) that you can use to find the appropriate options. Stripping down your program to a shorter snippet that shows the crash more or less reliably is a very good way to track down the circumstances that are required to trigger the problem. In the best case, you end up with a short test case that we can add to lxml's test suite to make sure the problem never comes back. Lastly, if you know how to use gdb and can investigate a bit more, any further hints that you find can be helpful in tracking down the problem. Thanks, Stefan From filip.salomonsson at gmail.com Thu Dec 4 15:58:51 2008 From: filip.salomonsson at gmail.com (Filip Salomonsson) Date: Thu, 4 Dec 2008 15:58:51 +0100 Subject: [lxml-dev] Tracking down bugs In-Reply-To: <58011.213.61.181.86.1228401461.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <2f334ccd0812040600v56900797ja6861cfdca7cd0f@mail.gmail.com> <58011.213.61.181.86.1228401461.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <2f334ccd0812040658u29320bc5p98ec97fe725c7aaf@mail.gmail.com> On Thu, Dec 4, 2008 at 15:37, Stefan Behnel wrote: > > First of all, use the latest released (!) version of lxml and build it > without having Cython installed and with debug symbols enabled (add "-g3" > to your CFLAGS) > Then, install valgrind and try to make your program crash under valgrind > control. Excellent; thanks! I'll try that and see what I can find. (Hadn't noticed 2.1.3, really - I'll go get that regardless.) > Stripping down your program to a shorter snippet that shows the crash more > or less reliably is a very good way to track down the circumstances that > are required to trigger the problem. Sure will - if and when I'm able to provoke a repeatable crash. It all seems annoyingly random so far. > Lastly, if you know how to use gdb and can investigate a bit more, any > further hints that you find can be helpful in tracking down the problem. I'm afraid gdb is mostly black voodoo magic to me so far, but I'll probably try to dive into it if I get to the point where that seems useful. Thank you, Filip (If I didn't have lxml, I think my job would be really tedious, so thanks for that too!) From olauzanne at gmail.com Fri Dec 5 16:27:26 2008 From: olauzanne at gmail.com (Olivier Lauzanne) Date: Fri, 5 Dec 2008 16:27:26 +0100 Subject: [lxml-dev] pyquery In-Reply-To: <4936D2AA.7090106@colorstudy.com> References: <49343BB6.1040103@colorstudy.com> <4936D2AA.7090106@colorstudy.com> Message-ID: I released pyquery 0.2 with a much more complete API. http://pypi.python.org/pypi/pyquery On Wed, Dec 3, 2008 at 7:40 PM, Ian Bicking wrote: > Olivier Lauzanne wrote: > >> >> >> On Mon, Dec 1, 2008 at 8:32 PM, Ian Bicking > ianb at colorstudy.com>> wrote: >> >> Olivier Lauzanne wrote: >> >> Hello, >> >> First thanks for lxml it's great. >> But I miss an interface on top of it. Something like jquery >> or hpricot >> . >> >> Is there any work in progress to go toward something like that >> in python ? >> >> Missing a jquery like API in python, I started reproducing the >> jquery API in python by using lxml and released it a few days >> ago : pyquery >> >> >> Some of this overlaps with what lxml.html already does, and some >> would already be appropriate there. jQuery is a bit unusual in a >> Python context, because it only deals with sets of elements. But >> it's not unreasonable. >> >> >> In lxml.html, it seems there is very specific code for each html tag. I >> think the css query approach is more powerfull and simple. And it can >> provide a similar enough api. Instead of doing p.inputs you just do >> p('input'). >> > > In most cases there's something distinct about those attributes. For > instance p.inputs gives you special form fields. If course > p.cssselect('input,select,textarea') also works (and if you don't mind a > honking long XPath query you could do that too). > > Dealing with sets of elements is something that I came to love about >> jquery. And I don't think it's actually unpythonic in any way. It's just a >> different approach. It's just like getting an element of a string gives you >> a string back and not a character. >> > > Well... it is unpythonic in that sets and items are treated differently in > Python (except the oddball case of strings, as you mention). It's more a > question of whether it is justifiably unpythonic... and I'm not disputing > that it can be. > > Some things in jQuery are a result of Javascript, where the >> equivalent in Python would use a different syntax. For instance: >> >> >>> p.attr("id") >> 'hello' >> >>> p.attr("id", "plop") >> [] >> >> Would more typically be: >> >> >>> p.attrib['id'] >> 'hello' >> >>> p.attrib['id'] = 'plop' >> >> Javascript just doesn't have anything like __getitem__/__setitem__, >> and doesn't really have getters and setters (at least on many >> browsers) so it also has to use functions to get and set values. >> Also note you don't allow things like p.attr('id', None), which >> should be valid (probably meaning an attribute deletion). >> >> >> attr('id', None) doesn't work, but it doesn't work in jquery either, there >> actually is a method called removeAttr for that purpose. >> > > Well, it would be easy to make it work, just don't use None as your > sentinel. > It works in the 0.2 version that I just released. > > > You're right, jquery isn't always perfectly pythonic, it doesn't use >> setters, and method names use the hungarian notation which isn't pythonic >> and which I don't like. But it is object oriented (very much so) and allow >> "streamed" method application, calling method over method over method on the >> same object, which you can't do if you use a python setter. Also jquery >> misses a method to access the full html string of a tag (you can only access >> innerHtml) which sucks. >> > > There's a very small (4-line?) outerHtml plugin for jquery, BTW. > Cool. > > > On the other hand it is has the advantage of being simple, well known, >> used and documented API. So it felt like it would already be good to >> replicate it. Also reproducing the jquery API has the advantage of making it >> trivial to move a functionality in a web application from server to client, >> or client to server. And then if people started using it and if there was a >> consensus that it should be changed it could always be done then. But I'm >> open enough if you have a vision of a better API, but it would have to be a >> significantly better API to compensate for the fact of not using a well >> known API. >> > > I think there are arguably places where setters and getters are just > simpler and look nicer. I guess I see the jQuery technique for these > specifically as a way of turning a deficiency in Javascript (lack of getters > and setters) into an advantage (chaining)... but I'm not sure it's enough of > an advantage to make it worth it. > > For instance, el.html and el.html = '...' seems nicer to me than el.html() > and el.html('...'), and all you lose is the ability to do something like > el.html('...').attr('foo', 'bar'), and that doesn't seem like such a big > thing. > You're right. But I still think that the fact of being compatible with a known API is good. > > Also there's two APIs: jQuery and lxml. There's some advantage to reusing > the lxml APIs as well, I think, so that for instance el.attrib and > el.get().attrib are the same. (I'm not sure you actually implemented > .get()?) > No this get is not implemented yet. It seems that it's in jQuery only for backward compatibility http://docs.jquery.com/Core/get > > It might be good, or it might be sloppy, to actually support both APIs to > the degree they don't overlap (e.g., .attr vs. .attrib). > Gael Pasgrimaud started contributing to pyquery (and he contributed a lot !) and he created a more pythonic API for the attributes alongside the jQuery one. > > Of course if you have CSS patches to CSSSelect (e.g., for :first -- >> though I thought that worked?) it would be good to have them in lxml >> directly. Or if there are patches to make it easier to subclass >> CSSSelector, that'd be fine too -- there's a number of useful >> extensions to selectors in jQuery (e.g., input:checkbox), but it'd >> be nice to keep CSSSelect itself more strictly CSS 3. The $() >> constructor is also overloaded to do a lot more than selection, but >> that's kind of out of style for Python -- alternate class methods >> would be preferable. >> >> >> I don't have patches yet, but I have seen where they can be done. I was >> planning on monkey-patching, I perfectly agree that CSSSelect should remain >> standard compliant. I'll check if I can do something cleaner than >> monkey-patching. >> > > Probably some of the functions would have to turn into methods of a class, > and then you'd subclass that to add custom selectors and XPath translations > of those selectors. > Didn't had time for it yet, but I'll look into it. > > > You also seem to be using lxml.etree in places where lxml.html would >> definitely be better. E.g., for setting .html: >> >> children = lxml.html.fragments_fromstring(html) >> if children and isinstance(children[0], basestring): >> parent.text = children.pop(0) >> else: >> parent.text = None >> parent[:] = children >> >> Also to get the HTML contents, (parent.text or >> '')+''.join(tostring(el) for el in parent). I'm sure there's >> several other things. >> >> >> Thanks for the info, I'll look into it. pyquery was the occasion for me to >> learn lxml so I may have overlooked some more things. >> >> Also jquery hacks are a common practice when working on complex >> applications, you can't understand the logic of the application (or just >> don't want to modify it) so you just hack the modification in another layer >> on top of the application, this layer can be javasscript but I think it's >> kind of the same idea that is used in deliverance. I would like to have a >> wsgi application where I could do some quick hacks like that on server side, >> maybe in deliverance or in its own wsgi middleware. What do you think ? >> > > Yeah, that could be possible -- people have asked for the ability to do > arbitrary code-based transitions in Deliverance -- for the reasons you > describe, like not wanting to touch the underlying application -- and this > would probably be a very comfortable technique for people, especially if > they are more front-end oriented. Like people have asked for the ability to > do something that I guess would be expressed like doc('ul#menu > li').prepend('>), when they want some kind of text separators in a list. > Gael also created an api for getting urls from wsgi applications so I think pyquery is getting really close from something that is actually usable :) - Olivier Lauzanne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081205/de331fa8/attachment-0001.htm From gael at gawel.org Mon Dec 8 16:17:44 2008 From: gael at gawel.org (Gael Pasgrimaud) Date: Mon, 8 Dec 2008 16:17:44 +0100 Subject: [lxml-dev] buildlibxml problems Message-ID: <7911b3bb0812080717q3f408333h51a7d655af3feed5@mail.gmail.com> Hi, I have a problem to install lxml with Cython and --static-deps. The problem is that you dont keep the original os.environ in place during subprocess so gcc is not found during ./configure of libxml2. I've found a workaround and you can find my patch attached (buildlibxml.py.patch) Regards, -- Gael -------------- next part -------------- A non-text attachment was scrubbed... Name: buildlibxml.py.patch Type: application/octet-stream Size: 1020 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20081208/59b8298e/attachment.obj From jeandaniel.browne at gmail.com Tue Dec 9 13:13:20 2008 From: jeandaniel.browne at gmail.com (Jean Daniel) Date: Tue, 9 Dec 2008 13:13:20 +0100 Subject: [lxml-dev] extracting .text strings systematically in unicode Message-ID: Hello, I am working on a small XML to SQL application. Input attribute values and text fields usually are unicode but not always. They are fed into the attributes of an object which only accepts unicode input and raise an exception if the data is an 'str' instead (said object is a storm persisted class). My problem seems to be that lxml extracts text element either as an 'str' or a 'unicode', depending on the text element, as shown on the code snippets : from lxml.etree import XML type( XML('element').text ) type( XML('?l?ment').text ) So far, it seems that my only choice is to 'cast' every extraction of the xml doc to unicode, which is cumbersome and does not seem necessary. Example : self.name = unicode( element.get('name') ) for child in element: setattr(self, child.tag, unicode( child.text ) ) Is there a switch in the lxml module to make the strings of the xml document appears predictably as unicode even is the string can be represented a simple 'str'? Thank you, From stefan_ml at behnel.de Tue Dec 9 19:00:09 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Dec 2008 19:00:09 +0100 Subject: [lxml-dev] extracting .text strings systematically in unicode In-Reply-To: References: Message-ID: <493EB229.1010005@behnel.de> Hi, Jean Daniel wrote: > Is there a switch in the lxml module to make the strings of the xml > document appears predictably as unicode even is the string can be > represented a simple 'str'? No, that's the way ElementTree works (and lxml is ET compatible). This is mainly for performance reasons, since ASCII strings are extremely common in XML. Creating a plain ASCII str is more memory efficient and a lot faster than creating a unicode object, and in Py2 it behaves the same in almost all situations (except in APIs that specifically test for unicode objects as input). You can either switch to Py3.0 where lxml always returns unicode strings, or you can stick to casting the string yourself. BTW, it's faster to do u""+s than to do unicode(s) although it might be considered less readable. It has the advantage of raising an exception for non-strings, though. Stefan From jlovell at nwesd.org Tue Dec 9 19:11:12 2008 From: jlovell at nwesd.org (John Lovell) Date: Tue, 9 Dec 2008 10:11:12 -0800 Subject: [lxml-dev] extracting .text strings systematically in unicode In-Reply-To: <493EB229.1010005@behnel.de> References: <493EB229.1010005@behnel.de> Message-ID: The first one is the one the raises an exception for non-strings? John ---- You can either switch to Py3.0 where lxml always returns unicode strings, or you can stick to casting the string yourself. BTW, it's faster to do u""+s than to do unicode(s) although it might be considered less readable. It has the advantage of raising an exception for non-strings, though. Stefan From stefan_ml at behnel.de Tue Dec 9 19:14:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Dec 2008 19:14:12 +0100 Subject: [lxml-dev] extracting .text strings systematically in unicode In-Reply-To: References: <493EB229.1010005@behnel.de> Message-ID: <493EB574.9060903@behnel.de> John Lovell wrote: > The first one is the one the raises an exception for non-strings? Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u""+1 Traceback (most recent call last): File "", line 1, in TypeError: coercing to Unicode: need string or buffer, int found Stefan From jcd at sdf.lonestar.org Tue Dec 9 19:21:24 2008 From: jcd at sdf.lonestar.org (J. Clifford Dyer) Date: Tue, 09 Dec 2008 13:21:24 -0500 Subject: [lxml-dev] extracting .text strings systematically in unicode In-Reply-To: References: <493EB229.1010005@behnel.de> Message-ID: <1228846884.13228.4.camel@sohp-laptop> > On Tue, 2008-12-09 at 10:11 -0800, John Lovell wrote: > The first one is the one the raises an exception for non-strings? > > John > Yes: Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> unicode(1) u'1' >>> u""+1 Traceback (most recent call last): File "", line 1, in TypeError: coercing to Unicode: need string or buffer, int found >>> From stefan_ml at behnel.de Tue Dec 9 19:23:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Dec 2008 19:23:39 +0100 Subject: [lxml-dev] extracting .text strings systematically in unicode In-Reply-To: <493EB574.9060903@behnel.de> References: <493EB229.1010005@behnel.de> <493EB574.9060903@behnel.de> Message-ID: <493EB7AB.6020300@behnel.de> Stefan Behnel wrote: > John Lovell wrote: >> The first one is the one the raises an exception for non-strings? > > Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) > [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> u""+1 > Traceback (most recent call last): > File "", line 1, in > TypeError: coercing to Unicode: need string or buffer, int found Or to present something more lxml related (session edited for readability): Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.etree as et >>> root = et.fromstring("") >>> root.tag 'a' >>> unicode(root.tag) u'a' >>> u""+root.tag u'a' >>> root[0].tag >>> unicode(root[0].tag) u'' >>> u""+root[0].tag Traceback (most recent call last): File "", line 1, in TypeError: coercing to Unicode: need string or buffer, \ builtin_function_or_method found Stefan From sergio at sergiomb.no-ip.org Tue Dec 9 21:41:27 2008 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 09 Dec 2008 20:41:27 +0000 Subject: [lxml-dev] html encoding In-Reply-To: References: <20081204114634.241670@gmx.net> Message-ID: <1228855287.24579.4.camel@segulix> when str is the html I use: htmldecode( unicode(str,'utf-8') ).encode('utf-8') import re from htmlentitydefs import name2codepoint # This pattern matches a character entity reference (a decimal numeric # references, a hexadecimal numeric reference, or a named reference). charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?') def htmldecode(text): """Decode HTML entities in the given text.""" if type(text) is unicode: uchr = unichr else: uchr = lambda value: value > 255 and unichr(value) or chr(value) def entitydecode(match, uchr=uchr): entity = match.group(1) if entity.startswith('#x'): return uchr(int(entity[2:], 16)) elif entity.startswith('#'): return uchr(int(entity[1:])) elif entity in name2codepoint: return uchr(name2codepoint[entity]) else: return match.group(0) return charrefpat.sub(entitydecode, text) On Thu, 2008-12-04 at 12:57 +0100, Dirk Rothe wrote: > On Thu, 04 Dec 2008 12:46:34 +0100, Daniel Jirku wrote: > > > hi... > > > > My problem is i suppose well known, but i couldnt find any soultion > > through my searches... > > > > I have a regular html link with ? and an &. When i print the variable in > > pyhton, it looks fine... (like: > > http://www.somelink.com/site.html?param1=test¶m2=hello), BUT when i > > add it to my root xml element with: > > adId1 = etree.SubElement(tagAd, "originalAdUrl") > > adId1.text = adUrl > > > > and then later write the xml to a file with this: > > toStringValue = etree.tostring(xmlTagRoot, encoding="utf-8", > > method="xml", xml_declaration=True, pretty_print=True) > > ... > > > > the tag has as its value the link with an & instead of & !! > > How can i use the correct signs for persistant storage in a xml file...? > > The XML Processor has correctly escaped your "&" character. If you > deserialise (aka load) the file with a XML Parser of your choice, it will > restore your "&" character. > > see > http://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_entity_references > > --dirk > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20081209/ad54337c/attachment.bin From putualbertolee at yahoo.com Wed Dec 10 20:39:12 2008 From: putualbertolee at yahoo.com (I Putu Alberto Lee) Date: Wed, 10 Dec 2008 11:39:12 -0800 (PST) Subject: [lxml-dev] Yet (another) XML-verifier utility, Xpectador Message-ID: <411863.13432.qm@web63405.mail.re1.yahoo.com> Hi, I just implemented an XML-verifier utility for the purpose of my work. It's written in python, and uses lxml library. With pleasure I'd like to share with you; hopefully you will find it useful. Basically it allows you to state your expectations over "what should appear" and "what should not appear" in the XML-document that you'd like to validate, using the same format as that XML-document-to-be-verified. I make the source code available here: http://www.box.net/shared/5hs5ynq75k (a .py file) -- just three functions that make up the recursive checkings. More info is available at: http://jananuraga.blogspot.com/2008/12/xpectador-10.html Feel free to use and modify it under creative-common attribution generic 2.5 license: http://creativecommons.org/licenses/by/2.5/ . At least please put a link to my blog in your version / port of xpectador. Saludos, Raka -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081210/7a0b821f/attachment-0001.htm From putualbertolee at yahoo.com Wed Dec 10 21:00:28 2008 From: putualbertolee at yahoo.com (I Putu Alberto Lee) Date: Wed, 10 Dec 2008 12:00:28 -0800 (PST) Subject: [lxml-dev] Yet (another) XML-verifier utility, Xpectador Message-ID: <925376.7687.qm@web63406.mail.re1.yahoo.com> Hi, Sorry I made a mistake, the correct link to the py file is http://www.box.net/shared/u55s51p9hf Best regards, Raka --- On Wed, 12/10/08, I Putu Alberto Lee wrote: From: I Putu Alberto Lee Subject: Yet (another) XML-verifier utility, Xpectador To: lxml-dev at codespeak.net Date: Wednesday, December 10, 2008, 2:39 PM Hi, I just implemented an XML-verifier utility for the purpose of my work. It's written in python, and uses lxml library. With pleasure I'd like to share with you; hopefully you will find it useful. Basically it allows you to state your expectations over "what should appear" and "what should not appear" in the XML-document that you'd like to validate, using the same format as that XML-document-to-be-verified. I make the source code available here: http://www.box.net/shared/5hs5ynq75k (a .py file) -- just three functions that make up the recursive checkings. More info is available at: http://jananuraga.blogspot.com/2008/12/xpectador-10.html Feel free to use and modify it under creative-common attribution generic 2.5 license: http://creativecommons.org/licenses/by/2.5/ . At least please put a link to my blog in your version / port of xpectador. Saludos, Raka -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081210/7c32de32/attachment.htm From ianb at colorstudy.com Fri Dec 12 18:18:18 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 12 Dec 2008 11:18:18 -0600 Subject: [lxml-dev] [Fwd: [Ian Bicking: a blog] Comment: "lxml: an underappreciated web scraping library"] Message-ID: <49429CDA.30409@colorstudy.com> Any ideas on this guy's installation issues? -------- Original Message ------- New comment on your post #85 "lxml: an underappreciated web scraping library" Author : john aman Thanks to all. @Ian, @Tres: The first real problem? My stupidity. python2.5-dev not installed. I should have landed here much sooner but for Ubuntu makes installing packages so damn easy. 2nd problem - running at 64 bit is a bit like warp speed to this MSDOS 2.0 old timer. Years of DOS and windoze development may also have corrupted my neural network. Now I get to this: (env)john at ibex:~$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1' ... Downloading libxslt into libs/libxslt-1.1.24.tar.gz Unpacking libxslt-1.1.24.tar.gz into build/tmp Running "./configure --without-python --disable-dependency-tracking --disable-shared --prefix=/tmp/easy_install-Am4yKQ/lxml-2.2alpha1/build/tmp/libxml2" in build/tmp/libxml2-2.7.2 checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu ... lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig" /usr/bin/install -c -m 644 'libxslt.pc' '/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig/libxslt.pc' /usr/bin/install -c -m 644 'libexslt.pc' '/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig/libexslt.pc' make[2]: Leaving directory `/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxslt-1.1.24' make[1]: Leaving directory `/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxslt-1.1.24' NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxml2 2.7.2 and libxslt 1.1.24 Building against libxml2/libxslt in the following directory: /tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib ************************ /usr/bin/ld: /tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/libexslt.a(exslt.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC ************************ /tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/libexslt.a: could not read symbols: Bad value collect2: ld returned 1 exit status error: Setup script exited with error: command 'gcc' failed with exit status 1 where error is marked *********** It seems the linker is failing to link exslt.o into libexslt.a: (exslt.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC Oh, did I mention I'm new to 64 bit OS? Perhaps someone can point out how to do all the steps the wget/tar/.configure/make way. Too old to try to figure out something so easy_install._ Not knocking easy_install - this is a first time failure for me with that program. From stefan_ml at behnel.de Fri Dec 12 18:50:30 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 12 Dec 2008 18:50:30 +0100 Subject: [lxml-dev] [Fwd: [Ian Bicking: a blog] Comment: "lxml: an underappreciated web scraping library"] In-Reply-To: <49429CDA.30409@colorstudy.com> References: <49429CDA.30409@colorstudy.com> Message-ID: <4942A466.1020102@behnel.de> Hi, Ian Bicking wrote: > -------- Original Message ------- > Author : john aman > It seems the linker is failing to link exslt.o into libexslt.a: > (exslt.o): relocation R_X86_64_32 against `a local symbol' can not be > used when making a shared object; recompile with -fPIC Try doing what the error message says: pass "-fPIC" to the compiler, i.e. as part of CFLAGS. Stefan From stefan_ml at behnel.de Fri Dec 12 18:56:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 12 Dec 2008 18:56:36 +0100 Subject: [lxml-dev] Yet (another) XML-verifier utility, Xpectador In-Reply-To: <411863.13432.qm@web63405.mail.re1.yahoo.com> References: <411863.13432.qm@web63405.mail.re1.yahoo.com> Message-ID: <4942A5D4.8000605@behnel.de> Hi, I Putu Alberto Lee wrote: > I just implemented an XML-verifier utility for the purpose of my work. > It's written in python, and uses lxml library. > > Basically it allows you to state your expectations over "what should > appear" and "what should not appear" in the XML-document that you'd like > to validate, using the same format as that XML-document-to-be-verified. Have you looked at examplotron as a schema language? It looks a bit like what you are doing here. http://examplotron.org/ You can also use it with xml as it's implemented in XSLT. Stefan From stefan_ml at behnel.de Fri Dec 12 19:16:20 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 12 Dec 2008 19:16:20 +0100 Subject: [lxml-dev] buildlibxml problems In-Reply-To: <7911b3bb0812080717q3f408333h51a7d655af3feed5@mail.gmail.com> References: <7911b3bb0812080717q3f408333h51a7d655af3feed5@mail.gmail.com> Message-ID: <4942AA74.309@behnel.de> Gael Pasgrimaud wrote: > I have a problem to install lxml with Cython and --static-deps. > > The problem is that you dont keep the original os.environ in place > during subprocess so gcc is not found during ./configure of libxml2. > I've found a workaround and you can find my patch attached Updated the build script. Thanks! Stefan From stefan_ml at behnel.de Sat Dec 13 00:37:13 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 13 Dec 2008 00:37:13 +0100 Subject: [lxml-dev] lxml 2.0.11, lxml 2.1.4 and 2.2beta1 released - a note on future releases Message-ID: <4942F5A9.6020105@behnel.de> Hi, lxml 2.0.11, 2.1.4 and 2.2beta1 are on PyPI. The first two are pure bug fix releases for a crash bug that can occur when using a single XPath evaluator concurrently from multiple threads. lxml 2.2beta1 adds a couple of further fixes to this. Changelog below. In its current state, I consider lxml stable and mature enough not to require separate long-living release branches. The current release is the end of the 2.0 series. Unless we find major reasons for users not to switch after the release of 2.2 final, 2.1 will soon go out of maintenance. Stefan 2.2beta1 (2008-12-12) Features added * Allow lxml.html.diff.htmldiff to accept Element objects, not just HTML strings. Bugs fixed * Crash when using an XPath evaluator in multiple threads. * Fixed missing whitespace before Link:... in lxml.html.diff. Other changes * Export lxml.html.parse. 2.1.4 (2008-12-12) Bugs fixed * Crash when using an XPath evaluator in multiple threads. 2.0.11 (2008-12-12) Bugs fixed * Crash when using an XPath evaluator in multiple threads. From olauzanne at gmail.com Sun Dec 14 21:13:52 2008 From: olauzanne at gmail.com (Olivier Lauzanne) Date: Sun, 14 Dec 2008 21:13:52 +0100 Subject: [lxml-dev] pyquery In-Reply-To: References: <49343BB6.1040103@colorstudy.com> <4936D2AA.7090106@colorstudy.com> Message-ID: Hello, I have implemented some of the classes that are present in jQuery but not in the css standard. Extending the classes seems like the right way to do it. You can check it out here. http://www.bitbucket.org/olauzanne/pyquery/src/tip/pyquery/cssselectpatch.py cssselect should remain standard compliant but I could make a patch so that using the jQuery pseudo classes is an option of the CSSSelector class. I don't have a strong opinion about doing it or not doing it. What do you think ? -- Olivier Lauzanne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081214/d26ae2e9/attachment.htm From stefan_ml at behnel.de Sun Dec 14 22:15:20 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 14 Dec 2008 22:15:20 +0100 Subject: [lxml-dev] pyquery In-Reply-To: References: <49343BB6.1040103@colorstudy.com> <4936D2AA.7090106@colorstudy.com> Message-ID: <49457768.4090106@behnel.de> Hi, Olivier Lauzanne wrote: > I have implemented some of the classes that are present in jQuery but not in > the css standard. Extending the classes seems like the right way to do it. > You can check it out here. > > http://www.bitbucket.org/olauzanne/pyquery/src/tip/pyquery/cssselectpatch.py > > cssselect should remain standard compliant but I could make a patch so that > using the jQuery pseudo classes is an option of the CSSSelector class. I > don't have a strong opinion about doing it or not doing it. What do you > think ? Would it make sense to make this a separate module like "jqselect" (which would inherit most of the API of cssselect) instead of patching into cssselect itself? Stefan From olauzanne at gmail.com Sun Dec 14 23:04:03 2008 From: olauzanne at gmail.com (Olivier Lauzanne) Date: Sun, 14 Dec 2008 23:04:03 +0100 Subject: [lxml-dev] pyquery In-Reply-To: <49457768.4090106@behnel.de> References: <49343BB6.1040103@colorstudy.com> <4936D2AA.7090106@colorstudy.com> <49457768.4090106@behnel.de> Message-ID: On Sun, Dec 14, 2008 at 10:15 PM, Stefan Behnel wrote: > Hi, > > Olivier Lauzanne wrote: > > I have implemented some of the classes that are present in jQuery but not > in > > the css standard. Extending the classes seems like the right way to do > it. > > You can check it out here. > > > > > http://www.bitbucket.org/olauzanne/pyquery/src/tip/pyquery/cssselectpatch.py > > > > cssselect should remain standard compliant but I could make a patch so > that > > using the jQuery pseudo classes is an option of the CSSSelector class. I > > don't have a strong opinion about doing it or not doing it. What do you > > think ? > > Would it make sense to make this a separate module like "jqselect" (which > would inherit most of the API of cssselect) instead of patching into > cssselect itself? > > Stefan > > It's possible but not simple. I would have to add a parameter on most functions down to parse_simple_selector for it to use some specific Pseudo and Function classes. Or else I would have to do some major refactoring : putting most functions in a class and using self.Pseudo and self.Function instead of Pseudo and Function. That would be the nicest I think. Then the JQueryPseudo and JQueryFunction classes could be put in a different file. Anyway I have to patch cssselect if I don't want to duplicate code (which I don't want), either with a monkey patch or by modifying cssselect. The most simple is definitly monkey patching. But it's not the safest in the long term. Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20081214/aa1a8ec5/attachment-0001.htm From pythonalex at email.it Mon Dec 15 11:11:27 2008 From: pythonalex at email.it (pythonalex at email.it) Date: Mon, 15 Dec 2008 11:11:27 +0100 Subject: [lxml-dev] Strange ValueError Message-ID: Hello everyone. I'm Alessio Esposito and collaborate with the Institute of Cybernetics of the CNR in Pozzuoli (NA) - Italy, I have the latest version of lxml (lxml 2.1.4). I need to have something like this: ... something something ... ... ... And this is the python code: ... e_octapy = SubElement(rootelement, 'octapyelement') e_octapy.set('{%s}schemaLocation' % NS_XSI, 'http://octapycms.remuna.org/schema.xsd') for sub_elem in struct: e = SubElement( e_octapy, sub_elem ) e.text = 'something' ... The variable sub_elem contain strings octapy:first, octapy:second etc. The problem is that the latest version of lxml does not accept that sub_elem contains ':' . This is a part of the terminal output: Module lxml.etree, line 2342, in lxml.etree.SubElement (src/lxml/lxml.etree.c:%u) Module lxml.etree, line 160, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:%u) Module lxml.etree, line 1311, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:%u) ValueError: Invalid tag name u'octapy:first' The previous version of lxml (lxml 2.1) whit the same code gave no problems. How can I fix this? Kind regards, Alessio. -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it: http://www.email.it/f Sponsor: CheBanca! La prima banca che ti d? gli interessi in anticipo. Fino al 4,70% sul Conto Deposito, zero spese e interessi subito. Aprilo! Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8435&d=20081215 From stefan_ml at behnel.de Mon Dec 15 11:30:52 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 15 Dec 2008 11:30:52 +0100 (CET) Subject: [lxml-dev] Strange ValueError In-Reply-To: References: Message-ID: <48587.213.61.181.86.1229337052.squirrel@groupware.dvs.informatik.tu-darmstadt.de> [replying to all...] pythonalex at email.it wrote: > e = SubElement( e_octapy, sub_elem ) > > The problem is that the latest version of lxml does not accept that > sub_elem contains ':' . You have to provide the namespace URI instead of the prefix. http://codespeak.net/lxml/tutorial.html#namespaces > The previous version of lxml (lxml 2.1) whit the same code gave no > problems. I doubt that. Tag validation has been in there for a while. Stefan From d.rothe at semantics.de Tue Dec 16 10:39:34 2008 From: d.rothe at semantics.de (Dirk Rothe) Date: Tue, 16 Dec 2008 10:39:34 +0100 Subject: [lxml-dev] xslt coverage Message-ID: Hi, are there any possibilities to get something like a line coverage from a xsl-transformation. That would be pretty useful, if integrated in a test-code-coverage run. --dirk From stefan_ml at behnel.de Tue Dec 16 12:42:35 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Dec 2008 12:42:35 +0100 (CET) Subject: [lxml-dev] xslt coverage In-Reply-To: References: Message-ID: <44884.213.61.181.86.1229427755.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Dirk Rothe wrote: > are there any possibilities to get something like a line coverage from a > xsl-transformation. That would be pretty useful, if integrated in a > test-code-coverage run. There's the profiler (pass "profile_run=True" to the XSLT run), but it won't give you line coverage. http://codespeak.net/lxml/api/lxml.etree.XSLT-class.html#__call__ Stefan From pythonalex at email.it Mon Dec 15 12:32:33 2008 From: pythonalex at email.it (Alessio Esposito) Date: Mon, 15 Dec 2008 11:32:33 +0000 (UTC) Subject: [lxml-dev] Strange ValueError References: <48587.213.61.181.86.1229337052.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Thank you for the rapidity of the response, I will follow your indications. On the question of the version I can guarantee you that before there were no problems. Thanks Again, Alessio. From d.rothe at semantics.de Tue Dec 16 17:55:51 2008 From: d.rothe at semantics.de (Dirk Rothe) Date: Tue, 16 Dec 2008 17:55:51 +0100 Subject: [lxml-dev] xslt coverage In-Reply-To: <44884.213.61.181.86.1229427755.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <44884.213.61.181.86.1229427755.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: On Tue, 16 Dec 2008 12:42:35 +0100, Stefan Behnel wrote: > Dirk Rothe wrote: >> are there any possibilities to get something like a line coverage from a >> xsl-transformation. That would be pretty useful, if integrated in a >> test-code-coverage run. > > There's the profiler (pass "profile_run=True" to the XSLT run), but it > won't give you line coverage. > > http://codespeak.net/lxml/api/lxml.etree.XSLT-class.html#__call__ jep, I've seen that. Would be quite a task to generate a "function-level" report by extracting the template-matches. (with line numbers, respecting xsl:include ...) One would have to assume full coverage of all the conditional branches. Hmm, maybe I will try it some day :). thnx, dirk From sidnei at enfoldsystems.com Thu Dec 18 17:31:20 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Thu, 18 Dec 2008 14:31:20 -0200 Subject: [lxml-dev] lxml 2.0.11, lxml 2.1.4 and 2.2beta1 released - a note on future releases In-Reply-To: <4942F5A9.6020105@behnel.de> References: <4942F5A9.6020105@behnel.de> Message-ID: On Fri, Dec 12, 2008 at 9:37 PM, Stefan Behnel wrote: > Hi, > > lxml 2.0.11, 2.1.4 and 2.2beta1 are on PyPI. The first two are pure bug fix > releases for a crash bug that can occur when using a single XPath evaluator > concurrently from multiple threads. lxml 2.2beta1 adds a couple of further > fixes to this. Changelog below. Binaries for Windows have been uploaded to PyPI for Python 2.4, 2.5 and 2.6, in both egg and distutils-based installer formats. That totals to 18 releases (2 flavors x 3 Python releases x 3 lxml releases). :) > In its current state, I consider lxml stable and mature enough not to > require separate long-living release branches. The current release is the > end of the 2.0 series. Unless we find major reasons for users not to switch > after the release of 2.2 final, 2.1 will soon go out of maintenance. That will certainly be a welcome change for me! -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 Skype zopedc From paulsen at orbiteam.de Thu Dec 18 18:03:16 2008 From: paulsen at orbiteam.de (Volker Paulsen) Date: Thu, 18 Dec 2008 18:03:16 +0100 Subject: [lxml-dev] lxml 2.1.4/2.2beta1 Solaris 9 segv in test-suite Message-ID: <20081218170315.GA24502@mail.orbiteam.de> Hi, I just compiled lxml-2.1.4 (and lxml-2.2beta) with gcc 4.2.4 against - libxml2-2.7.2 - libxslt-1.1.24 Unfortunately the test "test_schematron_invalid_schema_empty" causes a segmentation violation with Python 2.5 and Python 2.6; Please find a gdb backtrace for Python 2.6 and lxml-2.1.4 (and lxml-2.2beta) attached. Regards, Volker Paulsen -- OrbiTeam Software GmbH & Co. KG http://www.orbiteam.de/ () Ascii Ribbon Campaign /\ Support plain text e-mail -------------- next part -------------- ############################################################################ ### lxml 2.2beta1 ############################################################################ gdb /opt/python/bin/python2.6 GNU gdb 6.7 Copyright (C) 2007 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.9"... (gdb) run test.py -p -v Starting program: /opt/sfw/python/bin/python2.6 test.py -p -v warning: Temporarily disabling breakpoints for unloaded shared library "/usr/lib/ld.so.1" TESTED VERSION: 2.2.beta1 Python: (2, 6, 1, 'final', 0) lxml.etree: (2, 2, -99, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) 837/986 ( 84.9%): test_schematron_invalid_schema_empty (...chematronTestCase) Program received signal SIGSEGV, Segmentation fault. 0xfeeb467c in strlen () from /usr/lib/libc.so.1 (gdb) bt #0 0xfeeb467c in strlen () from /usr/lib/libc.so.1 #1 0xfef07588 in _doprnt () from /usr/lib/libc.so.1 #2 0xfef095f8 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe8319dc in __xmlRaiseError () from /usr/local/lib/libxml2.so.2 #4 0xfe95fd58 in xmlSchematronParse () from /usr/local/lib/libxml2.so.2 #5 0xfeb3d8a4 in __pyx_pf_4lxml_5etree_10Schematron___init__ ( __pyx_v_self=0x9d52b0, __pyx_args=, __pyx_kwds=) at src/lxml/lxml.etree.c:111543 #6 0xff21a2ec in type_call (type=0xf800, args=0x9bfc70, kwds=0x9ea9c0) at Objects/typeobject.c:745 #7 0xff1b7310 in PyObject_Call (func=0xfebdbef0, arg=0x9bfc70, kw=0x9ea9c0) at Objects/abstract.c:2487 #8 0xff264d14 in PyEval_EvalFrameEx (f=0x3aae90, throwflag=) at Python/ceval.c:3978 #9 0xff26a468 in PyEval_EvalCodeEx (co=0x161410, globals=, locals=, args=0x423138, argcount=4, kws=0x423138, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #10 0xff268120 in PyEval_EvalFrameEx (f=0x422fe8, throwflag=) at Python/ceval.c:3775 #11 0xff269414 in PyEval_EvalFrameEx (f=0x417258, throwflag=) at Python/ceval.c:3765 #12 0xff26a468 in PyEval_EvalCodeEx (co=0x161140, globals=, locals=, args=0x8b062c, argcount=2, kws=0x37bc08, kwcount=0, defs=0x16835c, defcount=1, closure=0x0) at Python/ceval.c:2942 #13 0xff1e9598 in function_call (func=0x16b2b0, arg=0x8b0620, kw=0x9c44b0) at Objects/funcobject.c:524 #14 0xff1b7310 in PyObject_Call (func=0x16b2b0, arg=0x8b0620, kw=0x9c44b0) at Objects/abstract.c:2487 #15 0xff264d14 in PyEval_EvalFrameEx (f=0x4170e8, throwflag=) at Python/ceval.c:3978 #16 0xff26a468 in PyEval_EvalCodeEx (co=0x161188, globals=, locals=, args=0x8b060c, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #17 0xff1e94c4 in function_call (func=0x16b2f0, arg=0x8b05f8, kw=0x0) at Objects/funcobject.c:524 #18 0xff1b7310 in PyObject_Call (func=0x16b2f0, arg=0x8b05f8, kw=0x0) at Objects/abstract.c:2487 #19 0xff1c88e0 in instancemethod_call (func=0x16b2f0, arg=0x8b05f8, kw=0x0) at Objects/classobject.c:2579 #20 0xff1b7310 in PyObject_Call (func=0x473698, arg=0x8ac9b0, kw=0x0) at Objects/abstract.c:2487 #21 0xff223fd0 in slot_tp_call (self=, args=0x8ac9b0, kwds=0x0) at Objects/typeobject.c:5368 #22 0xff1b7310 in PyObject_Call (func=0x7e0250, arg=0x8ac9b0, kw=0x0) at Objects/abstract.c:2487 #23 0xff266488 in PyEval_EvalFrameEx (f=0x8dcab0, throwflag=) at Python/ceval.c:3890 #24 0xff26a468 in PyEval_EvalCodeEx (co=0x161920, globals=, locals=, args=0x8a9eec, argcount=2, kws=0x58d5b8, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #25 0xff1e9598 in function_call (func=0x16b7b0, arg=0x8a9ee0, kw=0x8b44b0) at Objects/funcobject.c:524 #26 0xff1b7310 in PyObject_Call (func=0x16b7b0, arg=0x8a9ee0, kw=0x8b44b0) at Objects/abstract.c:2487 #27 0xff264d14 in PyEval_EvalFrameEx (f=0x8dc940, throwflag=) at Python/ceval.c:3978 #28 0xff26a468 in PyEval_EvalCodeEx (co=0x161968, globals=, locals=, args=0x827c74, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #29 0xff1e94c4 in function_call (func=0x16b7f0, arg=0x827c60, kw=0x0) at Objects/funcobject.c:524 #30 0xff1b7310 in PyObject_Call (func=0x16b7f0, arg=0x827c60, kw=0x0) at Objects/abstract.c:2487 #31 0xff1c88e0 in instancemethod_call (func=0x16b7f0, arg=0x827c60, kw=0x0) at Objects/classobject.c:2579 #32 0xff1b7310 in PyObject_Call (func=0x473760, arg=0x8ac650, kw=0x0) at Objects/abstract.c:2487 #33 0xff223fd0 in slot_tp_call (self=, args=0x8ac650, kwds=0x0) at Objects/typeobject.c:5368 #34 0xff1b7310 in PyObject_Call (func=0x8907b0, arg=0x8ac650, kw=0x0) at Objects/abstract.c:2487 #35 0xff266488 in PyEval_EvalFrameEx (f=0x8dbd40, throwflag=) at Python/ceval.c:3890 #36 0xff269414 in PyEval_EvalFrameEx (f=0x12c778, throwflag=) at Python/ceval.c:3765 #37 0xff269414 in PyEval_EvalFrameEx (f=0x1130d0, throwflag=) at Python/ceval.c:3765 #38 0xff26a468 in PyEval_EvalCodeEx (co=0xb4c80, globals=, locals=, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #39 0xff26a7e0 in PyEval_EvalCode (co=0xb4c80, globals=0x431e0, locals=0x431e0) at Python/ceval.c:515 #40 0xff2931d0 in PyRun_FileExFlags (fp=0xfef401f4, filename=0xffbff262 "test.py", start=, globals=0x431e0, locals=0x431e0, closeit=, flags=0xffbff05c) at Python/pythonrun.c:1330 #41 0xff2934ec in PyRun_SimpleFileExFlags (fp=0xfef401f4, filename=0xffbff262 "test.py", closeit=1, flags=0xffbff05c) at Python/pythonrun.c:926 #42 0xff29ff8c in Py_Main (argc=4, argv=0xffbff0d4) at Modules/main.c:597 #43 0x000105a0 in _start () ############################################################################ ### lxml 2.1.4 ############################################################################ gdb /opt/python/bin/python2.6 GNU gdb 6.7 Copyright (C) 2007 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.9"... (gdb) run test.py -p -v Starting program: /opt/sfw/python/bin/python2.6 test.py -p -v warning: Temporarily disabling breakpoints for unloaded shared library "/usr/lib/ld.so.1" TESTED VERSION: 2.1.4 Python: (2, 6, 1, 'final', 0) lxml.etree: (2, 1, 4, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) 828/976 ( 84.8%): test_schematron_invalid_schema_empty (...chematronTestCase) Program received signal SIGSEGV, Segmentation fault. 0xfeeb467c in strlen () from /usr/lib/libc.so.1 (gdb) bt #0 0xfeeb467c in strlen () from /usr/lib/libc.so.1 #1 0xfef07588 in _doprnt () from /usr/lib/libc.so.1 #2 0xfef095f8 in vsnprintf () from /usr/lib/libc.so.1 #3 0xfe8319dc in __xmlRaiseError () from /usr/local/lib/libxml2.so.2 #4 0xfe95fd58 in xmlSchematronParse () from /usr/local/lib/libxml2.so.2 #5 0xfeb6fd10 in __pyx_pf_4lxml_5etree_10Schematron___init__ ( __pyx_v_self=0x9c6eb8, __pyx_args=, __pyx_kwds=) at src/lxml/lxml.etree.c:102835 #6 0xff21a2ec in type_call (type=0xf800, args=0x9acc30, kwds=0xa03660) at Objects/typeobject.c:745 #7 0xff1b7310 in PyObject_Call (func=0xfebec810, arg=0x9acc30, kw=0xa03660) at Objects/abstract.c:2487 #8 0xff264d14 in PyEval_EvalFrameEx (f=0x3aa168, throwflag=) at Python/ceval.c:3978 #9 0xff26a468 in PyEval_EvalCodeEx (co=0x160410, globals=, locals=, args=0x41dd80, argcount=4, kws=0x41dd80, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #10 0xff268120 in PyEval_EvalFrameEx (f=0x41dc30, throwflag=) at Python/ceval.c:3775 #11 0xff269414 in PyEval_EvalFrameEx (f=0x417518, throwflag=) at Python/ceval.c:3765 #12 0xff26a468 in PyEval_EvalCodeEx (co=0x160140, globals=, locals=, args=0x8a5c6c, argcount=2, kws=0x37e910, kwcount=0, defs=0x16735c, defcount=1, closure=0x0) at Python/ceval.c:2942 #13 0xff1e9598 in function_call (func=0x16a2b0, arg=0x8a5c60, kw=0x9c8150) at Objects/funcobject.c:524 #14 0xff1b7310 in PyObject_Call (func=0x16a2b0, arg=0x8a5c60, kw=0x9c8150) at Objects/abstract.c:2487 #15 0xff264d14 in PyEval_EvalFrameEx (f=0x410dc8, throwflag=) at Python/ceval.c:3978 #16 0xff26a468 in PyEval_EvalCodeEx (co=0x160188, globals=, locals=, args=0x8a5c4c, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #17 0xff1e94c4 in function_call (func=0x16a2f0, arg=0x8a5c38, kw=0x0) at Objects/funcobject.c:524 #18 0xff1b7310 in PyObject_Call (func=0x16a2f0, arg=0x8a5c38, kw=0x0) at Objects/abstract.c:2487 #19 0xff1c88e0 in instancemethod_call (func=0x16a2f0, arg=0x8a5c38, kw=0x0) at Objects/classobject.c:2579 #20 0xff1b7310 in PyObject_Call (func=0x474508, arg=0x8a3910, kw=0x0) at Objects/abstract.c:2487 #21 0xff223fd0 in slot_tp_call (self=, args=0x8a3910, kwds=0x0) at Objects/typeobject.c:5368 #22 0xff1b7310 in PyObject_Call (func=0x7da350, arg=0x8a3910, kw=0x0) at Objects/abstract.c:2487 #23 0xff266488 in PyEval_EvalFrameEx (f=0x8d7ca8, throwflag=) at Python/ceval.c:3890 #24 0xff26a468 in PyEval_EvalCodeEx (co=0x160920, globals=, locals=, args=0x8a5564, argcount=2, kws=0x594af0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #25 0xff1e9598 in function_call (func=0x16a7b0, arg=0x8a5558, kw=0x8a9660) at Objects/funcobject.c:524 #26 0xff1b7310 in PyObject_Call (func=0x16a7b0, arg=0x8a5558, kw=0x8a9660) at Objects/abstract.c:2487 #27 0xff264d14 in PyEval_EvalFrameEx (f=0x8d7b38, throwflag=) at Python/ceval.c:3978 #28 0xff26a468 in PyEval_EvalCodeEx (co=0x160968, globals=, locals=, args=0x7e5864, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #29 0xff1e94c4 in function_call (func=0x16a7f0, arg=0x7e5850, kw=0x0) at Objects/funcobject.c:524 #30 0xff1b7310 in PyObject_Call (func=0x16a7f0, arg=0x7e5850, kw=0x0) at Objects/abstract.c:2487 #31 0xff1c88e0 in instancemethod_call (func=0x16a7f0, arg=0x7e5850, kw=0x0) at Objects/classobject.c:2579 #32 0xff1b7310 in PyObject_Call (func=0x4745d0, arg=0x8a35b0, kw=0x0) at Objects/abstract.c:2487 #33 0xff223fd0 in slot_tp_call (self=, args=0x8a35b0, kwds=0x0) at Objects/typeobject.c:5368 #34 0xff1b7310 in PyObject_Call (func=0x88a750, arg=0x8a35b0, kw=0x0) at Objects/abstract.c:2487 #35 0xff266488 in PyEval_EvalFrameEx (f=0x8d6f38, throwflag=) at Python/ceval.c:3890 #36 0xff269414 in PyEval_EvalFrameEx (f=0x12c778, throwflag=) at Python/ceval.c:3765 #37 0xff269414 in PyEval_EvalFrameEx (f=0x1130d0, throwflag=) at Python/ceval.c:3765 #38 0xff26a468 in PyEval_EvalCodeEx (co=0xb4c80, globals=, locals=, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2942 #39 0xff26a7e0 in PyEval_EvalCode (co=0xb4c80, globals=0x431e0, locals=0x431e0 at Python/ceval.c:515 #40 0xff2931d0 in PyRun_FileExFlags (fp=0xfef401f4, filename=0xffbff25a "test.py", start=, globals=0x431e0, locals=0x431e0, closeit=, flags=0xffbff054) at Python/pythonrun.c:1330 #41 0xff2934ec in PyRun_SimpleFileExFlags (fp=0xfef401f4, filename=0xffbff25a "test.py", closeit=1, flags=0xffbff054) at Python/pythonrun.c:926 #42 0xff29ff8c in Py_Main (argc=4, argv=0xffbff0cc) at Modules/main.c:597 #43 0x000105a0 in _start () From azaroth at liverpool.ac.uk Mon Dec 22 14:14:57 2008 From: azaroth at liverpool.ac.uk (Dr R. Sanderson) Date: Mon, 22 Dec 2008 13:14:57 +0000 (GMT) Subject: [lxml-dev] Memory Leak 2.1.1 -> 2.1.2 In-Reply-To: <4895532E.1050501@behnel.de> References: <47878E4E.5080800@behnel.de> <4895532E.1050501@behnel.de> Message-ID: Hi all, I'm working on a script to replicate it, but using 2.1.2 or more recent results in not freeing any memory when parsing multiple documents in quick succession. The changelog says there was a memory issue fixed, so perhaps this introduced the bug at the same time? I've seen (but not consistently) the lxml memory allocation failed: growing buffer message. Normally it just runs my machine out of memory. Rob From azaroth at liverpool.ac.uk Mon Dec 22 15:33:10 2008 From: azaroth at liverpool.ac.uk (Dr R. Sanderson) Date: Mon, 22 Dec 2008 14:33:10 +0000 (GMT) Subject: [lxml-dev] Memory Leak 2.1.1 -> 2.1.2 In-Reply-To: References: <47878E4E.5080800@behnel.de> <4895532E.1050501@behnel.de> Message-ID: The actual code is below, but I've got it so that it inflates Very Quickly... [cheshire at edhellond jstor]$ ./memory.py UID PID PPID C SZ RSS PSR STIME TTY TIME CMD cheshire 1778 1154 0 5861 14204 1 14:25 pts/2 00:00:00 /home/cheshire/install/bin/python -i ./memory.py 0 cheshire 1778 1154 99 20753 73820 1 14:25 pts/2 00:00:01 /home/cheshire/install/bin/python -i ./memory.py 238 cheshire 1778 1154 99 140239 551556 1 14:25 pts/2 00:00:08 /home/cheshire/install/bin/python -i ./memory.py 483 cheshire 1778 1154 99 245972 974616 1 14:25 pts/2 00:00:14 /home/cheshire/install/bin/python -i ./memory.py 734 cheshire 1778 1154 99 319488 1268656 1 14:25 pts/2 00:00:24 /home/cheshire/install/bin/python -i ./memory.py 1269 eg, after parsing 1269 documents (on average 250k each) it's using a total of 1.5 gigabytes of memory. This also happens in 2.1.1. I've used guppy/hpy to check that it's not python level code. Putting in a hp.heap() call in the loop shows the only difference to be the for loop's frame, per iteration. The actual production code works in 2.1.1, but has a lot more xpaths and then a serialization phase in the loop as well. Code, with comments: ---------------------------- def build_journal(jrnl): global nparse # Search for journal descriptions q = parse('c3.idx-id-journal exact "%s"' % jrnl) rs = db.search(session, q) # step through matches for rsi in rs: nparse += 1 # fetch record out of storage, use etree.XML(data) to parse rec = rsi.fetch_record(session) # process_xpath passes through directly to node.xpath() try: year = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/year/text()')[0] month = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/month/text()')[0] day = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/day/text()')[0] except: rsi._ymd = (0,0,0) del rec continue rsi._ymd = (year, month, day) del rec # sort list based on date rs._list.sort(key=lambda x: x._ymd) del rs nparse = 0 # scan through all journal identifiers q = parse('c3.idx-id-journal exact ""') jids = db.scan(session, q, 1000000) # get OS memory usage stats pid = os.getpid() cmd = "ps -F -p %s" % pid print commands.getoutput(cmd) print nparse # and try to build for j in jids[100:]: build_journal(j[0]) print commands.getoutput(cmd).split('\n')[1] print nparse ---------------------------------------- Help? Rob On Mon, 22 Dec 2008, Dr R. Sanderson wrote: > > Hi all, > > I'm working on a script to replicate it, but using 2.1.2 or more recent > results in not freeing any memory when parsing multiple documents in > quick succession. The changelog says there was a memory issue fixed, so > perhaps this introduced the bug at the same time? > > I've seen (but not consistently) the lxml memory allocation failed: > growing buffer message. Normally it just runs my machine out of memory. > > Rob > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From douglas at openplans.org Tue Dec 23 01:42:41 2008 From: douglas at openplans.org (Douglas Mayle) Date: Mon, 22 Dec 2008 19:42:41 -0500 Subject: [lxml-dev] lxml forms and cookies... Message-ID: <92A882AB-D88C-435C-A441-FFA7FA5AF61D@openplans.org> Hey everyone, I've been trying to use the lxml forms with client cookies to handle html logins. Using cookielib, I'm able to manually login to a page using either urllib or urllib2 and have it work. The moment I try to use lxml.html.submit_form(), however, it fails. I've tried sniffing the packets, and it turns out that lxml is never sending cookies, which makes me guess that lxml is using neither urllib nor urllib2. How do I use cookes with lxml? Doug From stefan_ml at behnel.de Tue Dec 23 07:38:52 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Dec 2008 07:38:52 +0100 Subject: [lxml-dev] lxml 2.1.4/2.2beta1 Solaris 9 segv in test-suite In-Reply-To: <20081218170315.GA24502@mail.orbiteam.de> References: <20081218170315.GA24502@mail.orbiteam.de> Message-ID: <4950877C.9020600@behnel.de> Hi, thanks for the report. Volker Paulsen wrote: > I just compiled lxml-2.1.4 (and lxml-2.2beta) > with gcc 4.2.4 against > > - libxml2-2.7.2 > - libxslt-1.1.24 > > Unfortunately the test "test_schematron_invalid_schema_empty" causes a > segmentation violation with Python 2.5 and Python 2.6; > > Please find a gdb backtrace for Python 2.6 and lxml-2.1.4 (and > lxml-2.2beta) attached. I don't think I've seen this before, might be specific to Solaris. From the stack trace, it's not sure that the problem is in lxml, as the error is handled purely inside libxml2 up to that point. I'd say you're safe if you don't use schematron (which most people won't run into anyway). Could you try to reproduce this with 'xmllint' (comes with libxml2) and the empty schema given by the test case? That would allow us to see if it's a problem with libxml2. Thanks, Stefan From jjl at pobox.com Tue Dec 23 13:50:47 2008 From: jjl at pobox.com (John J Lee) Date: Tue, 23 Dec 2008 12:50:47 +0000 (GMT) Subject: [lxml-dev] lxml forms and cookies... In-Reply-To: <92A882AB-D88C-435C-A441-FFA7FA5AF61D@openplans.org> References: <92A882AB-D88C-435C-A441-FFA7FA5AF61D@openplans.org> Message-ID: On Mon, 22 Dec 2008, Douglas Mayle wrote: > Hey everyone, I've been trying to use the lxml forms with client > cookies to handle html logins. Using cookielib, I'm able to manually > login to a page using either urllib or urllib2 and have it work. The > moment I try to use lxml.html.submit_form(), however, it fails. I've > tried sniffing the packets, and it turns out that lxml is never > sending cookies, which makes me guess that lxml is using neither > urllib nor urllib2. How do I use cookes with lxml? lxml.html.submit_form() has an open_http parameter: import urllib import urllib2 import urlparse import lxml.html def url_with_query(url, values): parts = urlparse.urlparse(url) rest, (query, frag) = parts[:-2], parts[-2:] return urlparse.urlunparse(rest + (urllib.urlencode(values), None)) def make_open_http(): opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) opener.addheaders = [] # pretend we're a human -- don't do this def open_http(method, url, values={}): if method == "POST": return opener.open(url, urllib.urlencode(values)) else: return opener.open(url_with_query(url, values)) return open_http open_http = make_open_http() tree = lxml.html.fromstring(open_http("GET", "http://python.org").read()) form = tree.forms[0] form.fields["q"] = "lxml" submit_values = {"submit": form.fields["submit"]} response = lxml.html.submit_form(form, extra_values=submit_values, open_http=open_http) html = response.read() doc = lxml.html.fromstring(html) lxml.html.open_in_browser(doc) John From optilude at gmx.net Wed Dec 31 02:39:02 2008 From: optilude at gmx.net (Martin Aspeli) Date: Wed, 31 Dec 2008 01:39:02 +0000 Subject: [lxml-dev] Working with Message-ID: Hi, I'd like to quickly/efficiently get a list of all processing instructions in a given document. I have managed to find it via root_tree.getprevious(), but it seems I need to search through the siblings here to find the if indeed there is one. I'm using the HTML parser. Is there a more natural API? Also, serialising using lxml.html.tostring() seems to lose the PI. Is this by design? Cheers, Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From optilude at gmx.net Wed Dec 31 13:08:15 2008 From: optilude at gmx.net (Martin Aspeli) Date: Wed, 31 Dec 2008 12:08:15 +0000 Subject: [lxml-dev] Working with In-Reply-To: References: Message-ID: Martin Aspeli wrote: > Hi, > > I'd like to quickly/efficiently get a list of all > processing instructions in a given document. > > I have managed to find it via root_tree.getprevious(), but it seems I > need to search through the siblings here to find the > if indeed there is one. > > I'm using the HTML parser. > > Is there a more natural API? > > Also, serialising using lxml.html.tostring() seems to lose the > PI. Is this by design? Mmmm.... and another thing: once I get the HtmlProcessingInstruction node, how can I get the value of its pseudo-attributes (href and type, in this case)? The attr dict is empty... Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Wed Dec 31 13:17:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 31 Dec 2008 13:17:26 +0100 Subject: [lxml-dev] Working with In-Reply-To: References: Message-ID: <495B62D6.9020505@behnel.de> Hi, Martin Aspeli wrote: > I'd like to quickly/efficiently get a list of all > processing instructions in a given document. reversed( el for el in root.itersiblings(preceding=True) if el.tag is etree.ProcessingInstruction and el.target == "xml-stylesheet" ) > Also, serialising using lxml.html.tostring() seems to lose the > PI. Is this by design? You need to wrap the root element in an ElementTree and serialise that. Stefan