From shwg.geng at gmail.com Thu Dec 2 06:02:42 2010 From: shwg.geng at gmail.com (Shaung) Date: Thu, 2 Dec 2010 14:02:42 +0900 Subject: [lxml-dev] How to register extension functions for XSL transformations Message-ID: Hello all, I am trying to convert a xml document into html. Input: Expected output: SomeVar's Description The xsl file I am using: nothing and in python: class MyControlAdapter: def GetDescFor(self, _, name): return foo.get_variable(name).get_desc() # foo is the module holding a list of variables ext_mod = MyControlAdapter() funcs = ('GetDescFor',) exts = etree.Extension(ext_mod, funcs, ns='myctl') transform = etree.XSLT(xslt_doc, extensions=exts) rslt = transform(input_doc) But always got an exception lxml.etree.XSLTApplyError: XPath evaluation returned no result. So am I doing it wrong? If so how to get it done without modifying the xslt contents? I am pretty new to xlst, so I might be totally thinking it wrong but just have too little knowledge to know how wrong I am... Any help would be greatly appreciated. (Sorry for my English) Shaung From jholg at gmx.de Thu Dec 2 10:35:52 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 02 Dec 2010 10:35:52 +0100 Subject: [lxml-dev] How to register extension functions for XSL transformations In-Reply-To: References: Message-ID: <20101202093552.69500@gmx.net> Hi, > class MyControlAdapter: > def GetDescFor(self, _, name): > return foo.get_variable(name).get_desc() # foo is the module > holding a list of variables > > ext_mod = MyControlAdapter() > funcs = ('GetDescFor',) > exts = etree.Extension(ext_mod, funcs, ns='myctl') > > transform = etree.XSLT(xslt_doc, extensions=exts) > rslt = transform(input_doc) > > But always got an exception > lxml.etree.XSLTApplyError: XPath evaluation returned no result. > > So am I doing it wrong? If so how to get it done without modifying the > xslt contents? > I am pretty new to xlst, so I might be totally thinking it wrong but > just have too little knowledge to know how wrong I am... You're confusing namespaces and namespace prefixes. lxml generally uses namespaces, not prefixes, which is The Right Way. This should help you on: ext_mod = MyControlAdapter() funcs = ('GetDescFor',) exts = etree.Extension(ext_mod, funcs, ns='urn:MyControlAdapter') transform = etree.XSLT(xslt_doc, extensions=exts) rslt = transform(input_doc) Holger -- GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl From dpritsos at extremepro.gr Thu Dec 2 12:17:58 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Thu, 02 Dec 2010 13:17:58 +0200 Subject: [lxml-dev] lxml-dev Digest, Vol 75, Issue 1 In-Reply-To: References: Message-ID: <4CF78066.8050706@extremepro.gr> Hello guys, I am sorry that I am sending this as a response but there is two issues I d like to point out: 1. There is a memory leakage using lxml.html.parse (or etree) while you do that constantly in a loop. In particular creating etrees in a loop does let the trees there and is not deleting the properly when you reuse the same python variable to store the resutls. For now I haven't tryed to resolve it because module re (regular expression) is just fine for URL extraction, however I would prefer the use of XPath for extracting a variate of links more easily in Coding point of view. Plus I think that the overhead of Tree Building is not so much (I dont know for sure thought). 2. Speaking of XPath for url extraction, I think that lxml.html has some issues in url extraction (this is what I think reading the Code of this module). And the question is why not to use the XPath for making the code twice smaller and twice neater (I cleaner and well formed - I hope my vocabulary is correct), maybe faster too. Best Regards, Dimitrios On 12/02/2010 11:35 AM, lxml-dev-request at codespeak.net wrote: > Send lxml-dev mailing list submissions to > lxml-dev at codespeak.net > > To subscribe or unsubscribe via the World Wide Web, visit > http://codespeak.net/mailman/listinfo/lxml-dev > or, via email, send a message with subject or body 'help' to > lxml-dev-request at codespeak.net > > You can reach the person managing the list at > lxml-dev-owner at codespeak.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of lxml-dev digest..." > > > Today's Topics: > > 1. Re: Compile failure (felix) > 2. Re: Compile failure (jholg at gmx.de) > 3. Re: SyntaxErrors with Python 3 (Stefan Behnel) > 4. Re: Schema validation - no file position (Stefan Behnel) > 5. Re: Schema validation - no file position (Krzysztof Jakubczyk) > 6. Re: Schema validation - no file position (Stefan Behnel) > 7. Re: read .xlsx spreadsheets with lxml ? (Chris Withers) > 8. How to register extension functions for XSL transformations > (Shaung) > 9. Re: How to register extension functions for XSL > transformations (jholg at gmx.de) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 23 Nov 2010 11:14:19 +0100 > From: felix > Subject: Re: [lxml-dev] Compile failure > To: Stefan Behnel > Cc: lxml-dev at codespeak.net > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel wrote: > >> felix, 26.10.2010 15:28: >> >> According to this: >>> http://codespeak.net/lxml/build.html >>> >>> we should avoid installing Cython >>> >>> but using easy_install to build fails saying the cython generated file is >>> missing >>> >> I doubt that it's failing because of that. However, you didn't provide the >> output of the build, so I can't guess what happened that actually made the >> build fail. > > sorry, that output had scrolled off by the time I realized I should submit a > report. I have another server so fortunately I can fail there and show you. > > crucial at crucial-systems:~/working/lxml$ python setup.py build > /home/crucial/working/lxml/versioninfo.py:53: UserWarning: unrecognized > .svn/entries format; skipping /home/crucial/working/lxml/ > warn("unrecognized .svn/entries format; skipping "+base) > Building lxml version 2.3.beta1. > *NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' > needs to be available.* > Using build configuration of libxslt 1.1.26 > Building against libxml2/libxslt in the following directory: /usr/lib > running build > running build_py > running build_ext > building 'lxml.etree' extension > gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall > -Wstrict-prototypes -fPIC -I/usr/local/include/libxml2 > -I/usr/include/python2.6 -c src/lxml/lxml.etree.c -o > build/temp.linux-x86_64-2.6/src/lxml/lxml.etree.o -w > gcc: src/lxml/lxml.etree.c: No such file or directory > gcc: no input files > error: command 'gcc' failed with exit status 1 > > > >> The latest build instructions for the SVN trunk are in the SVN trunk as >> "doc/build.txt", or *(not always completely up-to-date)* here: >> > exactly > > > > >> *but then I succeeded with the old sudo easy_install lxml* >>> >>> because now I have Cython >>> >> Again, I doubt that this is the reason. >> > sudo easy_install lxml > failed before > > after installing Cython it says it uses Cython (not Trying to build without > Cython) and it worked. > > nothing else having changed I thought it was a reasonable guess that it > worked because it used Cython because Cython is installed. > > * > * > >> Stefan >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101123/7777e0de/attachment-0001.htm > > ------------------------------ > > Message: 2 > Date: Tue, 23 Nov 2010 11:35:07 +0100 > From: jholg at gmx.de > Subject: Re: [lxml-dev] Compile failure > To: felix > Cc: lxml-dev at codespeak.net > Message-ID:<20101123103507.100470 at gmx.net> > Content-Type: text/plain; charset="utf-8" > > Hi, > >> On Sat, Oct 30, 2010 at 10:49 AM, Stefan Behnel >> wrote: >> >>> felix, 26.10.2010 15:28: >>> >>> According to this: >>>> http://codespeak.net/lxml/build.html >>>> >>>> we should avoid installing Cython >>>> >>>> but using easy_install to build fails saying the cython generated file >> is >>>> missing > Note that it says > """ > Since we distribute the Cython-generated .c files with lxml *releases*, however, you do not need Cython to build lxml from the normal *release* sources. > """ > > So the Cython-generated .c files are not in an SVN checkout but should be in the release packages. > > >>> Again, I doubt that this is the reason. >>> >> sudo easy_install lxml >> failed before >> >> after installing Cython it says it uses Cython (not Trying to build >> without >> Cython) and it worked. >> >> nothing else having changed I thought it was a reasonable guess that it >> worked because it used Cython because Cython is installed. > I just checked the 2.3beta1 (source) package on pypi and it does contain the .c files: > > -rw-r--r-- 1000/1000 7102827 Sep 6 09:31 2010 lxml-2.3beta1/src/lxml/lxml.etree.c > -rw-r--r-- 1000/1000 1318011 Sep 6 09:33 2010 lxml-2.3beta1/src/lxml/lxml.objectify.c > > > Holger From stefan_ml at behnel.de Thu Dec 2 13:48:22 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Dec 2010 13:48:22 +0100 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF78066.8050706@extremepro.gr> References: <4CF78066.8050706@extremepro.gr> Message-ID: <4CF79596.30206@behnel.de> Dimitrios Pritsos, 02.12.2010 12:17: > I am sorry that I am sending this as a response No need to do so if you want to start a new topic. Just send a message directly to the list address. Replies are for replying. > There is a memory leakage using lxml.html.parse (or etree) while you > do that constantly in a loop. In particular creating etrees in a loop > does let the trees there and is not deleting the properly when you reuse > the same python variable to store the resutls. I can reproduce this. I'll take a look ASAP. Stefan From stefan_ml at behnel.de Thu Dec 2 13:53:28 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Dec 2010 13:53:28 +0100 Subject: [lxml-dev] URL extraction from HTML documents In-Reply-To: <4CF78066.8050706@extremepro.gr> References: <4CF78066.8050706@extremepro.gr> Message-ID: <4CF796C8.3030006@behnel.de> Dimitrios Pritsos, 02.12.2010 12:17: > module re (regular expression) is just fine for > URL extraction, however I would prefer the use of XPath for extracting a > variate of links more easily in Coding point of view. Sure, lxml.html has specific support for extracting URLs from parsed documents. > Plus I think that the overhead of Tree Building is not so much (I dont > know for sure thought). Likely slower than re, but also likely fast enough. > Speaking of XPath for url extraction, I think that lxml.html has some > issues in url extraction (this is what I think reading the Code of this > module). Such as ... ? > And the question is why not to use the XPath for making the > code twice smaller and twice neater (I cleaner and well formed - I hope > my vocabulary is correct), maybe faster too. Maybe. If you want to provide a patch that simplifies the code and back it with sufficient evidence that it's at least as fast as before and doesn't degrade functionality, I'll be happy to accept it. Stefan From shwg.geng at gmail.com Thu Dec 2 14:18:25 2010 From: shwg.geng at gmail.com (Shaung) Date: Thu, 2 Dec 2010 22:18:25 +0900 Subject: [lxml-dev] How to register extension functions for XSL transformations In-Reply-To: References: <20101202093552.69500@gmx.net> Message-ID: On Thu, Dec 2, 2010 at 6:35 PM, ? wrote: > Hi, > > You're confusing namespaces and namespace prefixes. lxml generally uses namespaces, not prefixes, which is The Right Way. This should help you on: > > ext_mod = MyControlAdapter() > funcs = ('GetDescFor',) > exts = etree.Extension(ext_mod, funcs, ns='urn:MyControlAdapter') > transform = etree.XSLT(xslt_doc, extensions=exts) > rslt = transform(input_doc) > > Holger > This solved my problem. Thanks a lot! Shaung From dpritsos at extremepro.gr Thu Dec 2 16:35:23 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Thu, 02 Dec 2010 17:35:23 +0200 Subject: [lxml-dev] URL extraction from HTML documents In-Reply-To: <4CF796C8.3030006@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF796C8.3030006@behnel.de> Message-ID: <4CF7BCBB.5050105@extremepro.gr> On 12/02/2010 02:53 PM, Stefan Behnel wrote: On 12/02/2010 02:53 PM, Stefan Behnel wrote: > Dimitrios Pritsos, 02.12.2010 12:17: >> module re (regular expression) is just fine for >> URL extraction, however I would prefer the use of XPath for extracting a >> variate of links more easily in Coding point of view. > > Sure, lxml.html has specific support for extracting URLs from parsed > documents. > But what about the Memory Leakage, I am sorry if there is a solution already. However, I believe that this is not intuitive at all (I mean the all tree to stay in Memory like a Garbage and not to be replaced). I don't think that I am experienced enough to fix this. > >> Plus I think that the overhead of Tree Building is not so much (I dont >> know for sure thought). > > Likely slower than re, but also likely fast enough. > > >> Speaking of XPath for url extraction, I think that lxml.html has some >> issues in url extraction (this is what I think reading the Code of this >> module). > > Such as ... ? > I just think it is harder to have all the definition of HTML 4.0 (XTHML 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will be more general. just that :) > >> And the question is why not to use the XPath for making the >> code twice smaller and twice neater (I cleaner and well formed - I hope >> my vocabulary is correct), maybe faster too. > > Maybe. If you want to provide a patch that simplifies the code and > back it with sufficient evidence that it's at least as fast as before > and doesn't degrade functionality, I'll be happy to accept it. As for lxml.html I think I can send something Xpath based and to be multiprocessing/multi-threaded too (optionally). But still need some work and I don't have enough time to finish it right now, because this is a critical phase for my PhD and Job. > > Stefan Thank you very much for your fast response! Dimitrios From stefan_ml at behnel.de Thu Dec 2 17:07:48 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Dec 2010 17:07:48 +0100 Subject: [lxml-dev] URL extraction from HTML documents In-Reply-To: <4CF7BCBB.5050105@extremepro.gr> References: <4CF78066.8050706@extremepro.gr> <4CF796C8.3030006@behnel.de> <4CF7BCBB.5050105@extremepro.gr> Message-ID: <4CF7C454.4060804@behnel.de> Dimitrios Pritsos, 02.12.2010 16:35: > But what about the Memory Leakage See my other mail. > On 12/02/2010 02:53 PM, Stefan Behnel wrote: >> Dimitrios Pritsos, 02.12.2010 12:17: >>> Speaking of XPath for url extraction, I think that lxml.html has some >>> issues in url extraction (this is what I think reading the Code of this >>> module). >> >> Such as ... ? >> > I just think it is harder to have all the definition of HTML 4.0 (XTHML > 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will > be more general. How would that be more general? The expressions would simply select what the code currently selects as well. Could you provide an example of what you have in mind? >>> And the question is why not to use the XPath for making the >>> code twice smaller and twice neater (I cleaner and well formed - I hope >>> my vocabulary is correct), maybe faster too. >> >> Maybe. If you want to provide a patch that simplifies the code and >> back it with sufficient evidence that it's at least as fast as before >> and doesn't degrade functionality, I'll be happy to accept it. > > As for lxml.html I think I can send something Xpath based and to be > multiprocessing/multi-threaded too (optionally). I don't think multi-threading (and especially not multiprocessing) make any sense here. They should get applied at the document level, not within a single document. Stefan From Marc.Graff at VerizonWireless.com Thu Dec 2 19:49:10 2010 From: Marc.Graff at VerizonWireless.com (Graff, Marc) Date: Thu, 2 Dec 2010 13:49:10 -0500 Subject: [lxml-dev] URL extraction from HTML documents In-Reply-To: References: <4CF78066.8050706@extremepro.gr> <4CF796C8.3030006@behnel.de> Message-ID: <20101202184915.4D600282B9E@codespeak.net> I have only leveraged lxml for one project so I am no expert. If your code is small enough and you are not prohibited from doing so, can you provide it so others can understand what you are doing and how to make a better determination of the memory leak you encountered and possible cause? Thanks. -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Dimitrios Pritsos Sent: Thursday, December 02, 2010 10:35 AM To: lxml-dev at codespeak.net Subject: Re: [lxml-dev] URL extraction from HTML documents On 12/02/2010 02:53 PM, Stefan Behnel wrote: On 12/02/2010 02:53 PM, Stefan Behnel wrote: > Dimitrios Pritsos, 02.12.2010 12:17: >> module re (regular expression) is just fine for >> URL extraction, however I would prefer the use of XPath for extracting a >> variate of links more easily in Coding point of view. > > Sure, lxml.html has specific support for extracting URLs from parsed > documents. > But what about the Memory Leakage, I am sorry if there is a solution already. However, I believe that this is not intuitive at all (I mean the all tree to stay in Memory like a Garbage and not to be replaced). I don't think that I am experienced enough to fix this. > >> Plus I think that the overhead of Tree Building is not so much (I dont >> know for sure thought). > > Likely slower than re, but also likely fast enough. > > >> Speaking of XPath for url extraction, I think that lxml.html has some >> issues in url extraction (this is what I think reading the Code of this >> module). > > Such as ... ? > I just think it is harder to have all the definition of HTML 4.0 (XTHML 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will be more general. just that :) > >> And the question is why not to use the XPath for making the >> code twice smaller and twice neater (I cleaner and well formed - I hope >> my vocabulary is correct), maybe faster too. > > Maybe. If you want to provide a patch that simplifies the code and > back it with sufficient evidence that it's at least as fast as before > and doesn't degrade functionality, I'll be happy to accept it. As for lxml.html I think I can send something Xpath based and to be multiprocessing/multi-threaded too (optionally). But still need some work and I don't have enough time to finish it right now, because this is a critical phase for my PhD and Job. > > Stefan Thank you very much for your fast response! Dimitrios _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Thu Dec 2 20:11:22 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Dec 2010 20:11:22 +0100 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF79596.30206@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF79596.30206@behnel.de> Message-ID: <4CF7EF5A.9090706@behnel.de> Stefan Behnel, 02.12.2010 13:48: > Dimitrios Pritsos, 02.12.2010 12:17: >> I am sorry that I am sending this as a response > > No need to do so if you want to start a new topic. Just send a message > directly to the list address. Replies are for replying. > > >> There is a memory leakage using lxml.html.parse (or etree) while you >> do that constantly in a loop. In particular creating etrees in a loop >> does let the trees there and is not deleting the properly when you reuse >> the same python variable to store the resutls. > > I can reproduce this. I'll take a look ASAP. It's easily reproducible. I can parse a document repeatedly in a loop using lxml.html.parse() and see the memory consumption of the Python process grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from the same problem. I'll see about that when I figured out what happens. It's only a problem with the HTML parser, and it's not related to lxml.html. This is enough to reproduce it: from lxml import etree p = etree.HTMLParser() while True: etree.parse("somefile.html", p) Stefan From stefan_ml at behnel.de Thu Dec 2 22:13:46 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Dec 2010 22:13:46 +0100 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF7EF5A.9090706@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF79596.30206@behnel.de> <4CF7EF5A.9090706@behnel.de> Message-ID: <4CF80C0A.2070803@behnel.de> Stefan Behnel, 02.12.2010 20:11: > Stefan Behnel, 02.12.2010 13:48: >> Dimitrios Pritsos, 02.12.2010 12:17: >>> I am sorry that I am sending this as a response >> >> No need to do so if you want to start a new topic. Just send a message >> directly to the list address. Replies are for replying. >> >> >>> There is a memory leakage using lxml.html.parse (or etree) while you >>> do that constantly in a loop. In particular creating etrees in a loop >>> does let the trees there and is not deleting the properly when you reuse >>> the same python variable to store the resutls. >> >> I can reproduce this. I'll take a look ASAP. > > It's easily reproducible. I can parse a document repeatedly in a loop using > lxml.html.parse() and see the memory consumption of the Python process > grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from the same > problem. I'll see about that when I figured out what happens. > > It's only a problem with the HTML parser, and it's not related to > lxml.html. This is enough to reproduce it: > > from lxml import etree > > p = etree.HTMLParser() > while True: > etree.parse("somefile.html", p) I think it may be an issue with libxml2. The memory consumption seems to be stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. What's the version you use? Could you try the latest one? http://codespeak.net/lxml/dev/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do Stefan From dpritsos at extremepro.gr Thu Dec 2 22:34:49 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Thu, 02 Dec 2010 23:34:49 +0200 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF80C0A.2070803@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF79596.30206@behnel.de> <4CF7EF5A.9090706@behnel.de> <4CF80C0A.2070803@behnel.de> Message-ID: <4CF810F9.9080007@extremepro.gr> On 02/12/10 23:13, Stefan Behnel wrote: > Stefan Behnel, 02.12.2010 20:11: >> Stefan Behnel, 02.12.2010 13:48: >>> Dimitrios Pritsos, 02.12.2010 12:17: >>>> I am sorry that I am sending this as a response >>> >>> No need to do so if you want to start a new topic. Just send a message >>> directly to the list address. Replies are for replying. >>> >>> >>>> There is a memory leakage using lxml.html.parse (or etree) while you >>>> do that constantly in a loop. In particular creating etrees in a loop >>>> does let the trees there and is not deleting the properly when you >>>> reuse >>>> the same python variable to store the resutls. >>> >>> I can reproduce this. I'll take a look ASAP. >> >> It's easily reproducible. I can parse a document repeatedly in a loop >> using >> lxml.html.parse() and see the memory consumption of the Python process >> grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from >> the same >> problem. I'll see about that when I figured out what happens. >> >> It's only a problem with the HTML parser, and it's not related to >> lxml.html. This is enough to reproduce it: >> >> from lxml import etree >> >> p = etree.HTMLParser() >> while True: >> etree.parse("somefile.html", p) > > I think it may be an issue with libxml2. The memory consumption seems > to be stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. > > What's the version you use? Could you try the latest one? > > http://codespeak.net/lxml/dev/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do > > > Stefan Ok I will check it! and let you all know as soon as possible Cheers! Dimitrios From dpritsos at extremepro.gr Thu Dec 2 22:44:56 2010 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Thu, 02 Dec 2010 23:44:56 +0200 Subject: [lxml-dev] URL extraction from HTML documents In-Reply-To: References: <4CF78066.8050706@extremepro.gr> <4CF796C8.3030006@behnel.de> Message-ID: <4CF81358.7030800@extremepro.gr> On 02/12/10 20:49, Graff, Marc wrote: > I have only leveraged lxml for one project so I am no expert. If your > code is small enough and you are not prohibited from doing so, can you > provide it so others can understand what you are doing and how to make a > better determination of the memory leak you encountered and possible > cause? > > Thanks. I think that I sent a sample case, as a response to Stefan's sample code for memory leakage replication. regards Dimitrios > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Dimitrios Pritsos > Sent: Thursday, December 02, 2010 10:35 AM > To: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] URL extraction from HTML documents > > On 12/02/2010 02:53 PM, Stefan Behnel wrote: > > > On 12/02/2010 02:53 PM, Stefan Behnel wrote: >> Dimitrios Pritsos, 02.12.2010 12:17: >>> module re (regular expression) is just fine for >>> URL extraction, however I would prefer the use of XPath for > extracting a >>> variate of links more easily in Coding point of view. >> Sure, lxml.html has specific support for extracting URLs from parsed >> documents. >> > But what about the Memory Leakage, I am sorry if there is a solution > already. However, I believe that this is not intuitive at all (I mean > the all tree to stay in Memory like a Garbage and not to be replaced). I > > don't think that I am experienced enough to fix this. > >>> Plus I think that the overhead of Tree Building is not so much (I > dont >>> know for sure thought). >> Likely slower than re, but also likely fast enough. >> >> >>> Speaking of XPath for url extraction, I think that lxml.html has some >>> issues in url extraction (this is what I think reading the Code of > this >>> module). >> Such as ... ? >> > I just think it is harder to have all the definition of HTML 4.0 (XTHML > 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will > be more general. just that :) >>> And the question is why not to use the XPath for making the >>> code twice smaller and twice neater (I cleaner and well formed - I > hope >>> my vocabulary is correct), maybe faster too. >> Maybe. If you want to provide a patch that simplifies the code and >> back it with sufficient evidence that it's at least as fast as before >> and doesn't degrade functionality, I'll be happy to accept it. > As for lxml.html I think I can send something Xpath based and to be > multiprocessing/multi-threaded too (optionally). But still need some > work and I don't have enough time to finish it right now, because this > is a critical phase for my PhD and Job. >> Stefan > Thank you very much for your fast response! > > Dimitrios > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From jholg at gmx.de Fri Dec 3 10:13:24 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 03 Dec 2010 10:13:24 +0100 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF80C0A.2070803@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF79596.30206@behnel.de> <4CF7EF5A.9090706@behnel.de> <4CF80C0A.2070803@behnel.de> Message-ID: <20101203091324.121800@gmx.net> Hi, > > It's only a problem with the HTML parser, and it's not related to > > lxml.html. This is enough to reproduce it: > > > > from lxml import etree > > > > p = etree.HTMLParser() > > while True: > > etree.parse("somefile.html", p) > > I think it may be an issue with libxml2. The memory consumption seems to > be > stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. > > What's the version you use? Could you try the latest one? FWIW, memory looks stable here (vintage version): python2.4 -i -c 'from lxml import etree; print etree.__version__; print "%s (%s) - %s (%s)" % (etree.LIBXML_VERSION, etree.LIBXML_COMPILED_VERSION, etree.LIBXSLT_VERSION, etree.LIBXSLT_COMPILED_VERSION)' 2.2.6 (2, 6, 32) ((2, 6, 32)) - (1, 1, 23) ((1, 1, 23)) >>> p = etree.HTMLParser() >>> while True: ... doc = etree.parse("/ae/data/tmp/hjoukl/lxml-codespeak.htm", p) ... Holger -- GRATIS! Movie-FLAT mit ?ber 300 Videos. Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome From lists at cheimes.de Fri Dec 10 02:35:04 2010 From: lists at cheimes.de (Christian Heimes) Date: Fri, 10 Dec 2010 02:35:04 +0100 Subject: [lxml-dev] Test coverage and profiling for XSL(T) code Message-ID: <4D0183C8.7070005@cheimes.de> Hello! We are using Ned Batchelder's excellent coverage tool [1] to measure the code coverage of our unit test suite. Now I'm looking for a way to get a coverage report for our large XSL(T) code base. Is there any way to see which lines of a XSL file are executed with lxml? The word "coverage" is mentioned at [2] for version 2.7.0 but I can't find any additional information. In a slightly related matter, how can I profile the speed of our templates? I like to search for hot spots and optimize them. In my experience with Python optimization, only a profiler is able to detect hot spots during runtime. Now I like to do the same with XSL code. Thanks, Christian [1] http://pypi.python.org/pypi/coverage [2] http://xmlsoft.org/xml.html From stefan_ml at behnel.de Fri Dec 10 02:43:21 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 10 Dec 2010 02:43:21 +0100 Subject: [lxml-dev] Test coverage and profiling for XSL(T) code In-Reply-To: <4D0183C8.7070005@cheimes.de> References: <4D0183C8.7070005@cheimes.de> Message-ID: <4D0185B9.1010107@behnel.de> Christian Heimes, 10.12.2010 02:35: > We are using Ned Batchelder's excellent coverage tool [1] to measure the > code coverage of our unit test suite. Now I'm looking for a way to get a > coverage report for our large XSL(T) code base. Is there any way to see > which lines of a XSL file are executed with lxml? The word "coverage" is > mentioned at [2] for version 2.7.0 but I can't find any additional > information. > > In a slightly related matter, how can I profile the speed of our > templates? I like to search for hot spots and optimize them. In my > experience with Python optimization, only a profiler is able to detect > hot spots during runtime. Now I like to do the same with XSL code. Does this help? http://codespeak.net/lxml/xpathxslt.html#profiling Stefan From d.rothe at semantics.de Fri Dec 10 08:32:50 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Fri, 10 Dec 2010 08:32:50 +0100 Subject: [lxml-dev] Test coverage and profiling for XSL(T) code In-Reply-To: <4D0185B9.1010107@behnel.de> References: <4D0183C8.7070005@cheimes.de> <4D0185B9.1010107@behnel.de> Message-ID: On Fri, 10 Dec 2010 02:43:21 +0100, Stefan Behnel wrote: > Christian Heimes, 10.12.2010 02:35: >> We are using Ned Batchelder's excellent coverage tool [1] to measure the >> code coverage of our unit test suite. Now I'm looking for a way to get a >> coverage report for our large XSL(T) code base. Is there any way to see >> which lines of a XSL file are executed with lxml? The word "coverage" is >> mentioned at [2] for version 2.7.0 but I can't find any additional >> information. See: http://article.gmane.org/gmane.comp.python.lxml.devel/4255/match=line+coverage >> In a slightly related matter, how can I profile the speed of our >> templates? I like to search for hot spots and optimize them. In my >> experience with Python optimization, only a profiler is able to detect >> hot spots during runtime. In my experience studying logfiles can help as well ;) From henke at mac.se Tue Dec 14 14:18:35 2010 From: henke at mac.se (Henrik) Date: Tue, 14 Dec 2010 14:18:35 +0100 Subject: [lxml-dev] Using cssselect with XHTMLParser Message-ID: <7AD878F9-DBD1-4093-8BCB-56519F646EEB@mac.se> Hello list, I have a problem with using cssselect method on my xhtml elements. I.e xhtml = u""" Foo

Test ©

""" parser = lxml.html.XHTMLParser(recover=True) soup = lxml.html.fromstring(xhtml, parser=parser) h1s = soup.body.cssselect("h1") print h1s " => This returns an empty list [] Changing from XHTMLParser(recover=True) to HTMLParser(recover=True) Makes the code work again and h1s returns a list with an Element. Who do I get the cssselect to work when parsing with XHTMLParser? Cheers, Henrik From stefan_ml at behnel.de Tue Dec 14 15:57:22 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 14 Dec 2010 15:57:22 +0100 Subject: [lxml-dev] Using cssselect with XHTMLParser In-Reply-To: <7AD878F9-DBD1-4093-8BCB-56519F646EEB@mac.se> References: <7AD878F9-DBD1-4093-8BCB-56519F646EEB@mac.se> Message-ID: <4D0785D2.7030503@behnel.de> Henrik, 14.12.2010 14:18: > I have a problem with using cssselect method on my xhtml elements. > > xhtml = u""" > > > Foo > > > > > > >
>
> >

Test©

>
>
> > > """ > > parser = lxml.html.XHTMLParser(recover=True) > soup = lxml.html.fromstring(xhtml, parser=parser) > h1s = soup.body.cssselect("h1") > print h1s " => This returns an empty list [] CSSSelect is namespace aware. Since XHTML documents live in a namespace, you need to use fully qualified tag names. http://codespeak.net/lxml/cssselect.html#namespaces Alternatively, use the xhtml_to_html() function in lxml.html to remove the namespaces from the tag names in your document. Stefan From henke at mac.se Fri Dec 17 11:05:22 2010 From: henke at mac.se (Henrik) Date: Fri, 17 Dec 2010 11:05:22 +0100 Subject: [lxml-dev] Using cssselect with XHTMLParser In-Reply-To: References: <7AD878F9-DBD1-4093-8BCB-56519F646EEB@mac.se> Message-ID: On 14 dec 2010, at 15:59, Stefan Behnel wrote: > Henrik, 14.12.2010 14:18: >> I have a problem with using cssselect method on my xhtml elements. >> >> xhtml = u""" >> >> >> Foo >> >> >> >> >> >> >>
>>
>> >>

Test©

>>
>>
>> >> >> """ >> >> parser = lxml.html.XHTMLParser(recover=True) >> soup = lxml.html.fromstring(xhtml, parser=parser) >> h1s = soup.body.cssselect("h1") >> print h1s " => This returns an empty list [] > > CSSSelect is namespace aware. Since XHTML documents live in a namespace, you need to use fully qualified tag names. > > http://codespeak.net/lxml/cssselect.html#namespaces > > Alternatively, use the xhtml_to_html() function in lxml.html to remove the namespaces from the tag names in your document. > > Stefan Hello, I've been using the xhtml_to_html() to resolve some of my issues but I can figure out how to write a cssselect() query with a namespace-prefix? How should I write to find the

tag in a xhtml document without using xhtml_to_html()? I've tried the full cssselect('http://www.w3.org/1999/xhtml | h1') but nothing works :) Can I set this default namespace as a class variable or something? Cheers, Henrik From art at embed.ly Mon Dec 20 21:54:48 2010 From: art at embed.ly (Art Gibson) Date: Mon, 20 Dec 2010 15:54:48 -0500 Subject: [lxml-dev] lxml.html text_content() returns html for escaped html? Message-ID: In the code snippet below?, the "text_content{}" call returns html instead of the escaped html. Is this the desired behaviour? from lxml.html import fromstring code_block = '<a href="" title="">' doc = fromstring(code_block) doc.text_content() #returns , instead of the escaped version Thanks for any help, Art -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20101220/2fd9acba/attachment.htm From d.rothe at semantics.de Tue Dec 21 00:04:40 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Tue, 21 Dec 2010 00:04:40 +0100 Subject: [lxml-dev] lxml.html text_content() returns html for escaped html? In-Reply-To: References: Message-ID: On Mon, 20 Dec 2010 21:54:48 +0100, Art Gibson wrote: > In the code snippet below?, the "text_content{}" call returns html > instead > of the escaped html. Is this the desired behaviour? > > from lxml.html import fromstring > code_block = '<a href="" title="">' > doc = fromstring(code_block) > doc.text_content() > #returns , instead of the escaped version > > Thanks for any help, Yes, I would expect that. It is the text of the textnodes. You are leaving XML-Escaping-Space. From jholg at gmx.de Tue Dec 28 17:23:47 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 28 Dec 2010 17:23:47 +0100 Subject: [lxml-dev] xpath: retrieve attribute names Message-ID: <20101228162347.137960@gmx.net> Hi, is there a convenient way to retrieve element attribute values and *names* using xpath? I haven't yet found an XPath (1.0) expression and believe this is impossible in XPath 1.0. It is of course trivial to get at the attribute values: >>> print etree.__version__ 2.2.6 >>> print etree.LIBXML_VERSION (2, 6, 32) >>> print etree.LIBXSLT_VERSION (1, 1, 23) >>> >>> root = objectify.Element('root') >>> root.a = 3 >>> root.a.set('foo', 'bar') >>> root.a.set('doo', 'woop') >>> print etree.tostring(root, pretty_print=True) 3 >>> root.xpath('//@*') ['TREE', 'int', 'bar', 'woop'] >>> All I can come up with for getting at attribute names are ugly beasts like >>> [ (elt.xpath('name(@*[$i])', i=i+1), elt.xpath('@*[$i]', i=i+1)[0]) for i in range(elt.xpath('count(@*)')) ] [('py:pytype', 'int'), ('foo', 'bar'), ('doo', 'woop')] I also thought about using exslt trickery, but now I'm a bit confused about its possible usage in XPath: >>> root.xpath("dyn:map(//*, name())") 'root' >>> root.xpath('str:split("12, 13", ",")') ',' Seems like this isn't possible? Anyhow: How about adding a getname() method to XPath smart string results, analogous to the getparent() functionality? This would * return the (ns-qualified) attribute name for attribute results * return the (ns-qualified) element tag for element results It seems to me that having getparent() isn't of much use for attribute results if you want to get at the attribute's name. Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From jholg at gmx.de Wed Dec 29 09:28:54 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 29 Dec 2010 09:28:54 +0100 Subject: [lxml-dev] xpath: retrieve attribute names In-Reply-To: <4D1A2DF4.7080602@gmail.com> References: <20101228162347.137960@gmx.net> <4D1A2DF4.7080602@gmail.com> Message-ID: <20101229082854.137980@gmx.net> (cc-ing the list) Hi, > hi, > maybe not too convenient but this works: > > >>> x = etree.XML('') > >>> x.xpath('@*') > ['1', '2'] > >>> x.xpath('name(@*)') > 'test' > >>> x.xpath('name(@*[1])') > 'test' > >>> x.xpath('name(@*[2])') > 'a2' As I don't know the number of attributes up front that'll lead to a solution similar to what I posted originally: > >>>> [ (elt.xpath('name(@*[$i])', i=i+1), elt.xpath('@*[$i]', i=i+1)[0]) > for i in range(elt.xpath('count(@*)')) ] > > [('py:pytype', 'int'), ('foo', 'bar'), ('doo', 'woop')] > > Best regards, Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From stefan_ml at behnel.de Wed Dec 29 10:00:55 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 29 Dec 2010 10:00:55 +0100 Subject: [lxml-dev] xpath: retrieve attribute names In-Reply-To: <20101228162347.137960@gmx.net> References: <20101228162347.137960@gmx.net> Message-ID: <4D1AF8C7.5040906@behnel.de> jholg at gmx.de, 28.12.2010 17:23: > is there a convenient way to retrieve element attribute values and *names* > using xpath? As part of an XPath expression? I don't know. But see below. > >>> [ (elt.xpath('name(@*[$i])', i=i+1), elt.xpath('@*[$i]', i=i+1)[0]) for i in range(elt.xpath('count(@*)')) ] > [('py:pytype', 'int'), ('foo', 'bar'), ('doo', 'woop')] I didn't benchmark it, but I have my doubts that this is faster than [ el.attrib for el in root.xpath('//*[@*]') ] It's certainly less readable. > I also thought about using exslt trickery, but now I'm a bit confused about its possible usage in XPath: > > >>> root.xpath("dyn:map(//*, name())") > 'root' > >>> root.xpath('str:split("12, 13", ",")') > ',' I haven't used these in a while. Don't you have to map the prefixes to their namespaces? (although I'd expect an exception if they are undefined). > Anyhow: > How about adding a getname() method to XPath smart string results, analogous to the getparent() functionality? This would > * return the (ns-qualified) attribute name for attribute results > * return the (ns-qualified) element tag for element results > > It seems to me that having getparent() isn't of much use for attribute results if you want to get at the attribute's name. Since 2.3, there is an "attrname" property on the smart string results of attribute values. Element text results don't need this as you can pass through getparent(). http://codespeak.net/lxml/dev/api/lxml.etree._ElementUnicodeResult-class.html Stefan From jholg at gmx.de Wed Dec 29 11:02:23 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 29 Dec 2010 11:02:23 +0100 Subject: [lxml-dev] xpath: retrieve attribute names In-Reply-To: <4D1AF8C7.5040906@behnel.de> References: <20101228162347.137960@gmx.net> <4D1AF8C7.5040906@behnel.de> Message-ID: <20101229100223.219160@gmx.net> Hi Stefan, > > >>> [ (elt.xpath('name(@*[$i])', i=i+1), elt.xpath('@*[$i]', i=i+1)[0]) > for i in range(elt.xpath('count(@*)')) ] > > [('py:pytype', 'int'), ('foo', 'bar'), ('doo', 'woop')] > > I didn't benchmark it, but I have my doubts that this is faster than > > [ el.attrib for el in root.xpath('//*[@*]') ] > > It's certainly less readable. Definitely. > > > I also thought about using exslt trickery, but now I'm a bit confused > about its possible usage in XPath: > > > > >>> root.xpath("dyn:map(//*, name())") > > 'root' > > >>> root.xpath('str:split("12, 13", ",")') > > ',' > > I haven't used these in a while. Don't you have to map the prefixes to > their namespaces? (although I'd expect an exception if they are > undefined). >>> root.xpath('str:split("12, 13", ",")') ',' >>> root.xpath('str:split("12, 13", ",")', namespaces={'str': 'http://exslt.org/strings'}) Traceback (most recent call last): File "", line 1, in ? File "lxml.etree.pyx", line 1317, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:37255) File "xpath.pxi", line 289, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:104028) File "xpath.pxi", line 212, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:103335) File "xpath.pxi", line 197, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:103158) lxml.etree.XPathEvalError: Unregistered function >>> Looks like these namespaces are automagically recognized. But I just realized that maybe this doesn't work due to my libxslt version. From the 2.3alpha1 change log: * During regular XPath evaluation, various ESXLT functions are available within their namespace when using libxslt 1.1.26 or later. I'm running on libxslt 1.1.23. > Since 2.3, there is an "attrname" property on the smart string results of > attribute values. Element text results don't need this as you can pass > through getparent(). > > http://codespeak.net/lxml/dev/api/lxml.etree._ElementUnicodeResult-class.html Ah, didn't spot this. Just what I was looking for! Holger -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail From grzegorz.slusarek at sensisoft.com Thu Dec 30 13:50:26 2010 From: grzegorz.slusarek at sensisoft.com (Grzegorz =?ISO-8859-2?B?pmx1c2FyZWs=?=) Date: Thu, 30 Dec 2010 13:50:26 +0100 Subject: [lxml-dev] no error position or line number when walidating against XMLSchema using iterparse Message-ID: <20101230135026.6e55c06c@skynet> Hello I'm trying to use iterparse to validate big xml document using XmlSchema. Generally it works, but when walidate malformed xml, the exception received doesn't contain any information about file position. Specific Exception attributes like position is (0,0), offset is None and linenumber is just None. In that case is very hard say what's wrong with xml, especially when is'a big xml file (300 MB). Is it a bug or this is correct behaviour ? I found that this question was allready asked on this mailing list, but there wasn't any response. Btw. this situation happen on lxml 2.2.8 and python2.6 (I must say that I try to used lxml 2.3 beta and get the same case). Below I attached some piece of code that I use to do validation schema = XMLSchema(file=schemapath) context = iterparse(file(filepath,'r'), events=('end',), schema=schema) try: for event, elem in context: elem.clear() while elem.getprevious() is not None: if elem.getparent() is not None: del elem.getparent()[0] except XMLSyntaxError, e: ....... Regards Grzegorz ?lusarek