[lxml-dev] Ask for help about lxml usage

qhlonline qhlonline at 163.com
Tue May 12 05:56:23 CEST 2009


Hi, all
   I am a lxml experimental user. from site http://codespeak.net/lxml/FAQ.html I know that python can support multithread parsing without GIL. I have tried to write multi-thread parsing program to run on a eight-core CPU computer, but the total CPU used was only 180%, only about 20% of each core had been used.but when I tried the libxml2 directly, It run much faster, and more then 50% of each CPU core were used. My goal is to parse a HTML file on a disk to get special HTML tags and their relative data, like attributes and texts. I will not use DOM tree creation, renew, delete, or XPath operations. then how can my HTML-Parsing program run faster? I have used the Target SAX parser to parse a HTML file ,but the speed is not good enough. the Iterparse can't parse HTML file eigher(I have set the "html=True" parameter), the parser said my HTML file had misplaced the DOCTYPE declaration,but this web page is caught from a popular website and is truly subject to the HTML protocal. Now there are more HTML files to process, so now I wan't to speed up the parsing by multi-thread process. My question are whether the LXML had freed GIL completely
on memery and disk file prasing? how can my multi-thread program run faster on multi-core CPU computer? Can I make some change on lxml source to jump some unwanted operation to Improve my program?
                                            Thanks a lot
                                            Yours Sincere

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090512/c8912f8e/attachment.htm 


More information about the lxml-dev mailing list