[lxml-dev] Ask for help about lxml usage
Stefan Behnel
stefan_ml at behnel.de
Tue May 12 10:14:27 CEST 2009
Hi,
qhlonline wrote:
> Hi, all I am a lxml experimental user. from site
> http://codespeak.net/lxml/FAQ.html I know that python can support
> multithread parsing without GIL. I have tried to write multi-thread
> parsing program to run on a eight-core CPU computer, but the total CPU
> used was only 180%, only about 20% of each core had been used.
I never tested lxml on a machine with more than two cores myself (simply
because I don't have one). But it always depends on your code how much
parallelism you get. The figures above let me assume that less than 20% of
the work is spent in parsing.
The comment below are rather generic based on the information you provided.
If you want better suggestions, it would help to see a bit of your code to
understand which parts of lxml's API you are using, and how.
> but when I
> tried the libxml2 directly, It run much faster, and more then 50% of
> each CPU core were used.
That's because libxml2 (I assume you are referring to the Python bindings?)
does a lot less on each call, so if it frees the GIL, I doubt that there is
any case where it has to reclaim it during the parsing. Depending on your
code, there may be a reason lxml has to. Also, if you use lxml.html instead
of lxml.etree, you add another Python (i.e. GIL locked) layer on top of that.
> My goal is to parse a HTML file on a disk to
> get special HTML tags and their relative data, like attributes and
> texts. I will not use DOM tree creation, renew, delete, or XPath
> operations. then how can my HTML-Parsing program run faster? I have used
> the Target SAX parser to parse a HTML file ,but the speed is not good
> enough.
That's because freeing the GIL in the target parser (I don't remember if
that's actually done or not) would induce too much overhead (one GIL
acquire-release cycle per element or text node!) and actually hurt the
performance.
> the Iterparse can't parse HTML file eigher (I have set the
> "html=True" parameter)
Parsing HTML with iterparse() should work (so I'm interested in an example
that fails). But don't expect too much parallelism there. iterparse() is
not made for highly parallel parsing, as it calls into Python a lot. It
does not even free the GIL during parsing, only during file access.
> the parser said my HTML file had misplaced the
> DOCTYPE declaration, but this web page is caught from a popular website
> and is truly subject to the HTML protocal.
I assume you validated it against the W3C validator?
I'm not sure, but at a quick glance at the code, it may be possible that
iterparse() doesn't set the "recover" option for HTML, so your results with
broken HTML may not be as expected.
> Now there are more HTML files
> to process, so now I wan't to speed up the parsing by multi-thread
> process. My question are whether the LXML had freed GIL completely on
> memery and disk file prasing?
It does that, yes. But note that you need to pass a filename or URL to get
maximum parallelism, not a file(-like) object, as that is read in Python space.
> how can my multi-thread program run faster on multi-core CPU computer?
The usual answer in Python is: use multiple processes instead of multiple
threads. If you have many files that are treated independently anyway,
using a process pool of eight processes should really get you close to 100%
parallel code. If the idea is to extract small amounts of data from HTML
pages and aggregate them somehow, I doubt that there is a way to beat
separate processes that write their output into a common
pipe/queue/database/whatever.
This is actually an important thing to understand: Threads are good for
avoiding I/O latency when your problem is I/O bound. They are not good for
parallel computations, i.e. CPU bound tasks.
Stefan
More information about the lxml-dev
mailing list