[lxml-dev] Ask for help about lxml usage

Stefan Behnel stefan_ml at behnel.de
Wed May 13 08:32:56 CEST 2009


[forwarding this to the list]

qhlonline wrote:
> Thank you very much for your kindly help. I have tried to trace the lxml
> source code. I have attempted to find out whether the lxml is simply a
> python-wrapped interface for HTML/XML parsing. if so, after my program
> have reached the C/Python interface(eg. xmlDoc* htmlCtxtReadMemory in
> htmlparser.pxd), the following job should have run parallelled on
> different CPU cores, becausing it was using libxml2. I have also naively
> thought that the elimination of DOM tree creation process would save time,
> but it seems that this job is done by SAX event processers of libxml2 if
> we are using the default parser.

It's not naive, the thing is just that this is a lot more efficient for
single-threaded programs than for multi-threaded ones. That's not an
obvious difference, and it's also not advertised in the docs. Most users
don't care too much about the exact high performance characteristics and
it's not like there's an obvious bottleneck in lxml.etree that you have to
warn people about. It's more the general "Python code is single-threaded"
kind of thing and it all depends on how you use it, so you have to
benchmark your own code anyway.


> I don't know the Cython language, some
> new types and declarations of variables and functions in .pxi files and
> .pxd files makes me feel headache.

You can ignore most of the little stuff that you don't understand. The
main idea is just that you can switch freely between C and Python in the
code, so depending on how firm you are in both, some code sections may be
less obvious than others.


> but my instructor said that since the
> multi-thread parsing program with lxml can save time on two-core CPU
> machine(It is true, it can run nearly 20% faseter with 2 thread on
> two-core CPU machine),  it should run better with more thread on a better
> machine. now, with your help, I know maybe multi-process program will do
> better. thank you again!

Usually a lot faster, as you avoid any implicit concurrency issues. In
multi-process programs, synchronisation only happens where you explicitly
do it. Things are a lot more implicit and subtle in multi-threading.

I'll comment on your program in your other mail. Please keep the list
involved next time.

Stefan



More information about the lxml-dev mailing list