[lxml-dev] Some benchmarks
Mike Meyer
mwm-keyword-lxml.9112b8 at mired.org
Mon Mar 10 22:02:48 CET 2008
On Mon, 10 Mar 2008 21:15:58 +0100 Stefan Behnel <stefan_ml at behnel.de> wrote:
> > I also tried allocating random strings until the size increased, to see
> > if there was lots of allocated but free memory (the unused amount is an
> > estimate, as I'm unsure what the exact internal representation of a list
> > of strings is). The results were peculiar:
> >
> > VSZ RSS (used)
> > lxml : 26952 / 26211 (unused: 5)
> > bs : 83408 / 82156 (unused: 0)
> > html5_cet : 55640 / 54745 (unused: 19)
> > html5_et : 65712 / 64946 (unused: 14)
> > html5_lxml : 50072 / 48986 (unused: 134)
> > html5_minidom : 195372 / 192914 (unused: 14)
> > html5_simple : 99772 / 97999 (unused: 17)
> > lxml_bs : 104644 / 73037 (unused: 31783)
> > htmlparser : 4448 / 4433 (unused: 19)
> >
> > I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots
> > of free memory left over at the end. I am surprised that the others
> > don't, at least html5_lxml should be similar I'd think (though I guess
> > if you take into account the unused memory then html5_lxml and lxml_bs
> > are similar).
>
> That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use
> Python's memory management, so memory that is freed by the parser is really
> freed to the OS, not just left as a growing interpreter heap.
Not necessarily. libxml2 uses the c libraries
free/malloc. Historically, on Unix systems the C library free/malloc
don't return the memory to the OS, but keep it in an internal
heap. Systems that are Not Unix tend to do otherwise, creating some
confusion for people moving from those systems to unix.
> > I haven't looked hard (yet, at least). The example on the ElementSoup
> > page parses *slightly* better with BS, but lxml parses it very similarly
> > to how html5lib parses it, which I'd consider the better standard.
> > html5lib has the advantage of being a kind of standard.
> >
> > If I had a good collection of crappy HTML, that would probably be an
> > interesting test to see how differently html5lib, BS, and lxml parse it.
> > I'm not sure where to find a good collection like that. Maybe
> > html5lib's tests, I guess.
>
> There seem to be a fair amount of HTML browser compliance test suites on the
> web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html :-).
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
More information about the lxml-dev
mailing list