[lxml-dev] Some benchmarks

Stefan Behnel stefan_ml at behnel.de
Tue Mar 11 08:41:09 CET 2008


Hi,

Mike Meyer wrote:
> Not necessarily. libxml2 uses the c libraries
> free/malloc. Historically, on Unix systems the C library free/malloc
> don't return the memory to the OS, but keep it in an internal
> heap. Systems that are Not Unix tend to do otherwise, creating some
> confusion for people moving from those systems to unix.

I tend to consider libc a part of the OS. But technically you are right and it
even makes a difference here.


>>> I haven't looked hard (yet, at least).  The example on the ElementSoup
>>> page parses *slightly* better with BS, but lxml parses it very similarly
>>> to how html5lib parses it, which I'd consider the better standard.
>>> html5lib has the advantage of being a kind of standard.
>>>
>>> If I had a good collection of crappy HTML, that would probably be an
>>> interesting test to see how differently html5lib, BS, and lxml parse it.
>>>  I'm not sure where to find a good collection like that.  Maybe
>>> html5lib's tests, I guess.
>> There seem to be a fair amount of HTML browser compliance test suites on the
>> web, but I didn't find any test suites for broken HTML at a first glance.
> 
> I think google has a nice collection of broken html  :-).

Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just
don't know how to write a Google search query for broken HTML pages... :)

Anyway, I'm not sure they actually keep the broken HTML pages around. I would
expect them to send them through a sanitizer before doing anything else with
them (including local caching).

Stefan


More information about the lxml-dev mailing list