[lxml-dev] Some benchmarks
Stefan Behnel
stefan_ml at behnel.de
Tue Mar 11 08:41:09 CET 2008
Hi,
Mike Meyer wrote:
> Not necessarily. libxml2 uses the c libraries
> free/malloc. Historically, on Unix systems the C library free/malloc
> don't return the memory to the OS, but keep it in an internal
> heap. Systems that are Not Unix tend to do otherwise, creating some
> confusion for people moving from those systems to unix.
I tend to consider libc a part of the OS. But technically you are right and it
even makes a difference here.
>>> I haven't looked hard (yet, at least). The example on the ElementSoup
>>> page parses *slightly* better with BS, but lxml parses it very similarly
>>> to how html5lib parses it, which I'd consider the better standard.
>>> html5lib has the advantage of being a kind of standard.
>>>
>>> If I had a good collection of crappy HTML, that would probably be an
>>> interesting test to see how differently html5lib, BS, and lxml parse it.
>>> I'm not sure where to find a good collection like that. Maybe
>>> html5lib's tests, I guess.
>> There seem to be a fair amount of HTML browser compliance test suites on the
>> web, but I didn't find any test suites for broken HTML at a first glance.
>
> I think google has a nice collection of broken html :-).
Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just
don't know how to write a Google search query for broken HTML pages... :)
Anyway, I'm not sure they actually keep the broken HTML pages around. I would
expect them to send them through a sanitizer before doing anything else with
them (including local caching).
Stefan
More information about the lxml-dev
mailing list