[lxml-dev] Proposal: Better html5lib Support

Stefan Behnel stefan_ml at behnel.de
Sat Jul 12 21:54:26 CEST 2008


Hi,

Armin Ronacher wrote:
> I'm lately working a lot with html5lib which has a tree builder that can
> generate an lxml tree which is awesome :-)

:)


> There are however a few inconveniences in the html5lib lxml support.  Mostly
> because the html5lib API is quite complex to use and I've
> seen that there is a beautiful soup parser support in html5lib, so why not
> move the html5lib tree builder into an lxml.html.html5 module or so that
> provides the same API as the html (that is `fragment_fromstring`,
> `document_fromstring`, etc.)

I do not use html5lib myself, but I'm happily taking patches if you can fix it
up in a more convenient way.


> There is another small problem with html5lib and lxml interoperability that
> is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.

Does the "cannot handle" result in any visible problems?


> I know that lxml is an XML library after all, but maybe support for this
> special doctype could be added.

This is something that is handled at the level of libxml2 and the system wide
catalogs. Check the catalogs on your system to see if there is anything that
resembles that doctype. Maybe it can be added.

Stefan



More information about the lxml-dev mailing list