[lxml-dev] Proposal: Better html5lib Support
Armin Ronacher
armin.ronacher at active-4.com
Sat Jul 12 18:47:12 CEST 2008
Hi,
I'm lately working a lot with html5lib which has a tree builder that can
generate an lxml tree which is awesome :-)
There are however a few inconveniences in the html5lib lxml support. Mostly
because the html5lib API is quite complex to use and I've
seen that there is a beautiful soup parser support in html5lib, so why not
move the html5lib tree builder into an lxml.html.html5 module or so that
provides the same API as the html (that is `fragment_fromstring`,
`document_fromstring`, etc.)
html5lib is currently the most advanced HTML parsing module for Python I
know about and it is able to deal with most HTML the same way popular
browsers do.
There is another small problem with html5lib and lxml interoperability that
is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
I know that lxml is an XML library after all, but maybe support for this
special doctype could be added.
Regards,
Armin
More information about the lxml-dev
mailing list