[lxml-dev] Proposal: Better html5lib Support

Armin Ronacher armin.ronacher at active-4.com
Sat Jul 12 23:14:50 CEST 2008


Stefan Behnel <stefan_ml <at> behnel.de> writes:

> > There are however a few inconveniences in the html5lib lxml support.  Mostly
> > because the html5lib API is quite complex to use and I've
> > seen that there is a beautiful soup parser support in html5lib, so why not
> > move the html5lib tree builder into an lxml.html.html5 module or so that
> > provides the same API as the html (that is `fragment_fromstring`,
> > `document_fromstring`, etc.)
> 
> I do not use html5lib myself, but I'm happily taking patches if you can fix it
> up in a more convenient way.
I'll happily create a patch :-)

> > There is another small problem with html5lib and lxml interoperability that
> > is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
> 
> Does the "cannot handle" result in any visible problems?
This document::

    <!doctype html>
    <title>foo</title>
    <p>blah

Comes out as (lxml.etree.tostring)::

    <!DOCTYPE html PUBLIC "" "">
    ...

Not a big deal as not writing out data as a whole document and if I would then
as HTML4.  I think the html5 doctype is not a valid XML doctype but HTML5 as
serialization format is not really XML.  For HTML5 serialization one would have
to use the html5lib serializer anyways and that could add a workaround for
lxml.


Regards,
Armin



More information about the lxml-dev mailing list