[lxml-dev] Proposal: Better html5lib Support
Armin Ronacher
armin.ronacher at active-4.com
Sat Jul 12 23:14:50 CEST 2008
Stefan Behnel <stefan_ml <at> behnel.de> writes:
> > There are however a few inconveniences in the html5lib lxml support. Mostly
> > because the html5lib API is quite complex to use and I've
> > seen that there is a beautiful soup parser support in html5lib, so why not
> > move the html5lib tree builder into an lxml.html.html5 module or so that
> > provides the same API as the html (that is `fragment_fromstring`,
> > `document_fromstring`, etc.)
>
> I do not use html5lib myself, but I'm happily taking patches if you can fix it
> up in a more convenient way.
I'll happily create a patch :-)
> > There is another small problem with html5lib and lxml interoperability that
> > is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
>
> Does the "cannot handle" result in any visible problems?
This document::
<!doctype html>
<title>foo</title>
<p>blah
Comes out as (lxml.etree.tostring)::
<!DOCTYPE html PUBLIC "" "">
...
Not a big deal as not writing out data as a whole document and if I would then
as HTML4. I think the html5 doctype is not a valid XML doctype but HTML5 as
serialization format is not really XML. For HTML5 serialization one would have
to use the html5lib serializer anyways and that could add a workaround for
lxml.
Regards,
Armin
More information about the lxml-dev
mailing list