[lxml-dev] Proposal: Better html5lib Support

Stefan Behnel stefan_ml at behnel.de
Sun Jul 13 06:57:05 CEST 2008


Hi,

Armin Ronacher wrote:
> Stefan Behnel <stefan_ml <at> behnel.de> writes:
>>> There is another small problem with html5lib and lxml interoperability that
>>> is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
>> Does the "cannot handle" result in any visible problems?
> This document::
> 
>     <!doctype html>
>     <title>foo</title>
>     <p>blah
> 
> Comes out as (lxml.etree.tostring)::
> 
>     <!DOCTYPE html PUBLIC "" "">
>     ...

We are actually serialising the DOCTYPE ourselves. Try this patch.

I'm not sure if <!DOCTYPE html> is actually allowed in SGML, didn't find
anything on that so far. If it isn't, I'll have to see if I can restrict the
impact of the patch to this specific case.

Note that you will need Cython 0.9.8 installed to build a patched lxml.

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: html5-doctype.patch
Type: text/x-patch
Size: 1781 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080713/1f011c61/attachment.bin 


More information about the lxml-dev mailing list