[lxml-dev] Proposal: Better html5lib Support

Stefan Behnel stefan_ml at behnel.de
Mon Jul 14 21:30:05 CEST 2008


Hi,

Armin Ronacher wrote:
> Stefan Behnel <stefan_ml <at> behnel.de> writes:
> 
>> I do not use html5lib myself, but I'm happily taking patches if you can fix it
>> up in a more convenient way.
> I created a patch now: http://paste.pocoo.org/show/79376/

Thanks!


> That however has two disadvantages.  For one it extends the lxml etree builder
> in a pretty ugly way but that could probably be improved,

I'll take a look at it as soon as I find the time.


> and it also creates
> etree.Comment objects and not etree.html.HtmlComments.  The same problem exists
> with the soupparser, mainly because there is no way to generate HtmlComment
> objects without creating a segfault.

Yes. Although this isn't really a bug (you should use the Comment factory to
create a comment, not the _Comment or HtmlComment classes), this seems to be a
common misconception especially by new users. This behaviour will change in
lxml 2.2, where calling an Element class already creates a new Element.


> (The only way is to use html.fromstring
> with the comment there, but that's an ugly hack).

Using the etree.Comment() factory is just fine and will do the right thing.

Stefan


More information about the lxml-dev mailing list