[lxml-dev] lxml and html encodings

Stefan Behnel behnel_ml at gkec.informatik.tu-darmstadt.de
Thu Oct 19 08:59:17 CEST 2006


Hi,

Luke Tucker wrote:
> I could be confused, but I think the issue chris is referring 
> to here might be clouded by the bad HTML in the original 
> message.

Sure, that's why I was referring him to the libxml2 mailing list.


> Here's some behavior that, to me, doesn't appear to 
> match up entirely with the FAQ (as far as where errors are 
> produced) using fixed up HTML. 
> 
>>>> html = open('home2.html').read()
>>>> unicode = html.decode('Shift_JIS')
>>>> from lxml import etree
>>>> rh = etree.HTML(html)
>>>> uh = etree.HTML(unicode)
>>>> rh[0][1].text
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "etree.pyx", line 859, in etree._Element.text.__get__
>   File "apihelpers.pxi", line 291, in etree._collectText
>   File "apihelpers.pxi", line 552, in etree.funicode
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0:
> unexpected code byte
>>>> uh[0][1].text
> u'\u30b3\u30df'
> 
> It looked to me like uh = etree.HTML(unicode) in this case should 
> produce errors (since it is unicode and contains a proper meta 
> charset entry) and that rh should behave normally. Apologies if I'm
> simply confusing the issue further :) 

Sorry, but your HTML is very broken, too. It has two <html> tags and two
contradictory <meta> tags (saying both "us-ascii" and "shift_jis"), so don't
expect libxml2's HTML parser to magically know what you really meant when you
wrote it. That's like saying: Ok, I know this function only works for values
from 1-5, so I'll put in a 99 and complain if it breaks.

If you parse broken HTML and the parser doesn't handle it correctly, the
reason is your broken HTML, really.

If you think libxml2 should be able to parse this kind of non-HTML, please
file a bug on the libxml2 parser. There is nothing lxml can do about it.

Stefan


More information about the lxml-dev mailing list