[lxml-dev] lxml and html encodings

Luke Tucker ltucker at openplans.org
Wed Oct 18 17:00:33 CEST 2006


Hey,

I could be confused, but I think the issue chris is referring 
to here might be clouded by the bad HTML in the original 
message. Here's some behavior that, to me, doesn't appear to 
match up entirely with the FAQ (as far as where errors are 
produced) using fixed up HTML. 

>>> html = open('home2.html').read()
>>> unicode = html.decode('Shift_JIS')
>>> from lxml import etree
>>> rh = etree.HTML(html)
>>> uh = etree.HTML(unicode)
>>> rh[0][1].text
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "etree.pyx", line 859, in etree._Element.text.__get__
  File "apihelpers.pxi", line 291, in etree._collectText
  File "apihelpers.pxi", line 552, in etree.funicode
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0:
unexpected code byte
>>> uh[0][1].text
u'\u30b3\u30df'

It looked to me like uh = etree.HTML(unicode) in this case should 
produce errors (since it is unicode and contains a proper meta 
charset entry) and that rh should behave normally. Apologies if I'm
simply confusing the issue further :) 

- Luke 


On Wed, 2006-10-18 at 08:51 +0200, Stefan Behnel wrote:
> Hi Chris,
> 
> Chris Abraham wrote:
> > Thanks for this. Who should I contact to get the FAQ updated?
> > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
> 
> Well, the FAQ isn't really wrong in what it says. In your case, the encoding
> information is simply not taken into account as it is in a totally wrong
> position. So it's more like the document did not contain any encoding
> information at all.
> 
> Note that the HTML parser is not guaranteed to create correct HTML that is
> 'equivalent' to the broken HTML. It just tries its best, which may mean that
> some of the original content may get lost. And in this case, it's meta data
> that gets lost.
> 
> Stefan
> 
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
> 
> !DSPAM:1014,4535cefc145172207481331!
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061018/62ea6596/attachment.html 


More information about the lxml-dev mailing list