[lxml-dev] lxml and html encodings
Luke Tucker
ltucker at openplans.org
Wed Oct 18 17:00:33 CEST 2006
Hey,
I could be confused, but I think the issue chris is referring
to here might be clouded by the bad HTML in the original
message. Here's some behavior that, to me, doesn't appear to
match up entirely with the FAQ (as far as where errors are
produced) using fixed up HTML.
>>> html = open('home2.html').read()
>>> unicode = html.decode('Shift_JIS')
>>> from lxml import etree
>>> rh = etree.HTML(html)
>>> uh = etree.HTML(unicode)
>>> rh[0][1].text
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "etree.pyx", line 859, in etree._Element.text.__get__
File "apihelpers.pxi", line 291, in etree._collectText
File "apihelpers.pxi", line 552, in etree.funicode
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0:
unexpected code byte
>>> uh[0][1].text
u'\u30b3\u30df'
It looked to me like uh = etree.HTML(unicode) in this case should
produce errors (since it is unicode and contains a proper meta
charset entry) and that rh should behave normally. Apologies if I'm
simply confusing the issue further :)
- Luke
On Wed, 2006-10-18 at 08:51 +0200, Stefan Behnel wrote:
> Hi Chris,
>
> Chris Abraham wrote:
> > Thanks for this. Who should I contact to get the FAQ updated?
> > http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
>
> Well, the FAQ isn't really wrong in what it says. In your case, the encoding
> information is simply not taken into account as it is in a totally wrong
> position. So it's more like the document did not contain any encoding
> information at all.
>
> Note that the HTML parser is not guaranteed to create correct HTML that is
> 'equivalent' to the broken HTML. It just tries its best, which may mean that
> some of the original content may get lost. And in this case, it's meta data
> that gets lost.
>
> Stefan
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
> !DSPAM:1014,4535cefc145172207481331!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061018/62ea6596/attachment.html
More information about the lxml-dev
mailing list