[lxml-dev] LXML utf-8 problem...
Stefan Behnel
stefan_ml at behnel.de
Fri Feb 20 21:37:44 CET 2009
Hi,
Douglas Mayle wrote:
> Unfortunately, I'm running into an error that I thought I had licked
> before. I've running lxml 2.1.2 on OS X and python 2.5. I have a
> 'str' object that contains html with utf-8 bytes and a utf-8 encoding
> specified by the directive, which should be properly handled, to my
> understanding, but is not:
>
> douglas$ python
> Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import lxml.html
> >>> lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"?
> ><html><body><p>\xa9</p></body></html>'.encode('utf-8'))
> Traceback (most recent call last):
> [...]
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 53: ordinal not in range(128)
:)
The error message is a bit misleading here. parse() takes a file name as
argument, which in your case is a UTF-8 encoded byte sequence. When
lxml.etree tries to parse, it fails to find the file and thus tries to
raise an error. It then fails as it cannot format the error message.
Haven't tried, but it should work with 2.2.
Stefan
More information about the lxml-dev
mailing list