[lxml-dev] LXML utf-8 problem...
Douglas Mayle
douglas at openplans.org
Fri Feb 20 21:10:04 CET 2009
Hi all,
Unfortunately, I'm running into an error that I thought I had licked
before. I've running lxml 2.1.2 on OS X and python 2.5. I have a
'str' object that contains html with utf-8 bytes and a utf-8 encoding
specified by the directive, which should be properly handled, to my
understanding, but is not:
douglas$ python
Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html
>>> lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"?
><html><body><p>\xa9</p></body></html>'.encode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/
tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/
__init__.py", line 651, in parse
File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/
lxml.etree.c:25269)
File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/
lxml/lxml.etree.c:63768)
File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:64012)
File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/
lxml/lxml.etree.c:63169)
File "parser.pxi", line 969, in
lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461)
File "parser.pxi", line 538, in
lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:
56751)
File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/
lxml/lxml.etree.c:57595)
File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/
lxml/lxml.etree.c:56936)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
53: ordinal not in range(128)
>>>
Why is ascii being used as a codec? It's properly identified in the
string. It's a valid character (in this case a copyright symbol).
What can I do?
More information about the lxml-dev
mailing list