[lxml-dev] LXML utf-8 problem...
Douglas Mayle
douglas at openplans.org
Sat Feb 21 15:11:12 CET 2009
Actually, after digging further, I found out that it's a problem with
the error reporting mechanisms in lxml. If you have unicode data
inside of of a 'str' type object (which is normal for many html and
xml documents) then the lxml error reporting incorrectly decodes the
string while trying to spit out an error, which causes a new error
that masks the original error. As mentioned earlier in this thread,
it should be fixed in the newest version of lxml. In any case, I
copied code from elsewhere in my program and forgot to switch from
parse (which takes a filename or url) to fromstring(which takes text
data). parse was spitting out an error because it didn't receive a
filename, and that error was mixed with the incorrectly decoded data
of the filename which caused a new error...
Doug
On Feb 21, 2009, at 12:56 AM, Sergio Monteiro Basto wrote:
> On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote:
>> Hi all,
>> Unfortunately, I'm running into an error that I thought I had licked
>> before. I've running lxml 2.1.2 on OS X and python 2.5. I have a
>> 'str' object that contains html with utf-8 bytes and a utf-8 encoding
>> specified by the directive, which should be properly handled, to my
>> understanding, but is not:
>>
>> douglas$ python
>> Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16)
>> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
>> Type "help", "copyright", "credits" or "license" for more
>> information.
>>>>> import lxml.html
>>>>> lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"?
>>> <html><body><p>\xa9</p></body></html>'.encode('utf-8'))
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/
>> tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/
>> __init__.py", line 651, in parse
>> File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/
>> lxml.etree.c:25269)
>> File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/
>> lxml/lxml.etree.c:63768)
>> File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL
>> (src/lxml/lxml.etree.c:64012)
>> File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/
>> lxml/lxml.etree.c:63169)
>> File "parser.pxi", line 969, in
>> lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:
>> 60461)
>> File "parser.pxi", line 538, in
>> lxml.etree._ParserContext._handleParseResultDoc (src/lxml/
>> lxml.etree.c:
>> 56751)
>> File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/
>> lxml/lxml.etree.c:57595)
>> File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/
>> lxml/lxml.etree.c:56936)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
>> 53: ordinal not in range(128)
>>>>>
>>
>> Why is ascii being used as a codec? It's properly identified in the
>> string. It's a valid character (in this case a copyright symbol).
>> What can I do?
>
> if is what I think could be a problem with python it self !
> This code :
> content = urllib.urlopen(url).read(-1)
> content = content.decode('cp1252')
> print content
>
> With one page with enconding windows-1252, I print to stdout and I see
> it well but if I put it on a pipe , like :
> python getcontent.py | grep something,
> gives the error that you mention.
>
> don't ask me why but adding .encode('utf-8')
>
> content = content.decode('cp1252').encode('utf-8')
>
> fixes this problem .
>
> hope that can help , regards.
> --
> Sérgio M. B.
More information about the lxml-dev
mailing list