[lxml-dev] LXML utf-8 problem...

Sergio Monteiro Basto sergio at sergiomb.no-ip.org
Sat Feb 21 06:56:26 CET 2009


On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote:
> Hi all,
> 	Unfortunately, I'm running into an error that I thought I had licked  
> before.  I've running lxml 2.1.2 on OS X and python 2.5.  I have a  
> 'str' object that contains html with utf-8 bytes and a utf-8 encoding  
> specified by the directive, which should be properly handled, to my  
> understanding, but is not:
> 
> douglas$ python
> Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>  >>> import lxml.html
>  >>> lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? 
>  ><html><body><p>\xa9</p></body></html>'.encode('utf-8'))
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ 
> tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ 
> __init__.py", line 651, in parse
>    File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ 
> lxml.etree.c:25269)
>    File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ 
> lxml/lxml.etree.c:63768)
>    File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL  
> (src/lxml/lxml.etree.c:64012)
>    File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ 
> lxml/lxml.etree.c:63169)
>    File "parser.pxi", line 969, in  
> lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461)
>    File "parser.pxi", line 538, in  
> lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c: 
> 56751)
>    File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ 
> lxml/lxml.etree.c:57595)
>    File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ 
> lxml/lxml.etree.c:56936)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position  
> 53: ordinal not in range(128)
>  >>>
> 
> Why is ascii being used as a codec?  It's properly identified in the  
> string.  It's a valid character (in this case a copyright symbol).   
> What can I do?

if is what I think could be a problem with python it self ! 
This code :
content = urllib.urlopen(url).read(-1)
content = content.decode('cp1252')
print content 
 
With one page with enconding windows-1252, I print to stdout and I see
it well but if I put it on a pipe  , like : 
python getcontent.py | grep something, 
gives the error that you mention.

don't ask me why but adding .encode('utf-8')

content = content.decode('cp1252').encode('utf-8')

fixes this problem .

hope that can help , regards.
-- 
Sérgio M. B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2192 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090221/26c45f49/attachment.bin 


More information about the lxml-dev mailing list