[kupu-dev] UnicodeDecodeError with umlauts in image title

Duncan Booth duncan.booth at suttoncourtenay.org.uk
Fri Jul 4 09:36:02 CEST 2008


Tim Terlegård <tim.terlegard at valentinewebsystems.se> wrote:

> Hi kupuers,
> 
> I get an error when I have an image with a title that contains umlauts
> and use that image inside a document with caption enabled.
> 
> The error is triggered by the transform in html2captioned.py on these  
> lines:
> 
>      if isinstance(data, str):
>          data = data.decode('utf8')
>      html = IMAGE_PATTERN.sub(replaceImage, data)
> 
> replaceImage returns utf8, so data should also be utf8, otherwise the  
> sub()
> method will fail when there are umlauts involved.
> 
> Things work if I remove the conversion to unicode on the line above.
> I'm not sure why the conversion to unicode was added some months ago.
> I have changed the tests to use umlauts and removed the conversion to
> unicode. The tests pass. Should I commit this or is there something 
I'm
> missing?
> 
> /Tim
> 
No, don't commit that.

You haven't said which version of Plone you are using. The problem is 
that some versions of Plone return unicode here and some return byte 
strings so the code has to work in both situations. However it is 
important that the regular expression be working on unicode. You should 
never do manipulations on utf8 encoded strings as it is possible (albeit 
unlikely) that the regex could mess up parts of multi-byte encoded 
characters. That's why the decode (and later on an encode) are present.

If on your system replaceImage is returning utf8 then the fix should be 
to ensure that it decodes its result before returning. Probably change:

                        return template(**d)

to:
                        result = template(**d)
                        if isinstance(result, str):
                            result = result.decode('utf8')
                        return result





More information about the kupu-dev mailing list