[kupu-dev] nbsp tags disappear, breakin' paragraphs

Duncan Booth duncan.booth at suttoncourtenay.org.uk
Mon Feb 12 09:21:50 CET 2007


bernhard <g.bernhard at akbild.ac.at> wrote:

> Entering &nbsp; in html view and switching back to 'normal' kupu view  
> i found no &nbsp; any more; I looked into the dom using Firebug - no  
> &nbsp;

It doesn't appear as &nbsp; in the DOM. It will appear as character #xA0 
which, since it displays as a space is quite hard to see. I found I had 
to do 'copy html' from firebug and paste into an editor set to display 
hex codes for non-ascii characters before I could verify that the non-
break space was still there.

There is a trivial 'fix' which should get you up and running: edit 
common/kupueditor.js, find this.escapeEntities remove the 'return xml;' 
line and uncomment the 4 line return statement. That will entitise 
everything.

    this.escapeEntities = function(xml) {
	// XXX: temporarily disabled
        return xml;
        // Escape non-ascii characters as entities.
//         return xml.replace(/[^\r\n -\177]/g,
//             function(c) {
//             return '&#'+c.charCodeAt(0)+';';
//         });
    };

Unfortunately, doing that will break text indexing. An alternative 
'fix' would be:

    this.escapeEntities = function(xml) {
        // XXX: temporarily disabled
        xml = xml.replace('\xa0', '&nbsp;');
        return xml;
    };

which just escapes the non break space.

> 
> If you want to have a serious i18n aware catalog you will have to use  
> TextIndexNG - and TextIndexNG knows how to deal with html entities  
> (hopefully); I would like to know if it is hard to locate the code  
> where the entities are dropped - is it zope, plone or kupu? As long  
> as it is not Python we definitively can handle :-P  People are unable  
> to edit the image modules i put in otherwise.
> 
Kupu doesn't entitise what it saves because Plone (either currently or 
in the past, but I think it is still a problem) doesn't handle entities 
properly. When you save a document Plone uses PortalTransforms to 
convert the html to plain text before passing the plain text to 
TextIndexNG. The transform looks like (Plone 2.5.2 version):

from Products.PortalTransforms.libtransforms.retransform import 
retransform

class html_to_text(retransform):
    inputs  = ('text/html',)
    output = 'text/plain'

def register():
    # XXX convert entites with htmlentitydefs.name2codepoint ?
    return html_to_text("html_to_text",
                       ('<script [^>]>.*</script>(?im)', ' '),
                       ('<style [^>]>.*</style>(?im)', ' '),
                       ('<head [^>]>.*</head>(?im)', ' '),
                       ('(?im)</?(font|em|i|strong|b)(?=\W)[^>]*>', ''),
                       ('<[^>]*>(?i)(?m)', ' '),
                       )

Note the XXX comment which has been there for donkey's years. So what 
happens is that x&ecaute;y (or x&#xe9;y which is what kupu used to 
convert it to) is left unchanged and TextIndexNG indexes separate words 
x, eacute, y (or x, xe9, y).

I guess maybe I should attempt to do the conversion for some things like 
nbsp and leave accented letters unchanged.



More information about the kupu-dev mailing list