[kupu-dev] nbsp tags disappear, breakin' paragraphs

bernhard g.bernhard at akbild.ac.at
Mon Feb 12 10:11:35 CET 2007


Dear Duncan!

Now i _do_ owe you one...
Kupu is dealing with 'nbsp' entities now exactly as it is supposed  
to.  I'd suggest to make

     this.escapeEntities = function(xml) {
         xml = xml.replace('\xa0', ' ');
         return xml;
     };

default behaviour; Indexing should work perfectly well with it. If  
there is some service
as 'fleurop' or 'www.bierher.at' on your side of the screen im am  
really willing to send
you a 'surprise'...

Who said Mondays are bad?

Very Best Regards,
Gogo.
<g.bernhard at akbild.ac.at>


On 12.02.2007, at 09:21, Duncan Booth wrote:

> bernhard <g.bernhard at akbild.ac.at> wrote:
>
>> Entering &nbsp; in html view and switching back to 'normal' kupu view
>> i found no &nbsp; any more; I looked into the dom using Firebug - no
>> &nbsp;
>
> It doesn't appear as &nbsp; in the DOM. It will appear as character  
> #xA0
> which, since it displays as a space is quite hard to see. I found I  
> had
> to do 'copy html' from firebug and paste into an editor set to display
> hex codes for non-ascii characters before I could verify that the non-
> break space was still there.
>
> There is a trivial 'fix' which should get you up and running: edit
> common/kupueditor.js, find this.escapeEntities remove the 'return  
> xml;'
> line and uncomment the 4 line return statement. That will entitise
> everything.
>
>     this.escapeEntities = function(xml) {
> 	// XXX: temporarily disabled
>         return xml;
>         // Escape non-ascii characters as entities.
> //         return xml.replace(/[^\r\n -\177]/g,
> //             function(c) {
> //             return '&#'+c.charCodeAt(0)+';';
> //         });
>     };
>
> Unfortunately, doing that will break text indexing. An alternative
> 'fix' would be:
>
>     this.escapeEntities = function(xml) {
>         // XXX: temporarily disabled
>         xml = xml.replace('\xa0', '&nbsp;');
>         return xml;
>     };
>
> which just escapes the non break space.
>
>>
>> If you want to have a serious i18n aware catalog you will have to use
>> TextIndexNG - and TextIndexNG knows how to deal with html entities
>> (hopefully); I would like to know if it is hard to locate the code
>> where the entities are dropped - is it zope, plone or kupu? As long
>> as it is not Python we definitively can handle :-P  People are unable
>> to edit the image modules i put in otherwise.
>>
> Kupu doesn't entitise what it saves because Plone (either currently or
> in the past, but I think it is still a problem) doesn't handle  
> entities
> properly. When you save a document Plone uses PortalTransforms to
> convert the html to plain text before passing the plain text to
> TextIndexNG. The transform looks like (Plone 2.5.2 version):
>
> from Products.PortalTransforms.libtransforms.retransform import
> retransform
>
> class html_to_text(retransform):
>     inputs  = ('text/html',)
>     output = 'text/plain'
>
> def register():
>     # XXX convert entites with htmlentitydefs.name2codepoint ?
>     return html_to_text("html_to_text",
>                        ('<script [^>]>.*</script>(?im)', ' '),
>                        ('<style [^>]>.*</style>(?im)', ' '),
>                        ('<head [^>]>.*</head>(?im)', ' '),
>                        ('(?im)</?(font|em|i|strong|b)(?=\W)[^>]*>',  
> ''),
>                        ('<[^>]*>(?i)(?m)', ' '),
>                        )
>
> Note the XXX comment which has been there for donkey's years. So what
> happens is that x&ecaute;y (or x&#xe9;y which is what kupu used to
> convert it to) is left unchanged and TextIndexNG indexes separate  
> words
> x, eacute, y (or x, xe9, y).
>
> I guess maybe I should attempt to do the conversion for some things  
> like
> nbsp and leave accented letters unchanged.
>
> _______________________________________________
> kupu-dev mailing list
> kupu-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/kupu-dev



More information about the kupu-dev mailing list