[kupu-dev] nbsp tags disappear, breakin' paragraphs -> supplemental

bernhard g.bernhard at akbild.ac.at
Mon Feb 12 10:42:30 CET 2007


Hello again!

supplemental:
xml.replace would just replace the *first* occurance of any matches...

     this.escapeEntities = function(xml) {
         xml = xml.split('\xa0').join(' ');
         return xml;
     };

is maybe better; now it rocks.
Gogo.


On 12.02.2007, at 10:11, bernhard wrote:

> Dear Duncan!
>
> Now i _do_ owe you one...
> Kupu is dealing with 'nbsp' entities now exactly as it is supposed
> to.  I'd suggest to make
>
>      this.escapeEntities = function(xml) {
>          xml = xml.replace('\xa0', ' ');
>          return xml;
>      };
>
> default behaviour; Indexing should work perfectly well with it. If
> there is some service
> as 'fleurop' or 'www.bierher.at' on your side of the screen im am
> really willing to send
> you a 'surprise'...
>
> Who said Mondays are bad?
>
> Very Best Regards,
> Gogo.
> <g.bernhard at akbild.ac.at>
>
>
> On 12.02.2007, at 09:21, Duncan Booth wrote:
>
>> bernhard <g.bernhard at akbild.ac.at> wrote:
>>
>>> Entering &nbsp; in html view and switching back to 'normal' kupu  
>>> view
>>> i found no &nbsp; any more; I looked into the dom using Firebug - no
>>> &nbsp;
>>
>> It doesn't appear as &nbsp; in the DOM. It will appear as character
>> #xA0
>> which, since it displays as a space is quite hard to see. I found I
>> had
>> to do 'copy html' from firebug and paste into an editor set to  
>> display
>> hex codes for non-ascii characters before I could verify that the  
>> non-
>> break space was still there.
>>
>> There is a trivial 'fix' which should get you up and running: edit
>> common/kupueditor.js, find this.escapeEntities remove the 'return
>> xml;'
>> line and uncomment the 4 line return statement. That will entitise
>> everything.
>>
>>     this.escapeEntities = function(xml) {
>> 	// XXX: temporarily disabled
>>         return xml;
>>         // Escape non-ascii characters as entities.
>> //         return xml.replace(/[^\r\n -\177]/g,
>> //             function(c) {
>> //             return '&#'+c.charCodeAt(0)+';';
>> //         });
>>     };
>>
>> Unfortunately, doing that will break text indexing. An alternative
>> 'fix' would be:
>>
>>     this.escapeEntities = function(xml) {
>>         // XXX: temporarily disabled
>>         xml = xml.replace('\xa0', '&nbsp;');
>>         return xml;
>>     };
>>
>> which just escapes the non break space.
>>
>>>
>>> If you want to have a serious i18n aware catalog you will have to  
>>> use
>>> TextIndexNG - and TextIndexNG knows how to deal with html entities
>>> (hopefully); I would like to know if it is hard to locate the code
>>> where the entities are dropped - is it zope, plone or kupu? As long
>>> as it is not Python we definitively can handle :-P  People are  
>>> unable
>>> to edit the image modules i put in otherwise.
>>>
>> Kupu doesn't entitise what it saves because Plone (either  
>> currently or
>> in the past, but I think it is still a problem) doesn't handle
>> entities
>> properly. When you save a document Plone uses PortalTransforms to
>> convert the html to plain text before passing the plain text to
>> TextIndexNG. The transform looks like (Plone 2.5.2 version):
>>
>> from Products.PortalTransforms.libtransforms.retransform import
>> retransform
>>
>> class html_to_text(retransform):
>>     inputs  = ('text/html',)
>>     output = 'text/plain'
>>
>> def register():
>>     # XXX convert entites with htmlentitydefs.name2codepoint ?
>>     return html_to_text("html_to_text",
>>                        ('<script [^>]>.*</script>(?im)', ' '),
>>                        ('<style [^>]>.*</style>(?im)', ' '),
>>                        ('<head [^>]>.*</head>(?im)', ' '),
>>                        ('(?im)</?(font|em|i|strong|b)(?=\W)[^>]*>',
>> ''),
>>                        ('<[^>]*>(?i)(?m)', ' '),
>>                        )
>>
>> Note the XXX comment which has been there for donkey's years. So what
>> happens is that x&ecaute;y (or x&#xe9;y which is what kupu used to
>> convert it to) is left unchanged and TextIndexNG indexes separate
>> words
>> x, eacute, y (or x, xe9, y).
>>
>> I guess maybe I should attempt to do the conversion for some things
>> like
>> nbsp and leave accented letters unchanged.
>>
>> _______________________________________________
>> kupu-dev mailing list
>> kupu-dev at codespeak.net
>> http://codespeak.net/mailman/listinfo/kupu-dev
>
> _______________________________________________
> kupu-dev mailing list
> kupu-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/kupu-dev



More information about the kupu-dev mailing list