[kupu-dev] nbsp tags disappear, breakin' paragraphs
Duncan Booth
duncan.booth at suttoncourtenay.org.uk
Mon Feb 12 09:21:50 CET 2007
bernhard <g.bernhard at akbild.ac.at> wrote:
> Entering in html view and switching back to 'normal' kupu view
> i found no any more; I looked into the dom using Firebug - no
>
It doesn't appear as in the DOM. It will appear as character #xA0
which, since it displays as a space is quite hard to see. I found I had
to do 'copy html' from firebug and paste into an editor set to display
hex codes for non-ascii characters before I could verify that the non-
break space was still there.
There is a trivial 'fix' which should get you up and running: edit
common/kupueditor.js, find this.escapeEntities remove the 'return xml;'
line and uncomment the 4 line return statement. That will entitise
everything.
this.escapeEntities = function(xml) {
// XXX: temporarily disabled
return xml;
// Escape non-ascii characters as entities.
// return xml.replace(/[^\r\n -\177]/g,
// function(c) {
// return '&#'+c.charCodeAt(0)+';';
// });
};
Unfortunately, doing that will break text indexing. An alternative
'fix' would be:
this.escapeEntities = function(xml) {
// XXX: temporarily disabled
xml = xml.replace('\xa0', ' ');
return xml;
};
which just escapes the non break space.
>
> If you want to have a serious i18n aware catalog you will have to use
> TextIndexNG - and TextIndexNG knows how to deal with html entities
> (hopefully); I would like to know if it is hard to locate the code
> where the entities are dropped - is it zope, plone or kupu? As long
> as it is not Python we definitively can handle :-P People are unable
> to edit the image modules i put in otherwise.
>
Kupu doesn't entitise what it saves because Plone (either currently or
in the past, but I think it is still a problem) doesn't handle entities
properly. When you save a document Plone uses PortalTransforms to
convert the html to plain text before passing the plain text to
TextIndexNG. The transform looks like (Plone 2.5.2 version):
from Products.PortalTransforms.libtransforms.retransform import
retransform
class html_to_text(retransform):
inputs = ('text/html',)
output = 'text/plain'
def register():
# XXX convert entites with htmlentitydefs.name2codepoint ?
return html_to_text("html_to_text",
('<script [^>]>.*</script>(?im)', ' '),
('<style [^>]>.*</style>(?im)', ' '),
('<head [^>]>.*</head>(?im)', ' '),
('(?im)</?(font|em|i|strong|b)(?=\W)[^>]*>', ''),
('<[^>]*>(?i)(?m)', ' '),
)
Note the XXX comment which has been there for donkey's years. So what
happens is that x&ecaute;y (or xéy which is what kupu used to
convert it to) is left unchanged and TextIndexNG indexes separate words
x, eacute, y (or x, xe9, y).
I guess maybe I should attempt to do the conversion for some things like
nbsp and leave accented letters unchanged.
More information about the kupu-dev
mailing list