From sisi at foei.org Thu Apr 12 14:33:16 2007
From: sisi at foei.org (sisi)
Date: Thu, 12 Apr 2007 14:33:16 +0200
Subject: [kupu-dev] Kupu says links are bad when they aren't
Message-ID: <461E270C.3060305@foei.org>
Hi all,
We're trying to get our content migrated from our flat html site into a
plone site, and while using the kupu relative path to uid tool we've
come up against some strange problems.
Our plone site settings are:
Zope 2.9.6-final, Plone 2.5.2, python 2.4.4, linux, kupu 1.4 (svn, trunk
- Revision: 41990)
All files have been dropped in using both FTP and WebDAV (separate
experiments).
Link checker in Kupu erroneously flags some links (relative, site
internal) as bad. Some common elements are:
. involving GIFs (and sometimes jpgs, we think)
. link is often a combination of anchor and img like:
. after editing (but making no changes) with Kupu and saving, these
links become:
(the anchor point has
silently been removed)
This means that while kupu will flag these links as bad (even though
they are not, and they are within either href or src), as soon as we
have saved a page and rerun the kupu links script, the links get UID'ed
and no longer marked as bad by kupu. So it seems that one small element
that kupu does not like on our pages is causing it to ignore all the
relative paths on that page.
Since we have close to 4500 pages we cannot save every page in our
migrated content and then rerun the links scripts. At least, we'd rather
not!
Other types (non specific) are also involved, but not as frequently.
Furthermore, due to the fact the Kupu believes these links are bad, it
will not convert them to UID.
Point of note is that, for some unknown reason, the content type
registry was empty on first loading the files. This peculiarity has been
noted by several people on the net, and is easily fixed by uninstalling
and reinstalling the ATContentTypes product. After fixing this, the
problem went away for many of the PDFs, for instance, but still persists
(even after repopulating the folders via FTP or WebDAV) for many GIFs
(and still some jpegs and pdfs and normal href links).
If anyone has any ideas about what might be causing this problem please
send them on. We think there are 3900 pages with bad links on them, and
each page has several bad links. One thing that is throwing us off the
scent is that some pages have a mixture of resolved uids and relative
paths after running the scripts.
One question we have is:
Where is the code that identifies links for checking? Is it in kupu or
is kupu calling it from a function or something in Plone? Because we'd
like to look at the code and make some progress that way but we can't
find it, using all our ninja powers of grep and find etc :-)
Cheers,
sisi
--
# sisi nutt # extranet coordinator
# Friends of the Earth International
# PO Box 19199 # 1000 GD Amsterdam # The Netherlands
# Tel 31 20 6221369 # Fax 31 20 6392181 # http://www.foei.org
# email sisi at foei.org # skype foei_sisi
From duncan.booth at suttoncourtenay.org.uk Thu Apr 12 17:32:04 2007
From: duncan.booth at suttoncourtenay.org.uk (Duncan Booth)
Date: Thu, 12 Apr 2007 15:32:04 +0000 (UTC)
Subject: [kupu-dev] Kupu says links are bad when they aren't
References: <461E270C.3060305@foei.org>
Message-ID:
sisi wrote:
> One question we have is:
> Where is the code that identifies links for checking? Is it in kupu or
> is kupu calling it from a function or something in Plone? Because we'd
> like to look at the code and make some progress that way but we can't
> find it, using all our ninja powers of grep and find etc :-)
>
Aha. I think I can answer that question.
Look in Products/kupu/plone/html2captioned.py
Specifically the class Migration:
object_check checks the links in a single object.
classifyLink tries to figure out whether the link is relative, absolute,
uid based or broken.
What I would suggest is that you put some print statements in checklink:
e.g. print link, base, classification
and then run zope foregrounded so you can see how it categorises each link
and if you disagree with what its doing email me with details of what you
think it should be doing instead.
If the problem is that it isn't finding links at all (i.e. checklink isn't
being called when it should be called) then prints or breakpoints near the
line:
newdata = LINK_PATTERN.sub(checklink, data)
may help. data is the original value of the field, newdata is the updated
value after doing the substitution to fix up the links. The regular
expression matching could well be broken.
Duncan
From sisi at foei.org Thu Apr 12 19:16:30 2007
From: sisi at foei.org (sisi)
Date: Thu, 12 Apr 2007 19:16:30 +0200
Subject: [kupu-dev] Kupu says links are bad when they aren't
In-Reply-To:
References: <461E270C.3060305@foei.org>
Message-ID: <461E696E.3030202@foei.org>
Duncan Booth wrote:
> sisi wrote:
>
>> One question we have is:
>> Where is the code that identifies links for checking? Is it in kupu or
>> is kupu calling it from a function or something in Plone? Because we'd
>> like to look at the code and make some progress that way but we can't
>> find it, using all our ninja powers of grep and find etc :-)
>>
> Aha. I think I can answer that question.
>
> Look in Products/kupu/plone/html2captioned.py
> Specifically the class Migration:
>
> object_check checks the links in a single object.
> classifyLink tries to figure out whether the link is relative, absolute,
> uid based or broken.
>
> What I would suggest is that you put some print statements in checklink:
> e.g. print link, base, classification
> and then run zope foregrounded so you can see how it categorises each link
> and if you disagree with what its doing email me with details of what you
> think it should be doing instead.
>
>
> If the problem is that it isn't finding links at all (i.e. checklink isn't
> being called when it should be called) then prints or breakpoints near the
> line:
>
> newdata = LINK_PATTERN.sub(checklink, data)
>
> may help. data is the original value of the field, newdata is the updated
> value after doing the substitution to fix up the links. The regular
> expression matching could well be broken.
>
> Duncan
Wow, we think we've cracked it, thanks Duncan!
So, here is what we think:
Kupu is failing on new lines. One of our migration processes was to run
our html code through html-tidy, which was adding newlines characters
inside tags to improve layout.
The errors from kupu looked like this ( we inserted the print statements
you recommended ):
link classification: ../index.html,
http://84.243.219.191:8080/foeiMultisiteFolder/foei/www/publications/link/100/e23.html,
internal
link classification:
,
http://84.243.219.191:8080/foeiMultisiteFolder/foei/www/publications/link/100/e23.html,
bad
( Directly after the heading 'link classification:' should be part of
the link, but as you can see, it's just newlines )
Anyway, after some testing this seems to be at least most of our
problem. This explains why it seemed so random, and also why kupu was
solving the problem when we saved a page: it was stripping out those new
lines.
Could you modify your regular expression matching to cope with new lines
dotted around in the html? If you don't have time for this we will try
to run our html through another script to get rid of those new lines,
but we think they're pretty common so maybe it would be worth taking
this into consideration for kupu, as the layout of the HTML text is
independent of the meaning. I am pasting an example bit of code that the
kupu line code chokes on below.
Thanks again,
sisi
--
# sisi nutt # extranet coordinator
# Friends of the Earth International
# PO Box 19199 # 1000 GD Amsterdam # The Netherlands
# Tel 31 20 6221369 # Fax 31 20 6392181 # http://www.foei.org
# email sisi at foei.org # skype foei_sisi
From duncan.booth at suttoncourtenay.org.uk Fri Apr 13 16:07:37 2007
From: duncan.booth at suttoncourtenay.org.uk (Duncan Booth)
Date: Fri, 13 Apr 2007 14:07:37 +0000 (UTC)
Subject: [kupu-dev] Kupu says links are bad when they aren't
References: <461E270C.3060305@foei.org>
<461E696E.3030202@foei.org>
Message-ID:
sisi wrote:
> Could you modify your regular expression matching to cope with new lines
> dotted around in the html? If you don't have time for this we will try
> to run our html through another script to get rid of those new lines,
> but we think they're pretty common so maybe it would be worth taking
> this into consideration for kupu, as the layout of the HTML text is
> independent of the meaning. I am pasting an example bit of code that the
> kupu line code chokes on below.
>
> Thanks again,
> sisi
>
>
"../../images/linkhead.gif" width="200" height=
> "76" alt="link" border="0">
>
Ugh. I didn't know you could put spaces around the '=' in an HTML attribute
value: you can't of course in XML, but HTML isn't XML.
Ok, I checked in something which should fix this.
From sisi at foei.org Wed Apr 18 12:45:13 2007
From: sisi at foei.org (sisi)
Date: Wed, 18 Apr 2007 12:45:13 +0200
Subject: [kupu-dev] Kupu says links are bad when they aren't
In-Reply-To:
References: <461E270C.3060305@foei.org> <461E696E.3030202@foei.org>
Message-ID: <4625F6B9.8090105@foei.org>
Duncan Booth wrote:
> sisi wrote:
>>
> "../../images/linkhead.gif" width="200" height=
>> "76" alt="link" border="0">
>>
>
> Ugh. I didn't know you could put spaces around the '=' in an HTML attribute
> value: you can't of course in XML, but HTML isn't XML.
>
> Ok, I checked in something which should fix this.
and it worked! (thanks so so much...)
We also found that kupu was not ignoring things in comments, which
caused us a bit of a headache. Somehow a spacer gif got stuck in a
comment in a template at some point in the last five years and kupu
complained about this spacer gif because of course we did not copy it
across, because it was not used.
We got around the issue, but it might be good if kupu ignored links
inside of
Though of course our use case might be a little crazy, having had five
years worth of volunteers hacking about in our html willy nilly. I'm not
sure if too many other sites would have links inside of their comments :-/
Thanks again, saved our bacon!
sisi
--
# sisi nutt # extranet coordinator
# Friends of the Earth International
# PO Box 19199 # 1000 GD Amsterdam # The Netherlands
# Tel 31 20 6221369 # Fax 31 20 6392181 # http://www.foei.org
# email sisi at foei.org # skype foei_sisi