[kupu-dev] Kupu says links are bad when they aren't
sisi
sisi at foei.org
Thu Apr 12 19:16:30 CEST 2007
Duncan Booth wrote:
> sisi <sisi at foei.org> wrote:
>
>> One question we have is:
>> Where is the code that identifies links for checking? Is it in kupu or
>> is kupu calling it from a function or something in Plone? Because we'd
>> like to look at the code and make some progress that way but we can't
>> find it, using all our ninja powers of grep and find etc :-)
>>
> Aha. I think I can answer that question.
>
> Look in Products/kupu/plone/html2captioned.py
> Specifically the class Migration:
>
> object_check checks the links in a single object.
> classifyLink tries to figure out whether the link is relative, absolute,
> uid based or broken.
>
> What I would suggest is that you put some print statements in checklink:
> e.g. print link, base, classification
> and then run zope foregrounded so you can see how it categorises each link
> and if you disagree with what its doing email me with details of what you
> think it should be doing instead.
>
>
> If the problem is that it isn't finding links at all (i.e. checklink isn't
> being called when it should be called) then prints or breakpoints near the
> line:
>
> newdata = LINK_PATTERN.sub(checklink, data)
>
> may help. data is the original value of the field, newdata is the updated
> value after doing the substitution to fix up the links. The regular
> expression matching could well be broken.
>
> Duncan
Wow, we think we've cracked it, thanks Duncan!
So, here is what we think:
Kupu is failing on new lines. One of our migration processes was to run
our html code through html-tidy, which was adding newlines characters
inside tags to improve layout.
The errors from kupu looked like this ( we inserted the print statements
you recommended ):
link classification: ../index.html,
http://84.243.219.191:8080/foeiMultisiteFolder/foei/www/publications/link/100/e23.html,
internal
link classification:
,
http://84.243.219.191:8080/foeiMultisiteFolder/foei/www/publications/link/100/e23.html,
bad
( Directly after the heading 'link classification:' should be part of
the link, but as you can see, it's just newlines )
Anyway, after some testing this seems to be at least most of our
problem. This explains why it seemed so random, and also why kupu was
solving the problem when we saved a page: it was stripping out those new
lines.
Could you modify your regular expression matching to cope with new lines
dotted around in the html? If you don't have time for this we will try
to run our html through another script to get rid of those new lines,
but we think they're pretty common so maybe it would be worth taking
this into consideration for kupu, as the layout of the HTML text is
independent of the meaning. I am pasting an example bit of code that the
kupu line code chokes on below.
Thanks again,
sisi
<a href="../index.html"><img src=
"../../images/linkhead.gif" width="200" height=
"76" alt="link" border="0"></a>
--
# sisi nutt # extranet coordinator
# Friends of the Earth International
# PO Box 19199 # 1000 GD Amsterdam # The Netherlands
# Tel 31 20 6221369 # Fax 31 20 6392181 # http://www.foei.org
# email sisi at foei.org # skype foei_sisi
More information about the kupu-dev
mailing list