[ftputil] Ftputil seems really slow to distinguish file and folders

Stefan Schwarzer sschwarzer at sschwarzer.net
Fri May 1 10:52:54 CEST 2009


Hi Nicola,

On 2009-04-30 23:50, MailingList SVR wrote:
>> What I like about your reports is that you always provide
>> concrete examples with working test code. Great! :-)
> 
> and your answers are ever rich of quality valuable infos, thanks!

Thanks a lot :-)

>> Running this with ftputil 2.4 on my computer takes about
>> 25 minutes. When increasing the cache size to 2000, the code
>> runs in about 40 seconds. :)
> 
> works in about 40 seconds on my box too, this time is acceptable,
> however on the same directory a standard ftpclient give the directory
> listing in few seconds. I haven't look at ftputil code but host.listdir
> is fast such as filezilla and company (2-3 seconds) what other tasks
> ftputil does in about 35 seconds?

First, I'd like to add that if I leave the print statements
out of the loop, i. e.

>>> def f():
...   for i in lista:
...     isd = host.path.isdir(folder+i)
...     isf = host.path.isfile(folder+i)

I'm down to 25 seconds (with a cache size of 2000). :)

Using only one cache access per name (as listdir does) in the
loop, I get the loop done in 11 seconds. The following function,
including connecting to the server, runs in about 13 seconds:

>>> def h():
...   host = ftputil.FTPHost('ftp.nluug.nl','anonymous','pippo at pippo.com')
...   host.stat_cache.resize(2000)
...   for i in host.listdir(folder):
...     s = host.lstat(folder + i)
>>> %time h()  # built into IPython
CPU times: user 9.38 s, sys: 0.08 s, total: 9.46 s
Wall time: 12.59 s

However, listdir (needing 1.2 seconds for the directory) only
_stores_ values in the cache while lstat _retrieves_ the info. So
I suspect the retrieval is significantly slower than the storage.

A little test:

>>> def g():
...   for i in lista:
...     s = host.lstat(folder+i)
>>> %prun g()  # built into IPython
         11358799 function calls in 54.934 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5661503   25.974    0.000   38.745    0.000 lrucache.py:110(__cmp__)
     2386   15.896    0.007   54.641    0.023 {_heapq.heapify}
  5661503   12.771    0.000   12.771    0.000 {cmp}
...

The total time of the other calls seems negligible, so ftputil's
own code seems rather innocent. ;-)

If you're writing an end-user client and need to give the user a
directory listing fast, including the stat'ed information for
each file/dir, the following - untested - approach _might_ work:

- Write a cache class with an interface like that in
  ftp_stat_cache.py. However, use a raw Python dictionary to
  store the stat results. This cache will have no (automatic)
  means to prune old entries, but see below. You can use the
  present cache class a template.

- Derive a stat class from ftp_stat._Stat:

  class MyStat(ftp_stat._Stat):
      def __init__(self, *args, **kwargs):
          super(MyStat, self).__init__(*args, **kwargs)
          self._lstat_cache = MyCache()

- Derive a class from FTPHost:

  class MyFTPHost(ftputil.FTPHost):
      def __init__(self, *args, **kwargs):
          super(MyFTPHost, self).__init__(*args, **kwargs)
          self._stat = MyStat(self)
          self.stat_cache = self._stat._lstat_cache

- To get a listing, instantiate your custom FTPHost class. Clear
  the cache with host.stat_cache.clear() before retrieving a
  directory listing. Also clear the cache explicitly after you no
  longer need the directory data.

Keep in mind that if your software isn't interactive, you most
probably don't need to worry about tuning at all. [1] ... And
_with_ interactivity, I just measured: Nautilus 2.24.1 needs
about 8 seconds to show the directory listing, so I think the
13 seconds I got with the pure lstat calls (see above) are not so
bad! Also remember that most users won't have so many directory
items frequently. For most directories, you won't notice any
difference.

If you nevertheless tried out the above idea, I (and maybe other
readers of the list) would be very thankful if you shared your
results. :-)

> It is possible to have a fastest
> listing or the code is already optimized?

As far as I remember, I haven't optimized the code at all because
I haven't had a use case yet where the code was too slow for me. I
don't know how much tuning has gone into lrucache [2], though.

[1] Please read
    http://sschwarzer.com/download/optimization_europython2006.pdf
    if you haven't already done so. :)

[2] http://pypi.python.org/pypi/lrucache/0.2 (It's listed there
    as alpha code, but to my question over two years ago the
    author replied it were used successfully in several projects
    and that it had comprehensive unit tests. I haven't had any
    complaints either.)

Best regards,
Stefan


More information about the ftputil mailing list