[ftputil] Stat'ing a Whole Directory

Stefan Schwarzer sschwarzer at sschwarzer.net
Fri Jul 21 01:17:14 CEST 2006


Hi Dan,

On 2006-07-20 20:39, Dan Milstein wrote:
> c) Low-level/high-lvel
>
> A thought: yes, returning a list of stat results feels a bit low-
> level.  Esp since the reason I need the stat results and not just the
> names is to do things like isfile() and isdir(), which I can then
> only do by explicitly calling stat.S_ISDIR(stat_result.st_mode).  Blech.

hm, so I currently don't think that this is the way to go.

> What might be nice, but would move just a bit away from the os.path
> analogy, would be to make ftphost.path be able to run isfile()/isdir
> ()/etc on both path names and also FTPFile objects (or something like
> FTPFile objects -- see below).

What I would like to avoid is to add an interface that basically
does the same as an already present interface. Look at the
current Python way to get the length of a string: len(s) . A
similar interface, regarding the purpose, might be s.len() . Note
that I don't say that the former way is better, just that there
shouldn't be an _additional_ way if there is already one.

> If you are interested, what I'd like to do is add a method to ftphost:
>
> getfiles(path)

Should the files be opened in text or binary mode? Should they be
opened for reading or writing?

At the moment, this could be implemented like this (below),
assuming you want files to be read in binary mode (untested code,
beware):

def getfiles(path):
    files = []
    cwd = host.getcwd()
    for name in host.listdir(path):
        complete_path = host.path.join(cwd, name)
        if host.path.isfile(complete_path):
            files.append(host.file(complete_path, 'rb'))
    return files

or, assuming this is a method of `FTPHost`,

def getfiles(self, path):
    files = []
    cwd = self.getcwd()
    for name in self.listdir(path):
        absolute_path = self.path.join(cwd, name)
        if self.path.isfile(absolute_path):
            files.append(self.file(absolute_path, 'rb'))
    return files

(I only wrote this code down to see how much effort could be
saved by having the new method. Another criterion is the actual
usefulness.)

> Which returns a list of FTPFile-like objects (with stat and path
> information hidden in them).  They would lazily support the file
> interface, but would only connect to the remote server if someone
> actually reads or writes their contents.  And then I would extend
> ftp_path.isdir()/isfile()/etc to handle those objects as well as path
> names.
>
> So that, if you need to operate on a directory, to say, copy all
> files, and recurse over all subdirectories, you could do something like:
>
> def copy_all_files(path):
>      for f in ftphost.getfiles(path):
>          if ftphost.path.isdir(f):
>             copy_all_files(f.path)
>          else:
>             ftphost.copyfileobj(f, file(f.name))

The `path` and `name` attributes are a bit tricky. Should `path`
be the absolute path of the file/directory of the server, and
`name` just the tail of the path, similar to what
`os.path.basename(path)` returns?

If yes, your code iterates recursively through a directory on the
server and copies all the files into a single directory on the
client's local host (actually without closing the files).

Let me try an alternative implementation which uses the current
`FTPHost` class (again, untested).

def copy_all_files(host, path):
    for dir_path, dir_names, file_names in host.walk(path):
        for file_name in file_names:
            absolute_path = host.path.join(dir_path, file_name)
            host.download(absolute_path, file_name)

> Which feels fairly clean and intuitive,

My code feels about as clean and intuitive, I hope. :-)

> and would behave pretty
> nicely in terms of how it hits the remote server (whereas, with the
> current implementation, the lib is fetching the entire directory
> listing over and over again for every call to isdir()).

Of course, the _current_ performance isn't good, that's why we
started to discuss caching, after all. However, I would like to
improve the performance _for the current interface_, don't add an
additional interface and let the client deal with the right
interface to improve performance. In that case, the "server"
(ftputil) code is lazy, shifting the responsibility of
optimization to every client (user of ftputil). IMHO, it should
be the other way around: ftputil should improve it's performance
while keeping its interface, thus making it easy for users of the
module.

Of course, there's a tradeoff: If ftputil doesn't expose the
means for its optimizations, clients have to cope with it; they
won't be able (or not as easily) to fiddle with ftputil to gain
better performance. On the other hand, that's the same kind of
tradeoff when we choose Python (less control, easier to use) over
C (potentially more control, more difficult to use).

Best wishes
Stefan


More information about the ftputil mailing list