[ftputil] Stat'ing a Whole Directory

Dan Milstein danmil at comcast.net
Fri Jul 21 18:01:58 CEST 2006


Stefan,

Thanks for the detailed response.  I'll make one more bid for adding  
something like getfiles (or statdir), and then I'll trundle along on  
my own ;-)

In terms of the performance / making client code solve it / caching:  
IMHO, caching is at least as nasty from a client perspective as  
adding a somewhat-redundant method.  With caching, the client has to  
be aware of what the caching state is, and how to flush it, in order  
to get their code to behave consistently.  There are opportunities  
for the client programmer to make subtle mistakes which fail silently  
(e.g., in testing everything works fine, but in certain obscure  
cases, a client call returns an invalid version of a file).

Given that you want to keep the current interface (and I say right  
on ;-),  I guess my argument is that adding caching and then exposing  
methods to flush or keep the cache in sync is not keeping the current  
interface.  And if you're going to change it, adding a  
comprehensible, non-stateful method feels simpler to me.

I'm all for making interfaces hide the complexity of implementations,  
but, when the implementation involves remote network access, I very  
much like an extra method or two which allows the client to make  
their own decisions about performance tradeoffs in a simple way.   
Just as the DBI interface has fetchone(), fetchmany() and fetchall().

In terms of writing getfiles() using the current interface: I think I  
didn't explain something clearly enough.  The results returned from  
getfiles() would *not* be simple FTPFiles, because ftp_path.isfoo()  
can't return information about FTPFiles.  They'd be objects which  
hold both an FTPFile (only opened if it was read from), and also a  
stat_result.  And then ftp_path.isfoo() would have to grow the  
ability to know what it was passed and handle it intelligently.   
Which would make ftp_path.isfoo() subtly different from os.path.isfoo 
(), but in a way which I think people would quickly catch onto.

Actually, what I'd do is make FTPFiles behave this way all the time  
(meaning, they always have or are able to obtain a stat_result).  But  
again, you can do that lazily, and it's hidden from the user.  From  
the user perspective, all they need to learn is: there's a getfiles()  
method as well as a file() method, and ftp_path methods can operate  
on both pathnames and FTPFile objects.  And then a note on  
performance says "path.isfoo(path) methods contact the server on  
every call, consider using getfiles() if you just need file  
information".

Yes, my toy example is easy to replicate with walk(). And there are  
some important issues with modes and name/path attributes which I  
haven't fully thought through.  Perhaps a better toy example (and  
this more closely models what I'm doing in my app), would be to  
display to a user an annotated directory listing of a single directory:

for f in host.getfiles(path):
     if host.path.isdir(f):
         print "%s (directory)" % f.name
     elif host.path.islink(f):
         print "%s (link)" % f.name
     else:
         print "%s (size: %s, last_modified: %s)" % (f.name,  
host.path.getsize(f), host.path.getmtime(f))

You could do this using listdir, but all those calls to host.path.foo 
() would be deadly.

Having a getfiles() method would allow the client code to efficiently  
do things like: sync an entire tree in both directions; search for  
files with recent modification dates; list all files over a certain  
size, etc.  Without having to make a remote call for every file in  
the tree, and without any worries about cache consistency.  In  
general, it gives the client the ability to efficiently obtain and  
act on completely up-to-date file *information* en masse.  And the  
extra complexity cost to the interface for ftputil is adding one  
method to ftphost, and overloading the various ftp_path methods.   
Which, to me, feels like a win.

And now, if you still don't want it, I'll stop bugging you ;-)

-Dan


On Jul 20, 2006, at 7:17 PM, Stefan Schwarzer wrote:

> Hi Dan,
>
> On 2006-07-20 20:39, Dan Milstein wrote:
>> c) Low-level/high-lvel
>>
>> A thought: yes, returning a list of stat results feels a bit low-
>> level.  Esp since the reason I need the stat results and not just the
>> names is to do things like isfile() and isdir(), which I can then
>> only do by explicitly calling stat.S_ISDIR(stat_result.st_mode).   
>> Blech.
>
> hm, so I currently don't think that this is the way to go.
>
>> What might be nice, but would move just a bit away from the os.path
>> analogy, would be to make ftphost.path be able to run isfile()/isdir
>> ()/etc on both path names and also FTPFile objects (or something like
>> FTPFile objects -- see below).
>
> What I would like to avoid is to add an interface that basically
> does the same as an already present interface. Look at the
> current Python way to get the length of a string: len(s) . A
> similar interface, regarding the purpose, might be s.len() . Note
> that I don't say that the former way is better, just that there
> shouldn't be an _additional_ way if there is already one.
>
>> If you are interested, what I'd like to do is add a method to  
>> ftphost:
>>
>> getfiles(path)
>
> Should the files be opened in text or binary mode? Should they be
> opened for reading or writing?
>
> At the moment, this could be implemented like this (below),
> assuming you want files to be read in binary mode (untested code,
> beware):
>
> def getfiles(path):
>     files = []
>     cwd = host.getcwd()
>     for name in host.listdir(path):
>         complete_path = host.path.join(cwd, name)
>         if host.path.isfile(complete_path):
>             files.append(host.file(complete_path, 'rb'))
>     return files
>
> or, assuming this is a method of `FTPHost`,
>
> def getfiles(self, path):
>     files = []
>     cwd = self.getcwd()
>     for name in self.listdir(path):
>         absolute_path = self.path.join(cwd, name)
>         if self.path.isfile(absolute_path):
>             files.append(self.file(absolute_path, 'rb'))
>     return files
>
> (I only wrote this code down to see how much effort could be
> saved by having the new method. Another criterion is the actual
> usefulness.)
>
>> Which returns a list of FTPFile-like objects (with stat and path
>> information hidden in them).  They would lazily support the file
>> interface, but would only connect to the remote server if someone
>> actually reads or writes their contents.  And then I would extend
>> ftp_path.isdir()/isfile()/etc to handle those objects as well as path
>> names.
>>
>> So that, if you need to operate on a directory, to say, copy all
>> files, and recurse over all subdirectories, you could do something  
>> like:
>>
>> def copy_all_files(path):
>>      for f in ftphost.getfiles(path):
>>          if ftphost.path.isdir(f):
>>             copy_all_files(f.path)
>>          else:
>>             ftphost.copyfileobj(f, file(f.name))
>
> The `path` and `name` attributes are a bit tricky. Should `path`
> be the absolute path of the file/directory of the server, and
> `name` just the tail of the path, similar to what
> `os.path.basename(path)` returns?
>
> If yes, your code iterates recursively through a directory on the
> server and copies all the files into a single directory on the
> client's local host (actually without closing the files).
>
> Let me try an alternative implementation which uses the current
> `FTPHost` class (again, untested).
>
> def copy_all_files(host, path):
>     for dir_path, dir_names, file_names in host.walk(path):
>         for file_name in file_names:
>             absolute_path = host.path.join(dir_path, file_name)
>             host.download(absolute_path, file_name)
>
>> Which feels fairly clean and intuitive,
>
> My code feels about as clean and intuitive, I hope. :-)
>
>> and would behave pretty
>> nicely in terms of how it hits the remote server (whereas, with the
>> current implementation, the lib is fetching the entire directory
>> listing over and over again for every call to isdir()).
>
> Of course, the _current_ performance isn't good, that's why we
> started to discuss caching, after all. However, I would like to
> improve the performance _for the current interface_, don't add an
> additional interface and let the client deal with the right
> interface to improve performance. In that case, the "server"
> (ftputil) code is lazy, shifting the responsibility of
> optimization to every client (user of ftputil). IMHO, it should
> be the other way around: ftputil should improve it's performance
> while keeping its interface, thus making it easy for users of the
> module.
>
> Of course, there's a tradeoff: If ftputil doesn't expose the
> means for its optimizations, clients have to cope with it; they
> won't be able (or not as easily) to fiddle with ftputil to gain
> better performance. On the other hand, that's the same kind of
> tradeoff when we choose Python (less control, easier to use) over
> C (potentially more control, more difficult to use).
>
> Best wishes
> Stefan
> _______________________________________________
> ftputil mailing list
> ftputil at codespeak.net
> http://codespeak.net/mailman/listinfo/ftputil



More information about the ftputil mailing list