From hraban at fiee.net Wed Oct 13 15:51:44 2010 From: hraban at fiee.net (Henning Hraban Ramm) Date: Wed, 13 Oct 2010 15:51:44 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory Message-ID: Hello Stefan and all the silent folks, I run into a strange problem: * If the current dir is the root of my FTP account * and I start ftputil's walk (host.walk or host.path.walk) * with a root of '.', my script hangs _or_ breaks at the point where it encounters files or directories whose names contain non-ASCII chars. I still don't understand, when I'll get a traceback and when not. The traceback I get with ftputil 2.4.2 is: File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/ftputil/ftp_stat.py", line 549, in __call_with_parser_retry result = method(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/ftputil/ftp_stat.py", line 421, in _real_listdir loop_path = self._path.join(path, stat_result._st_name) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 41: ordinal not in range(128) In ftputil 2.5b it is: File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/ftputil/ftp_stat.py", line 565, in __call_with_parser_retry result = method(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/ftputil/ftp_stat.py", line 437, in _real_listdir loop_path = self._path.join(path, stat_result._st_name) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/posixpath.py", line 70, in join path += '/' + b (In this case some file names contain '?'.) With a root of '/', './' or every other absolute or relative path I tried, it works (I get everything as byte strings, regardless what their encoding might once have been, and that's expected FTP behaviour). Since the real problem seems to be in posixpath, I don't know if it's possible to work around this in ftputil. Somewhat confused greetlings from Lake Constance! Hraban --- http://www.fiee.net https://www.cacert.org (I'm an assurer) From sschwarzer at sschwarzer.net Sun Oct 17 18:02:22 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Sun, 17 Oct 2010 18:02:22 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory In-Reply-To: References: Message-ID: <4CBB1E0E.5090408@sschwarzer.net> Hi Henning, Thanks for reporting! On 2010-10-13 15:51, Henning Hraban Ramm wrote: > Hello Stefan and all the silent folks, They're silent, indeed. ;-) > I run into a strange problem: > > * If the current dir is the root of my FTP account > * and I start ftputil's walk (host.walk or host.path.walk) > * with a root of '.', '.' or u'.' or either? > my script hangs _or_ breaks at the point where it encounters files or > directories whose names contain non-ASCII chars. I believe by "hang" you mean it blocks and can only be stopped by pressing ? With "break" you mean the traceback, don't you? > I still don't understand, when I'll get a traceback and when not. Does that mean you can't reproduce the bug on a "constant" file system or does it mean you can reproduce it, but just don't know what causes it? > The traceback I get with ftputil 2.4.2 is: > > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/site-packages/ftputil/ftp_stat.py", line 549, in > __call_with_parser_retry > result = method(*args, **kwargs) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/site-packages/ftputil/ftp_stat.py", line 421, in _real_listdir > loop_path = self._path.join(path, stat_result._st_name) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/posixpath.py", line 70, in join > path += '/' + b > UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position > 41: ordinal not in range(128) > > In ftputil 2.5b it is: > > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/site-packages/ftputil/ftp_stat.py", line 565, in > __call_with_parser_retry > result = method(*args, **kwargs) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/site-packages/ftputil/ftp_stat.py", line 437, in _real_listdir > loop_path = self._path.join(path, stat_result._st_name) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/posixpath.py", line 70, in join > path += '/' + b The last part is missing here, but I assume, you're getting a `UnicodeDecodeError` as well? With the same byte offset in the same file name? > (In this case some file names contain '?'.) > > With a root of '/', './' or every other absolute or relative path I > tried, it works (I get everything as byte strings, regardless what > their encoding might once have been, and that's expected FTP behaviour). I think so, too. > Since the real problem seems to be in posixpath, I don't know if it's > possible to work around this in ftputil. We'll see. ... Can you prepare an archive file (zip, tar.gz or similar) which contains a minimum file system which leads to such tracebacks or hangs? What's your actual code using ftputil? Do you have some working snippet which causes the bug? Does the bug occur if you do something like for dir, dirnames, filenames in ftp_host.walk("."): pass ? Does it make any difference if the problematic file is in the starting directory ("top") of the `os.walk` scan or not? Henning, if you like, you can file a bug in the Trac-System. http://ftputil.sschwarzer.net/trac/newticket ; please login with user/password ftputiluser/ftputil . Stefan From hraban at fiee.net Sun Oct 17 21:32:59 2010 From: hraban at fiee.net (Henning Hraban Ramm) Date: Sun, 17 Oct 2010 21:32:59 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory In-Reply-To: <4CBB1E0E.5090408@sschwarzer.net> References: <4CBB1E0E.5090408@sschwarzer.net> Message-ID: Am 2010-10-17 um 18:02 schrieb Stefan Schwarzer: >> * If the current dir is the root of my FTP account >> * and I start ftputil's walk (host.walk or host.path.walk) >> * with a root of '.', > > '.' or u'.' or either? Checking again, my report proofs wrong: It works with '.', but fails with *every* unicode root. >> my script hangs _or_ breaks at the point where it encounters files or >> directories whose names contain non-ASCII chars. > > I believe by "hang" you mean it blocks and can only be > stopped by pressing ? With "break" you mean the > traceback, don't you? Indeed. >> I still don't understand, when I'll get a traceback and when not. > > Does that mean you can't reproduce the bug on a "constant" > file system or does it mean you can reproduce it, but just > don't know what causes it? I can't reproduce blocking. It occurred irregularly and now it doesn't any more. Maybe it was some network issue. >> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ >> python2.6/site-packages/ftputil/ftp_stat.py", line 565, in >> __call_with_parser_retry >> result = method(*args, **kwargs) >> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ >> python2.6/site-packages/ftputil/ftp_stat.py", line 437, in >> _real_listdir >> loop_path = self._path.join(path, stat_result._st_name) >> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ >> python2.6/posixpath.py", line 70, in join >> path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 41: ordinal not in range(128) > The last part is missing here, but I assume, you're getting > a `UnicodeDecodeError` as well? With the same byte offset in > the same file name? Yes, exactly the same. Sorry for cutting it off. >> Since the real problem seems to be in posixpath, I don't know if it's >> possible to work around this in ftputil. > > We'll see. ... Can you prepare an archive file (zip, tar.gz > or similar) which contains a minimum file system which leads > to such tracebacks or hangs? Don't need any archive for that: import ftputil host = ftputil.FTPHost('ftp.xxx.de', 'xxx', 'xxx') for (lroot, dirs, files) in host.walk(u'.'): print lroot, dirs, files If you think it's a problem with one server, I can send you the account data in private mail. (It's semi-public, but not enough for a public mailing list.) > What's your actual code using ftputil? Do you have some > working snippet which causes the bug? Does the bug occur if > you do something like > > for dir, dirnames, filenames in ftp_host.walk("."): > pass > > ? Yes, see above. > Does it make any difference if the problematic file is in > the starting directory ("top") of the `os.walk` scan or not? No. It just fails earlier ;-) > Henning, if you like, you can file a bug in the Trac-System. > http://ftputil.sschwarzer.net/trac/newticket ; please login > with user/password ftputiluser/ftputil . Hm, I don't know if it's actually an error: If I give ftputil an unicode root, it tries to read file names as unicode and fails if it's no valid unicode, but any encoding (including utf-8). Since you'll never get Python unicode strings from a FTP server (I guess), you can consider it an error of ftputil. Perhaps it would be enough if you write a warning in the docs? In my case I know that most files in one directory tree are in Latin-1, while most files in an other tree on the same server are in UTF-8, depending on the users and their FTP clients. (I can't educate them to only use valid filenames...) So not even an encoding setting for the FTPHost would help. Greetlings from Lake Constance! Hraban --- http://www.fiee.net https://www.cacert.org (I'm an assurer) From sschwarzer at sschwarzer.net Thu Oct 21 00:20:54 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Thu, 21 Oct 2010 00:20:54 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory In-Reply-To: References: <4CBB1E0E.5090408@sschwarzer.net> Message-ID: <4CBF6B46.9070507@sschwarzer.net> Hi Henning, Thanks for looking into this bug. On 2010-10-17 21:32, Henning Hraban Ramm wrote: > Am 2010-10-17 um 18:02 schrieb Stefan Schwarzer: > Checking again, my report proofs wrong: > It works with '.', but fails with *every* unicode root. > [...] > > import ftputil > host = ftputil.FTPHost('ftp.xxx.de', 'xxx', 'xxx') > for (lroot, dirs, files) in host.walk(u'.'): > print lroot, dirs, files I can reproduce the problem! > Hm, I don't know if it's actually an error: > If I give ftputil an unicode root, it tries to read file names as > unicode and fails if it's no valid unicode, but any encoding > (including utf-8). > Since you'll never get Python unicode strings from a FTP server (I > guess), you can consider it an error of ftputil. I guess you're right. As far as I know, FTP has no concept of encodings, it just reads and writes bytes. > Perhaps it would be enough if you write a warning in the docs? I think, the "right" thing to do would be to fail as early as possible if a unicode argument is passed into ftputil for a directory or file name on the server. Maybe something along the lines of: def some_method(self, directory): # raise an `UnicodeEncodeError` as soon as possible str(directory) ... This way, 7-bit ASCII unicode strings and any byte strings will still work as arguments, but other arguments will fail. And yes, there should be a note on this in the docs. :-) > In my case I know that most files in one directory tree are in > Latin-1, while most files in an other tree on the same server are in > UTF-8, depending on the users and their FTP clients. > (I can't educate them to only use valid filenames...) So not even an > encoding setting for the FTPHost would help. As you imply, ftputil can't know the encoding, so I think the safest thing to do is to fail loudly. I'll file a ticket. Stefan From sschwarzer at sschwarzer.net Thu Oct 21 00:29:37 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Thu, 21 Oct 2010 00:29:37 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory In-Reply-To: <4CBF6B46.9070507@sschwarzer.net> References: <4CBB1E0E.5090408@sschwarzer.net> <4CBF6B46.9070507@sschwarzer.net> Message-ID: <4CBF6D51.4010607@sschwarzer.net> Hi Henning, On 2010-10-21 00:20, Stefan Schwarzer wrote: > def some_method(self, directory): > # raise an `UnicodeEncodeError` as soon as possible > str(directory) > ... > > This way, 7-bit ASCII unicode strings and any byte strings > will still work as arguments, but other arguments will fail. Nonsense. :) Of course that won't help if the argument is ASCII (like u".") but the encoding problem happens on the way "down" the tree. But a variant of the idea might work: def some_method(self, directory): directory = str(directory) ... That way, we should receive byte strings for the directories and files. Other suggestions are welcome, too. Stefan From hraban at fiee.net Thu Oct 21 08:21:39 2010 From: hraban at fiee.net (Henning Hraban Ramm) Date: Thu, 21 Oct 2010 08:21:39 +0200 Subject: [ftputil] UnicodeDecodeError with walk in current directory In-Reply-To: <4CBF6D51.4010607@sschwarzer.net> References: <4CBB1E0E.5090408@sschwarzer.net> <4CBF6B46.9070507@sschwarzer.net> <4CBF6D51.4010607@sschwarzer.net> Message-ID: <37609E8C-E612-416F-BAC0-60EE63720409@fiee.net> Am 2010-10-21 um 00:29 schrieb Stefan Schwarzer: > On 2010-10-21 00:20, Stefan Schwarzer wrote: >> def some_method(self, directory): >> # raise an `UnicodeEncodeError` as soon as possible >> str(directory) >> ... >> >> This way, 7-bit ASCII unicode strings and any byte strings >> will still work as arguments, but other arguments will fail. > > Nonsense. :) Of course that won't help if the argument is > ASCII (like u".") but the encoding problem happens on the > way "down" the tree. > > But a variant of the idea might work: > > def some_method(self, directory): > directory = str(directory) > ... > > That way, we should receive byte strings for the directories > and files. Sounds good to me. Thank you! Greetlings from Lake Constance! Hraban --- http://www.fiee.net https://www.cacert.org (I'm an assurer) From oao2005 at gmail.com Fri Oct 22 16:55:23 2010 From: oao2005 at gmail.com (Matthieu Bizien) Date: Fri, 22 Oct 2010 16:55:23 +0200 Subject: [ftputil] Patch to walk Message-ID: Hi everyone, I discovered ftputil because I wanted to index ftp servers. So I have to list directories with walk. But the current code make too many requests (1 per file). In a directory with thousands of files it is unusable. To this is a patch which use retrlines("LIST"). It makes only one request per dir. def walk(self, top, topdown=True, onerror=None): """ Iterate over directory tree and return a tuple (dirpath, dirnames, filenames) on each iteration, like the `os.walk` function (see http://docs.python.org/lib/os-file-dir.html ). """ # Hack with retrlines("LIST"). # The following code is copied from `os.walk` in Python 2.4 # and adapted to ftputil. dirs, nondirs = [], [] def callback(line): #example of a line : #drwxrwxrwx 2 ftp nogroup 159744 Oct 22 04:48 Upload #their neither is . and .. name = line[55:] #Ugly hack, must check that is correct. It work ith proftpd. if line.startswith("d"): dirs.append(name) elif not line.startswith("l"): #No links nondirs.append(name) try: raw = self._session.retrlines("LIST "+ top, callback) except ftp_error.FTPOSError: if onerror is not None: onerror(err) return if topdown: yield top, dirs, nondirs for name in dirs: path = self.path.join(top, name) for item in self.walk(path, topdown, onerror): yield item if not topdown: yield top, dirs, nondirs Matthieu Bizien -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/ftputil/attachments/20101022/3ee5a283/attachment.htm From sschwarzer at sschwarzer.net Fri Oct 22 19:45:18 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Fri, 22 Oct 2010 19:45:18 +0200 Subject: [ftputil] Patch to walk In-Reply-To: References: Message-ID: <4CC1CDAE.4030308@sschwarzer.net> Hi Matthieu, On 2010-10-22 16:55, Matthieu Bizien wrote: > Hi everyone, I discovered ftputil because I wanted to index ftp servers. So > I have to list directories with walk. But the current code make too many > requests (1 per file). In a directory with thousands of files it is > unusable. Thank you for investigating the problem and providing the patch! :) I assume you're referring to the `isdir` and `islink` calls in the implementation of `walk`. When ftputil reads a directory, it tries to collect all the "stat" information in a cache. If later you ask if an item is a directory, this information is supposed to be fetched from the cache, not the server. What probably causes you trouble here is the number of items in each directory. The thing is that ftputil uses an LRU (least recently used) cache and by default its size is 1000 entries. That means if ftputil reads a directory with over 1000 entries the newer entries will replace the older ones and they won't be able anymore if an `isdir` call is made on the first entries which have been removed already. Fortunately, you can increase the cache size by: ftp_host = ftputil.FTPHost(...) # Can be called anytime with a larger or smaller value # than the current size. ftp_host.stat_cache.resize(10000) ... Could you please try that with a number higher than the maximum number of items you expect in the largest directory? Does it work better? Caching is also discussed in ftputil's documentation: http://ftputil.sschwarzer.net/trac/wiki/Documentation#local-caching-of-file-system-information You're not the first person running into this issue. I think I'll add a note to the FAQ section of the documentation. I know that caching isn't necessarily the best solution for some problems, but I guess it works for most scenarios. The most serious issue is that you don't know in advance how many entries you will need.(*) The nice thing about the caching is its implementation on a quite low level, so many uses benefit from it without having to optimize a lot of methods individually. If there are any more problems, please ask. :-) (*) I'm starting to think about an optional auto-resize feature which will be used while reading a directory. :-) Matthieu, was the server you had problems with a public server with anonymous access? If that's the case, could you please give me the host name and the directory you started your `walk` call in? That might help me to experiment with caching modifications as mentioned above. Stefan From sschwarzer at sschwarzer.net Sun Oct 24 14:14:16 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Sun, 24 Oct 2010 14:14:16 +0200 Subject: [ftputil] [ANN] ftputil 2.5 released Message-ID: <4CC42318.1090009@sschwarzer.net> ftputil 2.5 is now available from http://ftputil.sschwarzer.net/download . Changes since version 2.4.2 --------------------------- - As announced over a year ago [1], the `xreadlines` method for FTP file objects has been removed, and exceptions can no longer be accessed via the `ftputil` namespace. Only use `ftp_error` to access the exceptions. The distribution contains a small tool `find_deprecated_code.py` to scan a directory tree for the deprecated uses. Invoke the program with the `--help` option to see a description. - Upload and download methods now accept a `callback` argument to do things during a transfer. Modification time comparisons in `upload_if_newer` and `download_if_newer` now consider the timestamp precision of the remote file which may lead to some unnecessary transfers. These can be avoided by waiting at least a minute between calls of `upload_if_newer` (or `download_if_newer`) for the same file. See the documentation for details [2]. - The `FTPHost` class got a `keep_alive` method. It should be used carefully though, not routinely. Please read the description [3] in the documentation. - Several bugs were fixed [4-7]. - The source code was restructured. The tests are now in a `test` subdirectory and are no longer part of the release archive. You can still get them via the source repository. Licensing matters have been moved to a common `LICENSE` file. What is ftputil? ---------------- ftputil is a high-level FTP client library for the Python programming language. ftputil implements a virtual file system for accessing FTP servers, that is, it can generate file-like objects for remote files. The library supports many functions similar to those in the os, os.path and shutil modules. ftputil has convenience functions for conditional uploads and downloads, and handles FTP clients and servers in different timezones. Read the documentation at http://ftputil.sschwarzer.net/documentation . License ------- ftputil is Open Source software, released under the revised BSD license (see http://www.opensource.org/licenses/bsd-license.php ). [1] http://codespeak.net/pipermail/ftputil/2009q1/000256.html [2] http://ftputil.sschwarzer.net/trac/wiki/Documentation#uploading-and-downloading-files [3] http://ftputil.sschwarzer.net/trac/wiki/Documentation#keep-alive [4] http://ftputil.sschwarzer.net/trac/ticket/44 [5] http://ftputil.sschwarzer.net/trac/ticket/46 [6] http://ftputil.sschwarzer.net/trac/ticket/47 [7] http://ftputil.sschwarzer.net/trac/ticket/51 Stefan From sschwarzer at sschwarzer.net Sun Oct 24 14:20:48 2010 From: sschwarzer at sschwarzer.net (Stefan Schwarzer) Date: Sun, 24 Oct 2010 14:20:48 +0200 Subject: [ftputil] Patch to walk In-Reply-To: <4CC1CDAE.4030308@sschwarzer.net> References: <4CC1CDAE.4030308@sschwarzer.net> Message-ID: <4CC424A0.40700@sschwarzer.net> Hello, On 2010-10-22 19:45, Stefan Schwarzer wrote: > You're not the first person running into this issue. I > think I'll add a note to the FAQ section of the > documentation. I've added said note to the documentation for ftputil 2.5. Stefan