需要使用 Python 通过 FTP 下载大量的文件,每个文件的大小在 0.3 到 1.5GB 之间,总数约 200 到 300 个。下载后,需要对这些文件进行一些处理。使用 Python 的 ftplib 库来实现,但遇到了问题:有时下载会卡住,无法完成部分文件的下载。尝试了调整 KEEPALIVE 设置来解决问题,但效果不理想。
代码示例:
with closing(ftplib.FTP()) as ftp:
try:
ftp.connect(self.host, self.port, 30*60) #30 mins timeout
# print ftp.getwelcome()
ftp.login(self.login, self.passwd)
ftp.set_pasv(True)
ftp.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 75)
ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
with open(local_filename, 'w+b') as f:
res = ftp.retrbinary('RETR %s' % orig_filename, f.write)
if not res.startswith('226 Transfer complete'):
logging.error('Downloaded of file {0} is not compile.'.format(orig_filename))
os.remove(local_filename)
return None
os.rename(local_filename, self.storage + filename + file_ext)
ftp.rename(orig_filename, orig_filename + '.copied')
return filename + file_ext
except:
logging.exception('Error during download from FTP')
问题细节:
- 下载一个文件通常需要 7 到 15 分钟。
- FTP 服务器总是会在日志中显示文件已完全下载,但客户端会卡住。
- 这种情况并不总是发生,但偶尔会发生。
问题:
- 是否是因为断开连接导致的?
- 如何实现下载过程的监控,并在断开连接时重新连接?
2、解决方案 由于找不到任何好的建议或代码示例,因此自己实现了解决方案。使用了一些 Stackoverflow 社区提供的想法,并将代码放在 GitHub(pyFTPclient)上。
解决方案的要点:
- 使用 FTPClient 类来封装 FTP 连接,该类继承自 ftplib.FTP,并添加了监控下载进度、在超时或断开连接时重新连接以及显示当前下载速度等功能。
- 在断开连接后,从断开点继续下载文件(如果 FTP 服务器支持)。
代码示例:
def download_file(self, orig_filename, local_filename, timeout=30 * 60, chunk_size=1024 * 256):
"""
Downloads a file from the FTP server.
Args:
orig_filename: The name of the file to download on the FTP server.
local_filename: The name of the file to save on the local computer.
timeout: The timeout in seconds for the FTP connection.
chunk_size: The size of the chunks to use when downloading the file.
Returns:
The name of the downloaded file if successful, None otherwise.
"""
with self.ftp_connect(timeout=timeout) as ftp:
try:
with open(local_filename, 'w+b') as f:
# Monitor the download progress and reconnect if necessary.
self._monitor_download(ftp, f, orig_filename, chunk_size)
# Rename the downloaded file to the original filename.
os.rename(local_filename, self.storage + orig_filename)
return orig_filename
except Exception as e:
logging.exception('Error during download from FTP')
return None
def _monitor_download(self, ftp, f, orig_filename, chunk_size):
"""
Monitors the download progress and reconnects if necessary.
Args:
ftp: The FTPClient object.
f: The file object to write the downloaded data to.
orig_filename: The name of the file to download on the FTP server.
chunk_size: The size of the chunks to use when downloading the file.
"""
# Get the file size.
file_size = ftp.size(orig_filename)
# Initialize the downloaded size.
downloaded_size = 0
# Loop until the file is completely downloaded.
while downloaded_size < file_size:
try:
# Read a chunk of data.
data = ftp.get_data().decode('utf-8')
# If the data is empty, it means that the connection has been lost.
if not data:
raise ftplib.error_temp('Connection lost')
# Write the data to the file.
f.write(data.encode('utf-8'))
# Update the downloaded size.
downloaded_size += len(data)
# Print the current download speed.
self._print_download_speed(downloaded_size, file_size)
except ftplib.error_temp:
# The connection has been lost. Reconnect and continue downloading.
self._reconnect_and_continue_download(ftp, orig_filename, downloaded_size, chunk_size)
except Exception as e:
# An unknown error occurred. Log it and raise it.
logging.exception('Error during download from FTP')
raise e
def _print_download_speed(self, downloaded_size, file_size):
"""
Prints the current download speed.
Args:
downloaded_size: The size of the downloaded data in bytes.
file_size: The total size of the file in bytes.
"""
# Calculate the download speed in KB/s.
download_speed = downloaded_size / (time.time() - self._start_time) / 1024
# Print the download speed.
print('Download speed: {:.2f} KB/s'.format(download_speed))
需要注意的是,这是一个相对复杂的解决方案,需要仔细阅读和理解才能使用。