通过 Python 使用 FTP 下载大文件(带监控和重连)

109 阅读3分钟

需要使用 Python 通过 FTP 下载大量的文件,每个文件的大小在 0.3 到 1.5GB 之间,总数约 200 到 300 个。下载后,需要对这些文件进行一些处理。使用 Python 的 ftplib 库来实现,但遇到了问题:有时下载会卡住,无法完成部分文件的下载。尝试了调整 KEEPALIVE 设置来解决问题,但效果不理想。

代码示例:

with closing(ftplib.FTP()) as ftp:
    try:
        ftp.connect(self.host, self.port, 30*60) #30 mins timeout
        # print ftp.getwelcome()
        ftp.login(self.login, self.passwd)
        ftp.set_pasv(True)
        ftp.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
        ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 75)
        ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
        with open(local_filename, 'w+b') as f:
            res = ftp.retrbinary('RETR %s' % orig_filename, f.write)

            if not res.startswith('226 Transfer complete'):
                logging.error('Downloaded of file {0} is not compile.'.format(orig_filename))
                os.remove(local_filename)
                return None

        os.rename(local_filename, self.storage + filename + file_ext)
        ftp.rename(orig_filename, orig_filename + '.copied')

        return filename + file_ext

    except:
            logging.exception('Error during download from FTP')

问题细节:

  • 下载一个文件通常需要 7 到 15 分钟。
  • FTP 服务器总是会在日志中显示文件已完全下载,但客户端会卡住。
  • 这种情况并不总是发生,但偶尔会发生。

问题:

  • 是否是因为断开连接导致的?
  • 如何实现下载过程的监控,并在断开连接时重新连接?

2、解决方案 由于找不到任何好的建议或代码示例,因此自己实现了解决方案。使用了一些 Stackoverflow 社区提供的想法,并将代码放在 GitHub(pyFTPclient)上。

解决方案的要点:

  • 使用 FTPClient 类来封装 FTP 连接,该类继承自 ftplib.FTP,并添加了监控下载进度、在超时或断开连接时重新连接以及显示当前下载速度等功能。
  • 在断开连接后,从断开点继续下载文件(如果 FTP 服务器支持)。

代码示例:

def download_file(self, orig_filename, local_filename, timeout=30 * 60, chunk_size=1024 * 256):
    """
    Downloads a file from the FTP server.

    Args:
        orig_filename: The name of the file to download on the FTP server.
        local_filename: The name of the file to save on the local computer.
        timeout: The timeout in seconds for the FTP connection.
        chunk_size: The size of the chunks to use when downloading the file.

    Returns:
        The name of the downloaded file if successful, None otherwise.
    """

    with self.ftp_connect(timeout=timeout) as ftp:
        try:
            with open(local_filename, 'w+b') as f:
                # Monitor the download progress and reconnect if necessary.
                self._monitor_download(ftp, f, orig_filename, chunk_size)

            # Rename the downloaded file to the original filename.
            os.rename(local_filename, self.storage + orig_filename)

            return orig_filename

        except Exception as e:
            logging.exception('Error during download from FTP')
            return None

def _monitor_download(self, ftp, f, orig_filename, chunk_size):
    """
    Monitors the download progress and reconnects if necessary.

    Args:
        ftp: The FTPClient object.
        f: The file object to write the downloaded data to.
        orig_filename: The name of the file to download on the FTP server.
        chunk_size: The size of the chunks to use when downloading the file.
    """

    # Get the file size.
    file_size = ftp.size(orig_filename)

    # Initialize the downloaded size.
    downloaded_size = 0

    # Loop until the file is completely downloaded.
    while downloaded_size < file_size:
        try:
            # Read a chunk of data.
            data = ftp.get_data().decode('utf-8')

            # If the data is empty, it means that the connection has been lost.
            if not data:
                raise ftplib.error_temp('Connection lost')

            # Write the data to the file.
            f.write(data.encode('utf-8'))

            # Update the downloaded size.
            downloaded_size += len(data)

            # Print the current download speed.
            self._print_download_speed(downloaded_size, file_size)

        except ftplib.error_temp:
            # The connection has been lost. Reconnect and continue downloading.
            self._reconnect_and_continue_download(ftp, orig_filename, downloaded_size, chunk_size)

        except Exception as e:
            # An unknown error occurred. Log it and raise it.
            logging.exception('Error during download from FTP')
            raise e

def _print_download_speed(self, downloaded_size, file_size):
    """
    Prints the current download speed.

    Args:
        downloaded_size: The size of the downloaded data in bytes.
        file_size: The total size of the file in bytes.
    """

    # Calculate the download speed in KB/s.
    download_speed = downloaded_size / (time.time() - self._start_time) / 1024

    # Print the download speed.
    print('Download speed: {:.2f} KB/s'.format(download_speed))

需要注意的是,这是一个相对复杂的解决方案,需要仔细阅读和理解才能使用。