Python读取大文本文件最后1MB

100 阅读3分钟

在处理大文本文件时,有时需要读取文件的最后部分。然而,如果文件非常大(例如超过4GB),直接读取整个文件可能会导致内存溢出。

find_str = "ERROR"
file = open(file_directory)                           
last_few_lines​ = file.readlines()[-20:]   

error​ = False  

for line in ​last_few_lines​:
    if find_str in line:
    ​    error​ = True

上面的代码试图读取文件最后20行,但是对于非常大的文件,这可能会导致内存溢出。

2. 解决方案

一种解决方案是使用file.size() 方法来获取文件的大小,然后读取文件的最后1MB。

import os
find_str = "ERROR"
error = False
# Open file with 'b' to specify binary mode
with open(file_directory, 'rb') as file:
    file.seek(-1024 * 1024, os.SEEK_END)  # Note minus sign
    if find_str in file.read():
        error = True

使用file.seek() 方法来定位到文件末尾,然后读取最后1MB。这样就可以避免内存溢出,并且仍然可以读取文件的最后部分。

另一种解决方案是使用tail命令。

from collections import deque

def tail(fn, n):
    with open(fn) as fin:
        return list(deque(fin, n))

print tail('/tmp/lines.txt', 20)

tail命令可以读取文件的最后n行,而不用将整个文件加载到内存中。这对于处理非常大的文件非常有用。

最后一种解决方案是使用file对象。

for line in file(file_directory):
    if find_str in line:
        error = True

file对象可以逐行读取文件,而不用将整个文件加载到内存中。这对于处理非常大的文件非常有用。

3. 代码例子

以下是一个使用file.seek()方法读取文件最后1MB的代码例子:

import os

def read_last_mb(file_directory, find_str):
  """
  Read the last 1MB of a file.

  Args:
    file_directory: The path to the file.
    find_str: The string to search for.

  Returns:
    A boolean indicating whether the string was found.
  """

  with open(file_directory, 'rb') as file:
    file.seek(-1024 * 1024, os.SEEK_END)
    last_mb = file.read()

  return find_str in last_mb


if __name__ == '__main__':
  file_directory = 'path/to/file.txt'
  find_str = 'ERROR'

  found = read_last_mb(file_directory, find_str)

  if found:
    print('The string was found in the last 1MB of the file.')
  else:
    print('The string was not found in the last 1MB of the file.')

以下是一个使用tail命令读取文件最后n行的代码例子:

from collections import deque

def tail(fn, n):
  """
  Read the last n lines of a file.

  Args:
    fn: The path to the file.
    n: The number of lines to read.

  Returns:
    A list of the last n lines of the file.
  """

  with open(fn) as fin:
    return list(deque(fin, n))


if __name__ == '__main__':
  file_directory = 'path/to/file.txt'
  n = 20

  last_lines = tail(file_directory, n)

  print('The last {} lines of the file are:'.format(n))
  for line in last_lines:
    print(line)

以下是一个使用file对象逐行读取文件的代码例子:

def read_file(file_directory, find_str):
  """
  Read a file line by line.

  Args:
    file_directory: The path to the file.
    find_str: The string to search for.

  Returns:
    A boolean indicating whether the string was found.
  """

  found = False

  with open(file_directory) as file:
    for line in file:
      if find_str in line:
        found = True
        break

  return found


if __name__ == '__main__':
  file_directory = 'path/to/file.txt'
  find_str = 'ERROR'

  found = read_file(file_directory, find_str)

  if found:
    print('The string was found in the file.')
  else:
    print('The string was not found in the file.')