如何用 Python 从大文件中获取单词，且内存占用少？我们需要对一个文件中的单词进行遍历，但该文件可能非常大（超过 1

我们需要对一个文件中的单词进行遍历，但该文件可能非常大（超过 1TB），并且它的行可能也很长（甚至只有一行）。单词都是英文的，所以大小合理。因此，我们不想将整个文件或整行加载到内存中。

我们有一些可用的代码，但这可能会在行过长时崩溃（在我的机器上超过约 3GB）。

def words(file):
    for line in file:
        words=re.split("\W+", line)
        for w in words:
            word=w.lower()
            if word != '': yield word

你能告诉我现在该如何重新编写这个迭代器函数，以便它不会在内存中占用超过所需的空间吗？

2、解决方案

方法一：缓冲读取 不要一行一行地读取，而是以缓冲块的形式读取：

import re

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        for word in (w.lower() for w in words if w):
            yield word

    if buffer:
        yield buffer.lower()

我使用的是 iter() 函数的可调用版本和哨兵版本来处理从文件读取，直到 file.read() 返回一个空字符串；我更喜欢这个形式，而不是 while 循环。

如果你使用的是 Python 3.3 或更高版本，则可以使用生成器委托：

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        yield from (w.lower() for w in words if w)

    if buffer:
        yield buffer.lower()

方法二：Mmap mmap是一种内存映射的方法。可以通过以下方式使用它来读取文件中的单词：

import mmap

with open('large_file.txt', 'r') as file:
    data = mmap.mmap(file.fileno(), 0)
    words = data.read().split()

for word in words:
    # do something with the word
    pass

方法三：Pandas Pandas是一个开源库，它提供了用于数据分析和操作的数据结构和操作。可以使用Pandas来读取文件中的单词：

import pandas as pd

# read the file into a Pandas DataFrame
df = pd.read_csv('large_file.txt', sep='\s+', header=None)

# convert the DataFrame to a list of words
words = df.values.tolist()

for word in words:
    # do something with the word
    pass

演示使用较小的块大小演示所有这些操作按预期进行：

>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
...     print word
... 
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur