🐍 想处理百万行日志?你需要的不只是
for
,而是生成器——它让 Python 成为“内存杀手”的克星!
✅ 本文目标
- 理解
yield
和生成器函数的运行机制 - 理解“惰性求值”和内存优化的本质
- 实战一个百万行日志流式分析器
- 结合
itertools
玩转无限流与分组技巧
🧠 一、什么是生成器?
普通函数 VS 生成器函数:
def classic():
return [1, 2, 3]
def gen():
yield 1
yield 2
yield 3
使用方式:
for x in gen():
print(x)
或者:
g = gen()
print(next(g)) # 1
print(next(g)) # 2
print(next(g)) # 3
✅ 生成器优点:
- 不一次性返回所有值,占用极少内存
- 执行中断与恢复,适合处理大数据流
🛠 二、百万行文件读取的正确姿势
模拟日志文件(可提前生成):
with open("biglog.txt", "w") as f:
for i in range(1_000_000):
f.write(f"line {i}\n")
错误示范:
lines = open("biglog.txt").readlines() # 一次性加载,占用内存超大
推荐用生成器方式读取:
def read_lines(filename):
with open(filename, "r") as f:
for line in f:
yield line.strip()
使用方式:
for line in read_lines("biglog.txt"):
if "99999" in line:
print("🎯 找到了:", line)
break
🔄 三、实现一个“懒惰版”关键词搜索器
def search_keyword(filename, keyword):
for line in read_lines(filename):
if keyword in line:
yield line
使用示例:
results = search_keyword("biglog.txt", "12345")
for i, line in enumerate(results):
print(f"第{i+1}条命中:{line}")
if i == 4: break # 只取前5个
🧩 四、使用 itertools 玩转“无限流”处理器
import itertools
# 无限流
def infinite_counter():
n = 0
while True:
yield n
n += 1
for i in itertools.islice(infinite_counter(), 5, 10):
print(i) # 5 6 7 8 9
# 分组工具:groupby
from itertools import groupby
data = ["apple", "apple", "banana", "banana", "banana", "cat"]
for key, group in groupby(data):
print(key, list(group))
输出:
apple ['apple', 'apple']
banana ['banana', 'banana', 'banana']
cat ['cat']
🚀 项目实战:关键词统计器(可处理超大文件)
from collections import defaultdict
def count_keywords(filename, keywords):
result = defaultdict(int)
for line in read_lines(filename):
for kw in keywords:
if kw in line:
result[kw] += 1
return result
使用:
counts = count_keywords("biglog.txt", ["error", "warn", "success"])
print(counts)
💡 拓展任务
- 实现分页式读取(每1000行处理一次)
- 使用
yield from
合并多个文件流 - 搭配
tqdm
实现进度条反馈
🧠 总结一句话
别再一次性读文件了,yield 是处理大数据的王道,你写的不是函数,而是“数据流的接力棒”。