摘要:Python并发编程有三种主要方式:多线程、多进程、异步IO。GIL的存在让选型变得微妙。本文用实际基准测试数据,帮你在不同场景下做出正确选择。
GIL:绕不开的话题
Python的全局解释器锁(GIL)保证同一时刻只有一个线程执行Python字节码。这意味着:
- CPU密集型任务:多线程几乎没有加速效果
- IO密集型任务:多线程依然有效(等待IO时会释放GIL)
import threading
import time
# CPU密集型:多线程反而更慢
def cpu_task():
total = 0
for i in range(10_000_000):
total += i
# 单线程
start = time.time()
cpu_task()
cpu_task()
print(f'单线程: {time.time() - start:.2f}s') # ~1.8s
# 双线程
start = time.time()
t1 = threading.Thread(target=cpu_task)
t2 = threading.Thread(target=cpu_task)
t1.start(); t2.start()
t1.join(); t2.join()
print(f'双线程: {time.time() - start:.2f}s') # ~2.0s(更慢!)
三种并发方式对比
多线程(threading)
import threading
import requests
from concurrent.futures import ThreadPoolExecutor
def download(url):
resp = requests.get(url, timeout=10)
return len(resp.content)
urls = ['https://httpbin.org/delay/1'] * 10
# 方式1:手动管理线程
threads = []
for url in urls:
t = threading.Thread(target=download, args=(url,))
threads.append(t)
t.start()
for t in threads:
t.join()
# 方式2:线程池(推荐)
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(download, urls))
多进程(multiprocessing)
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def heavy_compute(n):
"""CPU密集型任务"""
total = 0
for i in range(n):
total += i ** 2
return total
numbers = [10_000_000] * 8
# 方式1:进程池
with Pool(processes=4) as pool:
results = pool.map(heavy_compute, numbers)
# 方式2:ProcessPoolExecutor(接口统一)
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(heavy_compute, numbers))
异步IO(asyncio)
import asyncio
import aiohttp
async def download(session, url):
async with session.get(url) as resp:
return len(await resp.read())
async def main():
urls = ['https://httpbin.org/delay/1'] * 10
async with aiohttp.ClientSession() as session:
tasks = [download(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
results = asyncio.run(main())
基准测试
IO密集型:下载100个网页
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import asyncio
import aiohttp
import requests
urls = [f'https://httpbin.org/delay/0.1'] * 100
# 串行
start = time.time()
for url in urls:
requests.get(url)
serial_time = time.time() - start
# 多线程
start = time.time()
with ThreadPoolExecutor(20) as ex:
list(ex.map(lambda u: requests.get(u), urls))
thread_time = time.time() - start
# 异步
async def async_test():
async with aiohttp.ClientSession() as s:
tasks = [s.get(u) for u in urls]
return await asyncio.gather(*tasks)
start = time.time()
asyncio.run(async_test())
async_time = time.time() - start
| 方式 | 耗时 | 加速比 |
|---|---|---|
| 串行 | 12.5s | 1x |
| 多线程(20) | 0.8s | 15x |
| 异步IO(100) | 0.3s | 40x |
| 多进程(4) | 3.5s | 3.5x |
IO密集型场景:异步 > 多线程 >> 多进程 > 串行
CPU密集型:计算100个大数阶乘
import math
def compute(n):
return len(str(math.factorial(n)))
numbers = [100000] * 8
| 方式 | 耗时 | 加速比 |
|---|---|---|
| 串行 | 16.2s | 1x |
| 多线程(4) | 16.8s | 0.96x(更慢) |
| 多进程(4) | 4.5s | 3.6x |
CPU密集型场景:多进程 >> 串行 ≈ 多线程
选型决策树
你的任务是什么类型?
├── IO密集型(网络请求、文件读写、数据库查询)
│ ├── 需要极致性能 → asyncio + aiohttp
│ ├── 代码简单优先 → ThreadPoolExecutor
│ └── 已有同步代码不想改 → ThreadPoolExecutor
│
├── CPU密集型(数学计算、图像处理、数据分析)
│ ├── 纯Python计算 → multiprocessing
│ ├── NumPy/Pandas → 它们内部已释放GIL,多线程也行
│ └── 可以用C扩展 → 释放GIL的C扩展 + 多线程
│
└── 混合型
└── 多进程 + 每个进程内用异步IO
实战:混合型任务
爬虫就是典型的混合型——下载是IO密集,解析是CPU密集:
import asyncio
import aiohttp
from concurrent.futures import ProcessPoolExecutor
from bs4 import BeautifulSoup
def parse_html(html):
"""CPU密集:解析HTML(在子进程中执行)"""
soup = BeautifulSoup(html, 'lxml')
return {
'title': soup.title.string if soup.title else '',
'links': len(soup.find_all('a')),
'text_length': len(soup.get_text()),
}
async def fetch(session, url):
"""IO密集:下载页面(异步)"""
async with session.get(url) as resp:
return await resp.text()
async def main(urls):
# 进程池用于CPU密集的解析
process_pool = ProcessPoolExecutor(max_workers=4)
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
# 异步下载所有页面
htmls = await asyncio.gather(*[fetch(session, url) for url in urls])
# 用进程池并行解析
tasks = [
loop.run_in_executor(process_pool, parse_html, html)
for html in htmls
]
results = await asyncio.gather(*tasks)
process_pool.shutdown()
return results
线程安全
多线程共享内存,要注意数据竞争:
import threading
# ❌ 不安全
counter = 0
def increment():
global counter
for _ in range(100000):
counter += 1 # 非原子操作!
# ✅ 用锁保护
lock = threading.Lock()
counter = 0
def safe_increment():
global counter
for _ in range(100000):
with lock:
counter += 1
# ✅ 更好的方式:用Queue避免共享状态
from queue import Queue
def worker(q, results):
while True:
item = q.get()
if item is None:
break
result = process(item)
results.put(result)
q.task_done()
总结
- IO密集型 → 首选asyncio,次选多线程
- CPU密集型 → 多进程是唯一选择(GIL限制)
- 混合型 → 多进程 + 异步IO
- 简单场景用
concurrent.futures,接口统一,切换方便 - 多线程注意线程安全,能用Queue就别用Lock
选对并发模型,性能提升是数量级的。选错了,不仅没加速,还增加了复杂度。