Python多线程vs多进程:并发编程选型指南

3 阅读4分钟

摘要:Python并发编程有三种主要方式:多线程、多进程、异步IO。GIL的存在让选型变得微妙。本文用实际基准测试数据,帮你在不同场景下做出正确选择。

GIL:绕不开的话题

Python的全局解释器锁(GIL)保证同一时刻只有一个线程执行Python字节码。这意味着:

  • CPU密集型任务:多线程几乎没有加速效果
  • IO密集型任务:多线程依然有效(等待IO时会释放GIL)
import threading
import time

# CPU密集型:多线程反而更慢
def cpu_task():
    total = 0
    for i in range(10_000_000):
        total += i

# 单线程
start = time.time()
cpu_task()
cpu_task()
print(f'单线程: {time.time() - start:.2f}s')  # ~1.8s

# 双线程
start = time.time()
t1 = threading.Thread(target=cpu_task)
t2 = threading.Thread(target=cpu_task)
t1.start(); t2.start()
t1.join(); t2.join()
print(f'双线程: {time.time() - start:.2f}s')  # ~2.0s(更慢!)

三种并发方式对比

多线程(threading)

import threading
import requests
from concurrent.futures import ThreadPoolExecutor

def download(url):
    resp = requests.get(url, timeout=10)
    return len(resp.content)

urls = ['https://httpbin.org/delay/1'] * 10

# 方式1:手动管理线程
threads = []
for url in urls:
    t = threading.Thread(target=download, args=(url,))
    threads.append(t)
    t.start()
for t in threads:
    t.join()

# 方式2:线程池(推荐)
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(download, urls))

多进程(multiprocessing)

from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor

def heavy_compute(n):
    """CPU密集型任务"""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

numbers = [10_000_000] * 8

# 方式1:进程池
with Pool(processes=4) as pool:
    results = pool.map(heavy_compute, numbers)

# 方式2:ProcessPoolExecutor(接口统一)
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(heavy_compute, numbers))

异步IO(asyncio)

import asyncio
import aiohttp

async def download(session, url):
    async with session.get(url) as resp:
        return len(await resp.read())

async def main():
    urls = ['https://httpbin.org/delay/1'] * 10
    async with aiohttp.ClientSession() as session:
        tasks = [download(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(main())

基准测试

IO密集型:下载100个网页

import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import asyncio
import aiohttp
import requests

urls = [f'https://httpbin.org/delay/0.1'] * 100

# 串行
start = time.time()
for url in urls:
    requests.get(url)
serial_time = time.time() - start

# 多线程
start = time.time()
with ThreadPoolExecutor(20) as ex:
    list(ex.map(lambda u: requests.get(u), urls))
thread_time = time.time() - start

# 异步
async def async_test():
    async with aiohttp.ClientSession() as s:
        tasks = [s.get(u) for u in urls]
        return await asyncio.gather(*tasks)

start = time.time()
asyncio.run(async_test())
async_time = time.time() - start
方式耗时加速比
串行12.5s1x
多线程(20)0.8s15x
异步IO(100)0.3s40x
多进程(4)3.5s3.5x

IO密集型场景:异步 > 多线程 >> 多进程 > 串行

CPU密集型:计算100个大数阶乘

import math

def compute(n):
    return len(str(math.factorial(n)))

numbers = [100000] * 8
方式耗时加速比
串行16.2s1x
多线程(4)16.8s0.96x(更慢)
多进程(4)4.5s3.6x

CPU密集型场景:多进程 >> 串行 ≈ 多线程

选型决策树

你的任务是什么类型?
├── IO密集型(网络请求、文件读写、数据库查询)
│   ├── 需要极致性能 → asyncio + aiohttp
│   ├── 代码简单优先 → ThreadPoolExecutor
│   └── 已有同步代码不想改 → ThreadPoolExecutor
│
├── CPU密集型(数学计算、图像处理、数据分析)
│   ├── 纯Python计算 → multiprocessing
│   ├── NumPy/Pandas → 它们内部已释放GIL,多线程也行
│   └── 可以用C扩展 → 释放GIL的C扩展 + 多线程
│
└── 混合型
    └── 多进程 + 每个进程内用异步IO

实战:混合型任务

爬虫就是典型的混合型——下载是IO密集,解析是CPU密集:

import asyncio
import aiohttp
from concurrent.futures import ProcessPoolExecutor
from bs4 import BeautifulSoup

def parse_html(html):
    """CPU密集:解析HTML(在子进程中执行)"""
    soup = BeautifulSoup(html, 'lxml')
    return {
        'title': soup.title.string if soup.title else '',
        'links': len(soup.find_all('a')),
        'text_length': len(soup.get_text()),
    }

async def fetch(session, url):
    """IO密集:下载页面(异步)"""
    async with session.get(url) as resp:
        return await resp.text()

async def main(urls):
    # 进程池用于CPU密集的解析
    process_pool = ProcessPoolExecutor(max_workers=4)
    loop = asyncio.get_event_loop()
    
    async with aiohttp.ClientSession() as session:
        # 异步下载所有页面
        htmls = await asyncio.gather(*[fetch(session, url) for url in urls])
        
        # 用进程池并行解析
        tasks = [
            loop.run_in_executor(process_pool, parse_html, html)
            for html in htmls
        ]
        results = await asyncio.gather(*tasks)
    
    process_pool.shutdown()
    return results

线程安全

多线程共享内存,要注意数据竞争:

import threading

# ❌ 不安全
counter = 0
def increment():
    global counter
    for _ in range(100000):
        counter += 1  # 非原子操作!

# ✅ 用锁保护
lock = threading.Lock()
counter = 0
def safe_increment():
    global counter
    for _ in range(100000):
        with lock:
            counter += 1

# ✅ 更好的方式:用Queue避免共享状态
from queue import Queue

def worker(q, results):
    while True:
        item = q.get()
        if item is None:
            break
        result = process(item)
        results.put(result)
        q.task_done()

总结

  • IO密集型 → 首选asyncio,次选多线程
  • CPU密集型 → 多进程是唯一选择(GIL限制)
  • 混合型 → 多进程 + 异步IO
  • 简单场景用concurrent.futures,接口统一,切换方便
  • 多线程注意线程安全,能用Queue就别用Lock

选对并发模型,性能提升是数量级的。选错了,不仅没加速,还增加了复杂度。