⚙️ 想让你的爬虫、下载器、脚本提速 5 倍?那你必须掌握 Python 中的线程(Thread)与进程(Process)。这篇带你写出实用的并发工具,不卷术语,直接上项目!
✅ 本文目标
- 理解
threadingvsmultiprocessing的适用场景 - 实现一个并发下载器(多线程)和批量 CPU 密集处理(多进程)
- 掌握进度控制、数据共享与任务管理
🧠 一、线程 & 进程的区别通俗解释
| 特性 | Thread(线程) | Process(进程) |
|---|---|---|
| 使用场景 | I/O 密集(如网络请求) | CPU 密集(如图像处理) |
| 内存空间 | 共享内存 | 独立内存 |
| 创建开销 | 小 | 大 |
| GIL 限制 | 受限制 | 不受限制 |
🧵 二、threading 实战:多线程爬虫/下载器
import threading
import requests
urls = [
"https://www.baidu.com",
"https://juejin.cn",
"https://zhihu.com",
"https://python.org"
]
def fetch(url):
print(f"🌐 开始抓取 {url}")
resp = requests.get(url)
print(f"✅ 完成 {url},长度:{len(resp.text)}")
threads = []
for u in urls:
t = threading.Thread(target=fetch, args=(u,))
threads.append(t)
t.start()
for t in threads:
t.join()
print("🎉 所有任务完成")
🧠 三、multiprocessing 实战:批量处理 CPU 任务
比如计算斐波那契数列:
from multiprocessing import Pool
def fib(n):
if n <= 2:
return 1
return fib(n-1) + fib(n-2)
if __name__ == "__main__":
with Pool(4) as p:
results = p.map(fib, [30, 31, 32, 33])
print(results)
👉 输出:
🔄 四、使用 concurrent.futures 写法更优雅
from concurrent.futures import ThreadPoolExecutor
def fetch(url):
print(f"抓取 {url}")
return requests.get(url).status_code
urls = [
"https://www.baidu.com",
"https://www.zhihu.com",
"https://juejin.cn"
]
with ThreadPoolExecutor(max_workers=3) as executor:
results = executor.map(fetch, urls)
for code in results:
print("返回状态码:", code)
🔐 五、线程锁与数据共享(避免“数据混乱”)
import threading
count = 0
lock = threading.Lock()
def add():
global count
for _ in range(100000):
with lock:
count += 1
threads = [threading.Thread(target=add) for _ in range(10)]
for t in threads: t.start()
for t in threads: t.join()
print("最终 count =", count)
📦 项目实战:多线程批量图片下载器
import threading
import requests
import os
img_urls = [
"https://picsum.photos/300/300",
"https://picsum.photos/400/400",
"https://picsum.photos/500/500"
]
os.makedirs("imgs", exist_ok=True)
def download_img(i, url):
r = requests.get(url)
with open(f"imgs/img_{i}.jpg", "wb") as f:
f.write(r.content)
print(f"✅ 下载完成:img_{i}.jpg")
threads = []
for i, u in enumerate(img_urls):
t = threading.Thread(target=download_img, args=(i, u))
threads.append(t)
t.start()
for t in threads:
t.join()
💡 拓展挑战
- 用
multiprocessing.Pool写一个视频转码脚本(多核加速) - 下载失败自动重试(结合
retrying或tenacity) - 写一个线程池 + 队列的任务派发中心
🧠 总结一句话
threading用来加速爬虫,multiprocessing用来榨干 CPU,学会并发,才能写出既快又猛的 Python 工具。