主要内容
为什么要并发编程
程序提速原理图
多进程、多线程、多协程如何选择
什么是CPU-bound、IO-bound
多线程、多进程、多协程对比,多线程、多进程、多协程如何选择?
为什么慢?
GIL工作示意图
为什么需要GIL
如何规避GIL带来的限制
python创建多线程的方法
代码
多线程爬取博客园
import blog_spider
import threading
import time
def single_thread():
print("single_thread begin")
for url in blog_spider.urls:
blog_spider.craw(url)
print("single_thread end")
def multi_thread():
print("multi_thread begin")
threads = []
for url in blog_spider.urls:
threads.append(
threading.Thread(target=blog_spider.craw,args=(url,))
)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print("multi_thread end")
if __name__ == "__main__":
start = time.time()
single_thread()
end = time.time()
print("single thread cost:",end - start,"seconds")
start = time.time()
multi_thread()
end = time.time()
print("multi thread cost:",end - start,"seconds")
python实现生产者消费者爬虫
爬虫生产者消费者模型
生产者消费者模型爬虫举例
import queue
import blog_spider
import time
import random
import threading
def do_craw(url_queue:queue.Queue,html_queue:queue.Queue):
while True:
url = url_queue.get()
html = blog_spider.craw(url)
html_queue.put(html)
print(threading.current_thread().name,f"craw{url}",
"url_queue.size=",url_queue.qsize())
time.sleep(random.randint(1,2))
def do_parse(html_queue:queue.Queue,fout):
while True:
html = html_queue.get()
results = blog_spider.parse(html)
for result in results:
fout.write(str(result)+"\n")
print(threading.current_thread().name,f"result.size{len(result)}",
"html_queue.size=",html_queue.qsize())
time.sleep(random.randint(1,2))
if __name__ == "__main__":
url_queue = queue.Queue()
html_queue = queue.Queue()
for url in blog_spider.urls:
url_queue.put(url)
# 生产者线程
for idx in range(3):
t = threading.Thread(target=do_craw,args=(url_queue,html_queue),
name=f"craw{idx}")
t.start()
# 消费者线程
fout = open("02.data.txt","w")
for idx in range(3):
t = threading.Thread(target=do_parse,args=(html_queue,fout),
name=f"parse{idx}")
t.start()
线程安全
如何解决线程安全
具体代码例子
import threading,time
lock = threading.Lock()
class Account:
def __init__(self,balance):
self.balance = balance
def draw(account,amount):
with lock:
if account.balance >= amount:
time.sleep(0.1)
print(threading.current_thread().name,"取钱成功")
account.balance -=amount
print(threading.current_thread().name,"余额",account.balance)
else:
print(threading.current_thread().name,"取钱失败,余额不足")
if __name__ == "__main__":
account = Account(1000)
ta = threading.Thread(name="ta",target=draw,args=(account,800))
tb = threading.Thread(name="tb",target=draw,args=(account,800))
ta.start()
tb.start()
线程池
为什么使用线程池?
使用线程池的好处
ThreadPoolExecutor的使用语法
具体代码例子
import concurrent.futures
import blog_spider
#craw
with concurrent.futures.ThreadPoolExecutor() as pool:
htmls = pool.map(blog_spider.craw,blog_spider.urls)
htmls = list(zip(blog_spider.urls,htmls))
for url,html in htmls:
print(url,len(html))
print("craw over")
#parse
with concurrent.futures.ThreadPoolExecutor() as pool:
futures = {}
for url,html in htmls:
future = pool.submit(blog_spider.parse,html)
futures[future] = url
# for future,url in futures.items():
# print(url,future.result())
for future in concurrent.futures.as_completed(futures):
url = futures[future]
print(url,future.result())
print("parse over")
web服务架构使用线程池
1、web服务的架构与特点
2、使用线程池ThreadPoolExecutor加速
05 flask_thread_pool.py
3、Python使用线程池在Web服务中实现加速
import json
import time
from concurrent.futures import ThreadPoolExecutor
import flask
app = flask.Flask(__name__)
# 初始化pool对象
pool = ThreadPoolExecutor()
def read_file():
time.sleep(0.1)
return "file result"
def read_db():
time.sleep(0.1)
return "db result"
def read_api():
time.sleep(0.1)
return "api result"
@app.route("/")
def index():
result_file = pool.submit(read_file)
result_db = pool.submit(read_db)
result_api = pool.submit(read_api)
return json.dumps({
"result_file":result_file.result(),
"result_db":result_db.result(),
"result_api":result_api.result()
})
if __name__ == "__main__":
app.run()
使用多进程multiprocessing模块加速程序的运行
1、为什么需要有多进程?主要是cpu计算型,io计算型使用多线程
2、多进程知识梳理
3、多线程不适合cpu密集型计算
4、具体代码例子
import math
import time
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
PRIMES = [112272535095293] * 100
def is_prime(n):
if n < 2:
return True
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3,sqrt_n + 1,2):
if n % i == 0:
return False
return True
def single_thread():
for number in PRIMES:
is_prime(number)
def multi_thread():
with ThreadPoolExecutor() as pool:
pool.map(is_prime,PRIMES)
def multi_process():
with ProcessPoolExecutor() as pool:
pool.map(is_prime,PRIMES)
if __name__ == "__main__":
start = time.time()
single_thread()
end = time.time()
print("single_thread,cost:",end-start,"seconds")
start = time.time()
multi_thread()
end = time.time()
print("multi_thread,cost:",end-start,"seconds")
start = time.time()
multi_process()
end = time.time()
print("multi_process,cost:",end-start,"seconds")
Python在Flask服务中使用多进程池加速程序运行
import json
import math
import flask
from concurrent.futures import ProcessPoolExecutor
app = flask.Flask(__name__)
def is_prime(n):
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3,sqrt_n + 1,2):
if n % i == 0:
return False
return True
@app.route("/is_prime/<numbers>")
def api_is_prime(numbers):
number_list = [int(x) for x in numbers.split(",")]
results = process_pool.map(is_prime,number_list)
return json.dumps(dict(zip(number_list,results)))
if __name__ == "__main__":
#必须放到这里来,process_pool = ProcessPoolExecutor()
process_pool = ProcessPoolExecutor()
app.run()
Python异步IO实现并发爬虫
python异步IO库介绍:asyncio
多线程与多协程对比
1、多协程运行时间
2、多线程运行时间
在异步IO中使用信号量控制爬虫并发度
信号量是什么?
使用代码控制协程数量
Hayo回答
是的,Semaphore 可以用来控制可以访问共享资源的协程的数量。
Semaphore 是一个计数器,它的主要作用是为了控制可以访问某个共享资源的线程(或者协程)的数量。
初始化时,将计数器值设置为许可证(permit)的数量,每次 acquisition(获取)会将计数器减1,每次 release(释放)会将计数器加1。
这样就可以限制总体并发数了。
举个例子,如果一个资源只允许最多 5 个协程同时访问,那么可以这样使用 Semaphore:
sem = Semaphore(5)
def access_resource():
sem.acquire()
# 协程进入临界区,访问共享资源
...
sem.release()
# 创建多个协程访问资源
for i in range(10):
spawn(access_resource)
上述代码中,虽然创建了 10 个协程去访问资源,但由于 Semaphore 的限制,任何时刻只允许最多 5 个协程同时访问共享资源。
所以答案是,使用 Semaphore 可以有效地控制可以访问共享资源的协程的数量。
爬虫举例
import asyncio
import time
import aiohttp
import blog_spider
semaphore = asyncio.Semaphore(10)
async def async_craw(url):
async with semaphore:
print("craw url: ",url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
result = await resp.text()
await asyncio.sleep(10)
print(f"craw url:{url},{len(result)}")
loop = asyncio.get_event_loop()
tasks = [loop.create_task(async_craw(url)) for url in blog_spider.urls]
start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print(f"use time seconds:{end - start}")