多线程与多进程学习

339 阅读4分钟

主要内容

image.png

为什么要并发编程

image.png

程序提速原理图

image.png

多进程、多线程、多协程如何选择

什么是CPU-bound、IO-bound

image.png

多线程、多进程、多协程对比,多线程、多进程、多协程如何选择?

image.png

为什么慢?

image.png

GIL工作示意图

image.png

为什么需要GIL

image.png

如何规避GIL带来的限制

image.png

python创建多线程的方法

image.png

代码
多线程爬取博客园
import blog_spider
import threading
import time
def single_thread():
    print("single_thread begin")
    for url in blog_spider.urls:
        blog_spider.craw(url)
    print("single_thread end")
def multi_thread():
    print("multi_thread begin")
    threads = []
    for url in blog_spider.urls:
        threads.append(
            threading.Thread(target=blog_spider.craw,args=(url,))
        )
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print("multi_thread end")

if __name__ == "__main__":
    start = time.time()
    single_thread()
    end = time.time()
    print("single thread cost:",end - start,"seconds")
    start = time.time()
    multi_thread()
    end = time.time()
    print("multi thread cost:",end - start,"seconds")

python实现生产者消费者爬虫

image.png

爬虫生产者消费者模型

image.png

生产者消费者模型爬虫举例
import queue
import blog_spider
import time
import random
import threading


def do_craw(url_queue:queue.Queue,html_queue:queue.Queue):
    while True:
        url = url_queue.get()
        html = blog_spider.craw(url)
        html_queue.put(html)
        print(threading.current_thread().name,f"craw{url}",
              "url_queue.size=",url_queue.qsize())
        time.sleep(random.randint(1,2))
def do_parse(html_queue:queue.Queue,fout):
    while True:
        html = html_queue.get()
        results = blog_spider.parse(html)
        for result in results:
            fout.write(str(result)+"\n")
        print(threading.current_thread().name,f"result.size{len(result)}",
              "html_queue.size=",html_queue.qsize())
        time.sleep(random.randint(1,2))

if __name__ == "__main__":
    url_queue = queue.Queue()
    html_queue = queue.Queue()
    for url in blog_spider.urls:
        url_queue.put(url)
    # 生产者线程
    for idx in range(3):
        t = threading.Thread(target=do_craw,args=(url_queue,html_queue),
                             name=f"craw{idx}")
        t.start()
    # 消费者线程
    fout = open("02.data.txt","w")
    for idx in range(3):
        t = threading.Thread(target=do_parse,args=(html_queue,fout),
                             name=f"parse{idx}")
        t.start()

线程安全

image.png

如何解决线程安全

image.png

具体代码例子
import threading,time
lock = threading.Lock()
class Account:
    def __init__(self,balance):
        self.balance = balance
def draw(account,amount):
    with lock:
        if account.balance >= amount:
            time.sleep(0.1)
            print(threading.current_thread().name,"取钱成功")
            account.balance -=amount
            print(threading.current_thread().name,"余额",account.balance)
        else:
            print(threading.current_thread().name,"取钱失败,余额不足")
if __name__ == "__main__":
    account = Account(1000)
    ta = threading.Thread(name="ta",target=draw,args=(account,800))
    tb = threading.Thread(name="tb",target=draw,args=(account,800))

    ta.start()
    tb.start()

线程池

为什么使用线程池?

image.png

使用线程池的好处

image.png

ThreadPoolExecutor的使用语法

image.png

具体代码例子
import concurrent.futures
import blog_spider

#craw
with concurrent.futures.ThreadPoolExecutor() as pool:
    htmls = pool.map(blog_spider.craw,blog_spider.urls)
    htmls = list(zip(blog_spider.urls,htmls))
    for url,html in htmls:
        print(url,len(html))
print("craw over")
#parse
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = {}
    for url,html in htmls:
        future = pool.submit(blog_spider.parse,html)
        futures[future] = url
    # for future,url in futures.items():
    #     print(url,future.result())
    for future in concurrent.futures.as_completed(futures):
        url = futures[future]
        print(url,future.result())
print("parse over")

image.png

web服务架构使用线程池

1、web服务的架构与特点

image.png

2、使用线程池ThreadPoolExecutor加速

image.png 05 flask_thread_pool.py

3、Python使用线程池在Web服务中实现加速

import json
import time
from concurrent.futures import ThreadPoolExecutor

import flask

app = flask.Flask(__name__)
# 初始化pool对象
pool = ThreadPoolExecutor()
def read_file():
    time.sleep(0.1)
    return "file result"
def read_db():
    time.sleep(0.1)
    return "db result"
def read_api():
    time.sleep(0.1)
    return "api result"
@app.route("/")
def index():
    result_file = pool.submit(read_file)
    result_db = pool.submit(read_db)
    result_api = pool.submit(read_api)

    return json.dumps({
        "result_file":result_file.result(),
        "result_db":result_db.result(),
        "result_api":result_api.result()
    })

if __name__ == "__main__":
    app.run()

使用多进程multiprocessing模块加速程序的运行

1、为什么需要有多进程?主要是cpu计算型,io计算型使用多线程

image.png

2、多进程知识梳理

image.png

3、多线程不适合cpu密集型计算

image.png

4、具体代码例子

import math
import time
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
PRIMES = [112272535095293] * 100
def is_prime(n):
    if n < 2:
        return True
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3,sqrt_n + 1,2):
        if n % i == 0:
            return False
    return True

def single_thread():
    for number in PRIMES:
        is_prime(number)
def multi_thread():
    with ThreadPoolExecutor() as pool:
        pool.map(is_prime,PRIMES)
def multi_process():
    with ProcessPoolExecutor() as pool:
        pool.map(is_prime,PRIMES)
if __name__ == "__main__":
    start = time.time()
    single_thread()
    end = time.time()
    print("single_thread,cost:",end-start,"seconds")

    start = time.time()
    multi_thread()
    end = time.time()
    print("multi_thread,cost:",end-start,"seconds")

    start = time.time()
    multi_process()
    end = time.time()
    print("multi_process,cost:",end-start,"seconds")

Python在Flask服务中使用多进程池加速程序运行

import json
import math

import flask
from concurrent.futures import ProcessPoolExecutor


app = flask.Flask(__name__)

def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3,sqrt_n + 1,2):
        if n % i == 0:
            return False
    return True
@app.route("/is_prime/<numbers>")
def api_is_prime(numbers):
    number_list = [int(x) for x in numbers.split(",")]
    results = process_pool.map(is_prime,number_list)
    return json.dumps(dict(zip(number_list,results)))
if __name__ == "__main__":
    #必须放到这里来,process_pool = ProcessPoolExecutor()
    process_pool = ProcessPoolExecutor()
    app.run()

Python异步IO实现并发爬虫

image.png

python异步IO库介绍:asyncio

image.png

多线程与多协程对比

1、多协程运行时间

image.png

2、多线程运行时间

image.png

在异步IO中使用信号量控制爬虫并发度

信号量是什么?

image.png

使用代码控制协程数量
Hayo回答

是的,Semaphore 可以用来控制可以访问共享资源的协程的数量。

Semaphore 是一个计数器,它的主要作用是为了控制可以访问某个共享资源的线程(或者协程)的数量。

初始化时,将计数器值设置为许可证(permit)的数量,每次 acquisition(获取)会将计数器减1,每次 release(释放)会将计数器加1。

这样就可以限制总体并发数了。

举个例子,如果一个资源只允许最多 5 个协程同时访问,那么可以这样使用 Semaphore:

sem = Semaphore(5)

def access_resource():
    sem.acquire()
    # 协程进入临界区,访问共享资源
    ...
    sem.release()

# 创建多个协程访问资源
for i in range(10): 
    spawn(access_resource)

上述代码中,虽然创建了 10 个协程去访问资源,但由于 Semaphore 的限制,任何时刻只允许最多 5 个协程同时访问共享资源。

所以答案是,使用 Semaphore 可以有效地控制可以访问共享资源的协程的数量。

爬虫举例
import asyncio
import time

import aiohttp

import blog_spider

semaphore = asyncio.Semaphore(10)
async def async_craw(url):
    async with semaphore:
        print("craw url: ",url)
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                result = await resp.text()
                await asyncio.sleep(10)
                print(f"craw url:{url},{len(result)}")
loop = asyncio.get_event_loop()
tasks = [loop.create_task(async_craw(url)) for url in blog_spider.urls]

start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print(f"use time seconds:{end - start}")