python并发编程之使用协程爬取图片喜欢的朋友可以关注下我的公众号，直接微信搜一下：feelwow [toc] 协程

多线程的相关介绍可参考此篇文章：python基础之多线程和线程池网页解析可参考此篇文章：python爬虫之xpath使用介绍

喜欢的朋友可以关注下我的公众号，直接微信搜一下：feelwow

[toc]

协程

协程介绍

其实对于IO密集型的任务，除了使用多线程，还可以使用协程，协程也叫做微线程，协程是运行在单线程中的，在python中，协程可以看做是程序员写的一个个的函数，以一种异步的方式去运行程序，当遇到有IO操作时，就会对这部分任务进行挂起操作，同时整个任务不会被阻塞，而是会继续往下运行。

协程跟线程是有区别的，线程是通过在内核态进行系统调度、切换、销毁的，而协程是在用户态，通过程序本身来进行调度和切换，由于协程会将寄存器上下文和栈保存到其他地方，在切换回来时，会恢复之前保存的寄存器上下文和栈，在这一点很像生成器，因此协程在没有系统的参与下，整体性能会非常好，没有了线程之间切换的开销，可以获得很高的运行效率，同时由于不用线程切换，也就意味着协程不会出现同时访问同一个数据，造成线程安全的问题，因此协程是安全的，不需要多线程的锁机制，所以基本上能用多线程的地方，也可以用协程，并且资源占用更低，效率也差不多。

下面通过举一个不恰当的例子，来通俗的理解下什么是协程：

我们都知道银行一般都会有以下业务：办理借记卡、信用卡、存钱、取钱、理财、贷款之类的，银行的职员(程序员，写代码的人)就会询问小明是来办理什么业务的，比如说小明是来办理理财的，大堂经理就把小明带到了一个柜台，银行职员给小明介绍了一些理财，介绍完了后，结果小明忘记了带身份证(有阻塞，相当于有IO)，然后银行职员就告诉他，你需要回家拿身份证等证件，小明就回去了，这个时候柜台肯定不能一直等着，于是又给下一个人办理手续，等这个人办理的差不多了，小明回来了，银行职员一看是那个忘记带身份证的傻子，由于银行职员之前已经给小明介绍完了理财产品(协程的特点：可以保存当前函数运行过程中的一些状态)，因此会接着往下执行，直到小明办理完业务。

协程使用场景

网络爬虫
文件读取
web框架
数据库查询

本篇文章会通过结合网络爬虫，分别爬取图片和视频来进行协程的介绍。

协程的使用方法介绍

几个常用关键字

async 放在定义函数的前面，用于声明这个函数是一个协程函数，例如：async def func()
await 用于表示此段逻辑需要进行挂起, await后面跟的一定是一个协程对象

定义一个协程函数

通过async关键字来进行声明这个函数是一个协程函数，

import asyncio

async def func():
    print('hello')
    await do_something() # 或者 await asyncio.sleep(1)
    print('world')

运行一个协程

python提供了三种方式，下面分别说明下

运行最高层级的main(), 这种方式一般会将一个最高层级的main()函数放进去运行，所以在写完异步逻辑后，在写一个main()函数

import asyncio


async def func():
    await asyncio.sleep(2)
    print('hello world')

# 顶一个main()函数    
async def main():
    await func()
    
asyncio.run(main())

asyncio.create_task() 用来并发的运行作为asyncio的任务

import asyncio


async def func():
    await asyncio.sleep(2)
    print('hello world')


async def main():
    task = asyncio.create_task(func())
    await task  # 等待所有的任务完成


asyncio.run(main())

future对象，是一个占位符，用于接收异步函数执行的结果

这个这里就不介绍了，查看官方文档的介绍，在python3.10版本，已经将asyncio中关于future的一些方法给弃用了

运行协程的流程

通过上面的介绍，我们知道了怎么去运行一个协程函数，那么运行一个协程的流程是什么呢？其实这个写法几乎都是一样的，基本上可以分为这三步

创建一个task任务列表，将协程函数统统加到task列表里
并发的去运行task任务列表里的任务，并挂起等待任务完成
运行协程函数

import asyncio


async def func(number):
    print(f"current number: {number}")
    await asyncio.sleep(1)


async def main():
    tasks = []
    t1 = asyncio.create_task(func(1))
    t2 = asyncio.create_task(func(2))
    t3 = asyncio.create_task(func(3))
    tasks.append(t1)
    tasks.append(t2)
    tasks.append(t3)
    await asyncio.wait(tasks)


if __name__ == '__main__':
    asyncio.run(main())

协程的返回值

有时候我们需要获取协程函数的返回值，来判断执行结果，下面就介绍两种方式来获取返回值

还是已上面的代码为例，稍微改下

asyncio.wait方式

import asyncio


async def func(number):
    print(f"current number: {number}")
    await asyncio.sleep(1)
    return f"current number: {number}"


async def main():
    tasks = []
    t1 = asyncio.create_task(func(1))
    t2 = asyncio.create_task(func(2))
    t3 = asyncio.create_task(func(3))
    tasks.append(t1)
    tasks.append(t2)
    tasks.append(t3)
    done, pendding = await asyncio.wait(tasks)
    print(done)

if __name__ == '__main__':
    asyncio.run(main())

打印下done的返回结果

{<Task finished name='Task-2' coro=<func() done, defined at /home/dogfei/PycharmProjects/pythonProject/爬虫/协程/test.py:4> result='current number: 1'>, <Task finished name='Task-3' coro=<func() done, defined at /home/dogfei/PycharmProjects/pythonProject/爬虫/协程/test.py:4> result='current number: 2'>, <Task finished name='Task-4' coro=<func() done, defined at /home/dogfei/PycharmProjects/pythonProject/爬虫/协程/test.py:4> result='current number: 3'>}

可以看出来，done的返回结果是一个集合，并且集合里面我们可以看到我们的返回结果是放到了result里的，所以我们直接通过遍历来获取结果即可

import asyncio


async def func(number):
    print(f"current number: {number}")
    await asyncio.sleep(1)
    return f"current number: {number}"


async def main():
    tasks = []
    t1 = asyncio.create_task(func(1))
    t2 = asyncio.create_task(func(2))
    t3 = asyncio.create_task(func(3))
    tasks.append(t1)
    tasks.append(t2)
    tasks.append(t3)
    done, pendding = await asyncio.wait(tasks)
    for res in done:  # 遍历拿结果
        print(res.result())

if __name__ == '__main__':
    asyncio.run(main())

asyncio.gather方式

import asyncio


async def func(number):
    print(f"current number: {number}")
    await asyncio.sleep(1)
    return f"current number: {number}"


async def main():
    tasks = []
    t1 = asyncio.create_task(func(1))
    t2 = asyncio.create_task(func(2))
    t3 = asyncio.create_task(func(3))
    tasks.append(t1)
    tasks.append(t2)
    tasks.append(t3)
    result = await asyncio.gather(*tasks)  # 改为gather
    print(result)

if __name__ == '__main__':
    asyncio.run(main())

运行结果：

current number: 1
current number: 2
current number: 3
['current number: 1', 'current number: 2', 'current number: 3']

result的结果返回的是一个列表，所以我们也是通过遍历取值即可。

asyncio.wait和asyncio.gather这两个区别就是wait返回的结果是一个集合，所以结果是无序的，而gather返回的是一个列表，结果是有序的，可以通过放入tasks列表中的顺序来按序返回

基于协程、单线程、多线程的爬虫示例

单线程

import requests
import time
import os
import base64
from lxml import etree


def get_img_urls(url):
    resp = requests.get(url)
    resp.encoding = 'utf-8'
    et = etree.HTML(resp.text)
    res = et.xpath('//img[@class="rich_pages wxw-img js_insertlocalimg"]/@data-src')
    return res


def save_img(content):
    if not os.path.exists("tmp"):
        os.makedirs("tmp")
    with open(f"tmp/{time.time()}.jpg", "wb") as f:
        f.write(content)


def get_content(url):
    content = requests.get(url).content
    save_img(content)


def main():
    b64_2_url = 'aHR0cHM6Ly9tcC53ZWl4aW4ucXEuY29tL3MvMzFBaGRlUy12bFpMNmktbDFqMkdvdw=='
    base_url = base64.b64decode(b64_2_url).decode('utf-8')
    urls = get_img_urls(base_url)
    for url in urls:
        get_content(url)


if __name__ == '__main__':
    start = time.time()
    main()
    end = time.time()
    print(end - start)

运行结果：10.119770050048828

多线程

import requests
import time
import os
import base64
from lxml import etree
from concurrent.futures import ThreadPoolExecutor


def get_img_urls(url):
    resp = requests.get(url)
    resp.encoding = 'utf-8'
    et = etree.HTML(resp.text)
    res = et.xpath('//img[@class="rich_pages wxw-img js_insertlocalimg"]/@data-src')
    return res


def save_img(content):
    if not os.path.exists("tmp"):
        os.makedirs("tmp")
    with open(f"tmp/{time.time()}.jpg", "wb") as f:
        f.write(content)


def get_content(url):
    content = requests.get(url).content
    save_img(content)


def main():
    b64_2_url = 'aHR0cHM6Ly9tcC53ZWl4aW4ucXEuY29tL3MvMzFBaGRlUy12bFpMNmktbDFqMkdvdw=='
    base_url = base64.b64decode(b64_2_url).decode('utf-8')
    urls = get_img_urls(base_url)
    with ThreadPoolExecutor() as pool:
        for url in urls:
            pool.submit(get_content, url)


if __name__ == '__main__':
    start = time.time()
    main()
    end = time.time()
    print(end - start)

运行结果：1.640012502670288

协程

import requests
import asyncio
import aiohttp
import aiofiles
import time
import os
import base64
from lxml import etree


def get_img_urls(url):
    resp = requests.get(url)
    resp.encoding = 'utf-8'
    et = etree.HTML(resp.text)
    res = et.xpath('//img[@class="rich_pages wxw-img js_insertlocalimg"]/@data-src')
    return res


async def save_img(content):
    if not os.path.exists("tmp"):
        os.makedirs("tmp")
    async with aiofiles.open(f"tmp/{time.time()}.jpg", "wb") as f:
        await f.write(content)


async def get_content(url, session):
    async with session.get(url) as resp:
        content = await resp.content.read()
        await save_img(content)


async def main():
    b64_2_url = 'aHR0cHM6Ly9tcC53ZWl4aW4ucXEuY29tL3MvMzFBaGRlUy12bFpMNmktbDFqMkdvdw=='
    base_url = base64.b64decode(b64_2_url).decode('utf-8')
    urls = get_img_urls(base_url)
    async with aiohttp.ClientSession() as session:
        tasks = [
            asyncio.create_task(get_content(url, session))
            for url in urls
        ]
        await asyncio.wait(tasks)


if __name__ == '__main__':
    start = time.time()
    asyncio.run(main())
    end = time.time()
    print(end - start)

运行显示总时间为：1.4556291103363037

分别对比单线程，多线程，协程的执行效率，可以很直观的看出单线程的效率是最低的，而协程和多线程时间上差不多，甚至可能还要好一些。

爬取一个小电影

由于爬取小电影比较复杂，涉及到页面的url解析，还有视频切片文件的加密、解密，以及视频文件的合成，这里打算再单独写一篇文章，敬请期待.

喜欢的朋友可以关注下我的公众号，直接微信搜一下：feelwow