异步爬虫async web crawler

330 阅读2分钟
原文链接: zhuanlan.zhihu.com

async--异步将异步从yieled写法中解放出来.

async一般用于方法或者条件语句前面,用于表明当前条件语句内部或者方法内部存在异步函数

await 用于具体的操作前面,表明当前操作为异步操作。

Aiohttp推荐使用ClientSession作为主要的接口发起请求。ClientSession允许在多个请求之间保存cookie以及相关对象信息。
Session(会话)在使用完毕之后需要关闭,关闭Session是另一个异步操作,所以每次你都需要使用async with关键字。
要让程序正常跑起来需要将他们加入时间循环中,因此要创建asyncio loop实例,然后将任务加入其中。

import random
import aiohttp
import asyncio
import async_timeout
from urllib.parse import urljoin, urldefrag  #如果url包含一个片段标识符,则urldefrag返回一个没有片段标识符的修改过的url,并且这个片段标识符作为单独的字符串。
                                             #如果url中没有片段标识符,则返回未修改的url和一个空字符串。


from fake_useragent import UserAgent


ua = UserAgent()  #随机生成用户代理

root_url = "人人都是产品经理 | 产品经理、产品爱好者学习交流平台"
page  = random.randint(1,50)
crawled_urls, url_hub = [], ["http://www.woshipm.com/category/it/page/%s" % (page), "http://www.woshipm.com/category/pd/page/%s" % (page), "http://www.woshipm.com/category/pmd/page/%s" % (page)]
headers = {'user-agent': ua.random}



async def get_body(url):
    async with aiohttp.ClientSession() as session:
        try:
            with async_timeout.timeout(10):
                async with session.get(url, headers=headers) as response:
                    if response.status == 200:
                        html = await response.text()
                        return {'error': '', 'html': html}
                    else:
                        return {'error': response.status, 'html': ''}
        except Exception as err:
            return {'error': err, 'html': ''}


async def handle_task(task_id, work_queue):
    while not work_queue.empty():
        queue_url = await work_queue.get()
        if not queue_url in crawled_urls:
            crawled_urls.append(queue_url)
            body = await get_body(queue_url)
            if not body['error']:
                for new_url in get_urls(body['html']):
                    if root_url in new_url and not new_url in crawled_urls:
                        work_queue.put_nowait(new_url)
            else:
                print(f"Error: {body['error']} - {queue_url}")


def remove_fragment(url):
    pure_url, frag = urldefrag(url)
    return pure_url

def get_urls(html):
    new_urls = [url.split('"')[0] for url in str(html).replace("'",'"').split('href="')[1:]]
    return [urljoin(root_url, remove_fragment(new_url)) for new_url in new_urls]


if __name__ == "__main__":
    q = asyncio.Queue()
    [q.put_nowait(url) for url in url_hub]    
    loop = asyncio.get_event_loop()
    tasks = [handle_task(task_id, q) for task_id in range(3)]
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()
    for u in crawled_urls:
        print(u)
    print('-'*30)
    print(len(crawled_urls))

返回结果如下:

Error: 404 - http://www.woshipm.com/css/specification.css Error: 404 - http://www.woshipm.com/index.html Error: 404 - http://www.woshipm.com/ios.html Error: 404 - http://www.woshipm.com/pad.html Error: 404 - http://www.woshipm.com/watch.html Error: 404 - http://www.woshipm.com/computer.html Error: 404 - http://www.woshipm.com/display.html Error: - http://www.woshipm.com/operate/1320593.html Error: - http://www.woshipm.com/u/820704 Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=阻力设计的正确使用方法&url=http://www.woshipm.com/pd/1796218.html&pic=http://image.woshipm.com/wp-files/2019/01/7H2rbZbxmqQLeRJ8A2ZS.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=做好这三步,你的小程序离“爆款”就不远了&url=http://www.woshipm.com/operate/1173103.html&pic=http://image.woshipm.com/wp-files/2018/07/Jl8RHbHuXGgJB8RhJWMu.jpg Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=四个主题,看Facebook的产品设计师是如何思考(2)&url=http://www.woshipm.com/pd/632210.html&pic=http://image.woshipm.com/wp-files/2017/04/bGVAR1UtTmchObgRZR8t.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=物业APP业务流程设计(1):物业报修流程&url=http://www.woshipm.com/pd/1810695.html&pic=http://image.woshipm.com/wp-files/2019/01/IEF1kov6qUlCmvY21pxE.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=线下课程丨产品总监如何搭建一只强大的队伍,只需做好这三点&url=http://www.woshipm.com/active/1775564.html&pic=http://image.woshipm.com/wp-files/2018/12/Hiq5TpaqhNsLgLrcxxza.jpg Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=线上课程丨掌握这些硬核运营能力,年底跳槽时老板还会嫌你经验不够?&url=http://www.woshipm.com/active/1762495.html&pic=http://image.woshipm.com/wp-files/2018/12/ZIglURapwXS9E6sSP9Xg.png