async--异步将异步从yieled写法中解放出来.
async一般用于方法或者条件语句前面,用于表明当前条件语句内部或者方法内部存在异步函数
await 用于具体的操作前面,表明当前操作为异步操作。
Aiohttp推荐使用ClientSession作为主要的接口发起请求。ClientSession允许在多个请求之间保存cookie以及相关对象信息。
Session(会话)在使用完毕之后需要关闭,关闭Session是另一个异步操作,所以每次你都需要使用async with关键字。
要让程序正常跑起来需要将他们加入时间循环中,因此要创建asyncio loop实例,然后将任务加入其中。
import random
import aiohttp
import asyncio
import async_timeout
from urllib.parse import urljoin, urldefrag #如果url包含一个片段标识符,则urldefrag返回一个没有片段标识符的修改过的url,并且这个片段标识符作为单独的字符串。
#如果url中没有片段标识符,则返回未修改的url和一个空字符串。
from fake_useragent import UserAgentua = UserAgent() #随机生成用户代理
root_url = "人人都是产品经理 | 产品经理、产品爱好者学习交流平台"
page = random.randint(1,50)
crawled_urls, url_hub = [], ["http://www.woshipm.com/category/it/page/%s" % (page), "http://www.woshipm.com/category/pd/page/%s" % (page), "http://www.woshipm.com/category/pmd/page/%s" % (page)]
headers = {'user-agent': ua.random}async def get_body(url):
async with aiohttp.ClientSession() as session:
try:
with async_timeout.timeout(10):
async with session.get(url, headers=headers) as response:
if response.status == 200:
html = await response.text()
return {'error': '', 'html': html}
else:
return {'error': response.status, 'html': ''}
except Exception as err:
return {'error': err, 'html': ''}async def handle_task(task_id, work_queue):
while not work_queue.empty():
queue_url = await work_queue.get()
if not queue_url in crawled_urls:
crawled_urls.append(queue_url)
body = await get_body(queue_url)
if not body['error']:
for new_url in get_urls(body['html']):
if root_url in new_url and not new_url in crawled_urls:
work_queue.put_nowait(new_url)
else:
print(f"Error: {body['error']} - {queue_url}")def remove_fragment(url):
pure_url, frag = urldefrag(url)
return pure_url
def get_urls(html):
new_urls = [url.split('"')[0] for url in str(html).replace("'",'"').split('href="')[1:]]
return [urljoin(root_url, remove_fragment(new_url)) for new_url in new_urls]if __name__ == "__main__":
q = asyncio.Queue()
[q.put_nowait(url) for url in url_hub]
loop = asyncio.get_event_loop()
tasks = [handle_task(task_id, q) for task_id in range(3)]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
for u in crawled_urls:
print(u)
print('-'*30)
print(len(crawled_urls))返回结果如下:
Error: 404 - http://www.woshipm.com/css/specification.css Error: 404 - http://www.woshipm.com/index.html Error: 404 - http://www.woshipm.com/ios.html Error: 404 - http://www.woshipm.com/pad.html Error: 404 - http://www.woshipm.com/watch.html Error: 404 - http://www.woshipm.com/computer.html Error: 404 - http://www.woshipm.com/display.html Error: - http://www.woshipm.com/operate/1320593.html Error: - http://www.woshipm.com/u/820704 Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=阻力设计的正确使用方法&url=http://www.woshipm.com/pd/1796218.html&pic=http://image.woshipm.com/wp-files/2019/01/7H2rbZbxmqQLeRJ8A2ZS.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=做好这三步,你的小程序离“爆款”就不远了&url=http://www.woshipm.com/operate/1173103.html&pic=http://image.woshipm.com/wp-files/2018/07/Jl8RHbHuXGgJB8RhJWMu.jpg Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=四个主题,看Facebook的产品设计师是如何思考(2)&url=http://www.woshipm.com/pd/632210.html&pic=http://image.woshipm.com/wp-files/2017/04/bGVAR1UtTmchObgRZR8t.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=物业APP业务流程设计(1):物业报修流程&url=http://www.woshipm.com/pd/1810695.html&pic=http://image.woshipm.com/wp-files/2019/01/IEF1kov6qUlCmvY21pxE.png Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=线下课程丨产品总监如何搭建一只强大的队伍,只需做好这三点&url=http://www.woshipm.com/active/1775564.html&pic=http://image.woshipm.com/wp-files/2018/12/Hiq5TpaqhNsLgLrcxxza.jpg Error: - http://service.weibo.com/share/share.php?appkey=2775287854&title=线上课程丨掌握这些硬核运营能力,年底跳槽时老板还会嫌你经验不够?&url=http://www.woshipm.com/active/1762495.html&pic=http://image.woshipm.com/wp-files/2018/12/ZIglURapwXS9E6sSP9Xg.png