AsHTTP:50 行代码 60 秒爬完电影天堂 5000 部电影

901 阅读1分钟

项目地址

github.com/gaojiuli/as…

安装

pip install git+https://github.com/gaojiuli/ashttp

简介

AsHTTP 是一款面向未来的,纯异步的 HTTP 库,比aiohttp更快,因为解析器采用了httptools,以最快的速度发送最多的请求。同时内置了HTML解析模块,让爬虫开发快速、运行更快速。

优势

基于AsHTTP 的爬虫案例

import time

import ashttp as http

movie_list = []


async def get_list(url):
    r = await http.get(url)
    html = await r.html()
    for l in html.find("a.ulink"):
        movie_list.append({
            "link": l.href,
            "title": l.text
        })

async def get_detail(movie):
    r = await http.get(movie['link'])
    html = await r.html()
    download_link = html.find_one("#Zoom table a")
    movie.update({"download_link": download_link.text})


async def main():
    print("Downloading...")
    tasks = [get_list(f"https://www.ygdy8.net/html/gndy/dyzz/list_23_{i + 1}.html") for i in range(197)]
    await asyncio.gather(*tasks)
    tasks = [get_detail(i) for i in movie_list]
    await asyncio.gather(*tasks)
    print("Down!")


if __name__ == '__main__':
    import asyncio

    loop = asyncio.get_event_loop()
    sem = asyncio.Semaphore(500)  # limit concurrent
    s = time.time()
    loop.run_until_complete(main())
    loop.close()
    print("Time Usage", time.time() - s)
    print("Movie Count:", len(movie_list))
    print()
    print("First Movie:", movie_list[0])

打印结果

Downloading...
Down!
Time Usage 50.032444477081299
Movie Count: 4913