Scrapy 之永不停止的爬虫

1,324 阅读1分钟

某些情况下我们希望爬虫是永不停止的,比如监控抓取某个新闻站点;或者根据一些关键词以搜索的形式去监控抓取对应的内容。 但是在 Scrapy 中,一旦爬虫空闲就会自动停止程序,所以我们需要做的是接收 spider_idle 信号,抛出 Scrapy 中 定义的 DontCloseSpider Exception,并继续发起 Request。 直接上代码:

from scrapy import Request, Spider, signals
from scrapy.exceptions import DontCloseSpider


class TestSpider(Spider):
    name = 'test'
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'LOG_LEVEL': 'INFO',
    }
    url = 'https://www.baidu.com'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        instance = super(TestSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(instance.spider_idle, signal=signals.spider_idle)
        return instance

    def start_requests(self):
        yield Request(
            self.url, dont_filter=True,
        )

    def next_request(self):
        req = Request(
            self.url,
            callback=self.parse,
            dont_filter=True,
        )
        self.crawler.engine.crawl(req, spider=self)

    def parse(self, response):
        self.logger.info('crawled: %s', response.url)

    def spider_idle(self):
        self.next_request()
        self.logger.info('spider idled.')
        raise DontCloseSpider

运行之后可以看到,爬虫在持续运行中

2019-07-02 16:24:30 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:30 [test] INFO: spider idled.
2019-07-02 16:24:31 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:31 [test] INFO: spider idled.
2019-07-02 16:24:32 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:32 [test] INFO: spider idled.
2019-07-02 16:24:33 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:33 [test] INFO: spider idled.
2019-07-02 16:24:35 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:35 [test] INFO: spider idled.
2019-07-02 16:24:36 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:36 [test] INFO: spider idled.
2019-07-02 16:24:37 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:37 [test] INFO: spider idled.
2019-07-02 16:24:38 [test] INFO: crawled: https://www.baidu.com