某些情况下我们希望爬虫是永不停止的,比如监控抓取某个新闻站点;或者根据一些关键词以搜索的形式去监控抓取对应的内容。 但是在 Scrapy 中,一旦爬虫空闲就会自动停止程序,所以我们需要做的是接收 spider_idle 信号,抛出 Scrapy 中 定义的 DontCloseSpider Exception,并继续发起 Request。 直接上代码:
from scrapy import Request, Spider, signals
from scrapy.exceptions import DontCloseSpider
class TestSpider(Spider):
name = 'test'
custom_settings = {
'DOWNLOAD_DELAY': 1,
'LOG_LEVEL': 'INFO',
}
url = 'https://www.baidu.com'
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
instance = super(TestSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(instance.spider_idle, signal=signals.spider_idle)
return instance
def start_requests(self):
yield Request(
self.url, dont_filter=True,
)
def next_request(self):
req = Request(
self.url,
callback=self.parse,
dont_filter=True,
)
self.crawler.engine.crawl(req, spider=self)
def parse(self, response):
self.logger.info('crawled: %s', response.url)
def spider_idle(self):
self.next_request()
self.logger.info('spider idled.')
raise DontCloseSpider
运行之后可以看到,爬虫在持续运行中
2019-07-02 16:24:30 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:30 [test] INFO: spider idled.
2019-07-02 16:24:31 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:31 [test] INFO: spider idled.
2019-07-02 16:24:32 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:32 [test] INFO: spider idled.
2019-07-02 16:24:33 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:33 [test] INFO: spider idled.
2019-07-02 16:24:35 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:35 [test] INFO: spider idled.
2019-07-02 16:24:36 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:36 [test] INFO: spider idled.
2019-07-02 16:24:37 [test] INFO: crawled: https://www.baidu.com
2019-07-02 16:24:37 [test] INFO: spider idled.
2019-07-02 16:24:38 [test] INFO: crawled: https://www.baidu.com