Spider开发流程——继承Spider基类

697 阅读1分钟
**class 类名(基类):class New_Spider(scrapy.Spider)**
Spider基类实现了公scrapy调用的引擎接口,供用户使用的实用方法,可以供用户访问的属性。

继承覆盖Spider基类的name属性

每一个Spider有一个专有的名字,这个在启动Spider的时候要用到。**scrapy crawl **。

设置初始的爬取点

通过设置start_urls属性来达到设置初始爬取页面的目的。
def start_requests(self):
    cls = self.__class__
    if not self.start_urls and hasattr(self, 'start_url'):
        raise AttributeError(
            "Crawling could not start: 'start_urls' not found "
            "or empty (but found 'start_url' attribute instead, "
            "did you miss an 's'?)")
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            yield Request(url, dont_filter=True)

def make_requests_from_url(self, url):
    """ This method is deprecated. """
    warnings.warn(
        "Spider.make_requests_from_url method is deprecated: "
        "it will be removed and not be called by the default "
        "Spider.start_requests method in future Scrapy releases. "
        "Please override Spider.start_requests method instead."
    )
    return Request(url, dont_filter=True)
Spider的基类属性start_urls是一个列表,基类方法start_requests配合make_requests_from_url会通过迭代start_urls的成员来生成一个生成器,生成器会生成一个又一个Request对象。