**class 类名(基类):class New_Spider(scrapy.Spider)**
Spider基类实现了公scrapy调用的引擎接口,供用户使用的实用方法,可以供用户访问的属性。
继承覆盖Spider基类的name属性
每一个Spider有一个专有的名字,这个在启动Spider的时候要用到。**scrapy crawl **。
设置初始的爬取点
通过设置start_urls属性来达到设置初始爬取页面的目的。
def start_requests(self):
cls = self.__class__
if not self.start_urls and hasattr(self, 'start_url'):
raise AttributeError(
"Crawling could not start: 'start_urls' not found "
"or empty (but found 'start_url' attribute instead, "
"did you miss an 's'?)")
if method_is_overridden(cls, Spider, 'make_requests_from_url'):
warnings.warn(
"Spider.make_requests_from_url method is deprecated; it "
"won't be called in future Scrapy releases. Please "
"override Spider.start_requests method instead (see %s.%s)." % (
cls.__module__, cls.__name__
),
)
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
for url in self.start_urls:
yield Request(url, dont_filter=True)
def make_requests_from_url(self, url):
""" This method is deprecated. """
warnings.warn(
"Spider.make_requests_from_url method is deprecated: "
"it will be removed and not be called by the default "
"Spider.start_requests method in future Scrapy releases. "
"Please override Spider.start_requests method instead."
)
return Request(url, dont_filter=True)
Spider的基类属性start_urls是一个列表,基类方法start_requests配合make_requests_from_url会通过迭代start_urls的成员来生成一个生成器,生成器会生成一个又一个Request对象。