携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第25天，点击查看活动详情前言大家好，我是一身正气的辣条哥今天主要跟大家分享一下Scrapy，Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试 Scrapy使用了Twisted 异步网络库来处理网络通讯。

目录前言一.简介二.组件介绍 2.1下载中间件 2.2爬虫中间件三.项目命令 3.1创建项目: 3.2cd 到项目下 3.3.运行项目 3.4.setting 里配置四.shell 交互式平台 4.1目标数据要求： 4.2爬虫文件 4.3items文件 4.4piplines文件 4.5settings文件五.项目注意事项六.scrapy shell 七.选择器八.items文件九.pipelines 文件十.settings 文件十一.Scrapy shell 十二.Scrapy 选择器十三.嵌套选择器十四.scrapy.Spider 十五 .logger 十六 .from_crawler 十七.start_requests() 开始请求十八.parse 默认回调函数方法一.简介 Scrapy是纯Python开发的一个高效,结构化的网页抓取框架；

另外有没有在学python比较蒙圈的，或者没什么好的思路的可以点击下方点我点我点我

使用原因： 1.为了更利于我们将精力集中在请求与解析上 2.企业级的要求

安装 scrapy支持Python2.7和python3.4以上版本。 python包可以用全局安装(也称为系统范围),也可以安装在用户空间中。运行流程

spiders网页爬虫 items项目 engine引擎 scheduler调度器 downloader下载器 item pipelines项目管道 middleware中间设备，中间件

数据流：上图显示了Scrapy框架的体系结构及其组件，以及系统内部发生的数据流（由红色的箭头显示。） Scrapy中的数据流由执行引擎控制,流程如下：

首先从网页爬虫获取初始的请求将请求放入调度模块，然后获取下一个需要爬取的请求调度模块返回下一个需要爬取的请求给引擎引擎将请求发送给下载器，依次穿过所有的下载中间件一旦页面下载完成，下载器会返回一个响应包含了页面数据，然后再依次穿过所有的下载中间件。引擎从下载器接收到响应，然后发送给爬虫进行解析，依次穿过所有的爬虫中间件爬虫处理接收到的响应，然后解析出item和生成新的请求，并发送给引擎引擎将已经处理好的item发送给管道组件，将生成好的新的请求发送给调度模块，并请求下一个请求该过程重复，直到调度程序不再有请求为止。

二.组件介绍 Scrapy Engine(引擎) 引擎负责控制系统所有组件之间的数据流，并在发生某些操作时触发事件。 scheduler（调度器) 调度程序接收来自引擎的请求，将它们排入队列，以便稍后引擎请求它们。 Downloader（下载器) 下载程序负责获取web页面并将它们提供给引擎，引擎再将它们提供给spider。 spider（爬虫）爬虫是由用户编写的自定义的类，用于解析响应，从中提取数据，或其他要抓取的请求。 Item pipeline（管道) 管道负责在数据被爬虫提取后进行后续处理。典型的任务包括清理，验证和持久性（如将数据存储在数据库中）

2.1下载中间件下载中间件是位于引擎和下载器之间的特定的钩子，它们处理从引擎传递到下载器的请求，以及下载器传递到引擎的响应。如果你要执行以下操作之一，请使用Downloader中间件：在请求发送到下载程序之前处理请求（即在scrapy将请求发送到网站之前）在响应发送给爬虫之前直接发送新的请求，而不是将收到的响应传递给蜘蛛将响应传递给爬行器而不获取web页面; 默默的放弃一些请求

2.2爬虫中间件爬虫中间件是位于引擎和爬虫之间的特定的钩子，能够处理传入的响应和传递出去的item和请求。如果你需要以下操作请使用爬虫中间件：处理爬虫回调之后的请求或item 处理start_requests 处理爬虫异常根据响应内容调用errback而不是回调请简单使用

三.项目命令 3.1创建项目: scrapy startproject <project_name> [project_dir] ps: “<>”表示必填 ,”[]”表示可选 scrapy startproject db

都是db

3.2cd 到项目下 scrapy genspider [options]

scrapy genspider example example.com

会创建在项目/spider下 ;其中example 是爬虫文件名, example.com 是 url 1 2

3.3.运行项目 scrapy crawl 爬虫文件名 #注重流程

3.4.setting 里配置 ROBOTSTXT_OBEY;DEFAULT_REQUEST_HEADERS

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = { ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’, ‘Accept-Language’: ‘en’, “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36” }

四.shell 交互式平台 scrapy shell url (start_url) 获取我们项目中的response 测试 xpath进行匹配

4.1目标数据要求： 250个电影信息电影信息为：电影名字,导演信息(可以包含演员信息),评分将电影信息直接本地保存将电影信息通过管道进行保存

4.2爬虫文件

-- coding: utf-8 --

import json

import scrapy

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 4.3items文件 import scrapy

class DbItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() film_name=scrapy.Field() director_name=scrapy.Field() score=scrapy.Field() 1 2 3 4 5 6 7 8 4.4piplines文件 import json

class DbPipeline(object):

12 13 14 15 4.5settings文件

-- coding: utf-8 --

Scrapy settings for db project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

docs.scrapy.org/en/latest/t…

BOT_NAME = 'db'

SPIDER_MODULES = ['db.spiders'] NEWSPIDER_MODULE = 'db.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'db (+www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = False

Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See docs.scrapy.org/en/latest/t…

The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

#COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', 'Accept-Language': 'en', "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" }

Enable or disable spider middlewares

See docs.scrapy.org/en/latest/t…

#SPIDER_MIDDLEWARES = {

'db.middlewares.DbSpiderMiddleware': 543,

Enable or disable downloader middlewares

See docs.scrapy.org/en/latest/t…

#DOWNLOADER_MIDDLEWARES = {

'db.middlewares.DbDownloaderMiddleware': 543,

Enable or disable extensions

See docs.scrapy.org/en/latest/t…

#EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

Configure item pipelines

See docs.scrapy.org/en/latest/t…

ITEM_PIPELINES = { 'db.pipelines.DbPipeline': 300, }

Enable and configure the AutoThrottle extension (disabled by default)

See docs.scrapy.org/en/latest/t…

#AUTOTHROTTLE_ENABLED = True

The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See docs.scrapy.org/en/latest/t…

#HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 五.项目注意事项 settings文件中项目默认的是 ROBOTSTXT_OBEY = True,即遵循robots协议,则不能爬取到数据则更改为 ROBOTSTXT_OBEY = False

settings中,有些网站需要添加User-Agent ,才能获取到数据 (伪装成客户端) settings中,需要将管道打开,才可以将数据传递到pipelines文件中 items中需要设置相应的字段,使用Item对象传递数据,(可以理解为mysql先定义字段,才能写入数据一样)

六.scrapy shell

Scrapy shell

[s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) #scrapy 模块 [s] crawler <scrapy.crawler.Crawler object at 0x000002624C415F98> #爬虫对象 [s] item {} #item对象 [s] request <GET movie.douban.com/top250> # 请求对象 [s] response <200 movie.douban.com/top250> #响应对象 [s] settings <scrapy.settings.Settings object at 0x000002624C415EB8> #配置文件 [s] spider <DefaultSpider 'default' at 0x2624c8ed3c8> #spider文件 [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) #通过url 获取response [s] fetch(req) Fetch a scrapy.Request and update local objects #通过请求对象获取response [s] shelp() Shell help (print this help) #列出命令 [s] view(response) View response in a browser #response 界面本地浏览器环境下使用 1 2 3 4 5 6 7 8 9 10 11 12 13 14 七.选择器 html_str="""

肖申克的救赎 / The Shawshank Redemption / 月黑高飞(港) / 刺激1995(台) [可播放]

导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...
1994 / 美国 / 犯罪剧情

9.7 1980500人评价

                        <p class="quote">
                            <span class="inq">希望让人自由。</span>
                        </p>
                </div>
            </div>
        </div>

""" from scrapy.selector import Selector #1.通过text 参数来构造对象 selc_text=Selector(text=html_str)

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

#2.通过 response 构造selector对象

from scrapy.http import HtmlResponse response=HtmlResponse(url="www.example.com",body=html_str.encode()) Selector(response=response)

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

#3.嵌套表达式 selector 可以任意使用 css xpath re

print(response.css("a").xpath('./span[1]/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').re("的..")[0]) print(response.css("a").xpath('./span[1]/text()').re_first("的.."))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

次级页面抓取及数据传递拼接（电影） 1.详情页抓取（次级页面）的主要方法是get_detail 方法

def get_detail(self,response): pass 1 2 2.参数的传递拼接的关键参数是 meta参数

spider文件

-- coding: utf-8 --

import json

import scrapy

from ..items import DbItem #是一个安全的字典 class Db250Spider(scrapy.Spider):#继承基础类 name = 'db250' #爬虫文件名字必须存在且唯一 # allowed_domains = ['movie.douban.com'] #允许的域名可以不存在不存在任何域名都可以 start_urls = ['movie.dou.com/top250']#初始… 必须要存在 page_num=0 def parse(self, response):#解析函数处理响应数据 node_list=response.xpath('//div[@class="info"]') for node in node_list: #电影名字 film_name=node.xpath("./div/a/span/text()").extract()[0] #导演信息 director_name=node.xpath("./div/p/text()").extract()[0].strip() #评分 score=node.xpath('./div/div/span[@property="v:average"]/text()').extract()[0] #使用管道存储 item_pipe=DbItem() #创建Dbitem对象当成字典来使用 item_pipe['film_name']=film_name item_pipe['director_name']=director_name item_pipe['score']=score # yield item_pipe # print("电影信息",dict(item_pipe)) # 电影简介 detail_url = node.xpath('./div/a/@href').extract()[0] yield scrapy.Request(detail_url,callback=self.get_detail,meta={"info":item_p #发送新一页的请求 #构造url self.page_num += 1 if self.page_num==4: return page_url="movie.douban.com/top250?star… yield scrapy.Request(page_url) def get_detail(self,response): item=DbItem() #解析详情页的response #1.meta 会跟随response 一块返回 2.通过response.meta接收 3.通过update 添加到新的item对象中 info = response.meta["info"] item.update(info) #简介内容 description=response.xpath('//div[@id="link-report"]//span[@property="v:summary"]/text()').extract()[0].strip() # print('description',description)

    item["description"]=description
    #通过管道保存
    yield  item

#目标数据电影信息+ 获取电影简介数据次级页面的网页源代码里 #请求流程访问一级页面提取电影信息+次级页面的url 访问次级页面url 从次级的数据中提取电影简介

#存储的问题数据没有次序需要使用 meta传参保证同一电影的信息在一起

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 八.items文件 import scrapy

def open_spider(self,spider): #爬虫文件开启,此方法执行 self.f=open("film_pipe.txt","w",encoding="utf-8") def process_item(self, item, spider): json_data=json.dumps(dict(item),ensure_ascii=False)+"\n" self.f.write(json_data) return item def close_spider(self,spider): # 爬虫文件关闭,此方法执行 self.f.close() #关闭文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 十.settings 文件此处删除了大部分注释

-- coding: utf-8 --

Scrapy settings for db project

BOT_NAME = 'db'

SPIDER_MODULES = ['db.spiders'] NEWSPIDER_MODULE = 'db.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'db (+www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = False

Override the default request headers:

Configure item pipelines

See docs.scrapy.org/en/latest/t…

ITEM_PIPELINES = { 'db.pipelines.DbPipeline': 300, }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 十一.Scrapy shell scrapy shell的作用是用于调试，

在项目目录下输入scrapy shell movie.dou…com/top250 得到下列信息：

scrapy shell 会自动加载settings里的配置，即robots协议，请求头等都可以加载，从而发起请求可以得到正确的响应信息。

快捷方法： shelp() fetch(url[,redirect=True]) fetch(request) view(response) scrapy 对象： crawler spider request response setting

十二.Scrapy 选择器 Scrapy提供基于lxml库的解析机制，它们被称为选择器。因为，它们“选择”由XPath或CSS表达式指定的HTML文档的某部分。 Scarpy选择器的API非常小，且非常简单。

选择器提供2个方法来提取标签

xpath() 基于xpath的语法规则 css() 基于css选择器的语法规则快捷方式 response.xpath() response.css() 它们返回的选择器列表提取文本： selector.extract() 返回文本列表 selector.extract_first() 返回第一个selector的文本，没有返回None

十三.嵌套选择器有时候我们获取标签需要多次调用选择方法（.xpath()或.css()） response.css(‘img’).xpath(‘@src’)

Selector还有一个.re()方法使用正则表达式提取数据的方法。它返回字符串。它一般使用在xpath()，css()方法之后，用来过滤文本数据。 re_first()用来返回第一个匹配的字符串。

html_str="""

肖申克的救赎 / The Shawshank Redemption / 月黑高飞(港) / 刺激1995(台) [可播放]

导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...
1994 / 美国 / 犯罪剧情

9.7 1980500人评价

                        <p class="quote">
                            <span class="inq">希望让人自由。</span>
                        </p>
                </div>
            </div>
        </div>

""" from scrapy.selector import Selector #1.通过text 参数来构造对象 selc_text=Selector(text=html_str)

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

#2.通过 response 构造selector对象 from scrapy.http import HtmlResponse response=HtmlResponse(url="www.example.com",body=html_str.encode()) Selector(response=response)

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

#3.嵌套表达式 selector 可以任意使用 css xpath re

print(response.css("a").xpath('./span[1]/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').re("的..")[0]) print(response.css("a").xpath('./span[1]/text()').re_first("的.."))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 十四.scrapy.Spider spider 的名称 name

一个字符串，用于定义此蜘蛛的名称。蜘蛛名称是Scrapy如何定位（并实例化）蜘蛛，因此它必须是唯一的。这是最重要的蜘蛛属性，它是必需的。

起始urls

蜘蛛将开始爬取的URL列表。因此，下载的第一页将是此处列出的页面。后续Request将从起始URL中包含的数据连续生成。

自定义设置

运行此蜘蛛时将覆盖项目范围的设置。必须将其定义为类属性，因为在实例化之前更新了设置。

class Spider(object_ref): """Base class for scrapy spiders. All spiders must inherit from this class. """ custom_settings = None

def init(self, name=None, **kwargs): if name is not None: self.name = name elif not getattr(self, 'name', None): raise ValueError("%s must have a name" % type(self).name) self.dict.update(kwargs) if not hasattr(self, 'start_urls'): self.start_urls = []

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 十五 .logger

Python框架篇：结构化的网页抓取框架-Scrapy

-- coding: utf-8 --

-- coding: utf-8 --

Scrapy settings for db project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

docs.scrapy.org/en/latest/t…

docs.scrapy.org/en/latest/t…

docs.scrapy.org/en/latest/t…

Crawl responsibly by identifying yourself (and your website) on the user-agent

Obey robots.txt rules

Configure maximum concurrent requests performed by Scrapy (default: 16)

Configure a delay for requests for the same website (default: 0)

See docs.scrapy.org/en/latest/t…

See also autothrottle settings and docs

The download delay setting will honor only one of:

Disable cookies (enabled by default)

Disable Telnet Console (enabled by default)

Override the default request headers:

Enable or disable spider middlewares

See docs.scrapy.org/en/latest/t…

'db.middlewares.DbSpiderMiddleware': 543,

Enable or disable downloader middlewares

See docs.scrapy.org/en/latest/t…

'db.middlewares.DbDownloaderMiddleware': 543,

Enable or disable extensions

See docs.scrapy.org/en/latest/t…

'scrapy.extensions.telnet.TelnetConsole': None,

Configure item pipelines

See docs.scrapy.org/en/latest/t…

Enable and configure the AutoThrottle extension (disabled by default)

See docs.scrapy.org/en/latest/t…

The initial download delay

The maximum download delay to be set in case of high latencies

The average number of requests Scrapy should be sending in parallel to

each remote server

Enable showing throttling stats for every response received:

Enable and configure HTTP caching (disabled by default)

See docs.scrapy.org/en/latest/t…

Scrapy shell

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').extract()[0])

-- coding: utf-8 --

-- coding: utf-8 --

Scrapy settings for db project

Crawl responsibly by identifying yourself (and your website) on the user-agent

Obey robots.txt rules

Override the default request headers:

Configure item pipelines

See docs.scrapy.org/en/latest/t…

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').extract()[0])