安装
$ pip install Scrapy
$ scrapy version
$ Scrapy 2.11.0
教程
项目
scrapy startproject myproject [project_dir]
$ scrapy startproject Article
You can start your first spider with:
cd Article
scrapy genspider example example.com
目录结果
Article/
scrapy.cfg # 部署配置文件
Article/ # 项目的Python模块,从这里导入代码
__init__.py
items.py # 项目`项`定义文件
middlewares.py # 项目中间件文件
pipelines.py # 项目管道文件
settings.py # 项目设置文件
spiders/ # 放spiders的目录
__init__.py
spider
命令创建spider
scrapy genspider [-t template] <name> <domain>
要进入到项目目录执行命令。
这只是一个基于预先定义的模板创建
spider的快捷命令,但不是创建spider的唯一方法。可以自己创建蜘蛛源代码文件,而不是使用这个命令。
$ scrapy genspider cnblogs_pick https://www.cnblogs.com/pick/
Created spider 'cnblogs_pick' using template 'basic' in module:
Article.spiders.cnblogs_pick
import scrapy
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
pass
手动创建spider
在项目/spiders下创建cnblogs_pick.py
import scrapy
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
def start_requests(self):
urls = [
'https://www.cnblogs.com/pick/',
'https://www.cnblogs.com/pick/#p2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
pass
- name:spider标识。在项目中
必须是唯一的,不能为不同的spider设置相同的name。 - start_requests():必须返回可迭代的请求(返回请求列表或生成器函数),spider将从该请求开始执行。随后的请求将从这些初始请求中依次生成。
- parse():将调用的方法,用于处理每个下载响应时的默认回调。
运行spider
scrapy crawl <name>
$ scrapy crawl cnblogs_pick
... (omitted for brevity)
2024-04-02 18:30:06 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-02 18:30:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6134,
'downloader/request_count': 20,
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 320042,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'elapsed_time_seconds': 1.119223,
'file_count': 18,
'file_status_count/downloaded': 1,
'file_status_count/uptodate': 17,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 2, 10, 30, 6, 447372, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 1058812,
'httpcompression/response_count': 20,
'item_scraped_count': 13,
'log_count/DEBUG': 102,
'log_count/ERROR': 6,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 20,
'scheduler/dequeued': 19,
'scheduler/dequeued/memory': 19,
'scheduler/enqueued': 19,
'scheduler/enqueued/memory': 19,
'start_time': datetime.datetime(2024, 4, 2, 10, 30, 5, 328149, tzinfo=datetime.timezone.utc)}
2024-04-02 18:30:06 [scrapy.core.engine] INFO: Spider closed (finished)
提取数据
Scrapy提取数据的最佳方法是scrapy shell网址中包含
&,将不起作用Windows上,网址要使用双引号
scrapy shell website
$ scrapy shell "https://juejin.cn/post/7351399718909689883"
2024-04-02 18:42:51 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: Article)
2024-04-02 18:42:51 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.1.3 19 Sep 2023), cryptography 41.0.4, Platform Windows-10-10.0.22621-SP0
2024-04-02 18:42:51 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-02 18:42:51 [asyncio] DEBUG: Using selector: SelectSelector
2024-04-02 18:42:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-02 18:42:51 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-04-02 18:42:51 [scrapy.extensions.telnet] INFO: Telnet Password: a06767df8f23b127
2024-04-02 18:42:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2024-04-02 18:42:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Article',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'Article.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['Article.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-02 18:42:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-02 18:42:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-02 18:42:51 [scrapy.middleware] INFO: Enabled item pipelines:
['Article.pipelines.ArticleImagesPipeline',
'Article.pipelines.MysqlPipeline',
'Article.pipelines.ArticlePipeline']
2024-04-02 18:42:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-02 18:42:51 [scrapy.core.engine] INFO: Spider opened
2024-04-02 18:42:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://juejin.cn/post/7351399718909689883> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000018794BFC280>
[s] item {}
[s] request <GET https://juejin.cn/post/7351399718909689883>
[s] response <200 https://juejin.cn/post/7351399718909689883>
[s] settings <scrapy.settings.Settings object at 0x0000018794BFC0D0>
[s] spider <DefaultSpider 'default' at 0x18796f62160>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
>>> response.xpath('//span[@class="name"]/text()').get()
'\n 一觀者也\n '
>>> response.xpath('//span[@class="name"]/text()').getall()
['\n 一觀者也\n ', '\n 一觀者也\n ']
spider提取数据
import scrapy
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
# 获取页面所有的详情页
url = response.xpath('//article[@class="post-item"]')
for item in url:
# 获取详情页的url
detail_url = item.xpath('.//a[@class="post-item-title"]/@href').extract_first("")
# 获取图片的url
image_url = item.xpath('.//p[@class="post-item-summary"]//img/@src').extract_first("")
yield {
"detail_url": detail_url,
"image_url": image_url,
}
def parse_detail(self, response):
pass
2024-04-02 19:08:36 [scrapy.core.engine] INFO: Spider opened
2024-04-02 19:08:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-02 19:08:36 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-02 19:08:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cnblogs.com/pick/> (referer: None)
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/fanzhidongyzby/p/18075179/langchain', 'image_url': 'https://pic.cnblogs.com/face/405877/20190929150121.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/whuanle/p/18045341', 'image_url': 'https://pic.cnblogs.com/face/1315495/20240304085002.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/tcjiaan/p/18012397', 'image_url': 'https://pic.cnblogs.com/face/367389/20220306190127.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/CodeBlogMan/p/17983370', 'image_url': 'https://pic.cnblogs.com/face/2458865/20240117173742.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/ZOMI/p/17561719.html', 'image_url': 'https://pic.cnblogs.com/face/1078349/20211112102413.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/thisiswhy/p/17559808.html', 'image_url': 'https://pic.cnblogs.com/face/1820785/20200106125509.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/t-bar/p/17359545.html', 'image_url': ''}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/xuanyuan/p/17373994.html', 'image_url': 'https://pic.cnblogs.com/face/659280/20210228194243.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/InCerry/p/about-dotnet-auto-apm-instru-impl.html', 'image_url': 'https://pic.cnblogs.com/face/997046/20180730152032.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/boycelee/p/17324600.html', 'image_url': 'https://pic.cnblogs.com/face/765838/20240121172459.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/xbotter/p/semantic_kernel_introduction.html', 'image_url': 'https://pic.cnblogs.com/face/758442/20221218155328.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/coco1s/p/17312405.html', 'image_url': 'https://pic.cnblogs.com/face/608782/20160411131806.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/mylibs/p/production-accident-0002.html', 'image_url': 'https://pic.cnblogs.com/face/1925794/20200120103339.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/cgli/p/17179590.html', 'image_url': 'https://pic.cnblogs.com/face/u160088.jpg?id=11215338'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/eventhorizon/p/17170301.html', 'image_url': 'https://pic.cnblogs.com/face/1201123/20180325141750.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/linvanda/p/17120205.html', 'image_url': 'https://pic.cnblogs.com/face/1997761/20200503230955.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/sheng-jie/p/17100467.html', 'image_url': 'https://pic.cnblogs.com/face/577140/20170107145656.png'}
2024-04-02 19:08:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/makemylife/p/17063284.html', 'image_url': 'https://pic.cnblogs.com/face/2487169/20210801221514.png'}
2024-04-02 19:08:37 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-02 19:08:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 221,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15687,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.625946,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 2, 11, 8, 37, 322028, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 74533,
'httpcompression/response_count': 1,
'item_scraped_count': 18,
'log_count/DEBUG': 22,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 4, 2, 11, 8, 36, 696082, tzinfo=datetime.timezone.utc)}
2024-04-02 19:08:37 [scrapy.core.engine] INFO: Spider closed (finished)
存储数据
scrapy crawl <name> -O <name>.json
这将生成一个
json文件,-O参数的作用是覆盖已有文件,- o是追加文件。
$ scrapy crawl cnblogs_pick -O cnblogs_pick.json
2024-04-03 10:13:12 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: Article)
2024-04-03 10:13:12 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.1.3 19 Sep 2023), cryptography 41.0.4, Platform Windows-10-10.0.22621-SP0
2024-04-03 10:13:12 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 10:13:12 [asyncio] DEBUG: Using selector: SelectSelector
2024-04-03 10:13:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 10:13:12 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-04-03 10:13:12 [scrapy.extensions.telnet] INFO: Telnet Password: fa4f7f9af8c6b5b8
2024-04-03 10:13:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2024-04-03 10:13:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Article',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'Article.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['Article.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 10:13:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 10:13:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 10:13:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 10:13:13 [scrapy.core.engine] INFO: Spider opened
2024-04-03 10:13:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-03 10:13:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-03 10:13:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cnblogs.com/pick/> (referer: None)
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/fanzhidongyzby/p/18075179/langchain', 'image_url': 'https://pic.cnblogs.com/face/405877/20190929150121.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/whuanle/p/18045341', 'image_url': 'https://pic.cnblogs.com/face/1315495/20240304085002.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/tcjiaan/p/18012397', 'image_url': 'https://pic.cnblogs.com/face/367389/20220306190127.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/CodeBlogMan/p/17983370', 'image_url': 'https://pic.cnblogs.com/face/2458865/20240117173742.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/ZOMI/p/17561719.html', 'image_url': 'https://pic.cnblogs.com/face/1078349/20211112102413.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/thisiswhy/p/17559808.html', 'image_url': 'https://pic.cnblogs.com/face/1820785/20200106125509.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/t-bar/p/17359545.html', 'image_url': ''}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/xuanyuan/p/17373994.html', 'image_url': 'https://pic.cnblogs.com/face/659280/20210228194243.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/InCerry/p/about-dotnet-auto-apm-instru-impl.html', 'image_url': 'https://pic.cnblogs.com/face/997046/20180730152032.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/boycelee/p/17324600.html', 'image_url': 'https://pic.cnblogs.com/face/765838/20240121172459.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/xbotter/p/semantic_kernel_introduction.html', 'image_url': 'https://pic.cnblogs.com/face/758442/20221218155328.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/coco1s/p/17312405.html', 'image_url': 'https://pic.cnblogs.com/face/608782/20160411131806.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/mylibs/p/production-accident-0002.html', 'image_url': 'https://pic.cnblogs.com/face/1925794/20200120103339.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/cgli/p/17179590.html', 'image_url': 'https://pic.cnblogs.com/face/u160088.jpg?id=11215338'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/eventhorizon/p/17170301.html', 'image_url': 'https://pic.cnblogs.com/face/1201123/20180325141750.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/linvanda/p/17120205.html', 'image_url': 'https://pic.cnblogs.com/face/1997761/20200503230955.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/sheng-jie/p/17100467.html', 'image_url': 'https://pic.cnblogs.com/face/577140/20170107145656.png'}
2024-04-03 10:13:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cnblogs.com/pick/>
{'detail_url': 'https://www.cnblogs.com/makemylife/p/17063284.html', 'image_url': 'https://pic.cnblogs.com/face/2487169/20210801221514.png'}
2024-04-03 10:13:14 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-03 10:13:14 [scrapy.extensions.feedexport] INFO: Stored json feed (18 items) in: cnblogs.json
2024-04-03 10:13:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 221,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15418,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.431993,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 3, 2, 13, 14, 138355, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 74528,
'httpcompression/response_count': 1,
'item_scraped_count': 18,
'log_count/DEBUG': 22,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 4, 3, 2, 13, 13, 706362, tzinfo=datetime.timezone.utc)}
2024-04-03 10:13:14 [scrapy.core.engine] INFO: Spider closed (finished)
抓取实例
获取分页链接
<div class="pager">
<a href="/pick/" class="p_1 current" onclick="aggSite.loadCategoryPostList(1,20);buildPaging(1);return false;">
1
</a>
<a href="/pick/2/" class="p_2 middle" onclick="aggSite.loadCategoryPostList(2,20);buildPaging(2);return false;">
2
</a>
<a href="/pick/3/" class="p_3 middle" onclick="aggSite.loadCategoryPostList(3,20);buildPaging(3);return false;">
3
</a>
<span class="ellipsis">
···
</span>
<a href="/pick/82/" class="p_82 last" onclick="aggSite.loadCategoryPostList(82,20);buildPaging(82);return false;">
82
</a>
<a href="/pick/2/" onclick="aggSite.loadCategoryPostList(2,20);buildPaging(2);return false;">
>
</a>
</div>
$ scrapy shell https://www.cnblogs.com/pick/
2024-04-03 11:17:03 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: Article)
2024-04-03 11:17:03 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.1.3 19 Sep 2023), cryptography 41.0.4, Platform Windows-10-10.0.22621-SP0
2024-04-03 11:17:03 [scrapy.utils.spider] ERROR: More than one spider can handle: <GET https://www.cnblogs.com/pick/> - cnblogs, cnblogs_pick
2024-04-03 11:17:03 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 11:17:03 [asyncio] DEBUG: Using selector: SelectSelector
2024-04-03 11:17:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 11:17:03 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-04-03 11:17:03 [scrapy.extensions.telnet] INFO: Telnet Password: 3d8a532f590c2d94
2024-04-03 11:17:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2024-04-03 11:17:03 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Article',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'Article.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['Article.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 11:17:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 11:17:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 11:17:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 11:17:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-03 11:17:03 [scrapy.core.engine] INFO: Spider opened
2024-04-03 11:17:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cnblogs.com/pick/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000015FF5D49220>
[s] item {}
[s] request <GET https://www.cnblogs.com/pick/>
[s] response <200 https://www.cnblogs.com/pick/>
[s] settings <scrapy.settings.Settings object at 0x0000015FF5D49070>
[s] spider <DefaultSpider 'default' at 0x15ff4c382b0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> response.xpath('//div[@class="pager"]/a[contains(text(),">")]/@href').extract_first("")
'/pick/2/'
递归提取分页数据
import scrapy
from scrapy import Request
from urllib import parse
from Article.items import CnblogsItem
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
# 获取页面所有的详情页
url = response.xpath('//article[@class="post-item"]')
for item in url:
# 获取详情页的url
detail_url = item.xpath('.//a[@class="post-item-title"]/@href').extract_first("")
# 获取图片的url
image_url = item.xpath('.//p[@class="post-item-summary"]//img/@src').extract_first("")
# 交给scrapy进行下载
yield Request(url=parse.urljoin(response.url, detail_url), meta={'image_url': parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
# 获取下一页交给scrapy
next_page = response.xpath('//div[@class="pager"]/a[contains(text(),">")]/@href').extract_first("")
if next_page is not None:
yield Request(url=response.urljoin(next_page), callback=self.parse)
def parse_detail(self, response):
pass
提取数据后,parse()方法查找到分页链接。使用urljoin()方法(因为有的链接是相对的)生成下一页的新请求,将自身注册为回调,用于处理下一页数据提取,并保持Spider在所有分页中进行。
递归提取分页及详情页数据
import scrapy
from scrapy import Request
from urllib import parse
from Article.items import CnblogsItem
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
# 获取页面所有的详情页
url = response.xpath('//article[@class="post-item"]')
for item in url:
# 获取详情页的url
detail_url = item.xpath('.//a[@class="post-item-title"]/@href').extract_first("")
# 获取图片的url
image_url = item.xpath('.//p[@class="post-item-summary"]//img/@src').extract_first("")
# 交给scrapy进行下载
yield Request(url=parse.urljoin(response.url, detail_url), meta={'image_url': parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
# 获取下一页交给scrapy
next_page = response.xpath('//div[@class="pager"]/a[contains(text(),">")]/@href').extract_first("")
if next_page is not None:
yield Request(url=response.urljoin(next_page), callback=self.parse)
def parse_detail(self, response):
cnblogs_item = CnblogsPickItem()
# 获取详情页的url
source_url = response.url
title = response.xpath("//title/text()").extract_first("")
content = response.xpath('//div[@id="cnblogs_post_body"]').extract_first("")
# 获取meta传来的值
if response.meta.get('image_url', ''):
cnblogs_item['image_url'] = [response.meta.get('image_url', '')]
else:
cnblogs_item['image_url'] = []
cnblogs_item['title'] = title
cnblogs_item['content'] = content
cnblogs_item['source_url'] = source_url
yield cnblogs_item
Spider
scrapy.Spider
这是最简单的Spider,也是每个Spider都必须继承的Spider。供一个默认值start_requests()。
name
定义Spider名称的字符串,Spider名称是Scrapy定位(实例化)Spider的方式,因此它必须是唯一的。
allowed_domains
包含允许Spider抓取的域的字符串的可选列表。
start_urls
未指定特定URL时,Spider将从此开始抓取URL列表。
items
Spider将提取的数据返回为items。
items对象
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ArticleItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class CnblogsPickItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
image_url = scrapy.Field()
image_path = scrapy.Field()
source_url = scrapy.Field()
itemsPipeline
在一个items被Spider抓取后,它被发送到Pipeline,该Pipeline通过设置的优先级顺序处理。
process_item()
对每个项管道组件调用此方法。
def process_item(self, item, spider):
close_spider()
当spider关闭时调用此方法。
def spider_closed(self, spider):
示例
import json
class JsonImagesPipeline():
def __init__(self):
self.file = codecs.open('article.json', 'a', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
设置
Scrapy设置允许自定义所有Scrapy组件的行为,包括核心、扩展、管道和spider本身。
填充设置
可以使用不同的机制填充设置,每个机制具有不同的优先级。以下是按优先级降序排列的列表:
- 命令行选项(最优先)
- 每个Spider的设置
- 项目设置模块
- 每个命令的默认设置
- 默认全局设置(优先级较低)
命令行选项
命令行提供的参数是最优先的参数,覆盖了任何其他选项。可以使用 -s (或) --set )命令行选项
scrapy crawl myspider -s LOG_FILE=scrapy.log
每个Spider的设置
Spider可以定义其自己的设置,这些设置将优先于并覆盖项目设置。它们可以通过设置它们的 custom_settings 属性:
import scrapy
from scrapy import Request
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
custom_settings = {
'COOKIES_ENABLED': True,
}
项目设置模块
项目设置模块是项目的标准配置文件,它将填充大部分自定义设置。对于标准的Scrapy项目,这意味着将在 settings.py 为项目创建的文件。
每个命令的默认设置
各crapy tool命令可以有自己的默认设置,这将覆盖全局默认设置。这些自定义命令设置在 default_settings 命令类的属性。
默认全局设置
全局默认值位于 scrapy.settings.default_settings
内置设置
ROBOTSTXT_OBEY
默认值:False
如果启用,scrapy将遵守robots.txt策略。
DOWNLOAD_DELAY
默认值:0
对同一域的两个连续请求之间等待的最短秒数。
支持小数,例如:每2.5秒一个,即10秒4个。
DOWNLOAD_DELAY = 2.5
当CONCURRENT_REQUESTS_PER_IP为非0时,将按IP地址而不是域执行等待。
ITEM_PIPELINES
默认值:{}
包含要使用的项目管道及其顺序的字典。顺序值是任意的,但通常将其定义在0-1000的范围内。较低订单在较高订单之前处理。
ITEM_PIPELINES = {'Article.pipelines.ArticleImagesPipeline': 1, 'Article.pipelines.MysqlPipeline': 2, "Article.pipelines.ArticlePipeline": 300}
下载处理文件和图片
Scrapy可重复使用item pipelines用于下载附加到特定项目的文件。这些管道共享一些功能和结构(将它们称为媒体管道),但通常可以使用文件管道或图像管道。
两条管道都实现了以下功能:
- 避免重新下载最近下载的媒体
- 指定存储媒体的位置
图像管道有一些用于处理图像的额外功能:
- 将所有下载的图像转换为通用格式(jpg)和模式(rgb)
- 缩略图生成
- 检查图像的宽度/高度以确保它们满足最小限制
文件管道
- 在spider中,爬取一个项目并将所需的URL放入
file_urls字段。 - 该项从spider返回并转到项管道。
- 当项目到达
FilesPipeline时,file_urls字段中的URL将使用标准的Scrapy调度程序和下载器进行下载调度(这意味着调度程序和下载器中间件被重用),但具有更高的优先级,在抓取其他页面之前处理它们。该项目在该特定管道阶段保持锁定状态,直到文件下载完成(或由于某种原因失败)。 - 下载文件后,另一个字段(
files)将填充结果。该字段将包含一个字典列表,其中包含有关下载文件的信息,例如下载路径、原始抓取的url(取自file_urls字段)、文件校验和以及文件状态。文件列表字段中的文件将保持与原始file_urls字段相同的顺序。如果某些文件下载失败,将会记录一个错误,并且该文件不会出现在files字段中。
图片管道
使用ImagesPipeline与使用FilesPipeline非常相似,只是使用的默认字段名不同。使用image_urls作为项目的图像URL,它将填充一个images字段来显示有关下载图像的信息。 对图像文件使用ImagesPipeline的优势在于可以配置一些额外的功能,如生成缩略图和根据图像大小过滤图像。 图像管道需要Pillow 7.1.0或更高版本。它用于缩略图和将图像标准化为JPEG/RGB格式。
启用管道
要启用媒体管道,必须首先将其添加到项目ITEM_PIPELINES设置中。
对于图像管道,请使用:
ITEM_PIPELINES = {'Article.pipelines.ArticleImagesPipeline': 1
对于文件管道,请使用:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
还可以同时使用文件和图像管道
然后,将目标存储设置配置为将用于存储下载图像的有效值。否则,即使将管道包含在ITEM_PIPELINES设置中,管道也将保持禁用状态。
对于文件管道,设置FILES_STORE:
FILES_STORE = '/path/to/valid/dir'
对于图像管道,设置IMAGES_STORE:
IMAGES_STORE = '/path/to/valid/dir'
示例
image_path = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE = os.path.join(image_path, "images")
使用实例
为了使用媒体管道,首先要启用管道。 然后,如果蜘蛛返回一个带有URLs字段的item对象(file_urls或image_urls,分别用于文件或图像管道),管道将把结果放在相应的字段(文件或图像)下。 使用预先定义了字段的项目类型时,必须同时定义URL字段和结果字段。例如,当使用图像管道时,项目必须定义image_urls和images字段。例如,使用项目类:
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
如果要对URL键或结果键使用其他字段名,也可以重写它。
对于文件管道,设置FILES_URLS_FIELD和/或FILES_RESULT_FIELD
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
对于图像管道,设置IMAGES_URLS_FIELD和/或IMAGES_RESULT_FIELD
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
扩展媒体管道
item_completed
item_completed(results,item,info)
当单个项目的所有文件请求都已完成(下载完成或由于某种原因失败)时调用的FilesPipeline.item_completed()方法。 item_completed()方法必须返回将发送到后续项目管道阶段的输出,因此必须像在任何管道中一样返回(或删除)项目。 下面是item_completed()方法的一个示例,其中将下载的文件路径存储在image_path项目字段中,如果没有,将设置''。
from scrapy.pipelines.images import ImagesPipeline
# 定义一个方法,获取文件下载路径
class ArticleImagesPipeline(ImagesPipeline):
# 重写方法,设置图片的存储路径
def item_completed(self, results, item, info):
if 'image_url' in item:
image_path = ''
# 获取图片的保存路径
for success, image_info in results:
if success:
image_path = image_info['path']
# 设置图片的存储路径
item['image_path'] = image_path
return item
异步入库
class MysqlTwistPipeline():
@classmethod
def from_settings(cls, settings):
from MySQLdb.cursors import DictCursor
dbparms = dict(host=settings['MYSQL_HOST'], db=settings['MYSQL_DBNAME'], user=settings['MYSQL_USER'], passwd=settings['MYSQL_PASSWORD'], charset='utf8', cursorclass=DictCursor, use_unicode=True)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbparms)
return cls(dbpool)
def __init__(self, dbpool):
self.dbpool = dbpool
def process_item(self, item, spider):
d = self.dbpool.runInteraction(self.do_insert, item)
d.addErrback(self.handle_error, item, spider)
d.addBoth(self.handle_complete, item, spider)
return d
def handle_error(self, failure, item, spider):
# 打印错误
print(failure)
def handle_complete(self, result, item, spider):
# 打印结果
print(result)
def do_insert(self, cursor, item):
# 执行具体的插入
# 根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql = """insert into cnblogs_pick(title, content, image_url, image_path,source_url) VALUES (%s, %s, %s, %s, %s)"""
params = list()
params.append(item.get('title', '')[0:10])
params.append(item.get('content', ''))
params.append(item.get('image_url', ''))
params.append(item.get('image_path', ''))
params.append(item.get('source_url', ''))
cursor.execute(insert_sql, tuple(params))
itemLoader
使用itemLoader
要使用itemLoader,必须先实例化它。
数据的存储类型为
list
未使用
import scrapy
from scrapy import Request
from urllib import parse
from Article.items import CnblogsPickItem
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
# 获取页面所有的详情页
url = response.xpath('//article[@class="post-item"]')
for item in url:
# 获取详情页的url
detail_url = item.xpath('.//a[@class="post-item-title"]/@href').extract_first("")
# 获取图片的url
image_url = item.xpath('.//p[@class="post-item-summary"]//img/@src').extract_first("")
# 交给scrapy进行下载
yield Request(url=parse.urljoin(response.url, detail_url), meta={'image_url': parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
# 获取下一页交给scrapy
next_page = response.xpath('//div[@class="pager"]/a[contains(text(),">")]/@href').extract_first("")
if next_page is not None:
yield Request(url=response.urljoin(next_page), callback=self.parse)
def parse_detail(self, response):
cnblogs_item = CnblogsPickItem()
# 获取详情页的url
source_url = response.url
title = response.xpath("//title/text()").extract_first("")
content = response.xpath('//div[@id="cnblogs_post_body"]').extract_first("")
# 获取meta传来的值
if response.meta.get('image_url', ''):
cnblogs_item['image_url'] = [response.meta.get('image_url', '')]
else:
cnblogs_item['image_url'] = []
cnblogs_item['title'] = title
cnblogs_item['content'] = content
cnblogs_item['source_url'] = source_url
yield cnblogs_item
使用
import scrapy
from scrapy import Request
from urllib import parse
from Article.items import CnblogsPickItem
from scrapy.loader import ItemLoader
class CnblogsPickSpider(scrapy.Spider):
name = "cnblogs_pick"
allowed_domains = ["www.cnblogs.com"]
start_urls = ["https://www.cnblogs.com/pick/"]
def parse(self, response):
# 获取页面所有的详情页
url = response.xpath('//article[@class="post-item"]')
for item in url:
# 获取详情页的url
detail_url = item.xpath('.//a[@class="post-item-title"]/@href').extract_first("")
# 获取图片的url
image_url = item.xpath('.//p[@class="post-item-summary"]//img/@src').extract_first("")
# 交给scrapy进行下载
yield Request(url=parse.urljoin(response.url, detail_url), meta={'image_url': parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
# 获取下一页交给scrapy
next_page = response.xpath('//div[@class="pager"]/a[contains(text(),">")]/@href').extract_first("")
if next_page is not None:
yield Request(url=response.urljoin(next_page), callback=self.parse)
def parse_detail(self, response):
# cnblogs_item = CnblogsPickItem()
# # 获取详情页的url
# source_url = response.url
# title = response.xpath("//title/text()").extract_first("")
# content = response.xpath('//div[@id="cnblogs_post_body"]').extract_first("")
# # 获取meta传来的值
# if response.meta.get('image_url', ''):
# cnblogs_item['image_url'] = [response.meta.get('image_url', '')]
# else:
# cnblogs_item['image_url'] = []
#
# cnblogs_item['title'] = title
# cnblogs_item['content'] = content
# cnblogs_item['source_url'] = source_url
item_loader = ItemLoader(item=CnblogsPickItem(), response=response)
item_loader.add_xpath('title', "//title/text()")
item_loader.add_xpath('content', '//div[@id="cnblogs_post_body"]')
item_loader.add_value('source_url', response.url)
item_loader.add_value('image_url', response.meta.get('image_url', ''))
cnblogs_item = item_loader.load_item()
yield cnblogs_item
声明输入和输出
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(separator=','),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
下载器中间件
默认值:{}
DOWNLOADER_MIDDLEWARES是一个介于Scrapy的request/response之间的钩子框架,用于更改全局Scrapy的请求和响应。
激活
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.CustomDownloaderMiddleware": 543,
}
如果想禁用内置中间件,必须在
settings中将其设置为None
DOWNLOADER_MIDDLEWARES = {
"Article.middlewares.RandomUserAgentMiddleware": 543,
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
}
内置下载器中间件
COOKIES_ENABLED
默认值:True
是否启用cookie中间件。如果禁用,则不会向web服务器发送cookie。
根据网站情况设置,与全局设置冲突的,可以去每个Spider单独设置
自定义
RandomUserAgentMiddleware
UserAgentPool.py
因为
fake-useragent等模块都不更新了,所有使用本地数据
import random
class UserAgentPool:
def user_agent_list(self):
user_agent_pool = [ # User-Agent池
# Cent Browser 4.3.9.248,Chromium 86.0.4240.198
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', # 2021.01
# Cent Browser 5.0.1002.295,Chromium 102.0.5005.167
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.167 Safari/537.36', # 2022.12
# Edge
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.38', # 2021.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.44', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.47', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36 Edg/94.0.992.31', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36 Edg/94.0.992.37', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.47', # 2021.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.44', # 2021.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36 Edg/96.0.1054.29', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.41', # 2021.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55', # 2022.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43', # 2022.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.55', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.56', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.30', # 2022.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.39', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36 Edg/99.0.1150.46', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36 Edg/99.0.1150.52', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36 Edg/99.0.1150.55', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29', # 2022.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36 Edg/100.0.1185.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36 Edg/100.0.1185.39', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36 Edg/101.0.1210.32', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39', # 2022.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.30', # 2022.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.33', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.39', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.44', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36 Edg/103.0.1264.44', # 2022.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.47', # 2022.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.63', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27', # 2022.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.42', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.53', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.34', # 2022.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.37', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.47', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.24', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.26', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.35', # 2022.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.52', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46', # 2022.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.76', # 2023.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.55', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.78', # 2023.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.41', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.46', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.49', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.57', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63', # 2023.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.69', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.41', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.43', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.51', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.54', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.34', # 2023.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.39', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.46', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.58', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.64', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.68', # 2023.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.37', # 2023.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.41', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.51', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.58', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.67', # 2023.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.79', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.86', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.183', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200', # 2023.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.69', # 2023.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.76', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.81', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.31', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.41', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.43',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.47', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.55', # 2023.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.60', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.46', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.57', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.61', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.69', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.76', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.44', # 2023.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.58', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.72', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.93', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.97', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.61', # 2023.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.77', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.89', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.91', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.121', # 2024.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.133', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.144', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.83', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.98', # 2024.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.106', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.110', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.112', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.128', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.52', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.59', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.63', # 2024.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.66', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.80', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.2365.92', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.53', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.65',
# Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36', # 2021.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36', # 2021.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36', # 2021.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36', # 2022.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36', # 2022.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36', # 2022.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36', # 2022.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36', # 2022.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36', # 2022.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36', # 2022.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36', # 2022.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.54 Safari/537.36', # 2022.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.127 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.91 Safari/537.36', # 2022.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.103 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.119 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.63 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.106 Safari/537.36', # 2022.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.107 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.122 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.72 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.95 Safari/537.36', # 2022.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.99 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.125 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.75 Safari/537.36', # 2023.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.120 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.78 Safari/537.36', # 2023.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.104 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.105 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.178 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.180 Safari/537.36', # 2023.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.64 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.65 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.111 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.112 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.147 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.50 Safari/537.36', # 2023.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.87 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.121 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.138 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5672.64 Safari/537.36', # 2023.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5672.93 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5672.127 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.91 Safari/537.36', # 2023.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.198 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.199 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.99 Safari/537.36', # 2023.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.102 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.171 Safari/537.36', # 2023.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.172 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.97 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.111 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.112 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.141 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.180 Safari/537.36', # 2023.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.188 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.63 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.89 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.92 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.132 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.149 Safari/537.36', # 2023.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.150 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.71 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.89 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.118 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.6045.106 Safari/537.36', # 2023.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.6045.124 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.6045.160 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.6045.200 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.63 Safari/537.36', # 2023.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.71 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.130 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.131 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.200 Safari/537.36', # 2024.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.216 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.217 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.224 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.225 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.86 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.140 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.161 Safari/537.36', # 2024.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.185 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.58 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.70 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.95 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.112 Safari/537.36', # 2024.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.129 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.6312.59 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.6312.86 Safari/537.36',
# Chrome Beta
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.41 Safari/537.36', # 2021.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.17 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.32 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.40 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.49 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.18 Safari/537.36', # 2021.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.27 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.35 Safari/537.36', # 2021.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
# Firefox
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0', # 2021.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0', # 2021.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0', # 2021.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0', # 2021.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0', # 2022.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0', # 2022.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0', # 2022.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0', # 2022.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0', # 2022.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0', # 2022.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0', # 2022.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0', # 2022.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0', # 2022.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0', # 2022.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0', # 2022.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0', # 2022.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0', # 2022.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0', # 2023.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0', # 2023.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0', # 2023.03
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/112.0', # 2023.04
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0', # 2023.05
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0', # 2023.06
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0', # 2023.07
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0', # 2023.08
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0', # 2023.09
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0', # 2023.10
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0', # 2023.11
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0', # 2023.12
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0', # 2024.01
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0', # 2024.02
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', # 2024.03
]
return user_agent_pool
def get_user_agent(self):
user_agent_pool = self.user_agent_list()
user_agent = random.choice(user_agent_pool)
print(user_agent)
settings.py
DOWNLOADER_MIDDLEWARES = {
"Article.middlewares.RandomUserAgentMiddleware": 543,
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
}
middlewares.py
from Article.Tools.UserAgentPool import UserAgentPool
class RandomUserAgentMiddleware:
'''
随机更换user-agent
'''
def __init__(self, crawler):
super(RandomUserAgentMiddleware, self).__init__()
# 因为fake_useragent是读取的接口数据,会因为网络原因报错,最好的方式是下载到本地读取
self.ua = UserAgentPool()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', self.ua.get_user_agent())
自动限速扩展
Settings
AUTOTHROTTLE_ENABLED
默认值:False
启用AutoThrottle扩展