本文正在参加「Python主题月」,详情查看活动链接
阿晨也是初学Scrapy,有些不对的希望大佬能不吝赐教,在底下留言告诉我!不胜感激!
写这篇文章时,Scrapy的最新版本是2.5.0
好了,废话少说,开始!
命令行帮助
任何命令行工具,一般都会带命令行说明,几乎是行业内默认。
Scrapy自然也不例外。
$ scrapy -h
Scrapy 2.5.0 - project: scrapybot
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
startproject
创建一个新的
Scrapy项目,自动生成Scrapy项目结构
命令格式
$ scrapy startproject 项目名称 [项目目录] # 项目目录是可选项,不写则生成同名文件夹作为项目目录
命令示例
$ scrapy startproject example
New Scrapy project 'example', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
D:\WorkSpace\Personal\my-scrapy\example
You can start your first spider with:
cd example
scrapy genspider example example.com
$ ls
example readme.md venv
项目结构大概如下图所示
genspider
使用预定义的模板生成新爬虫,这个非常有用,有时候能够极大地提升爬虫效率,前提是有一套好用的预定于模板。
命令格式
$ scrapy genspider [-t 爬虫模板名称] 爬虫名称 爬虫域名 # 爬虫模板名称不是必须的,不填则使用默认模板
命令示例
$ cd example/example/spiders/
$ scrapy genspider exampleSpider example.spider.com
Created spider 'exampleSpider' using template 'basic' in module:
example.spiders.exampleSpider
创建好的爬虫如下
crawl
运行爬虫,这个看似与
runspider有点像,不过这个命令要求执行的爬虫必须是Scrapy认可的项目结构才行。
命令格式
$ scrapy crawl 爬虫名称 # 就是我们上面使用genspider创建的爬虫名称
命令示例
$ scrapy crawl exampleSpider
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:50:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-25 01:50:59 [scrapy.crawler] INFO: Overridden settings:
...
'start_time': datetime.datetime(2021, 7, 24, 17, 51, 0, 206683)}
2021-07-25 01:51:02 [scrapy.core.engine] INFO: Spider closed (finished)
runspider
这个命令也是用来执行爬虫的,不过可以执行外部爬虫文件,也就是单独的
Spider
命令格式
$ scrapy runspider 爬虫文件.py
命令示例
$ scrapy runspider exampleSpider.py
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:54:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
...
'start_time': datetime.datetime(2021, 7, 24, 17, 54, 24, 908097)}
2021-07-25 01:54:31 [scrapy.core.engine] INFO: Spider closed (finished)
bench
基准测试,会运行一个简单的示例爬虫,来进行基准测试,到底测试啥,不是很理解。
这个没看明白到底有什么作用,希望有大佬能留言解答阿晨的疑惑!
命令示例
$ scrapy bench
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 01:58:19 [scrapy.extensions.logstats] INFO: Crawled 90 pages (at 5400 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:20 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 6360 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:21 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 5340 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:22 [scrapy.extensions.logstats] INFO: Crawled 369 pages (at 5040 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:23 [scrapy.extensions.logstats] INFO: Crawled 433 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:24 [scrapy.extensions.logstats] INFO: Crawled 513 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:25 [scrapy.extensions.logstats] INFO: Crawled 593 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:26 [scrapy.extensions.logstats] INFO: Crawled 657 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:27 [scrapy.extensions.logstats] INFO: Crawled 721 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:28 [scrapy.extensions.logstats] INFO: Crawled 785 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
...
'start_time': datetime.datetime(2021, 7, 24, 17, 58, 18, 691354)}
2021-07-25 01:58:29 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
check
爬虫代码检查,类似于代码静态检查,提前检查爬虫编写是否有误。
命令格式
$ scrapy check [-l] <spider>
命令示例
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list
列出当前项目所有可用的爬虫
命令格式
$ scrapy list
命令示例
$ scrapy list
hotList
edit
编辑爬虫,临时修改下配置,还是可以的。会打开一个编辑器,来编辑爬虫代码
命令格式
$ scrapy edit <spider>
命令示例
$ scrapy edit hotList
'%s' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
# 阿晨电脑里好像没有所谓的默认编辑器,所以出错了
fetch
使用
Scrapy来访问网页
命令格式
$ scrapy fetch <url>
# 支持的参数列表
--spider=SPIDER: 使用指定的爬虫来访问此网页,可以用来确认爬虫是否生效
--headers: 打印响应头,不打印响应正文
--no-redirect: 忽略重定向
命令示例
$ scrapy fetch https://www.baidu.com
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 02:16:17 [scrapy.core.engine] INFO: Spider closed (finished)
<!DOCTYPE html>
<html><!--STATUS OK--><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="no-cache" /><meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no"/><style type="text/css">body {margin: 0;text-align: center;font-size: 14px;font-family: Arial,Helvetica,LiHei Pro Medium;color: #262626;}form {position: relative;margin: 12px 15px 91px;height: 41px;}img {border: 0}.wordWrap{margin-right: 85px;}#word {background-color: #FFF;border: 1px solid #6E6E6E;color: #000;font-size: 18px;height: 27px;padding: 6px;width: 100%;-webkit-appearance: none;-webkit-border-radius: 0;border-radius: 0;}.bn {background-color: #F5F5F5;border: 1px solid #787878;font-size: 16px;
# 会打印出百度首页HTML代码
view
使用
Scrapy调起浏览器来打开网页,Scrapy会对网页做些分析。官方文档提到,爬虫和普通用户看到的网页有时候是不一样的,所以可以确认是否能爬取此网页。
命令格式
$ scrapy view <url>
# 支持的参数列表
--spider=SPIDER: 使用指定的爬虫来访问此网页,可以用来确认爬虫是否生效
--no-redirect: 忽略重定向
命令示例
$ scrapy view https://www.baidu.com
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 02:20:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
# ...
下载了一个网页文件到本地,确实和我们平时见到的百度不大一样!
shell
也是一个开发爬虫时常用命令,与
fetch的区别是,可以自定义一段shell脚本来测试解析网页
命令格式
$ scrapy shell [url]
命令示例
$ scrapy shell --nolog -c '(response.status, response.url)' https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25
(200, 'https://blog.csdn.net/phoenix/web/blog/hotRank?page=0')
parse
尝试爬取给定的网址,常用来测试爬虫代码是否有效。
命令格式
$ scrapy parse <url> [options]
# 支持的参数列表
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback for parsing the response
--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’
--cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
--output or -o: dump scraped items to a file
命令示例
$ scrapy parse https://blog.csdn.net/rank/list --spider=hotList
...
2021-07-25 02:27:06 [scrapy.core.engine] INFO: Spider closed (finished)
>>> STATUS DEPTH LEVEL 0 <<<
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[]
settings
查看爬虫配置
命令格式
$ scrapy settings [options]
命令示例
$ scrapy settings --get BOT_NAME
csdnHot