新版Scrapy到底有哪些命令?|Python 主题月

773 阅读4分钟

本文正在参加「Python主题月」,详情查看活动链接

阿晨也是初学Scrapy,有些不对的希望大佬能不吝赐教,在底下留言告诉我!不胜感激!

写这篇文章时,Scrapy的最新版本是2.5.0

好了,废话少说,开始!

命令行帮助

任何命令行工具,一般都会带命令行说明,几乎是行业内默认。Scrapy自然也不例外。

$ scrapy -h
Scrapy 2.5.0 - project: scrapybot

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

startproject

创建一个新的Scrapy项目,自动生成Scrapy项目结构

命令格式

$ scrapy startproject 项目名称 [项目目录] # 项目目录是可选项,不写则生成同名文件夹作为项目目录

命令示例

$ scrapy startproject example
New Scrapy project 'example', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
    D:\WorkSpace\Personal\my-scrapy\example

You can start your first spider with:
    cd example
    scrapy genspider example example.com

$ ls
example  readme.md  venv

项目结构大概如下图所示

image-20210725014007557

genspider

使用预定义的模板生成新爬虫,这个非常有用,有时候能够极大地提升爬虫效率,前提是有一套好用的预定于模板。

命令格式

$ scrapy genspider [-t 爬虫模板名称] 爬虫名称 爬虫域名 # 爬虫模板名称不是必须的,不填则使用默认模板

命令示例

$ cd example/example/spiders/
$ scrapy genspider exampleSpider example.spider.com
Created spider 'exampleSpider' using template 'basic' in module:
  example.spiders.exampleSpider

创建好的爬虫如下

image-20210725014306990

crawl

运行爬虫,这个看似与runspider有点像,不过这个命令要求执行的爬虫必须是Scrapy认可的项目结构才行。

命令格式

$ scrapy crawl 爬虫名称 # 就是我们上面使用genspider创建的爬虫名称

命令示例

$ scrapy crawl exampleSpider
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:50:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-25 01:50:59 [scrapy.crawler] INFO: Overridden settings:
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 51, 0, 206683)}
2021-07-25 01:51:02 [scrapy.core.engine] INFO: Spider closed (finished)

runspider

这个命令也是用来执行爬虫的,不过可以执行外部爬虫文件,也就是单独的Spider

命令格式

$ scrapy runspider 爬虫文件.py

命令示例

$ scrapy runspider exampleSpider.py
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:54:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 54, 24, 908097)}
2021-07-25 01:54:31 [scrapy.core.engine] INFO: Spider closed (finished)

bench

基准测试,会运行一个简单的示例爬虫,来进行基准测试,到底测试啥,不是很理解。

这个没看明白到底有什么作用,希望有大佬能留言解答阿晨的疑惑!

命令示例

$ scrapy bench
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 01:58:19 [scrapy.extensions.logstats] INFO: Crawled 90 pages (at 5400 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:20 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 6360 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:21 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 5340 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:22 [scrapy.extensions.logstats] INFO: Crawled 369 pages (at 5040 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:23 [scrapy.extensions.logstats] INFO: Crawled 433 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:24 [scrapy.extensions.logstats] INFO: Crawled 513 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:25 [scrapy.extensions.logstats] INFO: Crawled 593 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:26 [scrapy.extensions.logstats] INFO: Crawled 657 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:27 [scrapy.extensions.logstats] INFO: Crawled 721 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:28 [scrapy.extensions.logstats] INFO: Crawled 785 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 58, 18, 691354)}
2021-07-25 01:58:29 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

check

爬虫代码检查,类似于代码静态检查,提前检查爬虫编写是否有误。

命令格式

$ scrapy check [-l] <spider>

命令示例

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

list

列出当前项目所有可用的爬虫

命令格式

$ scrapy list

命令示例

$ scrapy list
hotList

edit

编辑爬虫,临时修改下配置,还是可以的。会打开一个编辑器,来编辑爬虫代码

命令格式

$ scrapy edit <spider>

命令示例

$ scrapy edit hotList
'%s' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
# 阿晨电脑里好像没有所谓的默认编辑器,所以出错了

fetch

使用Scrapy来访问网页

命令格式

$ scrapy fetch <url>
# 支持的参数列表
--spider=SPIDER: 使用指定的爬虫来访问此网页,可以用来确认爬虫是否生效
--headers: 打印响应头,不打印响应正文
--no-redirect: 忽略重定向

命令示例

$ scrapy fetch https://www.baidu.com
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 02:16:17 [scrapy.core.engine] INFO: Spider closed (finished)
<!DOCTYPE html>
<html><!--STATUS OK--><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="no-cache" /><meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no"/><style type="text/css">body {margin: 0;text-align: center;font-size: 14px;font-family: Arial,Helvetica,LiHei Pro Medium;color: #262626;}form {position: relative;margin: 12px 15px 91px;height: 41px;}img {border: 0}.wordWrap{margin-right: 85px;}#word {background-color: #FFF;border: 1px solid #6E6E6E;color: #000;font-size: 18px;height: 27px;padding: 6px;width: 100%;-webkit-appearance: none;-webkit-border-radius: 0;border-radius: 0;}.bn {background-color: #F5F5F5;border: 1px solid #787878;font-size: 16px;
# 会打印出百度首页HTML代码

view

使用Scrapy调起浏览器来打开网页,Scrapy会对网页做些分析。

官方文档提到,爬虫和普通用户看到的网页有时候是不一样的,所以可以确认是否能爬取此网页。

命令格式

$ scrapy view <url>
# 支持的参数列表
--spider=SPIDER: 使用指定的爬虫来访问此网页,可以用来确认爬虫是否生效
--no-redirect: 忽略重定向

命令示例

$ scrapy view https://www.baidu.com
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 02:20:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
# ...

下载了一个网页文件到本地,确实和我们平时见到的百度不大一样!

image-20210725022112537

shell

也是一个开发爬虫时常用命令,与fetch的区别是,可以自定义一段shell脚本来测试解析网页

命令格式

$ scrapy shell [url]

命令示例

$ scrapy shell --nolog -c '(response.status, response.url)' https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25
(200, 'https://blog.csdn.net/phoenix/web/blog/hotRank?page=0')

parse

尝试爬取给定的网址,常用来测试爬虫代码是否有效。

命令格式

$ scrapy parse <url> [options]
# 支持的参数列表
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback for parsing the response
--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’
--cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
--output or -o: dump scraped items to a file

命令示例

$ scrapy parse https://blog.csdn.net/rank/list --spider=hotList
...
2021-07-25 02:27:06 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 0 <<<
# Scraped Items  ------------------------------------------------------------
[]

# Requests  -----------------------------------------------------------------
[]

settings

查看爬虫配置

命令格式

$ scrapy settings [options]

命令示例

$ scrapy settings --get BOT_NAME
csdnHot

参考资料