安装scrapy相关库:
# 安装顺序参考如下:
zope.interface
pyopenssl
twisted
lxml
scrapy
查看scrapy安装情况:
(venv) PS G:\Python_pj\Scrapy_vevn_04> scrapy -h
Scrapy 2.8.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
创建scrapy项目:
(venv) PS G:\Python_pj\Scrapy_vevn_04> scrapy startproject ADtest
New Scrapy project 'ADtest', using template directory 'G:\Python_pj\Scrapy_vevn_04\venv\Lib\site-packages\scrapy\templates\project', created in:
G:\Python_pj\Scrapy_vevn_04\ADtest
You can start your first spider with:
cd ADtest
scrapy genspider example example.com
(venv) PS G:\Python_pj\Scrapy_vevn_04>
创建scrapy爬虫:
- 快速创建
scrapy genspider baidu baidu.com
- 指定模板创建
scrapy genspider -t basic tencent tencent.com
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapy genspider baidu baidu.com
Created spider 'baidu' using template 'basic' in module:
ADtest.spiders.baidu
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapy genspider -t basic tencent tencent.com
Created spider 'tencent' using template 'basic' in module:
ADtest.spiders.tencent
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapy list
baidu
tencent
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest>
测试scrapy爬虫:
- 执行爬虫
scrapy crawl baidu
2023-03-27 16:18:23 [scrapy.core.engine] INFO: Spider opened
2023-03-27 16:18:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-27 16:18:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-03-27 16:18:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://baidu.com/robots.txt> (referer: None)
2023-03-27 16:18:23 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://baidu.com/>
2023-03-27 16:18:23 [scrapy.core.engine] INFO: Closing spider (finished)
安装scrapy-redis库:
- pip安装数据库包
pip install scrapy-redis
scrapyd部署:
- 服务器端部署环境
pip install scrapyd
- 客户端部署工具
pip install scrapyd-client
启动服务(服务器端打开并保持开启):
- dbs文件夹
服务器端开启服务出现(存放数据库)
(venv) PS G:\Python_pj\Scrapy_vevn_04> scrapyd
2023-03-27T16:36:06+0800 [-] Loading G:\Python_pj\Scrapy_vevn_04\venv\lib\site-packages\scrapyd\txapp.py...
2023-03-27T16:36:06+0800 [-] Basic authentication disabled as either `username` or `password` is unset
2023-03-27T16:36:06+0800 [-] Scrapyd web console available at http://127.0.0.1:6800/
2023-03-27T16:36:06+0800 [-] Loaded.
2023-03-27T16:36:06+0800 [twisted.application.app.AppLogger#info] twistd 22.10.0 (G:\Python_pj\Scrapy_vevn_04\venv\Scripts\python.exe 3.10.5) starti
ng up.
2023-03-27T16:36:06+0800 [twisted.application.app.AppLogger#info] reactor class: twisted.internet.selectreactor.SelectReactor.
2023-03-27T16:36:06+0800 [-] Site starting on 6800
2023-03-27T16:36:06+0800 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x000001B5F3BE3970>
2023-03-27T16:36:06+0800 [Launcher] Scrapyd 1.4.1 started: max_proc=80, runner='scrapyd.runner'
客户端开始部署服务
- 客户端是否正常
scrapyd-deploy -h
(venv) PS G:\Python_pj\Scrapy_vevn_04> scrapyd-deploy -h
usage: scrapyd-deploy [-h] [-p PROJECT] [-v VERSION] [-l] [-a] [-d] [-L TARGET] [--egg FILE] [--build-egg FILE] [--include-dependencies] [TARGET]
Deploy Scrapy project to Scrapyd server
positional arguments:
TARGET
options:
-h, --help show this help message and exit
-p PROJECT, --project PROJECT
the project name in the TARGET
-v VERSION, --version VERSION
the version to deploy. Defaults to current timestamp
-l, --list-targets list available targets
-a, --deploy-all-targets
deploy all targets
-d, --debug debug mode (do not remove build dir)
-L TARGET, --list-projects TARGET
list available projects in the TARGET
--egg FILE use the given egg, instead of building it
--build-egg FILE only build the egg, don't deploy it
--include-dependencies
include dependencies from requirements.txt in the egg
(venv) PS G:\Python_pj\Scrapy_vevn_04>
- 部署命令
scrapyd-deploy
# 配置文件没有打开url地址
(venv) PS G:\Python_pj\Scrapy_vevn_04> cd ADtest
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapyd-deploy
Unknown target: default
- 打包完成
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapyd-deploy
Packing version 1679907159
Deploying to project "ADtest" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "PS2022ZYSKWXTZ", "status": "ok", "project": "ADtest", "version": "1679907159", "spiders": 2}
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest>
再次进行部署测试
- project改名再次部署并查看
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest> scrapyd-deploy
Packing version 1679907735
Deploying to project "ADtest_001" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "PS2022ZYSKWXTZ", "status": "ok", "project": "ADtest_001", "version": "1679907735", "spiders": 2}
(venv) PS G:\Python_pj\Scrapy_vevn_04\ADtest>
补充说明
- 客户端部署打包需要安装库
pip install pywin32
- 客户端完成打包会出现新增文件源码文件夹不受影响
- 部署打包完成后服务器端会更新dbs和eggs文件内容
服务器部署多个scrapyd
- 新建多个文件夹,每个文件夹放入配置文件
注意是复制出来请不要改原始文件
- 修改配置文件端口号
原始端口:6800
- 按文件夹路径启动服务测试
scrapyd -d G:/Python_pj/Scrapy_vevn_03/mingsen
可视化管理工具Gerapy
1.中英文切换
2.通过配置文件进行配置
3.支持动态配置,服务启动后可以进行配置主机、服务、爬虫
4.界面干净友好
5.基于Django
6.在线编辑代码文件
7.命令行全部封装
8.启动方式特殊
a.gerapy init
b.gerapy migrate
c.gerapy runserver
- Gerapy安装
pip install gerapy
- 使用前提是scrapyd可以正常使用
(venv) PS G:\Python_pj\Scrapy_vevn_03> scrapyd
2023-03-28T11:51:55+0800 [-] Loading G:\Python_pj\Scrapy_vevn_03\venv\lib\site-packages\scrapyd\txapp.py...
2023-03-28T11:51:56+0800 [-] Basic authentication disabled as either `username` or `password` is unset
2023-03-28T11:51:56+0800 [-] Scrapyd web console available at http://127.0.0.1:6800/
2023-03-28T11:51:56+0800 [-] Loaded.
2023-03-28T11:51:56+0800 [twisted.application.app.AppLogger#info] twistd 22.10.0 (G:\Python_pj\Scrapy_vevn_03\venv\Scripts\python.exe 3.10.5) starti
ng up.
2023-03-28T11:51:56+0800 [twisted.application.app.AppLogger#info] reactor class: twisted.internet.selectreactor.SelectReactor.
2023-03-28T11:51:56+0800 [-] Site starting on 6800
2023-03-28T11:51:56+0800 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x000002084DD4F100>
2023-03-28T11:51:56+0800 [Launcher] Scrapyd 1.4.1 started: max_proc=80, runner='scrapyd.runner'
初始化 gerapy init
(venv) PS G:\Python_pj\Scrapy_vevn_03> gerapy init
Initialized workspace gerapy
(venv) PS G:\Python_pj\Scrapy_vevn_03>
迁移 gerapy migrate
# 进入路径gerapy迁移,把所有模型转变为数据库文件
(venv) PS G:\Python_pj\Scrapy_vevn_03> cd .\gerapy\
(venv) PS G:\Python_pj\Scrapy_vevn_03\gerapy> gerapy migrate
启动服务 gerapy runserver
(venv) PS G:\Python_pj\Scrapy_vevn_03\gerapy> gerapy runserver
Watching for file changes with StatReloader
Performing system checks...
INFO - 2023-03-28 12:01:46,329 - process: 14800 - scheduler.py - gerapy.server.core.scheduler - 105 - scheduler - successfully synced task with jobs
with force
System check identified no issues (0 silenced).
March 28, 2023 - 12:01:46
Django version 2.2.28, using settings 'gerapy.server.server.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.
# 停止服务:命令行中Ctrl+C
创建用户密码 gerapy createsuperuser
(venv) PS G:\Python_pj\Scrapy_vevn_03\gerapy> gerapy createsuperuser
Username (leave blank to use 'admin'): admin
Email address:
Password:
Password (again):
This password is too short. It must contain at least 8 characters.
This password is too common.
This password is entirely numeric.
Bypass password validation and create user anyway? [y/N]: y
Superuser created successfully.