Scrapy爬虫读接口数据并保存mysql

460 阅读4分钟

Scrapy

1.Scrapy代码生成

  • 下载依赖
pip install scrapy
  • 创建项目
scrapy startproject scrapy_demo
  • 生成Spider
scrapy genspider geega www.xxx.com
  • 目录结构
scrapy_demo
├── scrapy_demo
│   ├── items.py       # 数据模型文件
│   ├── middlewares.py # 中间件文件,配置所有中间件
│   ├── pipelines.py   #  pipeline 文件,用于存放自定义pipeline的处理逻辑,比如配置保存数据库的逻辑
│   ├── settings.py    # 项目的配置文件,自定义的外部配置都可以放在这里
│   └── spiders        #Spider类文件夹,我们编写的解析代码均存放在这里
└── scrapy.cfg        # 项目的部署配置文件

1.1 Scrapy的组件

  • 引擎(Scrapy Engine): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
  • 调度器(Scheduler): 接受引擎发送过来的Request请求并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。
  • 下载器(Downloader): 负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给引擎,由引擎交给Spider来处理
  • 爬虫(Spiders): 从所有Responses中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入调度器
  • 管道(Item Pipeline): 处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方
  • 下载器中间件(Downloader middlewares): 一个可以自定义扩展下载功能的组件。
  • Spider中间件(Spider middlewares): 一个可以自定扩展和操作引擎和Spider中间通信的功能组

2.Scrapy代码实现

2.1 items.py定义采集的字段

在 items.py 文件中定义采集的字段以及一些其他需要的字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyGeegaItem(scrapy.Item):
    id = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    content = scrapy.Field()
    file_id = scrapy.Field()
    sort = scrapy.Field()
    create_time = scrapy.Field()
    update_time = scrapy.Field()

2.2 spider文件编写采集逻辑

import json
import random
import time
from datetime import datetime

import scrapy

from scrapy_geega import items


class GeegaSpider(scrapy.Spider):
    name = 'geega'
    allowed_domains = ['www.xxx.com']
    start_url = 'https://www.xxx.com/basic-proxy/rest/posts/list'
    idle_time = random.randint(5, 10)
    max_page = 10
    page_size = 20

    def start_requests(self):
        time.sleep(self.idle_time)
        for i in range(1, self.max_page):
            data = {
                'pageNo': i,
                'pageSize': self.page_size,
                'filter': {
                    'category': 'ARTICLE',
                    'tagIds': []
                }
            }
            yield scrapy.Request(url=self.start_url,
                                 method="POST",
                                 body=json.dumps(data),
                                 headers={'Content-Type': 'application/json'},
                                 callback=self.parse,
                                 dont_filter=True)

    def parse(self, response):
        res = json.loads(response.body)
        if res['success']:
            print("成功爬取数据")
        try:
            item = items.ScrapyGeegaItem()
            data = res['data']['list']
            total = res['data']['total']
            for index in range(len(data)):
                print('title', index+1, data[index]['title'])
                item['id'] = data[index]['id']
                item['title'] = data[index]['title']
                item['summary'] = data[index]['introduction']
                item['content'] = data[index]['introduction']
                item['file_id'] = data[index]['authorAvatar']
                item['sort'] = data[index]['id']
                item['create_time'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                item['update_time'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                yield item
        except Exception as e:
            print("出错了 %s", e)
            return

2.3 命令行启动spider

scrapy crawl <spider>
  • 其中的 是我们 spider 文件中 name 属性的值,我们在 scrapy 项目中可以通过 scrapy list 命令查看
D:\yangzhen\spider\scrapy_geega>scrapy list
geega

  • 启动:
scrapy crawl geega
  • 爬虫结果保存为json格式:
scrapy crawl tech_web -o result.json

2.4 代码启动spider

根目录新建main.py文件

from scrapy.cmdline import execute
import os
import sys

if __name__ == '__main__':
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(['scrapy', 'crawl', 'geega'])

2.4 保存数据到mysql

2.4.1 pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import copy

import pymysql as pymysql
from twisted.enterprise import adbapi


class ScrapyGeegaPipeline:

    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        """
        函数名固定,会被scrapy调用,直接可用settings的值
        数据库建立连接
        :param settings:
        :return:
        """
        db_params = dict(
            host=settings['MYSQL_HOST'],
            port=settings['MYSQL_PORT'],
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWORD'],
            cursorclass=pymysql.cursors.DictCursor
            # 指定cursor类型,如果要返回字典(dict)表示的记录,就要设置cursor参数为pymysql.cursors.DictCursor类。
        )
        # Twisted模块将数据异步的添加
        # 连接数据池ConnectionPool,使用pymysql或者Mysqldb连接
        dbpool = adbapi.ConnectionPool('pymysql', **db_params)
        # 返回实例化参数
        return cls(dbpool)

    def process_item(self, item, spider):
        # 使用twisted将MySQL插入变成异步执行。通过连接池执行具体的sql操作,返回一个对象
        async_item = copy.deepcopy(item)
        query = self.dbpool.runInteraction(self.do_insert, async_item)
        # 添加异常处理
        query.addCallback(self.handle_error)

    def do_insert(self, cursor, item):
        # 对数据库进行插入操作,并不需要commit,twisted会自动commit
        insert_sql = """
         INSERT INTO blog_article (id,title,summary,content,tag_ids,click_count,collect_count,file_id,admin_id,
         is_original,user_id,citation,is_publish,sort_id,sort,status,create_time,update_time)
         VALUES (%s, %s, %s, %s, NULL, 0, 0, NULL , NULL, 1, NULL, NULL, 1, NULL, %s, 1, %s, %s)
        """
        _id = item['id']
        title = item['title']
        summary = item['summary']
        content = item['content']
        sort = item['sort']
        create_time = item['create_time']
        update_time = item['update_time']
        try:
            cursor.execute(insert_sql, (_id, title, summary, content, sort, create_time, update_time))
        except Exception as e:
            print(e)
        print('OK', item['id'])

    def handle_error(self, err):
        if err:
            # 打印错误信息
            print(err)

2.4.2 pipeline 生效

如果要让自己的 pipeline 生效, 需要配置在 settings.py 文件中

# Scrapy settings for scrapy_geega project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_geega'

SPIDER_MODULES = ['scrapy_geega.spiders']
NEWSPIDER_MODULE = 'scrapy_geega.spiders'

MYSQL_HOST = '127.0.0.1'
MYSQL_PORT = 3306
MYSQL_DBNAME = "mu_db"
MYSQL_USER = "root"
MYSQL_PASSWORD = "root"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_geega (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_geega.middlewares.ScrapyGeegaSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_geega.middlewares.ScrapyGeegaDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 让自己的 pipeline 生效
ITEM_PIPELINES = {
   'scrapy_geega.pipelines.ScrapyGeegaPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'