使用scrapy框架进行多数据保存

2 阅读2分钟

在解析response响应对象的过程当中,解析出来的数据可能是一个新的可访问的URL,如果需要对解析出来的URL地址进行请求并获取数据该如何完成?

以下我展示“获取qingting里面的数据并通过访问图片URL来保存图片数据”案例

确定需求:从响应对象中提取URL,对这样的url也发送请求然后提取它的数据。

步骤:

  1. 创建蜻蜓FM项目:scrapy startproject fm

  2. 进入到fm文件夹,创建qingting爬虫:scrapy genspider qingting m.qingting.fm/rank/

  3. 编辑spiders/qingting.py

首先在scrapy框架的Spider文件中 点击我们的qingtingfm.py文件,在里面输入代码:

import scrapy

from scrapy import cmdline

from scrapy.http import HtmlResponse

import sys

import os

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(file))))

class QingtingSpider(scrapy.Spider):

name = 'qingting'

allowed_domains = ['qingting.fm', 'pic.qtfm.cn']

start_urls = ['https://m.qingting.fm/rank/']


def parse(self, response: HtmlResponse, **kwargs):
    a_list = response.xpath("//div[@class='rank-list']/a")

    for a_temp in a_list:
        rank_number = a_temp.xpath("./div[@class='badge']//text()").extract_first()
        img_url = a_temp.xpath("./img/@src").extract_first()
        title = a_temp.xpath("./div[@class='content']/div[@class='title']/text()").extract_first()
        desc = a_temp.xpath("./div[@class='content']/div[@class='desc']/text()").extract_first()
        play_number = a_temp.xpath(".//div[@class='info-item'][1]/span/text()").extract_first()

        
        if img_url and img_url.startswith('//'):
            img_url = 'https:' + img_url

        # 提交信息
        yield {
            'type': 'info',
            'rank_number': rank_number,
            'img_url': img_url,
            'title': title,
            'desc': desc,
            'play_number': play_number
        }

        # 请求图片
        if img_url and title:
            yield scrapy.Request(
                img_url,
                callback=self.parse_image,
                cb_kwargs={"image_name": title}
            )

# 解析图片
def parse_image(self, response, image_name):
    yield {
        'type': 'image',
        "image_name": image_name + ".png",
        "image_content": response.body
    }

if name == 'main': cmdline.execute('scrapy crawl qingting'.split())

接着在pipelines.py里面输入代码:

import os import pymongo

class FmPipeline:

def process_item(self, item, spider):

    type_ = item.get('type')

    # 保存图片
    if type_ == 'image':
        # 固定保存在 当前项目下的 download 文件夹
        base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
        download_path = os.path.join(base_dir, "download")

        if not os.path.exists(download_path):
            os.makedirs(download_path)

        image_name = item.get("image_name")
        image_content = item.get("image_content")

        if image_name and image_content:
            img_path = os.path.join(download_path, image_name)
            with open(img_path, "wb") as f:
                f.write(image_content)
            print(" 图片保存成功:", image_name)

    # 保存数据到 MongoDB
    elif type_ == 'info':
        mongo_client = pymongo.MongoClient('localhost', 27017)
        db = mongo_client['py_spider']
        collection = db['qingtingFM']
        collection.insert_one(item)
        print("数据插入成功:", item.get('title'))

    return item

接着在我们的sttings.py文件里面 将: 1.ROBOTSTXT_OBEY 设置为False

2.将DEFAULT_REQUEST_HEADERS解锁并设置模拟UA

3.将ITEM_PIPELINES解锁如:

ITEM_PIPELINES = { '你的项目名.pipelines.FmPipeline': 300, }

之后我们运行qingtingfm.py,即可发现在我们是fm文件夹下会出现download文件 该文件里面即保存了我们的图片