在解析response响应对象的过程当中,解析出来的数据可能是一个新的可访问的URL,如果需要对解析出来的URL地址进行请求并获取数据该如何完成?
以下我展示“获取qingting里面的数据并通过访问图片URL来保存图片数据”案例
确定需求:从响应对象中提取URL,对这样的url也发送请求然后提取它的数据。
步骤:
-
创建蜻蜓FM项目:scrapy startproject fm
-
进入到fm文件夹,创建qingting爬虫:scrapy genspider qingting m.qingting.fm/rank/
-
编辑spiders/qingting.py
首先在scrapy框架的Spider文件中 点击我们的qingtingfm.py文件,在里面输入代码:
import scrapy
from scrapy import cmdline
from scrapy.http import HtmlResponse
import sys
import os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(file))))
class QingtingSpider(scrapy.Spider):
name = 'qingting'
allowed_domains = ['qingting.fm', 'pic.qtfm.cn']
start_urls = ['https://m.qingting.fm/rank/']
def parse(self, response: HtmlResponse, **kwargs):
a_list = response.xpath("//div[@class='rank-list']/a")
for a_temp in a_list:
rank_number = a_temp.xpath("./div[@class='badge']//text()").extract_first()
img_url = a_temp.xpath("./img/@src").extract_first()
title = a_temp.xpath("./div[@class='content']/div[@class='title']/text()").extract_first()
desc = a_temp.xpath("./div[@class='content']/div[@class='desc']/text()").extract_first()
play_number = a_temp.xpath(".//div[@class='info-item'][1]/span/text()").extract_first()
if img_url and img_url.startswith('//'):
img_url = 'https:' + img_url
# 提交信息
yield {
'type': 'info',
'rank_number': rank_number,
'img_url': img_url,
'title': title,
'desc': desc,
'play_number': play_number
}
# 请求图片
if img_url and title:
yield scrapy.Request(
img_url,
callback=self.parse_image,
cb_kwargs={"image_name": title}
)
# 解析图片
def parse_image(self, response, image_name):
yield {
'type': 'image',
"image_name": image_name + ".png",
"image_content": response.body
}
if name == 'main': cmdline.execute('scrapy crawl qingting'.split())
接着在pipelines.py里面输入代码:
import os import pymongo
class FmPipeline:
def process_item(self, item, spider):
type_ = item.get('type')
# 保存图片
if type_ == 'image':
# 固定保存在 当前项目下的 download 文件夹
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
download_path = os.path.join(base_dir, "download")
if not os.path.exists(download_path):
os.makedirs(download_path)
image_name = item.get("image_name")
image_content = item.get("image_content")
if image_name and image_content:
img_path = os.path.join(download_path, image_name)
with open(img_path, "wb") as f:
f.write(image_content)
print(" 图片保存成功:", image_name)
# 保存数据到 MongoDB
elif type_ == 'info':
mongo_client = pymongo.MongoClient('localhost', 27017)
db = mongo_client['py_spider']
collection = db['qingtingFM']
collection.insert_one(item)
print("数据插入成功:", item.get('title'))
return item
接着在我们的sttings.py文件里面 将: 1.ROBOTSTXT_OBEY 设置为False
2.将DEFAULT_REQUEST_HEADERS解锁并设置模拟UA
3.将ITEM_PIPELINES解锁如:
ITEM_PIPELINES = { '你的项目名.pipelines.FmPipeline': 300, }
之后我们运行qingtingfm.py,即可发现在我们是fm文件夹下会出现download文件 该文件里面即保存了我们的图片