持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第13天,点击查看活动详情
教你爬取妹子图之拿下“女神”(三)
前言
上节我们把图片和图集的基本信息爬了下来,这次我们在这个的基础上把作者的基础信息也爬下来。
编码
在上一节我们爬虫的起始url是使用了start_urls属性来设置的,这次我们换一个方案,来更灵活的设置起始url和回调的方法。
处理开始URL
我们点开框架的代码,会发现这样的一段代码:
async def process_start_urls(self):
"""
处理开始的地址
:return: 异步请求迭代器
"""
for url in self.start_urls:
yield await self.get(url=url, callback=self.parse, metadata=self.metadata)
可以看到框架的内部就是通过process_start_urls方法来进行第一批的爬取操作,我们这里就要修改这个方法,来让我们的爬虫同时支持爬取帖子信息和作者信息
我们对process_start_urls方法进行如下的改造:
async def process_start_urls(self):
"""
处理起始URL
Returns:
"""
post_urls = [f'https://www.xsnvshen.com/album/?p={i}' for i in range(1, 2)]
girl_urls = [f'https://www.xsnvshen.com/girl//?p={i}' for i in range(1, 2)]
for url in post_urls:
yield await self.get(url=url, callback=self.parse_post_list)
for url in girl_urls:
yield await self.get(url=url, callback=self.parse_girl_list)
这里分别设置了两组URL,帖子和妹子,分别对帖子和妹子进行不同的回调操作
处理列表
我们这里编写爬取妹子列表的操作,我们先拿帖子列表的选择器试试,万一可以呢?
我们发现是完全没问题的,既然是这样那么就不用写两个列表操作了,复用一个就可以,然后通过属性去区分是帖子还是妹子,修改后的代码如下:
async def process_start_urls(self):
"""
处理起始URL
Returns:
"""
post_urls = [f'https://www.xsnvshen.com/album/?p={i}' for i in range(1, 2)]
girl_urls = [f'https://www.xsnvshen.com/girl//?p={i}' for i in range(1, 2)]
for url in post_urls:
yield self.get(url=url, callback=self.parse_list, metadata={"type": "post"})
for url in girl_urls:
yield self.get(url=url, callback=self.parse_list, metadata={'type': "girl"})
async def parse_list(self, response: Response):
"""
解析帖子和妹子的列表信息
Args:
response: 响应
Returns:
"""
sp_type = response.metadata.get('type')
urls = response.css('.itemimg::attr(href)').getall()
for url in response.to_url(urls):
if sp_type == "post":
yield self.get(url=url, callback=self.parse_post)
elif sp_type == "girl":
yield self.get(url=url, callback=self.parse_girl)
我们通过metadata的方式向下面去传递我们要爬取的是帖子还是妹子,最后通过类型不同调用不同的回调方法
爬取妹子
我们在Item中,只预先写了简介的爬取规则,而其他的信息都是在网页里面以表格的形式显示出来的,我们要处理一下:
把这个表格转换为字典的形式,然后再对应我们Item中的字段去保存就好了,观察表格如下:
可以发现所有的信息是在.entry-baseInfo-bd li下面的,然后key是bas-title,value是bas-cont,找到了这个规律之后我们就可以编写如下代码:
girl_info = {
"名字": "name",
"中文名": "chinese_name",
"英文名": "english_name",
"生日": "birthday",
"星座": "constellation",
"三围": "size_3",
"出生": "addr",
"属相": "chinese_zodiac",
"身高": "height",
"体重": "weight",
"职业": "job"
}
async def parse_girl(self, response: Response):
"""
解析妹子的信息
Args:
response: 响应
Returns:
"""
girl_item = await GirlItem.extract(response=response)
girl_item.web_src = response.url
girl_item.girl_id = int(response.url.split('/')[-1])
for info in response.css('.entry-baseInfo-bd li'):
value = info.css('.bas-cont::text').get()
key = info.css('.bas-title::text').get()
key = girl_info.get(key)
if key:
girl_item[key] = value
return girl_item
girl_info是把表格中中文字段名与我们GirlItem中的字段名字做一个转换,然后之前保存进girl_item中就可以
保存信息
目前为止妹子的基本信息和帖子的基本信息,都爬完了,接下来就是保存了,我们依旧使用mongo_db来保存:
if __name__ == '__main__':
client = AsyncIOMotorClient("mongodb://ip:port", serverSelectionTimeoutMS=5000)
db = client.xiuse
main()
async def process_item(self, item: Union[PostItem, GirlItem]):
"""
处理信息
Args:
item: 帖子信息或妹子信息
Returns:
"""
if isinstance(item, PostItem):
collection = db['post']
id_name = 'post_id'
else:
collection = db['gril']
id_name = 'girl_id'
db_data = await collection.find_one({id_name: {'=': item[id_name]}})
if not db_data:
item.created_at = datetime.now().isoformat()
result = await collection.insert_one(item.copy())
if isinstance(item, PostItem):
self.logger.info(f"插入帖子:{item.title} {result.inserted_id}")
else:
self.logger.info(f"插入妹子:{item.name} {result.inserted_id}")
运行结果
全部代码
from datetime import datetime
from typing import Union
from hssp import Spider, Response, Setting
from hssp.item import Item, Field
from motor.motor_asyncio import AsyncIOMotorClient
class PostItem(Item):
# 标题
title = Field(css_select='h1 a::text')
# 描述
desc = Field(css_select='.longConWhite div[style]::text')
# 图片
images = Field(css_select='img[id^="imglist_"]::attr(src)', many=True)
# 标签
tags = Field(css_select='.post-tags a::text', many=True)
# 着装
dress = Field()
# 风格
style = Field()
# 体征
signs = Field()
# 场景
scene = Field()
# 地域
regional = Field()
# 机构
organization = Field()
# 套图ID
post_id = Field()
# 妹子ID
girl_id = Field()
# 爬取来源
web_src = Field()
# 插入时间
created_at = Field()
class GirlItem(Item):
# 简介
desc = Field(css_select='.star2-intro-bd p::text')
# 中文名
chinese_name = Field()
# 英文名
english_name = Field()
# 名字
name = Field()
# 生日
birthday = Field()
# 职业
job = Field()
# 三围
size_3 = Field()
# 出生地点
addr = Field()
# 身高
height = Field()
# 体重
weight = Field()
# 星座
constellation = Field()
# 生肖
chinese_zodiac = Field()
# 地域
regional = Field()
# 身材
figure = Field()
# 体征
signs = Field()
# 组合
group = Field()
# 套图ID
girl_id = Field()
# 爬取来源
web_src = Field()
# 插入时间
created_at = Field()
girl_info = {
"名字": "name",
"中文名": "chinese_name",
"英文名": "english_name",
"生日": "birthday",
"星座": "constellation",
"三围": "size_3",
"出生": "addr",
"属相": "chinese_zodiac",
"身高": "height",
"体重": "weight",
"职业": "job"
}
class XiuSeSpider(Spider):
async def process_start_urls(self):
"""
处理起始URL
Returns:
"""
post_urls = [f'https://www.xsnvshen.com/album/?p={i}' for i in range(1, 2)]
girl_urls = [f'https://www.xsnvshen.com/girl//?p={i}' for i in range(1, 2)]
for url in post_urls:
yield await self.get(url=url, callback=self.parse_list, metadata={"type": "post"})
for url in girl_urls:
yield await self.get(url=url, callback=self.parse_list, metadata={'type': "girl"})
async def parse_list(self, response: Response):
"""
解析帖子和妹子的列表信息
Args:
response: 响应
Returns:
"""
sp_type = response.metadata.get('type')
urls = response.css('.itemimg::attr(href)').getall()
for url in response.to_url(urls):
if sp_type == "post":
yield await self.get(url=url, callback=self.parse_post)
elif sp_type == "girl":
yield await self.get(url=url, callback=self.parse_girl)
async def parse_post(self, response: Response):
"""
解析帖子的详细信息
Args:
response: 响应
Returns:
"""
post_item = await PostItem.extract(response=response)
post_item.web_src = response.url
post_item.post_id = int(response.url.split('/')[-1])
post_item.girl_id = int(response.css('.show-topmbx a::attr(href)').getall()[2].split('/')[-1])
yield post_item
async def parse_girl(self, response: Response):
"""
解析妹子的信息
Args:
response: 响应
Returns:
"""
girl_item = await GirlItem.extract(response=response)
girl_item.web_src = response.url
girl_item.girl_id = int(response.url.split('/')[-1])
for info in response.css('.entry-baseInfo-bd li'):
value = info.css('.bas-cont::text').get()
key = info.css('.bas-title::text').get()
key = girl_info.get(key)
if key:
girl_item[key] = value
yield girl_item
async def process_item(self, item: Union[PostItem, GirlItem]):
"""
处理信息
Args:
item: 帖子信息或妹子信息
Returns:
"""
if isinstance(item, PostItem):
collection = db['post']
id_name = 'post_id'
else:
collection = db['gril']
id_name = 'girl_id'
db_data = await collection.find_one({id_name: {'=': item[id_name]}})
if not db_data:
item.created_at = datetime.now().isoformat()
result = await collection.insert_one(item.copy())
if isinstance(item, PostItem):
self.logger.info(f"插入帖子:{item.title} {result.inserted_id}")
else:
self.logger.info(f"插入妹子:{item.name} {result.inserted_id}")
def main():
setting = Setting()
XiuSeSpider.start(setting=setting)
if __name__ == '__main__':
client = AsyncIOMotorClient("mongodb://107.182.25.232:58269", serverSelectionTimeoutMS=5000)
db = client.xiuse
main()
结语
我们这经过一篇可以把所有的帖子和妹子信息一次性爬取下来,但是还缺少那些扩展的信息,还有增量爬取的步骤,我们下一节继续