这是我参与11月更文挑战的第2天,活动详情查看:2021最后一次更文挑战
实验 2
2.2 思路
2.2.1 setting.py
- 解除限制
ROBOTSTXT_OBEY = False
- 设置保存图片的路径
IMAGES_STORE = r'.\images' # 保存文件的路径
- 打开pipelines
ITEM_PIPELINES = {
'weatherSpider.pipelines.WeatherspiderPipeline': 300,
}
- 设置请求头
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36',
}
2.2.2 item.py
- 设置要爬取的字段
class WeatherspiderItem(scrapy.Item):
number = scrapy.Field()
pic_url = scrapy.Field()
2.2.3 wt_Spider.py
- 发送请求
def start_requests(self):
yield scrapy.Request(self.start_url, callback=self.parse)
- 获取页面所有的
a标签
def parse(self, response):
html = response.text
urlList = re.findall('<a href="(.*?)" ', html, re.S)
for url in urlList:
self.url = url
try:
yield scrapy.Request(self.url, callback=self.picParse)
except Exception as e:
print("err:", e)
pass
- 再次请求所有的a标签下面的网址,再找所有的图片返回
def picParse(self, response):
imgList = re.findall(r'<img.*?src="(.*?)"', response.text, re.S)
for k in imgList:
if self.total > 102:
return
try:
item = WeatherspiderItem()
item['pic_url'] = k
item['number'] = self.total
self.total += 1
yield item
except Exception as e:
pass
- 那么与存入数据库类似,
数据处理
全部都应该在pipelines.py
中处理,也就是说,pipelines还是要发送请求
2.2.4 pipelines.py
- 导入setting信息
from weatherSpider.settings import IMAGES_STORE as images_store # 读取配置文件的信息
from scrapy.pipelines.images import ImagesPipeline
settings = get_project_settings()
- 编写保存函数
def get_media_requests(self, item, info):
image_url = item["pic_url"]
yield Request(image_url)
- 这里优化的话,应该保存文件的时候重命名会好一点!