scrapy框架爬取天气网图片

354 阅读1分钟

这是我参与11月更文挑战的第2天,活动详情查看:2021最后一次更文挑战

实验 2

2.2 思路

2.2.1 setting.py

  • 解除限制
ROBOTSTXT_OBEY = False
  • 设置保存图片的路径
IMAGES_STORE = r'.\images'  # 保存文件的路径
  • 打开pipelines
ITEM_PIPELINES = {    
'weatherSpider.pipelines.WeatherspiderPipeline': 300,
}
  • 设置请求头
DEFAULT_REQUEST_HEADERS = {    
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',    
'Accept-Language': 'en',    
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36',
}

2.2.2 item.py

  • 设置要爬取的字段
class WeatherspiderItem(scrapy.Item):    
number = scrapy.Field()    
pic_url = scrapy.Field()

2.2.3 wt_Spider.py

  • 发送请求
    def start_requests(self):        
      yield scrapy.Request(self.start_url, callback=self.parse)
  • 获取页面所有的a标签
    def parse(self, response):
        html = response.text
        urlList = re.findall('<a href="(.*?)" ', html, re.S)
        for url in urlList:
            self.url = url
            try:
                yield scrapy.Request(self.url, callback=self.picParse)
            except Exception as e:
                print("err:", e)
                pass
  • 再次请求所有的a标签下面的网址,再找所有的图片返回
    def picParse(self, response):
        imgList = re.findall(r'<img.*?src="(.*?)"', response.text, re.S)
        for k in imgList:
            if self.total > 102:
                return 
            try:
                item = WeatherspiderItem()
                item['pic_url'] = k
                item['number'] = self.total
                self.total += 1
                yield item
            except Exception as e:
                pass
  • 那么与存入数据库类似,数据处理全部都应该在pipelines.py中处理,也就是说,pipelines还是要发送请求

2.2.4 pipelines.py

  • 导入setting信息
from weatherSpider.settings import IMAGES_STORE as images_store      # 读取配置文件的信息
from scrapy.pipelines.images import ImagesPipeline
settings = get_project_settings()
  • 编写保存函数
    def get_media_requests(self, item, info):
        image_url = item["pic_url"]
        yield Request(image_url)
  • 这里优化的话,应该保存文件的时候重命名会好一点!

image-20211027111759905