本文已参与「新人创作礼」活动,一起开启掘金创作之路。 @TOC
区别
基本使用
(一)创建项目基本命令操作
(1)安装 Scrapy pip install scrapy (2)创建项目 scrapy startproject 项目名称 (3)第一个爬虫项目
- 步骤一:scrapy startproject baidu
- 步骤二:pycharm 打开项目
- 步骤三:在 spider 文件夹下,新建 baidu.py 文件或者在spider文件夹下输入scrapy genspider s_baidu baidu.com
- 步骤四:cd 项目目录,启动爬虫,命令:scrapy crawl baidu
问题:运行后,查看文件,发现并没有保存成文件。 原因: Robots 协议(也叫爬虫协议、机器人协议等),全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过 Robots 协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取,例如百度
为什么之前的爬虫可以爬取呢?是因为之前自己写的爬虫不遵守这个协议,而 scrapy 默认是遵守该协议。
解决方案:settings.py--->ROBOTS_OBEY = False,
案例 --->虎扑篮球新闻
(一) 安装模块
pip install scrapy
(二) 项目建立
#在当前目录下
scrapy startproject hupu
#在 spider 文件夹下,输入以下命令
scrapy genspider hupu_scrapy www
(三)目录介绍
(四)修改settings.py
(1) headings请求头配置
根据具体情况进行修改和添加
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.6.1000',
}
(2)robots协议
全称:网络爬虫排除标准
作用:网站告诉搜索引擎那些数据可以爬取,那些不可以爬取的协议文件
scrapy默认遵守协议
(五)编写item
hupu_spider.py
import scrapy
#只能用相对路径
from ..items import HupuItem
class HupuSpiderSpider(scrapy.Spider):
name = 'hupu_spider'
# allowed_domains = ['www']
start_urls = ['https://voice.hupu.com/nba']
for page in range(1,2):
url=f"https://voice.hupu.com/nba/{str(page)}"
start_urls.append(url)
def parse(self, response):
nba_urls=response.xpath('//div[@class="news-list"]/ul/li//h4/a/@href').extract()
for url in nba_urls:
yield scrapy.Request(url=url,callback=self.parse_detail,encoding="utf-8")
print("存储成功")
def parse_detail(self,response):
#标题
news_tilte=response.xpath('//h1[@class="headline"]/text()').extract_first().strip()
#来源
news_from=response.xpath('//*[@id="source_baidu"]/a/text()').extract_first().strip()
#公布时间
news_pub_time=response.xpath('//*[@id="pubtime_baidu"]/text()').extract_first().strip()
#内容
news_content=response.xpath('string(//div[@class="artical-content-read"])').extract_first().strip()
#所有图片
news_imgs=response.xpath('//div[@class="artical-content-read"]//img/@src').extract()
items=HupuItem()
items['news_tilte']=news_tilte
items['news_from']=news_from
items['news_pub_time']=news_pub_time
items['news_content']=news_content
items['news_imgs']=news_imgs
items['news_url']=response.url
yield items
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class HupuItem(scrapy.Item):
# define the fields for your item here like:
news_tilte = scrapy.Field()
news_from = scrapy.Field()
news_pub_time = scrapy.Field()
news_content = scrapy.Field()
news_imgs = scrapy.Field()
news_url = scrapy.Field()
news_id = scrapy.Field()
(六) 保存数据
- 配置pipelines
settings.py
- 编写存储数据的pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import hashlib
from itemadapter import ItemAdapter
import pymongo
class HupuPipeline:
def __init__(self):
# 1.创建连接
self.client = pymongo.MongoClient(host='localhost', port=27017)
# 连接数据库
self.db = self.client['hupu']
def get_md5(self,value):
return hashlib.md5(bytes(value,encoding='utf-8')).hexdigest()
def process_item(self, item, spider):
# print(item)
item['news_id']=self.get_md5(item['news_url'])
self.db['nba'].update({'news_id':item['news_id']},{'$set':dict(item)},True)
print("正在存储..........")
return item
(七)效果图