这是我参与11月更文挑战的第19天,活动详情查看:2021最后一次更文挑战
1.1 题目
熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;
Scrapy+Xpath+MySQL数据库
存储技术路线爬取当当网站图书数据 候选网站:www.dangdang.com/
1.2 思路
1.2.1 setting.py
-
打开请求头
-
连接数据库信息
-
ROBOTSTXT_OBEY
设置为False -
打开pipelines
1.2.2 item.py
编写item.py的字段
class DangdangItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
publisher = scrapy.Field()
date = scrapy.Field()
price = scrapy.Field()
detail = scrapy.Field()
1.2.3 db_Spider.py
- 观察网页,查看分页
第二页 第三页
所以很容易发现这个page_index
就是分页的参数
- 获取节点信息
def parse(self, response):
lis = response.xpath('//*[@id="component_59"]')
titles = lis.xpath(".//p[1]/a/@title").extract()
authors = lis.xpath(".//p[5]/span[1]/a[1]/text()").extract()
publishers = lis.xpath('.//p[5]/span[3]/a/text()').extract()
dates = lis.xpath(".//p[5]/span[2]/text()").extract()
prices = lis.xpath('.//p[3]/span[1]/text()').extract()
details = lis.xpath('.//p[2]/text()').extract()
for title,author,publisher,date,price,detail in zip(titles,authors,publishers,dates,prices,details):
item = DangdangItem(
title=title,
author=author,
publisher=publisher,
date=date,
price=price,
detail=detail,
)
self.total += 1
print(self.total,item)
yield item
self.page_index += 1
yield scrapy.Request(self.next_url % (self.keyword, self.page_index),
callback=self.next_parse)
- 指定爬取数量
爬取102
条
1.2.4 pipelines.py
- 数据库连接
def __init__(self):
# 获取setting中主机名,端口号和集合名
host = settings['HOSTNAME']
port = settings['PORT']
dbname = settings['DATABASE']
username = settings['USERNAME']
password = settings['PASSWORD']
self.conn = pymysql.connect(host=host, port=port, user=username, password=password, database=dbname,
charset='utf8')
self.cursor = self.conn.cursor()
- 插入数据
def process_item(self, item, spider):
data = dict(item)
sql = "INSERT INTO spider_dangdang(title,author,publisher,b_date,price,detail)" \
" VALUES (%s,%s, %s, %s,%s, %s)"
try:
self.conn.commit()
self.cursor.execute(sql, [data["title"],
data["author"],
data["publisher"],
data["date"],
data["price"],
data["detail"],
])
print("插入成功")
except Exception as err:
print("插入失败", err)
return item
结果查看,一共102条数据,这个id
我是设置自动自增的,因为有之前测试的数据插入,所以id
并没有从1开始