实战演练
前面我们爬取了books.toscrape.com网站中的书籍信息,但仅从每一个书籍列表页面爬取了书的名字和价格信息
新建一个Scrapy项目,爬取每一本书更多的信息
其中每一本书的信息包括:
书名+价格+评价+等级+产品编码+库存量+评价数量
爬取结果并输出
准备工作
页面分析
我们可在
<div class="col-sm-6 product_main">中提取书名、价格、评价等级
可在页面下端位置的
<table class="table table-striped">中提取产品编码、库存量、评价数量
每个书籍页面的链接可以在每个<article class="product_pod">中找到
编写代码
创建一个Scrapy项目,取名为t_book
使用scrapygenspider<SPIDER_NAME><DOMAIN>命令生成(根据模板)
srcapy genspider books books.toscrape.com
scrapy genspider命令创建了文件t_book/spiders/books.py,并在其中创建了一个BooksSpider类
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
pass
在toscrape_book/items.py中定义封装书籍信息的Item类
class BookItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
review_rating = scrapy.Field()
review_num = scrapy.Field()
upc = scrapy.Field()
stock = scrapy.Field()
实现书籍列表页面的解析
import scrapy
from scrapy.linkextractors import *
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
# 书籍列表页面解析
def parse(self, response):
le = LinkExtractor(restrict_css='article.product_pod h3')
for link in le.extract_links(response):
yield scrapy.Request(link.url,callback=self.parse_book)
le=LinkExtractor(restrict_css='ul.pager li.next')
links=le.extract_links(response)
if links:
next_url=links[0].url
yield scrapy.Request(next_url,callback=self.parse)
# 书籍页面解析
def parse_book(self, response):
pass
实现书籍页面的解析
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import BookItem
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
# 书籍列表页面解析
def parse(self, response):
le = LinkExtractor(restrict_css='article.product_pod h3')
for link in le.extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_book)
le = LinkExtractor(restrict_css='ul.pager li.next')
links = le.extract_links(response)
if links:
next_url = links[0].url
yield scrapy.Request(next_url, callback=self.parse)
# 书籍页面解析
def parse_book(self, response):
book = BookItem()
sel = response.css('div.product_main')
book['name'] = sel.xpath('./h1/text()').extract_first()
book['price'] = sel.css('p.price_color::text').extract_first()
book['review_rating'] = sel.css('p.star-rating::attr(class)').re_first('star-rating ([A-Za-z]+)')
sel = response.css('table.table.table-striped')
book['upc'] = sel.xpath('(.//tr)[1]/td/text()').extract_first()
book['stock'] = sel.xpath('(.//tr)[last()-1]/td/text()').re_first('\((\d+) available\)')
book['review_num'] = sel.xpath('(.//tr)[last()]/td/text()').extract_first()
print(book)
运行爬虫
将字典输出在了窗口
参考资料:Scrapy官方文档
《精通Scrapy网络爬虫》刘硕 清华大学出版社