爬虫第七课--scrapy(三)爬取163新闻

121 阅读1分钟

这节课,我们学习一个新的爬取模板---crawlSpider

'''
crawlSpider类的基本使用
切换模板
scrapy genspider -t crawl 爬虫名称 爬取网址

LinkExtractors  :提取链接

参数:allow()满足则表达式的值会提取
      restrict_xpaths() 满足xpath路劲的值

Rule

流程:导入模块LinkExtractors(from scrapy.linkextractors import LinkExtractor)

CrawlSpider 类源码

extract_links
'''
"""
案例分析网易新闻
scrapy startproject new
scrapy genspider -t crawl new_spider 域名
"""

接下来,我们试着做一个小案例:

spider代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NewSpiderSpider(CrawlSpider):
    name = 'new_spider'
    # allowed_domains = ['ddd']
    start_urls = ['https://www.163.com/']

    rules = (
        Rule(LinkExtractor(allow='http.*?://.*?\.163\.com/\d{2}/\