爬虫第七课--scrapy（三）爬取163新闻流程：导入模块LinkExtractors(from scrapy.fro

这节课，我们学习一个新的爬取模板---crawlSpider

'''
crawlSpider类的基本使用
切换模板
scrapy genspider -t crawl 爬虫名称 爬取网址

LinkExtractors  :提取链接

参数：allow()满足则表达式的值会提取
      restrict_xpaths() 满足xpath路劲的值

Rule

流程：导入模块LinkExtractors(from scrapy.linkextractors import LinkExtractor)

CrawlSpider 类源码

extract_links
'''
"""
案例分析网易新闻
scrapy startproject new
scrapy genspider -t crawl new_spider 域名
"""

接下来，我们试着做一个小案例：

spider代码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NewSpiderSpider(CrawlSpider):
    name = 'new_spider'
    # allowed_domains = ['ddd']
    start_urls = ['https://www.163.com/']

    rules = (
        Rule(LinkExtractor(allow='http.*?://.*?\.163\.com/\d{2}/\