scrapy+selenium简单入门当爬京东或者百度等网站上时候、有时候会发现文章有反爬手段、用ua 请求头带cook

当爬京东或者百度等网站上时候、有时候会发现文章有反爬手段、用ua 请求头带cookie 等反扒手段，让我们不能用简单去请求，获取数据

1.什么是selenium？

（1）Selenium是一个用于Web应用程序测试的工具。

（2）Selenium 测试直接运行在浏览器中，就像真正的用户在操作一样。

（3）支持通过各种driver（FirfoxDriver，IternetExplorerDriver，OperaDriver，ChromeDriver）驱动真实浏览器完成测试。

（4）selenium也是支持无界面浏览器操作的。

注意使用 selenium 和游览器的驱动器目录下复制一个游览器驱动器

【解决问题】WebDriverException: Message: 'chromedriver' executable needs to be in PATH

scrapy startproject ze

scrapy genspider baidu baidu.com

setting 文件下

# 注释掉 或者改为 false
# ROBOTSTXT_OBEY = True

baidu.py 文件

import scrapy

from selenium import webdriver
# 使用无头浏览器
from selenium.webdriver.chrome.options import Options
# 无头浏览器设置
chorme_options = Options()
chorme_options.add_argument("--headless")
chorme_options.add_argument("--disable-gpu")

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=周杰伦']

    # def parse(self, response):
    #     pass
    # 实例化一个浏览器对象
    def __init__(self):
        self.browser = webdriver.Chrome(options=chorme_options)
        super().__init__()

    def parse(self, response):
        # print(response.text)
        self.browser.get("https://www.baidu.com/s?wd=周杰伦")
        print(self.browser.page_source)

    def close(self, spider):
        print('爬虫结束，关闭浏览器')
        self.browser.quit()