将stealth.min.js打包进egg,分布式selenium gird隐藏webdriver属性

·  阅读 1700

引子:

要实现的目标:分布式selenium gird隐藏webdriver属性。
重点在打包,部署,读取文件,路径的坑。我写的可配置通用爬虫,重在开发效率和部署效率、管理

细节:
将stealth.min.js打包进egg,使用scrapy框架管理,通过scrapyd部署,scrapydweb调度。
分布式selenium gird隐藏webdriver属性

目标网站:
需求:关注金融管理局的公告,几十个省份、城市爬虫,一天搞定和维护
比如这2个网站,对浏览器进行了检测
aHR0cDovL2pyLmNoZW5nZHUuZ292LmNuL2ppbnJvbmdiYW4vYzEzOTAxMy9saXN0LnNodG1s
aHR0cDovL2pyai5oYWluYW4uZ292LmNuL3NqcmIvemNjYy9uZXd4eGdrX2luZGV4LnNodG1s
复制代码

关联文章:基于scrapy的可配置爬虫,大大提高工作效率

# 可配置爬虫 配置示例
{"local": "海南省",
 "organization": "xxxx地方金融监督管理局",
 "link": "xxxxxx.shtml",
 "article_rows_xpath": '//a[contains(text(), "公示公告")]/../following-sibling::div[1]/ul/li',
 "title_xpath": "./a",
 "title_parse": "./@title",
 "title_link_xpath": "./a/@href",
 "date_xpath": "./em",
 "date_parse": './text()',
 "prefix": "http://jrj.hainan.gov.cn/",
 "note": "{'way':'selenium', 'use_proxy':'False'} ",
 },
复制代码

难点解读:

1、隐藏webdriver demo code

参考: 最完美方案!模拟浏览器如何正确隐藏特征

本地版本

# 本地版本
# -*- coding:utf-8 -*-
# @Author: clark
# @Time: 2021/3/4 5:40 下午
# @File: webdriver_hide_feature.py
# @project demand:当前时间,全面隐藏webdriver特征

import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
    'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')

driver = Chrome(options=chrome_options)

with open('stealth.min.js') as f:
    js = f.read()

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": js
})
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')

# 你可以保存源代码为 html 再双击打开,查看完整结果
source = driver.page_source
with open('result.html', 'w') as f:
    f.write(source)
复制代码

remote selenium gird版本

# -*- coding:utf-8 -*-
# @Author: clark
# @Time: 2021/4/8 2:38 下午
# @File: webdriver_hide_feature_remote.py
# @project demand:当前时间,全面隐藏webdriver特征
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.remote_connection import ChromeRemoteConnection
from selenium import webdriver

chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
    'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')

with webdriver.Remote(command_executor=ChromeRemoteConnection(
        remote_server_addr='http://192.168.95.56:4444/wd/hub',
        keep_alive=True),
        desired_capabilities={
            'platform': 'WINDOWS',
            'browserName': "chrome",
            'version': '',
            'javascriptEnabled': True
        },
        options=chrome_options
) as driver:
    with open('stealth.min.js') as f:
        js = f.read()
    print(driver.execute("executeCdpCommand", {'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
        "source": js
    }}))

    driver.get('https://bot.sannysoft.com/')
    time.sleep(5)
    driver.save_screenshot('screenshot.png')

复制代码

2、静态资源打包进egg,生产环境,读取包中的数据文件

参考:

花了两天,终于把 Python 的 setup.py 给整明白了

Python打包分发工具setuptools简介

python cookbook学习笔记:模块和包

文件路径: scrapy路径/spiders/static/stealth.min.js

# MANIFEST.in
recursive-include risk_control_info/spiders/static *
复制代码
# xxx/spiders/big_finance_jgj_news.py
with webdriver.Remote(command_executor=ChromeRemoteConnection(
        remote_server_addr="{}/wd/hub".format(SELENIUM_DOCKER_HOST),
        keep_alive=True),
        desired_capabilities={
            'platform': 'WINDOWS',
            'browserName': "chrome",
            'version': '',
            'javascriptEnabled': True
        },
        options=options
) as browser:
    # 隐藏webdriver属性
    try:
        # 生产环境:读取包中的数据文件
        import pkg_resources
        f = pkg_resources.resource_stream(__package__, 'static/stealth.min.js')
        js = f.read().decode()
        # self.logger.info(js)
        self.logger.info(
            browser.execute("executeCdpCommand",
                            {'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
                                "source": js
                            }}))
    except Exception as e:
        self.logger.warning(f"测试环境读取本地路径:{e}")
        # 从本地读取
        this_dir, this_filename = os.path.split(__file__)
        STEALTH_PATH = os.path.join(this_dir, "static", "stealth.min.js")
        self.logger.info(f"STEALTH_PATH:{STEALTH_PATH}")
        with open(STEALTH_PATH) as f:
            js = f.read()
            self.logger.info(
                browser.execute("executeCdpCommand",
                                {'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
                                    "source": js
                                }}))
复制代码
# xxx/setup.py
# Automatically created by: scrapyd-deploy

from setuptools import setup, find_packages

setup(
    name='project',
    version='1.0',
    packages=find_packages(),
    entry_points={'scrapy': ['settings = risk_control_info.settings']},
    include_package_data=True  # 启用清单文件MANIFEST.in
)

复制代码
分类:
后端
标签:
收藏成功!
已添加到「」, 点击更改