引子:
要实现的目标:分布式selenium gird隐藏webdriver属性。
重点在打包,部署,读取文件,路径的坑。我写的可配置通用爬虫,重在开发效率和部署效率、管理
细节:
将stealth.min.js打包进egg,使用scrapy框架管理,通过scrapyd部署,scrapydweb调度。
分布式selenium gird隐藏webdriver属性
目标网站:
需求:关注金融管理局的公告,几十个省份、城市爬虫,一天搞定和维护
比如这2个网站,对浏览器进行了检测
aHR0cDovL2pyLmNoZW5nZHUuZ292LmNuL2ppbnJvbmdiYW4vYzEzOTAxMy9saXN0LnNodG1s
aHR0cDovL2pyai5oYWluYW4uZ292LmNuL3NqcmIvemNjYy9uZXd4eGdrX2luZGV4LnNodG1s
复制代码
# 可配置爬虫 配置示例
{"local": "海南省",
"organization": "xxxx地方金融监督管理局",
"link": "xxxxxx.shtml",
"article_rows_xpath": '//a[contains(text(), "公示公告")]/../following-sibling::div[1]/ul/li',
"title_xpath": "./a",
"title_parse": "./@title",
"title_link_xpath": "./a/@href",
"date_xpath": "./em",
"date_parse": './text()',
"prefix": "http://jrj.hainan.gov.cn/",
"note": "{'way':'selenium', 'use_proxy':'False'} ",
},
复制代码
难点解读:
1、隐藏webdriver demo code
本地版本
# 本地版本
# -*- coding:utf-8 -*-
# @Author: clark
# @Time: 2021/3/4 5:40 下午
# @File: webdriver_hide_feature.py
# @project demand:当前时间,全面隐藏webdriver特征
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
driver = Chrome(options=chrome_options)
with open('stealth.min.js') as f:
js = f.read()
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": js
})
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')
# 你可以保存源代码为 html 再双击打开,查看完整结果
source = driver.page_source
with open('result.html', 'w') as f:
f.write(source)
复制代码
remote selenium gird版本
# -*- coding:utf-8 -*-
# @Author: clark
# @Time: 2021/4/8 2:38 下午
# @File: webdriver_hide_feature_remote.py
# @project demand:当前时间,全面隐藏webdriver特征
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.remote_connection import ChromeRemoteConnection
from selenium import webdriver
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
with webdriver.Remote(command_executor=ChromeRemoteConnection(
remote_server_addr='http://192.168.95.56:4444/wd/hub',
keep_alive=True),
desired_capabilities={
'platform': 'WINDOWS',
'browserName': "chrome",
'version': '',
'javascriptEnabled': True
},
options=chrome_options
) as driver:
with open('stealth.min.js') as f:
js = f.read()
print(driver.execute("executeCdpCommand", {'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
"source": js
}}))
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')
复制代码
2、静态资源打包进egg,生产环境,读取包中的数据文件
参考:
花了两天,终于把 Python 的 setup.py 给整明白了
文件路径:
scrapy路径/spiders/static/stealth.min.js
# MANIFEST.in
recursive-include risk_control_info/spiders/static *
复制代码
# xxx/spiders/big_finance_jgj_news.py
with webdriver.Remote(command_executor=ChromeRemoteConnection(
remote_server_addr="{}/wd/hub".format(SELENIUM_DOCKER_HOST),
keep_alive=True),
desired_capabilities={
'platform': 'WINDOWS',
'browserName': "chrome",
'version': '',
'javascriptEnabled': True
},
options=options
) as browser:
# 隐藏webdriver属性
try:
# 生产环境:读取包中的数据文件
import pkg_resources
f = pkg_resources.resource_stream(__package__, 'static/stealth.min.js')
js = f.read().decode()
# self.logger.info(js)
self.logger.info(
browser.execute("executeCdpCommand",
{'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
"source": js
}}))
except Exception as e:
self.logger.warning(f"测试环境读取本地路径:{e}")
# 从本地读取
this_dir, this_filename = os.path.split(__file__)
STEALTH_PATH = os.path.join(this_dir, "static", "stealth.min.js")
self.logger.info(f"STEALTH_PATH:{STEALTH_PATH}")
with open(STEALTH_PATH) as f:
js = f.read()
self.logger.info(
browser.execute("executeCdpCommand",
{'cmd': "Page.addScriptToEvaluateOnNewDocument", 'params': {
"source": js
}}))
复制代码
# xxx/setup.py
# Automatically created by: scrapyd-deploy
from setuptools import setup, find_packages
setup(
name='project',
version='1.0',
packages=find_packages(),
entry_points={'scrapy': ['settings = risk_control_info.settings']},
include_package_data=True # 启用清单文件MANIFEST.in
)
复制代码