前言:
现在的公司为了不被爬虫,不断的改版前后端数据交互方式,但俗语有说:道高一尺,魔高一丈;对于常见
的:iframe、svg、接口交互(Network请求和响应) selenium、网路上的各种大神都有出对应的对策,
本文章,主要介绍 获取浏览器Network请求和响应 的方法以及经验教训(滴泪。。。)
正文:
前置:
这里以抓取喜马拉雅接口数据为例,目标url:https://www.ximalaya.com/sound/148249100 获取Network请求和响应
Browsermob-Proxy抓取Network数据包
下载地址:
上代码
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# chromedriver 路径配置
driverP = Service(r'xxxx/xxxxx/chromedriver的存放路径')
# Browsermob-Proxy启动
BPserver = Server(r"Browsermob-Proxy下载包解压路径\browsermob-proxy-2.1.4\bin\browsermob-proxy", {'port': 9723})
BPserver.start()
Bproxy = BPserver.create_proxy()
# ChromeOptions 配置相关
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={0}'.format(Bproxy.proxy))
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(options=option, service=driverP)
base_url = "https://www.ximalaya.com/sound/148249100"
Bproxy.new_har("douyin", options={'captureHeaders': True, 'captureContent': True})
driver.get(base_url)
result = Bproxy.har
for entry in result['log']['entries']:
_url = entry['request']['url']
# 根据URL找到数据接口
if "/bdsp/album/pay" in _url:
_response = entry['response']
_content = _response['content']['text']
# 获取接口返回内容
try:
json_content = json.loads(_content)
print("json_content", json_content)
except:
print("_content", _content)
BPserver.stop()
driver.quit()
嘻嘻,看着是不是很nice ?在mac/linux下使用还是挺香的,但在windows 上,那就大意了!!!呜呜呜。。。。小编的血的教训啊!!!来上坑:
坑1、 BrowsermobProxy 如何正确被调用!在 自动化项目整到有一定规模情况下 driver.get(base_url) 与driver的配置是拆分开的,BrowsermobProxy 要用,也要同步在代码中添加配置,脚本变得臃肿不灵活
解决方案:
无,小编想不到更好的,各位可以看代码自己尝试 A.py
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
def drivers(self, proxyPath, driverPath):
# chromedriver 路径配置
driverP = Service(r"{}".format(driverPath))
# Browsermob-Proxy启动
BPserver = Server(r"{}".format(proxyPath), {'port': 9723})
BPserver.start()
Bproxy = BPserver.create_proxy()
# ChromeOptions 配置相关
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={0}'.format(Bproxy.proxy))
options.add_argument('--ignore-certificate-errors')
BPserver,Bproxy,driver = webdriver.Chrome(options=option, service=driverP)
return driver
B.py
import A as a
proxyPath = "Browsermob-Proxy下载包解压路径\browsermob-proxy-2.1.4\bin\browsermob-proxy"
driverPath = "xxx/xxx/chromedriver"
BPserver,Bproxy,driver = a.drivers(proxyPath, driverPath)
base_url = "https://www.ximalaya.com/sound/148249100"
Bproxy.new_har("douyin", options={'captureHeaders': True, 'captureContent': True})
driver.get(base_url)
result = Bproxy.har
for entry in result['log']['entries']:
_url = entry['request']['url']
# 根据URL找到数据接口
if "/bdsp/album/pay" in _url:
_response = entry['response']
_content = _response['content']['text']
# 获取接口返回内容
try:
json_content = json.loads(_content)
print("json_content", json_content)
except:
print("_content", _content)
BPserver.stop()
driver.quit()
ps:聪明如你,相信已经看到不方便的点了吧
坑2、 windows上运行,BrowsermobProxy进程是一个大麻烦,server.stop()其实是没有BrowsermobProxy所有进程都干掉,导致二次再跑的时候,会报:“10054,‘An existing connection was forcibly closed by the remote host',None,10054,None" 刷爆各种google,度娘 都找不到原因,重启电脑,再跑又很正常,直接怀疑猿生都猜不到,原来是进程干的不干净导致的
解决方案:
def terminate_browsermob_processes(self):
# Find BrowserMob-Proxy processes that may still be alive and kill them
for process in psutil.process_iter():
try:
process_info = process.as_dict(attrs=['name', 'cmdline'])
if process_info.get('name') in ('java', 'java.exe'):
for cmd_info in process_info.get('cmdline'):
if cmd_info == '-Dapp.name=browsermob-proxy':
process.kill()
return True
except psutil.NoSuchProcess:
pass
return False
if __name__ == '__main__':
# 检查是否存在browsermobproxy 残余
browsermob = terminate_browsermob_processes()
print("browsermobproxy 杀进程结果{}".format(str(browsermob)))
只能在脚本跑之前先检查一遍 BrowserMob-Proxy 进程了,哎~算比较实用了
坑3、 致命伤!!!如果是在vpn(自行百科)情况下使用BrowserMob-Proxy 90% 的机率出现异常,详细可以看: blog.csdn.net/qq_36991535…