1.引言
产品团队,有一个共同的共识:通过脚本化能力,实现产品的扩展性和灵活性,来满足客户个性化、定制化的需求。我们之前也有类似的场景诉求,需要在平台内接入第三方应用,中间件等。只不过之前技术栈是java,我通过java自身的动态编译特性封装了一套动态编译组件,有时间整理一下可以分享出来。
其实,web类应用,脚本化能力通过js来实现也是一个不错的选择。除了web类应用,爬虫类应用应对反爬,同样会需要在后端执行脚本能力。
比如之前帮朋友解决某网站反爬碰到的场景:
该网站通过签名机制做了反爬策略,截图中的几个header参数:
x-ca-key:203899271
x-ca-nonce:46d9d43c-54f4-46ac-bcfb-89d5d3a0c776
x-ca-signature:cAFHSOzeCjjbFjBVj6JIhbUsKbJJyHKoS7HLF1RvcH0=
x-ca-signature-headers:x-ca-key,x-ca-nonce
签名机制,是在客户端通过js脚本来实现的:
其中x-ca-nonce参数,通过函数生成随机数:
f = function(e) {
var t = e || null;
return null == t && (t = "xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx".replace(/[xy]/g, (function(e) {
var t = 16 * Math.random() | 0;
return ("x" === e ? t : 3 & t | 8).toString(16)
}
))),
t
}
最终签名,通过哈希算法,和base64编码完成
源于爬虫技术选型python,要解决类似反爬,就需要在python中执行js脚本能力。下面示例,来看我们是怎么一步一步完成反反爬的。
当然,今天的重点是关注python执行js脚本,如果有朋友感兴趣爬虫,后续我专门出一个爬虫的系列。
2.示例
2.1.访问哪吒社区
2.2.通过爬虫方式访问
import requests
url = "https://bizapi.csdn.net/community-cloud/v1/homepage/community/by/tag?deviceType=PC&tagId=1"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cookie": """uuid_tt_dd=10_36832079180-1660385704790-634259; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac={"islogin":{"value":"0","scope":1},"isonline":{"value":"0","scope":1},"isvip":{"value":"0","scope":1}}; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_36832079180-1660385704790-634259; __gads=ID=ab63d7715dc2e0f3-2282c58b8ed500c7:T=1660385706:RT=1660385706:S=ALNI_Ma0EQOs5uHlLZtQGPYFpdoq54ekUw; __bid_n=1844bcefaaa19ac2e64207; c_segment=3; dc_sid=12a9dda1cd39a3ee2ea3e21add653e27; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1680481810,1681220462,1681607631; __gpi=UID=0000088ed13d5161:T=1660385706:RT=1682833209:S=ALNI_MbsHza3SbbUlcl8sKxCeqVG5UYm4A; FPTOKEN=A5rYyiEISFYQSXqLvce+WOMmmo+w824MM4zuNKyyheBAtpJJf0/kdBfciCvsOQgawa1KI6WF2hQFJLJqbV5LLKp2ibUsXOZcu6UfdxMMkjOzvzz1ft2IFenUeEUaQ0qNeQWeGxQJqqcuxbiSbG5cQ5XmU7fERCCXYqti7cdols1ttv1ZQzxw6zTgcHaa/EeHtFW1CVLt2cc0/EpS64FL3cjUZTL6u/N2cLKNFuuAuu4WtoZclmk7bITJ/Ku5u9rGbGaQg1Ws3ynBqFad9/XwpYydQ3AZQnxEakHBZwTgQ4ZZ5iJYdKQNGlEaMrWU1vxPszTsQQKQVhBLbZo/Zl3BOV4iPmDgmN8x7jw68L5m0mGDHoc0Jkdu49LCSmzrRDbCDYp91uflx4mxYu1eVCU/YA==|3tlxg6Wx4faTSoK2X5tzCxRZEpNtjqnJdVRN9V8OGiU=|10|0738f1fd42a6144fb5883981beb0c524; c_first_ref=default; c_first_page=https://www.csdn.net/; hide_login=1; is_advert=1; SESSION=2edf0501-9922-4967-98b3-ed635e280f28; ssxmod_itna=CqfhiIwGOGCtGHD8D2DEDkDBCz5DQ8qWe=T8HDlr4xA5D8D6DQeGTi2b8siG3hqbrBu+dQ0mbqKoEflrYAWNgae0aDbqGkd+G4iicDCeDIDWeDiDG4GmR4GtDpxG=DjCU1CgExYPGWtqDbDinOhxGCDeKD0ZRaDQKDu1KKdl2O4zph0gAkdYYPWCKD9xoDsriEFfAEA/h14SoAZpb7DlF6DCIC01HFt4GdfhyCZcqT5YxelCD3ejR4WArqbinbZirqdYxx7DbraG2bWADxokxNvqDWbQGRHYD===; ssxmod_itna2=CqfhiIwGOGCtGHD8D2DEDkDBCz5DQ8qWe=T8D6h3ED0vn503Wd72jeX=2D6i1QdyhPZnQkid3bSrB7L1lYbxOuAu0n+AXO8b9wRcFDgxQwHOTka1O8Ab2b2YXIEunki7gpM3kLO6phnbFOICjhn4Bgqi5nPKarhQkBcFbY9H2i05HS5FBhKb9iITq4IOwO3bGou4diOGW4IdituI3FFtpSefhowmY+O12GKFDoui6nvE7ECQ=Ia0ftCXTLgOt6SUTyB4AD07k408DY954D==; c_utm_source=csdn_bbs_toolbar; utm_source=csdn_bbs_toolbar; dc_session_id=10_1682924916939.195481; c_dsid=11_1682924872425.208034; log_Id_click=36; c_pref=https://bbs.csdn.net/forums/hacker; c_ref=https://bbs.csdn.net/forums/nezha?category=1; c_page_id=default; log_Id_pv=49; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1682925355; firstDie=1; dc_tos=rtyxij; log_Id_view=255""",
"origin": "https://bbs.csdn.net",
"referer": "https://bbs.csdn.net/forums/nezha?category=4",
"sec-ch-ua": '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-site",
"x-ca-key": "203899271",
"x-ca-nonce": "2e5572aa-d523-4o3c-b2bb-69fc0a9accb7",
"x-ca-signature": "yDsKW+DrW8ePjhbWgQ+aYAb9W8zzjfe5zCT7pjitniE=",
"x-ca-signature-headers": "x-ca-key,x-ca-nonce"
}
rsp = requests.get(url, headers = headers)
print(rsp.status_code)
哪怕我带上了所有请求头,访问结果:401。被反爬了
2.3.反反爬
2.3.1.准备环境
python中执行js脚本有多种方法
- 内置javascript引擎
- 第三方库:PyExecJS
- 第三方库:Brython
我这里选择PyExecJS,既然是第三方库,需要安装一下
pip install PyExecJS
2.3.2.编写签名脚本
import execjs
from base64 import b64encode
import hmac
import hashlib
nonce_func = """
f = function(e) {
var t = e || null;
return null == t && (t = "xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx".replace(/[xy]/g, (function(e) {
var t = 16 * Math.random() | 0;
return ("x" === e ? t : 3 & t | 8).toString(16)
}
))),
t
}
"""
# 生成x-ca-nonce随机值
py_func = execjs.compile(nonce_func)
ca_nonce = py_func.call("f")
# 完成签名
app_secretKey = "bK9jk5dBEtjauy6gXL7vZCPJ1fOy076H"
ca_key = "203899271"
template = """GET\napplication/json, text/plain, */*\n\n\n\nx-ca-key:203899271\nx-ca-nonce:{}\n/community-cloud/v1/homepage/community/by/tag?deviceType=PC&tagId=1"""
result_template = template.format(ca_nonce)
result_template = result_template.encode("utf-8")
sign = b64encode(hmac.new(app_secretKey.encode("utf-8"), msg=result_template, digestmod=hashlib.sha256).digest()).decode()
print("nonce==["+ca_nonce)
print("sign==[" +sign)
2.3.3.验证结果
将刚才生成的随机值,及签名带入爬虫脚本
nonce==[2a81bf81-e2d3-4b27-985e-028a41ac3420
sign==[m17UJz93G2T8/uFrJei6o5Vwxefey9+ZuucYji4uNms=
结果:200,成功反反爬。
这样,我们通过python执行js脚本能力,完成反反爬。核心代码片段:
nonce_func = """
f = function(e) {
var t = e || null;
return null == t && (t = "xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx".replace(/[xy]/g, (function(e) {
var t = 16 * Math.random() | 0;
return ("x" === e ? t : 3 & t | 8).toString(16)
}
))),
t
}
"""
# 生成x-ca-nonce随机值
py_func = execjs.compile(nonce_func)
ca_nonce = py_func.call("f")