js反爬
js反爬指的是爬虫在获取网页数据时,遇到通过JavaScript代码实现的反爬虫措施。js反爬技术的实现方式包括动态渲染、异步加载、验证码、IP限制等多种方式。这些技术可以有效地防止爬虫的抓取,有助于保护网站的数据安全。
js逆向
如果想要破解js反爬手段,就需要进行js逆向。js逆向需要具备一定的JavaScript编程能力和代码分析能力,对于爬虫开发者来说是一项高级的技能。
逆向流程
- 分析网站:分析需要逆向的目标网站,确定是否存在影响数据获取的参数;
- 寻找加密数据:通过搜索、调试断点、hook等手段,手动找到参数的加密位置;
- 代码移植:将找到的加密算法或者逻辑移植到自己的代码中,这可能涉及到解码、解密、算法还原等复杂操作。同时要注意应对代码混淆、算法修改等阻碍逆向的手段;
- 代码优化:对逆向后的代码进行优化,以确保爬虫程序能够稳定运行,并及时适配反爬措施的更新。
需要注意的是,js逆向是一项复杂且高级的技术,需要具备扎实的JavaScript编程能力,对代码分析和算法理解的能力,以及耐心和毅力来应对可能遇到的各种挑战。
js加密算法
进⾏翻⻚或者新的数据获取
请求体参数是会变化的,⽽且变化规律找不到,需要进⾏js逆向
请求体,请求头参数完全⼀模⼀样,但是还是请求不到 js逆向带有时间戳
⽰例请求同⼀个数据,但是参数会改变
确认⽅案 对同⼀个接⼝请求两次,对⽐两次请求中不同的参数,就找到是那部分参数进⾏了加密
实战案例(一)
百度翻译:fanyi.baidu.com/?aldtype=16… (请求体的参数加密)
百度翻译js逆向.py
# coding = utf-8
import requests
import execjs
url = 'https://fanyi.baidu.com/v2transapi'
cookies = {
'APPGUIDE_10_6_9': '1',
'BAIDUID': '0AC910D623E84E2AF2EC0FEB2844D6C4:FG=1',
'BAIDUID_BFESS': '0AC910D623E84E2AF2EC0FEB2844D6C4:FG=1',
'BIDUPSID': '92068D3C5A4CAF9FD74E85044409657D',
'FANYI_WORD_SWITCH': '1',
'HISTORY_SWITCH': '1',
'H_PS_PSSID': '40124_40161_40200_40210_40207_40215_40223_40266_40079_40294_40290_40288_40285',
'H_WISE_SIDS': '40124_40161_40200_40210_40207_40215_40223_40266_40079',
'H_WISE_SIDS_BFESS': '40039_39939_40124_40161_40200_40210_40207_40215',
'Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574': '1709014343',
'Hm_lvt_64ecd82404c51e03dc91cb9e8c025574': '1708245569,1708313786,1708586584,1709014343',
'PSTM': '1699327572',
'REALTIME_TRANS_SWITCH': '1',
'SOUND_PREFER_SWITCH': '1',
'SOUND_SPD_SWITCH': '1',
'ZFY': 'GJJUPI80hEogV40Y:BfXkGhmboDFI5nAvm7Iu9p1Z2fQ:C',
'ab_sr': '1.0.1_MDM0ZDAxMDAwMWUxMGFlMTNhNDc5NzhjYzJiMTBkYjU0ZTZhYTIyMjQyZDhmMTNjMWNkZTI3MTI1ODVkYzkxZjhmODYxZjIwNTRkMTlhZWM3ZmM3Njg2YTgzZmQzMGYzNWYzMTQzNGRlMDM4Nzk2YTg2OGU1N2NjNTdlNjI4NzNkZmI0OGRkNjI2MzkyYTRlNzAxNzQwMWYyMDU5YWRlNQ==',
}
headers = {
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Acs-Token': '1708948806239_1709015253494_iiNu8giQN7SGz7NKJn2ymjISVlKelYfD8piAKrUU60RUrbGY3KJLa/HVG7Xs7tBhH6GBQyncEKnra4HsxMwpoWXU9b5oPQuapRf0UEiiX2IIbeaZ+3ial7JSpw6s/npoL9FPGW35BnjnOO7hmzEAeIDTHEj5Iw/ccWVH3tq+QRh0GDdVNTNHElCNpKJyI9/A7o5JxVDWNKAYTGvumidwAFp50FDobC2xS4BWuEfZ7bkmvs/UXt2aSZAzoTzlLVVmuDfEj6PdHsFGNnzocuVyFB4jbg4RF6j8V+18RmUmR8MXozpdq/dc2237nyNAx65WD6GZJOhJ34LffO6SKfqp0n+YiaP9Adnm1vUDH/CPXcRkCap9lQHjPxFi2PbSyZpKe5OCj0EpGP4XSQZn28UaGXLB1qImONBbLYmW880zWE6O6N8YWbMbtiHKqmhRCSyZrGtPHThsWLGFK5bJ4Q62ziRo5hlmsCobeV5tiOm3IYE=',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://fanyi.baidu.com',
'Referer': 'https://fanyi.baidu.com/?aldtype=16047',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
with open('bd.js','r',encoding='utf-8') as file:
js=execjs.compile(file.read()) #execjs.compile对file.read()进行编译
query='你好'
data = {
'from': 'zh',
'to': 'en',
'query': query,
'transtype': 'realtime',
'simple_means_flag': '3',
'sign': js.call('b',query), #调用js中的b函数,传入参数query
'token': 'ea5952a149816a41548acc9f575cdc8f',
'domain': 'common',
'ts': '1709015253483',
}
# 当前时间戳: 1709016027.882114 'ts': '1709015253483',就是一个时间戳
response = requests.post(url, headers=headers, data=data, cookies=cookies)
print(response.json())
#1.查找请求体参数的来源,可以通过直接搜索方法
#sign 必须要全部名称全部相等,如果说数据比较多,在后面加上【:】,比如【sign:】
#通过一个个文件的排查,找到可能存在的位置,并且打上断点。断点:程序/代码运行到断点位置,就会断住,不会往之后运行
bd.js
function n(t, e) {
for (var n = 0; n < e.length - 2; n += 3) {
var r = e.charAt(n + 2);
r = "a" <= r ? r.charCodeAt(0) - 87 : Number(r),
r = "+" === e.charAt(n + 1) ? t >>> r : t << r,
t = "+" === e.charAt(n) ? t + r & 4294967295 : t ^ r
}
return t
}
// window是一个全局对象,环境 window={}
// ||是短路运算,默认返回前面的函数结果
function b(t) {
var o, i = t.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
if (null === i) {
var a = t.length;
a > 30 && (t = "".concat(t.substr(0, 10)).concat(t.substr(Math.floor(a / 2) - 5, 10)).concat(t.substr(-10, 10)))
} else {
for (var s = t.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), c = 0, u = s.length, l = []; c < u; c++)
"" !== s[c] && l.push.apply(l, function(t) {
if (Array.isArray(t))
return e(t)
}(o = s[c].split("")) || function(t) {
if ("undefined" != typeof Symbol && null != t[Symbol.iterator] || null != t["@@iterator"])
return Array.from(t)
}(o) || function(t, n) {
if (t) {
if ("string" == typeof t)
return e(t, n);
var r = Object.prototype.toString.call(t).slice(8, -1);
return "Object" === r && t.constructor && (r = t.constructor.name),
"Map" === r || "Set" === r ? Array.from(t) : "Arguments" === r || /^(?:Ui|I)nt(?:8|16|32)(?:Clamped)?Array$/.test(r) ? e(t, n) : void 0
}
}(o) || function() {
throw new TypeError("Invalid attempt to spread non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.")
}()),
c !== u - 1 && l.push(i[c]);
var p = l.length;
p > 30 && (t = l.slice(0, 10).join("") + l.slice(Math.floor(p / 2) - 5, Math.floor(p / 2) + 5).join("") + l.slice(-10).join(""))
}
r = "320305.131321201"
for (var d = "".concat(String.fromCharCode(103)).concat(String.fromCharCode(116)).concat(String.fromCharCode(107)), h = (null !== r ? r : (r = "320305.131321201" || "") || "").split("."), f = Number(h[0]) || 0, m = Number(h[1]) || 0, g = [], y = 0, v = 0; v < t.length; v++) {
var _ = t.charCodeAt(v);
_ < 128 ? g[y++] = _ : (_ < 2048 ? g[y++] = _ >> 6 | 192 : (55296 == (64512 & _) && v + 1 < t.length && 56320 == (64512 & t.charCodeAt(v + 1)) ? (_ = 65536 + ((1023 & _) << 10) + (1023 & t.charCodeAt(++v)),
g[y++] = _ >> 18 | 240,
g[y++] = _ >> 12 & 63 | 128) : g[y++] = _ >> 12 | 224,
g[y++] = _ >> 6 & 63 | 128),
g[y++] = 63 & _ | 128)
}
for (var b = f, w = "".concat(String.fromCharCode(43)).concat(String.fromCharCode(45)).concat(String.fromCharCode(97)) + "".concat(String.fromCharCode(94)).concat(String.fromCharCode(43)).concat(String.fromCharCode(54)), k = "".concat(String.fromCharCode(43)).concat(String.fromCharCode(45)).concat(String.fromCharCode(51)) + "".concat(String.fromCharCode(94)).concat(String.fromCharCode(43)).concat(String.fromCharCode(98)) + "".concat(String.fromCharCode(43)).concat(String.fromCharCode(45)).concat(String.fromCharCode(102)), x = 0; x < g.length; x++)
b = n(b += g[x], w);
return b = n(b, k),
(b ^= m) < 0 && (b = 2147483648 + (2147483647 & b)),
"".concat((b %= 1e6).toString(), ".").concat(b ^ f)
}
// e="香蕉"
// sign = b(e)
// console.log(sign);
// "320305.131321201"
// "320305.131321201"
实战案例(二)
猫眼专业版:piaofang.maoyan.com/dashboard
猫眼专业版.py
# coding = utf-8
import crawles
import execjs
url = 'https://piaofang.maoyan.com/dashboard-ajax'
cookies = {
'_lxsdk': '186a6f6b9d4c8-0ddef66ba7f5d9-26021051-1fa400-186a6f6b9d5c8',
'_lxsdk_cuid': '186a6f6b9d4c8-0ddef66ba7f5d9-26021051-1fa400-186a6f6b9d5c8',
'_lxsdk_s': '18de573e10a-e39-16-d4a%7C%7C1',
}
headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Referer': 'https://piaofang.maoyan.com/dashboard',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57',
'X-FOR-WITH': 'Zj+YjTqKOkz6iFKWUbN/kEXb1kVS0Xgt0OuyPfp9hJTTT87oHnFXVe8//gWpSBOPUTJgp12SaXtR7UO0Iy/kQx+mx65JuPDvYNrT4eKltfZ4cyxnnqMJFmNDM49JozGSIvxNbhW+bnoO5wzBbVfanQ+zVMKVkd0ppM8DNG2busvYYPZFtByO+TQj6ovJYcMFnG9AMCf4Wm4pxpnrj0peWg==',
'sec-ch-ua': '"Microsoft Edge";v="113", "Chromium";v="113", "Not-A.Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
#其中的index = random.randint(0,1000),也可以换成这个
params = {
'User-Agent': 'TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzExMy4wLjAuMCBTYWZhcmkvNTM3LjM2IEVkZy8xMTMuMC4xNzc0LjU3',
'channelId': '40009',
'index': '991',
'orderType': '0',
'sVersion': '2',
'timeStamp': '1708951665245',
'uuid': '186a6f6b9d4c8-0ddef66ba7f5d9-26021051-1fa400-186a6f6b9d5c8',
}
with open('my.js', 'r', encoding='utf-8') as file:
js = execjs.compile(file.read())
params['signKey'] = js.call('b', params['index'], params['timeStamp'])
# 'signKey': '686651c99b3b916e6dbde5b600be3a15',
# index
# signKey "9f8a780184ebc39c6f7fb82e0e24ef76"
# 加密的数据大部分是放置在一起的
# 当前时间戳: 1708951709.0289228
response = crawles.get(url, headers=headers, params=params, cookies=cookies)
# print(response.json())
for data in response.json()['movieList']['data']['list']:
print(f"影片名称{data['movieInfo']['movieName']}"
f" 综合票房{data['boxSplitUnit']['num']}"
f" 票房占比{data['boxRate']}"
f" 排片场次{data['showCount']}")
my.js
// _jsMd2.default
// c
// i(269)
// 引⽤CryptoJS
CryptoJS = require("crypto-js");
// 下载crypto-js模块,npm install crypto-js
// 一般进行加/解密都需要用到crypto-js,需要注意的是:命令需要运行在当前项目环境
function b(index,timeStamp){
c=`method=GET&timeStamp=${timeStamp}&User-Agent=TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzEyMi4wLjAuMCBTYWZhcmkvNTM3LjM2&index=${index}&channelId=40009&sVersion=2&key=A013F70DB97834C0A5492378BD76C53A`
f = (0,CryptoJS.MD5)(c['replace'](/\s+/g, " ")) // /g在js中是全局的意思,\s是一个空格,\s+是多个空格
signKey = f
return signKey.toString()
}
//c="method=GET&timeStamp=1709022588516&User-Agent=TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzEyMi4wLjAuMCBTYWZhcmkvNTM3LjM2&index=339&channelId=40009&sVersion=2&key=A013F70DB97834C0A5492378BD76C53A"
// f = (0,CryptoJS.MD5)(c['replace'](/\s+/g, " ")) // /g在js中是全局的意思,\s是一个空格,\s+是多个空格
// (0,_jsMd2.default)这个语法看起来有些奇怪,它实际上是使⽤了逗号表达式。逗号表达式会执⾏它的每⼀个⼦表达式(从左⾄右),并返回最后⼀个⼦表达式的值。
// signKey = f
//
// console.log(signKey.toString()); //获取md5加密后的数据
//混淆,在javascript所有的.操作都可以用字典的中括号来完成
// s='abc'
// console.log(s.replace('a', 'b'));
// console.log(s['replace']('a', 'b')); //这里的'replace'一般就可以进行加密
总结:
- 翻页或者再次获取数据,请求体的参数会发生变化
- 暴力搜索,找到满足(名称完全相同)要求的数据
- 把满足要求的打上断点
- 进行调试(刷新,放入新的数据)
- 如果断点了,找到断点位置,查看数据是否是我们需要的数据
- 抠js代码,通过把代码写死,目的是为了让js代码能够运行
- []和周围是否有类似代码,确认是否混淆,手动解开混淆
- 确认调试后没有问题,接入python代码中