通过headers有很多反爬方式比如说 User-Agent 或者其他的一些参数判断是不是在短时间内使用同一个headers在请求,像判断ip请求频率一样,所以大多爬虫都会随机ua或其他参数,一句话就是封什么换什么。
- 回顾一下极目新闻热榜热榜排行是没有cookies的但是发送一段时间请求之后就会发现取不到数据了 先把请求排行的代码贴一下
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://www.ctdsb.net',
'Pragma': 'no-cache',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'requestTime': '1684978507946',
'token': '8b18ea22ecc76494da0d9c3fc3adafdd',
}
data = {
'focusNo': '5',
'publishFlag': '1',
'pageNo': '1',
'pageSize': '20',
'column': '1476',
}
response = requests.post('http://yth.ctdsb.net/amc/client/listContentByColumn', headers=headers, data=data,
verify=False).json()
print(response)
for res in response['data']['contentList']:
contentId = res['contentId']
channelId = 'c' + str(res['channelId'])
creationTime = str(res['creationTime']).replace('-', '')[:6]
print(contentId, channelId, creationTime)
url = f'http://www.ctdsb.net/{channelId}_{creationTime}/{contentId}.html'
print(url)
可见出现这样的情况因为没有cookies那么最大的可能就是在headers中有限制
仔细看一下headers中的参数可见有两处与往常的参数名不同,requestTime,token,通过以往的经验这两个参数的逻辑应该就是,一个时间戳一个token这两个是绑定的关系,然后根据当前的时间戳对参数中的时间戳做一个最大差的限制。
- 找到了验证的地方就debug一下源码,看看这个两个参数是怎么生成了,F12
全局索引一下token可见这里有结果的
到生成值的位置断点一下,刷新网页
通过断点可见token和requestTime就是通过这段代码生成的
翻译一下这段代码
先生成一个当前时间戳(requestTime),
然后时间戳+上字符串(hbrb-app-amc)+($)做md5加密
然后再+字符串(h5Client-id)+($)再做md5加密得到(token)
这两个参数的生成过程就清楚了,然后直接翻译成python就ok了,代码如下
def USE_MD5(test):
if not isinstance(test, bytes):
test = bytes(test, 'utf-8')
m = hashlib.md5()
m.update(test)
return m.hexdigest()
def get_token():
salt = "hbrb-app-amc"
h5Str = "h5Client-id"
timee = int(time.time() * 1000)
strr = salt + "$" + str(timee)
md5_test = USE_MD5(strr)
strr = h5Str + "$" + md5_test + "$" + str(timee)
md5_test = USE_MD5(strr)
return md5_test, timee
def get_data():
md5_test, timee = get_token()
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://www.ctdsb.net',
'Pragma': 'no-cache',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'requestTime': f'{timee}',
'token': f'{md5_test}',
}
ok这样headers中的参数据就自动生成了,每次运行都是当前的时间戳,完整代码如下
import datetime
import random
import redis
import time
import hashlib
import requests
from gne import GeneralNewsExtractor
import pymongo
def USE_MD5(test):
if not isinstance(test, bytes):
test = bytes(test, 'utf-8')
m = hashlib.md5()
m.update(test)
return m.hexdigest()
def get_token():
salt = "hbrb-app-amc"
h5Str = "h5Client-id"
timee = int(time.time() * 1000)
strr = salt + "$" + str(timee)
md5_test = USE_MD5(strr)
strr = h5Str + "$" + md5_test + "$" + str(timee)
md5_test = USE_MD5(strr)
return md5_test, timee
def get_data():
md5_test, timee = get_token()
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://www.ctdsb.net',
'Pragma': 'no-cache',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'requestTime': f'{timee}',
'token': f'{md5_test}',
}
data = {
'focusNo': '5',
'publishFlag': '1',
'pageNo': '1',
'pageSize': '20',
'column': '1476',
}
response = requests.post('http://yth.ctdsb.net/amc/client/listContentByColumn', headers=headers, data=data,
verify=False).json()
print(response)
for res in response['data']['contentList']:
contentId = res['contentId']
channelId = 'c' + str(res['channelId'])
creationTime = str(res['creationTime']).replace('-', '')[:6]
print(contentId, channelId, creationTime)
url = f'http://www.ctdsb.net/{channelId}_{creationTime}/{contentId}.html'
print(url)
get_xq(url)
def get_xq(url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get(url, headers=headers, verify=False)
response.encoding = response.apparent_encoding
Genera = GeneralNewsExtractor()
aa = Genera.extract(response.text)
print(aa)
if __name__ == '__main__':
get_data()