headers中时间戳+token反爬逆向详解

843 阅读3分钟

通过headers有很多反爬方式比如说 User-Agent 或者其他的一些参数判断是不是在短时间内使用同一个headers在请求,像判断ip请求频率一样,所以大多爬虫都会随机ua或其他参数,一句话就是封什么换什么。

  1. 回顾一下极目新闻热榜热榜排行是没有cookies的但是发送一段时间请求之后就会发现取不到数据了 先把请求排行的代码贴一下
headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Origin': 'http://www.ctdsb.net',
    'Pragma': 'no-cache',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'requestTime': '1684978507946',
    'token': '8b18ea22ecc76494da0d9c3fc3adafdd',
}
data = {
    'focusNo': '5',
    'publishFlag': '1',
    'pageNo': '1',
    'pageSize': '20',
    'column': '1476',
}

response = requests.post('http://yth.ctdsb.net/amc/client/listContentByColumn', headers=headers, data=data,
                         verify=False).json()
print(response)
for res in response['data']['contentList']:
    contentId = res['contentId']
    channelId = 'c' + str(res['channelId'])
    creationTime = str(res['creationTime']).replace('-', '')[:6]
    print(contentId, channelId, creationTime)
    url = f'http://www.ctdsb.net/{channelId}_{creationTime}/{contentId}.html'
    print(url)

image.png 可见出现这样的情况因为没有cookies那么最大的可能就是在headers中有限制 仔细看一下headers中的参数可见有两处与往常的参数名不同,requestTime,token,通过以往的经验这两个参数的逻辑应该就是,一个时间戳一个token这两个是绑定的关系,然后根据当前的时间戳对参数中的时间戳做一个最大差的限制。

  1. 找到了验证的地方就debug一下源码,看看这个两个参数是怎么生成了,F12

全局索引一下token可见这里有结果的 image.png 到生成值的位置断点一下,刷新网页 image.png 通过断点可见token和requestTime就是通过这段代码生成的 image.png 翻译一下这段代码

先生成一个当前时间戳(requestTime),

然后时间戳+上字符串(hbrb-app-amc)+($)做md5加密

然后再+字符串(h5Client-id)+($)再做md5加密得到(token)

这两个参数的生成过程就清楚了,然后直接翻译成python就ok了,代码如下

def USE_MD5(test):
    if not isinstance(test, bytes):
        test = bytes(test, 'utf-8')
    m = hashlib.md5()
    m.update(test)
    return m.hexdigest()


def get_token():
    salt = "hbrb-app-amc"
    h5Str = "h5Client-id"
    timee = int(time.time() * 1000)
    strr = salt + "$" + str(timee)
    md5_test = USE_MD5(strr)
    strr = h5Str + "$" + md5_test + "$" + str(timee)
    md5_test = USE_MD5(strr)
    return md5_test, timee


def get_data():
    md5_test, timee = get_token()
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Origin': 'http://www.ctdsb.net',
        'Pragma': 'no-cache',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
        'requestTime': f'{timee}',
        'token': f'{md5_test}',
    }

ok这样headers中的参数据就自动生成了,每次运行都是当前的时间戳,完整代码如下

import datetime
import random

import redis
import time
import hashlib
import requests
from gne import GeneralNewsExtractor
import pymongo

def USE_MD5(test):
    if not isinstance(test, bytes):
        test = bytes(test, 'utf-8')
    m = hashlib.md5()
    m.update(test)
    return m.hexdigest()


def get_token():
    salt = "hbrb-app-amc"
    h5Str = "h5Client-id"
    timee = int(time.time() * 1000)
    strr = salt + "$" + str(timee)
    md5_test = USE_MD5(strr)
    strr = h5Str + "$" + md5_test + "$" + str(timee)
    md5_test = USE_MD5(strr)
    return md5_test, timee


def get_data():
    md5_test, timee = get_token()

    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Origin': 'http://www.ctdsb.net',
        'Pragma': 'no-cache',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
        'requestTime': f'{timee}',
        'token': f'{md5_test}',
    }

    data = {
        'focusNo': '5',
        'publishFlag': '1',
        'pageNo': '1',
        'pageSize': '20',
        'column': '1476',
    }

    response = requests.post('http://yth.ctdsb.net/amc/client/listContentByColumn', headers=headers, data=data,
                             verify=False).json()
    print(response)
    for res in response['data']['contentList']:
        contentId = res['contentId']
        channelId = 'c' + str(res['channelId'])
        creationTime = str(res['creationTime']).replace('-', '')[:6]
        print(contentId, channelId, creationTime)
        url = f'http://www.ctdsb.net/{channelId}_{creationTime}/{contentId}.html'
        print(url)

        get_xq(url)


def get_xq(url):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Pragma': 'no-cache',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    }

    response = requests.get(url, headers=headers, verify=False)
    response.encoding = response.apparent_encoding
    Genera = GeneralNewsExtractor()
    aa = Genera.extract(response.text)
    print(aa)


if __name__ == '__main__':
    get_data()

image.png