封ip+cookie反爬
网站:
流程分析:
打开抓包工具,清空浏览器记录,访问hd.chinatax.gov.cn/nszx/InitCr…
第一个请求
第二个请求
第三个请求
第三个请求地址:
http://fxsjcj.kaipuyun.cn//logcount.php?WS=10003701&RD=common&SWS=&SWSID=&SWSPID=&JSVER=20161011&TDT=web&UC=_ck20073016022012811741728128923&LUC=&VUC=_vk1596096140278&FS=&RF=&PS=hd.chinatax.gov.cn&PU=%2Fnszx%2FInitCredit.html&PT=&PER=0&PC=&PI=&LM=1596096140000&LG=zh-CN&CL=24&CK=1&SS=1920*1080&SCW=1863&SCH=935&SSH=952&FT=1596096140278<=1596096140278&DL=0&FL=1&CKT=HttpCookie&JV=0&AL=0&SY=windows%20nt%2010.0&BR=chrome&TZ=-8&AU=&UN=&UID=&URT=&UA=&US=&TID=&MT=&FMSRC=same&MSRC=&MSCH=&EDM=&RC=0&SHPIC=&MID=1596096140278&TT=%E4%BC%81%E4%B8%9A%E7%BA%B3%E7%A8%8E%E4%BF%A1%E7%94%A8%E7%AD%89%E7%BA%A7%E6%9F%A5%E8%AF%A2&CHK=128&SHT=chinatax.gov.cn&RDM=0.5986343562186351
参数值:
{'WS': '10003701',
'RD': 'common',
'SWS': '',
'SWSID': '',
'SWSPID': '',
'JSVER': '20161011',
'TDT': 'web',
'UC': '_ck20073016022012811741728128923',
'LUC': '',
'VUC': '_vk1596096140278',
'FS': '',
'RF': '',
'PS': 'hd.chinatax.gov.cn',
'PU': '/nszx/InitCredit.html',
'PT': '',
'PER': '0',
'PC': '',
'PI': '',
'LM': '1596096140000',
'LG': 'zh-CN',
'CL': '24',
'CK': '1',
'SS': '1920*1080',
'SCW': '1863',
'SCH': '935',
'SSH': '952',
'FT': '1596096140278',
'LT': '1596096140278',
'DL': '0',
'FL': '1',
'CKT': 'HttpCookie',
'JV': '0',
'AL': '0',
'SY': 'windows nt 10.0',
'BR': 'chrome',
'TZ': '-8',
'AU': '',
'UN': '',
'UID': '',
'URT': '',
'UA': '',
'US': '',
'TID': '',
'MT': '',
'FMSRC': 'same',
'MSRC': '',
'MSCH': '',
'EDM': '',
'RC': '0',
'SHPIC': '',
'MID': '1596096140278',
'TT': '企业纳税信用等级查询',
'CHK': '128',
'SHT': 'chinatax.gov.cn',
'RDM': '0.5986343562186351'}
这些参数来源:http://fxsjcj.kaipuyun.cn/count/10003701/10003701.js这个js文件
参数
第四个请求
第四个请求发出时带的cookie值中,_Jo0OQK由服务端返回;yfx_c_g_u_id_10003701值为第三个请求服务端set-cookie的yfx_sv_c_g_u_id这个值;yfx_f_l_v_t_10003701这个值固定字符拼接时间戳即可
注意:
每次程序发送第三个请求时,参数中也有四个时间值,最好第三个请求中的时间值和第四个cookie中的时间值一致。 第三个请求中除了时间值取当前时间戳外,其他参数可以固定 也可以分析js每次由js代码生成,自己抓取数据时发现写死可行,只需要翻页ip被封时切换ip,同样流程从头开始执行,又可以获取到数据
数据抓取:
数据请求1
数据请求2
前四步获取到必要的cookie值之后,携带参数post到hd.chinatax.gov.cn/service/fin…: C3VK=79e220值
{
'page': 0, #页码,控制翻页
'location': '110000', #地区码
'code': '',
'name': '',
'evalyear': 2019 #年份
}
携带C3VK参数发送同样的请求即可拿到如图数据请求2图片中所示的数据。
翻页:
翻页时先把cookie中的C3VK值清除,其他cookie值保留,表单中页码修改成要请求的页码,post数据到hd.chinatax.gov.cn/service/fin…: C3VK=3c20f0一个新值,携带新的C3VK值发送同样的请求,即可拿回数据。
封IP处理:
全局记录已抓取页码,切换完IP之后,按照前面获取cookie数据的4部流程,从头获取cookie,获取到cookie后按照请求数据流程从上次页码继续翻页获取数据即可。
流程测试python代码:
代码为分析流程时写的代码,headers参数以及很多流程都没有做简化,仅作流程分析参考
# -*- encoding: utf-8 -*-
"""
@File : 纳税信用等级
"""
import json
import time
import requests
def spider():
'''
测试代码
:return:
'''
url = "http://hd.chinatax.gov.cn/nszx/InitCredit.html"
headers1 = {'Host': 'hd.chinatax.gov.cn',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
session = requests.session()
resp = session.get(url, headers=headers1, allow_redirects=False)
if resp.status_code == 302:
location_url = resp.headers["Location"]
resp = session.get(location_url, headers=headers1, allow_redirects=False)
t = int(time.time()*1000)
if resp.status_code == 200:
url = "http://fxsjcj.kaipuyun.cn//logcount.php?" \
"WS=10003701&RD=common&SWS=&SWSID=&SWSPID=&JSVER=20161011&" \
"TDT=web&UC=_ck20072709525514305521860574102&LUC=&" \
"VUC=_vk1595814775435&FS=&RF=&PS=hd.chinatax.gov.cn&PU=%2Fnszx%2FInitCredit.html&PT=&PER=0&PC=&PI=&" \
"LM={}&" \
"LG=zh-CN&CL=24&CK=1&SS=1536*864&SCW=1494&SCH=727&SSH=952&" \
"FT={}&" \
"LT={}&" \
"DL=0&FL=1&CKT=HttpCookie&JV=0&AL=0&SY=windows%20nt%2010.0&BR=chrome&TZ=-8&AU=&UN=&UID=&URT=&UA=&US=&TID=&MT=&FMSRC=same&MSRC=&MSCH=&EDM=&RC=0&SHPIC=&" \
"MID={}&" \
"TT=%E4%BC%81%E4%B8%9A%E7%BA%B3%E7%A8%8E%E4%BF%A1%E7%94%A8%E7%AD%89%E7%BA%A7%E6%9F%A5%E8%AF%A2&" \
"CHK=114&SHT=chinatax.gov.cn&RDM=0.0014739249745641114".format(t,t,t,t)
headers2 = {
'Host': 'fxsjcj.kaipuyun.cn',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
'Referer': 'http://hd.chinatax.gov.cn/nszx/InitCredit.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9'}
resp = session.get(url, headers=headers2, allow_redirects=False)
if resp.status_code == 200:
yfx_sv_c_g_u_id = resp.cookies["yfx_sv_c_g_u_id"]
url = "http://hd.chinatax.gov.cn/pepp4_lakers"
headers3 = {'Host': 'hd.chinatax.gov.cn',
'Connection': 'keep-alive',
'Origin': 'http://hd.chinatax.gov.cn',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': '*/*',
'Referer': 'http://hd.chinatax.gov.cn/nszx/InitCredit.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9'}
session.cookies["yfx_c_g_u_id_10003701"] = yfx_sv_c_g_u_id
session.cookies['yfx_f_l_v_t_10003701'] = "f_t_{}__r_t_{}__v_t_{}__r_c_0".format(t,t,t)
param = {
'js': '1',
'flag': '1',
'res': '',
'plat': 'Win32',
'ua': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
#param = "js=1&flag=1&res=&plat=Win32&ua=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
resp = session.post(url, headers=headers3, data=param, allow_redirects=False)
if resp.status_code == 200:
print(resp.text)
#请求年份
url = "http://hd.chinatax.gov.cn/dict/queryDict?code=PJND"
headers4 = {'Host': 'hd.chinatax.gov.cn',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Referer': 'http://hd.chinatax.gov.cn/nszx/InitCredit.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
resp = session.get(url,headers=headers4)
#请求地区
url = "http://hd.chinatax.gov.cn/dict/queryDict?code=SZDQ"
resp = session.get(url,headers=headers4)
#请求数据
url = "http://hd.chinatax.gov.cn/service/findCredit.do"
data = {'page': 0,
'location': '110000',
'code': '',
'name': '',
'evalyear': 2019
}
headers5 ={'Host': 'hd.chinatax.gov.cn',
'Connection': 'keep-alive',
'Content-Length': '55',
'Accept': '*/*',
'Origin': 'http://hd.chinatax.gov.cn',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://hd.chinatax.gov.cn/nszx/InitCredit.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9'}
resp = session.post(url,headers=headers5,data=data,allow_redirects=False)
if resp.status_code == 307:
resp = session.post(url, headers=headers5,data=data, allow_redirects=False)
if resp.status_code == 200:
print(resp.text)
json_obj = json.loads(resp.text)
totalPages = json_obj.get("totalPages",0)
if resp.status_code == 503:
print("ip 被封")
for page in range(1,int(totalPages/15)+1):
del session.cookies["C3VK"]
data = {'page': page,
'location': '110000',
'code': '',
'name': '',
'evalyear': 2019}
resp = session.post(url, headers=headers5, data=data, allow_redirects=False)
if resp.status_code == 307:
resp = session.post(url, headers=headers5, data=data, allow_redirects=False)
if resp.status_code == 200:
print(resp.text)
if resp.status_code == 503:
print("ip 被封")
if __name__ == '__main__':
spider()
完
为不影响网站正常使用,请sleep