本文已参与「新人创作礼」活动,一起开启掘金创作之路。
此为试手项目,侵权请及时告知删除
目标网站:
目标结果:
浏览器打开目标网站,打开F12抓包
get请求,
结果也在,然后看看请求参数,完完整整,简简单单
用神器curl,复制代码运行。结果就出来了,没啥好研究的
接下来还有提取数据,存入目的地
源码如下,没啥好说的,简单测试没反爬
接着分析详情页,根据上面的结果拿到下一页url
请求出来是个
进来查看可以知道是个播放器,接下来就是寻找播放地址,如下
实际就是这个地址
拿播放地址的代码,代码附于最后
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
}
params = (
('dates', '300'),
('categoryId', '13'),
('tabName', '\u7ED3\u679C\u516C\u544A'),
('page', '1'),
)
response = requests.get('https://www.powerbeijing-ec.com/jncms/search/jingneng_result.html', headers=headers, params=params, )
# response = requests.get('https://www.powerbeijing-ec.com/jncms/search/jingneng_result.html?dates=300&categoryId=13&tabName=%E7%BB%93%E6%9E%9C%E5%85%AC%E5%91%8A&page=1', headers=headers, )
print(response.text)
import requests
headers = {
# 'Connection': 'keep-alive',
# 'Cache-Control': 'max-age=0',
# 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
# 'sec-ch-ua-mobile': '?0',
# 'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 'Sec-Fetch-Site': 'none',
# 'Sec-Fetch-Mode': 'navigate',
# 'Sec-Fetch-User': '?1',
# 'Sec-Fetch-Dest': 'document',
# 'Accept-Language': 'zh-CN,zh;q=0.9',
# 'If-None-Match': 'W/"6126f74b-3213"',
# 'If-Modified-Since': 'Thu, 26 Aug 2021 02:07:07 GMT',
}
response = requests.get('https://www.powerbeijing-ec.com/jingneng_result/2021-08-26/23287.html', headers=headers,)
print(response.text)
总结:这个网站比较简单,再确认能拿到数据后,可以考虑,爬行效率(如异步,多线程),稳定性(添加代码请求重试,随机请求头,ip代理等常用手段),增量(利用数据库做每天增量爬虫)