某英网热榜数据采集详解

98 阅读3分钟

时间过的真是快恍惚之间从4.10到今天已经32天了,每天写一篇爬虫文章貌似已经养成了一种习惯,昨天看了一篇文章一位大佬去面阿里的岗呗怒怼经理。真的需要不断学习哇,之后会换一种文章类型更新,以逆向破解反爬小案例,日常遇到了值得记录一下的内容为主。今天爬某英网热榜数据。话不多说,直接进入主题

  1. 还是先分析一下网站结构,网站地址

image.png 打开网站可见已经在首页了就爬这个最新收录文章。找到数据列表页了就翻个页看看数据接口是什么样子的

image.png 滚轮到最下面看到点击加载更多抓包

image.png 可见这个就是数据包了这里有两个参数一个page表示页数,一个stamp参数是时间戳,根据以往的经验这个时间戳要么是上一篇页最后一篇文章的时间戳要么就是当前的时间戳,再翻一页验证一下

image.png 可见不是上一页的时间戳那就是当前请求的时间戳一般当前的时间戳是不做校验的,那就没什么问题了, 再看一下详情页数据是什么样的接口

image.png可见详情页的数据是在html中的文章链接有id这个id就在列表中,那就ok了,

  1. 先请求列表页获取所有的文章链接.
cookies = {
    'ci_session': 'a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22398048964ca2f0329d878daccc8a0fb4%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A13%3A%22103.85.169.91%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A115%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F100.0.4896.60+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1683769198%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7Dfc0151e8ef7867bce5657e522cc2521f',
    'SERVERID': 'wrs02',
}

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Origin': 'https://www.digitaling.com',
    'Pragma': 'no-cache',
    'Referer': 'https://www.digitaling.com/',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

data = {
    'page': '2',
    'stamp': '1683769198',
}

response = requests.post('https://www.digitaling.com/api/getHome', cookies=cookies, headers=headers, data=data).json()['content']
for res in response:
    conId = res['conId']
    print(conId)

image.png 再请求详情页查看并自动解析

cookies = {
    'SERVERID': 'wrs02',
    'ci_session': 'a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2292f18122f19a8b05ace8855b52137048%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A13%3A%22103.85.169.91%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A115%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F100.0.4896.60+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1683770650%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D231fde93f5516036d522b4d233a588a5',
}

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    # 'Cookie': 'SERVERID=wrs02; ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2292f18122f19a8b05ace8855b52137048%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A13%3A%22103.85.169.91%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A115%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F100.0.4896.60+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1683770650%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D231fde93f5516036d522b4d233a588a5',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

response = requests.get(f'https://www.digitaling.com/articles/{conid}.html', cookies=cookies, headers=headers)
extractor = GeneralNewsExtractor()
result = extractor.extract(response.text, noise_node_list=['//div[@class="comment-list"]'])
print(result)
input()

image.png ok,没什么问题。