爱寻匿网站热点爬虫

269 阅读3分钟

昨天爬了爱寻匿得第三方源热点数据,发现他自己也是有热点数据的,顺带着也爬一下吧

  1. 还是先看一下网页结构,网站地址

image.png 可见数据列表页就在这里,先抓一下包看一下列表的接口

image.png 因为首页请求会有很多的包被抓出来找起来还挺麻烦,所以就直接抓一个翻页的包

image.png 可见数据是在html中的,和昨天的数据不同这里也不要key,翻页是连接/page/2。然后再看一下详情页数据

image.png

注意看链接中1128盲猜这个应该是这篇文章的id,根据以往的经验这个id肯定是在列表页html中的,回去看一下源码

image.png 正是这样, 那就可以直接开始操作了

  1. 先请求列表页,解析出左右文章连接
headers = {
    'authority': 'www.aixunni.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'no-cache',
    'cookie': '__51vcke__JidhIS3WqWdHR4fA=fb2eef69-4e80-589a-ad69-f5f0379d5d56; __51vuft__JidhIS3WqWdHR4fA=1682219380303; _tcnyl=1; __51huid__Jidk2jylHnTV3MHC=5c17fc32-49b1-5d14-a554-2e7cef40df2b; __51uvsct__JidhIS3WqWdHR4fA=2; __vtins__JidhIS3WqWdHR4fA=%7B%22sid%22%3A%20%22f9d70d4e-87da-5f0c-a6d7-27900669bb25%22%2C%20%22vd%22%3A%204%2C%20%22stt%22%3A%20487460%2C%20%22dr%22%3A%207820%2C%20%22expires%22%3A%201682232418482%2C%20%22ct%22%3A%201682230618482%7D',
    'pragma': 'no-cache',
    'referer': 'https://www.aixunni.com/blog/',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}

response = requests.get('https://www.aixunni.com/blog/page/2/', cookies=cookies, headers=headers)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@class='cat_list']/div/div[@class='list-item card']/div[@class='list-content']/div[@class='list-body']/h2/a")
for data in data_list:
    title = data.xpath("./@title")[0]
    url = data.xpath("./@href")[0]
    print(title,url)

image.png 然后再请求详情页数据,使用上次讲的GeneralNewsExtractor方法做一个自动解析

headers = {
    'authority': 'www.aixunni.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'no-cache',
    'cookie': '__51vcke__JidhIS3WqWdHR4fA=fb2eef69-4e80-589a-ad69-f5f0379d5d56; __51vuft__JidhIS3WqWdHR4fA=1682219380303; _tcnyl=1; __51huid__Jidk2jylHnTV3MHC=5c17fc32-49b1-5d14-a554-2e7cef40df2b; __51uvsct__JidhIS3WqWdHR4fA=2; __vtins__JidhIS3WqWdHR4fA=%7B%22sid%22%3A%20%22f9d70d4e-87da-5f0c-a6d7-27900669bb25%22%2C%20%22vd%22%3A%207%2C%20%22stt%22%3A%201097623%2C%20%22dr%22%3A%20157351%2C%20%22expires%22%3A%201682233028645%2C%20%22ct%22%3A%201682231228645%7D',
    'pragma': 'no-cache',
    'referer': 'https://www.aixunni.com/blog/page/2/',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}

response = requests.get(url, headers=headers)
extractor = GeneralNewsExtractor()
result = extractor.extract(response.text,noise_node_list=['//div[@class="s2"]'])
print(result)
input()

image.png 欧克

image.png