某星社区排行数据爬取

322 阅读3分钟

骐骥一跃,不能十步;驽马十驾,功在不舍。今天爬一下流星社区的推荐排行数据,话不多说直接开始,还是先欣赏一下板砖专用图

image.png

  1. 整体流程还是先分析网站结构,看一下有没有需要解决的反爬,然后开始写代码,网站地址

image.png 除了上面一些置顶的数据没有什么用,下面的就是正式数据了,首页包就不抓了依然是向下面走找到翻页接口,找到翻页接口,看数据情况

image.png 看过之前的文章,看到这里应该能猜到很大概率是数据在html中这种数据网页渲染,抓个包看一下

image.png 果然数据是在html中,翻页也是根据链接上面页数改变,然后看一下详情接口,盲猜也是再html中

image.png 也是在html中的,这样就明了了,开始写代码 2. 先翻页请求列表页,解析出来全部的详情页链接和题目


headers = {
    'authority': 'bbs.liuxingw.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'no-cache',
    'cookie': 'X_CACHE_KEY=17482ee2df40673b6db81d5896931d0b; __51vcke__JhUdxqqnl0OntR9a=53c40d43-cb94-5f98-8103-b3aabfc0bae9; __51vuft__JhUdxqqnl0OntR9a=1682403334483; _tcnyl=1; hyphp_lang=zh-CN; __51uvsct__JhUdxqqnl0OntR9a=2; __vtins__JhUdxqqnl0OntR9a=%7B%22sid%22%3A%20%226f193d92-5d15-5316-8465-e62087ac4101%22%2C%20%22vd%22%3A%204%2C%20%22stt%22%3A%201041172%2C%20%22dr%22%3A%20111272%2C%20%22expires%22%3A%201682412837045%2C%20%22ct%22%3A%201682411037045%7D',
    'pragma': 'no-cache',
    'referer': 'https://bbs.liuxingw.com/new/2.html',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
for page in range(1,10):
    response = requests.get(f'https://bbs.liuxingw.com/new/{page}.html', headers=headers)
    res = etree.HTML(response.text)
    data_list = res.xpath("//div[@class='item']/div/h4/a[@class='thread-title']")
    for data in data_list:
        title = data.xpath("./text()")[0]
        url = data.xpath("./@href")[0]
        print(title,url)

image.png 然后再请求详情页

headers = {
    'authority': 'bbs.liuxingw.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'no-cache',
    'cookie': 'X_CACHE_KEY=17482ee2df40673b6db81d5896931d0b; __51vcke__JhUdxqqnl0OntR9a=53c40d43-cb94-5f98-8103-b3aabfc0bae9; __51vuft__JhUdxqqnl0OntR9a=1682403334483; _tcnyl=1; hyphp_lang=zh-CN; __51uvsct__JhUdxqqnl0OntR9a=2; __vtins__JhUdxqqnl0OntR9a=%7B%22sid%22%3A%20%226f193d92-5d15-5316-8465-e62087ac4101%22%2C%20%22vd%22%3A%205%2C%20%22stt%22%3A%201162485%2C%20%22dr%22%3A%20121313%2C%20%22expires%22%3A%201682412958358%2C%20%22ct%22%3A%201682411158358%7D',
    'pragma': 'no-cache',
    'referer': 'https://bbs.liuxingw.com/new/3.html',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}

response = requests.get(url, headers=headers)
print(response.text)
sys.exit()

image.png 打印一下源码没有问题然后还是gne进行自动解析就ok了,这里就不再贴解析的代码了