爬取aistudio的fork数

117 阅读2分钟

一起养成写作习惯!这是我参与「掘金日新计划 · 4 月更文挑战」的第15天,点击查看活动详情

1.引入必要的python包

由于是爬虫,request包少不了,再加上爬取的数据json格式,json处理包也跑不了

import requests
import json

2.爬取url设置

比较重要的是要爬取的url,该url可以从net下查看到,但是要注意的是,对于带参数的url,需要手动补全,具体代码如下:

 link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)

其中参数是动态的。

3.header设置

这个较为简单,查到的直接复制进去,对于没有权限要求的cookies可以不需要,简单点即可。

 head = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52",
        "Referer": "https://aistudio.baidu.com/aistudio/projectdetail/" + str(projectid),
        "Origin": "https://aistudio.baidu.com",
        "Host": "aistudio.baidu.com",
        "Connection": "keep-alive",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6",
        "Accept-Encoding": "gzip, deflate",
        'x-requested-with': 'XMLHttpRequest',
    }

4.循环获取

对于数据量大的情况,都是分页获取,对此要爬取总页数,然后循环获取。

 totalPage = json.loads(html)["result"]["totalPage"]

5.用户名获取

对于json处理的response返回值,可直接获取对应的数据。直接加入列表

for i in range(1, totalPage + 1):
        link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)
        html = requests.post(url=link, headers=head).text
        data = json.loads(html)["result"]["data"]
        for ii in range(len(data)):
            mydata.append(data[ii]["nickname"])

6.去重

去重就简单多了,直接list转set,重复的数据自然消除,不用多考虑。

myset = set(mydata)
    print(f"fork用户数量:{len(set(myset))} ")
    return len(set(myset))

7.组装

接下来就是组装起来了。main方法里可以将项目id全部放进去,每个fork数量爬出来后求和即可。具体代码贴再下面:


#!/usr/bin/python
# -*- coding: UTF-8 -*-
"""
@author:livingbody
@file:project_fork_counts.py
@time:2022/04/12
"""

import requests
import json


def count_forks(projectid):
    head = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52",
        "Referer": "https://aistudio.baidu.com/aistudio/projectdetail/" + str(projectid),
        "Origin": "https://aistudio.baidu.com",
        "Host": "aistudio.baidu.com",
        "Connection": "keep-alive",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6",
        "Accept-Encoding": "gzip, deflate",
        'x-requested-with': 'XMLHttpRequest',
    }
    p = 1
    link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=1"
    html = requests.post(url=link, headers=head).text
    totalPage = json.loads(html)["result"]["totalPage"]
    # print("totalPage: ", totalPage)
    mydata = []
    for i in range(1, totalPage + 1):
        link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)
        html = requests.post(url=link, headers=head).text
        data = json.loads(html)["result"]["data"]
        for ii in range(len(data)):
            mydata.append(data[ii]["nickname"])
    myset = set(mydata)
    print(f"fork用户数量:{len(set(myset))} ")
    return len(set(myset))


def main():
    # 项目id
    projectids = [3741553, 3538237, 3571473, 3601731, 3581241]
    sum = 0
    for projectid in projectids:
        num = count_forks(projectid)
        sum = sum + num
    print("sum: ", sum)


if __name__ == '__main__':
    main()