一起养成写作习惯!这是我参与「掘金日新计划 · 4 月更文挑战」的第15天,点击查看活动详情。
1.引入必要的python包
由于是爬虫,request包少不了,再加上爬取的数据json格式,json处理包也跑不了
import requests
import json
2.爬取url设置
比较重要的是要爬取的url,该url可以从net下查看到,但是要注意的是,对于带参数的url,需要手动补全,具体代码如下:
link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)
其中参数是动态的。
3.header设置
这个较为简单,查到的直接复制进去,对于没有权限要求的cookies可以不需要,简单点即可。
head = {
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52",
"Referer": "https://aistudio.baidu.com/aistudio/projectdetail/" + str(projectid),
"Origin": "https://aistudio.baidu.com",
"Host": "aistudio.baidu.com",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6",
"Accept-Encoding": "gzip, deflate",
'x-requested-with': 'XMLHttpRequest',
}
4.循环获取
对于数据量大的情况,都是分页获取,对此要爬取总页数,然后循环获取。
totalPage = json.loads(html)["result"]["totalPage"]
5.用户名获取
对于json处理的response返回值,可直接获取对应的数据。直接加入列表
for i in range(1, totalPage + 1):
link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)
html = requests.post(url=link, headers=head).text
data = json.loads(html)["result"]["data"]
for ii in range(len(data)):
mydata.append(data[ii]["nickname"])
6.去重
去重就简单多了,直接list转set,重复的数据自然消除,不用多考虑。
myset = set(mydata)
print(f"fork用户数量:{len(set(myset))} ")
return len(set(myset))
7.组装
接下来就是组装起来了。main方法里可以将项目id全部放进去,每个fork数量爬出来后求和即可。具体代码贴再下面:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
"""
@author:livingbody
@file:project_fork_counts.py
@time:2022/04/12
"""
import requests
import json
def count_forks(projectid):
head = {
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52",
"Referer": "https://aistudio.baidu.com/aistudio/projectdetail/" + str(projectid),
"Origin": "https://aistudio.baidu.com",
"Host": "aistudio.baidu.com",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6",
"Accept-Encoding": "gzip, deflate",
'x-requested-with': 'XMLHttpRequest',
}
p = 1
link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=1"
html = requests.post(url=link, headers=head).text
totalPage = json.loads(html)["result"]["totalPage"]
# print("totalPage: ", totalPage)
mydata = []
for i in range(1, totalPage + 1):
link = "https://aistudio.baidu.com/studio/project/forklist?projectId=" + str(projectid) + "&p=" + str(i)
html = requests.post(url=link, headers=head).text
data = json.loads(html)["result"]["data"]
for ii in range(len(data)):
mydata.append(data[ii]["nickname"])
myset = set(mydata)
print(f"fork用户数量:{len(set(myset))} ")
return len(set(myset))
def main():
# 项目id
projectids = [3741553, 3538237, 3571473, 3601731, 3581241]
sum = 0
for projectid in projectids:
num = count_forks(projectid)
sum = sum + num
print("sum: ", sum)
if __name__ == '__main__':
main()