我们预先将群友的力扣主页收集到了一个Excel表上:
import pandas as pd
df = pd.read_excel("【万人千题】.xlsx", usecols="A:G")
df
此时的力扣题数需要更新,怎么爬更快呢?
对力扣的接口一番研究后发现,数据的数据都通过leetcode-cn.com/graphql/这一个接口传入查询语句进行获取,类似如下格式:
{"operationName":"reputationUserReputations","variables":{"userSlugs":["01_qustionsolver"]},"query":"query reputationUserReputations($userSlugs: [String!]!) {
reputationUserReputations(userSlugs: $userSlugs) {
level
reputation
user {
userSlug
__typename
}
__typename
}
}
"}
推测这种语法应该是一种图数据库。
同时发现了一个github项目已经研究清楚了这些公共api:github.com/prius/pytho…
最终完整爬取代码如下:
import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor
import re
headers = {"content-type": "application/json"}
data = {
"operationName": "userPublicProfile",
"variables": {
"userSlug": "ying-xiong-na-li-chu-lai"
},
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
'''
}
sess = requests.session()
sess.head("https://leetcode-cn.com/graphql/")
headers["x-csrftoken"] = sess.cookies["csrftoken"]
def query_leetcode_acTotal(username):
if not isinstance(username, str):
return pd.NA
data["variables"]["userSlug"] = username
for _ in range(3):
res = sess.post(
"https://leetcode-cn.com/graphql/",
headers=headers,
data=json.dumps(data, ensure_ascii=False, separators=(',', ':'))
)
if res.status_code == 200:
break
print("1秒后重试...")
time.sleep(1)
else:
print("失败")
return pd.NA
userProfilePublicProfile = res.json()['data']['userProfilePublicProfile']
if userProfilePublicProfile is None:
return pd.NA
acTotal = userProfilePublicProfile['submissionProgress']['acTotal']
return acTotal
with ThreadPoolExecutor(max_workers=8) as executor:
nums = executor.map(query_leetcode_acTotal, df.力扣主页.str.extract(
r"leetcode-cn.com/u/([^/]+)(?:/|$)", expand=False))
df["力扣题数"] = list(nums)
df.力扣题数.fillna(0, inplace=True)
df.sort_values("力扣题数", ascending=False, inplace=True)
df
可以看到在多线程加速下,3秒已经把178位用户的力扣题数爬完了,比selenium快了几十倍。
其中多线程的核心代码为:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as executor:
data = executor.map(func, list_data)
通过以上代码,程序结果可以保持原有的顺序,简单简洁。