多线程并发爬取群友的力扣刷题数

249 阅读1分钟

我们预先将群友的力扣主页收集到了一个Excel表上:

import pandas as pd

df = pd.read_excel("【万人千题】.xlsx", usecols="A:G")
df

image-20220223210529350

此时的力扣题数需要更新,怎么爬更快呢?

对力扣的接口一番研究后发现,数据的数据都通过leetcode-cn.com/graphql/这一个接口传入查询语句进行获取,类似如下格式:

image-20220223211102550

{"operationName":"reputationUserReputations","variables":{"userSlugs":["01_qustionsolver"]},"query":"query reputationUserReputations($userSlugs: [String!]!) {
  reputationUserReputations(userSlugs: $userSlugs) {
    level
    reputation
    user {
      userSlug
      __typename
    }
    __typename
  }
}
"}

推测这种语法应该是一种图数据库。

同时发现了一个github项目已经研究清楚了这些公共api:github.com/prius/pytho…

最终完整爬取代码如下:

import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor
import re

headers = {"content-type": "application/json"}
data = {
    "operationName": "userPublicProfile",
    "variables": {
        "userSlug": "ying-xiong-na-li-chu-lai"
    },
    "query": '''query userPublicProfile($userSlug: String!) {
        userProfilePublicProfile(userSlug: $userSlug) {
            username
            submissionProgress {
                acTotal
            }
        }
    }
'''
}
sess = requests.session()
sess.head("https://leetcode-cn.com/graphql/")
headers["x-csrftoken"] = sess.cookies["csrftoken"]


def query_leetcode_acTotal(username):
    if not isinstance(username, str):
        return pd.NA
    data["variables"]["userSlug"] = username
    for _ in range(3):
        res = sess.post(
            "https://leetcode-cn.com/graphql/",
            headers=headers,
            data=json.dumps(data, ensure_ascii=False, separators=(',', ':'))
        )
        if res.status_code == 200:
            break
        print("1秒后重试...")
        time.sleep(1)
    else:
        print("失败")
        return pd.NA
    userProfilePublicProfile = res.json()['data']['userProfilePublicProfile']
    if userProfilePublicProfile is None:
        return pd.NA
    acTotal = userProfilePublicProfile['submissionProgress']['acTotal']
    return acTotal


with ThreadPoolExecutor(max_workers=8) as executor:
    nums = executor.map(query_leetcode_acTotal, df.力扣主页.str.extract(
        r"leetcode-cn.com/u/([^/]+)(?:/|$)", expand=False))
df["力扣题数"] = list(nums)
df.力扣题数.fillna(0, inplace=True)
df.sort_values("力扣题数", ascending=False, inplace=True)
df

image-20220223212045967

可以看到在多线程加速下,3秒已经把178位用户的力扣题数爬完了,比selenium快了几十倍。

其中多线程的核心代码为:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as executor:
    data = executor.map(func, list_data)

通过以上代码,程序结果可以保持原有的顺序,简单简洁。