爬取掘金最近的帖子生成词云

184 阅读1分钟

爬取掘金文章

接口分析

爬取数据最简单的方式就是直接是爬取api,分析掘金的pc api,如下图

看到有一个 https://api.juejin.cn/recommend_api/v1/article/recommend_all_feed?aid=2608&uuid=6896764480194545166&spider=0 的接口,参数如下

{
  "id_type": 2,
  "client_type": 2608,
  "sort_type": 300,
  "cursor": "eyJ2IjoiNzI2ODI5MzU0Mjk3MTI3NzM0OSIsImkiOjIwfQ==",
  "limit": 20
}

cursor 肉眼分析猜测是base64格式,尝试base64的方式解码结果为

{"v":"7268293542971277349","i":20}

请求多几次接口,解码观察发现 i 其实就是分页

开始请求数据

爬取前10页数据并保存到文件中

contents = []
url = "https://api.juejin.cn/recommend_api/v1/article/recommend_all_feed?aid=2608&uuid=6896764480194545166&spider=0"

for i in range(0, 10):
    print(i)
    cursor = base64.b64encode(json.dumps({"v": "7268293542971277349", "i": (i - 1) * 20}).encode("utf-8"))
    data = {
        "client_type": 2608,
        "cursor": cursor.decode(),
        "id_type": 2,
        "limit": 20,
        "sort_type": 200,
    }

    response = requests.post(url,
                             headers={
                                 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "},
                             data=json.dumps(data))
    results = response.json()
    for result in results["data"]:
        if result["item_type"] != 2:
            continue
        contents.append(result["item_info"]["article_info"]['title'])
        contents.append(result["item_info"]["article_info"]['brief_content'])

space_list = ' '.join(contents)
# 把数据存起来
text_path = "./content.txt"
open(text_path, "w").write(space_list)

利用jieba分词筛选词频最高的词

sentence = open(text_path, 'r', encoding='utf-8').read()

# 去掉标点符号
r = "[_.!+-=——,$%^,。?、~@#¥%……&*《》<>「」{}【】()/] "
text = re.sub(r, ' ', sentence)
wordlist_after_jieba = jieba.lcut(text)
words_dropped_sapce = [i.strip() for i in wordlist_after_jieba]

stopwords = [line.strip() for line in open('stop_words_base.txt').readlines()]
words_dropped_stopwords = [i for i in words_dropped_sapce if i not in stopwords]

words_count = {}  # 字典类型
for word in words_dropped_stopwords:
    words_count[word] = words_count.get(word, 0) + 1
sorted_words_count = sorted(words_count.items(), key=operator.itemgetter(1), reverse=True)  # 降序排序
sorted_words_count = dict([(w[0], w[1]) for w in sorted_words_count])

将词合成词云

wc = WordCloud(width=600, height=400,
               background_color='white',
               mode='RGB',
               max_words=500,
               font_path="STHeiti Light.ttc",
               max_font_size=150,
               relative_scaling=0.6,
               random_state=50,
               scale=2
               ).generate_from_frequencies(sorted_words_count)

plt.imshow(wc)
plt.axis('off')
plt.show()
wc.to_file('juejin.jpg')

结果

结果展示发现,掘金最多讨论的是 前端  面试 chatGPT

详细代码

详细代码已经转到github github.com/stars1324/c…

将会不断地添加其他社区的数据爬取进行词云分析,感兴趣可以给个star