用python爬取“北邮家教部”公众号文章

2,917 阅读7分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

一、前言

前几天,金鱼问我能不能爬取微信公众号“北邮家教部”的历史推送,分析一下其中最多的年级和科目。我之前还没有做过微信公众号的爬虫,于是去研究了一下,就有了这篇文章。

二、准备

  • python 3(我用的是python 3.8.3)
  • requests库
  • 一个可以登录的微信公众号

三、正式开始

(一)批量获取公众号往期推送url链接

1.微信公众号后台插入其他公众号推送超链接的原理

想要批量爬取某一微信公众号往期推送,最大的问题在于如何获取这些推送的url链接。因为平时我们点开某一篇推送的时候微信会随机生成一个url链接,这个随机生成的url与该公众号其他推送的url并无联系。因此,如果我们想要批量爬取该公众号的所有推送,就需要手动点开每一篇推送,复制每篇推送的url链接。这显然不现实。在广泛查阅各种资料之后,我借鉴了如何爬取公众号所有文章这篇文章介绍的方法。

这种方法的原理是,当我们登录微信公众号后台,进行图文素材编辑的时候,可以在素材中插入其他公众号的推送链接。这里微信公众号后台会自动调用相关API,返回该公众号所有推送的长期链接列表。 插入其他公众号推送 我们打开Chrome浏览器的检查模式,选择Network,然后在编辑超链接界面的公众号搜索栏中输入“北邮家教部”,搜索并选择该公众号,发现Network中刷新出了一个开头为“appmsg”开头的内容,这就是我们要分析的目标。 检查界面 我们点击“appmsg”开头的这条内容,解析该请求的url:

https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=0&count=5&fakeid=MjM5NDY3ODI4OA==&type=9&query=&token=1983840068&lang=zh_CN&f=json&ajax=1

该链接分为三部分:

  1. https://mp.weixin.qq.com/cgi-bin/appmsg 请求的基础部分
  2. ?action=list_ex 常用于动态网站,实现不同的参数值而生成不同的页面或者返回不同的结果
  3. &begin=0&count=5&fakeid=MjM5NDY3ODI4OA==&type=9&query=&token=1983840068&lang=zh_CN&f=json&ajax=1 设置各种参数

2.获取Cookie和User-Agent

如果直接用Python的Requests库访问该url,并不能正常获得结果。原因在于利用网页版微信公众号后台插入超链接时,我们是登录状态,而用python直接访问时是未登录状态。因此,我们需要手动获取访问时的CookieUser-Agent,在用Python的Requests库进行访问时将其传入headers参数。这里我将谈么和我将公众号标识符fakeid以及token参数保存在了一个yaml文件中,方便爬取时加载。

cookie : appmsglist_action_3899……
user_agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64)……
fakeid : MzI4M……
token : "19……

在python代码中按如下方法加载:

import yaml
with open("wechat.yaml", "r") as file:
    file_data = file.read()
config = yaml.safe_load(file_data) 

headers = {
    "Cookie": config['cookie'],
    "User-Agent": config['user_agent'] 
}

3.设置url参数

接着我们设置将要请求的url链接的参数:

# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
begin = "0"
params = {
    "action": "list_ex",
    "begin": begin,
    "count": "5",
    "fakeid": config['fakeid'],
    "type": "9",
    "token": config['token'],
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1"
}

这里,count是一次请求返回的信息个数,begin则是当前请求的页数。当设置begin为0时,将会以json格式返回最新五篇推送的信息,以此类推。

4.开始爬取

通过一个循环,每次对begin的值增加1,循环爬取:

i = 0
while True:
    begin = i * 5
    params["begin"] = str(begin)
    # 随机暂停几秒,避免过快的请求导致过快的被查到
    time.sleep(random.randint(1,10))
    resp = requests.get(url, headers=headers, params = params, verify=False)
    # 微信流量控制, 退出
    if resp.json()['base_resp']['ret'] == 200013:
        print("frequencey control, stop at {}".format(str(begin)))
        time.sleep(3600)
        continue
    
    # 如果返回的内容中为空则结束
    if len(resp.json()['app_msg_list']) == 0:
        print("all ariticle parsed")
        break
        
    msg = resp.json()
    if "app_msg_list" in msg:
        for item in msg["app_msg_list"]:
            info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
            with open("app_msg_list.csv", "a",encoding='utf-8') as f:
                f.write(info+'\n')
        print(f"第{i}页爬取成功\n")
        print("\n".join(info.split(",")))
        print("\n\n---------------------------------------------------------------------------------\n")

    # 翻页
    i += 1

当爬取50页左右时,遇到如下错误: {'base_resp': {'err_msg': 'freq control', 'ret': 200013}} 这是因为微信公众号存在流量限制,等待一小时即可。我这里采用如下代码解决:

# 微信流量控制
if resp.json()['base_resp']['ret'] == 200013:
    print("frequencey control, stop at {}".format(str(begin)))
    time.sleep(3600)
    continue

对于每条爬取到的信息,解析后存入csv文件:

msg = resp.json()
if "app_msg_list" in msg:
    for item in msg["app_msg_list"]:
        info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
        with open("python小屋.csv", "a",encoding='utf-8') as f:
            f.write(info+'\n')

5.完整代码

完整代码如下:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File    :   Spider.py
@Time    :   2021/06/04 02:20:24
@Author  :   YuFanWenShu 
@Contact :   1365240381@qq.com
'''

# here put the import lib

import json
import requests
import time
import random

import yaml
with open("wechat.yaml", "r") as file:
    file_data = file.read()
config = yaml.safe_load(file_data) 

headers = {
    "Cookie": config['cookie'],
    "User-Agent": config['user_agent'] 
}

# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
begin = "0"
params = {
    "action": "list_ex",
    "begin": begin,
    "count": "5",
    "fakeid": config['fakeid'],
    "type": "9",
    "token": config['token'],
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1"
}

# 存放结果
app_msg_list = []
# 在不知道公众号有多少文章的情况下,使用while语句
# 也方便重新运行时设置页数
with open("app_msg_list.csv", "w",encoding='utf-8') as file:
    file.write("文章标识符aid,标题title,链接url,时间time\n")
i = 0
while True:
    begin = i * 5
    params["begin"] = str(begin)
    # 随机暂停几秒,避免过快的请求导致过快的被查到
    time.sleep(random.randint(1,10))
    resp = requests.get(url, headers=headers, params = params, verify=False)
    # 微信流量控制, 退出
    if resp.json()['base_resp']['ret'] == 200013:
        print("frequencey control, stop at {}".format(str(begin)))
        time.sleep(3600)
        continue
    
    # 如果返回的内容中为空则结束
    if len(resp.json()['app_msg_list']) == 0:
        print("all ariticle parsed")
        break
        
    msg = resp.json()
    if "app_msg_list" in msg:
        for item in msg["app_msg_list"]:
            info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
            with open("app_msg_list.csv", "a",encoding='utf-8') as f:
                f.write(info+'\n')
        print(f"第{i}页爬取成功\n")
        print("\n".join(info.split(",")))
        print("\n\n---------------------------------------------------------------------------------\n")

    # 翻页
    i += 1    

6.爬取结果

最终储存在csv文件中的结果,共565篇推送信息: csv文件

(二)对每篇推送爬取并提取所需信息

1.遍历爬取每篇推送

从csv文件中读取每篇推送的url链接,用Requests库爬取每篇推送的内容:

with open("app_msg_list.csv","r",encoding="utf-8") as f:
    data = f.readlines()
n = len(data)
for i in range(n):
    mes = data[i].strip("\n").split(",")
    if len(mes)!=4:
        continue
    title,url = mes[1:3]
    if i>0:
        r = requests.get(eval(url),headers=headers)
        if r.status_code == 200:
            text = r.text
            projects = re_project.finditer(text)

2.提取信息并写入文件

我们需要提取的是每个家教信息的年级科目。通过观察推送结构,我决定使用正则表达式进行提取。 推送结构 有些家教单子长时间没人接,会在多篇推送中反复出现,从而影响我们的统计结果。我决定通过编号识别不同的家教信息,编号相同只统计一次。因此使用如下正则表达式进行匹配:

re_project = re.compile(r">编号(.*?)<.*?>年级(.*?)<.*?>科目(.*?)<")

定义wash函数,对提取到的内容进行规范化处理:

def wash(subject):
    subject = subject.replace(":","")
    subject = subject.replace(":","")
    subject = subject.replace(" ","")
    subject = subject.replace(" ","")
    subject = subject.replace("&nbsp;","")
    return subject

对每次爬取到的网页内容进行匹配:

for project in projects:
    iid,grade,subject = project.group(1),project.group(2),project.group(3)
    iid = wash(iid)
    grade = wash(grade)
    subject = wash(subject)

通过id识别家教信息,相同的信息只写入文件一次:

if iid not in save_list:
    save_list.append(iid)
    with open("subjects.txt","a",encoding="utf-8") as x:
        x.write(subject+"\n")
        # x.write(subject+"-"+title+"\n")
    with open("grades.txt","a",encoding="utf-8") as y:
        y.write(grade+"\n")
        # y.write(grade+"-"+title+"\n")

3.完整代码:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File    :   extract.py
@Time    :   2021/06/04 04:09:12
@Author  :   YuFanWenShu 
@Contact :   1365240381@qq.com
'''

# here put the import lib
import re
import yaml
import requests

def wash(subject):
    subject = subject.replace(":","")
    subject = subject.replace(":","")
    subject = subject.replace(" ","")
    subject = subject.replace(" ","")
    subject = subject.replace("&nbsp;","")
    return subject

re_subject = re.compile(r">科目(.*?)<",re.S)
re_grade = re.compile(r">年级(.*?)<",re.S)
re_project = re.compile(r">编号(.*?)<.*?>年级(.*?)<.*?>科目(.*?)<")

save_list = []

with open("wechat.yaml", "r") as file:
    file_data = file.read()
config = yaml.safe_load(file_data) 

headers = {
    "Cookie": config['cookie'],
    "User-Agent": config['user_agent'] 
}


with open("app_msg_list.csv","r",encoding="utf-8") as f:
    data = f.readlines()
n = len(data)

with open("subjects.txt","w",encoding="utf-8") as x:
    pass
with open("grades.txt","w",encoding="utf-8") as y:
    pass

for i in range(n):
    mes = data[i].strip("\n").split(",")
    if len(mes)!=4:
        continue
    title,url = mes[1:3]
    # if "家教信息" in title:
    if i>0:
        r = requests.get(eval(url),headers=headers)
        if r.status_code == 200:
            text = r.text
            projects = re_project.finditer(text)
            for project in projects:
                iid,grade,subject = project.group(1),project.group(2),project.group(3)
                iid = wash(iid)
                grade = wash(grade)
                subject = wash(subject)
                if iid not in save_list:
                    save_list.append(iid)
                    with open("subjects.txt","a",encoding="utf-8") as x:
                        x.write(subject+"\n")
                        # x.write(subject+"-"+title+"\n")
                    with open("grades.txt","a",encoding="utf-8") as y:
                        y.write(grade+"\n")
                        # y.write(grade+"-"+title+"\n")
                    
            print(mes[1],"-",mes[3])
        else:
            print("error,status_code: ",r.status_code,"\n")
    print(f"进度:{i}/{n}")

4.结果

提取到的年级与学科分别写入了txt文件: 年级信息

学科信息

(三)统计输出

1.完整代码

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File    :   count.py
@Time    :   2021/06/04 04:46:58
@Author  :   YuFanWenShu 
@Contact :   1365240381@qq.com
'''

# here put the import lib

from collections import Counter

with open("subjects.txt","r",encoding="utf-8") as x:
    subjects = x.read().split("\n")
with open("grades.txt","r",encoding="utf-8") as y:
    grades = y.read().split("\n")
print("\n-----------------------------------------------------------------------------------\n")    
print("科目:")
for i in Counter(subjects).most_common(5):
    print(i)

print("\n-----------------------------------------------------------------------------------\n")   
print("年级:")
for i in Counter(grades).most_common(5):
    print(i)

2.结果

输出出现次数排名前5的科目和年级 在这里插入图片描述

四、结语

第一次尝试爬取特定公众号所有文章,效果还不错。不过最后的统计数据有点不太理想,信息比预想的要少。可能许多家教信息都没有在公众号上公布吧。

掌握了公众号爬取技巧,仿佛打开了一扇新世界的大门。下一篇准备介绍如何从推送中爬取图片并自动导入PPT,敬请期待~