本文已参与「新人创作礼」活动,一起开启掘金创作之路。
一、前言
前几天,金鱼问我能不能爬取微信公众号“北邮家教部”的历史推送,分析一下其中最多的年级和科目。我之前还没有做过微信公众号的爬虫,于是去研究了一下,就有了这篇文章。
二、准备
- python 3(我用的是python 3.8.3)
- requests库
- 一个可以登录的微信公众号
三、正式开始
(一)批量获取公众号往期推送url链接
1.微信公众号后台插入其他公众号推送超链接的原理
想要批量爬取某一微信公众号往期推送,最大的问题在于如何获取这些推送的url链接。因为平时我们点开某一篇推送的时候微信会随机生成一个url链接,这个随机生成的url与该公众号其他推送的url并无联系。因此,如果我们想要批量爬取该公众号的所有推送,就需要手动点开每一篇推送,复制每篇推送的url链接。这显然不现实。在广泛查阅各种资料之后,我借鉴了如何爬取公众号所有文章这篇文章介绍的方法。
这种方法的原理是,当我们登录微信公众号后台,进行图文素材编辑的时候,可以在素材中插入其他公众号的推送链接。这里微信公众号后台会自动调用相关API,返回该公众号所有推送的长期链接列表。
我们打开Chrome浏览器的检查模式,选择Network,然后在编辑超链接界面的公众号搜索栏中输入“北邮家教部”,搜索并选择该公众号,发现Network中刷新出了一个开头为“appmsg”开头的内容,这就是我们要分析的目标。
我们点击“appmsg”开头的这条内容,解析该请求的url:
https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=0&count=5&fakeid=MjM5NDY3ODI4OA==&type=9&query=&token=1983840068&lang=zh_CN&f=json&ajax=1
该链接分为三部分:
https://mp.weixin.qq.com/cgi-bin/appmsg请求的基础部分?action=list_ex常用于动态网站,实现不同的参数值而生成不同的页面或者返回不同的结果&begin=0&count=5&fakeid=MjM5NDY3ODI4OA==&type=9&query=&token=1983840068&lang=zh_CN&f=json&ajax=1设置各种参数
2.获取Cookie和User-Agent
如果直接用Python的Requests库访问该url,并不能正常获得结果。原因在于利用网页版微信公众号后台插入超链接时,我们是登录状态,而用python直接访问时是未登录状态。因此,我们需要手动获取访问时的Cookie和User-Agent,在用Python的Requests库进行访问时将其传入headers参数。这里我将谈么和我将公众号标识符fakeid以及token参数保存在了一个yaml文件中,方便爬取时加载。
cookie : appmsglist_action_3899……
user_agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64)……
fakeid : MzI4M……
token : "19……
在python代码中按如下方法加载:
import yaml
with open("wechat.yaml", "r") as file:
file_data = file.read()
config = yaml.safe_load(file_data)
headers = {
"Cookie": config['cookie'],
"User-Agent": config['user_agent']
}
3.设置url参数
接着我们设置将要请求的url链接的参数:
# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
begin = "0"
params = {
"action": "list_ex",
"begin": begin,
"count": "5",
"fakeid": config['fakeid'],
"type": "9",
"token": config['token'],
"lang": "zh_CN",
"f": "json",
"ajax": "1"
}
这里,count是一次请求返回的信息个数,begin则是当前请求的页数。当设置begin为0时,将会以json格式返回最新五篇推送的信息,以此类推。
4.开始爬取
通过一个循环,每次对begin的值增加1,循环爬取:
i = 0
while True:
begin = i * 5
params["begin"] = str(begin)
# 随机暂停几秒,避免过快的请求导致过快的被查到
time.sleep(random.randint(1,10))
resp = requests.get(url, headers=headers, params = params, verify=False)
# 微信流量控制, 退出
if resp.json()['base_resp']['ret'] == 200013:
print("frequencey control, stop at {}".format(str(begin)))
time.sleep(3600)
continue
# 如果返回的内容中为空则结束
if len(resp.json()['app_msg_list']) == 0:
print("all ariticle parsed")
break
msg = resp.json()
if "app_msg_list" in msg:
for item in msg["app_msg_list"]:
info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
with open("app_msg_list.csv", "a",encoding='utf-8') as f:
f.write(info+'\n')
print(f"第{i}页爬取成功\n")
print("\n".join(info.split(",")))
print("\n\n---------------------------------------------------------------------------------\n")
# 翻页
i += 1
当爬取50页左右时,遇到如下错误:
{'base_resp': {'err_msg': 'freq control', 'ret': 200013}}
这是因为微信公众号存在流量限制,等待一小时即可。我这里采用如下代码解决:
# 微信流量控制
if resp.json()['base_resp']['ret'] == 200013:
print("frequencey control, stop at {}".format(str(begin)))
time.sleep(3600)
continue
对于每条爬取到的信息,解析后存入csv文件:
msg = resp.json()
if "app_msg_list" in msg:
for item in msg["app_msg_list"]:
info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
with open("python小屋.csv", "a",encoding='utf-8') as f:
f.write(info+'\n')
5.完整代码
完整代码如下:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : Spider.py
@Time : 2021/06/04 02:20:24
@Author : YuFanWenShu
@Contact : 1365240381@qq.com
'''
# here put the import lib
import json
import requests
import time
import random
import yaml
with open("wechat.yaml", "r") as file:
file_data = file.read()
config = yaml.safe_load(file_data)
headers = {
"Cookie": config['cookie'],
"User-Agent": config['user_agent']
}
# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsg"
begin = "0"
params = {
"action": "list_ex",
"begin": begin,
"count": "5",
"fakeid": config['fakeid'],
"type": "9",
"token": config['token'],
"lang": "zh_CN",
"f": "json",
"ajax": "1"
}
# 存放结果
app_msg_list = []
# 在不知道公众号有多少文章的情况下,使用while语句
# 也方便重新运行时设置页数
with open("app_msg_list.csv", "w",encoding='utf-8') as file:
file.write("文章标识符aid,标题title,链接url,时间time\n")
i = 0
while True:
begin = i * 5
params["begin"] = str(begin)
# 随机暂停几秒,避免过快的请求导致过快的被查到
time.sleep(random.randint(1,10))
resp = requests.get(url, headers=headers, params = params, verify=False)
# 微信流量控制, 退出
if resp.json()['base_resp']['ret'] == 200013:
print("frequencey control, stop at {}".format(str(begin)))
time.sleep(3600)
continue
# 如果返回的内容中为空则结束
if len(resp.json()['app_msg_list']) == 0:
print("all ariticle parsed")
break
msg = resp.json()
if "app_msg_list" in msg:
for item in msg["app_msg_list"]:
info = '"{}","{}","{}","{}"'.format(str(item["aid"]), item['title'], item['link'], str(item['create_time']))
with open("app_msg_list.csv", "a",encoding='utf-8') as f:
f.write(info+'\n')
print(f"第{i}页爬取成功\n")
print("\n".join(info.split(",")))
print("\n\n---------------------------------------------------------------------------------\n")
# 翻页
i += 1
6.爬取结果
最终储存在csv文件中的结果,共565篇推送信息:
(二)对每篇推送爬取并提取所需信息
1.遍历爬取每篇推送
从csv文件中读取每篇推送的url链接,用Requests库爬取每篇推送的内容:
with open("app_msg_list.csv","r",encoding="utf-8") as f:
data = f.readlines()
n = len(data)
for i in range(n):
mes = data[i].strip("\n").split(",")
if len(mes)!=4:
continue
title,url = mes[1:3]
if i>0:
r = requests.get(eval(url),headers=headers)
if r.status_code == 200:
text = r.text
projects = re_project.finditer(text)
2.提取信息并写入文件
我们需要提取的是每个家教信息的年级和科目。通过观察推送结构,我决定使用正则表达式进行提取。
有些家教单子长时间没人接,会在多篇推送中反复出现,从而影响我们的统计结果。我决定通过编号识别不同的家教信息,编号相同只统计一次。因此使用如下正则表达式进行匹配:
re_project = re.compile(r">编号(.*?)<.*?>年级(.*?)<.*?>科目(.*?)<")
定义wash函数,对提取到的内容进行规范化处理:
def wash(subject):
subject = subject.replace(":","")
subject = subject.replace(":","")
subject = subject.replace(" ","")
subject = subject.replace(" ","")
subject = subject.replace(" ","")
return subject
对每次爬取到的网页内容进行匹配:
for project in projects:
iid,grade,subject = project.group(1),project.group(2),project.group(3)
iid = wash(iid)
grade = wash(grade)
subject = wash(subject)
通过id识别家教信息,相同的信息只写入文件一次:
if iid not in save_list:
save_list.append(iid)
with open("subjects.txt","a",encoding="utf-8") as x:
x.write(subject+"\n")
# x.write(subject+"-"+title+"\n")
with open("grades.txt","a",encoding="utf-8") as y:
y.write(grade+"\n")
# y.write(grade+"-"+title+"\n")
3.完整代码:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : extract.py
@Time : 2021/06/04 04:09:12
@Author : YuFanWenShu
@Contact : 1365240381@qq.com
'''
# here put the import lib
import re
import yaml
import requests
def wash(subject):
subject = subject.replace(":","")
subject = subject.replace(":","")
subject = subject.replace(" ","")
subject = subject.replace(" ","")
subject = subject.replace(" ","")
return subject
re_subject = re.compile(r">科目(.*?)<",re.S)
re_grade = re.compile(r">年级(.*?)<",re.S)
re_project = re.compile(r">编号(.*?)<.*?>年级(.*?)<.*?>科目(.*?)<")
save_list = []
with open("wechat.yaml", "r") as file:
file_data = file.read()
config = yaml.safe_load(file_data)
headers = {
"Cookie": config['cookie'],
"User-Agent": config['user_agent']
}
with open("app_msg_list.csv","r",encoding="utf-8") as f:
data = f.readlines()
n = len(data)
with open("subjects.txt","w",encoding="utf-8") as x:
pass
with open("grades.txt","w",encoding="utf-8") as y:
pass
for i in range(n):
mes = data[i].strip("\n").split(",")
if len(mes)!=4:
continue
title,url = mes[1:3]
# if "家教信息" in title:
if i>0:
r = requests.get(eval(url),headers=headers)
if r.status_code == 200:
text = r.text
projects = re_project.finditer(text)
for project in projects:
iid,grade,subject = project.group(1),project.group(2),project.group(3)
iid = wash(iid)
grade = wash(grade)
subject = wash(subject)
if iid not in save_list:
save_list.append(iid)
with open("subjects.txt","a",encoding="utf-8") as x:
x.write(subject+"\n")
# x.write(subject+"-"+title+"\n")
with open("grades.txt","a",encoding="utf-8") as y:
y.write(grade+"\n")
# y.write(grade+"-"+title+"\n")
print(mes[1],"-",mes[3])
else:
print("error,status_code: ",r.status_code,"\n")
print(f"进度:{i}/{n}")
4.结果
提取到的年级与学科分别写入了txt文件:
(三)统计输出
1.完整代码
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : count.py
@Time : 2021/06/04 04:46:58
@Author : YuFanWenShu
@Contact : 1365240381@qq.com
'''
# here put the import lib
from collections import Counter
with open("subjects.txt","r",encoding="utf-8") as x:
subjects = x.read().split("\n")
with open("grades.txt","r",encoding="utf-8") as y:
grades = y.read().split("\n")
print("\n-----------------------------------------------------------------------------------\n")
print("科目:")
for i in Counter(subjects).most_common(5):
print(i)
print("\n-----------------------------------------------------------------------------------\n")
print("年级:")
for i in Counter(grades).most_common(5):
print(i)
2.结果
输出出现次数排名前5的科目和年级
四、结语
第一次尝试爬取特定公众号所有文章,效果还不错。不过最后的统计数据有点不太理想,信息比预想的要少。可能许多家教信息都没有在公众号上公布吧。
掌握了公众号爬取技巧,仿佛打开了一扇新世界的大门。下一篇准备介绍如何从推送中爬取图片并自动导入PPT,敬请期待~