搭建环境
代码设计
使用说明及效果展示

一、搭建环境

1. 软件版本

Python3.7.4

Anaconda3

2. 环境搭建问题

配置Anaconda环境变量
问题：anaconda未设置在环境变量里，导致使用pip下载python自带的库时无法下载到对应的路径进行使用。
解决：在电脑的环境变量中添加anaconda的路径。
使用pip网络问题
问题：因为网速过慢的原因导致无法正常使用pip进行更新以及python库的下载。
WARNING: pip is configured with locations that require TLS/SSL,however the ssl module in Python is not available.
解决：输入命令改为pip install xxx -i pypi.douban.com/simple/ --trusted-host pypi.douban.com，使用豆瓣源进行下载。

二、代码设计

1. 获取cookie

在网上查阅了很多的资料之后，我实现了模拟登陆获取cookie的方法，但是目前仍不能通过模拟登陆获取的cookie进行微博平台的连接和数据获取，所以在此我们的cookie是通过手动登陆微博平台后所获得的。
（模拟登陆获取cookie的方法网上很多，在此不做具体说明了）

爬取平台：微博 www.weibo.cn
获取cookie：使用用户名+密码登录www.weibo.cn 后，点击键盘F12 进入控制台界面。在network中找到名为weibo.cn的记录，点击查看里面的cookies。

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/6a64e1f8656a4de48e250131ec34f068)

![](https://p26-tt.byteimg.com/origin/pgc-image/48e1442bb1484f1d9d2400e754da9939)

2. 爬取数据

- 爬取文字

该函数主要是用来爬取文字内容，其中省略了部分对于冗余字符的处理。

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/c4e7fea8551f4099bd15ead89bb199a2)

- 爬取图片

该函数主要是用来爬取图片内容，其中根据微博图片/评论图片的不同，对于标签的筛选不同。

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/ddcf7188462b47939f97f70db25721ac)

- 爬取表情

该函数主要是用来爬取微博评论表情内容。

![](https://p1.pstatp.com/origin/pgc-image/d67d9b9d84e1479a9e5f730c58712550)

三、使用说明及效果展示

1. 使用说明

修改cookie

![](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/050bbc8025964c7e80df27303dc85834)

修改微博评论的网址为所要爬取的微博评论，以“page=”结尾

![](https://p1.pstatp.com/origin/pgc-image/1a9f80fe9a684a348649aac639c58fc5)

在程序所在位置的同级目录下创建两个文件夹分别为：评论图片、评论表情。（本代码使用的相对路径，无需修改代码中的路径）

2. 效果展示

运行程序输入爬取的起始页数、终止页数
已爬取相应内容

附源代码：

"""

环境: Python 3.7.4 内容:爬取微博评论内容、图片、表情修改时间：2020.10.27 @author: Wenwen

"""

import requests import urllib.request import re import time import csv from bs4 import BeautifulSoup

num = 1 list_content = [] list_t = []

#请求函数：获取某一网页上的所有内容 def get_one_page(url): headers = { 'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36', 'Host' : 'weibo.cn', 'Accept' : 'application/json, text/plain, /', 'Accept-Language' : 'zh-CN,zh;q=0.9', 'Accept-Encoding' : 'gzip, deflate, br', 'Cookie' : '_T_WM=7c42b73c4c9cfa6fc4ff7350804c4504; SUB=_2A25yR6oNDeRhGedO41AZ8CzJyz2IHXVRyzZFrDV6PUJbktAKLXfakW1Nm4uHzxmmnLSVywkxToubmHotH6MGAF_Y; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5cHZZGnYY4KoxQoDxu1c2T5NHD95QpehnE1h5ESK5pWs4DqcjHIsyQi--Ri-zciKnfi--RiK.7iKyh; SUHB=0jBFua3v6DW2fH; SSOLoginState=1598282334', 'DNT' : '1', 'Connection' : 'keep-alive' }#请求头的书写，包括User-agent,Cookie等 response = requests.get(url,headers = headers,verify=False)#利用requests.get命令获取网页html if response.status_code == 200:#状态为200即为爬取成功 return response.text#返回值为html文档，传入到解析函数当中 return None

#爬取微博评论内容 def comment_page(html): print("序号","内容") pattern = re.compile('.?', re.S) items = re.findall(pattern,html) result = str(items) reResults = re.findall(">.?<",result,re.S) first_results = [] for result in reResults: #print(result) if ">']['<" in result: continue if ">:<" in result: continue if "<a href" in result: continue if ">回复<" in result: continue if ">评论配图<" in result: continue if "><" in result: continue if ">', '<" in result: continue if "@" in result: continue if "> <" in result: continue else: first_results.append(result) #print("first:",firstStepResults) TEXT1 = re.compile(">") TEXT2 = re.compile("<")
```
datalist = []
datalist_t = []

for last_result in first_results:
    global num
    temp1 = re.sub(TEXT1, '', last_result)
    excel = re.sub(TEXT2, '', temp1)
    datalist_t = [num,excel]
    datalist.append(datalist_t)
    print(num,excel)
    with open('.\\微博评论内容.txt','a',encoding='utf-8') as fp:
        fp.write(excel)
    num += 1
'''    
if datalist == datalist1:
    datalist.clear()
else:
    datalist1 = datalist
'''
return datalist
```
#爬取微博评论的图片 def comment_img(html): #BS对象 bf_1 = BeautifulSoup(html,'lxml') #获取全部上级标签 text_1 = bf_1.find_all('span',{'class':'ctt'}) #获取所有指定标签 img = []

img1 = []
```
for i in range(len(text_1)):
   for x in text_1[i].find_all('a',string = "评论配图"):
       link = x.get('href')
       if link:
           img.append(link)
```
if img == img1:

img.clear()

else:

img1 = img
```
return img
```
#爬取微博评论的表情 def comment_emotion(html): #BS对象 bf_1 = BeautifulSoup(html,'lxml') #获取全部上级标签 text_1 = bf_1.find_all('span',{'class':'ctt'}) #获取所有指定标签 emotion = []

emotion1 = []
```
for i in range(len(text_1)):
   for x in text_1[i].find_all('img'):
       link = x.get('src')
       if link:
           emotion.append('http:'+link)
```
if emotion == emotion1:

emotion.clear()

else:

emotion1 = emotion
```
return emotion
```
#爬取评论内容 def comment(ori,end): alllist = [] imglist = [] emotionlist = [] final = end + 1 for i in range(ori,final): url = "weibo.cn/comment/hot…" + str(i) html = get_one_page(url) #print(html) print('正在爬取第 %d 页评论' % (i)) alllist = alllist + comment_page(html) imglist = imglist + comment_img(html) emotionlist = emotionlist + comment_emotion(html) time.sleep(3)
```
if i == end :
    print('爬取结束，共爬取 %d 页评论' %(end-ori+1))

print('保存评论到微博评论内容.csv文件中......')
store_text(alllist)  
print('保存评论图片到文件夹中......')
#print('comment_img:',imglist)
store_img(imglist)
print('保存评论表情到文件夹中......')
#print('comment_emotion:',emotionlist)
store_emotion(emotionlist)
print('保存完毕！')
```
#存储微博评论内容 def store_text(list): with open('.\微博评论内容.csv','w',encoding='utf-8-sig',newline='') as f: ''' f.write('content\n') f.write('\n'.join(list)) ''' csv_writer = csv.writer(f) for item in list: csv_writer.writerow(item) f.close()

#存储微博评论图片 def store_img(list): j=0 for imgurl in list: urllib.request.urlretrieve(imgurl,'.\评论图片\%s.jpg' % j)
j+=1

#存储微博评论表情 def store_emotion(list): k=0 for imgurl in list: urllib.request.urlretrieve(imgurl,'.\评论表情\%s.jpg' % k) k+=1

if name=="main": ori = input("请输入你要爬取的起始页数:") end = input("情输入你要爬取的终止页数:") ori = int(ori) end = int(end) if ori and end: print("将会爬取第 %d 页到第 %d 页的评论内容" %(ori,end)) comment(ori,end) else: print("请输入符合规范的数字！")

PS：如遇到解决不了问题的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

基于微博平台的python爬虫数据采集，非常简单的小案例！

一、搭建环境

1. 软件版本

2. 环境搭建问题

二、代码设计

1. 获取cookie

2. 爬取数据

- 爬取文字

- 爬取图片

- 爬取表情

三、使用说明及效果展示

1. 使用说明

2. 效果展示

img1 = []

if img == img1:

img.clear()

else:

img1 = img

emotion1 = []

if emotion == emotion1:

emotion.clear()

else:

emotion1 = emotion