1. 引言
在汽车行业,用户评论数据是了解消费者需求、竞品优劣势的重要信息来源。懂车帝和汽车之家作为国内领先的汽车垂直平台,积累了大量的用户评价数据。通过Python爬虫技术抓取这些评论,并进行竞品分析,可以帮助车企、市场研究人员或数据分析师优化产品策略。
本文将介绍如何:
- 使用Python爬取懂车帝/汽车之家评论(涉及Requests、Selenium、反爬策略)
- 数据清洗与存储(Pandas、MySQL/MongoDB)
- 竞品分析(词频统计、情感分析、可视化)
2. 爬取懂车帝/汽车之家评论
2.1 目标分析
- 懂车帝:动态加载(Ajax/API),需分析接口
- 汽车之家:部分静态HTML,部分动态加载,可能需要Selenium
2.2 爬取汽车之家评论(静态+动态结合)
方法1:Requests + BeautifulSoup(静态页面)
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
def get_autohome_comments(car_id, page=1):
url = f"https://club.autohome.com.cn/bbs/thread-c-{car_id}-{page}.html"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all('div', class_='tz-paragraph')
return [comment.get_text().strip() for comment in comments]
# 示例:爬取某车型的评论(car_id需替换)
comments = get_autohome_comments("1234", 1) # 1234是车型ID
print(comments[:3]) # 输出前3条评论
方法2:Selenium(动态加载)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
service = Service("path/to/chromedriver") # 需下载对应ChromeDriver
driver = webdriver.Chrome(service=service)
def get_dongchedi_comments(car_id):
url = f"https://www.dongchedi.com/community/{car_id}"
driver.get(url)
time.sleep(3) # 等待加载
# 模拟滚动加载更多评论
for _ in range(3): # 滚动3次
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
comments = driver.find_elements(By.CLASS_NAME, "comment-content")
return [comment.text for comment in comments]
# 示例:爬取懂车帝某车型评论
comments = get_dongchedi_comments("1001") # 1001是车型ID
print(comments[:3])
driver.quit()
2.3 反爬策略
- 随机User-Agent:使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">fake_useragent</font>**库 - IP代理:使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**+代理IP池(如亿牛云、芝麻代理) - Selenium随机等待:避免被识别为机器人
3. 数据存储与清洗
3.1 存储至CSV/Pandas
import pandas as pd
data = {
"source": ["autohome"] * len(autohome_comments) + ["dongchedi"] * len(dongchedi_comments),
"comment": autohome_comments + dongchedi_comments
}
df = pd.DataFrame(data)
df.to_csv("car_comments.csv", index=False)
3.2 存储至MySQL
import pymysql
conn = pymysql.connect(
host="localhost",
user="root",
password="your_password",
database="car_analysis"
)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS comments (
id INT AUTO_INCREMENT PRIMARY KEY,
source VARCHAR(20),
comment TEXT
)
""")
# 插入数据
for index, row in df.iterrows():
cursor.execute("INSERT INTO comments (source, comment) VALUES (%s, %s)", (row["source"], row["comment"]))
conn.commit()
conn.close()
4. 竞品分析(数据可视化与NLP)
4.1 词频分析(jieba分词 + WordCloud)
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = " ".join(df["comment"])
words = jieba.lcut(text)
word_freq = pd.Series(words).value_counts().head(20)
# 生成词云
wordcloud = WordCloud(font_path="simhei.ttf", background_color="white").generate(" ".join(words))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
4.2 情感分析(SnowNLP)
from snownlp import SnowNLP
def get_sentiment(text):
return SnowNLP(text).sentiments
df["sentiment"] = df["comment"].apply(get_sentiment)
# 按来源(懂车帝/汽车之家)分析情感倾向
sentiment_by_source = df.groupby("source")["sentiment"].mean()
print(sentiment_by_source)
4.3 可视化对比(Matplotlib/Seaborn)
import seaborn as sns
# 绘制情感分布
sns.boxplot(x="source", y="sentiment", data=df)
plt.title("Sentiment Analysis: Autohome vs Dongchedi")
plt.show()
5. 结论
- 懂车帝 vs 汽车之家评论差异:
- 汽车之家评论更偏向技术讨论,懂车帝更偏向用户体验
- 情感分析显示,某车型在懂车帝的评分略高
- 竞品优化建议:
- 针对负面评论优化产品(如“油耗高”、“内饰一般”)
- 结合词云分析用户关注点(如“动力”、“空间”)