Python爬取懂车帝/汽车之家评论并做竞品分析

229 阅读3分钟

1. 引言

在汽车行业,用户评论数据是了解消费者需求、竞品优劣势的重要信息来源。懂车帝和汽车之家作为国内领先的汽车垂直平台,积累了大量的用户评价数据。通过Python爬虫技术抓取这些评论,并进行竞品分析,可以帮助车企、市场研究人员或数据分析师优化产品策略。

本文将介绍如何:

  1. 使用Python爬取懂车帝/汽车之家评论(涉及Requests、Selenium、反爬策略)
  2. 数据清洗与存储(Pandas、MySQL/MongoDB)
  3. 竞品分析(词频统计、情感分析、可视化)

2. 爬取懂车帝/汽车之家评论

2.1 目标分析

  • 懂车帝:动态加载(Ajax/API),需分析接口
  • 汽车之家:部分静态HTML,部分动态加载,可能需要Selenium

2.2 爬取汽车之家评论(静态+动态结合)

方法1:Requests + BeautifulSoup(静态页面)

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

def get_autohome_comments(car_id, page=1):
    url = f"https://club.autohome.com.cn/bbs/thread-c-{car_id}-{page}.html"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    comments = soup.find_all('div', class_='tz-paragraph')
    return [comment.get_text().strip() for comment in comments]

# 示例:爬取某车型的评论(car_id需替换)
comments = get_autohome_comments("1234", 1)  # 1234是车型ID
print(comments[:3])  # 输出前3条评论

方法2:Selenium(动态加载)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time

service = Service("path/to/chromedriver")  # 需下载对应ChromeDriver
driver = webdriver.Chrome(service=service)

def get_dongchedi_comments(car_id):
    url = f"https://www.dongchedi.com/community/{car_id}"
    driver.get(url)
    time.sleep(3)  # 等待加载
    
    # 模拟滚动加载更多评论
    for _ in range(3):  # 滚动3次
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    
    comments = driver.find_elements(By.CLASS_NAME, "comment-content")
    return [comment.text for comment in comments]

# 示例:爬取懂车帝某车型评论
comments = get_dongchedi_comments("1001")  # 1001是车型ID
print(comments[:3])
driver.quit()

2.3 反爬策略

  • 随机User-Agent:使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">fake_useragent</font>**
  • IP代理:使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**+代理IP池(如亿牛云、芝麻代理)
  • Selenium随机等待:避免被识别为机器人

3. 数据存储与清洗

3.1 存储至CSV/Pandas

import pandas as pd

data = {
    "source": ["autohome"] * len(autohome_comments) + ["dongchedi"] * len(dongchedi_comments),
    "comment": autohome_comments + dongchedi_comments
}

df = pd.DataFrame(data)
df.to_csv("car_comments.csv", index=False)

3.2 存储至MySQL

import pymysql

conn = pymysql.connect(
    host="localhost",
    user="root",
    password="your_password",
    database="car_analysis"
)

cursor = conn.cursor()
cursor.execute("""
    CREATE TABLE IF NOT EXISTS comments (
        id INT AUTO_INCREMENT PRIMARY KEY,
        source VARCHAR(20),
        comment TEXT
    )
""")

# 插入数据
for index, row in df.iterrows():
    cursor.execute("INSERT INTO comments (source, comment) VALUES (%s, %s)", (row["source"], row["comment"]))

conn.commit()
conn.close()

4. 竞品分析(数据可视化与NLP)

4.1 词频分析(jieba分词 + WordCloud)

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join(df["comment"])
words = jieba.lcut(text)
word_freq = pd.Series(words).value_counts().head(20)

# 生成词云
wordcloud = WordCloud(font_path="simhei.ttf", background_color="white").generate(" ".join(words))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

4.2 情感分析(SnowNLP)

from snownlp import SnowNLP

def get_sentiment(text):
    return SnowNLP(text).sentiments

df["sentiment"] = df["comment"].apply(get_sentiment)

# 按来源(懂车帝/汽车之家)分析情感倾向
sentiment_by_source = df.groupby("source")["sentiment"].mean()
print(sentiment_by_source)

4.3 可视化对比(Matplotlib/Seaborn)

import seaborn as sns

# 绘制情感分布
sns.boxplot(x="source", y="sentiment", data=df)
plt.title("Sentiment Analysis: Autohome vs Dongchedi")
plt.show()

5. 结论

  • 懂车帝 vs 汽车之家评论差异
    • 汽车之家评论更偏向技术讨论,懂车帝更偏向用户体验
    • 情感分析显示,某车型在懂车帝的评分略高
  • 竞品优化建议
    • 针对负面评论优化产品(如“油耗高”、“内饰一般”)
    • 结合词云分析用户关注点(如“动力”、“空间”)