大众点评火锅数据爬取分析与推荐系统｜Python实战全解析🍲 大众点评火锅数据爬取分析与推荐系统｜Python实战全解

🍲 大众点评火锅数据爬取分析与推荐系统｜Python实战全解析

本文为2024届本科毕业设计精华版，完整源码+数据集获取方式见文末

💡 研究背景与行业痛点

UGC平台价值：

✅ 真实用户反馈：消费者真实体验分享，可信度高
✅ 海量数据资源：180+火锅店铺，数万条用户评论
✅ 市场洞察：反映消费者偏好和市场趋势
✅ 决策支持：为商家运营和用户选择提供数据支撑

传统信息检索局限：

❌ 信息过载：海量评论难以快速获取有效信息
❌ 主观判断：依赖个人经验，缺乏数据支持
❌ 效率低下：手动筛选耗时耗力，准确性有限
❌ 个性化不足：难以满足不同用户的特定需求

🏗️ 技术架构设计

完整分析流程

🕷️ 数据采集层：
├── Requests：HTTP请求发送
├── BeautifulSoup：网页内容解析
└── 本地存储：HTML源码保存

🛠️ 数据处理层：
├── 数据清洗：去重、缺失值处理
├── 文本处理：分词、停用词去除
└── 特征提取：商家、用户、菜品特征

📊 分析可视化层：
├── 词云图：商圈、类型、菜品分布
├── 柱状图：价格、数量统计
└── 排名分析：TOP10店铺推荐

🎯 推荐应用层：
├── 个性化推荐：基于用户偏好
└── 智能筛选：多维度条件过滤

技术栈配置

技术领域	工具选择	应用场景
数据采集	Requests + BeautifulSoup	网页爬取、内容解析
数据处理	Pandas + NumPy	数据清洗、特征工程
文本处理	jieba分词	中文分词、词性标注
可视化	WordCloud + Matplotlib	词云、柱状图生成
数据分析	统计分析	趋势洞察、规律发现

⚡ 核心代码实现

1. 大众点评数据爬取

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt

class DazhongdianpingCrawler:
    """
    大众点评数据爬取类
    """
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)
    
    def crawl_shop_info(self, shop_url):
        """
        爬取店铺详细信息
        """
        try:
            # 发送HTTP请求
            response = self.session.get(shop_url, timeout=10)
            response.encoding = 'utf-8'
            
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                
                # 解析店铺基本信息
                shop_data = self.parse_shop_basic_info(soup)
                
                # 解析用户评论
                review_data = self.parse_shop_reviews(soup)
                
                # 随机延迟，避免被封IP
                time.sleep(random.uniform(1, 3))
                
                return {
                    'basic_info': shop_data,
                    'reviews': review_data
                }
            else:
                print(f"请求失败，状态码: {response.status_code}")
                return None
                
        except Exception as e:
            print(f"爬取过程中出现错误: {str(e)}")
            return None
    
    def parse_shop_basic_info(self, soup):
        """
        解析店铺基本信息
        """
        shop_info = {}
        
        try:
            # 店铺名称
            name_elem = soup.find('h1', class_='shop-name')
            shop_info['name'] = name_elem.text.strip() if name_elem else '未知'
            
            # 店铺评分
            rating_elem = soup.find('span', class_='rating')
            shop_info['rating'] = float(rating_elem.text.strip()) if rating_elem else 0.0
            
            # 人均价格
            price_elem = soup.find('span', class_='price')
            shop_info['avg_price'] = self.extract_price(price_elem.text) if price_elem else 0
            
            # 店铺类型
            category_elem = soup.find('span', class_='category')
            shop_info['category'] = category_elem.text.strip() if category_elem else '未知'
            
            # 商圈位置
            location_elem = soup.find('span', class_='region')
            shop_info['location'] = location_elem.text.strip() if location_elem else '未知'
            
            # 评论数量
            review_count_elem = soup.find('span', class_='review-count')
            shop_info['review_count'] = self.extract_number(review_count_elem.text) if review_count_elem else 0
            
        except Exception as e:
            print(f"解析店铺信息错误: {str(e)}")
        
        return shop_info
    
    def parse_shop_reviews(self, soup):
        """
        解析用户评论
        """
        reviews = []
        try:
            review_elems = soup.find_all('div', class_='review-words')
            
            for review_elem in review_elems[:10]:  # 只取前10条评论
                review_text = review_elem.text.strip()
                if review_text and len(review_text) > 5:  # 过滤过短评论
                    reviews.append(review_text)
                    
        except Exception as e:
            print(f"解析评论错误: {str(e)}")
        
        return reviews
    
    def extract_price(self, price_text):
        """提取价格数字"""
        import re
        numbers = re.findall(r'\d+', price_text)
        return int(numbers[0]) if numbers else 0
    
    def extract_number(self, text):
        """提取数字"""
        import re
        numbers = re.findall(r'\d+', text)
        return int(numbers[0]) if numbers else 0

# 使用示例
crawler = DazhongdianpingCrawler()
shop_data = crawler.crawl_shop_info('https://www.dianping.com/shop/xxx')

2. 数据清洗与处理

import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from collections import Counter

class DataProcessor:
    """
    数据处理与分析类
    """
    def __init__(self, data_path):
        self.df = pd.read_csv(data_path)
        self.stop_words = self.load_stop_words()
    
    def load_stop_words(self):
        """加载停用词表"""
        stop_words = set()
        # 基础停用词
        basic_stop_words = ['的', '了', '在', '是', '我', '有', '和', '就', 
                          '不', '人', '都', '一', '一个', '上', '也', '很', 
                          '到', '说', '要', '去', '你', '会', '着', '没有',
                          '看', '好', '自己', '这', '那', '吃', '火锅']
        stop_words.update(basic_stop_words)
        return stop_words
    
    def data_cleaning(self):
        """
        数据清洗
        """
        print("开始数据清洗...")
        
        # 去除重复数据
        initial_count = len(self.df)
        self.df = self.df.drop_duplicates()
        print(f"去除重复数据: {initial_count} -> {len(self.df)}")
        
        # 处理缺失值
        self.df = self.df.dropna(subset=['name', 'rating', 'avg_price'])
        
        # 价格范围过滤（合理的火锅价格范围）
        self.df = self.df[(self.df['avg_price'] >= 30) & (self.df['avg_price'] <= 500)]
        
        # 评论数量过滤
        self.df = self.df[self.df['review_count'] > 0]
        
        print(f"清洗后数据量: {len(self.df)}")
        return self.df
    
    def text_processing(self, text_series):
        """
        文本处理：分词和停用词去除
        """
        processed_texts = []
        
        for text in text_series:
            if pd.isna(text):
                processed_texts.append('')
                continue
                
            # 中文分词
            words = jieba.cut(str(text))
            
            # 去除停用词和标点符号
            filtered_words = [
                word for word in words 
                if (word not in self.stop_words and 
                    len(word) > 1 and 
                    word.strip() != '')
            ]
            
            processed_texts.append(' '.join(filtered_words))
        
        return processed_texts
    
    def extract_features(self):
        """
        特征提取
        """
        print("开始特征提取...")
        
        # 商家特征
        self.df['price_range'] = pd.cut(self.df['avg_price'], 
                                      bins=[0, 60, 90, 120, 200, 500],
                                      labels=['0-60', '60-90', '90-120', '120-200', '200+'])
        
        # 评分等级
        self.df['rating_level'] = pd.cut(self.df['rating'],
                                       bins=[0, 3.5, 4.0, 4.5, 5.0],
                                       labels=['差', '中', '良', '优'])
        
        # 热门程度（基于评论数量）
        self.df['popularity'] = pd.qcut(self.df['review_count'], 
                                       q=4, 
                                       labels=['低', '中', '高', '很高'])
        
        return self.df
    
    def generate_wordcloud(self, texts, title):
        """
        生成词云图
        """
        from wordcloud import WordCloud
        import matplotlib.pyplot as plt
        
        # 合并所有文本
        all_text = ' '.join([str(text) for text in texts if pd.notna(text)])
        
        # 生成词云
        wordcloud = WordCloud(
            font_path='SimHei.ttf',  # 中文字体
            width=800, 
            height=600,
            background_color='white',
            max_words=100
        ).generate(all_text)
        
        # 绘制词云
        plt.figure(figsize=(10, 8))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(title, fontsize=16)
        plt.tight_layout()
        return plt

# 使用示例
processor = DataProcessor('hotpot_data.csv')
cleaned_data = processor.data_cleaning()
processed_data = processor.extract_features()

3. 数据分析与可视化

class DataAnalyzer:
    """
    数据分析与可视化类
    """
    def __init__(self, df):
        self.df = df
    
    def plot_price_distribution(self):
        """
        绘制价格分布柱状图
        """
        price_ranges = ['0-60', '60-90', '90-120', '120-200', '200+']
        price_counts = [len(self.df[self.df['price_range'] == pr]) for pr in price_ranges]
        
        plt.figure(figsize=(10, 6))
        bars = plt.bar(price_ranges, price_counts, color='skyblue', alpha=0.8)
        
        # 添加数值标签
        for bar, count in zip(bars, price_counts):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                    str(count), ha='center', va='bottom')
        
        plt.title('火锅店铺价格分布', fontsize=16)
        plt.xlabel('价格区间(元)', fontsize=12)
        plt.ylabel('店铺数量', fontsize=12)
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        return plt
    
    def plot_top10_locations(self):
        """
        绘制TOP10商圈店铺数量
        """
        location_counts = self.df['location'].value_counts().head(10)
        
        plt.figure(figsize=(12, 6))
        bars = plt.bar(location_counts.index, location_counts.values, color='lightcoral')
        
        # 添加数值标签
        for bar, count in zip(bars, location_counts.values):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                    str(count), ha='center', va='bottom')
        
        plt.title('TOP10商圈火锅店铺数量', fontsize=16)
        plt.xlabel('商圈位置', fontsize=12)
        plt.ylabel('店铺数量', fontsize=12)
        plt.xticks(rotation=45)
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        return plt
    
    def plot_top10_review_shops(self):
        """
        绘制评论数TOP10店铺
        """
        top_shops = self.df.nlargest(10, 'review_count')[['name', 'review_count', 'avg_price']]
        
        fig, ax1 = plt.subplots(figsize=(12, 6))
        
        # 评论数量柱状图
        bars = ax1.bar(range(len(top_shops)), top_shops['review_count'], 
                      alpha=0.6, color='lightgreen', label='评论数量')
        ax1.set_xlabel('店铺名称')
        ax1.set_ylabel('评论数量', color='green')
        ax1.tick_params(axis='y', labelcolor='green')
        
        # 价格折线图
        ax2 = ax1.twinx()
        ax2.plot(range(len(top_shops)), top_shops['avg_price'], 
                color='red', marker='o', linewidth=2, label='人均价格')
        ax2.set_ylabel('人均价格(元)', color='red')
        ax2.tick_params(axis='y', labelcolor='red')
        
        plt.title('评论数TOP10火锅店铺', fontsize=16)
        ax1.set_xticks(range(len(top_shops)))
        ax1.set_xticklabels(top_shops['name'], rotation=45)
        
        # 添加图例
        lines1, labels1 = ax1.get_legend_handles_labels()
        lines2, labels2 = ax2.get_legend_handles_labels()
        ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
        
        plt.tight_layout()
        return plt
    
    def generate_recommendations(self, user_preferences):
        """
        生成个性化推荐
        """
        # 用户偏好：预算、口味、位置等
        budget = user_preferences.get('budget', 100)
        location = user_preferences.get('location', '')
        min_rating = user_preferences.get('min_rating', 4.0)
        
        # 筛选条件
        filtered_shops = self.df[
            (self.df['avg_price'] <= budget) & 
            (self.df['rating'] >= min_rating)
        ]
        
        if location:
            filtered_shops = filtered_shops[filtered_shops['location'].str.contains(location)]
        
        # 综合评分排序（考虑评分和评论数量）
        filtered_shops['composite_score'] = (
            filtered_shops['rating'] * 0.7 + 
            (filtered_shops['review_count'] / filtered_shops['review_count'].max()) * 0.3
        )
        
        recommendations = filtered_shops.nlargest(5, 'composite_score')
        return recommendations[['name', 'location', 'avg_price', 'rating', 'review_count']]

# 使用示例
analyzer = DataAnalyzer(processed_data)
price_plot = analyzer.plot_price_distribution()
location_plot = analyzer.plot_top10_locations()
review_plot = analyzer.plot_top10_review_shops()

# 个性化推荐
user_prefs = {'budget': 100, 'location': '万达广场', 'min_rating': 4.0}
recommendations = analyzer.generate_recommendations(user_prefs)

📊 数据分析结果

1. 商圈分布洞察

TOP10商圈店铺数量：

商圈位置	店铺数量	占比	特点分析
怡圆路	12家	6.8%	商业核心区，客流量大
新建区中心城区	11家	6.3%	行政中心，消费能力强
万达广场	10家	5.7%	大型商圈，年轻人聚集
昌北经济开发区	9家	5.1%	企业集中，商务宴请多
莲塘	8家	4.5%	居民区，家庭消费为主

🎯 商业洞察：火锅店铺集中分布在商业核心区和大型商圈，印证了"选址决定成败"的商业规律

2. 价格分布分析

火锅消费价格区间：

价格区间	店铺数量	占比	目标客群
60-90元	85家	48.3%	学生、年轻白领
90-120元	62家	35.2%	中产家庭、朋友聚会
120-200元	23家	13.1%	商务宴请、品质消费
200元以上	6家	3.4%	高端消费群体

💰 消费洞察：60-90元是主流消费区间，符合大众消费水平

3. 火锅类型偏好

热门火锅类型分布：

火锅类型	店铺数量	受欢迎程度	特色分析
自助火锅	28家	⭐⭐⭐⭐⭐	性价比高，选择多样
四川火锅	25家	⭐⭐⭐⭐	麻辣口味，年轻人喜爱
潮汕牛肉火锅	22家	⭐⭐⭐⭐	健康清淡，家庭偏好
重庆火锅	18家	⭐⭐⭐	重麻重辣，特色鲜明

🎯 推荐系统实现

个性化推荐策略

🔍 用户画像构建：
├── 消费能力：基于预算选择
├── 口味偏好：辣度、锅底类型
├── 场景需求：朋友聚会、家庭用餐、商务宴请
└── 地理位置：就近原则或特定商圈

📊 多维度评分：
├── 基础评分：大众点评官方评分
├── 热度评分：评论数量加权
├── 性价比：价格合理性评估
└── 匹配度：与用户偏好契合度

🎯 推荐结果：
├── 精准匹配：完全符合用户需求
├── 潜力推荐：高评分但稍超预算
└── 特色推荐：独特体验的店铺

💼 商业应用价值

对于消费者

🎯 精准选择：基于真实数据的个性化推荐
💰 预算控制：按价格区间筛选，避免超支
📍 位置便利：就近推荐，节省时间成本
👅 口味匹配：符合个人偏好的店铺推荐

对于商家

📊 市场定位：了解竞争对手和价格分布
🎯 精准营销：针对目标客群制定营销策略
💡 产品优化：根据热门菜品调整菜单
📍 选址参考：基于商圈热度优化店铺布局

对于平台

🔍 用户体验：提升平台使用价值和粘性
📈 商业变现：推荐服务创造新的收入来源
🏆 竞争优势：差异化服务增强市场竞争力

🚀 项目特色亮点

技术创新

全流程覆盖：从数据采集到推荐系统的完整解决方案
多维度分析：商圈、价格、类型、评论的全面洞察
个性化推荐：基于用户画像的智能匹配算法
可视化展示：直观的数据呈现和洞察发现

实用价值

真实数据：基于180+真实店铺数据分析
商业洞察：深挖数据背后的商业规律
可扩展性：框架可复用到其他品类分析
易用性：清晰的代码结构和详细注释

📈 进一步优化方向

技术增强

🤖 机器学习：引入更复杂的推荐算法
🔄 实时更新：建立自动化数据更新管道
📱 移动应用：开发手机APP提升用户体验
🌐 多城市扩展：覆盖更多城市数据

功能扩展

💬 情感分析：深度分析用户评论情感倾向
👥 社交推荐：融入好友推荐和相似用户偏好
🎯 场景推荐：针对不同用餐场景的专属推荐
📅 时序分析：季节性、节假日消费趋势分析

在这里插入图片描述

🎁 资源获取

完整项目资料包：

✅ 大众点评爬虫完整源码
✅ 180+火锅店铺清洗后数据集
✅ 数据分析Jupyter Notebook
✅ 可视化图表生成代码
✅ 推荐系统实现代码

💬 技术交流区

常见问题解答： Q: 爬虫会被封IP吗？ A: 代码中已包含随机延迟和请求头模拟，建议使用代理IP池进一步降低风险

Q: 数据是最新的吗？ A: 项目提供2022年真实数据，同时提供数据更新方法和代码

Q: 能否用于其他城市？ A: 代码框架通用，只需修改目标URL即可适配其他城市

✨ 如果觉得本项目对你有帮助，请点赞、收藏、关注支持！ ✨

大众点评火锅数据爬取分析与推荐系统｜Python实战全解析