【大数据】携程酒店用户评价数据分析系统计算机毕业设计项目 Hadoop+Spark环境配置数据科学与大数据技术附源码+文档+讲解

前言

💖💖作者：计算机程序员小杨 💙💙个人简介：我是一名计算机相关专业的从业者，擅长Java、微信小程序、Python、Golang、安卓Android等多个IT方向。会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。热爱技术，喜欢钻研新工具和框架，也乐于通过代码解决实际问题，大家有技术代码这一块的问题可以问我！ 💛💛想说的话：感谢大家的关注与支持！ 💕💕文末获取源码联系计算机程序员小杨 💜💜 网站实战项目安卓/小程序实战项目大数据实战项目深度学习实战项目计算机毕业设计选题 💜💜

一.开发工具简介

大数据框架：Hadoop+Spark（本次没用Hive，支持定制）开发语言：Python+Java（两个版本都支持）后端框架：Django+Spring Boot(Spring+SpringMVC+Mybatis)（两个版本都支持）前端：Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery 详细技术点：Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy 数据库：MySQL

二.系统内容简介

《携程酒店用户评价数据分析系统》是一个基于大数据技术构建的酒店行业数据分析平台,采用Hadoop+Spark分布式计算框架作为数据处理引擎,实现对海量酒店用户评价数据的存储、清洗与分析。系统后端采用Django框架结合Python语言开发,利用Spark SQL进行结构化数据查询,通过Pandas和NumPy完成复杂的数据建模与统计分析。前端基于Vue框架搭配ElementUI组件库构建用户交互界面,使用Echarts图表库实现数据可视化呈现。系统核心功能涵盖酒店评价数据的全生命周期管理,包括用户行为轨迹追踪、时间序列变化趋势挖掘、情感满意度量化评估、市场竞争力对比分析以及服务质量多维度评价等模块,通过可视化大屏实时展示分析结果,为酒店管理者提供数据驱动的决策支持,帮助其优化服务流程、提升用户体验、增强市场竞争力。

三.系统功能演示

携程酒店用户评价数据分析系统

四.系统界面展示

在这里插入图片描述

五.系统源码展示

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, when, regexp_replace, lower, trim, window, lag, lead, desc, asc
from pyspark.sql.types import StringType, FloatType, IntegerType, TimestampType
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from django.http import JsonResponse
from django.views import View
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
spark = SparkSession.builder.appName("CtripHotelAnalysis").config("spark.executor.memory", "4g").config("spark.driver.memory", "2g").enableHiveSupport().getOrCreate()
def sentiment_analysis_core(hotel_id, start_date, end_date):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hotel_db").option("dbtable", "user_reviews").option("user", "root").option("password", "password").load()
    filtered_df = df.filter((col("hotel_id") == hotel_id) & (col("review_date") >= start_date) & (col("review_date") <= end_date))
    cleaned_df = filtered_df.withColumn("cleaned_text", regexp_replace(lower(trim(col("review_content"))), "[^a-z0-9\\u4e00-\\u9fa5\\s]", ""))
    tokenizer = Tokenizer(inputCol="cleaned_text", outputCol="words")
    tokenized_df = tokenizer.transform(cleaned_df)
    stop_words = ["的", "了", "是", "在", "我", "有", "和", "就", "不", "人", "都", "一", "一个", "上", "也", "很", "到", "说", "要", "去", "你", "会", "着", "没有", "看", "好"]
    remover = StopWordsRemover(inputCol="words", outputCol="filtered_words", stopWords=stop_words)
    removed_df = remover.transform(tokenized_df)
    positive_keywords = ["满意", "舒适", "干净", "热情", "推荐", "值得", "不错", "棒", "好评", "优质", "贴心", "周到", "方便", "温馨", "喜欢"]
    negative_keywords = ["差", "脏", "吵", "冷漠", "失望", "不推荐", "糟糕", "难受", "后悔", "坑", "骗", "臭", "破", "旧", "差评"]
    sentiment_df = removed_df.withColumn("positive_count", sum([when(col("filtered_words").contains(kw), 1).otherwise(0) for kw in positive_keywords]))
    sentiment_df = sentiment_df.withColumn("negative_count", sum([when(col("filtered_words").contains(kw), 1).otherwise(0) for kw in negative_keywords]))
    sentiment_df = sentiment_df.withColumn("sentiment_score", (col("positive_count") - col("negative_count")) / (col("positive_count") + col("negative_count") + 1))
    sentiment_df = sentiment_df.withColumn("sentiment_label", when(col("sentiment_score") > 0.2, "正面").when(col("sentiment_score") < -0.2, "负面").otherwise("中性"))
    aggregated_result = sentiment_df.groupBy("sentiment_label").agg(count("*").alias("review_count"), avg("rating").alias("avg_rating"))
    pandas_result = aggregated_result.toPandas()
    satisfaction_rate = (pandas_result[pandas_result["sentiment_label"] == "正面"]["review_count"].sum() / pandas_result["review_count"].sum()) * 100 if not pandas_result.empty else 0
    detailed_stats = sentiment_df.groupBy("sentiment_label").agg(count("user_id").alias("user_count"), avg("positive_count").alias("avg_positive_words"), avg("negative_count").alias("avg_negative_words"))
    detailed_pandas = detailed_stats.toPandas()
    top_positive_reviews = sentiment_df.filter(col("sentiment_label") == "正面").orderBy(desc("sentiment_score")).limit(5).select("user_id", "review_content", "rating", "sentiment_score").toPandas()
    top_negative_reviews = sentiment_df.filter(col("sentiment_label") == "负面").orderBy(asc("sentiment_score")).limit(5).select("user_id", "review_content", "rating", "sentiment_score").toPandas()
    return {"satisfaction_rate": round(satisfaction_rate, 2), "sentiment_distribution": pandas_result.to_dict(orient="records"), "detailed_stats": detailed_pandas.to_dict(orient="records"), "top_positive": top_positive_reviews.to_dict(orient="records"), "top_negative": top_negative_reviews.to_dict(orient="records")}
def time_series_trend_analysis(hotel_ids, time_granularity, analysis_period):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hotel_db").option("dbtable", "user_reviews").option("user", "root").option("password", "password").load()
    df = df.withColumn("review_timestamp", col("review_date").cast(TimestampType()))
    end_date = datetime.now()
    if analysis_period == "month":
        start_date = end_date - timedelta(days=30)
        window_spec = "1 day"
    elif analysis_period == "quarter":
        start_date = end_date - timedelta(days=90)
        window_spec = "1 week"
    elif analysis_period == "year":
        start_date = end_date - timedelta(days=365)
        window_spec = "1 month"
    else:
        start_date = end_date - timedelta(days=7)
        window_spec = "1 hour"
    filtered_df = df.filter((col("hotel_id").isin(hotel_ids)) & (col("review_timestamp") >= start_date) & (col("review_timestamp") <= end_date))
    windowed_df = filtered_df.groupBy(window(col("review_timestamp"), window_spec), col("hotel_id")).agg(count("*").alias("review_count"), avg("rating").alias("avg_rating"), avg("cleanliness_score").alias("avg_cleanliness"), avg("service_score").alias("avg_service"), avg("facility_score").alias("avg_facility"))
    windowed_df = windowed_df.withColumn("time_bucket", col("window.start"))
    windowed_df = windowed_df.orderBy("hotel_id", "time_bucket")
    window_lag = Window.partitionBy("hotel_id").orderBy("time_bucket")
    trend_df = windowed_df.withColumn("prev_rating", lag("avg_rating", 1).over(window_lag))
    trend_df = trend_df.withColumn("rating_change", col("avg_rating") - col("prev_rating"))
    trend_df = trend_df.withColumn("trend_direction", when(col("rating_change") > 0.1, "上升").when(col("rating_change") < -0.1, "下降").otherwise("稳定"))
    pandas_trend = trend_df.select("hotel_id", "time_bucket", "review_count", "avg_rating", "avg_cleanliness", "avg_service", "avg_facility", "rating_change", "trend_direction").toPandas()
    pandas_trend["time_bucket"] = pd.to_datetime(pandas_trend["time_bucket"])
    correlation_matrix = pandas_trend[["avg_rating", "avg_cleanliness", "avg_service", "avg_facility"]].corr()
    forecast_data = []
    for hotel in hotel_ids:
        hotel_data = pandas_trend[pandas_trend["hotel_id"] == hotel].sort_values("time_bucket")
        if len(hotel_data) >= 3:
            recent_ratings = hotel_data["avg_rating"].tail(3).values
            trend_slope = np.polyfit(range(len(recent_ratings)), recent_ratings, 1)[0]
            next_prediction = recent_ratings[-1] + trend_slope
            forecast_data.append({"hotel_id": hotel, "predicted_rating": round(float(next_prediction), 2), "confidence": "中" if abs(trend_slope) < 0.05 else "高"})
    return {"trend_data": pandas_trend.to_dict(orient="records"), "correlation_matrix": correlation_matrix.to_dict(), "forecast": forecast_data}
def market_competition_analysis(target_hotel_id, competitor_hotel_ids, comparison_dimensions):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hotel_db").option("dbtable", "user_reviews").option("user", "root").option("password", "password").load()
    all_hotel_ids = [target_hotel_id] + competitor_hotel_ids
    filtered_df = df.filter(col("hotel_id").isin(all_hotel_ids))
    hotel_info_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hotel_db").option("dbtable", "hotel_info").option("user", "root").option("password", "password").load()
    joined_df = filtered_df.join(hotel_info_df, "hotel_id", "left")
    overall_stats = joined_df.groupBy("hotel_id", "hotel_name", "star_level", "price_range").agg(count("*").alias("total_reviews"), avg("rating").alias("overall_rating"), avg("cleanliness_score").alias("cleanliness"), avg("service_score").alias("service"), avg("facility_score").alias("facility"), avg("location_score").alias("location"), avg("cost_performance").alias("cost_performance"))
    dimension_comparison = overall_stats.toPandas()
    dimension_comparison["market_rank"] = dimension_comparison["overall_rating"].rank(ascending=False, method="min")
    target_hotel_data = dimension_comparison[dimension_comparison["hotel_id"] == target_hotel_id].iloc[0]
    competitor_avg = dimension_comparison[dimension_comparison["hotel_id"].isin(competitor_hotel_ids)][["overall_rating", "cleanliness", "service", "facility", "location", "cost_performance"]].mean()
    gap_analysis = {dim: round(float(target_hotel_data[dim] - competitor_avg[dim]), 2) for dim in ["overall_rating", "cleanliness", "service", "facility", "location", "cost_performance"] if dim in comparison_dimensions}
    strength_weakness = {"strengths": [k for k, v in gap_analysis.items() if v > 0.2], "weaknesses": [k for k, v in gap_analysis.items() if v < -0.2]}
    price_performance_df = joined_df.withColumn("price_level", when(col("price_range") < 200, "经济型").when(col("price_range") < 500, "舒适型").otherwise("豪华型"))
    price_segment_stats = price_performance_df.groupBy("hotel_id", "price_level").agg(avg("rating").alias("segment_rating"), count("*").alias("segment_reviews"))
    price_competitiveness = price_segment_stats.toPandas()
    review_volume_trend = joined_df.groupBy("hotel_id", window(col("review_date"), "1 month")).agg(count("*").alias("monthly_reviews"))
    review_momentum = review_volume_trend.toPandas()
    recent_3months = review_momentum.groupby("hotel_id")["monthly_reviews"].tail(3).groupby(review_momentum["hotel_id"]).mean().reset_index()
    recent_3months.columns = ["hotel_id", "avg_recent_reviews"]
    dimension_comparison = dimension_comparison.merge(recent_3months, on="hotel_id", how="left")
    dimension_comparison["market_momentum"] = dimension_comparison["avg_recent_reviews"].rank(ascending=False, method="min")
    return {"target_hotel": target_hotel_data.to_dict(), "competitor_comparison": dimension_comparison.to_dict(orient="records"), "gap_analysis": gap_analysis, "strength_weakness": strength_weakness, "price_competitiveness": price_competitiveness.to_dict(orient="records"), "market_position": {"overall_rank": int(target_hotel_data["market_rank"]), "total_competitors": len(all_hotel_ids)}}

六.系统文档展示