计算机编程指导师
⭐⭐个人介绍:自己非常喜欢研究技术问题!专业做Java、Python、小程序、安卓、大数据、爬虫、Golang、大屏、爬虫、深度学习、机器学习、预测等实战项目。
⛽⛽实战项目:有源码或者技术上的问题欢迎在评论区一起讨论交流!
国内旅游景点游客数据分析系统- 简介
基于Hadoop+Spark的国内旅游景点游客数据分析系统是一套专门面向旅游行业数据深度挖掘的大数据分析平台。系统采用Hadoop分布式文件存储系统(HDFS)作为海量旅游数据的存储基础,结合Spark分布式计算引擎实现对百万级游客行为数据的高效处理与分析。在技术实现层面,系统提供Python+Django和Java+SpringBoot两套完整的后端开发方案,前端统一采用Vue+ElementUI+Echarts技术栈构建可视化交互界面。系统核心分析功能覆盖五大维度:游客多维画像分析模块通过年龄分布、性别比例、客源地分析等6个子维度刻画游客群体特征;旅游消费行为分析模块基于RFM模型思想实现游客价值聚类,精准识别高价值客户群体;景点吸引力与满意度分析模块综合评估景点销量排行、类型偏好、情感倾向等关键指标;时序与环境影响分析模块揭示月度流量趋势、天气对出游的作用规律;区域旅游市场格局分析模块通过省份热力图、客源地流向矩阵等方式呈现全国旅游市场的空间分布特征。系统利用Spark SQL进行结构化数据查询,结合Pandas和NumPy完成统计分析与数据清洗,最终通过Echarts图表将26个细分分析维度的结果直观展示,为旅游管理部门的决策优化和景区运营改进提供数据支撑。
国内旅游景点游客数据分析系统-技术 框架
开发语言:Python或Java(两个版本都支持)
大数据框架:Hadoop+Spark(本次没用Hive,支持定制)
后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持)
前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery
详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy
数据库:MySQL
国内旅游景点游客数据分析系统- 背景
选题背景
近年来国内旅游市场呈现出蓬勃发展的态势,各类旅游景点每年接待的游客数量持续攀升,由此产生的游客行为数据、消费记录、满意度评价等信息呈现出典型的海量化和多样化特征。传统的数据处理方式依赖单机数据库和人工统计,面对动辄数百万条的游客记录时往往力不从心,查询响应缓慢且难以挖掘出数据背后隐藏的深层规律。与此同时,旅游管理部门和景区运营方迫切需要了解游客的真实画像、消费偏好、区域流动趋势以及不同环境因素对旅游活动的影响,以便制定更加科学合理的营销策略和资源配置方案。大数据技术特别是Hadoop和Spark框架的成熟应用为解决这一问题带来了新的契机,Hadoop的分布式存储能力可以轻松承载海量旅游数据,Spark的内存计算特性则能将复杂的多维度分析任务处理时间从小时级压缩到分钟级甚至秒级,这种技术变革为旅游行业的数字化转型提供了坚实的底层支撑,也让基于大数据的智能决策成为可能。
选题意义
本课题的实际意义主要体现在以下几个层面。从旅游管理角度来看,系统通过对游客年龄、性别、来源地等多维画像的分析,能够帮助景区和旅游部门更清晰地认识自己的目标客户群体到底是哪些人,他们偏好什么样的旅游方式,这些认知可以直接指导后续的产品设计和营销渠道选择,避免盲目投放广告造成的资源浪费。在消费行为分析方面,系统引入RFM模型思想对游客进行价值聚类,能够筛选出那些消费频次高、金额大的优质客户,为精准营销和客户关系维护提供明确的目标对象,同时通过分析不同客源地的人均消费能力,可以让管理者知道哪些省份的游客更舍得花钱,从而调整区域推广的侧重点。从技术实践层面讲,本系统完整地展示了Hadoop+Spark大数据技术栈在实际业务场景中的落地应用流程,包括如何利用HDFS存储海量数据、如何用Spark SQL进行分布式查询、如何结合Pandas进行数据清洗和统计分析,这些技术环节的打通对于理解大数据处理的完整链路具有很好的示范作用,虽然只是一个毕业设计项目,但其中涉及的技术方法和分析思路在一定程度上是可以复用到其他领域的数据分析任务中的,算是一次比较实际的技术练手机会。
国内旅游景点游客数据分析系统-视频展示
国内旅游景点游客数据分析系统-图片展示
国内旅游景点游客数据分析系统-代码展示
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, sum, when, expr, month, year
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from datetime import datetime
spark = SparkSession.builder.appName("TourismDataAnalysis").config("spark.executor.memory", "4g").config("spark.driver.memory", "2g").getOrCreate()
def analyze_visitor_portrait_multi_dimension(hdfs_data_path):
"""游客多维画像分析核心函数:包含年龄分布、性别消费对比、客源地TOP10、旅游方式偏好等"""
visitor_df = spark.read.parquet(hdfs_data_path)
visitor_df.createOrReplaceTempView("visitor_data")
age_distribution = spark.sql("""
SELECT
CASE
WHEN visitor_age BETWEEN 18 AND 30 THEN '青年(18-30岁)'
WHEN visitor_age BETWEEN 31 AND 50 THEN '中年(31-50岁)'
WHEN visitor_age BETWEEN 51 AND 70 THEN '老年(51-70岁)'
ELSE '其他年龄段'
END AS age_group,
COUNT(*) AS visitor_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS percentage
FROM visitor_data
WHERE visitor_age IS NOT NULL
GROUP BY age_group
ORDER BY visitor_count DESC
""")
gender_consumption = spark.sql("""
SELECT
visitor_gender,
COUNT(*) AS total_visitors,
ROUND(AVG(consumption_amount), 2) AS avg_consumption,
ROUND(SUM(consumption_amount), 2) AS total_consumption,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS gender_ratio
FROM visitor_data
WHERE visitor_gender IN ('男', '女')
GROUP BY visitor_gender
""")
source_province_top10 = spark.sql("""
SELECT
source_province,
COUNT(*) AS visitor_count,
ROUND(AVG(consumption_amount), 2) AS avg_consumption_per_capita,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM visitor_data), 2) AS market_share
FROM visitor_data
WHERE source_province IS NOT NULL
GROUP BY source_province
ORDER BY visitor_count DESC
LIMIT 10
""")
age_travel_mode_preference = spark.sql("""
SELECT
CASE
WHEN visitor_age BETWEEN 18 AND 30 THEN '青年'
WHEN visitor_age BETWEEN 31 AND 50 THEN '中年'
ELSE '老年'
END AS age_segment,
travel_mode,
COUNT(*) AS mode_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(PARTITION BY
CASE
WHEN visitor_age BETWEEN 18 AND 30 THEN '青年'
WHEN visitor_age BETWEEN 31 AND 50 THEN '中年'
ELSE '老年'
END
), 2) AS preference_rate
FROM visitor_data
WHERE travel_mode IS NOT NULL AND visitor_age IS NOT NULL
GROUP BY age_segment, travel_mode
ORDER BY age_segment, mode_count DESC
""")
trip_type_analysis = spark.sql("""
SELECT
CASE
WHEN source_province = attraction_province THEN '省内游'
ELSE '跨省游'
END AS trip_type,
COUNT(*) AS trip_count,
ROUND(AVG(consumption_amount), 2) AS avg_spending,
ROUND(AVG(duration_days), 2) AS avg_duration
FROM visitor_data
WHERE source_province IS NOT NULL AND attraction_province IS NOT NULL
GROUP BY trip_type
""")
duration_distribution = spark.sql("""
SELECT
CASE
WHEN duration_days BETWEEN 1 AND 3 THEN '1-3天短途游'
WHEN duration_days BETWEEN 4 AND 7 THEN '4-7天中长途游'
WHEN duration_days > 7 THEN '7天以上深度游'
ELSE '一日游'
END AS duration_category,
COUNT(*) AS visitor_count,
ROUND(AVG(consumption_amount), 2) AS avg_consumption
FROM visitor_data
WHERE duration_days IS NOT NULL
GROUP BY duration_category
ORDER BY visitor_count DESC
""")
return {
"age_distribution": age_distribution.toPandas(),
"gender_consumption": gender_consumption.toPandas(),
"source_province_top10": source_province_top10.toPandas(),
"age_travel_mode_preference": age_travel_mode_preference.toPandas(),
"trip_type_analysis": trip_type_analysis.toPandas(),
"duration_distribution": duration_distribution.toPandas()
}
def analyze_consumption_behavior_with_rfm_clustering(hdfs_data_path, current_date_str):
"""游客消费行为分析与RFM价值聚类核心函数:消费分层、旅游方式影响、K-Means聚类、日均消费"""
visitor_df = spark.read.parquet(hdfs_data_path)
visitor_df.createOrReplaceTempView("visitor_consumption")
current_date = datetime.strptime(current_date_str, '%Y-%m-%d')
consumption_level_distribution = spark.sql("""
SELECT
CASE
WHEN consumption_amount < 2000 THEN '经济型(<2000元)'
WHEN consumption_amount BETWEEN 2000 AND 5000 THEN '舒适型(2000-5000元)'
WHEN consumption_amount BETWEEN 5001 AND 10000 THEN '高端型(5001-10000元)'
ELSE '豪华型(>10000元)'
END AS consumption_level,
COUNT(*) AS visitor_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS percentage,
ROUND(SUM(consumption_amount), 2) AS total_revenue
FROM visitor_consumption
WHERE consumption_amount IS NOT NULL
GROUP BY consumption_level
ORDER BY visitor_count DESC
""")
travel_mode_consumption_impact = spark.sql("""
SELECT
travel_mode,
COUNT(*) AS visitor_count,
ROUND(AVG(consumption_amount), 2) AS avg_consumption,
ROUND(MAX(consumption_amount), 2) AS max_consumption,
ROUND(MIN(consumption_amount), 2) AS min_consumption,
ROUND(STDDEV(consumption_amount), 2) AS consumption_stddev
FROM visitor_consumption
WHERE travel_mode IS NOT NULL AND consumption_amount IS NOT NULL
GROUP BY travel_mode
ORDER BY avg_consumption DESC
""")
rfm_query = f"""
SELECT
visitor_id,
DATEDIFF(TO_DATE('{current_date_str}'), MAX(play_date)) AS recency,
COUNT(DISTINCT play_date) AS frequency,
ROUND(SUM(consumption_amount), 2) AS monetary
FROM visitor_consumption
WHERE visitor_id IS NOT NULL AND play_date IS NOT NULL AND consumption_amount IS NOT NULL
GROUP BY visitor_id
"""
rfm_df = spark.sql(rfm_query).toPandas()
if len(rfm_df) > 0:
scaler = StandardScaler()
rfm_normalized = scaler.fit_transform(rfm_df[['recency', 'frequency', 'monetary']])
optimal_k = min(4, len(rfm_df))
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
rfm_df['cluster'] = kmeans.fit_predict(rfm_normalized)
cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=['recency', 'frequency', 'monetary'])
cluster_centers['cluster'] = range(optimal_k)
cluster_centers['avg_monetary'] = cluster_centers['monetary']
cluster_centers = cluster_centers.sort_values('avg_monetary', ascending=False)
value_mapping = {cluster_centers.iloc[i]['cluster']: f'第{i+1}价值层' for i in range(len(cluster_centers))}
rfm_df['value_segment'] = rfm_df['cluster'].map(value_mapping)
cluster_summary = rfm_df.groupby('value_segment').agg({
'visitor_id': 'count',
'recency': 'mean',
'frequency': 'mean',
'monetary': 'mean'
}).round(2).reset_index()
cluster_summary.columns = ['value_segment', 'visitor_count', 'avg_recency_days', 'avg_frequency', 'avg_monetary']
else:
cluster_summary = pd.DataFrame()
daily_consumption_analysis = spark.sql("""
SELECT
visitor_id,
consumption_amount,
duration_days,
ROUND(consumption_amount / NULLIF(duration_days, 0), 2) AS daily_consumption,
CASE
WHEN ROUND(consumption_amount / NULLIF(duration_days, 0), 2) < 300 THEN '低日均消费'
WHEN ROUND(consumption_amount / NULLIF(duration_days, 0), 2) BETWEEN 300 AND 800 THEN '中日均消费'
ELSE '高日均消费'
END AS daily_consumption_level
FROM visitor_consumption
WHERE duration_days IS NOT NULL AND duration_days > 0 AND consumption_amount IS NOT NULL
""")
daily_consumption_distribution = daily_consumption_analysis.groupBy("daily_consumption_level").agg(
count("*").alias("visitor_count"),
avg("daily_consumption").alias("avg_daily_spending")
).orderBy(col("avg_daily_spending").desc())
return {
"consumption_level_distribution": consumption_level_distribution.toPandas(),
"travel_mode_consumption_impact": travel_mode_consumption_impact.toPandas(),
"rfm_cluster_summary": cluster_summary,
"daily_consumption_distribution": daily_consumption_distribution.toPandas()
}
def analyze_attraction_popularity_and_satisfaction(hdfs_data_path):
"""景点吸引力与满意度分析核心函数:销量排行、类型满意度、价格关系、星级对比、情感倾向"""
attraction_df = spark.read.parquet(hdfs_data_path)
attraction_df.createOrReplaceTempView("attraction_data")
top15_attractions_by_sales = spark.sql("""
SELECT
attraction_name,
SUM(attraction_sales) AS total_sales,
ROUND(AVG(satisfaction_score), 2) AS avg_satisfaction,
COUNT(DISTINCT visitor_id) AS unique_visitors,
ROUND(SUM(attraction_sales) * 100.0 / SUM(SUM(attraction_sales)) OVER(), 2) AS market_share_pct
FROM attraction_data
WHERE attraction_name IS NOT NULL AND attraction_sales IS NOT NULL
GROUP BY attraction_name
ORDER BY total_sales DESC
LIMIT 15
""")
satisfaction_by_attraction_type = spark.sql("""
SELECT
attraction_type,
COUNT(*) AS visit_count,
ROUND(AVG(satisfaction_score), 2) AS avg_satisfaction,
ROUND(MAX(satisfaction_score), 2) AS max_satisfaction,
ROUND(MIN(satisfaction_score), 2) AS min_satisfaction,
ROUND(STDDEV(satisfaction_score), 2) AS satisfaction_stddev
FROM attraction_data
WHERE attraction_type IS NOT NULL AND satisfaction_score IS NOT NULL
GROUP BY attraction_type
ORDER BY avg_satisfaction DESC
""")
price_satisfaction_correlation = spark.sql("""
SELECT
CASE
WHEN ticket_price < 50 THEN '低价位(<50元)'
WHEN ticket_price BETWEEN 50 AND 150 THEN '中价位(50-150元)'
WHEN ticket_price BETWEEN 151 AND 300 THEN '高价位(151-300元)'
ELSE '超高价位(>300元)'
END AS price_range,
COUNT(*) AS visit_count,
ROUND(AVG(satisfaction_score), 2) AS avg_satisfaction,
ROUND(AVG(ticket_price), 2) AS avg_ticket_price,
ROUND(AVG(satisfaction_score) / NULLIF(AVG(ticket_price), 0) * 100, 4) AS satisfaction_price_ratio
FROM attraction_data
WHERE ticket_price IS NOT NULL AND satisfaction_score IS NOT NULL
GROUP BY price_range
ORDER BY avg_ticket_price
""")
satisfaction_by_attraction_rating = spark.sql("""
SELECT
attraction_rating,
COUNT(*) AS visitor_count,
ROUND(AVG(satisfaction_score), 2) AS avg_satisfaction,
ROUND(AVG(ticket_price), 2) AS avg_price,
ROUND(AVG(consumption_amount), 2) AS avg_total_consumption
FROM attraction_data
WHERE attraction_rating IS NOT NULL AND satisfaction_score IS NOT NULL
GROUP BY attraction_rating
ORDER BY
CASE
WHEN attraction_rating = '5A' THEN 1
WHEN attraction_rating = '4A' THEN 2
WHEN attraction_rating = '3A' THEN 3
ELSE 4
END
""")
sentiment_polarity_distribution = spark.sql("""
SELECT
sentiment_polarity,
COUNT(*) AS comment_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS percentage,
ROUND(AVG(satisfaction_score), 2) AS avg_satisfaction_in_sentiment
FROM attraction_data
WHERE sentiment_polarity IS NOT NULL
GROUP BY sentiment_polarity
ORDER BY
CASE
WHEN sentiment_polarity = '正面' THEN 1
WHEN sentiment_polarity = '中性' THEN 2
ELSE 3
END
""")
return {
"top15_attractions_by_sales": top15_attractions_by_sales.toPandas(),
"satisfaction_by_attraction_type": satisfaction_by_attraction_type.toPandas(),
"price_satisfaction_correlation": price_satisfaction_correlation.toPandas(),
"satisfaction_by_attraction_rating": satisfaction_by_attraction_rating.toPandas(),
"sentiment_polarity_distribution": sentiment_polarity_distribution.toPandas()
}
```
```
# **国内旅游景点游客数据分析系统-结语**
选题难+技术难+数据难?这套Hadoop+Spark旅游分析系统一次解决大数据毕设三大痛点
同样是数据分析系统,为什么加上Hadoop+Spark技术栈就能高分通过答辩?
7天搞定大数据毕设:Hadoop+Spark旅游数据分析系统,5个分析模块轻松拿优
感谢大家点赞、收藏、投币+关注,如果遇到有技术问题或者获取源代码,欢迎在评论区一起交流探讨!