26届大数据毕设选题推荐-基于spark大数据+可视化的超市销售数据分析预测系统-基于机器学习的超市销售预测与决策支持系统

注意：该项目只展示部分功能

1.开发环境

发语言：python

采用技术：Spark、Hadoop、Django、Vue、Echarts等技术框架

数据库：MySQL

开发环境：PyCharm

2 系统设计

随着零售行业的快速发展和消费者购物行为的日益复杂化，传统的销售数据分析方法已无法满足超市精细化管理的需求。超市每天产生海量的销售数据，包含商品信息、客户消费记录、促销活动效果等多维度信息，这些数据蕴含着巨大的商业价值。然而，由于数据体量庞大、结构复杂，传统的数据处理工具难以高效处理和深度挖掘这些数据中的潜在规律。因此，构建一个基于spark大数据+可视化的超市销售数据分析预测系统，运用Python、Spark、Hadoop等先进技术，结合Vue前端框架和ECharts可视化工具，以MySQL作为数据存储基础，实现对超市销售数据的全方位智能分析。

基基于spark大数据+可视化的超市销售数据分析预测系统的开发具有重要的实践价值和商业意义，能够帮助超市管理者从海量销售数据中提取有价值的商业洞察，为经营决策提供科学依据。通过深度分析商品销售规律、客户消费行为、促销效果等关键指标，系统能够协助超市优化商品采购策略、改进货架陈列布局、制定精准营销方案、提升客户满意度和忠诚度。同时，基于大数据的预测分析功能有助于超市预测市场需求变化、降低库存风险、提高资金周转效率，最终实现超市经营效益的最大化和可持续发展。

基于spark大数据+可视化的超市销售数据分析预测系统研究内容围绕超市销售数据的多维度深度分析展开，运用大数据处理技术和数据挖掘算法，构建了涵盖商品销售、时间趋势、促销效果、顾客行为和商品关联五大分析维度的综合性分析体系。系统通过对商品编码、销售金额、顾客编号、销售日期、促销标记等关键数据字段的深度挖掘，采用统计分析、关联规则算法、RFM模型等多种分析方法，实现了对超市销售数据的全方位解析。在技术架构上，系统采用Hadoop分布式存储处理海量数据，运用Spark进行大数据计算和实时分析，通过Python实现核心算法逻辑，结合MySQL进行结构化数据管理，并使用Vue和ECharts构建直观的数据可视化界面，为超市管理决策提供科学依据和智能支撑。

基于spark大数据+可视化的超市销售数据分析预测系统主要包含五大核心功能模块：商品销售分析模块，提供畅销商品排名、商品类别占比、价格区间分布、散称与标准商品对比及利润率分析；时间维度分析模块，支持销售趋势分析、周内销售模式、月度类别变化及时段热销商品统计；促销效果分析模块，评估促销提升率、不同类别促销敏感度、折扣深度影响及促销频次效果；顾客行为分析模块，基于RFM模型进行客户价值分层、购物篮组合分析、购买多样性及价格敏感度分析；商品关联分析模块，挖掘商品关联关系、购买序列模式、互补替代关系及捆绑销售潜力。

3 系统展示

3.1 大屏页面

大屏下.png

3.2 分析页面

促销效果.png

顾客消费.png

商品关联.png

商品销售.png

时间维度.png

3.3 基础页面

数据管理.png

数据展示.png

5 部分功能代码

def rfm_customer_analysis():
    sales_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "sales_data").option("user", "root").option("password", "password").load()
    current_date = lit("2024-12-31")
    customer_rfm = sales_df.groupBy("customer_code").agg(max("sales_date").alias("last_purchase_date"), count("*").alias("frequency"), sum("sales_amount").alias("monetary"))
    rfm_with_recency = customer_rfm.withColumn("recency", datediff(current_date, col("last_purchase_date")))
    recency_stats = rfm_with_recency.agg(expr("percentile_approx(recency, 0.2)").alias("r_20"), expr("percentile_approx(recency, 0.4)").alias("r_40"), expr("percentile_approx(recency, 0.6)").alias("r_60"), expr("percentile_approx(recency, 0.8)").alias("r_80")).collect()[0]
    frequency_stats = rfm_with_recency.agg(expr("percentile_approx(frequency, 0.2)").alias("f_20"), expr("percentile_approx(frequency, 0.4)").alias("f_40"), expr("percentile_approx(frequency, 0.6)").alias("f_60"), expr("percentile_approx(frequency, 0.8)").alias("f_80")).collect()[0]
    monetary_stats = rfm_with_recency.agg(expr("percentile_approx(monetary, 0.2)").alias("m_20"), expr("percentile_approx(monetary, 0.4)").alias("m_40"), expr("percentile_approx(monetary, 0.6)").alias("m_60"), expr("percentile_approx(monetary, 0.8)").alias("m_80")).collect()[0]
    rfm_scored = rfm_with_recency.withColumn("R_score", when(col("recency") <= recency_stats.r_20, 5).when(col("recency") <= recency_stats.r_40, 4).when(col("recency") <= recency_stats.r_60, 3).when(col("recency") <= recency_stats.r_80, 2).otherwise(1)).withColumn("F_score", when(col("frequency") >= frequency_stats.f_80, 5).when(col("frequency") >= frequency_stats.f_60, 4).when(col("frequency") >= frequency_stats.f_40, 3).when(col("frequency") >= frequency_stats.f_20, 2).otherwise(1)).withColumn("M_score", when(col("monetary") >= monetary_stats.m_80, 5).when(col("monetary") >= monetary_stats.m_60, 4).when(col("monetary") >= monetary_stats.m_40, 3).when(col("monetary") >= monetary_stats.m_20, 2).otherwise(1))
    rfm_segmented = rfm_scored.withColumn("RFM_score", concat(col("R_score"), col("F_score"), col("M_score"))).withColumn("customer_segment", when(col("RFM_score").isin(["555", "554", "544", "545", "454", "455", "445"]), "冠军客户").when(col("RFM_score").isin(["543", "444", "435", "355", "354", "345", "344", "335"]), "忠实客户").when(col("RFM_score").isin(["512", "511", "422", "421", "412", "411", "311"]), "潜在忠实客户").when(col("RFM_score").isin(["533", "532", "531", "523", "522", "521", "515", "514", "513", "425", "424", "413", "414", "415", "315", "314", "313"]), "新客户").when(col("RFM_score").isin(["155", "154", "144", "214", "215", "115", "114"]), "重要客户需要关注").when(col("RFM_score").isin(["155", "154", "144", "214", "215", "115", "114"]), "重要客户需要关注").when(col("RFM_score").isin(["331", "321", "231", "241", "251"]), "即将流失客户").when(col("RFM_score").isin(["111", "112", "121", "131", "141", "151"]), "流失客户").otherwise("其他客户"))
    customer_insights = rfm_segmented.withColumn("avg_order_value", col("monetary") / col("frequency")).withColumn("customer_lifetime_days", lit(365) - col("recency")).withColumn("estimated_clv", col("avg_order_value") * col("frequency") * (col("customer_lifetime_days") / 365))
    marketing_strategy = customer_insights.withColumn("recommended_strategy", when(col("customer_segment") == "冠军客户", "VIP专享活动+积分奖励").when(col("customer_segment") == "忠实客户", "会员专属优惠+新品推荐").when(col("customer_segment") == "潜在忠实客户", "个性化推荐+优惠券").when(col("customer_segment") == "新客户", "欢迎礼包+引导购买").when(col("customer_segment") == "即将流失客户", "挽回优惠+关怀活动").when(col("customer_segment") == "流失客户", "重激活促销+调研反馈").otherwise("标准营销活动"))
    final_result = marketing_strategy.select("customer_code", "recency", "frequency", "monetary", "R_score", "F_score", "M_score", "customer_segment", "avg_order_value", "estimated_clv", "recommended_strategy")
    result_data = final_result.collect()
    connection = mysql.connector.connect(host='localhost', database='supermarket', user='root', password='password')
    cursor = connection.cursor()
    cursor.execute("DELETE FROM customer_rfm_analysis")
    for row in result_data:
        cursor.execute("INSERT INTO customer_rfm_analysis VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
    connection.commit()
    cursor.close()
    connection.close()
    return result_data
def market_basket_analysis():
    sales_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "sales_data").option("user", "root").option("password", "password").load()
    transaction_items = sales_df.filter(col("sales_quantity") > 0).groupBy("customer_code", "sales_date").agg(collect_list("product_code").alias("items"))
    filtered_transactions = transaction_items.filter(size(col("items")) >= 2)
    fpgrowth = FPGrowth(itemsCol="items", minSupport=0.01, minConfidence=0.1)
    model = fpgrowth.fit(filtered_transactions)
    frequent_itemsets = model.freqItemsets.filter(size(col("items")) >= 2).withColumn("itemset_size", size(col("items"))).withColumn("support_percentage", col("freq") / filtered_transactions.count() * 100)
    association_rules = model.associationRules.withColumn("confidence_percentage", col("confidence") * 100).withColumn("lift_ratio", col("lift")).filter(col("confidence") > 0.3)
    product_names = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "product_info").option("user", "root").option("password", "password").load().select("product_code", "product_name")
    itemsets_with_names = frequent_itemsets.join(product_names, frequent_itemsets.items.contains(product_names.product_code), "left")
    high_confidence_rules = association_rules.filter(col("confidence_percentage") >= 50).withColumn("antecedent_str", concat_ws(",", col("antecedent"))).withColumn("consequent_str", concat_ws(",", col("consequent")))
    cross_category_analysis = sales_df.join(spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "product_info").option("user", "root").option("password", "password").load().select("product_code", "category_name"), "product_code").groupBy("customer_code", "sales_date").agg(collect_set("category_name").alias("categories"))
    category_combinations = cross_category_analysis.filter(size(col("categories")) >= 2).groupBy("categories").count().withColumn("combination_frequency", col("count")).orderBy(desc("count")).limit(20)
    seasonal_patterns = sales_df.withColumn("month", month(col("sales_date"))).withColumn("season", when(col("month").isin([12, 1, 2]), "冬季").when(col("month").isin([3, 4, 5]), "春季").when(col("month").isin([6, 7, 8]), "夏季").otherwise("秋季"))
    seasonal_associations = seasonal_patterns.groupBy("customer_code", "sales_date", "season").agg(collect_list("product_code").alias("seasonal_items")).filter(size(col("seasonal_items")) >= 2)
    promotion_impact = sales_df.filter(col("is_promotion") == 1).groupBy("customer_code", "sales_date").agg(collect_list("product_code").alias("promo_items"))
    promo_basket_patterns = promotion_impact.filter(size(col("promo_items")) >= 2).groupBy("promo_items").count().withColumn("promo_frequency", col("count")).orderBy(desc("count")).limit(10)
    revenue_impact = association_rules.join(sales_df.groupBy("product_code").agg(avg("sales_amount").alias("avg_revenue")), association_rules.consequent.contains(sales_df.product_code), "left")
    high_value_rules = revenue_impact.withColumn("revenue_potential", col("confidence_percentage") * col("avg_revenue")).orderBy(desc("revenue_potential")).limit(15)
    final_recommendations = high_value_rules.withColumn("business_action", when(col("lift_ratio") > 2, "强推荐捆绑销售").when(col("lift_ratio") > 1.5, "建议关联陈列").when(col("lift_ratio") > 1.2, "考虑促销组合").otherwise("观察市场反应")).withColumn("expected_uplift", col("confidence_percentage") * col("lift_ratio") / 100)
    result_data = final_recommendations.select("antecedent", "consequent", "confidence_percentage", "lift_ratio", "revenue_potential", "business_action", "expected_uplift").collect()
    connection = mysql.connector.connect(host='localhost', database='supermarket', user='root', password='password')
    cursor = connection.cursor()
    cursor.execute("DELETE FROM market_basket_analysis")
    for row in result_data:
        cursor.execute("INSERT INTO market_basket_analysis VALUES (%s, %s, %s, %s, %s, %s, %s)", (str(row.antecedent), str(row.consequent), float(row.confidence_percentage), float(row.lift_ratio), float(row.revenue_potential), str(row.business_action), float(row.expected_uplift)))
    connection.commit()
 from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.fpm import FPGrowth
import mysql.connector
spark = SparkSession.builder.appName("SupermarketSalesAnalysis").config("spark.sql.adaptive.enabled","true").config("spark.sql.adaptive.coalescePartitions.enabled","true").getOrCreate()
def get_top20_bestselling_products():
    sales_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "sales_data").option("user", "root").option("password", "password").load()
    product_sales = sales_df.groupBy("product_code", "product_name").agg(sum("sales_quantity").alias("total_quantity"), sum("sales_amount").alias("total_amount"), count("*").alias("transaction_count"))
    quantity_rank = product_sales.withColumn("quantity_rank", row_number().over(Window.orderBy(desc("total_quantity"))))
    amount_rank = product_sales.withColumn("amount_rank", row_number().over(Window.orderBy(desc("total_amount"))))
    top20_by_quantity = quantity_rank.filter(col("quantity_rank") <= 20).select("product_code", "product_name", "total_quantity", "quantity_rank")
    top20_by_amount = amount_rank.filter(col("amount_rank") <= 20).select("product_code", "product_name", "total_amount", "amount_rank")
    combined_top20 = top20_by_quantity.join(top20_by_amount, ["product_code", "product_name"], "outer")
    final_ranking = combined_top20.withColumn("avg_unit_price", col("total_amount") / col("total_quantity")).withColumn("comprehensive_score", col("total_quantity") * 0.4 + col("total_amount") * 0.6)
    top20_products = final_ranking.orderBy(desc("comprehensive_score")).limit(20)
    market_share = top20_products.withColumn("quantity_share", col("total_quantity") / sum("total_quantity").over(Window.partitionBy()) * 100).withColumn("amount_share", col("total_amount") / sum("total_amount").over(Window.partitionBy()) * 100)
    profitability = market_share.withColumn("profit_margin", when(col("avg_unit_price") > 50, "高利润").when(col("avg_unit_price") > 20, "中等利润").otherwise("低利润"))
    growth_trend = profitability.withColumn("sales_velocity", col("total_quantity") / col("transaction_count")).withColumn("recommended_action", when(col("sales_velocity") > 5, "增加库存").when(col("sales_velocity") > 2, "维持现状").otherwise("考虑促销"))
    result_data = growth_trend.select("product_code", "product_name", "total_quantity", "total_amount", "avg_unit_price", "quantity_share", "amount_share", "profit_margin", "recommended_action").collect()
    connection = mysql.connector.connect(host='localhost', database='supermarket', user='root', password='password')
    cursor = connection.cursor()
    cursor.execute("DELETE FROM top_products_analysis")
    for row in result_data:
        cursor.execute("INSERT INTO top_products_analysis VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
    connection.commit()
    cursor.close()
    connection.close()
    return result_data
def rfm_customer_analysis():
    sales_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/supermarket").option("dbtable", "sales_data").option("user", "root").option("password", "password").load()
    current_date = lit("2024-12-31")
    customer_rfm = sales_df.groupBy("customer_code").agg(max("sales_date").alias("last_purchase_date"), count("*").alias("frequency"), sum("sales_amount").alias("monetary"))
    rfm_with_recency = customer_rfm.withColumn("recency", datediff(current_date, col("last_purchase_date")))
    recency_stats = rfm_with_recency.agg(expr("percentile_approx(recency, 0.2)").alias("r_20"), expr("percentile_approx(recency, 0.4)").alias("r_40"), expr("percentile_approx(recency, 0.6)").alias("r_60"), expr("percentile_approx(recency, 0.8)").alias("r_80")).collect()[0]
    frequency_stats = rfm_with_recency.agg(expr("percentile_approx(frequency, 0.2)").alias("f_20"), expr("percentile_approx(frequency, 0.4)").alias("f_40"), expr("percentile_approx(frequency, 0.6)").alias("f_60"), expr("percentile_approx(frequency, 0.8)").alias("f_80")).collect()[0]
    monetary_stats = rfm_with_recency.agg(expr("percentile_approx(monetary, 0.2)").alias("m_20"), expr("percentile_approx(monetary, 0.4)").alias("m_40"), expr("percentile_approx(monetary, 0.6)").alias("m_60"), expr("percentile_approx(monetary, 0.8)").alias("m_80")).collect()[0]
    rfm_scored = rfm_with_recency.withColumn("R_score", when(col("recency") <= recency_stats.r_20, 5).when(col("recency") <= recency_stats.r_40, 4).when(col("recency") <= recency_stats.r_60, 3).when(col("recency") <= recency_stats.r_80, 2).otherwise(1)).withColumn("F_score", when(col("frequency") >= frequency_stats.f_80, 5).when(col("frequency") >= frequency_stats.f_60, 4).when(col("frequency") >= frequency_stats.f_40, 3).when(col("frequency") >= frequency_stats.f_20, 2).otherwise(1)).withColumn("M_score", when(col("monetary") >= monetary_stats.m_80, 5).when(col("monetary") >= monetary_stats.m_60, 4).when(col("monetary") >= monetary_stats.m_40, 3).when(col("monetary") >= monetary_stats.m_20, 2).otherwise(1))
    rfm_segmented = rfm_scored.withColumn("RFM_score", concat(col("R_score"), col("F_score"), col("M_score"))).withColumn("customer_segment", when(col("RFM_score").isin(["555", "554", "544", "545", "454", "455", "445"]), "冠军客户").when(col("RFM_score").isin(["543", "444", "435", "355", "354", "345", "344", "335"]), "忠实客户").when(col("RFM_score").isin(["512", "511", "422", "421", "412", "411", "311"]), "潜在忠实客户").when(col("RFM_score").isin(["533", "532", "531", "523", "522", "521", "515", "514", "513", "425", "424", "413", "414", "415", "315", "314", "313"]), "新客户").when(col("RFM_score").isin(["155", "154", "144", "214", "215", "115", "114"]), "重要客户需要关注").when(col("RFM_score").isin(["155", "154", "144", "214", "215", "115", "114"]), "重要客户需要关注").when(col("RFM_score").isin(["331", "321", "231", "241", "251"]), "即将流失客户").when(col("RFM_score").isin(["111", "112", "121", "131", "141", "151"]), "流失客户").otherwise("其他客户"))
    customer_insights = rfm_segmented.withColumn("avg_order_value", col("monetary") / col("frequency")).withColumn("customer_lifetime_days", lit(365) - col("recency")).withColumn("estimated_clv", col("avg_order_value") * col("frequency") * (col("customer_lifetime_days") / 365))
    marketing_strategy = customer_insights.withColumn("recommended_strategy", when(col("customer_segment") == "冠军客户", "VIP专享活动+积分奖励").when(col("customer_segment") == "忠实客户", "会员专属优惠+新品推荐").when(col("customer_segment") == "潜在忠实客户", "个性化推荐+优惠券").when(col("customer_segment") == "新客户", "欢迎礼包+引导购买").when(col("customer_segment") == "即将流失客户", "挽回优惠+关怀活动").when(col("customer_segment") == "流失客户", "重激活促销+调研反馈").otherwise("标准营销活动"))
    final_result = marketing_strategy.select("customer_code", "recency", "frequency", "monetary", "R_score", "F_score", "M_score", "customer_segment", "avg_order_value", "estimated_clv", "recommended_strategy")
    result_data = final_result.collect()
    connection = mysql.connector.connect(host='localhost', database='supermarket', user='root', password='password')
    cursor = connection.cursor()
    cursor.execute("DELETE FROM customer_rfm_analysis")
    for row in result_data:
        cursor.execute("INSERT INTO customer_rfm_analysis VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
    connection.commit()
    cursor.close()
    connection.close()
    return result_data

源码项目、定制开发、文档报告、PPT、代码答疑 希望和大家多多交流