大数据毕设首选:基于Hadoop+Spark的懂车帝二手车数据分析系统

77 阅读10分钟

🎓 作者:计算机毕设小月哥 | 软件开发专家

🖥️ 简介:8年计算机软件程序开发经验。精通Java、Python、微信小程序、安卓、大数据、PHP、.NET|C#、Golang等技术栈。

🛠️ 专业服务 🛠️

  • 需求定制化开发

  • 源码提供与讲解

  • 技术文档撰写(指导计算机毕设选题【新颖+创新】、任务书、开题报告、文献综述、外文翻译等)

  • 项目答辩演示PPT制作

🌟 欢迎:点赞 👍 收藏 ⭐ 评论 📝

👇🏻 精选专栏推荐 👇🏻 欢迎订阅关注!

大数据实战项目

PHP|C#.NET|Golang实战项目

微信小程序|安卓实战项目

Python实战项目

Java实战项目

🍅 ↓↓主页获取源码联系↓↓🍅

基于大数据的懂车帝二手车数据分析系统-功能介绍

基于Hadoop+Spark的懂车帝二手车数据分析系统是一个专门针对二手车市场数据进行深度挖掘和智能分析的大数据应用系统。系统采用Hadoop作为分布式存储基础,利用Spark强大的内存计算能力处理海量二手车交易数据,通过HDFS实现数据的可靠存储和高效访问。在技术架构上,系统融合了Python的数据处理优势和Java的企业级开发能力,后端采用Django框架构建RESTful API服务,前端运用Vue+ElementUI打造现代化的数据可视化界面,通过Echarts展现丰富的图表分析结果。系统从二手车市场宏观特征、价值影响因素、品牌竞争力和供给画像四个核心维度进行数据挖掘,运用Pandas和NumPy进行数据预处理,通过Spark SQL实现复杂的多维度统计分析,并集成K-Means聚类算法对车辆进行智能分类。整个系统不仅能够处理车龄、里程、价格、城市分布等基础数据分析,还能深入挖掘品牌保值率、价格趋势、市场供需关系等高价值商业洞察,为二手车市场参与者提供科学的决策支持工具。

基于大数据的懂车帝二手车数据分析系统-选题背景意义

选题背景 随着我国汽车保有量的持续增长和消费观念的逐步转变,二手车交易市场正迎来快速发展期。懂车帝等主流二手车交易平台积累了大量的车辆信息、价格数据和用户行为数据,这些数据蕴含着丰富的市场规律和消费趋势信息。传统的二手车市场分析往往依赖人工经验判断,缺乏系统性的数据支撑,难以准确把握市场动态和价格走势。面对海量、多维度的二手车交易数据,传统的数据处理方法已经无法满足深度分析的需求,亟需运用大数据技术进行有效的数据挖掘和智能分析。大数据技术的成熟发展为处理这类复杂数据提供了强有力的技术保障,Hadoop和Spark等分布式计算框架能够高效处理TB级别的二手车数据,为市场分析提供了新的技术路径。 选题意义 本课题的研究具有一定的理论价值和实践意义。从技术角度来看,通过将Hadoop分布式存储与Spark内存计算相结合,能够有效解决大规模二手车数据的存储和处理难题,为类似的大数据分析项目提供可参考的技术方案。从应用角度来看,系统通过多维度数据分析能够为二手车买卖双方提供更加客观的市场信息,帮助消费者做出更明智的购车决策,同时为经销商提供市场趋势预测和库存优化建议。对于二手车平台运营方而言,系统产生的分析结果有助于优化定价策略和提升用户体验。从学术研究角度来看,本系统将大数据处理技术与具体业务场景相结合,探索了数据挖掘在汽车后市场的应用模式,为相关研究领域提供了一定的实践案例。虽然作为毕业设计项目在规模和复杂度上有所限制,但系统的设计思路和技术实现仍具备一定的参考价值和推广意义。

基于大数据的懂车帝二手车数据分析系统-技术选型

大数据框架:Hadoop+Spark(本次没用Hive,支持定制) 开发语言:Python+Java(两个版本都支持) 后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持) 前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery 详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy 数据库:MySQL

基于大数据的懂车帝二手车数据分析系统-视频展示

基于大数据的懂车帝二手车数据分析系统-视频展示

基于大数据的懂车帝二手车数据分析系统-图片展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

基于大数据的懂车帝二手车数据分析系统-代码展示

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np

spark = SparkSession.builder.appName("DongchediCarAnalysis").master("local[*]").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def market_macro_analysis():
    df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/car_data/dongchedi_data.csv")
    df_cleaned = df.filter(col("sh_price").isNotNull() & col("car_age").isNotNull() & col("car_mileage").isNotNull())
    age_distribution = df_cleaned.groupBy("car_age").agg(count("*").alias("car_count"), avg("sh_price").alias("avg_price")).orderBy("car_age")
    age_bins = [0, 1, 3, 5, 8, 10, float('inf')]
    age_labels = ["1年内", "1-3年", "3-5年", "5-8年", "8-10年", "10年以上"]
    age_result = []
    for i in range(len(age_bins)-1):
        filtered_df = df_cleaned.filter((col("car_age") > age_bins[i]) & (col("car_age") <= age_bins[i+1]))
        count = filtered_df.count()
        avg_price = filtered_df.agg(avg("sh_price")).collect()[0][0] if count > 0 else 0
        age_result.append({"age_range": age_labels[i], "count": count, "avg_price": round(avg_price, 2) if avg_price else 0})
    mileage_distribution = df_cleaned.groupBy("car_mileage").agg(count("*").alias("car_count"), avg("sh_price").alias("avg_price")).orderBy("car_mileage")
    mileage_bins = [0, 3, 6, 10, 15, 20, float('inf')]
    mileage_labels = ["3万公里内", "3-6万公里", "6-10万公里", "10-15万公里", "15-20万公里", "20万公里以上"]
    mileage_result = []
    for i in range(len(mileage_bins)-1):
        filtered_df = df_cleaned.filter((col("car_mileage") > mileage_bins[i]) & (col("car_mileage") <= mileage_bins[i+1]))
        count = filtered_df.count()
        avg_price = filtered_df.agg(avg("sh_price")).collect()[0][0] if count > 0 else 0
        mileage_result.append({"mileage_range": mileage_labels[i], "count": count, "avg_price": round(avg_price, 2) if avg_price else 0})
    city_distribution = df_cleaned.groupBy("car_source_city_name").agg(count("*").alias("car_count"), avg("sh_price").alias("avg_price")).orderBy(desc("car_count")).limit(20)
    city_result = city_distribution.collect()
    transfer_distribution = df_cleaned.groupBy("transfer_cnt").agg(count("*").alias("car_count"), avg("sh_price").alias("avg_price")).orderBy("transfer_cnt")
    transfer_result = transfer_distribution.collect()
    return {"age_analysis": age_result, "mileage_analysis": mileage_result, "city_analysis": [row.asDict() for row in city_result], "transfer_analysis": [row.asDict() for row in transfer_result]}

def price_influence_factor_analysis():
    df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/car_data/dongchedi_data.csv")
    df_cleaned = df.filter(col("sh_price").isNotNull() & col("car_age").isNotNull() & col("car_mileage").isNotNull() & col("official_price").isNotNull())
    df_with_depreciation = df_cleaned.withColumn("depreciation_rate", (col("official_price") - col("sh_price")) / col("official_price"))
    age_price_correlation = df_with_depreciation.groupBy("car_age").agg(avg("sh_price").alias("avg_price"), avg("depreciation_rate").alias("avg_depreciation"), count("*").alias("sample_count")).orderBy("car_age")
    age_price_result = age_price_correlation.collect()
    mileage_bins = [0, 3, 6, 10, 15, 20, float('inf')]
    mileage_labels = ["3万公里内", "3-6万公里", "6-10万公里", "10-15万公里", "15-20万公里", "20万公里以上"]
    mileage_price_result = []
    for i in range(len(mileage_bins)-1):
        filtered_df = df_with_depreciation.filter((col("car_mileage") > mileage_bins[i]) & (col("car_mileage") <= mileage_bins[i+1]))
        if filtered_df.count() > 0:
            stats = filtered_df.agg(avg("sh_price").alias("avg_price"), avg("depreciation_rate").alias("avg_depreciation"), count("*").alias("sample_count")).collect()[0]
            mileage_price_result.append({"mileage_range": mileage_labels[i], "avg_price": round(stats["avg_price"], 2), "avg_depreciation": round(stats["avg_depreciation"], 4), "sample_count": stats["sample_count"]})
    city_price_analysis = df_with_depreciation.groupBy("car_source_city_name").agg(avg("sh_price").alias("avg_price"), avg("depreciation_rate").alias("avg_depreciation"), count("*").alias("sample_count")).filter(col("sample_count") >= 50).orderBy(desc("avg_price")).limit(15)
    city_price_result = city_price_analysis.collect()
    official_price_bins = [0, 10, 20, 30, 50, 100, float('inf')]
    official_price_labels = ["10万以下", "10-20万", "20-30万", "30-50万", "50-100万", "100万以上"]
    official_price_result = []
    for i in range(len(official_price_bins)-1):
        filtered_df = df_with_depreciation.filter((col("official_price") > official_price_bins[i]) & (col("official_price") <= official_price_bins[i+1]))
        if filtered_df.count() > 0:
            stats = filtered_df.agg(avg("sh_price").alias("avg_price"), avg("depreciation_rate").alias("avg_depreciation"), count("*").alias("sample_count")).collect()[0]
            official_price_result.append({"price_range": official_price_labels[i], "avg_price": round(stats["avg_price"], 2), "avg_depreciation": round(stats["avg_depreciation"], 4), "sample_count": stats["sample_count"]})
    return {"age_price_analysis": [row.asDict() for row in age_price_result], "mileage_price_analysis": mileage_price_result, "city_price_analysis": [row.asDict() for row in city_price_result], "official_price_analysis": official_price_result}

def brand_competition_analysis():
    df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/car_data/dongchedi_data.csv")
    df_cleaned = df.filter(col("sh_price").isNotNull() & col("car_age").isNotNull() & col("brand_name").isNotNull() & col("official_price").isNotNull() & (col("official_price") > 0))
    df_with_depreciation = df_cleaned.withColumn("depreciation_rate", (col("official_price") - col("sh_price")) / col("official_price")).withColumn("value_retention_rate", 1 - col("depreciation_rate"))
    brand_market_share = df_with_depreciation.groupBy("brand_name").agg(count("*").alias("car_count")).orderBy(desc("car_count"))
    total_cars = df_with_depreciation.count()
    brand_share_result = brand_market_share.withColumn("market_share", round(col("car_count") / total_cars * 100, 2)).limit(20).collect()
    brand_value_retention = df_with_depreciation.groupBy("brand_name").agg(avg("value_retention_rate").alias("avg_retention_rate"), avg("sh_price").alias("avg_price"), avg("car_age").alias("avg_age"), count("*").alias("sample_count")).filter(col("sample_count") >= 30).orderBy(desc("avg_retention_rate"))
    brand_retention_result = brand_value_retention.limit(20).collect()
    premium_brands = ["奔驰", "宝马", "奥迪", "保时捷", "雷克萨斯", "沃尔沃", "捷豹", "路虎", "凯迪拉克", "林肯"]
    mainstream_brands = ["大众", "丰田", "本田", "日产", "福特", "雪佛兰", "现代", "起亚", "马自达", "三菱"]
    domestic_brands = ["比亚迪", "吉利", "长安", "奇瑞", "长城", "传祺", "荣威", "名爵", "红旗", "领克"]
    brand_category_analysis = []
    for category, brands in [("豪华品牌", premium_brands), ("主流合资", mainstream_brands), ("自主品牌", domestic_brands)]:
        category_df = df_with_depreciation.filter(col("brand_name").isin(brands))
        if category_df.count() > 0:
            stats = category_df.agg(avg("value_retention_rate").alias("avg_retention"), avg("sh_price").alias("avg_price"), avg("car_age").alias("avg_age"), count("*").alias("sample_count")).collect()[0]
            brand_category_analysis.append({"category": category, "avg_retention_rate": round(stats["avg_retention"], 4), "avg_price": round(stats["avg_price"], 2), "avg_age": round(stats["avg_age"], 1), "sample_count": stats["sample_count"]})
    top_brands_detail = df_with_depreciation.filter(col("brand_name").isin([row["brand_name"] for row in brand_share_result[:10]])).groupBy("brand_name").agg(avg("sh_price").alias("avg_price"), avg("car_age").alias("avg_age"), avg("car_mileage").alias("avg_mileage"), avg("value_retention_rate").alias("retention_rate"), count("*").alias("sample_count")).collect()
    return {"market_share_analysis": [row.asDict() for row in brand_share_result], "value_retention_ranking": [row.asDict() for row in brand_retention_result], "category_analysis": brand_category_analysis, "top_brands_detail": [row.asDict() for row in top_brands_detail]}

def vehicle_clustering_analysis():
    df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/car_data/dongchedi_data.csv")
    df_cleaned = df.filter(col("sh_price").isNotNull() & col("car_age").isNotNull() & col("car_mileage").isNotNull()).filter((col("sh_price") > 0) & (col("car_age") >= 0) & (col("car_mileage") >= 0))
    feature_cols = ["car_age", "car_mileage", "sh_price"]
    assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
    df_features = assembler.transform(df_cleaned)
    kmeans = KMeans(k=5, seed=42, featuresCol="features", predictionCol="cluster")
    model = kmeans.fit(df_features)
    df_clustered = model.transform(df_features)
    cluster_analysis = df_clustered.groupBy("cluster").agg(avg("car_age").alias("avg_age"), avg("car_mileage").alias("avg_mileage"), avg("sh_price").alias("avg_price"), count("*").alias("cluster_size")).orderBy("cluster")
    cluster_result = cluster_analysis.collect()
    cluster_descriptions = []
    for row in cluster_result:
        cluster_id = row["cluster"]
        avg_age = round(row["avg_age"], 1)
        avg_mileage = round(row["avg_mileage"], 1)
        avg_price = round(row["avg_price"], 2)
        size = row["cluster_size"]
        if avg_price > 30 and avg_age < 3:
            description = "豪华准新车型"
        elif avg_price > 20 and avg_age < 5:
            description = "高端次新车型"
        elif avg_price < 10 and avg_age > 8:
            description = "经济实用老车"
        elif avg_mileage < 5 and avg_age < 5:
            description = "低里程优质车"
        else:
            description = "主流性价比车型"
        cluster_descriptions.append({"cluster_id": cluster_id, "description": description, "avg_age": avg_age, "avg_mileage": avg_mileage, "avg_price": avg_price, "cluster_size": size})
    price_segment_analysis = []
    price_ranges = [(0, 5, "5万以下"), (5, 10, "5-10万"), (10, 20, "10-20万"), (20, 30, "20-30万"), (30, 50, "30-50万"), (50, float('inf'), "50万以上")]
    for min_price, max_price, label in price_ranges:
        segment_df = df_cleaned.filter((col("sh_price") > min_price) & (col("sh_price") <= max_price))
        if segment_df.count() > 0:
            stats = segment_df.agg(avg("car_age").alias("avg_age"), avg("car_mileage").alias("avg_mileage"), avg("transfer_cnt").alias("avg_transfer"), count("*").alias("segment_count")).collect()[0]
            price_segment_analysis.append({"price_range": label, "avg_age": round(stats["avg_age"], 1), "avg_mileage": round(stats["avg_mileage"], 1), "avg_transfer_count": round(stats["avg_transfer"], 1), "segment_count": stats["segment_count"]})
    nearly_new_cars = df_cleaned.filter((col("car_age") <= 1) & (col("car_mileage") <= 1))
    if nearly_new_cars.count() > 0:
        nearly_new_analysis = nearly_new_cars.groupBy("brand_name").agg(count("*").alias("car_count"), avg("sh_price").alias("avg_price"), avg(col("sh_price") / col("official_price")).alias("price_ratio")).filter(col("car_count") >= 5).orderBy(desc("car_count")).limit(15)
        nearly_new_result = nearly_new_analysis.collect()
    else:
        nearly_new_result = []
    return {"clustering_result": cluster_descriptions, "price_segment_analysis": price_segment_analysis, "nearly_new_analysis": [row.asDict() for row in nearly_new_result]}

基于大数据的懂车帝二手车数据分析系统-结语

🌟 欢迎:点赞 👍 收藏 ⭐ 评论 📝

👇🏻 精选专栏推荐 👇🏻 欢迎订阅关注!

大数据实战项目

PHP|C#.NET|Golang实战项目

微信小程序|安卓实战项目

Python实战项目

Java实战项目

🍅 ↓↓主页获取源码联系↓↓🍅