大数据毕业设计推荐:汽车投诉数据分析与可视化系统Spark实现

35 阅读7分钟

💖💖作者:计算机编程小央姐 💙💙个人简介:曾长期从事计算机专业培训教学,本人也热爱上课教学,语言擅长Java、微信小程序、Python、Golang、安卓Android等,开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜

💕💕文末获取源码

@TOC

汽车投诉数据分析与可视化系统Spark实现-系统功能介绍

该汽车投诉数据分析与可视化系统采用现代大数据技术架构,以Hadoop分布式存储和Spark大数据计算引擎为核心,构建了一套完整的汽车行业投诉数据处理分析平台。系统运用Spark SQL对海量汽车投诉数据进行高效处理,通过多维度数据挖掘技术深入分析各汽车品牌的投诉情况、车型问题分布、投诉时间趋势等关键指标。在数据可视化层面,系统集成Echarts图表库与Vue前端框架,构建直观的数据展示界面,支持品牌投诉量排名、问题类型分布统计、车系典型问题分析等多种可视化展现形式。系统还运用自然语言处理技术对投诉文本进行深度挖掘,提取关键词、分析情感强度、识别投诉主题,为消费者购车决策和汽车厂商质量改进提供数据支撑。整个系统采用Django或Spring Boot作为后端服务框架,实现了从数据采集、存储、处理到可视化展示的完整业务闭环,具备良好的扩展性和实用性。

汽车投诉数据分析与可视化系统Spark实现-系统技术介绍

大数据框架:Hadoop+Spark(本次没用Hive,支持定制) 开发语言:Python+Java(两个版本都支持) 后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持) 前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery 详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy 数据库:MySQL

汽车投诉数据分析与可视化系统Spark实现-系统背景意义

近年来汽车市场快速发展,消费者对汽车产品的质量要求日益提高,各类汽车投诉平台积累了大量的用户反馈数据。这些投诉数据包含了丰富的汽车质量信息,涵盖了不同品牌、车型的问题分布情况,记录着消费者的真实用车体验。然而传统的数据处理方式往往只能应对小规模数据集,面对日益增长的投诉数据量显得力不从心。与此同时,汽车厂商和消费者都迫切需要从这些海量数据中挖掘有价值的信息,但缺乏有效的技术手段来处理和分析如此庞大的数据集。现有的一些分析工具大多停留在简单的统计层面,无法深入挖掘数据背后的规律和趋势。另外,大数据技术在各行各业的广泛应用为解决这一问题提供了新的思路,Spark等大数据计算框架的成熟为构建高效的数据分析系统奠定了技术基础。 本课题的研究具有一定的实际应用价值。对于消费者而言,系统能够帮助他们更全面地了解各汽车品牌的真实质量状况,通过数据分析结果为购车决策提供一些参考依据,避免盲目选择问题较多的车型。对于汽车厂商来说,系统可以帮助他们及时发现产品中存在的普遍性问题,了解消费者关注的焦点,为产品改进和质量提升提供方向指导。从技术角度来看,本课题将大数据处理技术应用到汽车投诉数据分析领域,在一定程度上验证了Spark等大数据技术在实际业务场景中的可行性。通过构建完整的数据处理和可视化流程,能够为类似的数据分析项目提供一些技术参考。当然,作为一个毕业设计项目,其主要作用还是展示大数据技术的应用能力,为学习和掌握相关技术提供实践平台,系统的实际影响范围相对有限,更多的是在学术研究和技术探索层面发挥作用。

汽车投诉数据分析与可视化系统Spark实现-系统演示视频

演示视频

汽车投诉数据分析与可视化系统Spark实现-系统演示图片

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

汽车投诉数据分析与可视化系统Spark实现-系统部分代码

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, count, desc, regexp_extract, split, explode, year, month, dayofmonth

from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

import jieba

from collections import Counter

spark = SparkSession.builder.appName("CarComplaintAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def brand_complaint_ranking_analysis():

    complaint_df = spark.read.option("header", "true").csv("hdfs://localhost:9000/car_complaints/data.csv")

    complaint_df.createOrReplaceTempView("complaints")

    brand_stats = spark.sql("""

        SELECT complaint_brand as brand_name, 

               COUNT(*) as total_complaints,

               COUNT(DISTINCT complaint_model) as model_count,

               ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM complaints), 2) as complaint_percentage

        FROM complaints 

        WHERE complaint_brand IS NOT NULL AND complaint_brand != ''

        GROUP BY complaint_brand 

        ORDER BY total_complaints DESC

    """)

    brand_problem_distribution = spark.sql("""

        SELECT complaint_brand, problem_type, COUNT(*) as problem_count,

               ROW_NUMBER() OVER (PARTITION BY complaint_brand ORDER BY COUNT(*) DESC) as rank

        FROM complaints 

        WHERE complaint_brand IS NOT NULL AND problem_type IS NOT NULL

        GROUP BY complaint_brand, problem_type

    """).filter(col("rank") <= 3)

    brand_monthly_trend = spark.sql("""

        SELECT complaint_brand, 

               YEAR(complaint_date) as year,

               MONTH(complaint_date) as month,

               COUNT(*) as monthly_complaints

        FROM complaints 

        WHERE complaint_brand IS NOT NULL AND complaint_date IS NOT NULL

        GROUP BY complaint_brand, YEAR(complaint_date), MONTH(complaint_date)

        ORDER BY complaint_brand, year, month

    """)

    top_brands = brand_stats.limit(10).collect()

    result_data = {"brand_ranking": [], "problem_distribution": {}, "monthly_trends": {}}

    for row in top_brands:

        result_data["brand_ranking"].append({

            "brand": row["brand_name"], 

            "complaints": row["total_complaints"],

            "models": row["model_count"],

            "percentage": row["complaint_percentage"]

        })

    return result_data

def complaint_text_mining_analysis():

    complaint_df = spark.read.option("header", "true").csv("hdfs://localhost:9000/car_complaints/data.csv")

    text_data = complaint_df.select("complaint_id", "complaint_summary").filter(col("complaint_summary").isNotNull())

    def chinese_tokenize(text):

        if text is None:

            return []

        words = jieba.lcut(text)

        return [word for word in words if len(word) > 1 and word.isalpha()]

    from pyspark.sql.functions import udf

    from pyspark.sql.types import ArrayType, StringType

    tokenize_udf = udf(chinese_tokenize, ArrayType(StringType()))

    text_tokenized = text_data.withColumn("words", tokenize_udf(col("complaint_summary")))

    stop_words = ["的", "是", "了", "在", "有", "和", "就", "不", "都", "很", "也", "但是", "因为", "所以", "这个", "那个"]

    stop_words_remover = StopWordsRemover(inputCol="words", outputCol="filtered_words", stopWords=stop_words)

    text_filtered = stop_words_remover.transform(text_tokenized)

    word_counts = text_filtered.select(explode(col("filtered_words")).alias("word")).groupBy("word").count().orderBy(desc("count"))

    top_keywords = word_counts.limit(50).collect()

    cv = CountVectorizer(inputCol="filtered_words", outputCol="features", vocabSize=1000, minDF=2.0)

    cv_model = cv.fit(text_filtered)

    text_vectorized = cv_model.transform(text_filtered)

    vocabulary = cv_model.vocabulary

    keyword_analysis = []

    for row in top_keywords:

        keyword_analysis.append({"word": row["word"], "count": row["count"]})

    sentiment_keywords = {

        "positive": ["满意", "好", "不错", "优秀", "棒", "喜欢"],

        "negative": ["差", "烂", "垃圾", "失望", "糟糕", "问题", "故障", "坏"]

    }

    return {"top_keywords": keyword_analysis, "vocabulary_size": len(vocabulary)}

def complaint_time_trend_analysis():

    complaint_df = spark.read.option("header", "true").csv("hdfs://localhost:9000/car_complaints/data.csv")

    complaint_df.createOrReplaceTempView("complaints")

    daily_trends = spark.sql("""

        SELECT complaint_date,

               COUNT(*) as daily_complaints,

               COUNT(DISTINCT complaint_brand) as brands_involved,

               COUNT(DISTINCT problem_type) as problem_types

        FROM complaints 

        WHERE complaint_date IS NOT NULL

        GROUP BY complaint_date

        ORDER BY complaint_date DESC

    """)

    monthly_aggregation = spark.sql("""

        SELECT YEAR(complaint_date) as year,

               MONTH(complaint_date) as month,

               COUNT(*) as monthly_complaints,

               COUNT(DISTINCT complaint_brand) as active_brands,

               ROUND(AVG(CASE WHEN problem_type LIKE '%安全%' THEN 1 ELSE 0 END) * 100, 2) as safety_issue_ratio

        FROM complaints 

        WHERE complaint_date IS NOT NULL

        GROUP BY YEAR(complaint_date), MONTH(complaint_date)

        ORDER BY year DESC, month DESC

    """)

    seasonal_analysis = spark.sql("""

        SELECT CASE 

                 WHEN MONTH(complaint_date) IN (3,4,5) THEN '春季'

                 WHEN MONTH(complaint_date) IN (6,7,8) THEN '夏季'  

                 WHEN MONTH(complaint_date) IN (9,10,11) THEN '秋季'

                 ELSE '冬季'

               END as season,

               COUNT(*) as seasonal_complaints,

               ROUND(AVG(CASE WHEN problem_type LIKE '%空调%' THEN 1 ELSE 0 END) * 100, 2) as ac_issue_ratio

        FROM complaints 

        WHERE complaint_date IS NOT NULL

        GROUP BY CASE 

                   WHEN MONTH(complaint_date) IN (3,4,5) THEN '春季'

                   WHEN MONTH(complaint_date) IN (6,7,8) THEN '夏季'

                   WHEN MONTH(complaint_date) IN (9,10,11) THEN '秋季'

                   ELSE '冬季'

                 END

    """)

    peak_analysis = daily_trends.orderBy(desc("daily_complaints")).limit(10)

    trend_data = {"daily_trends": [], "monthly_trends": [], "seasonal_patterns": [], "peak_days": []}

    for row in daily_trends.limit(100).collect():

        trend_data["daily_trends"].append({

            "date": str(row["complaint_date"]),

            "complaints": row["daily_complaints"],

            "brands": row["brands_involved"],

            "problem_types": row["problem_types"]

        })

    return trend_data

汽车投诉数据分析与可视化系统Spark实现-结语

💟💟如果大家有任何疑虑,欢迎在下方位置详细交流。