【大数据】汽车保险数据可视化分析系统 计算机毕业设计项目 Hadoop+Spark环境配置 数据科学与大数据技术 附源码+文档+讲解

33 阅读7分钟

一、个人简介

💖💖作者:计算机编程果茶熊 💙💙个人简介:曾长期从事计算机专业培训教学,担任过编程老师,同时本人也热爱上课教学,擅长Java、微信小程序、Python、Golang、安卓Android等多个IT方向。会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 计算机毕业设计选题 💕💕文末获取源码联系计算机编程果茶熊

二、系统介绍

大数据框架:Hadoop+Spark(Hive需要定制修改) 开发语言:Java+Python(两个版本都支持) 数据库:MySQL 后端框架:SpringBoot(Spring+SpringMVC+Mybatis)+Django(两个版本都支持) 前端:Vue+Echarts+HTML+CSS+JavaScript+jQuery

汽车保险数据可视化分析系统是一个基于大数据技术构建的智能化保险业务分析平台。该系统采用Hadoop+Spark分布式计算架构作为数据处理核心,结合Django后端框架和Vue+ElementUI+Echarts前端技术栈,实现了对海量汽车保险数据的高效存储、处理和可视化展示。系统通过HDFS分布式文件系统存储保险业务数据,利用Spark SQL进行大规模数据查询和分析,配合Pandas、NumPy等数据科学库完成复杂的统计计算。平台集成了用户管理、汽车保险数据管理、客户画像分析、财务效益分析、保险产品分析、市场营销分析、风险管理分析等核心功能模块,并提供直观的可视化大屏展示。系统能够帮助保险公司深入洞察客户行为特征,优化产品设计策略,提升风险识别能力,为保险业务的数字化转型和智能化决策提供有力支撑。

三、视频解说

汽车保险数据可视化分析系统

四、部分功能展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

五、部分代码展示


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum, avg, when, desc, asc
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
import json

spark = SparkSession.builder.appName("InsuranceDataAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def customer_portrait_analysis(request):
    customer_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/customer_data.csv")
    policy_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/policy_data.csv")
    claim_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/claim_data.csv")
    customer_policy = customer_df.join(policy_df, "customer_id", "left")
    customer_claim = customer_policy.join(claim_df, "policy_id", "left")
    age_groups = customer_claim.withColumn("age_group", when(col("age") < 25, "青年").when(col("age") < 35, "中年轻").when(col("age") < 45, "中年").when(col("age") < 60, "中老年").otherwise("老年"))
    age_analysis = age_groups.groupBy("age_group").agg(count("customer_id").alias("customer_count"), avg("premium_amount").alias("avg_premium"), sum("claim_amount").alias("total_claim"))
    income_analysis = customer_claim.withColumn("income_level", when(col("annual_income") < 50000, "低收入").when(col("annual_income") < 100000, "中等收入").when(col("annual_income") < 200000, "高收入").otherwise("超高收入"))
    income_stats = income_analysis.groupBy("income_level").agg(count("customer_id").alias("customer_count"), avg("premium_amount").alias("avg_premium"), avg("claim_frequency").alias("avg_claim_freq"))
    vehicle_analysis = customer_claim.groupBy("vehicle_type", "vehicle_age").agg(count("policy_id").alias("policy_count"), avg("premium_amount").alias("avg_premium"), sum("claim_amount").alias("total_claim"))
    feature_cols = ["age", "annual_income", "premium_amount", "claim_frequency", "vehicle_age"]
    assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
    feature_data = assembler.transform(customer_claim.na.drop())
    kmeans = KMeans(k=5, seed=42, featuresCol="features", predictionCol="cluster")
    model = kmeans.fit(feature_data)
    clustered_data = model.transform(feature_data)
    cluster_summary = clustered_data.groupBy("cluster").agg(count("customer_id").alias("cluster_size"), avg("age").alias("avg_age"), avg("annual_income").alias("avg_income"), avg("premium_amount").alias("avg_premium"))
    risk_score = customer_claim.withColumn("risk_score", (col("claim_frequency") * 0.4 + col("claim_amount") / col("premium_amount") * 0.6) * 100)
    high_risk_customers = risk_score.filter(col("risk_score") > 80).select("customer_id", "customer_name", "risk_score", "claim_frequency", "total_claim_amount").orderBy(desc("risk_score"))
    age_result = age_analysis.toPandas().to_dict('records')
    income_result = income_stats.toPandas().to_dict('records')
    vehicle_result = vehicle_analysis.toPandas().to_dict('records')
    cluster_result = cluster_summary.toPandas().to_dict('records')
    risk_result = high_risk_customers.limit(100).toPandas().to_dict('records')
    return JsonResponse({'age_analysis': age_result, 'income_analysis': income_result, 'vehicle_analysis': vehicle_result, 'customer_clusters': cluster_result, 'high_risk_customers': risk_result})

def financial_benefit_analysis(request):
    premium_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/premium_data.csv")
    claim_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/claim_data.csv")
    expense_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/expense_data.csv")
    monthly_premium = premium_df.groupBy("year", "month").agg(sum("premium_amount").alias("total_premium"), count("policy_id").alias("policy_count"))
    monthly_claim = claim_df.groupBy("year", "month").agg(sum("claim_amount").alias("total_claim"), count("claim_id").alias("claim_count"))
    monthly_expense = expense_df.groupBy("year", "month").agg(sum("expense_amount").alias("total_expense"))
    financial_data = monthly_premium.join(monthly_claim, ["year", "month"], "left").join(monthly_expense, ["year", "month"], "left")
    financial_data = financial_data.fillna(0)
    profit_analysis = financial_data.withColumn("gross_profit", col("total_premium") - col("total_claim")).withColumn("net_profit", col("total_premium") - col("total_claim") - col("total_expense")).withColumn("profit_margin", col("net_profit") / col("total_premium") * 100).withColumn("claim_ratio", col("total_claim") / col("total_premium") * 100)
    product_premium = premium_df.groupBy("product_type").agg(sum("premium_amount").alias("product_premium"), count("policy_id").alias("product_policies"))
    product_claim = claim_df.groupBy("product_type").agg(sum("claim_amount").alias("product_claim"), count("claim_id").alias("product_claims"))
    product_profit = product_premium.join(product_claim, "product_type", "left").fillna(0)
    product_profit = product_profit.withColumn("product_net_profit", col("product_premium") - col("product_claim")).withColumn("product_profit_margin", col("product_net_profit") / col("product_premium") * 100).withColumn("product_claim_ratio", col("product_claim") / col("product_premium") * 100)
    quarterly_data = financial_data.withColumn("quarter", when(col("month") <= 3, "Q1").when(col("month") <= 6, "Q2").when(col("month") <= 9, "Q3").otherwise("Q4"))
    quarterly_summary = quarterly_data.groupBy("year", "quarter").agg(sum("total_premium").alias("quarter_premium"), sum("total_claim").alias("quarter_claim"), sum("total_expense").alias("quarter_expense"), avg("profit_margin").alias("avg_profit_margin"))
    cost_structure = expense_df.groupBy("expense_category").agg(sum("expense_amount").alias("category_expense")).withColumn("expense_percentage", col("category_expense") / expense_df.agg(sum("expense_amount")).collect()[0][0] * 100)
    growth_rate = financial_data.orderBy("year", "month").withColumn("prev_premium", col("total_premium").lag(1).over(Window.orderBy("year", "month"))).withColumn("premium_growth_rate", (col("total_premium") - col("prev_premium")) / col("prev_premium") * 100)
    roi_analysis = financial_data.withColumn("investment_return", col("net_profit") / col("total_expense") * 100).filter(col("total_expense") > 0)
    monthly_result = profit_analysis.orderBy("year", "month").toPandas().to_dict('records')
    product_result = product_profit.orderBy(desc("product_net_profit")).toPandas().to_dict('records')
    quarterly_result = quarterly_summary.orderBy("year", "quarter").toPandas().to_dict('records')
    cost_result = cost_structure.orderBy(desc("category_expense")).toPandas().to_dict('records')
    growth_result = growth_rate.filter(col("prev_premium").isNotNull()).toPandas().to_dict('records')
    roi_result = roi_analysis.toPandas().to_dict('records')
    return JsonResponse({'monthly_profit': monthly_result, 'product_profit': product_result, 'quarterly_summary': quarterly_result, 'cost_structure': cost_result, 'growth_analysis': growth_result, 'roi_analysis': roi_result})

def risk_management_analysis(request):
    claim_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/claim_data.csv")
    policy_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/policy_data.csv")
    customer_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/customer_data.csv")
    fraud_df = spark.read.option("header", "true").option("inferSchema", "true").csv("hdfs://localhost:9000/insurance/fraud_detection.csv")
    claim_frequency = claim_df.groupBy("policy_id").agg(count("claim_id").alias("claim_count"), sum("claim_amount").alias("total_claim_amount"), avg("claim_amount").alias("avg_claim_amount"))
    high_frequency_claims = claim_frequency.filter(col("claim_count") > 3).orderBy(desc("claim_count"))
    claim_type_analysis = claim_df.groupBy("claim_type").agg(count("claim_id").alias("type_count"), sum("claim_amount").alias("type_total_amount"), avg("claim_amount").alias("type_avg_amount")).withColumn("risk_level", when(col("type_avg_amount") > 50000, "高风险").when(col("type_avg_amount") > 20000, "中风险").otherwise("低风险"))
    geographical_risk = claim_df.groupBy("region", "city").agg(count("claim_id").alias("region_claims"), sum("claim_amount").alias("region_amount"), avg("claim_amount").alias("region_avg")).withColumn("regional_risk_score", col("region_claims") * 0.3 + col("region_avg") / 10000 * 0.7)
    age_risk_analysis = claim_df.join(customer_df, "customer_id", "inner").withColumn("age_group", when(col("age") < 25, "青年").when(col("age") < 35, "中年轻").when(col("age") < 45, "中年").when(col("age") < 60, "中老年").otherwise("老年"))
    age_risk_stats = age_risk_analysis.groupBy("age_group").agg(count("claim_id").alias("age_claims"), avg("claim_amount").alias("age_avg_claim"), sum("claim_amount").alias("age_total_claim"))
    vehicle_risk = claim_df.join(policy_df, "policy_id", "inner").groupBy("vehicle_type", "vehicle_age").agg(count("claim_id").alias("vehicle_claims"), avg("claim_amount").alias("vehicle_avg_claim")).withColumn("vehicle_risk_index", col("vehicle_claims") * 0.4 + col("vehicle_avg_claim") / 10000 * 0.6)
    fraud_detection = fraud_df.filter(col("fraud_probability") > 0.7).join(claim_df, "claim_id", "inner")
    suspicious_patterns = fraud_detection.groupBy("fraud_reason").agg(count("claim_id").alias("pattern_count"), avg("claim_amount").alias("pattern_avg_amount"))
    seasonal_risk = claim_df.withColumn("season", when(col("month").isin([12, 1, 2]), "冬季").when(col("month").isin([3, 4, 5]), "春季").when(col("month").isin([6, 7, 8]), "夏季").otherwise("秋季"))
    seasonal_analysis = seasonal_risk.groupBy("season").agg(count("claim_id").alias("seasonal_claims"), avg("claim_amount").alias("seasonal_avg"), sum("claim_amount").alias("seasonal_total"))
    risk_threshold = claim_frequency.filter(col("total_claim_amount") > 100000).withColumn("risk_category", when(col("total_claim_amount") > 500000, "极高风险").when(col("total_claim_amount") > 200000, "高风险").otherwise("中高风险"))
    early_warning = claim_df.filter(col("claim_status") == "pending").join(policy_df, "policy_id", "inner").withColumn("days_since_claim", (current_date() - col("claim_date")).cast("int")).filter(col("days_since_claim") > 30)
    frequency_result = high_frequency_claims.limit(50).toPandas().to_dict('records')
    type_result = claim_type_analysis.orderBy(desc("type_avg_amount")).toPandas().to_dict('records')
    geo_result = geographical_risk.orderBy(desc("regional_risk_score")).toPandas().to_dict('records')
    age_result = age_risk_stats.orderBy(desc("age_avg_claim")).toPandas().to_dict('records')
    vehicle_result = vehicle_risk.orderBy(desc("vehicle_risk_index")).toPandas().to_dict('records')
    fraud_result = suspicious_patterns.orderBy(desc("pattern_count")).toPandas().to_dict('records')
    seasonal_result = seasonal_analysis.toPandas().to_dict('records')
    threshold_result = risk_threshold.orderBy(desc("total_claim_amount")).toPandas().to_dict('records')
    warning_result = early_warning.select("policy_id", "customer_id", "claim_amount", "days_since_claim").toPandas().to_dict('records')
    return JsonResponse({'high_frequency_claims': frequency_result, 'claim_type_risk': type_result, 'geographical_risk': geo_result, 'age_risk_analysis': age_result, 'vehicle_risk_analysis': vehicle_result, 'fraud_patterns': fraud_result, 'seasonal_risk': seasonal_result, 'high_risk_policies': threshold_result, 'early_warning': warning_result})

六、部分文档展示

在这里插入图片描述

七、END

💕💕文末获取源码联系计算机编程果茶熊