基于大数据的信用卡交易诈骗数据分析系统 | Hadoop+Spark真的很难吗?信用卡交易诈骗数据分析系统让你重新认知大数据

47 阅读6分钟

💖💖作者:计算机毕业设计杰瑞 💙💙个人简介:曾长期从事计算机专业培训教学,本人也热爱上课教学,语言擅长Java、微信小程序、Python、Golang、安卓Android等,开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学校实战项目 计算机毕业设计选题推荐

基于大数据的信用卡交易诈骗数据分析系统介绍

《信用卡交易诈骗数据分析系统》是一套基于Hadoop分布式存储架构和Spark大数据计算框架构建的金融风控分析平台。该系统采用Python作为主要开发语言,Django作为后端服务框架,前端使用Vue+ElementUI构建用户交互界面,通过Echarts实现数据可视化展示。系统核心功能涵盖交易数据管理、总体态势分析、属性关联分析、聚类行为分析、时空特征分析、金额复合分析以及可视化大屏等七大模块。通过HDFS分布式文件系统存储海量交易数据,利用Spark SQL进行数据预处理和特征工程,结合Pandas和NumPy进行深度数据挖掘和统计分析。系统能够从多维度对信用卡交易行为进行建模分析,识别异常交易模式,为金融机构风险防控提供决策支持。整个系统架构清晰,功能模块化设计,既保证了大数据处理的高效性,又确保了前端交互的友好性,是一套集数据存储、计算分析、可视化展示于一体的综合性大数据应用系统。

基于大数据的信用卡交易诈骗数据分析系统演示视频

演示视频

基于大数据的信用卡交易诈骗数据分析系统演示图片

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

基于大数据的信用卡交易诈骗数据分析系统代码展示

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum as spark_sum, avg, stddev, when, isnan, isnull
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from django.http import JsonResponse
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

spark = SparkSession.builder.appName("CreditCardFraudAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def transaction_data_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/fraud_db").option("dbtable", "transactions").option("user", "root").option("password", "password").load()
    df.createOrReplaceTempView("transactions")
    total_transactions = df.count()
    fraud_transactions = df.filter(col("is_fraud") == 1).count()
    fraud_rate = fraud_transactions / total_transactions * 100
    daily_stats = spark.sql("SELECT DATE(transaction_time) as date, COUNT(*) as daily_count, SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as fraud_count FROM transactions GROUP BY DATE(transaction_time) ORDER BY date")
    amount_stats = df.select(avg("amount").alias("avg_amount"), stddev("amount").alias("std_amount"), spark_sum("amount").alias("total_amount")).collect()[0]
    merchant_fraud = spark.sql("SELECT merchant_category, COUNT(*) as total, SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as fraud_total, ROUND(SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as fraud_percentage FROM transactions GROUP BY merchant_category ORDER BY fraud_percentage DESC")
    time_pattern = spark.sql("SELECT HOUR(transaction_time) as hour, COUNT(*) as transaction_count, SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as fraud_count FROM transactions GROUP BY HOUR(transaction_time) ORDER BY hour")
    geo_analysis = spark.sql("SELECT state, city, COUNT(*) as count, AVG(amount) as avg_amount, SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as fraud_cases FROM transactions GROUP BY state, city HAVING COUNT(*) > 100 ORDER BY fraud_cases DESC")
    card_analysis = spark.sql("SELECT cc_num, COUNT(*) as usage_count, COUNT(DISTINCT merchant_category) as merchant_variety, AVG(amount) as avg_spending, MAX(amount) as max_amount, SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as fraud_incidents FROM transactions GROUP BY cc_num HAVING COUNT(*) > 10")
    daily_data = [{"date": row.date.strftime("%Y-%m-%d"), "total": row.daily_count, "fraud": row.fraud_count} for row in daily_stats.collect()]
    merchant_data = [{"category": row.merchant_category, "total": row.total, "fraud_rate": row.fraud_percentage} for row in merchant_fraud.collect()]
    time_data = [{"hour": row.hour, "total": row.transaction_count, "fraud": row.fraud_count} for row in time_pattern.collect()]
    return JsonResponse({"total_transactions": total_transactions, "fraud_rate": round(fraud_rate, 2), "avg_amount": round(float(amount_stats.avg_amount), 2), "daily_trends": daily_data, "merchant_analysis": merchant_data, "time_patterns": time_data})

def clustering_behavior_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/fraud_db").option("dbtable", "transactions").option("user", "root").option("password", "password").load()
    user_features = df.groupBy("cc_num").agg(count("*").alias("transaction_frequency"), avg("amount").alias("avg_amount"), stddev("amount").alias("amount_variance"), spark_sum("amount").alias("total_spending"), count(when(col("is_fraud") == 1, True)).alias("fraud_count"), count("merchant_category").alias("merchant_count"))
    user_features = user_features.fillna(0)
    feature_cols = ["transaction_frequency", "avg_amount", "amount_variance", "total_spending", "fraud_count", "merchant_count"]
    assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
    feature_df = assembler.transform(user_features)
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withStd=True, withMean=True)
    scaler_model = scaler.fit(feature_df)
    scaled_df = scaler_model.transform(feature_df)
    kmeans = KMeans(k=5, featuresCol="scaled_features", predictionCol="cluster", maxIter=100, seed=42)
    kmeans_model = kmeans.fit(scaled_df)
    clustered_df = kmeans_model.transform(scaled_df)
    cluster_summary = clustered_df.groupBy("cluster").agg(count("*").alias("user_count"), avg("transaction_frequency").alias("avg_frequency"), avg("avg_amount").alias("cluster_avg_amount"), avg("fraud_count").alias("avg_fraud_incidents"))
    risk_assessment = clustered_df.withColumn("risk_score", (col("fraud_count") * 0.4 + col("amount_variance") * 0.3 + col("transaction_frequency") * 0.3))
    high_risk_users = risk_assessment.filter(col("risk_score") > 10).select("cc_num", "cluster", "risk_score", "fraud_count", "total_spending")
    cluster_centers = kmeans_model.clusterCenters()
    cluster_data = []
    for row in cluster_summary.collect():
        center_info = {"center": cluster_centers[row.cluster].tolist(), "user_count": row.user_count, "avg_frequency": round(float(row.avg_frequency), 2), "avg_amount": round(float(row.cluster_avg_amount), 2), "avg_fraud": round(float(row.avg_fraud_incidents), 2)}
        cluster_data.append({"cluster_id": row.cluster, "stats": center_info})
    high_risk_data = [{"card_number": row.cc_num, "cluster": row.cluster, "risk_score": round(float(row.risk_score), 2), "fraud_incidents": row.fraud_count} for row in high_risk_users.collect()]
    return JsonResponse({"cluster_analysis": cluster_data, "high_risk_users": high_risk_data, "total_clusters": len(cluster_data)})

def spatiotemporal_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/fraud_db").option("dbtable", "transactions").option("user", "root").option("password", "password").load()
    df = df.withColumn("hour", col("transaction_time").substr(12, 2).cast("integer")).withColumn("day_of_week", date_format(col("transaction_time"), "E")).withColumn("month", month(col("transaction_time")))
    temporal_patterns = df.groupBy("hour", "day_of_week").agg(count("*").alias("transaction_count"), count(when(col("is_fraud") == 1, True)).alias("fraud_count"), avg("amount").alias("avg_transaction_amount"))
    spatial_patterns = df.groupBy("state", "city").agg(count("*").alias("location_transactions"), count(when(col("is_fraud") == 1, True)).alias("location_fraud"), avg("amount").alias("location_avg_amount"))
    cross_state_analysis = spark.sql("SELECT cc_num, COUNT(DISTINCT state) as state_count, COUNT(DISTINCT city) as city_count, COUNT(*) as total_transactions FROM transactions GROUP BY cc_num HAVING COUNT(DISTINCT state) > 3")
    velocity_analysis = df.select("cc_num", "transaction_time", "state", "city", "amount").orderBy("cc_num", "transaction_time")
    velocity_window = Window.partitionBy("cc_num").orderBy("transaction_time")
    velocity_df = velocity_analysis.withColumn("prev_time", lag("transaction_time").over(velocity_window)).withColumn("prev_state", lag("state").over(velocity_window))
    suspicious_velocity = velocity_df.filter((col("state") != col("prev_state")) & (unix_timestamp("transaction_time") - unix_timestamp("prev_time") < 3600))
    fraud_hotspots = spatial_patterns.withColumn("fraud_rate", col("location_fraud") * 100.0 / col("location_transactions")).filter(col("location_transactions") > 50).orderBy(col("fraud_rate").desc())
    peak_hours = temporal_patterns.withColumn("fraud_rate", col("fraud_count") * 100.0 / col("transaction_count")).orderBy(col("fraud_rate").desc())
    monthly_trends = df.groupBy("month").agg(count("*").alias("monthly_transactions"), count(when(col("is_fraud") == 1, True)).alias("monthly_fraud"), avg("amount").alias("monthly_avg_amount"))
    temporal_data = [{"hour": row.hour, "day": row.day_of_week, "transactions": row.transaction_count, "fraud_rate": round(row.fraud_count * 100.0 / row.transaction_count, 2)} for row in peak_hours.collect()]
    spatial_data = [{"state": row.state, "city": row.city, "transactions": row.location_transactions, "fraud_rate": round(float(row.fraud_rate), 2)} for row in fraud_hotspots.collect()]
    velocity_data = [{"card": row.cc_num, "state_changes": row.state_count, "city_changes": row.city_count} for row in cross_state_analysis.collect()]
    return JsonResponse({"temporal_analysis": temporal_data, "spatial_hotspots": spatial_data, "velocity_patterns": velocity_data, "monthly_trends": [{"month": row.month, "transactions": row.monthly_transactions, "fraud": row.monthly_fraud} for row in monthly_trends.collect()]})

基于大数据的信用卡交易诈骗数据分析系统文档展示

在这里插入图片描述

💖💖作者:计算机毕业设计杰瑞 💙💙个人简介:曾长期从事计算机专业培训教学,本人也热爱上课教学,语言擅长Java、微信小程序、Python、Golang、安卓Android等,开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学校实战项目 计算机毕业设计选题推荐