【大数据】音乐人社交媒体粉丝数据的可视化分析系统 计算机毕业设计项目 Hadoop+Spark环境配置 数据科学与大数据技术 附源码+文档+讲解

42 阅读10分钟

前言

💖💖作者:计算机程序员小杨 💙💙个人简介:我是一名计算机相关专业的从业者,擅长Java、微信小程序、Python、Golang、安卓Android等多个IT方向。会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。热爱技术,喜欢钻研新工具和框架,也乐于通过代码解决实际问题,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💕💕文末获取源码联系 计算机程序员小杨 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学习实战项目 计算机毕业设计选题 💜💜

一.开发工具简介

大数据框架:Hadoop+Spark(本次没用Hive,支持定制) 开发语言:Python+Java(两个版本都支持) 后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持) 前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery 详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy 数据库:MySQL

二.系统内容简介

本系统是一个面向音乐人社交媒体粉丝数据的可视化分析平台,采用Hadoop+Spark分布式计算框架作为底层大数据处理引擎,结合Django后端框架和Vue+ElementUI+Echarts前端技术栈构建而成。系统通过HDFS存储海量粉丝数据,利用Spark SQL和Pandas进行数据清洗与分析,实现了用户管理、粉丝数据管理、音乐热度分析、粉丝地域分析、粉丝画像分析、粉丝分群分析等核心功能模块。系统提供可视化大屏展示,通过Echarts图表库将粉丝年龄分布、性别比例、地域热力图、音乐作品播放趋势等多维度数据以直观的方式呈现,帮助音乐人及其团队深入了解粉丝群体特征和音乐传播效果。系统基于MySQL数据库存储结构化数据,运用NumPy进行数值计算,通过Spark的分布式计算能力处理大规模粉丝行为数据,为音乐人提供数据驱动的运营决策支持。

三.系统功能演示

音乐人社交媒体粉丝数据的可视化分析系统

四.系统界面展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

五.系统源码展示


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, sum, desc, when, lit, concat, round
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, TimestampType
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views.decorators.http import require_http_methods
from datetime import datetime, timedelta
import json
spark = SparkSession.builder.appName("MusicFansAnalysis").config("spark.sql.warehouse.dir", "/user/hive/warehouse").config("spark.executor.memory", "4g").config("spark.driver.memory", "2g").getOrCreate()
@require_http_methods(["POST"])
def analyze_fans_portrait(request):
    data = json.loads(request.body)
    musician_id = data.get('musician_id')
    time_range = data.get('time_range', 30)
    fans_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fans_data", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    interaction_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fan_interactions", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    filtered_fans = fans_df.filter(col("musician_id") == musician_id)
    end_date = datetime.now()
    start_date = end_date - timedelta(days=time_range)
    filtered_interactions = interaction_df.filter((col("musician_id") == musician_id) & (col("interaction_time") >= start_date) & (col("interaction_time") <= end_date))
    age_distribution = filtered_fans.groupBy("age_group").agg(count("fan_id").alias("fan_count")).withColumn("percentage", round(col("fan_count") / filtered_fans.count() * 100, 2)).orderBy("age_group")
    gender_distribution = filtered_fans.groupBy("gender").agg(count("fan_id").alias("fan_count")).withColumn("percentage", round(col("fan_count") / filtered_fans.count() * 100, 2))
    joined_df = filtered_interactions.join(filtered_fans, "fan_id", "inner")
    activity_scores = joined_df.groupBy("fan_id", "age_group", "gender", "province").agg(count("interaction_id").alias("interaction_count"), sum(when(col("interaction_type") == "comment", 3).when(col("interaction_type") == "share", 5).when(col("interaction_type") == "like", 1).otherwise(0)).alias("activity_score"))
    avg_activity_by_age = activity_scores.groupBy("age_group").agg(avg("activity_score").alias("avg_activity"), count("fan_id").alias("active_fans")).orderBy(desc("avg_activity"))
    avg_activity_by_gender = activity_scores.groupBy("gender").agg(avg("activity_score").alias("avg_activity"), count("fan_id").alias("active_fans"))
    consumption_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fan_consumption", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    consumption_data = consumption_df.filter((col("musician_id") == musician_id) & (col("consumption_time") >= start_date) & (col("consumption_time") <= end_date))
    consumption_joined = consumption_data.join(filtered_fans, "fan_id", "inner")
    consumption_by_age = consumption_joined.groupBy("age_group").agg(sum("amount").alias("total_consumption"), avg("amount").alias("avg_consumption"), count("fan_id").alias("paying_fans")).orderBy(desc("total_consumption"))
    high_value_fans = activity_scores.join(consumption_joined.groupBy("fan_id").agg(sum("amount").alias("total_spent")), "fan_id", "left").fillna(0, subset=["total_spent"]).withColumn("fan_value_score", col("activity_score") * 0.4 + col("total_spent") * 0.6).orderBy(desc("fan_value_score")).limit(100)
    device_distribution = filtered_fans.groupBy("device_type").agg(count("fan_id").alias("device_count")).withColumn("percentage", round(col("device_count") / filtered_fans.count() * 100, 2))
    interest_tags = joined_df.filter(col("interaction_type").isin(["like", "share", "comment"])).join(spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="music_works", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"}), "work_id", "inner").groupBy("music_genre").agg(count("interaction_id").alias("interaction_count")).withColumn("preference_score", round(col("interaction_count") / joined_df.count() * 100, 2)).orderBy(desc("preference_score"))
    age_dist_result = [{"age_group": row["age_group"], "count": row["fan_count"], "percentage": float(row["percentage"])} for row in age_distribution.collect()]
    gender_dist_result = [{"gender": row["gender"], "count": row["fan_count"], "percentage": float(row["percentage"])} for row in gender_distribution.collect()]
    activity_age_result = [{"age_group": row["age_group"], "avg_activity": float(row["avg_activity"]), "active_fans": row["active_fans"]} for row in avg_activity_by_age.collect()]
    consumption_age_result = [{"age_group": row["age_group"], "total_consumption": float(row["total_consumption"]), "avg_consumption": float(row["avg_consumption"]), "paying_fans": row["paying_fans"]} for row in consumption_by_age.collect()]
    device_result = [{"device_type": row["device_type"], "count": row["device_count"], "percentage": float(row["percentage"])} for row in device_distribution.collect()]
    interest_result = [{"genre": row["music_genre"], "interaction_count": row["interaction_count"], "preference_score": float(row["preference_score"])} for row in interest_tags.collect()]
    total_fans = filtered_fans.count()
    active_fans_count = activity_scores.filter(col("activity_score") > 10).count()
    avg_fan_age = filtered_fans.agg(avg("age").alias("avg_age")).collect()[0]["avg_age"]
    return JsonResponse({"status": "success", "data": {"age_distribution": age_dist_result, "gender_distribution": gender_dist_result, "activity_by_age": activity_age_result, "consumption_by_age": consumption_age_result, "device_distribution": device_result, "interest_tags": interest_result, "summary": {"total_fans": total_fans, "active_fans": active_fans_count, "avg_age": float(avg_fan_age) if avg_fan_age else 0}}})
@require_http_methods(["POST"])
def analyze_music_popularity(request):
    data = json.loads(request.body)
    musician_id = data.get('musician_id')
    time_range = data.get('time_range', 90)
    works_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="music_works", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    plays_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="music_plays", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    interaction_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fan_interactions", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    filtered_works = works_df.filter(col("musician_id") == musician_id)
    end_date = datetime.now()
    start_date = end_date - timedelta(days=time_range)
    filtered_plays = plays_df.filter((col("play_time") >= start_date) & (col("play_time") <= end_date))
    filtered_interactions = interaction_df.filter((col("interaction_time") >= start_date) & (col("interaction_time") <= end_date))
    plays_with_works = filtered_plays.join(filtered_works, "work_id", "inner")
    play_stats = plays_with_works.groupBy("work_id", "work_name", "music_genre", "release_date").agg(count("play_id").alias("total_plays"), avg("play_duration").alias("avg_play_duration"), sum(when(col("play_duration") >= col("work_duration") * 0.8, 1).otherwise(0)).alias("complete_plays"))
    interaction_stats = filtered_interactions.join(filtered_works, "work_id", "inner").groupBy("work_id").agg(sum(when(col("interaction_type") == "like", 1).otherwise(0)).alias("likes"), sum(when(col("interaction_type") == "comment", 1).otherwise(0)).alias("comments"), sum(when(col("interaction_type") == "share", 1).otherwise(0)).alias("shares"), sum(when(col("interaction_type") == "favorite", 1).otherwise(0)).alias("favorites"))
    popularity_metrics = play_stats.join(interaction_stats, "work_id", "left").fillna(0, subset=["likes", "comments", "shares", "favorites"])
    popularity_metrics = popularity_metrics.withColumn("completion_rate", round(col("complete_plays") / col("total_plays") * 100, 2)).withColumn("engagement_rate", round((col("likes") + col("comments") * 3 + col("shares") * 5 + col("favorites") * 2) / col("total_plays") * 100, 2)).withColumn("popularity_score", col("total_plays") * 0.3 + col("complete_plays") * 0.2 + col("likes") * 0.1 + col("comments") * 0.15 + col("shares") * 0.15 + col("favorites") * 0.1)
    top_works = popularity_metrics.orderBy(desc("popularity_score")).limit(10)
    genre_performance = popularity_metrics.groupBy("music_genre").agg(sum("total_plays").alias("genre_total_plays"), avg("completion_rate").alias("avg_completion_rate"), avg("engagement_rate").alias("avg_engagement_rate"), count("work_id").alias("work_count")).orderBy(desc("genre_total_plays"))
    plays_with_works_date = plays_with_works.withColumn("play_date", col("play_time").cast("date"))
    daily_trends = plays_with_works_date.groupBy("play_date").agg(count("play_id").alias("daily_plays"), sum(when(col("play_duration") >= col("work_duration") * 0.8, 1).otherwise(0)).alias("daily_complete_plays")).orderBy("play_date")
    daily_interaction_trends = filtered_interactions.join(filtered_works, "work_id", "inner").withColumn("interaction_date", col("interaction_time").cast("date")).groupBy("interaction_date").agg(sum(when(col("interaction_type") == "like", 1).otherwise(0)).alias("daily_likes"), sum(when(col("interaction_type") == "comment", 1).otherwise(0)).alias("daily_comments"), sum(when(col("interaction_type") == "share", 1).otherwise(0)).alias("daily_shares")).orderBy("interaction_date")
    trends_combined = daily_trends.join(daily_interaction_trends, daily_trends["play_date"] == daily_interaction_trends["interaction_date"], "left").fillna(0, subset=["daily_likes", "daily_comments", "daily_shares"]).select("play_date", "daily_plays", "daily_complete_plays", "daily_likes", "daily_comments", "daily_shares")
    new_works = filtered_works.filter(col("release_date") >= start_date)
    new_work_ids = [row["work_id"] for row in new_works.select("work_id").collect()]
    if len(new_work_ids) > 0:
        new_works_performance = popularity_metrics.filter(col("work_id").isin(new_work_ids)).withColumn("days_since_release", (lit(end_date) - col("release_date")).cast("int"))
        new_works_growth = new_works_performance.withColumn("daily_avg_plays", col("total_plays") / col("days_since_release")).orderBy(desc("daily_avg_plays"))
        new_works_result = [{"work_id": row["work_id"], "work_name": row["work_name"], "total_plays": row["total_plays"], "daily_avg_plays": float(row["daily_avg_plays"]), "popularity_score": float(row["popularity_score"])} for row in new_works_growth.collect()]
    else:
        new_works_result = []
    peak_hours = plays_with_works.withColumn("play_hour", col("play_time").substr(12, 2).cast("int")).groupBy("play_hour").agg(count("play_id").alias("hourly_plays")).orderBy(desc("hourly_plays")).limit(5)
    top_works_result = [{"work_id": row["work_id"], "work_name": row["work_name"], "genre": row["music_genre"], "total_plays": row["total_plays"], "completion_rate": float(row["completion_rate"]), "engagement_rate": float(row["engagement_rate"]), "likes": row["likes"], "comments": row["comments"], "shares": row["shares"], "popularity_score": float(row["popularity_score"])} for row in top_works.collect()]
    genre_result = [{"genre": row["music_genre"], "total_plays": row["genre_total_plays"], "avg_completion_rate": float(row["avg_completion_rate"]), "avg_engagement_rate": float(row["avg_engagement_rate"]), "work_count": row["work_count"]} for row in genre_performance.collect()]
    trends_result = [{"date": str(row["play_date"]), "plays": row["daily_plays"], "complete_plays": row["daily_complete_plays"], "likes": row["daily_likes"], "comments": row["daily_comments"], "shares": row["daily_shares"]} for row in trends_combined.collect()]
    peak_hours_result = [{"hour": row["play_hour"], "plays": row["hourly_plays"]} for row in peak_hours.collect()]
    total_plays = plays_with_works.count()
    total_works = filtered_works.count()
    avg_plays_per_work = total_plays / total_works if total_works > 0 else 0
    return JsonResponse({"status": "success", "data": {"top_works": top_works_result, "genre_performance": genre_result, "daily_trends": trends_result, "new_works_performance": new_works_result, "peak_hours": peak_hours_result, "summary": {"total_plays": total_plays, "total_works": total_works, "avg_plays_per_work": round(avg_plays_per_work, 2)}}})
@require_http_methods(["POST"])
def analyze_fans_clustering(request):
    data = json.loads(request.body)
    musician_id = data.get('musician_id')
    time_range = data.get('time_range', 60)
    fans_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fans_data", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    interaction_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fan_interactions", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    plays_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="music_plays", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    consumption_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="fan_consumption", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    works_df = spark.read.jdbc(url="jdbc:mysql://localhost:3306/music_fans_db", table="music_works", properties={"user": "root", "password": "password", "driver": "com.mysql.cj.jdbc.Driver"})
    filtered_fans = fans_df.filter(col("musician_id") == musician_id)
    end_date = datetime.now()
    start_date = end_date - timedelta(days=time_range)
    filtered_works = works_df.filter(col("musician_id") == musician_id)
    work_ids = [row["work_id"] for row in filtered_works.select("work_id").collect()]
    filtered_plays = plays_df.filter((col("work_id").isin(work_ids)) & (col("play_time") >= start_date) & (col("play_time") <= end_date))
    filtered_interactions = interaction_df.filter((col("musician_id") == musician_id) & (col("interaction_time") >= start_date) & (col("interaction_time") <= end_date))
    filtered_consumption = consumption_df.filter((col("musician_id") == musician_id) & (col("consumption_time") >= start_date) & (col("consumption_time") <= end_date))
    play_metrics = filtered_plays.groupBy("fan_id").agg(count("play_id").alias("play_count"), sum("play_duration").alias("total_play_duration"), avg("play_duration").alias("avg_play_duration"))
    interaction_metrics = filtered_interactions.groupBy("fan_id").agg(sum(when(col("interaction_type") == "like", 1).otherwise(0)).alias("like_count"), sum(when(col("interaction_type") == "comment", 1).otherwise(0)).alias("comment_count"), sum(when(col("interaction_type") == "share", 1).otherwise(0)).alias("share_count"), sum(when(col("interaction_type") == "favorite", 1).otherwise(0)).alias("favorite_count"))
    consumption_metrics = filtered_consumption.groupBy("fan_id").agg(sum("amount").alias("total_consumption"), count("consumption_id").alias("consumption_frequency"), avg("amount").alias("avg_consumption"))
    fan_features = filtered_fans.join(play_metrics, "fan_id", "left").join(interaction_metrics, "fan_id", "left").join(consumption_metrics, "fan_id", "left").fillna(0, subset=["play_count", "total_play_duration", "avg_play_duration", "like_count", "comment_count", "share_count", "favorite_count", "total_consumption", "consumption_frequency", "avg_consumption"])
    fan_features = fan_features.withColumn("engagement_score", col("like_count") * 1 + col("comment_count") * 3 + col("share_count") * 5 + col("favorite_count") * 2).withColumn("activity_level", col("play_count") * 0.5 + col("engagement_score") * 0.5).withColumn("consumption_level", col("total_consumption"))
    fan_features_pd = fan_features.select("fan_id", "activity_level", "consumption_level", "play_count", "engagement_score").toPandas()
    activity_percentile_33 = np.percentile(fan_features_pd['activity_level'], 33)
    activity_percentile_67 = np.percentile(fan_features_pd['activity_level'], 67)
    consumption_percentile_33 = np.percentile(fan_features_pd['consumption_level'], 33)
    consumption_percentile_67 = np.percentile(fan_features_pd['consumption_level'], 67)
    def assign_cluster(row):
        activity = row['activity_level']
        consumption = row['consumption_level']
        if activity >= activity_percentile_67 and consumption >= consumption_percentile_67:
            return "核心粉丝"
        elif activity >= activity_percentile_67 and consumption < consumption_percentile_67:
            return "活跃粉丝"
        elif activity < activity_percentile_67 and consumption >= consumption_percentile_67:
            return "付费粉丝"
        elif activity >= activity_percentile_33 and consumption >= consumption_percentile_33:
            return "普通粉丝"
        else:
            return "潜在粉丝"
    fan_features_pd['cluster'] = fan_features_pd.apply(assign_cluster, axis=1)
    fan_clusters_spark = spark.createDataFrame(fan_features_pd)
    cluster_summary = fan_clusters_spark.groupBy("cluster").agg(count("fan_id").alias("fan_count"), avg("activity_level").alias("avg_activity"), avg("consumption_level").alias("avg_consumption"), avg("play_count").alias("avg_plays"), avg("engagement_score").alias("avg_engagement"))
    cluster_demographics = fan_clusters_spark.join(filtered_fans, "fan_id", "inner").groupBy("cluster", "age_group").agg(count("fan_id").alias("count")).orderBy("cluster", desc("count"))
    cluster_gender = fan_clusters_spark.join(filtered_fans, "fan_id", "inner").groupBy("cluster", "gender").agg(count("fan_id").alias("count"))
    cluster_province = fan_clusters_spark.join(filtered_fans, "fan_id", "inner").groupBy("cluster", "province").agg(count("fan_id").alias("count")).orderBy("cluster", desc("count"))
    top_fans_per_cluster = fan_clusters_spark.withColumn("rank", spark_row_number().over(Window.partitionBy("cluster").orderBy(desc("activity_level")))).filter(col("rank") <= 20).select("fan_id", "cluster", "activity_level", "consumption_level", "play_count", "engagement_score")
    cluster_genre_preference = filtered_plays.join(filtered_works.select("work_id", "music_genre"), "work_id", "inner").join(fan_clusters_spark.select("fan_id", "cluster"), "fan_id", "inner").groupBy("cluster", "music_genre").agg(count("play_id").alias("play_count")).orderBy("cluster", desc("play_count"))
    cluster_summary_result = [{"cluster": row["cluster"], "fan_count": row["fan_count"], "avg_activity": float(row["avg_activity"]), "avg_consumption": float(row["avg_consumption"]), "avg_plays": float(row["avg_plays"]), "avg_engagement": float(row["avg_engagement"])} for row in cluster_summary.collect()]
    cluster_demographics_result = {}
    for row in cluster_demographics.collect():
        cluster = row["cluster"]
        if cluster not in cluster_demographics_result:
            cluster_demographics_result[cluster] = []
        cluster_demographics_result[cluster].append({"age_group": row["age_group"], "count": row["count"]})
    cluster_province_result = {}
    for row in cluster_province.collect():
        cluster = row["cluster"]
        if cluster not in cluster_province_result:
            cluster_province_result[cluster] = []
        cluster_province_result[cluster].append({"province": row["province"], "count": row["count"]})
    cluster_genre_result = {}
    for row in cluster_genre_preference.collect():
        cluster = row["cluster"]
        if cluster not in cluster_genre_result:
            cluster_genre_result[cluster] = []
        cluster_genre_result[cluster].append({"genre": row["music_genre"], "play_count": row["play_count"]})
    total_fans = filtered_fans.count()
    return JsonResponse({"status": "success", "data": {"cluster_summary": cluster_summary_result, "cluster_demographics": cluster_demographics_result, "cluster_provinces": cluster_province_result, "cluster_genre_preference": cluster_genre_result, "total_fans": total_fans}})

六.系统文档展示

在这里插入图片描述

结束

💕💕文末获取源码联系 计算机程序员小杨