【大数据】豆瓣电影用户行为与市场趋势分析系统 计算机毕业设计项目 Hadoop+Spark环境配置 数据科学与大数据技术 附源码+文档+讲解

40 阅读4分钟

前言

💖💖作者:计算机程序员小杨 💙💙个人简介:我是一名计算机相关专业的从业者,擅长Java、微信小程序、Python、Golang、安卓Android等多个IT方向。会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。热爱技术,喜欢钻研新工具和框架,也乐于通过代码解决实际问题,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💕💕文末获取源码联系 计算机程序员小杨 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学习实战项目 计算机毕业设计选题 💜💜

一.开发工具简介

大数据框架:Hadoop+Spark(本次没用Hive,支持定制) 开发语言:Python+Java(两个版本都支持) 后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持) 前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery 详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy 数据库:MySQL

二.系统内容简介

《豆瓣电影用户行为与市场趋势分析系统》是一个基于大数据技术构建的电影行业分析平台,采用Hadoop+Spark大数据框架作为底层支撑,结合Python开发语言和Django后端框架,前端使用Vue+ElementUI+Echarts实现数据可视化展示。系统通过HDFS分布式存储海量豆瓣电影数据,利用Spark SQL进行高效数据处理和分析,结合Pandas、NumPy等数据科学库进行深度挖掘。核心功能涵盖用户管理、豆瓣电影数据管理、评论情感分析、市场热度分析、电影基础特征分析、质量市场表现分析、用户聚类分析、用户评分行为分析以及可视化大屏展示。系统能够从多维度解析电影市场动态,深入挖掘用户观影偏好和行为模式,为电影制作方、发行商、影院经营者等提供科学的数据支撑和决策参考,助力电影行业的精准营销和内容优化。

三.系统功能演示

豆瓣电影用户行为与市场趋势分析系统

四.系统界面展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

五.系统源码展示


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc, when, regexp_replace, split, explode
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from snownlp import SnowNLP
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views import View
import json

spark = SparkSession.builder.appName("DoubanMovieAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

class SentimentAnalysisView(View):
    def post(self, request):
        movie_id = json.loads(request.body).get('movie_id')
        comments_df = spark.sql(f"SELECT comment_text, user_id, rating FROM movie_comments WHERE movie_id = {movie_id}")
        comments_pd = comments_df.toPandas()
        sentiment_scores = []
        emotion_distribution = {'positive': 0, 'neutral': 0, 'negative': 0}
        for comment in comments_pd['comment_text']:
            cleaned_comment = self.clean_text(comment)
            if len(cleaned_comment) > 5:
                snow = SnowNLP(cleaned_comment)
                sentiment_score = snow.sentiments
                sentiment_scores.append(sentiment_score)
                if sentiment_score > 0.6:
                    emotion_distribution['positive'] += 1
                elif sentiment_score < 0.4:
                    emotion_distribution['negative'] += 1
                else:
                    emotion_distribution['neutral'] += 1
        avg_sentiment = np.mean(sentiment_scores) if sentiment_scores else 0.5
        total_comments = len(sentiment_scores)
        positive_ratio = emotion_distribution['positive'] / total_comments if total_comments > 0 else 0
        negative_ratio = emotion_distribution['negative'] / total_comments if total_comments > 0 else 0
        rating_sentiment_correlation = self.calculate_rating_sentiment_correlation(comments_pd, sentiment_scores)
        sentiment_trend = self.analyze_sentiment_trend_over_time(movie_id)
        keyword_sentiment = self.extract_keyword_sentiment(comments_pd, sentiment_scores)
        return JsonResponse({
            'avg_sentiment': round(avg_sentiment, 3),
            'emotion_distribution': emotion_distribution,
            'positive_ratio': round(positive_ratio, 3),
            'negative_ratio': round(negative_ratio, 3),
            'rating_correlation': round(rating_sentiment_correlation, 3),
            'sentiment_trend': sentiment_trend,
            'keyword_sentiment': keyword_sentiment,
            'total_analyzed': total_comments
        })

class MarketHeatAnalysisView(View):
    def post(self, request):
        time_range = json.loads(request.body).get('time_range', 30)
        movies_df = spark.sql(f"SELECT movie_id, movie_name, release_date, rating, comment_count, view_count FROM movies WHERE release_date >= date_sub(current_date(), {time_range})")
        heat_scores = []
        for row in movies_df.collect():
            comment_weight = min(row['comment_count'] / 1000, 10)
            view_weight = min(row['view_count'] / 10000, 10)
            rating_weight = row['rating'] / 2 if row['rating'] else 2.5
            days_since_release = self.calculate_days_since_release(row['release_date'])
            time_decay = max(0.1, 1 - (days_since_release / 365))
            social_buzz = self.calculate_social_buzz(row['movie_id'])
            search_index = self.get_search_trend(row['movie_name'])
            heat_score = (comment_weight * 0.25 + view_weight * 0.25 + rating_weight * 0.2 + social_buzz * 0.15 + search_index * 0.15) * time_decay
            heat_scores.append({
                'movie_id': row['movie_id'],
                'movie_name': row['movie_name'],
                'heat_score': round(heat_score, 2),
                'comment_count': row['comment_count'],
                'view_count': row['view_count'],
                'rating': row['rating']
            })
        sorted_heat = sorted(heat_scores, key=lambda x: x['heat_score'], reverse=True)
        trending_genres = self.analyze_trending_genres(time_range)
        market_growth_rate = self.calculate_market_growth_rate(time_range)
        return JsonResponse({
            'top_hot_movies': sorted_heat[:20],
            'trending_genres': trending_genres,
            'market_growth_rate': market_growth_rate,
            'analysis_period': time_range
        })

class UserClusteringAnalysisView(View):
    def post(self, request):
        cluster_params = json.loads(request.body)
        k_clusters = cluster_params.get('k_clusters', 5)
        users_df = spark.sql("SELECT user_id, avg_rating, rating_count, comment_count, favorite_genres, activity_level FROM user_behavior_summary")
        feature_df = users_df.select(
            col('user_id'),
            col('avg_rating').cast('double'),
            col('rating_count').cast('double'),
            col('comment_count').cast('double'),
            col('activity_level').cast('double')
        ).filter(col('rating_count') > 5)
        assembler = VectorAssembler(inputCols=['avg_rating', 'rating_count', 'comment_count', 'activity_level'], outputCol='features')
        feature_vector_df = assembler.transform(feature_df)
        scaler = StandardScaler(inputCol='features', outputCol='scaled_features', withStd=True, withMean=True)
        scaler_model = scaler.fit(feature_vector_df)
        scaled_df = scaler_model.transform(feature_vector_df)
        kmeans = KMeans(k=k_clusters, seed=42, featuresCol='scaled_features', predictionCol='cluster')
        kmeans_model = kmeans.fit(scaled_df)
        clustered_df = kmeans_model.transform(scaled_df)
        cluster_stats = []
        for i in range(k_clusters):
            cluster_data = clustered_df.filter(col('cluster') == i)
            cluster_size = cluster_data.count()
            avg_rating_cluster = cluster_data.agg(avg(col('avg_rating'))).collect()[0][0]
            avg_activity = cluster_data.agg(avg(col('activity_level'))).collect()[0][0]
            avg_comments = cluster_data.agg(avg(col('comment_count'))).collect()[0][0]
            cluster_profile = self.generate_cluster_profile(avg_rating_cluster, avg_activity, avg_comments)
            cluster_stats.append({
                'cluster_id': i,
                'size': cluster_size,
                'avg_rating': round(avg_rating_cluster, 2) if avg_rating_cluster else 0,
                'avg_activity': round(avg_activity, 2) if avg_activity else 0,
                'avg_comments': round(avg_comments, 2) if avg_comments else 0,
                'profile': cluster_profile
            })
        user_distribution = self.analyze_user_distribution_by_cluster(clustered_df, k_clusters)
        behavioral_patterns = self.extract_behavioral_patterns(clustered_df)
        return JsonResponse({
            'cluster_statistics': cluster_stats,
            'user_distribution': user_distribution,
            'behavioral_patterns': behavioral_patterns,
            'total_users_analyzed': feature_df.count(),
            'clustering_quality': self.evaluate_clustering_quality(kmeans_model, scaled_df)
        })

六.系统文档展示

在这里插入图片描述

结束

💕💕文末获取源码联系 计算机程序员小杨