【大数据】豆瓣读书数据分析与可视化系统 计算机毕业设计项目 Hadoop+Spark环境配置 数据科学与大数据技术 附源码+文档+讲解

44 阅读6分钟

一、个人简介

💖💖作者:计算机编程果茶熊 💙💙个人简介:曾长期从事计算机专业培训教学,担任过编程老师,同时本人也热爱上课教学,擅长Java、微信小程序、Python、Golang、安卓Android等多个IT方向。会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 计算机毕业设计选题 💕💕文末获取源码联系计算机编程果茶熊

二、系统介绍

大数据框架:Hadoop+Spark(Hive需要定制修改) 开发语言:Java+Python(两个版本都支持) 数据库:MySQL 后端框架:SpringBoot(Spring+SpringMVC+Mybatis)+Django(两个版本都支持) 前端:Vue+Echarts+HTML+CSS+JavaScript+jQuery

豆瓣读书数据分析与可视化系统是一个基于大数据技术栈构建的智能化数据分析平台,采用Hadoop+Spark分布式计算框架作为核心处理引擎,结合Django后端框架和Vue+ElementUI+Echarts前端技术栈实现全栈开发。系统通过Spark SQL和Pandas、NumPy等数据科学工具对豆瓣读书海量数据进行深度挖掘和分析,涵盖用户管理、豆瓣读书数据管理、作者维度分析、书籍特征分析、内容价值分析、出版社维度分析等核心功能模块。平台利用HDFS分布式文件系统存储大规模数据集,通过Spark的内存计算能力实现高效的数据处理和分析,并通过Echarts可视化组件构建直观的数据大屏展示系统。该系统能够从多个维度对图书市场进行量化分析,为读者选书、出版社决策、作者创作等提供数据支撑,实现了从数据采集、存储、处理到可视化展示的完整数据分析链路。

三、视频解说

豆瓣读书数据分析与可视化系统

四、部分功能展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

五、部分代码展示


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc, asc, when, isnan, isnull, regexp_replace, split, explode, lower, trim
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
import json
import mysql.connector
from datetime import datetime

spark = SparkSession.builder.appName("DoubanBookAnalysis").config("spark.executor.memory", "4g").config("spark.driver.memory", "2g").getOrCreate()

def author_dimension_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/douban_books").option("dbtable", "books").option("user", "root").option("password", "password").load()
    author_stats = df.groupBy("author").agg(count("*").alias("book_count"), avg("rating").alias("avg_rating"), avg("rating_count").alias("avg_rating_count")).filter(col("book_count") >= 3)
    author_productivity = author_stats.select("author", "book_count").orderBy(desc("book_count")).limit(20)
    author_quality = author_stats.filter(col("avg_rating_count") >= 100).select("author", "avg_rating").orderBy(desc("avg_rating")).limit(20)
    genre_analysis = df.select("author", explode(split(col("genres"), ",")).alias("genre")).groupBy("author", "genre").count().withColumnRenamed("count", "genre_count")
    author_genre_diversity = genre_analysis.groupBy("author").agg(count("genre").alias("genre_diversity"), avg("genre_count").alias("avg_genre_books"))
    comprehensive_author_ranking = author_stats.join(author_genre_diversity, "author", "left").withColumn("comprehensive_score", col("avg_rating") * 0.4 + col("book_count") * 0.3 + col("genre_diversity") * 0.2 + col("avg_rating_count") * 0.1 / 1000).orderBy(desc("comprehensive_score"))
    yearly_publication = df.select("author", "publish_year").groupBy("author", "publish_year").count().withColumnRenamed("count", "yearly_books")
    author_career_span = yearly_publication.groupBy("author").agg((max("publish_year") - min("publish_year")).alias("career_span"), avg("yearly_books").alias("avg_yearly_output"))
    rating_distribution = df.select("author", when(col("rating") >= 8.5, "excellent").when(col("rating") >= 7.5, "good").when(col("rating") >= 6.5, "average").otherwise("below_average").alias("rating_category")).groupBy("author", "rating_category").count()
    author_rating_profile = rating_distribution.groupBy("author").pivot("rating_category").sum("count").fillna(0)
    collaboration_analysis = df.select("author").filter(col("author").contains("、") | col("author").contains("/")).withColumn("author_list", split(col("author"), "[、/]")).select(explode(col("author_list")).alias("individual_author")).groupBy("individual_author").count().withColumnRenamed("count", "collaboration_count")
    final_author_analysis = comprehensive_author_ranking.join(author_career_span, "author", "left").join(author_rating_profile, "author", "left").join(collaboration_analysis, comprehensive_author_ranking.author == collaboration_analysis.individual_author, "left")
    result_data = final_author_analysis.select("author", "book_count", "avg_rating", "genre_diversity", "comprehensive_score", "career_span", "avg_yearly_output").orderBy(desc("comprehensive_score")).limit(50).toPandas().to_dict('records')
    return JsonResponse({"status": "success", "data": result_data})

def book_feature_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/douban_books").option("dbtable", "books").option("user", "root").option("password", "password").load()
    page_analysis = df.select("pages", "rating", "rating_count").filter(col("pages").isNotNull() & (col("pages") > 0))
    page_ranges = page_analysis.withColumn("page_range", when(col("pages") <= 100, "ultra_short").when(col("pages") <= 200, "short").when(col("pages") <= 400, "medium").when(col("pages") <= 600, "long").otherwise("ultra_long"))
    page_rating_stats = page_ranges.groupBy("page_range").agg(avg("rating").alias("avg_rating"), count("*").alias("book_count"), avg("rating_count").alias("avg_rating_count")).orderBy("avg_rating")
    price_analysis = df.select("price", "rating", "pages", "publish_year").filter(col("price").isNotNull() & (col("price") > 0))
    price_ranges = price_analysis.withColumn("price_range", when(col("price") <= 20, "budget").when(col("price") <= 50, "moderate").when(col("price") <= 100, "premium").otherwise("luxury"))
    price_rating_correlation = price_ranges.groupBy("price_range").agg(avg("rating").alias("avg_rating"), count("*").alias("book_count"), avg("pages").alias("avg_pages"))
    genre_feature_analysis = df.select("genres", "rating", "pages", "price", "rating_count").filter(col("genres").isNotNull())
    genre_expanded = genre_feature_analysis.select(explode(split(col("genres"), ",")).alias("genre"), col("rating"), col("pages"), col("price"), col("rating_count"))
    genre_characteristics = genre_expanded.groupBy("genre").agg(avg("rating").alias("avg_rating"), avg("pages").alias("avg_pages"), avg("price").alias("avg_price"), count("*").alias("book_count"), avg("rating_count").alias("avg_popularity"))
    title_length_analysis = df.select("title", "rating", "rating_count").withColumn("title_length", length(trim(col("title"))))
    title_length_ranges = title_length_analysis.withColumn("title_length_range", when(col("title_length") <= 5, "very_short").when(col("title_length") <= 10, "short").when(col("title_length") <= 20, "medium").otherwise("long"))
    title_impact = title_length_ranges.groupBy("title_length_range").agg(avg("rating").alias("avg_rating"), avg("rating_count").alias("avg_popularity"), count("*").alias("book_count"))
    publication_trend = df.select("publish_year", "rating", "pages", "price").filter(col("publish_year").isNotNull() & (col("publish_year") >= 2000))
    yearly_features = publication_trend.groupBy("publish_year").agg(avg("rating").alias("avg_rating"), avg("pages").alias("avg_pages"), avg("price").alias("avg_price"), count("*").alias("publication_count")).orderBy("publish_year")
    rating_distribution = df.select("rating").filter(col("rating").isNotNull()).withColumn("rating_bucket", when(col("rating") >= 9.0, "masterpiece").when(col("rating") >= 8.0, "excellent").when(col("rating") >= 7.0, "good").when(col("rating") >= 6.0, "average").otherwise("poor"))
    rating_bucket_stats = rating_distribution.groupBy("rating_bucket").count().withColumnRenamed("count", "book_count")
    comprehensive_features = page_rating_stats.union(price_rating_correlation.select("price_range", "avg_rating", "book_count")).union(title_impact.select("title_length_range", "avg_rating", "book_count"))
    result_data = {"page_analysis": page_rating_stats.toPandas().to_dict('records'), "price_analysis": price_rating_correlation.toPandas().to_dict('records'), "genre_analysis": genre_characteristics.orderBy(desc("avg_rating")).limit(20).toPandas().to_dict('records'), "title_analysis": title_impact.toPandas().to_dict('records'), "yearly_trends": yearly_features.toPandas().to_dict('records'), "rating_distribution": rating_bucket_stats.toPandas().to_dict('records')}
    return JsonResponse({"status": "success", "data": result_data})

def content_value_analysis(request):
    df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/douban_books").option("dbtable", "books").option("user", "root").option("password", "password").load()
    reviews_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/douban_books").option("dbtable", "reviews").option("user", "root").option("password", "password").load()
    content_quality_metrics = df.select("id", "title", "rating", "rating_count", "pages", "summary").filter(col("rating_count") >= 50)
    summary_analysis = content_quality_metrics.filter(col("summary").isNotNull()).withColumn("summary_length", length(col("summary"))).withColumn("summary_quality", when(col("summary_length") >= 200, "detailed").when(col("summary_length") >= 100, "moderate").otherwise("brief"))
    summary_impact = summary_analysis.groupBy("summary_quality").agg(avg("rating").alias("avg_rating"), avg("rating_count").alias("avg_engagement"), count("*").alias("book_count"))
    value_score_calculation = content_quality_metrics.withColumn("popularity_score", log10(col("rating_count") + 1)).withColumn("quality_score", col("rating")).withColumn("engagement_ratio", col("rating_count") / (col("pages") + 1)).withColumn("content_value_score", col("quality_score") * 0.5 + col("popularity_score") * 0.3 + col("engagement_ratio") * 0.2)
    high_value_books = value_score_calculation.filter(col("content_value_score") >= 7.0).orderBy(desc("content_value_score"))
    genre_value_analysis = df.select("genres", "rating", "rating_count", "pages").filter(col("genres").isNotNull() & (col("rating_count") >= 30))
    genre_value_expanded = genre_value_analysis.select(explode(split(col("genres"), ",")).alias("genre"), col("rating"), col("rating_count"), col("pages"))
    genre_value_metrics = genre_value_expanded.withColumn("value_density", col("rating") * log10(col("rating_count") + 1) / (col("pages") + 1)).groupBy("genre").agg(avg("value_density").alias("avg_value_density"), avg("rating").alias("avg_rating"), avg("rating_count").alias("avg_popularity"), count("*").alias("book_count"))
    content_longevity = df.select("publish_year", "rating", "rating_count").filter(col("publish_year").isNotNull() & (col("rating_count") >= 100))
    book_age = content_longevity.withColumn("book_age", 2024 - col("publish_year")).withColumn("age_group", when(col("book_age") <= 5, "recent").when(col("book_age") <= 15, "modern").when(col("book_age") <= 30, "classic").otherwise("vintage"))
    longevity_analysis = book_age.groupBy("age_group").agg(avg("rating").alias("avg_rating"), avg("rating_count").alias("avg_current_popularity"), count("*").alias("book_count"))
    review_sentiment_joined = content_quality_metrics.join(reviews_df, content_quality_metrics.id == reviews_df.book_id, "left")
    review_engagement = review_sentiment_joined.groupBy("id", "title", "rating").agg(count("*").alias("review_count"), avg("helpful_count").alias("avg_helpfulness"))
    content_discussion_value = review_engagement.withColumn("discussion_score", col("review_count") * col("avg_helpfulness")).filter(col("discussion_score").isNotNull())
    comprehensive_content_ranking = value_score_calculation.join(content_discussion_value.select("id", "discussion_score"), "id", "left").withColumn("final_content_score", col("content_value_score") + coalesce(col("discussion_score") / 1000, lit(0))).orderBy(desc("final_content_score"))
    reading_difficulty_analysis = df.select("pages", "rating", "rating_count", "genres").filter(col("pages").isNotNull() & (col("pages") > 0))
    difficulty_assessment = reading_difficulty_analysis.withColumn("reading_intensity", col("pages") / (col("rating_count") + 1)).withColumn("difficulty_level", when(col("reading_intensity") > 1.0, "challenging").when(col("reading_intensity") > 0.5, "moderate").otherwise("accessible"))
    difficulty_preference = difficulty_assessment.groupBy("difficulty_level").agg(avg("rating").alias("avg_rating"), count("*").alias("book_count"), avg("rating_count").alias("avg_popularity"))
    result_data = {"high_value_books": high_value_books.select("title", "rating", "rating_count", "content_value_score").limit(30).toPandas().to_dict('records'), "genre_value_ranking": genre_value_metrics.orderBy(desc("avg_value_density")).limit(20).toPandas().to_dict('records'), "content_longevity": longevity_analysis.toPandas().to_dict('records'), "summary_impact": summary_impact.toPandas().to_dict('records'), "difficulty_analysis": difficulty_preference.toPandas().to_dict('records'), "comprehensive_ranking": comprehensive_content_ranking.select("title", "rating", "content_value_score", "final_content_score").limit(50).toPandas().to_dict('records')}
    return JsonResponse({"status": "success", "data": result_data})

六、部分文档展示

在这里插入图片描述

七、END

💕💕文末获取源码联系计算机编程果茶熊