基于大数据的世界五百强企业数据分析系统 | 7天搞定世界五百强企业数据分析系统:从HDFS到Spark SQL完整大数据毕设攻略

69 阅读7分钟

💖💖作者:计算机毕业设计江挽 💙💙个人简介:曾长期从事计算机专业培训教学,本人也热爱上课教学,语言擅长Java、微信小程序、Python、Golang、安卓Android等,开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学习实战项目

基于大数据的世界五百强企业数据分析系统介绍

世界五百强企业数据分析系统是一个基于大数据技术栈构建的企业信息分析平台,系统采用Hadoop+Spark大数据框架作为核心处理引擎,通过HDFS分布式文件系统存储海量企业数据,利用Spark SQL进行高效的数据查询和分析处理。系统支持Python+Django和Java+Spring Boot双技术栈开发方案,前端采用Vue+ElementUI构建现代化用户界面,集成Echarts图表库实现丰富的数据可视化效果。系统主要功能涵盖五百强企业基础信息管理、企业规模多维度分析、地理分布统计分析、特殊群体企业筛选分析、行业分布深度分析以及综合数据大屏展示等核心模块。通过Pandas和NumPy进行数据预处理和统计计算,结合MySQL数据库进行结构化数据存储,系统能够对全球五百强企业的多元化数据进行全方位挖掘分析,为用户提供直观清晰的企业发展趋势洞察和行业分布特征展示,实现了从数据采集、存储、处理到可视化展现的完整大数据分析流程。

基于大数据的世界五百强企业数据分析系统演示视频

演示视频

基于大数据的世界五百强企业数据分析系统演示图片

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

基于大数据的世界五百强企业数据分析系统代码展示

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum, avg, desc, asc
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views import View
import mysql.connector
import json

spark = SparkSession.builder.appName("WorldTop500Analysis").config("spark.some.config.option", "some-value").getOrCreate()

def enterprise_scale_analysis(request):
    try:
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/top500").option("dbtable", "enterprises").option("user", "root").option("password", "password").load()
        revenue_ranges = [(0, 50000), (50000, 100000), (100000, 200000), (200000, 500000), (500000, float('inf'))]
        range_labels = ['小型企业(0-500亿)', '中型企业(500-1000亿)', '大型企业(1000-2000亿)', '超大型企业(2000-5000亿)', '巨型企业(5000亿+)']
        scale_data = []
        for i, (min_val, max_val) in enumerate(revenue_ranges):
            if max_val == float('inf'):
                filtered_df = df.filter(col("revenue") >= min_val)
            else:
                filtered_df = df.filter((col("revenue") >= min_val) & (col("revenue") < max_val))
            count_result = filtered_df.count()
            avg_revenue = filtered_df.agg(avg("revenue").alias("avg_revenue")).collect()[0]["avg_revenue"]
            avg_employees = filtered_df.agg(avg("employees").alias("avg_employees")).collect()[0]["avg_employees"]
            scale_data.append({
                'range': range_labels[i],
                'count': count_result,
                'avg_revenue': float(avg_revenue) if avg_revenue else 0,
                'avg_employees': int(avg_employees) if avg_employees else 0,
                'percentage': round((count_result / df.count()) * 100, 2)
            })
        employee_ranges = [(0, 10000), (10000, 50000), (50000, 100000), (100000, 300000), (300000, float('inf'))]
        employee_labels = ['小规模(1万以下)', '中等规模(1-5万)', '大规模(5-10万)', '超大规模(10-30万)', '巨大规模(30万+)']
        employee_data = []
        for i, (min_val, max_val) in enumerate(employee_ranges):
            if max_val == float('inf'):
                filtered_df = df.filter(col("employees") >= min_val)
            else:
                filtered_df = df.filter((col("employees") >= min_val) & (col("employees") < max_val))
            count_result = filtered_df.count()
            avg_revenue = filtered_df.agg(avg("revenue").alias("avg_revenue")).collect()[0]["avg_revenue"]
            employee_data.append({
                'range': employee_labels[i],
                'count': count_result,
                'avg_revenue': float(avg_revenue) if avg_revenue else 0,
                'percentage': round((count_result / df.count()) * 100, 2)
            })
        correlation_data = df.select("revenue", "employees").toPandas()
        correlation_coefficient = np.corrcoef(correlation_data['revenue'], correlation_data['employees'])[0, 1]
        result = {
            'revenue_analysis': scale_data,
            'employee_analysis': employee_data,
            'correlation': float(correlation_coefficient),
            'total_enterprises': df.count()
        }
        return JsonResponse(result)
    except Exception as e:
        return JsonResponse({'error': str(e)}, status=500)

def geographical_distribution_analysis(request):
    try:
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/top500").option("dbtable", "enterprises").option("user", "root").option("password", "password").load()
        country_stats = df.groupBy("country").agg(
            count("*").alias("enterprise_count"),
            sum("revenue").alias("total_revenue"),
            avg("revenue").alias("avg_revenue"),
            sum("employees").alias("total_employees"),
            avg("employees").alias("avg_employees")
        ).orderBy(desc("enterprise_count"))
        country_data = []
        for row in country_stats.collect():
            country_data.append({
                'country': row['country'],
                'count': row['enterprise_count'],
                'total_revenue': float(row['total_revenue']),
                'avg_revenue': float(row['avg_revenue']),
                'total_employees': int(row['total_employees']),
                'avg_employees': int(row['avg_employees']),
                'percentage': round((row['enterprise_count'] / df.count()) * 100, 2)
            })
        continent_mapping = {
            'United States': 'North America', 'China': 'Asia', 'Japan': 'Asia',
            'Germany': 'Europe', 'France': 'Europe', 'United Kingdom': 'Europe',
            'South Korea': 'Asia', 'Canada': 'North America', 'India': 'Asia',
            'Netherlands': 'Europe', 'Switzerland': 'Europe', 'Italy': 'Europe'
        }
        continent_stats = {}
        for row in country_stats.collect():
            continent = continent_mapping.get(row['country'], 'Other')
            if continent not in continent_stats:
                continent_stats[continent] = {
                    'count': 0, 'total_revenue': 0, 'total_employees': 0
                }
            continent_stats[continent]['count'] += row['enterprise_count']
            continent_stats[continent]['total_revenue'] += row['total_revenue']
            continent_stats[continent]['total_employees'] += row['total_employees']
        continent_data = []
        for continent, stats in continent_stats.items():
            continent_data.append({
                'continent': continent,
                'count': stats['count'],
                'total_revenue': float(stats['total_revenue']),
                'avg_revenue': float(stats['total_revenue'] / stats['count']),
                'total_employees': int(stats['total_employees']),
                'percentage': round((stats['count'] / df.count()) * 100, 2)
            })
        top_cities = df.groupBy("city", "country").agg(
            count("*").alias("enterprise_count"),
            sum("revenue").alias("total_revenue")
        ).orderBy(desc("enterprise_count")).limit(20)
        city_data = []
        for row in top_cities.collect():
            city_data.append({
                'city': row['city'],
                'country': row['country'],
                'count': row['enterprise_count'],
                'total_revenue': float(row['total_revenue']),
                'avg_revenue': float(row['total_revenue'] / row['enterprise_count'])
            })
        result = {
            'country_analysis': country_data,
            'continent_analysis': continent_data,
            'city_analysis': city_data,
            'total_countries': len(country_data),
            'total_continents': len(continent_data)
        }
        return JsonResponse(result)
    except Exception as e:
        return JsonResponse({'error': str(e)}, status=500)

def industry_distribution_analysis(request):
    try:
        df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/top500").option("dbtable", "enterprises").option("user", "root").option("password", "password").load()
        industry_stats = df.groupBy("industry").agg(
            count("*").alias("enterprise_count"),
            sum("revenue").alias("total_revenue"),
            avg("revenue").alias("avg_revenue"),
            sum("employees").alias("total_employees"),
            avg("employees").alias("avg_employees")
        ).orderBy(desc("total_revenue"))
        industry_data = []
        total_enterprises = df.count()
        total_global_revenue = df.agg(sum("revenue").alias("total")).collect()[0]["total"]
        for row in industry_stats.collect():
            industry_data.append({
                'industry': row['industry'],
                'count': row['enterprise_count'],
                'total_revenue': float(row['total_revenue']),
                'avg_revenue': float(row['avg_revenue']),
                'total_employees': int(row['total_employees']),
                'avg_employees': int(row['avg_employees']),
                'market_share': round((row['total_revenue'] / total_global_revenue) * 100, 2),
                'enterprise_percentage': round((row['enterprise_count'] / total_enterprises) * 100, 2)
            })
        industry_categories = {
            'Technology': ['Information Technology', 'Telecommunications', 'Electronics'],
            'Energy': ['Oil & Gas', 'Utilities', 'Energy', 'Mining'],
            'Finance': ['Banking', 'Insurance', 'Financial Services'],
            'Manufacturing': ['Automotive', 'Aerospace', 'Industrial Manufacturing'],
            'Consumer': ['Retail', 'Consumer Goods', 'Food & Beverage'],
            'Healthcare': ['Pharmaceuticals', 'Healthcare', 'Medical Devices']
        }
        category_stats = {}
        for industry_row in industry_stats.collect():
            industry_name = industry_row['industry']
            category_found = False
            for category, industries in industry_categories.items():
                if any(keyword in industry_name for keyword in industries):
                    if category not in category_stats:
                        category_stats[category] = {
                            'count': 0, 'total_revenue': 0, 'total_employees': 0
                        }
                    category_stats[category]['count'] += industry_row['enterprise_count']
                    category_stats[category]['total_revenue'] += industry_row['total_revenue']
                    category_stats[category]['total_employees'] += industry_row['total_employees']
                    category_found = True
                    break
            if not category_found:
                if 'Others' not in category_stats:
                    category_stats['Others'] = {
                        'count': 0, 'total_revenue': 0, 'total_employees': 0
                    }
                category_stats['Others']['count'] += industry_row['enterprise_count']
                category_stats['Others']['total_revenue'] += industry_row['total_revenue']
                category_stats['Others']['total_employees'] += industry_row['total_employees']
        category_data = []
        for category, stats in category_stats.items():
            category_data.append({
                'category': category,
                'count': stats['count'],
                'total_revenue': float(stats['total_revenue']),
                'avg_revenue': float(stats['total_revenue'] / stats['count']) if stats['count'] > 0 else 0,
                'total_employees': int(stats['total_employees']),
                'market_share': round((stats['total_revenue'] / total_global_revenue) * 100, 2),
                'percentage': round((stats['count'] / total_enterprises) * 100, 2)
            })
        growth_analysis = df.select("industry", "revenue", "profit", "assets").toPandas()
        growth_data = []
        for industry in growth_analysis['industry'].unique():
            industry_subset = growth_analysis[growth_analysis['industry'] == industry]
            profit_margin = (industry_subset['profit'].mean() / industry_subset['revenue'].mean()) * 100
            asset_turnover = industry_subset['revenue'].mean() / industry_subset['assets'].mean()
            growth_data.append({
                'industry': industry,
                'profit_margin': round(float(profit_margin), 2),
                'asset_turnover': round(float(asset_turnover), 3),
                'efficiency_score': round(float(profit_margin * asset_turnover), 2)
            })
        result = {
            'industry_analysis': industry_data,
            'category_analysis': sorted(category_data, key=lambda x: x['total_revenue'], reverse=True),
            'growth_analysis': sorted(growth_data, key=lambda x: x['efficiency_score'], reverse=True),
            'total_industries': len(industry_data),
            'total_categories': len(category_data)
        }
        return JsonResponse(result)
    except Exception as e:
        return JsonResponse({'error': str(e)}, status=500)

基于大数据的世界五百强企业数据分析系统文档展示

在这里插入图片描述

💖💖作者:计算机毕业设计江挽 💙💙个人简介:曾长期从事计算机专业培训教学,本人也热爱上课教学,语言擅长Java、微信小程序、Python、Golang、安卓Android等,开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法,也喜欢交流技术,大家有技术代码这一块的问题可以问我! 💛💛想说的话:感谢大家的关注与支持! 💜💜 网站实战项目 安卓/小程序实战项目 大数据实战项目 深度学习实战项目