大数据毕设太难?手把手教你搭建火车站地理数据可视化分析系统,Hadoop+Spark零基础入门|计算机毕业设计

58 阅读6分钟

前言

一.开发工具简介

  • 大数据框架:Hadoop+Spark(本次没用Hive,支持定制)
  • 开发语言:Python+Java(两个版本都支持)
  • 后端框架:Django+Spring Boot(Spring+SpringMVC+Mybatis)(两个版本都支持)
  • 前端:Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery
  • 详细技术点:Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy
  • 数据库:MySQL

二.系统内容简介

基于大数据的中国火车站站点地理数据可视化分析系统是一个综合性的数据分析平台,采用Hadoop+Spark大数据处理框架作为核心技术架构,结合Python开发语言和Django后端框架,前端使用Vue+ElementUI+Echarts技术栈实现数据可视化展示。系统通过收集和整理全国范围内的火车站站点地理信息数据,利用Spark SQL和Pandas进行数据清洗与预处理,运用NumPy进行数值计算分析。系统主要功能模块包括系统首页展示、用户信息管理、火车站站点信息的增删改查操作、大屏数据可视化展示、站点宏观特征统计分析、站点空间分布规律分析、铁路局管辖范围划分分析、火车站站点等级分类统计以及核心站点聚集区域分析等功能。通过HDFS分布式文件系统存储海量地理数据,确保数据的高可用性和可扩展性,为铁路运输规划和地理信息研究提供有力的数据支撑和决策参考。

三.系统功能演示

大数据毕设太难?手把手教你搭建火车站地理数据可视化分析系统,Hadoop+Spark零基础入门|计算机毕业设计

四.系统界面展示

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

五.系统源码展示


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, sum, desc, asc, when, isnan, isnull
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
import pandas as pd
import numpy as np
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
import json

spark = SparkSession.builder.appName("TrainStationAnalysis").config("spark.sql.adaptive.enabled", "true").config("spark.sql.adaptive.coalescePartitions.enabled", "true").getOrCreate()

def station_macro_analysis(request):
    try:
        station_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/train_station_db").option("dbtable", "station_info").option("user", "root").option("password", "123456").load()
        station_df.createOrReplaceTempView("stations")
        total_stations = station_df.count()
        station_level_stats = spark.sql("SELECT station_level, COUNT(*) as count FROM stations WHERE station_level IS NOT NULL GROUP BY station_level ORDER BY count DESC")
        level_data = []
        for row in station_level_stats.collect():
            level_data.append({"level": row.station_level, "count": row.count, "percentage": round((row.count / total_stations) * 100, 2)})
        province_stats = spark.sql("SELECT province, COUNT(*) as station_count, AVG(longitude) as avg_lng, AVG(latitude) as avg_lat FROM stations WHERE province IS NOT NULL GROUP BY province ORDER BY station_count DESC")
        province_data = []
        for row in province_stats.collect():
            province_data.append({"province": row.province, "station_count": row.station_count, "avg_longitude": round(row.avg_lng, 4), "avg_latitude": round(row.avg_lat, 4)})
        railway_bureau_stats = spark.sql("SELECT railway_bureau, COUNT(*) as managed_stations, MIN(longitude) as min_lng, MAX(longitude) as max_lng, MIN(latitude) as min_lat, MAX(latitude) as max_lat FROM stations WHERE railway_bureau IS NOT NULL GROUP BY railway_bureau ORDER BY managed_stations DESC")
        bureau_data = []
        for row in railway_bureau_stats.collect():
            lng_span = row.max_lng - row.min_lng
            lat_span = row.max_lat - row.min_lat
            coverage_area = lng_span * lat_span
            bureau_data.append({"bureau": row.railway_bureau, "managed_stations": row.managed_stations, "coverage_area": round(coverage_area, 4), "lng_span": round(lng_span, 4), "lat_span": round(lat_span, 4)})
        result_data = {"total_stations": total_stations, "level_distribution": level_data, "province_distribution": province_data, "bureau_management": bureau_data, "analysis_timestamp": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S")}
        return JsonResponse({"status": "success", "data": result_data})
    except Exception as e:
        return JsonResponse({"status": "error", "message": str(e)})
    finally:
        if 'spark' in locals():
            spark.stop()

def station_spatial_analysis(request):
    try:
        station_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/train_station_db").option("dbtable", "station_info").option("user", "root").option("password", "123456").load()
        station_df = station_df.filter((col("longitude").isNotNull()) & (col("latitude").isNotNull()) & (col("longitude") != 0) & (col("latitude") != 0))
        station_df.createOrReplaceTempView("spatial_stations")
        lng_stats = spark.sql("SELECT MIN(longitude) as min_lng, MAX(longitude) as max_lng, AVG(longitude) as avg_lng, STDDEV(longitude) as std_lng FROM spatial_stations")
        lat_stats = spark.sql("SELECT MIN(latitude) as min_lat, MAX(latitude) as max_lat, AVG(latitude) as avg_lat, STDDEV(latitude) as std_lat FROM spatial_stations")
        lng_row = lng_stats.collect()[0]
        lat_row = lat_stats.collect()[0]
        coordinate_stats = {"longitude": {"min": round(lng_row.min_lng, 4), "max": round(lng_row.max_lng, 4), "avg": round(lng_row.avg_lng, 4), "std": round(lng_row.std_lng, 4)}, "latitude": {"min": round(lat_row.min_lat, 4), "max": round(lat_row.max_lat, 4), "avg": round(lat_row.avg_lat, 4), "std": round(lat_row.std_lat, 4)}}
        regional_density = spark.sql("SELECT FLOOR(longitude/2)*2 as lng_grid, FLOOR(latitude/2)*2 as lat_grid, COUNT(*) as station_density FROM spatial_stations GROUP BY lng_grid, lat_grid HAVING station_density > 5 ORDER BY station_density DESC")
        density_data = []
        for row in regional_density.collect():
            density_data.append({"lng_grid": row.lng_grid, "lat_grid": row.lat_grid, "density": row.station_density, "grid_center_lng": row.lng_grid + 1, "grid_center_lat": row.lat_grid + 1})
        distance_analysis = spark.sql("SELECT s1.station_name as station1, s2.station_name as station2, s1.longitude as lng1, s1.latitude as lat1, s2.longitude as lng2, s2.latitude as lat2, SQRT(POW(s1.longitude - s2.longitude, 2) + POW(s1.latitude - s2.latitude, 2)) as distance FROM spatial_stations s1 CROSS JOIN spatial_stations s2 WHERE s1.station_name < s2.station_name AND SQRT(POW(s1.longitude - s2.longitude, 2) + POW(s1.latitude - s2.latitude, 2)) < 0.5 ORDER BY distance ASC LIMIT 20")
        nearby_pairs = []
        for row in distance_analysis.collect():
            nearby_pairs.append({"station1": row.station1, "station2": row.station2, "distance": round(row.distance, 4), "coordinates": {"station1": {"lng": row.lng1, "lat": row.lat1}, "station2": {"lng": row.lng2, "lat": row.lat2}}})
        spatial_result = {"coordinate_statistics": coordinate_stats, "high_density_regions": density_data, "nearby_station_pairs": nearby_pairs, "total_analyzed_stations": station_df.count()}
        return JsonResponse({"status": "success", "data": spatial_result})
    except Exception as e:
        return JsonResponse({"status": "error", "message": str(e)})
    finally:
        if 'spark' in locals():
            spark.stop()

def core_station_cluster_analysis(request):
    try:
        station_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/train_station_db").option("dbtable", "station_info").option("user", "root").option("password", "123456").load()
        station_df = station_df.filter((col("longitude").isNotNull()) & (col("latitude").isNotNull()) & (col("station_level").isNotNull()))
        station_df.createOrReplaceTempView("cluster_stations")
        core_stations = spark.sql("SELECT * FROM cluster_stations WHERE station_level IN ('特等站', '一等站', '二等站')")
        core_stations.createOrReplaceTempView("core_stations")
        cluster_regions = spark.sql("SELECT ROUND(longitude, 1) as lng_cluster, ROUND(latitude, 1) as lat_cluster, COUNT(*) as core_count, AVG(longitude) as center_lng, AVG(latitude) as center_lat, COLLECT_LIST(station_name) as station_names, COLLECT_LIST(station_level) as station_levels FROM core_stations GROUP BY lng_cluster, lat_cluster HAVING core_count >= 3 ORDER BY core_count DESC")
        cluster_data = []
        for row in cluster_regions.collect():
            station_list = list(zip(row.station_names, row.station_levels))
            level_distribution = {}
            for station, level in station_list:
                level_distribution[level] = level_distribution.get(level, 0) + 1
            cluster_info = {"cluster_id": f"cluster_{row.lng_cluster}_{row.lat_cluster}", "center_coordinates": {"longitude": round(row.center_lng, 4), "latitude": round(row.center_lat, 4)}, "core_station_count": row.core_count, "station_details": [{"name": name, "level": level} for name, level in station_list], "level_distribution": level_distribution, "cluster_density": round(row.core_count / 1.0, 2)}
            cluster_data.append(cluster_info)
        province_core_stats = spark.sql("SELECT province, COUNT(*) as total_core, COUNT(CASE WHEN station_level = '特等站' THEN 1 END) as special_count, COUNT(CASE WHEN station_level = '一等站' THEN 1 END) as first_class_count, COUNT(CASE WHEN station_level = '二等站' THEN 1 END) as second_class_count FROM core_stations WHERE province IS NOT NULL GROUP BY province ORDER BY total_core DESC")
        province_core_data = []
        for row in province_core_stats.collect():
            province_core_data.append({"province": row.province, "total_core_stations": row.total_core, "special_stations": row.special_count or 0, "first_class_stations": row.first_class_count or 0, "second_class_stations": row.second_class_count or 0, "core_station_ratio": round((row.total_core / station_df.count()) * 100, 2)})
        network_connectivity = spark.sql("SELECT c1.province as province1, c2.province as province2, COUNT(*) as connection_strength FROM core_stations c1 CROSS JOIN core_stations c2 WHERE c1.province != c2.province AND c1.province < c2.province AND SQRT(POW(c1.longitude - c2.longitude, 2) + POW(c1.latitude - c2.latitude, 2)) < 2.0 GROUP BY c1.province, c2.province HAVING connection_strength > 5 ORDER BY connection_strength DESC LIMIT 15")
        connectivity_data = []
        for row in network_connectivity.collect():
            connectivity_data.append({"province_pair": f"{row.province1}-{row.province2}", "connection_strength": row.connection_strength, "provinces": {"province1": row.province1, "province2": row.province2}})
        cluster_result = {"station_clusters": cluster_data, "province_core_distribution": province_core_data, "inter_province_connectivity": connectivity_data, "analysis_summary": {"total_clusters": len(cluster_data), "total_core_stations": core_stations.count(), "cluster_coverage_rate": round((len(cluster_data) / core_stations.count()) * 100, 2)}}
        return JsonResponse({"status": "success", "data": cluster_result})
    except Exception as e:
        return JsonResponse({"status": "error", "message": str(e)})
    finally:
        if 'spark' in locals():
            spark.stop()

六.系统文档展示

在这里插入图片描述

结束