项目名称:DataInsight Dashboard - 一个基于真实数据集的分析与可视化平台 技术栈:Python (Pandas, Matplotlib/Seaborn), SQLite/SQL, Streamlit (一个用Python写Web App的神器) 核心目标:证明你能够从数据中提取有价值的商业洞察。 具体执行步骤:
- 寻找数据集:去Kaggle找一个你感兴趣的数据集(如电商销售数据、电影IMDb评分、共享单车使用数据)。
- 数据清洗与探索:使用Pandas进行数据清洗(处理空值、格式转换)。这就是你展示SQL思维的地方:用Pandas实现类似SQL的GROUP BY, JOIN, WHERE等操作。
- SQLite集成:将清洗后的数据存入SQLite数据库,并编写几个关键的SQL查询语句来提取核心指标(如“每月销售额”、“最畅销产品”)。这能证明你会操作数据库。
- 可视化与报告:用Matplotlib/Seaborn绘制精美的图表(折线图、柱状图、热力图)。
- 打造Web应用:使用Streamlit,在几天内就能将你的分析变成一个交互式Web仪表盘。Streamlit非常简单,只需几十行代码。
一、前期准备:环境安装 pip install pandas numpy sqlite3 streamlit openpyxl 二、Step1:数据获取(Kaggle共享单车数据集)
下载数据集:Bike Sharing Dataset(解压后得到 day.csv 和 hour.csv,用 day.csv 做日度分析)
三、Step2:数据清洗与探索(Pandas实现SQL思维操作)
创建 data_cleaning.py: import pandas as pd import numpy as np
1. 读取数据(类似SQL SELECT * FROM table)
df = pd.read_csv("day.csv")
2. 数据清洗(SQL思维对应操作)
处理空值(WHERE 过滤非空)
df = df.dropna(subset=["dteday", "cnt", "temp"])
格式转换:日期字段(SQL DATE 类型)
df["dteday"] = pd.to_datetime(df["dteday"])
新增字段:年月(GROUP BY 分组维度)
df["year_month"] = df["dteday"].dt.strftime("%Y-%m")
新增字段:季节映射(提升可读性,类似SQL CASE WHEN)
season_map = {1: "春季", 2: "夏季", 3: "秋季", 4: "冬季"} df["season_name"] = df["season"].map(season_map)
新增字段:天气映射(SQL CASE WHEN)
weather_map = {1: "晴朗", 2: "多云/雾", 3: "小雨/雪", 4: "大雨/雪"} df["weather_name"] = df["weathersit"].map(weather_map)
过滤异常值(WHERE 条件:排除极端天气和无效骑行量)
df = df[(df["cnt"] > 0) & (df["weathersit"] != 4)]
3. 探索性分析(类似SQL GROUP BY + AGG)
print("=== 核心指标探索 ===") print(f"总骑行量:{df['cnt'].sum():,} 次") print(f"平均日骑行量:{df['cnt'].mean():.0f} 次") print(f"各季节骑行量:") season_rides = df.groupby("season_name")["cnt"].sum().sort_values(ascending=False) print(season_rides)
保存清洗后的数据(供PowerBI读取)
df.to_csv("cleaned_bike_data.csv", index=False, encoding="utf-8-sig") print("数据清洗完成,已保存为 cleaned_bike_data.csv(支持PowerBI读取)") 四、Step3:SQLite集成(数据存储+核心SQL查询)
创建 sql_operations.py: import sqlite3 import pandas as pd
1. 连接SQLite数据库
conn = sqlite3.connect("bike_sharing.db") cursor = conn.cursor()
2. 读取清洗后的数据
df = pd.read_csv("cleaned_bike_data.csv", encoding="utf-8-sig")
3. 数据存入SQLite表(CREATE TABLE + INSERT)
df.to_sql("bike_rides", conn, if_exists="replace", index=False) print("数据已存入SQLite表 bike_rides")
4. 关键SQL查询(提取核心指标,供分析和PowerBI联动)
def run_sql(query): return pd.read_sql(query, conn)
示例1:每月骑行量(GROUP BY 年月)
query1 = """ SELECT year_month, SUM(cnt) AS monthly_rides, AVG(temp) AS avg_temp FROM bike_rides GROUP BY year_month ORDER BY year_month; """ monthly_rides = run_sql(query1) print("\n=== 每月骑行量 ===") print(monthly_rides)
示例2:各季节骑行量(GROUP BY 季节)
query2 = """ SELECT season_name, SUM(cnt) AS season_rides, AVG(cnt) AS avg_daily_rides FROM bike_rides GROUP BY season_name ORDER BY season_rides DESC; """ season_rides = run_sql(query2) print("\n=== 各季节骑行量 ===") print(season_rides)
示例3:天气对骑行量的影响(GROUP BY 天气)
query3 = """ SELECT weather_name, SUM(cnt) AS weather_rides, AVG(cnt) AS avg_weather_rides FROM bike_rides GROUP BY weather_name ORDER BY avg_weather_rides DESC; """ weather_rides = run_sql(query3) print("\n=== 各天气骑行量 ===") print(weather_rides)
示例4:工作日vs周末骑行量(GROUP BY 工作日标识)
query4 = """ SELECT CASE WHEN workingday = 1 THEN '工作日' ELSE '周末' END AS day_type, SUM(cnt) AS total_rides, AVG(cnt) AS avg_rides FROM bike_rides GROUP BY workingday; """ workday_rides = run_sql(query4) print("\n=== 工作日vs周末骑行量 ===") print(workday_rides)
conn.close() 五、Step4:PowerBI可视化操作指引(核心!)
无需Python代码,直接用PowerBI连接数据生成专业图表,步骤如下:
-
导入数据:打开PowerBI → 「获取数据」→ 「文本/CSV」→ 选择 cleaned_bike_data.csv(或直接连接SQLite:「获取数据」→ 「SQLite数据库」→ 输入数据库文件路径 bike_sharing.db → 选择 bike_rides 表)
-
创建核心图表(拖拽式操作):
◦ 「折线图」:X轴=year_month,Y轴=SUM(cnt) → 每月骑行量趋势
◦ 「柱状图」:X轴=season_name,Y轴=SUM(cnt) → 各季节骑行量对比
◦ 「热力图」:行=weekday,列=season_name,值=AVG(cnt) → 周几+季节骑行量热力图
◦ 「散点图」:X轴=temp(温度),Y轴=cnt(骑行量) → 温度与骑行量相关性
◦ 「卡片图」:显示总骑行量、平均日骑行量等核心指标
-
美化与交互:设置主题颜色、添加筛选器(如年份、季节筛选)、添加数据标签
-
导出报告:可导出为PDF,或发布到PowerBI Service生成分享链接
六、Step5:Streamlit Web应用(交互式仪表盘)
创建 app.py(与PowerBI互补,生成可分享网页): import streamlit as st import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sqlite3
设置中文字体
plt.rcParams["font.sans-serif"] = ["SimHei"] plt.rcParams["axes.unicode_minus"] = False
页面配置
st.set_page_config(page_title="BikeInsight Dashboard", layout="wide", page_icon="🚲")
标题
st.title("🚲 共享单车骑行数据智能分析仪表盘") st.divider()
缓存数据(提升性能)
@st.cache_data def load_sql_data(query): conn = sqlite3.connect("bike_sharing.db") df = pd.read_sql(query, conn) conn.close() return df
加载核心数据
monthly_data = load_sql_data("SELECT year_month, SUM(cnt) AS monthly_rides, AVG(temp) AS avg_temp FROM bike_rides GROUP BY year_month ORDER BY year_month;") season_data = load_sql_data("SELECT season_name, SUM(cnt) AS season_rides FROM bike_rides GROUP BY season_name ORDER BY season_rides DESC;") weather_data = load_sql_data("SELECT weather_name, AVG(cnt) AS avg_weather_rides FROM bike_rides GROUP BY weather_name ORDER BY avg_weather_rides DESC;") total_metrics = load_sql_data("SELECT SUM(cnt) AS total_rides, AVG(cnt) AS avg_daily_rides, COUNT(DISTINCT dteday) AS total_days FROM bike_rides;")
1. 核心指标卡片(顶部)
col1, col2, col3 = st.columns(3) with col1: st.metric("总骑行量", f"{total_metrics['total_rides'].iloc[0]:,} 次") with col2: st.metric("平均日骑行量", f"{total_metrics['avg_daily_rides'].iloc[0]:.0f} 次") with col3: st.metric("分析总天数", f"{total_metrics['total_days'].iloc[0]} 天") st.divider()
2. 可视化区域(两列布局)
col_left, col_right = st.columns(2)
左侧:每月骑行量趋势
with col_left: st.subheader("📈 每月骑行量趋势") fig, ax = plt.subplots(figsize=(12, 6)) sns.lineplot(x="year_month", y="monthly_rides", data=monthly_data, ax=ax, linewidth=2.5, color="#2E8B57", marker="o") ax.set_xlabel("年月", fontsize=12) ax.set_ylabel("骑行量(次)", fontsize=12) ax.tick_params(axis="x", rotation=45) st.pyplot(fig)
右侧:各季节骑行量
with col_right: st.subheader("🍂 各季节骑行量对比") fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x="season_name", y="season_rides", data=season_data, ax=ax, palette="viridis") ax.set_xlabel("季节", fontsize=12) ax.set_ylabel("骑行量(次)", fontsize=12) # 添加数值标签 for i, v in enumerate(season_data["season_rides"]): ax.text(i, v + 5000, f"{v:,}", ha="center", fontsize=10) st.pyplot(fig)
下方:天气对骑行量影响 + 温度相关性
col1, col2 = st.columns(2) with col1: st.subheader("🌤️ 天气与平均骑行量") fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x="weather_name", y="avg_weather_rides", data=weather_data, ax=ax, palette="Set2") ax.set_xlabel("天气状况", fontsize=12) ax.set_ylabel("平均骑行量(次/天)", fontsize=12) st.pyplot(fig)
with col2: st.subheader("🌡️ 温度与骑行量相关性") raw_data = pd.read_csv("cleaned_bike_data.csv", encoding="utf-8-sig") fig, ax = plt.subplots(figsize=(10, 6)) sns.scatterplot(x="temp", y="cnt", data=raw_data, ax=ax, alpha=0.6, color="#FF6347") ax.set_xlabel("标准化温度(0-1)", fontsize=12) ax.set_ylabel("骑行量(次)", fontsize=12) st.pyplot(fig)
3. 数据表格与下载
st.subheader("📋 数据预览(前100行)") st.dataframe(raw_data[["dteday", "season_name", "weather_name", "temp", "cnt"]].head(100), use_container_width=True)
st.download_button( label="💾 下载清洗后完整数据", data=raw_data.to_csv(index=False, encoding="utf-8-sig"), file_name="bike_sharing_cleaned_data.csv", mime="text/csv" )
页脚
st.divider() st.caption("© 2024 BikeInsight Dashboard | 基于 Python + Streamlit + PowerBI 构建")