大数据项目实战 3 | 离线 | 实时 | 数仓 | 推荐 | 可视化
本文聚焦大数据核心场景实战,涵盖离线数据处理、实时计算、数据仓库构建、推荐系统开发、可视化落地全流程,所有环节均提供可运行代码示例,助力快速上手企业级大数据项目开发。
一、环境准备与技术栈选型
核心技术栈
- 离线处理:Hadoop(HDFS+MapReduce)、Spark SQL、Hive
- 实时计算:Flink、Kafka
- 数据仓库:Hive 数仓分层、Dimension Modeling
- 推荐系统:Spark MLlib、协同过滤算法
- 可视化:Superset、ECharts
- 辅助工具:MySQL(元数据存储)、Redis(缓存)、Docker(环境部署)
环境快速部署(Docker Compose)
yaml
# docker-compose.yml 大数据环境一键部署
version: '3'
services:
hadoop:
image: sequenceiq/hadoop-docker:2.7.1
ports:
- "50070:50070"
- "8088:8088"
command: /etc/bootstrap.sh -d
hive:
image: sequenceiq/hive:1.1.0
ports:
- "10000:10000"
depends_on:
- hadoop
environment:
- HIVE_CONF_DIR=/usr/local/hive/conf
kafka:
image: confluentinc/cp-kafka:7.0.0
ports:
- "9092:9092"
depends_on:
- zookeeper
environment:
- KAFKA_BROKER_ID=1
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092
zookeeper:
image: confluentinc/cp-zookeeper:7.0.0
ports:
- "2181:2181"
environment:
- ZOOKEEPER_CLIENT_PORT=2181
flink-jobmanager:
image: flink:1.14.0
ports:
- "8081:8081"
command: jobmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=flink-jobmanager
flink-taskmanager:
image: flink:1.14.0
depends_on:
- flink-jobmanager
command: taskmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=flink-jobmanager
superset:
image: apache/superset:2.0.0
ports:
- "8088:8088"
depends_on:
- mysql
environment:
- SUPERSET_SECRET_KEY=your-secret-key
mysql:
image: mysql:8.0
ports:
- "3306:3306"
environment:
- MYSQL_ROOT_PASSWORD=root123
- MYSQL_DATABASE=bigdata_project
启动命令:
bash
运行
docker-compose up -d
二、离线数据处理实战(Spark SQL + Hive)
场景:用户行为数据离线统计(日活、留存、消费分析)
1. Hive 数仓分层建表
sql
-- 创建ODS层(原始数据层)
CREATE EXTERNAL TABLE IF NOT EXISTS ods.user_behavior (
user_id STRING,
item_id STRING,
category_id STRING,
behavior_type STRING,
ts BIGINT
)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/ods/user_behavior';
-- 创建DWD层(明细数据层)
CREATE TABLE IF NOT EXISTS dwd.user_behavior_detail (
user_id STRING,
item_id STRING,
category_id STRING,
behavior_type STRING,
dt STRING,
hour STRING
)
STORED AS PARQUET;
-- 创建DWS层(汇总数据层)
CREATE TABLE IF NOT EXISTS dws.user_daily_stats (
dt STRING,
dau BIGINT, -- 日活用户数
pu BIGINT, -- 付费用户数
total_order_amount DECIMAL(10,2) -- 当日订单总金额
)
STORED AS PARQUET;
2. Spark SQL 离线计算脚本
python
运行
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, date_format, countDistinct, sum
# 初始化SparkSession
spark = SparkSession.builder \
.appName("OfflineUserBehaviorStats") \
.enableHiveSupport() \
.getOrCreate()
# 1. ODS层数据加载与DWD层清洗
ods_df = spark.table("ods.user_behavior")
dwd_df = ods_df.withColumn("dt", date_format(from_unixtime(ods_df.ts / 1000), "yyyy-MM-dd")) \
.withColumn("hour", date_format(from_unixtime(ods_df.ts / 1000), "HH")) \
.select("user_id", "item_id", "category_id", "behavior_type", "dt", "hour")
# 写入DWD层
dwd_df.write.mode("overwrite").insertInto("dwd.user_behavior_detail")
# 2. DWS层汇总计算(按日期统计)
dws_df = dwd_df.filter(dwd_df.dt == "2025-11-28") \
.groupBy("dt") \
.agg(
countDistinct("user_id").alias("dau"),
countDistinct(when(dwd_df.behavior_type == "pay", dwd_df.user_id)).alias("pu"),
sum(when(dwd_df.behavior_type == "pay", 100).otherwise(0)).alias("total_order_amount") # 模拟订单金额
)
# 写入DWS层
dws_df.write.mode("overwrite").insertInto("dws.user_daily_stats")
# 输出结果验证
dws_df.show()
spark.stop()
运行命令:
bash
运行
spark-submit --master yarn --deploy-mode cluster offline_stats.py
三、实时数据处理实战(Kafka + Flink)
场景:实时监控用户行为流(热门商品 Top10、实时告警)
1. Kafka 生产者模拟数据
python
运行
# kafka_producer.py 模拟用户行为数据发送
from kafka import KafkaProducer
import json
import random
import time
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# 模拟用户行为数据
user_ids = [f"user_{i}" for i in range(1000)]
item_ids = [f"item_{i}" for i in range(10000)]
behavior_types = ["click", "collect", "cart", "pay"]
while True:
data = {
"user_id": random.choice(user_ids),
"item_id": random.choice(item_ids),
"category_id": f"cat_{random.randint(1, 20)}",
"behavior_type": random.choice(behavior_types),
"ts": int(time.time() * 1000)
}
producer.send("user_behavior_topic", value=data)
print(f"发送数据: {data}")
time.sleep(0.1) # 每秒发送10条数据
2. Flink 实时计算脚本
java
运行
// FlinkRealTimeStats.java 实时统计热门商品Top10
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import java.util.Properties;
// 用户行为实体类
class UserBehavior {
private String user_id;
private String item_id;
private String category_id;
private String behavior_type;
private long ts;
// getter/setter/toString 省略
public static UserBehavior fromJson(String json) {
return new com.alibaba.fastjson.JSONObject().parseObject(json, UserBehavior.class);
}
}
// 商品点击量聚合器
class ItemClickCounter implements AggregateFunction<UserBehavior, Long, Long> {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(UserBehavior value, Long accumulator) {
// 只统计点击行为
return value.getBehavior_type().equals("click") ? accumulator + 1 : accumulator;
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {
return a + b;
}
}
public class FlinkRealTimeStats {
public static void main(String[] args) throws Exception {
// 1. 初始化执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2. Kafka配置
Properties kafkaProps = new Properties();
kafkaProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
kafkaProps.put(ConsumerConfig.GROUP_ID_CONFIG, "flink_realtime_group");
kafkaProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
// 3. 读取Kafka数据流
DataStream<String> kafkaStream = env.addSource(
new FlinkKafkaConsumer<>("user_behavior_topic", new SimpleStringSchema(), kafkaProps)
);
// 4. 数据解析与Watermark设置
DataStream<UserBehavior> behaviorStream = kafkaStream
.map(UserBehavior::fromJson)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<UserBehavior>forMonotonousTimestamps()
.withTimestampAssigner((behavior, timestamp) -> behavior.getTs())
);
// 5. 按商品ID分组
KeyedStream<UserBehavior, String> itemKeyedStream = behaviorStream
.keyBy((KeySelector<UserBehavior, String>) UserBehavior::getItem_id);
// 6. 滑动窗口(1分钟窗口,30秒滑动一次)
WindowedStream<UserBehavior, String, TimeWindow> windowStream = itemKeyedStream
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(30)));
// 7. 聚合计算商品点击量
DataStream<Tuple2<String, Long>> itemClickStream = windowStream
.aggregate(new ItemClickCounter(), (window, iterable, collector) -> {
String itemId = window.getKey();
Long clickCount = iterable.iterator().next();
collector.collect(new Tuple2<>(itemId, clickCount));
});
// 8. 全局排序获取Top10热门商品
itemClickStream.keyBy(t -> "global")
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(30)))
.process(new TopNProcessFunction(10))
.print("实时热门商品Top10:");
// 执行作业
env.execute("UserBehaviorRealTimeStats");
}
// TopN处理函数(省略实现,核心逻辑:收集窗口内数据并排序取前N)
static class TopNProcessFunction extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
private int topSize;
public TopNProcessFunction(int topSize) { this.topSize = topSize; }
@Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> out) {
// 排序逻辑实现
List<Tuple2<String, Long>> list = new ArrayList<>();
for (Tuple2<String, Long> elem : elements) list.add(elem);
list.sort((a, b) -> b.f1.compareTo(a.f1));
StringBuilder result = new StringBuilder();
result.append("窗口时间: ").append(context.window().getStart()).append(" - ").append(context.window().getEnd()).append("\n");
for (int i = 0; i < Math.min(topSize, list.size()); i++) {
result.append("第").append(i+1).append("名: 商品").append(list.get(i).f0).append(" 点击量: ").append(list.get(i).f1).append("\n");
}
out.collect(result.toString());
}
}
}
四、数据仓库构建(Hive 分层 + 维度建模)
场景:电商业务数仓建模(事实表 + 维度表)
1. 维度表创建
sql
-- 用户维度表
CREATE TABLE IF NOT EXISTS dim.dim_user (
user_id STRING COMMENT '用户ID',
user_name STRING COMMENT '用户名',
gender STRING COMMENT '性别',
age INT COMMENT '年龄',
register_dt STRING COMMENT '注册日期',
phone STRING COMMENT '手机号'
)
STORED AS PARQUET
COMMENT '用户维度表';
-- 商品维度表
CREATE TABLE IF NOT EXISTS dim.dim_item (
item_id STRING COMMENT '商品ID',
item_name STRING COMMENT '商品名称',
category_id STRING COMMENT '分类ID',
category_name STRING COMMENT '分类名称',
price DECIMAL(10,2) COMMENT '商品价格',
create_dt STRING COMMENT '创建日期'
)
STORED AS PARQUET
COMMENT '商品维度表';
-- 日期维度表(通过SQL生成)
CREATE TABLE IF NOT EXISTS dim.dim_date (
dt STRING COMMENT '日期(yyyy-MM-dd)',
year INT COMMENT '年份',
month INT COMMENT '月份',
day INT COMMENT '日',
week INT COMMENT '周数',
quarter INT COMMENT '季度',
is_weekend INT COMMENT '是否周末(0-否,1-是)'
)
STORED AS PARQUET
COMMENT '日期维度表';
-- 插入日期数据(2025年全年)
INSERT OVERWRITE TABLE dim.dim_date
SELECT
date_add('2025-01-01', pos) AS dt,
year(date_add('2025-01-01', pos)) AS year,
month(date_add('2025-01-01', pos)) AS month,
day(date_add('2025-01-01', pos)) AS day,
weekofyear(date_add('2025-01-01', pos)) AS week,
quarter(date_add('2025-01-01', pos)) AS quarter,
CASE WHEN weekday(date_add('2025-01-01', pos)) >=5 THEN 1 ELSE 0 END AS is_weekend
FROM (
SELECT posexplode(split(space(datediff('2025-12-31', '2025-01-01')), ' ')) AS (pos, val)
) t;
2. 事实表与关联查询
sql
-- 订单事实表
CREATE TABLE IF NOT EXISTS fact.fact_order (
order_id STRING COMMENT '订单ID',
user_id STRING COMMENT '用户ID',
item_id STRING COMMENT '商品ID',
order_amount DECIMAL(10,2) COMMENT '订单金额',
pay_amount DECIMAL(10,2) COMMENT '支付金额',
order_dt STRING COMMENT '下单日期',
pay_dt STRING COMMENT '支付日期'
)
STORED AS PARQUET
COMMENT '订单事实表';
-- 关联查询:2025年Q4各分类商品销售金额Top5
SELECT
d.category_name,
SUM(f.order_amount) AS total_sales,
RANK() OVER (ORDER BY SUM(f.order_amount) DESC) AS rank
FROM fact.fact_order f
JOIN dim.dim_item d ON f.item_id = d.item_id
JOIN dim.dim_date dt ON f.order_dt = dt.dt
WHERE dt.quarter = 4 AND dt.year = 2025
GROUP BY d.category_name
LIMIT 5;
五、推荐系统实战(Spark MLlib 协同过滤)
场景:基于用户行为的商品推荐
python
运行
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col
# 初始化SparkSession
spark = SparkSession.builder \
.appName("ItemRecommendationSystem") \
.enableHiveSupport() \
.getOrCreate()
# 1. 加载数据(用户-商品行为数据,转换为评分矩阵)
# 行为权重:click=1, collect=2, cart=3, pay=4
behavior_df = spark.table("dwd.user_behavior_detail") \
.select(
col("user_id").cast("int").alias("user"),
col("item_id").cast("int").alias("item"),
case(
when(col("behavior_type") == "click", 1),
when(col("behavior_type") == "collect", 2),
when(col("behavior_type") == "cart", 3),
when(col("behavior_type") == "pay", 4)
).alias("rating")
) \
.filter(col("rating").isNotNull())
# 2. 划分训练集和测试集
train_df, test_df = behavior_df.randomSplit([0.8, 0.2], seed=42)
# 3. 训练ALS协同过滤模型
als = ALS(
maxIter=10,
regParam=0.01,
userCol="user",
itemCol="item",
ratingCol="rating",
coldStartStrategy="drop" # 忽略冷启动数据
)
model = als.fit(train_df)
# 4. 模型评估
predictions = model.transform(test_df)
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"模型RMSE: {rmse}")
# 5. 为指定用户推荐Top10商品
user_id = 100 # 目标用户ID
user_df = spark.createDataFrame([(user_id,)], ["user"])
recommendations = model.recommendForUserSubset(user_df, 10)
# 解析推荐结果
recommend_df = recommendations \
.select(
col("user"),
explode(col("recommendations")).alias("rec")
) \
.select(
col("user").alias("user_id"),
col("rec.item").alias("item_id"),
col("rec.rating").alias("predicted_rating")
)
# 关联商品信息
item_df = spark.table("dim.dim_item").select(
col("item_id").cast("int"),
col("item_name"),
col("category_name"),
col("price")
)
final_recommend = recommend_df.join(item_df, "item_id") \
.orderBy(col("predicted_rating").desc())
print(f"为用户{user_id}推荐的Top10商品:")
final_recommend.show(truncate=False)
spark.stop()
六、可视化落地(Superset + ECharts)
1. Superset 数据连接与看板配置
(1)Superset 连接 Hive 数据源
python
运行
# superset_hive_config.py
from superset.db_engine_specs.hive import HiveEngineSpec
# 在Superset配置文件中添加Hive连接信息
SQLALCHEMY_DATABASE_URI = 'sqlite:////superset/superset.db'
CONNECTIONS_ADDITIONAL = {
"hive": {
"engine": "hive",
"metadata_params": {},
"params": {
"host": "hive",
"port": 10000,
"database": "default",
"username": "hive",
"password": "",
"extra": {"auth": "NOSASL"}
}
}
}
(2)ECharts 实时可视化代码(HTML)
html
预览
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>大数据实时监控看板</title>
<script src="https://cdn.bootcdn.net/ajax/libs/echarts/5.4.3/echarts.min.js"></script>
<script src="https://cdn.bootcdn.net/ajax/libs/socket.io/4.5.1/socket.io.min.js"></script>
</head>
<body>
<div id="dau-chart" style="width: 800px; height: 400px;"></div>
<div id="top10-item-chart" style="width: 800px; height: 400px;"></div>
<script>
// 初始化ECharts实例
const dauChart = echarts.init(document.getElementById('dau-chart'));
const top10Chart = echarts.init(document.getElementById('top10-item-chart'));
// 日活统计图表配置
const dauOption = {
title: { text: '实时日活用户趋势' },
xAxis: { type: 'category', data: [] },
yAxis: { type: 'value' },
series: [{ name: '日活用户数', type: 'line', data: [] }]
};
// 热门商品Top10图表配置
const top10Option = {
title: { text: '实时热门商品Top10' },
xAxis: { type: 'value' },
yAxis: { type: 'category', data: [] },
series: [{ name: '点击量', type: 'bar', data: [] }]
};
// 连接Flink实时数据输出(通过Socket.IO转发)
const socket = io('http://localhost:3000'); // 假设Flink结果通过Node.js转发
// 监听日活数据
socket.on('dau_data', (data) => {
dauOption.xAxis.data.push(data.dt);
dauOption.series[0].data.push(data.dau);
// 只保留最近10个数据点
if (dauOption.xAxis.data.length > 10) {
dauOption.xAxis.data.shift();
dauOption.series[0].data.shift();
}
dauChart.setOption(dauOption);
});
// 监听热门商品数据
socket.on('top10_item', (data) => {
top10Option.yAxis.data = data.map(item => item.item_name);
top10Option.series[0].data = data.map(item => item.click_count);
top10Chart.setOption(top10Option);
});
// 窗口大小自适应
window.addEventListener('resize', () => {
dauChart.resize();
top10Chart.resize();
});
</script>
</body>
</html>
七、项目运行与验证
1. 执行顺序
- 启动 Docker 环境:
docker-compose up -d - 执行 Hive 数仓建表脚本
- 运行离线计算脚本(Spark SQL)
- 启动 Kafka 生产者模拟数据
- 提交 Flink 实时作业
- 运行推荐系统模型训练
- 部署 Superset 并导入 ECharts 看板
2. 核心验证指标
- 离线计算:DWS 层日活、付费用户数统计准确性
- 实时计算:Flink 作业 QPS、延迟(秒级)、Top10 商品更新频率
- 推荐系统:RMSE 误差(目标 < 1.0)、推荐商品点击率
- 可视化:看板数据刷新及时性、图表展示完整性
通过以上完整代码实现,可快速搭建覆盖离线、实时、数仓、推荐、可视化的全链路大数据项目,适配电商、金融等多行业业务场景,为企业级大数据应用开发提供实战参考。