51CTO-【完整】大数据项目实战3|离线|实时|数据仓库|推荐系统|数据可视化

63 阅读9分钟

大数据项目实战 3 | 离线 | 实时 | 数仓 | 推荐 | 可视化

本文聚焦大数据核心场景实战,涵盖离线数据处理、实时计算、数据仓库构建、推荐系统开发、可视化落地全流程,所有环节均提供可运行代码示例,助力快速上手企业级大数据项目开发。

一、环境准备与技术栈选型

核心技术栈

  • 离线处理:Hadoop(HDFS+MapReduce)、Spark SQL、Hive
  • 实时计算:Flink、Kafka
  • 数据仓库:Hive 数仓分层、Dimension Modeling
  • 推荐系统:Spark MLlib、协同过滤算法
  • 可视化:Superset、ECharts
  • 辅助工具:MySQL(元数据存储)、Redis(缓存)、Docker(环境部署)

环境快速部署(Docker Compose)

yaml

# docker-compose.yml 大数据环境一键部署
version: '3'
services:
  hadoop:
    image: sequenceiq/hadoop-docker:2.7.1
    ports:
      - "50070:50070"
      - "8088:8088"
    command: /etc/bootstrap.sh -d
  hive:
    image: sequenceiq/hive:1.1.0
    ports:
      - "10000:10000"
    depends_on:
      - hadoop
    environment:
      - HIVE_CONF_DIR=/usr/local/hive/conf
  kafka:
    image: confluentinc/cp-kafka:7.0.0
    ports:
      - "9092:9092"
    depends_on:
      - zookeeper
    environment:
      - KAFKA_BROKER_ID=1
      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092
  zookeeper:
    image: confluentinc/cp-zookeeper:7.0.0
    ports:
      - "2181:2181"
    environment:
      - ZOOKEEPER_CLIENT_PORT=2181
  flink-jobmanager:
    image: flink:1.14.0
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=flink-jobmanager
  flink-taskmanager:
    image: flink:1.14.0
    depends_on:
      - flink-jobmanager
    command: taskmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=flink-jobmanager
  superset:
    image: apache/superset:2.0.0
    ports:
      - "8088:8088"
    depends_on:
      - mysql
    environment:
      - SUPERSET_SECRET_KEY=your-secret-key
  mysql:
    image: mysql:8.0
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=root123
      - MYSQL_DATABASE=bigdata_project

启动命令:

bash

运行

docker-compose up -d

二、离线数据处理实战(Spark SQL + Hive)

场景:用户行为数据离线统计(日活、留存、消费分析)

1. Hive 数仓分层建表

sql

-- 创建ODS层(原始数据层)
CREATE EXTERNAL TABLE IF NOT EXISTS ods.user_behavior (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior_type STRING,
    ts BIGINT
)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/ods/user_behavior';

-- 创建DWD层(明细数据层)
CREATE TABLE IF NOT EXISTS dwd.user_behavior_detail (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior_type STRING,
    dt STRING,
    hour STRING
)
STORED AS PARQUET;

-- 创建DWS层(汇总数据层)
CREATE TABLE IF NOT EXISTS dws.user_daily_stats (
    dt STRING,
    dau BIGINT,  -- 日活用户数
    pu BIGINT,  -- 付费用户数
    total_order_amount DECIMAL(10,2)  -- 当日订单总金额
)
STORED AS PARQUET;
2. Spark SQL 离线计算脚本

python

运行

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, date_format, countDistinct, sum

# 初始化SparkSession
spark = SparkSession.builder \
    .appName("OfflineUserBehaviorStats") \
    .enableHiveSupport() \
    .getOrCreate()

# 1. ODS层数据加载与DWD层清洗
ods_df = spark.table("ods.user_behavior")
dwd_df = ods_df.withColumn("dt", date_format(from_unixtime(ods_df.ts / 1000), "yyyy-MM-dd")) \
    .withColumn("hour", date_format(from_unixtime(ods_df.ts / 1000), "HH")) \
    .select("user_id", "item_id", "category_id", "behavior_type", "dt", "hour")

# 写入DWD层
dwd_df.write.mode("overwrite").insertInto("dwd.user_behavior_detail")

# 2. DWS层汇总计算(按日期统计)
dws_df = dwd_df.filter(dwd_df.dt == "2025-11-28") \
    .groupBy("dt") \
    .agg(
        countDistinct("user_id").alias("dau"),
        countDistinct(when(dwd_df.behavior_type == "pay", dwd_df.user_id)).alias("pu"),
        sum(when(dwd_df.behavior_type == "pay", 100).otherwise(0)).alias("total_order_amount")  # 模拟订单金额
    )

# 写入DWS层
dws_df.write.mode("overwrite").insertInto("dws.user_daily_stats")

# 输出结果验证
dws_df.show()

spark.stop()

运行命令:

bash

运行

spark-submit --master yarn --deploy-mode cluster offline_stats.py

三、实时数据处理实战(Kafka + Flink)

场景:实时监控用户行为流(热门商品 Top10、实时告警)

1. Kafka 生产者模拟数据

python

运行

# kafka_producer.py 模拟用户行为数据发送
from kafka import KafkaProducer
import json
import random
import time

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# 模拟用户行为数据
user_ids = [f"user_{i}" for i in range(1000)]
item_ids = [f"item_{i}" for i in range(10000)]
behavior_types = ["click", "collect", "cart", "pay"]

while True:
    data = {
        "user_id": random.choice(user_ids),
        "item_id": random.choice(item_ids),
        "category_id": f"cat_{random.randint(1, 20)}",
        "behavior_type": random.choice(behavior_types),
        "ts": int(time.time() * 1000)
    }
    producer.send("user_behavior_topic", value=data)
    print(f"发送数据: {data}")
    time.sleep(0.1)  # 每秒发送10条数据
2. Flink 实时计算脚本

java

运行

// FlinkRealTimeStats.java 实时统计热门商品Top10
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import java.util.Properties;

// 用户行为实体类
class UserBehavior {
    private String user_id;
    private String item_id;
    private String category_id;
    private String behavior_type;
    private long ts;

    // getter/setter/toString 省略
    public static UserBehavior fromJson(String json) {
        return new com.alibaba.fastjson.JSONObject().parseObject(json, UserBehavior.class);
    }
}

// 商品点击量聚合器
class ItemClickCounter implements AggregateFunction<UserBehavior, Long, Long> {
    @Override
    public Long createAccumulator() {
        return 0L;
    }

    @Override
    public Long add(UserBehavior value, Long accumulator) {
        // 只统计点击行为
        return value.getBehavior_type().equals("click") ? accumulator + 1 : accumulator;
    }

    @Override
    public Long getResult(Long accumulator) {
        return accumulator;
    }

    @Override
    public Long merge(Long a, Long b) {
        return a + b;
    }
}

public class FlinkRealTimeStats {
    public static void main(String[] args) throws Exception {
        // 1. 初始化执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 2. Kafka配置
        Properties kafkaProps = new Properties();
        kafkaProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        kafkaProps.put(ConsumerConfig.GROUP_ID_CONFIG, "flink_realtime_group");
        kafkaProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");

        // 3. 读取Kafka数据流
        DataStream<String> kafkaStream = env.addSource(
            new FlinkKafkaConsumer<>("user_behavior_topic", new SimpleStringSchema(), kafkaProps)
        );

        // 4. 数据解析与Watermark设置
        DataStream<UserBehavior> behaviorStream = kafkaStream
            .map(UserBehavior::fromJson)
            .assignTimestampsAndWatermarks(
                WatermarkStrategy.<UserBehavior>forMonotonousTimestamps()
                    .withTimestampAssigner((behavior, timestamp) -> behavior.getTs())
            );

        // 5. 按商品ID分组
        KeyedStream<UserBehavior, String> itemKeyedStream = behaviorStream
            .keyBy((KeySelector<UserBehavior, String>) UserBehavior::getItem_id);

        // 6. 滑动窗口(1分钟窗口,30秒滑动一次)
        WindowedStream<UserBehavior, String, TimeWindow> windowStream = itemKeyedStream
            .window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(30)));

        // 7. 聚合计算商品点击量
        DataStream<Tuple2<String, Long>> itemClickStream = windowStream
            .aggregate(new ItemClickCounter(), (window, iterable, collector) -> {
                String itemId = window.getKey();
                Long clickCount = iterable.iterator().next();
                collector.collect(new Tuple2<>(itemId, clickCount));
            });

        // 8. 全局排序获取Top10热门商品
        itemClickStream.keyBy(t -> "global")
            .window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(30)))
            .process(new TopNProcessFunction(10))
            .print("实时热门商品Top10:");

        // 执行作业
        env.execute("UserBehaviorRealTimeStats");
    }

    // TopN处理函数(省略实现,核心逻辑:收集窗口内数据并排序取前N)
    static class TopNProcessFunction extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
        private int topSize;
        public TopNProcessFunction(int topSize) { this.topSize = topSize; }
        @Override
        public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> out) {
            // 排序逻辑实现
            List<Tuple2<String, Long>> list = new ArrayList<>();
            for (Tuple2<String, Long> elem : elements) list.add(elem);
            list.sort((a, b) -> b.f1.compareTo(a.f1));
            StringBuilder result = new StringBuilder();
            result.append("窗口时间: ").append(context.window().getStart()).append(" - ").append(context.window().getEnd()).append("\n");
            for (int i = 0; i < Math.min(topSize, list.size()); i++) {
                result.append("第").append(i+1).append("名: 商品").append(list.get(i).f0).append(" 点击量: ").append(list.get(i).f1).append("\n");
            }
            out.collect(result.toString());
        }
    }
}

四、数据仓库构建(Hive 分层 + 维度建模)

场景:电商业务数仓建模(事实表 + 维度表)

1. 维度表创建

sql

-- 用户维度表
CREATE TABLE IF NOT EXISTS dim.dim_user (
    user_id STRING COMMENT '用户ID',
    user_name STRING COMMENT '用户名',
    gender STRING COMMENT '性别',
    age INT COMMENT '年龄',
    register_dt STRING COMMENT '注册日期',
    phone STRING COMMENT '手机号'
)
STORED AS PARQUET
COMMENT '用户维度表';

-- 商品维度表
CREATE TABLE IF NOT EXISTS dim.dim_item (
    item_id STRING COMMENT '商品ID',
    item_name STRING COMMENT '商品名称',
    category_id STRING COMMENT '分类ID',
    category_name STRING COMMENT '分类名称',
    price DECIMAL(10,2) COMMENT '商品价格',
    create_dt STRING COMMENT '创建日期'
)
STORED AS PARQUET
COMMENT '商品维度表';

-- 日期维度表(通过SQL生成)
CREATE TABLE IF NOT EXISTS dim.dim_date (
    dt STRING COMMENT '日期(yyyy-MM-dd)',
    year INT COMMENT '年份',
    month INT COMMENT '月份',
    day INT COMMENT '日',
    week INT COMMENT '周数',
    quarter INT COMMENT '季度',
    is_weekend INT COMMENT '是否周末(0-否,1-是)'
)
STORED AS PARQUET
COMMENT '日期维度表';

-- 插入日期数据(2025年全年)
INSERT OVERWRITE TABLE dim.dim_date
SELECT 
    date_add('2025-01-01', pos) AS dt,
    year(date_add('2025-01-01', pos)) AS year,
    month(date_add('2025-01-01', pos)) AS month,
    day(date_add('2025-01-01', pos)) AS day,
    weekofyear(date_add('2025-01-01', pos)) AS week,
    quarter(date_add('2025-01-01', pos)) AS quarter,
    CASE WHEN weekday(date_add('2025-01-01', pos)) >=5 THEN 1 ELSE 0 END AS is_weekend
FROM (
    SELECT posexplode(split(space(datediff('2025-12-31', '2025-01-01')), ' ')) AS (pos, val)
) t;
2. 事实表与关联查询

sql

-- 订单事实表
CREATE TABLE IF NOT EXISTS fact.fact_order (
    order_id STRING COMMENT '订单ID',
    user_id STRING COMMENT '用户ID',
    item_id STRING COMMENT '商品ID',
    order_amount DECIMAL(10,2) COMMENT '订单金额',
    pay_amount DECIMAL(10,2) COMMENT '支付金额',
    order_dt STRING COMMENT '下单日期',
    pay_dt STRING COMMENT '支付日期'
)
STORED AS PARQUET
COMMENT '订单事实表';

-- 关联查询:2025年Q4各分类商品销售金额Top5
SELECT 
    d.category_name,
    SUM(f.order_amount) AS total_sales,
    RANK() OVER (ORDER BY SUM(f.order_amount) DESC) AS rank
FROM fact.fact_order f
JOIN dim.dim_item d ON f.item_id = d.item_id
JOIN dim.dim_date dt ON f.order_dt = dt.dt
WHERE dt.quarter = 4 AND dt.year = 2025
GROUP BY d.category_name
LIMIT 5;

五、推荐系统实战(Spark MLlib 协同过滤)

场景:基于用户行为的商品推荐

python

运行

from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col

# 初始化SparkSession
spark = SparkSession.builder \
    .appName("ItemRecommendationSystem") \
    .enableHiveSupport() \
    .getOrCreate()

# 1. 加载数据(用户-商品行为数据,转换为评分矩阵)
# 行为权重:click=1, collect=2, cart=3, pay=4
behavior_df = spark.table("dwd.user_behavior_detail") \
    .select(
        col("user_id").cast("int").alias("user"),
        col("item_id").cast("int").alias("item"),
        case(
            when(col("behavior_type") == "click", 1),
            when(col("behavior_type") == "collect", 2),
            when(col("behavior_type") == "cart", 3),
            when(col("behavior_type") == "pay", 4)
        ).alias("rating")
    ) \
    .filter(col("rating").isNotNull())

# 2. 划分训练集和测试集
train_df, test_df = behavior_df.randomSplit([0.8, 0.2], seed=42)

# 3. 训练ALS协同过滤模型
als = ALS(
    maxIter=10,
    regParam=0.01,
    userCol="user",
    itemCol="item",
    ratingCol="rating",
    coldStartStrategy="drop"  # 忽略冷启动数据
)
model = als.fit(train_df)

# 4. 模型评估
predictions = model.transform(test_df)
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"模型RMSE: {rmse}")

# 5. 为指定用户推荐Top10商品
user_id = 100  # 目标用户ID
user_df = spark.createDataFrame([(user_id,)], ["user"])
recommendations = model.recommendForUserSubset(user_df, 10)

# 解析推荐结果
recommend_df = recommendations \
    .select(
        col("user"),
        explode(col("recommendations")).alias("rec")
    ) \
    .select(
        col("user").alias("user_id"),
        col("rec.item").alias("item_id"),
        col("rec.rating").alias("predicted_rating")
    )

# 关联商品信息
item_df = spark.table("dim.dim_item").select(
    col("item_id").cast("int"),
    col("item_name"),
    col("category_name"),
    col("price")
)
final_recommend = recommend_df.join(item_df, "item_id") \
    .orderBy(col("predicted_rating").desc())

print(f"为用户{user_id}推荐的Top10商品:")
final_recommend.show(truncate=False)

spark.stop()

六、可视化落地(Superset + ECharts)

1. Superset 数据连接与看板配置

(1)Superset 连接 Hive 数据源

python

运行

# superset_hive_config.py
from superset.db_engine_specs.hive import HiveEngineSpec

# 在Superset配置文件中添加Hive连接信息
SQLALCHEMY_DATABASE_URI = 'sqlite:////superset/superset.db'
CONNECTIONS_ADDITIONAL = {
    "hive": {
        "engine": "hive",
        "metadata_params": {},
        "params": {
            "host": "hive",
            "port": 10000,
            "database": "default",
            "username": "hive",
            "password": "",
            "extra": {"auth": "NOSASL"}
        }
    }
}
(2)ECharts 实时可视化代码(HTML)

html

预览

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>大数据实时监控看板</title>
    <script src="https://cdn.bootcdn.net/ajax/libs/echarts/5.4.3/echarts.min.js"></script>
    <script src="https://cdn.bootcdn.net/ajax/libs/socket.io/4.5.1/socket.io.min.js"></script>
</head>
<body>
    <div id="dau-chart" style="width: 800px; height: 400px;"></div>
    <div id="top10-item-chart" style="width: 800px; height: 400px;"></div>

    <script>
        // 初始化ECharts实例
        const dauChart = echarts.init(document.getElementById('dau-chart'));
        const top10Chart = echarts.init(document.getElementById('top10-item-chart'));

        // 日活统计图表配置
        const dauOption = {
            title: { text: '实时日活用户趋势' },
            xAxis: { type: 'category', data: [] },
            yAxis: { type: 'value' },
            series: [{ name: '日活用户数', type: 'line', data: [] }]
        };

        // 热门商品Top10图表配置
        const top10Option = {
            title: { text: '实时热门商品Top10' },
            xAxis: { type: 'value' },
            yAxis: { type: 'category', data: [] },
            series: [{ name: '点击量', type: 'bar', data: [] }]
        };

        // 连接Flink实时数据输出(通过Socket.IO转发)
        const socket = io('http://localhost:3000'); // 假设Flink结果通过Node.js转发

        // 监听日活数据
        socket.on('dau_data', (data) => {
            dauOption.xAxis.data.push(data.dt);
            dauOption.series[0].data.push(data.dau);
            // 只保留最近10个数据点
            if (dauOption.xAxis.data.length > 10) {
                dauOption.xAxis.data.shift();
                dauOption.series[0].data.shift();
            }
            dauChart.setOption(dauOption);
        });

        // 监听热门商品数据
        socket.on('top10_item', (data) => {
            top10Option.yAxis.data = data.map(item => item.item_name);
            top10Option.series[0].data = data.map(item => item.click_count);
            top10Chart.setOption(top10Option);
        });

        // 窗口大小自适应
        window.addEventListener('resize', () => {
            dauChart.resize();
            top10Chart.resize();
        });
    </script>
</body>
</html>

七、项目运行与验证

1. 执行顺序

  1. 启动 Docker 环境:docker-compose up -d
  2. 执行 Hive 数仓建表脚本
  3. 运行离线计算脚本(Spark SQL)
  4. 启动 Kafka 生产者模拟数据
  5. 提交 Flink 实时作业
  6. 运行推荐系统模型训练
  7. 部署 Superset 并导入 ECharts 看板

2. 核心验证指标

  • 离线计算:DWS 层日活、付费用户数统计准确性
  • 实时计算:Flink 作业 QPS、延迟(秒级)、Top10 商品更新频率
  • 推荐系统:RMSE 误差(目标 < 1.0)、推荐商品点击率
  • 可视化:看板数据刷新及时性、图表展示完整性

通过以上完整代码实现,可快速搭建覆盖离线、实时、数仓、推荐、可视化的全链路大数据项目,适配电商、金融等多行业业务场景,为企业级大数据应用开发提供实战参考。