尚硅谷《线上问诊离线数仓》项目实战解析:从数据采集到分析应用的全流程实现
一、项目架构与技术栈
1. 整体架构设计
数据流向架构:
业务数据库 → 数据采集 → ODS层 → DWD层 → DWS层 → ADS层 → 数据应用
2. 核心组件版本
| 组件 | 版本 | 用途 |
|---|---|---|
| Hadoop | 3.1.3 | 分布式存储与计算 |
| Hive | 3.1.2 | 数据仓库管理 |
| Sqoop | 1.4.7 | 关系型数据导入 |
| Flume | 1.9.0 | 日志数据采集 |
| Azkaban | 3.90.0 | 工作流调度 |
| Presto | 0.245.1 | 即席查询 |
二、数据采集与ODS层构建
1. 结构化数据导入(Sqoop)
# 从MySQL导入问诊记录表
sqoop import \
--connect jdbc:mysql://mysql01:3306/medical \
--username root \
--password 123456 \
--table consultation \
--fields-terminated-by '\t' \
--delete-target-dir \
--target-dir /medical/ods/consultation/dt=${date} \
--m 4 \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
2. 日志数据采集(Flume配置)
# flume-web-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置Source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/nginx/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 配置Channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data/flume/checkpoint
a1.channels.c1.dataDirs = /data/flume/data
# 配置Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /medical/ods/web_log/dt=%Y-%m-%d
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = snappy
三、数据清洗与DWD层设计
1. 问诊记录清洗(HQL示例)
-- 问诊事实表清洗
INSERT OVERWRITE TABLE dwd_consultation_fact PARTITION(dt='${date}')
SELECT
c.id,
c.patient_id,
c.doctor_id,
c.department_id,
c.start_time,
c.end_time,
c.fee,
c.status,
p.gender AS patient_gender,
p.age AS patient_age,
d.title AS doctor_title
FROM ods_consultation c
JOIN ods_patient p ON c.patient_id = p.id
JOIN ods_doctor d ON c.doctor_id = d.id
WHERE c.dt='${date}'
AND c.status IN (2,3,4) -- 过滤无效状态
AND c.start_time IS NOT NULL;
2. 用户行为日志解析(UDF使用)
// 解析URL参数的UDF
public class ParseUrlParam extends GenericUDF {
@Override
public Object evaluate(DeferredObject[] arguments) {
String url = arguments[0].get().toString();
String param = arguments[1].get().toString();
try {
String[] pairs = url.split("\\?")[1].split("&");
for (String pair : pairs) {
String[] kv = pair.split("=");
if (kv[0].equals(param)) {
return kv.length > 1 ? kv[1] : "";
}
}
} catch (Exception e) {
return null;
}
return null;
}
}
四、维度建模与DWS层构建
1. 医生服务宽表设计
-- 医生服务聚合宽表
CREATE TABLE dws_doctor_service_wide(
doctor_id BIGINT COMMENT '医生ID',
doctor_name STRING COMMENT '医生姓名',
department_name STRING COMMENT '科室名称',
consultation_count BIGINT COMMENT '问诊总量',
avg_duration DOUBLE COMMENT '平均问诊时长(分钟)',
good_comment_rate DOUBLE COMMENT '好评率',
month_total_fee DECIMAL(16,2) COMMENT '月收入'
) COMMENT '医生服务宽表'
PARTITIONED BY (dt STRING)
STORED AS PARQUET;
-- 宽表数据计算
INSERT OVERWRITE TABLE dws_doctor_service_wide PARTITION(dt='${date}')
SELECT
d.id,
d.name,
dept.name,
COUNT(c.id),
AVG(UNIX_TIMESTAMP(c.end_time) - UNIX_TIMESTAMP(c.start_time))/60,
SUM(CASE WHEN e.score >= 4 THEN 1 ELSE 0 END)/COUNT(e.id),
SUM(c.fee)
FROM dim_doctor d
JOIN dim_department dept ON d.department_id = dept.id
LEFT JOIN dwd_consultation_fact c ON d.id = c.doctor_id
LEFT JOIN dwd_evaluation_fact e ON c.id = e.consultation_id
WHERE c.dt='${date}'
GROUP BY d.id, d.name, dept.name;
2. 患者画像标签计算
-- 患者标签视图
CREATE VIEW patient_tags AS
SELECT
p.id,
p.name,
CASE
WHEN p.age < 18 THEN '未成年'
WHEN p.age BETWEEN 18 AND 35 THEN '青年'
WHEN p.age BETWEEN 36 AND 55 THEN '中年'
ELSE '老年'
END AS age_group,
COUNT(DISTINCT c.department_id) AS consulted_departments,
SUM(c.fee) AS total_fee,
NTILE(5) OVER (ORDER BY SUM(c.fee) DESC) AS consumption_level
FROM dim_patient p
JOIN dwd_consultation_fact c ON p.id = c.patient_id
WHERE c.dt BETWEEN '${start_date}' AND '${end_date}'
GROUP BY p.id, p.name, p.age;
五、数据调度与监控
1. Azkaban工作流配置
# medical_etl.flow
nodes:
- name: import_ods_data
type: command
config:
command: sh /scripts/sqoop_import.sh ${dt}
- name: etl_dwd_layer
type: command
dependsOn: [import_ods_data]
config:
command: hive -f /hql/dwd_consultation.hql --hivevar dt=${dt}
- name: build_dws_layer
type: command
dependsOn: [etl_dwd_layer]
config:
command: hive -f /hql/dws_doctor.hql --hivevar dt=${dt}
- name: data_quality_check
type: command
dependsOn: [build_dws_layer]
config:
command: python /scripts/quality_check.py ${dt}
2. 数据质量监控(Python示例)
# quality_check.py
import pyhive
from datetime import datetime
def check_null_rate(table, dt, threshold=0.05):
conn = pyhive.connect(host='hiveserver')
cursor = conn.cursor()
# 获取表结构
cursor.execute(f"DESCRIBE {table}")
columns = [row[0] for row in cursor.fetchall()]
# 检查每列空值率
for col in columns:
query = f"""
SELECT COUNT(CASE WHEN {col} IS NULL THEN 1 END)/COUNT(1)
FROM {table} WHERE dt='{dt}'
"""
cursor.execute(query)
null_rate = cursor.fetchone()[0]
if null_rate > threshold:
send_alert(f"{table}.{col} 空值率 {null_rate:.2%} 超过阈值")
cursor.close()
conn.close()
if __name__ == "__main__":
dt = datetime.today().strftime('%Y-%m-%d')
check_null_rate('dwd_consultation_fact', dt)
六、数据可视化应用
1. 科室问诊量分析(Presto SQL)
-- 科室问诊量排行
SELECT
d.name AS department,
COUNT(*) AS consultations,
SUM(c.fee) AS total_fee,
COUNT(DISTINCT c.patient_id) AS patient_count
FROM hive.medical.dwd_consultation_fact c
JOIN hive.medical.dim_department d ON c.department_id = d.id
WHERE c.dt = date_format(current_date - interval '1' day, '%Y-%m-%d')
GROUP BY d.name
ORDER BY consultations DESC
LIMIT 10;
2. Superset看板配置
# 科室运营看板配置示例
dashboard:
title: 科室运营分析
slices:
- viz_type: pie
datasource: presto_medical
params:
metrics: [sum__consultations]
groupby: [department]
row_limit: 10
- viz_type: line
datasource: presto_medical
params:
metrics: [sum__fee]
granularity: dt
time_range: Last 30 days
七、项目经验总结
1. 关键问题解决方案
| 问题类型 | 解决方案 | 实施效果 |
|---|---|---|
| 数据倾斜 | 增加随机前缀+局部聚合 | 作业时间减少67% |
| 小文件过多 | 合并小文件+Hive压缩 | 查询性能提升40% |
| 维度缓慢变化 | 拉链表设计+SCD2实现 | 历史追溯准确率100% |
| 数据质量不稳定 | 建立数据质量监控体系 | 问题发现时间缩短80% |
2. 最佳实践建议
-
分层设计原则:
- ODS层保持数据原貌
- DWD层明细数据去重脱敏
- DWS层按主题聚合
- ADS层面向应用优化
-
性能优化要点:
-- 分区裁剪示例 SET hive.optimize.ppd=true; -- 谓词下推 SET hive.exec.parallel=true; -- 并行执行 -- 合理设置Reduce数量 SET hive.exec.reducers.bytes.per.reducer=256000000; -
数据治理规范:
- 统一命名规范(表/字段/分区)
- 完善元数据管理(数据血缘+业务字典)
- 建立数据生命周期策略
通过本项目的系统实践,学员将掌握从业务需求分析到数据应用落地的完整数据仓库建设能力,重点培养:
- 维度建模设计思维
- 大数据组件调优能力
- 数据质量保障意识
- 数据价值挖掘方法
建议在项目基础上扩展实时数仓能力(Flink+Kafka),并探索AI与数据分析的结合应用(如问诊质量预测模型),构建更完整的数据驱动体系。