1. 缓慢变化维
缓慢变化维(Slowly Changing Dimensions,SCD)是数据仓库中的一个概念。维度表中的数据(维度属性)通常不会像事实表数据那样快速变化,而是随着时间缓慢地发生改变。例如,在一个销售数据仓库中,销售事实(如销售金额、销售数量)可能每天都在大量变化,这是事实表数据。而客户维度(如客户姓名、地址、联系方式)相对变化缓慢,这些客户维度数据就是缓慢变化维。
处理维度表的历史变化信息的问题称为处理缓慢变化维的问题,简称SCD问题。处理 缓慢变化维的方法有以下几种常见方式:
- 保留原值
- 直接覆盖
- 增加新属性列
- 快照表
- 拉链表
2. 保留原始值
维度属性值不做更改,保留原始值。
如商品上架售卖时间:一个商品上架售卖后由于其他原因下架,后来又再次上架,此种情况产生了多个商品上架售卖时间。如果业务重点关注的是商品首次上架售卖时间,则采用该方式。
3. 直接覆盖
修改维度属性为最新值,直接覆盖,不保留历史信息。
如商品属于哪个品类:当商品品类发生变化时,直接重写为新品类
4. 增加新属性列
在维度表中增加新的一列,原先属性列存放上一版本的属性值,当前属性列存放当前版本的属性值,还可以增加一列记录变化的时间。
缺点:只能记录最后一次变化的信息。
5. 快照表
每天保留一份全量数据。
简单、高效。缺点是信息重复,浪费磁盘空间。
适用范围:维表不能太大
使用场景多,范围广;一般而言维表都不大。
6. 拉链表
拉链表适合于:表的数据量大,而且数据会发生新增和变化,但是大部分是不变的
(数据发生变化的百分比不大),且是缓慢变化的(如电商中用户信息表中的某些用
户基本属性不可能每天都变化)。主要目的是节省存储空间。
适用场景:
- 表的数据量大
- 表中部分字段会被更新
- 表中记录变量的比例不高
- 需要保留历史信息
维表拉链表应用案例
创建表加载数据(准备工作)
-- 用户信息
DROP TABLE IF EXISTS test.userinfo;
CREATE TABLE test.userinfo(
userid STRING COMMENT '用户编号 ',
mobile STRING COMMENT '手机号码 ',
regdate STRING COMMENT '注册日期 '
) COMMENT '用户信息 '
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';
-- 拉链表(存放用户历史信息)
-- 拉链表不是分区表;多了两个字段start_date、end_date
DROP TABLE IF EXISTS test.userhis;
CREATE TABLE test.userhis(
userid STRING COMMENT '用户编号 ',
mobile STRING COMMENT '手机号码 ',
regdate STRING COMMENT '注册日期 ',
start_date STRING,
end_date STRING
)COMMENT '用户信息拉链表 '
row format delimited fields terminated by ',';
-- 数据(/data/dw/data/userinfo.dat)
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22
001,13554444444,2020-03-01,2020-06-23
003,13574444444,2020-05-01,2020-06-23
005,13555554444,2020-06-21,2020-06-23
007,18600744444,2020-06-23,2020-06-23
008,18600844444,2020-06-23,2020-06-23
-- 静态分区数据加载(略)
/data/dw/data/userinfo0620.dat
001,13551111111,2020-03-01
002,13561111111,2020-04-01
003,13571111111,2020-05-01
004,13581111111,2020-06-01
load data local inpath '/data/dw/data/userinfo0620.dat' into table test.userinfo
partition(dt='2020-06-20');
-- 动态分区数据加载:分区的值是不固定的,由输入数据确定
-- 创建中间表(非分区表)
drop table if exists test.tmp1;
create table test.tmp1 as
select * from test.userinfo;
-- tmp1 非分区表,使用系统默认的字段分割符 '\001'
alter table test.tmp1 set serdeproperties('field.delim'=',');
-- 向中间表加载数据
load data local inpath '/data/dw/data/userinfo.dat' into table test.tmp1;
-- 从中间表向分区表加载数据
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table test.userinfo
partition(dt)
select * from test.tmp1;
拉链表的实现
- userinfo(分区表) => userid、 mobile、 regdate => 每日变更的数据(修改的+新增 的)/ 历史数据(第一天)
- userhis(拉链表)=> 多了两个字段start_date / end_date
1 userinfo 分区:(2020-06-20)
-- 1、userinfo 分区:(2020-06-20)
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
2 初始化拉链表
-- 2、初始化拉链表(2020-06-20)由分区表userinfo到拉链表userhis
insert overwrite table test.userhis
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo where dt='2020-06-20';
2.1 查看数据
select * from userhis;
001 13551111111 2020-03-01 2020-06-20 9999-12-31
004 13581111111 2020-06-01 2020-06-20 9999-12-31
003 13571111111 2020-05-01 2020-06-20 9999-12-31
002 13561111111 2020-04-01 2020-06-20 9999-12-31
3 分区表 userinfo 分区:(2020-06-21)
-- 3、userinfo 分区:(2020-06-21)
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
3.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-21') A
right join test.userhis B
on A.userid=B.userid;
userid | userid(1) | mobile | regdate | start_date | end_date |
---|---|---|---|---|---|
002 | 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 9999-12-31 |
004 | 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 9999-12-31 |
001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 | |
003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
可以发现 002 004 发生了数据变化, 001 003 没有发生变化
3.2 核心sql
-- 最终的
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then date_add('2020-06-21', -1)
else B.end_date
end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
right join test.userhis B
on A.userid=B.userid;
userid | mobile | regdate | start_date | end_date |
---|---|---|---|---|
002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
004 | 13582222222 | 2020-06-01 | 2020-06-21 | 9999-12-31 |
005 | 13552222222 | 2020-06-21 | 2020-06-21 | 9999-12-31 |
002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
3.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then date_add('2020-06-21', -1)
else B.end_date
end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
right join test.userhis B
on A.userid=B.userid;
4 分区表 userinfo 分区:(2020-06-22)
004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22
4.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-22') A
right join test.userhis B
on A.userid=B.userid;
userid | userid | mobile | regdate | start_date | end_date |
---|---|---|---|---|---|
004 | 004 | 13582222222 | 2020-06-01 | 2020-06-21 | 9999-12-31 |
004 | 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
005 | 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 9999-12-31 |
002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 | |
002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 | |
001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 | |
003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
可以发现 004 005 发生了数据变化
4.2 核心sql
-- 最终的
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from userinfo
where dt='2020-06-22'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then '2020-06-21'
else B.end_date
end as end_date
from (select * from userinfo where dt='2020-06-22') A
right join userhis B
on A.userid=B.userid;
userid | mobile | regdate | start_date | end_date |
---|---|---|---|---|
004 | 13333333333 | 2020-06-01 | 2020-06-22 | 9999-12-31 |
005 | 13533333333 | 2020-06-21 | 2020-06-22 | 9999-12-31 |
006 | 13733333333 | 2020-06-22 | 2020-06-22 | 9999-12-31 |
004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 |
004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
005 | 13552222222 | 2020-06-21 | 2020-06-21 | 2020-06-21 |
002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
4.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-22'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then date_add('2020-06-22', -1)
else B.end_date
end as end_date
from (select * from test.userinfo where dt='2020-06-22') A
right join test.userhis B
on A.userid=B.userid;
5 分区表 userinfo 分区:(2020-06-23)
001,13554444444,2020-03-01,2020-06-23
003,13574444444,2020-05-01,2020-06-23
005,13555554444,2020-06-21,2020-06-23
007,18600744444,2020-06-23,2020-06-23
008,18600844444,2020-06-23,2020-06-23
5.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-23') A
right join test.userhis B
on A.userid=B.userid;
userid | userid | mobile | regdate | start_date | end_date |
---|---|---|---|---|---|
001 | 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
003 | 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
005 | 005 | 13533333333 | 2020-06-21 | 2020-06-22 | 9999-12-31 |
005 | 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 200-06-21 |
004 | 13333333333 | 2020-06-01 | 2020-06-22 | 9999-12-31 | |
006 | 13733333333 | 2020-06-22 | 2020-06-22 | 9999-12-31 | |
004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 | |
004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 | |
002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 | |
002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
5.2 核心sql
-- 最终的
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from userinfo
where dt='2020-06-22'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then '2020-06-21'
else B.end_date
end as end_date
from (select * from userinfo where dt='2020-06-22') A
right join userhis B
on A.userid=B.userid;
userid | mobile | regdate | start_date | end_date |
---|---|---|---|---|
001 | 13554444444 | 2020-03-01 | 2020-06-23 | 9999-12-31 |
003 | 13574444444 | 2020-05-01 | 2020-06-23 | 9999-12-31 |
005 | 13555554444 | 2020-06-21 | 2020-06-23 | 9999-12-31 |
007 | 18600744444 | 2020-06-23 | 2020-06-23 | 9999-12-31 |
008 | 18600844444 | 2020-06-23 | 2020-06-23 | 9999-12-31 |
001 | 13551111111 | 2020-03-01 | 2020-06-20 | 2020-06-22 |
003 | 13571111111 | 2020-05-01 | 2020-06-20 | 2020-06-22 |
005 | 13533333333 | 2020-06-21 | 2020-06-22 | 2020-06-22 |
005 | 13552222222 | 2020-06-21 | 2020-06-21 | 2020-06-21 |
004 | 13333333333 | 2020-06-01 | 2020-06-22 | 9999-12-31 |
006 | 13733333333 | 2020-06-22 | 2020-06-22 | 9999-12-31 |
004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 |
004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
5.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21'
union all
select B.userid,
B.mobile,
B.regdate,
B.start_Date,
case when B.end_date='9999-12-31' and A.userid is not null
then date_add('2020-06-21', -1)
else B.end_date
end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
right join test.userhis B
on A.userid=B.userid;