缓慢变化维与周期性事实表

130 阅读17分钟

1. 缓慢变化维

缓慢变化维(Slowly Changing Dimensions,SCD)是数据仓库中的一个概念。维度表中的数据(维度属性)通常不会像事实表数据那样快速变化,而是随着时间缓慢地发生改变。例如,在一个销售数据仓库中,销售事实(如销售金额、销售数量)可能每天都在大量变化,这是事实表数据。而客户维度(如客户姓名、地址、联系方式)相对变化缓慢,这些客户维度数据就是缓慢变化维。

处理维度表的历史变化信息的问题称为处理缓慢变化维的问题,简称SCD问题。处理 缓慢变化维的方法有以下几种常见方式:

  • 保留原值
  • 直接覆盖
  • 增加新属性列
  • 快照表
  • 拉链表

2. 保留原始值

维度属性值不做更改,保留原始值。

如商品上架售卖时间:一个商品上架售卖后由于其他原因下架,后来又再次上架,此种情况产生了多个商品上架售卖时间。如果业务重点关注的是商品首次上架售卖时间,则采用该方式。

3. 直接覆盖

修改维度属性为最新值,直接覆盖,不保留历史信息。

如商品属于哪个品类:当商品品类发生变化时,直接重写为新品类

4. 增加新属性列

在维度表中增加新的一列,原先属性列存放上一版本的属性值,当前属性列存放当前版本的属性值,还可以增加一列记录变化的时间。

缺点:只能记录最后一次变化的信息。

5. 快照表

每天保留一份全量数据。

简单、高效。缺点是信息重复,浪费磁盘空间。

适用范围:维表不能太大

使用场景多,范围广;一般而言维表都不大。

6. 拉链表

拉链表适合于:表的数据量大,而且数据会发生新增和变化,但是大部分是不变的

(数据发生变化的百分比不大),且是缓慢变化的(如电商中用户信息表中的某些用

户基本属性不可能每天都变化)。主要目的是节省存储空间。

适用场景:

  • 表的数据量大
  • 表中部分字段会被更新
  • 表中记录变量的比例不高
  • 需要保留历史信息

维表拉链表应用案例

创建表加载数据(准备工作)

-- 用户信息
DROP TABLE IF EXISTS test.userinfo; 

CREATE TABLE test.userinfo(
    userid STRING COMMENT '用户编号 ',  
    mobile STRING COMMENT '手机号码 ',  
    regdate STRING COMMENT '注册日期 '

) COMMENT '用户信息 '
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';

-- 拉链表(存放用户历史信息)
-- 拉链表不是分区表;多了两个字段start_date、end_date
DROP TABLE IF EXISTS test.userhis; 
CREATE TABLE test.userhis(
    userid STRING COMMENT '用户编号 ',  
    mobile STRING COMMENT '手机号码 ',  
    regdate STRING COMMENT '注册日期 ',
    start_date STRING, 
    end_date STRING
)COMMENT '用户信息拉链表 '
row format delimited fields terminated by ',';
 -- 数据(/data/dw/data/userinfo.dat)
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20

002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21

004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22

001,13554444444,2020-03-01,2020-06-23
003,13574444444,2020-05-01,2020-06-23
005,13555554444,2020-06-21,2020-06-23
007,18600744444,2020-06-23,2020-06-23
008,18600844444,2020-06-23,2020-06-23

 -- 静态分区数据加载(略)
/data/dw/data/userinfo0620.dat
001,13551111111,2020-03-01
002,13561111111,2020-04-01
003,13571111111,2020-05-01
004,13581111111,2020-06-01

 load data local inpath '/data/dw/data/userinfo0620.dat' into table test.userinfo
 partition(dt='2020-06-20'); 


 -- 动态分区数据加载:分区的值是不固定的,由输入数据确定 
 -- 创建中间表(非分区表)
 drop table if exists test.tmp1;
 create table test.tmp1 as
     select * from test.userinfo;

 -- tmp1 非分区表,使用系统默认的字段分割符 '\001'
 alter table test.tmp1 set serdeproperties('field.delim'=','); 

 -- 向中间表加载数据
 load data local inpath '/data/dw/data/userinfo.dat' into table test.tmp1;

 -- 从中间表向分区表加载数据
 set hive.exec.dynamic.partition.mode=nonstrict;

 insert into table test.userinfo
 partition(dt)
 select * from test.tmp1;

拉链表的实现

  • userinfo(分区表) => userid、 mobile、 regdate => 每日变更的数据(修改的+新增 的)/ 历史数据(第一天)
  • userhis(拉链表)=> 多了两个字段start_date / end_date
1 userinfo 分区:(2020-06-20)
-- 1、userinfo 分区:(2020-06-20)
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
2 初始化拉链表
-- 2、初始化拉链表(2020-06-20)由分区表userinfo到拉链表userhis
insert overwrite table test.userhis
    select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
    from test.userinfo where dt='2020-06-20';
2.1 查看数据
select * from userhis;
001	13551111111	2020-03-01	2020-06-20	9999-12-31
004	13581111111	2020-06-01	2020-06-20	9999-12-31
003	13571111111	2020-05-01	2020-06-20	9999-12-31
002	13561111111	2020-04-01	2020-06-20	9999-12-31
3 分区表 userinfo 分区:(2020-06-21)
-- 3、userinfo 分区:(2020-06-21)
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
3.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-21') A
     right join test.userhis B
     on A.userid=B.userid;
useriduserid(1)mobileregdatestart_dateend_date
002002135611111112020-04-012020-06-209999-12-31
004004135811111112020-06-012020-06-209999-12-31
001135511111112020-03-012020-06-209999-12-31
003135711111112020-05-012020-06-209999-12-31

可以发现 002 004 发生了数据变化, 001 003 没有发生变化

3.2 核心sql
-- 最终的
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21' 

union all

select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then date_add('2020-06-21', -1)
        else B.end_date
    end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
     right join test.userhis B
     on A.userid=B.userid;
useridmobileregdatestart_dateend_date
002135622222222020-04-012020-06-219999-12-31
004135822222222020-06-012020-06-219999-12-31
005135522222222020-06-212020-06-219999-12-31
002135611111112020-04-012020-06-202020-06-20
004135811111112020-06-012020-06-202020-06-20
001135511111112020-03-012020-06-209999-12-31
003135711111112020-05-012020-06-209999-12-31
3.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21' 
union all 
select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then date_add('2020-06-21', -1)
          else B.end_date
    end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
     right join test.userhis B
     on A.userid=B.userid;
4 分区表 userinfo 分区:(2020-06-22)
 004,13333333333,2020-06-01,2020-06-22
 005,13533333333,2020-06-21,2020-06-22
 006,13733333333,2020-06-22,2020-06-22
4.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-22') A
     right join test.userhis B
     on A.userid=B.userid;
useriduseridmobileregdatestart_dateend_date
004004135822222222020-06-012020-06-219999-12-31
004004135811111112020-06-012020-06-202020-06-20
005005135522222222020-06-212020-06-219999-12-31
002135622222222020-04-012020-06-219999-12-31
002135611111112020-04-012020-06-202020-06-20
001135511111112020-03-012020-06-209999-12-31
003135711111112020-05-012020-06-209999-12-31

可以发现 004 005 发生了数据变化

4.2 核心sql
-- 最终的
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
	from userinfo
	where dt='2020-06-22' 

union all 

select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then '2020-06-21'
        else B.end_date
    end as end_date
from (select * from userinfo where dt='2020-06-22') A
     right join userhis B
     on A.userid=B.userid;
useridmobileregdatestart_dateend_date
004133333333332020-06-012020-06-229999-12-31
005135333333332020-06-212020-06-229999-12-31
006137333333332020-06-222020-06-229999-12-31
004135822222222020-06-012020-06-212020-06-21
004135811111112020-06-012020-06-202020-06-20
005135522222222020-06-212020-06-212020-06-21
002135622222222020-04-012020-06-219999-12-31
002135611111112020-04-012020-06-202020-06-20
001135511111112020-03-012020-06-209999-12-31
003135711111112020-05-012020-06-209999-12-31
4.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-22' 
union all 
select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then date_add('2020-06-22', -1)
          else B.end_date
    end as end_date
from (select * from test.userinfo where dt='2020-06-22') A
     right join test.userhis B
     on A.userid=B.userid;
5 分区表 userinfo 分区:(2020-06-23)
 001,13554444444,2020-03-01,2020-06-23
 003,13574444444,2020-05-01,2020-06-23
 005,13555554444,2020-06-21,2020-06-23
 007,18600744444,2020-06-23,2020-06-23
 008,18600844444,2020-06-23,2020-06-23
5.1 观察数据
-- 观察数据
select A.userid, B.userid, B.mobile, B.regdate, B.start_date,B.end_date
from (select * from test.userinfo where dt='2020-06-23') A
     right join test.userhis B
     on A.userid=B.userid;
useriduseridmobileregdatestart_dateend_date
001001135511111112020-03-012020-06-209999-12-31
003003135711111112020-05-012020-06-209999-12-31
005005135333333332020-06-212020-06-229999-12-31
005005135522222222020-06-212020-06-21200-06-21
004133333333332020-06-012020-06-229999-12-31
006137333333332020-06-222020-06-229999-12-31
004135822222222020-06-012020-06-212020-06-21
004135811111112020-06-012020-06-202020-06-20
002135622222222020-04-012020-06-219999-12-31
002135611111112020-04-012020-06-202020-06-20
5.2 核心sql
-- 最终的
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
	from userinfo
	where dt='2020-06-22' 

union all 

select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then '2020-06-21'
        else B.end_date
    end as end_date
from (select * from userinfo where dt='2020-06-22') A
     right join userhis B
     on A.userid=B.userid;
useridmobileregdatestart_dateend_date
001135544444442020-03-012020-06-239999-12-31
003135744444442020-05-012020-06-239999-12-31
005135555544442020-06-212020-06-239999-12-31
007186007444442020-06-232020-06-239999-12-31
008186008444442020-06-232020-06-239999-12-31
001135511111112020-03-012020-06-202020-06-22
003135711111112020-05-012020-06-202020-06-22
005135333333332020-06-212020-06-222020-06-22
005135522222222020-06-212020-06-212020-06-21
004133333333332020-06-012020-06-229999-12-31
006137333333332020-06-222020-06-229999-12-31
004135822222222020-06-012020-06-212020-06-21
004135811111112020-06-012020-06-202020-06-20
002135622222222020-04-012020-06-219999-12-31
002135611111112020-04-012020-06-202020-06-20
5.3 将结果插入到拉链表userhis
-- 最终的处理(新增+历史数据)
insert overwrite table test.userhis
select  userid, mobile, regdate, dt as start_date, '9999-12-31' as end_date
from test.userinfo
where dt='2020-06-21' 
union all 
select B.userid,
     B.mobile,
     B.regdate,
     B.start_Date,
     case when B.end_date='9999-12-31' and A.userid is not null
          then date_add('2020-06-21', -1)
          else B.end_date
    end as end_date
from (select * from test.userinfo where dt='2020-06-21') A
     right join test.userhis B
     on A.userid=B.userid;