hive下merge into实现增量更新方法

3,441 阅读1分钟

一、hive下增量的两种方法

下面列出了两种hive批量增量的方法,我们详细讲一下merge into方法。

  1. union all方式
insert overwrite  table 原表
    select * from 增量表  union all
    select  a.* from 原表 a  left join 增量表 b on 1=1 and a.业务不会变化的字段=b.业务不会变化的字段 where b.业务不会变化的字段 is null;
  1. merge into
 MERGE INTO <target table> AS T USING <source expression/table> AS S
 ON <``boolean` `expression1>
 WHEN MATCHED [AND <``boolean` `expression2>] THEN UPDATE SET <set clause list>
 WHEN MATCHED [AND <``boolean` `expression3>] THEN DELETE
 WHEN NOT MATCHED [AND <``boolean` `expression4>] THEN INSERT VALUES<value list>

二、测试过程(hive: V 2.2.1)

  1. 创建存量表a和增量表b,准备测试数据; 如果要支持delete和update,则目标表需是AcidOutputFormat,必须分桶。 
    而且目前只有ORCFileformat支持AcidOutputFormat,不仅如此建表时必须指定参数('transactional' = true) 。否则在merge into数据时会报如下错:
org.apache.hive.service.cli.HiveSQLException: Error while compiling    statement: FAILED: SemanticException [Error 10297]: Attempt to do update or delete on table demo.a that does not use an AcidOutputFormat or is not bucketed

表a:

CREATE table demo.a (id int,name string) CLUSTERED by (id) into 2 buckets STORED as orc TBLPROPERTIES('transactional'='true');

表a插入存量数据:

insert into TABLE demo.a VALUES (1,'wf')

表b:

CREATE table demo.b (id int,name string) stored as orc;

表b插入增量数据(存在旧数据更新,新增数据两种场景):

insert into TABLE demo.b VALUES (1,'wf'),(3,'bb'),(4,'2132'),(8,'21'),(9,'12'),(11,'1')
  1. 将表b数据更新到表a:
merge into demo.a as T1 USING demo.b as T2
ON T1.id = T2.id
when matched then UPDATE set name = 'wf1'
when NOT matched then insert VALUES(T2.id,T2.name);
  1. 检查数据结果:
  • 表a旧数据已更新为新数据wf1;
  • 表b下新增数据已更新到表a。
select * from demo.a;

image.png