Hive 不同版本中更新操作 merge into 上篇博客《hive表数据更新insert overwrite/me

上篇博客《hive表数据更新insert overwrite/merge into》简单提了一下hive更新，没有详细介绍这两个命令，这篇博客就详细介绍一下。

一、使用条件

hive2.2.0及之后的版本支持使用merge into 语法，使用源表数据批量目标表的数据。使用该功能还需做如下配置

1、参数配置

set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true; set hive.compactor.worker.threads = 1; set hive.auto.convert.join=false; set hive.merge.cardinality.check=false; -- 目标表中出现重复匹配时要设置该参数 2、建表要求 Hive对使用Update功能的表有特定的语法要求, 语法要求如下:

(1)要执行Update的表中, 建表时必须带有buckets(分桶)属性

(2)要执行Update的表中, 需要指定格式,其余格式目前赞不支持, 如:parquet格式, 目前只支持ORCFileformat和AcidOutputFormat

(3)要执行Update的表中, 建表时必须指定参数(‘transactional’ = true);

DROP TABLE IF EXISTS dim_date_table; create table dim_date_table( date_key string comment'如:2023-04-02' ,day int comment'日（131）' ,month int comment'月，如:4' ,month_name string comment'月名称，如:4月' ,year int comment'年，如:2023' ,year_month int comment'年月，如202304' ,week_of_year string comment'年内第几周 2023-14' ,week int comment'周（17）' ,week_name string comment'周，如星期三' ,quarter int comment'季（1~4）' ) CLUSTERED BY (date_key) INTO 10 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS orc TBLPROPERTIES('transactional'='true'); 二、批量更新语法对比

对比在hive1.1.0 使用overwrite ，hive2.3.5使用merge into的方式，对不同量级的数据进行更新时的语法及效率。

之前hive表实现更新操作的步骤

insert overwrite table dim_date_table -- 旧的改变了的数据 select t2.date_key,t2.day,t2.month,t2.month_name,t2.year,t2.year_month,t2.week_of_year,t2.week,t2.week_name,1001 as quarter from dim_date_table t1 join dim_date_table1 t2 on t1.date_key=t2.date_key -- 旧的不变的数据 union all select t1.* from dim_date_table t1 left join dim_date_table1 t2 on t1.date_key=t2.date_key where t2.date_key is null -- 新增的数据 union all select t1.* from dim_date_table1 t1 left join dim_date_table t2 on t1.date_key=t2.date_key where t2.date_key is null ; Hive2.3.5

MERGE INTO dim_date_table AS T USING dim_date_table1 AS S ON t.date_key=s.date_key WHEN MATCHED THEN UPDATE SET quarter=1001 --关联上，变化的数据 WHEN NOT MATCHED THEN INSERT --没关联上的新增的数据 VALUES(S.date_key,S.day,S.month,S.month_name,S.year,S.year_month,S.week_of_year,S.week,S.week_name,S.quarter); 批量更新语法

MERGE INTO AS T USING <source expression/table> AS S ON <boolean` `expression1> WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET WHEN MATCHED [AND <boolean` `expression3>] THEN DELETE WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES