Hive分区表(static、dynamic)

3,203 阅读4分钟

前言

简单理解分区表就是HDFS中文件夹分而治之,查询的时候可以进一步缩小数据搜索范围。

举个例子,订单表中承载了所有订单业务数据,由于单表查询数据压力很大,所以采用create_time字段作为分区标识,把当日的数据存在2020.06.11中,次日的数据则存储在2020.06.12中。

查询数据时则只需要添加where字段则可进一步缩小数据的查询范围。

静态分区

建表语句

create table static_partition(
id int,
name string,
age int,
sex string
)
comment '静态分区表'
partitioned by (day int)
row format delimited fields terminated by ',';

1、查询static_partition表数据

hive (xinxing)> select * from static_partition;
OK
static_partition.id	static_partition.name	static_partition.age	static_partition.sex	static_partition.day
  • 由此可见,新增了一个字段day,也就是我们刚才的分区字段

2、查询static_partition表结构

hive (xinxing)> desc formatted static_partition;

  • 表结构中新增了分区明细

3、把数据导入至Hive中

1,xinxing,22,boy
2,laocao,24,boy
3,shiqin,19,girl
4,xiaoming,50,boy
5,xiaohong,49,girl

hive (xinxing)> load data local inpath '/home/hadoop/data/student.txt' into table static_partition partition(day = 11);
hive (xinxing)> load data local inpath '/home/hadoop/data/student.txt' into table static_partition partition(day = 12);

4、在HDFSweb页查看static_partition

  • 分区表第11日的路径:/user/hive/warehouse/xinxing.db/static_partition/day=11
  • 分区表第12日的路径:/user/hive/warehouse/xinxing.db/static_partition/day=12

5、再次查询static_partition表数据

hive (xinxing)> select * from static_partition;
OK
static_partition.id	static_partition.name	static_partition.age	static_partition.sex	static_partition.day
1                   	xinxing	                    22                  	boy	                11
2                   	laocao	                    24                     	boy	                11
3                   	shiqin	                    19                     	girl                11
4                   	xiaoming                    50                  	boy	                11
5                   	xiaohong                    49                  	girl                11
1                   	xinxing	                    22                  	boy	                12
2                   	laocao	                    24                     	boy	                12
3                   	shiqin	                    19                     	girl            	12
4                   	xiaoming                    50                  	boy	                12
5                   	xiaohong                    49                  	girl               	12
Time taken: 0.555 seconds, Fetched: 10 row(s)

hive (xinxing)> select * from static_partition where day=12;
OK
static_partition.id	static_partition.name	static_partition.age	static_partition.sex	static_partition.day
1                   	xinxing	                    22                  	boy	                12
2                   	laocao	                    24                     	boy	                12
3                   	shiqin	                    19                     	girl            	12
4                   	xiaoming                    50                  	boy	                12
5                   	xiaohong                    49                  	girl               	12
Time taken: 0.239 seconds, Fetched: 5 row(s)
hive (xinxing)> 

动态分区

为什么要使用动态分区。分区是按天的,我不可能说365天每天都手动去分区。如果需求除了按天分区还要按小时分区呢,那我们什么都不要做了,一天到晚维护这张表就行。所有我们要使用动态分区。

建表语句

create table dynamic_partition(
id int,
name string,
age int,
sex string
)
comment '动态分区表'
partitioned by (day int)
row format delimited fields terminated by ',';

1、查询dynamic_partition表数据

hive (xinxing)> select * from dynamic_partition;
OK
dynamic_partition.id	dynamic_partition.name	dynamic_partition.age	dynamic_partition.sex	dynamic_partition.day
Time taken: 0.049 seconds
  • 好像跟静态分区没什么区别

2、把数据导入至Hive中

hive (xinxing)> load data  inpath '/data/student.txt' into table static_partition partition(day);
  • 细节来了,导入报错了FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
    失败:SemanticException [错误10096]:动态分区严格模式至少需要一个静态分区列。要关闭此功能,请设置hive.exec.dynamic.partition.mode = nonstrict

查看默认分区模式

hive (xinxing)> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict

更改分区模式从新执行

hive (xinxing)>set hive.exec.dynamic.partition.mode=nonstrict;

hive (xinxing)> load data  inpath '/data/student.txt' into table static_partition partition(day);
  • 再次报错hive (xinxing)> load data inpath '/data/student.txt' into table static_partition partition(day); FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [day, ], values [])MetaException(消息:无效的分区键和值;键[day,],值[])
  • 看来动态分区并不能通过load data的方式写入数据

3、查询表数据以及dynamic_partition表的HDFS目录结构

  • 先查询一次,好对比
hive (xinxing)> select id,name,age,sex,day from dynamic_partition;
OK
id	name	age	sex	day
Time taken: 0.041 seconds

  • 啥也没有,干净得很

4、使用insert来把查询到的数据插入到动态分区表中

hive (xinxing)> insert into table dynamic_partition partition(day)  select id,name,age,sex,day from static_partition;

此处不一定非要查static_partition表,普通表也可以,我偷懒,直接拿过来用。
此步骤执行是要跑Map的。

5、再次查询表数据以及dynamic_partition表的HDFS目录结构

hive (xinxing)> select * from static_partition;
OK
static_partition.id	static_partition.name	static_partition.age	static_partition.sex	static_partition.day
1                   	xinxing	                    22                  	boy	                11
2                   	laocao	                    24                     	boy	                11
3                   	shiqin	                    19                     	girl                11
4                   	xiaoming                    50                  	boy	                11
5                   	xiaohong                    49                  	girl                11
1                   	xinxing	                    22                  	boy	                12
2                   	laocao	                    24                     	boy	                12
3                   	shiqin	                    19                     	girl            	12
4                   	xiaoming                    50                  	boy	                12
5                   	xiaohong                    49                  	girl               	12
Time taken: 0.555 seconds, Fetched: 10 row(s)

  • 是不是来了,数据都有了,搞定

多分区字段

create table dynamic_partition(
id int,
name string,
age int,
sex string
)
comment '动态分区表'
partitioned by (day int, hour int)
row format delimited fields terminated by ',';
  • 没什么区别,就是在天的基础上加个小时字段,我就不在这里演示。

删除分区

单分区

alter table static_partition drop partition(day=11);

多分区

删除11号当天24点这一个小时的数据

alter table static_partition drop partition(day=11,hour=24);

删除11号的所有数据

alter table static_partition drop partition(day=11);

删除每一天24点这一小时的数据
(一个月30天全部的24点数据都会被删除)

alter table static_partition drop partition(hour=24);

!!注意:不能直接把day这个分区字段直接删除