Hive的基本操作之表分区持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第4天，点击查看活动详情

持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第4天，点击查看活动详情

由于hive在查询时会做全表扫描，有些情况下我们只需要查询部分数据，为了避免全表扫描消耗资源和性能，我们可以实现表分区使之扫描部分表。

在hive中分区字段是一个伪字段，并不实际存储数据，但可以作为条件用于查询。表分区就是会在表目录下面针对不同的分区创建一个子目录，如果有二级分区，那么会在一级子目录下面继续创建子目录。

hive中的分区字段是表外字段，mysql中的分区字段是表内字段。

分区应用：

进入目标库中

hive> use test_data;

一级分区应用：

建表

hive> create table if not exists partition1(
    > id int,
    > name string,
    > age int
    > )
    > partitioned by (testdate string)
    > row format delimited
    > fields terminated by '\t'
    > lines terminated by '\n';     #默认就是以\n作为换行符的，所以可以省略不写

创建数据文件

[root@hadoop01 test_data]# pwd
/usr/local/wyh/test_data
[root@hadoop01 test_data]# cat user_partition1.txt
1       user1   1
2       user2   2
3       user3   2
4       user4   1
5       user5   1
6       user6   1
7       user7   2
8       user8   1
9       user9   2

加载数据

由于在建表时定义了分区字段，所以在加载数据时就要指定当前数据文件要加载到哪个分区中。

这里我们先将数据放在**testdate='2022-04-28' **这个分区下。（分区不存在时，会自动创建该分区）

hive> load data local inpath '/usr/local/wyh/test_data/user_partition1.txt' into table partition1 partition(testdate='2022-04-28');

查看hdfs目录树
查看数据

查看数据时就会发现最后面多了一列分区字段：

hive> select * from partition1;
OK
1       user1   1       2022-04-28
2       user2   2       2022-04-28
3       user3   2       2022-04-28
4       user4   1       2022-04-28
5       user5   1       2022-04-28
6       user6   1       2022-04-28
7       user7   2       2022-04-28
8       user8   1       2022-04-28
9       user9   2       2022-04-28

加载数据放入第二个分区中。

hive> load data local inpath '/usr/local/wyh/test_data/user_partition1.txt' into table partition1 partition(testdate='2022-04-27');

查看数据

hive> select * from partition1;
OK
1       user1   1       2022-04-27
2       user2   2       2022-04-27
3       user3   2       2022-04-27
4       user4   1       2022-04-27
5       user5   1       2022-04-27
6       user6   1       2022-04-27
7       user7   2       2022-04-27
8       user8   1       2022-04-27
9       user9   2       2022-04-27
1       user1   1       2022-04-28
2       user2   2       2022-04-28
3       user3   2       2022-04-28
4       user4   1       2022-04-28
5       user5   1       2022-04-28
6       user6   1       2022-04-28
7       user7   2       2022-04-28
8       user8   1       2022-04-28
9       user9   2       2022-04-28

使用分区字段作为查询条件

hive> select * from partition1 where testdate='2022-04-27';
OK
1       user1   1       2022-04-27
2       user2   2       2022-04-27
3       user3   2       2022-04-27
4       user4   1       2022-04-27
5       user5   1       2022-04-27
6       user6   1       2022-04-27
7       user7   2       2022-04-27
8       user8   1       2022-04-27
9       user9   2       2022-04-27

这样在查询时，就会只扫描当前分区的目录。

二级分区应用：

建表

hive> create table if not exists partition2(
    > id int,
    > name string,
    > age int
    > )
    > partitioned by (year string,month string)
    > row format delimited
    > fields terminated by '\t';

加载数据

在二级分区应用中，需要指定两个字段，第一个分区字段会生成一级分区目录，第二个字段会生成一级分区目录下面的二级分区目录。

hive> load data local inpath '/usr/local/wyh/test_data/user_partition2.txt' into table partition2 partition(year='2022',month='03');

查看hdfs目录树
查看数据

查询数据时后面会多了两列，分别为分区的两个字段：

hive> select * from partition2;
OK
1       user1   1       2022    03
2       user2   2       2022    03
3       user3   2       2022    03
4       user4   1       2022    03
5       user5   1       2022    03
6       user6   1       2022    03
7       user7   2       2022    03
8       user8   1       2022    03
9       user9   2       2022    03

加载第二个分区的数据

hive> load data local inpath '/usr/local/wyh/test_data/user_partition2.txt' into table partition2 partition(year='2022',month='02');

查看数据

hive> select * from partition2;
OK
1       user1   1       2022    02
2       user2   2       2022    02
3       user3   2       2022    02
4       user4   1       2022    02
5       user5   1       2022    02
6       user6   1       2022    02
7       user7   2       2022    02
8       user8   1       2022    02
9       user9   2       2022    02
1       user1   1       2022    03
2       user2   2       2022    03
3       user3   2       2022    03
4       user4   1       2022    03
5       user5   1       2022    03
6       user6   1       2022    03
7       user7   2       2022    03
8       user8   1       2022    03
9       user9   2       2022    03

加载第三个分区的数据

hive> load data local inpath '/usr/local/wyh/test_data/user_partition2.txt' into table partition2 partition(year='2021',month='04');

查看数据

hive> select * from partition2;
OK
1       user1   1       2021    04
2       user2   2       2021    04
3       user3   2       2021    04
4       user4   1       2021    04
5       user5   1       2021    04
6       user6   1       2021    04
7       user7   2       2021    04
8       user8   1       2021    04
9       user9   2       2021    04
1       user1   1       2022    02
2       user2   2       2022    02
3       user3   2       2022    02
4       user4   1       2022    02
5       user5   1       2022    02
6       user6   1       2022    02
7       user7   2       2022    02
8       user8   1       2022    02
9       user9   2       2022    02
1       user1   1       2022    03
2       user2   2       2022    03
3       user3   2       2022    03
4       user4   1       2022    03
5       user5   1       2022    03
6       user6   1       2022    03
7       user7   2       2022    03
8       user8   1       2022    03
9       user9   2       2022    03

查看表分区

hive> show partitions partition2;
OK
year=2021/month=04
year=2022/month=02
year=2022/month=03

以上两个案例都是通过load数据文件进行加载数据的，属于静态分区。

动态分区：

创建动态分区表

hive> create table dynamic_partition1(
    > id int,
    > name string,
    > gender string,
    > age int,
    > academy string
    > )
    > partitioned by (test_date string)
    > row format delimited fields terminated by ','
    > ;

创建临时表

实现动态分区时，需要通过创建临时表来完成，先创建一个与上表结构相同的临时表，且将分区字段加入表结构中，然后将数据先加载至临时表，再将临时表数据导入目标表。

hive> create table tmp_dynamic_partition1(
    > id int,
    > name string,
    > gender string,
    > age int,
    > academy string,
    > test_date string
    > )
    > row format delimited fields terminated by ','
    > ;

#注意这里没有分区字段，原表中的分区字段在临时表中要放在实际表的字段中。

创建数据文件

[root@hadoop01 test_data]# pwd
/usr/local/wyh/test_data
[root@hadoop01 test_data]# cat dynamic_student.txt
1801,Tom,male,20,CS,2018-8-23
1802,Lily,female,18,MA,2019-7-14
1803,Bob,male,19,IS,2017-8-29
1804,Alice,female,21,MA,2017-9-16
1805,Sam,male,19,IS,2018-8-23

将数据加载至临时表

hive> load data local inpath '/usr/local/wyh/test_data/dynamic_student.txt' into table tmp_dynamic_partition1;

查看数据

hive> select * from tmp_dynamic_partition1;
OK
1801    Tom     male    20      CS      2018-8-23
1802    Lily    female  18      MA      2019-7-14
1803    Bob     male    19      IS      2017-8-29
1804    Alice   female  21      MA      2017-9-16
1805    Sam     male    19      IS      2018-8-23

将动态分区模式参数改为非严格模式

hive> set hive.exec.dynamic.partition.mode=nonstrict;

否则在动态分区导入数据时，会要求分区字段至少要有一个时静态值。

动态加载数据

将临时表中的数据动态加载至我们实际要用的目标表中。

hive> insert into dynamic_partition1 partition(test_date) select id,name,gender,age,academy,test_date from tmp_dynamic_partition1;

查询数据

hive> select * from dynamic_partition1;
OK
1803    Bob     male    19      IS      2017-8-29
1804    Alice   female  21      MA      2017-9-16
1801    Tom     male    20      CS      2018-8-23
1805    Sam     male    19      IS      2018-8-23
1802    Lily    female  18      MA      2019-7-14

查看表分区

hive> show partitions dynamic_partition1;
OK
test_date=2017-8-29
test_date=2017-9-16
test_date=2018-8-23
test_date=2019-7-14

2018-8-23这个分区会存放两条数据。

混合分区：

混合分区与静态分区的步骤大致一致，也是需要创建临时表的，只不过在将临时表的数据加载至目标表时，分区字段有的是写死的，有的是根据字段值动态分区的。

创建目标表

hive> create table mixed_partition(
    > id int,
    > name string
    > )
    > partitioned by (year string,month string,day string)
    > row format delimited fields terminated by ','
    > ;

创建临时表

hive> create table tmp_mixed_partition(
    > id int,
    > name string,
    > year string,
    > month string,
    > day string
    > )
    > row format delimited fields terminated by ','
    > ;

#将目标表中的三个分区字段全部写在表的实际字段中

创建数据文件

[root@hadoop01 test_data]# pwd
/usr/local/wyh/test_data
[root@hadoop01 test_data]# cat mixed_partition.txt
1,Mike,2022,03,13
2,Peak,2022,04,21
3,Tina,2022,04,17
4,Keith,2022,03,13

将数据文件加载至临时表中

hive> load data local inpath '/usr/local/wyh/test_data/mixed_partition.txt' into table tmp_mixed_partition;

将临时表数据导入目标表中

hive> insert into mixed_partition partition (year='2022',month,day) select id,name,month,day from tmp_mixed_partition;


#这里的混合分区指的就是，year字段是我们手动指定的，属于静态分区，month和day是根据实际值进行动态分区的，所以这里就是混合分区。
#注意这里在从临时表中查询时select的字段里不包含year字段，因为我们已经在分区字段中写死了值，所以不需要查year字段。

查询数据

hive> select * from mixed_partition;
OK
1       Mike    2022    03      13
4       Keith   2022    03      13
3       Tina    2022    04      17
2       Peak    2022    04      21

查看表分区

hive> show partitions mixed_partition;
OK
year=2022/month=03/day=13
year=2022/month=04/day=17
year=2022/month=04/day=21

以上就是hive中表分区的一些简单应用。