分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过where子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

分区表基本操作

创建分区表语法

create table dept_partition(
    deptno int,    --部门编号
    dname string, --部门名称
    loc string     --部门位置
)
partitioned by (day string)
row format delimited fields terminated by '\t';

加载数据

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition 
partition(day='20220401');

load data local inpath '/opt/module/hive/datas/dept_20220402.log' 
into table dept_partition 
partition(day='20220402');

load data local inpath '/opt/module/hive/datas/dept_20220403.log' 
into table dept_partition 
partition(day='20220403');

注意：分区表加载数据时，必须指定分区

查询分区表中数据

单分区查询

select 
    * 
from dept_partition 
where day='20220401';

多分区查询

select 
	* 
from dept_partition 
where day='20220401'
union
select 
	* 
from dept_partition
where day='20220402'

查看分区表有多少分区

show partitions dept_partition;

增加分区

创建单个分区

alter table dept_partition 
add partition(day='20220404');

同时创建多个分区（分区之间不能有逗号）

alter table dept_partition
add partition(day='20220405') partition(day='20220406');

删除分区

删除单个分区

alter table dept_partition 
drop partition (day='20220406');

同时删除多个分区（分区之间必须有逗号）

alter table dept_partition 
drop partition (day='20220404'), partition(day='20220405');

分区表二级分区

创建二级分区表

create table dept_partition2(
    deptno int,    -- 部门编号
    dname string, -- 部门名称
    loc string     -- 部门位置
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

正常的加载数据

加载数据到二级分区表中

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition2 
partition(day='20220401', hour='12');

查询分区数据

select 
*
from dept_partition2 
where day='20220401' and hour='12';

把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

方式一：上传数据后修复

上传数据

dfs -mkdir -p  /user/hive/warehouse/dept_partition2/day=20220401/hour=13
dfs -put /opt/module/hive/datas/dept_20220401.log  /user/hive/warehouse/dept_partition2/day=20220401/hour=13;

查询数据（查询不到刚上传的数据）

select 
 * 
 from dept_partition2
 where day='20220401' and hour='13';

执行修复命令

msck repair table dept_partition2;

再次查询数据

select 
*
from dept_partition2 
where day='20220401' and hour='13';

方式二：上传数据后添加分区

上传数据

dfs -mkdir -p  /user/hive/warehouse/dept_partition2/day=20220401/hour=14;
dfs -put /opt/module/hive/datas/dept_20220401.log  /user/hive/warehouse/dept_partition2/day=20220401/hour=14;

执行添加分区

alter table dept_partition2 
add partition(day='20220401',hour='14');

查询数据

select 
*
from dept_partition2 
where day='20220401' and hour='14';

方式三：创建文件夹后load数据到分区

dfs -mkdir -p  /user/hive/warehouse/dept_partition2/day=20220401/hour=15;

上传数据

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition2 
partition(day='20220401',hour='15');

查询数据

select
*
from dept_partition2
where day='20220401' and hour='15';

动态分区调整

关系型数据库中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区（Dynamic Partition），只不过，使用Hive的动态分区，需要进行相应的配置。

开启动态分区参数设置

开启动态分区功能（默认true，开启）
```
hive.exec.dynamic.partition=true
```
设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区）
```
hive.exec.dynamic.partition.mode=nonstrict
```
在所有执行MapReduce的节点上，最大一共可以创建多少个动态分区。默认1000。
```
hive.exec.max.dynamic.partitions=1000
```
在每个执行MapReduce的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。
```
hive.exec.max.dynamic.partitions.pernode=100
```
整个MapReduce Job中，最大可以创建多少个HDFS文件。默认100000。
```
hive.exec.max.created.files=100000
```
当有空分区生成时，是否抛出异常。一般不需要设置。默认false。
```
hive.error.on.empty.partition=false
```

实例

创建目标分区表

create table dept_partition_dynamic(
    id int, 
    name string
) 
partitioned by (loc int) 
row format delimited fields terminated by '\t';

设置动态分区

insert into table dept_partition_dynamic 
partition(loc) 
select 
    deptno, 
    dname, 
    loc
from dept;

查看目标分区表的分区情况

show partitions dept_partition_dynamic;

分桶表

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区。对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

分区针对的是数据的存储路径；分桶针对的是数据文件。

创建分桶表

create table stu_buck(
    id int, 
    name string
)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

创建排序分桶表（桶内数据排序）

create table stu_buck_sort(
    id int, 
    name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';

压缩和存储

Hadoop压缩配置

MapReduce支持的压缩编码

压缩格式	算法	文件扩展名	是否可切分
DEFLATE	DEFLATE	.deflate	否
Gzip	DEFLATE	.gz	否
bzip2	bzip2	.bz2	是
LZO	LZO	.lzo	是
Snappy	Snappy	.snappy	否

为了支持多种压缩/解压缩算法，Hadoop引入了编码/解码器，如下表所示：

Hadoop查看支持压缩的方式hadoop checknative。

Hadoop在driver端设置压缩。

压缩格式	对应的编码/解码器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

压缩参数配置

参数	默认值	阶段	建议
io.compression.codecs （在core-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec	输入压缩	Hadoop使用文件扩展名判断是否支持某种编解码器
mapreduce.map.output.compress	false	Mapper输出	这个参数设为true启用压缩
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	Mapper输出	使用LZO、LZ4或snappy编解码器在此阶段压缩数据
mapreduce.output.fileoutputformat.compress	false	Reducer输出	这个参数设为true启用压缩
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	Reducer输出	使用标准工具或者编解码器，如gzip和bzip2

开启Map输出阶段压缩

开启Hive中间传输数据压缩功能（Hive本身也希望自己控制下压缩）

set hive.exec.compress.intermediate=true;

开启MapReduce中Map输出压缩功能

set mapreduce.map.output.compress=true;

设置MapReduce中Map输出数据的压缩方式

set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

开启Reduce输出阶段压缩

当Hive将输出写入到表中时，输出内容同样可以进行压缩。属性hive.exec.compress.output控制着这个功能。用户可能需要保持默认设置文件中的默认值false，这样默认的输出就是非压缩的纯文本文件了。用户可以通过在查询语句或执行脚本中设置这个值为true，来开启输出结果压缩功能。

开启Hive最终输出数据压缩功能（Hive希望能自己控制压缩）

set hive.exec.compress.output=true;

开启MapReduce最终输出数据压缩

set mapreduce.output.fileoutputformat.compress=true;

设置MapReduce最终数据输出压缩方式

set mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodec;

文件存储格式

Hive支持的存储数据的格式主要有：textfile 、sequencefile、orc、parquet

列式存储和行式存储

行存储的特点查询满足条件的一整行数据的时候，列存储则需要去每个聚集的字段找到对应的每个列的值，行存储只需要找到其中一个值，其余的值都在相邻地方，所以此时行存储查询的速度更快。
列存储的特点

因为每个字段的数据聚集存储，在查询只需要少数几个字段的时候，能大大减少读取的数据量；每个字段的数据类型一定是相同的，列式存储可以针对性的设计更好的设计压缩算法。

textfile和sequencefile的存储格式都是基于行存储的

orc和parquet是基于列式存储的

大数据开发学习1.6-Hive的分区、分桶表和压缩存储

分区表