Hive速查手册

26 阅读10分钟

目录


1. 数据库(Database)

1.1 创建数据库

语法:

CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

示例:

create database db_hive1;
​
create database db_hive2 location '/db_hive2';
​
create database db_hive3
with dbproperties('create_date'='2022-11-18');

1.2 查询数据库

SHOW DATABASES [LIKE 'identifier_with_wildcards'];

示例:

show databases like 'db_hive*';

1.3 查看数据库描述

DESCRIBE DATABASE [EXTENDED] db_name;

示例:

desc database db_hive3;

示例输出结构(字段含义):

字段含义
db_name数据库名
comment注释
locationHDFS 路径
owner_name / owner_type所有者信息
parametersDBPROPERTIES

1.4 修改数据库

可修改:dbpropertieslocationowner user 注意:修改 database 的 location 不会影响已有表的路径,只会影响后续新表默认父目录。

ALTER DATABASE db_hive3
SET DBPROPERTIES ('create_date'='2022-11-20');

1.5 删除数据库

DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE];
  • RESTRICT(默认):库不为空则删除失败
  • CASCADE:级联删除库中表

示例(失败场景:库非空):

drop database db_hive3;
-- FAILED: ... Database db_hive3 is not empty. One or more tables exist.

1.6 切换当前数据库

USE database_name;

2. 数据表(Table)

2.1 建表语法总览

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
  col_name data_type [COMMENT col_comment],
  ...
)
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)  [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path];

关键概念:

  • TEMPORARY:临时表,仅当前会话可见,会话结束自动删除
  • EXTERNAL:外部表(只管理元数据,不管理 HDFS 数据)

2.2 数据类型

基本数据类型

Hive 类型说明示例
tinyint1 byte 有符号整数
smallint2 byte 有符号整数
int4 byte 有符号整数
bigint8 byte 有符号整数
boolean布尔true/false
float单精度浮点
double双精度浮点
decimal十进制精准数decimal(16,2)
varchar定长上限字符串varchar(32)
string字符串
timestamp时间类型
binary二进制

复杂数据类型

类型说明定义示例取值示例
array相同类型数组array<string>arr[0]
map键值对集合map<string,int>map['key']
struct结构体struct<id:int,name:string>struct.id

类型转换

  • 隐式转换(举例)

    • tinyint → int → bigint
    • int/float/string → double
    • tinyint/smallint/int → float
    • boolean 不能转换为其它类型
  • 显式转换cast

cast(expr as <type>);

select '1' + 2, cast('1' as int) + 2;

2.3 分区表与分桶表关键字

  • PARTITIONED BY:分区表(按目录拆分)
  • CLUSTERED BY ... SORTED BY ... INTO ... BUCKETS:分桶表(按文件拆分)

2.4 Row Format / SerDe

2.4.1 默认 SerDe:ROW FORMAT DELIMITED

ROW FORMAT DELIMITED
  [FIELDS TERMINATED BY char]
  [COLLECTION ITEMS TERMINATED BY char]
  [MAP KEYS TERMINATED BY char]
  [LINES TERMINATED BY char]
  [NULL DEFINED AS char]

含义:

  • FIELDS TERMINATED BY:列分隔符
  • COLLECTION ITEMS TERMINATED BY:array/map/struct 内元素分隔符
  • MAP KEYS TERMINATED BY:map 的 key/value 分隔符
  • LINES TERMINATED BY:行分隔符

2.4.2 指定 SerDe:ROW FORMAT SERDE

ROW FORMAT SERDE 'serde_name'
WITH SERDEPROPERTIES (...);

2.5 文件格式 Stored As

常见:

  • textfile(默认)
  • sequencefile
  • orc
  • parquet

2.6 Location / TBLPROPERTIES

  • LOCATION:指定 HDFS 路径 默认:${hive.metastore.warehouse.dir}/db_name.db/table_name
  • TBLPROPERTIES:表级参数 KV 配置

2.7 CTAS 与 LIKE

CTAS(Create Table As Select)

直接用查询结果建表,表结构与查询结果一致,并写入数据。

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table_name
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
AS select_statement;

Create Table Like

复制已有表结构,不包含数据。

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE exist_table_name
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)];

2.8 内部表 vs 外部表

内部表(管理表)

  • Hive 管理 元数据 + HDFS 数据
  • drop table 会删除 HDFS 数据目录

示例:

create table if not exists student(
  id int,
  name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/db_hive3.db/student';

数据准备:

vim /opt/module/datas/student.txt
# 内容示例:
# 1001    student1
# 1002    student2
# ...

hadoop fs -put /opt/module/datas/student.txt /user/hive/warehouse/db_hive3.db/student

外部表(EXTERNAL)

  • Hive 只管理 元数据
  • drop table 仅删除元数据,HDFS 数据仍保留
create external table if not exists student(
  id int,
  name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/db_hive3.db/student';

2.9 JSON SerDe 示例

JSON 示例:

{
  "name": "dasongsong",
  "friends": ["bingbing", "lili"],
  "students": {"xiaohaihai": 18, "xiaoyangyang": 16},
  "address": {"street": "hui long guan", "city": "beijing", "postal_code": 10010}
}

建表:

create table teacher (
  name     string,
  friends  array<string>,
  students map<string,int>,
  address  struct<city:string,street:string,postal_code:int>
)
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/user/hive/warehouse/db_hive3.db/teacher';

导入数据文件:

vim /opt/module/datas/teacher.txt
# 单行 JSON:
# {"name":"dasongsong","friends":["bingbing","lili"],"students":{"xiaohaihai":18,"xiaoyangyang":16},"address":{"street":"hui long guan","city":"beijing","postal_code":10010}}

hadoop fs -put /opt/module/datas/teacher.txt /user/hive/warehouse/db_hive3.db/teacher

CTAS / LIKE 示例:

create table teacher1 as select * from teacher;  -- teacher1 有数据
create table teacher2 like teacher;              -- teacher2 无数据(只复制结构)

2.10 查看/修改表

查看表

SHOW TABLES [IN database_name] LIKE 'identifier_with_wildcards';

示例:

show tables like 'teacher*';

查看表信息

DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name;

示例:

desc extended teacher;
desc formatted teacher;

修改表

1)重命名:

ALTER TABLE table_name RENAME TO new_table_name;

alter table student rename to stu;
alter table teacher1 rename to teacher_demo;

2)添加列(追加到末尾):

ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);

alter table stu add columns(age int);

3)更新列(改名/类型/注释/位置):

ALTER TABLE table_name
CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name];

alter table stu change column age ages double;

4)替换列(用新列集替换所有列):

ALTER TABLE table_name REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...);

alter table stu replace columns(id int, name string);

删除表 / 清空表

DROP TABLE [IF EXISTS] table_name;
TRUNCATE TABLE table_name;

3. DML(数据操作)

3.1 LOAD 导入

语法:

LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2, ...)];

示例表:

create table student(
  id int,
  name string
)
row format delimited fields terminated by '\t';

导入本地文件:

load data local inpath '/opt/module/hive/datas/student.txt' into table student;

导入 HDFS 文件(先 put 到 HDFS):

hadoop fs -put /opt/module/hive/datas/student.txt /user/atguigu
load data inpath '/user/atguigu/student.txt' into table student; -- 追加写
load data inpath '/user/atguigu/student.txt' overwrite into table student; -- 覆盖写

3.2 INSERT 写入

INSERT (INTO | OVERWRITE) ... SELECT

INSERT (INTO | OVERWRITE) TABLE tablename
[PARTITION (...)]
select_statement;

示例:

create table student1(
  id int,
  name string
)
row format delimited fields terminated by '\t';

insert overwrite table student1
select id, name from student;

INSERT ... VALUES

INSERT (INTO | OVERWRITE) TABLE tablename VALUES
  (v1, v2, ...),
  (v1, v2, ...);

示例:

insert into table student1 values (1,'wangwu'),(2,'zhaoliu');

3.3 INSERT OVERWRITE DIRECTORY

INSERT OVERWRITE [LOCAL] DIRECTORY 'directory'
[ROW FORMAT row_format]
[STORED AS file_format]
select_statement;

示例(写本地目录):

insert overwrite local directory '/opt/module/hive/datas/student_out'
row format delimited fields terminated by '\t'
select id, name from student;

3.4 Export / Import

Export 会导出 数据 + 元数据 到 HDFS;Import 可在另一个 Hive 实例恢复。

导出:

EXPORT TABLE tablename TO 'export_target_path';

导入:

IMPORT [EXTERNAL] TABLE new_or_original_tablename
FROM 'source_path'
[LOCATION 'import_target_path'];

示例:

export table db_hive3.student to '/user/hive/warehouse/export/student';
import table student from '/user/hive/warehouse/export/student';

注意:若 import 的目标表已存在且包含数据文件,会失败。


4. 查询(SELECT)

4.1 基本语法与注意事项

SELECT [ALL | DISTINCT] select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

书写习惯:

  • SQL 大小写不敏感
  • 关键字不能拆分
  • 子句建议分行、缩进提高可读性

4.2 常用子句示例

列别名:

select
  ename as name,
  deptno as dn
from emp;

Limit:

select * from emp limit 5;
select * from emp limit 2, 3;

Where:

select * from emp where sal > 1000;
select ename as name, deptno as dn from emp where sal > 1000;

聚合与分组:

select count(*) cnt from emp;
select max(sal) max_sal from emp;
select min(sal) min_sal from emp;
select sum(sal) sum_sal from emp;
select avg(sal) avg_sal from emp;

select deptno, avg(sal) avg_sal
from emp
group by deptno
having avg_sal > 2000;

4.3 Join / Union / 笛卡尔积

准备表:

create table if not exists dept(
  deptno int,
  dname string,
  loc int
)
row format delimited fields terminated by '\t';

create table if not exists emp(
  empno int,
  ename string,
  job string,
  sal double,
  deptno int
)
row format delimited fields terminated by '\t';

等值 Join(Hive 仅支持等值连接):

select e.empno, e.ename, d.dname
from emp e
join dept d
on e.deptno = d.deptno;

左外连接:

select e.empno, e.ename, d.deptno
from emp e
left join dept d
on e.deptno = d.deptno;

满外连接:

select e.empno, e.ename, d.deptno
from emp e
full join dept d
on e.deptno = d.deptno;

多表连接:

create table if not exists location(
  loc int,
  loc_name string
)
row format delimited fields terminated by '\t';

select e.ename, d.dname, l.loc_name
from emp e
join dept d on d.deptno = e.deptno
join location l on d.loc = l.loc;

笛卡尔积(慎用):

select empno, dname
from emp, dept;

Union / Union All:

select * from emp where deptno = 30
union
select * from emp where deptno = 40;

-- union all 不去重

要求:

  1. 两个查询列数必须一致 2) 对应列类型必须一致

4.4 排序(Order/Sort/Distribute/Cluster)

提示:演示时可设置 reduce 数量。

set mapreduce.job.reduces = 3;

全局排序:ORDER BY

select * from emp order by sal desc;

每个 Reduce 内排序:SORT BY

insert overwrite local directory '/opt/module/hive/datas/sortby-result'
select * from emp sort by deptno desc;

分区:DISTRIBUTE BY(常与 sort by 配合)

insert overwrite local directory '/opt/module/hive/datas/distribute-result'
select *
from emp
distribute by deptno
sort by sal desc;

说明:

  • 分区规则:hash(字段) % reduce_num 相同落同一 reducer
  • 语法要求:distribute by 必须在 sort by 之前

分区排序:CLUSTER BY

distribute bysort by 字段相同,可用 cluster by(只支持升序):

select * from emp cluster by deptno;

5. 函数(Functions)

5.1 查看函数

show functions;
desc function upper;
desc function extended upper;

5.2 数值/字符串/日期/流程控制/集合函数

数值函数

select round(3.3);  -- 3
select ceil(3.1);   -- 4
select floor(4.8);  -- 4

字符串函数

-- substring
select substring('atguigu', 3, 2);

-- replace
select replace('a-b-c', '-', '_');

-- regexp_replace(正则替换)
select regexp_replace('abc123', '\d+', 'X');

-- regexp(正则匹配)
select 'dfsaaaa' regexp 'dfsa+';

-- repeat
select repeat('123', 3);

-- split
select split('a-b-c-d', '-');

-- nvl
select nvl(null, 'fallback');

-- concat / concat_ws
select concat('a','b','c');
select concat_ws('-', 'a','b','c');

-- get_json_object
select get_json_object(
  '[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]',
  '$.[0]'
);

日期函数

select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');
select from_unixtime(1700000000, 'yyyy-MM-dd');

select current_date;
select current_timestamp;

select month('2022-08-08 08:08:08');
select day('2022-08-08 08:08:08');
select hour('2022-08-08 08:08:08');

select datediff('2022-08-10', '2022-08-08');
select date_add('2022-08-08', 2);
select date_sub('2022-03-01', 1);
select date_format('2022-08-08','yyyy年-MM月-dd日');

流程控制

-- case when
select
  case
    when sal >= 5000 then 'A'
    when sal >= 3000 then 'B'
    else 'C'
  end as level
from emp;

-- if
select if(sal > 3000, 'high', 'low') from emp;

集合函数

-- size
select size(friends) from teacher;

-- map / map_keys / map_values
select map('xiaohai',1,'dahai',2);
select map_keys(map('k1',1,'k2',2));
select map_values(map('k1',1,'k2',2));

-- array / array_contains / sort_array
select array('1','2','3','4');
select array_contains(array('a','b'), 'b');
select sort_array(array('a','d','c'));

-- struct / named_struct
select struct('name','age','weight');
select named_struct('name','xiaosong','age',18,'weight',80);

5.3 自定义函数 UDF/UDAF/UDTF

分类:

  • UDF:一进一出
  • UDAF:多进一出(如 count/max/min)
  • UDTF:一进多出(如 explode)

开发/使用流程(概览):

  1. 继承 Hive 类:

    • org.apache.hadoop.hive.ql.udf.generic.GenericUDF
    • org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
  2. 实现抽象方法

  3. Hive 中注册函数并使用

临时函数(会话级)

add jar /opt/module/hive/datas/myudf.jar;

create temporary function my_len
as "cc.ivera.hive.udf.MyUDF";

select ename, my_len(ename) as ename_len from emp;

drop temporary function my_len;

永久函数(库级,依赖 HDFS jar)

hdfs dfs -mkdir /udf
hdfs dfs -put my_len.jar /udf/my_len.jar
create function my_len2
as "cc.ivera.hive.udf.MyUDF"
using jar "hdfs://hadoop102:8020/udf/myudf.jar";

select ename, my_len2(ename) as ename_len from emp;

drop function my_len2;

6. 分区表与分桶表

6.1 分区表

分区:把大表按业务维度拆成多个目录(路径级隔离),查询可通过分区字段减少扫描。

建表:

create table dept_partition(
  deptno int,
  dname  string,
  loc    string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

写数据(load):

load data local inpath '/opt/module/hive/datas/dept_20220401.log'
into table dept_partition
partition(day='20220401');

读数据:

select deptno, dname, loc, day
from dept_partition
where day = '20220401';

分区操作:

show partitions dept_partition;

alter table dept_partition add partition(day='20220403');
alter table dept_partition add partition(day='20220404') partition(day='20220405');

alter table dept_partition drop partition(day='20220403');
alter table dept_partition drop partition(day='20220404'), partition(day='20220405');

修复分区(元数据与 HDFS 不一致时):

msck repair table dept_partition;
-- 或
msck repair table dept_partition add partitions;
msck repair table dept_partition drop partitions;
msck repair table dept_partition sync partitions;

6.2 动态分区

含义:insert 时分区值不手写,由数据行决定。

关键参数:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=100;
set hive.exec.max.created.files=100000;
set hive.error.on.empty.partition=false;

案例:按 loc 动态写入分区

create table dept_partition_dynamic(
  id int,
  name string
)
partitioned by (loc int)
row format delimited fields terminated by '\t';

set hive.exec.dynamic.partition.mode = nonstrict;

insert into table dept_partition_dynamic
partition(loc)
select deptno, dname, loc
from dept;

6.3 分桶表

分桶:更细粒度的数据组织(文件级),对 shuffle/采样/某些 join 优化有帮助。

建表:

create table stu_buck(
  id int,
  name string
)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';

导入:

load data local inpath '/opt/module/hive/datas/student.txt'
into table stu_buck;

分桶排序表:

create table stu_buck_sort(
  id int,
  name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive/datas/student.txt'
into table stu_buck_sort;

6.4 分区 vs 分桶对比表

项目分区(Partition)分桶(Bucket)
作用减少扫描目录/文件(裁剪分区)减少 shuffle / 更均匀切分数据
查询常见写法WHERE dt='20240101'WHERE user_id=...(配合分桶字段)
常搭配维度时间、地区、业务日期用户/ID/订单ID 等高基数字段
存储表现多目录少目录内多文件(桶文件)