目录
1. 数据库(Database)
1.1 创建数据库
语法:
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
示例:
create database db_hive1;
create database db_hive2 location '/db_hive2';
create database db_hive3
with dbproperties('create_date'='2022-11-18');
1.2 查询数据库
SHOW DATABASES [LIKE 'identifier_with_wildcards'];
示例:
show databases like 'db_hive*';
1.3 查看数据库描述
DESCRIBE DATABASE [EXTENDED] db_name;
示例:
desc database db_hive3;
示例输出结构(字段含义):
| 字段 | 含义 |
|---|---|
| db_name | 数据库名 |
| comment | 注释 |
| location | HDFS 路径 |
| owner_name / owner_type | 所有者信息 |
| parameters | DBPROPERTIES |
1.4 修改数据库
可修改:
dbproperties、location、owner user注意:修改 database 的location不会影响已有表的路径,只会影响后续新表默认父目录。
ALTER DATABASE db_hive3
SET DBPROPERTIES ('create_date'='2022-11-20');
1.5 删除数据库
DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE];
RESTRICT(默认):库不为空则删除失败CASCADE:级联删除库中表
示例(失败场景:库非空):
drop database db_hive3;
-- FAILED: ... Database db_hive3 is not empty. One or more tables exist.
1.6 切换当前数据库
USE database_name;
2. 数据表(Table)
2.1 建表语法总览
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
col_name data_type [COMMENT col_comment],
...
)
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path];
关键概念:
- TEMPORARY:临时表,仅当前会话可见,会话结束自动删除
- EXTERNAL:外部表(只管理元数据,不管理 HDFS 数据)
2.2 数据类型
基本数据类型
| Hive 类型 | 说明 | 示例 |
|---|---|---|
| tinyint | 1 byte 有符号整数 | |
| smallint | 2 byte 有符号整数 | |
| int | 4 byte 有符号整数 | |
| bigint | 8 byte 有符号整数 | |
| boolean | 布尔 | true/false |
| float | 单精度浮点 | |
| double | 双精度浮点 | |
| decimal | 十进制精准数 | decimal(16,2) |
| varchar | 定长上限字符串 | varchar(32) |
| string | 字符串 | |
| timestamp | 时间类型 | |
| binary | 二进制 |
复杂数据类型
| 类型 | 说明 | 定义示例 | 取值示例 |
|---|---|---|---|
| array | 相同类型数组 | array<string> | arr[0] |
| map | 键值对集合 | map<string,int> | map['key'] |
| struct | 结构体 | struct<id:int,name:string> | struct.id |
类型转换
-
隐式转换(举例)
- tinyint → int → bigint
- int/float/string → double
- tinyint/smallint/int → float
- boolean 不能转换为其它类型
-
显式转换:
cast
cast(expr as <type>);
select '1' + 2, cast('1' as int) + 2;
2.3 分区表与分桶表关键字
- PARTITIONED BY:分区表(按目录拆分)
- CLUSTERED BY ... SORTED BY ... INTO ... BUCKETS:分桶表(按文件拆分)
2.4 Row Format / SerDe
2.4.1 默认 SerDe:ROW FORMAT DELIMITED
ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]
含义:
FIELDS TERMINATED BY:列分隔符COLLECTION ITEMS TERMINATED BY:array/map/struct 内元素分隔符MAP KEYS TERMINATED BY:map 的 key/value 分隔符LINES TERMINATED BY:行分隔符
2.4.2 指定 SerDe:ROW FORMAT SERDE
ROW FORMAT SERDE 'serde_name'
WITH SERDEPROPERTIES (...);
2.5 文件格式 Stored As
常见:
textfile(默认)sequencefileorcparquet
2.6 Location / TBLPROPERTIES
- LOCATION:指定 HDFS 路径 默认:
${hive.metastore.warehouse.dir}/db_name.db/table_name - TBLPROPERTIES:表级参数 KV 配置
2.7 CTAS 与 LIKE
CTAS(Create Table As Select)
直接用查询结果建表,表结构与查询结果一致,并写入数据。
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table_name
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
AS select_statement;
Create Table Like
复制已有表结构,不包含数据。
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE exist_table_name
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)];
2.8 内部表 vs 外部表
内部表(管理表)
- Hive 管理 元数据 + HDFS 数据
drop table会删除 HDFS 数据目录
示例:
create table if not exists student(
id int,
name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/db_hive3.db/student';
数据准备:
vim /opt/module/datas/student.txt
# 内容示例:
# 1001 student1
# 1002 student2
# ...
hadoop fs -put /opt/module/datas/student.txt /user/hive/warehouse/db_hive3.db/student
外部表(EXTERNAL)
- Hive 只管理 元数据
drop table仅删除元数据,HDFS 数据仍保留
create external table if not exists student(
id int,
name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/db_hive3.db/student';
2.9 JSON SerDe 示例
JSON 示例:
{
"name": "dasongsong",
"friends": ["bingbing", "lili"],
"students": {"xiaohaihai": 18, "xiaoyangyang": 16},
"address": {"street": "hui long guan", "city": "beijing", "postal_code": 10010}
}
建表:
create table teacher (
name string,
friends array<string>,
students map<string,int>,
address struct<city:string,street:string,postal_code:int>
)
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/user/hive/warehouse/db_hive3.db/teacher';
导入数据文件:
vim /opt/module/datas/teacher.txt
# 单行 JSON:
# {"name":"dasongsong","friends":["bingbing","lili"],"students":{"xiaohaihai":18,"xiaoyangyang":16},"address":{"street":"hui long guan","city":"beijing","postal_code":10010}}
hadoop fs -put /opt/module/datas/teacher.txt /user/hive/warehouse/db_hive3.db/teacher
CTAS / LIKE 示例:
create table teacher1 as select * from teacher; -- teacher1 有数据
create table teacher2 like teacher; -- teacher2 无数据(只复制结构)
2.10 查看/修改表
查看表
SHOW TABLES [IN database_name] LIKE 'identifier_with_wildcards';
示例:
show tables like 'teacher*';
查看表信息
DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name;
示例:
desc extended teacher;
desc formatted teacher;
修改表
1)重命名:
ALTER TABLE table_name RENAME TO new_table_name;
alter table student rename to stu;
alter table teacher1 rename to teacher_demo;
2)添加列(追加到末尾):
ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
alter table stu add columns(age int);
3)更新列(改名/类型/注释/位置):
ALTER TABLE table_name
CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name];
alter table stu change column age ages double;
4)替换列(用新列集替换所有列):
ALTER TABLE table_name REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...);
alter table stu replace columns(id int, name string);
删除表 / 清空表
DROP TABLE [IF EXISTS] table_name;
TRUNCATE TABLE table_name;
3. DML(数据操作)
3.1 LOAD 导入
语法:
LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2, ...)];
示例表:
create table student(
id int,
name string
)
row format delimited fields terminated by '\t';
导入本地文件:
load data local inpath '/opt/module/hive/datas/student.txt' into table student;
导入 HDFS 文件(先 put 到 HDFS):
hadoop fs -put /opt/module/hive/datas/student.txt /user/atguigu
load data inpath '/user/atguigu/student.txt' into table student; -- 追加写
load data inpath '/user/atguigu/student.txt' overwrite into table student; -- 覆盖写
3.2 INSERT 写入
INSERT (INTO | OVERWRITE) ... SELECT
INSERT (INTO | OVERWRITE) TABLE tablename
[PARTITION (...)]
select_statement;
示例:
create table student1(
id int,
name string
)
row format delimited fields terminated by '\t';
insert overwrite table student1
select id, name from student;
INSERT ... VALUES
INSERT (INTO | OVERWRITE) TABLE tablename VALUES
(v1, v2, ...),
(v1, v2, ...);
示例:
insert into table student1 values (1,'wangwu'),(2,'zhaoliu');
3.3 INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE [LOCAL] DIRECTORY 'directory'
[ROW FORMAT row_format]
[STORED AS file_format]
select_statement;
示例(写本地目录):
insert overwrite local directory '/opt/module/hive/datas/student_out'
row format delimited fields terminated by '\t'
select id, name from student;
3.4 Export / Import
Export 会导出 数据 + 元数据 到 HDFS;Import 可在另一个 Hive 实例恢复。
导出:
EXPORT TABLE tablename TO 'export_target_path';
导入:
IMPORT [EXTERNAL] TABLE new_or_original_tablename
FROM 'source_path'
[LOCATION 'import_target_path'];
示例:
export table db_hive3.student to '/user/hive/warehouse/export/student';
import table student from '/user/hive/warehouse/export/student';
注意:若 import 的目标表已存在且包含数据文件,会失败。
4. 查询(SELECT)
4.1 基本语法与注意事项
SELECT [ALL | DISTINCT] select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
书写习惯:
- SQL 大小写不敏感
- 关键字不能拆分
- 子句建议分行、缩进提高可读性
4.2 常用子句示例
列别名:
select
ename as name,
deptno as dn
from emp;
Limit:
select * from emp limit 5;
select * from emp limit 2, 3;
Where:
select * from emp where sal > 1000;
select ename as name, deptno as dn from emp where sal > 1000;
聚合与分组:
select count(*) cnt from emp;
select max(sal) max_sal from emp;
select min(sal) min_sal from emp;
select sum(sal) sum_sal from emp;
select avg(sal) avg_sal from emp;
select deptno, avg(sal) avg_sal
from emp
group by deptno
having avg_sal > 2000;
4.3 Join / Union / 笛卡尔积
准备表:
create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
create table if not exists emp(
empno int,
ename string,
job string,
sal double,
deptno int
)
row format delimited fields terminated by '\t';
等值 Join(Hive 仅支持等值连接):
select e.empno, e.ename, d.dname
from emp e
join dept d
on e.deptno = d.deptno;
左外连接:
select e.empno, e.ename, d.deptno
from emp e
left join dept d
on e.deptno = d.deptno;
满外连接:
select e.empno, e.ename, d.deptno
from emp e
full join dept d
on e.deptno = d.deptno;
多表连接:
create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';
select e.ename, d.dname, l.loc_name
from emp e
join dept d on d.deptno = e.deptno
join location l on d.loc = l.loc;
笛卡尔积(慎用):
select empno, dname
from emp, dept;
Union / Union All:
select * from emp where deptno = 30
union
select * from emp where deptno = 40;
-- union all 不去重
要求:
- 两个查询列数必须一致 2) 对应列类型必须一致
4.4 排序(Order/Sort/Distribute/Cluster)
提示:演示时可设置 reduce 数量。
set mapreduce.job.reduces = 3;
全局排序:ORDER BY
select * from emp order by sal desc;
每个 Reduce 内排序:SORT BY
insert overwrite local directory '/opt/module/hive/datas/sortby-result'
select * from emp sort by deptno desc;
分区:DISTRIBUTE BY(常与 sort by 配合)
insert overwrite local directory '/opt/module/hive/datas/distribute-result'
select *
from emp
distribute by deptno
sort by sal desc;
说明:
- 分区规则:
hash(字段) % reduce_num相同落同一 reducer - 语法要求:
distribute by必须在sort by之前
分区排序:CLUSTER BY
当 distribute by 与 sort by 字段相同,可用 cluster by(只支持升序):
select * from emp cluster by deptno;
5. 函数(Functions)
5.1 查看函数
show functions;
desc function upper;
desc function extended upper;
5.2 数值/字符串/日期/流程控制/集合函数
数值函数
select round(3.3); -- 3
select ceil(3.1); -- 4
select floor(4.8); -- 4
字符串函数
-- substring
select substring('atguigu', 3, 2);
-- replace
select replace('a-b-c', '-', '_');
-- regexp_replace(正则替换)
select regexp_replace('abc123', '\d+', 'X');
-- regexp(正则匹配)
select 'dfsaaaa' regexp 'dfsa+';
-- repeat
select repeat('123', 3);
-- split
select split('a-b-c-d', '-');
-- nvl
select nvl(null, 'fallback');
-- concat / concat_ws
select concat('a','b','c');
select concat_ws('-', 'a','b','c');
-- get_json_object
select get_json_object(
'[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]',
'$.[0]'
);
日期函数
select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');
select from_unixtime(1700000000, 'yyyy-MM-dd');
select current_date;
select current_timestamp;
select month('2022-08-08 08:08:08');
select day('2022-08-08 08:08:08');
select hour('2022-08-08 08:08:08');
select datediff('2022-08-10', '2022-08-08');
select date_add('2022-08-08', 2);
select date_sub('2022-03-01', 1);
select date_format('2022-08-08','yyyy年-MM月-dd日');
流程控制
-- case when
select
case
when sal >= 5000 then 'A'
when sal >= 3000 then 'B'
else 'C'
end as level
from emp;
-- if
select if(sal > 3000, 'high', 'low') from emp;
集合函数
-- size
select size(friends) from teacher;
-- map / map_keys / map_values
select map('xiaohai',1,'dahai',2);
select map_keys(map('k1',1,'k2',2));
select map_values(map('k1',1,'k2',2));
-- array / array_contains / sort_array
select array('1','2','3','4');
select array_contains(array('a','b'), 'b');
select sort_array(array('a','d','c'));
-- struct / named_struct
select struct('name','age','weight');
select named_struct('name','xiaosong','age',18,'weight',80);
5.3 自定义函数 UDF/UDAF/UDTF
分类:
- UDF:一进一出
- UDAF:多进一出(如 count/max/min)
- UDTF:一进多出(如 explode)
开发/使用流程(概览):
-
继承 Hive 类:
org.apache.hadoop.hive.ql.udf.generic.GenericUDForg.apache.hadoop.hive.ql.udf.generic.GenericUDTF
-
实现抽象方法
-
Hive 中注册函数并使用
临时函数(会话级)
add jar /opt/module/hive/datas/myudf.jar;
create temporary function my_len
as "cc.ivera.hive.udf.MyUDF";
select ename, my_len(ename) as ename_len from emp;
drop temporary function my_len;
永久函数(库级,依赖 HDFS jar)
hdfs dfs -mkdir /udf
hdfs dfs -put my_len.jar /udf/my_len.jar
create function my_len2
as "cc.ivera.hive.udf.MyUDF"
using jar "hdfs://hadoop102:8020/udf/myudf.jar";
select ename, my_len2(ename) as ename_len from emp;
drop function my_len2;
6. 分区表与分桶表
6.1 分区表
分区:把大表按业务维度拆成多个目录(路径级隔离),查询可通过分区字段减少扫描。
建表:
create table dept_partition(
deptno int,
dname string,
loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
写数据(load):
load data local inpath '/opt/module/hive/datas/dept_20220401.log'
into table dept_partition
partition(day='20220401');
读数据:
select deptno, dname, loc, day
from dept_partition
where day = '20220401';
分区操作:
show partitions dept_partition;
alter table dept_partition add partition(day='20220403');
alter table dept_partition add partition(day='20220404') partition(day='20220405');
alter table dept_partition drop partition(day='20220403');
alter table dept_partition drop partition(day='20220404'), partition(day='20220405');
修复分区(元数据与 HDFS 不一致时):
msck repair table dept_partition;
-- 或
msck repair table dept_partition add partitions;
msck repair table dept_partition drop partitions;
msck repair table dept_partition sync partitions;
6.2 动态分区
含义:insert 时分区值不手写,由数据行决定。
关键参数:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=100;
set hive.exec.max.created.files=100000;
set hive.error.on.empty.partition=false;
案例:按 loc 动态写入分区
create table dept_partition_dynamic(
id int,
name string
)
partitioned by (loc int)
row format delimited fields terminated by '\t';
set hive.exec.dynamic.partition.mode = nonstrict;
insert into table dept_partition_dynamic
partition(loc)
select deptno, dname, loc
from dept;
6.3 分桶表
分桶:更细粒度的数据组织(文件级),对 shuffle/采样/某些 join 优化有帮助。
建表:
create table stu_buck(
id int,
name string
)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
导入:
load data local inpath '/opt/module/hive/datas/student.txt'
into table stu_buck;
分桶排序表:
create table stu_buck_sort(
id int,
name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/hive/datas/student.txt'
into table stu_buck_sort;
6.4 分区 vs 分桶对比表
| 项目 | 分区(Partition) | 分桶(Bucket) |
|---|---|---|
| 作用 | 减少扫描目录/文件(裁剪分区) | 减少 shuffle / 更均匀切分数据 |
| 查询常见写法 | WHERE dt='20240101' | WHERE user_id=...(配合分桶字段) |
| 常搭配维度 | 时间、地区、业务日期 | 用户/ID/订单ID 等高基数字段 |
| 存储表现 | 多目录 | 少目录内多文件(桶文件) |