【Hive】HQL 之 DDL

375 阅读13分钟

这是我参与8月更文挑战的第30天,活动详情查看:8月更文挑战

DDL(data definition language): 主要的命令有 CREATEALTERDROP 等。

DDL 主要是用在定义、修改数据库对象的结构 或 数据类型。

如图:2020-08-1414:48.png

一、数据库操作

Hive 有一个默认的数据库 default , 在操作 HQL 时, 如果不明确的指定要使用哪个库, 则使用默认数据库;

  • Hive 的数据库名、表名均不区分大小写;

  • 名字不能使用数字开头;

  • 不能使用关键字, 尽量不使用特殊符号;

  1. 创建数据库语法:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];


-- 创建数据库,在HDFS上存储路径为 /user/hive/warehouse/*.db
hive (default)> create database mydb;
hive (default)> dfs -ls /user/hive/warehouse;
Found 2 items
drwxr-xr-x   - root supergroup          0 2020-08-16 17:37 /user/hive/warehouse/mydb.db
drwxr-xr-x   - root supergroup          0 2020-08-16 16:15 /user/hive/warehouse/test1.db


-- 避免数据库已经存在时报错,使用 if not exists 进行判断【标准写法】
hive (default)> create database if not exists mydb;


-- 创建数据库。添加备注,指定数据库在存放位置
hive (default)> create database if not exists mydb2
              > comment 'this is mydb2'
              > location '/user/hive/mydb2.db';
  1. 查看数据库
-- 查看所有数据库
hive (default)> show database;

-- 查看数据库信息
hive (default)> desc database mydb2;

hive (default)> desc database extended mydb2;

hive (default)> describe database extended mydb2;
  1. 使用数据库
use mydb;
  1. 删除数据库
-- 删除一个空数据库
hive (default)> drop database databasename;

-- 如果数据库不为空(里面有表), 使用 cascade 强制删除
hive (default)> drop database databasename cascade;

二、建表语句

create [external] table [IF NOT EXISTS] table_name
[(colName colType [comment 'comment'], ...)]
[comment table_comment]
[partition by (colName colType [comment col_comment], ...)]
[clustered BY (colName, colName, ...)
[sorted by (col_name [ASC|DESC], ...)] into num_buckets buckets]
[row format row_format]
[stored as file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
  1. CREATE TABLE:按给定名称创建表, 如果表已经存在则抛出异常。可使用 if not exists 规避。

  2. EXTERNAL 关键字 : 创建外部表, 否则创建的是内部表(管理表)。

  • 删除内部表时, 数据和表的定义同时被删除;
  • 删除外部表时, 仅仅删除了表的定义,数据保留;
  • 在生产环境中, 多使用外部表;
  1. comment:表的注释

  2. partition by :对表中数据进行分区, 指定表的分区字段

  3. clustered by :创建分桶表,指定分桶字段

  4. sorted by : 对桶中的一个或多个列排序, 较少使用

  5. 存储子句。

ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char] | SERDE serde_name
[WITH SERDEPROPERTIES (property_name=property_value,
property_name=property_value, ...)]

建表时可指定 SerDe 。 如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED, 将会使用默认的 SerDe。 建表时还需要为表指定列, 在指定列的同时也会指定自定义的 SerDeHive 通过 SerDe 确定表的具体的列的数据。SerDeSerialize/Deserilize 的简称, hive 使用 Serde 进行对象的序列与反序列化。

  1. stored as SEQUENCEFILE|TEXTFILE|RCFILE

如果文件数据是纯文本, 可以使用 STORED AS TEXTFILE (缺省); 如果数据需要压缩, 使用 STORED AS SEQUENCEFILE (二进制序列文件)。

  1. LOCATION:表在 HDFS 上的存放位置

  2. TBLPROPERTIES:定义表的属性

  3. AS:后面可以接查询语句, 表示根据后面的查询结果创建表

  4. LIKElike 表名, 允许用户复制现有的表结构, 但是不复制数据

三、内部表 & 外部表

在创建表的时候, 可指定表的类型。

表有两种类型,分别是:

  • 内部表(管理表)
  • 外部表

Tips:

  • 默认情况下, 创建内部表。如果要创建外部表, 需要使用关键字 external
  • 在删除内部表时, 表的定义(元数据) 和 数据同时被删除
  • 在删除外部表时, 仅删除表的定义, 数据被保留
  • 在生产环境中, 多使用外部表

(1)内部表

t1.dat 文件内容

2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin
  1. 创建表 SQL
-- 创建内部表
create table t1(
  id int,
  name string,
  hobby array<string>,
  addr  map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";

-- 显示表的定义,显示的信息较少
hive (mydb)> desc t1;


-- 显示表的定义,显示的信息多,格式友好
hive (mydb)> desc formatted t1;


-- 加载数据
hive (mydb)> load data local inpath '/home/hadoop/data/t1.dat' into table t1;
Loading data to table mydb.t1
OK
Time taken: 0.546 seconds


-- 查询数据
hive (mydb)> select * from t1;
OK
t1.id	t1.name	t1.hobby	t1.addr
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}
Time taken: 1.214 seconds, Fetched: 3 row(s)


-- 查询数据文件
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
Found 1 items
-rwxr-xr-x   3 root supergroup        148 2020-08-16 18:28 /user/hive/warehouse/mydb.db/t1/t1.dat


-- 删除表。表和数据同时被删除
hive (mydb)> drop table t1;
OK
Time taken: 0.317 seconds


-- 再次查询数据文件,已经被删除

(2)外部表

-- 创建外部表
create external table t2(
  id int,
  name string,
  hobby array<string>,
  addr  map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";


-- 显示表的定义
hive (mydb)> desc formatted t2;

-- 加载数据
hive (mydb)> load data local inpath '/home/hadoop/data/t1.dat' into table t2;
Loading data to table mydb.t2
OK
Time taken: 0.707 seconds

-- 查询数据
hive (mydb)> select * from t2;
OK
t2.id	t2.name	t2.hobby	t2.addr
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}
Time taken: 0.298 seconds, Fetched: 3 row(s)


-- 删除表。表删除了,目录仍然存在
hive (mydb)> drop table t2;
OK
Time taken: 0.291 seconds


-- 再次查询数据文件,仍然存在
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db;
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-08-16 18:32 /user/hive/warehouse/mydb.db/t2


hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;
Found 1 items
-rwxr-xr-x   3 root supergroup        148 2020-08-16 18:32 /user/hive/warehouse/mydb.db/t2/t1.dat


hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/*;
2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin

(3)内部表与外部表的转换

-- 创建内部表,加载数据,并检查数据文件和表的定义
create table t1(
  id int,
  name string,
  hobby array<string>,
  addr map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";


load data local inpath '/home/hadoop/data/t1.dat' into table t1;
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
hive (mydb)> desc formatted t1;

-- 内部表转外部表
hive (mydb)> alter table t1 set tblproperties('EXTERNAL'='TRUE');

-- 查询表信息,是否转换成功
hive (mydb)> desc formatted t1;

-- 外部表转内部表。EXTERNAL 大写,false 不区分大小
hive (mydb)> alter table t1 set tblproperties('EXTERNAL'='FALSE');

-- 查询表信息,是否转换成功
hive (mydb)> desc formatted t1;

(4)小结

  1. 建表时:
  • 如果不指定 external 关键字, 创建的是内部表;
  • 指定 external 关键字, 创建的是外部表;
  1. 删表时
  • 删除外部表时, 仅删除表的定义, 表的数据不受影响
  • 删除内部表时, 表的数据和定义同时被删除
  1. 外部表的使用场景
  • 想保留数据时使用。生产多用外部表

四、分区表

Hive 在执行查询时, 一般会扫描整个表的数据。

由于表的数据量大,全表扫描消耗时间长、效率低。

而有时候, 查询只需要扫描表中的一部分数据即可, Hive 引入了分区表的概念, 将表的数据存储在不同的子目录中, 每一个子目录对应一个分区。

只查询部分区数据时, 可避免全表扫描, 提高查询效率。

在实际中, 通常根据时间、地区等信息进行分区。

  1. 分区表创建 与 数据加载

备注: 分区字段不是表中已经存在的数据, 可以将分区字段看成伪列

-- 创建表
create table if not exists t3(
  id int,
  name string,
  hobby array<string>,
  addr map<String,string>
)
partitioned by (dt string)
row format delimited
fields terminated by ';'
collection items terminated by ','
map keys terminated by ':';

-- 加载数据。
load data local inpath "/home/hadoop/data/t1.dat" into table t3
partition(dt="2020-06-01");
load data local inpath "/home/hadoop/data/t1.dat" into table t3
partition(dt="2020-06-02");
  1. 查看分区
hive (mydb)> show partitions t3;
OK
partition
dt=2020-06-01
dt=2020-06-02
Time taken: 0.107 seconds, Fetched: 2 row(s)

hive (mydb)> select * from t3;
OK
t3.id	t3.name	t3.hobby	t3.addr	t3.dt
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-01
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-01
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-01
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-02
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-02
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-02
Time taken: 1.948 seconds, Fetched: 6 row(s)


-- 分区其实是伪列,是目录
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3;
Found 5 items
drwxr-xr-x   - root supergroup          0 2020-08-16 18:41 /user/hive/warehouse/mydb.db/t3/dt=2020-06-01
drwxr-xr-x   - root supergroup          0 2020-08-16 18:41 /user/hive/warehouse/mydb.db/t3/dt=2020-06-02

  1. 新增分区并设置数据
-- 增加一个分区,不加载数据
hive (mydb)> alter table t3 add partition(dt='2020-06-03');
OK
Time taken: 0.157 seconds


-- 增加多个分区,不加载数据
hive (mydb)> alter table t3 add partition(dt='2020-06-05') partition(dt='2020-06-06');
OK
Time taken: 0.717 seconds


-- 增加多个分区。准备数据
hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2020-06-01 /user/hive/warehouse/mydb.db/t3/dt=2020-06-07;

hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2020-06-01
/user/hive/warehouse/mydb.db/t3/dt=2020-06-08


-- 增加多个分区。加载数据
hive (mydb)> alter table t3 add partition(dt='2020-06-07') location '/user/hive/warehouse/mydb.db/t3/dt=2020-06-07' partition(dt='2020-06-08') location '/user/hive/warehouse/mydb.db/t3/dt=2020-06-08';
OK
Time taken: 0.393 seconds


-- 查询数据
hive (mydb)> select * from t3;
OK
t3.id	t3.name	t3.hobby	t3.addr	t3.dt
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-01
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-01
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-01
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-02
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-02
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-02
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-07
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-07
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-07
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020-06-08
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020-06-08
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020-06-08
Time taken: 1.724 seconds, Fetched: 12 row(s)

  1. 修改分区的 hdfs 路径
alter table t3 partition(dt='2020-06-01') set location
'/user/hive/warehouse/t3/dt=2020-06-03';
  1. 删除分区
-- 可以删除一个或多个分区,用逗号隔开
hive (mydb)> alter table t3 drop partition(dt='2020-06-03'),
           > partition(dt='2020-06-04');
Dropped the partition dt=2020-06-03
OK
Time taken: 0.558 seconds

五、分桶表

当单个的分区或者表的数据量过大, 分区不能更细粒度的划分数据, 就需要使用分桶技术将数据划分成更细的粒度。

将数据按照指定的字段进行分成多个桶中去, 即将数据按照字段进行划分, 数据按照字段划分到多个文件当中去。

分桶的原理:

  • MR 中: key.hashCode % reductTask
  • Hive 中: 分桶字段 hashCode % 分桶个数
-- 测试数据 /home/hadoop/data/course.dat
1	java	90
1	c	78
1	python	91
1	hadoop	80
2	java	75
2	c	76
2	python	80
2	hadoop	93
3	java	98
3	c	74
3	python	89
3	hadoop	91
5	java	93
6	c	76
7	python	87
8	hadoop	88
-- 创建分桶表,分成 3 个桶,以一行以 `Tab` 分隔符
create table course(
  id int,
  name string,
  score int
)
clustered by (id) into 3 buckets
row format delimited fields terminated by "\t";

-- 创建普通表
create table course_common(
  id int,
  name string,
  score int
)
row format delimited fields terminated by "\t";


-- 普通表加载数据
load data local inpath '/home/hadoop/data/course.dat' into table course_common;



-- 通过 insert ... select ... 给桶表加载数据
hive (mydb)> insert into table course select * from course_common;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200816201230_d45d45b0-cc91-473a-9ed0-0d512b941022
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Cannot run job locally: Number of reducers (= 3) is more than 1
Starting Job = job_1596350031879_0004, Tracking URL = http://linux123:8088/proxy/application_1596350031879_0004/
Kill Command = /opt/lagou/servers/hadoop-2.9.2/bin/hadoop job  -kill job_1596350031879_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2020-08-16 20:12:40,851 Stage-1 map = 0%,  reduce = 0%
2020-08-16 20:12:47,369 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec
2020-08-16 20:12:53,622 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 2.17 sec
2020-08-16 20:12:54,653 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 3.77 sec
2020-08-16 20:12:55,710 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.47 sec
MapReduce Total cumulative CPU time: 5 seconds 470 msec
Ended Job = job_1596350031879_0004
Loading data to table mydb.course
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 5.47 sec   HDFS Read: 16087 HDFS Write: 374 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 470 msec
OK
course_common.id	course_common.name	course_common.score
Time taken: 27.092 seconds



-- 观察分桶数据。数据按照:(分区字段.hashCode) % (分桶数) 进行分区
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/course;
Found 3 items
-rwxr-xr-x   3 root supergroup         57 2020-08-16 20:12 /user/hive/warehouse/mydb.db/course/000000_0
-rwxr-xr-x   3 root supergroup         53 2020-08-16 20:12 /user/hive/warehouse/mydb.db/course/000001_0
-rwxr-xr-x   3 root supergroup         63 2020-08-16 20:12 /user/hive/warehouse/mydb.db/course/000002_0


hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000000_0;
6	c	76
3	hadoop	91
3	python	89
3	c	74
3	java	98

hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000001_0;
1	python	91
1	c	78
1	java	90
7	python	87
1	hadoop	80

hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000002_0;
2	hadoop	93
2	python	80
2	c	76
8	hadoop	88
2	java	75
5	java	93

备注:

  • 分桶规则: 分桶字段 .hashCode % 分桶数

  • 分桶表加载数据时, 使用 insert... select ... 方式进行

  • 网上有资料说要使用分区表需要设置 hive.enforce.bucketing=true

Hive 1.x 以前的版本; Hive 2.x 中, 删除了该参数, 始终可以分桶;

六、修改表 & 删除表

-- 修改表名。rename
hive (mydb)> alter table course_common rename to course_common1;
OK
Time taken: 0.134 seconds


-- 修改列名。change column
hive (mydb)> alter table course_common1 change column id cid int;
OK
Time taken: 0.175 seconds


-- 修改字段类型。change column
hive (mydb)> alter table course_common1 change column cid cid string;
OK
Time taken: 0.16 seconds


-- The following columns have types incompatible with the existing columns in their respective positions


-- 修改字段数据类型时,要满足数据类型转换的要求。如int可以转为string,但是 string不能转为int
hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
cid                 	string              	                    
name                	string              	                    
score               	string              	                    
Time taken: 0.072 seconds, Fetched: 3 row(s)

hive (mydb)> alter table course_common1 change column score score int;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
score



-- 增加字段。add columns
hive (mydb)> alter table course_common1 add columns (common string);
OK
Time taken: 0.094 seconds

hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
cid                 	string              	                    
name                	string              	                    
score               	string              	                    
common              	string              	                    
Time taken: 0.078 seconds, Fetched: 4 row(s)



-- 删除字段:replace columns
-- 这里仅仅只是在元数据中删除了字段, 并没有改动hdfs上的数据文件
hive (mydb)> alter table course_common1 replace columns (id string, cname string, score string);
OK
Time taken: 0.131 seconds

hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
id                  	string              	                    
cname               	string              	                    
score               	string              	                    
Time taken: 0.077 seconds, Fetched: 3 row(s)



-- 删除表
hive (mydb)> drop table course_common1;
OK
Time taken: 0.153 seconds

hive (mydb)> show tables;
OK
tab_name
course
t3
Time taken: 0.076 seconds, Fetched: 2 row(s)