Hive SQL
表相关操作
有关Hive语法参考地址:cwiki.apache.org/confluence/…
表基础知识
创建删除数据库
-- 展示所有数据库
show databases;
-- 切换数据库
use database_name;
-- 创建数据库语法
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
-- 创建数据库
create database xxx;
-- 删除数据库语法
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
-- 删除数据库
drop database xxx;
注意:当进入hive的命令行开始编写SQL语句的时候,如果没有任何相关的数据库操作,那么默认情况下,所有的表存在于default数据库,在hdfs上的展示形式是将此数据库的表保存在hive的默认路径下,如果创建了数据库,那么会在hive的默认路径下生成一个database_name.db的文件夹,此数据库的所有表会保存在database_name.db的目录下
创建表
-- 创建表语句
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
-- 创建临时表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
-- 复杂数据类型
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
-- 基本数据类型
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
-- 行格式化
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
-- 文件基本类型
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| JSONFILE -- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
: [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
: [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]
-- 表约束
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
-- 常用的建表语句如下
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 分区
[CLUSTERED BY (col_name, col_name, ...) 分桶
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format] row format delimited fields terminated by 分隔符
[STORED AS file_format]
[LOCATION hdfs_path]
基本数据类型
| 类型名称 | 描述 | 举例 |
|---|---|---|
| boolean | true/false | true |
| tinyint | 1字节的有符号整数 | 1 |
| smallint | 2字节的有符号整数 | 1 |
| int | 4字节的有符号整数 | 1 |
| bigint | 8字节的有符号整数 | 1 |
| float | 4字节单精度浮点数 | 1.0 |
| double | 8字节单精度浮点数 | 1.0 |
| string | 字符串(不设长度) | “aaa” |
| varchar | 字符串(1-65355长度,超长截断) | “bbb” |
| timestamp | 时间戳 | 161243654 |
| date | 日期 | 20200416 |
复合数据类型
| 类型名称 | 描述 | 举例 |
|---|---|---|
| array | 一组有序的字段,字段类型必须相同 array(元素1,元素2) | Array(1,2,3) |
| map | 一组无序的键值对 map(k1,v1,k2,v2) | Map(‘a’,1,'b',2) |
| struct | 一组命名的字段,字段类型可以不同 struct(元素1,元素2) | Struct('a',1,2,0) |
create table complex(
col1 array<int>,
col2 map<string,int>,
col3 struct<a:string,b:int,c:double>
)
关于复合类型操作
- array字段的元素访问方式可以使用下标获取元素,下标从0开始,例如,获取第一个元素:array[0]
- map字段的元素访问方式可以通过键获取值,例如,获取a字段key对应的value,map['a']
- struct字段的元素获取方式,例如定义一个字段c的类型为struct{a int;b string},获取a和b的值,使用c.a 和c.b 获取其中的元素值,可以把struct这种类型看成是一个对象
Hive 建表案例
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
Hive 查看表描述
DESC [EXTENDED|FORMATTED] table_name
desc formatted person;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Detailed Table Information
Database: default
OwnerType: USER
Owner: bigdata
CreateTime: Fri Mar 15 08:19:16 CST 2024
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://mycluster/user/hive/warehouse/person
Table Type: MANAGED_TABLE
Table Parameters: # 表信息
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"address\":\"true\",\"id\":\"true\",\"likes\":\"true\",\"name\":\"true\"}}
bucketing_version 2
numFiles 0 文件数
numRows 0 行数
rawDataSize 0 原始数据大小
totalSize 0 总大小
transient_lastDdlTime 1710461956
# Storage Information # 存储信息
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params: # 存储描述参数
collection.delim - 集合元素分隔符
field.delim , 字段分隔符
line.delim \n 行分隔符
mapkey.delim : map中key和value的分隔符
serialization.format ,
删除表
create table psn(id int,age int);
desc formatted psn;
drop table psn;
加载数据
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
数据示例如下:id,姓名,爱好,地址
1,小明1,lol-book-movie,beijing:xisanqi-shanghai:pudong
2,小明2,lol-book-movie,beijing:xisanqi-shanghai:pudong
3,小明3,lol-book-movie,beijing:xisanqi-shanghai:pudong
4,小明4,lol-book-movie,beijing:xisanqi-shanghai:pudong
5,小明5,lol-movie,beijing:xisanqi-shanghai:pudong
6,小明6,lol-book-movie,beijing:xisanqi-shanghai:pudong
7,小明7,lol-book,beijing:xisanqi-shanghai:pudong
8,小明8,lol-book,beijing:xisanqi-shanghai:pudong
9,小明9,lol-book-movie,beijing:xisanqi-shanghai:pudong
-- 将上面的数据放到~/person01.txt文件中,然后加载本地文件到表中
load data local inpath '/home/bigdata/person01.txt' into table person;
-- 查询所有数据
select * from person;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
...
-- 查询指定的数据
select id,name,likes[0],address['beijing'] from person;
1 小明1 lol xisanqi
2 小明2 lol xisanqi
3 小明3 lol xisanqi
将数据文件上传到hdfs中,然后加载hdfs中的文件到表中
# 上传文件到hdfs中
hdfs dfs -put ~/person01.txt /tmp/
-- 先清空表
truncate table person;
-- 从hdfs中加载数据
load data inpath '/tmp/person01.txt' into table person;
-- 查询数据
select * from person;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
...
再次查看hdfs,发现/tmp/person01.txt移动到了/user/hive/warehouse/person/person01.txt
默认分隔符演示
create table person2
(
id int,
name string,
likes array<string>,
address map<string,string>
);
数据示例如下,注意:^A 输入Ctrl V Ctrl A,^B和^C同理
1^A小明1^Alol^Bbook^Bmovie^Abeijing^Cxisanqi^Bshanghai^Cpudong
-- 将上面的数据放到~/person02.txt文件中,需要重新编辑^A等分隔符
load data local inpath '/home/bigdata/person02.txt' into table person2;
-- 查询
select * from person2;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
create table person3
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003';
desc formatted person3;
...
Storage Desc Params:
collection.delim \u0002
field.delim \u0001
mapkey.delim \u0003
serialization.format \u0001
-- 加载数据
load data local inpath '/home/bigdata/person02.txt' into table person3;
-- 查询
select * from person3;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
-- 从上面可以看出默认分隔符,^A(Ctrl V Ctrl A)是\u0001,^B是\u0002,^C是\u0003
内部表与外部表
Hive 内部表
CREATE TABLE [IF NOT EXISTS] table_name
删除表时,元数据与数据都会被删除
Table Type: MANAGED_TABLE 内部表
Hive 外部表
CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path
删除外部表只删除metastore的元数据,不删除hdfs中的表数据
Table Type: EXTERNAL_TABLE external
# 首先将person01.txt文件上传到hdfs的/mydata目录
hdfs dfs -mkdir /mydata
hdfs dfs -put ~/person01.txt /mydata
-- 创建外部表
create external table person4
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/mydata/'; -- 不指定默认是/user/hive/warehouse/
desc formatted person4;
...
Table Type: EXTERNAL_TABLE
...
select * from person4;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
...
内部表和外部表的区别
- Hive中创建表默认是内部表,创建表可以指定数据存储位置,不指定默认位置是
/user/hive/warehouse/ - 当删除内部表时,元数据和真实映射路径数据都会被删除,内部表常用于临时表
- 当删除外部表时,元数据会被删除,真实映射的数据不会被删除
- 对于外部表,可以先创建表,再添加数据,也可以先有数据,再创建表(数据在hdfs中先存储,需要的时候再建表处理即可),本质上是将hdfs的某一个目录的数据跟hive的表关联映射起来,因此适合原始数据的存储,不会因为误操作将数据给删除掉
# 查看person表对应的文件
hdfs dfs -ls -R /user/hive/warehouse/person
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 08:40 /user/hive/warehouse/person/person01.txt
# 查看person4表对应的文件
hdfs dfs -ls -R /mydata/
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 13:17 /mydata/person01.txt
-- 删除person4和person表
drop table person4;
drop table person;
# 查看person表对应的文件已经被删除(删除内部表会删除对应的数据文件)
hdfs dfs -ls -R /user/hive/warehouse/person
ls: `/user/hive/warehouse/person': No such file or directory
# 查看person4表对应的文件(删除外部表不会删除对应的数据文件)
hdfs dfs -ls -R /mydata/
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 13:17 /mydata/person01.txt
-- 再次创建刚刚删除的外部表
create external table person4
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/mydata/';
-- 数据依然在
select * from person4;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
...
hive读时检查,实现解耦,提高数据加载的效率,加载时不会检查,查询时会检查,如果数据不能正确分割则返回null。加载数据时只是将文件拷贝到相应的hdfs路径,效率较高
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
-- 加载时不检查
load data local inpath '/home/bigdata/person02.txt' into table person;
-- 查询时检查,如果数据不能正确分割则返回null
select * from person;
NULL NULL NULL NULL
表分区
hive默认将表的数据保存在某一个hdfs的存储目录下,当需要检索符合条件的某一部分数据的时候,需要全量遍历数据,io量比较大,效率比较低,因此可以采用分而治之的思想,将符合某些条件的数据放置在某一个目录 ,此时检索的时候只需要搜索指定目录即可,不需要全量遍历数据。即hive表设置分区可以加快查询hive表的速度,防止扫描全部数据。hive表创建分区必须在表定义时指定对应的partition字段
单分区
单分区建表语句create table xxx (col1 int, col2 string) partitioned by (col3 string)...。在表结构中存在col1,col2,col3三列(注意,分区字段不能出现在建表语句中),在hdfs中以col3为文件夹区分
-- 单分区表,按天分区
create table person5
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
desc formatted person5;
...
# Partition Information
# col_name data_type comment
age int
...
-- 添加数据
load data local inpath '/home/bigdata/person01.txt' into table person5 partition(age=10);
select * from person5;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
...
-- 向另外分区加载数据
load data local inpath '/home/bigdata/person01.txt' into table person5 partition(age=20);
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
...
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
# 查看person5表在hdfs中对应文件,不同的分区分别有不同的目录
hdfs dfs -ls -R /user/hive/warehouse/person5
drwxr-xr-x - bigdata supergroup 0 2024-03-15 13:39 /user/hive/warehouse/person5/age=10
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 13:39 /user/hive/warehouse/person5/age=10/person01.txt
drwxr-xr-x - bigdata supergroup 0 2024-03-15 13:41 /user/hive/warehouse/person5/age=20
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 13:41 /user/hive/warehouse/person5/age=20/person01.txt
双分区
双分区建表语句create table xx (col1 int, col2 string) partitioned by (col3 string, col4 string)...,双分区表,按col3,col4分区,表结构中一共有col1,col2,col3,col4列。在hdfs中先以col3为文件夹,再以col4为子文件夹区分。理论上分区的个数可以任意多,但是常用的为单分区和双分区
create table person6
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
desc formatted person6;
...
# Partition Information
# col_name data_type comment
age int
sex string
...
-- 加载数据时,必须指定所有分区字段,否则会报错
load data local inpath '/home/bigdata/person01.txt' into table person6 partition(age=20);
FAILED: SemanticException [Error 10006]: Line 1:81 Partition not found '20'
load data local inpath '/home/bigdata/person01.txt' into table person6 partition(age=20,sex='man');
select * from person6;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
...
-- 加载数据时,分区字段顺序不重要
load data local inpath '/home/bigdata/person01.txt' into table person6 partition(sex='man',age=10);
select * from person6;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
...
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
...
# 查看person6表在hdfs中对应文件,不同的分区分别有不同的目录
hdfs dfs -ls -R /user/hive/warehouse/person6
drwxr-xr-x - bigdata supergroup 0 2024-03-15 14:01 /user/hive/warehouse/person6/age=10
drwxr-xr-x - bigdata supergroup 0 2024-03-15 14:01 /user/hive/warehouse/person6/age=10/sex=man
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 14:01 /user/hive/warehouse/person6/age=10/sex=man/person01.txt
drwxr-xr-x - bigdata supergroup 0 2024-03-15 13:58 /user/hive/warehouse/person6/age=20
drwxr-xr-x - bigdata supergroup 0 2024-03-15 13:58 /user/hive/warehouse/person6/age=20/sex=man
-rw-r--r-- 3 bigdata supergroup 496 2024-03-15 13:58 /user/hive/warehouse/person6/age=20/sex=man/person01.txt
Hive分区操作
表添加分区
准备~/students.txt数据如下
1 zs 18 beijing
2 ls 18 shanghai
3 ww 19 tianjin
4 ml 20 shenzhen
5 tq 19 shanghai
6 lc
-- 创建表,分区为age和loc
create table students(id int,name string) partitioned by (age int,loc string)row format delimited fields terminated by '\t';
-- 加载数据,指定分区,直接将数据文件写到hdfs中对应的文件(不需要MR来执行),较快
load data local inpath '/home/bigdata/students.txt' into table students partition(loc='0',age=0);
-- 上面加载数据时都是人为指定分区列的值,我们更加希望能够根据记录中的某一个字段来判断将数据插入到哪一个分区目录下,此时需要使用动态分区来完成相关操作
-- 加载数据,不指定分区,需要读取每条数据的分区然后写到hdfs中对应的文件(需要MR来执行),较慢
load data local inpath '/home/bigdata/students.txt' into table students;
-- 查询表所有分区
show partitions students;
age=0/loc=0
age=18/loc=beijing
age=18/loc=shanghai
age=19/loc=shanghai
age=19/loc=tianjin
age=20/loc=shenzhen
age=__HIVE_DEFAULT_PARTITION__/loc=__HIVE_DEFAULT_PARTITION__
上面向表中加载数据时,表有分区,但是加载数据时没有指定分区,会自动按照分隔符切分字段,按照切出来的对应列自动进行分区。如果切出来的列为空,则添加默认分区(HIVE_DEFAULT_PARTITION)
# 查看students表在hdfs中对应文件,不同的分区分别有不同的目录
hdfs dfs -ls -R /user/hive/warehouse/students
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:44 /user/hive/warehouse/students/age=0
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:44 /user/hive/warehouse/students/age=0/loc=0
-rw-r--r-- 3 bigdata supergroup 88 2024-03-18 08:44 /user/hive/warehouse/students/age=0/loc=0/students.txt
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=18
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=18/loc=beijing
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=18/loc=beijing/000000_2
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=18/loc=shanghai
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=18/loc=shanghai/000000_2
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=19
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=19/loc=shanghai
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=19/loc=shanghai/000000_2
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=19/loc=tianjin
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=19/loc=tianjin/000000_2
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=20
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=20/loc=shenzhen
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=20/loc=shenzhen/000000_2
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=__HIVE_DEFAULT_PARTITION__
drwxr-xr-x - bigdata supergroup 0 2024-03-18 08:37 /user/hive/warehouse/students/age=__HIVE_DEFAULT_PARTITION__/loc=__HIVE_DEFAULT_PARTITION__
-rw-r--r-- 3 bigdata supergroup 5 2024-03-18 08:37 /user/hive/warehouse/students/age=__HIVE_DEFAULT_PARTITION__/loc=__HIVE_DEFAULT_PARTITION__/000000_2
我们想对Student表添加新的分区,添加分区语句如下
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
例如添加分区age=30,loc='hangzhou',表在创建时定义的是双分区,添加分区时需要同时指定两个分区的值
-- 针对多分区表,不能只指定其中某一个分区
ALTER TABLE students ADD PARTITION (age=30);
"FAILED: ValidationFailureSemanticException default.students: partition spec {age=30} doesn't contain all (2) partition columns"
-- 正确执行方式
ALTER TABLE students ADD PARTITION (age=30,loc='hangzhou');
-- 添加分区时的分区字段顺序不重要,该分区已经存在如果不加'IF NOT EXISTS'会报错,分区已经存在创建分区操作实际不会执行
ALTER TABLE students ADD IF NOT EXISTS PARTITION (loc='hangzhou',age=30);
# 查看students表在hdfs中对应文件,新创建的分区已经创建了对应的目录,但是目录中没有数据
hdfs dfs -ls -R /user/hive/warehouse/students
...
drwxr-xr-x - bigdata supergroup 0 2024-03-15 16:55 /user/hive/warehouse/students/age=30
drwxr-xr-x - bigdata supergroup 0 2024-03-15 16:55 /user/hive/warehouse/students/age=30/loc=hangzhou
...
删除分区
删除分区的语法如下
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
对于双分区表,可以通过指定两个分区值进行删除
alter table students drop if exists partition (age=30,loc='hangzhou');
也可以通过指定一个分区值进行删除
alter table students drop partition (age=18);
Dropped the partition age=18/loc=beijing
Dropped the partition age=18/loc=shanghai
alter table students drop partition (loc='tianjin');
Dropped the partition age=19/loc=tianjin
修复分区
在使用hive外部表的时候,可以先将数据上传到hdfs的某一个目录中,然后再创建外部表建立映射关系,如果在上传数据的时候,参考分区表的形式也创建了多级目录,那么此时创建完表之后,是查询不到数据的,原因是分区的元数据没有保存在mysql中,因此需要修复分区,将元数据同步更新到mysql中,此时才可以查询到元数据
删除students表,由于默认是内部表,所以在hdfs中的数据目录也会被清空。可以创建外部表并指定数据目录可以对表进行修复。在hdfs中创建目录/student/age=10/loc=shanghai和目录/student/age=20/loc=beijing,并上传一些数据
准备数据文件s1.txt和s2.txt,上传到hdfs目录中
1 zs
2 ls
3 ww
4 ml
# 创建目录
hdfs dfs -mkdir -p /student/age=10/loc=shanghai
hdfs dfs -mkdir -p /student/age=20/loc=beijing
# 上传数据
hdfs dfs -put s1.txt /student/age=10/loc=shanghai
hdfs dfs -put s2.txt /student/age=20/loc=beijing
创建表并修复分区
-- 先删除students表
drop table students;
-- 在hive中重新创建表students,并指定为外表,映射HDFS路径为/student
create table students(id int,name string) partitioned by (age int,loc string)row format delimited fields terminated by '\t' location '/student';
-- 没有数据,无法自动识别到分区
select * from students;
-- 创建完表后,并不会将对应的分区信息同步到MySQL数据库的元数据中,这里需要修复分区,修复分区语法格式
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
-- 执行如下命令,修复分区只是修改该表对应的元数据,不需要执行MR,速度不会很慢
msck repair table students;
Partitions not in metastore: students:age=10/loc=shanghai students:age=20/loc=beijing
Repair: Added partition to metastore students:age=10/loc=shanghai
Repair: Added partition to metastore students:age=20/loc=beijing
-- 查询数据
select * from students;
1 zs 10 shanghai
2 ls 10 shanghai
3 ww 20 beijing
4 ml 20 beijing
总结
- 当创建完分区表之后,在保存数据的时候,会在hdfs目录中看到分区列会成为一个目录,以多级目录的形式存在
- 当创建多分区表之后,插入数据的时候不可以只添加一个分区列,需要将所有的分区列都添加值
- 多分区表在添加分区列值的时候,与顺序无关,与分区表的分区列的名称相关,按照名称匹配
- 添加分区的时候,如果定义的是多分区表,那么必须给所有的分区列都赋值
- 删除分区的时候,无论是单分区表还是多分区表,都可以将指定的分区进行删除
基于已有表建表
Create Table Like
语法格式
CREATE TABLE empty_key_value_store LIKE key_value_store;
只创建相同结构的空表,没有表中的数据
select * from person3;
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
create table person8 like person3;
-- 没有数据
select * from person8;
Create Table As Select (CTAS)
语法格式
CREATE TABLE new_key_value_store AS SELECT columA, columB FROM key_value_store;
创建普通表,将数据直接写入表,适用于临时表处理数据场景
create table person9 as select id,name from person3;
select * from person9;
1 小明1
删除数据
Hive中删除数据不支持delete from tbl 语法,支持Truncate 删除数据
truncate table person;
-- 没有数据
select * from person;
Hive DML
Loading files into tables
load方式向Hive表中加载数据语法如下
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
copy:load的是本地(LOCAL)的文件,即如果是LOCAL会将文件拷贝到hdfs中
move:load的是hdfs上的文件,不适用local。如果也copy的话,hdfs上将会产生不必要的文件。即如果不是LOCAL,就将指定的文件(在hdfs中)移动到表(分区)在hdfs中对应的目录
hdfs dfs -mkdir /test
hdfs dfs -put person01.txt /test/
LOAD DATA INPATH '/test/person01.txt' INTO TABLE person5 PARTITION (age=30);
# 可以看到hdfs中/test/person01.txt被移动到了/user/hive/warehouse/person5/age=30/person01.txt
# 没有person01.txt
hdfs dfs -ls /test
# 有person01.txt
hdfs dfs -ls -R /user/hive/warehouse/person5/age=30
-rw-r--r-- 3 bigdata supergroup 496 2024-03-18 17:19 /user/hive/warehouse/person5/age=30/person01.txt
注意
- load操作不会对数据做任何的转换修改操作
- 从本地linux load数据文件是复制文件的过程
- 从hdfs load数据文件是移动文件的过程
- load操作也支持向分区表中load数据,只不过需要添加分区列的值
Inserting data into Hive Tables from queries
从查询语句中获取数据插入某张表,语法如下
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
查询其他表数据向表中插入数据,语法如下
INSERT [OVERWRITE] INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
以上方式不建议使用,因为如果针对一张表需要将不同列数据插入到不同表中时,需要查询多次表,效率低。可以使用如下方式
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
以上这种方式只查询一次表,分别向其他表中插入数据,建议使用。此外注意:INSERT OVERWRITE:表示覆盖,INSERT INTO:表示追加
-- 案例如下
create table person2_1(
id int,
name string
);
create table person2_2(
id int,
likes array<string>
);
from person5
insert overwrite table person2_1
select id,name
insert into person2_2
select id,likes;
-- 再执行一次
from person5 insert overwrite table person2_1 select id,name insert into person2_2 select id,likes;
select count(*) from person2_1;
27
select count(*) from person2_2;
54
使用场景:将一个复杂的表person(假如它有50个列),将其中一些列的数据添加到person2中,将另外一些列的数据添加到person3中。仅需要查询一次,减少磁盘IO
Writing data into the filesystem from queries
将查询到的结果写入到文件系统中,可以是本地文件系统,也可以是hdfs文件系统,语法如下
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
-- 注意:路径千万不要填写根目录,会把所有的数据文件都覆盖
-- 将查询到的结果写入到hdfs文件系统中
insert overwrite directory '/tmp/result/person2_2' select * from person2_2;
-- 将查询的结果写入到本地文件系统中
insert overwrite local directory '/tmp/result/person2_2' select * from person2_2;
# node03
# 在hdfs中和本地查看刚刚导出的文件
hdfs dfs -ls /tmp/result/person2_2
-rw-r--r-- 3 bigdata supergroup 816 2024-03-27 14:31 /tmp/result/person2_2/000000_0
ll /tmp/result/person2_2
-rw-r--r-- 1 bigdata bigdata 816 Mar 27 14:32 000000_0
Inserting values into tables from SQL
使用传统关系型数据库的方式插入数据,效率较低
Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
-- 创建表
create table psn(id int, name string);
--插入数据
insert into psn values(1,'zhangsan');
-- 查询数据
select * from psn;
1 zhangsan
数据更新和删除
在官网中我们明确看到hive中是支持Update和Delete操作的,但是实际上,是需要事务支持的,Hive对于事务的支持有很多的限制
事务的限制,可以看到限制很多
开启事务的配置
在使用hive的过程中,一般不会产生删除和更新的操作,如果产生了代价也很大,hdfs中的文件是不可以修改的,会替换hdfs中的文件(先删除再上传),下面演示一下
hive服务端修改配置文件hive-site.xml,添加如下配置(node01)
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.cleaner.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
hive客户端修改配置文件hive-site.xml,添加如下配置(node03)
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
重启hive服务端,客户端hive重连
准备数据文件(test_trancaction.txt)
1,jerrick
2,tom
3,jerry
4,lily
5,hanmei
6,limlei
7,lucky
-- 创建表,test_trancaction表对应的文件只能是orc格式,不能直接导入文本文件,这里建立一张中间表导数据
create table test_trancaction (id int,name string) clustered by (id) into 3 buckets stored as orc TBLPROPERTIES ('transactional'='true');
create table test_insert_test(id int,name string) row format delimited fields TERMINATED BY ',';
-- 加载数据
load data local inpath '/home/bigdata/test_trancaction.txt' into table test_insert_test;
insert into test_trancaction select * from test_insert_test;
-- 更新前查询数据
select * from test_trancaction where id = 1;
1 jerrick
-- 更新
update test_trancaction set name='jerrick_up' where id = 1;
-- 更新后查询数据
select * from test_trancaction where id = 1;
1 jerrick_up
-- 删除
delete from test_trancaction where id = 1;
-- 再次查询,已经没有数据了
select * from test_trancaction where id = 1;
注意,在实际使用Hive时,不到万不得已不要使用update和delete功能,update和delete开销很大
Hive SerDe
Hive SerDe - Serializer and Deserializer,SerDe 用来做序列化和反序列化,构建在数据存储和执行引擎之间,对两者实现解耦
应用场景
- hive主要用来存储结构化数据,当结构化数据存储的格式嵌套比较复杂的时候,可以使用serde的方式,利用正则表达式匹配的方法来读取数据,例如,表字段如下:
id,name,map<string,array<map<string,string>>> - 当读取数据的时候,数据的某些特殊格式不希望显示在数据中,如:
192.168.57.4 - - [29/Feb/2019:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -不希望数据显示的时候包含[]或者"",此时可以考虑使用serde的方式
Hive可以通过ROW FORMAT DELIMITED以及SERDE进行内容的读写,语法如下
row_format
: DELIMITED
[FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
:SERDE serde_name [WITH SERDEPROPERTIES (property_name=
property_value,property_name=property_value, ...)]
准备数据文件access_log.txt
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
Hive正则匹配建表,并加载access_log.txt数据
CREATE TABLE logtbl (
host STRING,
identity STRING,
t_user STRING,
`time` STRING,
request STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (.*) \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
);
load data local inpath '/home/bigdata/access_log.txt' into table logtbl;
-- 数据显示如下(不包含[]和")
select * from logtbl;
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] GET /bg-upper.png HTTP/1.1 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] GET /bg-nav.png HTTP/1.1 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] GET /asf-logo.png HTTP/1.1 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - [29/Feb/2023:18:14:35 +0800] GET /bg-middle.png HTTP/1.1 304 -
Hive Server2与Beeline
Hive Beeline 是Hive0.11版本引入的新命令行客户端,它是基于SQLLine Cli的JDBC客户端。Beeline工作模式有两种,即本地嵌入模式和远程模式。嵌入模式情况下,它返回一个嵌入式的Hive(类似于Hive CLI)。而远程模式则是通过Thrift协议与某个单独的HiveServer2进程进行连接通信
想要通过Beeline写SQL查询Hive数据,必须配置HiveServer2服务,HiveServer2(HS2)是一种使客户端能够对Hive执行查询的服务。HiveServer2是已被废弃的HiveServer1(仅支持单客户端访问)的继承者。HiveServer2支持多客户端并发和身份验证。它旨在为JDBC和ODBC等开放API客户端提供更好的支持,即支持使用jdbc连接到hive
使用hiveserver2的优点如下
- 在应用端不需要部署hadoop和hive的客户端
- hiveserver2不用直接将hdfs和metastore暴露给用户
- 有HA机制,解决应用端的并发和负载问题
- jdbc的连接方式,可以使用任何语言,方便与应用进行数据交互
通过Beeline和Hiverserver2连接操作Hive时,需要指定一个用户,这个用户可以随意指定,但是需要在HDFS中允许使用代理用户配置,需要在每台Hadoop 节点配置core-site.xml,添加以下配置
<!-- 配置代理访问用户 -->
<property>
<name>hadoop.proxyuser.bigdata.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.bigdata.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.bigdata.users</name>
<value>*</value>
</property>
还需要修改$HIVE_HOME/conf/hive-site.xml配置文件添加如下内容(这里可以不用配置)
<!-- 该参数的含义是是否启用Hiveserver2用户模拟的功能。若启用,则Hiveserver2会模拟成客户端的登录用户去访问Hadoop集群的数据,不启用,则Hivesever2会直接使用启动用户访问Hadoop集群数据。模拟用户的功能,默认是开启的。这里如果配置成false,则上面core-site.xml中,配置代理访问用户的配置可以不用配置 -->
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
# node01 bigdata用户
cd $HADOOP_HOME/etc/hadoop
# 修改core-site.xml文件
# 发送到其它节点
for i in 2 3 4;do scp core-site.xml node0$i:`pwd`/;done
cd $HIVE_HOME/conf
# 修改hive-site.xml文件
# 配置完成之后刷新hadoop集群配置或重新启动hadoop集群
# 刷新所有namenode节点配置,node01和node02是namenode
hdfs dfsadmin -fs hdfs://node01:8020 -refreshSuperUserGroupsConfiguration
hdfs dfsadmin -fs hdfs://node02:8020 -refreshSuperUserGroupsConfiguration
# 重启集群
stop-all.sh
start-all.sh
# 重启HiveMetastore,先杀掉,然后启动(不启动这个服务也可以,hiveserver2可以依赖metastore也可以不依赖)
kill -9 `jps | grep RunJar | awk '{print $1}'`
hive --service metastore &
本地嵌入模式使用Beeline
# 首先在Hive服务端node01节点启动HiverServer2
# hiveserver2 & 和 hive --service hiveserver2 命令都可以启动hiveserver2
hiveserver2 &
# 然后在node01上可以直接登录beeline
# 使用beeline方式登录的时候,默认的用户名和密码是不验证的,也就是说随便写用户名和密码即可
beeline
# 用户名和密码随便输入
!connect jdbc:hive2://node01:10000
# 在beeline客户端中
# 退出连接
!connect
# 退出beeline客户端
!quit
# 可以使用另外一种方式登录beeline
beeline -u jdbc:hive2://node01:10000
# 如果不是bigdata用于登录,可能会报错
beeline -u jdbc:hive2://node01:10000 zhangsan
Error: Could not open client transport with JDBC Uri: jdbc:hive2://node01:10000/default: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=zhangsan, access=EXECUTE, inode="/tmp":bigdata:supergroup:drwx------
# 上面的错误,是因为beeline需要访问hdfs的/tmp目录,但是没有权限导致,需要修改hdfs的/tmp目录权限
hdfs dfs -chmod 777 /tmp
# 再次登录,可以正常登录了
beeline -u jdbc:hive2://node01:10000 zhangsan
# 使用zhangsan用户创建表时,依然会报错,还是hdfs目录的权限问题,代理用户(bigdata)只会进行用户认证(例如Kerberos),用户认证通过之后,使用被代理用户即实际的用户(zhangsan),来操作hdfs,因此实际的用户需要拥有操作hdfs的权限(即需要在hdfs中做相关的配置)
# 在Hadoop的用户认证机制中,如果使用的是Simple认证机制,实际上ProxyUser的使用意义并不大,因为客户端本身就可以使用任意用户对服务端进行访问,服务端并不会做认证。而在使用了安全认证机制(例如Kerberos)的情况下,ProxyUser认证机制就很有作用,一个代理用户就可以为多个用户提供认证
create table if not exists zhangsan_test (id int,name string,age int);
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=zhangsan, access=WRITE, inode="/user/hive/warehouse":bigdata:supergroup:drwxr-xr-x
# 如果hadoop的core-site.xml中没有配置代理访问用户,且hive的hive-site.xml配置文件中hive.server2.enable.doAs这个配置为true,即使beeline使用bigdata用户连接hive也会报错,报错信息是`Error: Could not open client transport with JDBC Uri: jdbc:hive2://node01:10000/default: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: bigdata is not allowed to impersonate root (state=08S01,code=0)`
远程模式使用Beeline
在实际开发中建议远程模式使用beeline,这样并不是在Hive的服务端进行操作,比较安全。远程使用beeline就是直接找到Hive的客户端然后启动beeline,前提是需要在Hive的服务端启动Hiveserver2服务,这种情况下客户端的beeline是通过Thrift协议与服务端的HiveServer2进程进行连接通信
# node03,该节点有hive环境
beeline
# 用户名和密码随便输入
!connect jdbc:hive2://node01:10000
# 可以使用另外一种方式登录beeline
beeline -u jdbc:hive2://node01:10000
beeline实践
# 有hive环境都可以使用beeline连接
beeline -u jdbc:hive2://node01:10000
# 将node03节点的/home/bigdata/person02.txt数据拷贝到node01节点
# node03
cd ~
scp person02.txt node01:`pwd`
-- beeline cli
create table person22
(
id int,
name string,
likes array<string>,
address map<string,string>
);
-- 这里 local inpath 指定的路径,需要是HiveServer2服务端存在的路径(这里是node01节点),客户端存在该路径没有用,因为是将语句发送到服务端执行的,当然指定hdfs中的路径是可以的
load data local inpath '/home/bigdata/person02.txt' into table person22;
select * from person22;
+--------------+----------------+-------------------------+--------------------------------------------+
| person22.id | person22.name | person22.likes | person22.address |
+--------------+----------------+-------------------------+--------------------------------------------+
| 1 | 小明1 | ["lol","book","movie"] | {"beijing":"xisanqi","shanghai":"pudong"} |
+--------------+----------------+-------------------------+--------------------------------------------+
注意
hiveserver2可以依赖metastore也可以不依赖,如果在hive-site.xml配置文件中配置了hive.metastore.uris这个表示依赖metastore,否则表示不依赖metastore,采用直连mysql的方式(配置文件中需要配置hdfs路径和mysql相关信息)。在实际使用中一般会在一个节点启动metastore服务,提供hive cli给管理员使用,可以执行增删改查操作;在其它节点启动hiveserver2服务,为beeline或程序中jdbc提供查询服务
<!-- 如果hive-site.xml中配置了这个,使用hive cli访问hive时使用这个url访问metastore(不是直连mysql),启用hiveserver2服务时也是使用这个url访问metastore来与hive通信(不是直连mysql)。这样hive客户端或hiveserver2服务端就不用配置hdfs路径和mysql相关信息了,而且这里可以配置多个metastore的url来实现高可用 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://node01:9083,thrift://node02:9083</value>
</property>
hiveserver2的UI界面web通过10002端口可以访问