大数据开发Hive高级函数（第二十篇）持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第7天，点击

持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第7天，点击查看活动详情

一、Hive函数的基本操作

与MySQL一样，hive也是一个主要做统计的工具，所以为了满足各种各样的统计需求，它也内置了相当多的函数。我们可以通过show functions来查看hive的内置函数

1.1、查看具体的函数信息

desc function

二、Hive高级函数应用

2.1、分组排序取TopN：row_number() over()

row_number会对数据编号，编号从1开始
over可以理解为把数据划分到一个窗口内，里面可以加上partition by，表示按照字段对数据进行分组，还可以加上order by 表示对每个分组内的数据按照某个字段进行排序

创建基础数据student_score.data

备注下面的空格是采用tab。也是'\t'

1 张三  语文  80
2 张三  数学  99
3 张三  英语  88
4 李四  语文  98
5 李四  数学  88
6 李四  英语  95

创建外部表

create external table student_score(
id int,
name string,
subject string,
score int
)row format delimited
fields terminated by '\t'
location '/data/student_score';

把基础数据手工上传到hdfs

hdfs dfs -put /root/data/student_score.data /data/student_score

查询hive数据

select * from student_score;

排序

select *, row_number() over (partition by subject order by score desc) as num 
from student_score

此时会生成mapreduce任务

最终结果如下：

如果取各学科前一名，SQL如下：

select * from (
select *, row_number() over (partition by subject order by score desc) as num 
from student_score) s where s.num=1;

2.2、行转列

行转列就是把多行数据转为一列数据。针对行转列这种需求主要需要使用到concat_ws()、collect_set()、collect_list()函数。

concat_ws

CONCAT_WS() 函数可以实现根据指定的分隔符拼接多个字段的值，最终转化为一个带有分隔符的字符串它可以接收多个参数，第一个参数是分隔符，后面的参数可以是字符串或者字符串数组，最终就是使用分隔符把后面的所有字符串拼接到一块

collect_list

这个函数可以返回一个list集合，集合中的元素会重复，一般和group by 结合在一起使用

collect_set

这个函数可以返回一个set集合，集合汇中的元素不重复，一般和group by 结合在一起使用

测试数据

vi student_favors.data

zhangsan  编码
zhangsan  打篮球
zhangsan  游泳

创建hive外部表

create external table student_favors(
  name string,
  favor string
)row format delimited
fields terminated by '\t'
location '/data/studentFavors';

上传数据至hdfs

hdfs dfs -put /root/data/student_favors.data /data/studentFavors

查询hive外部表

select * from student_favors;

行转列

select name,collect_list(favor) as favor from student_favors group by name;

Query ID = root_20221017232554_5a8a0a1c-a8ee-461f-a5ae-bb7fa9ef8af8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1666011801346_0003, Tracking URL = http://192.168.234.100:8088/proxy/application_1666011801346_0003/
Kill Command = /root/software/hadoop-3.3.4/bin/mapred job  -kill job_1666011801346_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-10-17 23:26:03,505 Stage-1 map = 0%,  reduce = 0%
2022-10-17 23:26:08,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.47 sec
2022-10-17 23:26:15,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.28 sec
MapReduce Total cumulative CPU time: 5 seconds 280 msec
Ended Job = job_1666011801346_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.28 sec   HDFS Read: 8736 HDFS Write: 153 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 280 msec
OK

再把这个数组信息转化为字符串

select name,concat_ws(',',collect_list(favor)) as favor from student_favors group by name;

Query ID = root_20221017232811_d4b5c9c7-99b2-4146-a97c-1a382785153d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1666011801346_0004, Tracking URL = http://192.168.234.100:8088/proxy/application_1666011801346_0004/
Kill Command = /root/software/hadoop-3.3.4/bin/mapred job  -kill job_1666011801346_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-10-17 23:28:20,778 Stage-1 map = 0%,  reduce = 0%
2022-10-17 23:28:25,926 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.48 sec
2022-10-17 23:28:32,081 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.48 sec
MapReduce Total cumulative CPU time: 5 seconds 480 msec
Ended Job = job_1666011801346_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.48 sec   HDFS Read: 9133 HDFS Write: 153 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 480 msec
OK

如果需要去重，则使用collect_set

select name,concat_ws(',',collect_set(favor)) as favor from student_favors group by name;

2.3、列转行操作

列转行是和刚才的行转列反着来的，列转行可以把一列数据转成多行。主要使用到了split()、explode()、lateral view

split

desc function split;

split函数，接受一个字串符和切割规则，就类似于java中的split函数，使用切割规则对字符串中的数据进行切割，最终返回一个array数组

explode

desc function explode;

explode函数可以接受array或者map

explode(ARRAY)：表示把数组中的每个元素转成一行

explode(MAP) ：表示把map中每个key-value对，转成一行，key为一列，value为一列

lateral view

Lateral view 通常和split， explode等函数一起使用。split可以对表中的某一列进行切割，返回一个数组类型的字段，explode可以对这个数组中的每一个元素转为一行，lateral view可以对这份数据产生一个支持别名的虚拟表

创建数据

vi student_favors_2.data

张三	编码,唱歌,打游戏

创建hive外部表

create external table student_favors2(
  name string,
  favor string
)row format delimited
fields terminated by '\t'
location '/data/studentFavors2';

上传数据至hdfs

hdfs dfs -put /root/data/student_favors_2.data /data/studentFavors2

查询hive表数据

select * from student_favors2;

 select split(favor,',') from student_favors2;

 select name,explode(split(favor,',')) from student_favors2;

laterview相当于把explode返回的数据作为一个虚拟表来使用了，起名字为table1，然后给这个表里面的那一列数据起一个名字叫favor_new，如果有多个字段，可以再后面指定多个。这样在select后面就可以使用这个名字了，有点类似join操作了

select name,favor_new from student_favors2 lateral view explode(split(favor,',')) t as favor_new;

三、hive排序相关函数

3.1、order by

hive中order by 跟传统的sql语言中的order by作用是一样的，会对查询的结果做一次全局排序，使用这个语句的时候生成的reduce任务只有一个。

3.2、sort by

Hive中指定了sort by，如果有多个reduce，那么在每个reducer端都会做排序，也就是说保证了局部有序，（每个reducer出来的数据是有序的，但是不能保证所有的数据是全局有序的，除非你只有一个reduce）

3.3、distribute by

只会根据指定的key对数据进行分区，但是不会排序。一般情况下可以和sort by结合使用，先对数据分区，再进行排序，两者结合使用的时候distribute by 必须写在sort by之前

3.4、cluster by

cluster by的功能就是distribute by和sort by的简写形式。也就是cluster by id 等于 distribute by id sort by id

四、hive的分区和去重函数

4.1、group by

对数据按照指定字段进行分组

4.2、distinct

对数据中指定字段的重复值进行去重

案例：

统计order表中name去重之后的数据量

1、select count(distinct name) from order
使用distinct会将所有的name都shuffle到一个reducer里面性能比较低
2、select count(*) from (select name from order group by name)tmp
先对name分组，因为分组的同时其实就是去重，此时可以并行计算，然后再计算count

五、hive数据倾斜处理方案

SELECT a.Key,
		 SUM(a.Cnt) AS Cnt
FROM 
    (SELECT Key,
		 COUNT(*) AS Cnt
    FROM TableName
    GROUP BY  Key,
    	CASE
    	WHEN Key = 'KEY001' THEN
    	Hash(Random()) % 50
    	ELSE 0
    	END ) a
GROUP BY  a.Key;

根据key进行分组，但是每个key的数据分布并不均匀，存在数据倾斜，而KEY001占用90%，此时可以针对KEY001利用hash。（任意数字模50，会返回0~49之间的数字）。所以KEY001被打散成50份。进行局部聚合，然后再全局聚合，提高计算效率

六、hive的web工具

gethue.com/