大数据开发Hive高级函数(第二十篇)

385 阅读7分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第7天,点击查看活动详情

一、Hive函数的基本操作

与MySQL一样,hive也是一个主要做统计的工具,所以为了满足各种各样的统计需求,它也内置了相当多的函数。我们可以通过show functions来查看hive的内置函数

image-20221016172606755

image-20221016172630856

1.1、查看具体的函数信息

desc function

image-20221016172838671

二、Hive高级函数应用

2.1、分组排序取TopN:row_number() over()
  • row_number会对数据编号,编号从1开始
  • over可以理解为把数据划分到一个窗口内,里面可以加上partition by,表示按照字段对数据进行分组,还可以加上order by 表示对每个分组内的数据按照某个字段进行排序

创建基础数据student_score.data

备注下面的空格是采用tab。也是'\t'

image-20221016174341934

1 张三  语文  80
2 张三  数学  99
3 张三  英语  88
4 李四  语文  98
5 李四  数学  88
6 李四  英语  95
创建外部表
create external table student_score(
id int,
name string,
subject string,
score int
)row format delimited
fields terminated by '\t'
location '/data/student_score';

把基础数据手工上传到hdfs

hdfs dfs -put /root/data/student_score.data /data/student_score

image-20221016174942311

image-20221016174931578

查询hive数据

select * from student_score;

image-20221016175039036

排序
select *, row_number() over (partition by subject order by score desc) as num 
from student_score

此时会生成mapreduce任务

image-20221017211807448

最终结果如下:

image-20221017211834258

如果取各学科前一名,SQL如下:

select * from (
select *, row_number() over (partition by subject order by score desc) as num 
from student_score) s where s.num=1;

image-20221017212119534

2.2、行转列

行转列就是把多行数据转为一列数据。针对行转列这种需求主要需要使用到concat_ws()、collect_set()、collect_list()函数。

concat_ws

image-20221017212506427

CONCAT_WS() 函数可以实现根据指定的分隔符拼接多个字段的值,最终转化为一个带有分隔符的字符串它可以接收多个参数,第一个参数是分隔符,后面的参数可以是字符串或者字符串数组,最终就是使用分隔符把后面的所有字符串拼接到一块

collect_list

image-20221017212645063

这个函数可以返回一个list集合,集合中的元素会重复,一般和group by 结合在一起使用

collect_set

image-20221017212743550

这个函数可以返回一个set集合,集合汇中的元素不重复,一般和group by 结合在一起使用

测试数据

vi student_favors.data

zhangsan  编码
zhangsan  打篮球
zhangsan  游泳
创建hive外部表
create external table student_favors(
  name string,
  favor string
)row format delimited
fields terminated by '\t'
location '/data/studentFavors';
上传数据至hdfs

hdfs dfs -put /root/data/student_favors.data /data/studentFavors

image-20221017232305607

查询hive外部表
select * from student_favors;

image-20221017232326257

行转列
select name,collect_list(favor) as favor from student_favors group by name;
Query ID = root_20221017232554_5a8a0a1c-a8ee-461f-a5ae-bb7fa9ef8af8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1666011801346_0003, Tracking URL = http://192.168.234.100:8088/proxy/application_1666011801346_0003/
Kill Command = /root/software/hadoop-3.3.4/bin/mapred job  -kill job_1666011801346_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-10-17 23:26:03,505 Stage-1 map = 0%,  reduce = 0%
2022-10-17 23:26:08,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.47 sec
2022-10-17 23:26:15,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.28 sec
MapReduce Total cumulative CPU time: 5 seconds 280 msec
Ended Job = job_1666011801346_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.28 sec   HDFS Read: 8736 HDFS Write: 153 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 280 msec
OK

image-20221017232652496

再把这个数组信息转化为字符串

select name,concat_ws(',',collect_list(favor)) as favor from student_favors group by name;
Query ID = root_20221017232811_d4b5c9c7-99b2-4146-a97c-1a382785153d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1666011801346_0004, Tracking URL = http://192.168.234.100:8088/proxy/application_1666011801346_0004/
Kill Command = /root/software/hadoop-3.3.4/bin/mapred job  -kill job_1666011801346_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-10-17 23:28:20,778 Stage-1 map = 0%,  reduce = 0%
2022-10-17 23:28:25,926 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.48 sec
2022-10-17 23:28:32,081 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.48 sec
MapReduce Total cumulative CPU time: 5 seconds 480 msec
Ended Job = job_1666011801346_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.48 sec   HDFS Read: 9133 HDFS Write: 153 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 480 msec
OK

image-20221017232905528

如果需要去重,则使用collect_set

select name,concat_ws(',',collect_set(favor)) as favor from student_favors group by name;
2.3、列转行操作

列转行是和刚才的行转列反着来的,列转行可以把一列数据转成多行。主要使用到了split()、explode()、lateral view

split

desc function split;

image-20221017233534946

split函数,接受一个字串符和切割规则,就类似于java中的split函数,使用切割规则对字符串中的数据进行切割,最终返回一个array数组

explode

desc function explode;

image-20221017233634204

explode函数可以接受array或者map

explode(ARRAY):表示把数组中的每个元素转成一行

explode(MAP) :表示把map中每个key-value对,转成一行,key为一列,value为一列

lateral view

Lateral view 通常和split, explode等函数一起使用。split可以对表中的某一列进行切割,返回一个数组类型的字段,explode可以对这个数组中的每一个元素转为一行,lateral view可以对这份数据产生一个支持别名的虚拟表

创建数据

vi student_favors_2.data

张三	编码,唱歌,打游戏
创建hive外部表
create external table student_favors2(
  name string,
  favor string
)row format delimited
fields terminated by '\t'
location '/data/studentFavors2';

image-20221018224431933

上传数据至hdfs

hdfs dfs -put /root/data/student_favors_2.data /data/studentFavors2

image-20221018224544747

查询hive表数据

select * from student_favors2;

image-20221018224651821

 select split(favor,',') from student_favors2;

image-20221018224801102

 select name,explode(split(favor,',')) from student_favors2;

image-20221018224857426

laterview相当于把explode返回的数据作为一个虚拟表来使用了,起名字为table1,然后给这个表里面的 那一列数据起一个名字叫favor_new,如果有多个字段,可以再后面指定多个。这样在select后面就可以使用这个名字了,有点类似join操作了

select name,favor_new from student_favors2 lateral view explode(split(favor,',')) t as favor_new;

image-20221018225220828

三、hive排序相关函数

3.1、order by

hive中order by 跟传统的sql语言中的order by作用是一样的,会对查询的结果做一次全局排序,使用这个语句的时候生成的reduce任务只有一个。

3.2、sort by

Hive中指定了sort by,如果有多个reduce,那么在每个reducer端都会做排序,也就是说保证了局部有序,(每个reducer出来的数据是有序的,但是不能保证所有的数据是全局有序的,除非你只有一个reduce)

3.3、distribute by

只会根据指定的key对数据进行分区,但是不会排序。一般情况下可以和sort by结合使用,先对数据分区,再进行排序,两者结合使用的时候distribute by 必须写在sort by之前

3.4、cluster by

cluster by的功能就是distribute by和sort by的简写形式。也就是cluster by id 等于 distribute by id sort by id

四、hive的分区和去重函数

4.1、group by

对数据按照指定字段进行分组

4.2、distinct

对数据中指定字段的重复值进行去重

案例:

统计order表中name去重之后的数据量

1select count(distinct name) from order
使用distinct会将所有的name都shuffle到一个reducer里面性能比较低
2select count(*) from (select name from order group by name)tmp
先对name分组,因为分组的同时其实就是去重,此时可以并行计算,然后再计算count

五、hive数据倾斜处理方案

SELECT a.Key,
		 SUM(a.Cnt) AS Cnt
FROM 
    (SELECT Key,
		 COUNT(*) AS Cnt
    FROM TableName
    GROUP BY  Key,
    	CASE
    	WHEN Key = 'KEY001' THEN
    	Hash(Random()) % 50
    	ELSE 0
    	END ) a
GROUP BY  a.Key;

根据key进行分组,但是每个key的数据分布并不均匀,存在数据倾斜,而KEY001占用90%,此时可以针对KEY001利用hash。(任意数字模50,会返回0~49之间的数字)。所以KEY001被打散成50份。进行局部聚合,然后再全局聚合,提高计算效率

六、hive的web工具

gethue.com/