Hive场景题

97 阅读15分钟

第一题:

了解哪些窗口函数,都是什么功能?找一个在某个业务中的应用? 手写窗口函数及功能意义,同时随便写一个带窗口函数的sql,并说明其sql的含义。

 over函数 开窗函数

2、求出每个栏目的被观看次数及累计观看时长?

数据: video表

 uid channel min 
 1 1 23
 2 1 12
 3 1 12
 4 1 32
 5 1 342
 6 2 13
 7 2 34
 8 2 13
 9 2 134
 create table video( 
 uid int, 
 channel string, 
 min int 
 )
 row format delimited 
 fields terminated by ' ' 
 ;
 load data local inpath './hivedata/video.txt' into table video;

答案:

 select channel,count(*) count,sum(min) total from video group by channel;

3、编写sql实现

数据:

 userid,month,visits 
 A,2015-01,5
 A,2015-01,15
 B,2015-01,5
 A,2015-01,8
 B,2015-01,25
 A,2015-01,5
 A,2015-02,4
 A,2015-02,6
 B,2015-02,10
 B,2015-02,5
 A,2015-03,16
 A,2015-03,22
 B,2015-03,23
 B,2015-03,10
 B,2015-03,1
 drop table visits;
 create table visits(
 userid string, 
 month string, 
 visits int 
 )
 row format delimited 
 fields terminated by ',' 
 ;
 load data local inpath './hivedata/visits.txt' overwrite into table visits; 

完成需求:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数,结果数据格式如下:

 +---------+----------+---------+-------------+---------------+--+
 | userid  |  month   | visits  | max_visits  | total_visits  |
 +---------+----------+---------+-------------+---------------+--+
 | A       | 2015-01  | 33      | 33          | 33            |
 | A       | 2015-02  | 10      | 33          | 43            |
 | A       | 2015-03  | 38      | 38          | 81            |
 | B       | 2015-01  | 30      | 30          | 30            |
 | B       | 2015-02  | 15      | 30          | 45            |
 | B       | 2015-03  | 34      | 34          | 79            |
 +---------+----------+---------+-------------+---------------+--+
 select userid,month,
 visits,
 max(visits) over(distribute by userid sort by month) max_visits,
 sum(visits) over(distribute by userid sort by month) total_visits
 from (
 select userid,month,
 sum(visits) visits
 from visits 
 group by userid,month
 ) t;

4、编写连续7天登录的总人数:

数据: t1表

Uid dt login_status(1登录成功,0异常) 
1 2019-07-11 1 
1 2019-07-12 1 
1 2019-07-13 1 
1 2019-07-14 1 
1 2019-07-15 1 
1 2019-07-16 1 
1 2019-07-17 1 
1 2019-07-18 1 
2 2019-07-11 1 
2 2019-07-12 1 
2 2019-07-13 0 
2 2019-07-14 1 
2 2019-07-15 1 
2 2019-07-16 0 
2 2019-07-17 1 
2 2019-07-18 0 
2 2019-07-19 1 
2 2019-07-20 0 
2 2019-07-21 1 
2 2019-07-22 0 
2 2019-07-23 1 
2 2019-07-24 0 
3 2019-07-11 1 
3 2019-07-12 1 
3 2019-07-13 1 
3 2019-07-14 1 
3 2019-07-15 1 
3 2019-07-16 1 
3 2019-07-17 1 
3 2019-07-18 1 
drop table login;
create table login( 
Uid int, 
dt string, 
login_status int 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/login.txt' into table login; 
select count(*)
from
(
select distinct uid
from(
select uid,dt,lag(dt,6) over(partition by uid order by dt) pre_dt,
sum(login_status) over(partition by uid order by dt rows between 6 preceding and current row) total
from login
) t
where date_sub(dt,6)=pre_dt and t.total=7) t1;

5、你知道的排名函数有哪些?说一说它们之间的区别? 文字说明即可

6、编写sql语句实现每班前三名,分数一样不并列,同时求出前三名按名次排序的一次的分差:

数据: stu表

Stu_no class score 
1	1901	90
2	1901	90
3	1901	83
4	1901	60
5	1902	66
6	1902	23
7	1902	99
8	1902	67
9	1902	87
drop table stu;
create table stu( 
Stu_no int, 
class string, 
score int 
)
row format delimited 
fields terminated by '\t' 
;

load data local inpath './hivedata/stu.txt' into table stu; 

编写sql实现,结果如下:

+--------+---------+--------+-----+----------+--+
| class  | stu_no  | score  | rn  | rn_diff  |
+--------+---------+--------+-----+----------+--+
| 1901   | 2       | 90     | 1   | 90       |
| 1901   | 1       | 90     | 2   | 0        |
| 1901   | 3       | 83     | 3   | -7       |
| 1902   | 7       | 99     | 1   | 99       |
| 1902   | 9       | 87     | 2   | -12      |
| 1902   | 8       | 67     | 3   | -20      |
+--------+---------+--------+-----+----------+--+
select class,stu_no,score,rn,rn_diff
from(
select class,stu_no,score,
row_number() over(partition by class order by score desc) rn,
score-nvl(lag(score,1) over(partition by class order by score desc),0) rn_diff
from stu
) t
where t.rn<4;

7、对于行列互换,你有哪些解决方式,详细说明每一种方式? 使用语言描述即可

8、编写sql实现行列互换。数据如下:

id sid subject int
1,001,语文,90
2,001,数学,92
3,001,英语,80
4,002,语文,88
5,002,数学,90
6,002,英语,75.5
7,003,语文,70
8,003,数学,85
9,003,英语,90
10,003,政治,82

编写sql实现,得到结果如下:

+---------+--------+--------+--------+--------+-----------+--+
| sid  	  | 语文   | u2.数学  | u2.英语  | u2.政治  | u2.total  |
+---------+--------+--------+--------+--------+-----------+--+
| 001     | 90.0   | 92.0   | 80.0   | 0.0    | 262.0     |
| 002     | 88.0   | 90.0   | 75.5   | 0.0    | 253.5     |
| 003     | 70.0   | 85.0   | 90.0   | 82.0   | 327.0     |
| total   | 248.0  | 267.0  | 245.5  | 82.0   | 842.5     |
+---------+--------+--------+--------+--------+-----------+--+
drop table score;
create table score( 
id int, 
sid string, 
subject string, 
score double
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/score.txt' into table score; 
select 
sid, 
sum(if(subject="语文",score,0)) as `语文`, 
sum(if(subject="数学",score,0)) as `数学`, 
sum(case when subject="英语" then score else 0 end) as `英语`, 
sum(case when subject="政治" then score else 0 end) as `政治`,
sum(score) total
from score 
group by sid
union
select "total",sum(`语文`),sum(`数学`),sum(`英语`),sum(`政治`),sum(total) from
(
select 
sid, 
sum(if(subject="语文",score,0)) as `语文`, 
sum(if(subject="数学",score,0)) as `数学`, 
sum(case when subject="英语" then score else 0 end) as `英语`, 
sum(case when subject="政治" then score else 0 end) as `政治`,
sum(score) total
from score 
group by sid
)t;

9、编写sql实现如下:

数据: t1表

uid tags 
1	1,2,3
2	2,3
3	1,2

编写sql实现如下结果:

uid tag 
1	1 
1	2
1	3
2	2
2	3
3	1
3	2
create table t1( 
uid int, 
tags string 
)
row format delimited 
fields terminated by '\t' 
;

load data local inpath './hivedata/t1.txt' into table t1; 
select uid,tag
from t1 lateral view explode(split(tags,",")) t  as tag;

10、行转列

数据: T2表:

Tags 
1,2,3
1,2
2,3

T3表:

id lab 
1 A
2 B
3 C

根据T2和T3表的数据,编写sql实现如下结果:

+--------+--------+--+
|  tags  |  labs  |
+--------+--------+--+
| 1,2    | A,B    |
| 1,2,3  | A,B,C  |
| 2,3    | B,C    |
+--------+--------+--+
create table t2( 
tags string 
);
load data local inpath './hivedata/t2.txt' overwrite into table t2;


create table t3( 
id int, 
lab string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/t3.txt' overwrite into table t3; 
select tags,
concat_ws(",",collect_set(lab)) labs
from
(select tags,lab
from 
(select tags,tag
from t2 lateral view explode(split(tags,",")) A as tag) B 
join t3 on B.tag = t3.id) C
group by tags;

11、行转列

数据: t4表:

id tag flag 
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8

编写sql实现如下结果:

id tag flag 
a b 1|2|3
c d 6|8
create table t4( 
id string, 
tag string, 
flag int 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/t4.txt' overwrite into table t4;

select id,tag,concat_ws("|",collect_set(cast(flag as string))) from t4 group by id,tag;

12、列转行

数据: t5表

uid name tags 
1	goudan	chihuo,huaci
2	mazi	sleep
3	laotie	paly

编写sql实现如下结果:

uid name tag 
1 goudan chihuo
1 goudan huaci
2 mazi sleep
3 laotie paly
create table t5( 
uid string, 
name string, 
tags string 
)
row format delimited 
fields terminated by '\t' ;

load data local inpath './hivedata/t5.txt' overwrite into table t5;
select 
uid, 
name, 
tag 
from t5 
lateral view explode(split(tags,",")) t1 as tag ; 

13、行转列

数据: content表:

uid contents 
1	i|love|china
2	china|is|good|i|i|like

统计结果如下,如果出现次数一样,则按照content名称排序:

+----------+------+--+
| content  | num  |
+----------+------+--+
| i        | 3    |
| china    | 2    |
| good     | 1    |
| is       | 1    |
| like     | 1    |
| love     | 1    |
+----------+------+--+
drop table content;
create table content( 
uid int, 
contents string 
)
row format delimited 
fields terminated by '\t' 
;
load data local inpath './hivedata/content.txt' overwrite into table content; 
select content,count(1) num
from content lateral view explode(split(contents,"\|")) tmp as content
group by content
order by num desc,content;

14、列转行

数据: course1表

id course 
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

根据编写sql,得到结果如下(表中的1表示选修,表中的0表示未选修):

+-----+----+----+----+----+----+----+--+
| id  | a  | b  | c  | d  | e  | f  |
+-----+----+----+----+----+----+----+--+
| 1   | 1  | 1  | 1  | 0  | 1  | 0  |
| 2   | 1  | 0  | 1  | 1  | 0  | 1  |
| 3   | 1  | 1  | 1  | 0  | 1  | 0  |
+-----+----+----+----+----+----+----+--+
create table course( 
id int, 
course string 
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/course.txt' overwrite into table course; 
select 
id, 
sum(case when c.course="a" then 1 else 0 end) as `a`, 
sum(case when c.course="b" then 1 else 0 end) as `b`, 
sum(case when c.course="c" then 1 else 0 end) as `c`, 
sum(case when c.course="d" then 1 else 0 end) as `d`, 
sum(case when c.course="e" then 1 else 0 end) as `e`, 
sum(case when c.course="f" then 1 else 0 end) as `f` 
from course c 
group by id ; 

15、时间戳函数:unix_timestamp,from_unixtime

获取当前时间戳:

select unix_timestamp();

获取"2019-07-31 11:57:25"对应的时间戳:

select unix_timestamp("2019-07-31 11:57:25");

获取"2019-07-31 11:57"对应的时间戳:

select unix_timestamp("2019-07-31 11:57","yyyy-MM-dd HH:mm");

获取时间戳:1564545445所对应的日期和时分秒:

select from_unixtime(1564545445);

获取时间戳:1564545446所对应的日期和小时(yyyy/MM/dd HH):

select from_unixtime(1564545445,"yyyy/mm/dd HH"); 

16、时间格式转换:yyyyMMdd -> yyyy-MM-dd

数据: dt表

20190730
20190731

编写sql实现如下的结果:

2019-07-30
2019-07-31
create table dt( 
dt string 
);

load data local inpath './hivedata/dt.txt' overwrite into table dt; 
select from_unixtime(unix_timestamp(dt,"yyyyMMdd"),"yyyy-MM-dd") from dt;

17、数据: 店铺,月份,金额

sid month money
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
create table store( 
sid string,
month string,
money int
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/store.txt' overwrite into table store;

编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额?

select sid,month,sales,
sum(sales) over(partition by sid order by month) total
from (
select sid,month,sum(money) sales from store group by sid,month
) t;

18、Hive是否发生过数据倾斜,怎么处理的,原理是什么?

使用语言表述即可。

19、Hive中什么时候使用过array和map,为什么使用?

语言描述即可

20、使用sql编写下面的需求:

Table A 是一个用户登陆时间记录表,当月每次登陆一次会记录一条记录。A表如下:

log_time	uid
2018-10-01 12:34:11	123
2018-10-02 13:21:08	123
2018-10-02 14:21:08	123
2018-10-02 14:08:09	456
2018-10-04 05:10:22	123
2018-10-04 21:38:38	456
2018-10-04 22:38:38	456
2018-10-05 09:57:32	123
2018-10-06 13:22:56	123
2018-11-01 12:34:11	123
2018-11-02 13:21:08	123
2018-11-02 14:21:08	123
2018-11-02 14:08:09	456
2018-11-04 05:10:22	123
2018-11-04 21:38:38	456
2018-11-05 09:57:32	123
2018-11-06 13:22:56	123

需计算出每个用户本月最大连续登陆天数。如表A样例数据中,用户123最大连续登陆天数为3,而用户456最大连续登陆天数为1

drop table login_time;
create table login_time(
log_time timestamp,
uid string
)
row format delimited
fields terminated by '\t';

load data local inpath './hivedata/login_time.txt' overwrite into table login_time;

-- 注意:可能需要对原始数据做清洗,保证每个用户每天只有一条登录信息 
select uid,max(cnt)
from
(select uid,date_format(dt,"yyyy-MM"),count(1) cnt
from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) grp_dt,row_number() over(partition by uid order by dt)
from
(
select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt 
from login_time
)a
)b group by uid,grp_dt,date_format(dt,"yyyy-MM")
)c
group by uid;
a表进行数据清洗
b表获取uid,dt,还有使用date_sub方法去减去排名后的天数,
此时排名函数是本题的解法最妙的一点,因为使用排名函数之后,
如果登陆日期连续的话,grp_dt列的值是一样的,
所以从后面的c表就可以按照uid,grp_dt和date_format分组
从而可以算出连续的天数,最后取最大值就可以算出最大连续登陆天数了。

21、使用sql实现如下:

样例数据: t1表

gender,cookie,ip,timestampe,ua 

F,1707041428491566106,111.200.195.186,1208524973899,Dalvik%2F2.1.0%20%28Linux%3B%20U%3B%20Android 

...具体数据如下图

将图片中的awk修改为使用sql编写,然后将上诉题作出回答?

统计pv/uv的使用sql,其它问题语言描述即可。

22、使用hive求出两个数据集的差集?

数据 diff_t1表:

id name 
1 zs
2 ls

diff_t2表:

id name 
1 zs
3 ww

结果如下:

id name 
2 ls 
3 ww 

就是先求t2的空集,再union一下t1的空集
select t1.id ,t1.name from diff_t1 t1 left join diff_t2 t2 on t1.id=t2.id unoin select t2.id,t2.name from diff_t2 t2 left join diff_t1 t1 on t1.id=t2.id where t1.id is
create table diff_t1( 
id string, 
name string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/diff_t1.txt' overwrite into table diff_t1; 
create table diff_t2( 
id string, 
name string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/diff_t2.txt' overwrite into table diff_t2; 
select t1.id id, t1.name name 
from diff_t1 t1 
left join diff_t2 t2 on t1.id=t2.id 
where t2.id is null 
union 
select t2.id id, t2.name name 
from diff_t1 t1 
right join diff_t2 t2 on t1.id=t2.id 
where t1.id is null ; 

23、使用hive的hql实现如下需求

现有某网站购买记录字段如下

orderid,userid,productid,price,timestamp,date
121,张三,3,100,1535945356,2018-08-07
122,张三,3,200,1535945356,2018-08-08
123,李四,3,200,1535945356,2018-08-08
124,王五,1,200,1535945356,2018-08-08
125,张三,3,200,1535945356,2018-08-09
126,张三,2,200,1535945356,2018-08-09
127,李四,3,200,1535945356,2018-08-09
128,李四,3,200,1535945356,2018-08-10
129,李四,3,200,1535945356,2018-08-11

用sql统计今日及昨日都购买过商品productid为3的用户及其昨日消费。

drop table product;
create table product(
orderid string,
userid string,
productid int,
price int,
tamp int,
dt date
)
row format delimited
fields terminated by ',';

load data local inpath './hivedata/product.txt' overwrite into table product;
select * 
from(
select userid,dt,
lag(dt,1) over(partition by userid order by dt) yestoday,
case when
datediff(dt,lag(dt,1) over(partition by userid order by dt))=1
then lag(total_price) over(partition by userid order by dt)
else null end yestoday_price
from
(
select userid,dt,productid,sum(price) over(partition by userid,dt order by dt) total_price
from product order by userid,dt
)A
where productid = 3
)B where yestoday_price is not null;

24、使用hive的hql如下:

表user_action_log用户行为故据

uid	 time	action
1	Time1	Read
3	Time2	Comment
1	Time3	Share
2	Time4	Like
1	Time5	Write
2	Time6	like
3	Time7	Write
2	Time8	Read

分析用户行为习惯找到毎一个用户在表中的第一次行为

drop table user_action_log;
create table user_action_log(
uid int,
time string,
action string
)
row format delimited
fields terminated by '\t';

load data local inpath './hivedata/user_action_log.txt' overwrite into table user_action_log;

使用代码实现

select uid,time,action
from (
select uid,time,action,row_number() over(partition by uid order by time) rn
from user_action_log
)A
where rn = 1;

25、每个用户连续登陆的最大天数?

数据: user_login表

uid,dt
1,2019-08-01
1,2019-08-02
1,2019-08-03
2,2019-08-01
2,2019-08-02
3,2019-08-01
3,2019-08-03
4,2019-07-28
4,2019-07-29
4,2019-08-01
4,2019-08-02
4,2019-08-03

结果如下:

uid cnt_days 
1 3 
2 2 
3 1 
4 3 
create table user_login(
uid int,
dt date
)
row format delimited
fields terminated by ',';

load data local inpath './hivedata/user_login.txt' overwrite into table user_login;
select uid,max(cnt)
from
(select uid,dt_sub,count(1) cnt
from
(select uid,dt,
date_sub(dt,row_number() over(partition by uid order by dt)) dt_sub
from user_login
)A 
group by uid, dt_sub
)B group by uid;

26、请使用sql计算pv、uv?

数据:

t1表

uid dt url
1 2019-08-06 http://www.baidu.com
2 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.soho.com
3 2019-08-06 http://www.meituan.com
3 2019-08-06

结果如下:

dt uv pv 
2019-08-6 3 5 
create table user_net_log(
uid int,
dt date,
url string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/user_net_log.txt' overwrite into table user_net_log;
select dt,count(distinct uid),count(url)
from user_net_log
group by dt;

27、hive中coalease()、nvl()、concat_ws()、collect_list()、collect_set()、regexp_replace().这几个函 数的意义?

使用语言描述即可。

28、有表如下记录了智智同学玩某moba游戏每局的英雄pick情况

pk_moba表

id	names
1	亚索,挖据机,艾瑞莉亚,洛,卡莎
2	亚索,盖伦,奥巴马,牛头,皇子
3	亚索,盖伦,艾瑞莉亚,宝石,琴女
4	亚索,盖伦,赵信,老鼠,锤石

请用HiveSQL计算出出场次数最多的top3英雄及其Pick率(=出现场数/总场数)

create table pk_moba(
id int,
names array<string>
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

load data local inpath './hivedata/pk_moba.txt' overwrite into table pk_moba;
select name,cnt,rn,concat(round(cnt/4*100,0),"%") lv
from 
(
select name,cnt,dense_rank() over(sort by cnt desc) rn
from (
select name,count(name) cnt
from pk_moba lateral view explode(names) tmp as name
group by name
)A
)B where rn<4;

29、有三个表,分别是:

区域(district) 区域中有两个字段分别是区域Id(disid)和区域名称(disname)

城市(city) 城市有两个字段分别是城市ID(cityid)和区域ID(disid)

订单(order) 订单有四个字段分别是订单ID(orderid)、用户ID(userid)、城市ID(cityid)和消费金额(amount)。

district表:

disid disname 
1 华中
2 西南


create table district(
disid int,
disname string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/district.txt' overwrite into table district;

city表:

cityid disid 
1 1
2 1
3 2
4 2
5 2

create table city(
cityid int,
disid int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/city.txt' overwrite into table city;

order表:

oid userid cityid amount 
1 1 1 1223.9
2 1 1 9999.9
3 2 2 2322
4 2 2 8909
5 2 3 6789
6 2 3 798
7 3 4 56786
8 4 5 78890

create table order_t(
oid int,
userid int,
cityid int,
amount float
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/order.txt' overwrite into table order_t;

高消费者是消费金额大于1W的用户,使用hive hql生成如下报表:

区域名 高消费者人数 消费总额

select disname,count(1),sum(amount)
from(
select disname,userid,sum(amount) amount from district A join city B on A.disid = B.disid
join order_t C on B.cityid = C.cityid
group by disname,userid
)D
where amount>10000
group by disname;

30、某APP每天访问数据存放在表access_log里面,包含日期字段ds,用户类型字段user_type,用户账 号user_id,用户访问时间log_time,请使用hive的hql语句实现如下需求:

(1)、每天整体的访问UV、PV?

select log_time,count(distinct user_id) uv,count(1) pv
from access_log
group by log_time

(2)、每天每个类型的访问UV、PV?

select log_time,user_type,count(distinct user_id) uv,count(1) pv
from access_log
group by log_time;

(3)、每天每个类型中最早访问时间和最晚访问时间?

select log_time,user_type,min(log_time),max(log_time)
from access_log
group by log_time;

(4)、每天每个类型中访问次数最高的10个用户?

select dt,user_type,cnt,rn
from(
select dt,user_type,cnt,row_number() over(partition by dt,user_type order by cnt desc) rn
from
(select substr(log_time,1,7) dt,user_type,count(1) cnt
from access_log
group by substr(log_time,1,7),user_type
)A
)B where rn <11;

31、一张大表A(上亿条记录)和小表B(几千条记录),如果join出现数据倾斜,有什么解决办法?

使用语言描述即可

32、有如下三张表:

表login_a(登录表):

ds user_id 
2019-08-06 1
2019-08-06 2
2019-08-06 3
2019-08-06 4

create table login_a(
ds date,
user_id int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/login_a.txt' overwrite into table login_a;

表read_b(阅读表):

ds user_id read_num 
2019-08-06 1 2
2019-08-06 2 3
2019-08-06 3 6

create table read_b(
ds date,
user_id int,
read_num int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/read_b.txt' overwrite into table read_b;

表cost_c(付费表):

ds user_id price
2019-08-06 1 55.6
2019-08-06 2 55.8

create table cost_c(
ds date,
user_id int,
price float
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/cost_c.txt' overwrite into table cost_c;

基于上述三张表,请使用hive的hql语句实现如下需求:

(1)、用户登录并且当天有个阅读的用户数,已经阅读书籍数量

select A.ds,count(distinct A.user_id),sum(B.read_num)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
group by A.ds;

(2)、用户登录并且阅读,但是没有付费的用户数

select A.ds,count(1)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
left join cost_c C on B.user_id = C.user_id and B.ds = C.ds
where C.price is null
group by A.ds;

(3)、用户登录并且付费,付费用户数量和金额总数

select A.ds,count(1),sum(price)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
left join cost_c C on B.user_id = C.user_id and B.ds = C.ds
group by A.ds;

33、hive的hql中,left outer join和left semi join的区别?

使用语言描述即可。 

34、有一个订单表和渠道表,结构如下:

create table order(
order_id long,
user_id long comment '用户id',
amount double comment '订单金额',
channel string comment '渠道',
time string comment '订单时间,yyyy-MM-dd HH:mi:ss'
)
partition by (dt string comment '天,yyyy-MM-dd');

请使用hive hql查询出2019-08-06号 每个渠道的下单用户数、订单总金额。

hql语句实现,结果表头如下: channel user_num order_amount

select channel,count(user_id) user_num,sum(amount) order_acount
from order
where to_date(time) = '2019-08-06'
group by channel;

35、考虑表设计和sql查询:

设计数据库表,用来存放学生基本信息,课程信息,学生的课程及成绩,并给出查询语句,查询平均成绩大于85的所有学生。
 create table stu_1( 
 id string, 
 name string, 
 age int, 
 addr string 
 )
 row format delimited 
 fields terminated by ',' 
 ;
 ​
 create table course_1( 
 cid string, 
 cname string 
 )
 row format delimited 
 fields terminated by ',' ;
 ​
 create table course_sc( 
 id string, 
 cid string, 
 score int 
 )
 row format delimited 
 fields terminated by ',' 
 ;
 ​
 load data local inpath '/hivedata/course_1.txt' overwrite into table course_1; 
 load data local inpath '/hivedata/stu_1.txt' overwrite into table stu_1; 
 load data local inpath '/hivedata/course_sc.txt' overwrite into table course_sc; 
 select cs.id, avg(score) avgscore 
 from course_sc cs 
 group by cs.id 
 having avg(score)>85 ; 

36、需求如下:

 有用户表user(uid,name) 以及黑名单BanUser(uid)
 1.left join 方式写sql查出所有不在黑名单的用户信息
 2.not exists 方法写sql查询所有不在黑名单的用户信息
 create table u( 
 id string, 
 name string 
 )
 row format delimited 
 fields terminated by ',' 
 ;
 ​
 create table banuser( 
 id string 
 );
 ​
 load data local inpath '/hivedata/banuser.txt' overwrite into table banuser; 
 load data local inpath '/hivedata/u.txt' overwrite into table u; 
 select u.* 
 from u left join banuser on u.id = banuser.id
 where banuser.id is null;
 ​
 select u.*
 from u 
 where not exists (select 1 from banuser where banuser.id = u.id);

37、需求如下:

course_score表数据:

 1,zhangsan,数学,80,2015
 2,lisi,语文,90,2016
 3,lisi,数学,70,2016
 4,wangwu,化学,80,2017
 5,zhangsan,语文,85,2015
 6,zhangsan,化学,90,2015
 create table course_score( 
 id string, 
 name string, 
 course string, 
 score int, 
 year string 
 )
 row format delimited 
 fields terminated by ',' 
 ;
 ​
 load data local inpath './hivedata/course_score.txt' overwrite into table course_score; 

1、查出每个学期每门课程最高分记录(包含全部5个字段)

 select id,name,course,score,year,max(score) over(partition by year,course)
 from course_score;
 ​
 select year,course,max(score)
 from course_score
 group by year,course order by year;

2、查出单个学期中语文课在90分以上的学生的数学成绩记录(包含全部字段)

 select A.id,A.name,A.course,A.score,A.year
 from course_score A join course_score B on A.name = B.name 
 where B.course='语文' and B.score>=90 and A.course='数学';
 select s.id, s.name, s.course, s.score, s.year 
 from course_score s 
 join
 (select 
 course, 
 year, 
 max(score) score 
 from course_score 
 group by course,year) t1 
 on s.course=t1.course 
 and 
 s.year=t1.year 
 and 
 s.score=t1.score 
 ;
 select s.id, s.name, s.course, s.score, s.year 
 from course_score s 
 join 
 (select 
 id, name, course, score, year from course_score 
 where score >=90 and course="语文") t1 
 on s.name=t1.name 
 where s.course="数学" 
 ; 

38、需求如下

t1表:

 name course score 
 aa English 75
 bb math 85
 aa math 90
 ​
 create table t1_1(
 name string,
 course string,
 score int
 )
 row format delimited
 fields terminated by ' ';
 ​
 load data local inpath './hivedata/t1_1.txt' overwrite into table t1_1;

使用hql输出以下结果

 name English math 
 aa 75 90 
 bb 0 85
 select name,
 max(if(course="English",score,0)) English,
 max(if(course="math",score,0)) math
 from t1_1
 group by name;

39、需求如下千锋好程序大数据学院

t1表:

 用户 商品 
 A P1
 B P1
 A P2
 B P3

请你使用hql变成如下结果:

 用户 P1 P2 P3 
 A 1 1 0
 B 1 0 1
 select username,
 max(if(product="p1",1,0)) p1,
 max(if(product="p2",1,0)) p2,
 max(if(product="p3",1,0)) p3
 from t1
 group by username;

40、需求如下

dpt部门

 dpt_id  dpt_name
 1   产品
 2   技术

User用户表

 User_id Dpt_id
 1   1
 2   1
 3   2
 4   2
 5   3

result表

 user_id dpt_id  dpt_name
 1   1   产品
 2   1   产品
 3   2   技术
 4   2   技术
 5   3   其他部门

原数据表时user和dpt表,请使用hql实现result表中的结果.

 select user.user_id,user.dpt_id,dpt.dpt_id
 from user 
 join dpt 
 on user.dpt_id = dpt.dpt_id;

41、需求如下

t1_order表:

 order_id order_type order_time
 111 N 10:00
 111 A 10:05
 111 B 10:10
 ​
 create table t1_order(
 order_id string,
 order_type string,
 order_time string
 )
 row format delimited
 fields terminated by ' ';
 ​
 load data local inpath './hivedata/t1_order.txt' overwrite into table t1_order;

是用hql获取结果如下:

 order_id order_type_1 order_type_2 order_time_1 order_time_2 
 111 N A 10:00 10:05
 111 A B 10:05 10:10
 select order_id,order_type_1,order_type_2,order_time_1,order_time_2
 from(
 select order_id,order_type order_type_1,
 lead(order_type,1) over(sort by order_time) order_type_2,
 order_time order_time_1,
 lead(order_time,1) over(sort by order_time) order_time_2
 from t1_order
 )A
 where order_type_2 is not null;

42、需求如下

t1_hobby表

 name sex hobby 
 janson 男 打乒乓球、游泳、看电影 
 tom 男 打乒乓球、看电影
 ​
 drop table t1_hobby;
 create table t1_hobby(
 name string,
 sex string,
 hobby string
 )
 row format delimited
 fields terminated by ' ';
 ​
 load data local inpath './hivedata/t1_hobby.txt' overwrite into table t1_hobby;

hobby最多3个值,使用hql实现结果如下:

 name sex hobby1 hobby2 hobby3 
 janson 男 打乒乓球 游泳 看电影 
 tom 男 打乒乓球 看电影 
 select name,sex,split(hobby,"、")[0] hobby1,
 split(hobby,"、")[1] hobby2,
 nvl(split(hobby,"、")[2],"") hobby3
 from t1_hobby;

43、需求如下

表t1(注:数据是正常的访问日志数据,分隔符全是空格)

 8.35.201.160 - - [16/May/2018:17:38:21 +0800] "GET/uc_server/data/avatar/000/01/54/22_avatar_middle.jpg HTTP/1.1" 200 5396 

使用hive的hql实现结果如下:

ip dt url

 8.35.201.160 2018-5-16 17:38:21 /uc_server/data/avatar/000/01/54/22_avatar_middle.jpg

\