将《MySQL按日周月年统计数据、分组取出前n条数据、结合多表统计数据》中的统计操作,以MaxCompute(原来叫作ODPS)的方式实现。MaxCompute是用来处理量级比较大的数据的,使用方式和MySQL有一些不同。
表的设计和数据和统计目的都和该文中一样(但因为数据创建顺序和该文中不同,所以此文的统计结果和该文的统计结果有所不同),就不赘述了,直接上统计语句,数据准备部分在文章末尾。
按日、周、月、年对数据进行统计
参考地址:日期函数
1.按日、周、月、年统计各游戏分区入口被点击的次数
和MySQL的语句类似,需要换一下使用的函数,在select时加上分区的筛选,并且在MaxCompute中不能直接group by 列别名。
(1) 按日统计
以下语句将日期转化为形如'2021-08-05'这样的字符串:
to_char(created_time, 'yyyy-mm-dd')
将各分区入口的点击数据统计起来:
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
count(if(action='click_a_entrace', true, null)) / count(action) as a_rate,
count(if(action='click_b_entrace', true, null)) / count(action) as b_rate,
count(if(action='click_c_entrace', true, null)) / count(action) as c_rate,
to_char(created_time, 'yyyy-mm-dd') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy-mm-dd');
(2) 按周统计
将日期转化为行为'2021'的字符串:
to_char(created_time, 'yyyy')
拿到日期值位于当年的第几周(bigint类型):
weekofyear(created_time)
将日期和周数结合起来,组成形如2021-22的字符串:
concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time))
查询各分区入口被点击的总数:
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time)) as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time));
(3) 按月统计
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
to_char(created_time, 'yyyy-mm') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy-mm');
(4) 按年统计
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
to_char(created_time, 'yyyy') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy');
2.统计的是人次
还是按日、周、月、年统计各游戏分区入口被点击的数量,但是统计的是点击的人数。
以统计每年点击各区的人数为例:
select
count(distinct if(action='click_a_entrace', user_id, null)) as a_count,
count(distinct if(action='click_b_entrace', user_id, null)) as b_count,
count(distinct if(action='click_c_entrace', user_id, null)) as c_count,
count(distinct user_id) as all_count,
to_char(created_time, 'yyyy') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy');
分组取出每组统计数据的前n条
统计各区参比赛胜率前3的的用户。
(1)先查出每个用户在各个分区的比赛场次和胜率:
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition;
(2)对每个分区胜率进行排序:
select
game_partition,
user_id,
success_rate,
row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition
limit 20000
) as rate_table;
在MaxCompute中order by必须和limit一起使用,除非明确地进行设置:
ORDER BY must be used with a LIMIT clause, please set odps.sql.validate.orderby.limit=false to use it.
(3)取出胜率的前3名
使用行编号小于等于3where partition_row_number <= 3;取出前3名。
select * from (
select
game_partition,
user_id,
success_rate,
row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition
limit 20000
) as rate_table
) as has_row_bumber_table
where partition_row_number <= 3;
结合多个表对数据进行统计
统计胜率前3的用户发起投降的次数。
先从从games_demo表中取出胜率前3的用户:
select
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by user_id
order by success_rate desc
limit 3
然后从user_actions_demo中出用户点击投降的次数,left join到前面查出的结果集中:
select
success_rate_table.user_id,
success_rate_table.success_rate,
click_surrender_table.click_surrender_count
from (
select
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by user_id
order by success_rate desc
limit 3
) as success_rate_table
left join (
select
user_id,
count(1) as click_surrender_count
from user_actions_demo where dt=20210805 and action='click_surrender'
group by user_id
) as click_surrender_table
on success_rate_table.user_id = click_surrender_table.user_id
order by click_surrender_table.click_surrender_count;
这里拿到的用户id是1、3、5,但是MySQL实现中拿到的用户id是1、3、9。这是因为MaxCompute和MySQL的排序结果和limit结果不同。
MySQL的排序结果是这样的:
MaxCompute的排序结果是这样的:
使用limit 3之后,MySQL拿到的是:
MaxCompute拿到的是:
这里其实不严谨,因为光就胜率来说,前面几个人的胜率都是1,user_id为5和者user_id为9的胜率是一样的,谁排在前3都算是正确。只是很奇怪,为什么MySQL没有连续地选择值,而是中间跳了两个。这里先保留疑问,等进一步了解MySQL之后再回头来想想这个问题吧。
数据准备
1.设置数据类型版本为MaxCompute 2.0。
本练习只是对某Session设置数据类型版本:
set odps.sql.type.system.odps2=true;
我理解的这语句的意思是:针对这条语句所在的指令设置数据类型版本,不是针对整个项目空间设置的。
选择不同的数据类型版本,表的数据类型、DML语句和内置函数行为,以及与MaxCompute密切相关的开发组件都有所不同,具体差异看:不同数据类型版本的差异。
2.创建三张表,用户表users_demo、比赛表games_demo、用户行为表user_actions_demo。
(1)创建用户表
-- 创建用户表
create table if not exists users_demo (
id varchar(20) not null comment '用户id',
name varchar(255) not null comment '用户名称',
phone varchar(20) comment '电话号码',
created_time datetime not null comment '创建时间',
modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');
按照创建日期对数据进行分区,这里创建日期dt是形如20210805这样的字符串。
(2)创建比赛表
-- 创建比赛表
create table if not exists games_demo (
id int not null comment '比赛id',
user_id varchar(20) not null comment '用户id',
result varchar(200) not null comment '比赛结果',
game_partition varchar(2) not null comment '比赛分区',
created_time datetime not null comment '创建时间',
modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');
(3)创建日志表
-- 创建日志表
create table if not exists user_actions_demo (
id int not null comment '日志id',
user_id varchar(20) comment '用户id',
action varchar(255) not null comment '用户行为',
created_time datetime not null comment '创建时间',
modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');
3.向表中插入数据
参考地址:插入或覆写数据
(1)添加用户数据
在MySQL中使用的是current_timestamp()拿到当前的日期时间,在MaxCompute中是用getdate()拿当前日期时间。
以下语句拿到当前日期前60个月的日期:
dateadd(getdate(), -60, 'mm')
a.添加分区'20210805':
alter table users_demo add if not exists partition (dt='20210805');
b.向分区中添加数据:
insert into users_demo partition (dt='20210805')
values
('1', '用户1', '13012345671', dateadd(getdate(), -60, 'mm'), dateadd(getdate(), -60, 'mm')),
('2', '用户2', '13012345672', dateadd(getdate(), -60, 'mm'),
...
('10', '用户10', '13012345670', dateadd(getdate(), -60, 'mm'), dateadd(getdate(), -60, 'mm'));
这里有很奇怪的一点,我设置的分区是char(10)是固定长度为10的字符,但是当我以下面这种方式查找分区中的数据时,查出来的结果为空。
select * from users_demo where dt='20210805';
只能以数字的形式查才能查出数据:
select * from users_demo where dt=20210805;
(2)添加比赛数据
a.添加分区'20210805':
alter table games_demo add if not exists partition (dt='20210805');
b.向分区中添加数据:
-- 向比赛表中插入用户比赛的信息
insert into games_demo partition (dt='20210805')
values
(3, '6', 'failed', 'a', '2021-07-16 10:10:10', '2021-07-16 10:10:10')
...
(9, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10');
(3)添加日志数据
a.添加分区'20210805':
alter table user_actions_demo add if not exists partition (dt='20210805');
b.向分区中添加数据:
insert into user_actions_demo partition (dt='20210805')
values
(1, '1', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
(2, '2', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
...
(40, '6', 'click_surrender', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
这里添加数据直接是在ODPS SQL 文件中用SQL语句添加的,如果使用PyODPS文件,用python来写会简洁很多。