MaxCompute 按日周月年统计数据、分组取出前n条数据、结合多表统计数据

1,689 阅读7分钟

《MySQL按日周月年统计数据、分组取出前n条数据、结合多表统计数据》中的统计操作,以MaxCompute(原来叫作ODPS)的方式实现。MaxCompute是用来处理量级比较大的数据的,使用方式和MySQL有一些不同。

表的设计和数据和统计目的都和该文中一样(但因为数据创建顺序和该文中不同,所以此文的统计结果和该文的统计结果有所不同),就不赘述了,直接上统计语句,数据准备部分在文章末尾。

按日、周、月、年对数据进行统计

参考地址:日期函数

1.按日、周、月、年统计各游戏分区入口被点击的次数

和MySQL的语句类似,需要换一下使用的函数,在select时加上分区的筛选,并且在MaxCompute中不能直接group by 列别名

(1) 按日统计

以下语句将日期转化为形如'2021-08-05'这样的字符串:

to_char(created_time, 'yyyy-mm-dd')

将各分区入口的点击数据统计起来:

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    count(if(action='click_a_entrace', true, null)) / count(action) as a_rate,
    count(if(action='click_b_entrace', true, null)) / count(action) as b_rate,
    count(if(action='click_c_entrace', true, null)) / count(action) as c_rate,
    to_char(created_time, 'yyyy-mm-dd') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy-mm-dd');

image-20210805220602028.png

(2) 按周统计

将日期转化为行为'2021'的字符串:

to_char(created_time, 'yyyy')

拿到日期值位于当年的第几周(bigint类型):

weekofyear(created_time)

将日期和周数结合起来,组成形如2021-22的字符串:

concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time))

查询各分区入口被点击的总数:

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time)) as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by concat(to_char(created_time, 'yyyy'), '-', weekofyear(created_time));

image-20210805221641246.png

(3) 按月统计

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    to_char(created_time, 'yyyy-mm') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy-mm');

image-20210805221807729.png

(4) 按年统计

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    to_char(created_time, 'yyyy') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy');

image-20210805221903593.png

2.统计的是人次

还是按日、周、月、年统计各游戏分区入口被点击的数量,但是统计的是点击的人数。

以统计每年点击各区的人数为例:

select
    count(distinct if(action='click_a_entrace', user_id, null)) as a_count,
    count(distinct if(action='click_b_entrace', user_id, null)) as b_count,
    count(distinct if(action='click_c_entrace', user_id, null)) as c_count,
    count(distinct user_id) as all_count,
    to_char(created_time, 'yyyy') as action_date
from user_actions_demo where dt=20210805 and action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by to_char(created_time, 'yyyy');

image-20210805222441004.png

分组取出每组统计数据的前n条

统计各区参比赛胜率前3的的用户。

(1)先查出每个用户在各个分区的比赛场次和胜率:

select
    game_partition,
    user_id,
    count(if(result='succeed', true, null)) as success_count,
    count(result) as all_count,
    count(if(result='succeed', true, null)) / count(result) as success_rate
from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition;

image-20210805223216259.png

(2)对每个分区胜率进行排序:

MaxCompute的窗口函数MySQL的窗口函数

select
    game_partition,
    user_id,
    success_rate,
    row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
  select
    game_partition,
    user_id,
    count(if(result='succeed', true, null)) as success_count,
    count(result) as all_count,
    count(if(result='succeed', true, null)) / count(result) as success_rate
  from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
  group by game_partition, user_id
  order by game_partition
    limit 20000
  ) as rate_table;

在MaxCompute中order by必须和limit一起使用,除非明确地进行设置:

ORDER BY must be used with a LIMIT clause, please set odps.sql.validate.orderby.limit=false to use it.

image-20210805224506622.png

(3)取出胜率的前3名

使用行编号小于等于3where partition_row_number <= 3;取出前3名。

select * from (
    select
        game_partition,
        user_id,
        success_rate,
        row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
    from (
        select
            game_partition,
            user_id,
            count(if(result='succeed', true, null)) as success_count,
            count(result) as all_count,
            count(if(result='succeed', true, null)) / count(result) as success_rate
        from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
        group by game_partition, user_id
        order by game_partition
        limit 20000
        ) as rate_table
    ) as has_row_bumber_table
where partition_row_number <= 3;

image-20210805225114418.png

结合多个表对数据进行统计

统计胜率前3的用户发起投降的次数。

先从从games_demo表中取出胜率前3的用户:

select
      user_id,
      count(if(result='succeed', true, null)) as success_count,
      count(result) as all_count,
      count(if(result='succeed', true, null)) / count(result) as success_rate
  from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
  group by user_id 
  order by success_rate desc
  limit 3

然后从user_actions_demo中出用户点击投降的次数,left join到前面查出的结果集中:

select 
    success_rate_table.user_id, 
    success_rate_table.success_rate, 
    click_surrender_table.click_surrender_count
from (
    select
        user_id,
        count(if(result='succeed', true, null)) as success_count,
        count(result) as all_count,
        count(if(result='succeed', true, null)) / count(result) as success_rate
    from games_demo where dt=20210805 and created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
    group by user_id 
    order by success_rate desc
    limit 3
) as success_rate_table
left join (
  select 
    user_id,
    count(1) as click_surrender_count
 from user_actions_demo where dt=20210805 and action='click_surrender'
 group by user_id
) as click_surrender_table
on success_rate_table.user_id = click_surrender_table.user_id
order by click_surrender_table.click_surrender_count;

image-20210805225750528.png

这里拿到的用户id是1、3、5,但是MySQL实现中拿到的用户id是1、3、9。这是因为MaxCompute和MySQL的排序结果和limit结果不同。

MySQL的排序结果是这样的:

image-20210805230430476.png

MaxCompute的排序结果是这样的:

image-20210805230246763.png

使用limit 3之后,MySQL拿到的是:

image-20210805230619995.png

MaxCompute拿到的是:

image-20210805230704134.png

这里其实不严谨,因为光就胜率来说,前面几个人的胜率都是1,user_id为5和者user_id9的胜率是一样的,谁排在前3都算是正确。只是很奇怪,为什么MySQL没有连续地选择值,而是中间跳了两个。这里先保留疑问,等进一步了解MySQL之后再回头来想想这个问题吧。

数据准备

1.设置数据类型版本为MaxCompute 2.0。

本练习只是对某Session设置数据类型版本:

set odps.sql.type.system.odps2=true;

我理解的这语句的意思是:针对这条语句所在的指令设置数据类型版本,不是针对整个项目空间设置的。

选择不同的数据类型版本,表的数据类型、DML语句和内置函数行为,以及与MaxCompute密切相关的开发组件都有所不同,具体差异看:不同数据类型版本的差异

2.创建三张表,用户表users_demo、比赛表games_demo、用户行为表user_actions_demo

参考地址:分区表操作分区和列操作2.0数据类型版本

(1)创建用户表

-- 创建用户表
create table if not exists users_demo (
    id varchar(20) not null comment '用户id',
    name varchar(255) not null comment '用户名称',
    phone varchar(20) comment '电话号码',
    created_time datetime not null comment '创建时间',
    modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');

按照创建日期对数据进行分区,这里创建日期dt是形如20210805这样的字符串。

(2)创建比赛表

-- 创建比赛表
create table if not exists games_demo (
    id int not null comment '比赛id',
    user_id varchar(20) not null comment '用户id',
    result varchar(200) not null comment '比赛结果',
    game_partition varchar(2) not null comment '比赛分区',
    created_time datetime not null comment '创建时间',
    modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');

(3)创建日志表

-- 创建日志表
create table if not exists user_actions_demo (
    id int not null comment '日志id',
    user_id varchar(20) comment '用户id',
    action varchar(255) not null comment '用户行为',
    created_time datetime not null comment '创建时间',
    modified_time datetime not null comment '修改时间'
)
partitioned by (dt char(10) comment '创建日期');

3.向表中插入数据

参考地址:插入或覆写数据

(1)添加用户数据

在MySQL中使用的是current_timestamp()拿到当前的日期时间,在MaxCompute中是用getdate()拿当前日期时间。

以下语句拿到当前日期前60个月的日期:

dateadd(getdate(), -60, 'mm')

a.添加分区'20210805'

alter table users_demo add if not exists partition (dt='20210805');

b.向分区中添加数据:

insert into users_demo partition (dt='20210805') 
values
    ('1', '用户1', '13012345671', dateadd(getdate(), -60, 'mm'), dateadd(getdate(), -60, 'mm')),
    ('2', '用户2', '13012345672', dateadd(getdate(), -60, 'mm'), 
    ...
    ('10', '用户10', '13012345670', dateadd(getdate(), -60, 'mm'), dateadd(getdate(), -60, 'mm'));

这里有很奇怪的一点,我设置的分区是char(10)是固定长度为10的字符,但是当我以下面这种方式查找分区中的数据时,查出来的结果为空。

select * from users_demo where dt='20210805';

只能以数字的形式查才能查出数据:

select * from users_demo where dt=20210805;

(2)添加比赛数据

a.添加分区'20210805'

alter table games_demo add if not exists partition (dt='20210805');

b.向分区中添加数据:

-- 向比赛表中插入用户比赛的信息
insert into games_demo partition (dt='20210805') 
values
    (3, '6', 'failed', 'a', '2021-07-16 10:10:10', '2021-07-16 10:10:10')
    ...
    (9, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10');

(3)添加日志数据

a.添加分区'20210805'

alter table user_actions_demo add if not exists partition (dt='20210805');

b.向分区中添加数据:

insert into user_actions_demo partition (dt='20210805') 
values
    (1, '1', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
    (2, '2', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
    ...
    (40, '6', 'click_surrender', '2020-01-01 10:10:10', '2020-01-01 10:10:10');

这里添加数据直接是在ODPS SQL 文件中用SQL语句添加的,如果使用PyODPS文件,用python来写会简洁很多。

源码地址

github.com/renmo/myBlo…