本文主要总结下以下统计:
1.按日、周、月、年对数据进行统计
2.分组取出每组统计数据的前n条
3.结合多个表对数据进行统计
为了练习,设计了几张表,放入了一些简单的数据(不使用外键)。此处脑补的是游戏业务的场景,其他业务场景稍微变形一下也能适用的。
假设此游戏有A、B、C三个区,每个区以不同的方式进行比赛,有胜负。有一个用户表,用户日志表,用户比赛表。
按日、周、月、年对数据进行统计
主要用到date_format对日期的格式进行整理。
主要用到count(expr)函数进行统计,统计expr
表示的值不为空(null
)的行的数量。比如count(action)
,列action
的值不为空的行就会被计数。
1.按日、周、月、年统计各游戏分区入口被点击的次数
(1)按日统计
将时间整理成日期格式,按照日期统计对应的action
的数量。
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
date_format(created_time, '%Y-%m-%d') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
count(if(action='click_a_entrace', true, null)) / count(action) as a_rate,
count(if(action='click_b_entrace', true, null)) / count(action) as b_rate,
count(if(action='click_c_entrace', true, null)) / count(action) as c_rate,
date_format(created_time, '%Y-%m-%d') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
(统计周、月、年的action的比例也是同样的,就不再重复得到比例的代码了)
(2)按周统计
和按日统计相同,只是date_format
整理出的是周的格式。
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
date_format(created_time, '%Y-%u') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
%u
(00-53) 是一年中的星期数,星期一是一周的第一天。
(3)按月统计
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
date_format(created_time, '%Y-%m') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
(4)按年统计
select
count(if(action='click_a_entrace', true, null)) as a_count,
count(if(action='click_b_entrace', true, null)) as b_count,
count(if(action='click_c_entrace', true, null)) as c_count,
count(action) as all_count,
date_format(created_time, '%Y') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
如果要按照年份从小到大的顺序排序,使用order by action_date
。
select
count(if(action='click_a_entrace', true, null)) as a_count,
...
group by action_date
order by action_date;
2.统计的是人次
还是按日、周、月、年统计各游戏分区入口被点击的数量,但是统计的是点击的人数。
以统计每年点击各区的人数为例:
select
count(distinct if(action='click_a_entrace', user_id, null)) as a_count,
count(distinct if(action='click_b_entrace', user_id, null)) as b_count,
count(distinct if(action='click_c_entrace', user_id, null)) as c_count,
count(distinct user_id) as all_count,
date_format(created_time, '%Y') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
按照日期进行分组,统计各日期组中,行为为click_a_entrace
、click_b_entrace
、click_c_entrace
的不同的用户。
查询结果为:
====== 不重要内容分割线-start ======
这一小块内容不重要。
只是为了说明练习过程中的数据和下文“练习过程”部分的数据表格相比是有些微不同的。 为了看出效果,和下文“练习过程”部分的数据表格相比,这里手动添加了一条数据:
insert into user_actions(id, user_id, action, created_time, modified_time)
values
(41, '6', 'click_c_entrace', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
这样在2020-01-01,用户6
点击了C区
入口2
次,2020点击C区的人数应该还是2
(对照下文“练习过程”部分的数据表格能很明显地得到这个结论)。
这里可以看见2020
年C
区入口的人数为2,是正确的。
====== 不重要内容分割线-end ======
分组取出每组统计数据的前n条
1.统计各区参比赛胜率前3的的用户
统计数据的范围为2020-01-01到2021-07-17。
按照游戏区域对数据进行分区,对分区的行进行排序,取前3行。
(1)先查出每个用户在各个分区的比赛场次和胜率:
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition;
要记住,MySQL的处理顺序是from
-> where
-> select
-> group by
-> having
-> order by
-> limit
。
结果如下:
(2)对每个分区胜率进行排序:
用到了window functions,窗口函数和统计函数类似,不同的是统计函数将统计结果写为了一行,但是窗口函数是将统计结果写到各行中,并没有把所有的行合并为一行。
select
game_partition,
user_id,
success_rate,
row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition
) as rate_table;
得到的结果是这样的,按照游戏分区game_partition
对数据进行了划分,对于每一个游戏分区,按照success_rate
从高到低进行排序,并使用row_number()
得到了对于每一个游戏分区数据的行编号。
(3)取出胜率的前3名
使用行编号小于等于3where partition_row_number <= 3;
取出前3名。
select * from (
select
game_partition,
user_id,
success_rate,
row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
select
game_partition,
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition
) as rate_table
) as has_row_bumber_table
where partition_row_number <= 3;
====== 不重要内容分割线-start ======
这一小块内容不重要。
只是为了说明练习过程中的数据和下文和下文“练习过程”部分的数据表格相比是有些微不同的。
为了看出效果,和下文“练习过程”部分的数据表格相比,手动添加了一条数据:
insert into games(id, user_id, result, game_partition, created_time, modified_time)
values
(9, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10');
这样用户1
在A区
进行的比赛应该是2
场,且都胜了。
====== 不重要内容分割线-end ======
结合多个表对数据进行统计
1.统计胜率前3的用户发起投降的次数
这个统计设计了用户行为表user_actions
表和比赛表games
。这里就不按照每个游戏分区统计胜率了,直接按照所有数据统计胜率。
(1)从games
表中取出胜率前3的用户
select
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by user_id
order by success_rate desc
limit 3;
(2)从user_actions
中出用户点击投降的次数,left join 到前面查出的结果集中
select
success_rate_table.user_id,
success_rate_table.success_rate,
click_surrender_table.click_surrender_count
from (
select
user_id,
count(if(result='succeed', true, null)) as success_count,
count(result) as all_count,
count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by user_id
order by success_rate desc
limit 3
) as success_rate_table
left join (
select
user_id,
count(1) as click_surrender_count
from user_actions where action='click_surrender'
group by user_id
) as click_surrender_table
on success_rate_table.user_id = click_surrender_table.user_id
order by click_surrender_table.click_surrender_count;
数据准备
练习数据的创建过程:
1.创建数据库并使用该数据库
create database if not exists game_demo;
use game_demo;
2.创建所需的用户表、比赛表、日志表
-- 创建用户表
create table if not exists users (
id varchar(20) primary key,
name varchar(255) not null,
phone varchar(20),
created_time datetime not null,
modified_time datetime not null
);
-- 创建比赛表
create table if not exists games (
id int,
user_id varchar(20),
result varchar(200) not null,
game_partition varchar(2) not null,
created_time datetime not null,
modified_time datetime not null,
primary key(id, user_id)
);
-- 创建日志表
create table if not exists user_actions (
id int primary key auto_increment,
user_id varchar(20),
action varchar(255) not null,
created_time datetime not null,
modified_time datetime not null
);
查看创建好的表:show tables;
。
查看某个表的信息: describe user_actions
。
一个比赛id
会有多个不同的user_id
,因为一场比赛中会有多个用户。
3.向表中插入数据
按照下图向表中插入数据,为了方便看出结果,尽量简化了用户行为。
(1)向用户表中插入数据
-- 向用户表中添加10个用户
insert into users(id, name, phone, created_time, modified_time)
values
('1', '用户1', '13012345671', current_timestamp(), current_timestamp()),
...
('10', '用户10', '13012345670', current_timestamp(), current_timestamp());
(2)向用户行为表中插入数据
各action表示的意义:click_surrender
点击投降、click_a_entrace
点击A区入口、click_b_entrace
点击B区入口、click_c_entrace
点击C区入口。
-- 向用户行为表中插入点击入口数据
insert into user_actions(id, user_id, action, created_time, modified_time)
values
(1, '1', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
(2, '2', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
...
(16, '6', 'click_c_entrace', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
-- 向用户行为表中插入发起投降数据
insert into user_actions(id, user_id, action, created_time, modified_time)
values
(17, '1', 'click_surrender', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
(18, '2', 'click_surrender', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
...
(40, '6', 'click_surrender', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
(3)向比赛表中插入数据
result
为succeed
表示成功,为failed
表示失败。只有成功和失败,没有平局。
-- 向比赛表中插入用户比赛的信息
insert into games(id, user_id, result, game_partition, created_time, modified_time)
values
(1, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
(1, '2', 'failed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
...
(8, '5', 'succeed', 'c', '2020-01-01 10:10:10', '2020-01-01 10:10:10'),
(8, '6', 'failed', 'c', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
数据创建好之后,就按照上文描述的方式对数据进行查询。