MySQL按日周月年统计数据、分组取出前n条数据、结合多表统计数据

2,309 阅读8分钟

本文主要总结下以下统计:

1.按日、周、月、年对数据进行统计

2.分组取出每组统计数据的前n条

3.结合多个表对数据进行统计

为了练习,设计了几张表,放入了一些简单的数据(不使用外键)。此处脑补的是游戏业务的场景,其他业务场景稍微变形一下也能适用的。

假设此游戏有A、B、C三个区,每个区以不同的方式进行比赛,有胜负。有一个用户表,用户日志表,用户比赛表。

按日、周、月、年对数据进行统计

主要用到date_format对日期的格式进行整理。

主要用到count(expr)函数进行统计,统计expr表示的值不为空(null)的行的数量。比如count(action),列action的值不为空的行就会被计数。

1.按日、周、月、年统计各游戏分区入口被点击的次数

(1)按日统计

将时间整理成日期格式,按照日期统计对应的action的数量。

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    date_format(created_time, '%Y-%m-%d') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;
以上每天各区被点击的次数,以及总的点击次数,如果要统计出比例,就这样写:
select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    count(if(action='click_a_entrace', true, null)) / count(action) as a_rate,
    count(if(action='click_b_entrace', true, null)) / count(action) as b_rate,
    count(if(action='click_c_entrace', true, null)) / count(action) as c_rate,
    date_format(created_time, '%Y-%m-%d') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;

(统计周、月、年的action的比例也是同样的,就不再重复得到比例的代码了)

(2)按周统计

和按日统计相同,只是date_format整理出的是周的格式。

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    date_format(created_time, '%Y-%u') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;

%u (00-53) 是一年中的星期数,星期一是一周的第一天。

(3)按月统计

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    date_format(created_time, '%Y-%m') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;

(4)按年统计

select
    count(if(action='click_a_entrace', true, null)) as a_count,
    count(if(action='click_b_entrace', true, null)) as b_count,
    count(if(action='click_c_entrace', true, null)) as c_count,
    count(action) as all_count,
    date_format(created_time, '%Y') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;

如果要按照年份从小到大的顺序排序,使用order by action_date

select
    count(if(action='click_a_entrace', true, null)) as a_count,
...
group by action_date
order by action_date;

2.统计的是人次

还是按日、周、月、年统计各游戏分区入口被点击的数量,但是统计的是点击的人数。

以统计每年点击各区的人数为例:

select
    count(distinct if(action='click_a_entrace', user_id, null)) as a_count,
    count(distinct if(action='click_b_entrace', user_id, null)) as b_count,
    count(distinct if(action='click_c_entrace', user_id, null)) as c_count,
    count(distinct user_id) as all_count,
    date_format(created_time, '%Y') as action_date
from user_actions where action in ('click_a_entrace', 'click_b_entrace', 'click_c_entrace')
group by action_date;

按照日期进行分组,统计各日期组中,行为为click_a_entraceclick_b_entraceclick_c_entrace的不同的用户。

查询结果为:

乍一看可能会觉得奇怪,`4 + 4 + 4 > 10`了,但是稍微一想就会明白是因为有同样的用户点击了不同的区,是在不同区中去重的用户。

====== 不重要内容分割线-start ======

这一小块内容不重要。

只是为了说明练习过程中的数据和下文“练习过程”部分的数据表格相比是有些微不同的。 为了看出效果,和下文“练习过程”部分的数据表格相比,这里手动添加了一条数据:

insert into user_actions(id, user_id, action, created_time, modified_time)
    values
(41, '6', 'click_c_entrace', '2020-01-01 10:10:10', '2020-01-01 10:10:10');

这样在2020-01-01,用户6点击了C区入口2次,2020点击C区的人数应该还是2(对照下文“练习过程”部分的数据表格能很明显地得到这个结论)。

这里可以看见2020C区入口的人数为2,是正确的。

====== 不重要内容分割线-end ======

分组取出每组统计数据的前n条

1.统计各区参比赛胜率前3的的用户

统计数据的范围为2020-01-01到2021-07-17。

按照游戏区域对数据进行分区,对分区的行进行排序,取前3行。

(1)先查出每个用户在各个分区的比赛场次和胜率:

select
    game_partition,
    user_id,
    count(if(result='succeed', true, null)) as success_count,
    count(result) as all_count,
    count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by game_partition, user_id
order by game_partition;

要记住,MySQL的处理顺序是from -> where -> select -> group by -> having -> order by -> limit

结果如下:

(2)对每个分区胜率进行排序:

用到了window functions窗口函数和统计函数类似,不同的是统计函数将统计结果写为了一行,但是窗口函数是将统计结果写到各行中,并没有把所有的行合并为一行。

select
    game_partition,
    user_id,
    success_rate,
    row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
from (
	select
		game_partition,
		user_id,
		count(if(result='succeed', true, null)) as success_count,
		count(result) as all_count,
		count(if(result='succeed', true, null)) / count(result) as success_rate
	from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
	group by game_partition, user_id
	order by game_partition
	) as rate_table;

得到的结果是这样的,按照游戏分区game_partition对数据进行了划分,对于每一个游戏分区,按照success_rate从高到低进行排序,并使用row_number()得到了对于每一个游戏分区数据的行编号。

(3)取出胜率的前3名

使用行编号小于等于3where partition_row_number <= 3;取出前3名。

select * from (
	select
		game_partition,
		user_id,
		success_rate,
		row_number() over (partition by game_partition order by success_rate desc) as partition_row_number
	from (
		select
			game_partition,
			user_id,
			count(if(result='succeed', true, null)) as success_count,
			count(result) as all_count,
			count(if(result='succeed', true, null)) / count(result) as success_rate
		from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
		group by game_partition, user_id
		order by game_partition
		) as rate_table
	) as has_row_bumber_table
where partition_row_number <= 3;

====== 不重要内容分割线-start ======

这一小块内容不重要。

只是为了说明练习过程中的数据和下文和下文“练习过程”部分的数据表格相比是有些微不同的。

为了看出效果,和下文“练习过程”部分的数据表格相比,手动添加了一条数据:

insert into games(id, user_id, result, game_partition, created_time, modified_time)
values
  (9, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10');

这样用户1A区进行的比赛应该是2场,且都胜了。

====== 不重要内容分割线-end ======

结合多个表对数据进行统计

1.统计胜率前3的用户发起投降的次数

这个统计设计了用户行为表user_actions表和比赛表games。这里就不按照每个游戏分区统计胜率了,直接按照所有数据统计胜率。

(1)从games表中取出胜率前3的用户

select
    user_id,
    count(if(result='succeed', true, null)) as success_count,
    count(result) as all_count,
    count(if(result='succeed', true, null)) / count(result) as success_rate
from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
group by user_id 
order by success_rate desc
limit 3;

(2)从user_actions中出用户点击投降的次数,left join 到前面查出的结果集中

select 
    success_rate_table.user_id, 
    success_rate_table.success_rate, 
    click_surrender_table.click_surrender_count
from (
  select
     user_id,
     count(if(result='succeed', true, null)) as success_count,
     count(result) as all_count,
     count(if(result='succeed', true, null)) / count(result) as success_rate
  from games where created_time >= '2020-01-01 00:00:00' and created_time < '2021-07-17 23:59:59'
  group by user_id 
  order by success_rate desc
  limit 3
) as success_rate_table
left join (
  select 
    user_id,
    count(1) as click_surrender_count
 from user_actions where action='click_surrender'
 group by user_id
) as click_surrender_table
on success_rate_table.user_id = click_surrender_table.user_id
order by click_surrender_table.click_surrender_count;

数据准备

练习数据的创建过程:

1.创建数据库并使用该数据库

create database if not exists game_demo;

use game_demo;

2.创建所需的用户表、比赛表、日志表

-- 创建用户表
create table if not exists users (
    id varchar(20) primary key,
    name varchar(255) not null,
    phone varchar(20),
    created_time datetime not null,
    modified_time datetime not null
);
​
-- 创建比赛表
create table if not exists games (
    id int,
    user_id varchar(20),
    result varchar(200) not null,
    game_partition varchar(2) not null,
    created_time datetime not null,
    modified_time datetime not null,
    primary key(id, user_id)
);
​
-- 创建日志表
create table if not exists user_actions (
    id int primary key auto_increment,
    user_id varchar(20),
    action varchar(255) not null,
    created_time datetime not null,
    modified_time datetime not null
);

查看创建好的表:show tables;

查看某个表的信息: describe user_actions

一个比赛id会有多个不同的user_id,因为一场比赛中会有多个用户。

3.向表中插入数据

按照下图向表中插入数据,为了方便看出结果,尽量简化了用户行为。

(1)向用户表中插入数据

-- 向用户表中添加10个用户
insert into users(id, name, phone, created_time, modified_time)
values 
('1', '用户1', '13012345671', current_timestamp(), current_timestamp()),
...
('10', '用户10', '13012345670', current_timestamp(), current_timestamp());

(2)向用户行为表中插入数据

各action表示的意义:click_surrender 点击投降、click_a_entrace 点击A区入口、click_b_entrace 点击B区入口、click_c_entrace 点击C区入口。

-- 向用户行为表中插入点击入口数据
insert into user_actions(id, user_id, action, created_time, modified_time)
values
  (1, '1', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  (2, '2', 'click_a_entrace', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  ...
  (16, '6', 'click_c_entrace', '2020-01-01 10:10:10', '2020-01-01 10:10:10');
 
 -- 向用户行为表中插入发起投降数据
insert into user_actions(id, user_id, action, created_time, modified_time)
values
  (17, '1', 'click_surrender', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  (18, '2', 'click_surrender', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  ...
  (40, '6', 'click_surrender', '2020-01-01 10:10:10', '2020-01-01 10:10:10');

(3)向比赛表中插入数据

resultsucceed表示成功,为failed表示失败。只有成功和失败,没有平局。

-- 向比赛表中插入用户比赛的信息
insert into games(id, user_id, result, game_partition, created_time, modified_time)
values
  (1, '1', 'succeed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  (1, '2', 'failed', 'a', '2021-07-17 10:10:10', '2021-07-17 10:10:10'),
  ...
  (8, '5', 'succeed', 'c', '2020-01-01 10:10:10', '2020-01-01 10:10:10'),
    (8, '6', 'failed', 'c', '2020-01-01 10:10:10', '2020-01-01 10:10:10');

数据创建好之后,就按照上文描述的方式对数据进行查询。

其他

源码地址