窗口函数介绍

MySQL从8.0版本开始支持窗口函数，所谓窗口函数，就是为每行数据开辟一个窗口，然后在窗口的范围内对数据进行运算，并返回运算的结果给该行。
比如，当你在处理数据库中的数据时，你可能需要使用聚合函数来计算一些汇总值，比如总和、平均值、最大值和最小值等等。这些聚合函数通常是用在GROUP BY语句中。

但是，如果你想要在结果集中同时获取原始数据和聚合数据，你就需要使用窗口函数。窗口函数可以让你在查询结果中计算聚合函数，同时不破坏原始数据的行结构。这意味着你可以对原始数据进行排序、筛选、分组等操作，同时计算聚合数据，从而得到更加灵活的结果集。

思考

从上面的简介则需要了解到，会对窗口函数提出一下问题：

怎么样定义一个窗口？
在窗口中可以进行那些操作？
窗口函数有那些应用场景？

如何定义一个窗口函数？

首先，看一下窗口函数的基本语法：

window_function_name(expression) 
    OVER (
        [partition_defintion]
        [order_definition]
        [frame_definition]
    )

名称	说明	备注
window_function_name(expression)	窗口函数的名称，指定要对窗口中的数据执行什么操作
OVER	窗口函数的窗口定义部分，当中包含分区定义、排序定义、帧（分区的自己）定义
[partition_defintion]	用于定义分区的字段，可以省略，类似于group by，定义后，会根据partition_definition对数据进行分区
[order_definition]	定义分区内部的数据排序规则
[frame_definition]	定义分区数据的子集，以当前行为基准，定义一个数据子集（很抽象，实际上就是以当前行为参照，可以操作分区中其他行的数据，比如：当前行的上一行、下一行等等）

窗口函数能做什么？

试想一个场景，比如，我们有一个学生成绩表，要按照班级对学生的成绩进行排序。并且返回每个学生的名次。那么该如何使用窗口函数呢？，先定义学生表及插入相关数据：

create table transcripts(
    id int primary key auto_increment,
    name char(20),
    scores float,
    class char(30)
)engine=InnoDB character set utf8 COLLATE utf8_general_ci;

insert into transcripts(name,scores,class) values ("张三01",85.5,"大数据01");
insert into transcripts(name,scores,class) values ("张三02",80.5,"大数据01");
insert into transcripts(name,scores,class) values ("张三03",65.5,"大数据01");
insert into transcripts(name,scores,class) values ("张三04",55.5,"大数据01");
insert into transcripts(name,scores,class) values ("张三05",95.5,"大数据01");
insert into transcripts(name,scores,class) values ("张三06",75.5,"大数据01");
insert into transcripts(name,scores,class) values ("王二01",85.5,"计算机01");
insert into transcripts(name,scores,class) values ("王二02",80.5,"计算机01");
insert into transcripts(name,scores,class) values ("王二03",65.5,"计算机01");
insert into transcripts(name,scores,class) values ("王二04",55.5,"计算机01");
insert into transcripts(name,scores,class) values ("王二05",95.5,"计算机01");
insert into transcripts(name,scores,class) values ("王二06",75.5,"计算机01");

若使用窗口函数来解决上述问题，则语法如下：

select
    *,
    rank() over(
    partition by class
    order by scores
    rows between unbounded preceding and unbounded following
) as "rank"
from
    transcripts;

返回结果如下：

+----+--------+--------+----------+------+
| id | name   | scores | class    | rank |
+----+--------+--------+----------+------+
|  4 | 张三04 |   55.5 | 大数据01 |    1 |
|  3 | 张三03 |   65.5 | 大数据01 |    2 |
|  6 | 张三06 |   75.5 | 大数据01 |    3 |
| 13 | 张三07 |   75.5 | 大数据01 |    3 |
|  2 | 张三02 |   80.5 | 大数据01 |    5 |
|  1 | 张三01 |   85.5 | 大数据01 |    6 |
|  5 | 张三05 |   95.5 | 大数据01 |    7 |
| 10 | 王二04 |   55.5 | 计算机01 |    1 |
|  9 | 王二03 |   65.5 | 计算机01 |    2 |
| 12 | 王二06 |   75.5 | 计算机01 |    3 |
|  8 | 王二02 |   80.5 | 计算机01 |    4 |
|  7 | 王二01 |   85.5 | 计算机01 |    5 |
| 14 | 王二07 |   85.5 | 计算机01 |    5 |
| 11 | 王二05 |   95.5 | 计算机01 |    7 |
+----+--------+--------+----------+------+
14 rows in set (0.03 sec)

在窗口函数部分，调用了来窗口函数rank，rank函数会放回当前数据在窗口中的排序，其中，

over关键字声明了窗口定义的信息，
partition by 指定了按照class进行分组，o
rder by 指定了窗口内的数据按照score排序，
rows rows between unbounded preceding and unbounded following 指定了在对当前数据进行操作的时候，定义的数据子集（该语法在这个窗口函数中没有用到，因为这个时候我们不需要这个子集，通常会用到聚合函数中。）

专用窗口函数

可作用于窗口操作的函数包括专用窗口函数、聚合函数，常见的专用窗口函数包括：rank、dense_rank、row_number、lag、lead;

rank

rank函数对当前数据进行排序的时候，会将重复的数值排序为同一个名次，且会跳过重复名次所占据的位置。比如在上例中，大数据01的查询结果会跳过第四名，且会有两个人占据第三个名次。


+----+--------+--------+----------+------+
| id | name   | scores | class    | rank |
+----+--------+--------+----------+------+
|  4 | 张三04 |   55.5 | 大数据01 |    1 |
|  3 | 张三03 |   65.5 | 大数据01 |    2 |
|  6 | 张三06 |   75.5 | 大数据01 |    3 |
| 13 | 张三07 |   75.5 | 大数据01 |    3 |
|  2 | 张三02 |   80.5 | 大数据01 |    5 |
|  1 | 张三01 |   85.5 | 大数据01 |    6 |
|  5 | 张三05 |   95.5 | 大数据01 |    7 |
| 10 | 王二04 |   55.5 | 计算机01 |    1 |
|  9 | 王二03 |   65.5 | 计算机01 |    2 |
| 12 | 王二06 |   75.5 | 计算机01 |    3 |
|  8 | 王二02 |   80.5 | 计算机01 |    4 |
|  7 | 王二01 |   85.5 | 计算机01 |    5 |
| 14 | 王二07 |   85.5 | 计算机01 |    5 |
| 11 | 王二05 |   95.5 | 计算机01 |    7 |
+----+--------+--------+----------+------+
14 rows in set (0.00 sec)

dense_rank

dense_rank与rank类似，但是不会因为重复的名次而跳过相应的数值。如下：

select
    *,
    dense_rank() over(
    partition by class
    order by scores
    rows between unbounded preceding and unbounded following
) as "dense_rank"
from
    transcripts;

结果如下：

+----+--------+--------+----------+------------+
| id | name   | scores | class    | dense_rank |
+----+--------+--------+----------+------------+
|  4 | 张三04 |   55.5 | 大数据01 |          1 |
|  3 | 张三03 |   65.5 | 大数据01 |          2 |
|  6 | 张三06 |   75.5 | 大数据01 |          3 |
| 13 | 张三07 |   75.5 | 大数据01 |          3 |
|  2 | 张三02 |   80.5 | 大数据01 |          4 |
|  1 | 张三01 |   85.5 | 大数据01 |          5 |
|  5 | 张三05 |   95.5 | 大数据01 |          6 |
| 10 | 王二04 |   55.5 | 计算机01 |          1 |
|  9 | 王二03 |   65.5 | 计算机01 |          2 |
| 12 | 王二06 |   75.5 | 计算机01 |          3 |
|  8 | 王二02 |   80.5 | 计算机01 |          4 |
|  7 | 王二01 |   85.5 | 计算机01 |          5 |
| 14 | 王二07 |   85.5 | 计算机01 |          5 |
| 11 | 王二05 |   95.5 | 计算机01 |          6 |
+----+--------+--------+----------+------------+
14 rows in set (0.00 sec)

此时，大数据01不会跳过数值4；

row_number

与rank类似，但是不会给相同的数据相同的名次，只会按顺序给相应的数值。

select
    *,
    row_number() over(
    partition by class
    order by scores
    rows between unbounded preceding and unbounded following
) as "rank"
from
    transcripts;

结果如下：

+----+--------+--------+----------+------+
| id | name   | scores | class    | rank |
+----+--------+--------+----------+------+
|  4 | 张三04 |   55.5 | 大数据01 |    1 |
|  3 | 张三03 |   65.5 | 大数据01 |    2 |
|  6 | 张三06 |   75.5 | 大数据01 |    3 |
| 13 | 张三07 |   75.5 | 大数据01 |    4 |
|  2 | 张三02 |   80.5 | 大数据01 |    5 |
|  1 | 张三01 |   85.5 | 大数据01 |    6 |
|  5 | 张三05 |   95.5 | 大数据01 |    7 |
| 10 | 王二04 |   55.5 | 计算机01 |    1 |
|  9 | 王二03 |   65.5 | 计算机01 |    2 |
| 12 | 王二06 |   75.5 | 计算机01 |    3 |
|  8 | 王二02 |   80.5 | 计算机01 |    4 |
|  7 | 王二01 |   85.5 | 计算机01 |    5 |
| 14 | 王二07 |   85.5 | 计算机01 |    6 |
| 11 | 王二05 |   95.5 | 计算机01 |    7 |
+----+--------+--------+----------+------+
14 rows in set (0.00 sec)

lag/lead

lag有延迟，落后之意，在窗口中，也就是落后于当前数据的数据（这里的落后，就是指窗口中当前数据之前的数据）
lead则是领先之意，在窗口函数中，也就是在领先于当前数据的数据。
lag/lead函数有三个参数，第一个参数也是一个表达式，第二个参数是偏移量，也就是一以当前函数为基准，返回之前的第多少条，在三个参数是在当前数据之前没有找到数据时的返回值。

下面的sql语句，会返回上一条的分数*100，在没有上一条数据时，返回0。

select
    *,
    lag(scores*100,1,0) over(
    partition by class
    order by scores
    rows between unbounded preceding and unbounded following
) as "scores_before*100"
from
    transcripts;

结果如下：


+----+--------+--------+----------+-------------------+
| id | name   | scores | class    | scores_before*100 |
+----+--------+--------+----------+-------------------+
|  4 | 张三04 |   55.5 | 大数据01 |                 0 |
|  3 | 张三03 |   65.5 | 大数据01 |              5550 |
|  6 | 张三06 |   75.5 | 大数据01 |              6550 |
| 13 | 张三07 |   75.5 | 大数据01 |              7550 |
|  2 | 张三02 |   80.5 | 大数据01 |              7550 |
|  1 | 张三01 |   85.5 | 大数据01 |              8050 |
|  5 | 张三05 |   95.5 | 大数据01 |              8550 |
| 10 | 王二04 |   55.5 | 计算机01 |                 0 |
|  9 | 王二03 |   65.5 | 计算机01 |              5550 |
| 12 | 王二06 |   75.5 | 计算机01 |              6550 |
|  8 | 王二02 |   80.5 | 计算机01 |              7550 |
|  7 | 王二01 |   85.5 | 计算机01 |              8050 |
| 14 | 王二07 |   85.5 | 计算机01 |              8550 |
| 11 | 王二05 |   95.5 | 计算机01 |              8550 |
+----+--------+--------+----------+-------------------+
14 rows in set (0.00 sec)

聚合函数

聚合函数也可以用于窗口函数，与上面不同的时，他可以作用于在窗口函数中，以当前数据为基准，定义的数据子集。
常见的聚合函数有：count、sum、avg、min、max等。

比如，对分数进行从小到大排序后，计算周期为3的移动平均数，可以使用如下的sql：

select
    *,
    avg(scores) over(
    partition by class
    order by scores
    rows between 3 preceding and current row
) as "scores_before*100"
from
    transcripts;

结果如下：

+----+--------+--------+----------+-------------------+
| id | name   | scores | class    | scores_before*100 |
+----+--------+--------+----------+-------------------+
|  4 | 张三04 |   55.5 | 大数据01 |              55.5 |
|  3 | 张三03 |   65.5 | 大数据01 |              60.5 |
|  6 | 张三06 |   75.5 | 大数据01 |              65.5 |
| 13 | 张三07 |   75.5 | 大数据01 |                68 |
|  2 | 张三02 |   80.5 | 大数据01 |             74.25 |
|  1 | 张三01 |   85.5 | 大数据01 |             79.25 |
|  5 | 张三05 |   95.5 | 大数据01 |             84.25 |
| 10 | 王二04 |   55.5 | 计算机01 |              55.5 |
|  9 | 王二03 |   65.5 | 计算机01 |              60.5 |
| 12 | 王二06 |   75.5 | 计算机01 |              65.5 |
|  8 | 王二02 |   80.5 | 计算机01 |             69.25 |
|  7 | 王二01 |   85.5 | 计算机01 |             76.75 |
| 14 | 王二07 |   85.5 | 计算机01 |             81.75 |
| 11 | 王二05 |   95.5 | 计算机01 |             86.75 |
+----+--------+--------+----------+-------------------+
14 rows in set (0.00 sec)

窗口函数的其他应用场景

连续满足某个条件的问题的处理

力扣：601. 体育馆的人流量 - 力扣（LeetCode）编写一个 SQL 查询以找出每行的人数大于或等于 100 且 id 连续的三行或更多行记录。

使用where语句，筛选出人流量大于100的数据；
开辟一个窗口，在窗口中对id进行排序，使用row_number获得一个连续的序列；
使用id和row_number的结果做差值，若是id连续，则差值必定相同，差值命名为 "diff"，如下图：

id	row_number	diff
1	1	0
2	2	0
3	3	0
5	4	1
6	5	1
10	5	5
11	5	5

使用"diff"划分窗口，使用窗口函数count()计算总数，返回总数大于等于3的记录。

sql如下：

with a as(
    select
        id,
        visit_date,
        people,
        id - row_number() over(  -- 使用id减去一个连续的序列，若是子序列连续的，则差值相等。
            order by id
        ) as "diff"
    from
        Stadium
    where
        people >= 100
),b as (
    select
        id,
        visit_date,
        people,
        count(*) over(  -- 按差距进行分组，因为是连续的序列，则差值相同，然后求和即可
            partition by diff
            rows between unbounded preceding and unbounded following
        ) as "count"
    from 
        a
)

select
    id,
    visit_date,
    people
from
    b
where count >= 3
order by id

MySQL窗口函数学习笔记

窗口函数介绍

思考

如何定义一个窗口函数？

窗口函数能做什么？

专用窗口函数

rank

dense_rank

row_number

lag/lead

聚合函数

窗口函数的其他应用场景

连续满足某个条件的问题的处理