SQL窗口函数

500 阅读2分钟

rob-wingate-Fd9tUmRBJzk-unsplash.jpeg

1. 窗口函数及其作用

窗口函数定义

窗口函数在和当前行相关的一组表行上执行计算。 这相当于一个可以由聚合函数完成的计算类型。但不同于常规的聚合函数, 使用的窗口函数不会导致行被分组到一个单一的输出行;行保留其独立的身份。 在后台,窗口函数能够访问的不止查询结果的当前行。

窗口函数的作用

  • 更加优雅地实现一些功能
  • 更加强大,GROUP BY有一个缺点,在实际得运算统计中,会改变表得行数,一行只显示一个类别。在工作中,我们又经常会遇到需要每组内排名得相关问题,比如:排名问题:每个部门按业绩来排名;topN问题:找出每个部门排名前N的员工进行奖励,可以使用窗口函数来解决

2. 常用例子

以下例子来自:SQL for Data Analysis,使用PostgreSQL的SQL函数

数据的ERD(Entity Relationship Diagram)表示如下

image.png

1. 统计到某个时间段内数据之和

SELECT standard_amt_usd,
       SUM(standard_amt_usd) OVER (ORDER BY occurred_at) AS running_total
FROM orders

查询结果如下

image.png

2. 使用PARTITION BY划分分区

PARTITION BY列表指定将行划分成组或分区, 组或分区共享相同的PARTITION BY表达式的值。 对于每一行,窗口函数在和当前行落在同一个分区的所有行上进行计算

SELECT standard_amt_usd,
       DATE_TRUNC('year', occurred_at) as year,
       SUM(standard_amt_usd) OVER (PARTITION BY DATE_TRUNC('year', occurred_at) ORDER BY occurred_at) AS running_total
FROM orders

查询结果如下

image.png

3. 使用RANK对一个窗口内数据进行排序

SELECT id,
       account_id,
       total,
       RANK() OVER (PARTITION BY account_id ORDER BY total DESC) AS total_rank
FROM orders

查询结果如下

image.png

4. 实现窗口内统计

SELECT id,
       account_id,
       standard_qty,
       DATE_TRUNC('month', occurred_at) AS month,
       DENSE_RANK() OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS dense_rank,
       SUM(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS sum_std_qty,
       COUNT(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS count_std_qty,
       AVG(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS avg_std_qty,
       MIN(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS min_std_qty,
       MAX(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS max_std_qty
FROM orders

实现和GROUP BY的类似功能,运行结果类似,此处不展示

5. 对窗口取别名,提高可读性

SELECT id,
       account_id,
       DATE_TRUNC('year',occurred_at) AS year,
       DENSE_RANK() OVER account_year_window AS dense_rank,
       total_amt_usd,
       SUM(total_amt_usd) OVER account_year_window AS sum_total_amt_usd,
       COUNT(total_amt_usd) OVER account_year_window AS count_total_amt_usd,
       AVG(total_amt_usd) OVER account_year_window AS avg_total_amt_usd,
       MIN(total_amt_usd) OVER account_year_window AS min_total_amt_usd,
       MAX(total_amt_usd) OVER account_year_window AS max_total_amt_usd
FROM orders 
WINDOW account_year_window AS (PARTITION BY account_id ORDER BY DATE_TRUNC('year',occurred_at))

6. 将当前行和上一行比较

SELECT occurred_at,
       total_amt_usd,
       LEAD(total_amt_usd) OVER (ORDER BY occurred_at) AS lead,
       LEAD(total_amt_usd) OVER (ORDER BY occurred_at) - total_amt_usd AS lead_difference
FROM (
SELECT occurred_at,
       SUM(total_amt_usd) AS total_amt_usd
  FROM orders 
 GROUP BY 1
) sub

查询结果如下

image.png

LEAD相对的是LAG,方向相反。

7. 统计窗口内分位

SELECT
       account_id,
       occurred_at,
       total_amt_usd,
       NTILE(100) OVER (PARTITION BY account_id ORDER BY total_amt_usd) AS total_percentile
  FROM orders 
 ORDER BY account_id DESC

查询结果如下

image.png

NTILE(4) 即划分4份,按照大小统计落在哪个桶

3. 参考资料

  1. PostgreSQL 9.4.4 中文手册
  2. 数分面试必考题:窗口函数
  3. 零基础学习SQL-SQL高级功能:窗口函数
  4. SQL for Data Analysis