1. 窗口函数及其作用
窗口函数定义
窗口函数在和当前行相关的一组表行上执行计算。 这相当于一个可以由聚合函数完成的计算类型。但不同于常规的聚合函数, 使用的窗口函数不会导致行被分组到一个单一的输出行;行保留其独立的身份。 在后台,窗口函数能够访问的不止查询结果的当前行。
窗口函数的作用
- 更加优雅地实现一些功能
- 更加强大,GROUP BY有一个缺点,在实际得运算统计中,会改变表得行数,一行只显示一个类别。在工作中,我们又经常会遇到需要每组内排名得相关问题,比如:排名问题:每个部门按业绩来排名;topN问题:找出每个部门排名前N的员工进行奖励,可以使用窗口函数来解决
2. 常用例子
以下例子来自:SQL for Data Analysis,使用PostgreSQL的SQL函数
数据的ERD(Entity Relationship Diagram)表示如下
1. 统计到某个时间段内数据之和
SELECT standard_amt_usd,
SUM(standard_amt_usd) OVER (ORDER BY occurred_at) AS running_total
FROM orders
查询结果如下
2. 使用PARTITION BY划分分区
PARTITION BY列表指定将行划分成组或分区, 组或分区共享相同的PARTITION BY表达式的值。 对于每一行,窗口函数在和当前行落在同一个分区的所有行上进行计算
SELECT standard_amt_usd,
DATE_TRUNC('year', occurred_at) as year,
SUM(standard_amt_usd) OVER (PARTITION BY DATE_TRUNC('year', occurred_at) ORDER BY occurred_at) AS running_total
FROM orders
查询结果如下
3. 使用RANK对一个窗口内数据进行排序
SELECT id,
account_id,
total,
RANK() OVER (PARTITION BY account_id ORDER BY total DESC) AS total_rank
FROM orders
查询结果如下
4. 实现窗口内统计
SELECT id,
account_id,
standard_qty,
DATE_TRUNC('month', occurred_at) AS month,
DENSE_RANK() OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS dense_rank,
SUM(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS sum_std_qty,
COUNT(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS count_std_qty,
AVG(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS avg_std_qty,
MIN(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS min_std_qty,
MAX(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS max_std_qty
FROM orders
实现和GROUP BY的类似功能,运行结果类似,此处不展示
5. 对窗口取别名,提高可读性
SELECT id,
account_id,
DATE_TRUNC('year',occurred_at) AS year,
DENSE_RANK() OVER account_year_window AS dense_rank,
total_amt_usd,
SUM(total_amt_usd) OVER account_year_window AS sum_total_amt_usd,
COUNT(total_amt_usd) OVER account_year_window AS count_total_amt_usd,
AVG(total_amt_usd) OVER account_year_window AS avg_total_amt_usd,
MIN(total_amt_usd) OVER account_year_window AS min_total_amt_usd,
MAX(total_amt_usd) OVER account_year_window AS max_total_amt_usd
FROM orders
WINDOW account_year_window AS (PARTITION BY account_id ORDER BY DATE_TRUNC('year',occurred_at))
6. 将当前行和上一行比较
SELECT occurred_at,
total_amt_usd,
LEAD(total_amt_usd) OVER (ORDER BY occurred_at) AS lead,
LEAD(total_amt_usd) OVER (ORDER BY occurred_at) - total_amt_usd AS lead_difference
FROM (
SELECT occurred_at,
SUM(total_amt_usd) AS total_amt_usd
FROM orders
GROUP BY 1
) sub
查询结果如下
与LEAD相对的是LAG,方向相反。
7. 统计窗口内分位
SELECT
account_id,
occurred_at,
total_amt_usd,
NTILE(100) OVER (PARTITION BY account_id ORDER BY total_amt_usd) AS total_percentile
FROM orders
ORDER BY account_id DESC
查询结果如下
NTILE(4) 即划分4份,按照大小统计落在哪个桶