大数据开发必备技能,面试高频考点,收藏这一篇就够了
📖 前言
窗口函数是 SQL 中最强大也最容易混淆的功能之一。无论你是准备面试,还是实际工作中需要处理数据分析需求,窗口函数都是绕不开的技能点。
本文从基础概念讲起,配合大量实战案例,帮你彻底掌握窗口函数。
学完本文,你将能够:
- ✅ 理解窗口函数的工作原理
- ✅ 熟练使用各类窗口函数
- ✅ 解决 Top N、连续登录、同比环比等经典问题
- ✅ 避免常见坑点
🎯 一、窗口函数是什么?
1.1 普通聚合 vs 窗口聚合
先看一个简单需求:查询每个员工的工资,同时显示部门平均工资
普通聚合(GROUP BY):
-- ❌ 只能得到聚合结果,丢失明细
SELECT dept, AVG(salary) as avg_salary
FROM employees
GROUP BY dept;
窗口函数:
-- ✅ 保留明细,同时显示聚合值
SELECT
name,
dept,
salary,
AVG(salary) OVER (PARTITION BY dept) as avg_salary
FROM employees;
结果对比:
| name | dept | salary | avg_salary |
|---|---|---|---|
| 张三 | 技术部 | 10000 | 9000 |
| 李四 | 技术部 | 8000 | 9000 |
| 王五 | 销售部 | 7000 | 6500 |
| 赵六 | 销售部 | 6000 | 6500 |
核心区别:
- 普通聚合:多行 → 一行(丢失明细)
- 窗口函数:多行 → 多行(保留明细)
1.2 窗口函数语法结构
函数名 (参数) OVER (
PARTITION BY 分组列
ORDER BY 排序列
窗口子句
)
三个核心要素:
| 要素 | 作用 | 是否必需 |
|---|---|---|
| PARTITION BY | 定义窗口分组(类似 GROUP BY) | 可选 |
| ORDER BY | 定义窗口内排序 | 可选 |
| 窗口子句 | 定义窗口范围(ROWS/RANGE) | 可选 |
📚 二、窗口函数分类
窗口函数分为四大类:
┌─────────────────────────────────────────────────────────┐
│ 窗口函数分类 │
├──────────────┬──────────────┬──────────────┬───────────┤
│ 排名函数 │ 聚合函数 │ 前后行函数 │ 分布函数 │
├──────────────┼──────────────┼──────────────┼───────────┤
│ ROW_NUMBER │ SUM │ LAG │ NTILE │
│ RANK │ AVG │ LEAD │ PERCENT_RANK│
│ DENSE_RANK │ COUNT │ FIRST_VALUE │ CUME_DIST │
│ │ MAX/MIN │ LAST_VALUE │ │
└──────────────┴──────────────┴──────────────┴───────────┘
🔢 三、排名函数(最常用)
3.1 ROW_NUMBER - 行号
功能: 生成连续行号,不考虑并列
SELECT
name,
dept,
salary,
ROW_NUMBER() OVER (
PARTITION BY dept
ORDER BY salary DESC
) as rn
FROM employees;
结果:
| name | dept | salary | rn |
|---|---|---|---|
| 张三 | 技术部 | 10000 | 1 |
| 李四 | 技术部 | 8000 | 2 |
| 王五 | 销售部 | 7000 | 1 |
| 赵六 | 销售部 | 6000 | 2 |
特点: 即使工资相同,行号也连续(1, 2, 3, 4...)
3.2 RANK - 跳跃排名
功能: 考虑并列,排名跳跃
SELECT
name,
dept,
salary,
RANK() OVER (
PARTITION BY dept
ORDER BY salary DESC
) as rk
FROM employees;
结果(有并列情况):
| name | dept | salary | rk | |
|---|---|---|---|---|
| 张三 | 技术部 | 10000 | 1 | |
| 李四 | 技术部 | 10000 | 1 | |
| 王五 | 技术部 | 8000 | 3 | ← 跳过 2 |
| 赵六 | 销售部 | 7000 | 1 |
特点: 并列第 1,下一个是第 3(1, 1, 3, 4...)
3.3 DENSE_RANK - 密集排名
功能: 考虑并列,排名不跳跃
SELECT
name,
dept,
salary,
DENSE_RANK() OVER (
PARTITION BY dept
ORDER BY salary DESC
) as drk
FROM employees;
结果(有并列情况):
| name | dept | salary | drk | |
|---|---|---|---|---|
| 张三 | 技术部 | 10000 | 1 | |
| 李四 | 技术部 | 10000 | 1 | |
| 王五 | 技术部 | 8000 | 2 | ← 不跳过 |
| 赵六 | 销售部 | 7000 | 1 |
特点: 并列第 1,下一个是第 2(1, 1, 2, 3...)
3.4 三者对比总结
SELECT
name,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) as rn,
RANK() OVER (ORDER BY salary DESC) as rk,
DENSE_RANK() OVER (ORDER BY salary DESC) as drk
FROM employees;
| name | salary | rn | rk | drk |
|---|---|---|---|---|
| A | 10000 | 1 | 1 | 1 |
| B | 10000 | 2 | 1 | 1 |
| C | 8000 | 3 | 3 | 2 |
| D | 6000 | 4 | 4 | 3 |
记忆口诀:
- ROW_NUMBER:永远连续,不管并列
- RANK:有并列就跳跃
- DENSE_RANK:有并列也密集
3.5 实战:Top N 问题
需求: 查询每个部门工资最高的前 3 名员工
WITH ranked AS (
SELECT
name,
dept,
salary,
DENSE_RANK() OVER (
PARTITION BY dept
ORDER BY salary DESC
) as rk
FROM employees
)
SELECT name, dept, salary
FROM ranked
WHERE rk <= 3;
为什么用 DENSE_RANK?
- 如果用 ROW_NUMBER:可能漏掉并列的员工
- 如果用 RANK:排名跳跃,可能取不到 3 个人
- DENSE_RANK:保证取到前 3 名(含并列)
📊 四、聚合窗口函数
4.1 基础聚合
SELECT
name,
dept,
salary,
-- 部门总和
SUM(salary) OVER (PARTITION BY dept) as dept_total,
-- 部门平均
AVG(salary) OVER (PARTITION BY dept) as dept_avg,
-- 部门最高
MAX(salary) OVER (PARTITION BY dept) as dept_max,
-- 部门最低
MIN(salary) OVER (PARTITION BY dept) as dept_min,
-- 部门人数
COUNT(*) OVER (PARTITION BY dept) as dept_count
FROM employees;
4.2 累计聚合
需求: 计算每日累计销售额
SELECT
date,
daily_sales,
SUM(daily_sales) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_sales
FROM sales;
结果:
| date | daily_sales | cumulative_sales |
|---|---|---|
| 01-01 | 100 | 100 |
| 01-02 | 150 | 250 |
| 01-03 | 200 | 450 |
| 01-04 | 180 | 630 |
4.3 移动平均
需求: 计算 7 日移动平均销售额
SELECT
date,
daily_sales,
AVG(daily_sales) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7d
FROM sales;
结果:
| date | daily_sales | moving_avg_7d |
|---|---|---|
| 01-01 | 100 | 100 |
| 01-02 | 150 | 125 |
| ... | ... | ... |
| 01-07 | 180 | 155 |
| 01-08 | 200 | 165 |
⏪ 五、前后行函数
5.1 LAG - 取前 N 行
需求: 计算销售额环比增长率
SELECT
date,
sales,
LAG(sales, 1) OVER (ORDER BY date) as prev_sales,
ROUND(
(sales - LAG(sales, 1) OVER (ORDER BY date)) * 100.0 /
LAG(sales, 1) OVER (ORDER BY date),
2
) as growth_rate
FROM sales;
结果:
| date | sales | prev_sales | growth_rate |
|---|---|---|---|
| 01-01 | 100 | NULL | NULL |
| 01-02 | 120 | 100 | 20.00% |
| 01-03 | 150 | 120 | 25.00% |
| 01-04 | 140 | 150 | -6.67% |
5.2 LEAD - 取后 N 行
需求: 对比当前与下一次的销售
SELECT
date,
sales,
LEAD(sales, 1) OVER (ORDER BY date) as next_sales,
next_sales - sales as diff
FROM sales;
5.3 FIRST_VALUE / LAST_VALUE
需求: 显示每个部门最高和最低工资
SELECT DISTINCT
dept,
FIRST_VALUE(salary) OVER (
PARTITION BY dept
ORDER BY salary DESC
) as max_salary,
LAST_VALUE(salary) OVER (
PARTITION BY dept
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as min_salary
FROM employees;
⚠️ 注意: LAST_VALUE 需要指定 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING,否则只到当前行。
🎯 六、经典实战案例
6.1 连续登录问题
需求: 找出连续登录 3 天以上的用户
WITH login_ranked AS (
SELECT
user_id,
login_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY login_date
) as rn
FROM user_logins
),
login_grouped AS (
SELECT
user_id,
login_date,
DATE_SUB(login_date, INTERVAL rn DAY) as grp
FROM login_ranked
)
SELECT
user_id,
MIN(login_date) as start_date,
MAX(login_date) as end_date,
COUNT(*) as consecutive_days
FROM login_grouped
GROUP BY user_id, grp
HAVING COUNT(*) >= 3;
核心思路:
- 给每次登录编号
- 用日期减去行号,连续的日期会得到相同的组
- 按组聚合,统计连续天数
6.2 同比环比计算
需求: 计算每月销售额的同比和环比
WITH monthly_sales AS (
SELECT
DATE_FORMAT(order_date, '%Y-%m') as month,
SUM(amount) as sales
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
)
SELECT
month,
sales,
LAG(sales, 1) OVER (ORDER BY month) as prev_month,
ROUND(
(sales - LAG(sales, 1) OVER (ORDER BY month)) * 100.0 /
LAG(sales, 1) OVER (ORDER BY month),
2
) as mom_growth, -- 环比
LAG(sales, 12) OVER (ORDER BY month) as prev_year,
ROUND(
(sales - LAG(sales, 12) OVER (ORDER BY month)) * 100.0 /
LAG(sales, 12) OVER (ORDER BY month),
2
) as yoy_growth -- 同比
FROM monthly_sales;
6.3 留存率计算
需求: 计算次日留存、7 日留存
WITH user_first_login AS (
SELECT
user_id,
MIN(login_date) as first_date
FROM user_logins
GROUP BY user_id
),
user_retention AS (
SELECT
f.user_id,
f.first_date,
l.login_date,
DATEDIFF(l.login_date, f.first_date) as days_diff
FROM user_first_login f
JOIN user_logins l ON f.user_id = l.user_id
)
SELECT
first_date,
COUNT(DISTINCT user_id) as total_users,
COUNT(DISTINCT CASE WHEN days_diff = 1 THEN user_id END) as day1_users,
ROUND(
COUNT(DISTINCT CASE WHEN days_diff = 1 THEN user_id END) * 100.0 /
COUNT(DISTINCT user_id),
2
) as day1_retention,
COUNT(DISTINCT CASE WHEN days_diff = 7 THEN user_id END) as day7_users,
ROUND(
COUNT(DISTINCT CASE WHEN days_diff = 7 THEN user_id END) * 100.0 /
COUNT(DISTINCT user_id),
2
) as day7_retention
FROM user_retention
GROUP BY first_date;
6.4 去重取最新
需求: 订单表有重复数据,每个订单保留最新一条
WITH ranked_orders AS (
SELECT
order_id,
status,
update_time,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY update_time DESC
) as rn
FROM orders
)
SELECT order_id, status, update_time
FROM ranked_orders
WHERE rn = 1;
⚠️ 七、常见坑点与注意事项
7.1 执行顺序
-- ❌ 错误:窗口函数不能在 WHERE 中使用
SELECT * FROM (
SELECT name, salary, ROW_NUMBER() OVER (ORDER BY salary) as rn
FROM employees
)
WHERE rn <= 10; -- ✅ 正确:在外层过滤
-- ❌ 错误
SELECT name, salary
FROM employees
WHERE ROW_NUMBER() OVER (ORDER BY salary) <= 10;
执行顺序: FROM → WHERE → GROUP BY → HAVING → SELECT (窗口函数) → ORDER BY
7.2 LAST_VALUE 的坑
-- ❌ 错误:默认窗口范围是 UNBOUNDED PRECEDING TO CURRENT ROW
LAST_VALUE(salary) OVER (ORDER BY salary DESC)
-- ✅ 正确:需要明确指定窗口范围
LAST_VALUE(salary) OVER (
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
7.3 NULL 值处理
-- NULL 值默认排在最后(升序)或最前(降序)
-- 可以显式指定
ORDER BY salary DESC NULLS LAST
7.4 性能优化
- 避免全表窗口: 尽量用 PARTITION BY 缩小窗口范围
- 索引优化: ORDER BY 列加索引
- 避免嵌套窗口: 多个窗口函数尽量在一次扫描中完成
📋 八、速查表
排名函数
| 函数 | 并列处理 | 排名序列 |
|---|---|---|
| ROW_NUMBER | 不处理 | 1, 2, 3, 4... |
| RANK | 跳跃 | 1, 1, 3, 4... |
| DENSE_RANK | 不跳跃 | 1, 1, 2, 3... |
窗口子句
| 子句 | 含义 |
|---|---|
| ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | 从起点到当前行 |
| ROWS BETWEEN 3 PRECEDING AND CURRENT ROW | 前 3 行到当前行 |
| ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING | 当前行到终点 |
| ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING | 整个分区 |
常用场景速查
| 场景 | 推荐函数 |
|---|---|
| Top N | DENSE_RANK |
| 去重取最新 | ROW_NUMBER |
| 累计求和 | SUM + ORDER BY |
| 移动平均 | AVG + ROWS BETWEEN |
| 环比 | LAG |
| 同比 | LAG(12) |
| 连续登录 | ROW_NUMBER + 日期差 |
| 留存率 | 自连接 + 日期差 |
🎓 九、总结
窗口函数是 SQL 进阶的必经之路,核心要点:
- 理解窗口概念: 窗口是一组相关行,窗口函数对这组行进行计算但保留明细
- 掌握四大类函数: 排名、聚合、前后行、分布
- 熟练经典场景: Top N、累计、移动平均、同比环比、连续登录
- 避开常见坑点: 执行顺序、LAST_VALUE 窗口范围、NULL 处理
学习建议:
- 先理解原理,再记忆语法
- 多动手实践,在真实数据上验证
- 收藏本文,需要时快速查阅
📚 十、延伸学习
- LeetCode SQL 题库(窗口函数专题)
- 牛客网 SQL 实战
- 《SQL 必知必会》第 15-18 章
觉得有用?欢迎收藏、转发、评论!
下一篇预告: 《Spark 性能优化 10 个技巧》