1.问题描述
t_login记录了用户每次登录的时间信息。
如何通过sql查询出2022年1月连续登录三天以上的用户?
2.初始化脚本
CREATE TABLE t_login(uid int, login_time timestamp);
INSERT INTO t_login VALUES (1, '2022-01-01 08:05:11');
INSERT INTO t_login VALUES (2, '2022-01-01 10:00:00');
INSERT INTO t_login VALUES (3, '2022-01-01 12:13:14');
INSERT INTO t_login VALUES (1, '2022-01-01 19:30:00');
INSERT INTO t_login VALUES (1, '2022-01-02 07:59:30');
INSERT INTO t_login VALUES (2, '2022-01-02 14:00:00');
INSERT INTO t_login VALUES (2, '2022-01-03 11:15:00');
INSERT INTO t_login VALUES (3, '2022-01-03 16:00:00');
INSERT INTO t_login VALUES (1, '2022-01-04 07:20:00');
INSERT INTO t_login VALUES (2, '2022-01-04 07:45:00');
INSERT INTO t_login VALUES (3, '2022-01-04 10:30:30');
INSERT INTO t_login VALUES (1, '2022-01-05 13:00:00');
INSERT INTO t_login VALUES (1, '2022-01-06 17:18:19');
INSERT INTO t_login VALUES (1, '2022-01-07 20:00:00');
INSERT INTO t_login VALUES (2, '2022-01-07 21:00:00');
3.思路分析
-- 查询2022年1月份所有登录的用户
select
distinct uid,
date(login_time) ymd
from
t_login
where
login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59';
1.方法一:通过上述的信息,关联三次自身根据时间来判断,每一次关联前一张表比后一张表时间多一天。
select
t1.uid,t1.ymd,t2.ymd,t3.ymd
from(select distinct uid,date(login_time) ymd from t_login
where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59')t1
join (select distinct uid,date(login_time) ymd from t_login
where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59')t2
on (t1.uid=t2.uid and datediff(t2.ymd,t1.ymd)=1)
join (select distinct uid,date(login_time) ymd from t_login
where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59')t3
on (t2.uid=t3.uid and datediff(t3.ymd,t2.ymd)=1);
- 优点:简单
- 缺点:1.该方法的扩展性很差。2.多次join,性能很差。
2.方法二:我们可以通过窗口函数来解决
-
定义一个视图多次使用,通过窗口函数row_number ()来判断时间的连续性,这里我们会得到一个编号;partition by uid order by ymd 通过uid来进行分区并通过时间进行排序。
-
通过日期进去日期-编号如果得到的差是连续的说明这些天数是连续的,如果不是连续的说明时间是有跳跃的。
with t1 as(select distinct uid,date(login_time) ymd from t_login where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59') select uid,ymd, row_number () over(partition by uid order by ymd) num , date_sub(ymd,interval row_number () over(partition by uid order by ymd) day) sub_date from t1我们得到以下结果:
-
有了上面的操作我们可以将结果简化为t2,通过id和时间差分组,我们可以得到以下数据
with t1 as(select distinct uid,date(login_time) ymd from t_login where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59'), t2 as(select uid,ymd, date_sub(ymd,interval row_number () over(partition by uid order by ymd) day) sub_date from t1) select uid,count(*),min(ymd),max(ymd) from t2 group by uid ,sub_date having count(*)>=3 ;
- 对比方法1,该方法便于扩展,只需要修改having的条件就可以查找任意连续几天的登录信息。2.基于 with 我们只需要访问t_login这个表一次,性能上也得到优化
3.方法三:基于lag()函数进行查询。lag()函数是基于某一字段当前行进行向上偏移几位。
with t1 as(select distinct uid,date(login_time) ymd
from t_login
where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59')
select uid,ymd,
lag(ymd,2)over (partition by uid order by ymd ) from t1
查看结果我们可以发现
基于上述我们可以通过查找,我们将ymd与偏移之后的值相减,如果差值为3,则表示该用户连续三天登录。
with t1 as(select distinct uid,date(login_time) ymd
from t_login
where login_time between timestamp '2022-01-01 00:00:00' and timestamp '2022-01-31 23:59:59'),
t2 as(select uid,ymd,datediff(ymd,lag(ymd,2)over (partition by uid order by ymd)) diff from t1)
select uid,ymd,diff from t2 where diff=3;
执行上述sql得到结果
- 不难发现,该方法的思路与方法二大致相同。