环境:MySql 8.0.23
1.新建用户登陆表:action_log
CREATE TABLE action_log (
user_id int,
log_date date
);
2.插入测试数据
INSERT INTO action_log (user_id, log_date)
VALUES (2, '2021-04-01'),
(2, '2021-04-02'),
(2, '2021-04-03'),
(2, '2021-04-06'),
(2, '2021-04-07'),
(2, '2021-04-07'),
(2, '2021-04-08'),
(2, '2021-04-09'),
(2, '2021-04-10'),
(2, '2021-04-10'),
(2, '2021-04-11'),
(2, '2021-04-12'),
(3, '2021-04-03'),
(3, '2021-04-04'),
(3, '2021-04-05'),
(3, '2021-04-06'),
(3, '2021-04-07'),
(3, '2021-04-08'),
(3, '2021-04-09'),
(5, '2021-04-08'),
(5, '2021-04-09'),
(5, '2021-04-15'),
(6, '2021-04-10'),
(6, '2021-04-12'),
(6, '2021-04-13'),
(6, '2021-04-22'),
(7, '2021-04-20');
3.统计连续N天登录的用户
3.1 用户一天内可能有多次登录,先去重。记作a表:
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date;
这里也可以用distinct。但是对数据量比较大的表,使用group by可以达到同样的效果,速度也更快。
3.2 使用窗口函数,对数据进行排序。记作b表:
SELECT user_id, log_date, row_number() OVER (PARTITION BY user_id ORDER BY log_date) AS rw
FROM a;
3.3 使用log_date减去rw,得到一个新日期。连续登录时,得到的结果是同一个日期。记作c表:
SELECT user_id, log_date, rw
, date_sub(log_date, INTERVAL rw DAY) AS diff_date
FROM b;
3.4 使用user_id和diff_date分组,其中diff_date >= N的,即为连续登录N天或N天以上的用户。记作d表:
SELECT user_id
FROM c
GROUP BY user_id, diff_date;
3.5 对d表去重即到结果。最终sql:
SELECT DISTINCT user_id
FROM (
SELECT user_id
FROM (
SELECT user_id, log_date, rw
, date_sub(log_date, INTERVAL rw DAY) AS diff_date
FROM (
SELECT user_id, log_date, row_number() OVER (PARTITION BY user_id ORDER BY log_date) AS rw
FROM (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) a
) b
) c
GROUP BY user_id, diff_date
) d;
结果如下:
4.统计留存率:次日留存、7日留存
常用的留存率有两周,一是针对每日登录用户的留存率,一是针对每日新增用户的留存率。
4.1 每日登录用户留存率
登录用户留存率:指当天登录的用户数除以数天后(当天)该批用户数在此天仍活跃的用户数;
例:2019-11-18有100名登录用户,此100名用户中有60名在 2019-11-25仍有登录记录,则7天活跃用户留存率为60/100=60%
4.1.1 对数据去重。记作a表:
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date;
4.1.2 将上述表使用左外连接进行自连接,记作b表
左表tem1,右表为tem2,且tem2表的登录日期大于tem1表的登录日期。此举为求一一列出每用户每天及当天后面有活跃的日期。
SELECT tem1.user_id, tem1.log_date AS date1, tem2.log_date AS date2
FROM a tem1
LEFT JOIN a tem2
ON tem1.user_id = tem2.user_id
AND tem2.log_date >= tem1.log_date;
4.1.3 求以左表日期为首日的首日用户数,第二日用户数,第三日用户数,第四日用户数,第八日用户数
此处使用COUNT(DISTINCT IF(DATEDIFF(date1,date2)=1, user_id, NULL)) 。 DATEDIFF(date1,date2)=1为求出左右表日期差值为1(即次日这个条件),然后判断其是否为真,为真输出其对应的user_id,并去重计算,即为当天用户在次天仍有活跃,并计算符合此情况的用户ID数量。即可求出当天后的一天,仍有多少用户留存。sql如下:
SELECT b.date1, count(DISTINCT b.user_id) AS day_0
, count(DISTINCT if(datediff(b.date2, b.date1) = 1, b.user_id, NULL)) AS day_2
, count(DISTINCT if(datediff(b.date2, b.date1) = 2, b.user_id, NULL)) AS day_3
, count(DISTINCT if(datediff(b.date2, b.date1) = 6, b.user_id, NULL)) AS day_7
, count(DISTINCT if(datediff(b.date2, b.date1) >= 7, b.user_id, NULL)) AS day_7plus
FROM (
SELECT tem1.user_id, tem1.log_date AS date1, tem2.log_date AS date2
FROM (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) tem1
LEFT JOIN (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) tem2
ON tem1.user_id = tem2.user_id
AND tem2.log_date >= tem1.log_date
) b
GROUP BY b.date1;
结果如下:
4.1.4 最后可以基于用户数直接计算留存率
将前一步得出的结果记作c表,slq为:
SELECT date1, day_0
, concat(truncate(day_2 / day_0 * 100, 2), '%') AS day_2_retention
, concat(truncate(day_3 / day_0 * 100, 2), '%') AS day_3_retention
, concat(truncate(day_7 / day_0 * 100, 2), '%') AS day_7_retention
, concat(truncate(day_day_7plus / day_0 * 100, 2), '%') AS day_day_7plus_retention
FROM c;
4.2 每日新增用户留存率
计算公式为: 留存率 = 新增用户中在第N天登录的用户数 / 新增用户数 * 100% 这种定义相比第一种更为常用。这种情况下,计算逻辑和第一种类似,区别在于我们需要先筛选出每日的新增用户(即用户的首次登录日期),之后则和第一种计算方法类似。这里直接给出最终sql:
SELECT a.first_log_dt, count(DISTINCT a.user_id) AS day_0, count(DISTINCT b.user_id) AS day_2
, count(DISTINCT c.user_id) AS day_3, count(DISTINCT d.user_id) AS day_7
, count(DISTINCT e.user_id) AS day_7plus
FROM (
SELECT user_id, min(log_date) AS first_log_dt
FROM action_log
GROUP BY user_id
) a
LEFT JOIN (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) b
ON a.user_id = b.user_id
AND datediff(b.log_date, a.first_log_dt) = 1
LEFT JOIN (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) c
ON a.user_id = c.user_id
AND datediff(c.log_date, a.first_log_dt) = 2
LEFT JOIN (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) d
ON a.user_id = d.user_id
AND datediff(d.log_date, a.first_log_dt) = 6
LEFT JOIN (
SELECT user_id, log_date
FROM action_log
GROUP BY user_id, log_date
) e
ON a.user_id = e.user_id
AND datediff(e.log_date, a.first_log_dt) >= 7
GROUP BY a.first_log_dt;
结果如下:
可以根据上表结果计算相应留存率。