使用maxCompute(odps) 统计每个用户每天的在线时长

334 阅读6分钟

使用maxCompute(odps) 统计每个用户每天的在线时长

需求

  1. 用户状态有(ONLINE,REST,OFFLINE)分别表示在线、休息、下线。
  2. 用户从上线到下一个状态存在跨数天的情况,也存在同一天中存在多个状态或者重复多个状态。
  3. REST,OFFLINE后都不计算在线时长,一直到出现ONLINE后开始计算在线时长。
  4. 当天在上线状态内(表示从上线开始,一直没下线),如果没有上线记录,则以0点为当天在线时长开始时间;如果没有REST,OFFLINE记录则以23:59:59为当天在线时长结束时间。

sql

WITH user_info AS 
(
    SELECT  ui.mongo_id
            ,ui.name
            ,ui.nick
    FROM    antts.adm_ats_ods_links_user_for_ali_group ui
    WHERE   ui.dt = MAX_PT('antts.adm_ats_ods_links_user_for_ali_group')
    AND     ui.name = 'lxy'
)
,business_action AS 
(
    SELECT  uba.user_id
            ,ui.name
            ,ui.nick
            ,uba.gmt_modified
            ,uba.on_off_status
    FROM    antts.adm_ats_ods_links_user_business_action_for_ali_group uba
    ,       user_info ui
    WHERE   uba.user_id = ui.mongo_id
    AND     uba.dt = MAX_PT('antts.adm_ats_ods_links_user_business_action_for_ali_group')
)
,date_range AS 
(
    SELECT  user_id
            ,user_name
            ,user_nick
            ,DATEADD(TO_DATE(bizdate,'yyyy-mm-dd'),d,'dd') AS dt
    FROM    (
                SELECT  uba.user_id
                        ,MAX(uba.name) AS user_name
                        ,MAX(uba.nick) AS user_nick
                        ,MIN(SUBSTRING(uba.gmt_modified,1,10)) AS bizdate
                        ,sequence(0,
                                 DATEDIFF(TO_DATE(TO_CHAR(GETDATE(),'yyyy-mm-dd'),'yyyy-mm-dd'),TO_DATE(MIN(SUBSTRING(uba.gmt_modified,1,10)),'yyyy-mm-dd'),'dd')
                        ,1) AS ds
                FROM    business_action uba
                GROUP BY uba.user_id
            ) 
    LATERAL VIEW EXPLODE(ds) D AS d
)
,action AS 
(
    SELECT  user_id
            ,TO_DATE(SUBSTRING(gmt_modified,1,10),'yyyy-mm-dd') AS action_date
            ,gmt_modified AS action_time
            ,on_off_status AS action_status
            ,LAG(on_off_status,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS pre_status
            ,LEAD(on_off_status,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS after_status
            ,LAG(gmt_modified,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS pre_action_time
            ,LEAD(gmt_modified,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS after_action_time
    FROM  business_action
    ORDER BY action_time
)
,online_time_period AS 
(
    SELECT  DISTINCT dr.user_id
            ,dr.user_name
            ,dr.user_nick
            ,dr.dt AS action_date
            ,a.action_status AS action_status
            ,CASE   WHEN a.action_status = 'ONLINE' THEN a.action_time
                    ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
            END AS begin_time
            ,CASE   WHEN a.action_status = 'ONLINE' THEN (CASE   WHEN TO_DATE(SUBSTRING(a.after_action_time,1,10),'yyyy-mm-dd') = dr.dt THEN a.after_action_time
                            ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 23:59:59')
                    END)
                    WHEN a.action_status IN ('REST','OFFLINE')
                        AND a.pre_status = 'ONLINE'
                        AND TO_DATE(SUBSTRING(a.pre_action_time,1,10),'yyyy-mm-dd') = dr.dt THEN TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
                    WHEN a.action_status IN ('REST','OFFLINE')
                        AND a.pre_status = 'ONLINE'
                        AND TO_DATE(SUBSTRING(a.pre_action_time,1,10),'yyyy-mm-dd') != dr.dt THEN a.action_time
                    WHEN a.action_status IS NULL
                        AND (
                                SELECT  ARG_MAX(action_time,action_status)
                                FROM    action
                                WHERE   user_id = dr.user_id
                                    AND action_date < dr.dt
                            )  = 'ONLINE' THEN TO_CHAR(dr.dt,'yyyy-mm-dd 23:59:59')
                    ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
            END AS end_time
    FROM    date_range dr
    LEFT JOIN action a
    ON      dr.user_id = a.user_id
    AND     dr.dt = a.action_date
    ORDER BY dr.user_id,action_date,begin_time
)
SELECT  otp.user_id
        ,MAX(IFNULL(otp.user_nick,otp.user_name)) AS user_name
        ,otp.action_date
        ,COLLECT_LIST(otp.action_status) AS action_status
        ,COLLECT_LIST(MAP(otp.begin_time,otp.end_time)) as period_list
        ,SUM(
            DATEDIFF(TO_DATE(otp.end_time,'yyyy-mm-dd hh:mi:ss'),TO_DATE(otp.begin_time,'yyyy-mm-dd hh:mi:ss'),'mi')
        ) AS diff
FROM    online_time_period otp
where otp.begin_time != otp.end_time
GROUP BY otp.user_id
         ,otp.action_date
ORDER BY otp.user_id,otp.action_date

sql说明

user_info:根据用户名过滤数据,减少sql执行时间。
business_action:和user_info内连接,过滤数据。
date_range:查询日期数据集,得到用户的每天的日期范围。

date_range AS 
(
    SELECT  user_id
            ,user_name
            ,user_nick
            ,DATEADD(TO_DATE(bizdate,'yyyy-mm-dd'),d,'dd') AS dt
    FROM    (
                SELECT  uba.user_id
                        ,MAX(uba.name) AS user_name
                        ,MAX(uba.nick) AS user_nick
                        ,MIN(SUBSTRING(uba.gmt_modified,1,10)) AS bizdate
                        ,sequence(0,
                                 DATEDIFF(TO_DATE(TO_CHAR(GETDATE(),'yyyy-mm-dd'),'yyyy-mm-dd'),TO_DATE(MIN(SUBSTRING(uba.gmt_modified,1,10)),'yyyy-mm-dd'),'dd')
                        ,1) AS ds
                FROM    business_action uba
                GROUP BY uba.user_id
            ) 
    LATERAL VIEW EXPLODE(ds) D AS d
)
  1. Lateral View:将单行数据拆成多行数据。
  2. EXPLODE:将一行数据转为多行的UDTF。
  3. LATERAL VIEW explode(ds):将列ds序列展开(列转行),然后将最小日期增加天数生成最小日期到当前日期的日期数据集。最后得到的结果为每个用户的最小日期到当前当前日期的数据集合。
  4. 下面为构建日期数据集的最小日期和增加天数序列。
 SELECT  uba.user_id
                        ,MAX(uba.name) AS user_name
                        ,MAX(uba.nick) AS user_nick
                        ,MIN(SUBSTRING(uba.gmt_modified,1,10)) AS bizdate
                        ,sequence(0,
                                 DATEDIFF(TO_DATE(TO_CHAR(GETDATE(),'yyyy-mm-dd'),'yyyy-mm-dd'),TO_DATE(MIN(SUBSTRING(uba.gmt_modified,1,10)),'yyyy-mm-dd'),'dd')
                        ,1) AS ds
                FROM    business_action uba
                GROUP BY uba.user_id

sequence(start, stop, [step]) :根据序列的开始元素、结束元素以及步长step生成序列。
所以sql表示查询最小的日期bizdate、序列0到n(n为当前日期和最小日期的天数差)步长为1。 action:用户在线记录数据。

action AS 
(
    SELECT  user_id
            ,TO_DATE(SUBSTRING(gmt_modified,1,10),'yyyy-mm-dd') AS action_date
            ,gmt_modified AS action_time
            ,on_off_status AS action_status
            ,LAG(on_off_status,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS pre_status
            ,LEAD(on_off_status,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS after_status
            ,LAG(gmt_modified,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS pre_action_time
            ,LEAD(gmt_modified,1) OVER (PARTITION BY user_id ORDER BY gmt_modified ) AS after_action_time
    FROM  business_action
    ORDER BY action_time
)

获取用户在线记录数据,同时根据记录生成时间排序获取每个用户的每条记录的前一条数据的状态和生成时间。
online_time_period:获取每个用户每天的上下线情况

online_time_period AS 
(
    SELECT  DISTINCT dr.user_id
            ,dr.user_name
            ,dr.user_nick
            ,dr.dt AS action_date
            ,a.action_status AS action_status
            ,CASE   WHEN a.action_status = 'ONLINE' THEN a.action_time
                    ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
            END AS begin_time
            ,CASE   WHEN a.action_status = 'ONLINE' THEN (CASE   WHEN TO_DATE(SUBSTRING(a.after_action_time,1,10),'yyyy-mm-dd') = dr.dt THEN a.after_action_time
                            ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 23:59:59')
                    END)
                    WHEN a.action_status IN ('REST','OFFLINE')
                        AND a.pre_status = 'ONLINE'
                        AND TO_DATE(SUBSTRING(a.pre_action_time,1,10),'yyyy-mm-dd') = dr.dt THEN TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
                    WHEN a.action_status IN ('REST','OFFLINE')
                        AND a.pre_status = 'ONLINE'
                        AND TO_DATE(SUBSTRING(a.pre_action_time,1,10),'yyyy-mm-dd') != dr.dt THEN a.action_time
                    WHEN a.action_status IS NULL
                        AND (
                                SELECT  ARG_MAX(action_time,action_status)
                                FROM    action
                                WHERE   user_id = dr.user_id
                                    AND action_date < dr.dt
                            )  = 'ONLINE' THEN TO_CHAR(dr.dt,'yyyy-mm-dd 23:59:59')
                    ELSE TO_CHAR(dr.dt,'yyyy-mm-dd 00:00:00')
            END AS end_time
    FROM    date_range dr
    LEFT JOIN action a
    ON      dr.user_id = a.user_id
    AND     dr.dt = a.action_date
    ORDER BY dr.user_id,action_date,begin_time
)

arg_max(<valueToMaximize>, <valueToReturn>):返回valueToMaximize最大值对应行的valueToReturn。
获取每个用户从第一次记录日期开始当前日期的每天在线开始时间和结束时间。

  1. 如果当天日期有在线记录,并且为ONLINE上线,则以上线时间为开始时间,否则以00:00:00为开始时间。
  2. 如果当天日期有在线记录,并且为ONLINE上线,则判断当天用户是否有下次在线记录,如果有则以下次记录时间为结束时间,否则为23:59:59为结束时间。
  3. 如果当天有'REST', 'OFFLINE'记录,前一条记录为ONLINE并且上线时间为当天,则以00:00:00为结束时间。
  4. 如果当天有'REST', 'OFFLINE'记录,前一条记录为ONLINE并且上线时间不为当天,则以'REST', 'OFFLINE'记录的时间为结束时间。
  5. 如果当天没有任何在线记录,则判断当天日期前最近一条记录是否有ONLINE记录,如果有则以23:59:59为结束时间(表示用户当天依旧还是在线状态),否则00:00:00为结束时间(表示用户处于离线或者休息状态,不算在线时长,所以开始时间和结束时间都是00:00:00)。
SELECT otp.user_id
     , MAX(IFNULL(otp.user_nick, otp.user_name))       AS user_name
     , otp.action_date
     , COLLECT_LIST(otp.action_status)                 AS action_status
     , COLLECT_LIST(MAP(otp.begin_time, otp.end_time)) as period_list
     , SUM(
        DATEDIFF(TO_DATE(otp.end_time, 'yyyy-mm-dd hh:mi:ss'), TO_DATE(otp.begin_time, 'yyyy-mm-dd hh:mi:ss'), 'mi')
       )                                               AS diff
FROM online_time_period otp
where otp.begin_time != otp.end_time
GROUP BY otp.user_id
       , otp.action_date
ORDER BY otp.user_id, otp.action_date

根据用户和日期分组,聚合计算每个用户每个日期的在线时长。