dataframe日期按周、按月、按季度聚合

1,328 阅读3分钟

dataframe中的每一列都是一个Series对象,这个Series对象的index默认是从0开始,步长为1进行递增。

如果你的dataframe中有一列是日期,对于其他列字段需要按照日期字段进行聚合的话,需要将待聚合列的Series对象index属性修改为该日期字段。

关键是这个操作:

offline_ratio = custs.offline_ratio
offline_ratio.index=custs['day_tm'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d'))
offline_ratio_month = offline_ratio.resample('m').mean()
  1. 修改Series对象的index属性为日期
  2. resample函数可以通过调整参数,实现按月(m),按周(w),按季度(q)等聚合
import pandas as pd
import matplotlib.pyplot as plt
from impala.dbapi import connect
from impala.util import as_pandas
from datetime import datetime

conn = connect(host='你服务器的ip',port=端口号,user='用户名',password='密码',auth_mechanism='PLAIN')
curs = conn.cursor()

curs.execute("""
                select substring(created_time,1,10) as day_tm,
                       sum(case when date_from in (0,3) 
                           then 1 
                           else 0 
                           end) as offline_custs,
                       sum(case when date_from in (0,3) 
                                     and so_no is not null
                                     and substring(created_time,1,10)= substring(so_date,1,10)
                           then 1 
                           else 0 
                           end) as offline_orders,
                       sum(case when date_from not in (0,3) 
                           then 1 
                           else 0 
                           end) as online_custs,
                       sum(case when date_from not in (0,3) 
                                     and so_no is not null
                                     and substring(created_time,1,10)= substring(so_date,1,10)
                           then 1 
                           else 0 
                           end) as online_orders
                from 你的表名 c
                where substring(created_time,1,4) >= '2019'                     
                group by substring(created_time,1,10)
                order by substring(created_time,1,10) asc
             """)
custs_to_orders = as_pandas(curs)

custs_to_orders['offline_ratio'] = custs_to_orders['offline_orders']/custs_to_orders['offline_custs']
custs_to_orders['online_ratio'] = custs_to_orders['online_orders']/custs_to_orders['online_custs']

offline_ratio = custs_to_orders.offline_ratio
offline_ratio.index=custs_to_orders['day_tm'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d'))
offline_ratio_month = offline_ratio.resample('m').mean()

day_tm
2019-01-31 0.301876
2019-02-28 0.183390
2019-03-31 0.178983
2019-04-30 0.183437
2019-05-31 0.202010
2019-06-30 0.242368
2019-07-31 0.177942
2019-08-31 0.173683
2019-09-30 0.179291
2019-10-31 0.186196
2019-11-30 0.183292
2019-12-31 0.221013
2020-01-31 0.256396
2020-02-29 0.260454
2020-03-31 0.162729
2020-04-30 0.117873
2020-05-31 0.103655
Freq: M, Name: offline_ratio, dtype: float64

👏👏👏再看看我们以前的文章😃😃😃
🌺 Excel中数据分析工具库-相关系数篇
🌺 干货,手把手教会你做相关性分析
🌺 5年数据分析路,小结。
🌺 用户细分及画像分析
🌺 K-近邻算法及实践

欢迎关注,微信公众号“数据分析师之家
扫描二维码 关注我们
💁提供职业规划、简历指导、面试辅导服务哦
QQ交流群:254674155
在这里插入图片描述