我有一段日志数据,我想首先按 user_id 分组,然后按时间对每组进行排序,最后提取第二条记录。这部分已经完成,但缺少的步骤是计算每组中每个记录相对于第一条记录的时间差。
下面是示例数据和已经完成的部分代码:
import pandas as pd
dd = pd.DataFrame({'item_id': {0: 0, 1: 4, 2: 6, 3: 8, 4: 9, 5: 1},
'date': {0: '2013-12-29T17:56:01Z', 1: '2013-12-29T19:44:09Z',
2: '2013-12-29T19:58:05Z', 3: '2013-12-29T20:00:09Z',
4: '2013-12-29T20:13:35Z', 5: '2013-12-29T20:19:56Z'},
'user_id': {0: 6, 1: 8, 2: 3, 3: 3, 4: 6, 5: 6}})
print("Step 1: Original DataFrame, sorted by date:\n", dd)
g = dd.groupby(by='user_id', sort=False)
print("\nStep 2: Grouped by User ID:\n", g.head())
# Print the 2nd entry (if it exists)
print("\nStep 3: The 2nd user for each entry:\n", g.nth(1).dropna(how='all'))
输出结果:
Step 1: Original DataFrame, sorted by date:
date item_id user_id
0 2013-12-29T17:56:01Z 0 6
1 2013-12-29T19:44:09Z 4 8
2 2013-12-29T19:58:05Z 6 3
3 2013-12-29T20:00:09Z 8 3
4 2013-12-29T20:13:35Z 9 6
5 2013-12-29T20:19:56Z 1 6
Step 2: Grouped by User ID:
date item_id user_id
user_id
6 0 2013-12-29T17:56:01Z 0 6
4 2013-12-29T20:13:35Z 9 6
5 2013-12-29T20:19:56Z 1 6
8 1 2013-12-29T19:44:09Z 4 8
3 2 2013-12-29T19:58:05Z 6 3
3 2013-12-29T20:00:09Z 8 3
Step 3: The 2nd user for each entry:
date item_id
user_id
6 2013-12-29T20:13:35Z 9
3 2013-12-29T20:00:09Z 8
2、解决方案
为了计算每组中每个记录相对于第一条记录的时间差,我们可以使用 pandas.DataFrame.apply() 函数和一个自定义函数。自定义函数可以用来计算每组中每个记录与第一条记录的时间差,并将其作为新的列添加到 DataFrame 中。
def calculate_age(x):
"""
Calculate the age of each record in a group, relative to the first record.
Args:
x: A DataFrame representing a group of records.
Returns:
A DataFrame with an additional column named 'age' that contains the age of each record.
"""
# Convert the 'date' column to datetime64[ns] dtype
x['date'] = pd.to_datetime(x['date'])
# Sort the DataFrame by 'date'
x = x.sort_values(by='date')
# Calculate the difference between each record's date and the first record's date
x['age'] = x['date'] - x['date'].iloc[0]
# Return the DataFrame with the new 'age' column
return x
# Apply the 'calculate_age()` function to each group in the DataFrame
result = dd.groupby('user_id').apply(calculate_age)
# Display the result
print(result)
输出结果:
date item_id user_id age
0 2013-12-29 17:56:01 0 6 00:00:00
1 2013-12-29 19:44:09 4 8 00:00:00
2 2013-12-29 19:58:05 6 3 00:00:00
3 2013-12-29 20:00:09 8 3 00:02:04
4 2013-12-29 20:13:35 9 6 02:17:34
5 2013-12-29 20:19:56 1 6 02:23:55
现在,我们已经成功地为每组中的每个记录计算了时间差,并将其存储在新的列 age 中。