使用 Pandas 对日志条目进行分组并计算时间差

47 阅读3分钟

我有一段日志数据,我想首先按 user_id 分组,然后按时间对每组进行排序,最后提取第二条记录。这部分已经完成,但缺少的步骤是计算每组中每个记录相对于第一条记录的时间差。

下面是示例数据和已经完成的部分代码:

import pandas as pd

dd = pd.DataFrame({'item_id': {0: 0, 1: 4, 2: 6, 3: 8, 4: 9, 5: 1},
                   'date': {0: '2013-12-29T17:56:01Z', 1: '2013-12-29T19:44:09Z',
                            2: '2013-12-29T19:58:05Z', 3: '2013-12-29T20:00:09Z',
                            4: '2013-12-29T20:13:35Z', 5: '2013-12-29T20:19:56Z'},
                   'user_id': {0: 6, 1: 8, 2: 3, 3: 3, 4: 6, 5: 6}})

print("Step 1: Original DataFrame, sorted by date:\n", dd)

g = dd.groupby(by='user_id', sort=False)
print("\nStep 2: Grouped by User ID:\n", g.head())

# Print the 2nd entry (if it exists)
print("\nStep 3: The 2nd user for each entry:\n", g.nth(1).dropna(how='all'))

输出结果:

Step 1: Original DataFrame, sorted by date:
                   date  item_id  user_id
0  2013-12-29T17:56:01Z        0        6
1  2013-12-29T19:44:09Z        4        8
2  2013-12-29T19:58:05Z        6        3
3  2013-12-29T20:00:09Z        8        3
4  2013-12-29T20:13:35Z        9        6
5  2013-12-29T20:19:56Z        1        6

Step 2: Grouped by User ID:
                           date  item_id  user_id
user_id                                          
6       0  2013-12-29T17:56:01Z        0        6
        4  2013-12-29T20:13:35Z        9        6
        5  2013-12-29T20:19:56Z        1        6
8       1  2013-12-29T19:44:09Z        4        8
3       2  2013-12-29T19:58:05Z        6        3
        3  2013-12-29T20:00:09Z        8        3

Step 3: The 2nd user for each entry:
                         date  item_id
user_id                               
6        2013-12-29T20:13:35Z        9
3        2013-12-29T20:00:09Z        8

2、解决方案

为了计算每组中每个记录相对于第一条记录的时间差,我们可以使用 pandas.DataFrame.apply() 函数和一个自定义函数。自定义函数可以用来计算每组中每个记录与第一条记录的时间差,并将其作为新的列添加到 DataFrame 中。

def calculate_age(x):
    """
    Calculate the age of each record in a group, relative to the first record.

    Args:
        x: A DataFrame representing a group of records.

    Returns:
        A DataFrame with an additional column named 'age' that contains the age of each record.
    """
    # Convert the 'date' column to datetime64[ns] dtype
    x['date'] = pd.to_datetime(x['date'])

    # Sort the DataFrame by 'date'
    x = x.sort_values(by='date')

    # Calculate the difference between each record's date and the first record's date
    x['age'] = x['date'] - x['date'].iloc[0]

    # Return the DataFrame with the new 'age' column
    return x

# Apply the 'calculate_age()` function to each group in the DataFrame
result = dd.groupby('user_id').apply(calculate_age)

# Display the result
print(result)

输出结果:

                 date  item_id  user_id     age
0 2013-12-29 17:56:01        0        6 00:00:00
1 2013-12-29 19:44:09        4        8 00:00:00
2 2013-12-29 19:58:05        6        3 00:00:00
3 2013-12-29 20:00:09        8        3 00:02:04
4 2013-12-29 20:13:35        9        6 02:17:34
5 2013-12-29 20:19:56        1        6 02:23:55

现在,我们已经成功地为每组中的每个记录计算了时间差,并将其存储在新的列 age 中。