数据分析之APP用户活跃度预测

544 阅读11分钟

代码已更新至quickhand: “APP用户活跃预测”项目基于脱敏和采样后的数据信息,预测未来一段时间活跃的用户 (gitee.com)

项目背景及任务

“APP用户活跃预测”项目基于脱敏和采样后的数据信息,预测未来一段时间活跃的用户

img

将在未来7天(即第31天至第37天)内使用过APP (在上述任一类型日志中出现过)的用户定义为“活跃用户”,需要从“注册日志”中预测出这些用户,即转化为活跃/非活跃的二分类问题。

APP用户活跃预测-数据分析

数据获取和数据预处理

导入基本的工具包

# 数据工具包
import numpy as np
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
import os
import gc
import time
import datetime
import multiprocessing as mp
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
​
plt.rcParams["font.sans-serif"] = ["SimHei"]  # 设置字体
plt.rcParams["axes.unicode_minus"] = False  # 该语句解决图像中的乱码问题
​
​
pd.set_option('max_colwidth', 200)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

数据获取和预处理

register = pd.read_csv('user_register_log.txt', sep='\t',
                       names=['user_id', 'register_day', 'register_type', 'device_type'])
launch = pd.read_csv('app_launch_log.txt', sep='\t', names=['user_id', 'launch_day'])
create = pd.read_csv('video_create_log.txt', sep='\t', names=['user_id', 'create_day'])
activity = pd.read_csv('user_activity_log.txt', sep='\t',
                       names=['user_id', 'act_day', 'page', 'video_id', 'author_id', 'act_type'])

数据说明

user_register_log.txt---用户注册日志表

列名类型说明示例
user_idint用户唯一标识(脱敏后)666
register_daystring日期01,02,...,30
register_typeint渠道来源(脱敏后)0
device_typeint设备类型(脱敏后)0

img

app_launch_log---app启动表

列名类型说明示例
user_idint用户唯一标识(脱敏后)666
launch_daystring日期01,02,...,30

img

video_create_log---用户创作视频表

列名类型说明示例
user_idint用户唯一标识(脱敏后)666
create_daystring日期01,02,...,30

img

user_activity_log.txt---用户活跃日志表

列名类型说明示例
user_idint用户唯一标识(脱敏后)666
act_daystring日期01,02,...,30
pageint行为发生的页面。每个数字分别对应“关注页”、“个人主页”、“发现页”、“同城页”或“其他页”中的一个1
video_idint视频的id(脱敏后)333
author_idint作者id(脱敏后)999
act_typeint用户行为类型。每个数字分别对应“播放”、“关注”、“点赞”、“转发”、 “举报”和“减少此类作品”中的一个1

img

评估指标的选择

precision=(MN)Mprecision = \frac{|(M\cap N)|}{|M|}
recall=(MN)Nrecall = \frac{|(M\cap N)|}{|N|}
F1Score=2×precision×recallprecision+recallF1 Score = \frac{2\times precision\times recall}{precision + recall}

用户行为数据分析和可视化

register表

首先,对于注册表,我们使用info对注册表中的内容进行观察,去了解表中是否存在缺失值以及表字段的数据类型

register.info()

img

接下来我们对用户注册日期进行可视化

sns.countplot(x='register_day', data=register)
plt.show()

img

通过可视化图表我们可以看出,24号这天数据是明显增多的,且每隔一段时间就会拥有一个峰值出现

# 使用密度图对数据进行可视化
plt.figure(figsize=(12, 5))
plt.title('Distribution of register day')
ax = sns.distplot(register['register_day'], bins=30)
plt.show()

img

下面我们对用户注册的类型进行可视化分析

sns.countplot(x='register_type', data=register)
plt.show()

img

可以看出渠道0,,1,2是用户注册最多的渠道

结合注册类型和用户注册时间,对峰值数据(也就是24号的数据)进行分析

sns.countplot(x='register_type', data=register[register['register_day'] == 24])
plt.show()

img

通过可视化图表发现24号当天通过渠道3注册的用户异常增多

结合前一天的数据分布

sns.countplot(x='register_type', data=register[register['register_day'] == 23])

img

可以看出23号是,用户注册数据依然是符合大众分布的,因此我们猜测24号时,平台是否通过渠道3进行了推广

我们使用value_counts()观察device_type的分布情况

register['device_type'].value_counts()

img

launch表

launch.info()

img

可以看出该表中数据也是比较干净的

我们对launch_day进行可视化观察

sns.countplot(x='launch_day', data=launch)
plt.show()

img

可以看出总体是属于一个上升趋势的,且周期性也是比较明显的(比如6,7,13,14......),可以猜测是否受周末的影响

通过value_counts()观察用户启动次数

launch['user_id'].value_counts()

img

由结果可以看出有的用户一个月30天都有启动app,有的用户一个月只启动了一次

create表

create.info()

img

我们对create_day字段的发布进行可视化

sns.countplot(x='create_day', data=create)
plt.show()

img

整体还是呈现上升趋势且依然是存在一定的周期性的

activity表

activity.info()

img

首先对用户活跃的日期进行可视化分析

sns.countplot(x='act_day', data=activity)
plt.show()

img

与前面很多数据所类似,也是呈现一个上升趋势及周期性

特征工程

def split_data(data, columns, start_day, end_day):
    data = data[(data[columns] >= start_day) & (data[columns] <= end_day)]
    return data

首先我们定义一个split_data,为了去进行对应数据的一个切分,输入数据,列,开始日期及结束日期(通过日期画区间)

返回的list说明下: 以118天为起始特征区间,用于返回需要划多少天 例如划118,就返回18-18=0, 划测试集1~30,就返回30-18=12 需要少划几个对应修改就好

除此之外,我们还需要以下几个函数辅助我们划分区间

def features_addday_list():
    return [0, 1, 2, 3, 4, 5, 12]


def ups():
    return 1


def downs():
    return 18

区间划分

def get_label_list(start_day, end_day):
    result = split_data(launch, 'launch_day', start_day, end_day)['user_id'].drop_duplicates()
    return pd.Series(result)


up = downs() + 1
down = downs() + 7
data = register.loc[:, ['user_id']]
for label_num in range(len(features_addday_list()) - 1):
    print(label_num, up + label_num, down + label_num)
    label_list = get_label_list(up + label_num, down + label_num)
    label_name = 'label_' + str(label_num)
    data[label_name] = data['user_id'].isin(label_list).replace({True: 1, False: 0})
data.to_csv('../data/feature/data_label.csv', index=None)

img

data_label是对我们已知数据的构建,由输出可以看出,我们将数据划分成了六个部分(1925、2026、...),这样我们的训练集就构建完成了

img

data_label.csv

构建特征

构建描述用户的特征

我们通过四个表格对用户特征进行构建image-20230225214313757

register相关特征

up = ups()
down = downs()
for feature_num in tqdm(features_addday_list()):

    # 基础变量定义
    feature_start = up + 0
    feature_end = down + feature_num
    print(feature_num, feature_start, feature_end)

    result_data = split_data(register, 'register_day', 1, feature_end)
    feature_data = split_data(register, 'register_day', feature_start, feature_end)

    # 提特征(已经包含设备类型、设备类型)
    # 特征区间最大天数减去注册日期
    result_data['maxday_red_registerday'] = max(feature_data['register_day']) - feature_data['register_day']
    result_data = result_data.fillna(max(feature_data['register_day']))
    del result_data['register_day']

    # 保存结果
    result_file_name = 'register_feature_'+str(feature_num)+'.csv'
    result_data.to_csv('../data/feature/'+result_file_name, index=None)

result_data 是存放特征的结果文件 feature_data 用于存放被提取的原文件 ***** _tmp 存放临时特征文件

img

create相关特征

up = ups()
down = downs()

for feature_num in tqdm(features_addday_list()):
    # 基础变量定义
    feature_start = up
    feature_end = down + feature_num
    print(feature_num, feature_start, feature_end)

    result_data = split_data(register, 'register_day', 1, feature_end).loc[:, ['user_id', 'register_day']]
    feature_data = split_data(create, 'create_day', feature_start, feature_end)

    # 提特征
    # 用户创建视频计数
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc='count').reset_index().rename(columns={"create_day": 'create_count'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 用户创建视频的 平均/最大/最小日期 与 注册日期/最大时间 的时间差
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc='mean').reset_index().rename(columns={"create_day": 'create_mean'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['createmean_red_register'] = result_data['create_mean'] - result_data['register_day']
    result_data['maxday_red_createmean'] = max(result_data['register_day']) - result_data['create_mean']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc=np.max).reset_index().rename(columns={"create_day": 'create_max'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['createmax_red_register'] = result_data['create_max'] - result_data['register_day']
    result_data['maxday_red_createmax'] = max(result_data['register_day']) - result_data['create_max']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc=np.min).reset_index().rename(columns={"create_day": 'create_min'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['createmin_red_register'] = result_data['create_min'] - result_data['register_day']
    result_data['maxday_red_createmin'] = max(result_data['register_day']) - result_data['create_min']
    result_data = result_data.fillna(-1)

    # 创建最大间隔
    result_data['max_red_min_create'] = result_data['create_max'] - result_data['create_min']

    # 最后一天是否有活动
    result_data['create_at_lastday'] = pd.Series(
        result_data['create_max'] == max(feature_data['create_day'])).replace({True: 1, False: 0})

    # 均值/最大/最小 天数处理
    result_data['create_mean'] = max(feature_data['create_day']) - result_data['create_mean']
    result_data['create_max'] = max(feature_data['create_day']) - result_data['create_max']
    result_data['create_min'] = max(feature_data['create_day']) - result_data['create_min']

    # 间隔的 方差/均值
    feature_data_tmp = feature_data.drop_duplicates(['user_id', 'create_day']).sort_values(
        by=['user_id', 'create_day'])
    feature_data_tmp['create_gap'] = np.array(feature_data_tmp['create_day']) - np.array(
        feature_data_tmp.tail(1).append(feature_data_tmp.head(len(feature_data_tmp) - 1))['create_day'])

    feature_tmp = pd.pivot_table(feature_data_tmp, index='user_id', values='create_gap',
                                 aggfunc=(lambda a: np.average(a[1:]))).reset_index().rename(
        columns={"create_gap": 'create_gap_mean'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data_tmp, index='user_id', values='create_gap',
                                 aggfunc=(lambda a: np.var(a[1:]))).reset_index().rename(
        columns={"create_gap": 'create_gap_var'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 是否一直连续/连续到结束
    result_data['always_create'] = [1 if i == 1 else 0 for i in result_data['create_gap_mean']]
    tmp = (result_data['create_at_lastday'] == 1).replace({True: 1, False: 0})
    result_data['always_create_atlast'] = tmp * result_data['always_create']
    del tmp

    # 创建日期的 方差/峰度/偏度
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc=np.var).reset_index().rename(columns={"create_day": 'create_var'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc=pd.Series.kurt).reset_index().rename(columns={"create_day": 'create_kurt'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='create_day',
                                 aggfunc=pd.Series.skew).reset_index().rename(columns={"create_day": 'create_skew'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 求一天最大创建数
    feature_data['max_create_in_oneday'] = 0
    feature_tmp = pd.pivot_table(feature_data, index=['user_id', 'create_day'], values='max_create_in_oneday',
                                 aggfunc='count').reset_index()
    feature_tmp = pd.DataFrame(feature_tmp.groupby(['user_id'])['max_create_in_oneday'].max()).reset_index()
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data.fillna(0, inplace=True)
    del result_data['register_day']

    # 保存结果
    result_file_name = 'create_feature_' + str(feature_num) + '.csv'
    result_data.to_csv('../data/feature/' + result_file_name, index=None)
  

img

构建描述拍客的特征

我们依然通过四个表格对用户特征进行构建

image-20230225214355815

launch相关特征

up = ups()
down = downs()

for feature_num in tqdm(features_addday_list()):
    # 基础变量定义
    feature_start = up
    feature_end = down + feature_num
    print(feature_num, feature_start, feature_end)

    result_data = split_data(register, 'register_day', 1, feature_end).loc[:, ['user_id', 'register_day']]
    feature_data = split_data(launch, 'launch_day', feature_start, feature_end)

    # 提特征
    # 登录计数/登录率
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc='count').reset_index().rename(columns={"launch_day": 'launch_count'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    distance = (max(feature_data['launch_day']) - min(feature_data['launch_day']))
    result_data['launch_ratio'] = result_data['launch_count'] * 1.0 / distance
    result_data = result_data.fillna(0)

    # 登录的 平均/最大/最小日期 与 注册日期/最大时间 的时间差
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc='mean').reset_index().rename(columns={"launch_day": 'launch_mean'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['launchmean_red_register'] = result_data['launch_mean'] - result_data['register_day']
    result_data['maxday_red_launchmean'] = max(result_data['register_day']) - result_data['launch_mean']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc=np.max).reset_index().rename(columns={"launch_day": 'launch_max'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['launchmax_red_register'] = result_data['launch_max'] - result_data['register_day']
    result_data['maxday_red_launchmax'] = max(result_data['register_day']) - result_data['launch_max']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc=np.min).reset_index().rename(columns={"launch_day": 'launch_min'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['maxday_red_launchmin'] = max(result_data['register_day']) - result_data['launch_min']
    result_data = result_data.fillna(-1)

    # 登录最大与最小差
    result_data['max_red_min_launch'] = result_data['launch_max'] - result_data['launch_min']

    # 最后一天是否有活动
    result_data['launch_at_lastday'] = pd.Series(result_data['launch_max'] == max(feature_data['launch_day'])).replace(
        {True: 1, False: 0})

    # 均值/最大/最小 天数处理
    result_data['launch_mean'] = max(feature_data['launch_day']) - result_data['launch_mean']
    result_data['launch_max'] = max(feature_data['launch_day']) - result_data['launch_max']
    result_data['launch_min'] = max(feature_data['launch_day']) - result_data['launch_min']

    # 间隔的 方差/均值/最大
    feature_data_tmp = feature_data.drop_duplicates(['user_id', 'launch_day']).sort_values(by=['user_id', 'launch_day'])
    feature_data_tmp['launch_gap'] = np.array(feature_data_tmp['launch_day']) - np.array(
        feature_data_tmp.tail(1).append(feature_data_tmp.head(len(feature_data_tmp) - 1))['launch_day'])

    feature_tmp = pd.pivot_table(feature_data_tmp, index='user_id', values='launch_gap',
                                 aggfunc=(lambda a: np.average(a[1:]))).reset_index().rename(
        columns={"launch_gap": 'launch_gap_mean'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data_tmp, index='user_id', values='launch_gap',
                                 aggfunc=(lambda a: np.var(a[1:]))).reset_index().rename(
        columns={"launch_gap": 'launch_gap_var'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data_tmp, index='user_id', values='launch_gap',
                                 aggfunc=(lambda a: np.max(a[1:]))).reset_index().rename(
        columns={"launch_gap": 'launch_gap_max'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 是否一直连续/连续到结束
    result_data['always_launch'] = [1 if i == 1 else 0 for i in result_data['launch_gap_mean']]
    tmp = (result_data['launch_at_lastday'] == 1).replace({True: 1, False: 0})
    result_data['always_launch_atlast'] = tmp * result_data['always_launch']
    del tmp

    # 登录日期的 方差/峰度/偏度
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc=np.var).reset_index().rename(columns={"launch_day": 'launch_var'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc=pd.Series.kurt).reset_index().rename(columns={"launch_day": 'launch_kurt'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='launch_day',
                                 aggfunc=pd.Series.skew).reset_index().rename(columns={"launch_day": 'launch_skew'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)
    del result_data['register_day']

    # 保存结果
    result_file_name = 'launch_feature_' + str(feature_num) + '.csv'
    result_data.to_csv('../data/feature/' + result_file_name, index=None)

img

activity相关特征

up = ups()
down = downs()

for feature_num in tqdm(features_addday_list()):
    
    # 基础变量定义
    feature_start = up
    feature_end = down + feature_num
    print(feature_num, feature_start, feature_end)
    
    result_data = split_data(register, 'register_day', 1, feature_end).loc[:, ['user_id', 'register_day']]
    feature_data = split_data(activity, 'act_day', feature_start, feature_end)
    
    # 提特征
    # 活动计数
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc='count').reset_index().rename(columns={"act_day": 'act_count'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 活动的 平均/最大/最小日期 与 注册日期/最大时间 的时间差
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc='mean').reset_index().rename(columns={"act_day": 'act_mean'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['actmean_red_register'] = result_data['act_mean'] - result_data['register_day']
    result_data['maxday_red_actmean'] = max(result_data['register_day']) - result_data['act_mean']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc=np.max).reset_index().rename(columns={"act_day": 'act_max'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['actmax_red_register'] = result_data['act_max'] - result_data['register_day']
    result_data['maxday_red_actmax'] = max(result_data['register_day']) - result_data['act_max']

    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc=np.min).reset_index().rename(columns={"act_day": 'act_min'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data['actmin_red_register'] = result_data['act_min'] - result_data['register_day']
    result_data['maxday_red_actmin'] = max(result_data['register_day']) - result_data['act_min']
    result_data = result_data.fillna(-1)

    # 最后一天是否有活动
    result_data['act_at_lastday'] = pd.Series(result_data['act_max'] == max(feature_data['act_day'])).replace({True: 1, False: 0})

    # 均值/最大/最小 天数处理
    result_data['act_mean'] = max(feature_data['act_day']) - result_data['act_mean']
    result_data['act_max'] = max(feature_data['act_day']) - result_data['act_max']
    result_data['act_min'] = max(feature_data['act_day']) - result_data['act_min']

    # 观看自己计数
    feature_tmp = pd.pivot_table(feature_data[feature_data['user_id'] == feature_data['author_id']],
                                 index='user_id', values='author_id', aggfunc='count').reset_index().rename(columns={"author_id": 'act_self_count'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # 活动日期的 方差/峰度/偏度
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc=np.var).reset_index().rename(columns={"act_day": 'act_var'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc=pd.Series.kurt).reset_index().rename(columns={"act_day": 'act_kurt'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    feature_tmp = pd.pivot_table(feature_data, index='user_id', values='act_day',
                                 aggfunc=pd.Series.skew).reset_index().rename(columns={"act_day": 'act_skew'})
    result_data = pd.merge(result_data, feature_tmp, on='user_id', how='left')
    result_data = result_data.fillna(0)

    # action 的 计数/率
    feature_tmp = feature_data.loc[:, ['user_id', 'action_type', 'act_day']].groupby(['user_id', 'action_type']).count().reset_index().rename(columns={"act_day": 'action_count'})
    for i in range(6):
        fea_name = 'action_' + str(i) + '_count'
        action_tmp = feature_tmp[feature_tmp['action_type'] == i].loc[:, ['user_id', 'action_count']].rename(columns={"action_count": fea_name})
        result_data = pd.merge(result_data, action_tmp, how='left', on='user_id')
    result_data = result_data.fillna(0)
    result_data['action_all'] = (result_data['action_0_count']+result_data['action_1_count']+
                                 result_data['action_2_count']+result_data['action_3_count']+
                                 result_data['action_4_count']+result_data['action_5_count']).replace(0, 1)
    for i in range(6):
        fea_name = 'action_' + str(i) + '_ratio'
        fea_name_2 = 'action_' + str(i) + '_count'
        result_data[fea_name] = result_data[fea_name_2] / result_data['action_all']

    # page 的 计数/率
    feature_tmp = feature_data.loc[:, ['user_id', 'page', 'act_day']].groupby(['user_id', 'page']).count().reset_index().rename(columns={"act_day": 'page_count'})
    for i in range(5):
        fea_name = 'page_' + str(i) + '_count'
        page_tmp = feature_tmp[feature_tmp['page'] == i].loc[:, ['user_id', 'page_count']].rename(columns={"page_count": fea_name})
        result_data = pd.merge(result_data, page_tmp, how='left', on='user_id')
    result_data = result_data.fillna(0)
    result_data['page_all'] = (result_data['page_0_count']+result_data['page_1_count']+
                               result_data['page_2_count']+result_data['page_3_count']+
                               result_data['page_4_count']).replace(0, 1)
    for i in range(5):
        fea_name = 'page_' + str(i) + '_ratio'
        fea_name_2 = 'page_' + str(i) + '_count'
        result_data[fea_name] = result_data[fea_name_2] / result_data['page_all']

    del result_data['page_all']
    del result_data['action_all']
    del result_data['register_day']

    # 保存结果
    result_file_name = 'activity_feature_' + str(feature_num) + '.csv'
    result_data.to_csv('../data/feature/' + result_file_name, index=None)

img

选择有价值的特征

img

读取训练集、测试集

def get_feature(num, data_label=None):
    path = '../data/feature/'

    register = pd.read_csv(path + 'register_feature_' + str(num) + '.csv')
    create = pd.read_csv(path + 'create_feature_' + str(num) + '.csv')
    launch = pd.read_csv(path + 'launch_feature_' + str(num) + '.csv')
    activity = pd.read_csv(path + 'activity_feature_' + str(num) + '.csv')

    feature = pd.merge(register, launch, on='user_id', how='left')
    feature = pd.merge(feature, activity, on='user_id', how='left')
    feature = pd.merge(feature, create, on='user_id', how='left')
    del register
    del create
    del launch

    if data_label is not None:
        label_name = 'label_' + str(num)
        data_label_tmp = data_label[data_label['user_id'].isin(feature['user_id'])]
        data_label_tmp = data_label.loc[:, ['user_id', label_name]]
        data_label_tmp.columns = ['user_id', 'label']
        feature = pd.merge(feature, data_label_tmp, on='user_id', how='left')

    return feature

我们对四个数据表进行拼接

读取标签数据和特征数据

# 读标签数据
data_label = pd.read_csv('../data/feature/data_label.csv')

# 读特征数据
test_x = get_feature('12')
train_x = get_feature('0', data_label).append(get_feature('1', data_label)).append(
    get_feature('2', data_label)).append(get_feature('3', data_label)).append(
    get_feature('4', data_label)).append(get_feature('5', data_label))
train_x = train_x.reset_index(drop=True)

train_y = train_x['label']
test_user = test_x['user_id']

分阶段提取训练集,最后汇成train_x

train_x.head()

img

特征选择

首先通过热力图的形式观察特征间的相关性

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 14))
plt.title('features correlation plot (Pearson)')
corr = train_x.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, linewidths=.1, cmap="Blues")
plt.show()

img

由图可以看出,由于特征过多,相关性不好观察,我们通过对相关性排序解决这个问题

train_x.corr().sort_values('label', ascending=False)

以label为例,我们想要知道哪个字段和label的相关性更高,我们可以通过上面的代码,对相关性进行降序排序

img

由结果可以看出,有一些特征和label的相关性是偏高的,有0.5,0.4,也存在负相关的

方差选择法

from sklearn import feature_selection

feature_df = train_x[[f for f in train_x.columns if f not in ['user_id','label']]]
label = train_x['label']

# Filter
# 方差选择法,选择方差大于阈值的特征
features_var = feature_selection.VarianceThreshold(threshold=0.7).fit_transform(feature_df)

我们把阈值设置为0.7,filter会自动帮我们选择方差大于阈值的特征

img

递归特征消除法

# Wrapper
# 递归特征消除法,这里选择逻辑回归作为基模型,n_features_to_select为选择的特征个数
from sklearn.linear_model import LogisticRegression
features_rfe = feature_selection.RFE(estimator=LogisticRegression(), n_features_to_select=30).fit_transform(feature_df, label)

img

基于惩罚项的特征选择方法

# Embedded
# 基于惩罚项的特征选择法,这里选择带L1惩罚项的逻辑回归作为基模型
features_lr_embed = feature_selection.SelectFromModel(LogisticRegression(penalty="l2", C=0.1)).fit_transform(feature_df, label)

img

基于树模型的特征选择法

from sklearn.ensemble import GradientBoostingClassifier
features_gbdt_embed = feature_selection.SelectFromModel(GradientBoostingClassifier()).fit_transform(feature_df, label)

img

训练模型

在模型训练开始前,我们需要删除掉user_id字段和label

label是我们的目标,不能作为特征使用

del train_x['user_id']
del test_x['user_id']
del train_x['label']

接下去我们把树模型三剑客全部封装到一个函数里面

def cv_model(clf, train_x, train_y, test_x, clf_name):
    # 使用五折交叉印证,随机种子2022
    folds = 5
    seed = 2022
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    
    # 构建train和test保证我们的验证结果,测试结果
    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    cv_scores = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i + 1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], \
                                     train_y[valid_index]

        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': 5,
                'num_leaves': 2 ** 5,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.5,
                'seed': 2022,
                'n_jobs': -1,
                'verbose': -1,
            }

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix],
                              categorical_feature=[], verbose_eval=2000, early_stopping_rounds=500)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)

            print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])

        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x, label=trn_y)
            valid_matrix = clf.DMatrix(val_x, label=val_y)
            test_matrix = clf.DMatrix(test_x)

            params = {'booster': 'gbtree',
                      'objective': 'binary:logistic',
                      'eval_metric': 'auc',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.5,
                      'tree_method': 'exact',
                      'seed': 2022,
                      'nthread': 36
                      }

            watchlist = [(train_matrix, 'train'), (valid_matrix, 'eval')]

            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=2000,
                              early_stopping_rounds=200)
            val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)

        if clf_name == "cat":
            params = {'learning_rate': 0.5, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
                      'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}

            model = clf(iterations=20000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      cat_features=[], use_best_model=True, verbose=2000)

            val_pred = model.predict(val_x)
            test_pred = model.predict(test_x)

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))

        print(cv_scores)

    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test

参数说明:

clf:学习机器

train_x:训练数据

train_y:标签

test_x:测试集的特征

clf_name:对应选择学习器的名字(lgb对应LightGBM,xgb对应XGBoost,cat对应catboost)

因为cv_model中三个模型对应的参数有所区别,因此我们另定义3个函数帮助我们找到对应的模型

def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test


def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test


def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
    return cat_train, cat_test

首先训练lgb

lgb_train, lgb_test = lgb_model(train_x, train_y, test_x)

img

训练xgb

xgb_train, xgb_test = xgb_model(train_x, train_y, test_x)

训练cat

cat_train, cat_test = cat_model(train_x, train_y, test_x)

模型融合

def stack_model(oof_1, oof_2, oof_3, predictions_1, predictions_2, predictions_3, y):
   
    train_stack = np.vstack([oof_1, oof_2, oof_3]).transpose()
    test_stack = np.vstack([predictions_1, predictions_2, predictions_3]).transpose()
    
    from sklearn.linear_model import BayesianRidge
    from sklearn.model_selection import RepeatedKFold
    folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2022)
    
    oof = np.zeros(train_stack.shape[0])
    predictions = np.zeros(test_stack.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, y)):
        print("fold n°{}".format(fold_+1))
        trn_data, trn_y = train_stack[trn_idx], y[trn_idx]
        val_data, val_y = train_stack[val_idx], y[val_idx]
        print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
        clf = BayesianRidge()
        clf.fit(trn_data, trn_y)
​
        oof[val_idx] = clf.predict(val_data)
        predictions += clf.predict(test_stack) / (5 * 2)
    print('mean: ',roc_auc_score(y, oof))
    
    return oof, predictions
​
stack_train, stack_test = stack_model(lgb_train, xgb_train, cat_train, lgb_test, xgb_test, cat_test, train_y)