S-Learner增益模型原理
S-Learner增益模型方法,也叫Treatment Dummy approach, 或者 Single model approach。以二元干预为例说明:训练时,通过一个模型学习treatment组和control组的个体,把取值1或者0的“干预标记”作为一列特征,与其余特征一起喂入模型;预测时,对同一个个体,分别让干预标记列取值1或者0,得到两个预测值,二者相减就是有干预相对无干预的增益值。
上面公式描述的方法就是scikit-uplift包实现的the SoloModel class的the dummy method。针对the dummy method的一个变体是带有干预交叉特征的Single model approach,即scikit-uplift包实现的the SoloModel class的the treatment_interaction method。
公众号原文:Uplift Model:S-Learner类增益模型实战
Scikit-uplift包SoloModel类源码
本文主要是使用scikit-uplift包,训练S-Learner增益模型,适用于二值干预(treatment取值0或1)建模。scikit-uplift包实现的S-Learner增益模型的源码类是class SoloModel,该类的模型训练方法是fit(),模型预测方法是predict()。
下面针对class SoloModel源码的关键部分给出解释。首先,看fit()方法的实现,当self.method等于dummy时,将treatment列作为一个二值特征,与特征X按行拼接,作为最终训练模型的输入X_mod。源码里,分别针对X是numpy数组、pandas的DataFrame给出了扩展拼接实现:
if self.method == 'dummy':
if isinstance(X, np.ndarray):
X_mod = np.column_stack((X, treatment)) # 将两个数组 按列合并
elif isinstance(X, pd.DataFrame):
X_mod = X.assign(treatment=treatment) # 给X 新增一个treatment列
当self.method等于treatment_interaction时,除了需要dummy方法里的部分,另外,还增加了X与treatment列的按位相乘交叉项,即源码里np.multiply实现的那部分特征。同样,源码里也给出了X是pd.DataFrame时,利用apply方法交叉代码:
if self.method == 'treatment_interaction':
if isinstance(X, np.ndarray): # 增加了X与treatment 按位相乘交叉项
X_mod = np.column_stack((X, np.multiply(X, np.array(treatment).reshape(-1, 1)), treatment))
elif isinstance(X, pd.DataFrame):
X_mod = pd.concat([X, X.apply(lambda x: x * treatment).rename(columns=lambda x: str(x) + '_treatment_interaction')], axis=1).assign(treatment=treatment)
特征准备好了,利用estimator学习器,拟合训练特征。这里,可以使用的estimator要适配sklearn包的训练和预估接口。预估时,对于同一个样本,分别预估其被分配到干预组的响应概率self.trmnt_preds_、控制组的响应概率self.ctrl_preds_;最后,将两个概率做差,即得到增益uplift。
if self._type_of_target == 'binary': # 响应变量是二元变量
self.trmnt_preds_ = self.estimator.predict_proba(X_mod_trmnt)[:, 1]
self.ctrl_preds_ = self.estimator.predict_proba(X_mod_ctrl)[:, 1]
else:
self.trmnt_preds_ = self.estimator.predict(X_mod_trmnt)
self.ctrl_preds_ = self.estimator.predict(X_mod_ctrl)
uplift = self.trmnt_preds_ - self.ctrl_preds_ # 增益 = 干预组的响应概率 - 控制组的响应概率
S-Learner增益模型实战案例
数据集预处理
from sklearn.model_selection import train_test_split
from sklift.datasets import fetch_x5
import pandas as pd
pd.set_option('display.max_columns', None)
%matplotlib inline
dataset = fetch_x5()
dataset.keys()
# dict_keys(['data', 'target', 'treatment', 'DESCR', 'feature_names', 'target_name', 'treatment_name'])
# 查看数据结构
print(type(dataset['data']['clients']), dataset['data']['clients'].shape)
print(type(dataset['data']['purchases']), dataset['data']['purchases'].shape)
print(type(dataset['data']['train']), dataset['data']['train'].shape)
print( )
print(type(dataset['target']), dataset['target'].shape, dataset['target'].value_counts())
print(type(dataset['treatment']), dataset['treatment'].shape, dataset['treatment'].value_counts())
# print(type(dataset['DESCR']), dataset['DESCR'])
print(type(dataset['feature_names']), dataset['feature_names'], len(dataset['feature_names']['train_features']), len(dataset['feature_names']['clients_features']), len(dataset['feature_names']['purchases_features']) )
print(type(dataset['target_name']), dataset['target_name'])
print(type(dataset['treatment_name']), dataset['treatment_name'])
# <class 'pandas.core.frame.DataFrame'> (400162, 5)
# <class 'pandas.core.frame.DataFrame'> (45786568, 13)
# <class 'pandas.core.frame.DataFrame'> (200039, 1)
# <class 'pandas.core.series.Series'> (200039,) 1 124002
# 0 76037
# Name: target, dtype: int64
# <class 'pandas.core.series.Series'> (200039,) 0 100058
# 1 99981
# Name: treatment_flg, dtype: int64
# <class 'sklearn.utils.Bunch'> {'train_features': ['client_id', 'treatment_flg', 'target'], 'clients_features': ['client_id', 'first_issue_date', 'first_redeem_date', 'age', 'gender'], 'purchases_features': ['client_id', 'transaction_id', 'transaction_datetime', 'regular_points_received', 'express_points_received', 'regular_points_spent', 'express_points_spent', 'purchase_sum', 'store_id', 'product_id', 'product_quantity', 'trn_sum_from_iss', 'trn_sum_from_red']} 3 5 13
# <class 'str'> target
# <class 'str'> treatment_flg
# 抽取数据特征
df_clients = dataset.data['clients'].set_index("client_id") # 把clients数据集的client_id列 设置为indexprint(df_clients.shape, df_clients.columns)df_train = pd.concat( [dataset.data['train'], dataset.treatment , dataset.target], axis=1).set_index("client_id")
# 拼接train/treatment/targetprint(df_train.shape, df_train.columns)indices_test = pd.Index(set(df_clients.index) - set(df_train.index))
print(len(indices_test), indices_test)
篇幅所限 更详细的代码以及案例展示请参阅公众号原文:Uplift Model:S-Learner类增益模型实战
关于Uplift Model:S-Learner类增益模型实战的内容基本介绍完了,对相关主题感兴趣的读者欢迎留言交流讨论。感谢你看到这里,你的支持是我持续创作的动力~更多优质内容请关注公众号: