上次我们讲到用python实现psi的计算。本文是PSI系列的最后一篇文章,主要讲计算模型分的PSI,连续以及离散特征的PSI,以及用hive实现PSI的计算,最后会在kaggle上给出一个简单的实例。那我们开始吧~
PSI计算python实现如下:
def psi_calc(actual,predict,bins=10):
'''
功能: 计算PSI值,并输出实际和预期占比分布曲线
输入值:
actual: 一维数组或series,代表训练集模型得分
predict: 一维数组或series,代表测试集模型得分
bins: 违约率段划分个数
输出值:
字典,键值关系为{'psi': PSI值,'psi_fig': 实际和预期占比分布曲线}
'''
psi_dict = {}
actual = np.sort(actual)
predict = np.sort(predict)
actual_len = len(actual)
predict_len = len(predict)
psi_cut = []
actual_bins = []
predict_bins = []
actual_min = actual.min()
actual_max = actual.max()
cuts = []
binlen = (actual_max-actual_min) / bins
for i in range(1, bins):
cuts.append(actual_min+i*binlen)
for i in range(1, (bins+1)):
if i == 1:
lowercut = float('-Inf')
uppercut = cuts[i-1]
elif i == bins:
lowercut = cuts[i-2]
uppercut = float('Inf')
else:
lowercut = cuts[i-2]
uppercut = cuts[i-1]
actual_cnt = ((actual >= lowercut) & (actual < uppercut)).sum()+1
predict_cnt = ((predict >= lowercut) & (predict < uppercut)).sum()+1
actual_pct = (actual_cnt+0.0) / actual_len
predict_pct = (predict_cnt+0.0) / predict_len
psi_cut.append((actual_pct-predict_pct) * math.log(actual_pct/predict_pct))
actual_bins.append(actual_pct)
predict_bins.append(predict_pct)
psi = sum(psi_cut)
nbins = len(actual_bins)
xlab = np.arange(1, nbins+1)
fig = plt.figure()
plt.plot(xlab, np.array(actual_bins),'r',label='actual')
plt.plot(xlab, np.array(predict_bins),'b',label='predict')
plt.legend(loc='best')
plt.title('Psi Curve')
plt.close()
psi_dict['psi'] = psi
psi_dict['psi_fig'] = fig
return psi_dict
上述计算代码中,我们对变量的分箱采用最简单的无监督的等距的分箱方法,这个在PSI最原始的定义中也是这么定义的。
后面我们会介绍多种分箱方法,都可以来代替本代码中的分箱代码,进而得到不同分箱下的PSI。
特征PSI计算 python实现
上述代码一般用于计算模型分的PSI,有一个前提假设,就是我们的变量是连续的,如果我们要用来计算的PSI的特征离散且本身的离散值个数就小于分bin数,为了解决这样的情况,我们重写了上面的代码来计算不同类型特征的PSI,代码如下:
def fea_psi_calc(actual,predict,bins=10):
'''
功能: 计算连续变量和离散变量的PSI值
输入值:
actual: 一维数组或series,代表训练集中的变量
predict: 一维数组或series,代表测试集中的变量
bins: 违约率段划分个数
输出值:
字典,键值关系为{'psi': PSI值,'psi_fig': 实际和预期占比分布曲线}
'''
psi_dict = {}
actual = np.sort(actual)
actual_distinct = np.sort(list(set(actual)))
predict = np.sort(predict)
predict_distinct = np.sort(list(set(predict)))
actual_len = len(actual)
actual_distinct_len = len(actual_distinct)
predict_len = len(predict)
predict_distinct_len = len(predict_distinct)
psi_cut = []
actual_bins = []
predict_bins = []
actual_min = actual.min()
actual_max = actual.max()
cuts = []
binlen = (actual_max-actual_min) / bins
if (actual_distinct_len<bins):
for i in actual_distinct:
cuts.append(i)
for i in range(2, (actual_distinct_len+1)):
if i == bins:
lowercut = cuts[i-2]
uppercut = float('Inf')
else:
lowercut = cuts[i-2]
uppercut = cuts[i-1]
actual_cnt = ((actual >= lowercut) & (actual < uppercut)).sum()+1
predict_cnt = ((predict >= lowercut) & (predict < uppercut)).sum()+1
actual_pct = (actual_cnt+0.0) / actual_len
predict_pct = (predict_cnt+0.0) / predict_len
psi_cut.append((actual_pct-predict_pct) * math.log(actual_pct/predict_pct))
actual_bins.append(actual_pct)
predict_bins.append(predict_pct)
else:
for i in range(1, bins):
cuts.append(actual_min+i*binlen)
for i in range(1, (bins+1)):
if i == 1:
lowercut = float('-Inf')
uppercut = cuts[i-1]
elif i == bins:
lowercut = cuts[i-2]
uppercut = float('Inf')
else:
lowercut = cuts[i-2]
uppercut = cuts[i-1]
actual_cnt = ((actual >= lowercut) & (actual < uppercut)).sum()+1
predict_cnt = ((predict >= lowercut) & (predict < uppercut)).sum()+1
actual_pct = (actual_cnt+0.0) / actual_len
predict_pct = (predict_cnt+0.0) / predict_len
psi_cut.append((actual_pct-predict_pct) * math.log(actual_pct/predict_pct))
actual_bins.append(actual_pct)
predict_bins.append(predict_pct)
psi = sum(psi_cut)
nbins = len(actual_bins)
xlab = np.arange(1, nbins+1)
psi_dict['psi'] = psi
return psi_dict
但是计算出的PSI值和我们平常默认的十等分计算的psi值略有差异,在确定阈值的时候根据经验适当调整(一般离散变量的分箱数越少,变量的psi越小,这个可以利用上述代码,改变bins大小进行试验。)
PSI计算 hive实现
上面介绍了python实现特征PSI的计算方法,实际中,在线上环境中,为了实现快速的上线迭代看效果,更方便的是武器无疑是SQL了,这里我们介绍用SQL实现PSI的计算。
select
feature,
dt,
sum(psi_bins) as psi
from
(
select
feature,
bins,
(last_rate - rate) * ln(last_rate / rate) as psi_bins,
dt
from
(
select
feature,
bins,
rate,
lead(rate, 1 , 0.000001) over(
partition by feature,
bins
order by
dt desc
) as last_rate,
dt
from
table name
) a
) b
这里有个前提,计算的特征已经进行过离散化了,一般特征的离散化可以用case when来解决;同时提前计算好,每个特征不同分箱内值所占的比例,这部分比较简单,就不介绍了,有疑问可以私信交流。下面主要介绍下用到的一个窗口函数lead
hive lead函数介绍
hive中有一类专门为分析而生的函数,又叫做窗口函数,窗口的操作大大简化了表的自连接操作,并且有更高的效率,其中lag和lead就是其中的代表。
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
与LAG相反LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
在计算psi的时候我们就会用到,近期和上期的比较,为了避免表的自连接,我们采用lead函数,取向下1行的值作为上期数值,和近期对齐达到同行,进而方便计算。 具体使用,可参考参考资料中的 Hive 分析函数lead、lag实例应用
kaggle实战
kaggle是个很好的平台,里面有很多比赛、项目还要数据集可供给我们试验。同时,kaggle还为每个人提供免费的云计算,也就是在线kernel,真的是kaggle在手,天下我有的感觉(每个月还有免费的30h的GPU和30h的TPU的使用时间,背靠谷歌爸爸就是好)
这次用的是信贷的数据,Lending Club Loan Data,kaggle主页直接搜索就好,参考资料中也给出了连接,前期导入数据,部分特征处理代码,参考高票kernel:Lending Club || Risk Analysis and Metrics
1. 导入数据
# Import our libraries we are going to use for our data analysis.
import tensorflow as tf
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Plotly visualizations
from plotly import tools
# import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')
# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter
# Other Libraries
import time
%matplotlib inline
df = pd.read_csv('../input/lending-club-loan-data/loan.csv',low_memory=False)
df = df.sample(n = 1000000,replace=False,random_state=1024,axis=0)
# Copy of the dataframe
original_df = df.copy()
df.head()
2. 部分字段处理
# Replace the name of some columns
df = df.rename(columns={"loan_amnt": "loan_amount", "funded_amnt": "funded_amount", "funded_amnt_inv": "investor_funds",
"int_rate": "interest_rate", "annual_inc": "annual_income"})
# Drop irrelevant columns
df.drop(['id', 'member_id', 'emp_title', 'url', 'desc', 'zip_code', 'title'], axis=1, inplace=True)
3. Y值处理
# Determining the loans that are bad from loan_status column
bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period",
"Late (16-30 days)", "Late (31-120 days)"]
df['loan_condition'] = np.nan
def loan_condition(status):
if status in bad_loan:
return 'Bad Loan'
else:
return 'Good Loan'
df['loan_condition'] = df['loan_status'].apply(loan_condition)
4. 坏账率分析
f, ax = plt.subplots(1,2, figsize=(16,8))
colors = ["#3791D7", "#D72626"]
labels ="Good Loans", "Bad Loans"
plt.suptitle('Information on Loan Conditions', fontsize=20)
df["loan_condition"].value_counts().plot.pie(explode=[0,0.25], autopct='%1.2f%%', ax=ax[0], shadow=True, colors=colors,
labels=labels, fontsize=12, startangle=70)
# ax[0].set_title('State of Loan', fontsize=16)
ax[0].set_ylabel('% of Condition of Loans', fontsize=14)
# sns.countplot('loan_condition', data=df, ax=ax[1], palette=colors)
# ax[1].set_title('Condition of Loans', fontsize=20)
# ax[1].set_xticklabels(['Good', 'Bad'], rotation='horizontal')
palette = ["#3791D7", "#E01E1B"]
sns.barplot(x="year", y="loan_amount", hue="loan_condition", data=df, palette=palette, estimator=lambda x: len(x) / len(df) * 100)
ax[1].set(ylabel="(%)")
5. 部分特征处理
# Copy Dataframe
complete_df = df.copy()
# Handling Missing Numeric Values
# Transform Missing Values for numeric dataframe
# Nevertheless check what these variables mean tomorrow in the morning.
for col in ('dti_joint', 'annual_inc_joint', 'il_util', 'mths_since_rcnt_il', 'open_acc_6m','open_il_12m',
'open_il_24m', 'inq_last_12m', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl',
'mths_since_last_record', 'mths_since_last_major_derog', 'mths_since_last_delinq', 'total_bal_il', 'tot_coll_amt',
'tot_cur_bal', 'total_rev_hi_lim', 'revol_util', 'collections_12_mths_ex_med', 'open_acc', 'inq_last_6mths',
'verification_status_joint', 'acc_now_delinq'):
complete_df[col] = complete_df[col].fillna(0)
# # Get the mode of next payment date and last payment date and the last date credit amount was pulled
complete_df["next_pymnt_d"] = complete_df.groupby("region")["next_pymnt_d"].transform(lambda x: x.fillna(x.mode))
complete_df["last_pymnt_d"] = complete_df.groupby("region")["last_pymnt_d"].transform(lambda x: x.fillna(x.mode))
complete_df["last_credit_pull_d"] = complete_df.groupby("region")["last_credit_pull_d"].transform(lambda x: x.fillna(x.mode))
complete_df["earliest_cr_line"] = complete_df.groupby("region")["earliest_cr_line"].transform(lambda x: x.fillna(x.mode))
# # Get the mode on the number of accounts in which the client is delinquent
complete_df["pub_rec"] = complete_df.groupby("region")["pub_rec"].transform(lambda x: x.fillna(x.median()))
# # Get the mean of the annual income depending in the region the client is located.
complete_df["annual_income"] = complete_df.groupby("region")["annual_income"].transform(lambda x: x.fillna(x.mean()))
# Get the mode of the total number of credit lines the borrower has
complete_df["total_acc"] = complete_df.groupby("region")["total_acc"].transform(lambda x: x.fillna(x.median()))
# Mode of credit delinquencies in the past two years.
complete_df["delinq_2yrs"] = complete_df.groupby("region")["delinq_2yrs"].transform(lambda x: x.fillna(x.mean()))
len(complete_df['loan_condition_int'])
# Loan Ratios (Imbalanced classes)
complete_df['loan_condition_int'].value_counts()/len(complete_df['loan_condition_int']) * 100
6. 训练测试集划分(根据时间)
def data_train_test_split(df,split_date,strat_day = '2017-09-01',apply_day='2017-10-01',end_day='2017-11-01',overdue='y30'):
'''
功能: 将数据集划分为训练集和测试集
输入值:
df: 类型:Dataframe;功能:数据集
product_id: 类型:List;功能:需要划分的产品id;
apply_day: 类型:Str;划分训练测试的开始月份,格式为%Y-%m-%d,例如'2017-04-01'
end_day: 类型:Str;划分训练测试的结束月份,格式为%Y-%m-%d,例如'2017-11-01'
输出值:
dt_train:Dataframe测试集
df_test:Dataframe训练集
'''
df[split_date] = pd.to_datetime(df[split_date].astype(str)).apply(lambda x: x.strftime("%Y-%m-%d"))
dt_train = df[(df[split_date] >= strat_day)&(df[split_date] < apply_day)]
df_test = df[(df[split_date] >= apply_day)&(df[split_date] < end_day)]
return dt_train,df_test
df_train,df_test = data_train_test_split(complete_df,split_date = 'complete_date',strat_day = '2017-01-01',apply_day='2018-01-01',end_day='2019-01-01',overdue='loan_condition_int')
7. 计算特征PSI
import math
###计算单变量psi
def fea_psi_calc(actual,predict,bins=10):
'''
功能: 计算连续变量和离散变量的PSI值
输入值:
actual: 一维数组或series,代表训练集中的变量
predict: 一维数组或series,代表测试集中的变量
bins: 违约率段划分个数
输出值:
字典,键值关系为{'psi': PSI值,'psi_fig': 实际和预期占比分布曲线}
'''
psi_dict = {}
actual = np.sort(actual)
actual_distinct = np.sort(list(set(actual)))
predict = np.sort(predict)
predict_distinct = np.sort(list(set(predict)))
actual_len = len(actual)
actual_distinct_len = len(actual_distinct)
predict_len = len(predict)
predict_distinct_len = len(predict_distinct)
psi_cut = []
actual_bins = []
predict_bins = []
actual_min = actual.min()
actual_max = actual.max()
cuts = []
binlen = (actual_max-actual_min) / bins
if (actual_distinct_len<bins):
for i in actual_distinct:
cuts.append(i)
for i in range(2, (actual_distinct_len+1)):
if i == bins:
lowercut = cuts[i-2]
uppercut = float('Inf')
else:
lowercut = cuts[i-2]
uppercut = cuts[i-1]
actual_cnt = ((actual >= lowercut) & (actual < uppercut)).sum()+1
predict_cnt = ((predict >= lowercut) & (predict < uppercut)).sum()+1
actual_pct = (actual_cnt+0.0) / actual_len
predict_pct = (predict_cnt+0.0) / predict_len
psi_cut.append((actual_pct-predict_pct) * math.log(actual_pct/predict_pct))
actual_bins.append(actual_pct)
predict_bins.append(predict_pct)
else:
for i in range(1, bins):
cuts.append(actual_min+i*binlen)
for i in range(1, (bins+1)):
if i == 1:
lowercut = float('-Inf')
uppercut = cuts[i-1]
elif i == bins:
lowercut = cuts[i-2]
uppercut = float('Inf')
else:
lowercut = cuts[i-2]
uppercut = cuts[i-1]
actual_cnt = ((actual >= lowercut) & (actual < uppercut)).sum()+1
predict_cnt = ((predict >= lowercut) & (predict < uppercut)).sum()+1
actual_pct = (actual_cnt+0.0) / actual_len
predict_pct = (predict_cnt+0.0) / predict_len
psi_cut.append((actual_pct-predict_pct) * math.log(actual_pct/predict_pct))
actual_bins.append(actual_pct)
predict_bins.append(predict_pct)
psi = sum(psi_cut)
nbins = len(actual_bins)
xlab = np.arange(1, nbins+1)
psi_dict['psi'] = psi
return psi_dict
8. 根据PSI筛选特征
columns_select = [i for i in X_columns]
fea_psi_compare = pd.DataFrame(index=['df_train'],columns=columns_select)
for column in columns_select:
psi = fea_psi_calc(df_train[column].dropna(),df_test[column].dropna())['psi']
fea_psi_compare.loc['df_train',column] = psi
print(column,psi)
fea_psi_compare_T = fea_psi_compare.T;
fea_psi_compare_T[fea_psi_compare_T['df_train']>1].sort_values(by='df_train',ascending=False).head()
psi_drop_columns = fea_psi_compare_T[fea_psi_compare_T['df_train']>0.25].sort_values(by='df_train',ascending=False).index.tolist()
print('psi_drop_columns:{}'.format(psi_drop_columns))
输出结果如上,可根据经验阈值筛选特征。 本次分享就到这里,后续分享方向:1. kaggle比赛优秀kernel分享 2.风控相关业务知识分享 3.风控技术细节分享,期待你得留言反馈哦~
参考资料
Hive 分析函数lead、lag实例应用
https://blog.csdn.net/kent7306/article/details/50441967
https://www.kaggle.com/wendykan/lending-club-loan-data
https://www.kaggle.com/janiobachmann/lending-club-risk-analysis-and-metrics
https://www.kaggle.com/monkebai/lending-club-baseline/edit
本文使用 mdnice 排版