金融风控-贷款违约预测

932 阅读7分钟

前言

阿里天池平台具有十分丰富的数据资源,对于强化数据分析思维有很好的导向作用。本篇通过分析贷款违约预测来加深不同模型训练的理解,并期望在实际项目中能够进一步优化完善数据分析的思维框架。

模块导入、Jupyter环境配置及数据集加载

导入需要使用的模块。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sklearn 
from sklearn.linear_model import LinearRegression,LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

前期的一些设置。

#设置字体,防止中文乱码。
mpl.rcParams['font.sans-serif'] =[u'simHei']
#防止图片下负号显示为矩形框
mpl.rcParams['axes.unicode_minus'] =False
#拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)  
#seaboen字体设置
sns.set(font='SimHei')

加载数据集。

data_submit =pd.read_csv('sample_submit.csv ')
data_testA  =pd.read_csv('testA.csv')
data_train  =pd.read_csv('train.csv')
data_train
______________________________________________________

id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
3	3	11000.0	3	7.26	340.96	A	A4	46854.0	10+ years	1	...	16.0	4.0	7.0	21.0	6.0	9.0	0.0	0.0	0.0	1.0
4	4	3000.0	3	12.99	101.07	C	C2	54.0	NaN	1	...	4.0	9.0	10.0	15.0	7.0	12.0	0.0	0.0	0.0	4.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
799995	799995	25000.0	3	14.49	860.41	C	C4	2659.0	7 years	1	...	6.0	2.0	12.0	13.0	10.0	14.0	0.0	0.0	0.0	3.0
799996	799996	17000.0	3	7.90	531.94	A	A4	29205.0	10+ years	0	...	15.0	16.0	2.0	19.0	2.0	7.0	0.0	0.0	0.0	0.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0
800000 rows × 47 columns

探索性分析(EDA)

数据集总览

主要包括:数据类型信息,数据统计学分布预览及数据维度分析。

#查看数据类型
data_train.info()
#查看数据维度
data_train.shape
#查看数据统计学分布情况
data_train.describe()
——————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

查看缺失值

查看存在缺失值的数据,将这些数据提取出来进行可视化。

#用DtaFrame存放数据集中缺失值的数量,列名为counts
nan_num = pd.DataFrame(data_train.isnull().sum(),columns=['counts'])
#取出缺失值数量大于1的列和值
data_nan =nan_num[nan_num['counts']>0]
#按照值排序
data_nan.sort_values(by='counts',inplace=True)
————————————————————————————————————————————————————
	counts
employmentTitle	1
postCode	1
title	       1
dti	239
pubRecBankruptcies	405
revolUtil	531
n10	33239
n4	33239
n12	40270
n9	40270
n7	40270
n6	40270
n3	40270
n13	40270
n2	40270
n1	40270
n0	40270
n5	40270
n14	40270
n8	40271
employmentLength	46799
n11	69752

将缺失值根据数量进行可视化。

plt.figure(figsize=(30,10))
data_nan.plot.hist()

image.png

异常值的分析只要是观察是否在偏差范围内,通过箱型图来查看数据分布情况。

#查看去重后values数量
data_u =data_train.nunique().sort_values()
data_u
______________________________________________________
policyCode                 1
term                       2
applicationType            2
initialListStatus          2
isDefault                  2
verificationStatus         3
n12                        5
n11                        5
homeOwnership              6
grade                      7
pubRecBankruptcies        11
employmentLength          11
purpose                   14
n13                       28
delinquency_2years        30
n14                       31
pubRec                    32
n1                        33
subGrade                  35
n0                        39
ficoRangeLow              39
ficoRangeHigh             39
n9                        44
n4                        46
n2                        50
n3                        50
regionCode                51
n5                        65
n7                        70
openAcc                   75
n10                       76
n8                       102
n6                       107
totalAcc                 134
issueDate                139
interestRate             641
earliesCreditLine        720
postCode                 932
revolUtil               1286
loanAmnt                1540
dti                     6321
title                  39644
annualIncome           44926
revolBal               71116
installment            72360
employmentTitle       248683
id                    800000
dtype: int64

数据类型分析

从之前的分析可以得到,数据集主要有object、int64、float64三种数据类型,下面将根据数据类型筛选出数据,再依次进行分析。

num_type=data_train.select_dtypes(exclude=['object'])
cls_type=data_train.select_dtypes(include=['object'])
num_type.columns ,cls_type.columns
——————————————————————————————————————————————————————
(Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment',        'employmentTitle', 'homeOwnership', 'annualIncome',        'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode',        'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',        'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',        'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0',        'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11',        'n12', 'n13', 'n14'],
       dtype='object'),
 Index(['grade', 'subGrade', 'employmentLength', 'issueDate',        'earliesCreditLine'],
       dtype='object'))

数值型数据

构造函数来进行数据类型分类,为了显示规整,返回值设置为ndarray类型进行转置处理。

def classfify_types():
    num_continuous=[]
    num_discrete=[]
    global data_train,num_type
    for name  in num_type:
        counts =data_train[name].nunique()
        if counts<= 10:
            num_discrete.append(name)
        else:
            num_continuous.append(name)
    return np.array(num_discrete).T,np.array(num_continuous).T

连续分布数据

 array(['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle',
        'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti',
        'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
        'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil',
        'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6',
        'n7', 'n8', 'n9', 'n10', 'n13', 'n14'], dtype='<U18'))

离散分布数据。

array(['term', 'homeOwnership', 'verificationStatus', 'isDefault',
        'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12'],
       dtype='<U18'),

查看离散数据情况,数据太长了,只放了开头的结果。

for i in num_discrete:
    print(f'{i}:{data_train[i]}')

image.png

连续数据分布可以通过可视化来分析,由于数据量太大,执行时间较长,故选择5w条数据进行可视化,观察数据是否呈正态分布。

data_continuous =data_train[num_continuous]
date_partial=data_continuous.head(50000)
df = pd.melt(date_partial,value_vars=num_continuous)
sp=sns.FacetGrid(df,col='variable',col_wrap=3,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='r',rug=True)

image.png

分类型数据

先来查看一下分类型数据的分布 image.png

image.png 分类型数据可以通过可视化来分析,提取贷款等级的数据来绘图。可以发现贷款等级为B/C/A的数量较多。

plt.figure(figsize=(8,5))
sns.barplot(data_train['grade'].value_counts().index,data_train['grade'].value_counts())

image.png

再来看看工作年限的分布情况,工作十年以上的人数最多,其他工作年限分布较为均匀。

plt.figure(figsize=(8,5))
sns.barplot(data_train['employmentLength'].value_counts().index,data_train['employmentLength'].value_counts())

image.png

数据间关系分析

先来看看目标预测值的分布情况。

data_train['isDefault'].value_counts().plot.bar(color='g')

image.png 下面来分析贷款是否违约与特征值之间的关系。

将default=0/1的数据集分别提取出来。

data_y = data_train[data_train['isDefault']==1]
data_n = data_train[data_train['isDefault']==0]

分别对违约与未违约的数据集分析employmentLength&isDefault的关系,及grade&grade的关系。

fig,((ax1,ax2),(ax3,ax4)) =plt.subplots(2,2,figsize=(14,8))
data_y.groupby('employmentLength').size().plot.bar(ax=ax1,color='r',alpha=0.8)
data_n.groupby('employmentLength')['employmentLength'].count().plot.bar(ax=ax2,color='r',alpha=0.85)
data_y.groupby('grade').size().plot.bar(ax=ax3,color='g',alpha=0.8)
data_n.groupby('grade')['grade'].count().plot.bar(ax=ax4,color='g',alpha=0.85)
plt.xticks(rotation=90)

image.png 可视化分析贷款额与违约情况的关系,发现违约情况多发生在贷款额度较大的人群中。

fig,((ax1,ax2)) =plt.subplots(1,2,figsize=(10,8))
data_train.groupby('isDefault')['loanAmnt'].sum()
sns.countplot(x='isDefault',data=data_train,ax=ax1,color='r',alpha=0.9)
data_train.groupby('isDefault')['loanAmnt'].median()
sns.countplot(x='isDefault',data=data_train,ax=ax2,color='b',alpha=1)

image.png

特征工程

经过之前的探索性分析后,对于整个数据集有了比较全面的了解,因此前期的分析十分重要,在在特征工程中,主要针对EDA中的要点进行具体分析处理,主要包括:缺失值的处理、特征值数据处理与特征选择。

缺失值处理

在这之前已经进行过数据类型的分析,将预测目标值从数据集中分离出来后,对于数值型数据用平均值来填充空位,而分类型的数据选择它的众数作为结果。

num_type=list(data_train.select_dtypes(exclude=['object']).columns)
cls_type=list(data_train.select_dtypes(include=['object']).columns)
e ='isDefault'
num_type.remove(e)
#预测值分离
data_target =data_train['isDefault']
data_train.drop('isDefault',axis=1,inplace=True)
#用平均数填充
data_train[num_type]=data_train[num_type].fillna(data_train[num_type].mean())
#用众数填充
data_train[cls_type]=data_train[cls_type].fillna(data_train[cls_type].mode())
data_train[cls_type].isnull().sum()
data_train['employmentLength'].fillna('10+ years',inplace=True)
_____________________________________________________________

grade	subGrade	employmentLength	issueDate	earliesCreditLine
0	E	E2	2 years	2014-07-01	Aug-2001
1	D	D2	5 years	2012-08-01	May-2002
2	D	D3	8 years	2015-10-01	May-2006
3	A	A4	10+ years	2015-08-01	May-1999
4	C	C2	10+ years	2016-03-01	Aug-1977
...	...	...	...	...	...
799995	C	C4	7 years	2016-07-01	Aug-2011
799996	A	A4	10+ years	2013-04-01	May-1989
799997	C	C3	10+ years	2015-10-01	Jul-2002
799998	A	A4	10+ years	2015-02-01	Jan-1994
799999	B	B3	5 years	2018-08-01	Feb-2002
800000 rows × 5 columns

测试集中的缺失值处理:

data_testA[num_type]=data_testA[num_type].fillna(data_testA[num_type].mean())
data_testA[cls_type]=data_testA[cls_type].fillna(data_testA[cls_type].mode())
data_testA['employmentLength']=data_testA['employmentLength'].fillna('10+ years')

由于id对模型训练没有实际意义,选择直接删除。

data_train.drop(['id'],axis=1,inplace=True)
data_testA.drop(['id'],1,inplace=True)

将测试集加载到训练集中,为了数据安全,后面选择训练集进行训练。

combined=data_train.append(data_testA)

查看合并后数据的情况。公司的策略policyCode值只有一个,对于总体分析来说,意义很小,选择删除。处理后的数据集有100万条数据,44个特征,其中前80万是训练集,后20万是测试集,这样便于后面数据集分离,即'data_train=combined[:800000]'.

combined.drop(['index','id','policyCode'],1,inplace=True)
combined.shape
_______________________________________________________
(1000000, 44)

独热编码

观察数据集中根据value值来判断需要进行虚拟编码来处理的有: 贷款期限(term), 贷款等级(grade),贷款等级子级(subgrade), 工作年限(employmentLength), 验证状态(verificationStatus)。此时combined有97个特征。

def dummies_coder():
    global combined
    for name in ['term','grade','subGrade',
                 'employmentLength','verificationStatus']:
        data_dummies = pd.get_dummies(combined[name],prefix=name)
        combined = pd.concat([combined,data_dummies],axis=1)
        combined.drop(name,axis=1,inplace=True)
    return combined
combined.shape
————————————————————————————————————————————————————————
(1000000, 97)

特征组合

根据官网给的信息,对所有特征主观分类,之前已经完成了独热编码,在这个基础上观察特征的关联性。

借款人信息: 
            O-employmentLength就业年限(年)          
            annualIncome 年收入         
            employmentTitle就业职称 
            dti  债务收入比                 
贷款信息:
           loanAmnt  贷款金额           
           O-term    贷款期限(year)        
           interestRate 贷款利率            
           O-grade  贷款等级                  
           O-subGrade  贷款等级之子级               
           issueDate 贷款发放的月份             
   
还款信息:
             installment分期付款金额    

贷款附加信息:
            O-verificationStatus验证状态         
            postCode 贷款申请邮政编码的前3位数字     
            purpose 贷款用途类别                
            delinquency_2years 违约事件数     
            pubRec贬损公共记录的数量                                       
            homeOwnership 房屋所有权状况            
            revolBal信贷周转余额合计             
       earliesCreditLine借款人最早报告的信用额度开立的月份        
       openAcc借款人信用档案中未结信用额度的数量
            title 借款人提供的贷款名称                 
    applicationType个人申请还是与两个共同借款人的联合申请     
    0-initialListStatus         
    revolUtil循环额度利用率/相对于所有可用循环信贷的信贷金额        
    totalAcc 借款人信用档案中当前的信用额度总数              
    ficoRangeLow 贷款发放时的fico所属的下限          
     ficoRangeHigh贷款发放时的fico所属的上限           
     pubRecBankruptcies公开记录清除的数量        
还款附加信息:
            regionCode  地区编码    
            
匿名特征n0-n14,为一些贷款人行为计数特征的处理
n12                        5
n11                        5
 n13                       28
n14                       31
n1                        33
n0                        39
n9                        44
n4                        46
n2                        50
n3                        50
n5                        65
n7                        70
n10                       76
n8                       102
n6                       107
  • 将年收入(annualIncome) x 债务收入比(dti) =年债务(debt),这表明了贷款人的经济压力程度,影响着是否违约;
combined['debt']=combined['annualIncome'] *combined['dti']
combined.drop(['annualIncome','dti'],1,inplace=True)
  • 将贷款金额(loanAmnt) x 贷款利率(interestRate) x 年数(term) =利息(interest),再将贷款本息总额(total_loan)/分期付款金额(installment)=分期数(stages),分期数越大表明了违约可能发生的周期更长;
combined['stages']=(combined['loanAmnt'] *combined['interestRate']*0.01*combined['term']+combined['loanAmnt'])/combined['installment']
combined.drop(['loanAmnt','interestRate','term','installment'],1,inplace=True)
combined.shape
__________________________________________________________
(1000000, 92)
  • 将(delinquency_2years)违约事件数+(pubRec)贬损公共记录的数量-( pubRecBankruptcies )公开记录清除的数量 =信用记录(Credit_record),不良的记录可以从侧面看出贷款人的违约履历;
combined['Credit_record'] = combined['delinquency_2years']+combined['pubRec']-combined['pubRecBankruptcies']
combined.drop(['delinquency_2years','pubRec','pubRecBankruptcies'],1,inplace=True)
  • 将(openAcc)借款人信用档案中未结信用额度的数量+(totalAcc)借款人信用档案中当前的信用额度总数(revolBal)信贷周转余额合计+(revolUtil)循环额度利用率/相对于所有可用循环信贷的信贷金额=总信贷额度(credit_line),总的信贷额度可以看出贷款人的信用水准,信用越好额度越大,并且额度也可以用来规避违约风险;
combined['credit_line'] =combined['openAcc']+combined['totalAcc']+combined['revolBal']+combined['revolUtil']
combined.drop(['openAcc','totalAcc','revolBal','revolUtil'],1,inplace=True)
  • 将(ficoRangeLow)贷款发放时的fico所属的下限+ (ficoRangeHigh)贷款发放时的fico所属的上限 =贷款发放时的fico所属的范围(ficoRange),机械相加,不觉明厉;
combined['ficoRange'] = combined['ficoRangeHigh'] +combined['ficoRangeLow']
combined.drop(['ficoRangeHigh','ficoRangeLow'],1,inplace=True)
  • 将贷款人行为计数特征(n)n0+n1…………+n14 =total_n,既然是贷款人的行为,并且n值都很小,估测是不好的事,索性全部变成一家人,一下子干掉15个,太爽了。。。。。
combined['total_n']=combined['n0']+combined['n1']+combined['n2']+combined['n3']+combined['n4']combined['n5']+combined['n6']+combined['n7']+combined['n8']+combined['n9']+combined['n10']+combined['n11']+combined['n12']+combined['n13']+combined['n14']
combined.drop(['n0','n1', 'n2','n3' ,'n4' ,'n5' ,'n6' ,'n7' ,'n8' ,'n9' ,'n10' ,               'n11' ,'n12' ,'n13' , 'n14'],1,inplace=True)
combined.shape
___________________________________________________
(1000000, 72)

全部处理完成后,数据集中有72个特征。 下面回复训练集、测试集和目标值。

train=combined.iloc[:800000]
test=combined.iloc[800000:]
targets=pd.read_csv('train.csv',usecols=['isDefault'])['isDefault'].values

模型训练

特征选择

经过特征处理后,数据有72个特征之多,现在需要进行特征选择,来人为的降低特征维度,下面将使用随机森立估算器计算特征重要性。

clf=RandomForestClassifier(n_estimators=50,max_features='sqrt')
clf =clf.fit(train,targets)

对于每一个特征进行可视化。

features =pd.DataFrame()
features['feature'] =train.columns
features['importance']=clf.feature_importances_
features.sort_values(by=['importance'],ascending=True,inplace=True)
features.set_index('feature',inplace=True)
features.plot(kind='barh',figsize=(15,15))

image.png

可以发现,正如特征工程中所探索的那样,stages,debt,cridit_line,total_n,postcode,employmenttitle都有较强的关联性。下面我们将数据集变得更加紧凑,特征一下子减少了很多。

model =SelectFromModel(clf,prefit=True)
train_reduce =model.transform(train)
train_reduce.shape
————————————————————————————————————————————————————
(800000, 12)

设置一个评分函数,来对模型进行打分:

def compute_score(clf,x,y,cv=5,scoring='accuracy'):
    xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
    return np.mean(xval)

基础模型

实例化分类器。

logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()
svc=SVC()
gnb = GaussianNB()
models = [ logreg_cv, rf,svc, gnb,gboost]

由于数据集比较大,为了提高效率,选取部分数据进行训练。

train_partial =train[:80000]
targets_partial =targets[:80000]
for model in models:
    print (f'Cross-validation of :{model.__class__}')
    score = compute_score(clf=model, x=train_partial, y=targets_partial, scoring='accuracy')
    print (f'CV score ={score}')
    print ('****')
——————————————————————————————————————————————————————————————————
Cross-validation of :<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>
CV score =0.7982125
****
Cross-validation of :<class 'sklearn.ensemble._forest.RandomForestClassifier'>
CV score =0.7988125
****
Cross-validation of :<class 'sklearn.svm._classes.SVC'>
CV score =0.7984249999999999
****
Cross-validation of :<class 'sklearn.naive_bayes.GaussianNB'>
CV score =0.7983374999999999
****
Cross-validation of :<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>
CV score =0.8005749999999999
****
  • logreg_cv CV score =0.7982125
  • rf CV score =0.7988125
  • svc CV score =0.7984249999999999
  • gnb 0.7983374999999999
  • gboost 0.8005749999999999 根据初次训练的结果,rf和gboost表现较好,决定进行参数优化以改进模型。

超参数调整

在使用的模型中,随机森林和梯度提升决策树的表现较为良好,因此在这个基础上继续进一步优化参数,将使用gridsearch来完成这个步骤。

model_box= []
rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
            'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])

gboost=GradientBoostingClassifier(random_state=2021)
gboost_params ={'learning_rate':[0.05,0.1,0.15],'n_estimators':[10,50],
                'max_depth':[3,4,6,10],'min_samples_split':[50,10]}
model_box.append([gboost,gboost_params])
for i in range(len(model_box)):
    best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,cv=5,scoring='roc_auc').fit(train_partial,targets_partial)
    print(model_box[i],':')
    print('best_parameters:',best_model.best_params_) 
  
 ————————————————————————————————————————————————————————————————
[RandomForestClassifier(random_state=2021), {'n_estimators': [50, 120, 300], 'max_depth': [5, 8, 15], 'min_samples_leaf': [2, 5, 10], 'min_samples_split': [2, 5, 10]}] :
best_parameters: {'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300}
[GradientBoostingClassifier(random_state=2021), {'learning_rate': [0.05, 0.1, 0.15], 'n_estimators': [10, 50], 'max_depth': [3, 4, 6, 10], 'min_samples_split': [50, 10]}] :
best_parameters: {'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50}

网格搜索的结果显示,最优的参数是:

  • rf: best_parameters: 'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300
  • gboost: best_parameters: 'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50

带入这些参数进行模型训练并计算score值来评估模型。

for i  in model_best:
    model_best[i].fit(train_partial,targets_partial)
    score = cross_val_score(model_best[i],train_partial,targets_partial,cv=5,scoring='score')
    print('%s的score值为:%.4f' % (i,score.mean()))
__________________________________________________________________
rf的score值为:0.7997
gboost的score值为:0.8005

选择最优模型进行预测并输出结果文件。

model_final =model_best['gboost']
predictions= model_final.predict(test).astype(int)
df_predictions = pd.DataFrame()
abc = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/testA.csv')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = predictions
df_predictions[['id','isDefault']].to_csv('/content/drive/MyDrive/Colab Notebooks/submit.csv', index=False)