前言
阿里天池平台具有十分丰富的数据资源,对于强化数据分析思维有很好的导向作用。本篇通过分析贷款违约预测来加深不同模型训练的理解,并期望在实际项目中能够进一步优化完善数据分析的思维框架。
模块导入、Jupyter环境配置及数据集加载
导入需要使用的模块。
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sklearn
from sklearn.linear_model import LinearRegression,LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
前期的一些设置。
#设置字体,防止中文乱码。
mpl.rcParams['font.sans-serif'] =[u'simHei']
#防止图片下负号显示为矩形框
mpl.rcParams['axes.unicode_minus'] =False
#拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
#seaboen字体设置
sns.set(font='SimHei')
加载数据集。
data_submit =pd.read_csv('sample_submit.csv ')
data_testA =pd.read_csv('testA.csv')
data_train =pd.read_csv('train.csv')
data_train
______________________________________________________
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership ... n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 ... 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 ... NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN
2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 ... 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 ... 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 ... 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
799995 799995 25000.0 3 14.49 860.41 C C4 2659.0 7 years 1 ... 6.0 2.0 12.0 13.0 10.0 14.0 0.0 0.0 0.0 3.0
799996 799996 17000.0 3 7.90 531.94 A A4 29205.0 10+ years 0 ... 15.0 16.0 2.0 19.0 2.0 7.0 0.0 0.0 0.0 0.0
799997 799997 6000.0 3 13.33 203.12 C C3 2582.0 10+ years 1 ... 4.0 26.0 4.0 10.0 4.0 5.0 0.0 0.0 1.0 4.0
799998 799998 19200.0 3 6.92 592.14 A A4 151.0 10+ years 0 ... 10.0 6.0 12.0 22.0 8.0 16.0 0.0 0.0 0.0 5.0
799999 799999 9000.0 3 11.06 294.91 B B3 13.0 5 years 0 ... 3.0 4.0 4.0 8.0 3.0 7.0 0.0 0.0 0.0 2.0
800000 rows × 47 columns
探索性分析(EDA)
数据集总览
主要包括:数据类型信息,数据统计学分布预览及数据维度分析。
#查看数据类型
data_train.info()
#查看数据维度
data_train.shape
#查看数据统计学分布情况
data_train.describe()
——————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
查看缺失值
查看存在缺失值的数据,将这些数据提取出来进行可视化。
#用DtaFrame存放数据集中缺失值的数量,列名为counts
nan_num = pd.DataFrame(data_train.isnull().sum(),columns=['counts'])
#取出缺失值数量大于1的列和值
data_nan =nan_num[nan_num['counts']>0]
#按照值排序
data_nan.sort_values(by='counts',inplace=True)
————————————————————————————————————————————————————
counts
employmentTitle 1
postCode 1
title 1
dti 239
pubRecBankruptcies 405
revolUtil 531
n10 33239
n4 33239
n12 40270
n9 40270
n7 40270
n6 40270
n3 40270
n13 40270
n2 40270
n1 40270
n0 40270
n5 40270
n14 40270
n8 40271
employmentLength 46799
n11 69752
将缺失值根据数量进行可视化。
plt.figure(figsize=(30,10))
data_nan.plot.hist()
异常值的分析只要是观察是否在偏差范围内,通过箱型图来查看数据分布情况。
#查看去重后values数量
data_u =data_train.nunique().sort_values()
data_u
______________________________________________________
policyCode 1
term 2
applicationType 2
initialListStatus 2
isDefault 2
verificationStatus 3
n12 5
n11 5
homeOwnership 6
grade 7
pubRecBankruptcies 11
employmentLength 11
purpose 14
n13 28
delinquency_2years 30
n14 31
pubRec 32
n1 33
subGrade 35
n0 39
ficoRangeLow 39
ficoRangeHigh 39
n9 44
n4 46
n2 50
n3 50
regionCode 51
n5 65
n7 70
openAcc 75
n10 76
n8 102
n6 107
totalAcc 134
issueDate 139
interestRate 641
earliesCreditLine 720
postCode 932
revolUtil 1286
loanAmnt 1540
dti 6321
title 39644
annualIncome 44926
revolBal 71116
installment 72360
employmentTitle 248683
id 800000
dtype: int64
数据类型分析
从之前的分析可以得到,数据集主要有object、int64、float64三种数据类型,下面将根据数据类型筛选出数据,再依次进行分析。
num_type=data_train.select_dtypes(exclude=['object'])
cls_type=data_train.select_dtypes(include=['object'])
num_type.columns ,cls_type.columns
——————————————————————————————————————————————————————
(Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object'),
Index(['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine'],
dtype='object'))
数值型数据
构造函数来进行数据类型分类,为了显示规整,返回值设置为ndarray类型进行转置处理。
def classfify_types():
num_continuous=[]
num_discrete=[]
global data_train,num_type
for name in num_type:
counts =data_train[name].nunique()
if counts<= 10:
num_discrete.append(name)
else:
num_continuous.append(name)
return np.array(num_discrete).T,np.array(num_continuous).T
连续分布数据
array(['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle',
'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti',
'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil',
'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6',
'n7', 'n8', 'n9', 'n10', 'n13', 'n14'], dtype='<U18'))
离散分布数据。
array(['term', 'homeOwnership', 'verificationStatus', 'isDefault',
'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12'],
dtype='<U18'),
查看离散数据情况,数据太长了,只放了开头的结果。
for i in num_discrete:
print(f'{i}:{data_train[i]}')
连续数据分布可以通过可视化来分析,由于数据量太大,执行时间较长,故选择5w条数据进行可视化,观察数据是否呈正态分布。
data_continuous =data_train[num_continuous]
date_partial=data_continuous.head(50000)
df = pd.melt(date_partial,value_vars=num_continuous)
sp=sns.FacetGrid(df,col='variable',col_wrap=3,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='r',rug=True)
分类型数据
先来查看一下分类型数据的分布
分类型数据可以通过可视化来分析,提取贷款等级的数据来绘图。可以发现贷款等级为B/C/A的数量较多。
plt.figure(figsize=(8,5))
sns.barplot(data_train['grade'].value_counts().index,data_train['grade'].value_counts())
再来看看工作年限的分布情况,工作十年以上的人数最多,其他工作年限分布较为均匀。
plt.figure(figsize=(8,5))
sns.barplot(data_train['employmentLength'].value_counts().index,data_train['employmentLength'].value_counts())
数据间关系分析
先来看看目标预测值的分布情况。
data_train['isDefault'].value_counts().plot.bar(color='g')
下面来分析贷款是否违约与特征值之间的关系。
将default=0/1的数据集分别提取出来。
data_y = data_train[data_train['isDefault']==1]
data_n = data_train[data_train['isDefault']==0]
分别对违约与未违约的数据集分析employmentLength&isDefault的关系,及grade&grade的关系。
fig,((ax1,ax2),(ax3,ax4)) =plt.subplots(2,2,figsize=(14,8))
data_y.groupby('employmentLength').size().plot.bar(ax=ax1,color='r',alpha=0.8)
data_n.groupby('employmentLength')['employmentLength'].count().plot.bar(ax=ax2,color='r',alpha=0.85)
data_y.groupby('grade').size().plot.bar(ax=ax3,color='g',alpha=0.8)
data_n.groupby('grade')['grade'].count().plot.bar(ax=ax4,color='g',alpha=0.85)
plt.xticks(rotation=90)
可视化分析贷款额与违约情况的关系,发现违约情况多发生在贷款额度较大的人群中。
fig,((ax1,ax2)) =plt.subplots(1,2,figsize=(10,8))
data_train.groupby('isDefault')['loanAmnt'].sum()
sns.countplot(x='isDefault',data=data_train,ax=ax1,color='r',alpha=0.9)
data_train.groupby('isDefault')['loanAmnt'].median()
sns.countplot(x='isDefault',data=data_train,ax=ax2,color='b',alpha=1)
特征工程
经过之前的探索性分析后,对于整个数据集有了比较全面的了解,因此前期的分析十分重要,在在特征工程中,主要针对EDA中的要点进行具体分析处理,主要包括:缺失值的处理、特征值数据处理与特征选择。
缺失值处理
在这之前已经进行过数据类型的分析,将预测目标值从数据集中分离出来后,对于数值型数据用平均值来填充空位,而分类型的数据选择它的众数作为结果。
num_type=list(data_train.select_dtypes(exclude=['object']).columns)
cls_type=list(data_train.select_dtypes(include=['object']).columns)
e ='isDefault'
num_type.remove(e)
#预测值分离
data_target =data_train['isDefault']
data_train.drop('isDefault',axis=1,inplace=True)
#用平均数填充
data_train[num_type]=data_train[num_type].fillna(data_train[num_type].mean())
#用众数填充
data_train[cls_type]=data_train[cls_type].fillna(data_train[cls_type].mode())
data_train[cls_type].isnull().sum()
data_train['employmentLength'].fillna('10+ years',inplace=True)
_____________________________________________________________
grade subGrade employmentLength issueDate earliesCreditLine
0 E E2 2 years 2014-07-01 Aug-2001
1 D D2 5 years 2012-08-01 May-2002
2 D D3 8 years 2015-10-01 May-2006
3 A A4 10+ years 2015-08-01 May-1999
4 C C2 10+ years 2016-03-01 Aug-1977
... ... ... ... ... ...
799995 C C4 7 years 2016-07-01 Aug-2011
799996 A A4 10+ years 2013-04-01 May-1989
799997 C C3 10+ years 2015-10-01 Jul-2002
799998 A A4 10+ years 2015-02-01 Jan-1994
799999 B B3 5 years 2018-08-01 Feb-2002
800000 rows × 5 columns
测试集中的缺失值处理:
data_testA[num_type]=data_testA[num_type].fillna(data_testA[num_type].mean())
data_testA[cls_type]=data_testA[cls_type].fillna(data_testA[cls_type].mode())
data_testA['employmentLength']=data_testA['employmentLength'].fillna('10+ years')
由于id对模型训练没有实际意义,选择直接删除。
data_train.drop(['id'],axis=1,inplace=True)
data_testA.drop(['id'],1,inplace=True)
将测试集加载到训练集中,为了数据安全,后面选择训练集进行训练。
combined=data_train.append(data_testA)
查看合并后数据的情况。公司的策略policyCode值只有一个,对于总体分析来说,意义很小,选择删除。处理后的数据集有100万条数据,44个特征,其中前80万是训练集,后20万是测试集,这样便于后面数据集分离,即'data_train=combined[:800000]'.
combined.drop(['index','id','policyCode'],1,inplace=True)
combined.shape
_______________________________________________________
(1000000, 44)
独热编码
观察数据集中根据value值来判断需要进行虚拟编码来处理的有: 贷款期限(term), 贷款等级(grade),贷款等级子级(subgrade), 工作年限(employmentLength), 验证状态(verificationStatus)。此时combined有97个特征。
def dummies_coder():
global combined
for name in ['term','grade','subGrade',
'employmentLength','verificationStatus']:
data_dummies = pd.get_dummies(combined[name],prefix=name)
combined = pd.concat([combined,data_dummies],axis=1)
combined.drop(name,axis=1,inplace=True)
return combined
combined.shape
————————————————————————————————————————————————————————
(1000000, 97)
特征组合
根据官网给的信息,对所有特征主观分类,之前已经完成了独热编码,在这个基础上观察特征的关联性。
借款人信息:
O-employmentLength就业年限(年)
annualIncome 年收入
employmentTitle就业职称
dti 债务收入比
贷款信息:
loanAmnt 贷款金额
O-term 贷款期限(year)
interestRate 贷款利率
O-grade 贷款等级
O-subGrade 贷款等级之子级
issueDate 贷款发放的月份
还款信息:
installment分期付款金额
贷款附加信息:
O-verificationStatus验证状态
postCode 贷款申请邮政编码的前3位数字
purpose 贷款用途类别
delinquency_2years 违约事件数
pubRec贬损公共记录的数量
homeOwnership 房屋所有权状况
revolBal信贷周转余额合计
earliesCreditLine借款人最早报告的信用额度开立的月份
openAcc借款人信用档案中未结信用额度的数量
title 借款人提供的贷款名称
applicationType个人申请还是与两个共同借款人的联合申请
0-initialListStatus
revolUtil循环额度利用率/相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
ficoRangeLow 贷款发放时的fico所属的下限
ficoRangeHigh贷款发放时的fico所属的上限
pubRecBankruptcies公开记录清除的数量
还款附加信息:
regionCode 地区编码
匿名特征n0-n14,为一些贷款人行为计数特征的处理
n12 5
n11 5
n13 28
n14 31
n1 33
n0 39
n9 44
n4 46
n2 50
n3 50
n5 65
n7 70
n10 76
n8 102
n6 107
- 将年收入(annualIncome) x 债务收入比(dti) =年债务(debt),这表明了贷款人的经济压力程度,影响着是否违约;
combined['debt']=combined['annualIncome'] *combined['dti']
combined.drop(['annualIncome','dti'],1,inplace=True)
- 将贷款金额(loanAmnt) x 贷款利率(interestRate) x 年数(term) =利息(interest),再将贷款本息总额(total_loan)/分期付款金额(installment)=分期数(stages),分期数越大表明了违约可能发生的周期更长;
combined['stages']=(combined['loanAmnt'] *combined['interestRate']*0.01*combined['term']+combined['loanAmnt'])/combined['installment']
combined.drop(['loanAmnt','interestRate','term','installment'],1,inplace=True)
combined.shape
__________________________________________________________
(1000000, 92)
- 将(delinquency_2years)违约事件数+(pubRec)贬损公共记录的数量-( pubRecBankruptcies )公开记录清除的数量 =信用记录(Credit_record),不良的记录可以从侧面看出贷款人的违约履历;
combined['Credit_record'] = combined['delinquency_2years']+combined['pubRec']-combined['pubRecBankruptcies']
combined.drop(['delinquency_2years','pubRec','pubRecBankruptcies'],1,inplace=True)
- 将(openAcc)借款人信用档案中未结信用额度的数量+(totalAcc)借款人信用档案中当前的信用额度总数(revolBal)信贷周转余额合计+(revolUtil)循环额度利用率/相对于所有可用循环信贷的信贷金额=总信贷额度(credit_line),总的信贷额度可以看出贷款人的信用水准,信用越好额度越大,并且额度也可以用来规避违约风险;
combined['credit_line'] =combined['openAcc']+combined['totalAcc']+combined['revolBal']+combined['revolUtil']
combined.drop(['openAcc','totalAcc','revolBal','revolUtil'],1,inplace=True)
- 将(ficoRangeLow)贷款发放时的fico所属的下限+ (ficoRangeHigh)贷款发放时的fico所属的上限 =贷款发放时的fico所属的范围(ficoRange),机械相加,不觉明厉;
combined['ficoRange'] = combined['ficoRangeHigh'] +combined['ficoRangeLow']
combined.drop(['ficoRangeHigh','ficoRangeLow'],1,inplace=True)
- 将贷款人行为计数特征(n)n0+n1…………+n14 =total_n,既然是贷款人的行为,并且n值都很小,估测是不好的事,索性全部变成一家人,一下子干掉15个,太爽了。。。。。
combined['total_n']=combined['n0']+combined['n1']+combined['n2']+combined['n3']+combined['n4']combined['n5']+combined['n6']+combined['n7']+combined['n8']+combined['n9']+combined['n10']+combined['n11']+combined['n12']+combined['n13']+combined['n14']
combined.drop(['n0','n1', 'n2','n3' ,'n4' ,'n5' ,'n6' ,'n7' ,'n8' ,'n9' ,'n10' , 'n11' ,'n12' ,'n13' , 'n14'],1,inplace=True)
combined.shape
___________________________________________________
(1000000, 72)
全部处理完成后,数据集中有72个特征。 下面回复训练集、测试集和目标值。
train=combined.iloc[:800000]
test=combined.iloc[800000:]
targets=pd.read_csv('train.csv',usecols=['isDefault'])['isDefault'].values
模型训练
特征选择
经过特征处理后,数据有72个特征之多,现在需要进行特征选择,来人为的降低特征维度,下面将使用随机森立估算器计算特征重要性。
clf=RandomForestClassifier(n_estimators=50,max_features='sqrt')
clf =clf.fit(train,targets)
对于每一个特征进行可视化。
features =pd.DataFrame()
features['feature'] =train.columns
features['importance']=clf.feature_importances_
features.sort_values(by=['importance'],ascending=True,inplace=True)
features.set_index('feature',inplace=True)
features.plot(kind='barh',figsize=(15,15))
可以发现,正如特征工程中所探索的那样,stages,debt,cridit_line,total_n,postcode,employmenttitle都有较强的关联性。下面我们将数据集变得更加紧凑,特征一下子减少了很多。
model =SelectFromModel(clf,prefit=True)
train_reduce =model.transform(train)
train_reduce.shape
————————————————————————————————————————————————————
(800000, 12)
设置一个评分函数,来对模型进行打分:
def compute_score(clf,x,y,cv=5,scoring='accuracy'):
xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
return np.mean(xval)
基础模型
实例化分类器。
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()
svc=SVC()
gnb = GaussianNB()
models = [ logreg_cv, rf,svc, gnb,gboost]
由于数据集比较大,为了提高效率,选取部分数据进行训练。
train_partial =train[:80000]
targets_partial =targets[:80000]
for model in models:
print (f'Cross-validation of :{model.__class__}')
score = compute_score(clf=model, x=train_partial, y=targets_partial, scoring='accuracy')
print (f'CV score ={score}')
print ('****')
——————————————————————————————————————————————————————————————————
Cross-validation of :<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>
CV score =0.7982125
****
Cross-validation of :<class 'sklearn.ensemble._forest.RandomForestClassifier'>
CV score =0.7988125
****
Cross-validation of :<class 'sklearn.svm._classes.SVC'>
CV score =0.7984249999999999
****
Cross-validation of :<class 'sklearn.naive_bayes.GaussianNB'>
CV score =0.7983374999999999
****
Cross-validation of :<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>
CV score =0.8005749999999999
****
- logreg_cv CV score =0.7982125
- rf CV score =0.7988125
- svc CV score =0.7984249999999999
- gnb 0.7983374999999999
- gboost 0.8005749999999999 根据初次训练的结果,rf和gboost表现较好,决定进行参数优化以改进模型。
超参数调整
在使用的模型中,随机森林和梯度提升决策树的表现较为良好,因此在这个基础上继续进一步优化参数,将使用gridsearch来完成这个步骤。
model_box= []
rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])
gboost=GradientBoostingClassifier(random_state=2021)
gboost_params ={'learning_rate':[0.05,0.1,0.15],'n_estimators':[10,50],
'max_depth':[3,4,6,10],'min_samples_split':[50,10]}
model_box.append([gboost,gboost_params])
for i in range(len(model_box)):
best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,cv=5,scoring='roc_auc').fit(train_partial,targets_partial)
print(model_box[i],':')
print('best_parameters:',best_model.best_params_)
————————————————————————————————————————————————————————————————
[RandomForestClassifier(random_state=2021), {'n_estimators': [50, 120, 300], 'max_depth': [5, 8, 15], 'min_samples_leaf': [2, 5, 10], 'min_samples_split': [2, 5, 10]}] :
best_parameters: {'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300}
[GradientBoostingClassifier(random_state=2021), {'learning_rate': [0.05, 0.1, 0.15], 'n_estimators': [10, 50], 'max_depth': [3, 4, 6, 10], 'min_samples_split': [50, 10]}] :
best_parameters: {'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50}
网格搜索的结果显示,最优的参数是:
- rf: best_parameters: 'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300
- gboost: best_parameters: 'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50
带入这些参数进行模型训练并计算score值来评估模型。
for i in model_best:
model_best[i].fit(train_partial,targets_partial)
score = cross_val_score(model_best[i],train_partial,targets_partial,cv=5,scoring='score')
print('%s的score值为:%.4f' % (i,score.mean()))
__________________________________________________________________
rf的score值为:0.7997
gboost的score值为:0.8005
选择最优模型进行预测并输出结果文件。
model_final =model_best['gboost']
predictions= model_final.predict(test).astype(int)
df_predictions = pd.DataFrame()
abc = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/testA.csv')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = predictions
df_predictions[['id','isDefault']].to_csv('/content/drive/MyDrive/Colab Notebooks/submit.csv', index=False)