「这是我参与2022首次更文挑战的第9天,活动详情查看:2022首次更文挑战」。
一、用户贷款违约预测
\
1.题目简介
-
邀请码: ZDEYEU
\
随着全球经济的快速发展以及资本市场的垄断,无论是企业的发展还是人们超前消费观念的提前到来,贷款已成为企业和个人解决经济问题的一种重要手段。
\
对于银行业或者小贷机构而言,信用卡以及信贷服务是高风险和高收益的业务,如何通过用户的海量数据挖掘出用户潜在的信息即信用評分,并参与审批业务的决策从而提高了风险防控措施,该过程不仅提高了业务的审批效率而且给予了关键的决策,同时风险防控如果没有监测到位,对于银行业来说会造成不可估量的损失,因此这部分的工作是至关重要的。
\
\
2.数据说明
练习赛所使用数据集基于某贷款用户行为数据,并且针对部分字段做出了一定的调整,所有的字段信息请以本练习赛提供的字段信息为准
字段信息内容参考如下:
\
字段 | 描述 | 类型 |
---|---|---|
id | 样本唯一标识符 | 字符串 |
income | 用户收入 | 整数 |
age | 用户年龄 | 整数 |
experience_years | 用户从业年限 | 整数 |
is_married | 职业 | 字符串 |
city | 居住城市,匿名处理 | 整数 |
region | 居住地区,匿名处理 | 整数 |
current_job_years | 现任职位工作年限 | 字符串 |
current_house_years | 在现房屋的居住年数 | 整数 |
house_ownership | 房屋类型:租用;个人;未有 | 整数 |
car_ownership | 是否拥有汽车 | 字符串 |
profession | 职业,匿名处理 | 整数 |
label | 表示过去是否存在违约 | float |
\
二、数据分析
\
# 解压缩data
!unzip -oq /home/aistudio/data/data112664/data.zip
\
# 数据读取
import pandas as pd
train=pd.read_csv('train.csv')
train.head()
\
id | income | age | experience_years | is_married | city | region | current_job_years | current_house_years | house_ownership | car_ownership | profession | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train_0 | 8529345 | 44 | 2 | single | 210 | 0 | 2 | 10 | rented | no | 13 | 0 |
1 | train_1 | 7848654 | 55 | 9 | single | 229 | 2 | 9 | 13 | rented | no | 43 | 0 |
2 | train_2 | 8491491 | 61 | 20 | single | 114 | 28 | 8 | 11 | rented | no | 12 | 0 |
3 | train_3 | 8631544 | 69 | 13 | married | 276 | 14 | 13 | 12 | rented | no | 27 | 0 |
4 | train_4 | 6947233 | 62 | 10 | single | 56 | 11 | 10 | 12 | rented | no | 47 | 0 |
\
# 1.数据集当中label为1的样本数量是多少?label为0的样本数量是多少?
import matplotlib
%matplotlib inline
print(train['label'].value_counts())
train['label'].value_counts().plot.bar()
\
0 147325
1 20675
Name: label, dtype: int64
\
<matplotlib.axes._subplots.AxesSubplot at 0x7f234a8df810>
\
\
# 2.第1000,2000,3000,4000条数据的年龄分别是什么?
print(train['age'][1000],train['age'][2000],train['age'][3000],train['age'][4000])
\
28 67 74 54
\
# 3.region有多少个不同的值?
train['region'].nunique()
\
29
\
# 4.收入的最大值是多少?
train['income'].max()
\
9999938
\
# 5.label=1且年龄在40岁以下(包括40岁)的样本有多少?
len(train.loc[(train['label']==1) & (train['age']<=40)])
\
7441
\
# 6.婚姻状况为单身(single)的平均收入是多少?
train.loc[(train['is_married']=='single')]['income'].mean()
\
4999634.8222969035
\
# 7.label与其余各变量的相关系数绝对值大于0.03的有几个?
train.corr()
\
income | age | experience_years | city | region | current_job_years | current_house_years | profession | label | |
---|---|---|---|---|---|---|---|---|---|
income | 1.000000 | 0.000080 | 0.005880 | -0.002808 | -0.002914 | 0.008629 | -0.002476 | 0.003137 | -0.003210 |
age | 0.000080 | 1.000000 | -0.002743 | 0.003705 | -0.004571 | 0.002111 | -0.020356 | -0.009920 | -0.022394 |
experience_years | 0.005880 | -0.002743 | 1.000000 | -0.024454 | -0.000557 | 0.646392 | 0.017630 | -0.000573 | -0.033902 |
city | -0.002808 | 0.003705 | -0.024454 | 1.000000 | -0.036352 | -0.027934 | -0.009120 | 0.019299 | 0.004715 |
region | -0.002914 | -0.004571 | -0.000557 | -0.036352 | 1.000000 | 0.008872 | 0.005963 | 0.004280 | -0.005890 |
current_job_years | 0.008629 | 0.002111 | 0.646392 | -0.027934 | 0.008872 | 1.000000 | 0.005183 | -0.005910 | -0.016840 |
current_house_years | -0.002476 | -0.020356 | 0.017630 | -0.009120 | 0.005963 | 0.005183 | 1.000000 | 0.002931 | -0.005695 |
profession | 0.003137 | -0.009920 | -0.000573 | 0.019299 | 0.004280 | -0.005910 | 0.002931 | 1.000000 | -0.004034 |
label | -0.003210 | -0.022394 | -0.033902 | 0.004715 | -0.005890 | -0.016840 | -0.005695 | -0.004034 | 1.000000 |
\
# 8.有几个变量包含空值?
for column in train.columns:
print(column,train[column].isnull().any())
\
id False
income False
age False
experience_years False
is_married False
city False
region False
current_job_years False
current_house_years False
house_ownership False
car_ownership False
profession False
label False
\
# 9.哪个region下面的用户出现违约情况最多?
train['region'].value_counts()
\
25 18903
14 17162
0 16773
28 15722
2 13295
22 10980
13 9248
11 7843
6 7587
20 6100
10 5972
7 5268
23 4991
1 4814
12 3918
5 3719
19 3138
17 3047
4 2572
27 1256
9 1167
18 968
15 567
16 561
24 556
8 538
26 497
3 421
21 417
Name: region, dtype: int64
\
# 10.哪个职业的平均收入最多?
\
print(train.groupby('profession')['income'].mean().argmax(), train.groupby('profession')['income'].mean().max())
\
36 5447858.759071981
\
三、模型选择
\
机器学习利器:LightGBM、XGBoost、Catboost
模型原理:
\
-
CatBoost原理及实践 zhuanlan.zhihu.com/p/37916954
-
对xgboost的理解:zhuanlan.zhihu.com/p/75217528
-
深入理解LightGBM:zhuanlan.zhihu.com/p/99069186
模型对比:
\
\
\
1.导入库
\
!pip install catboost
\
import pandas as pd
import lightgbm as lgb
import jieba
import gc
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
\
.datatable table.frame { margin-bottom: 0; } .datatable table.frame thead { border-bottom: none; } .datatable table.frame tr.coltypes td { color: #FFFFFF; line-height: 6px; padding: 0 0.5em;} .datatable .bool { background: #DDDD99; } .datatable .object { background: #565656; } .datatable .int { background: #5D9E5D; } .datatable .float { background: #4040CC; } .datatable .str { background: #CC4040; } .datatable .row_index { background: var(--jp-border-color3); border-right: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); font-size: 9px;} .datatable .frame tr.coltypes .row_index { background: var(--jp-border-color0);} .datatable th:nth-child(2) { padding-left: 12px; } .datatable .hellipsis { color: var(--jp-cell-editor-border-color);} .datatable .vellipsis { background: var(--jp-layout-color0); color: var(--jp-cell-editor-border-color);} .datatable .na { color: var(--jp-cell-editor-border-color); font-size: 80%;} .datatable .footer { font-size: 9px; } .datatable .frame_dimensions { background: var(--jp-border-color3); border-top: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); display: inline-block; opacity: 0.6; padding: 1px 10px 1px 5px;}\
2.数据加载
\
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0)
\
cate_cols = ['car_ownership', 'house_ownership', 'is_married']
for col in cate_cols:
lb = LabelEncoder()
data[col] = lb.fit(data[col])
train[col] = lb.transform(train[col])
test[col] = lb.transform(test[col])
\
no_feas = ['id', 'label']
features = [col for col in train.columns if col not in no_feas]
X_train = train[features]
X_test = test[features]
\
y_train = train['label'].astype(int)
\
def train_model_classification(X, X_test, y, params, num_classes=2,
folds=None, model_type='lgb',
eval_metric='logloss', columns=None,
plot_feature_importance=False,
model=None, verbose=10000,
early_stopping_rounds=200,
splits=None, n_folds=3):
"""
分类模型函数
返回字典,包括: oof predictions, test predictions, scores and, if necessary, feature importances.
:params: X - 训练数据, pd.DataFrame
:params: X_test - 测试数据,pd.DataFrame
:params: y - 目标
:params: folds - folds to split data
:params: model_type - 模型
:params: eval_metric - 评价指标
:params: columns - 特征列
:params: plot_feature_importance - 是否展示特征重要性
:params: model - sklearn model, works only for "sklearn" model type
"""
start_time = time.time()
global y_pred_valid, y_pred
\
columns = X.columns if columns is None else columns
X_test = X_test[columns]
splits = folds.split(X, y) if splits is None else splits
n_splits = folds.n_splits if splits is None else n_folds
\
# to set up scoring parameters
metrics_dict = {
'logloss': {
'lgb_metric_name': 'logloss',
'xgb_metric_name': 'logloss',
'catboost_metric_name': 'Logloss',
'sklearn_scoring_function': metrics.log_loss
},
'lb_score_method': {
'sklearn_scoring_f1': metrics.f1_score, # 线上评价指标
'sklearn_scoring_accuracy': metrics.accuracy_score, # 线上评价指标
'sklearn_scoring_auc': metrics.roc_auc_score
},
}
result_dict = {}
\
# out-of-fold predictions on train data
oof = np.zeros(shape=(len(X), num_classes))
# averaged predictions on train data
prediction = np.zeros(shape=(len(X_test), num_classes))
# list of scores on folds
acc_scores=[]
scores = []
# feature importance
feature_importance = pd.DataFrame()
\
# split and train on folds
for fold_n, (train_index, valid_index) in enumerate(splits):
if verbose:
print(f'Fold {fold_n + 1} started at {time.ctime()}')
if type(X) == np.ndarray:
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
else:
X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
\
if model_type == 'lgb':
model = lgb.LGBMClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
verbose=verbose,
early_stopping_rounds=early_stopping_rounds)
\
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)
\
if model_type == 'xgb':
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
verbose=bool(verbose), # xgb verbose bool
early_stopping_rounds=early_stopping_rounds)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
if model_type == 'sklearn':
model = model
model.fit(X_train, y_train)
y_pred_valid = model.predict_proba(X_valid)
score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')
y_pred = model.predict_proba(X_test)
\
if model_type == 'cat':
model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
**params,
loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
verbose=False)
\
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test)
\
oof[valid_index] = y_pred_valid
# 评价指标
acc_scores.append(
metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
scores.append(
metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))
print(acc_scores)
print(scores)
prediction += y_pred
\
if model_type == 'lgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
\
if model_type == 'xgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
prediction /= n_splits
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
\
result_dict['oof'] = oof
result_dict['prediction'] = prediction
result_dict['acc_scores'] = acc_scores
result_dict['scores'] = scores
\
\
if model_type == 'lgb' or model_type == 'xgb':
if plot_feature_importance:
feature_importance["importance"] /= n_splits
cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
by="importance", ascending=False)[:50].index
\
best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
\
plt.figure(figsize=(16, 12))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
plt.title('LGB Features (avg over folds)')
plt.show()
result_dict['feature_importance'] = feature_importance
end_time = time.time()
\
print("train_model_classification cost time:{}".format(end_time - start_time))
return result_dict
\
\
3.lgb模型
-
GitHub主页:github.com/microsoft/L…
\
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'n_estimators': 100000,
'learning_rate': 0.1,
'random_state': 2948,
'bagging_freq': 8,
'bagging_fraction': 0.80718,
'feature_fraction': 0.38691, # 0.3
'feature_fraction_seed': 11,
'max_depth': 9,
'min_data_in_leaf': 40,
'min_child_weight': 0.18654,
"min_split_gain": 0.35079,
'min_sum_hessian_in_leaf': 1.11347,
'num_leaves': 29,
'num_threads': 6,
"lambda_l1": 0.55831,
'lambda_l2': 1.67906,
'cat_smooth': 10.4,
'subsample': 0.7,
'colsample_bytree': 0.7,
# 'n_jobs': -1,
'metric': 'auc'
}
\
n_fold = 5
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
X = train[features]
print(y_train.value_counts())
X_test = test[features]
\
result_dict_lgb = train_model_classification(X=X,
X_test=X_test,
y=y_train,
params=lgb_params,
num_classes=num_classes,
folds=folds,
model_type='lgb',
eval_metric='logloss',
plot_feature_importance=True,
verbose=200,
early_stopping_rounds=200,
n_folds=n_fold
)
\
acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
\
sub = pd.read_csv('sample_submit.csv')
\
sub['label'] = result_dict_lgb['prediction'][:, 1]
sub[['id','label']].to_csv('lgb_sub.csv', index=None)
\
4.XGB 模型
-
GitHub主页:github.com/dmlc/xgboos…
\
xgb_params = {
'objective': 'binary:logistic',
'max_depth': 5,
'n_estimators': 100000,
'learning_rate': 0.1,
'nthread': 4,
'subsample': 0.7,
'colsample_bytree': 0.7,
'min_child_weight': 3,
'n_jobs':-1
}
n_fold = 5
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)
\
result_dict_xgb = train_model_classification(X=X,
X_test=X_test,
y=y_train,
params=xgb_params,
num_classes=num_classes,
folds=folds,
model_type='xgb',
eval_metric='logloss',
plot_feature_importance=False,
verbose=10,
early_stopping_rounds=200,
n_folds=n_fold)
\
acc_score = np.mean(result_dict_xgb['acc_scores'])
score = np.mean(result_dict_xgb['scores'])
print(score)
\
sub = pd.read_csv('sample_submit.csv')
sub['label'] = result_dict_xgb['prediction'][:, 1]
sub[['id','label']].to_csv('xgb_sub.csv', index=None)
\
5.CAT模型
\
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0)
\
cate_cols = ['car_ownership', 'house_ownership', 'is_married']
for col in cate_cols:
lb = LabelEncoder()
data[col] = lb.fit(data[col])
train[col] = lb.transform(train[col])
test[col] = lb.transform(test[col])
\
no_feas = ['id', 'label']
features = [col for col in train.columns if col not in no_feas]
X_train = train[features]
X_test = test[features]
\
y_train = train['label'].astype(int)
\
cat_params = {'learning_rate': 0.1, 'depth': 9, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
n_fold = 5
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)
# train, y, test, features = load_data()
\
\
result_dict_cat = train_model_classification(X=X,
X_test=X_test,
y=y_train,
params=cat_params,
num_classes=num_classes,
folds=folds,
model_type='cat',
eval_metric='logloss',
plot_feature_importance=False,
verbose=1,
early_stopping_rounds=200,
n_folds=n_fold)
\
分类个数num_classes:2
Fold 1 started at Tue Oct 19 19:31:48 2021
[0.8984821428571429]
[0.9224975053919032]
Fold 2 started at Tue Oct 19 19:32:16 2021
[0.8984821428571429, 0.8975892857142858]
[0.9224975053919032, 0.9137011366138295]
Fold 3 started at Tue Oct 19 19:32:37 2021
[0.8984821428571429, 0.8975892857142858, 0.8974702380952381]
[0.9224975053919032, 0.9137011366138295, 0.9188366497992926]
Fold 4 started at Tue Oct 19 19:33:00 2021
[0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667]
[0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075]
Fold 5 started at Tue Oct 19 19:33:28 2021
[0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667, 0.8984821428571429]
[0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075, 0.9216918398255385]
CV mean score: 0.9203, std: 0.0038.
train_model_classification cost time:124.51118421554565
\
acc_score = np.mean(result_dict_cat['acc_scores'])
score = np.mean(result_dict_cat['scores'])
print(score)
\
sub = pd.read_csv('sample_submit.csv')
sub['label'] = result_dict_cat['prediction'][:, 1]
sub[['id','label']].to_csv('cat_sub.csv', index=None)
\
0.9202806715733278
\
6.多模型融合
\
!head lgb_sub.csv
\
id,label
test_0,0.04768790720651824
test_1,0.03234766703952933
test_2,0.019486958139745335
test_3,0.025913724543374726
test_4,0.015892808511107425
test_5,0.020375922959108346
test_6,0.5500078835133054
test_7,0.015298065400623622
test_8,0.5267191550395849
\
!head xgb_sub.csv
\
id,label
test_0,0.046526006609201434
test_1,0.01112006399780512
test_2,0.02196923615410924
test_3,0.022300471551716328
test_4,0.019372097216546535
test_5,0.02230435274541378
test_6,0.5783432126045227
test_7,0.021838839538395403
test_8,0.5301580131053925
\
!head cat_sub.csv
\
id,label
test_0,0.024350702410146956
test_1,0.012536424839143129
test_2,0.04935167789692676
test_3,0.030931472369207347
test_4,0.025919158896289263
test_5,0.015382541256121665
test_6,0.5760770372253025
test_7,0.01336187014240726
test_8,0.5325660396323679
\
import pandas as pd
\
\
lgb = pd.read_csv('lgb_sub.csv')
xgb = pd.read_csv('xgb_sub.csv')
ctb = pd.read_csv('cat_sub.csv')
sub = lgb.copy()
\
sub['label']=(lgb['label']+xgb['label']+ctb['label'])/3
sub['label'] = sub['label']
sub.to_csv("result.csv", index=False)
\
\
可见取均值并不是最好的办法