用户贷款违约预测

252 阅读7分钟

「这是我参与2022首次更文挑战的第9天,活动详情查看:2022首次更文挑战」。

一、用户贷款违约预测

\

1.题目简介

\

随着全球经济的快速发展以及资本市场的垄断,无论是企业的发展还是人们超前消费观念的提前到来,贷款已成为企业和个人解决经济问题的一种重要手段。

\

对于银行业或者小贷机构而言,信用卡以及信贷服务是高风险和高收益的业务,如何通过用户的海量数据挖掘出用户潜在的信息即信用評分,并参与审批业务的决策从而提高了风险防控措施,该过程不仅提高了业务的审批效率而且给予了关键的决策,同时风险防控如果没有监测到位,对于银行业来说会造成不可估量的损失,因此这部分的工作是至关重要的。

\

\

2.数据说明

练习赛所使用数据集基于某贷款用户行为数据,并且针对部分字段做出了一定的调整,所有的字段信息请以本练习赛提供的字段信息为准

字段信息内容参考如下:

\

字段描述类型
id样本唯一标识符字符串
income用户收入整数
age用户年龄整数
experience_years用户从业年限整数
is_married职业字符串
city居住城市,匿名处理整数
region居住地区,匿名处理整数
current_job_years现任职位工作年限字符串
current_house_years在现房屋的居住年数整数
house_ownership房屋类型:租用;个人;未有整数
car_ownership是否拥有汽车字符串
profession职业,匿名处理整数
label表示过去是否存在违约float

\

二、数据分析


\


# 解压缩data

!unzip -oq /home/aistudio/data/data112664/data.zip


\


# 数据读取

import pandas as pd

train=pd.read_csv('train.csv')

train.head()




\

    .dataframe tbody tr th:only-of-type {         vertical-align: middle;     } \     .dataframe tbody tr th {         vertical-align: top;     } \     .dataframe thead th {         text-align: right;     }

 

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

 

 

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

   

 

idincomeageexperience_yearsis_marriedcityregioncurrent_job_yearscurrent_house_yearshouse_ownershipcar_ownershipprofessionlabel
0train_08529345442single2100210rentedno130
1train_17848654559single2292913rentedno430
2train_284914916120single11428811rentedno120
3train_386315446913married276141312rentedno270
4train_469472336210single56111012rentedno470




\


# 1.数据集当中label为1的样本数量是多少?label为0的样本数量是多少?

import matplotlib

%matplotlib inline

print(train['label'].value_counts())

train['label'].value_counts().plot.bar()

\

    0    147325

    1     20675

    Name: label, dtype: int64





\

    <matplotlib.axes._subplots.AxesSubplot at 0x7f234a8df810>




\



\


# 2.第1000,2000,3000,4000条数据的年龄分别是什么?

print(train['age'][1000],train['age'][2000],train['age'][3000],train['age'][4000])

\

    28 67 74 54



\


# 3.region有多少个不同的值?

train['region'].nunique()




\

    29




\


# 4.收入的最大值是多少?

train['income'].max()




\

    9999938




\


# 5.label=1且年龄在40岁以下(包括40岁)的样本有多少?

len(train.loc[(train['label']==1) & (train['age']<=40)])




\

    7441




\


# 6.婚姻状况为单身(single)的平均收入是多少?

train.loc[(train['is_married']=='single')]['income'].mean()




\

    4999634.8222969035




\


# 7.label与其余各变量的相关系数绝对值大于0.03的有几个?

train.corr()




\

    .dataframe tbody tr th:only-of-type {         vertical-align: middle;     } \     .dataframe tbody tr th {         vertical-align: top;     } \     .dataframe thead th {         text-align: right;     }

 

   

     

     

     

     

     

     

     

     

     

     

   

 

 

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

   

     

     

     

     

     

     

     

     

     

     

   

 

incomeageexperience_yearscityregioncurrent_job_yearscurrent_house_yearsprofessionlabel
income1.0000000.0000800.005880-0.002808-0.0029140.008629-0.0024760.003137-0.003210
age0.0000801.000000-0.0027430.003705-0.0045710.002111-0.020356-0.009920-0.022394
experience_years0.005880-0.0027431.000000-0.024454-0.0005570.6463920.017630-0.000573-0.033902
city-0.0028080.003705-0.0244541.000000-0.036352-0.027934-0.0091200.0192990.004715
region-0.002914-0.004571-0.000557-0.0363521.0000000.0088720.0059630.004280-0.005890
current_job_years0.0086290.0021110.646392-0.0279340.0088721.0000000.005183-0.005910-0.016840
current_house_years-0.002476-0.0203560.017630-0.0091200.0059630.0051831.0000000.002931-0.005695
profession0.003137-0.009920-0.0005730.0192990.004280-0.0059100.0029311.000000-0.004034
label-0.003210-0.022394-0.0339020.004715-0.005890-0.016840-0.005695-0.0040341.000000


\


# 8.有几个变量包含空值?

for column in train.columns:

     print(column,train[column].isnull().any())

\

    id False

    income False

    age False

    experience_years False

    is_married False

    city False

    region False

    current_job_years False

    current_house_years False

    house_ownership False

    car_ownership False

    profession False

    label False



\


# 9.哪个region下面的用户出现违约情况最多?

train['region'].value_counts()




\

    25    18903

    14    17162

    0     16773

    28    15722

    2     13295

    22    10980

    13     9248

    11     7843

    6      7587

    20     6100

    10     5972

    7      5268

    23     4991

    1      4814

    12     3918

    5      3719

    19     3138

    17     3047

    4      2572

    27     1256

    9      1167

    18      968

    15      567

    16      561

    24      556

    8       538

    26      497

    3       421

    21      417

    Name: region, dtype: int64




\


# 10.哪个职业的平均收入最多?

\


print(train.groupby('profession')['income'].mean().argmax(), train.groupby('profession')['income'].mean().max())

\

    36 5447858.759071981


\

三、模型选择

\

机器学习利器:LightGBM、XGBoost、Catboost

模型原理:

\

模型对比:

\

\

\

1.导入库


\


!pip install catboost

\


import pandas as pd

import lightgbm as lgb

import jieba

import gc

import numpy as np

import time

import lightgbm as lgb

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import xgboost as xgb

import seaborn as sns

from sklearn.model_selection import StratifiedKFold

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import roc_auc_score

from sklearn.preprocessing import LabelEncoder

from catboost import CatBoostClassifier

from sklearn import metrics

from sklearn.model_selection import StratifiedKFold

\

.datatable table.frame { margin-bottom: 0; } .datatable table.frame thead { border-bottom: none; } .datatable table.frame tr.coltypes td {  color: #FFFFFF;  line-height: 6px;  padding: 0 0.5em;} .datatable .bool    { background: #DDDD99; } .datatable .object  { background: #565656; } .datatable .int     { background: #5D9E5D; } .datatable .float   { background: #4040CC; } .datatable .str     { background: #CC4040; } .datatable .row_index {  background: var(--jp-border-color3);  border-right: 1px solid var(--jp-border-color0);  color: var(--jp-ui-font-color3);  font-size: 9px;} .datatable .frame tr.coltypes .row_index {  background: var(--jp-border-color0);} .datatable th:nth-child(2) { padding-left: 12px; } .datatable .hellipsis {  color: var(--jp-cell-editor-border-color);} .datatable .vellipsis {  background: var(--jp-layout-color0);  color: var(--jp-cell-editor-border-color);} .datatable .na {  color: var(--jp-cell-editor-border-color);  font-size: 80%;} .datatable .footer { font-size: 9px; } .datatable .frame_dimensions {  background: var(--jp-border-color3);  border-top: 1px solid var(--jp-border-color0);  color: var(--jp-ui-font-color3);  display: inline-block;  opacity: 0.6;  padding: 1px 10px 1px 5px;}

\

2.数据加载


\


train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

data = pd.concat([train, test], axis=0)

\


cate_cols = ['car_ownership', 'house_ownership', 'is_married']

for col in cate_cols:

    lb = LabelEncoder()

    data[col] = lb.fit(data[col])

    train[col] = lb.transform(train[col])

    test[col] = lb.transform(test[col])

\


no_feas = ['id', 'label']

features = [col for col in train.columns if col not in no_feas]

X_train = train[features]

X_test = test[features]

\


y_train = train['label'].astype(int)


\


def train_model_classification(X, X_test, y, params, num_classes=2,

                               folds=None, model_type='lgb',

                               eval_metric='logloss', columns=None,

                               plot_feature_importance=False,

                               model=None, verbose=10000,

                               early_stopping_rounds=200,

                               splits=None, n_folds=3):

    """

    分类模型函数

    返回字典,包括: oof predictions, test predictions, scores and, if necessary, feature importances.

    :params: X - 训练数据, pd.DataFrame

    :params: X_test - 测试数据,pd.DataFrame

    :params: y - 目标

    :params: folds - folds to split data

    :params: model_type - 模型

    :params: eval_metric - 评价指标

    :params: columns - 特征列

    :params: plot_feature_importance - 是否展示特征重要性

    :params: model - sklearn model, works only for "sklearn" model type

    """

    start_time = time.time()

    global y_pred_valid, y_pred

\


    columns = X.columns if columns is None else columns

    X_test = X_test[columns]

    splits = folds.split(X, y) if splits is None else splits

    n_splits = folds.n_splits if splits is None else n_folds

\


    # to set up scoring parameters

    metrics_dict = {

        'logloss': {

            'lgb_metric_name': 'logloss',

            'xgb_metric_name': 'logloss',

            'catboost_metric_name': 'Logloss',

            'sklearn_scoring_function': metrics.log_loss

        },

        'lb_score_method': {

            'sklearn_scoring_f1': metrics.f1_score,  # 线上评价指标

            'sklearn_scoring_accuracy': metrics.accuracy_score,  # 线上评价指标

            'sklearn_scoring_auc': metrics.roc_auc_score

        },

    }

    result_dict = {}

\


    # out-of-fold predictions on train data

    oof = np.zeros(shape=(len(X), num_classes))

    # averaged predictions on train data

    prediction = np.zeros(shape=(len(X_test), num_classes))

    # list of scores on folds

    acc_scores=[]

    scores = []

    # feature importance

    feature_importance = pd.DataFrame()

\


    # split and train on folds

    for fold_n, (train_index, valid_index) in enumerate(splits):

        if verbose:

            print(f'Fold {fold_n + 1} started at {time.ctime()}')

        if type(X) == np.ndarray:

            X_train, X_valid = X[train_index], X[valid_index]

            y_train, y_valid = y[train_index], y[valid_index]

        else:

            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]

            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

\


        if model_type == 'lgb':

            model = lgb.LGBMClassifier(**params)

            model.fit(X_train, y_train,

                      eval_set=[(X_train, y_train), (X_valid, y_valid)],

                      eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],

                      verbose=verbose,

                      early_stopping_rounds=early_stopping_rounds)

\


            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)

\


        if model_type == 'xgb':

            model = xgb.XGBClassifier(**params)

            model.fit(X_train, y_train,

                      eval_set=[(X_train, y_train), (X_valid, y_valid)],

                      eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],

                      verbose=bool(verbose),  # xgb verbose bool

                      early_stopping_rounds=early_stopping_rounds)

            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)

        if model_type == 'sklearn':

            model = model

            model.fit(X_train, y_train)

            y_pred_valid = model.predict_proba(X_valid)

            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)

            print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')

            y_pred = model.predict_proba(X_test)

\


        if model_type == 'cat':

            model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],

                                       **params,

                                       loss_function=metrics_dict[eval_metric]['catboost_metric_name'])

            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,

                      verbose=False)

\


            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test)

\


        oof[valid_index] = y_pred_valid

        # 评价指标

        acc_scores.append(

            metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))

        scores.append(

            metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))

        print(acc_scores)

        print(scores)

        prediction += y_pred

\


        if model_type == 'lgb' and plot_feature_importance:

            # feature importance

            fold_importance = pd.DataFrame()

            fold_importance["feature"] = columns

            fold_importance["importance"] = model.feature_importances_

            fold_importance["fold"] = fold_n + 1

            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

\


        if model_type == 'xgb' and plot_feature_importance:

            # feature importance

            fold_importance = pd.DataFrame()

            fold_importance["feature"] = columns

            fold_importance["importance"] = model.feature_importances_

            fold_importance["fold"] = fold_n + 1

            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

    prediction /= n_splits

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

\


    result_dict['oof'] = oof

    result_dict['prediction'] = prediction

    result_dict['acc_scores'] = acc_scores

    result_dict['scores'] = scores

\
\


    if model_type == 'lgb' or model_type == 'xgb':

        if plot_feature_importance:

            feature_importance["importance"] /= n_splits

            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(

                by="importance", ascending=False)[:50].index

\


            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

\


            plt.figure(figsize=(16, 12))

            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))

            plt.title('LGB Features (avg over folds)')

            plt.show()

            result_dict['feature_importance'] = feature_importance

    end_time = time.time()

\


    print("train_model_classification cost time:{}".format(end_time - start_time))

    return result_dict

\


\

3.lgb模型


\


lgb_params = {

    'boosting_type': 'gbdt',

    'objective': 'binary',

    'n_estimators': 100000,

    'learning_rate': 0.1,

    'random_state': 2948,

    'bagging_freq': 8,

    'bagging_fraction': 0.80718,

    'feature_fraction': 0.38691,  # 0.3

    'feature_fraction_seed': 11,

    'max_depth': 9,

    'min_data_in_leaf': 40,

    'min_child_weight': 0.18654,

    "min_split_gain": 0.35079,

    'min_sum_hessian_in_leaf': 1.11347,

    'num_leaves': 29,

    'num_threads': 6,

    "lambda_l1": 0.55831,

    'lambda_l2': 1.67906,

    'cat_smooth': 10.4,

    'subsample': 0.7,

    'colsample_bytree': 0.7,

    # 'n_jobs': -1,

    'metric': 'auc'

}

\


n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)

X = train[features]

print(y_train.value_counts())

X_test = test[features]

\


result_dict_lgb = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=lgb_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='lgb',

                                             eval_metric='logloss',

                                             plot_feature_importance=True,

                                             verbose=200,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold

                                             )

\


acc_score = np.mean(result_dict_lgb['acc_scores'])

score = np.mean(result_dict_lgb['scores'])

print(score)

\


sub = pd.read_csv('sample_submit.csv')

\


sub['label'] = result_dict_lgb['prediction'][:, 1]

sub[['id','label']].to_csv('lgb_sub.csv', index=None)

\

4.XGB 模型


\


xgb_params = {

    'objective': 'binary:logistic',

    'max_depth': 5,

    'n_estimators': 100000,

    'learning_rate': 0.1,

    'nthread': 4,

    'subsample': 0.7,

    'colsample_bytree': 0.7,

    'min_child_weight': 3,

    'n_jobs':-1

}

n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)

\


result_dict_xgb = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=xgb_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='xgb',

                                             eval_metric='logloss',

                                             plot_feature_importance=False,

                                             verbose=10,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold)


\


acc_score = np.mean(result_dict_xgb['acc_scores'])

score = np.mean(result_dict_xgb['scores'])

print(score)

\


sub = pd.read_csv('sample_submit.csv')

sub['label'] = result_dict_xgb['prediction'][:, 1]

sub[['id','label']].to_csv('xgb_sub.csv', index=None)

\

5.CAT模型



\


train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

data = pd.concat([train, test], axis=0)

\


cate_cols = ['car_ownership', 'house_ownership', 'is_married']

for col in cate_cols:

    lb = LabelEncoder()

    data[col] = lb.fit(data[col])

    train[col] = lb.transform(train[col])

    test[col] = lb.transform(test[col])

\


no_feas = ['id', 'label']

features = [col for col in train.columns if col not in no_feas]

X_train = train[features]

X_test = test[features]

\


y_train = train['label'].astype(int)


\


cat_params = {'learning_rate': 0.1, 'depth': 9, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',

              'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}

n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)

# train, y, test, features = load_data()

\
\


result_dict_cat = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=cat_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='cat',

                                             eval_metric='logloss',

                                             plot_feature_importance=False,

                                             verbose=1,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold)

\

    分类个数num_classes:2

    Fold 1 started at Tue Oct 19 19:31:48 2021

    [0.8984821428571429]

    [0.9224975053919032]

    Fold 2 started at Tue Oct 19 19:32:16 2021

    [0.8984821428571429, 0.8975892857142858]

    [0.9224975053919032, 0.9137011366138295]

    Fold 3 started at Tue Oct 19 19:32:37 2021

    [0.8984821428571429, 0.8975892857142858, 0.8974702380952381]

    [0.9224975053919032, 0.9137011366138295, 0.9188366497992926]

    Fold 4 started at Tue Oct 19 19:33:00 2021

    [0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667]

    [0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075]

    Fold 5 started at Tue Oct 19 19:33:28 2021

    [0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667, 0.8984821428571429]

    [0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075, 0.9216918398255385]

    CV mean score: 0.9203, std: 0.0038.

    train_model_classification cost time:124.51118421554565



\


acc_score = np.mean(result_dict_cat['acc_scores'])

score = np.mean(result_dict_cat['scores'])

print(score)

\


sub = pd.read_csv('sample_submit.csv')

sub['label'] = result_dict_cat['prediction'][:, 1]

sub[['id','label']].to_csv('cat_sub.csv', index=None)

\

    0.9202806715733278


\

6.多模型融合


\


!head lgb_sub.csv

\

    id,label

    test_0,0.04768790720651824

    test_1,0.03234766703952933

    test_2,0.019486958139745335

    test_3,0.025913724543374726

    test_4,0.015892808511107425

    test_5,0.020375922959108346

    test_6,0.5500078835133054

    test_7,0.015298065400623622

    test_8,0.5267191550395849



\


!head xgb_sub.csv

\

    id,label

    test_0,0.046526006609201434

    test_1,0.01112006399780512

    test_2,0.02196923615410924

    test_3,0.022300471551716328

    test_4,0.019372097216546535

    test_5,0.02230435274541378

    test_6,0.5783432126045227

    test_7,0.021838839538395403

    test_8,0.5301580131053925



\


!head cat_sub.csv

\

    id,label

    test_0,0.024350702410146956

    test_1,0.012536424839143129

    test_2,0.04935167789692676

    test_3,0.030931472369207347

    test_4,0.025919158896289263

    test_5,0.015382541256121665

    test_6,0.5760770372253025

    test_7,0.01336187014240726

    test_8,0.5325660396323679



\


import pandas as pd

\
\


lgb = pd.read_csv('lgb_sub.csv')

xgb = pd.read_csv('xgb_sub.csv')

ctb = pd.read_csv('cat_sub.csv')

sub = lgb.copy()

\


sub['label']=(lgb['label']+xgb['label']+ctb['label'])/3

sub['label'] = sub['label']

sub.to_csv("result.csv", index=False)

\


\

可见取均值并不是最好的办法

项目地址: aistudio.baidu.com/aistudio/pr…