「这是我参与2022首次更文挑战的第9天，活动详情查看：2022首次更文挑战」。

一、用户贷款违约预测

1.题目简介

链接: www.heywhale.com/home/compet…
邀请码: ZDEYEU

随着全球经济的快速发展以及资本市场的垄断，无论是企业的发展还是人们超前消费观念的提前到来，贷款已成为企业和个人解决经济问题的一种重要手段。

对于银行业或者小贷机构而言，信用卡以及信贷服务是高风险和高收益的业务，如何通过用户的海量数据挖掘出用户潜在的信息即信用評分，并参与审批业务的决策从而提高了风险防控措施，该过程不仅提高了业务的审批效率而且给予了关键的决策，同时风险防控如果没有监测到位，对于银行业来说会造成不可估量的损失，因此这部分的工作是至关重要的。

2.数据说明

练习赛所使用数据集基于某贷款用户行为数据，并且针对部分字段做出了一定的调整，所有的字段信息请以本练习赛提供的字段信息为准

字段信息内容参考如下：

字段	描述	类型
id	样本唯一标识符	字符串
income	用户收入	整数
age	用户年龄	整数
experience_years	用户从业年限	整数
is_married	职业	字符串
city	居住城市，匿名处理	整数
region	居住地区，匿名处理	整数
current_job_years	现任职位工作年限	字符串
current_house_years	在现房屋的居住年数	整数
house_ownership	房屋类型：租用；个人；未有	整数
car_ownership	是否拥有汽车	字符串
profession	职业，匿名处理	整数
label	表示过去是否存在违约	float

二、数据分析


# 解压缩data

!unzip -oq /home/aistudio/data/data112664/data.zip


# 数据读取

import pandas as pd

train=pd.read_csv('train.csv')

train.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } \ .dataframe tbody tr th { vertical-align: top; } \ .dataframe thead th { text-align: right; }

	id	income	age	experience_years	is_married	city	region	current_job_years	current_house_years	house_ownership	car_ownership	profession
0	train_0	8529345	44	2	single	210	0	2	10	rented	no	13
1	train_1	7848654	55	9	single	229	2	9	13	rented	no	43
2	train_2	8491491	61	20	single	114	28	8	11	rented	no	12
3	train_3	8631544	69	13	married	276	14	13	12	rented	no	27
4	train_4	6947233	62	10	single	56	11	10	12	rented	no	47


# 1.数据集当中label为1的样本数量是多少？label为0的样本数量是多少？

import matplotlib

%matplotlib inline

print(train['label'].value_counts())

train['label'].value_counts().plot.bar()

0 147325

1 20675

Name: label, dtype: int64

<matplotlib.axes._subplots.AxesSubplot at 0x7f234a8df810>


# 2.第1000，2000，3000，4000条数据的年龄分别是什么？

print(train['age'][1000],train['age'][2000],train['age'][3000],train['age'][4000])

28 67 74 54


# 3.region有多少个不同的值？

train['region'].nunique()


# 4.收入的最大值是多少？

train['income'].max()

9999938


# 5.label=1且年龄在40岁以下（包括40岁）的样本有多少？

len(train.loc[(train['label']==1) & (train['age']<=40)])

7441


# 6.婚姻状况为单身（single）的平均收入是多少？

train.loc[(train['is_married']=='single')]['income'].mean()

4999634.8222969035


# 7.label与其余各变量的相关系数绝对值大于0.03的有几个？

train.corr()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } \ .dataframe tbody tr th { vertical-align: top; } \ .dataframe thead th { text-align: right; }

	income	age	experience_years	city	region	current_job_years	current_house_years	profession	label
income	1.000000	0.000080	0.005880	-0.002808	-0.002914	0.008629	-0.002476	0.003137	-0.003210
age	0.000080	1.000000	-0.002743	0.003705	-0.004571	0.002111	-0.020356	-0.009920	-0.022394
experience_years	0.005880	-0.002743	1.000000	-0.024454	-0.000557	0.646392	0.017630	-0.000573	-0.033902
city	-0.002808	0.003705	-0.024454	1.000000	-0.036352	-0.027934	-0.009120	0.019299	0.004715
region	-0.002914	-0.004571	-0.000557	-0.036352	1.000000	0.008872	0.005963	0.004280	-0.005890
current_job_years	0.008629	0.002111	0.646392	-0.027934	0.008872	1.000000	0.005183	-0.005910	-0.016840
current_house_years	-0.002476	-0.020356	0.017630	-0.009120	0.005963	0.005183	1.000000	0.002931	-0.005695
profession	0.003137	-0.009920	-0.000573	0.019299	0.004280	-0.005910	0.002931	1.000000	-0.004034
label	-0.003210	-0.022394	-0.033902	0.004715	-0.005890	-0.016840	-0.005695	-0.004034	1.000000


# 8.有几个变量包含空值？

for column in train.columns:

     print(column,train[column].isnull().any())

id False

income False

age False

experience_years False

is_married False

city False

region False

current_job_years False

current_house_years False

house_ownership False

car_ownership False

profession False

label False


# 9.哪个region下面的用户出现违约情况最多？

train['region'].value_counts()

25 18903

14 17162

0 16773

28 15722

2 13295

22 10980

13 9248

11 7843

6 7587

20 6100

10 5972

7 5268

23 4991

1 4814

12 3918

5 3719

19 3138

17 3047

4 2572

27 1256

9 1167

18 968

15 567

16 561

24 556

8 538

26 497

3 421

21 417

Name: region, dtype: int64


# 10.哪个职业的平均收入最多？

\


print(train.groupby('profession')['income'].mean().argmax(), train.groupby('profession')['income'].mean().max())

36 5447858.759071981

三、模型选择

机器学习利器：LightGBM、XGBoost、Catboost

模型原理：

CatBoost原理及实践 zhuanlan.zhihu.com/p/37916954
对xgboost的理解:zhuanlan.zhihu.com/p/75217528
深入理解LightGBM:zhuanlan.zhihu.com/p/99069186

模型对比：

参数对比： blog.csdn.net/zwqjoy/arti…

1.导入库


!pip install catboost


import pandas as pd

import lightgbm as lgb

import jieba

import gc

import numpy as np

import time

import lightgbm as lgb

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import xgboost as xgb

import seaborn as sns

from sklearn.model_selection import StratifiedKFold

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import roc_auc_score

from sklearn.preprocessing import LabelEncoder

from catboost import CatBoostClassifier

from sklearn import metrics

from sklearn.model_selection import StratifiedKFold

.datatable table.frame { margin-bottom: 0; } .datatable table.frame thead { border-bottom: none; } .datatable table.frame tr.coltypes td { color: #FFFFFF; line-height: 6px; padding: 0 0.5em;} .datatable .bool { background: #DDDD99; } .datatable .object { background: #565656; } .datatable .int { background: #5D9E5D; } .datatable .float { background: #4040CC; } .datatable .str { background: #CC4040; } .datatable .row_index { background: var(--jp-border-color3); border-right: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); font-size: 9px;} .datatable .frame tr.coltypes .row_index { background: var(--jp-border-color0);} .datatable th:nth-child(2) { padding-left: 12px; } .datatable .hellipsis { color: var(--jp-cell-editor-border-color);} .datatable .vellipsis { background: var(--jp-layout-color0); color: var(--jp-cell-editor-border-color);} .datatable .na { color: var(--jp-cell-editor-border-color); font-size: 80%;} .datatable .footer { font-size: 9px; } .datatable .frame_dimensions { background: var(--jp-border-color3); border-top: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); display: inline-block; opacity: 0.6; padding: 1px 10px 1px 5px;}

2.数据加载


train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

data = pd.concat([train, test], axis=0)

\


cate_cols = ['car_ownership', 'house_ownership', 'is_married']

for col in cate_cols:

    lb = LabelEncoder()

    data[col] = lb.fit(data[col])

    train[col] = lb.transform(train[col])

    test[col] = lb.transform(test[col])

\


no_feas = ['id', 'label']

features = [col for col in train.columns if col not in no_feas]

X_train = train[features]

X_test = test[features]

\


y_train = train['label'].astype(int)


def train_model_classification(X, X_test, y, params, num_classes=2,

                               folds=None, model_type='lgb',

                               eval_metric='logloss', columns=None,

                               plot_feature_importance=False,

                               model=None, verbose=10000,

                               early_stopping_rounds=200,

                               splits=None, n_folds=3):

    """

    分类模型函数

    返回字典，包括： oof predictions, test predictions, scores and, if necessary, feature importances.

    :params: X - 训练数据， pd.DataFrame

    :params: X_test - 测试数据，pd.DataFrame

    :params: y - 目标

    :params: folds - folds to split data

    :params: model_type - 模型

    :params: eval_metric - 评价指标

    :params: columns - 特征列

    :params: plot_feature_importance - 是否展示特征重要性

    :params: model - sklearn model, works only for "sklearn" model type

    """

    start_time = time.time()

    global y_pred_valid, y_pred

\


    columns = X.columns if columns is None else columns

    X_test = X_test[columns]

    splits = folds.split(X, y) if splits is None else splits

    n_splits = folds.n_splits if splits is None else n_folds

\


    # to set up scoring parameters

    metrics_dict = {

        'logloss': {

            'lgb_metric_name': 'logloss',

            'xgb_metric_name': 'logloss',

            'catboost_metric_name': 'Logloss',

            'sklearn_scoring_function': metrics.log_loss

        },

        'lb_score_method': {

            'sklearn_scoring_f1': metrics.f1_score,  # 线上评价指标

            'sklearn_scoring_accuracy': metrics.accuracy_score,  # 线上评价指标

            'sklearn_scoring_auc': metrics.roc_auc_score

        },

    }

    result_dict = {}

\


    # out-of-fold predictions on train data

    oof = np.zeros(shape=(len(X), num_classes))

    # averaged predictions on train data

    prediction = np.zeros(shape=(len(X_test), num_classes))

    # list of scores on folds

    acc_scores=[]

    scores = []

    # feature importance

    feature_importance = pd.DataFrame()

\


    # split and train on folds

    for fold_n, (train_index, valid_index) in enumerate(splits):

        if verbose:

            print(f'Fold {fold_n + 1} started at {time.ctime()}')

        if type(X) == np.ndarray:

            X_train, X_valid = X[train_index], X[valid_index]

            y_train, y_valid = y[train_index], y[valid_index]

        else:

            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]

            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

\


        if model_type == 'lgb':

            model = lgb.LGBMClassifier(**params)

            model.fit(X_train, y_train,

                      eval_set=[(X_train, y_train), (X_valid, y_valid)],

                      eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],

                      verbose=verbose,

                      early_stopping_rounds=early_stopping_rounds)

\


            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)

\


        if model_type == 'xgb':

            model = xgb.XGBClassifier(**params)

            model.fit(X_train, y_train,

                      eval_set=[(X_train, y_train), (X_valid, y_valid)],

                      eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],

                      verbose=bool(verbose),  # xgb verbose bool

                      early_stopping_rounds=early_stopping_rounds)

            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)

        if model_type == 'sklearn':

            model = model

            model.fit(X_train, y_train)

            y_pred_valid = model.predict_proba(X_valid)

            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)

            print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')

            y_pred = model.predict_proba(X_test)

\


        if model_type == 'cat':

            model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],

                                       **params,

                                       loss_function=metrics_dict[eval_metric]['catboost_metric_name'])

            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,

                      verbose=False)

\


            y_pred_valid = model.predict_proba(X_valid)

            y_pred = model.predict_proba(X_test)

\


        oof[valid_index] = y_pred_valid

        # 评价指标

        acc_scores.append(

            metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))

        scores.append(

            metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))

        print(acc_scores)

        print(scores)

        prediction += y_pred

\


        if model_type == 'lgb' and plot_feature_importance:

            # feature importance

            fold_importance = pd.DataFrame()

            fold_importance["feature"] = columns

            fold_importance["importance"] = model.feature_importances_

            fold_importance["fold"] = fold_n + 1

            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

\


        if model_type == 'xgb' and plot_feature_importance:

            # feature importance

            fold_importance = pd.DataFrame()

            fold_importance["feature"] = columns

            fold_importance["importance"] = model.feature_importances_

            fold_importance["fold"] = fold_n + 1

            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

    prediction /= n_splits

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

\


    result_dict['oof'] = oof

    result_dict['prediction'] = prediction

    result_dict['acc_scores'] = acc_scores

    result_dict['scores'] = scores

\
\


    if model_type == 'lgb' or model_type == 'xgb':

        if plot_feature_importance:

            feature_importance["importance"] /= n_splits

            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(

                by="importance", ascending=False)[:50].index

\


            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

\


            plt.figure(figsize=(16, 12))

            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))

            plt.title('LGB Features (avg over folds)')

            plt.show()

            result_dict['feature_importance'] = feature_importance

    end_time = time.time()

\


    print("train_model_classification cost time:{}".format(end_time - start_time))

    return result_dict

\

3.lgb模型


lgb_params = {

    'boosting_type': 'gbdt',

    'objective': 'binary',

    'n_estimators': 100000,

    'learning_rate': 0.1,

    'random_state': 2948,

    'bagging_freq': 8,

    'bagging_fraction': 0.80718,

    'feature_fraction': 0.38691,  # 0.3

    'feature_fraction_seed': 11,

    'max_depth': 9,

    'min_data_in_leaf': 40,

    'min_child_weight': 0.18654,

    "min_split_gain": 0.35079,

    'min_sum_hessian_in_leaf': 1.11347,

    'num_leaves': 29,

    'num_threads': 6,

    "lambda_l1": 0.55831,

    'lambda_l2': 1.67906,

    'cat_smooth': 10.4,

    'subsample': 0.7,

    'colsample_bytree': 0.7,

    # 'n_jobs': -1,

    'metric': 'auc'

}

\


n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)

X = train[features]

print(y_train.value_counts())

X_test = test[features]

\


result_dict_lgb = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=lgb_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='lgb',

                                             eval_metric='logloss',

                                             plot_feature_importance=True,

                                             verbose=200,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold

                                             )

\


acc_score = np.mean(result_dict_lgb['acc_scores'])

score = np.mean(result_dict_lgb['scores'])

print(score)


sub = pd.read_csv('sample_submit.csv')

\


sub['label'] = result_dict_lgb['prediction'][:, 1]

sub[['id','label']].to_csv('lgb_sub.csv', index=None)

4.XGB 模型

GitHub主页：github.com/dmlc/xgboos…
文档网址:xgboost.readthedocs.io/en/latest/
核心参数：xgboost.readthedocs.io/en/latest/p…


xgb_params = {

    'objective': 'binary:logistic',

    'max_depth': 5,

    'n_estimators': 100000,

    'learning_rate': 0.1,

    'nthread': 4,

    'subsample': 0.7,

    'colsample_bytree': 0.7,

    'min_child_weight': 3,

    'n_jobs':-1

}

n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)

\


result_dict_xgb = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=xgb_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='xgb',

                                             eval_metric='logloss',

                                             plot_feature_importance=False,

                                             verbose=10,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold)


acc_score = np.mean(result_dict_xgb['acc_scores'])

score = np.mean(result_dict_xgb['scores'])

print(score)

\


sub = pd.read_csv('sample_submit.csv')

sub['label'] = result_dict_xgb['prediction'][:, 1]

sub[['id','label']].to_csv('xgb_sub.csv', index=None)

5.CAT模型


train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

data = pd.concat([train, test], axis=0)

\


cate_cols = ['car_ownership', 'house_ownership', 'is_married']

for col in cate_cols:

    lb = LabelEncoder()

    data[col] = lb.fit(data[col])

    train[col] = lb.transform(train[col])

    test[col] = lb.transform(test[col])

\


no_feas = ['id', 'label']

features = [col for col in train.columns if col not in no_feas]

X_train = train[features]

X_test = test[features]

\


y_train = train['label'].astype(int)


cat_params = {'learning_rate': 0.1, 'depth': 9, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',

              'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}

n_fold = 5

num_classes = 2

print("分类个数num_classes:{}".format(num_classes))

folds = StratifiedKFold(n_splits=n_fold, random_state=1314, shuffle=True)

# train, y, test, features = load_data()

\
\


result_dict_cat = train_model_classification(X=X,

                                             X_test=X_test,

                                             y=y_train,

                                             params=cat_params,

                                             num_classes=num_classes,

                                             folds=folds,

                                             model_type='cat',

                                             eval_metric='logloss',

                                             plot_feature_importance=False,

                                             verbose=1,

                                             early_stopping_rounds=200,

                                             n_folds=n_fold)

分类个数num_classes:2

Fold 1 started at Tue Oct 19 19:31:48 2021

[0.8984821428571429]

[0.9224975053919032]

Fold 2 started at Tue Oct 19 19:32:16 2021

[0.8984821428571429, 0.8975892857142858]

[0.9224975053919032, 0.9137011366138295]

Fold 3 started at Tue Oct 19 19:32:37 2021

[0.8984821428571429, 0.8975892857142858, 0.8974702380952381]

[0.9224975053919032, 0.9137011366138295, 0.9188366497992926]

Fold 4 started at Tue Oct 19 19:33:00 2021

[0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667]

[0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075]

Fold 5 started at Tue Oct 19 19:33:28 2021

[0.8984821428571429, 0.8975892857142858, 0.8974702380952381, 0.8991666666666667, 0.8984821428571429]

[0.9224975053919032, 0.9137011366138295, 0.9188366497992926, 0.924676226236075, 0.9216918398255385]

CV mean score: 0.9203, std: 0.0038.

train_model_classification cost time:124.51118421554565


acc_score = np.mean(result_dict_cat['acc_scores'])

score = np.mean(result_dict_cat['scores'])

print(score)

\


sub = pd.read_csv('sample_submit.csv')

sub['label'] = result_dict_cat['prediction'][:, 1]

sub[['id','label']].to_csv('cat_sub.csv', index=None)

0.9202806715733278

6.多模型融合


!head lgb_sub.csv

id,label

test_0,0.04768790720651824

test_1,0.03234766703952933

test_2,0.019486958139745335

test_3,0.025913724543374726

test_4,0.015892808511107425

test_5,0.020375922959108346

test_6,0.5500078835133054

test_7,0.015298065400623622

test_8,0.5267191550395849


!head xgb_sub.csv

id,label

test_0,0.046526006609201434

test_1,0.01112006399780512

test_2,0.02196923615410924

test_3,0.022300471551716328

test_4,0.019372097216546535

test_5,0.02230435274541378

test_6,0.5783432126045227

test_7,0.021838839538395403

test_8,0.5301580131053925


!head cat_sub.csv

id,label

test_0,0.024350702410146956

test_1,0.012536424839143129

test_2,0.04935167789692676

test_3,0.030931472369207347

test_4,0.025919158896289263

test_5,0.015382541256121665

test_6,0.5760770372253025

test_7,0.01336187014240726

test_8,0.5325660396323679


import pandas as pd

\
\


lgb = pd.read_csv('lgb_sub.csv')

xgb = pd.read_csv('xgb_sub.csv')

ctb = pd.read_csv('cat_sub.csv')

sub = lgb.copy()

\


sub['label']=(lgb['label']+xgb['label']+ctb['label'])/3

sub['label'] = sub['label']

sub.to_csv("result.csv", index=False)

可见取均值并不是最好的办法

项目地址: aistudio.baidu.com/aistudio/pr…