金融风险预测

79 阅读8分钟

一、题目

题目以预测用户贷款是否违约为任务,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息。

二、评测标准

结果为每个测试样本是1的概率,也就是y为1的概率。评价方法为AUC评估模型效果(越大越好)。

三、 字段表

FieldDescription
id为贷款清单分配的唯一信用证标识
loanAmnt贷款金额
term贷款期限(year)
interestRate贷款利率
installment分期付款金额
grade贷款等级
subGrade贷款等级之子级
employmentTitle就业职称
employmentLength就业年限(年)
homeOwnership借款人在登记时提供的房屋所有权状况
annualIncome年收入
verificationStatus验证状态
issueDate贷款发放的月份
purpose借款人在贷款申请时的贷款用途类别
postCode借款人在贷款申请中提供的邮政编码的前3位数字
regionCode地区编码
dti债务收入比
delinquency_2years借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh借款人在贷款发放时的fico所属的上限范围
openAcc借款人信用档案中未结信用额度的数量
pubRec贬损公共记录的数量
pubRecBankruptcies公开记录清除的数量
revolBal信贷周转余额合计
revolUtil循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc借款人信用档案中当前的信用额度总数
initialListStatus贷款的初始列表状态
applicationType表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine借款人最早报告的信用额度开立的月份
title借款人提供的贷款名称
policyCode公开可用的策略_代码=1新产品不公开可用的策略_代码=2
n系列匿名特征匿名特征n0-n14,为一些贷款人行为计数特征的处理

四、代码

# 1. 数据准备与特征工程
def load_and_preprocess_data(train_path, test_path):
    # 读取数据
    train = pd.read_csv('D:/competition/Financial Risk Prediction Competition/data/train.csv')
    test = pd.read_csv('D:/competition/Financial Risk Prediction Competition/data/testA.csv')
    data = pd.concat([train, test], axis=0, ignore_index=True)
    
    # 1.1 基础特征处理
    # employmentLength处理
    data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    def employmentLength_to_int(s):
        if pd.isnull(s):
            return s
        else:
            return np.int8(s.split()[0])
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
    
    # earliesCreditLine处理
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]) if pd.notnull(s) else np.nan)
    
    # 1.7 统计特征 (移到one-hot编码之前)
    for col in ['grade', 'subGrade', 'regionCode']:
        # 各类别的违约率
        default_rate = data.groupby(col)['isDefault'].mean().to_dict()
        data[col+'_default_rate'] = data[col].map(default_rate)
        
        # 各类别的平均贷款金额
        mean_loan = data.groupby(col)['loanAmnt'].mean().to_dict()
        data[col+'_mean_loan'] = data[col].map(mean_loan)
    
    # 1.2 类别特征编码
    cate_features = ['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode']
    data = pd.get_dummies(data, columns=cate_features, drop_first=True)
    
    # 1.3 高维类别特征处理
    for f in ['employmentTitle', 'postCode', 'title']:
        data[f+'_cnts'] = data.groupby([f])['id'].transform('count')
        data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).fillna(-1).astype(int)
        del data[f]
    
    # 1.4 时间特征扩展
    data['issueDate'] = pd.to_datetime(data['issueDate'])
    data['issueYear'] = data['issueDate'].dt.year
    data['issueMonth'] = data['issueDate'].dt.month
    data['issueDay'] = data['issueDate'].dt.day
    data['issueDayofweek'] = data['issueDate'].dt.dayofweek
    data['issueDayofyear'] = data['issueDate'].dt.dayofyear
    data['earliesCreditLine_year'] = data['earliesCreditLine'] - data['issueYear']
    
    # 1.5 数值特征分箱
    num_features = ['loanAmnt', 'interestRate', 'installment', 'annualIncome', 'dti', 
                   'revolBal', 'revolUtil', 'totalAcc', 'openAcc']
    for col in num_features:
        data[col+'_bin'] = pd.qcut(data[col], q=10, duplicates='drop', labels=False)
    
    # 1.6 交叉特征
    data['income_to_loan_ratio'] = data['annualIncome'] / (data['loanAmnt'] + 1)
    data['installment_to_income_ratio'] = data['installment'] / (data['annualIncome'] + 1)
    data['revolBal_to_income_ratio'] = data['revolBal'] / (data['annualIncome'] + 1)
    
    # 分离训练集和测试集
    train = data[data.isDefault.notnull()].reset_index(drop=True)
    test = data[data.isDefault.isnull()].reset_index(drop=True)
    
    return train, test

# 2. 模型定义与参数设置
def get_lgb_params():
    return {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'learning_rate': 0.02,
        'num_leaves': 63,
        'max_depth': -1,
        'min_data_in_leaf': 50,
        'lambda_l1': 0.1,
        'lambda_l2': 0.1,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'seed': 2020,
        'n_jobs': -1,
        'verbose': -1,
        'max_bin': 255,
        'min_gain_to_split': 0.01,
        'is_unbalance': True
    }

def get_xgb_params():
    return {
        'booster': 'gbtree',
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'eta': 0.02,
        'gamma': 0.1,
        'max_depth': 6,
        'min_child_weight': 10,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'colsample_bylevel': 0.8,
        'lambda': 1,
        'alpha': 0.1,
        'tree_method': 'hist',
        'grow_policy': 'lossguide',
        'seed': 2020,
        'nthread': -1,
        'scale_pos_weight': 5  # 根据实际正负样本比例调整
    }

def get_cat_params():
    return {
        'iterations': 10000,
        'learning_rate': 0.03,
        'depth': 6,
        'l2_leaf_reg': 3,
        'border_count': 254,
        'loss_function': 'Logloss',
        'eval_metric': 'AUC',
        'random_seed': 2020,
        'od_type': 'Iter',
        'od_wait': 100,
        'bootstrap_type': 'Bayesian',
        'allow_writing_files': False,
        'class_weights': [1, 5]  # 根据实际正负样本比例调整
    }

# 3. 模型训练与交叉验证
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2020
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    
    train_pred = np.zeros(train_x.shape[0])
    test_pred = np.zeros(test_x.shape[0])
    cv_scores = []
    feature_importance = pd.DataFrame()
    
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print(f'{"="*20} Fold {i+1} {"="*20}')
        
        # 数据划分
        trn_x, trn_y = train_x.iloc[train_index], train_y[train_index]
        val_x, val_y = train_x.iloc[valid_index], train_y[valid_index]
        
        # 样本权重 - 正样本权重更高
        sample_weight = np.where(trn_y == 1, 2, 1)
        
        if clf_name == "lgb":
            params = get_lgb_params()
            train_matrix = lgb.Dataset(trn_x, label=trn_y, weight=sample_weight)
            valid_matrix = lgb.Dataset(val_x, label=val_y)
            
            model = lgb.train(params, train_matrix, 
                            num_boost_round=10000,
                            valid_sets=[train_matrix, valid_matrix],
                            callbacks=[lgb.early_stopping(stopping_rounds=200),
                                     lgb.log_evaluation(period=200)])
            
            # 特征重要性
            fold_importance = pd.DataFrame({
                'feature': trn_x.columns,
                'importance': model.feature_importance(),
                'fold': i+1
            })
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
            
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred += model.predict(test_x, num_iteration=model.best_iteration) / folds
            
        elif clf_name == "xgb":
            params = get_xgb_params()
            dtrain = xgb.DMatrix(trn_x, label=trn_y, weight=sample_weight)
            dvalid = xgb.DMatrix(val_x, label=val_y)
            dtest = xgb.DMatrix(test_x)
            
            model = xgb.train(params, dtrain, 
                            num_boost_round=10000,
                            evals=[(dtrain, 'train'), (dvalid, 'eval')],
                            early_stopping_rounds=200,
                            verbose_eval=200)
            
            val_pred = model.predict(dvalid)
            test_pred += model.predict(dtest) / folds
            
        elif clf_name == "cat":
            params = get_cat_params()
            model = CatBoostClassifier(**params)
            model.fit(trn_x, trn_y, 
                     eval_set=(val_x, val_y),
                     sample_weight=sample_weight,
                     use_best_model=True,
                     verbose=500)
            
            val_pred = model.predict_proba(val_x)[:, 1]
            test_pred += model.predict_proba(test_x)[:, 1] / folds
        
        train_pred[valid_index] = val_pred
        fold_auc = roc_auc_score(val_y, val_pred)
        cv_scores.append(fold_auc)
        print(f'Fold {i+1} AUC: {fold_auc:.5f}')
    
    # 输出特征重要性
    if clf_name == "lgb":
        importance_df = feature_importance.groupby('feature')['importance'].mean().sort_values(ascending=False)
        print("\nFeature Importance:")
        print(importance_df.head(20))
    
    print("\nCV Results:")
    print(f"{clf_name} AUC scores: {cv_scores}")
    print(f"{clf_name} Mean AUC: {np.mean(cv_scores):.5f} (±{np.std(cv_scores):.5f})")
    
    return train_pred, test_pred

# 4. 模型融合
def stacking_ensemble(x_train, y_train, x_test):
    # 第一层模型
    print("\nTraining LightGBM...")
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    
    print("\nTraining XGBoost...")
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    
    print("\nTraining CatBoost...")
    cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")
    
    # 创建第二层训练数据
    stack_train = np.column_stack([lgb_train, xgb_train, cat_train])
    stack_test = np.column_stack([lgb_test, xgb_test, cat_test])
    
    # 划分验证集用于早停
    X_train_stack, X_val_stack, y_train_stack, y_val_stack = train_test_split(
        stack_train, y_train, test_size=0.2, random_state=42, stratify=y_train)
    
    # 第二层模型 - 使用逻辑回归
    print("\nTraining Stacking Model...")
    lr = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
    lr.fit(X_train_stack, y_train_stack)
    
    # 验证集评估
    val_pred = lr.predict_proba(X_val_stack)[:, 1]
    val_auc = roc_auc_score(y_val_stack, val_pred)
    print(f"\nStacking Model Validation AUC: {val_auc:.5f}")
    
    # 全量训练
    lr_final = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
    lr_final.fit(stack_train, y_train)
    
    # 测试集预测
    test_pred = lr_final.predict_proba(stack_test)[:, 1]
    
    return test_pred

# 5. 主流程
def main():
    # 数据路径
    train_path = 'data/train.csv'
    test_path = 'data/testA.csv'
    
    # 1. 数据加载与特征工程
    print("Loading and preprocessing data...")
    train, test = load_and_preprocess_data(train_path, test_path)
    
    # 准备特征和标签
    features = [f for f in train.columns if f not in ['id', 'issueDate', 'isDefault']]
    x_train = train[features]
    y_train = train['isDefault']
    x_test = test[features]
    
    # 2. 模型训练与融合
    print("\nStarting model training and stacking...")
    final_pred = stacking_ensemble(x_train, y_train, x_test)
    
    # 3. 生成提交文件
    print("\nGenerating submission file...")
    submission = test[['id']].copy()
    submission['isDefault'] = final_pred
    submission.to_csv('final_submission.csv', index=False)
    print("Submission file saved as 'final_submission.csv'")

if __name__ == "__main__":
    main()

代码分析

1. 数据准备与特征工程

1.1 数据加载与合并

  • 同时加载训练集和测试集,合并处理确保特征一致性
  • 处理缺失值和异常值

1.2 特征处理亮点

  • employmentLength处理:将文本描述转换为数值

  • earliesCreditLine处理:提取年份信息

  • 类别特征编码

    • 对grade、subGrade等使用one-hot编码
    • 对高维类别特征(employmentTitle等)采用计数和排名替代原始值
  • 时间特征扩展:从issueDate提取多种时间特征

  • 数值特征分箱:对连续变量进行十分位数分箱

  • 交叉特征:创建收入与贷款的比例特征

1.3 统计特征

  • 计算各类别(grade等)的违约率和平均贷款金额作为新特征
  • 这种基于目标变量的统计特征能有效提升模型性能

2. 模型设计与参数配置

代码实现了三种主流梯度提升树模型:

2.1 LightGBM

  • 使用GBDT算法,设置合理的叶子数(63)和深度(-1表示不限制)
  • 包含L1/L2正则化,特征和样本采样
  • 处理类别不平衡(is_unbalance=True)

2.2 XGBoost

  • 使用hist树方法加速训练
  • 设置合理的深度(6)和最小子节点权重(10)
  • 调整正样本权重(scale_pos_weight=5)

2.3 CatBoost

  • 使用贝叶斯bootstrap方法
  • 设置较大的迭代次数(10000)和早停机制
  • 类别权重[1,5]处理不平衡数据

3. 模型训练与验证

3.1 交叉验证策略

  • 使用5折分层交叉验证(StratifiedKFold)
  • 每折记录验证集AUC和特征重要性
  • 早停机制防止过拟合

3.2 训练技巧

  • 对正样本赋予更高权重(weight=2 vs 1)
  • 各模型使用不同的评估指标和早停策略
  • LightGBM记录特征重要性用于分析

4. 模型融合

4.1 两层堆叠(Stacking)架构

  • 第一层:三个不同的基模型(LightGBM、XGBoost、CatBoost)
  • 第二层:逻辑回归作为元模型

4.2 融合策略

  • 基模型使用交叉验证生成OOF预测
  • 元模型在验证集上评估性能
  • 最终使用全部数据训练元模型

五、预测结果

image.png