一、题目
题目以预测用户贷款是否违约为任务,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息。
二、评测标准
结果为每个测试样本是1的概率,也就是y为1的概率。评价方法为AUC评估模型效果(越大越好)。
三、 字段表
| Field | Description |
|---|---|
| id | 为贷款清单分配的唯一信用证标识 |
| loanAmnt | 贷款金额 |
| term | 贷款期限(year) |
| interestRate | 贷款利率 |
| installment | 分期付款金额 |
| grade | 贷款等级 |
| subGrade | 贷款等级之子级 |
| employmentTitle | 就业职称 |
| employmentLength | 就业年限(年) |
| homeOwnership | 借款人在登记时提供的房屋所有权状况 |
| annualIncome | 年收入 |
| verificationStatus | 验证状态 |
| issueDate | 贷款发放的月份 |
| purpose | 借款人在贷款申请时的贷款用途类别 |
| postCode | 借款人在贷款申请中提供的邮政编码的前3位数字 |
| regionCode | 地区编码 |
| dti | 债务收入比 |
| delinquency_2years | 借款人过去2年信用档案中逾期30天以上的违约事件数 |
| ficoRangeLow | 借款人在贷款发放时的fico所属的下限范围 |
| ficoRangeHigh | 借款人在贷款发放时的fico所属的上限范围 |
| openAcc | 借款人信用档案中未结信用额度的数量 |
| pubRec | 贬损公共记录的数量 |
| pubRecBankruptcies | 公开记录清除的数量 |
| revolBal | 信贷周转余额合计 |
| revolUtil | 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额 |
| totalAcc | 借款人信用档案中当前的信用额度总数 |
| initialListStatus | 贷款的初始列表状态 |
| applicationType | 表明贷款是个人申请还是与两个共同借款人的联合申请 |
| earliesCreditLine | 借款人最早报告的信用额度开立的月份 |
| title | 借款人提供的贷款名称 |
| policyCode | 公开可用的策略_代码=1新产品不公开可用的策略_代码=2 |
| n系列匿名特征 | 匿名特征n0-n14,为一些贷款人行为计数特征的处理 |
四、代码
# 1. 数据准备与特征工程
def load_and_preprocess_data(train_path, test_path):
# 读取数据
train = pd.read_csv('D:/competition/Financial Risk Prediction Competition/data/train.csv')
test = pd.read_csv('D:/competition/Financial Risk Prediction Competition/data/testA.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)
# 1.1 基础特征处理
# employmentLength处理
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
def employmentLength_to_int(s):
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0])
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
# earliesCreditLine处理
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]) if pd.notnull(s) else np.nan)
# 1.7 统计特征 (移到one-hot编码之前)
for col in ['grade', 'subGrade', 'regionCode']:
# 各类别的违约率
default_rate = data.groupby(col)['isDefault'].mean().to_dict()
data[col+'_default_rate'] = data[col].map(default_rate)
# 各类别的平均贷款金额
mean_loan = data.groupby(col)['loanAmnt'].mean().to_dict()
data[col+'_mean_loan'] = data[col].map(mean_loan)
# 1.2 类别特征编码
cate_features = ['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode']
data = pd.get_dummies(data, columns=cate_features, drop_first=True)
# 1.3 高维类别特征处理
for f in ['employmentTitle', 'postCode', 'title']:
data[f+'_cnts'] = data.groupby([f])['id'].transform('count')
data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).fillna(-1).astype(int)
del data[f]
# 1.4 时间特征扩展
data['issueDate'] = pd.to_datetime(data['issueDate'])
data['issueYear'] = data['issueDate'].dt.year
data['issueMonth'] = data['issueDate'].dt.month
data['issueDay'] = data['issueDate'].dt.day
data['issueDayofweek'] = data['issueDate'].dt.dayofweek
data['issueDayofyear'] = data['issueDate'].dt.dayofyear
data['earliesCreditLine_year'] = data['earliesCreditLine'] - data['issueYear']
# 1.5 数值特征分箱
num_features = ['loanAmnt', 'interestRate', 'installment', 'annualIncome', 'dti',
'revolBal', 'revolUtil', 'totalAcc', 'openAcc']
for col in num_features:
data[col+'_bin'] = pd.qcut(data[col], q=10, duplicates='drop', labels=False)
# 1.6 交叉特征
data['income_to_loan_ratio'] = data['annualIncome'] / (data['loanAmnt'] + 1)
data['installment_to_income_ratio'] = data['installment'] / (data['annualIncome'] + 1)
data['revolBal_to_income_ratio'] = data['revolBal'] / (data['annualIncome'] + 1)
# 分离训练集和测试集
train = data[data.isDefault.notnull()].reset_index(drop=True)
test = data[data.isDefault.isnull()].reset_index(drop=True)
return train, test
# 2. 模型定义与参数设置
def get_lgb_params():
return {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.02,
'num_leaves': 63,
'max_depth': -1,
'min_data_in_leaf': 50,
'lambda_l1': 0.1,
'lambda_l2': 0.1,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'seed': 2020,
'n_jobs': -1,
'verbose': -1,
'max_bin': 255,
'min_gain_to_split': 0.01,
'is_unbalance': True
}
def get_xgb_params():
return {
'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'eta': 0.02,
'gamma': 0.1,
'max_depth': 6,
'min_child_weight': 10,
'subsample': 0.8,
'colsample_bytree': 0.8,
'colsample_bylevel': 0.8,
'lambda': 1,
'alpha': 0.1,
'tree_method': 'hist',
'grow_policy': 'lossguide',
'seed': 2020,
'nthread': -1,
'scale_pos_weight': 5 # 根据实际正负样本比例调整
}
def get_cat_params():
return {
'iterations': 10000,
'learning_rate': 0.03,
'depth': 6,
'l2_leaf_reg': 3,
'border_count': 254,
'loss_function': 'Logloss',
'eval_metric': 'AUC',
'random_seed': 2020,
'od_type': 'Iter',
'od_wait': 100,
'bootstrap_type': 'Bayesian',
'allow_writing_files': False,
'class_weights': [1, 5] # 根据实际正负样本比例调整
}
# 3. 模型训练与交叉验证
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2020
kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
train_pred = np.zeros(train_x.shape[0])
test_pred = np.zeros(test_x.shape[0])
cv_scores = []
feature_importance = pd.DataFrame()
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print(f'{"="*20} Fold {i+1} {"="*20}')
# 数据划分
trn_x, trn_y = train_x.iloc[train_index], train_y[train_index]
val_x, val_y = train_x.iloc[valid_index], train_y[valid_index]
# 样本权重 - 正样本权重更高
sample_weight = np.where(trn_y == 1, 2, 1)
if clf_name == "lgb":
params = get_lgb_params()
train_matrix = lgb.Dataset(trn_x, label=trn_y, weight=sample_weight)
valid_matrix = lgb.Dataset(val_x, label=val_y)
model = lgb.train(params, train_matrix,
num_boost_round=10000,
valid_sets=[train_matrix, valid_matrix],
callbacks=[lgb.early_stopping(stopping_rounds=200),
lgb.log_evaluation(period=200)])
# 特征重要性
fold_importance = pd.DataFrame({
'feature': trn_x.columns,
'importance': model.feature_importance(),
'fold': i+1
})
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred += model.predict(test_x, num_iteration=model.best_iteration) / folds
elif clf_name == "xgb":
params = get_xgb_params()
dtrain = xgb.DMatrix(trn_x, label=trn_y, weight=sample_weight)
dvalid = xgb.DMatrix(val_x, label=val_y)
dtest = xgb.DMatrix(test_x)
model = xgb.train(params, dtrain,
num_boost_round=10000,
evals=[(dtrain, 'train'), (dvalid, 'eval')],
early_stopping_rounds=200,
verbose_eval=200)
val_pred = model.predict(dvalid)
test_pred += model.predict(dtest) / folds
elif clf_name == "cat":
params = get_cat_params()
model = CatBoostClassifier(**params)
model.fit(trn_x, trn_y,
eval_set=(val_x, val_y),
sample_weight=sample_weight,
use_best_model=True,
verbose=500)
val_pred = model.predict_proba(val_x)[:, 1]
test_pred += model.predict_proba(test_x)[:, 1] / folds
train_pred[valid_index] = val_pred
fold_auc = roc_auc_score(val_y, val_pred)
cv_scores.append(fold_auc)
print(f'Fold {i+1} AUC: {fold_auc:.5f}')
# 输出特征重要性
if clf_name == "lgb":
importance_df = feature_importance.groupby('feature')['importance'].mean().sort_values(ascending=False)
print("\nFeature Importance:")
print(importance_df.head(20))
print("\nCV Results:")
print(f"{clf_name} AUC scores: {cv_scores}")
print(f"{clf_name} Mean AUC: {np.mean(cv_scores):.5f} (±{np.std(cv_scores):.5f})")
return train_pred, test_pred
# 4. 模型融合
def stacking_ensemble(x_train, y_train, x_test):
# 第一层模型
print("\nTraining LightGBM...")
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
print("\nTraining XGBoost...")
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
print("\nTraining CatBoost...")
cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")
# 创建第二层训练数据
stack_train = np.column_stack([lgb_train, xgb_train, cat_train])
stack_test = np.column_stack([lgb_test, xgb_test, cat_test])
# 划分验证集用于早停
X_train_stack, X_val_stack, y_train_stack, y_val_stack = train_test_split(
stack_train, y_train, test_size=0.2, random_state=42, stratify=y_train)
# 第二层模型 - 使用逻辑回归
print("\nTraining Stacking Model...")
lr = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
lr.fit(X_train_stack, y_train_stack)
# 验证集评估
val_pred = lr.predict_proba(X_val_stack)[:, 1]
val_auc = roc_auc_score(y_val_stack, val_pred)
print(f"\nStacking Model Validation AUC: {val_auc:.5f}")
# 全量训练
lr_final = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
lr_final.fit(stack_train, y_train)
# 测试集预测
test_pred = lr_final.predict_proba(stack_test)[:, 1]
return test_pred
# 5. 主流程
def main():
# 数据路径
train_path = 'data/train.csv'
test_path = 'data/testA.csv'
# 1. 数据加载与特征工程
print("Loading and preprocessing data...")
train, test = load_and_preprocess_data(train_path, test_path)
# 准备特征和标签
features = [f for f in train.columns if f not in ['id', 'issueDate', 'isDefault']]
x_train = train[features]
y_train = train['isDefault']
x_test = test[features]
# 2. 模型训练与融合
print("\nStarting model training and stacking...")
final_pred = stacking_ensemble(x_train, y_train, x_test)
# 3. 生成提交文件
print("\nGenerating submission file...")
submission = test[['id']].copy()
submission['isDefault'] = final_pred
submission.to_csv('final_submission.csv', index=False)
print("Submission file saved as 'final_submission.csv'")
if __name__ == "__main__":
main()
代码分析
1. 数据准备与特征工程
1.1 数据加载与合并
- 同时加载训练集和测试集,合并处理确保特征一致性
- 处理缺失值和异常值
1.2 特征处理亮点
-
employmentLength处理:将文本描述转换为数值
-
earliesCreditLine处理:提取年份信息
-
类别特征编码:
- 对grade、subGrade等使用one-hot编码
- 对高维类别特征(employmentTitle等)采用计数和排名替代原始值
-
时间特征扩展:从issueDate提取多种时间特征
-
数值特征分箱:对连续变量进行十分位数分箱
-
交叉特征:创建收入与贷款的比例特征
1.3 统计特征
- 计算各类别(grade等)的违约率和平均贷款金额作为新特征
- 这种基于目标变量的统计特征能有效提升模型性能
2. 模型设计与参数配置
代码实现了三种主流梯度提升树模型:
2.1 LightGBM
- 使用GBDT算法,设置合理的叶子数(63)和深度(-1表示不限制)
- 包含L1/L2正则化,特征和样本采样
- 处理类别不平衡(is_unbalance=True)
2.2 XGBoost
- 使用hist树方法加速训练
- 设置合理的深度(6)和最小子节点权重(10)
- 调整正样本权重(scale_pos_weight=5)
2.3 CatBoost
- 使用贝叶斯bootstrap方法
- 设置较大的迭代次数(10000)和早停机制
- 类别权重[1,5]处理不平衡数据
3. 模型训练与验证
3.1 交叉验证策略
- 使用5折分层交叉验证(StratifiedKFold)
- 每折记录验证集AUC和特征重要性
- 早停机制防止过拟合
3.2 训练技巧
- 对正样本赋予更高权重(weight=2 vs 1)
- 各模型使用不同的评估指标和早停策略
- LightGBM记录特征重要性用于分析
4. 模型融合
4.1 两层堆叠(Stacking)架构
- 第一层:三个不同的基模型(LightGBM、XGBoost、CatBoost)
- 第二层:逻辑回归作为元模型
4.2 融合策略
- 基模型使用交叉验证生成OOF预测
- 元模型在验证集上评估性能
- 最终使用全部数据训练元模型