1. 赛题
本练习赛与葡萄牙银行机构的营销活动相关。这些营销活动一般以电话为基础,银行的客服人员至少联系客户一次,以确认客户是否有意愿购买该银行的产品(定期存款)。
任务是基本类型为分类任务,即预测客户是否购买该银行的产品。
2.详细步骤与代码实现
2.1 数据加载与合并
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from catboost import CatBoostClassifier
train = pd.read_csv('/home/mw/input/purchase_pred8546/purchase_prediction_train_set.csv')
test = pd.read_csv('/home/mw/input/purchase_pred8546/purchase_prediction_test_set.csv')
# 合并处理
data = pd.concat([train, test], ignore_index=True, sort=False)
2.2 特征工程构建
特征工程分类:
# 处理 pdays:999表示缺失值
data['pdays_missing'] = (data['pdays'] == 999).astype(int)
data.loc[data['pdays'] == 999, 'pdays'] = -1
# balance 是否为负 - 财务健康度指标
data['balance_negative'] = (data['balance'] < 0).astype(int)
# duration 强化特征 - 交流时长相关
data['duration_log'] = np.log1p(data['duration'])
data['dur_prev'] = data['duration'] * (data['previous'] + 1)
data['dur_per_call'] = data['duration'] / (data['campaign'] + 1)
data['pdays_recip'] = 1 / (data['pdays'] + 2)
# poutcome 是否成功 × duration - 成功历史与当前交流的交互
data['dur_success'] = data['duration'] * (data['poutcome'] == 'success').astype(int)
# month 数值编码 - 捕捉季节性趋势
data['month_num'] = data['month'].map({
'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,
'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12
})
2.3:数据拆分与预处理
-
重新拆分数据集,准备建模数据
-
根据目标变量
y的存在与否拆分数据 -
移除ID列(非特征变量)
-
明确指定类别特征,CatBoost会自动处理
train_processed = data[data['y'].notna()].copy() test_processed = data[data['y'].isna()].copy()
特征和目标分离
X = train_processed.drop(['y','ID'], axis=1) y = train_processed['y'].astype(int) X_test = test_processed.drop(['y','ID'], axis=1)
类别特征定义
cat_cols = [ 'job','marital','education','default','housing','loan', 'contact','month','poutcome' ]
2.4:模型参数配置
目的:设置CatBoost最优参数组合
参数调优重点:
-
bootstrap_type='Bayesian':关键改进,提升稳定性
-
iterations=3000 + learning_rate=0.02:小步长多树策略
-
od_type='Iter':自动过拟合检测
model = CatBoostClassifier( iterations=3000,
learning_rate=0.02,
depth=7,
l2_leaf_reg=3,
random_strength=1.5,
bagging_temperature=0.6,
border_count=254,
loss_function='Logloss',
eval_metric='AUC',
bootstrap_type='Bayesian', od_type='Iter',
od_wait=80,
random_seed=42,
task_type='CPU'
)
2.5:交叉验证训练
使用5折交叉验证提升模型稳定性和泛化能力.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
test_pred = np.zeros(len(X_test))
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
print(f"Training Fold {fold+1} ...")
model = CatBoostClassifier(
iterations=3000, learning_rate=0.02, depth=7,
l2_leaf_reg=3, random_strength=1.5,
bagging_temperature=0.6, border_count=254,
loss_function='Logloss', eval_metric='AUC',
bootstrap_type='Bayesian', od_type='Iter',
od_wait=80, random_seed=42, task_type='CPU'
)
model.fit(
X.iloc[train_idx], y.iloc[train_idx],
cat_features=cat_cols
)
test_pred += model.predict_proba(X_test)[:, 1] / 5
print("5-Fold 训练完成!")
2.6 结果生成与提交
生成提交文件
submit = pd.DataFrame({
'ID': test_processed['ID'],
'Prediction': test_pred
})
submit.to_csv("submit.csv", index=False)
最高测试得分0.9334422974836348