总结

本次比赛TOP行列的方式主要是模型融合(Stacking), 大部分使用基础的模型

数据比较简单的入门赛
特征工程对结果没有太大作用
- 主要是将类别变量编码后直接使用
- 特征匿名, 特征交叉没有太大作用
基础模型方面大部分使用传统的基础模型
- 神经网络: simple NN, DAE with transforms, TabNet(测试但未使用)
- 树模型: XGB, LGBM, CatBoost,RF等
模型堆叠
- 主要是Stacking, 受到公开Code的影响
- Blending做辅助
骚操作
- 伪标签法(有人有用, 有人无用, 可能有用是个假象)
- 多seed运行进行平均
- 训练集分布和测试集分布具有一定的差异性, 通过对提交的直接运算减去差异性(可能过拟合)

TOP1: 1st Place Solution

CV/Pubilc LB

经过实验发现本次比赛的CV和LB有非常好的相关性, 所以"Trust CV". 只有当CV提高时才提交submission
LB Probing(LB探查): 竞赛度量是RMSE, 所以可以从public test data中探查一些信息
- 数据的均值和方差: 因为 $RMSE^2=MSE=variance+(mean-guessed\_value)^2$ from MSE, 所以探查需要两个公开测试的点
  
  $\begin{array}{|c|c|c|c|}\hline & \text { CV } & \text { Public LB } & \text { Private LB } \\\hline \text { all predictions 0 } & 8.27572 & 8.27414 & 8.27281 \\\hline \text { all predictions 1 } & 7.28036 & 7.27880 & 7.27747 \\\hline\end{array}$
  - 训练集的均值为8.24195, 标准差为0.74686
  - 公开测试集数据的均值为8.24023, 标准差为0.74833
  - 隐藏测试集数据的均值为, 标准差为8.23891, 标准差为0.748186
  - 因为训练集和公开测试集之间有微小的差别, 所以会导致CV和LB有差别, 例如最好的提交CV0.71544, LB0.71694
  - 思考: 是不是可以通过这种探查结果对预测后的数据进行数值上的调整, 同时也见过其他评价函数也可以进行局部数据的探查, 是不是可以总结一下

Base Models

数据预处理部分:
- 没有缺失值
- 连续数值类特征有相同的尺度因此不需要归一化数据等
- 类别特征进行编码, 以便输入到模型中, 用了三种类别编码:
  - Label Encoding (link).
  - Leave One Out Encoding (link).
  - One Hot Encoding (link).
  - 思考: 使用什么类型的类别编码是试出来的, 没有一个明确的依据
超参数调整: 使用 optuna 库对超参数进行优化, 因为GBDTs在GPU上运算比在CPU上运算虽然快,但是结果更糟. 所以在GPU上进行超参数优化, 在CPU上进行模型训练
- 思考: 在CPU上训练比在GPU上训练更精确, 可能是因为GPU限制了float32的精度
Model1: Extreme Gradient Boosting (XGBoost)
- Categorical Encoding Type: Label Encoding.
- 使用了三组不同的超参数
- 使用不同的随机种子进行训练并平均可以提高CV分数, 对每组超参数, 训练了20个种子并进行了平均(将CV从0.71629提高到0.71594)
- 思考: 我没有用大量的种子对模型进行训练, 或许这可以在下一次比赛中用到
Model 2: Light Gradient Boosting Machine (LGBM)
- Categorical Encoding Type: Leave One Out Encoding.
- Averaging of 20 different random seeds.
Model 3: Cat Boost (CB)
- Categorical Encoding Type: Label Encoding.
- Averaging of 20 different random seeds.
Model 4: Hist Gradient Boosting Regressor (HGBR)
- Categorical Encoding Type: Leave One Out Encoding.
- Averaging of 20 different random seeds.
Model 5: Ridge with features from Denoise Transformer AutoEncoder.
- 复现1st place DAE training code代码
- 使用自监督预训练方式训练Denoise Transformer AutoEncoder
  - 训练模式: 输入数据将添加交换噪声和两个预训练目标: 预测交换掩码和重建输出
  - 预测模式: 输入数据将从模型中的Transformer结构中获取特征表示
- 从DAE with Transformer中获取的特征表示输入到Ridge等模型中进行训练
- 虽然模型性能不及GBDT, 但是是一个对于模型多样化中非常重要的因素
- 思考: 这里的这个模型没有进行超参数优化, 我也复现过这个代码得到了相似的分数, 只不过因为其分数过低, 将其舍弃. 看来还是没从模型多样化角度思考, 只关注了分数的高低

$\begin{array}{|c|c|c|c|c|c|c|}\hline & \text { XGB } & \text { XGB2 } & \text { LGBM } & \text { CB } & \text { HGBR } & \text { Ridge with DTA } \\\hline \text { CV } & 0.71592 & 0.71594 & 0.71690 & 0.71763 & 0.71959 & 0.72217 \\\hline\end{array}$

Ensemble Models

堆叠层级越高CV分数越好
使用 optuna 对堆叠模型的超参数进行优化
堆叠第一层(10folds)

$\begin{array}{|c|c|c|c|c|c|c|c|}\hline & \text { Ridge } & \text { XGB } & \text { LGBM } & \text { CB } & \text { HGBR } & \text { RF } & \text { GBR } \\\hline \text { CV } & 0.71571 & 0.71555 & 0.71561 & 0.71562 & 0.71569 & 0.71599 & 0.71604 \\\hline\end{array}$
堆叠第二层(10folds)

并非所有的特征对第二层堆叠都有用, 因此可以搜索第二层的输入特征

$\begin{array}{|c|c|c|c|c|}\hline & \text { XGB } & \text { Ridge } & \text { LGBM } & \text { CB } \\\hline \text { CV } & 0.71555 & 0.71551 & 0.71547 & 0.71554 \\\hline\end{array}$
第三层(forward selection)

通过forward selection对输出进行weights optimized from Forward Selection OOF Ensemble - [0.942 Private], How To CV and How To Ensemble OOF Files
思考:
- 除了对于超参数可以优化外, 还可以对堆叠过程中的输入进行优化
- 除了这种堆叠方式, 可能也可以引入ReStacking的堆叠方式, 可能效果更好
- TODO: 对于weights optimized using forward selection来说是一个新知识, 需要阅读

What did not work(不起作用的)

TabNet
Add Gaussian Mixture Model features.
Pseudo Labeling: 使用pseudo labeling可以极大的改善CV分数, 但是对于LB没有提高
思考:
- TabNet在我的实验中效果要比DAE还要差很多, 基本符合
- TODO: GMM features需要进行补充
- Pseudo Labeling 对其没效果, 但是在我里面是有效的, 可能是我没能优化stacking的超参数导致的

Notebook

Discussion

差模型在模型融合中的多样性或许是有用的
书籍
- ML: Pattern Recognition and Machine Learning of Christopher Bishop
- DL: Dive into Deep Learning
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
关于超参数优化: 200次迭代以获得好的超参数, 对单个折叠优化超参数, 提高优化速率
对于比赛要进行事后分析(post-mortem), 例如petfinder_7th_place_solution

TOP3: 3rd Place Solution

Label encoding with sklearn.preprocessing.OrdinalEncoder
XGBoost
Hyper-parameter tuning with Optuna
K-fold cross validation (n_splits=10)
Stacking
Blending (with multi random seeds)

stacking models with these sub-models.

For the 1st level (base models)5 XGBoost models with 5 different hyper parameters, using K-fold cross validation (n_splits=10)
For the 2nd level (meta model)1 XGBoost model, using K-fold cross validation (n_splits=10)

repeated this process 8 times with different random seeds, and blended them to get the average prediction values.

TOP4: 4th Place Solution

Stacking(method from stacking)

Level 0
- 13 Fine Tuned Xgboost Model (Label Encoding)
- 5 Fine Tuned Lightgbm Model (Label Encoding)
- 4 Fine Tuned Catboost Model
Level 1: 输入进行优化,不一定全部输入
- Linear Regression(CV : 0.716209)
- Ridge Regression (CV : 0.716209)
- Xgboost (CV: 0.716085)
Level 2
- Linear Regression (CV: 0.71600)
Other Things Tried but Failed
- One-Hot Encoding
- Target Encoding
- Polynomial Features
- Dropping less important feature by using Mutual Information, Permutation Importance, etc.
- Dropping Target Outliers
- Scaling and Normalizing
- Ridge Regression and HistGradBoost Model for Level0
- Random Forest and HistGradBoost for Level1
- Xgboost for Level2
- Neural Network
- Denoising-AutoEncoder (CV: 0.72+) TPS Feb 1st Solution
Things learned:
- Sometimes Decision Trees are more powerful than NN
- Blending and Stacking
- Believe in your CV score
- If you are following best CV practices then don't be afraid of over-fitting on public LB
- Hyper-parameters make huge difference
- While making early submission try to find if their is any relation between your CV and public LB.
- Do not make use of weak Level 0 model while blending or stacking
- Other ML practices,etc

TOP20: My 20th place approach

fork a notebook with LB0.71719/PB0.71552
预测比真实较高, 因此将预测四舍五入到第三位小数之后减去0.002提升到了LB0.71718/PB0.71550
思考:
- 与TOP1中数据探查得到的结果相印证, 训练集的均值为: 8.24195; 公开测试集数据的均值为: 8.24023
- 这可能与数据的来源及分布有关, 许多目标值都有上限, 因为数据来源于保险理赔, 因此数据中几个尖峰代表这理赔金额
- 使用我的预测结果进行测试(LB,PB), 确实在LB上有提高, 但是在PB上没有明显变化, 可能是到极限了?
  - raw: 0.71695/0.71539
  - 四舍五入到三位-0.002: 0.71694/0.71539
  - 直接减去0.002: 0.71694/0.71539
  - 直接减去0.003: 0.71694/0.71539

TOP165: Sharing my solution. (Missing 'in the swag' cut by 0.00005)

10Folds CV

基础模型:

XGBoost #1
XGBoost #2
XGBoost #3 (Similar parameters to Kosta R's notebook)
XGBoost #4 (Similar parameters to Steven Ferrer's notebook)
XGBoost #5 (Similar parameters to Abhishek Thakur's notebook)
LightGBM
Random Forest on GPU (RAPIDS/cuML)
K-Nearest Neighbors on GPU (RAPIDS/cuML)
Ridge Regression

模型融合:

Neural Network - trained using the predictions of all models, except the Ridge Regression.
XGBoost - trained using the predictions of 4 XGBoost base models (2 to 5) + Random Forest and Ridge Regression.

the results from both meta-models were used for a final blending.

提交的模型

the output from the Neural Network meta-model(0.71709 on Public LB/ 0.71552 on Private LB)
the blended submission(0.71712 on Public LB/ 0.71552 on Private LB)

Notebook: 30days - Final Stacking

更新日志

初稿: 2021年9月2日-Today

比赛复盘总结-Kaggle-30 Days of ML-3-TOP解决方案

总结