08 - 集成学习：Boosting 与 Bagging目录为什么集成学习有效？集成学习全景图 Bagging 原

一句话理解集成学习： "三个臭皮匠，顶个诸葛亮" —— 把多个"普通"模型组合起来，得到一个"超强"模型。

1. 为什么集成学习有效？

1.1 生活中的类比

想象你生病了，去看医生：

只看一个医生 → 他可能误诊（单模型可能过拟合或欠拟合）
看三个医生，取多数意见 → 误诊概率大大降低（集成学习！）

再比如：

你在某综艺节目里看到"观众投票"环节，100 个观众投票猜一个问题的答案，正确率往往比单独一个"专家"还高。这就是统计学上著名的 "群体智慧"（Wisdom of Crowds）。

1.2 数学直觉

假设我们有 3 个独立的分类器，每个准确率 70%：

单个模型正确率 = 0.7

3 个模型投票（至少 2 个正确才算正确）：
P(正确) = C(3,2)*0.7^2*0.3 + C(3,3)*0.7^3
        = 3*0.49*0.3 + 0.343
        = 0.441 + 0.343
        = 0.784   ← 从 70% 提升到 78.4%！

模型越多、越独立，提升越明显。这就是集成学习的核心思想。

1.3 集成学习解决的三大问题

┌─────────────────────────────────────────────────┐
│           集成学习解决的核心问题                    │
├─────────────┬─────────────┬─────────────────────┤
│  统计问题    │  计算问题    │  表示问题             │
│(数据不够时   │(局部最优时   │(单模型表达能力        │
│ 减少选错模   │ 多起点搜索   │ 不够时，组合多个      │
│ 型的风险)    │ 更好的解)    │ 模型扩展边界)         │
└─────────────┴─────────────┴─────────────────────┘

2. 集成学习全景图

                    ┌──────────────┐
                    │   集成学习    │
                    └──────┬───────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
     ┌─────▼─────┐  ┌─────▼─────┐  ┌──────▼─────┐
     │  Bagging   │  │  Boosting  │  │  Stacking   │
     │  (并行)    │  │  (串行)    │  │  (分层)     │
     └─────┬─────┘  └─────┬─────┘  └──────┬─────┘
           │               │               │
     ┌─────▼─────┐  ┌─────▼─────────┐  多层模型
     │ 随机森林   │  │AdaBoost       │  堆叠组合
     │           │  │GBDT           │
     └───────────┘  │XGBoost        │
                    │LightGBM       │
                    │CatBoost       │
                    └───────────────┘

关键区别：

特性	Bagging	Boosting	Stacking
模型关系	并行、独立	串行、依赖	分层组合
数据采样	有放回采样	调整样本权重	交叉验证划分
核心目标	降低方差	降低偏差	综合优势
代表算法	随机森林	AdaBoost, GBDT	各种模型堆叠
过拟合风险	较低	较高	中等

3. Bagging 原理

3.1 什么是 Bagging？

Bagging = Bootstrap Aggregating（自助聚合）

就像你做一道菜拿不准味道，于是：

分别让 10 个厨师用 随机挑选的食材（有放回抽样）做同一道菜
把 10 道菜混合在一起（或取平均味道）
最终的菜比任何单个厨师做的都稳定好吃

3.2 Bagging 流程图

原始训练集 D（N 个样本）
     │
     │  有放回随机抽样（Bootstrap）
     │
     ├──────────┬──────────┬──────────┐
     ▼          ▼          ▼          ▼
  子集 D1     子集 D2    子集 D3    子集 Dk
  (N个样本)   (N个样本)  (N个样本)  (N个样本)
     │          │          │          │
     ▼          ▼          ▼          ▼
  模型 h1     模型 h2    模型 h3    模型 hk
     │          │          │          │
     └──────────┴─────┬────┴──────────┘
                      │
                      ▼
              ┌───────────────┐
              │  聚合 (投票/   │
              │   平均)        │
              └───────┬───────┘
                      │
                      ▼
                 最终预测结果

3.3 Bootstrap 采样的特点

每次有放回地抽 N 个样本，某个样本没有被抽到的概率：

P(某样本未被抽到) = (1 - 1/N)^N ≈ 1/e ≈ 0.368

→ 大约 36.8% 的样本不会出现在某个子集中
→ 这部分叫做 OOB（Out-Of-Bag）样本，可用来做验证！

3.4 Python 示例

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 生成模拟数据
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 创建 Bagging 分类器
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # 基学习器：决策树
    n_estimators=100,                     # 100 棵树
    max_samples=0.8,                      # 每次抽 80% 样本
    bootstrap=True,                       # 有放回抽样
    oob_score=True,                       # 用 OOB 评估
    random_state=42,
    n_jobs=-1                             # 并行训练
)

bagging.fit(X_train, y_train)
print(f"测试集准确率: {bagging.score(X_test, y_test):.4f}")
print(f"OOB 准确率:   {bagging.oob_score_:.4f}")

4. 随机森林（回顾）

随机森林 = Bagging + 特征随机选择

它在 Bagging 的基础上更进一步：不仅对样本随机采样，还对特征随机采样。

随机森林 vs 普通 Bagging：

普通 Bagging:  随机采样 [样本]   → 训练决策树
随机森林:      随机采样 [样本]   → 在随机选的 [特征子集] 上训练决策树
                                     ↑
                                这一步让树之间更"不同"
                                降低了树之间的相关性
                                集成效果更好！

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,          # 200 棵树
    max_features='sqrt',       # 每次随机选 sqrt(总特征数) 个特征
    max_depth=10,              # 限制树深度
    min_samples_leaf=5,        # 叶子最少样本数
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"随机森林准确率: {rf.score(X_test, y_test):.4f}")

# 特征重要性
import pandas as pd
importance = pd.Series(rf.feature_importances_).sort_values(ascending=False)
print("Top 5 重要特征:", importance.head().to_dict())

5. Boosting 原理

5.1 核心思想

Bagging 是 "大家各干各的，最后投票"， Boosting 是 "后一个人专门弥补前一个人的错误"。

生活类比：

你考试考砸了，老师给你出了一张新卷子，但你上次做错的题出现的频率更高。你再考一次，老师再根据你的错题出卷子……如此循环，你会越来越强！

5.2 Boosting 串行流程图

第 1 轮：
┌──────────┐    训练     ┌──────────┐    预测
│ 全部样本  │ ─────────→ │  模型 h1  │ ─────────→  得到错误样本
│(等权重)   │            └──────────┘              │
└──────────┘                                       │
                                                   ▼
第 2 轮：                                    调整样本权重
┌──────────┐    训练     ┌──────────┐    （错误样本权重↑）
│ 加权样本  │ ─────────→ │  模型 h2  │ ─────────→  得到错误样本
│(错的更重) │            └──────────┘              │
└──────────┘                                       │
                                                   ▼
第 3 轮：                                    调整样本权重
┌──────────┐    训练     ┌──────────┐
│ 加权样本  │ ─────────→ │  模型 h3  │ ─────→ ......
│(错的更重) │            └──────────┘
└──────────┘

                    最终模型
                       │
                       ▼
        H(x) = α1*h1(x) + α2*h2(x) + α3*h3(x) + ...
               ↑
          每个模型的权重由其准确率决定
          (越准确的模型，权重越大)

5.3 Boosting 为什么降低偏差？

                        偏差-方差 对比

    Bagging:                          Boosting:
    ┌─────────────────┐               ┌─────────────────┐
    │ 多个高方差模型   │               │ 多个高偏差模型    │
    │ (深决策树)       │               │ (浅决策树/桩)     │
    │       ↓         │               │       ↓          │
    │ 平均后方差↓     │               │ 串行组合偏差↓    │
    │ 偏差基本不变    │               │ 方差会略微上升    │
    └─────────────────┘               └─────────────────┘

6. AdaBoost 详解

6.1 AdaBoost = Adaptive Boosting（自适应提升）

核心思想：给分错的样本加权，让后面的模型重点关注难分的样本。

6.2 算法步骤

初始化：每个样本权重 = 1/N

For t = 1, 2, ..., T:

    1. 用加权样本训练弱分类器 h_t

    2. 计算加权错误率：
       ε_t = Σ(分错样本的权重)

    3. 计算模型权重：
       α_t = 0.5 * ln((1 - ε_t) / ε_t)

       ε_t 越小(越准) → α_t 越大(越重要)

    4. 更新样本权重：
       分对的样本：w = w * exp(-α_t)  ← 权重降低
       分错的样本：w = w * exp(+α_t)  ← 权重升高

    5. 归一化权重，使之和为 1

最终模型：H(x) = sign(Σ α_t * h_t(x))

6.3 权重调整可视化

第 1 轮（所有样本等权重）：
样本:  [●] [●] [●] [○] [○] [○] [●] [○] [●] [○]
权重:  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
              ↑           ↑
            分错!        分错!

第 2 轮（分错的权重增大）：
样本:  [●] [●] [●] [○] [○] [○] [●] [○] [●] [○]
权重:  0.06 0.06 0.22 0.06 0.06 0.22 0.06 0.06 0.06 0.06
                  ↑               ↑
              权重变大!        权重变大!
              (上轮分错)       (上轮分错)

第 3 轮（继续调整...）：
→ 模型越来越关注"难搞"的样本
→ 最终组合所有弱模型，各自贡献不同权重

6.4 Python 示例

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# AdaBoost 默认基学习器就是深度为1的决策树桩
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # 弱学习器：决策树桩
    n_estimators=200,          # 200 轮
    learning_rate=0.1,         # 学习率（缩减每个模型的贡献）
    algorithm='SAMME.R',       # 使用概率估计的版本
    random_state=42
)

ada.fit(X_train, y_train)
print(f"AdaBoost 准确率: {ada.score(X_test, y_test):.4f}")

# 查看每一轮的训练/测试误差
import matplotlib.pyplot as plt
import numpy as np

train_errors = []
test_errors = []
for y_pred_train, y_pred_test in zip(
    ada.staged_predict(X_train), ada.staged_predict(X_test)
):
    train_errors.append(1 - np.mean(y_pred_train == y_train))
    test_errors.append(1 - np.mean(y_pred_test == y_test))

plt.plot(train_errors, label='Train Error')
plt.plot(test_errors, label='Test Error')
plt.xlabel('Boosting 轮数')
plt.ylabel('Error Rate')
plt.legend()
plt.title('AdaBoost 误差随轮数变化')
plt.show()

7. 梯度提升树 GBDT

7.1 GBDT = Gradient Boosting Decision Tree

如果说 AdaBoost 是 "给分错的样本加权"，那 GBDT 就是 "每一轮拟合上一轮的残差（错误）"。

生活类比：

射箭比赛。第一箭偏了 10 厘米，你就瞄准偏差的方向补一箭（-10cm），还差 2 厘米，再补一箭（-2cm）……不断修正，越来越接近靶心。

7.2 GBDT 残差拟合过程

目标值:     y = [90, 85, 70, 65]

第 1 棵树（粗略预测）:
预测值:     ŷ1 = [80, 80, 80, 80]     ← 比如用均值
残差:       r1 = [10,  5, -10, -15]    ← y - ŷ1

第 2 棵树（拟合残差 r1）:
预测残差:   ŷ2 = [8,  3, -8, -12]
新残差:     r2 = [2,  2, -2, -3]       ← r1 - ŷ2

第 3 棵树（拟合残差 r2）:
预测残差:   ŷ3 = [1.5, 1.5, -1.5, -2.5]
新残差:     r3 = [0.5, 0.5, -0.5, -0.5] ← 残差越来越小！

最终预测:
F(x) = ŷ1 + η*ŷ2 + η*ŷ3 + ...
       ↑       ↑
      初始    学习率(防止过拟合)

7.3 为什么叫"梯度"提升？

损失函数的负梯度方向 ≈ 残差

对于均方误差 L = (y - F(x))^2 / 2：
  -∂L/∂F(x) = y - F(x) = 残差

→ 每一步沿着损失函数的负梯度方向更新
→ 这就是梯度下降的思想！
→ 所以叫 "梯度" 提升

对于不同的损失函数（如对数损失、Huber 损失），负梯度不一定严格等于残差，但思想是一致的：每轮拟合负梯度方向。

7.4 Python 示例

from sklearn.ensemble import GradientBoostingClassifier

gbdt = GradientBoostingClassifier(
    n_estimators=200,          # 200 棵树
    learning_rate=0.1,         # 学习率
    max_depth=3,               # 每棵树深度（GBDT 通常用浅树）
    subsample=0.8,             # 行采样比例（类似 Bagging 的效果）
    min_samples_leaf=5,
    random_state=42
)

gbdt.fit(X_train, y_train)
print(f"GBDT 准确率: {gbdt.score(X_test, y_test):.4f}")

8. XGBoost / LightGBM / CatBoost 三大神器

这三个是 GBDT 的"加强版"，是 Kaggle 竞赛中的常胜将军。

8.1 三者关系

        GBDT（基础版）
         │
    ┌────┼────────────────┐
    │    │                │
    ▼    ▼                ▼
 XGBoost  LightGBM     CatBoost
 (2014)   (2017)        (2017)
 微软陈天奇  微软           Yandex

 正则化     直方图加速      类别特征
 二阶导数   GOSS采样        有序提升
 稀疏感知   按叶子生长      减少过拟合

8.2 详细对比

特性	XGBoost	LightGBM	CatBoost
速度	中等	最快	较慢
内存	较大	最小	中等
类别特征	需手动编码	直接支持	原生最优支持
树生长策略	按层生长(Level-wise)	按叶子生长(Leaf-wise)	对称树
过拟合控制	好	需注意	最好
小数据集	好	容易过拟合	好
GPU 支持	有	有	有
适合场景	通用	大数据集	有大量类别特征

8.3 树生长策略对比图

Level-wise (XGBoost 默认):
按层分裂，每层所有叶子都分裂

         [根]
        /    \
      [A]    [B]         ← 第1层全部分裂
     / \    / \
   [C] [D][E] [F]       ← 第2层全部分裂

优点：不容易过拟合
缺点：可能做了不必要的分裂


Leaf-wise (LightGBM):
每次只分裂增益最大的叶子

         [根]
        /    \
      [A]    [B]         ← B 增益更大
              / \
            [E] [F]      ← E 增益更大
            / \
          [G] [H]

优点：同等叶子数下损失更低
缺点：树不平衡，可能过拟合

8.4 Python 示例

# ==================== XGBoost ====================
import xgboost as xgb

xgb_clf = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,      # 列采样
    reg_alpha=0.1,              # L1 正则化
    reg_lambda=1.0,             # L2 正则化
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)
xgb_clf.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            verbose=False)
print(f"XGBoost 准确率: {xgb_clf.score(X_test, y_test):.4f}")


# ==================== LightGBM ====================
import lightgbm as lgb

lgb_clf = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=-1,               # 不限深度
    num_leaves=31,              # 最大叶子数（关键参数！）
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)
lgb_clf.fit(X_train, y_train)
print(f"LightGBM 准确率: {lgb_clf.score(X_test, y_test):.4f}")


# ==================== CatBoost ====================
from catboost import CatBoostClassifier

cat_clf = CatBoostClassifier(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    l2_leaf_reg=3,              # L2 正则化
    random_state=42,
    verbose=0                   # 不打印训练过程
)
cat_clf.fit(X_train, y_train)
print(f"CatBoost 准确率: {cat_clf.score(X_test, y_test):.4f}")

9. Stacking 堆叠法

9.1 核心思想

Stacking 的思路完全不同于 Bagging 和 Boosting：

就像一个公司做决策：

第一层：不同部门（销售、技术、财务）各自给出意见

第二层：CEO 综合所有部门的意见，做出最终决定

CEO 不需要懂每个部门的细节，只需要学会"如何综合意见"。

9.2 Stacking 架构图

                    原始训练数据
                        │
        ┌───────────────┼───────────────┐
        │               │               │
        ▼               ▼               ▼
  ┌───────────┐  ┌───────────┐  ┌───────────┐
  │ 模型 1     │  │ 模型 2     │  │ 模型 3     │    第一层
  │ (随机森林) │  │ (XGBoost)  │  │ (SVM)     │   (基学习器)
  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
        │               │               │
        ▼               ▼               ▼
     预测 P1          预测 P2         预测 P3
        │               │               │
        └───────────────┼───────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │ 新特征矩阵       │
              │ [P1, P2, P3]     │         第二层
              └────────┬─────────┘        (元学习器)
                       │
                       ▼
              ┌──────────────────┐
              │ 元模型            │
              │ (逻辑回归)        │
              └────────┬─────────┘
                       │
                       ▼
                  最终预测结果

9.3 关键细节：用交叉验证防止过拟合

直接在训练集上预测再当特征，会导致严重过拟合！正确做法：

5折交叉验证生成第一层预测:

训练集分5份：[Fold1] [Fold2] [Fold3] [Fold4] [Fold5]

第1次：用 Fold2-5 训练 → 预测 Fold1 → 得到 Fold1 的预测值
第2次：用 Fold1,3-5 训练 → 预测 Fold2 → 得到 Fold2 的预测值
第3次：用 Fold1,2,4,5 训练 → 预测 Fold3 → ...
第4次：...
第5次：...

拼接所有 Fold 的预测值 → 第二层的训练特征（没有泄漏！）

9.4 Python 示例

from sklearn.ensemble import (
    StackingClassifier, RandomForestClassifier,
    GradientBoostingClassifier
)
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# 定义第一层基学习器
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gbdt', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(probability=True, random_state=42))
]

# 定义 Stacking
stacking = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),  # 第二层用逻辑回归
    cv=5,                                   # 5折交叉验证
    stack_method='predict_proba',           # 用概率作为特征
    n_jobs=-1
)

stacking.fit(X_train, y_train)
print(f"Stacking 准确率: {stacking.score(X_test, y_test):.4f}")

10. 投票法 Voting

10.1 硬投票 vs 软投票

假设有 3 个模型对一个样本进行分类：

硬投票（Hard Voting）—— 少数服从多数：
  模型 A 预测：类别 1
  模型 B 预测：类别 0
  模型 C 预测：类别 1
  → 最终结果：类别 1（2票 vs 1票）

软投票（Soft Voting）—— 根据概率平均：
  模型 A：P(类别1) = 0.9,  P(类别0) = 0.1
  模型 B：P(类别1) = 0.3,  P(类别0) = 0.7
  模型 C：P(类别1) = 0.8,  P(类别0) = 0.2

  平均：P(类别1) = (0.9+0.3+0.8)/3 = 0.667
        P(类别0) = (0.1+0.7+0.2)/3 = 0.333
  → 最终结果：类别 1

软投票通常效果更好，因为它利用了概率信息。

10.2 Python 示例

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# 硬投票
hard_voting = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('svc', SVC()),
        ('dt', DecisionTreeClassifier())
    ],
    voting='hard'
)

# 软投票（需要模型支持 predict_proba）
soft_voting = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('svc', SVC(probability=True)),       # SVC 需开启 probability
        ('dt', DecisionTreeClassifier())
    ],
    voting='soft'
)

hard_voting.fit(X_train, y_train)
soft_voting.fit(X_train, y_train)
print(f"硬投票准确率: {hard_voting.score(X_test, y_test):.4f}")
print(f"软投票准确率: {soft_voting.score(X_test, y_test):.4f}")

11. 实战案例

11.1 案例一：信用评分（Credit Scoring）

这是金融行业最常见的机器学习应用之一。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, classification_report
import xgboost as xgb
import lightgbm as lgb

# ---------- 1. 模拟信用评分数据 ----------
np.random.seed(42)
n = 5000
data = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10, 1, n),
    'debt_ratio': np.random.uniform(0, 1, n),
    'num_credit_lines': np.random.randint(0, 20, n),
    'late_payments_30d': np.random.poisson(1, n),
    'late_payments_90d': np.random.poisson(0.3, n),
    'credit_utilization': np.random.uniform(0, 1, n),
})
# 简化的违约标签
prob = 1 / (1 + np.exp(-(
    -3 + 0.02 * data['debt_ratio'] * 5
    + 0.5 * data['late_payments_90d']
    - 0.01 * data['income'] / 10000
)))
data['default'] = (np.random.random(n) < prob).astype(int)

X = data.drop('default', axis=1)
y = data['default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ---------- 2. 训练多个模型并对比 ----------
models = {
    'XGBoost': xgb.XGBClassifier(
        n_estimators=200, learning_rate=0.05, max_depth=4,
        subsample=0.8, colsample_bytree=0.8,
        eval_metric='auc', random_state=42, n_jobs=-1
    ),
    'LightGBM': lgb.LGBMClassifier(
        n_estimators=200, learning_rate=0.05, num_leaves=31,
        subsample=0.8, colsample_bytree=0.8,
        random_state=42, verbose=-1, n_jobs=-1
    ),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    print(f"{name} AUC: {auc:.4f}")

# ---------- 3. 特征重要性分析 ----------
# 以 LightGBM 为例
feature_imp = pd.DataFrame({
    'feature': X.columns,
    'importance': models['LightGBM'].feature_importances_
}).sort_values('importance', ascending=False)
print("\n特征重要性排名:")
print(feature_imp.to_string(index=False))

11.2 案例二：Kaggle 竞赛中的集成策略

Kaggle 大神们的常用套路：

Kaggle 竞赛典型集成流程：

Step 1: 特征工程（占 70% 的时间！）
  │
  ▼
Step 2: 训练多个不同类型的模型
  ├── LightGBM (不同参数组合 × 3)
  ├── XGBoost (不同参数组合 × 3)
  ├── CatBoost (× 2)
  ├── 神经网络 (× 2)
  └── 其他模型 ...
  │
  ▼
Step 3: 二层 Stacking 或 加权平均
  ├── 方法 A: Stacking（用逻辑回归做元学习器）
  └── 方法 B: 加权平均（根据验证集表现分配权重）
  │
  ▼
Step 4: 后处理 + 提交

# 简单的加权平均融合
from sklearn.metrics import roc_auc_score

# 假设有三个模型的预测概率
pred_xgb = models['XGBoost'].predict_proba(X_test)[:, 1]
pred_lgb = models['LightGBM'].predict_proba(X_test)[:, 1]

# 方法 1: 简单平均
pred_avg = (pred_xgb + pred_lgb) / 2
print(f"简单平均 AUC: {roc_auc_score(y_test, pred_avg):.4f}")

# 方法 2: 加权平均（根据各模型的验证集表现）
# 假设 XGBoost AUC=0.85, LightGBM AUC=0.87
w_xgb, w_lgb = 0.85, 0.87
w_sum = w_xgb + w_lgb
pred_weighted = (w_xgb * pred_xgb + w_lgb * pred_lgb) / w_sum
print(f"加权平均 AUC: {roc_auc_score(y_test, pred_weighted):.4f}")

# 方法 3: 用 Optuna 自动搜索最优权重
# （实际比赛中推荐这种方法）

11.3 超参数调优小贴士

# 使用 Optuna 进行 LightGBM 调参（推荐！）
# pip install optuna

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 15, 127),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'random_state': 42,
        'verbose': -1,
        'n_jobs': -1,
    }
    model = lgb.LGBMClassifier(**params)
    scores = cross_val_score(model, X_train, y_train,
                              cv=5, scoring='roc_auc')
    return scores.mean()

# 运行优化
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"最优 AUC: {study.best_value:.4f}")
print(f"最优参数: {study.best_params}")

12. 总结与选型指南

12.1 一张图总结

┌────────────────────────────────────────────────────────────────┐
│                    集成学习选型指南                               │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  数据量小 (<1万)      → CatBoost / XGBoost                     │
│  数据量大 (>100万)    → LightGBM（速度最快）                     │
│  类别特征多           → CatBoost（原生支持）                     │
│  需要可解释性         → 单棵GBDT / 随机森林                     │
│  追求极致精度         → Stacking 多模型融合                      │
│  快速建立基线         → LightGBM 默认参数                       │
│  防止过拟合是首要     → 随机森林 / Bagging                      │
│  Kaggle 竞赛          → LightGBM + XGBoost + CatBoost 融合     │
│                                                                │
└────────────────────────────────────────────────────────────────┘

12.2 常见面试题速答

Q: Bagging 和 Boosting 的区别？

Bagging: 并行，降方差，对异常值不敏感，不容易过拟合
Boosting: 串行，降偏差，对异常值敏感，可能过拟合

Q: 随机森林为什么不容易过拟合？

1. Bootstrap 采样引入随机性
2. 特征随机选择降低树之间的相关性
3. 大量树的平均进一步平滑预测

Q: XGBoost 比 GBDT 好在哪？

1. 用了二阶导数（Taylor 二阶展开），更精确
2. 加入了正则化项（L1 + L2），防止过拟合
3. 支持稀疏数据的自动处理
4. 内置列采样（类似随机森林的特征采样）
5. 支持并行计算（特征维度的并行，非树的并行）
6. 支持自定义损失函数

Q: 为什么 LightGBM 比 XGBoost 快？

1. 直方图算法：将连续特征离散化为直方图，减少计算量
2. GOSS（Gradient-based One-Side Sampling）：保留梯度大的样本
3. EFB（Exclusive Feature Bundling）：合并互斥的稀疏特征
4. Leaf-wise 生长：比 Level-wise 更高效

12.3 学习路线建议

初学者路线:

1. 理解决策树 ✓
        ↓
2. 理解随机森林（Bagging + 特征随机）  ← 你在这里
        ↓
3. 理解 AdaBoost（权重调整）
        ↓
4. 理解 GBDT（残差拟合）
        ↓
5. 上手 XGBoost / LightGBM（调参实战）
        ↓
6. 尝试 Stacking 融合
        ↓
7. 参加 Kaggle 竞赛实战

附录：关键术语速查

术语	英文	含义
集成学习	Ensemble Learning	组合多个模型
基学习器	Base Learner	被组合的单个模型
弱学习器	Weak Learner	比随机猜测略好的模型
有放回采样	Bootstrap Sampling	抽完放回再抽
袋外数据	Out-of-Bag (OOB)	没被抽到的样本
残差	Residual	真实值 - 预测值
学习率	Learning Rate	控制每步更新幅度
正则化	Regularization	防止过拟合的惩罚
元学习器	Meta Learner	Stacking 的第二层模型
特征重要性	Feature Importance	各特征对模型的贡献度

下一篇预告： 09 - 模型评估与调优：交叉验证、网格搜索与贝叶斯优化

如果这篇文章对你有帮助，请给个 Star 吧！有问题欢迎提 Issue 讨论。