知识总结：MarsCode AI 的底层原理（六）| 豆包MarsCode AI刷题第六章非线性机器学习与集成方法一

第六章非线性机器学习与集成方法

一、决策树（重点考！！！）

让我们以打网球为例子，究竟什么时候我们才会出门打球呢？

如果大家对数据结构的“树” 结构很清晰的话，我们将会发现判定出门打球的情况可以使用树进行表示，每一个树的分支都代表一个可能的决策、结果或者反应。叶子节点表示最终结果。

让我们用 Python 构造一个树吧：

class TreeNode:
    def __init__(self, attribute=None, class_label=None):
        self.attribute = attribute  # The attribute for this node
        self.class_label = class_label  # The class label if it's a leaf node
        self.children = {}  # Dictionary to hold child nodes

def majority_class(samples):
    # Returns the most frequent class label in the samples
    class_counts = {}
    for sample in samples:
        label = sample[-1]  # Assuming the last element is the class label
        if label not in class_counts:
            class_counts[label] = 0
        class_counts[label] += 1
    return max(class_counts, key=class_counts.get)

def best_partition_attribute(D, A):
    # Implement logic to select the best attribute for partitioning
    # This is a placeholder; actual implementation depends on the metric used (e.g., information gain)
    pass

def TreeGenerate(D, A):
    node = TreeNode()  # Step 1: Generate a new node

    # Step 2: Check if all samples belong to the same class
    classes = set(sample[-1] for sample in D)
    if len(classes) == 1:
        node.class_label = classes.pop()  # Step 3: Mark as leaf node
        return node

    # Step 5: Check if A is empty or all samples have the same value on A
    if not A or all(sample[a] == D[0][A[0]] for sample in D for a in A):
        # Step 6: Mark as leaf node with majority class
        node.class_label = majority_class(D)  
        return node

    # Step 7: Select the best partition attribute
    a_star = best_partition_attribute(D, A)
    
    # Step 8: Iterate through values of a_star
    for value in set(sample[a_star] for sample in D): 
        # Step 9: Create subset Dv 
        Dv = [sample for sample in D if sample[a_star] == value]  
        # Step 10: Check if Dv is empty
        if not Dv:  
            node.children[value] = TreeNode(class_label=majority_class(D))  
        else:
            node.children[value] = TreeGenerate(Dv, [a for a in A if a != a_star])

    return node  # Return the generated tree node

根据以上的代码，针对打网球的问题，我们总共有四种划分树的方式，如图所示：

究竟哪一种划分方式最好呢？

划分方式好：对一个 value（节点），可以得到全为正的实例。其余 value（节点），可以得到全为负的实例。

划分方式差：没有区分度、属性对决策没有作用，每一个 value（节点）的正面实例和负面实例数目都差不多（五五开）

要是划分的每一片叶子“区分度” 都足够大，就说明划分方式好。Entropy（熵） 就是用于表示“区分度”的。

已知一个集合 D，这个集合 D 只有正向数据和负面数据。那么可以用以下公式计算 D 的熵：

必考：如何求解数据集合 D 的熵？

解题口诀：负数比例 Alog2A

假设这个叶子节点里面有 9 个正数节点，5 个负数节点，根据口诀快速写出来：

我们自然可以从二种情况拓展到多种情况，如图所示

在导论的时候，我们说，熵是体现信息密度的一个数学工具，也是体现集合纯度的工具。

我们再引入一个 Gain 函数用来表示获取信息的度量。

Values(A)：对于属性 A 可以取的所有可能值的集合

Dv：是 D 的子集，表示属性 A 的值等于 v

必考：把类似打网球的表格转化为决策树

有了公式，大家求解树的划分就简单很多了，详细的数学计算请自行训练。回顾一下老师的拆分思路：

① 四棵树木套公式求解信息密度，发现 Outlook 的 Gain 值最大

② Outlook 中间节点没必要拆分，我们只需要拆分 Sunny 和 Rain 对应

③ 对于 Sunny 节点，我们可以用 Humidity, Wind, Temp 继续拆分，Humidity 的拆分效果最好，停止拆分

④ 对于 Rain 节点，我们可以用 Humidity, Wind, Temp 继续拆分，Wind 的拆分效果最好，停止拆分

考试碎碎念

如果大题让我们目瞪法拆分，就不需要背公式，直接穷举强拆。

如果大题要求写出解题步骤，我们就需要严谨地按照公式走了。

二、集成学习

集成学习：将一系列基础模型合并到一起，从而产生一个更好的预测模型。

主要方法：Bagging（打包），Boosting（提升）

随机森林是打包的一种拓展，但是这种方法使用了决策树作为基础的学习者

随机地从 p 个特征（features）中抽取出 m 个特征从而得到经过优化的划分特征

Bagging	Random Forest
固定的结构	结构随机，训练效率更好

三、AdaBoost (Adaptive Boosting)

Combines base learners linearly（将基学习器线性组合）

Iteratively adapts to the errors made by base learners in previous iterations（迭代地适应前一轮中基学习器所犯的错误）

权重调动技巧：

更高的权重将会被分配到未准确分类的点、更低的权重将会被分配到已准确分类的点

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class AdaBoost:
    def __init__(self, n_estimators=50):
        self.n_estimators = n_estimators
        self.alphas = []  # 存储每个弱学习器的权重
        self.models = []  # 存储每个弱学习器

    def fit(self, X, y):
        n_samples, n_features = X.shape
        # Step 1: 初始化权重向量
        w = np.ones(n_samples) / n_samples

        for t in range(self.n_estimators):
            # Step 3: 拟合基学习器
            model = DecisionTreeClassifier(max_depth=1)
            model.fit(X, y, sample_weight=w)
            y_pred = model.predict(X)

            # Step 4: 计算分类错误率
            error = np.sum(w * (y_pred != y)) / np.sum(w)

            # Step 6: 计算每个基学习器的权重
            alpha = 0.5 * np.log((1 - error) / (error + 1e-10))  # 添加小常数以防止除以零

            # Step 7: 更新每个数据点的权重
            w *= np.exp(-alpha * y * y_pred)  # 更新权重
            w /= np.sum(w)  # 归一化权重

            self.models.append(model)  # 保存弱学习器
            self.alphas.append(alpha)  # 保存权重

    def predict(self, X):
        # Step 9: 输出最终假设
        final_pred = np.zeros(X.shape[0])
        for alpha, model in zip(self.alphas, self.models):
            final_pred += alpha * model.predict(X)
        return np.sign(final_pred)  # 返回最终预测结果

# 示例用法
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    # 生成一个示例数据集
    X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=42)
    y = 2 * y - 1  # 转换为 {-1, 1}

    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 训练 AdaBoost 模型
    model = AdaBoost(n_estimators=50)
    model.fit(X_train, y_train)

    # 进行预测
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    print(f"Accuracy: {accuracy}")

接下来，让我们从样例入手，手把手实操 AdaBoost

for i in range(1, T + 1):
    if error > bound: # 我们会一直循环，直到 error 足够小
        ① Fix base learner ht(x) to data points # 先寻找一个基础的函数
        ② Calculate the classification error rate et of ht(x) # 计算分类的错误率
        ③ Calculate the weight alpha_t for ht(x) # 计算这个函数的权重
        ④ Update the weights of each data point # 对每一个点更新权重