决策树中信息增益、ID3以及C4.5的实现与总结决策树其核心是寻找在给定的特征条件下类（Labels or Catego

决策树其核心是寻找在给定的特征条件下类（Labels or Category）的条件概率分布。构建一颗决策树通常都要经历以下三个步骤：

特征提取
决策树的生成
决策树的剪枝

本文旨在以分类决策树为模型，总结决策树相关思想、实现特征提取、ID3以及C4.5这两个决策树生成算法。

本文数据集采用李航——《统计学习方法第2版》中的贷款申请样本数据集。

其中『年龄』属性的三个取值{0, 1, 2}对应{青年、中年、老年}；『有工作』、『有自己房子』属性的两个取值{0, 1}对应{否、是}；『信贷情况』的三个取值{0, 1, 2}对应{一般、好、非常好}；『类别』属性表示是否同意贷款取值{0, 1}对应{否、是}。

数据集下载地址

特征选择

选择能够最大程度划分数据集𝑫的特征。

信息增益

要理解信息增益的首要前提是理解『熵』这一抽象概念。

『熵』是信息的一个度量单位，在信息论与概率统计中，『熵』表示随机变量不确定性的程度，变量的不确定性越大，『熵』就越大，要了解（评估）这一变量需要的信息量也就越多。

这一基本概念对信息增益这个思想起着非常重要的指导作用。

要解释这一句话我们需要引入『熵』与『条件熵』的公式定义。（当『熵』和『条件熵』的概率由数据估计（似然估计等）得到时，所对应的分别是『经验熵』和『经验条件熵』）

设一离散型随机变量𝑿，及其概率分布为：

则『熵』或『经验熵』定义为：

$H(p) = -\sum^{n}_{i = 1}{p_i}{logp_i}$

另设随机变量(𝑿, 𝒀)，其联合概率分布为：

$p(X = x_i, Y = y_i) = p_{ij}, \qquad i = 1, 2, ... ,n; \quad j = 1, 2, ..., m$

则『条件熵』𝑯(𝒀|𝑿) 表示在已经随机变量X的条件下Y的不确定性，有：

$H(Y|X)=\sum^{n}_{i=1}p_iH(Y|X=x_i)\qquad p_i=P(X=x_i),i=1,2,...,n$

也就是说当我们已经知道一些信息𝑿，这些信息对我们了解𝒀的帮助程度达到多少。

信息增益：特征𝑨对训练数据集D的信息增益𝒈(𝑫, 𝑨)，定义为集合𝑫的经验熵𝑯(𝑫)与特征𝑨给定条件下𝑫的经验条件熵𝑯(𝑫|𝑨)之差，即：

在特征选择的过程中，需要选择信息增益值最大的的特征𝑨。

当一个集合𝑫给定时，𝑯(𝑫)就是定值，此时如果特征𝑨使得𝒈(𝑫, 𝑨)取得最大值，就意味着经验条件熵𝑯(𝑫|𝑨)是所有特征里最小的，即此时特征𝑨能够最大程度的减少𝑫的不确定性，也就意味着特征𝑨能够很好的将𝑫进行区分。这就是信息增益算法的核心。

Python实现

import scipy as sci
def information_gain(X: pd.DataFrame, feature, y) -> float:
    """
    param X: the dataset.
    param feature: the feature which need to compute.
    param y: categories or labels
    1. compute the data set D empirical entropy H(D)
    2. compute the empirical conditional entropy for D under the feature A. H(D|A)
    3. compute the infomation gain. g(D, A) = H(D) - H(D|A)
    """
    C = set(y)
    m = int(y.count())
    HD = 0
    for c in C:
        ck = int(y[y==c].count())
        HD += (ck/m * sci.log2(ck/m))
    HD = -HD
    
    A = X[feature]
    DA =  set(A)
    HDA = 0
    for da in DA:
        di = int(A[A==da].count())
        temp_hd = 0
        for c in C:
            dik = float(A[(A==da) & (y == c)].count())
            if dik == 0:
                continue
            temp_hd += (dik/di * sci.log2(dik/di))
        temp_hd = -temp_hd
        HDA += (di/m * temp_hd)
    return HD-HDA

测试『年龄』特征的信息增益值：

X = dataset[["Age", "Working", "Housing", "Credit"]]
y = dataset.Category
information_gain(X, "Age", y)       #  0.083

信息增益比

信息增益比算法修正了信息增益算法中会对某一特征取值较多时产生偏向的情况。

$\ _{gR} (D, A) = \frac{_{g}(D,A)}{\mathrm{H}_{A}(D)}$

其中

$\mathrm{H}_{A}(D) = -\sum^{n}_{i=1} \frac{\left | D_{i} \right |}{\left | D \right | } log_{2} \frac{\left | D_{i} \right |}{\left | D \right | }$

Python3代码

def information_gain_rate(X: pd.DataFrame, feature, y) -> float:
    """
    param X: dataset.
    param feature: which need to compute.
    param y: labels.
    1. compute g(D, A) -> information gain
    2. compute HA(D)
    """
    g = information_gain(X, feature, y)
    m = int(y.count())
    A = X[feature]
    N = set(A)
    HDA = 0
    for i in N:
        di = int(A[A==i].count())
        HDA += (di/m * sci.log2(di/m))
    HDA = -HDA
    return g/HDA

测试『年龄』特征信息增益比：

information_gain_rate(X, "Working", y)     # 0.35

最优特征选择

思路很简单，就是遍历所有的特征寻找信息增益或者信息增益比最大的特征。

def select_optima_feature(alg, X: pd.DataFrame, features: list, y) -> tuple:
    """
    Return the optimatic feature and correspondence value
    """
    opt_col = ""
    opt_ig = -1
    for c in features:
        if not callable(alg):
            raise TypeError("alg must a callable function.")
        ig = alg(X, c, y)
        if ig > opt_ig:
            opt_col = c
            opt_ig = ig
    return opt_col, opt_ig

测试最优特征

# 采用信息增益算法
select_optima_feature(information_gain, X, ["Age", "Housing", "Working", "Credit"], y)    # ('Housing', 0.4199730940219749)

# 采用信息增益比算法
select_optima_feature(information_gain_rate, X, ["Age", "Housing", "Working", "Credit"], y)    # ('Housing', 0.4325380677663126)

ID3 与 C4.5算法实现

ID3算法的核心是在决策树各个节点上应用信息增益准则选择特征，递归地构建决策树。而C4.5仅仅是在ID3基础上将信息增益算法换成信息增益比。

在构建决策树之前，我们还必须实现根据特征划分子集的方法。

def spilt_subset(X: pd.DataFrame, feature, feature_labels, y) -> dict:
    result = {}
    for i in feature_labels:
        s = (X[feature] == i)
        result.setdefault(i, {})
        result[i]['X'] = X[s]
        result[i]['y'] = y[s]
    return result

构建决策树

实现

def build_decision_tree(fsalg, X: pd.DataFrame, features: list, threshold: float, y):
    """
    create a decision tree
    param fsalg: feature selection algorithm.
    params X: dataset.
    params features: a array like, to describe what features in the D.
    params threshold: in accordance with judge D whether is a single tree.
    params labels_name: colum name which are include category.
    Return a dict-like structure object which namely Decision Tree
    """
    # 1. detect the features is equal null and whether all instances in D belong to the same category.
    if (not features) or (len(set(y))==1):
        return int(y.mode()[0])
    # 2. compute the biggest value for information gain.
    A, ig_value = select_optima_feature(fsalg, X, features, y)
    # 3. detect the value whether large than threshold.
    if ig_value < threshold:
        return int(y.mode()[0])
    DT = {}
    values = list(set(X[A]))
    DT.setdefault(A, {})
    childset = spilt_subset(X, A, values, y)
    features.remove(A)
    for v in values:
        DT[A][v] = build_decision_tree(fsalg, childset[v]['X'], features, threshold, childset[v]['y'])
    return DT

可以看到生成决策树时涉及到一个阈值，这个阈值是代表了能够算法忍受的最低信息不确定性因子，因为不管使用信息增益或者是信息增益比算法，其核心都是以最小化特征𝑨对𝑫的不确定性亦𝑯(𝑫|𝑨)，当𝑯(𝑫|𝑨)无限逼近𝑯(𝑫)时，此时可以说特征𝑨对于了解𝑫事件来说毫无意义，因此这个阈值就是在限定这种情况的最低限度。

测试：

model_id3 = build_decision_tree(information_gain, X, ["Age", "Housing", "Working", "Credit"], 0.2, y)    # {'Housing': {0: {'Working': {0: 0, 1: 1}}, 1: 1}}

model_c45 = build_decision_tree(information_gain_rate, X, ["Age", "Housing", "Working", "Credit"], 0.2, y)    # {'Housing': {0: {'Working': {0: 0, 1: 1}}, 1: 1}}

该树表示为