集成学习-随机森林

（文章开始前，请大家在阅读中留意--4.1数据准备--环节，不胜感激）

1 介绍

Random Forest（随机森林）

用随机的方式建立一个森林。随机森林算法由很多决策树组成，每一棵决策树之间没有关联。建立完森林后，当有新样本进入时，每棵决策树都会分别进行判断，然后基于投票法给出分类结果。

优点

1.在数据集上表现良好，相对于其他算法有较大的优势

2.易于并行化，在大数据集上有很大的优势；

3.能够处理高维度数据，不用做特征选择。

2 结构

Random Forest（随机森林）是 Bagging 的扩展变体，它在以决策树为基学习器构建Bagging集成的基础上，进一步在决策树的训练过程中引入了随机特征选择，随机森林包括四个部分：

1.随机选择样本（放回抽样）；

2.随机选择特征；

3.构建决策树；

4.随机森林投票（平均）。

注意：随机森林随机二字即从此处由来--样本随机、特征随机。

3 采样

随机选择样本和Bagging相同，采用的是Bootstraping自助采样法；随机选择特征是指每个节点在分裂过程中都是随机选择特征的（区别于每棵树随机选择一批特征）。

这种随机性导致随机森林的偏差会有稍微的增加（相比于单棵不随机树），但是由于随机森林的“平均”特性，会使得它的方差减小，而且方差的减小补偿了偏差的增大，因此总体而言是更好的模型。

4 演示

4.1 数据准备

数据戳这里，由于数据是之前一位博主博主开放的，但是笔者当时没有记录，现在无法找到出处。本次数据也是基于这位博主的数据进行更改的（让效果更真实），如原数据博主看到文章，请私信联系，笔者会更新并注明出处，同时，也望有相关信息的读者告知，不胜感激。

4.2 环境准备

Python3，scikit-learning库

4.3 模型构建

# 工具加载
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import pandas as pd
import joblib

import warnings
warnings.filterwarnings('ignore')

# 数据加载
df = pd.read_excel("./data_b.xlsx")

# 确定x和y
y_col = ['label']
x_col = [v for v in df.columns if v not in y_col]
x = df[x_col]
y = df[y_col]

# 划分训练集与测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=18)

rf_clf = RandomForestClassifier(class_weight={0: 1, 1:36},
                                n_jobs=-1,
                                n_estimators=45,
                                max_depth=3,
                                min_samples_split=5,
                                min_samples_leaf=1,
                                random_state=18)

rf_clf.fit(x_train, y_train)

tr_pred_rfc = rf_clf.predict(x_train)
te_pred_rfc = rf_clf.predict(x_test)

# 效果查看
print(metrics.confusion_matrix(y_train, tr_pred_rfc, labels=[0, 1]))
print(metrics.classification_report(y_train, tr_pred_rfc, digits=4))
print("=+=" * 20)
print(metrics.confusion_matrix(y_test, te_pred_rfc, labels=[0, 1]))
print(metrics.classification_report(y_test, te_pred_rfc, digits=4))


# [[29123   493]
#  [  608    96]]
#               precision    recall  f1-score   support
# 
#            0     0.9795    0.9834    0.9814     29616
#            1     0.1630    0.1364    0.1485       704
# 
#     accuracy                         0.9637     30320
#    macro avg     0.5713    0.5599    0.5650     30320
# weighted avg     0.9606    0.9637    0.9621     30320
# 
# =+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
# [[12527   192]
#  [  248    28]]
#               precision    recall  f1-score   support
# 
#            0     0.9806    0.9849    0.9827     12719
#            1     0.1273    0.1014    0.1129       276
# 
#     accuracy                         0.9661     12995
#    macro avg     0.5539    0.5432    0.5478     12995
# weighted avg     0.9625    0.9661    0.9643     12995



# 模型保存与加载

joblib.dump(rf_clf, "./model/rf_clf.m")
rf_clf = joblib.load("./model/rf_clf.m")



# 特征重要性
result = pd.DataFrame([x_test.columns, rf_clf.feature_importances_], index=['columns', 'importances']).T
result_df = result.sort_values('importances', ascending=False)
result_df

#         columns            importances
# 0        K1K2驱动信号       0.293763
# 3        门禁信号           0.234393
# 1        电子锁驱动信号      0.19334
# 2        急停信号           0.142503
# 4        THDI-M            0.136

4.4 模型调优

模型的调优与参数选择、数据样本量等相关，后续笔者将会整理相关文章，本文暂时提供模型的构建方法。