[机器学习]xgboost回归实战XGBoost有两大类接口：XGBoost原生接口和 scikit-learn接口

参考网址

blog.csdn.net/HHTNAN/arti…

xgboost有两大类接口：xgboost原生接口(陈天奇团队开发)和 xgboost.sklearn接口。

目标函数：objective

“reg:linear” —— 回归：线性回归
“reg:logistic”—— 逻辑：逻辑回归

“binary:logistic”—— 二分类：逻辑回归问题，输出为概率
“binary:logitraw”—— 二分类：逻辑回归问题，输出的结果为wTx。 

“count:poisson”—— 计数问题的poisson回归，输出结果为poisson分布。在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)

“multi:softmax” – 多分类问题，输出类别
                 同时需要设置参数num_class（类别个数） 
                 
“multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。每一行数据表示样本所属于每个类别的概率。

“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

1/基于xgboost原生接口的回归

import xgboost as xgb
xgboost原生接口，在训练模型的时候，用的是train()函数，而不是fit()函数。
表示弱学习器个数的参数是num_rounds，而不是n_estimators.
原生接口的参数，是放在params字典中，然后传入xgb.train()函数来训练模型。

import xgboost as xgb

from xgboost import plot_importance

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# 读取文件原始数据
data = []
labels = []
labels2 = []

# 读取文件lppz5.csv
# 针对每一行数据，根据‘,’分割，转换成列表list
# line_split[8]是标签列
# line_split[10:]从第10列往后，都是数据列，及特征列
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

# 构建样本数据X，y
# 先对样本数据进行数据类型的转换
x = []
for row in data:
    row = [float(x) for x in row]
    x.append(row)

y = [float(x) for x in labels]

# XGBoost训练过程
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

# 训练模型的参数
params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1, # 损失函数降低的最小阈值
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

# 把训练特征数据，训练标签数据，进行矩阵的转化
dtrain = xgb.DMatrix(x_train, y_train)

# 弱学习器的个数，及迭代的次数
num_rounds = 300 
plst = params.items()

# 训练模型
# 指定参数，训练数据，弱学习器的个数
xgb_model = xgb.train(plst, 
                     dtrain, 
                     num_rounds)

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = xgb_model.predict(dtest)

# 显示重要特征
plot_importance(model)
plt.show()

2/基于sklearn接口的回归

sklearn的接口函数，
xgboost.XGBRegressor()函数，初始化对象
通过fit()函数来训练模型
fit()函数是通过xgboost封装好的XGBClassifier或者XGBRegressor来训练模型。

# 导入必要的包
from xgboost.sklearn import XGBRegressor

from xgboost import plot_importance
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# 读取文件原始数据
data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

xgb_model = XGBRegressor(n_estimators=160, 
                         learning_rate=0.1, 
                         max_depth=5, 
                         silent=True, 
                         objective='reg:gamma')
                         
xgb_model.fit(X_train, y_train) 

# 对测试集进行预测
ans = model.predict(X_test)

# 显示重要特征
plot_importance(model)
plt.show()