[机器学习]xgboost回归实战

6,778 阅读3分钟

参考网址

blog.csdn.net/HHTNAN/arti…

xgboost有两大类接口:xgboost原生接口(陈天奇团队开发)和 xgboost.sklearn接口。

目标函数:objective

“reg:linear” —— 回归:线性回归
“reg:logistic”—— 逻辑:逻辑回归

“binary:logistic”—— 二分类:逻辑回归问题,输出为概率
“binary:logitraw”—— 二分类:逻辑回归问题,输出的结果为wTx。 

“count:poisson”—— 计数问题的poisson回归,输出结果为poisson分布。在poisson回归中,max_delta_step的缺省值为0.7。(used to safeguard optimization)

“multi:softmax” – 多分类问题,输出类别
                 同时需要设置参数num_class(类别个数) 
                 
“multi:softprob” –和softmax一样,但是输出的是ndata * nclass的向量,可以将该向量reshape成ndata行nclass列的矩阵。每一行数据表示样本所属于每个类别的概率。

“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

1/基于xgboost原生接口的回归

import xgboost as xgb
xgboost原生接口,在训练模型的时候,用的是train()函数,而不是fit()函数。
表示弱学习器个数的参数是num_rounds,而不是n_estimators.
原生接口的参数,是放在params字典中,然后传入xgb.train()函数来训练模型。
import xgboost as xgb

from xgboost import plot_importance

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# 读取文件原始数据
data = []
labels = []
labels2 = []

# 读取文件lppz5.csv
# 针对每一行数据,根据‘,’分割,转换成列表list
# line_split[8]是标签列
# line_split[10:]从第10列往后,都是数据列,及特征列
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

# 构建样本数据X,y
# 先对样本数据进行数据类型的转换
x = []
for row in data:
    row = [float(x) for x in row]
    x.append(row)

y = [float(x) for x in labels]

# XGBoost训练过程
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

# 训练模型的参数
params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1, # 损失函数降低的最小阈值
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

# 把训练特征数据,训练标签数据,进行矩阵的转化
dtrain = xgb.DMatrix(x_train, y_train)

# 弱学习器的个数,及迭代的次数
num_rounds = 300 
plst = params.items()

# 训练模型
# 指定参数,训练数据,弱学习器的个数
xgb_model = xgb.train(plst, 
                     dtrain, 
                     num_rounds)

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = xgb_model.predict(dtest)

# 显示重要特征
plot_importance(model)
plt.show()

2/基于sklearn接口的回归

sklearn的接口函数,
xgboost.XGBRegressor()函数,初始化对象
通过fit()函数来训练模型
fit()函数是通过xgboost封装好的XGBClassifier或者XGBRegressor来训练模型。
# 导入必要的包
from xgboost.sklearn import XGBRegressor

from xgboost import plot_importance
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# 读取文件原始数据
data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

xgb_model = XGBRegressor(n_estimators=160, 
                         learning_rate=0.1, 
                         max_depth=5, 
                         silent=True, 
                         objective='reg:gamma')
                         
xgb_model.fit(X_train, y_train) 

# 对测试集进行预测
ans = model.predict(X_test)

# 显示重要特征
plot_importance(model)
plt.show()