参考网址
blog.csdn.net/HHTNAN/arti…
xgboost有两大类接口:xgboost原生接口(陈天奇团队开发)和 xgboost.sklearn接口。
目标函数:objective
“reg:linear” —— 回归:线性回归
“reg:logistic”—— 逻辑:逻辑回归
“binary:logistic”—— 二分类:逻辑回归问题,输出为概率
“binary:logitraw”—— 二分类:逻辑回归问题,输出的结果为wTx。
“count:poisson”—— 计数问题的poisson回归,输出结果为poisson分布。在poisson回归中,max_delta_step的缺省值为0.7。(used to safeguard optimization)
“multi:softmax” – 多分类问题,输出类别
同时需要设置参数num_class(类别个数)
“multi:softprob” –和softmax一样,但是输出的是ndata * nclass的向量,可以将该向量reshape成ndata行nclass列的矩阵。每一行数据表示样本所属于每个类别的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
1/基于xgboost原生接口的回归
import xgboost as xgb
xgboost原生接口,在训练模型的时候,用的是train()函数,而不是fit()函数。
表示弱学习器个数的参数是num_rounds,而不是n_estimators.
原生接口的参数,是放在params字典中,然后传入xgb.train()函数来训练模型。
import xgboost as xgb
from xgboost import plot_importance
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
for line in fileObject:
line_split = line.split(',')
data.append(line_split[10:])
labels.append(line_split[8])
x = []
for row in data:
row = [float(x) for x in row]
x.append(row)
y = [float(x) for x in labels]
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=0)
params = {
'booster': 'gbtree',
'objective': 'reg:gamma',
'gamma': 0.1,
'max_depth': 5,
'lambda': 3,
'subsample': 0.7,
'colsample_bytree': 0.7,
'min_child_weight': 3,
'silent': 1,
'eta': 0.1,
'seed': 1000,
'nthread': 4,
}
dtrain = xgb.DMatrix(x_train, y_train)
num_rounds = 300
plst = params.items()
xgb_model = xgb.train(plst,
dtrain,
num_rounds)
dtest = xgb.DMatrix(X_test)
ans = xgb_model.predict(dtest)
plot_importance(model)
plt.show()
2/基于sklearn接口的回归
sklearn的接口函数,
xgboost.XGBRegressor()函数,初始化对象
通过fit()函数来训练模型
fit()函数是通过xgboost封装好的XGBClassifier或者XGBRegressor来训练模型。
from xgboost.sklearn import XGBRegressor
from xgboost import plot_importance
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data = []
labels = []
labels2 = []
with open("lppz5.csv", encoding='UTF-8') as fileObject:
for line in fileObject:
line_split = line.split(',')
data.append(line_split[10:])
labels.append(line_split[8])
X = []
for row in data:
row = [float(x) for x in row]
X.append(row)
y = [float(x) for x in labels]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=0)
xgb_model = XGBRegressor(n_estimators=160,
learning_rate=0.1,
max_depth=5,
silent=True,
objective='reg:gamma')
xgb_model.fit(X_train, y_train)
ans = model.predict(X_test)
plot_importance(model)
plt.show()