AAAMLP-Chapter-8: Hyperparameter OptimizationAAAMLP 第八章超参数优

模型越大，优化超参数越困难。

什么是超参数优化？

假设存在一个机器学习项目的处理流程：拿到数据集，直接输入给模型，得到输出。这个流程里控制的参数被称为模型超参数，即用于控制模型的训练、使用过程的参数。

如果我们训练一个 SGD 线性回归模型，模型的参数是系数和偏置，模型的超参数是学习率。

假设模型有超参数 a、b、c，都是 0-10 之间的整数。一个正确的参数组合会产生最好的模型效果。这相识某种三位密码锁，然而三位密码锁只有一个正解，模型的超参数组合可以有多个最优解。

那么如何寻找最优的超参数组合？

一种简单的思路是，对所有可能的组合，计算该组合下模型的性能指标。

这种方法非常耗时，实际问题中该方法不可行，因为超参数经常是实数，组合是无穷多的。

让我们看一下随机森林模型的定义。

RandomForestClassifier(
    n_estimators=100,
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features='auto',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None,
    ccp_alpha=0.0,
    max_samples=None,
)

其中有 9 个参数，所有参数的组合是无限多，我们不可能枚举各组合。

因此，我们设置参数网格，在网格上搜索模型参数的组合，该方法被称为 grid search。

假设 n_estimators 可以取 [100, 200, 250, 300, 400, 500] 。max_depth 可以取 [1, 2, 5, 7, 11] 。criterion 可以取 ['gini', 'entropy']，这样似乎超参数组合不太多了，但仍需要花费大量时间进行计算。

我们可以在上述条件下进行 grid search，统计验证集上模型的得分。当我们做了 k-折交叉验证时，我们在计算中引入了新的循环，需要更多的计算时间。

因此，Grid Search 并不流行。

下边通过预测手机价格范围这个例子来演示 Grid Search。

数据下载地址

Mobile Price Classification | Kaggle

代码如下。

import numpy as np
import pandas as pd
from sklearn import ensemble, metrics, model_selection

df = pd.read_csv('mobile-phone-price/train.csv')
x = df.drop('price_range', axis=1).values
y = df.price_range.values

classifier = ensemble.RandomForestClassifier(n_jobs=-1)
param_grid = {
    'n_estimators': [100, 200, 250, 300, 400, 500],
    'max_depth': [1, 2, 5, 7, 11, 15],
    'criterion': ['gini', 'entropy']
}
model = model_selection.GridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    scoring='accuracy',
    verbose=10,
    n_jobs=1,
    cv=5
)
model.fit(x, y)

print(f'Best Score: {model.best_score_}')
print('Best Param Set:')
best_params = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'{param_name}: {best_params[param_name]}')

计算后，我们得到了该 Grid 下的最优参数。

下一个模型选择方法是随机选择 Random Search。

随机挑选超参数组合并计算交叉验证得分，由于不枚举所有组合，因此耗时小于 Grid Search。

在使用时通常需要人为指定计算多少次组合，指定的计算次数决定耗时长短。

代码如下。

import numpy as np
import pandas as pd
from sklearn import ensemble, metrics, model_selection

df = pd.read_csv('mobile-phone-price/train.csv')
x = df.drop('price_range', axis=1).values
y = df.price_range.values

classifier = ensemble.RandomForestClassifier(n_jobs=-1)
param_grid = {
    'n_estimators': np.arange(100, 1500, 100),
    'max_depth': np.arange(1, 31),
    'criterion': ['gini', 'entropy']
}
model = model_selection.RandomizedSearchCV(
    estimator=classifier,
    param_distributions=param_grid,
    n_iter=20,
    scoring='accuracy',
    verbose=10,
    n_jobs=1,
    cv=5
)
model.fit(x, y)

print(f'Best Score: {model.best_score_}')
print('Best Param Set:')
best_params = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'{param_name}: {best_params[param_name]}')

与 Grid Search 相比，Random Search 对超参数的搜索范围更广，而且迭代次数更少，得到的参数效果更好。

通过使用这两种超参数搜索方法，可以对任何模型搜索最优(?)的超参数。

有时你需要使用流水线处理方法，例如，我们要处理一个多分类任务，其中训练数据由两列文本构成，现在我们要构建模型预测目标类型。

假设我们的处理流水线是这样的，首先对文本做 tf-idf ，接着使用带有 SVM 分类器的 SVD 分解方法，产生多个特征。

现在问题是，我们如何同时选择 SVM 的超参数和 SVD 的超参数。

实现代码如下（该代码没有提供数据来源，只能看看逻辑）。

import numpy as np
import pandas as pd
from sklearn import metrics, model_selection, pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

def quadratic_weighted_kappa(y_true, y_pred):
    return metrics.cohen_kappa_score(y_true, y_pred, weights='quadratic')

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

idx = test.id.values.astype(int)
train = train.drop('id', axis=1)
test = test.drop('id', axis=1)

y = train.relevance.values

traindata = list(
    train.apply(lambda x:'%s %s' % (x['text1'], x['text2']),axis=1)
)
testdata = list(
    test.apply(lambda x:'%s %s' % (x['text1'], x['text2']),axis=1)
)
tfv = TfidfVectorizer(
    min_df=3, 
    max_features=None, 
    strip_accents='unicode', 
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 3), 
    use_idf=1,
    smooth_idf=1,
    sublinear_tf=1,
    stop_words='english'
)
tfv.fit(traindata)
X = tfv.transform(traindata) 
X_test = tfv.transform(testdata)

svd = TruncatedSVD()
scl = StandardScaler()
svm_model = SVC()

clf = pipeline.Pipeline([
    ('svd', svd),
    ('scl', scl),
    ('svm', svm_model)
])
param_grid = {
    'svd__n_components' : [200, 300],
    'svm__C': [10, 12]
}
kappa_scorer = metrics.make_scorer(
    quadratic_weighted_kappa, 
    greater_is_better=True
)
model = model_selection.GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    scoring=kappa_scorer,
    verbose=10,
    n_jobs=-1,
    refit=True,
    cv=5
)
model.fit(x, y)

print(f'Best Score: {model.best_score_}')
print('Best Param Set:')
best_params = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'{param_name}: {best_params[param_name]}')

接下来介绍几个高级超参数优化技术，首先看一下最小功能方法 minimization of functions，该方法使用不同种最小化算法，如 downhill simplex algorithm、Nelder-Mead optimization、Bayesian with Gaussian process，来寻找最优超参数。

下边看一下 Gaussian Process 时如何用在超参数优化上的。这类方法需要一个优化的目标函数，因此，很像最小化误差的处理方式 minimize loss。

假设我们的目标函数是 accuracy，accuracy 越高越好，这样我们无法直接对 accuracy 做最小化算法，但可以将其乘以 -1，使其值越小越好。

Scikit-learn 提供的 gp_minimize 方法执行的是 Bayesian optimization with gaussian process。

import numpy as np
import pandas as pd
from sklearn import metrics, ensemble, model_selection
from functools import partial
from skopt import gp_minimize, space

np.int = np.int64

def optimize(params, param_names, x, y):
    params = dict(zip(param_names, params))
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.StratifiedKFold(n_splits=5)
    accuracies = []
    for train_idx, test_idx in kf.split(X=x, y=y):
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        xtest = x[test_idx]
        ytest = y[test_idx]
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        accuracies.append(
            metrics.accuracy_score(ytest, preds)
        )
    return np.mean(accuracies) * -1

df = pd.read_csv('mobile-phone-price/train.csv')
x = df.drop("price_range", axis=1).values
y = df.price_range.values

param_space = [
    space.Integer(3, 15, name="max_depth"),
    space.Integer(100, 1500, name="n_estimators"),
    space.Categorical(["gini", "entropy"], name="criterion"),
    space.Real(0.01, 1, prior="uniform", name="max_features")
]

param_names = [
    "max_depth",
    "n_estimators",
    "criterion",
    "max_features"
]

# 包装 optimize 函数，固定后三项参数，有科理化内味
optimization_function = partial(
    optimize,
    param_names=param_names,
    x=x,
    y=y
)
result = gp_minimize(
    optimization_function,
    dimensions=param_space,
    n_calls=15,
    n_random_starts=10,
    verbose=10
)
best_params = dict(zip(param_names, result.x))
print(best_params)

接下来，我们可以绘制出该计算收敛的过程。

from skopt.plots import plot_convergence
plot_convergence(result)

有许多库可以用来做超参数优化，这里使用的 scikit-optimize 只是其中之一（注：该库已经不再维护😑）。

另一个好用的库是 hyperopt，该库使用 Tree-structured Parzen Estimator, TPE 方法来寻找最优超参数。

import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble, metrics, model_selection
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

def optimize(params, x, y):
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.StratifiedKFold(n_splits=5)
    accuracies = []
    for train_idx, test_idx in kf.split(X=x, y=y):
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        xtest = x[test_idx]
        ytest = y[test_idx]
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        accuracies.append(
            metrics.accuracy_score(ytest, preds)
        )
    return np.mean(accuracies) * -1

df = pd.read_csv('mobile-phone-price/train.csv')
x = df.drop("price_range", axis=1).values
y = df.price_range.values

param_space = {
    'max_depth': scope.int(
        hp.quniform('max_depth', 1, 15, 1)
    ),
    'n_estimators': scope.int(
        hp.quniform('n_estimators', 100, 1500, 1)
    ),
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'max_features': hp.uniform('max_features', 0, 1)
}
optimition_function = partial(
    optimize,
    x=x,
    y=y
)
trials = Trials()
hopt = fmin(
    fn=optimition_function,
    space=param_space,
    algo=tpe.suggest,
    max_evals=15,
    trials=trials
)
print(hopt)

TPE 的实现和 GP 的实现差不多，都需要定义参数空间，定义得分函数等。

运行上述代码可以得到调优等准确率和与之对应的一组超参数。

结果中 criterion 的值为 1 ，表示选择了 entropy。

上述调整超参数的方法是最常见的方法，可以用在几乎所有模型上：线性回归、逻辑回归、基于树的模型、梯度增强模型、神经网络。

尽管有上述方法库的存在，初学者应该从手动调整超参数开始学习此概念。例如，在梯度增强模型中，增加或减少 depth 参数，或者调整学习率，以此增强对超参数调优的理解。如果直接使用自动化工具，你肯定掌握其中细节。

推荐使用下表来学习超参数调整，其中 RS* 表示使用 Random Search 方法更好。

一旦你可以通过手动调整超参数实现模型效果提升，你甚至不需要使用自动化超参数调整工具，你会熟悉哪些参数对哪些模型有效。

当你构建了一个复杂模型，其中使用了大量的特征时，你的模型可能对数据训练时的过拟合敏感。为了防止过拟合发生，你需要往训练数据里添加噪声数据，或者惩罚损失函数，该惩罚手段被称为正则化，能够帮助模型提升泛化性能。

在线性模型中，经常使用的正则化方式是 L1 和 L2，L1 被称为 Lasso Regression，L2 被称为 Ridge Regression。

在神经网络中，使用 dropout 来添加模型噪声。

Model	Optimize	RangeOfValues
Linear Regression	- fit_intercept - normalize	- T/F - T/F
Ridge	- alpha - fit_intercept - normalize	- 0.01,0.1,1,10,100 - T/F - T/F
KNN	- n_neighbors - p	- 2,4,8,16,... -2,3
SVM	- C - gamma - class_weight	- 0.001,0.01,..,1000 - 'auto', RS* - 'balanced', None
Logistic Regression	- Penalty - C	- l1/l2 - 0.001,0.01,...,100
Lasso	- alpha - normalize	- 0.1,1,10 - T/F
Random Forest	- n_estimitors - max_depth - max_fs - min_x_split - min_x_leaf	- 120,300,..,1200 - 5,8,15,25,30 - log2,sqrt,None - 1,2,5,10,15,100 - 1,2,5,10
XGBoost	- eta - gamma - max_depth - min_chd_w - subx - colx - lambda - alpha	- 0.01,0.015.. - 0.05,...,1 - 3,..,25 - 1,3,5,7 - 0.6,..,1 - 0.6,..,1 - 0.01,..,1,RS* - 0,..,1,RS*