本节课，讲师介绍了三种在sklearn中可以使用的自动调参方法，分别是grid search、random search 和更高级的贝叶斯调参。数据集依然是IMDB电影评论。

手动、自动调参对比

手动调参 Manual hyperparameter optimization

优点：熟能生巧之后，会对某些超参数有直觉。比如之前学过的，当发现过拟合严重时，可以减小max_depth（决策树）和 C（逻辑回归）.
缺点：效率较低，而且在很复杂的情形下，直觉可能不如数据驱动的方法效果好

自动调参 Automated hyperparameter optimization

优点：
- 减少人力付出
- 更不易出错并且提高了可复现性
- 数据驱动的方法可能是有效的
缺点：
- 可能很难引入直觉
- 要小心验证集的过拟合（个人理解：若交叉验证使用过度，也暗含着所选出的最好模型可能只是在交叉验证数据上的最好模型）

3种`自动`调参方法

scikit-learn中本身含有两种自动调参的方法，分别是详尽的网格搜索: sklearn.model_selection.GridSearchCV和随机的 sklearn.model_selection.RandomizedSearchCV，其中CV表示这些调参方法中已经内置了交叉验证

1. GridSearchCV

网格搜索时，用户分别为不同超参数指定一组值，然后sklearn会对所有的排列组合都去逐个尝试

用法如下, 为了不破坏golden rule, 我们需要用到pipeline：

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# 被固定的参数不参与调优
countvec = CountVectorizer(binary=True) # we should not set min_df here, it will be optimized
lr = LogisticRegression(max_iter=1000)  # we should not set C here, it will be optimized

pipe = Pipeline([
    ('countvec', countvec),
    ('lr', lr)])
    

# 用户定义想要网格搜索的参数，因为用了pipeline，所以param_grid中
# 需要使用 lr__C (双下划线)这样的写法, 表示pipeline中 lr组件中的C参数
# 后续课程会碰到更深层的嵌套： e.g., preprocessor__numeric__imputer__strategy
param_grid = {
    "countvec__min_df" : [0, 10, 100],
    "lr__C" : [0.01, 1, 10, 100]
}

# 随后将pipe传给GridSearchCV
# n_jobs=-1 调用计算机的全部cores，来并行计算
grid_search = GridSearchCV(pipe, param_grid, verbose=2, n_jobs=-1)

# 开始执行, 因为用了pipeline，这里在 X_train_imdb_raw 之上 fit
grid_search.fit(X_train_imdb_raw, y_train_imdb);

截屏2022-11-24 下午7.46.55.png

执行fit之后，可以看到提示有 12 candidates，这是因为我们分别为 min_df和C设置了3个和4个值，排列组合一共有3*4=12种, 每种组合做交叉验证（默认5-fold)，就是60 fits

fit完之后，我们可以使用 best_params_、 best_score_、 best_estimator_ 这几个属性, 例如：

grid_search.best_params_

返回：

{'countvec__min_df': 0, 'lr__C': 1} # 表示使用网格搜索所发现的最优超参数组合

# 此处返回的超参数都是各自的默认值，min_df默认就是0， C默认就是1
# 实际上这并不意外，因为sklearn的作者们在选择参数默认值时做了大量的工作

# 我们还可以使用 predict，对未见过的数据预测
grid_search.best_estimator_.predict(X_test_imdb_raw)
# sklearn提供了语法糖，免去每次敲入 .best_estimator_ 可以直接在GridSearchCV对象上.predict() 或 .score()
grid_search.predict(X_test_imdb_raw)

# 顺便提一下，在找到最优的超参数组合之后，默认会用该超参数组合重新在整个训练集上fit一遍
# 这样使得用了更多的数据来训练模型, 这一点可以从 GridSearchCV 的 refit=True看到

网格搜索很好理解，但它的问题是如果用户指定的参数比较多，计算量就大，很费时，甚至变得不可行。所以sklearn中还提供了第二种办法 RandomizedSearchCV

2. RandomizedSearchCV

随机搜索，用户指定各个超参数的取值范围，或者提供一个概率分布(probability distribution，用于采样该参数值)，然后sklearn随机的去取值进行计算，达到用户指定的次数后停止。

有研究表明，随机搜索是一个比网格搜索更好的主意，下图可以帮助建立直觉：

Source: Bergstra and Bengio, Random Search for Hyper-Parameter Optimization, JMLR 2012.

代码如下：


# 可以提供一个待选择的 sequence， 随后每一轮计算sklearn会随机从各个sequence中取值
# 因为不会像网格搜索那样穷尽，所以我们能提供更多的值
param_choices = {
    "countvec__min_df" : [0, 10, 100],
    "lr__C" : [0.01, 1, 10, 100]
}

param_choices = {
    "countvec__min_df" : np.arange(0,100),
    "lr__C" : 2.0**np.arange(-5,5)
}
# C的这种取值范围很常见，如果我们让C取值 [1,2,3,...100]，那么C=1,2,3这些值太接近，意义不大
# C这种参数，我们更关注的是它的数量级，比如 C = [0.01, 0.1, 1, 10, 100]
# 可以表示为 10^n, 其中 n=-2, -1, 0, 1, 2
# 上述例子中的 "lr__C" : 2.0**np.arange(-5,5) 也是一个意思


# 也可以提供采样的概率分布
import scipy.stats

param_choices = {
    "countvec__min_df" : scipy.stats.randint(low=0, high=300),
    "lr__C" : scipy.stats.randint(low=0, high=300) # TODO: this is lame, pick a continuous prob dist
}

我们使用sequence的方式继续看代码：

param_choices = {
    "countvec__min_df" : np.arange(0,100),
    "lr__C" : 2.0**np.arange(-5,5)
}

# 多了个 n_iter 参数， 这里表示取12个组合，然后交叉验证 12*5=60
random_search = RandomizedSearchCV(pipe, param_choices,
                                   n_iter = 12, 
                                   verbose = 1,
                                   n_jobs = -1,
                                   random_state = 123)
                                   
                                   
random_search.fit(X_train_imdb_raw, y_train_imdb)


random_search.best_params_ # {'lr__C': 0.0625, 'countvec__min_df': 13}


random_search.best_score_ # 0.8605333333333333, Mean cross-validated score of the best_estimator


random_search.score(X_test_imdb_raw, y_test_imdb) # 0.8544

还可以查看 .cv_results_ 属性:

截屏2022-11-25 下午5.19.42.png

3. 贝叶斯调参

无论网格搜索还是随机搜索，每一次实验都是独立的，但如果在一次实验中，我们发现某个参数取值效果不好，这一信息完全可以用来指引其余的实验，这就是贝叶斯调参的想法.

但很显然，这也导致很难并行搜索，因为每次实验都依赖于之前的实验。

We can do this with scikit-optimize, which is a completely different package from scikit-learn
It uses a technique called "model-based optimization" and we'll specifically use "Bayesian optimization".
- In short, it uses machine learning to predict what hyperparameters will be good.
- Machine learning on machine learning!

讲师在2020年的课程中使用的是scikit-optimize(独立于scikit-learn)，而近几年optuna发展很快，也值得花时间学习.

安装:

pip install scikit-optimize

或

conda install -c conda-forge scikit-optimize

BayesSearchCV uses the same interface as GridSearchCV and RandomSearchCV.
However, the way we specify the parameter distributions is slightly different.
Here, we can just give the bounds as tuples.


from skopt import BayesSearchCV

bayes_opt = BayesSearchCV(
    pipe,
    {
        'countvec__min_df': (0, 300),   # This gets interpreted as a range
        'lr__C': (0.25, 0.5, 1, 2, 4, 8, 16, 32) # This gets interpreted as a list.
    },
    n_iter=10,
    cv=3,
    random_state=123,
    verbose=0,
    refit=True
)

执行时间通常较慢:

截屏2022-11-29 上午10.57.37.png

bayes_opt.best_params_
# 输出：
OrderedDict([('countvec__min_df', 8), ('lr__C', 32.0)])


bayes_opt.best_score_  # 0.844

# 理论上，当我们增加n_iter时，分数会变得更好（因为它有更多的数据可供学习）。
# 与另外两种方法的测试集分数对比：

bayes_opt.score(X_test_imdb_raw, y_test_imdb) # 0.84

random_search.score(X_test_imdb_raw, y_test_imdb) # 0.8544

grid_search.score(X_test_imdb_raw, y_test_imdb) # 0.8556

在这个例子中，我们看到贝叶斯方法并没有在测试集上表现更好，当然理论上当增加实验数量（n_iter参数）时，它从过去的经验中越能学到东西

Disadvantage: requires installation.
Disadvantage: when number of trials is large (e.g. hundreds), the meta-ML can actually get too slow.
Disadvantage: harder parallelize the search because each trial depends on the previous ones.
- Note n_jobs parameter for GridSearchCV and RandomizedSearchCV.
- BayesSearchCV also has this parameter. （也有n_jobs参数）
- It can definitely parallelize the folds. (整体上是串行的，但每一组参数的实验在各个folds上的执行可以并行，随后指引下一组参数）
- The search will be less effective if it parallelizes further.
I feel there's kind of a "sweet spot" of maybe ~10 continuous hyperparameters and ~100 trials where this tends to do really well.

Can I generalize this to say BayesSearchCV > RandomizedSearchCV > GridSearchCV?
Not quite. I'd say RandomizedSearchCV > GridSearchCV is pretty reasonable most of the time.
But we should think a bit more carefully about BayesSearchCV for the above reasons.
RandomizedSearchCV is often a reasonable choice.

参考

[1] # Scikit-optimize for LightGBM Tutorial with Luca Massaron | Kaggle's #30daysofML 2021

Lecture 05 supplement: sklearn调参三大方法