机器学习手册学习笔记--模型选择仅供自己复习回顾使用，若有侵权可删除使用穷举搜索选择最佳模型使用随机搜索选择最佳模型

仅供自己复习回顾使用，若有侵权可删除

我们将选择最佳学习算法及选择最佳超参数的过程都称为模型选择。原因很简单：假设我们要训练具有 10 个候选超参数的支持向量分类器和具有 10 个候选超参数的随机森林分类器，实际上是要从 20 个候选模型中选出一个最佳模型。

使用穷举搜索选择最佳模型

import numpy as np
from sklearn.model_selection import GridSearchCV

penalty = ['l1', 'l2'] # 创建正则化惩罚的候选超参数区间
C = np.logspace(0, 4, 10) # 创建正则化候选超参数区间
hyperparameters = dict(C=C, penalty=penalty) # 创建候选超参数的字典
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0) # 创建网格搜索对象
'''
verbose参数决定了搜索期间输出的消息量， 0表示没有输出， 1到3代表输出信息，数字越大表示输出的细节越多
'''
best_model = gridsearch.fit(features, target) # 训练网格搜索

# 查看最佳超参数
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

使用随机搜索选择最佳模型

比 GridSearchCV 的暴力搜索更有效的方法是，在用户提供的参数分布（如正态分布、均匀分布）上选取特定数量的超参数随机组合

from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

# 创建正则化惩罚的候选超参数区间
penalty = ['l1', 'l2']
# 创建正则化候选超参数的分布
C = uniform(loc=0, scale=4)
# 创建超参数字典
hyperparameters = dict(C=C, penalty=penalty)
# 创建随机搜索对象
randomizedsearch = RandomizedSearchCV(logistic, hyperparameters, random_state=1, 
                                      n_iter=100,cv=5, verbose=0,n_jobs=-1)
#对超参数组合的采样数（即候选模型的数量）由参数 n_iter（迭代次数）指定
# 训练随机搜索
best_model = randomizedsearch.fit(features, target)

从多种学习算法中选择最佳模型

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# 创建流水线
pipe = Pipeline([("classifier", RandomForestClassifier())])
# 创建候选学习算法及超参数的字典
search_space = [{"classifier": [LogisticRegression()],
                "classifier__penalty": ['l1', 'l2'],
                "classifier__C": np.logspace(0, 4, 10)},
                {"classifier": [RandomForestClassifier()],
                "classifier__n_estimators": [10, 100, 1000],
                "classifier__max_features": [1, 2, 3]}]
# 创建 GridSearchCV 对象
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# 执行网格搜索
best_model = gridsearch.fit(features, target)
# 查看最佳模型
best_model.best_estimator_.get_params()["classifier"]

将数据预处理加入模型选择过程

FeatureUnion 可以组合多个预处理操作

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 创建一个包含 StandardScaler 和 PCA 的预处理对象
preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])
# 创建一个流水线
pipe = Pipeline([("preprocess", preprocess),
("classifier", LogisticRegression())])
# 创建候选值的取值空间
search_space = [{"preprocess__pca__n_components": [1, 2, 3],
                "classifier__penalty": ["l1", "l2"],
                "classifier__C": np.logspace(0, 4, 10)}]
# 创建网格搜索对象
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)
# 训练模型
best_model = clf.fit(features, target)
# 查看最佳模型的主成分数量
best_model.best_estimator_.get_params()['preprocess__pca__n_components']

用并行化加速模型选择

设置 n_jobs=-1，以使用所有的 CPU 核

使用针对特定算法的方法加速模型选择

在scikit-learn 中，许多学习算法（如岭回归、套索回归和弹性网络回归）都有特定的交叉验证方法来利用其自身的优势寻找最佳超参数 LogisticRegressionCV 有一个缺点，它只能搜索 C 的取值区间

# 加载库
from sklearn import linear_model, datasets
# 加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 创建 LogisticRegressionCV 对象
logit = linear_model.LogisticRegressionCV(Cs=100)
'''
Cs 为一个列表，就可以从 Cs 中选择候选超参数。如果 Cs 为一个整数，就会生成一个
含有对应数量候选值的列表。这些候选值按照对数值间隔相等的方式，在 0.0001 到 10,000 之间
'''
# 训练模型
logit.fit(features, target)

模型选择后的性能评估

由于我们使用这些数据来选择最佳超参数，因此就不能再使用它们评估模型的性能了

在嵌套交叉验证中，“内部”交叉验证用于选择最佳模型，而“外部”交叉验证对模型性能进行无偏估计

在本解决方案中，进行内部交叉验证的是 GridSearchCV 对象，然后使用 cross_val_score 方法将其封装到外部交叉验证中

import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV, cross_val_score
# 加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 创建逻辑回归对象
logistic = linear_model.LogisticRegression()
# 创建超参数 C 的 20 个候选值
C = np.logspace(0, 4, 20)
# 创建超参数字典
hyperparameters = dict(C=C)
# 创建网格搜索
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1,
verbose=0)
# 执行嵌套交叉验证并输出平均得分
cross_val_score(gridsearch, features, target).mean()