SKlearn使用教程 - 2：进阶操作

一、当多比少好: 交叉验证而不是单一分割

1. 示例

拆分数据对于评估统计模型性能是必要的。但是，它减少了可用于学习模型的样本数量。因此，应尽可能使用交叉验证。多次拆分也将提供有关模型稳定性的信息。

scikit-learn 提供了三个函数：cross_val_score、cross_val_predict 和 cross_validate。后者提供了有关拟合时间、培训和测试分数的更多信息。我也可以一次返回多个分数。

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

# 导入并分离数据
X, y = load_digits(return_X_y = True)
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state = 42, stratify = y)

pipe = make_pipeline(
    MinMaxScaler(),
    LogisticRegression(
        solver='lbfgs', 
        multi_class='auto', 
        max_iter=1000, 
        random_state=42
    )
)
scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)

cross_validate用法：

estimator：必须是含有fit-method的对象，也可以是自己定义的Pipeline
X：参与交叉验证的数据集
y：默认为None，训练集标签
groups：array数组。这个用法很神奇，比如说你的train set 有350个就像本例，那么当你设置[1,1,….2,2] 一共350个1和2的时候，在训练集中就相当于2折。1的为一组，2的为1组。
cv：int类型，决定了交叉验证的划分策略。可能的输入有：None，默认为3折。整数，用StratifiedKfold或者Kfold
n-jobs：CPU的数量
fit-params：dict，传入分类器的fit的方法
scores：bool是否返回均方误差

使用交叉验证功能，我们可以快速检查训练和测试分数，并使用 pandas 快速绘制绘图。

import pandas as pd
# 创建二维阵列
df_scores = pd.DataFrame(scores)
df_scores

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	fit_time	score_time	test_score	train_score
0	0.127001	0.000000	0.926544	0.988314
1	0.096663	0.001004	0.943239	0.984975
2	0.093163	0.000000	0.924875	0.993322

df_scores[['train_score', 'test_score']].boxplot()

<AxesSubplot:>

2. 练习

使用上一个练习的管道(乳腺癌数据)并进行交叉验证而不是单个拆分评估。

from sklearn.datasets import load_breast_cancer 
# 导入并分离数据
X, y = load_breast_cancer(return_X_y = True)
X_train,X_test,y_train,y_test = train_test_split(X,y ,random_state = 0, stratify  = y )

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='lbfgs', multi_class='auto',
                                        max_iter=1000, random_state=0))
scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)

df = pd.DataFrame(scores)
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	fit_time	score_time	test_score	train_score
0	0.005	0.001	0.963158	0.965699
1	0.003	0.000	0.952632	0.963061
2	0.004	0.000	0.978836	0.976316

df[['train_score', 'test_score']].boxplot()

<AxesSubplot:>

二、超参数优化：微调管道内部

1. 示例

有时您想找到导致最佳精度的管道组件的参数。我们已经看到我们可以使用 get_params() 检查管道的参数。

pipe.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler()),
  ('logisticregression', LogisticRegression(max_iter=1000, random_state=42))],
 'verbose': False,
 'minmaxscaler': MinMaxScaler(),
 'logisticregression': LogisticRegression(max_iter=1000, random_state=42),
 'minmaxscaler__clip': False,
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'logisticregression__C': 1.0,
 'logisticregression__class_weight': None,
 'logisticregression__dual': False,
 'logisticregression__fit_intercept': True,
 'logisticregression__intercept_scaling': 1,
 'logisticregression__l1_ratio': None,
 'logisticregression__max_iter': 1000,
 'logisticregression__multi_class': 'auto',
 'logisticregression__n_jobs': None,
 'logisticregression__penalty': 'l2',
 'logisticregression__random_state': 42,
 'logisticregression__solver': 'lbfgs',
 'logisticregression__tol': 0.0001,
 'logisticregression__verbose': 0,
 'logisticregression__warm_start': False}

超参数可以通过穷举搜索来优化。 GridSearchCV 提供了这样的实用程序，并在参数网格上进行交叉验证的网格搜索。

举个例子，我们想优化 LogisticRegression 分类器的 C 和惩罚参数。

from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='saga', multi_class='auto',
                                        random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
              'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=42,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'logisticregression__C': [0.1, 1.0, 10],
                         'logisticregression__penalty': ['l2', 'l1']},
             return_train_score=True)

在拟合网格搜索对象时，它会在训练集上找到可能的最佳参数组合（使用交叉验证）。我们可以通过访问属性 cv_results_ 来内省网格搜索的结果。它允许我们检查参数对模型性能的影响。

df_grid = pd.DataFrame(grid.cv_results_)
df_grid

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_logisticregression__C	param_logisticregression__penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	mean_train_score	std_train_score
0	0.003666	0.000471	0.000667	0.000471	0.1	l2	{'logisticregression__C': 0.1, 'logisticregres...	0.894366	0.936620	0.957746	0.929577	0.026350	5	0.943662	0.926056	0.933099	0.934272	0.007235
1	0.010667	0.002356	0.000667	0.000471	0.1	l1	{'logisticregression__C': 0.1, 'logisticregres...	0.880282	0.950704	0.936620	0.922535	0.030426	6	0.929577	0.915493	0.908451	0.917840	0.008783
2	0.006667	0.000943	0.000333	0.000471	1.0	l2	{'logisticregression__C': 1.0, 'logisticregres...	0.936620	0.978873	0.971831	0.962441	0.018484	4	0.971831	0.961268	0.964789	0.965962	0.004392
3	0.103380	0.053784	0.000000	0.000000	1.0	l1	{'logisticregression__C': 1.0, 'logisticregres...	0.936620	0.985915	0.978873	0.967136	0.021769	3	0.982394	0.975352	0.975352	0.977700	0.003320
4	0.023666	0.004497	0.000333	0.000471	10	l2	{'logisticregression__C': 10, 'logisticregress...	0.957746	0.985915	0.992958	0.978873	0.015213	1	0.992958	0.992958	0.982394	0.989437	0.004980
5	0.277392	0.040295	0.000000	0.000000	10	l1	{'logisticregression__C': 10, 'logisticregress...	0.950704	0.971831	0.992958	0.971831	0.017250	2	0.996479	0.996479	0.989437	0.994131	0.003320

默认情况下，网格搜索对象也充当估计器。一旦拟合好，调用 score 会将超参数固定为找到的最佳参数。

grid.best_params_

{'logisticregression__C': 10, 'logisticregression__penalty': 'l2'}

除此之外，可以将网格搜索称为任何其他分类器来进行预测。

accuracy = grid.score(X_test, y_test)
print('Accuracy score of the {} is {:.6f}'.format(grid.__class__.__name__, accuracy))

Accuracy score of the GridSearchCV is 0.951049

据了解，我们仅在单个拆分上进行网格搜索的拟合。但是，如前所述，我们可能有兴趣进行外部交叉验证以估计模型和不同数据样本的性能，并检查性能的潜在变化。由于 grid-search 是一个估计器，我们可以直接在 cross_validate 函数中使用它。

scores = cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True)
df_scores = pd.DataFrame(scores)
df_scores

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	fit_time	score_time	test_score	train_score
0	0.576077	0.000	0.963158	0.989446
1	0.381000	0.000	0.963158	0.981530
2	0.417000	0.001	0.973545	0.981579

2. 练习

为乳房数据集重复使用先前的管道并进行网格搜索以评估铰链和对数损失之间的差异。此外，微调惩罚。

from sklearn.datasets import load_breast_cancer 
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler 
# 导入并分离数据
X_breast, y_breast = load_breast_cancer(return_X_y = True)
X_breast_train,X_breast_test,y_breast_train,y_breast_test = train_test_split(X,y ,random_state = 0, stratify  = y )

pipe = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000))
param_grid = {'sgdclassifier__loss': ['hinge', 'log'],
              'sgdclassifier__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)
scores = cross_validate(grid, X_breast, y_breast, scoring='balanced_accuracy', cv=3, return_train_score=True)
df_scores = pd.DataFrame(scores)
df_scores[['train_score', 'test_score']].boxplot()

grid.fit(X_breast_train, y_breast_train)
print(grid.best_params_)

{'sgdclassifier__loss': 'log', 'sgdclassifier__penalty': 'l2'}

三、小结 — 我的 scikit-learn 管道不到 10 行代码（跳过导入语句）

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
              'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)
scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))
scores[['train_score', 'test_score']].boxplot()

grid.fit(X_train, y_train)
print(grid.best_params_)

{'logisticregression__C': 10, 'logisticregression__penalty': 'l2'}

SKlearn使用教程 - 2：进阶操作

SKlearn使用教程 - 2：进阶操作

一、当多比少好: 交叉验证而不是单一分割

1. 示例

2. 练习

二、 超参数优化：微调管道内部

1. 示例

2. 练习

三、小结 — 我的 scikit-learn 管道不到 10 行代码（跳过导入语句）

二、超参数优化：微调管道内部