机器学习模型评估方法入门到实战[二]上面说的将数据集划分为k等份的方法叫做k折交叉验证，在第三部分“运用交叉验证进行模型

这是我参与11月更文挑战的第5天，活动详情查看：2021最后一次更文挑战

运用交叉验证进行数据集划分

1 KFold方法 k折交叉验证

上面说的将数据集划分为k等份的方法叫做k折交叉验证，在第三部分“运用交叉验证进行模型评估”中，会介绍cross_value_score方法，该方法的参数cv负责制定数据集划分方法，若输入任一整型数字k，则使用KFold方法。

该方法的sklearn实现如下(但通常如上一部分所描述的在cross_value_score方法中使用)：

from sklearn.model_selection import KFold
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
    print('train_index', train_index, 'test_index', test_index)
    train_X, train_y = X[train_index], y[train_index]
    test_X, test_y = X[test_index], y[test_index]

n_splits 参数是指希望分为几份；

2 RepeatedKFold p次k折交叉验证

在实际当中，我们只进行一次k折交叉验证还是不够的，我们需要进行多次，最典型的是：10次10折交叉验证，RepeatedKFold方法可以控制交叉验证的次数。

from sklearn.model_selection import RepeatedKFold
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])

kf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=0)
for train_index, test_index in kf.split(X):
    print('train_index', train_index, 'test_index', test_index)

n_repeats　参数是希望验证的次数；

3 LeaveOneOut 留一法

留一法是k折交叉验证当中，k=n（n为数据集的样本个数）的情形，即我们每次只留一个样本（我一开始理解成了留一个数据集，心想这不和KFold一样了么）来进行验证，这种方法仅适用于样本数较少的情况；

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]

loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    print('train_index', train_index, 'test_index', test_index)

4 LeavePOut 留P法

参考上一节

from sklearn.model_selection import LeavePOut

X = [1, 2, 3, 4]

lpo = LeavePOut(p=2)
for train_index, test_index in lpo.split(X):
    print('train_index', train_index, 'test_index', test_index)

5 ShuffleSplit 随机分配

使用ShuffleSplit方法，可以随机的把数据打乱，然后分为训练集和测试集。它还有一个好处是可以通过random_state这个种子来重现我们的分配方式，如果没有指定，那么每次都是随机的。

你可以将这个方法理解为随机版的KFold k折交叉验证，或是运行n_splits次版的train_test_split留出法。

import numpy as np
from sklearn.model_selection import ShuffleSplit

X=np.random.randint(1,100,20).reshape(10,2)
rs = ShuffleSplit(n_splits=10, test_size=0.25)

for train , test in rs.split(X):
    print(f'train: {train} , test: {test}')

6 其它特殊情况的数据划分方法

对于分类数据来说，它们的target可能分配是不均匀的，比如在医疗数据当中得癌症的人比不得癌症的人少很多，这个时候，使用的数据划分方法有 StratifiedKFold ，StratifiedShuffleSplit
对于分组数据来说，它的划分方法是不一样的，主要的方法有 GroupKFold，LeaveOneGroupOut，LeavePGroupOut，GroupShuffleSplit
对于时间关联的数据，方法有TimeSeriesSplit