python sklearn中KFold与StratifiedKFold

464 阅读4分钟

在机器学习中经常会用到交叉验证,常用的就是KFold和StratifiedKFold,那么这两个函数有什么区别,应该怎么使用呢?

首先这两个函数都是sklearn模块中的,在应用之前应该导入:

from sklearn.model_selection import  StratifiedKFold,KFold

首先说一下两者的区别,StratifiedKFold函数采用分层划分的方法(分层随机抽样思想),划分后的训练集和验证集中类别分布尽量和原数据集一样,故StratifiedKFold在做划分的时候需要传入标签特征

模拟数据

X = np.array([[10, 1], [20, 2], [30, 3], [40, 4], [50,5], [60,6], [70,7],[80,8],[90,9],[100,10],
             [110, 1], [120, 2], [130, 3], [140, 4], [150,5], [160,6], [170,7],[180,8],[190,9],[200,10]])
# 五个类别:1:1:1:1:1
Y1 = np.array([1,1,2,3,3,2,4,4,5,5,1,1,2,3,3,2,4,4,5,5])
# 两个类别:2:3
Y2 = np.array([1,1,1,1,2,2,2,2,2,2,1,1,1,1,2,2,2,2,2,2])

1、KFold函数

KFold函数共有三个参数:

  • n_splits:默认为3,表示将数据划分为多少份,即k折交叉验证中的k;

  • shuffle:默认为False,表示是否需要打乱顺序,这个参数在很多的函数中都会涉及,如果设置为True,则会先打乱顺序再做划分,如果为False,会直接按照顺序做划分;

  • random_state:默认为None,表示随机数的种子,只有当shuffle设置为True的时候才会生效。

# k = 5 进行了循环五次的数据的划分,每次都是不一样的
kfolds = KFold(n_splits=5, shuffle=False)

# 注:返回的是索引
for (trn_idx, val_idx) in kfolds.split(X, Y1):
    print((trn_idx, val_idx))
    print((len(trn_idx), len(val_idx)))  

Out: 
(array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([0, 1, 2, 3]))
(16, 4)
(array([ 0,  1,  2,  3,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([4, 5, 6, 7]))
(16, 4)
(array([ 0,  1,  2,  3,  4,  5,  6,  7, 12, 13, 14, 15, 16, 17, 18, 19]), array([ 8,  9, 10, 11]))
(16, 4)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 16, 17, 18, 19]), array([12, 13, 14, 15]))
(16, 4)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), array([16, 17, 18, 19]))
(16, 4)

2、StratifiedKFold

StratifiedKFold函数的参数与KFold相同。

# StratifiedKFold: 抽样后的训练集和验证集的样本分类比例和原有的数据集尽量是一样的

# 对(X, Y1)进行抽样
# Y1中有5个类别,比例为1:1:1:1:1
# 所以,每个KFold的样本数必须为 1*x+1*x+1*x+1*x+1*x=5x个样本
stratifiedKFolds = StratifiedKFold(n_splits=2, shuffle=False)
for (trn_idx, val_idx) in stratifiedKFolds.split(X, Y2):
    print((trn_idx, val_idx))
    print((len(trn_idx), len(val_idx)))

print('################################################')
# 对(X, Y2)进行抽样
# Y1中有2个类别,比例为2:3
# 所以,每个KFold的样本数必须为 2x+3x=5x个样本
stratifiedKFolds = StratifiedKFold(n_splits=4, shuffle=False)
for (trn_idx, val_idx) in stratifiedKFolds.split(X, Y2):
    print((trn_idx, val_idx))
    print((len(trn_idx), len(val_idx)))

Out:
(array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
(10, 10)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]))
(10, 10)
################################################
(array([ 2,  3,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([0, 1, 4, 5, 6]))
(15, 5)
(array([ 0,  1,  4,  5,  6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), array([2, 3, 7, 8, 9]))
(15, 5)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 12, 13, 17, 18, 19]), array([10, 11, 14, 15, 16]))
(15, 5)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 14, 15, 16]), array([12, 13, 17, 18, 19]))
(15, 5)

总结:

对应到我们自己的代码中,实际上我们的k,也就是n_splits其实是没什么作用,我们只是使用了最后一轮划分好的index信息.

skf = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)

for train_idx, valid_idx in skf.split(df_all['image'], df_all["cultivar_index"]):
    df_train = df_all.iloc[train_idx]
    df_valid = df_all.iloc[valid_idx]

print(f"train size: {len(df_train)}")
print(f"valid size: {len(df_valid)}")

print(df_train.cultivar.value_counts())  # 计算总数量
print(df_valid.cultivar.value_counts())