如何从Pandas数据框架中创建一个训练和测试集在对数据集进行机器学习模型拟合时，我们经常将数据集分成两组。 1.训练集

在对数据集进行机器学习模型拟合时，我们经常将数据集分成两组。

1.训练集用来训练模型（原始数据集的70-80%）。

2.2.测试集用于获得对模型性能的无偏估计（原始数据集的20-30%）。

在Python中，有两种常见的方法可以将pandas DataFrame分成训练集和测试集。

方法1：使用sklearn中的train_test_split()

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=0)

方法2：使用pandas的sample()。

train = df.sample(frac=0.8,random_state=0)
test = df.drop(train.index)

下面的例子展示了如何用下面的pandas DataFrame来使用每个方法。

import pandas as pd
import numpy as np

#make this example reproducible
np.random.seed(1)

#create DataFrame with 1,000 rows and 3 columns
df = pd.DataFrame({'x1': np.random.randint(30, size=1000),
                   'x2': np.random.randint(12, size=1000),
                   'y': np.random.randint(2, size=1000)})

#view first few rows of DataFrame
df.head()

        x1	x2	y
0	5	1	1
1	11	8	0
2	12	4	1
3	8	7	0
4	9	0	0

例1：使用来自sklearn的train_test_split()

下面的代码展示了如何使用sklearn的**train_test_split()**函数将pandas DataFrame分割成训练集和测试集。

from sklearn.model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size=0.2, random_state=0)

#view first few rows of each set
print(train.head())

     x1  x2  y
687  16   2  0
500  18   2  1
332   4  10  1
979   2   8  1
817  11   1  0

print(test.head())

     x1  x2  y
993  22   1  1
859  27   6  0
298  27   8  1
553  20   6  0
672   9   2  1

#print size of each set
print(train.shape, test.shape)

(800, 3) (200, 3)

从输出结果中我们可以看到，已经创建了两个集。

训练集。800行和3列
测试集。200行和3列

请注意，test_size控制了原始DataFrame中属于测试集的观察值的百分比，而random_state的值使得分割可以重复进行。

例2：使用pandas的sample()。

下面的代码展示了如何使用pandas 的**sample()**函数将pandas的DataFrame分割成训练集和测试集。

#split original DataFrame into training and testing sets
train = df.sample(frac=0.8,random_state=0)
test = df.drop(train.index)

#view first few rows of each set
print(train.head())

     x1  x2  y
993  22   1  1
859  27   6  0
298  27   8  1
553  20   6  0
672   9   2  1

print(test.head())

    x1  x2  y
9   16   5  0
11  12  10  0
19   5   9  0
23  28   1  1
28  18   0  1

#print size of each set
print(train.shape, test.shape)

(800, 3) (200, 3)

从输出结果来看，我们可以看到已经创建了两个集。

训练集。800行和3列
测试集。200行和3列

请注意，frac 控制了原始DataFrame中属于训练集的观察值的百分比，而random_state值使得分割可以重复进行。

其他资源

下面的教程解释了如何在Python中执行其他常见任务。

如何在Python中执行Logistic回归
如何在Python中创建混淆矩阵
 如何在Python中计算平衡精度