Pandas抽样--从数据框架中随机抽取行数使用pandas库中的pandas.DataFrame.sample() 方

使用pandas库中的pandas.DataFrame.sample() 方法从一个DataFrame中随机选择行

随机选择行对于检查DataFrame的值是非常有用的。

在这篇文章中，你将了解这个方法从DataFrame中随机选择行的不同配置，然后是将这个方法用于不同目的的一些实用技巧。到课程结束时学习高级可视化技能。应用这些技术来获得洞察力，并在你的项目中表现它们。

创建一个数据框架

# Make a DataFrame
import pandas as pd

# Create the data of the DataFrame as a dictionary
data_df = {'Name': ['OpenCV', 'Tensorflow', 'Matlab', 'CUDA', 'Theano', 'Keras', 'GPUImage', 'YOLO', 'BoofCV'],

           'Created By': ['Gary Bradsky', 'Google Brain', 'Cleve Moler', 'Ian Buck', 'MILA',
                          'Francois Chollet', 'Brad Larson', 'Joseph Redmon', 'Peter Abeles'],

           'Written in': ['C++', 'Python', 'C++', 'C++', 'Python', 'Python', 'C', 'C', 'Java']}

# Create the dictionary
df = pd.DataFrame(data_df)
df

pandas.DataFrame.sample()

语法： DataFrame.sample(n = None, frac=None, replace=False, weights=None, random_state=None, axis=None)
用途： 返回 DataFrame 的行或列的随机样本。
参数：
- n: Int（默认：无）。用于指定要从 DataFrame 中返回的随机选择的行或列的数量。它用于指定要返回的行数，作为DataFrame中总行数的一部分。虽然它可以被认为是n参数的替代品，但是它不能和n同时使用。
- replace: 布尔值（默认：False）。它用于指定同一行或列是否可以被多次返回。
- weights: 字符串或数组（默认：无）。它用于给选定的行或列添加偏置，以便它们有更大的机会被该方法返回。
- random_state: Int或数组或BitGenerator或numpy.randomRandomState（默认：无）。它用于指定随机数生成器中使用的种子值。
- axis: 0或1（默认：无）。它用于指定返回对象的方向。将0值传给该参数以返回随机选择的行，或将1值传给该参数以返回随机选择的列。
返回： 一个pandas系列或者DataFrame，取决于调用对象的类型。

使用DataFrame.sample()方法

你可以直接使用DataFrame.sample() 方法而不传递任何参数。在这样做的时候，默认值会被传递到参数中，并且会返回一个随机选择的DataFrame行。

# Use the DataFrame.sample() method to return a single randomly selected row
df.sample()

使用n参数

DataFrame.sample() 方法的默认配置只返回一条记录。要返回多行，你可以使用n参数来指定要返回的行数。

# Return three randomly selected rows from the DataFrame
df.sample(n=3)

使用frac参数

使用frac参数，你可以指定要返回的行数为DataFrame中总行数的一部分。

# Return 30% of the total number of rows from the DataFrame
df.sample(frac=0.3)

使用替换参数

在这个参数的帮助下，你可以多次返回同一行。这个参数的默认值是False，这意味着它不能多次选择同一行。把它的值设置为 "真 "就可以返回重复的行。

# Return the same three rows more than once
df.sample(n=3, replace=True, random_state=2)

使用权重参数

DataFrame.sample() 方法每次被调用都会返回不同的行。然而，如果你希望某些行有更大的机会被返回，你可以使用权重参数来指定这些行被返回的概率。

# Add bias to those rows which should be returned more frequently than the others
bias = [15, 10, 0.5, 0.55, 0.4, 0.2, 0.1, 0.6, 8]
df.sample(n=2, weights=bias,)

正如你所看到的，第一、第二和最后一行被分配的权重比其他行高。这意味着，每次调用这个方法时，这些行将有更大的机会被返回。

使用random_state参数

你可以使用random_state参数来确保每次调用该方法时都能返回相同的行。

# Ensure that the same three rows are repeated each time the method is called
df.sample(n=3, random_state=0)

实用提示

当你给替换参数传递True这个值时，你可以返回比DataFrame中存在的总行数更多的行，尽管其中一些行将是其他行的重复。

print('Rows and columns present in the DataFrame:', df.shape)

df.sample(n=15, replace=True)

Rows and columns present in the DataFrame: (9, 3)

在使用权重参数时，你可以给行分配大于1的权重，尽管权重之和被标准化为1。
如果想和别人分享你的代码，但要确保输出是可重复的，随机状态参数就很有用。

测试你的知识

Q1: frac参数是用来在随机选择DataFrame的总行后返回其一部分。真的还是假的？

答案是： 真

问2: weights参数和random_state参数的功能有什么不同？

答案：Random_state。 random_state参数确保每次调用DataFrame.sample() 方法时，输出都是一样的。权重参数增加了具有较高权重的行被选中的机会，但它并不保证每次调用该方法时都会返回具有较高权重的行。

Q3: 编写代码，从DataFramedf中返回任意三条随机选择的行。确保每次调用该方法时，都能返回相同的行。

答案： df.sample(n = 3, random_state=0)

Q4: 你有一个DataFramedf，它有5行4列。写代码从DataFrame中随机返回10行。返回的行不一定是唯一的。

答案： df.sample(n = 10, replace = True)

问题5： 编写代码以返回DataFramedf中所有行的47%。

答案： df.sample(frac=0.47)