Pandas Dataframe.Diplicated()方法介绍pandas.DataFrame.diplicated

pandas.DataFrame.diplicated()方法用于查找DataFrame中重复的行。它返回一个布尔系列，用于识别某行是重复的还是唯一的。

在这篇文章中，你将学习如何使用这个方法来识别DataFrame中的重复行。你还会了解到使用这个方法的一些实用技巧。

创建一个数据框架

# Create a DataFrame
import pandas as pd
data_df = {'Name': ['Arpit', 'Riya', 'Priyanka', 'Aman', 'Arpit', 'Rohan', 'Riya', 'Sakshi'],

           'Employment Type': ['Full-time Employee', 'Part-time Employee', 'Intern', 'Intern',
                               'Full-time Employee', 'Part-time Employee', 'Part-time Employee', 'Full-time Employee'],

           'Department': ['Administration', 'Marketing', 'Technical', 'Marketing',
                          'Administration', 'Technical', 'Marketing', 'Administration']}

df = pd.DataFrame(data_df)
df

pandas.DataFrame.diplicated()

语法： pandas.DataFrame.diplicated(subset=None, keep= 'first')目的：识别DataFrame中重复的行。
参数：
- subset:（默认：无）。用来指定要搜索重复值的特定列。
- keep:'first'或'last'或False（默认：'first'）。它用于指定重复行的哪个实例被识别为唯一的行。
返回： 一个布尔系列，其值为True表示对应索引处的行是重复的，False表示该行是唯一的。

使用DataFrame.diplicated()函数

当你直接使用DataFrame.duplicated() 函数时，默认值将被传递给用于搜索DataFrame中重复行的参数。

# Use the DataFrame.duplicated() method to return a series of boolean values
bool_series = df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

使用keep参数

你可以使用keep参数来指定哪个重复的实例应该被认为是唯一的，其余的实例将被认为是重复的。

将keep设置为 "第一"。

keep参数的默认值是'第一'。这意味着该方法将认为某行的第一个实例是唯一的，其余的实例是重复的。

让我们试着删除重复的行。

# Use the keep parameter to consider only the first instance of a duplicate row to be unique
bool_series = df.duplicated(keep='first')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the first instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True.

df[~bool_series]

Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after keeping only the first instance of the duplicate rows:

正如你所看到的，第五行和第七行已经被确定为重复的。第五行是第一行的重复，第七行是第二行的重复。因此，它们已经从数据框架中被删除。

时学习高级可视化技能。应用这些技术来获得洞察力，并在你的项目中表现它们。

设置保持为 "最后"。

当你设置这个参数的值为'last'时，该方法将认为某行的最后一个实例是唯一的，其余的实例是重复的。

让我们试着删除重复的行。

# Use the keep parameter to consider only the last instance of a duplicate row to be unique
bool_series = df.duplicated(keep='last')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the last instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]

Boolean series:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool


DataFrame after keeping only the last instance of the duplicate rows:

在这里，第一和第二行被认定为重复的，而第五和第七行被认为是唯一的。

将keep设置为False

如果你将keep的值设置为布尔值False，那么该方法将认为所有的行的实例都是重复的。

让我们试着删除重复的行。

# Use the keep parameter to consider all instances of a row to be duplicates
bool_series = df.duplicated(keep=False)
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows:')
# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]

Boolean series:
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows:

使用子集参数

subset参数用于指定要搜索的重复列。
在你指定了这些列之后，该方法将通过比较这些行之间的指定列的值来搜索重复的行。

这是非常有用的，因为你可能只想找到少数列的重复值。

# Use the subset parameter to search for duplicate values only in the Name column of the DataFrame

bool_series = df.duplicated(subset='Name')

print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing duplicates found in the Name column:')
df[~bool_series]

Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing duplicates found in the Name column:

实用提示

如果你不使用子集参数，那么行中的所有值都需要相同才能被识别为重复值。
你也可以向子集参数传递多个列。然而，请记住，所有指定列的值必须在行中相同，才能被视为重复值。

bool_series = df.duplicated(subset=['Name', 'Department'])
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:')

df[~bool_series]

0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:

尽管这个方法返回一个系列，只识别DataFrame中的重复行，但你可以使用这个系列对DataFrame进行子集，使其只包含唯一的值。

测试你的知识

Q1: keep参数的False值是用来从DataFrame中删除所有重复的行。是真的还是假的？

答案： 假的。False值可以识别出所有重复的行，但它不会删除它们。

Q2： 当多个列被传递到子集参数时，如何识别重复的行？

答案： 当多个列被传递到子集参数时，只有当某行的所有指定列的值与另一行的指定列的值一致时，该方法才会认为该行是重复的。

Q3： 请编写代码，从DataFramedf中删除所有重复的行的实例，除了行的第一个实例。

答案： bool_series = df.duplicated(keep='first')

df.duplicated[~bool_series]

Q4: 编写代码从DataFramedf中删除所有重复行的实例，除了最后一个行的实例。

答案：

bool_series = df.duplicated(keep='last')

df.duplicated[~bool_series]

Q5： 编写代码以搜索DataFramedf中col_1和col_2列的重复值。

答案： df.duplicated(subset=[col_1, col_2])