如何查找Pandas数据框架中的重复行如果你想在一个DataFrame中根据所有或选定的列找到重复的行，可以使用pand

Python Pandas - Find Duplicate Rows In DataFrame Based On All Or Selected Columns

如果你想在一个DataFrame中根据所有或选定的列找到重复的行，可以使用pandas.dataframe.diplicated()函数。在数据科学中，有时候，你会得到一个混乱的数据集。例如，你可能不得不处理重复的数据，这将使你的分析出现偏差。

Pandas的重复行

要在Pandas DataFrame中找到 重复的 行，可以使用**pd.df.diplicated()**函数。**Pandas.DataFrame.diplicated()**是一个库函数，可以根据所有或特定的列找到重复的行。pd.df.diplicated()函数为每条重复的行返回一个布尔系列，其值为真。

语法

pandas.dataframe.diplicated()函数的语法如下。

DataFrame.duplicated(subset=None, keep='first')

参数

subset :
- 单列或多列标签应被用于重复检查。如果你不提供这些参数，那么所有的列都将被检查以找到重复的行。
keep：
- 它表示出现的情况，应该被标记为重复的。它的值可以是{"第一个"，"最后一个"，假的}，默认值是 "第一个"。
  - 第一个。除了第一次出现外，所有重复的行都将被标记为 "真"。
  - 最后一个。除了最后一次出现外，所有重复的行都将被标记为 "真"。
  - 错。所有的重复都将被标记为真。

例子

让我们创建一个包含重复值的DataFrame样本。

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print(dfObj)

输出

 python3 app.py
               Name  Seasons        Actor
0   Stranger Things        3       Millie
1   Game of Thrones        8       Emilia
2  La Casa De Papel        4       Sergio
3         Westworld        3  Evan Rachel
4   Stranger Things        3       Millie
5  La Casa De Papel        4       Sergio

正如你所看到的，上面的数据框架包含重复的行。

基于所有的列来查找重复的行。

如果我们想找到并选择重复的，所有的行都基于所有的列，调用Daraframe.duplicate()，不需要任何子集参数。它将返回每个重复的行的布尔系列，除了它们的第一次出现外，都是True（keep参数的默认值是 "第一"）。然后将这个布尔系列传递给Dataframe的[]操作符来选择重复的行。

请看下面的代码。

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated()]
print(duplicateDFRow)

输出

python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

这里所有重复的行都被返回，除了第一次出现的，因为keep参数的默认值是 "第一"。

如果我们想选择所有重复的行，除了最后出现的行，我们必须传递一个keep参数为 "**last"。**请看下面的代码。

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(keep='last')]
print(duplicateDFRow)

输出

pyt python3 app.py
               Name  Seasons   Actor
0   Stranger Things        3  Millie
2  La Casa De Papel        4  Sergio

根据选定的列查找重复的行。

如果我们想比较行并根据选定的列找到重复的行，我们应该在**Dataframe.duplicate()**函数的subset参数中传递列名列表。然后，它将只根据这些传递的列选择并返回重复的行。

例如，让我们根据一个单列来查找和选择行。

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(['Name'])]
print(duplicateDFRow)

输出

 pyt python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

在这里，在 "**姓名 "**列中有相同值的行被标记为重复的行并返回。

让我们看看另一个例子。

基于两列名称查找并选择行。

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(['Name', 'Seasons'])]
print(duplicateDFRow)

输出结果

pyt python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

结论

如果你想在Pandas DataFrame中找到重复的行，你可以使用pandas.dataframe.diplicated()函数。

本教程到此结束。