检查熊猫数据框列中的重复值

在熊猫中有没有一种方法可以检查一个数据框列是否有重复的值,而不需要删除行?我有一个函数可以删除重复的行,但是,我只希望在特定列中有重复的行时才运行它。

目前,我将列中的唯一值的数量与行的数量进行比较: 如果唯一值少于行,那么就会有重复值,代码就会运行。

 if len(df['Student'].unique()) < len(df.index):
# Code to remove duplicates based on Date column runs

有没有一种更简单或更有效的方法来检查重复的值是否存在于一个特定的列,使用熊猫?

我使用的一些示例数据(只显示了两列)。如果找到了重复的行,那么另一个函数将标识保留哪一行(带有最早日期的行) :

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Jack    February 2018
5   Jack    March 2018
172702 次浏览

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

Example to play around with

import pandas as pd
import io


data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''


df = pd.read_csv(io.StringIO(data), sep=',')


# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True


# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')


# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True


Student           Date
0     Joe  December 2017
1     Bob     April 2018


Student           Date
0     Joe  December 2017
1     Bob     April 2018

In addition to DataFrame.duplicated and Series.duplicated, Pandas also has a DataFrame.any and Series.any.

import pandas as pd


df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

With Python ≥3.8, check for duplicates and access some duplicate rows:

if (duplicated := df.duplicated(keep=False)).any():
some_duplicates = df[duplicated].sort_values(by=df.columns.to_list()).head()
print(f"Dataframe has one or more duplicated rows, for example:\n{some_duplicates}")

You can use is_unique:

df['Student'].is_unique


# equals true in case of no duplicates

Older pandas versions required:

pd.Series(df['Student']).is_unique

If you want to know how many duplicates & what they are use:

df.pivot_table(index=['ColumnName'], aggfunc='size')


df.pivot_table(index=['ColumnName1',.., 'ColumnNameN'], aggfunc='size')