Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.
Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.
if len(df['Student'].unique()) < len(df.index):
# Code to remove duplicates based on Date column runs
Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?
Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):
Student Date
0 Joe December 2017
1 James January 2018
2 Bob April 2018
3 Joe December 2017
4 Jack February 2018
5 Jack March 2018
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018