Drop all duplicate rows across multiple columns in Python Pandas

Jamie Bull picture Jamie Bull · May 15, 2014 · Viewed 331.3k times · Source

The pandas drop_duplicates function is great for "uniquifying" a dataframe. However, one of the keyword arguments to pass is take_last=True or take_last=False, while I would like to drop all rows which are duplicates across a subset of columns. Is this possible?

    A   B   C
0   foo 0   A
1   foo 1   A
2   foo 1   B
3   bar 1   A

As an example, I would like to drop rows which match on columns A and C so this should drop rows 0 and 1.

Answer

Ben picture Ben · Dec 14, 2015

This is much easier in pandas now with drop_duplicates and the keep parameter.

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)