Reindexing after pandas.drop_duplicates

Brebenel picture Brebenel · Mar 5, 2015 · Viewed 20.1k times · Source

I want to open a file, read it, drop duplicates in two of the file's columns, and then further use the file without the duplicates to do some calculations. To do this I am using pandas.drop_duplicates, which after dropping the duplicates also drops the indexing values. For example after droping line 1, file1 becomes file2:

file1:
   Var1    Var2    Var3   Var4
0    52     2       3      89
1    65     2       3      43
2    15     1       3      78
3    33     2       4      67

file2:
   Var1    Var2    Var3   Var4
0    52     2       3      89
2    15     1       3      78
3    33     2       4      67

To further use file2 as a dataframe I need to reindex it to 0, 1, 2, ...

Here is the code I am using:

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"])
# create another variable as a new index: ni
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

Although the code runs and produces good results, reindexing, gives the following warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  file2['ni']= range(0, len(file2))

I did check the link but I cannot figure out how to change my code. Any ideas on how to fix this?

Answer

cjprybol picture cjprybol · Aug 27, 2015

Pandas has a built in function to accomplish this task, which will allow you to avoid the thrown error by means of an alternative, and simpler, approach

Rather than adding a new column of sequential numbers and then setting the index to that column as you did with:

file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

You can instead use:

file2 = file2.reset_index(drop=True)

The default behavior of .reset_index() is to take the current index, insert that index as the first column of the dataframe, and then build a new index (I assume the logic here is that the default behavior makes it very easy to compare the old vs. new index, very useful for sanity checks). drop=True means instead of preserving the old index as a new column, just get rid of it and replace it with the new index, which seems like what you want.

all together, your new code could look like this

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)

See this question as well