I want to open a file, read it, drop duplicates in two of the file's columns, and then further use the file without the duplicates to do some calculations. To do this I am using pandas.drop_duplicates, which after dropping the duplicates also drops the indexing values. For example after droping line 1, file1 becomes file2:
file1:
Var1 Var2 Var3 Var4
0 52 2 3 89
1 65 2 3 43
2 15 1 3 78
3 33 2 4 67
file2:
Var1 Var2 Var3 Var4
0 52 2 3 89
2 15 1 3 78
3 33 2 4 67
To further use file2 as a dataframe I need to reindex it to 0, 1, 2, ...
Here is the code I am using:
file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4'])
file2 = file1.drop_duplicates(["Var2", "Var3"])
# create another variable as a new index: ni
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')
Although the code runs and produces good results, reindexing, gives the following warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
file2['ni']= range(0, len(file2))
I did check the link but I cannot figure out how to change my code. Any ideas on how to fix this?
Pandas has a built in function to accomplish this task, which will allow you to avoid the thrown error by means of an alternative, and simpler, approach
Rather than adding a new column of sequential numbers and then setting the index to that column as you did with:
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')
You can instead use:
file2 = file2.reset_index(drop=True)
The default behavior of .reset_index()
is to take the current index, insert that index as the first column of the dataframe, and then build a new index (I assume the logic here is that the default behavior makes it very easy to compare the old vs. new index, very useful for sanity checks). drop=True
means instead of preserving the old index as a new column, just get rid of it and replace it with the new index, which seems like what you want.
all together, your new code could look like this
file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4'])
file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)