When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy()
method. For example,
X = my_dataframe[features_list].copy()
...instead of just
X = my_dataframe[features_list]
Why are they making a copy of the data frame? What will happen if I don't make a copy?
This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:
df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)
You'll get:
x
0 -1
1 2
In contrast, the following leaves df unchanged:
df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1