I use sklearn to plot the feature importance for forests of trees. The dataframe is named 'heart'. Here the code to extract the list of the sorted features:
importances = extc.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(heart_train.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Then I plot the list in this way:
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Feature ranking", fontsize = 20)
plt.bar(range(heart_train.shape[1]), importances[indices],
color="b",
align="center")
plt.xticks(range(heart_train.shape[1]), indices)
plt.xlim([-1, heart_train.shape[1]])
plt.ylabel("importance", fontsize = 18)
plt.xlabel("index of the feature", fontsize = 18)
and I get a plot like this:
My question is: how could I substitute the NUMBER of the feature with the NAME of the feature in order to turn the plot more understandable? I tried to convert the string containing the name of the feature (which is the name of each column of the data frame), but I cannot reach my goal.
Thanks
The problem is here:
plt.xticks(range(heart_train.shape[1]), indices)
indices
is an array of indices returned from your np.argsort(importances)[::-1]
, it doesn't have the feature names you want to appear as ticks on your X axis.
You need something like this, assuming df
is your Pandas DataFrame
feature_names = df.columns # e.g. ['A', 'B', 'C', 'D', 'E']
plt.xticks(range(heart_train.shape[1]), feature_names)