How does sklearn random forest index feature_importances_

Jason Wolosonovich picture Jason Wolosonovich · Mar 12, 2014 · Viewed 11.7k times · Source

I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features.

important_features = []
for x,i in enumerate(rf.feature_importances_):
    if i>np.average(rf.feature_importances_):
        important_features.append(str(x))
print important_features

Additionally, in an effort to understand the indexing, I was able to find out what the important feature '12' actually was (it was variable x14). When I move variable x14 into what would be the 0 index position for the training dataset and run the code again, it should then tell me that feature '0' is important, but it does not, it's like it can't see that feature anymore and the first feature listed is the feature that was actually the second feature listed when I ran the code the first time (feature '22').

I'm thinking that perhaps feature_importances_ is actually using the first column (where I have placed x14) as a sort of ID for rest of the training dataset, and thus ignoring it in selecting important features. Can anyone shed some light on these two questions? Thank you in advance for any assistance.

EDIT
Here is how I stored the feature names:

tgmc_reader = csv.reader(csvfile)
row = tgmc_reader.next()    #Header contains feature names
feature_names = np.array(row)


Then I loaded the datasets and target classes

tgmc_x, tgmc_y = [], []
for row in tgmc_reader:
    tgmc_x.append(row[3:])    #This says predictors start at the 4th column, columns 2 and 3 are just considered ID variables.
    tgmc_y.append(row[0])     #Target column is the first in the dataset


Then proceeded to split the dataset into testing and training portions.

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(tgmc_x, tgmc_y, test_size=.10, random_state=33)


Then fit the model

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1, criterion='entropy', max_features=2, max_depth=5, bootstrap=True, oob_score=True, n_jobs=2, random_state=33)
rf = rf.fit(x_train, y_train)


Then returned the important features

important_features = []
for x,i in enumerate(rf.feature_importances_):
    if i>np.average(rf.feature_importances_):
        important_features.append((x))


Then I incorporated your suggestion which worked (Thank you very much!)

important_names = feature_names[important_features > np.mean(important_features)]
print important_names


And it did indeed return variable names.

['x9' 'x10' 'x11' 'x12' 'x13' 'x15' 'x16']


So you have solved one part of my question for sure, which is awesome. But when I go back to printing the results of my important features

print important_features


It returns the following output:

[12, 22, 51, 67, 73, 75, 87, 91, 92, 106, 125, 150, 199, 206, 255, 256, 275, 309, 314, 317]


I am interpreting this to mean that it considers the 12th,22nd, 51st, etc., variables to be the important ones. So this would be the 12th variable from the point where I told it to index the observations at the beginning of my code:

tgmc_x.append(row[3:])


Is this interpretation correct? If this is correct, when I move the 12th variable to the 4th column in the original dataset(where I told it to start reading the predictor values with the code I just referenced) and run the code again, I get the following output:

[22, 51, 66, 73, 75, 76, 106, 112, 125, 142, 150, 187, 191, 199, 250, 259, 309, 317]


This seems like its not recognizing that variable any longer.Additionally, when I move the same variable to the 5th column in the original dataset the output looks like this:

[1,22, 51, 66, 73, 75, 76, 106, 112, 125, 142, 150, 187, 191, 199, 250, 259, 309, 317]


This looks like its recognizing it again. One last thing, after I got it to return the variable names via your suggestion, it gave me a list of 7 variables. When I just return the important variables using the code I did originally, it gives me a longer list of important variables. Why is this? Thank you again for all of your help. I really appreciate it!

Answer

Newmu picture Newmu · Mar 12, 2014

Feature Importances returns an array where each index corresponds to the estimated feature importance of that feature in the training set. There is no sorting done internally, it is a 1-to-1 correspondence with the features given to it during training.

If you stored your feature names as a numpy array and made sure it is consistent with the features passed to the model, you can take advantage of numpy indexing to do it.

importances = rf.feature_importances_
important_names = feature_names[importances > np.mean(importances)]
print important_names