python: How to get real feature name from feature_importances

gladys0313 picture gladys0313 · May 20, 2015 · Viewed 8.7k times · Source

I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is:

for trip in database:
    venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature

feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)

orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr()

le = LabelEncoder() 
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
    unlabelled_int = int(le.transform(["Unlabelled"]))
else:
    unlabelled_int = -1

valid_rows_idx = np.where(labels!=unlabelled_int)[0]  
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)                      
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
    train_data = data[train_ind,:]
    test_data = data[test_ind,:]
    labels_train = labels[train_ind]
    labels_test = labels[test_ind]

    print ("Training classifier...")
    clf.fit(train_data,labels_train)
    importances = clf.feature_importances_

Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)

I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:

  1. How to get indices of top 20 from importances

  2. Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)

Any idea to help me? Thank you very much!

Answer

Jared Wilber picture Jared Wilber · Dec 3, 2017

To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):

for feat, importance in zip(df.columns, clf.feature_importances_):
    print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)