I am using Python's sklearn
random forest (ensemble.RandomForestClassifier
) to do classification and am using feature_importances_
to find significant feature for the classifier. Now my code is:
for trip in database:
venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
else:
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier...")
clf.fit(train_data,labels_train)
importances = clf.feature_importances_
Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)
I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:
How to get indices of top 20 from importances
Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)
Any idea to help me? Thank you very much!
To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):
for feat, importance in zip(df.columns, clf.feature_importances_):
print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)