I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)
split the data set into train and test
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)
I am applying the logistic regression algorithm as follows
optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)
# fitting the model
log_reg_optimal.fit(X_tr, y_tr)
# predict the response
pred = log_reg_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))
My weights are
weights = log_reg_optimal.coef_ . #<class 'numpy.ndarray'>
array([[-0.23729528, -0.16050616, -0.1382504 , ..., 0.27291847,
0.35857267, 0.41756443]])
(1, 38178) #shape of weights
I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?
One way to investigate the "influence" or "importance" of a given feature / parameter in a linear classification model is to consider the magnitude of the coefficients.
This is the most basic approach. Other techniques for finding feature importance or parameter influence could provide more insight such as using p-values, bootstrap scores, various "discriminative indices", etc.
Here, you have standardized the data so use directly this:
weights = log_reg_optimal.coef_
abs_weights = np.abs(weights)
print(abs_weights)
If you look at the original weights
then a negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class.
EDIT 1
Example showing how to obtain the feature names:
import numpy as np
#features names
names_of_variables =np.array(['a','b','c','d'])
#create random weights and get the magnitude
weights = np.random.rand(4)
abs_weights = np.abs(weights)
#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]
#check if the sorting indices are correct
print(abs_weights[sorted_index])
#get the index of the top-2 features
top_2 = sorted_index[:2]
#get the names of the top 2 most important features
print(names_of_variables[top_2])