How to get feature Importance in naive bayes?

merkle picture merkle · May 25, 2018 · Viewed 18.4k times · Source

I have a dataset of reviews which has a class label of positive/negative. I am applying Naive Bayes to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix

count_vect = CountVectorizer() 
final_counts = count_vect.fit_transform(sorted_data['Text'].values)

I am splitting the data into train and test dataset.

X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)

I am applying the naive bayes algorithm as follows

optimal_alpha = 1
NB_optimal = BernoulliNB(alpha=optimal_aplha)

# fitting the model
NB_optimal.fit(X_tr, y_tr)

# predict the response
pred = NB_optimal.predict(X_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the NB classifier for k = %d is %f%%' % (optimal_aplha, acc))

Here X_test is test dataset in which pred variable gives us whether the vector in X_test is positive or negative class.

The X_test shape is (54626 rows, 82343 dimensions)

length of pred is 54626

My question is I want to get the words with highest probability in each vector so that I can get to know by the words that why it predicted as positive or negative class. Therefore, how to get the words which have highest probability in each vector?

Answer

piman314 picture piman314 · May 25, 2018

You can get the important of each word out of the fit model by using the coefs_ or feature_log_prob_ attributes. For example

neg_class_prob_sorted = NB_optimal.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = NB_optimal.feature_log_prob_[1, :].argsort()[::-1]

print(np.take(count_vect.get_feature_names(), neg_class_prob_sorted[:10]))
print(np.take(count_vect.get_feature_names(), pos_class_prob_sorted[:10]))

Prints the top 10 most predictive words for each of your classes.