how to use tf-idf with Naive Bayes?

POOJA GUPTA picture POOJA GUPTA · May 24, 2016 · Viewed 11.5k times · Source

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :

Link 1

Link 2

Link 3

Link 4

etc.

Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:

Naive-Bayes formula :

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

tf-idf weighting can be employed in the above formula as:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.

Please suggest me. I am new to this domain.

Answer

jrhee17 picture jrhee17 · May 24, 2016

It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:

  1. You have a number of documents, each of which has a number of words.
  2. You would like to classify documents into categories.
  3. Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.

Your Solution

The tf idf you gave is the following:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.

Another potential Solution

It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.

If you wanted to use this method:

  1. Compute mean, variation of tf-idf values for each class.
  2. Compute the prior using a gaussian distribution generated by the above mean and variation.
  3. Proceed as normal (multiply to prior) and predict values.

Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.

Additional methods to increase

Apart from the above, you could also use the following techniques to increase accuracy:

  1. Preprocessing:

    1. Feature reduction (usually NMF, PCA, or LDA)
    2. Additional features
  2. Algorithm:

    Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression

  3. Misc.

    Bootstrapping, boosting, etc. Be careful not to overfit though...

Hopefully this was helpful. Leave a comment if anything was unclear