Naive Bayes vs. SVM for classifying text data

Ryan picture Ryan · Feb 12, 2016 · Viewed 25.4k times · Source

I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others).

In trying to come up with a model to use, I've had the following two ideas:

  • Naive Bayes (likely the sklearn multinomial Naive Bayes implementation)
  • Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation)

I have built both models, and am currently comparing the results.

What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better.

Many thanks!

Answer

Horia Coman picture Horia Coman · Feb 12, 2016

The biggest difference between the models you're building from a "features" point of view is that Naive Bayes treats them as independent, whereas SVM looks at the interactions between them to a certain degree, as long as you're using a non-linear kernel (Gaussian, rbf, poly etc.). So if you have interactions, and, given your problem, you most likely do, an SVM will be better at capturing those, hence better at the classification task you want.

The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes.

From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables which are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't an universal approximator. SVMs with the proper choice of Kernel are (as are 2/3 layer neural networks) though, so from that point of view, the theory matches the practice.

But in the end it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.