I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and the results are very promising.
This classifer needs to run in real time and constantly analyze documents thrown at it randomly.
However, when I run my classifier in production, the number of false positives is very high and therefore I end up with a very low precision. The reason is simple: there are many more negative samples that the classifer encounters in the real-time scenario (around 90 % of the time) and this does not correspond to the ideal balanced dataset I used for testing and training.
Is there a way I can simulate this real-time case during training or are there any tricks that I can use (including pre-processing on the documents to see if they are suitable for the classifer)?
I was planning to train my classifier using an imbalanced dataset with the same proportions as I have in real-time case but I am afraid that might bias Naive Bayes towards the negative class and lose the recall I have on the positive class.
Any advice is appreciated.
You have encountered one of the problems with classification with a highly imbalanced class distribution. I have to disagree with those that state the problem is with the Naive Bayes method, and I'll provide an explanation which should hopefully illustrate what the problem is.
Imagine your false positive rate is 0.01, and your true positive rate is 0.9. This means your false negative rate is 0.1 and your true negative rate is 0.99.
Imagine an idealised test scenario where you have 100 test cases from each class. You'll get (in expectation) 1 false positive and 90 true positives. Great! Precision is 90 / (90+1) on your positive class!
Now imagine there are 1000 times more negative examples than positive. Same 100 positive examples at test, but now there are 1000000 negative examples. You now get the same 90 true positives, but (0.01 * 1000000) = 10000 false positives. Disaster! Your precision is now almost zero (90 / (90+10000)).
The point here is that the performance of the classifier hasn't changed; false positive and true positive rates remained constant, but the balance changed and your precision figures dived as a result.
What to do about it is harder. If your scores are separable but the threshold is wrong, you should look at the ROC curve for thresholds based on the posterior probability and look to see if there's somewhere where you get the kind of performance you want. If your scores are not separable, try a bunch of different classifiers and see if you can get one where they are (logistic regression is pretty much a drop-in replacement for Naive Bayes; you might want to experiment with some non-linear classifiers, however, like a neural net or non-linear SVM, as you can often end up with non-linear boundaries delineating the space of a very small class).
To simulate this effect from a balanced test set, you can simply multiply instance counts by an appropriate multiplier in the contingency table (for instance, if your negative class is 10x the size of the positive, make every negative instance in testing add 10 counts to the contingency table instead of 1).
I hope that's of some help at least understanding the problem you're facing.