scikit-learn .predict() default threshold

ADJ picture ADJ · Nov 14, 2013 · Viewed 70.9k times · Source

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability.

In a binary classification problem, is scikit's classifier.predict() using 0.5 by default? If it doesn't, what's the default method? If it does, how do I change it?

In scikit some classifiers have the class_weight='auto' option, but not all do. With class_weight='auto', would .predict() use the actual population proportion as a threshold?

What would be the way to do this in a classifier like MultinomialNB that doesn't support class_weight? Other than using predict_proba() and then calculation the classes myself.

Answer

Fred Foo picture Fred Foo · Nov 15, 2013

is scikit's classifier.predict() using 0.5 by default?

In probabilistic classifiers, yes. It's the only sensible threshold from a mathematical viewpoint, as others have explained.

What would be the way to do this in a classifier like MultinomialNB that doesn't support class_weight?

You can set the class_prior, which is the prior probability P(y) per class y. That effectively shifts the decision boundary. E.g.

# minimal dataset
>>> X = [[1, 0], [1, 0], [0, 1]]
>>> y = [0, 0, 1]
# use empirical prior, learned from y
>>> MultinomialNB().fit(X,y).predict([1,1])
array([0])
# use custom prior to make 1 more likely
>>> MultinomialNB(class_prior=[.1, .9]).fit(X,y).predict([1,1])
array([1])