Predict classes or class probabilities?

Rahul picture Rahul · Jul 16, 2018 · Viewed 7.7k times · Source

I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability).

In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result.

Then I modified to the code to convert the target column to factor using asfactor() method on the H2OFrame still, there wasn't any change on the result.

But when I changed the values in the target attribute to True and False for 1 and 0 respectively, I was getting the expected result(i.e) the output was the classification rather than the probability.

  • What is the right way to get the classified prediction result?
  • If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

Answer

desertnaut picture desertnaut · Jul 19, 2018

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

What is the right way to get the classified prediction result?

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Classification probability threshold:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

the predicted class is the one with highest mean probability estimate

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).