What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

Hereme picture Hereme · Jun 19, 2016 · Viewed 10.9k times · Source

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction?

Answer

StephenBoesch picture StephenBoesch · Jun 19, 2016

Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax

The Probability is the conditional probability for each class. Here is the scaladoc:

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

The actual calculation depends on which Classifier you are using.

DecisionTree

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

It simply sums by class across the instances and then divides by the total instance count.

 class_k probability = Count_k/Count_Total

LogisticRegression

It uses the logistic formula

 class_k probability: 1/(1 + exp(-rawPrediction_k))

Naive Bayes

 class_k probability = exp(max(rawPrediction) - rawPrediction_k)

Random Forest

 class_k probability = Count_k/Count_Total