After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]
. I know what label
and featrues
mean, but how should I understand rawPrediction|probability|prediction
?
Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563
RawPrediction
is typically the direct probability/confidence calculation. From Spark docs:
Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
The Prediction
is the result of finding the statistical mode
of the rawPrediction - via
argmax`:
protected def raw2prediction(rawPrediction: Vector): Double =
rawPrediction.argmax
The Probability
is the conditional probability
for each class. Here is the scaladoc
:
Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.
The actual calculation depends on which Classifier
you are using.
DecisionTree
Normalize a vector of raw predictions to be a multinomial probability vector, in place.
It simply sums by class across the instances and then divides by the total instance count.
class_k probability = Count_k/Count_Total
LogisticRegression
It uses the logistic formula
class_k probability: 1/(1 + exp(-rawPrediction_k))
Naive Bayes
class_k probability = exp(max(rawPrediction) - rawPrediction_k)
Random Forest
class_k probability = Count_k/Count_Total