I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.
My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.
I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.
Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0 |[0.0,1.0]|
# | 1.0 |[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)
# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
# +---------+--------------------+--------------------+----------+
# | features| rawPrediction| probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...| 0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...| 1.0|
# +---------+--------------------+--------------------+----------+
rf_result = rf_model.transform(test)
# +---------+-------------+--------------------+----------+
# | features|rawPrediction| probability|prediction|
# +---------+-------------+--------------------+----------+
# |[0.2,0.5]| [1.0,2.0]|[0.33333333333333...| 1.0|
# |[0.5,0.2]| [1.0,2.0]|[0.33333333333333...| 1.0|
# +---------+-------------+--------------------+----------+
For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.