When using SparkML to predict labels the result Dataframe is:
scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.6,0.4]| 1.0|
| [0.6,0.4]| 1.0|
| [1.0,0.0]| 1.0|
| [0.9,0.1]| 1.0|
| [0.9,0.1]| 1.0|
| [1.0,0.0]| 1.0|
| [1.0,0.0]| 1.0|
+-----------+--------------+
only showing top 20 rows
I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:
+-----------+--------------+----------+
|probability|predictedLabel| prob |
+-----------+--------------+----------+
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.1|
| [0.6,0.4]| 1.0| 0.6|
| [0.6,0.4]| 1.0| 0.6|
| [1.0,0.0]| 1.0| 1.0|
| [0.9,0.1]| 1.0| 0.9|
| [0.9,0.1]| 1.0| 0.9|
| [1.0,0.0]| 1.0| 1.0|
| [1.0,0.0]| 1.0| 1.0|
+-----------+--------------+----------+
How can extract this value into a new column?
You can use the capabilities of Dataset
and the wonderful functions
library to accomplish what you need:
result.withColumn("prob", $"probability".getItem(0))
This adds a new Column
called prob
whose value is derived from the probability
Column
by taking the first item (at index 0--we are computer scientists after all) in the array.
I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.