I have a data file with three columns, and I want to normalize the last column to apply ALS with ML (Spark and Scala), how can I do it?
Here is an excerpt from my Dataframe
:
val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
.withColumnRenamed("userIdTemp", "userId")
.withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
.withColumnRenamed("productIdTemp", "productId")
.withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
.withColumnRenamed("viewTemp", "view")`
Using the StandardScaler
is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector
type (but Float
). Since the StandardScaler
only works on Vectors
, a VectorAssembler
can be applied first, but then the Vector
needs to be reconverted into a Float
after the scaling.
The simpler way in this case would be to do it yourself. First get the mean and standard deviation of the column and then perform the scaling. It can be done on the view
column as follows:
val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
.as[(Double, Double)]
.first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)