Normalize column with Spark

user9529942 picture user9529942 · May 3, 2018 · Viewed 7.7k times · Source

I have a data file with three columns, and I want to normalize the last column to apply ALS with ML (Spark and Scala), how can I do it?

Here is an excerpt from my Dataframe:

val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
                    .withColumnRenamed("userIdTemp", "userId")
                    .withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
                    .withColumnRenamed("productIdTemp", "productId")
                    .withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
                    .withColumnRenamed("viewTemp", "view")`

Answer

Shaido picture Shaido · May 4, 2018

Using the StandardScaler is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector type (but Float). Since the StandardScaler only works on Vectors, a VectorAssembler can be applied first, but then the Vector needs to be reconverted into a Float after the scaling.

The simpler way in this case would be to do it yourself. First get the mean and standard deviation of the column and then perform the scaling. It can be done on the view column as follows:

val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
  .as[(Double, Double)]
  .first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)