Normalize column with Spark

Question 1

Normalize column with Spark

scala apache-spark spark-dataframe apache-spark-ml normalize

user9529942 · May 3, 2018 · Viewed 7.7k times · Source

Answer

Answer

Using the StandardScaler is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector type (but Float). Since the StandardScaler only works on Vectors, a VectorAssembler can be applied first, but then the Vector needs to be reconverted into a Float after the scaling.

The simpler way in this case would be to do it yourself. First get the mean and standard deviation of the column and then perform the scaling. It can be done on the view column as follows:

val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
  .as[(Double, Double)]
  .first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)

Question 2

I have a data file with three columns, and I want to normalize the last column to apply ALS with ML (Spark and Scala), how can I do it?

Here is an excerpt from my Dataframe:

val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
                    .withColumnRenamed("userIdTemp", "userId")
                    .withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
                    .withColumnRenamed("productIdTemp", "productId")
                    .withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
                    .withColumnRenamed("viewTemp", "view")`

Normalize column with Spark

Answer

Related questions