Mapping Spark DataSet row values into new hash column

Jesús Zazueta picture Jesús Zazueta · Nov 6, 2017 · Viewed 11.1k times · Source

Given the following DataSet values as inputData:

column0 column1 column2 column3
A       88      text    99
Z       12      test    200
T       120     foo     12

In Spark, what is an efficient way to compute a new hash column, and append it to a new DataSet, hashedData, where hash is defined as the application of MurmurHash3 over each row value of inputData.

Specifically, hashedData as:

column0 column1 column2 column3 hash
A       88      text    99      MurmurHash3.arrayHash(Array("A", 88, "text", 99))
Z       12      test    200     MurmurHash3.arrayHash(Array("Z", 12, "test", 200))
T       120     foo     12      MurmurHash3.arrayHash(Array("T", 120, "foo", 12))

Please let me know if any more specifics are necessary.

Any help is appreciated. Thanks!

Answer

soote picture soote · Nov 6, 2017

One way is to use the withColumn function:

import org.apache.spark.sql.functions.{col, hash}
dataset.withColumn("hash", hash(dataset.columns.map(col):_*))