Given the following DataSet
values as inputData
column0 column1 column2 column3
A 88 text 99
Z 12 test 200
T 120 foo 12
In Spark, what is an efficient way to compute a new hash
column, and append it to a new DataSet
, hashedData
, where hash
is defined as the application of MurmurHash3
over each row value of inputData
Specifically, hashedData
column0 column1 column2 column3 hash
A 88 text 99 MurmurHash3.arrayHash(Array("A", 88, "text", 99))
Z 12 test 200 MurmurHash3.arrayHash(Array("Z", 12, "test", 200))
T 120 foo 12 MurmurHash3.arrayHash(Array("T", 120, "foo", 12))
Please let me know if any more specifics are necessary.
Any help is appreciated. Thanks!
One way is to use the withColumn
import org.apache.spark.sql.functions.{col, hash}
dataset.withColumn("hash", hash(*))