Add UUID to spark dataset

Question 1

Add UUID to spark dataset

apache-spark apache-spark-dataset spark-csv

Adiant · Apr 9, 2018 · Viewed 11.9k times · Source

Answer

Answer

When you include UUID as a lit column, you're doing the same as including a string literal.

UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

Originally, I had:

val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);

which @irbull pointed out could be an issue.

Question 2

I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+

Add UUID to spark dataset

Answer

Related questions