Add UUID to spark dataset

Adiant picture Adiant · Apr 9, 2018 · Viewed 11.9k times · Source

I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+

Answer

Benjamin Manns picture Benjamin Manns · Apr 9, 2018

When you include UUID as a lit column, you're doing the same as including a string literal.

UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

Originally, I had:

val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);

which @irbull pointed out could be an issue.