I am trying to add a UUID column to my dataset.
getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);
But the result is all the rows have the same UUID. How can i make it unique?
+-----------------------------------+
uniqueId |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+
When you include UUID as a lit
column, you're doing the same as including a string literal.
UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.
Your best bet may be generating a column with the Spark function rand
and using UUID.nameUUIDFromBytes
to convert that to a UUID.
Originally, I had:
val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);
which @irbull pointed out could be an issue.