Spark streaming with python: how to add a UUID column?

bea picture bea · Apr 12, 2018 · Viewed 7.2k times · Source

I would like to add a column with a generated id to my data frame. I have tried:

uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())

however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.

I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).

How can I do this?

Answer

Atanu chatterjee picture Atanu chatterjee · Apr 30, 2018

Please Try this:

import uuid
from pyspark.sql.functions import udf

uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
Df1 = Df.withColumn("id",uuidUdf())

Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)