How to overwrite entire existing column in Spark dataframe with new column?

Question 1

How to overwrite entire existing column in Spark dataframe with new column?

apache-spark dataframe pyspark apache-spark-sql apache-spark-mllib

GeorgeOfTheRF · Jun 19, 2017 · Viewed 20.1k times · Source

Answer

Answer

You can use

d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")

The withColumnRenamed renames the existing column to new name.

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.

d3 = df2.select((df2.id2 > 0).alias("id2"))

Above should work fine in your case.

Hope this helps!

Question 2

I want to overwrite a spark column with a new column which is a binary flag.

I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?

How to do it without using withcolumn() to create new column and drop() to drop the old column?

I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?

    df2 = spark.createDataFrame(
        [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
        ('session', "timestamp1", "id2"))

    df2.select(df2.id2 > 0).show()

+---------+
|(id2 > 0)|
+---------+
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
+---------+
 # Attempting to overwriting df2.id2
    df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
    df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
|      1|         1| NaN|
|      1|         2| 5.0|
|      1|         3| NaN|
|      1|         4| NaN|
|      1|         5|10.0|
|      1|         6| NaN|
|      1|         6| NaN|
+-------+----------+----+

How to overwrite entire existing column in Spark dataframe with new column?

Answer

Related questions