Concatenate columns in Apache Spark DataFrame

sql apache-spark dataframe apache-spark-sql

Nipun · Jul 16, 2015 · Viewed 294.6k times · Source

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

Answer

With raw SQL you can use CONCAT:

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In Scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

Since Spark 1.5.0 you can use concat function with DataFrame API:

In Python :

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

In Scala :

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

There is also concat_ws function which takes a string separator as the first argument.

Concatenate columns in Apache Spark DataFrame

Answer

Related questions