Why would I want .union over .unionAll in Spark for SchemaRDDs?

Question 1

Why would I want .union over .unionAll in Spark for SchemaRDDs?

sql scala apache-spark union union-all

duber · Mar 13, 2015 · Viewed 36.4k times · Source

Answer

Answer

In Spark 1.6, the above version of union was removed, so unionAll was all that remained.

In Spark 2.0, unionAll was renamed to union, with unionAll kept in for backward compatibility (I guess).

In any case, no deduplication is done in either union (Spark 2.0) or unionAll (Spark 1.6).

Question 2

I'm trying to wrap my head around these two functions in the Spark SQL documentation–

def union(other: RDD[Row]): RDD[Row]

Return the union of this RDD and another one.
def unionAll(otherPlan: SchemaRDD): SchemaRDD

Combines the tuples of two RDDs with the same schema, keeping duplicates.

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?

Why would I want .union over .unionAll in Spark for SchemaRDDs?

Answer

Related questions