Spark: subtract two DataFrames

apache-spark dataframe rdd

Interfector · Apr 9, 2015 · Viewed 110.7k times · Source

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?

Answer

According to the api docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.