Apache Spark RDD filter into two RDDs

monster picture monster · Apr 9, 2015 · Viewed 8.5k times · Source

I need to split an RDD into 2 parts:

1 part which satisfies a condition; another part which does not. I can do filter twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.

Answer

Marius Soutier picture Marius Soutier · Apr 9, 2015

Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.

If it's really just two different types, you can use a helper method:

implicit class RDDOps[T](rdd: RDD[T]) {
  def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {
    val passes = rdd.filter(f)
    val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot
    (passes, fails)
  }
}

val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)

But as soon as you have multiple types of data, just assign the filtered to a new val.