Spark: RDD to List

bill picture bill · Nov 30, 2016 · Viewed 35.3k times · Source

I have a RDD structure

RDD[(String, String)]

and I want to create 2 Lists (one for each dimension of the rdd).

I tried to use the rdd.foreach() and fill two ListBuffers and then convert them to Lists, but I guess each node creates its own ListBuffer because after the iteration the BufferLists are empty. How can I do it ?

EDIT : my approach

val labeled = data_labeled.map { line =>
  val parts = line.split(',')
  (parts(5), parts(7))
}.cache()

var testList : ListBuffer[String] = new ListBuffer()

labeled.foreach(line =>
  testList += line._1
)
  val labeledList = testList.toList
  println("rdd: " + labeled.count)
  println("bufferList: " + testList.size)
  println("list: " + labeledList.size)

and the result is:

rdd: 31990654
bufferList: 0
list: 0

Answer

Tzach Zohar picture Tzach Zohar · Nov 30, 2016

If you really want to create two Lists - meaning, you want all the distributed data to be collected into the driver application (risking slowness or OutOfMemoryError) - you can use collect and then use simple map operations on the result:

val list: List[(String, String)] = rdd.collect().toList
val col1: List[String] = list.map(_._1)
val col2: List[String] = list.map(_._2)

Alternatively - if you want to "split" your RDD into two RDDs - it's pretty similar without collecting the data:

rdd.cache() // to make sure calculation of rdd is not repeated twice
val rdd1: RDD[String] = rdd.map(_._1)
val rdd2: RDD[String] = rdd.map(_._2)

A third alternative is to first map into these two RDDs and then collect each one of them, but it's not much different from the first option and suffers from the same risks and limitations.