How Can I Obtain an Element Position in Spark's RDD?

SciPioneer picture SciPioneer · Sep 25, 2014 · Viewed 16.7k times · Source

I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it?

As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD.

Answer

zhang zhan picture zhang zhan · Sep 28, 2014

I believe in most cases, zipWithIndex() will do the trick, and it will preserve the order. Read the comments again. My understanding is that it exactly means keep the order in the RDD.

scala> val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3)
scala> val r2 = r1.zipWithIndex
scala> r2.foreach(println)
(c,2)
(d,3)
(e,4)
(f,5)
(g,6)
(a,0)
(b,1)

Above example confirm it. The red has 3 partitions, and a with index 0, b with index 1, etc.