How to get element by Index in Spark RDD (Java)

progNewbie picture progNewbie · Nov 9, 2014 · Viewed 57.2k times · Source

I know the method rdd.firstwfirst() which gives me the first element in an RDD.

Also there is the method rdd.take(num) Which gives me the first "num" elements.

But isn't there a possibility to get an element by index?

Thanks.e

Answer

maasg picture maasg · Nov 9, 2014

This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.

Given: rdd = (a,b,c)

val withIndex = rdd.zipWithIndex // ((a,0),(b,1),(c,2))

To lookup an element by index, this form is not useful. First we need to use the index as key:

val indexKey = withIndex.map{case (k,v) => (v,k)}  //((0,a),(1,b),(2,c))

Now, it's possible to use the lookup action in PairRDD to find an element by key:

val b = indexKey.lookup(1) // Array(b)

If you're expecting to use lookup often on the same RDD, I'd recommend to cache the indexKey RDD to improve performance.

How to do this using the Java API is an exercise left for the reader.