How to remove duplicate values from a RDD[PYSPARK]

Prince Bhatti picture Prince Bhatti · Sep 18, 2014 · Viewed 24.1k times · Source

I have the following table as a RDD:

Key Value
1    y
1    y
1    y
1    n
1    n
2    y
2    n
2    n

I want to remove all the duplicates from Value.

Output should come like this:

Key Value
1    y
1    n
2    y
2    n

While working in pyspark, output should come as list of key-value pairs like this:

[(u'1',u'n'),(u'2',u'n')]

I don't know how to apply for loop here. In a normal Python program it would have been very easy.

I wonder if there is some function in pyspark for the same.

Answer

Mikel Urkia picture Mikel Urkia · Sep 18, 2014

I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into python code.

You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.

If I am not mistaken, the best approach (in your case) would be to use the distinct() transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:

JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();

So that, for example:

Partition 1:

1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n

Partition 2:

2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n

Would get converted to:

Partition 1:

1-y | 2-y
1-n | 2-n 

Partition 2:

1-y | 2-g | 2-y
1-n | 2-n |

Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.