I have the following table as a RDD:
Key Value
1 y
1 y
1 y
1 n
1 n
2 y
2 n
2 n
I want to remove all the duplicates from Value
.
Output should come like this:
Key Value
1 y
1 n
2 y
2 n
While working in pyspark, output should come as list of key-value pairs like this:
[(u'1',u'n'),(u'2',u'n')]
I don't know how to apply for
loop here. In a normal Python program it would have been very easy.
I wonder if there is some function in pyspark
for the same.
I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into python code.
You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.
If I am not mistaken, the best approach (in your case) would be to use the distinct()
transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:
JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();
So that, for example:
Partition 1:
1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n
Partition 2:
2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n
Would get converted to:
Partition 1:
1-y | 2-y
1-n | 2-n
Partition 2:
1-y | 2-g | 2-y
1-n | 2-n |
Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.