Mock a Spark RDD in the unit tests

Edamame picture Edamame · Jun 19, 2015 · Viewed 14.4k times · Source

Is it possible to mock a RDD without using sparkContext?

I want to unit test the following utility function:

 def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}

So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!

Answer

Holden picture Holden · Jun 19, 2015

RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-base can help by providing a trait to setup & teardown the Spark Context for your tests.