What are the differences between sc.parallelize and sc.textFile?

user2531569 picture user2531569 · Jul 1, 2017 · Viewed 26.7k times · Source

I am new to Spark. can someone please clear my doubt:

Lets assume below is my code:

a = sc.textFile(filename) 
b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()

I hope below is what happens internally: (Please correct if my understanding is wrong)

(1) variable a will be saved as a RDD variable containing the expected txt file content

(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.

(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.

Now I want to understand what difference below code makes:

a = sc.textFile(filename).collect() 
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111")) 
c = b.collect() 

Could someone please clarify ?

Answer

Jacek Laskowski picture Jacek Laskowski · Jul 1, 2017

(1) variable a will be saved as a RDD variable containing the expected txt file content

(Highlighting mine) Not really. The line just describes what will happen after you execute an action, i.e. the RDD variable does not contain the expected txt file content.

The RDD describes the partitions that, when an action is called, become tasks that will read their parts of the input file.

(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.

Yes, but only when an action is called which is c=b.collect() in your case.

(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.

YES! That's the most dangerous operation memory-wise since all the Spark executors running somewhere in the cluster start sending data back to the driver.

Now I want to understand what difference below code makes

Quoting the documentation of sc.textFile:

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

Quoting the documentation of sc.parallelize:

parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T] Distribute a local Scala collection to form an RDD.

The difference is with the datasets - files (for textFile) while a local collection (for parallelize). Either does the same things under the covers, i.e. they both build a description of how to access the data that are going to be processed using transformations and an action.

The main difference is therefore the source of the data.