How spark read a large file (petabyte) when file can not be fit in spark's main memory

Arpit Rai picture Arpit Rai · Oct 9, 2017 · Viewed 22.8k times · Source

What will happen for large files in these cases?

1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode?

2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here?

3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.

Answer

Glennie Helles Sindholt picture Glennie Helles Sindholt · Oct 25, 2017

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.

Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.

If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.

Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.

I hope this explanation helps :)