I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000
.
Any ideas how to make my spark saving to file with a specified file name?
Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do
rdd.saveAsTextFile("foo")
It will be saved as "foo/part-XXXXX
" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002
) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002
, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.
The part-XXXXX
files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX
files inside foo as well.