I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.
Why is take(100)
basically instant, whereas
df.limit(100)
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.option("header", true)
.option("delimiter", ";")
.csv("myPath")
takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.
Why is take()
so much faster than limit()
?
Although it still is answered, I want to share what I learned.
myDataFrame.take(10)
-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).
myDataFrame.limit(10)
-> results in a new Dataframe. This is a transformation and does not perform collecting the data.
I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.