spark access first n rows - take vs limit

apache-spark apache-spark-sql limit

Georg Heiler · Oct 19, 2017 · Viewed 64.2k times · Source

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?

Answer

Although it still is answered, I want to share what I learned.

myDataFrame.take(10)

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10)

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

spark access first n rows - take vs limit

Answer

Related questions