In spark, how to estimate the number of elements in a dataframe quickly

lovasoa picture lovasoa · May 31, 2017 · Viewed 7.1k times · Source

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.

Maybe we could calculate this information from the number of partitions of the DataSet, could we ?

Answer

Raphael Roth picture Raphael Roth · May 31, 2017

You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

example usage:

val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.