In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count()
does.
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?
You could try to use countApprox
on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
You have to play a bit with the parameters timeout
and confidence
. The higher the timeout, the more accurate is the estimated count.