The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark
you could do something like this, using countDistinct()
:
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala
:
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct()
.