What's the equivalent of Panda's value_counts() in PySpark?

TSAR picture TSAR · Jun 27, 2018 · Viewed 10.7k times · Source

I am having the following python/pandas command:

df.groupby('Column_Name').agg(lambda x: x.value_counts().max()

where I am getting the value counts for ALL columns in a DataFrameGroupBy object.

How do I do this action in PySpark?

Answer

Tanjin picture Tanjin · Jun 27, 2018

It's more or less the same:

spark_df.groupBy('column_name').count().orderBy('count')

In the groupBy you can have multiple columns delimited by a ,

For example groupBy('column_1', 'column_2')