How to count unique ID after groupBy in pyspark

python pyspark apache-spark-sql

Lizou · Sep 26, 2017 · Viewed 91.8k times · Source

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.

I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.

Answer

Use countDistinct function

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

output

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

How to count unique ID after groupBy in pyspark

Answer

Related questions