PySpark - Sum a column in dataframe and return results as int

python dataframe sum pyspark

Bryce Ramgovind · Dec 14, 2017 · Viewed 66.4k times · Source

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.

df.groupBy().sum()

But I get a dataframe back.

+-----------+
|sum(Number)|
+-----------+
|        130|
+-----------+

I would 130 returned as an int stored in a variable to be used else where in the program.

result = 130

Answer

I think the simplest way:

df.groupBy().sum().collect()

will return a list. In your example:

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130