Spark 1.6: filtering DataFrames generated by describe()

Rami picture Rami · Feb 8, 2016 · Viewed 14.2k times · Source

The problem arises when I call describe function on a DataFrame:

val statsDF = myDataFrame.describe()

Calling describe function yields the following output:

statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]

I can show statsDF normally by calling statsDF.show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             53173|
|   mean|104.76128862392568|
| stddev|3577.8184333911513|
|    min|                 1|
|    max|            558407|
+-------+------------------+

I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:

val temp = statsDF.where($"summary" === "stddev").collect()

I am getting Task not serializable exception.

I am also facing the same exception when I call:

statsDF.where($"summary" === "stddev").show()

It looks like we cannot filter DataFrames generated by describe() function?

Answer

eliasah picture eliasah · Feb 8, 2016

I have considered a toy dataset I had containing some health disease data

val stddev_tobacco = rawData.describe().rdd.map{ 
    case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect