How to name aggregate columns?

Emre picture Emre · Jul 25, 2016 · Viewed 16.7k times · Source

I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as but the key column is a struct (due to the groupBy operation), and I can't find out how to define a case class with a StructType in it.

I tried defining a schema as follows:

val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
                                                             StructField("dst", IntegerType), true)), 
                              StructField("count", LongType, true))
edge_count.as[returnSchema]

but I got a compile error:

Message: <console>:74: error: overloaded method value apply with alternatives:
  (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
 cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
       val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),

Answer

Sim picture Sim · Jul 25, 2016

The best solution is to name your columns explicitly, e.g.,

df
  .groupBy('a, 'b)
  .agg(
    expr("count(*) as cnt"),
    expr("sum(x) as x"),
    expr("sum(y)").as("y")
  )

If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long].

You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.

If you want to do mass renames, use a Map and then foldLeft the dataframe.