I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as
but the key column is a struct (due to the groupBy
operation), and I can't find out how to define a case class
with a StructType
in it.
I tried defining a schema as follows:
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
StructField("dst", IntegerType), true)),
StructField("count", LongType, true))
edge_count.as[returnSchema]
but I got a compile error:
Message: <console>:74: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
The best solution is to name your columns explicitly, e.g.,
df
.groupBy('a, 'b)
.agg(
expr("count(*) as cnt"),
expr("sum(x) as x"),
expr("sum(y)").as("y")
)
If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long]
.
You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.
If you want to do mass renames, use a Map
and then foldLeft
the dataframe.