exitTotalDF
.filter($"accid" === "dc215673-ef22-4d59-0998-455b82000015")
.groupBy("exiturl")
.agg(first("accid"), first("segment"), $"exiturl", sum("session"), sum("sessionfirst"), first("date"))
.orderBy(desc("session"))
.take(500)
org.apache.spark.sql.AnalysisException: cannot resolve '`session`' given input columns: [first(accid, false), first(date, false), sum(session), exiturl, sum(sessionfirst), first(segment, false)]
Its like the sum function cannot find the column names properly.
Using Spark 2.1
Typically in scenarios like this, I'll use the as
method on the column. For example .agg(first("accid"), first("segment"), $"exiturl", sum("session").as("session"), sum("sessionfirst"), first("date"))
. This gives you more control on what to expect, and if the summation name were to ever change in future versions of spark, you will have less of a headache updating all of the names in your dataset.
Also, I just ran a simple test. When you don't specify the name, it looks like the name in Spark 2.1 gets changed to "sum(session)". One way to find this yourself is to call printSchema on the dataset.