Spark DataFrame exploding a map with the key as a member

dlamblin picture dlamblin · May 26, 2017 · Viewed 8.1k times · Source

I've found a map exploding example at databrick's blog:

// input
{
  "a": {
    "b": 1,
    "c": 2
  }
}

Python: events.select(explode("a").alias("x", "y"))
 Scala: events.select(explode('a) as Seq("x", "y"))
   SQL: select explode(a) as (x, y) from events

// output
[{ "x": "b", "y": 1 }, { "x": "c", "y": 2 }]

However, I can't see a way that this leads me to change my map to an array into which the key is flattened which is then exploded:

// input
{
  "id": 0,
  "a": {
    "b": {"d": 1, "e": 2}
    "c": {"d": 3, "e": 4}
  }
}
// Schema
struct<id:bigint,a:map<string,struct<d:bigint,e:bigint>>>
root
 |-- id: long (nullable = true)
 |-- a: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- d: long (nullable = true)
 |    |    |-- e: long (nullable = true)


// Imagined proces
Python: …
 Scala: events.select('id, explode('a) as Seq("x", "*")) //? "*" ?
   SQL: …

// Desired output
[{ "id": 0, "x": "b", "d": 1, "e": 2 }, { "id": 0, "x": "c", "d": 3, "e": 4 }]

Is there some obvious way that one could take such input to make a table like:

id | x | d | e
---|---|---|---
 0 | b | 1 | 2
 0 | c | 3 | 4

Answer

Raphael Roth picture Raphael Roth · May 26, 2017

Although I don't know whether its possible to explode the map with one single explode, there is a way to it with a UDF. The trick is to use Row#schema.fields(i).name to get the name of the "key"

def mapStructs = udf((r: Row) => {
  r.schema.fields.map(f => (
    f.name,
    r.getAs[Row](f.name).getAs[Long]("d"),
    r.getAs[Row](f.name).getAs[Long]("e"))
  )
})

df
  .withColumn("udfResult", explode(mapStructs($"a")))
  .withColumn("x", $"udfResult._1")
  .withColumn("d", $"udfResult._2")
  .withColumn("e", $"udfResult._3")
  .drop($"udfResult")
  .drop($"a")
  .show

gives

+---+---+---+---+
| id|  x|  d|  e|
+---+---+---+---+
|  0|  b|  1|  2|
|  0|  c|  3|  4|
+---+---+---+---+