SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

Question 1

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

scala apache-spark apache-spark-sql spark-dataframe parquet

shubham rajput · Jan 15, 2017 · Viewed 31.4k times · Source

Answer

Answer

As noted in my comments, one potentially easy approach to this problem would be to use:

df.write.partitionBy("hour").saveAsTable("myparquet")

As noted, the folder structure would be myparquet/hour=1, myparquet/hour=2, ..., myparquet/hour=24 as opposed to myparquet/1, myparquet/2, ..., myparquet/24.

To change the folder structure, you could

Potentially use the Hive configuration setting hcat.dynamic.partitioning.custom.pattern within an explicit HiveContext; more information at HCatalog DynamicPartitions.
Another approach would be to change the file system directly after you have executed the df.write.partitionBy.saveAsTable(...) command with something like for f in *; do mv $f ${f/${f:0:5}/} ; done which would remove the Hour= text from the folder name.

It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...) in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour) information.

Question 2

I have a DataFrame generated as follows:

df.groupBy($"Hour", $"Category")
  .agg(sum($"value").alias("TotalValue"))
  .sort($"Hour".asc,$"TotalValue".desc))

The results look like:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   0|   cat26|      30.9|
|   0|   cat13|      22.1|
|   0|   cat95|      19.6|
|   0|  cat105|       1.3|
|   1|   cat67|      28.5|
|   1|    cat4|      26.8|
|   1|   cat13|      12.6|
|   1|   cat23|       5.3|
|   2|   cat56|      39.6|
|   2|   cat40|      29.7|
|   2|  cat187|      27.9|
|   2|   cat68|       9.8|
|   3|    cat8|      35.6|
| ...|    ....|      ....|
+----+--------+----------+

I would like to make new dataframes based on every unique value of col("Hour") , i.e.

for the group of Hour==0
for the group of Hour==1
for the group of Hour==2 and so on...

So the desired output would be:

df0 as:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   0|   cat26|      30.9|
|   0|   cat13|      22.1|
|   0|   cat95|      19.6|
|   0|  cat105|       1.3|
+----+--------+----------+

df1 as:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   1|   cat67|      28.5|
|   1|    cat4|      26.8|
|   1|   cat13|      12.6|
|   1|   cat23|       5.3|
+----+--------+----------+

and similarly,

df2 as:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   2|   cat56|      39.6|
|   2|   cat40|      29.7|
|   2|  cat187|      27.9|
|   2|   cat68|       9.8|
+----+--------+----------+

Any help is highly appreciated.

EDIT 1:

What I have tried:

df.foreach(
  row => splitHour(row)
  )

def splitHour(row: Row) ={
    val Hour=row.getAs[Long]("Hour")

    val HourDF= sparkSession.createDataFrame(List((s"$Hour",1)))

    val hdf=HourDF.withColumnRenamed("_1","Hour_unique").drop("_2")

    val mydf: DataFrame =df.join(hdf,df("Hour")===hdf("Hour_unique"))

    mydf.write.mode("overwrite").parquet(s"/home/dev/shaishave/etc/myparquet/$Hour/")
  }

PROBLEM WITH THIS STRATEGY:

It took 8 hours when it was run on a dataframe df which had over 1 million rows and spark job was given around 10 GB RAM on single node. So, join is turning out to be highly in-efficient.

Caveat: I have to write each dataframe mydf as parquet which has nested schema that is required to be maintained (not flattened).

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

Answer

Related questions