How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

Question 1

How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

excel scala apache-spark apache-spark-dataset spark-excel

tomaskazemekas · Mar 12, 2017 · Viewed 8.7k times · Source

Answer

Answer

For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows

Question 2

Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.

The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs, sheet2.xlsx etc. and put them into sheets directory.

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?

How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

Answer

Related questions