How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

koiralo picture koiralo · May 26, 2017 · Viewed 81.6k times · Source

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.

Answer

Ramesh Maharjan picture Ramesh Maharjan · May 26, 2017

The solution to your problem is to use Spark Excel dependency in your project.

Spark Excel has flexible options to play with.

I have tested the following code to read from excel and convert it to dataframe and it just works perfect

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can give sheetname as option if your excel sheet has multiple sheets

.option("sheetName", "Sheet2")

I hope its helpful