I have a large Excel(xlsx and xls)
file with multiple sheet and I need convert it to RDD
or Dataframe
so that it can be joined to other dataframe
later. I was thinking of using Apache POI and save it as a CSV
and then read csv
in dataframe
. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
The solution to your problem is to use Spark Excel
dependency in your project.
Spark Excel has flexible options
to play with.
I have tested the following code to read from excel
and convert it to dataframe
and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
you can give sheetname
as option
if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
I hope its helpful