How to import multiple csv files in a single load?

Question 1

How to import multiple csv files in a single load?

apache-spark apache-spark-sql spark-dataframe

Chendur · Jun 5, 2016 · Viewed 93.3k times · Source

Answer

Answer

Use wildcard, e.g. replace 2008 with *:

df = sqlContext.read
       .format("com.databricks.spark.csv")
       .option("header", "true")
       .load("../Downloads/*.csv") // <-- note the star (*)

Spark 2.0

// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")

Notes:

Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.
Use spark not sqlContext

Question 2

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

df = sqlContext.read
       .format("com.databricks.spark.csv")
       .option("header", "true")
       .load("../Downloads/2008.csv")

How to import multiple csv files in a single load?

Answer

Spark 2.0

Related questions