Converting CSV to ORC with Spark

Edmon picture Edmon · Apr 5, 2016 · Viewed 6.9k times · Source

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.

It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC. I have also seen ways, as intended, to do these conversions in Hive.

Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.

Answer

eliasah picture eliasah · Apr 5, 2016

I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !

ORC support :

Concerning ORCs, they are supported with the HiveContext.

HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.

You can define a HiveContext as following :

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.

Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :

val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")

Saving as an ORC file

Let’s persist the DataFrame into the Hive ORC table we created before.

results.write.format("orc").save("data_orc")

To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)