SparkSQL vs Hive on Spark - Difference and pros and cons?

Gaurav Khare picture Gaurav Khare · Jul 24, 2015 · Viewed 42k times · Source

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?

Answer

prajod picture prajod · Feb 2, 2016
  1. When SparkSQL uses hive

    SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.

  2. When Hive uses Spark See the JIRA entry: HIVE-7292

    Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.

  3. There is a third option to process data with SparkSQL

    Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.