Spark SQL: how to cache sql query result without using rdd.cache()

lwwwzh picture lwwwzh · Jan 19, 2015 · Viewed 28.4k times · Source

Is there any way to cache a cache sql query result without using rdd.cache()? for examples:

output = sqlContext.sql("SELECT * From people")

We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.

So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

Answer

0x0FFF picture 0x0FFF · Jan 19, 2015

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

Here's an example. I've got this file on HDFS:

1|Alex|[email protected]
2|Paul|[email protected]
3|John|[email protected]

Then the code in PySpark:

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

Now we have a table and can query it:

sqlContext.sql('select * from people').collect()

To persist it, we have 3 options:

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()     

1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion

So going back to your question, here's one possible solution:

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()