How to refresh a table and do it concurrently?

宇宙人 picture 宇宙人 · Aug 22, 2017 · Viewed 19k times · Source

I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.

  1. how to refresh the table?

    Suppose I have some table loaded by

    spark.read.format("").load().createTempView("my_table")

    and it is also cached by

    spark.sql("cache table my_table")

    is it enough with following code to refresh the table, and when the table is loaded next, it will automatically be cached

    spark.sql("refresh table my_table")

    or do I have to do that manually with

    spark.table("my_table").unpersist spark.read.format("").load().createOrReplaceTempView("my_table") spark.sql("cache table my_table")

  2. is it safe to refresh the table concurrently?

    By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.

    What will happen if the Spark is using the cached table when I call refresh on the table?

Answer

Ganesh picture Ganesh · Aug 22, 2017

In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.

You can achieve it by using the API,

spark.catalog.refreshTable("my_table")

This API will update the metadata for that table to keep it consistent.