I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.
how to refresh the table?
Suppose I have some table loaded by
spark.read.format("").load().createTempView("my_table")
and it is also cached by
spark.sql("cache table my_table")
is it enough with following code to refresh the table, and when the table is loaded next, it will automatically be cached
spark.sql("refresh table my_table")
or do I have to do that manually with
spark.table("my_table").unpersist
spark.read.format("").load().createOrReplaceTempView("my_table")
spark.sql("cache table my_table")
is it safe to refresh the table concurrently?
By concurrent I mean using ScheduledThreadPoolExecutor
to do the refresh work apart from the main thread.
What will happen if the Spark is using the cached table when I call refresh on the table?
In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.
You can achieve it by using the API,
spark.catalog.refreshTable("my_table")
This API will update the metadata for that table to keep it consistent.