Difference between invalidate metadata and refresh commands in Impala?

covfefe picture covfefe · Feb 15, 2017 · Viewed 23.2k times · Source

I saw at this link which affects Impala version 1.1:

Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.

Does this still hold true for later versions of Impala?

Answer

spijs picture spijs · Feb 15, 2017

According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

and related to working on existing tables:

The table name is a required parameter [for REFRESH]. To flush the metadata for all tables, use the INVALIDATE METADATA command. Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.

So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.