When creating an external table in hive can I point the location to specific files in a directory?

George TeVelde picture George TeVelde · Jun 29, 2012 · Viewed 70.2k times · Source

I have defined a table as such:

create external table PageViews (Userid string, Page_View string)
partitioned by (ds string)
row format as delimited fields terminated by ','
stored as textfile location '/user/data';

I do not want all the files in the /user/data directory to be used as part of the table. Is it possible for me to do the following?

location 'user/data/*.csv'

Answer

Mark Grover picture Mark Grover · Jul 1, 2012

What kmosley said is true. As of now, you can't selectively choose certain files to be a part of your Hive table. However, there are 2 ways to get around it.

Option 1: You can move all the csv files into another HDFS directory and create a Hive table on top of that. If it works better for you, you can create a subdirectory (say, csv) within your present directory that houses all CSV files. You can then create a Hive table on top of this subdirectory. Keep in mind that any Hive tables created on top of the parent directory will NOT contain the data from the subdirectory.

Option 2: You can change your queries to make use of a virtual column called INPUT__FILE__NAME.

Your query would look something like:

SELECT 
   *
FROM
   my_table
WHERE
   INPUT__FILE__NAME LIKE '%csv';

The ill-effect of this approach is that the Hive query will have to churn through entire data present in the directory even though you only cared about specific files. The query wouldn't filter out files based on the predicate using INPUT__FILE__NAME. It will just filter out the records that don't come from match the predicate using INPUT__FILE__NAME during the map phase (consequently filtering out all records from particular files) but the mappers would run on unnecessary files as well. It will give you the correct result, might have some, probably minor, performance overhead.

The benefit of this approach is the you can use the same Hive table if you had multiple files in your table and you wanted the ability to query all files from that table (or its partition) in a few queries and a subset of the files in other queries. You could make use of the INPUT__FILE__NAME virtual column to achieve that. As an example: if a partition in your HDFS directory /user/hive/warehouse/web_logs/ looked like:

/user/hive/warehouse/web_logs/dt=2012-06-30/
   /user/hive/warehouse/web_logs/dt=2012-06-30/00.log
   /user/hive/warehouse/web_logs/dt=2012-06-30/01.log
   .
   .
   .
   /user/hive/warehouse/web_logs/dt=2012-06-30/23.log

Let's say your table definition looked like:

CREATE EXTERNAL TABLE IF NOT EXISTS web_logs_table (col1 STRING)
PARTITIONED BY (dt STRING)
LOCATION '/user/hive/warehouse/web_logs';

After adding the appropriate partitions, you could query all logs in the partition using a query like:

SELECT
   *
FROM
   web_logs_table w
WHERE
   dt='2012-06-30';

However, if you only cared about the logs from the first hour of the day, you could query the logs for the first hour using a query like:

SELECT
   *
FROM
   web_logs_table w
WHERE 
   dt ='2012-06-30'
   AND INPUT__FILE__NAME='00.log';

Another similar use case could be a directory that contains web logs from different domains and various queries need to analyze logs on different sets of domains. The queries can filter out domains using the INPUT__FILE__NAME virtual column.

In both the above use-cases, having a sub partition for hour or domain would solve the problem as well, without having to use the virtual column. However, there might exist some design trade-offs that require you to not create sub-partitions. In that case, arguably, using INPUT__FILE__NAME virtual column is your best bet.

Deciding between the 2 options:

It really depends on your use case. If you would never care about the files are you are trying to exclude from the Hive table, using Option 2 is probably an overkill and you should fix up the directory structure and create a Hive table on top of the directory containing files that you care about.

If the files you are presently excluding follow the same format as the other files (so they can all be part of the same Hive table) and you could see yourself writing a query that would analyze all the data in the directory, then go with Option 2.