Hive clustered by on more than one column

Manikandan Kannan picture Manikandan Kannan · Jun 16, 2015 · Viewed 8.9k times · Source

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets then there are 32 files in hdfs.

What does it mean to have the clustered by on more than one column? For example, lets say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would the hash function be performed if there are more than one column?

How many files would be generated? Is this still 32?

Answer

Maddy RS picture Maddy RS · Jun 17, 2015
  1. Yes the number of files will still be 32.
  2. Hash function will operate by considering "continent,country" as a single string and then will use this as input.

Hope it helps!!