generating unique ids in hive

user1745713 picture user1745713 · Aug 15, 2013 · Viewed 25.9k times · Source

I have been trying to generate unique ids for each row of a table (30 million+ rows).

  • using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
  • the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.

There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.

  • my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)

Answer

Carter Shanklin picture Carter Shanklin · Aug 18, 2013

Use the reflect UDF to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Update (2019)

For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

create table customer (
  id bigint default surrogate_key(),
  name string, 
  city string, 
  primary key (id) disable novalidate
);

To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;