I have been trying to generate unique ids for each row of a table (30 million+ rows).
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;