Looping using Hiveql

Beta picture Beta · Feb 25, 2016 · Viewed 29.3k times · Source

I'm trying to merge 2 datasets, say A and B. The dataset A has a variable "Flag" which takes 2 values. Rather than jut merging both data together I was trying to merge 2 datasets based on "flag" variable.

The merging code is the following:

create table new_data as
select a.*,b.y
from A as a left join B as b
on a.x=b.x

Since I'm running Hive code through CLI, I'm calling this through the following command

hive -f new_data.hql

The looping part of the code I'm calling to merge data based on "Flag" variable is the following:

for flag in 1 2;
do
  hive -hivevar flag=$flag -f new_data.hql
done

I put the above code in another ".hql" file asn calling it:

hive -f loop_data.hql

But it's throwing error.

cannot recognize input near 'for' 'flag' 'in'

Can anybody please tell me where I'm making mistake.

Thanks!

Answer

Ambrish picture Ambrish · Mar 5, 2016
  1. You should be adding the loop logic to shell script.

File Name: loop_data.sh

for flag in 1 2;
do
  hive -hivevar flag=$flag -f new_data.hql
done

And execute the script like:

sh loop_data.sh
  1. In your new_data.hql script, you are creating table. Since you should split out the DDL & DML in 2 separate scripts. Like

DDL: create_new_data.hql

create table new_data as
select 
  a.*,
  b.y
from 
  A as a left join 
  B as b on 
  a.x = b.x
where 
  1 = 0;

DML: insert_new_data.hql

insert into new_data 
select 
  a.*,
  b.y
from 
  A as a left join 
  B as b on 
  a.x = b.x
where
  flag = ${hiveconf:flag}

And update you shell script like:

File Name: loop_new_data.sh

# Create table
hive -f create_new_data.hql

# Insert data
for flag in 1 2;
do
  hive -hiveconf flag=$flag -f insert_new_data.hql
done

And execute it like:

sh loop_new_data.sh

Let me know if you want more info.