the table is:
create table test (
id string,
name string,
age string,
modified string)
data like this:
id name age modifed
1 a 10 2011-11-11 11:11:11
1 a 11 2012-11-11 12:00:00
2 b 20 2012-12-10 10:11:12
2 b 20 2012-12-10 10:11:12
2 b 20 2012-12-12 10:11:12
2 b 20 2012-12-15 10:11:12
I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:
1 a 11 2012-11-11 12:00:00
2 b 20 2012-12-15 10:11:12
I do like this:
insert overwrite table t
select b.id, b.name, b.age, b.modified
from (
select id,max(modified) as modified
from test
group by id
) a
left outer join test b on (a.id=b.id and a.modified=b.modified);
This sql can get the right result,but when mass data,it runs slow.
**Is there any way to do this without left outer join? **
There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:
test_argmax
id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y
You can do this:
select
max(struct(val, key, id)).col1 as max_val,
max(struct(val, key, id)).col2 as max_key,
max(struct(val, key, id)).col3 as max_id
from test_argmax
group by id
and get the result:
max_val,max_key,max_id
3,C,1
3,W,2
I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?