Loading data with Hive, S3, EMR, and Recover Partitions

Question 1

Loading data with Hive, S3, EMR, and Recover Partitions

hadoop amazon-s3 amazon-web-services hive elastic-map-reduce

Mike Repass · Nov 10, 2012 · Viewed 9.4k times · Source

Answer

Answer

It's a case issue on the hour field!

Question 2

SOLVED: See Update #2 below for the 'solution' to this issue.

~~~~~~~

In s3, I have some log*.gz files stored in a nested directory structure like:

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like:

create external table logs (content string)
partitioned by (y string, m string, d string, h string)
location 's3://($BUCKET)';

Creation of the table works. I then attempt to recover all of the existing partitions:

alter table logs recover partitions;

This seems to work and it does drill down through my s3 structure and add all the various levels of directories:

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
y=2012/m=11/d=06/h=09
y=2012/m=11/d=06/h=10
y=2012/m=11/d=06/h=11
y=2012/m=11/d=06/h=12
y=2012/m=11/d=06/h=13
y=2012/m=11/d=06/h=14
y=2012/m=11/d=06/h=15
y=2012/m=11/d=06/h=16
...

So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing:

hive> select count(*) from logs;
...
OK
0

hive> select * from logs limit 10;
OK

hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10;
OK

Thoughts? Am I missing some additional command to load data beyond recovering the partitions?

If I manually add a partition with an explicit location, then that works:

alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/'

I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'.

UPDATE #1

Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here.

The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting):

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I create my external table with a partition spec that does the proper capitalization:

partitioned by (y string, m string, d string, H string)

(Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h':

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08

(Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive.

I am going to move my data into an all-lowercase directory structure and see if that works...

UPDATE #2

Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening:

My layout on S3 had a case-sensitive partition name (H=)
Running RECOVER PARTITIONS correctly discovers these partitions...
But then they are stored internally as lowercase (h)

The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...

Loading data with Hive, S3, EMR, and Recover Partitions

Answer

Related questions