I have a large number of image files that I need to store and process on HDFS
Let's assume 2 scenarios:
I would like to do 4 things with the images:
The solution design IMO should consider:
My first thought was to aggregate the images to take care of the small file issue, which satisfied 1 and 2. But I was left with the quick random access to the images problem and with the addition of new images. I am not sure how I could deal with this.
I looked into sequenceFiles, HAR, mapFiles, combineFileInputFormat, Avro, but wasn't able to find a solution for (3) and (4). Since I would have to take care of indexing the contents of the blocks and searching and deleting or adding new files may become tricky.
The other approach was to use HBase or HCatalog to store the images, this would take care of (1) (2) (3) and (4), but at what cost? I know that storing binary BLOBS in a database is not very efficient specially as the number of images increases but I thought maybe HBase or HCatalog handled this a bit different.
Thanks for all the help!
EDIT:
I just found this thread on HBase for serving images, apparently Yfrog and ImageShack have billions of records with images here is the link it's a good read. Although If anyone knows of any benchmarks that would be great.
IMHO, there is no problem in storing images of size ~10MB directly in hbase. And bigger files can be stored in HDFS itself with a pointer in hbase. This would allow quicker access even if you have millions of such files. MR works perfectly well with both hbase and HDFS.