storing images in HBASE for processing and quick access

sanders picture sanders · Jul 6, 2013 · Viewed 7.6k times · Source

I have a large number of image files that I need to store and process on HDFS

Let's assume 2 scenarios:

  1. Images are less than 5MB
  2. Images range from 50KB to 20MB

I would like to do 4 things with the images:

  1. I need to apply some function fnc() to each image independently.
  2. I will need to extract a specific image from HDFS from time to time (1000 times/day) and display it on a website. These are user queries for specific images so the latency should be a few seconds.
  3. Once a year groups of images would have to be deleted.
  4. New images will be added to the system (1000 new images/day)

The solution design IMO should consider:

  1. The small files issue:
  2. MR Processing
  3. Quick access to the files
  4. Quick write of the new files is not that big of an issue since the image will not be used immediately. A delay of a few minutes or hours is OK.

My first thought was to aggregate the images to take care of the small file issue, which satisfied 1 and 2. But I was left with the quick random access to the images problem and with the addition of new images. I am not sure how I could deal with this.

I looked into sequenceFiles, HAR, mapFiles, combineFileInputFormat, Avro, but wasn't able to find a solution for (3) and (4). Since I would have to take care of indexing the contents of the blocks and searching and deleting or adding new files may become tricky.

The other approach was to use HBase or HCatalog to store the images, this would take care of (1) (2) (3) and (4), but at what cost? I know that storing binary BLOBS in a database is not very efficient specially as the number of images increases but I thought maybe HBase or HCatalog handled this a bit different.

Thanks for all the help!

EDIT:

I just found this thread on HBase for serving images, apparently Yfrog and ImageShack have billions of records with images here is the link it's a good read. Although If anyone knows of any benchmarks that would be great.

Answer

Tariq picture Tariq · Jul 7, 2013

IMHO, there is no problem in storing images of size ~10MB directly in hbase. And bigger files can be stored in HDFS itself with a pointer in hbase. This would allow quicker access even if you have millions of such files. MR works perfectly well with both hbase and HDFS.