Performance difference between Scan and Get?

Aniket Dutta picture Aniket Dutta · Jan 27, 2013 · Viewed 7.9k times · Source

I have an HBase table containing 8G of data.

When I use a partial key scan on that table to retrieve a value for a given key I get almost constant time value retrieval.

When I use a Get, the time taken is far greater than with the scan. However when I looked inside the code, I found that Get itself uses a Scan.

Can anyone explain this time difference?

Answer

Suman picture Suman · Jan 29, 2013

Correct, when you issue a Get, there is a scan happening behind the scenes. Cloudera's blog post confirms this: "Each time a get or a scan is issued, HBase scan (sic) through each file to find the result."

I can't confirm your results, but I think the clue may lie in your "partial key scan". When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan.

In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it. But with the partial key, HBase does not need to lookup the exact key match, and just needs to find the more approximate location of that key prefix.

The answer for this is: it depends. I think it will depend on:

  1. Your row key "schema" or composition
  2. The length of the Get key and the Scan prefix
  3. How many regions you have

and possibly other factors.