High throughput vs low latency in HDFS

spacemonkey picture spacemonkey · May 23, 2013 · Viewed 23.9k times · Source

I tried to define what the high throughput vs low latency means in HDFS in my own words, and came up with the following definition:

HDFS is optimized to access batches of data set quicker (high throughput), rather then particular records in that data set (low latency)

Does it make sense? :)

Thanks!

Answer

Joe K picture Joe K · May 23, 2013

I think what you've described is more like the difference between optimizing for different access patterns (sequential, batch vs random access) than the difference between throughput and latency in the purest sense.

When I think of a high latency system, I'm not thinking about which record I'm accessing, but rather that accessing any record at all has a high overhead cost. Accessing even just the first byte of a file from HDFS can take around a second or more.

If you're more quantitatively inclined, you can think about the total time required to access a number of records N as T(N)=aN+b. Here, a represents throughput, and b represents latency. With a system like HDFS, N is often so large that b becomes irrelevant and tradeoffs favoring a low a are beneficial. Contrast that to a low-latency data store, where often each read is only accessing a single record, and then optimizing for low b is better.

With that said, your statement isn't incorrect; it's definitely true, and it is often the case that batch access stores have high latency and high throughput, whereas random access stores have low latency and low throughput, but this is not strictly always the case.