How big data is "Bigdata"?

user1913522 picture user1913522 · Dec 26, 2012 · Viewed 8.5k times · Source

How much amount of data does qualify to be categorised as Bigdata?

With what size of data can one decide that this is the time to go for technologies like Hadoop and use the power of distributed computing?

I believe there is a certain premium in going for these technologies, so how to make sure that using Bigdata methods are going to leverage the current system?

Answer

Brian Campbell picture Brian Campbell · Dec 26, 2012

"Big Data" is a somewhat vague term, used more for marketing purposes than making technical decisions. What one person calls "big data" another may consider just to be day to day operations on a single system.

My rule of thumb is that big data starts where you have a working set of data that does not fit into main memory on a single system. The working set is the data you are actively working on at a given time. So, for instance, if you have a filesystem that stores 10 TB of data, but you are using that to store video for editing, your editors may only need a few hundred gigs of it at any given time; and they are generally streaming that data off of the discs, which doesn't require random-access. But if you are trying to do database queries against a full 10 TB data set that is changing on a regular basis, you don't want to be serving that data off of disk; that starts to become "big data."

For a basic rule of thumb, I can configure an off-the-shelf Dell server for 2 TB of RAM right now. But you pay a substantial premium to stuff that much RAM into a single system. 512 GB of RAM on a single server is much more affordable, so it would generally be more cost effective to use 4 machines with 512 GB of RAM than a single machine with 2 TB. So you can probably say that above 512 GB of working-set data (data that you need to access for any given computation on a day-to-day basis) would qualify as "big data".

Given the additional cost of developing software for "big data" systems as opposed to traditional database, for some people it might be more cost effective to move to that 2 TB system rather than re-designing their system to be distributed among several systems, so depending on your needs, anywhere between 512 GB and 2 TB of data may be the point where you need to move to "big data" systems.

I wouldn't use the term "big data" to make any technical decisions. Instead, formulate your actual needs, and determine what kinds of technologies are needed to address those needs now. Consider growth a bit, but also remember that systems are still growing in capacity; so don't try to over-plan. Many "big data" systems can be hard to use and inflexible, so if you don't actually need them to spread your data and computation to dozens or hundreds of systems, they can be more trouble than they're worth.