Simple explanation of MapReduce?

reefnet_alex picture reefnet_alex · Aug 26, 2008 · Viewed 70k times · Source

Related to my CouchDB question.

Can anyone explain MapReduce in terms a numbnuts could understand?

Answer

chakrit picture chakrit · Aug 27, 2008

Going all the way down to the basics for Map and Reduce.


Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.

suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say

A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2

and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write

A = [1, 2, 3].Map(x => x * 2)

the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.

1 : 1 => 1 * 2 : 2  
2 : 2 => 2 * 2 : 4  
3 : 3 => 3 * 2 : 6  

so after executing the map function with (x => x * 2) you'd have [2, 4, 6].


Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.

Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this

A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]

But, if you have access to a reduce function, you could write it like this

A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )

Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.

the execution would follows:

result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24

But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result = line.

say you want to sum 2 lists, it might look like this:

A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )

or a version you'd more likely to find in the real world:

A = [7, 8, 9]
B = [1, 2, 3]

sum_func = (x, y) => x + y
sum = A.reduce( B.reduce( 0, sum_func ), sum_func )

Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.

You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.

There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.

Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.