Since you asked for a thorough explanation, and the previous answer covers only basics, I'm taking the liberty of adding a more thorough one.
std::reduce
is intended to perform the second major step of the MapReduce programming model. The basic idea is that the platform (in this case, C++ implementation) provides these two primitive operations map and reduce, and the programmer provides callback operations for each of the two that perform the "actual work".
Basically, the callback for the map operation maps an input value to an intermediate value. The callback for reduce combines two intermediate values into one intermediate value. The last intermediate value left becomes the output of the whole MapReduce. It seems to be a very restrictive model per se, but still, it has a large range of applications.
The platform has to do more, of course, such as "shuffling" (distributing the inputs, usually in groups, to different processing units) and scheduling, but this is of little concern to the application programmer.
By the way, in the C++ standard library, the "map" operation is called transform
. It has received parallelism support in C++17 as well, but I'll get into parallelism later.
Here's a made-up example: Let's say we have a function that converts an integer to a string representation. Now, given a list of integers, we want the textual representation containing the greatest ratio of consonants to vocals. E.g.
(Try it for yourself if you don't believe this result.)
We could use MapReduce here by using our int-to-text function as the callback to map (rsp. std::transform
), and a function that counts the number of consonants and vocals and then selects either the left hand or right-hand argument accordingly. There are some inefficiencies here, in particular, the "ratio" should be cached, but these are details.
Now your question may and possibly should be: Why should I possibly care? After all, so far you didn't gain much from using e.g. std::transform
and std::reduce
in this way, and you could have used std::accumulate
in place of the latter as well. The answer of course, given a sufficiently large number of input values, is execution policies – the former two standard function templates have overloads that allow implicit parallelism. But that still begs the question why one would use MapReduce and not a thread pool or std::async
, and a bunch of hand-written loops? First, especially for "pure" computations on large vectors or other containers, with no I/O, it is often more convenient to write the two MapReduce callbacks because you don't have to deal with all the details of how the input data is spread around to different threads and then combined.
Second, MapReduce encourages a discipline of structuring your computations in a way that can be parallelized very effectively. Of course, in a programming language that supports the imperative paradigm, such as C++, you can still mess things up by locking a bunch of mutexes, or whatever way you may have of interfering with other threads. But the MapReduce paradigm encourages writing independent mapping callbacks. As a simple rule of thumb, if these tasks share data at all, then it should be read-only so that copies of it can be stored in multiple CPU caches at the same time. Well-written tasks, combined with an efficient platform implementation of the underlying algorithms, can scale to hundreds or even thousands of CPU cores, as is shown by the MapReduce platforms already in common use (like Apache Hadoop, but take this only as a necessary example and not as gratuitous advertisement).
But the question was mainly about std::reduce
– we could still perform this highly scalable mapping and then run std::accumulate
on the intermediate results, right? And this is where we get to what François Andrieux previously wrote. accumulate
performs what mathematicians call a left fold. If you view the operations and their operands as a binary tree, then this would be a left-leaning tree, e.g. to add 1, 2, 3 and 4:
+
/ \
+ 4
/ \
+ 3
/ \
1 2
As you can see, the result of each operation is one of the arguments of the next operation. This means there is a linear chain of data dependencies, and that is the bane of all parallelism. To add a million numbers, you need a million consecutive operations, which will block a single core, and you cannot make any use of additional cores. They will have nothing to do most of the time, and the communication overhead will greatly outweigh the cost of the computations. (It's actually worse than that because modern CPUs can perform multiple simple calculations per clock, but this doesn't work when there are so many data dependencies, so most of the ALUs or FPUs go unused.)
By lifting the restriction that a left fold must be used, std::reduce
allows the platform to make more efficient use of computational resources. Even if you use the single-threaded overload, the platform could, for example, use SIMD to add a million integers in much less than a million operations, and the number of data dependencies will be greatly reduced. A 10x speedup on a well-written integer addition reduce would not surprise me. Granted, this particular special case could probably be optimized under the as-if rule because the C++ implementation "knows" that integer addition is (almost, see below) associative.
But reduce goes further than that, as was mentioned, by supporting execution policies, i.e. in most cases multi-core parallelism. In theory, a balanced binary tree of operations could be used. (Remember that a tree is balanced if the depth is either less than two, or the depth of the left subtree is different from the depth of the right subtree by at most 1.) Such a tree has only logarithmic depth. If we have a million integers, the minimum tree depth is 20, so – theoretically – given enough cores and no communication overhead, we could achieve a factor 50,000 speedup over even the optimized single-threaded calculation. Of course, in practice, that's a load of wishful thinking, but we can still expect large speedups.
All that said, let me add a quick disclaimer/reminder that performance is not the same as efficiency. Using 64 cores for 100ms each means much higher performance than using one core for 1,000ms, but much less CPU efficiency. Another way to put it is that performance is efficiency in the sense of minimizing the elapsed time, but there are other measures of efficiency – total CPU time used, RAM used, energy used, and so on. The primary motivation of parallel MapReduce is to provide higher performance. Whether it reduces CPU time or energy consumption is unclear, and it will very likely increase peak RAM usage.
To top it all off, here are some caveats. As was mentioned, reduce
is non-deterministic if the operations are not associative or not commutative. Fortunately, the most important arithmetic operations such as addition and multiplication are associative and commutative, right? We all know that integer and floating-point addition, for example, have both of these properties. And of course, I am being facetious. In C++, neither signed integer addition nor floating-point addition, are associative. For floating-point numbers, this is due to possible differences in rounding of intermediate results. This is easily visualized if we, as an example, pick a simple decimal floating point format with two significant digits, and consider the sum 10 + 0.4 + 0.4. If this is done by the normal C++ syntax rules as a left-fold, we get (10 + 0.4) + 0.4 = 10 + 0.4 = 10 because each time the result is rounded back down to 10. But if we do it as 10 + (0.4 + 0.4), the first intermediate result is 0.8, and 10 + 0.8 is then rounded up to 11. Also, rounding errors can become greatly magnified by a high depth of the operation tree, so doing a left fold is actually one of the worst things one could do when it comes to accuracy. There are several ways to deal with this behavior, ranging from sorting and grouping the inputs to using an increased intermediate precision, but when it comes to reduce
there may simply be no way to get 100% run-for-run consistency.
The other, possibly more surprising, observation is that signed integer addition is not associative in C++. The reason for this is the possibility of overflow, to put it bluntly: (-1) + 1 + INT_MAX
. According to the normal syntax rules, or accumulate
, the result is INT_MAX
. But if you use reduce
, this might get rewritten as (-1) + (1 + INT_MAX)
which contains an integer overflow and hence undefined behavior. In fact, because reduce
may arbitrarily change the order of operands, this is even true if the inputs are { INT_MAX, -1, 1 }
.
My recommendation here is to ensure that the callback to reduce
cannot produce an overflow. This could be done by limiting the allowed range of inputs (e.g. if you add 1000 int
s, make sure that none of them is greater than INT_MAX / 1000
or less than INT_MIN / 1000
, rounded up), for example, or, equivalently, by using a larger integer type, or, as an absolute last resort (because this is both expensive and hard to handle correctly), putting additional checks in the reduce
callback. In most practical cases, reduce
is only marginally less safe regarding integer overflow than accumulate
, however.