MPI + GPU : how to mix the two techniques

mpi gpu hpc

cmo · Apr 9, 2012 · Viewed 7.6k times · Source

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.

But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.

I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?

Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.

Here is a schematic example of what I'm talking about:

Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.

Answer

This isn't the way to think about these things.

I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.

Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.

(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.

MPI + GPU : how to mix the two techniques

Answer

Related questions