I've been searching the web but I'm still very confused about this topic. Can anyone explain this more clearly? I come from an Aerospace Engineering background (not from a Computer Science one), so when I read online about OpenMP/CUDA/etc. and multithreading I don't really understand a great deal of what is being said.
I'm currently trying to parallelize an in-house CFD software written in FORTRAN. These are my doubts:
OpenMP shares the workload using multiple threads from the CPU. Can it be used to allow the GPU to get some of the work too?
I've read about OpenACC. Is it similar to OpenMP (easy to use)?
I've also read about CUDA and kernels, but I don't have any much experience in parallel programming and I don't have the faintest idea of what a kernel is.
Can you give me a "for dummies" type of answer?
Yes. The OpenMP 4 target constructs were designed to support a wide range of accelerators. Compiler support for NVIDIA GPUs is available from GCC 7+ (see 1 and 2, although the latter has not been updated to reflect OpenMP 4 GPU support), Clang (see 3,4,5), and Cray. Compiler support for Intel GPUs is available in the Intel C/C++ compiler (see e.g. 6).
The IBM-developed Clang/LLVM implementation of OpenMP 4+ for NVIDIA GPUs is available from https://github.com/clang-ykt. The build recipe is provided in "OpenMP compiler for CORAL/OpenPower Heterogeneous Systems".
The Cray compiler supports OpenMP target for NVIDIA GPUs. From Cray Fortran Reference Manual (8.5):
The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.
The Intel compiler supports OpenMP target for Intel Gen graphics for C/C++ but not Fortran. Furthermore, the teams
and distribute
clauses are not supported because they are not necessary/appropriate. Below is a simple example showing how the OpenMP target features work in different environments.
void vadd2(int n, float * a, float * b, float * c)
{
#pragma omp target map(to:n,a[0:n],b[0:n]) map(from:c[0:n])
#if defined(__INTEL_COMPILER) && defined(__INTEL_OFFLOAD)
#pragma omp parallel for simd
#else
#pragma omp teams distribute parallel for simd
#endif
for(int i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
The compiler options for Intel and GCC are as follows. I don't have GCC setup for NVIDIA GPUs but you can see the documentation for the appropriate -foffload
options.
$ icc -std=c99 -qopenmp -qopenmp-offload=gfx -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS
$ gcc-7 -fopenmp -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS