How to use C++ templates in OpenCL kernels?

flashnik picture flashnik · Dec 16, 2010 · Viewed 8.6k times · Source

I'm a novice in OpenCL.

I have an algorithm which uses templates. It worked well with OpenMP parallelization but now the amount of data has grown and the only way to process it is to rewrite it to use OpenCL. I can easily use MPI to build it for cluster but Tesla-like GPU is much cheaper than cluster :)

Is there any way to use C++ templates in OpenCL kernel?

Is it possible to somehow expand templates by C++ compiler or some tool and after that use so changed kernel function?

EDIT. The idea of a workaround is to somehow generate C99-compatible code from C++ code from the template.

I found a following about Comeau:

Comeau C++ 4.3.3 is a full and true compiler that performs full syntax checking, full semantic checking, full error checking and all other compiler duties. Input C++ code is translated into internal compiler trees and symbol tables looking nothing like C++ or C. As well, it generates an internal proprietary intermediate form. But instead of using a proprietary back end code generator, Comeau C++ 4.3.3 generates C code as its output. Besides the technical advantages of C++, the C generating aspects of products like Comeau C++ 4.3.3 have been touted as a reason for C++'s success since it was able to be brought to a large number of platforms due to the common availability of C compilers.

The C compiler is used merely and only for the sake of obtaining native code generation. This means that Comeau C++ is tailored for use with specific C compilers on each respective platform. Please note that it is a requirement that tailoring must be done by Comeau. Otherwise, the generated C code is meaningless as it is tied to a specific platform (where platform includes at least the CPU, OS, and C compiler) and furthermore, the generated C code is not standalone. Therefore, it cannot be used by itself (note that this is both a technical and legal requirement when using Comeau C++), and this is why there is not normally an option to see the generated C code: it's almost always unhelpful and the compile process, including its generation, should be considered as internal phases of translation.

Answer

stgatilov picture stgatilov · Jul 16, 2013

There is an old way to emulate templates in pure C language. It is based on including a single file several times (without include guard). Since OpenCL has fully functional preprocessor and allows including files, this trick can be used.

Here is a good explanation: http://arnold.uthar.net/index.php?n=Work.TemplatesC

It is still much messier than C++ templates: the code has to be splitted into several parts, and you have to explicitly instantiate each instance of template. Also, it seems that you cannot do some useful things like implementing factorial as a recursive template.

Code example

Let's apply the idea to OpenCL. Suppose that we want to calculate inverse square root by Newton-Raphson iteration (generally not a good idea). However, the floating point type and the number of iterations may vary.

First of all, we need a helper header ("templates.h"):

#ifndef TEMPLATES_H_
#define TEMPLATES_H_

#define CAT(X,Y,Z) X##_##Y##_##Z   //concatenate words
#define TEMPLATE(X,Y,Z) CAT(X,Y,Z)

#endif

Then, we write template function in "NewtonRaphsonRsqrt.cl":

#include "templates.h"

real TEMPLATE(NewtonRaphsonRsqrt, real, iters) (real x, real a) {
    int i;
    for (i = 0; i<iters; i++) {
        x *= ((real)1.5 - (0.5*a)*x*x);
    }
    return x;
}

In your main .cl file, instantiate this template as follows:

#define real float
#define iters 2
#include "NewtonRaphsonRsqrt.cl"  //defining NewtonRaphsonRsqrt_float_2

#define real double
#define iters 3
#include "NewtonRaphsonRsqrt.cl"  //defining NewtonRaphsonRsqrt_double_3

#define real double
#define iters 4
#include "NewtonRaphsonRsqrt.cl"  //defining NewtonRaphsonRsqrt_double_4

And then can use it like this:

double prec = TEMPLATE(NewtonRaphsonRsqrt, double, 4) (1.5, 0.5);
float approx = TEMPLATE(NewtonRaphsonRsqrt, float, 2) (1.5, 0.5);