I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.
Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.
I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?
To see how this is implemented on CUDA, I use the following code snippet:
__global__ void mod_kernel(unsigned *gout, const unsigned *gin) {
unsigned tid = threadIdx.x;
unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3];
typedef unsigned long long u64;
__syncthreads();
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
__syncthreads();
gout[tid] = r;
}
This code is not supposed to work, I just wanted to see how modular reduction is implemented on CUDA.
When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:
/*0098*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00a0*/ /*0x1c315c4350000000*/ IMUL.U32.U32.HI R5, R3, R7;
/*00a8*/ /*0x1c311c0350000000*/ IMUL.U32.U32 R4, R3, R7;
/*00b0*/ /*0xfc01dde428000000*/ MOV R7, RZ;
/*00b8*/ /*0xe001000750000000*/ CAL 0xf8;
/*00c0*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0;
/*00c8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow
Some time ago I experimented a lot with modular arithmetic on the GPU. On Fermi GPUs you can use double-precision arithmetic to avoid expensive div and mod operations. For example, modular multiplication can be done as follows:
// fast truncation of double-precision to integers
#define CUMP_D2I_TRUNC (double)(3ll << 51)
// computes r = a + b subop c unsigned using extended precision
#define VADDx(r, a, b, c, subop) \
asm volatile("vadd.u32.u32.u32." subop " %0, %1, %2, %3;" : \
"=r"(r) : "r"(a) , "r"(b), "r"(c));
// computes a * b mod m; invk = (double)(1<<30) / m
__device__ __forceinline__
unsigned mul_m(unsigned a, unsigned b, volatile unsigned m,
volatile double invk) {
unsigned hi = __umulhi(a*2, b*2); // 3 flops
// 2 double instructions
double rf = __uint2double_rn(hi) * invk + CUMP_D2I_TRUNC;
unsigned r = (unsigned)__double2loint(rf);
r = a * b - r * m; // 2 flops
// can also be replaced by: VADDx(r, r, m, r, "min") // == umin(r, r + m);
if((int)r < 0)
r += m;
return r;
}
However this only works for 31-bit integer modulos (if 1 bit is not critical for you) and you also need to precompute 'invk' beforehand. This gives absolute minimum of instructions I can achieve, ie.:
SHL.W R2, R4, 0x1;
SHL.W R8, R6, 0x1;
IMUL.U32.U32 R4, R4, R6;
IMUL.U32.U32.HI R8, R2, R8;
I2F.F64.U32 R8, R8;
DFMA R2, R2, R8, R10;
IMAD.U32.U32 R4, -R12, R2, R4;
ISETP.GE.AND P0, pt, R4, RZ, pt;
@!P0 IADD R4, R12, R4;
For description of the algorithm, you can have a look at my paper: gpu_resultants. Other operations like (xy - zw) mod m are also explained there.
Out of curiosity, I compared the performance of the resultant algorithm using your modular multiplication:
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
against the optimized version with mul_m.
Modular arithmetic with default % operation:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 5755.357910 ms; mod_inv time: 0.907008 ms; interp time: 856.015015 ms; CRA time: 44.065857 ms
GPU time elapsed: 6659.405273 ms;
Modular arithmetic with mul_m:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 1100.124756 ms; mod_inv time: 0.192608 ms; interp time: 220.615143 ms; CRA time: 10.376352 ms
GPU time elapsed: 1334.742310 ms;
So on the average it is about 5x faster. Note also that, you might not see a speed-up if you just evaluate raw arithmetic performance using a kernel with a bunch of mul_mod operations (like saxpy example). But in real applications with control logic, synchronization barriers etc. the speed-up is very noticeable.