large integer addition with CUDA

user1545642 picture user1545642 · Oct 18, 2012 · Viewed 7.4k times · Source

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.

For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then:

__global__ void add_kernel(int *C, const int *A, const int *B) {
     int x = A[threadIdx.x];
     int y = B[threadIdx.x];
     int z = x + y;
     int carry = (z < x);
     /** do carry propagation in parallel somehow ? */
     ............

     z = z + newcarry; // update the resulting words after carry propagation
     C[threadIdx.x] = z;
 }

I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..

I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet. Perhaps someone can give me a hint how to do that on CUDA ?

Answer

user1545642 picture user1545642 · Oct 21, 2012

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.

Suppose we have two large integers a[0..n-1] and b[0..n-1]. Then we compute (i = 0..n-1):

s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);

We define two functions:

generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);

with quite intuitive meaning: generate[i] == 1 means that the carry is generated at position i while propagate[i] == 1 means that the carry will be propagated from position (i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:

carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0

Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:

s[i] = s[i] + carryout[i-1];  for i = 1..n-1
carry = carryout[n-1];

Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:

cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];

In other words,

cy[i] = 0xffffffff  if propagate[i]
cy[i] = 1           if generate[i]
cy[u] = 0           otherwise

Then, one can verify that the following formula computes prefix sum for carryout function:

cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];

for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:

// add & output carry flag
#define UADDO(c, a, b) \ 
     asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \ 
     asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

#define WS 32

__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {

extern __shared__ unsigned shared[];
unsigned *r = shared; 

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;

uint4 a = ((const uint4 *)g_A)[thid],
      b = ((const uint4 *)g_B)[thid];

UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out

// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
        49 * (thid / 64)) + ((thid / 32) & 1);

scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
    // this indicates that carry will propagate through the number
    cf = -1u;

// "Hillis-and-Steele-style" reduction 
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;

int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
    postscan[thid >> 5] = cf;

__syncthreads();

if(thid < N_THIDS / 32) {
    volatile int *t = (volatile int *)postscan + thid;
    t[-8] = -1; // load identity symbol
    cf = t[0];
    cf = max((int)cf, (int)t[-1]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-2]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-4]) & cf;
    t[0] = cf;
}
__syncthreads();

cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];

if(thid_in_warp == 0)
    cf = ps;
if((int)cf < 0)
    cf = 0;

UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}

Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).

Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".