cuda Tutorial => Single-warp parallel reduction for commutative...

Example

Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp executes in a perfect sync, many __syncthreads() instructions can be removed - when compared to a block-level reduction.

static const int warpSize = 32;

__device__ int sumCommSingleWarp(volatile int* shArr) {
    int idx = threadIdx.x % warpSize; //the lane index in the warp
    if (idx<16) shArr[idx] += shArr[idx+16];
    if (idx<8) shArr[idx] += shArr[idx+8];
    if (idx<4) shArr[idx] += shArr[idx+4];
    if (idx<2) shArr[idx] += shArr[idx+2];
    if (idx==0) shArr[idx] += shArr[idx+1];
    return shArr[0];
}

shArr is preferably an array in shared memory. The value should be the same for all threads in the warp. If sumCommSingleWarp is called by multiple warps, shArr should be different between warps (same within each warp).

The argument shArr is marked as volatile to ensure that operations on the array are actually performed where indicated. Otherwise, the repetetive assignment to shArr[idx] may be optimized as an assignment to a register, with only final assigment being an actual store to shArr. When that happens, the immediate assignments are not visible to other threads, yielding incorrect results. Note, that you can pass a normal non-volatile array as an argument of volatile one, same as when you pass non-const as a const parameter.

If one does not care about the contents of shArr[1..31] after the reduction, one can simplify the code even further:

static const int warpSize = 32;

__device__ int sumCommSingleWarp(volatile int* shArr) {
    int idx = threadIdx.x % warpSize; //the lane index in the warp
    if (idx<16) {
      shArr[idx] += shArr[idx+16];
      shArr[idx] += shArr[idx+8];
      shArr[idx] += shArr[idx+4];
      shArr[idx] += shArr[idx+2];
      shArr[idx] += shArr[idx+1];
    }
    return shArr[0];
}

In this setup we removed many if conditions. The extra threads perform some unnecessary additions, but we no longer care about the contents they produce. Since warps execute in SIMD mode we do not actually save on time by having those threads doing nothing. On the other hand, evaluating the conditions does take relatively big amount of time, since the body of these if statements are so small. The initial if statement can be removed as well if shArr[32..47] are padded with 0.

The warp-level reduction can be used to boost the block-level reduction as well:

__global__ void sumCommSingleBlockWithWarps(const int *a, int *out) {
    int idx = threadIdx.x;
    int sum = 0;
    for (int i = idx; i < arraySize; i += blockSize)
        sum += a[i];
    __shared__ int r[blockSize];
    r[idx] = sum;
    sumCommSingleWarp(&r[idx & ~(warpSize-1)]);
    __syncthreads();
    if (idx<warpSize) { //first warp only
        r[idx] = idx*warpSize<blockSize ? r[idx*warpSize] : 0;
        sumCommSingleWarp(r);
        if (idx == 0)
            *out = r[0];
    }
}

The argument &r[idx & ~(warpSize-1)] is basically r + warpIdx*32. This effectively splits the r array into chunks of 32 elements, and each chunk is assigned to separate warp.

PDF - Download cuda for free

Previous Next

cuda

Fastest Entity Framework Extensions

Example

Got any cuda Question?

cuda

cuda Parallel reduction (e.g. how to sum an array) Single-warp parallel reduction for commutative operator

Fastest Entity Framework Extensions

Example

Got any cuda Question?