cuda Tutorial => Single-warp parallel reduction for noncommutative...

Example

Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp executes in a perfect sync, many __syncthreads() instructions can be removed - when compared to a block-level reduction.

static const int warpSize = 32;

__device__ int sumNoncommSingleWarp(volatile int* shArr) {
    int idx = threadIdx.x % warpSize; //the lane index in the warp
    if (idx%2 == 0) shArr[idx] += shArr[idx+1];
    if (idx%4 == 0) shArr[idx] += shArr[idx+2];
    if (idx%8 == 0) shArr[idx] += shArr[idx+4];
    if (idx%16 == 0) shArr[idx] += shArr[idx+8];
    if (idx == 0) shArr[idx] += shArr[idx+16];
    return shArr[0];
}

shArr is preferably an array in shared memory. The value should be the same for all threads in the warp. If sumCommSingleWarp is called by multiple warps, shArr should be different between warps (same within each warp).

The argument shArr is marked as volatile to ensure that operations on the array are actually performed where indicated. Otherwise, the repetetive assignment to shArr[idx] may be optimized as an assignment to a register, with only final assigment being an actual store to shArr. When that happens, the immediate assignments are not visible to other threads, yielding incorrect results. Note, that you can pass a normal non-volatile array as an argument of volatile one, same as when you pass non-const as a const parameter.

If one does not care about the final contents of shArr[1..31] and can pad shArr[32..47] with zeros, one can simplify the above code:

static const int warpSize = 32;

__device__ int sumNoncommSingleWarpPadded(volatile int* shArr) {
    //shArr[32..47] == 0
    int idx = threadIdx.x % warpSize; //the lane index in the warp
    shArr[idx] += shArr[idx+1];
    shArr[idx] += shArr[idx+2];
    shArr[idx] += shArr[idx+4];
    shArr[idx] += shArr[idx+8];
    shArr[idx] += shArr[idx+16];
    return shArr[0];
}

In this setup we removed all if conditions, which constitute about the half of the instructions. The extra threads perform some unnecessary additions, storing the result into cells of shArr that ultimately have no impact on the final result. Since warps execute in SIMD mode we do not actually save on time by having those threads doing nothing.

PDF - Download cuda for free

Previous Next

cuda

Fastest Entity Framework Extensions

Example

Got any cuda Question?

cuda

cuda Parallel reduction (e.g. how to sum an array) Single-warp parallel reduction for noncommutative operator

Fastest Entity Framework Extensions

Example

Got any cuda Question?