Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel.
Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp.
In such scenario a single warp can be assigned to perform the reduction.
Given that warp executes in a perfect sync, many __syncthreads() instructions can be removed - when compared to a block-level reduction.
static const int warpSize = 32;
__device__ int sumNoncommSingleWarp(volatile int* shArr) {
int idx = threadIdx.x % warpSize; //the lane index in the warp
if (idx%2 == 0) shArr[idx] += shArr[idx+1];
if (idx%4 == 0) shArr[idx] += shArr[idx+2];
if (idx%8 == 0) shArr[idx] += shArr[idx+4];
if (idx%16 == 0) shArr[idx] += shArr[idx+8];
if (idx == 0) shArr[idx] += shArr[idx+16];
return shArr[0];
}
shArr is preferably an array in shared memory. The value should be the same for all threads in the warp.
If sumCommSingleWarp is called by multiple warps, shArr should be different between warps (same within each warp).
The argument shArr is marked as volatile to ensure that operations on the array are actually performed where indicated.
Otherwise, the repetetive assignment to shArr[idx] may be optimized as an assignment to a register, with only final assigment being an actual store to shArr.
When that happens, the immediate assignments are not visible to other threads, yielding incorrect results.
Note, that you can pass a normal non-volatile array as an argument of volatile one, same as when you pass non-const as a const parameter.
If one does not care about the final contents of shArr[1..31] and can pad shArr[32..47] with zeros, one can simplify the above code:
static const int warpSize = 32;
__device__ int sumNoncommSingleWarpPadded(volatile int* shArr) {
//shArr[32..47] == 0
int idx = threadIdx.x % warpSize; //the lane index in the warp
shArr[idx] += shArr[idx+1];
shArr[idx] += shArr[idx+2];
shArr[idx] += shArr[idx+4];
shArr[idx] += shArr[idx+8];
shArr[idx] += shArr[idx+16];
return shArr[0];
}
In this setup we removed all if conditions, which constitute about the half of the instructions.
The extra threads perform some unnecessary additions, storing the result into cells of shArr that ultimately have no impact on the final result.
Since warps execute in SIMD mode we do not actually save on time by having those threads doing nothing.