Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel.
Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp.
In such scenario a single warp can be assigned to perform the reduction.
Given that warp executes in a perfect sync, many __syncthreads()
instructions can be removed - when compared to a block-level reduction.
static const int warpSize = 32;
__device__ int sumNoncommSingleWarp(volatile int* shArr) {
int idx = threadIdx.x % warpSize; //the lane index in the warp
if (idx%2 == 0) shArr[idx] += shArr[idx+1];
if (idx%4 == 0) shArr[idx] += shArr[idx+2];
if (idx%8 == 0) shArr[idx] += shArr[idx+4];
if (idx%16 == 0) shArr[idx] += shArr[idx+8];
if (idx == 0) shArr[idx] += shArr[idx+16];
return shArr[0];
}
shArr
is preferably an array in shared memory. The value should be the same for all threads in the warp.
If sumCommSingleWarp
is called by multiple warps, shArr
should be different between warps (same within each warp).
The argument shArr
is marked as volatile
to ensure that operations on the array are actually performed where indicated.
Otherwise, the repetetive assignment to shArr[idx]
may be optimized as an assignment to a register, with only final assigment being an actual store to shArr
.
When that happens, the immediate assignments are not visible to other threads, yielding incorrect results.
Note, that you can pass a normal non-volatile array as an argument of volatile one, same as when you pass non-const as a const parameter.
If one does not care about the final contents of shArr[1..31]
and can pad shArr[32..47]
with zeros, one can simplify the above code:
static const int warpSize = 32;
__device__ int sumNoncommSingleWarpPadded(volatile int* shArr) {
//shArr[32..47] == 0
int idx = threadIdx.x % warpSize; //the lane index in the warp
shArr[idx] += shArr[idx+1];
shArr[idx] += shArr[idx+2];
shArr[idx] += shArr[idx+4];
shArr[idx] += shArr[idx+8];
shArr[idx] += shArr[idx+16];
return shArr[0];
}
In this setup we removed all if
conditions, which constitute about the half of the instructions.
The extra threads perform some unnecessary additions, storing the result into cells of shArr
that ultimately have no impact on the final result.
Since warps execute in SIMD mode we do not actually save on time by having those threads doing nothing.