Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel.
Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp.
In such scenario a single warp can be assigned to perform the reduction.
Given that warp executes in a perfect sync, many __syncthreads()
instructions can be removed - when compared to a block-level reduction.
static const int warpSize = 32;
__device__ int sumCommSingleWarp(volatile int* shArr) {
int idx = threadIdx.x % warpSize; //the lane index in the warp
if (idx<16) shArr[idx] += shArr[idx+16];
if (idx<8) shArr[idx] += shArr[idx+8];
if (idx<4) shArr[idx] += shArr[idx+4];
if (idx<2) shArr[idx] += shArr[idx+2];
if (idx==0) shArr[idx] += shArr[idx+1];
return shArr[0];
}
shArr
is preferably an array in shared memory. The value should be the same for all threads in the warp.
If sumCommSingleWarp
is called by multiple warps, shArr
should be different between warps (same within each warp).
The argument shArr
is marked as volatile
to ensure that operations on the array are actually performed where indicated.
Otherwise, the repetetive assignment to shArr[idx]
may be optimized as an assignment to a register, with only final assigment being an actual store to shArr
.
When that happens, the immediate assignments are not visible to other threads, yielding incorrect results.
Note, that you can pass a normal non-volatile array as an argument of volatile one, same as when you pass non-const as a const parameter.
If one does not care about the contents of shArr[1..31]
after the reduction, one can simplify the code even further:
static const int warpSize = 32;
__device__ int sumCommSingleWarp(volatile int* shArr) {
int idx = threadIdx.x % warpSize; //the lane index in the warp
if (idx<16) {
shArr[idx] += shArr[idx+16];
shArr[idx] += shArr[idx+8];
shArr[idx] += shArr[idx+4];
shArr[idx] += shArr[idx+2];
shArr[idx] += shArr[idx+1];
}
return shArr[0];
}
In this setup we removed many if
conditions. The extra threads perform some unnecessary additions, but we no longer care about the contents they produce.
Since warps execute in SIMD mode we do not actually save on time by having those threads doing nothing.
On the other hand, evaluating the conditions does take relatively big amount of time, since the body of these if
statements are so small.
The initial if
statement can be removed as well if shArr[32..47]
are padded with 0.
The warp-level reduction can be used to boost the block-level reduction as well:
__global__ void sumCommSingleBlockWithWarps(const int *a, int *out) {
int idx = threadIdx.x;
int sum = 0;
for (int i = idx; i < arraySize; i += blockSize)
sum += a[i];
__shared__ int r[blockSize];
r[idx] = sum;
sumCommSingleWarp(&r[idx & ~(warpSize-1)]);
__syncthreads();
if (idx<warpSize) { //first warp only
r[idx] = idx*warpSize<blockSize ? r[idx*warpSize] : 0;
sumCommSingleWarp(r);
if (idx == 0)
*out = r[0];
}
}
The argument &r[idx & ~(warpSize-1)]
is basically r + warpIdx*32
.
This effectively splits the r
array into chunks of 32 elements, and each chunk is assigned to separate warp.