The simplest approach to parallel reduction in CUDA is to assign a single block to perform the task:
static const int arraySize = 10000;
static const int blockSize = 1024;
__global__ void sumCommSingleBlock(const int *a, int *out) {
int idx = threadIdx.x;
int sum = 0;
for (int i = idx; i < arraySize; i += blockSize)
sum += a[i];
__shared__ int r[blockSize];
r[idx] = sum;
__syncthreads();
for (int size = blockSize/2; size>0; size/=2) { //uniform
if (idx<size)
r[idx] += r[idx+size];
__syncthreads();
}
if (idx == 0)
*out = r[0];
}
...
sumCommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);
This is most feasable when the data size is not very large (around a few thousants elements).
This usually happens when the reduction is a part of some bigger CUDA program.
If the input matches blockSize
from the very beginning, the first for
loop can be completely removed.
Note that in first step, when there are more elements than threads, we add things up completely independently.
Only when the problem is reduced to blockSize
, the actual parallel reduction triggers.
The same code can be applied to any other commutative, associative operator, such as multiplication, minimum, maximum, etc.
Note that the algorithm can be made faster, for example by using a warp-level parallel reduction.