cuda Single-warp parallel reduction using registers only


Typically, reduction is performed on global or shared array. However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp. When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shuffle functions to avoid using shared memory at all.

Suppose for example, that each thread in a warp holds a single input data value. All threads together have 32 elements, that we need to sum up (or perform other associative operation)

__device__ int sumSingleWarpReg(int value) {
    value += __shfl_down(value, 1);
    value += __shfl_down(value, 2);
    value += __shfl_down(value, 4);
    value += __shfl_down(value, 8);
    value += __shfl_down(value, 16);
    return __shfl(value,0);

This version works for both commutative and non-commutative operators.