Typically, reduction is performed on global or shared array. However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp. When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shuffle functions to avoid using shared memory at all.
Suppose for example, that each thread in a warp holds a single input data value. All threads together have 32 elements, that we need to sum up (or perform other associative operation)
__device__ int sumSingleWarpReg(int value) {
value += __shfl_down(value, 1);
value += __shfl_down(value, 2);
value += __shfl_down(value, 4);
value += __shfl_down(value, 8);
value += __shfl_down(value, 16);
return __shfl(value,0);
}
This version works for both commutative and non-commutative operators.