Typically, reduction is performed on global or shared array.
However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp.
When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shu...