cuda Tutorial => Single-warp parallel reduction using registers only

Example

Typically, reduction is performed on global or shared array. However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp. When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shuffle functions to avoid using shared memory at all.

Suppose for example, that each thread in a warp holds a single input data value. All threads together have 32 elements, that we need to sum up (or perform other associative operation)

__device__ int sumSingleWarpReg(int value) {
    value += __shfl_down(value, 1);
    value += __shfl_down(value, 2);
    value += __shfl_down(value, 4);
    value += __shfl_down(value, 8);
    value += __shfl_down(value, 16);
    return __shfl(value,0);
}

This version works for both commutative and non-commutative operators.

PDF - Download cuda for free

Previous Next

cuda

Fastest Entity Framework Extensions

Example

Got any cuda Question?

cuda

cuda Parallel reduction (e.g. how to sum an array) Single-warp parallel reduction using registers only

Fastest Entity Framework Extensions

Example

Got any cuda Question?