Tutorial by Examples | RIP Tutorial

Single-block parallel reduction for commutative operator

The simplest approach to parallel reduction in CUDA is to assign a single block to perform the task: static const int arraySize = 10000; static const int blockSize = 1024; __global__ void sumCommSingleBlock(const int *a, int *out) { int idx = threadIdx.x; int sum = 0; for (int i ...

cuda • Parallel reduction (e.g. how to sum an array)

Single-block parallel reduction for non-commutative operator

Doing parallel reduction for a non-commutative operator is a bit more involved, compared to commutative version. In the example we still use a addition over integers for the simplicity sake. It could be replaced, for example, with matrix multiplication which really is non-commutative. Note, when ...

cuda • Parallel reduction (e.g. how to sum an array)

Multi-block parallel reduction for commutative operator

Multi-block approach to parallel reduction in CUDA poses an additional challenge, compared to single-block approach, because blocks are limited in communication. The idea is to let each block compute a part of the input array, and then have one final block to merge all the partial results. To do t...

cuda • Parallel reduction (e.g. how to sum an array)

Multi-block parallel reduction for noncommutative operator

Multi-block approach to parallel reduction is very similar to the single-block approach. The global input array must be split into sections, each reduced by a single block. When a partial result from each block is obtained, one final block reduces these to obtain the final result. sumNoncommSin...

cuda • Parallel reduction (e.g. how to sum an array)

Single-warp parallel reduction for commutative operator

Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp execut...

cuda • Parallel reduction (e.g. how to sum an array)

Single-warp parallel reduction for noncommutative operator

Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp execut...

cuda • Parallel reduction (e.g. how to sum an array)

Single-warp parallel reduction using registers only

Typically, reduction is performed on global or shared array. However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp. When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shu...

cuda • Parallel reduction (e.g. how to sum an array)