Tutorial by Examples

The simplest approach to parallel reduction in CUDA is to assign a single block to perform the task: static const int arraySize = 10000; static const int blockSize = 1024; __global__ void sumCommSingleBlock(const int *a, int *out) { int idx = threadIdx.x; int sum = 0; for (int i ...
Doing parallel reduction for a non-commutative operator is a bit more involved, compared to commutative version. In the example we still use a addition over integers for the simplicity sake. It could be replaced, for example, with matrix multiplication which really is non-commutative. Note, when ...
Multi-block approach to parallel reduction in CUDA poses an additional challenge, compared to single-block approach, because blocks are limited in communication. The idea is to let each block compute a part of the input array, and then have one final block to merge all the partial results. To do t...
Multi-block approach to parallel reduction is very similar to the single-block approach. The global input array must be split into sections, each reduced by a single block. When a partial result from each block is obtained, one final block reduces these to obtain the final result. sumNoncommSin...
Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp execut...
Sometimes the reduction has to be performed on a very small scale, as a part of a bigger CUDA kernel. Suppose for example, that the input data has exactly 32 elements - the number of threads in a warp. In such scenario a single warp can be assigned to perform the reduction. Given that warp execut...
Typically, reduction is performed on global or shared array. However, when the reduction is performed on a very small scale, as a part of a bigger CUDA kernel, it can be performed with a single warp. When that happens, on Keppler or higher architectures (CC>=3.0), it is possible to use warp-shu...

Page 1 of 1