cuda Tutorial => Single-block parallel reduction for commutative...

Example

The simplest approach to parallel reduction in CUDA is to assign a single block to perform the task:

static const int arraySize = 10000;
static const int blockSize = 1024;

__global__ void sumCommSingleBlock(const int *a, int *out) {
    int idx = threadIdx.x;
    int sum = 0;
    for (int i = idx; i < arraySize; i += blockSize)
        sum += a[i];
    __shared__ int r[blockSize];
    r[idx] = sum;
    __syncthreads();
    for (int size = blockSize/2; size>0; size/=2) { //uniform
        if (idx<size)
            r[idx] += r[idx+size];
        __syncthreads();
    }
    if (idx == 0)
        *out = r[0];
}

...

sumCommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

This is most feasable when the data size is not very large (around a few thousants elements). This usually happens when the reduction is a part of some bigger CUDA program. If the input matches blockSize from the very beginning, the first for loop can be completely removed.

Note that in first step, when there are more elements than threads, we add things up completely independently. Only when the problem is reduced to blockSize, the actual parallel reduction triggers. The same code can be applied to any other commutative, associative operator, such as multiplication, minimum, maximum, etc.

Note that the algorithm can be made faster, for example by using a warp-level parallel reduction.

PDF - Download cuda for free

Previous Next

cuda

Fastest Entity Framework Extensions

Example

Got any cuda Question?

cuda

cuda Parallel reduction (e.g. how to sum an array) Single-block parallel reduction for commutative operator

Fastest Entity Framework Extensions

Example

Got any cuda Question?