cuda Last-block guard


Example

Consider a grid working on some task, e.g. a parallel reduction. Initially, each block can do its work independently, producing some partial result. At the end however, the partial results need to be combined and merged together. A typical example is a reduction algorithm on a big data.

A typical approach is to invoke two kernels, one for the partial computation and the other for merging. However, if the merging can be done efficiently by a single block, only one kernel call is required. This is achieved by a lastBlock guard defined as:

2.0
__device__ bool lastBlock(int* counter) {
  __threadfence(); //ensure that partial result is visible by all blocks
  int last = 0;
  if (threadIdx.x == 0)
    last = atomicAdd(counter, 1);
  return __syncthreads_or(last == gridDim.x-1);
}
1.1
__device__ bool lastBlock(int* counter) {
  __shared__ int last;
  __threadfence(); //ensure that partial result is visible by all blocks
  if (threadIdx.x == 0) {
    last = atomicAdd(counter, 1);
  }
  __syncthreads();
  return last == gridDim.x-1;
}

With such a guard the last block is guaranteed to see all the results produced by all other blocks and can perform the merging.

__device__ void computePartial(T* out) { ... }
__device__ void merge(T* partialResults, T* out) { ... }

__global__ void kernel(int* counter, T* partialResults, T* finalResult) {
    computePartial(&partialResults[blockIdx.x]);
    if (lastBlock(counter)) {
      //this is executed by all threads of the last block only
      merge(partialResults,finalResult);
    }
}

Assumptions:

  • The counter must be a global memory pointer, initialized to 0 before the kernel is invoked.
  • The lastBlock function is invoked uniformly by all threads in all blocks
  • The kernel is invoked in one-dimentional grid (for simplicity of the example)
  • T names any type you like, but the example is not intended to be a template in C++ sense