cuda Global work queue


Consider an array of work items. The time needed for an each work item to complete varies greatly. In order to balance the work distribution between blocks it may be prudent for each block to fetch the next item only when previous one is complete. This is in contrast to a-priori assigning items to blocks.

class WorkQueue {
  WorkItem* gItems;
  size_t totalSize;
  size_t current;
  __device__ WorkItem& fetch() {
    __shared__ WorkItem item;
    if (threadIdx.x == 0) {
      size_t itemIdx = atomicAdd(current,1);
      if (itemIdx<totalSize)
        item = gItems[itemIdx];
        item = WorkItem::none();
    return item; //returning reference to smem - ok


  • The WorkQueue object, as well as gItem array reside in global memory
  • No new work items are added to the WorkQueue object in the kernel that is fetching from it
  • The WorkItem is a small representation of the work assignment, e.g. a pointer to another object
  • WorkItem::none() static member function creates a WorkItem object that represents no work at all
  • WorkQueue::fetch() must be called uniformly by all threads in the block
  • There are no 2 invocations of WorkQueue::fetch() without another __syncthreads() in between. Otherwise a race condition will appear!

The example does not include how the initialize the WorkQueue or populate it. It is done by another kernel or CPU code and should be pretty straight-forward.