cuda Tutorial => Global work queue

Example

Consider an array of work items. The time needed for an each work item to complete varies greatly. In order to balance the work distribution between blocks it may be prudent for each block to fetch the next item only when previous one is complete. This is in contrast to a-priori assigning items to blocks.

class WorkQueue {
private:
  WorkItem* gItems;
  size_t totalSize;
  size_t current;
public:
  __device__ WorkItem& fetch() {
    __shared__ WorkItem item;
    if (threadIdx.x == 0) {
      size_t itemIdx = atomicAdd(current,1);
      if (itemIdx<totalSize)
        item = gItems[itemIdx];
      else
        item = WorkItem::none();
    }
    __syncthreads();
    return item; //returning reference to smem - ok
  }
}

Assumptions:

The WorkQueue object, as well as gItem array reside in global memory
No new work items are added to the WorkQueue object in the kernel that is fetching from it
The WorkItem is a small representation of the work assignment, e.g. a pointer to another object
WorkItem::none() static member function creates a WorkItem object that represents no work at all
WorkQueue::fetch() must be called uniformly by all threads in the block
There are no 2 invocations of WorkQueue::fetch() without another __syncthreads() in between. Otherwise a race condition will appear!

The example does not include how the initialize the WorkQueue or populate it. It is done by another kernel or CPU code and should be pretty straight-forward.

PDF - Download cuda for free

Previous Next

cuda

Fastest Entity Framework Extensions

Example

Got any cuda Question?

cuda

cuda Inter-block communication Global work queue

Fastest Entity Framework Extensions

Example

Got any cuda Question?