Consider an array of work items. The time needed for an each work item to complete varies greatly. In order to balance the work distribution between blocks it may be prudent for each block to fetch the next item only when previous one is complete. This is in contrast to a-priori assigning items to blocks.
class WorkQueue {
private:
WorkItem* gItems;
size_t totalSize;
size_t current;
public:
__device__ WorkItem& fetch() {
__shared__ WorkItem item;
if (threadIdx.x == 0) {
size_t itemIdx = atomicAdd(current,1);
if (itemIdx<totalSize)
item = gItems[itemIdx];
else
item = WorkItem::none();
}
__syncthreads();
return item; //returning reference to smem - ok
}
}
Assumptions:
WorkItem
is a small representation of the work assignment, e.g. a pointer to another objectWorkItem::none()
static member function creates a WorkItem
object that represents no work at allWorkQueue::fetch()
must be called uniformly by all threads in the blockWorkQueue::fetch()
without another __syncthreads()
in between. Otherwise a race condition will appear!The example does not include how the initialize the WorkQueue
or populate it. It is done by another kernel or CPU code and should be pretty straight-forward.