So now we come down to the real stuff, executing your kernels on the parallel device. Please read about the hardware basics to fully understand the kernel dispatching.
First you'll need to set the kernel arguments before actually calling the kernel. This is done via
err = Cl.SetKernelArg(_kernel, $argumentIndex, $argument);
If you don't set every argument before launching the kernel, the kernel will fail.
Before we actually launch our kernel, we need to calculate the 'global work size' and the 'local work size'.
the global work size is the total number of threads that will be launched on your GPU. The local work size is the number of threads inside each thread block. The local work size can be omitted if the kernel does not need any special requirements. But if the local work size is given, the global work size has to be a multiple of the local work size.
The work sizes can either be one-dimensional, two dimensional or three dimensional. The choice on how many dimensions you want is entirely up to you and you can pick whatever suits your algorithm best.
Now that we decided on our work sizes we can call the kernel.
Event clevent;
err = Cl.EnqueueNDRangeKernel(_queue, _kernel, $dimensions, null, $globalWorkSize, $localWorkSize, 0, null, out clevent);
The $dimensions define our desired number of dimensions, $globalWorkSize is an array of size $dimensions with the global Work size and the same for $localWorkSize. The last argument gives you an object which represents your currently executed kernel.