This example illustrates how to create a simple program that will sum two int
arrays with CUDA.
A CUDA program is heterogenous and consist of parts runs both on CPU and GPU.
The main parts of a program that utilize CUDA are similar to CPU programs and consist of
To allocate devices memory we use cudaMalloc
function. To copy data between device and host cudaMemcpy
function can be used.
The last argument of cudaMemcpy
specifies the direction of copy operation. There are 5 possible types:
cudaMemcpyHostToHost
- Host -> HostcudaMemcpyHostToDevice
- Host -> DevicecudaMemcpyDeviceToHost
- Device -> HostcudaMemcpyDeviceToDevice
- Device -> DevicecudaMemcpyDefault
- Default based unified virtual address spaceNext the kernel function is invoked. The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel.
The first number (2
in example) specifies number of blocks and second ((size + 1) / 2
in example) - number of threads in a block. Note that in this example we add 1 to the size, so that we request one extra thread rather than having one thread responsible for two elements.
Since kernel invocation is an asynchronous function cudaDeviceSynchronize
is called to wait until execution is completed.
Result arrays is copied to the host memory and all memory allocated on the device is freed with cudaFree
.
To define function as kernel __global__
declaration specifier is used. This function will be invoked by each thread.
If we want each thread to process an element of the resultant array, then we need a means of distinguishing and identifying each thread.
CUDA defines the variables blockDim
, blockIdx
, and threadIdx
. The predefined variable blockDim
contains the dimensions of each thread block as specified in the second execution configuration parameter for the kernel launch.
The predefined variables threadIdx
and blockIdx
contain the index of the thread within its thread block and the thread block within the grid, respectively. Note that since we potentially request one more thread than elements in the arrays, we need to pass in size
to ensure we don't access past the end of the array.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void addKernel(int* c, const int* a, const int* b, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
c[i] = a[i] + b[i];
}
}
// Helper function for using CUDA to add vectors in parallel.
void addWithCuda(int* c, const int* a, const int* b, int size) {
int* dev_a = nullptr;
int* dev_b = nullptr;
int* dev_c = nullptr;
// Allocate GPU buffers for three vectors (two input, one output)
cudaMalloc((void**)&dev_c, size * sizeof(int));
cudaMalloc((void**)&dev_a, size * sizeof(int));
cudaMalloc((void**)&dev_b, size * sizeof(int));
// Copy input vectors from host memory to GPU buffers.
cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
// 2 is number of computational blocks and (size + 1) / 2 is a number of threads in a block
addKernel<<<2, (size + 1) / 2>>>(dev_c, dev_a, dev_b, size);
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaDeviceSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
}
int main(int argc, char** argv) {
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
addWithCuda(c, a, b, arraySize);
printf("{1, 2, 3, 4, 5} + {10, 20, 30, 40, 50} = {%d, %d, %d, %d, %d}\n", c[0], c[1], c[2], c[3], c[4]);
cudaDeviceReset();
return 0;
}