cuda Getting started with cuda Sum two arrays with CUDA


This example illustrates how to create a simple program that will sum two int arrays with CUDA.

A CUDA program is heterogenous and consist of parts runs both on CPU and GPU.

The main parts of a program that utilize CUDA are similar to CPU programs and consist of

  • Memory allocation for data that will be used on GPU
  • Data copying from host memory to GPUs memory
  • Invoking kernel function to process data
  • Copy result to CPUs memory

To allocate devices memory we use cudaMalloc function. To copy data between device and host cudaMemcpy function can be used. The last argument of cudaMemcpy specifies the direction of copy operation. There are 5 possible types:

  • cudaMemcpyHostToHost - Host -> Host
  • cudaMemcpyHostToDevice - Host -> Device
  • cudaMemcpyDeviceToHost - Device -> Host
  • cudaMemcpyDeviceToDevice - Device -> Device
  • cudaMemcpyDefault - Default based unified virtual address space

Next the kernel function is invoked. The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel. The first number (2 in example) specifies number of blocks and second ((size + 1) / 2 in example) - number of threads in a block. Note that in this example we add 1 to the size, so that we request one extra thread rather than having one thread responsible for two elements.

Since kernel invocation is an asynchronous function cudaDeviceSynchronize is called to wait until execution is completed. Result arrays is copied to the host memory and all memory allocated on the device is freed with cudaFree.

To define function as kernel __global__ declaration specifier is used. This function will be invoked by each thread. If we want each thread to process an element of the resultant array, then we need a means of distinguishing and identifying each thread. CUDA defines the variables blockDim, blockIdx, and threadIdx. The predefined variable blockDim contains the dimensions of each thread block as specified in the second execution configuration parameter for the kernel launch. The predefined variables threadIdx and blockIdx contain the index of the thread within its thread block and the thread block within the grid, respectively. Note that since we potentially request one more thread than elements in the arrays, we need to pass in size to ensure we don't access past the end of the array.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int* c, const int* a, const int* b, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < size) {
        c[i] = a[i] + b[i];

// Helper function for using CUDA to add vectors in parallel.
void addWithCuda(int* c, const int* a, const int* b, int size) {
    int* dev_a = nullptr;
    int* dev_b = nullptr;
    int* dev_c = nullptr;

    // Allocate GPU buffers for three vectors (two input, one output)
    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));

    // Copy input vectors from host memory to GPU buffers.
    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    // 2 is number of computational blocks and (size + 1) / 2 is a number of threads in a block
    addKernel<<<2, (size + 1) / 2>>>(dev_c, dev_a, dev_b, size);
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.

    // Copy output vector from GPU buffer to host memory.
    cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);


int main(int argc, char** argv) {
    const int arraySize = 5;
    const int a[arraySize] = {  1,  2,  3,  4,  5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    addWithCuda(c, a, b, arraySize);

    printf("{1, 2, 3, 4, 5} + {10, 20, 30, 40, 50} = {%d, %d, %d, %d, %d}\n", c[0], c[1], c[2], c[3], c[4]);


    return 0;