This simple CUDA program demonstrates how to write a function that will execute on the GPU (aka "device"). The CPU, or "host", creates CUDA threads by calling special functions called "kernels". CUDA programs are C++ programs with additional syntax.
To see how it works, put the following code in a file named hello.cu
:
#include <stdio.h>
// __global__ functions, or "kernels", execute on the device
__global__ void hello_kernel(void)
{
printf("Hello, world from the device!\n");
}
int main(void)
{
// greet from the host
printf("Hello, world from the host!\n");
// launch a kernel with a single thread to greet from the device
hello_kernel<<<1,1>>>();
// wait for the device to finish so that we see the message
cudaDeviceSynchronize();
return 0;
}
(Note that in order to use the printf
function on the device, you need a device that has a compute capability of at least 2.0. See the versions overview for details.)
Now let's compile the program using the NVIDIA compiler and run it:
$ nvcc hello.cu -o hello
$ ./hello
Hello, world from the host!
Hello, world from the device!
Some additional information about the above example:
nvcc
stands for "NVIDIA CUDA Compiler". It separates source code into host and device components.__global__
is a CUDA keyword used in function declarations indicating that the function runs on the GPU device and is called from the host.<<<
,>>>
) mark a call from host code to device code (also called "kernel launch"). The numbers within these triple brackets indicate the number of times to execute in parallel and the number of threads.