VOOZH about

URL: https://www.geeksforgeeks.org/cpp/launching-a-kernel-in-cuda/

⇱ Launching a Kernel | CUDA - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Launching a Kernel | CUDA

Last Updated : 26 Feb, 2026

In CUDA, a kernel launch is the process of starting parallel execution of a kernel function on the GPU from the Host (CPU). This is done using the Execution Configuration syntax <<< ... >>>, which specifies how many blocks and threads will execute the kernel on the Device (GPU).

Syntax

KernelName<<<blocksPerGrid, threadsPerBlock>>>(arguments);

  • KernelName: Name of the GPU kernel function to execute.
  • threadsPerBlock: The number of threads in each block. This is a user-defined constant (typically a multiple of 32, like 64, 128, 256, etc).
  • blocksPerGrid: The number of blocks required to cover all N operations.
  • Arguments: The data pointers or constants passed to the GPU.

Mathematical Relation for Grid Sizing

When you have N total operations to perform, grid (entire collection of threads) must be large enough to provide at least N threads. Because threads are launched in fixed-size blocks, we use ceiling division to calculate the number of blocks. Below is the formula we use:

  • B: Number of blocks required
  • N: Total number of elements
  • T: Threads per block

This formula ensures that if N is not perfectly divisible by T, an extra block is automatically added to handle the "remainder" elements.

Multidimensional Thread Organization

CUDA provides the dim3 type to organize threads and blocks in 2D or 3D. This is useful for tasks like image processing, matrices and volumetric data, where data exists in multiple dimensions.

Explanation:

  • dim3 threadsPerBlock(16, 16); creates 16 × 16 threads per block (total 256 threads).
  • dim3 numBlocks(32, 32); creates 32 × 32 blocks in the grid.
  • imageKernel<<<numBlocks, threadsPerBlock>>>(imageData); launches the kernel using this 2D parallel configuration.
  • Each thread can access its position using threadIdx.x, threadIdx.y, blockIdx.x, and blockIdx.y.

Example: This example shows a kernel being launched with 4 blocks, where each block contains 8 threads.

Output

Launching 4 blocks with 8 threads each...
Block ID: 1, Thread ID: 0
Block ID: 1, Thread ID: 1
Block ID: 1, Thread ID: 2
Block ID: 1, Thread ID: 3
Block ID: 1, Thread ID: 4
Block ID: 1, Thread ID: 5
Block ID: 1, Thread ID: 6
Block ID: 1, Thread ID: 7
Block ID: 0, Thread ID: 0
Block ID: 0, Thread ID: 1
Block ID: 0, Thread ID: 2
Block ID: 0, Thread ID: 3
Block ID: 0, Thread ID: 4
Block ID: 0, Thread ID: 5
Block ID: 0, Thread ID: 6
Block ID: 0, Thread ID: 7
Block ID: 3, Thread ID: 0
Block ID: 3, Thread ID: 1
Block ID: 3, Thread ID: 2
Block ID: 3, Thread ID: 3
Block ID: 3, Thread ID: 4
Block ID: 3, Thread ID: 5
Block ID: 3, Thread ID: 6
Block ID: 3, Thread ID: 7
Block ID: 2, Thread ID: 0
Block ID: 2, Thread ID: 1
Block ID: 2, Thread ID: 2
Block ID: 2, Thread ID: 3
Block ID: 2, Thread ID: 4
Block ID: 2, Thread ID: 5
Block ID: 2, Thread ID: 6
Block ID: 2, Thread ID: 7

Explanation:

  • checkIndex<<<4, 8>>>(): This launches the kernel with 4 blocks, each containing 8 threads, resulting in a total of 32 threads (4 × 8).
  • blockIdx.x and threadIdx.x: These are built-in CUDA variables that help each thread identify its block index and thread index, allowing every thread to execute its assigned task correctly.
  • Asynchronous launch: After the <<<...>>> call, CPU does not wait and continues executing the next instruction immediately. The cudaDeviceSynchronize() function is used to pause the CPU until the GPU completes execution.
Comment
Article Tags:
Article Tags: