Compiling CUDA Programs (NVCC)

Last Updated : 12 Feb, 2026

The NVIDIA CUDA Compiler (NVCC) is the specialized driver used to transform CUDA C++ source code into executable programs. Because CUDA programs are "heterogeneous" meaning they contain code for both a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) the compilation process is more complex than standard C++ development.

CUDA Compilation Process

A CUDA source file (ending in .cu) contains two types of code: Host code (runs on the CPU) and Device code (runs on the GPU). NVCC acts as a compiler driver that coordinates the compilation of both parts simultaneously.

👁 CUDAcompilationProcess

CUDA Compilation Process

How it Works:

Code Splitting: NVCC reads the .cu file and separates the standard C++ code from the GPU kernels (marked with __global__).
Host Compilation: The CPU code is passed to a standard host compiler (like gcc on Linux or cl.exe on Windows) to create standard object code.
Device Compilation: The GPU code is compiled by NVIDIA’s tools into an assembly-like format called PTX (Parallel Thread Execution) or a binary format called cubin.
Linking: Finally, NVCC bundles both the CPU and GPU components into a single executable file. When the user runs this file, the CPU part starts first and "launches" the GPU part when needed.

Compiling via Command Line

1. Basic Command: To compile a source file into a runnable program, use the following syntax:

nvcc program.cu -o program

Explanation:

nvcc: Invokes the compiler driver.
program.cu: The source file to be compiled.
-o program: The "output" flag that defines the name of the final executable.

2. Targeting Specific Hardware (-arch): GPU architectures evolve with every generation (e.g., Pascal, Turing, Ampere). To get the best performance, the compiler needs to know which GPU generation you are targeting.

nvcc -arch=sm_75 program.cu -o program

Explanation: -arch=sm_xx stands for "Shader Model." For example, sm_75 targets Turing-generation GPUs (like the RTX 20-series or Tesla T4). This ensures the compiler uses the specific instructions available on that hardware.

3. Optimization and Debugging: One can pass flags to improve performance or help find errors in the code.

-O3: Enables high-level optimization for the CPU code to make it run faster.
-g -G: These flags add debugging information for both the CPU (-g) and the GPU (-G), allowing to use tools like cuda-gdb to find bugs.

4. Running the Program: Once compiled, resulting binary is a standalone file. On Linux or macOS, run it with ./, and on Windows, simply type the filename.

./program

Explanation: This starts the host code on the CPU. If the code includes a cudaDeviceSynchronize() call, CPU will wait for all GPU kernels and printf statements to finish before the program closes, ensuring you see all output in the terminal.

Comment

Article Tags:

C++

CUDA

Explore

C++ Basics

Core Concepts

OOP

Standard Template Library(STL)

Practice & Problems

Courses

URL: https://www.geeksforgeeks.org/cpp/compiling-cuda-programs-nvcc/