VOOZH about

URL: https://thenewstack.io/cuda-12-harnesses-a-nvidias-speedier-gpu-architecture/

⇱ CUDA 12 Harnesses Nvidia's Speedier GPU Architecture - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-09-29 07:32:58
CUDA 12 Harnesses Nvidia's Speedier GPU Architecture
Hardware

CUDA 12 Harnesses Nvidia’s Speedier GPU Architecture

CUDA 12 is tuned to Nvidia's new GPU architecture called Hopper, which can be five times faster than the previous-generation Nvidia chips.
Sep 29th, 2022 7:32am by Agam Shah
👁 Featued image for: CUDA 12 Harnesses Nvidia’s Speedier GPU Architecture

GPU maker Nvidia will soon release the next version of the CUDA parallel-programming framework, version 12, to accompany the release of its new GPU architecture code-named Hopper.

“It’s the biggest release we’ve ever done,” said Stephen Jones, CUDA architect at Nvidia, during a break-out session held at Nvidia’s GPU Technology Conference being held virtually earlier this month.

CUDA started off as a simple programming language in June 2007 targeted at graphics, and is currently in version 11.7, with one major update, version 11.8, due before the move to version 12.

Jones didn’t provide an exact shipment date for CUDA 12, but past release timeline points to a version 12 available for download either late this year or early next year.

Nvidia typically releases a new version of CUDA with every new GPU architecture. This is the first time in two years that CUDA users will experience a major version change.

GPUs were initially popular for graphics, but the ability for the chips to compute in parallel planted the seed for Nvidia’s hardware to be used in non-graphics applications. Today, Nvidia’s GPUs dominate the market as accelerators for AI, simulation, graphics and supercomputing. But the proprietary CUDA parallel programming model works best only on Nvidia’s GPUs, and that is forcing customers to buy the company’s hardware.

Nvidia is now trying to shift gears to grow in the software business by selling AI software applications developed in CUDA. The company sees a $1 trillion market opportunity in software, with CUDA-based applications going into self-driving cars, robots, medical devices, and other AI systems.

A typical CUDA program has a GPU code section, which includes the code for execution on graphics cores, and a CPU code section, which sets up the execution environment that includes memory allocation and hardware management. CUDA also has a runtime system that includes libraries and a compiler that compiles the code into an executable.

CUDA binaries have CPU and GPU sections, and a separate PTX assembly code section, which acts as a backward and, to some extent, forward compatibility layer for all versions of CUDA dating back to the first edition in 2007.

Upgrade Challenges

But CUDA 12 applications will break on CUDA 11. Starting with CUDA 11, Nvidia included a compatibility layer so APIs don’t break in in-line versions, for example, an application built on CUDA 11.5 will work with CUDA 11.1. But that compatibility layer doesn’t apply to completely new versions.

“You can’t run CUDA 12 applications, say, on a system with 11.2 installed because API signals may have changed across a major version,” Jones said, adding: “this means two things. first, you need to care what major version of CUDA is running on your [system], and second, some APIs and data structures will change.”

CUDA 12 is specifically tuned to the new GPU architecture called Hopper, which replaces the two-year-old architecture code-named Ampere, which CUDA 11 supported. The flagship Hopper-based GPU, called the H100, has been measured at up to five times faster than the previous-generation Ampere flagship GPU branded A100. The speed enhancements in Hopper come through a host of new features such as beefier throughput and interconnect technologies, faster tensor cores for AI, and vector and floating-point operations.

Hopper has 132 streaming multiprocessors, PCIe Gen5 support, HBM3 memory, 50MB of L3 cache and the new NVLink interconnect with 900GB/s bandwidth.

If you want to get the best performance out of Hopper, you will only get it from CUDA 12. Nvidia is keeping its hardware and software close to its chest, and if you use Khronos’ OpenCL, AMD’s ROCm and others parallel programming tools, you won’t be able to harness the full power of Hopper.

The Hopper H100 GPU focuses on keeping data local, and reducing the time it takes to execute code. The GPU has 132 streaming-multiprocessor (SM) units in the H100, up from 15 in Kepler from ten years ago. Scaling across the SMs is central to CUDA 12, Jones said.

The CUDA programming model, at its core, asks users to break up work — like processing an image — in blocks, which are organized next to each other in a grid. Each block runs on a GPU like it’s a separate program, and Hopper can run several thousands of blocks at once. Each block, working on its own problem, is further broken down into threads.

Nvidia has broken down that grid-block-thread hierarchy even further with a new layer called the “thread block cluster.” The “thread block cluster” basically breaks down the old structure and woven in interconnected mini-grids at the block level, which all adds up to the larger grid. Because of its massive scale, “we’ve taken the concept of a grid makeup made up of wholly independent blocks of work,” Jones said.

The SMs have been organized in that hierarchy of thread block clusters, which exchange data simultaneously in a synchronized way. The 16 blocks run close to 16,384 threads simultaneously, which is a huge amount of concurrency, Jones said, adding that every block in a cluster can read and write the shared memory of every other block in the cluster.

“What we’ve made is a way to target a localized subset of your grid to a localized set of execution resources that opens up more opportunities for programmability and performance,” Jones said.

The thread block cluster feature in the programming model has new syntax that allows developers to define the launch size and the resources it needs for a task instead of relying on the CPU to do it correctly.

Another new Hopper feature is an asynchronous transaction barrier that reduces the back and forth of data for quicker execution of code. The asynchronous transaction barrier is more like a sleeping room in which waiting threads doze off until data from other threads arrive to complete a transaction. That reduces the energy, effort and bandwidth required to move data.

“You just say ‘Wake me up when the data has arrived.’ I can have my thread waiting … expecting data from lots of different places and only wake up when it’s all arrived,” Jones said.

In chips, work is commonly broken up into threads, which have to coordinate with each other. With normal barriers, threads typically track where data is coming from and the source it is synchronizing with, but that’s not the case in Hopper, which is just a single-write operation.

“The asynchronous memory copy knows how many bytes it’s carrying. The barrier knows how many it is expecting. When the data arrives, it just counts itself in. These are one-sided memory copies and they are seven times faster [communication] because they just go one way and don’t have to go back and forth,” Jones said.

Hopper also has a new processing unit called the Tensor Memory Accelerator, which the company has classified as a data movement engine. The engine enables bidirectional movement of large data blocks between the global and shared memory hierarchy. The TMA also takes over asynchronous memory copy between thread blocks in a cluster.

“You call [TMA] and it goes off to do the copy, which means the hardware is taking over the job of calculating addresses and strides, checking boundaries, all that kind of stuff. It can cut out a section of data … and just drop it into shared memory or put it back the other way. You don’t have to write a single line of code,” Jones said.

Hopper has new DPX instructions for something Nvidia calls “dynamic programming,” where one can efficiently find the solution to a larger problem by recursively solving overlapping subproblems. This could make CUDA 12 relevant to applications involved in computation that tracks traces to optimize or solve problems, like mapping or robotic path tracing.

“It’s very similar to a divide and conquer approach …. except it is overlapping data which is harder to solve,” Jones said.

Nvidia has also enhanced the concept of dynamic parallelism, which allows the GPU to launch a new kernel directly without invoking the CPU. “By adding some special mechanisms to the dynamic parallel programming model, we’ve been able to speed up the launch performance by a factor of three,” Jones said.

An Nvidia moderator didn’t clarify if the dynamic parallelism would advance to the OpenMP or OpenACC standards, saying “whether it makes its way into the standards as an explicit language feature depends on the committees.”

Nvidia is actively trying to upstream some features in the CUDA toolkit as part of standard C++ releases. CUDA has its own compiler called NVCC, which is designed for GPUs, and Runtime API with a simple C++-like interface. GPUs typically have computing elements like vector processing which are more adept for applications such as AI, and the runtime is built on top of a driver API.

TRENDING STORIES
Agam Shah has covered enterprise IT for more than a decade. Outside of machine learning, hardware and chips, he's also interested in martial arts and Russia.
Read more from Agam Shah
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.