VOOZH about

URL: https://thenewstack.io/hpc-kubernetes-ai-training-on-3500-gpus/

⇱ HPC Kubernetes: AI Training on 3,500 GPUs - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-12-04 10:20:35
HPC Kubernetes: AI Training on 3,500 GPUs
Hardware / Kubernetes

HPC Kubernetes: AI Training on 3,500 GPUs

K8s brings many advantages to managing fleets of GPUs, said CoreWeave's Peter Salanki, during a talk at KubeCon+CloudNativeCon 2023.
Dec 4th, 2023 10:20am by Joab Jackson
👁 Featued image for: HPC Kubernetes: AI Training on 3,500 GPUs
“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” — Peter Salanki, CoreWeave

To date, Kubernetes has largely steered clear of the high-performance computing (HPC), or supercomputing space.

But with such a premium being put on GPUs for large machine learning these days, Kubernetes could provide a more dynamic way for managing vast fleets of GPUs, with the little help from tools that originated in the HPC space.

One cloud provider showing what can be done is CoreWeave, which specializes in accelerating GPU workloads.

In June, the company aced round three of the MLCommons‘s MLPerf, a benchmark test for measuring and comparing system performance on training and inferencing tasks. CoreWeave spun up a cluster of 3,500 (recently released) Nvidia H100 GPUs that trounced other Kubernetes clusters by up to a factor of 29.

Unlike traditional high-performance computing (HPC) systems, CoreWeave does not run on services on bare metal but rather uses Kubernetes over the bare metal.

K8s brings many advantages to managing GPUs, said Peter Salanki, CoreWeave director of engineering, during a talk at KubeCon+CloudNativeCon 2023.

“Building an ecosystem around Kubernetes makes it very easy for us to plug in new things. And get metrics out without having to build a bunch of glue between proprietary systems and Kubernetes itself,” Salanki said.

👁 Image

Kubernetes on Bare Metal

All the GPUs were located in a single data center: Each server houses eight GPUs on an Intel Sapphire Rapids platform. They were all tethered by 400 miles of Infiniband fiber (for lowest possible interconnective latency) and 40,000 connections.

That number is important to note because large ML workloads, which MLPerf models, could span all the GPUs available for maximum performance. But if any one of these components go down, the whole job must be restarted from the last checkpoint.

“Any individual failure can be catastrophic to a job,” Salanki said. “So ensuring that your nodes are healthy and your entire fabric is healthy. That is critical to not lose performance.”

Everything is booted statelessly — the servers do not have any operating systems on them.

“The systems are delivered without any OS. We don’t want them to come with any OS from a vendor because things change constantly. We have new kernels to deploy and new CPUs, so we can’t really expect anything that is preloaded in the factory to work,” Salanki said.

👁 Image

Each server comes with a Nvidia Bluefield Digital Processing Unit (DPU), a processor on a network card (also managed by Kubernetes).

When booted, the DPU downloads a trimmed Ubuntu image with little more than GPU and Infiniband drivers, and a Kubelet. It then asks for a join token and joins a Kubernetes cluster. (The DPU also provides VPC isolation for each workload, to support a multi-tenant environment.)

“Everything is stateless,” Salanki said. “It’s fully ephemeral, which means we can plug in your notes and get them up and running on a Kubernetes cluster immediately.”

Kubernetes as the System of Record

Kubernetes serves as the system of record for each cluster, Salanki noted. Every action that happens is logged. All the performance metrics are captured.

In this setup, the Kubernetes API server is central. “Every action flows through Kubernetes. There is no path that does not go through Kubernetes,” he said. An admin that wants to reboot a node sets a condition on the node, which will trigger a reboot by the node controller.  The whole flow is captured by event logging.

“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” including a programming model that many developers already know, he said.

Slurm on Kubernetes

To run MLPerf, CoreWeave used Slurm, a scheduler in the HPC space well-known by researchers, though rarely used in a K8s environment.

So the company created a Helm chart for scheduling Slurm on Kubernetes (SUNK), which will be released as open source in early 2023. All the Slurm components are containerized, including the daemon, controllers and logging nodes.

With SUNK, Slurm acts as a plug-in scheduler for Kubernetes. On the same cluster, a training job could be run on Slurm, alongside long-running production inference workloads could be handled more effectively by Kubernetes itself, and could even pre-empt Slurm jobs.

In his talk, Salanki also went into detail about the two node controllers, node testing, automatic remediation for failure. Here is the full talk:

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: run.ai, Unit.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.