VOOZH about

URL: https://thenewstack.io/mlops-needs-a-better-way-to-manage-gpus/

⇱ MLOps Needs a Better Way to Manage GPUs - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-11-04 03:00:28
MLOps Needs a Better Way to Manage GPUs
sponsor-dell-technologies,sponsored-topic,
Hardware / Kubernetes / Operations

MLOps Needs a Better Way to Manage GPUs

GPUs must be provisioned in a smarter way, asserted two Run.AI engineers at KubeCon, pointing the way to useful tools and further areas of research.
Nov 4th, 2022 3:00am by Joab Jackson
👁 Featued image for: MLOps Needs a Better Way to Manage GPUs

GPUs are a necessity for deep learning and other large-scale forms of machine learning, yet we don’t yet have the tools to manage them effectively as we can with regular CPUs. Especially with the costs of GPUs being what they are these days, you want to make sure you get the most for your money.

Two Run.AI software engineers — Natasha Romm and Raz Rotenberg (Software Team Lead) — have been investigating ways to improve GPU utilization. They presented their finding at Kubernetes AI Day, part of the KubeCon+CloudNativeCon conference held last week in Detroit.

Dell Technologies (NYSE:DELL) is a unique family of businesses that helps organizations and individuals build their digital future and transform how they work, live and play. The company provides customers with the industry’s broadest and most innovative technology and services portfolio spanning from edge to core to cloud.
Learn More

“GPUs must be provisioned in a smarter way,” Rotenberg said.

Today, GPUs are allocated statically, and with not much nuance, usually by user or AI workload. What is needed is a finer grain permissions, fractions of GPU time, so they can be better allocated across tasks.

“GPU provisioning is not a term we use on a daily basis,” Rotenberg admitted. As more departments start using GPUs, the admin may put in a request for more hardware. But these existing GPUs are probably way underutilized. “The truth is, you don’t always need more GPUs,” he said.

Admins should be able to overprovision, assign more workloads than there would be for GPUs to run them. Overprovisioning is routinely done with CPUs, memory and even storage. Kubernetes can’t over-provision.

Most users don’t require an entire GPU for their work. And, much of the time, they probably don’t run the GPUs at all. The researchers may be off on a coffee break, or even away for a holiday.

Kubernetes can do dynamic allocation. However, it allocates one GPU per pod. Once a GPU is a GPU is assigned to a pod, it can’t be used elsewhere, even if the GPU itself is not being used.

Tools for Better Management

The Run.AI engineers have created an open source utility, called genv (GPU Environment Management), which can be used to control, configure and monitor the GPU resources. This tool works for workloads running directly on bare metal, or ones that can be accessed over SSH.

For Kubernetes deployments, take a look at Nvidia’s DCGM Exporter, can extract operational metrics from NVidia’s GPUs. It can be run as a standalone container or deployed as a daemonset on GPU nodes in a Kubernetes cluster. It is usually deployed by Nvidia’s Kubernetes Operator, so if you use that operator, you probably already have this capability built-in.

To build GPU-monitoring dashboards, the Run.AI folks also use Prometheus and Grafana on top of the NVidia exporter. This dashboard shows the GPU usage as a percentage.

With this information, an administrator can approach the owners of the GPUs — identifiable by the K8s namespace — and ask them to relinquish the GPUs with a 0% utilization.

Further Down the Road

Genv was created as a way to introduce AI users to the idea of better managing GPUs. The next steps are smart utilization and smart provisioning, Romm said.

Run.ai has built an orchestration layer for AI resources, for managing GPUs in particular.

“We help organizations get more out of their expensive hardware using smart scheduling algorithms and deep core capabilities,” Rotenberg wrote in a follow-up e-mail.

The company engineers have built Linux-level capabilities “that allow better management of GPUs, such as memory limitation, memory swapping, time-sharing and prioritization, rerouting running pods to idle GPUs,” he wrote. The Linux-level capabilities are integrated into the Kubernetes-level layer, which provides the scheduling and other key areas of support.

The Run.AI platform also offers management features to manage AI workloads by creating projects and departments, as well as managing users and enforcing more sophisticated quotas.

Dell Technologies (NYSE:DELL) is a unique family of businesses that helps organizations and individuals build their digital future and transform how they work, live and play. The company provides customers with the industry’s broadest and most innovative technology and services portfolio spanning from edge to core to cloud.
Learn More
TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: run.ai.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.