VOOZH about

URL: https://thenewstack.io/kubernetes-for-genai-why-it-makes-so-much-sense/

⇱ Kubernetes for GenAI: Why It Makes So Much Sense - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-06-26 11:14:52
Kubernetes for GenAI: Why It Makes So Much Sense
sponsor-dell-technologies,sponsored-post-contributed,
AI / Kubernetes / Large Language Models

Kubernetes for GenAI: Why It Makes So Much Sense

Kubernetes support for third-party operators, storage integrations, GPU enablement and more make it a great platform for GenAI applications.
Jun 26th, 2024 11:14am by Ryan Wallner
👁 Featued image for: Kubernetes for GenAI: Why It Makes So Much Sense
Featured image by Unsplash+ in collaboration with Alex Shuper.
Dell Technologies sponsored this post.

Generative AI (or GenAI) is quickly evolving, becoming essential to many organizations. It’s moved beyond simple prediction to enhancing applications with code completion, automation, deep knowledge and expertise. Whether your use case is web-based chat, customer service, documentation search, content generation, image manipulation, infrastructure troubleshooting or countless other functions, GenAI promises to help us become more efficient problem solvers.

Kubernetes, which recently marked its 10th birthday, offers valuable features for running GenAI workloads. Over the years, Kubernetes and the cloud native community have been improving, integrating and automating numerous infrastructure layers to make the lives of admins, developers and operations professionals easier.

GenAI can take advantage of this work to build frameworks that work well on top of Kubernetes. For example, the Operator Framework is already being used to adopt GenAI within Kubernetes, as it allows building applications in an automated and scalable way.

Let’s look a little deeper at why Kubernetes makes a great home for building GenAI workloads.

Why Generative AI on Kubernetes Makes Sense

Kubernetes provides building blocks for any type of application. It provides workload scheduling, automation, observability, persistent storage, security, networking, high availability, node labeling and other capabilities that are crucial for GenAI and other applications.

Take, for instance, making a foundational GenAI model like Google‘s Gemma or Meta’s Llama2 available to worker nodes with graphics processing units (GPUs). Kubernetes’ built-in Container Storage Interface (CSI) driver mechanisms make it much simpler to expose persistent shared storage for a model so that inference engines can quickly load it to the GPU’s core memory.

Another example is running a vector database like Chroma within a retrieval-augmented generation (RAG) pipeline. Databases often need to remain highly available, and Kubernetes’ built-in scheduling capability coupled with CSI drivers can enable vector databases to move to different workers in the Kubernetes cluster. This is critical in case of node, network, zone and other failures, as it keeps your pipelines up and running with access to the embeddings.

Whether you’re looking at observability, networking or much more, Kubernetes is a suitable place for GenAI applications because of its “batteries included” architecture.

Enabling GPUs on Kubernetes

Upstream Kubernetes supports managing Intel, AMD and NVIDIA GPUs through its device plugin framework, as long as an administrator has provisioned and installed the necessary hardware and drivers to the nodes.

This, along with third-party integrations via plugins and operators, sets up Kubernetes with the essential building blocks needed for enabling GenAI workloads.

Vendor support, such as the Intel Device Plugins Operator and NVIDIA GPU Operator, can also help simplify administrative overhead. For instance, the NVIDIA GPU Operator helps manage the driver, CUDA runtime and container toolkit installation and life cycle without having to perform them separately.

Deploying Models and Inference Engines

Enabling GPUs on a Kubernetes cluster is only a small part of the full GenAI puzzle. GPUs are needed to run GenAI models on Kubernetes; however, the full infrastructure layer includes other elements such as shared storage, inference engines, serving layers, embedding models, web apps and batch jobs that are needed to run a GenAI application.

Once a model is trained and available, the model needs to be downloaded and pulled into the Kubernetes environment. Many of the foundation models can be downloaded from Hugging Face, then loaded into the serving layer, which is part of the inference server or engine.

An inference engine or server, such as NVIDIA Triton Inference Server and Hugging Face Text Generation Interface (TGI), is made up of software that interfaces with pretrained models. It loads and unloads models, handles requests to the model, returns results, monitors logs and versions, and more.

Inference engines and serving layers do not have to be run on Kubernetes, but that’s what I will focus on here. You can deploy Hugging Face TGI to Kubernetes via Helm, a Kubernetes application package manager. This Helm chart from Substratus AI is an example of how to deploy and make TGI available to a Kubernetes environment using a simple configuration file to define the model and GPU-labeled nodes.

Data and Storage

Several types of data storage are required for running models and GenAI architectures, outside of the raw datasets that are fed into the training process.

For one, it is not realistic to replicate large language models (LLMs), which can be gigabytes to terabytes in size, after they’re downloaded to an environment. A better approach is using shared storage, such as a performant shared file system like a Network File System (NFS). This enables a model to be loaded into shared storage and mounted to any node that may need to load and serve the model on an available GPU.

Another potential use case for data storage is running a RAG framework to supplement running models with external or more recent sources. RAG frameworks often use vectorized data and vector databases, and a block storage-based Persistent Volume (PV) and Persistent Volume Claim (PVC) in Kubernetes can improve availability of the vector database.

Lastly, the application utilizing the model may need its own persistence to store user data, sessions and more. This will be highly dependent on the application and its data storage requirements. For example, a chatbot may store a specific user’s recent prompt queries to save the history for lookback.

RAG Frameworks

Another deployment scenario is implementing RAG or a context augmentation framework using tools such as LlamaIndex or Langchain. Deployed foundational models are typically trained on datasets at a point in time, and RAG or context augmentation can add additional context to an LLM. These frameworks add a step in the query process that can take newly sourced data and feed it and the user query to the LLM.

For example, a model trained on a company’s documents can implement a RAG framework to add newly sourced documents created after the model was trained to add context for a query. Data in a RAG framework is usually loaded and then processed into smaller chunks (called vectors) and stored in embeddings within a vector database, such as Chroma, PGVector or Milvus. These embeddings can represent diverse types of data including text, audio and images.

RAG frameworks can retrieve relevant information from the embeddings, and the model can use them as additional context in its generative response. Vector data is often more condensed and smaller than the model, but it can still benefit from using persistent storage.

Using Kubernetes to run stateful workloads is nothing new, Existing projects such as Postgres can add the PGVector extension to a Postgres cluster deployed via CloudNativePG using a PVC. PVCs enable high availability of persistent locations for databases, which allow data to move around a Kubernetes cluster. This can be important for the health of the RAG framework in case of failures or pod life cycle events.

Conclusion

Kubernetes provides a GenAI toolbox that supports compute scheduling, third-party operators, storage integrations, GPU enablement, security frameworks, monitoring and logging, application life cycle management and more. These are all considerable tactical advantages to using Kubernetes as a platform for GenAI.

In the end, using Kubernetes as the platform for your GenAI application extends the advantages it provides for operators, engineers, DevOps professionals and application developers to the deployment and usability of GenAI infrastructure and applications.

Dell Technologies (NYSE:DELL) is a unique family of businesses that helps organizations and individuals build their digital future and transform how they work, live and play. The company provides customers with the industry’s broadest and most innovative technology and services portfolio spanning from edge to core to cloud.
Learn More
TRENDING STORIES
Ryan Wallner is lead developer advocate at Dell Technologies and host of the Kubernetes Bytes podcast. Ryan is a cloud native and Kubernetes enthusiast, husband, and dad of a fearless daughter. Ryan enjoys adventure moto riding, hiking and mountain biking.
Read more from Ryan Wallner
Dell Technologies sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.