VOOZH about

URL: https://thenewstack.io/kserve-a-robust-and-extensible-cloud-native-model-server/

⇱ KServe: A Robust and Extensible Cloud Native Model Server - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-03-18 09:47:04
KServe: A Robust and Extensible Cloud Native Model Server
analysis,
Kubernetes / Serverless

KServe: A Robust and Extensible Cloud Native Model Server

KServe is collaboratively developed by Google, IBM, Bloomberg, NVIDIA, and Seldon as an open source, cloud native model server for Kubernetes.
Mar 18th, 2022 9:47am by Janakiram MSV
👁 Featued image for: KServe: A Robust and Extensible Cloud Native Model Server
Feature Image by Alexas_Fotos from Pixabay.

If you are familiar with Kubeflow, you know KFServing as the platform’s model server and inference engine. In September last year, the KFServing project has gone through a transformation to become KServe.

KServe is now an independent component graduating from the Kubeflow project, apart from the name change. The separation allows KServe to evolve as a separate, cloud native inference engine deployed as a standalone model server. Of course, it will continue to have tight integration with Kubeflow, but they would be treated and maintained as independent open source projects.

For a brief overview of the model server, refer to one of my previous articles at The New Stack.

KServe is collaboratively developed by Google, IBM, Bloomberg, Nvidia, and Seldon as an open source, cloud native model server for Kubernetes. The most recent version, 0.8, squarely focused on transforming the model server into a standalone component with changes to the taxonomy and nomenclature.

Let’s understand the core capabilities of KServe.

A model server is to machine learning models what an application is to code binaries. Both provide the runtime and execution context to the deployments. KServe, as a model server, provides the foundation for serving machine learning and deep learning models at scale.

KServe can be deployed as a traditional Kubernetes deployment or as a serverless deployment with support for scale-to-zero. For serverless, it takes advantage of Knative Serving for serverless, which comes with automatic scale-up and scale-down capabilities. Istio is used as an ingress to expose the service endpoints to the API consumers. The combination of Istio and Knative Serving enables exciting scenarios such as blue/green and canary deployments of models.

👁 Kserve architecture diagram

The RawDeployment Mode, which lets you use KServe without Knative Serving, supports traditional scaling techniques such as Horizontal Pod Autoscaler (HPA) but lacks support for scale-to-zero.

KServe Architecture

KServe model server has a control plane and a data plane. The control plane manages and reconciles the custom resources responsible for inference. In serverless mode, It coordinates with Knative resources in managing the autoscale.

👁 Kserve control plane

At the heart of KServe control plane is the KServe Controller that manages the lifecycle of an inference service. It is responsible for creating service, ingress resources, model server container, model agent container for request/response logging, batching, and pulling the models from the model store. The model store is a repository of models registered with the model server. It is typically an object storage service such as Amazon S3, Google Cloud Storage, Azure Storage, or MinIO.

The data plane manages the request/response cycle targeting a specific model. It has a predictor, transformer, and explainer components.

An AI application sends a REST or gRPC request to the predictor endpoint. The predictor acts as an inference pipeline that invokes the transformer component, which can perform pre-processing of the inbound data (request) and post-processing of outbound data (response). Optionally, there may be an explainer component to bring AI explainability to the hosted models. KServe encourages the usage of V2 protocol which is interoperable and extensible.

The data plane also has endpoints to check the readiness and health of models. It also exposes APIs for retrieving model metadata.

Supported Frameworks and Runtimes

KServe supports a wide range of machine learning and deep learning frameworks. Deep learning frameworks and runtimes work with existing serving infrastructures such as TensorFlow Serving, TorchServe, and Triton Inference Server. KServe can host TensorFlow, ONNX, PyTorch, TensorRT runtimes through Triton.

For classical machine learning models based on SKLearn, XGBoost, Spark MLLib, and LightGBM KServe rely on Seldon’s MLServer.

The extensible framework of KServe makes it possible to plugin any runtime that adheres to the V2 inference protocol.

Multimodel Serving with ModelMesh

KServe deploys one model per inference, limiting the platform’s scalability to the available CPUs and GPUs. This limitation becomes obvious when running inference on GPUs which are expensive and scarce compute resources.

With Multimodel serving, we can overcome the limitations of the infrastructure — compute resources, maximum pods, and maximum IP addresses.

ModelMesh Serving, developed by IBM, is a Kubernetes-based platform for a real-time serving of ML/DL models, optimized for high volume/density use cases. Similar to an operating system that manages processes to optimally utilize the available resources, ModelMesh optimizes the deployed models to run efficiently within the cluster.

👁 ModelMesh serving diagram

Through intelligent management of in-memory model data across clusters of deployed pods, and the usage of those models over time, the system maximizes the use of available cluster resources.

ModelMesh Serving is based on KServe v2 data plane API for inferencing, which makes it possible to deploy it as a runtime similar to NVIDIA Triton Inference Server. When a request hits the KServe data plane, it is simply delegated to ModelMesh Serving.

The integration of ModelMesh Serving with KServe is currently in Alpha. As both the projects mature, there will be a tighter integration making it possible to mix and match the features and capabilities of both platforms.

With model serving becoming the core building block of MLOps, open source projects such as KServe become important. The extensibility of KServe to use existing and upcoming runtimes makes it a unique model serving platform.

In the upcoming articles, I will walk you through the steps of deploying KServe on a GPU-based Kubernetes cluster to perform inference on a TensorFlow model. Stay tuned.

TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
IBM is a sponsor of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.