VOOZH about

URL: https://thenewstack.io/choose-the-right-storage-engine-for-kubeflow-and-ml-workloads/

⇱ Choose the Right Storage Engine for Kubeflow and ML Workloads - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-04-23 12:45:47
Choose the Right Storage Engine for Kubeflow and ML Workloads
contributed,sponsor-portworx,sponsored,sponsored-post-contributed,
AI / Kubernetes / Storage

Choose the Right Storage Engine for Kubeflow and ML Workloads

Portworx from PureStorage can serve as the optimal storage engine for KubeFlow.
Apr 23rd, 2021 12:45pm by Janakiram MSV
👁 Featued image for: Choose the Right Storage Engine for Kubeflow and ML Workloads
Feaure image by andreasmetallerreni on Pixabay.
Portworx by Pure Storage sponsored this post.

Kubeflow is one of the unique workloads designed for Kubernetes. This platform abstracts the underpinnings of Kubernetes by exposing a set of integrated functionalities for data scientists, developers, machine learning engineers, and operators. It’s also unique because of the prerequisites it has for running a robust, cloud native, and enterprise-ready machine learning platform.

Like any other mature applications designed for Kubernetes, Kubeflow heavily relies on the storage layer for achieving high availability and delivering expected performance.

There are many open source and commercially available storage engines for Kubernetes, which can be used with Kubeflow. From Ceph/Rook to Red Hat‘s GlusterFS to good old NFS, customers can choose from a variety of options. But, no single storage layer meets all the requirements of running the Kubeflow platform, and the diverse set of components such as Notebook Servers, Pipelines, and KFServing.

When you use Kubeflow, you are expected to meet the storage requirements of the platform and the ML jobs that you run through Jupyter Notebooks, Pipelines, Katib, and KFServing. It’s important to know that the Kubeflow platform and the ML jobs have distinct storage requirements.

Let’s take a closer look at the storage configuration that these two layers: The Kubeflow platform and custom jobs created by users that run on the platform.

Storage Prerequisites for Kubeflow Platform

Kubeflow is a comprehensive stack assembled from a variety of open source components and projects. The platform is based on Argo Workflow, Istio, Jupyter Hub, Knative, MinIO, MySQL, and Seldon.

There are multiple operators, CRDs, and Kubernetes objects that integrate these diverse open source projects to deliver the platform capabilities. For example, the tf-job-operator, pytorch-operator, and mxnet-operator are a combination of custom resources and operators that can run distributed training jobs.

Below is a subset of CRDs and operators created by Kubeflow:

👁 Image

👁 Image

Kubeflow’s CRDs and operators depend on some of the stateful services deployed as Kubernetes statefulsets and deployments with external PVCs.

Kubeflow needs a storage class that supports dynamic provisioning to create the PVCs on the fly.

Stateful services such as MySQL and MinIO need a persistent volume (PV) and persistent volume claim (PVC) backed by a high throughput storage layer.

When you run kubectl get pv immediately after installing Kubeflow, you see the persistent volumes created for MySQL and MinIO.

👁 Image

The PVs are bound to the PVCs attached to the pods running within the kubeflow namespace.

👁 Image

These PVCs are utilized by the pods highlighted in the screenshot below.

👁 Image

To ensure that the stateful services get the expected throughput and I/O, they need a high-performance storage layer. The other important aspect is that the stateful services are not configured as statefulsets. They are normal deployments backed by a normal PVC.

If you configure Kubeflow with shared filesystems such as NFS and GlusterFS, you may not get the expected throughput.

The key takeaway is that the Kubeflow platform layer needs a highly available, performant, and reliable storage engine that can deliver the throughput and I/O performance of write-intensive workloads such as MySQL and MinIO.

Storage Requirements for Machine Learning Jobs running on Kubeflow

Now, let’s take a look at a typical use case of Kubeflow — multiple teams within an organization leveraging the Notebook Server to build and deploy a deep learning model.

It all starts with the DevOps building the container images for individual teams — data scientists, ML engineers, and developers. The data science team prepares the data and performs feature engineering. The final dataset is stored in a shared location that is accessible to ML engineers training and tuning the model. The trained model is persisted to another shared location which is used by developers building the model serving and inference application.

Portworx is the leading provider of persistent storage for containers and is used in production by healthcare, global manufacturing, and telecom members of the Fortune Global 500 and other great companies. Learn about Portworx solutions for Kubernetes storage, DCOS storage & more at portworx.com.
Learn More
The latest from Portworx by Pure Storage

To visualize how the DevOps team has created three unique Notebook Servers for each of the teams, take a look at the screenshot below.

👁 Image

The Notebook Servers are created under a dedicated Kubernetes namespace. In this example, they are a part of the mldemo namespace. Notice how each Notebook Server is translated into an instance of the statefulset.

👁 Image

The pods, dataprep-0, train-0, and infer-0 are associated with the respective Notebook Server running in Kubeflow.

👁 Image

Each Notebook Server instance has a dedicated PVC with RWO mode that becomes the home directory of the user. To enable sharing of the artifacts such as datasets and models, each Notebook Server is also associated with a shared PVC with RWX mode that supports multiple read and write operations.

To support this scenario, we create two shared PVCs beforehand and attach them to the Notebook Server.

👁 Image

The shared PVs are attached to the Notebook Server during the creation.

👁 Image

With this approach, DevOps can enable a shared environment for all the teams to collaborate effectively. Shared volumes are one of the critical requirements for Kubeflow applications and ML jobs.

Since the majority of the cloud native storage engines don’t deliver shared volumes out of the box, customers end up using NFS or GlusterFS for Kubeflow.

We will explore this concept further in the upcoming MLOps tutorial based on Notebook Servers and Kubeflow Pipelines.

Portworx By Pure Storage for Kubeflow

As we have seen, Kubeflow needs a combination of storage engines — a high throughput, reliable backend for running stateful components, and a shared storage layer for the jobs running on Kubeflow.

Portworx by Pure Storage is a cloud native, container-granular, enterprise-grade storage engine for Kubernetes. It’s one of the unique storage platforms with capabilities such as replication, encryption, shared volumes, and in-built high availability and failover.

For Kubeflow, Portworx by Pure Storage becomes the natural choice due to the following reasons:

  • Dynamic storage class optimized for running databases that need high availability and throughput
  • In-built replication and high availability for regular stateful pods and without configuring statefulsets
  • Sharedv4 volumes provide the out-of-the-box capability to create multi-writer shared volumes

For stateful services such as MySQL, MinIO, Jupyter Notebooks, the following Portworx storage class delivers all expected capabilities.

👁 Image

Since the storage class is annotated as the default, dynamically provisioned PVCs will be automatically based on this.

The parameter repl ensures that the data has at least three copies which bring high availability. The io_profile parameter implements a write-back flush coalescing algorithm which ensures that replicas do not fail (kernel panic or power loss) simultaneously in a 50 ms window.

For provisioning shared volumes, we create a different storage class annotated as a sharedv4 volume.

👁 Image

The PVCs based on the above storage class support RWX mode making it possible to share ML artifacts across teams and Notebook Servers.

Portworx by Pure Storage is the only storage platform in the market that provides seamless support for dedicated volumes (RWO) and shared volumes (RWX) with no compromise in performance and throughput.

Check back next Friday for the next part of this series, where I will walk you through the steps involved in integrating Portworx Essentials, the free container-native storage engine from Portworx by Pure Storage with NVIDIA DeepOps. Stay tuned!

Portworx is the leading provider of persistent storage for containers and is used in production by healthcare, global manufacturing, and telecom members of the Fortune Global 500 and other great companies. Learn about Portworx solutions for Kubernetes storage, DCOS storage & more at portworx.com.
Learn More
The latest from Portworx by Pure Storage
TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
Portworx by Pure Storage sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.