VOOZH about

URL: https://volcano.sh/

⇱ Volcano


Skip to main content

Volcano

Cloud native batch scheduling system for compute-intensive workloads

Why Volcano

Unified Scheduling

Supports integrated job scheduling for both Kubernetes native workloads and mainstream computing frameworks (such as TensorFlow, Spark, PyTorch, Ray, Flink, etc.).

Queue Management

Provides multi-level queue management capabilities, enabling fine-grained resource quota control and task priority scheduling.

Heterogeneous Device Support

Efficiently schedules heterogeneous devices like GPU and NPU, fully unleashing hardware computing potential.

Network Topology Aware Scheduling

Greatly enhancing model training efficiency in AI distributed training scenarios.

Multi-cluster Scheduling

Supports cross cluster job scheduling, improving resource pool management capabilities and achieving large scale load balancing.

Online and Offline Workloads Colocation

Enables online and offline workloads colocation, improving cluster resource utilization through intelligent colocation scheduling.

Load Aware Descheduling

Optimizing cluster load distribution and enhancing system stability.

Multiple Scheduling Policies

Supports various scheduling strategies such as Gang scheduling, Fair-Share, Binpack, DeviceShare, NUMA-aware scheduling, Task Topology, etc.

Rich Framework Support

Seamlessly integrate with mainstream computing frameworks for AI, big data, and scientific computing

Spark

Apache Sparkβ„’ is a unified analytics engine for large-scale data processing

Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams

TensorFlow

An end-to-end open source machine learning platform

PyTorch

An open source machine learning framework that accelerates the path from research prototypes to production deployment

Argo

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

MindSpore

The all-scenario deep learning framework developed by Huawei.

Ray

Ray is a high-performance distributed computing framework that supports machine learning, deep learning, and distributed applications.

Kubeflow

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable.

Open MPI

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners.

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

MXNet

A truly open source deep learning framework suited for flexible research prototyping and production.

PaddlePaddle

PaddlePaddle is an open source deep learning platform derived from industrial practice initiated by Baidu.

Recent Posts

30
May
2026

Volcano v1.15 Released: Gang-Granularity Preemption, DRA Queue Quota, and More Scheduling Enhancements

New Features: Gang-Aware Preemption and Resource Reclamation, DRA Queue Quota in Capacity Plugin, Pluggable Multi-Sharding Policies, Volcano Benchmark and Performance Observability, Scheduling Gates for Queue Admission, Kubernetes 1.35 support, and more
30
Jan
2026

Volcano v1.14 Released: Entering a New Era of Unified AI Scheduling

New Features: Agent Scheduler for AI Agent workloads, Dynamic Node Sharding, Network Topology-Aware Scheduling, NPU and vNPU support, CPU Burst and Cgroup V2 support, and more
6
Jan
2026

Introducing Kthena: Redefining LLM Inference for the Cloud-Native Era

Kthena is a Kubernetes-native, high-performance LLM inference routing and orchestration system. It improves GPU/NPU utilization and reduces latency with topology-aware scheduling, KV Cache-aware routing, and Prefill-Decode disaggregation.
29
Sep
2025

Volcano v1.13 Released: Comprehensive Enhancement of Scheduling Capabilities for LLM Training and Inference

New Features: LeaderWorkerSet support for large model inference, Cron VolcanoJob, Label-based HyperNode auto-discovery, Native Ray framework support, HCCL plugin support, Enhanced NodeGroup functionality, ResourceStrategyFit plugin, Colocation decoupled from OS, Custom oversubscription resource names, Kubernetes v1.33 support, and more
13
Jun
2025

iFlytek Enhances AI Infrastructure with Volcano, Wins CNCF End-User Case Study Award

iFlytek was awarded for its innovative use of Volcano in the CNCF End-User Case Study Competition and shared its success in large-scale AI model training at KubeCon + CloudNativeCon China 2025.