Slurm is the default job scheduler at Meta, most national supercomputing centers, and nearly every university HPC cluster running serious AI workloads, and it is used in production at AI labs including Mistral. Yet almost every GPU cloud tutorial assumes Kubernetes. If you are migrating from an on-prem HPC cluster, or just want a simpler batch scheduling model for training jobs, this guide shows you how to run Slurm on cloud GPU nodes: architecture, cluster setup, sbatch patterns for multi-node LLM training, Pyxis containers, topology-aware scheduling, and cost optimization with spot instances.
Why Frontier Labs Run Slurm
The core reason is gang scheduling. When you submit a 4-node training job to Slurm, all four nodes start simultaneously or none start. The job does not begin until the full allocation is available. This matters because distributed training is intolerant of partial starts: a PyTorch dist.barrier() call blocks indefinitely if one rank never shows up.
Kubernetes does not have native gang scheduling. It schedules pods independently, and a 4-node training job can end up with 3 pods running and the fourth stuck in pending because no node has free GPUs. This causes silent hangs and wasted billing time. KAI Scheduler and Volcano add gang scheduling to Kubernetes, but they add operational complexity that Slurm users do not need.
Beyond gang scheduling:
- Fair-share scheduling. Slurm's multifactor priority system automatically decays the priority of teams that over-consume GPU resources, redistributing capacity to under-served users. No manual queue management needed.
- Native MPI integration.
mpirunbinds tosruntask slots naturally. For non-PyTorch HPC workloads, OpenMPI over Slurm is a solved problem. - Simple job semantics. A Slurm job is a shell script with
#SBATCHdirectives. There is no YAML object graph, no custom resource definition, no controller to debug. Submit a script, get results, read logs. - Zero toolchain change. Researchers moving from a university cluster to cloud GPUs can copy their sbatch scripts with minor modifications. The learning curve is flat.
This is not a claim that Slurm is universally better. Slurm solves a different problem set than Kubernetes. The right choice depends on what you are building.
Slurm vs Kubernetes for AI: When Each Wins
| Dimension | Slurm | Kubernetes |
|---|---|---|
| Job type | Batch training, HPC, MPI | Always-on inference, serving |
| Gang scheduling | Native | Requires KAI Scheduler or Volcano |
| Container support | Via Pyxis + Enroot | Native |
| Auto-scaling | Elastic plugins (cloud bursting) | KEDA, Knative |
| Multi-tenancy | Fair-share queues | Namespaces + resource quotas |
| Topology awareness | topology.conf (native) | Node affinity + labels |
| Existing HPC migration | Zero toolchain change | Full rewrite |
| Inference serving | Not designed for it | Designed for it |
The decision is usually straightforward: if you run training jobs that start, run for hours, checkpoint, and terminate, Slurm wins on simplicity. If you need auto-scaling HTTP inference endpoints or microservice architectures alongside AI, Kubernetes wins on ecosystem. For teams doing both, running Slurm for training and Kubernetes for inference serving is a common split. For a deep look at the Kubernetes side, the Kubernetes GPU scheduling with DRA and KAI Scheduler guide covers the full stack.
For teams that want Kubernetes-native gang scheduling with fractional GPU sharing, NVIDIA Run:ai on GPU Cloud covers the full setup and licensing math. For enterprise NVIDIA stacks that need both scheduler options and unified lifecycle management, the NVIDIA Mission Control guide covers the architecture and migration path.
Slurm Architecture for GPU Clusters
A Slurm cluster has three main components:
slurmctld (the controller daemon) runs on the head node. It manages the job queue, allocates resources, and dispatches jobs to compute nodes. There is typically one active controller with an optional standby for high availability.
slurmd (the compute node daemon) runs on every GPU node. It receives job steps from the controller, launches processes, and reports resource usage and health back to the controller.
slurmdbd (the database daemon) stores job accounting data in MariaDB or MySQL. Required for fair-share scheduling and usage reporting. Runs on the controller node or a dedicated host.
The GRES (Generic Resource) system is how Slurm tracks GPUs. You declare GPU resources in two files:
# /etc/slurm/slurm.conf (controller)
ClusterName=gpu-cluster
SlurmctldHost=controller-node
GresTypes=gpu
NodeName=gpu-node-[001-004] \
Gres=gpu:h100:8 \
CPUs=128 \
RealMemory=2048000 \
State=UNKNOWN
PartitionName=train \
Nodes=gpu-node-[001-004] \
Default=YES \
MaxTime=168:00:00 \
State=UP# /etc/slurm/gres.conf (each compute node)
NodeName=gpu-node-001 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-002 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-003 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-004 Name=gpu Type=h100 File=/dev/nvidia[0-7]The File=/dev/nvidia[0-7] binding tells Slurm to set CUDA_VISIBLE_DEVICES correctly for each job, preventing GPU conflicts between concurrent jobs on the same node.
Provisioning a Slurm Cluster on GPU Cloud: 4-8 Node H100 Walkthrough
This walkthrough provisions a 4-node H100 SXM5 cluster. Adjust node counts and GPU types as needed.
Step 1: Provision the controller node. Rent one CPU-only instance (8-16 cores, 32-64 GB RAM) as the controller. It does not need GPUs. This node will run slurmctld and slurmdbd.
Step 2: Provision compute nodes. Rent 4 H100 SXM5 bare-metal instances on Spheron. All nodes must be on the same private subnet so they can reach each other without NAT. Note the private IPs of each compute node.
Step 3: Set up shared storage. Create an NFS server (or use a managed NFS service) and export /home and /scratch to all nodes. Shared home directories are required so that sbatch scripts and dataset paths resolve identically on every node. For /scratch, use fast local NVMe for dataset reads and NFS only for checkpoints.
For a detailed treatment of parallel file system options (WekaIO, Lustre, BeeOND) and how to size them for specific cluster topologies, see the parallel file systems for AI training guide.
Step 4: Install Slurm. On Ubuntu 22.04, the slurm-wlm package installs both slurmctld and slurmd:
sudo apt-get update && sudo apt-get install -y slurm-wlm slurmdbd mariadb-server mungeStep 5: Generate and distribute the Munge key. Munge is the authentication system Slurm uses. The key must be byte-identical on every node:
# On the controller (Ubuntu 22.04 / munge < 0.5.15):
sudo create-munge-key
# If your system has munge >= 0.5.15 use: sudo mungekey --create
sudo systemctl enable --now munge
# Copy to each compute node (use a secrets manager in production)
scp /etc/munge/munge.key user@gpu-node-001:/tmp/munge.key
ssh gpu-node-001 "sudo mv /tmp/munge.key /etc/munge/munge.key && sudo chown munge:munge /etc/munge/munge.key && sudo chmod 400 /etc/munge/munge.key && sudo systemctl enable --now munge"Step 6: Write slurm.conf. Populate slurm.conf with your node names, specs, and partition definitions. Use the template from the Architecture section above, substituting your actual IP-resolved hostnames.
Step 7: Start the daemons. On the controller: sudo systemctl enable --now slurmctld slurmdbd. On each compute node: sudo systemctl enable --now slurmd.
Step 8: Verify the cluster. Run sinfo on the controller. If nodes show idle status, the cluster is ready. Run a sanity check:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
srun nvidia-smiSubmit with sbatch sanity.sh and check output with squeue, then sacct -j $JOBID.
Topology-Aware Scheduling for Multi-Node Training
On a GPU cluster with InfiniBand, multi-node all-reduce traffic routes through a leaf-spine fabric. Nodes under the same leaf switch can communicate at full IB bandwidth. Nodes under different leaf switches cross the spine, adding latency and potentially sharing bandwidth.
Slurm's topology scheduling places jobs on nodes that minimize cross-switch hops. You configure it in topology.conf:
# /etc/slurm/topology.conf
# Two racks, four nodes per rack, connected through a spine switch
SwitchName=spine1 Switches=leaf1,leaf2
SwitchName=leaf1 Nodes=gpu-node-[001-004]
SwitchName=leaf2 Nodes=gpu-node-[005-008]Enable it in slurm.conf:
TopologyPlugin=topology/treeWith topology/tree, Slurm attempts to place a 4-node job entirely under leaf1 before considering nodes across leaves. For single-rack clusters where all nodes share one switch, topology/flat is sufficient.
Test placement without actually running a job:
srun --nodes=4 --gres=gpu:8 --test-only /bin/trueThe output shows which nodes Slurm would allocate.
On the NCCL side, align your environment variables with the IB fabric topology:
# Pin NCCL to the correct InfiniBand HCAs (check with ibstat)
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1
# Use GID index 3 for RoCEv2 (or 0 for IB)
export NCCL_IB_GID_INDEX=3
# Enable GPU Direct RDMA reads
export NCCL_NET_GDR_READ=1
# Set the socket interface for inter-node rendezvous
export NCCL_SOCKET_IFNAME=ib0On cloud bare-metal nodes, IB device names may differ from on-prem clusters. Always check with ibstat before setting NCCL_IB_HCA. For more detail on IB vs RoCE tradeoffs, see the InfiniBand vs RoCE fabric selection guide. For the full set of NCCL environment variables, see NCCL tuning for multi-node training.
Running LLM Training Jobs: torchrun, FSDP, and DeepSpeed Under sbatch
Here is a complete sbatch script for a 4-node, 32-GPU FSDP training job:
#!/bin/bash
#SBATCH --job-name=llm-fsdp-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --time=48:00:00
#SBATCH --output=logs/%j/train.out
#SBATCH --error=logs/%j/train.err
# Extract the first node as the rendezvous master
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500
# InfiniBand settings - check ibstat for your HCA names
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_READ=1
export NCCL_SOCKET_IFNAME=ib0
# Launch torchrun on each node via srun
# srun starts one task per node; torchrun starts 8 GPU workers within each task
srun torchrun \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=8 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--rdzv_backend=c10d \
train.py \
--model_name meta-llama/Llama-3-70b \
--fsdp_sharding_strategy FULL_SHARD \
--gradient_checkpointing--rdzv_backend=c10d uses PyTorch's C10d rendezvous for node discovery. It is more reliable than the default static backend in cluster environments where nodes may have slightly different startup times.
For DeepSpeed, let srun handle distribution and skip the DeepSpeed CLI launcher entirely:
srun python train_ds.py --deepspeed ds_config.jsonThe script calls deepspeed.initialize() internally, so srun assigns ranks and manages inter-node communication without a conflicting second launcher. If you prefer using the DeepSpeed CLI as the sole outer launcher (not wrapped in srun), use a hostfile instead:
# Build a hostfile from the nodes SLURM allocated (8 slots = 8 GPUs per node)
scontrol show hostnames "$SLURM_JOB_NODELIST" \
| awk '{print $1 " slots=8"}' > /tmp/hostfile
HOSTFILE=/tmp/hostfile
deepspeed \
--hostfile=$HOSTFILE \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train_ds.py \
--deepspeed ds_config.jsonMonitoring Running Jobs
Check the queue and job status:
# See all running and pending jobs
squeue -u $USER
# Detailed accounting for a completed or running job
sacct -j $JOBID --format=JobID,Elapsed,CPUTime,NCPUS,AllocGRES,State
# Check GPU utilization on allocated nodes without SSH
srun --jobid=$JOBID --overlap nvidia-smiFor the full multi-node FSDP and DeepSpeed ZeRO-3 setup, including memory math and checkpoint strategies, see the FSDP and DeepSpeed multi-node setup guide.
Pyxis and Enroot: Containerized Slurm Without Performance Loss
The standard way to run containers in Slurm is via Pyxis and Enroot. The combination gives you full OCI container support, rootless execution, and GPU passthrough with near-zero performance overhead.
Why not Docker? Docker requires a root daemon, which is a security risk on shared HPC clusters. Most cluster admins do not allow it. Enroot is rootless: it unpacks Docker/OCI images into squashfs files and mounts them without a daemon. NVIDIA Container Toolkit hooks inside Enroot handle GPU device access.
How Pyxis works. Pyxis is a Slurm SPANK plugin that extends srun and sbatch with container flags. Once installed, your sbatch scripts gain --container-image and --container-mounts options that work identically to regular job flags.
Installation (abbreviated):
# On all compute nodes: install Enroot
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot_3.5.0-1_amd64.deb
sudo apt-get install -y ./enroot_3.5.0-1_amd64.deb
# Install NVIDIA Container Toolkit and register the GPU hook for Enroot
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk enroot-hook install
# On all nodes: install Pyxis
# Download from github.com/NVIDIA/pyxis and build against your Slurm headers
# Then register in /etc/slurm/plugstack.conf:
# required /usr/local/lib/slurm/spank_pyxis.soUsing containers in sbatch:
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --exclusive
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
srun \
--container-image=nvcr.io/nvidia/pytorch:24.12-py3 \
--container-mounts=/data:/data,/scratch:/scratch \
torchrun \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=8 \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train.pyEnroot imports the image on first use and caches a squashfs on each node. Subsequent runs mount from cache with near-zero startup time. For GPU-bound training workloads, squashfs mounts add less than 5% overhead vs bare-metal execution.
Fair-Share Scheduling, GPU Partitions, and Preemption
For multi-team clusters, Slurm's fair-share scheduler prevents any single team from monopolizing GPU resources.
Partition setup for a research cluster:
# slurm.conf partition definitions
PartitionName=debug Nodes=gpu-node-001 MaxTime=02:00:00 MaxCPUsPerUser=8 State=UP
PartitionName=train Nodes=gpu-node-[001-008] MaxTime=168:00:00 State=UP Default=YES
PartitionName=priority Nodes=gpu-node-[001-008] MaxTime=720:00:00 PriorityJobFactor=2 State=UPEnable fair-share scheduling:
# slurm.conf
AccountingStorageType=accounting_storage/slurmdbd
PriorityType=priority/multifactor
PriorityWeightFairshare=50000
PriorityWeightAge=1000
PriorityWeightJobSize=1000Fair-share works by comparing each user's or account's historical usage against their allocated share. Teams that have consumed more GPU-hours than their share get lower priority; teams that have used less get a boost. The PriorityWeightFairshare parameter controls how aggressively historical usage influences the queue.
Preemption allows high-priority jobs to evict lower-priority ones:
# slurm.conf
PreemptType=preempt/partition_prio
PreemptMode=REQUEUEWith REQUEUE, evicted jobs re-enter the queue and restart from their last checkpoint when resources free up. This pairs well with frequent checkpointing in training scripts.
QOS per team sets hard limits:
# Create accounts for each team
sacctmgr add account team-a Description="Team A" Organization=research
sacctmgr add user alice Account=team-a
# Create a QOS limiting GPU-hours per month
sacctmgr add qos team-a-qos GrpTRESMins="gres/gpu=43200" # 43200 GPU-minutes = 720 GPU-hours
sacctmgr modify account team-a set QOS=team-a-qosCost Optimization: Spot Instances and Elastic Slurm
Spot pricing cuts GPU costs significantly on workloads that can tolerate preemption. Here are current Spheron prices for the most common Slurm training configurations:
| GPU | Type | On-Demand (per GPU/hr) | 8-GPU Node (on-demand/hr) |
|---|---|---|---|
| H100 SXM5 | NVIDIA Hopper | $4.21 | $33.68 |
| A100 80GB | NVIDIA Ampere | $1.04 | $8.32 |
Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spot-safe job patterns for Slurm. Add --requeue to your sbatch script so that if a spot node is reclaimed, Slurm requeues the job automatically:
#SBATCH --requeuePair this with frequent checkpointing in your training script. Save a checkpoint every 100-500 steps to shared NFS. On restart, load from the latest checkpoint:
# In your training loop
if step % checkpoint_interval == 0:
torch.save({"step": step, "model": model.state_dict(), ...}, f"/scratch/ckpt/step_{step}.pt")For automatic restart after preemption:
# Submit job and capture the job ID
JOBID=$(sbatch --parsable train.sh)
# Set up a dependency job that restarts if the first fails (exit code != 0)
sbatch --dependency=afternotok:$JOBID train.shElastic Slurm (cloud bursting) adds and removes nodes dynamically via ResumeProgram and SuspendProgram hooks in slurm.conf. These hooks call cloud provider APIs to provision or terminate nodes as the queue grows or shrinks. SchedMD's documentation covers the configuration; the key point is that the hook scripts need to update /etc/slurm/slurm.conf and reload the controller each time nodes change.
Cost attribution with sacct:
# Total GPU-hours consumed by each user this month
sacct --allocations --starttime=$(date -d "1 month ago" +%Y-%m-%d) \
--format=User,AllocGRES,ElapsedRaw \
--state=COMPLETED \
| awk '/h100/ {split($2,a,":"); gpuhours[$1] += (a[3] * $3/3600)} END {for (u in gpuhours) print u, gpuhours[u], "GPU-hrs"}'Run the same accounting query on A100 clusters or H100 nodes to get per-team GPU spend for chargebacks. For a detailed comparison of when spot vs on-demand vs reserved makes sense, see on-demand vs spot vs reserved GPU instances. For a real case study, the spot GPU training cost analysis shows how a 70B training run was completed for $11,200 on spot GPUs. For workloads that don't need a central scheduler at all, decentralized training swarms without a central scheduler covers Pluralis and Prime Intellect's gossip and DHT-based coordination, which replace Slurm's central controller with peer-to-peer node discovery.
Slurm for Batch Inference: When It Beats Always-On Kubernetes
Batch inference is an underrated Slurm use case. If you are running overnight embedding generation, batch LLM evaluation, or dataset scoring pipelines, you are probably paying 24/7 for a Kubernetes deployment that is idle 18 hours a day.
Slurm array jobs handle embarrassingly parallel inference efficiently:
#!/bin/bash
#SBATCH --job-name=batch-embed
#SBATCH --array=0-999 # 1000 shards
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
# Each task processes one shard
SHARD_ID=$SLURM_ARRAY_TASK_ID
python embed.py \
--shard $SHARD_ID \
--total-shards 1000 \
--input /data/corpus \
--output /scratch/embeddings/shard_${SHARD_ID}.npySlurm schedules array tasks across available GPUs as they free up. A 1000-shard job on a 10-GPU cluster finishes in 100 batches, with GPU utilization at 100% throughout. Compare to an always-on Kubernetes deployment: even if you have 10 replicas running 24/7, idle time during off-hours still bills at full rate.
The economics flip when you need low p50 latency or auto-scaling HTTP endpoints. Slurm has no native HTTP serving layer. For production inference serving with latency SLAs, Kubernetes wins. For batch LLM inference scheduling where throughput matters more than latency, Slurm array jobs are the simpler choice.
RLHF training workflows fit naturally into Slurm: reward-model inference and policy training each become separate sbatch jobs with --dependency linking them. For an overview of the major RLHF frameworks, see the verl, OpenRLHF, and TRL training infrastructure guide.
Migrating from On-Prem Slurm to GPU Cloud
Engineers moving from on-prem HPC clusters to cloud Slurm hit a predictable set of gotchas.
Storage. On-prem clusters usually have Lustre or IBM Storage Scale with hundreds of GB/s aggregate bandwidth. On cloud, you are working with NFS over network block storage. Shared storage I/O is often the bottleneck, not compute. Mitigate this by mounting training datasets to fast local NVMe on each compute node and only using NFS for checkpoints and model outputs. Many cloud providers offer local NVMe on bare-metal GPU nodes.
Networking. InfiniBand is available on bare-metal H100 and A100 nodes from some cloud providers. Verify IB availability before provisioning. If IB is unavailable, RoCEv2 over 100GbE substitutes for most training workloads at moderate scale. The multi-node training without InfiniBand on cloud guide covers the tradeoffs and configuration in detail.
Licensing. Slurm is open source under the GNU GPLv2 license. No license cost. The database backend (slurmdbd) requires MariaDB or MySQL; budget a small instance for that, or use a managed database service.
Node naming. Cloud instances use dynamic hostnames or IP-based names that change on reprovision. Automate NodeName entries in slurm.conf via Terraform or a startup script that registers each node with the controller on boot. The controller must be able to resolve every compute node hostname.
MPI. On-prem clusters often run OpenMPI extensively. For PyTorch-based LLM training, NCCL replaces MPI entirely. OpenMPI still works for non-PyTorch HPC codes via srun; install libopenmpi-dev on all nodes and it works the same as on-prem.
Munge key distribution. The Munge authentication key must be byte-identical on all nodes. In production, store it in a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.) and retrieve it during node initialization before slurmd starts. Do not copy keys over SSH manually in scripts that run on node boot.
Spheron's bare-metal H100 and A100 instances with InfiniBand networking are built for exactly this kind of workload. You bring your own scheduler, whether Slurm, Ray, or something custom, and get raw HPC performance without managed-Kubernetes overhead. No lock-in, per-minute billing, full root access.
H100 GPU on Spheron → | On-demand A100 → | View all GPU pricing →
Quick Setup Guide
Provision Slurm head node and compute nodes on GPU cloud
Rent one CPU-only or small GPU instance as the Slurm controller (slurmctld) and N bare-metal H100 or A100 nodes as compute nodes (slurmd). All nodes must be on the same private network subnet. Install slurm-wlm on Ubuntu or use the SchedMD RPMs. Configure /etc/slurm/slurm.conf with ClusterName, SlurmctldHost, and NodeName entries for each compute node.
Configure GPU GRES in slurm.conf and gres.conf
In slurm.conf, add GresTypes=gpu and set Gres=gpu:h100:8 (or gpu:a100:8) on each NodeName line. Create /etc/slurm/gres.conf on each compute node listing each GPU's PCI bus address. This lets sbatch scripts request --gres=gpu:h100:N and have Slurm bind the correct CUDA_VISIBLE_DEVICES.
Set up topology-aware scheduling
Create /etc/slurm/topology.conf mapping switch names to compute node names, reflecting your IB or Ethernet fabric topology. Set TopologyPlugin=topology/tree in slurm.conf. For single-switch clusters, topology/flat is sufficient. Verify placement with srun --test-only and check the assigned nodes fall within the expected switch boundary.
Write and submit a multi-node LLM training sbatch script
Write a batch script with #SBATCH --nodes, --ntasks-per-node=1, --gres=gpu:8, --exclusive. Extract MASTER_ADDR from SLURM_JOB_NODELIST using scontrol, then call srun to launch torchrun or deepspeed on each node. Set NCCL_IB_HCA, NCCL_IB_GID_INDEX, and NCCL_NET_GDR_READ for InfiniBand. Submit with sbatch train.sh and monitor with squeue, sstat, and sacct.
Add Pyxis and Enroot for containerized jobs
Install Enroot on all compute nodes and the Pyxis SPANK plugin. Register the Enroot hook in /etc/enroot/hooks.d/ for NVIDIA GPU support. Add PlugStackConfig=plugstack.conf to slurm.conf and register the Pyxis plugin there. Jobs can then use srun --container-image=nvcr.io/nvidia/pytorch:24.12-py3 --container-mounts=/data:/data to run inside a container with full GPU access and NCCL performance.
Frequently Asked Questions
Slurm was built for HPC gang scheduling: all nodes in a job start together or none start. It has native MPI support, fair-share scheduling for multi-team clusters, and simple job semantics that map cleanly to training runs. Most on-prem HPC clusters at universities and national labs run Slurm, so researchers port their code to the cloud without a toolchain change. Kubernetes excels at long-running inference services and microservice orchestration but adds significant complexity for batch training workloads that run for hours, checkpoint, and terminate.
In your sbatch script, set --nodes=N, --ntasks-per-node=1, and --gres=gpu:8 (for 8 GPUs per node). Inside the script, extract MASTER_ADDR with: export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1). Then launch with: srun torchrun --nnodes=$SLURM_JOB_NUM_NODES --nproc_per_node=8 --master_addr=$MASTER_ADDR --master_port=29500 train.py. Slurm sets SLURM_PROCID on each task, which maps to the node_rank for torchrun.
Pyxis is an SPANK plugin for Slurm that adds --container-image and --container-mounts flags to srun and sbatch. It uses Enroot as the container runtime, which unpacks Docker/OCI images into squashfs files that mount with near-zero overhead. Unlike running Docker inside a job (which requires root), Enroot runs rootless. GPU passthrough works via NVIDIA Container Toolkit hooks inside the Enroot runtime, so CUDA and NCCL perform identically to a bare-metal job.
Slurm reads a topology.conf file that maps switch names to node names, building a tree of network switches. With TopologyPlugin=topology/tree in slurm.conf, the scheduler places multi-node jobs on nodes that share the closest common switch, minimizing cross-switch hops. On IB fabrics, this means jobs land on nodes under the same leaf switch first, then the same spine switch. Combined with GRES binding (--cpu-bind=closest, --mem-bind=local), this ensures NCCL uses RDMA transport across the shortest IB path.
Choose Slurm when: your team already uses Slurm on-prem and wants zero toolchain change on cloud, you run batch training jobs (not always-on inference services), you need fair-share scheduling for a shared research cluster with multiple teams, or you need native MPI support for HPC-style workloads. Choose Kubernetes when: you need auto-scaling HTTP inference endpoints, you are running microservice architectures alongside AI, or you need GitOps-style declarative deployment. For training-heavy shops, Slurm wins on simplicity; for inference-serving shops, Kubernetes wins on ecosystem.
