VOOZH about

URL: https://www.hardware-corner.net/vllm-local-mixed-gpu-mig-setup/

⇱ Running vLLM for Local LLMs on Mixed GPUs? MIG Might Just Make It Work.


Running vLLM for Local LLMs on Mixed GPUs? MIG Might Just Make It Work.

By Allan Witt | Updated: November 5, 2025

👁 Image

When I recently helped set up an LLM inference server for a client, I ran into a problem that may sound familiar to anyone mixing different GPUs. I had an RTX Pro 6000 Workstation (95 GB VRAM) and an RTX 5090 (32 GB VRAM). The goal was simple: run vLLM setup without wasting available memory. The reality was less straightforward.

vLLM, as of now, doesn’t support splitting models across GPUs with mismatched VRAM sizes. It will sync allocations to the smallest GPU, meaning if one card has 32 GB and the other has 95 GB, both will effectively operate as 32 GB. This leads to out-of-memory errors when you’re trying to load something like a 100 GB model.

The solution turned out to be MIG — Multi-Instance GPU, a feature NVIDIA introduced for Ampere, Hopper, and now Blackwell GPUs. MIG lets you slice a physical GPU into multiple smaller, fully isolated GPU instances. Each instance behaves like an independent GPU with its own compute cores, memory, and cache.

What NVIDIA MIG Actually Does

MIG isn’t virtualization in the usual sense. It’s hardware-level partitioning of GPU resources. Each partition — or “instance” — runs its own CUDA context, driver, and application stack, meaning different workloads or containers can use separate slices of the same GPU without interfering with each other.

Each MIG instance has:

  • A fixed slice of memory (for example, 32 GB on a 96 GB GPU)
  • Its own portion of streaming multiprocessors (SMs)
  • Dedicated L2 cache and memory bandwidth

This architecture is why MIG remains stable under mixed workloads — the hardware-level isolation prevents contention. It’s one of the most useful features for inference servers where you want predictable performance, consistent VRAM allocation, or to run multiple models side by side.

Supported Architectures and Requirements

MIG is supported on Ampere (A100, A30), Hopper (H100, H200), and Blackwell (B200, RTX Pro 6000 Blackwell, RTX Pro 5000 Blackwell).

To use MIG, you’ll need:

  • CUDA 12 or newer
  • A compatible NVIDIA driver (R525 or later for A100, R450 or later for H100, and R575 or later for Blackwell RTX Pro series)
  • Linux with the latest NVIDIA Data Center driver
  • Optional but recommended: NVIDIA Container Toolkit and nvidia-smi utilities

On the Blackwell workstation cards, you should also verify that the vBIOS is updated to at least version 98.02.55.00.00 for the RTX Pro 6000 Workstation edition.

⚠️ vBIOS Disclaimer: If you need to update your GPU’s vBIOS to enable MIG, make sure you use the correct BIOS for your exact GPU model and edition (RTX Pro 6000 Workstation, Max-Q, or Server).

Do not flash firmware meant for a different variant — it can permanently disable your card. Before flashing, check the minimum required vBIOS version in the official NVIDIA MIG documentation and verify that your hardware and motherboard are compatible.

Proceed at your own risk — incorrect flashing may void warranties or render the GPU unusable.

How to Enable MIG Mode

Activating MIG on a supported GPU is straightforward but must be done carefully. Here’s a step-by-step example using Linux.

First, install NVIDIA’s Display Mode Selector tool. This utility switches the GPU between “graphics” and “compute” mode. MIG can only be enabled in compute mode.

sudo ./displaymodeselector -i 1 --gpumode compute
sudo reboot

Info: If you run into errors with displaymodeselector, see this Reddit post for possible fixes.

After rebooting, confirm the card is in compute mode:

nvidia-smi -i 1

Then, enable MIG on the device (with PCI bus ID or UUID of 1 ):

sudo nvidia-smi -i 1 -mig 1

You can now create specific MIG instances. For example, to split a 96 GB GPU into four equal 24 GB partitions:

sudo nvidia-smi mig -cgi 1g.24gb,1g.24gb,1g.24gb,1g.24gb -C

Each instance will appear as a separate logical GPU, visible to CUDA and inference frameworks like vLLM. If you want to destroy the instances and disable MIG later, you can run:

sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
sudo nvidia-smi -i 1 -mig 0

To revert fully to graphics mode:

sudo ./displaymodeselector -i 1 --gpumode graphics
sudo reboot

For detailed guide and complete list of instance profiles check out NVIDIA MIG docs

Balancing Uneven GPUs for Inference

In build I did, using MIG on the RTX Pro 6000 (95 GB) allowed me to carve it into three 32 GB GPU instances. Combined with the RTX 5090 (32 GB), I now effectively had four identical 32 GB GPUs.

That uniformity matters because vLLM, the best-performing inference runtime for local and distributed agents right now, expects all participating GPUs to have identical VRAM sizes when using tensor parallelism. Once each card appeared as a 32 GB device, I could finally launch large models using:

vllm serve model-name -tp 4

The result was a stable multi-GPU setup with clean memory symmetry, allowing full utilization of my hardware — without waiting for vLLM to natively support uneven GPUs.

Why MIG Matters for Local LLM Enthusiasts

For home lab and on-prem inference builders, MIG changes the equation. Instead of hunting for perfectly matched GPUs, you can mix and match — carving larger workstation cards into standardized units that align with consumer GPUs.

It also improves performance isolation. You can dedicate one MIG instance per container, model, or client, ensuring no single workload starves the others for memory bandwidth. In mixed deployments, like serving quantized 7B and 70B models side-by-side, that predictability makes a big difference.

And let’s be honest — MIG is going to age really well. When the current AI gold rush cools off and those sweet, sweet A100s and H100s start flooding the used market, MIG will be the perfect way to slot them into your existing setups. You’ll be able to carve those datacenter beasts into tidy, equal slices and pair them with whatever GPUs you already have lying around — a little like recycling, but for compute power.

Final Thoughts

MIG is one of those features that feels enterprise-only at first glance but becomes incredibly practical once you understand it. For local LLM builders using vLLM, it can unlock configurations that simply weren’t possible before.

By slicing a 96 GB GPU into multiple smaller logical GPUs, you not only maximize VRAM efficiency but also keep your setup scalable, flexible, and cost-effective.

Whether you’re using a Pro 6000 Blackwell, an A100, or planning a Hopper upgrade, enabling MIG is worth the five minutes it takes. For anyone running vLLM on mixed GPUs — it’s a game changer.

Read more: Run LLMs Locally