- How to Run Multiple Local LLMs Simultaneously
- Table of Contents
- Prerequisites
- Why Run Multiple Local LLMs Simultaneously?
- Resource Planning for Concurrent Models
- Setting Up Multi-Model Serving with Ollama
- Containerized Multi-Model Orchestration with Docker
- Scaling with Kubernetes for Production Workloads
- Monitoring, Debugging, and Optimization
- Recommended Configurations
- How to Run Multiple Local LLMs Simultaneously
- Table of Contents
- Prerequisites
- Why Run Multiple Local LLMs Simultaneously?
- Resource Planning for Concurrent Models
- Setting Up Multi-Model Serving with Ollama
- Containerized Multi-Model Orchestration with Docker
- Scaling with Kubernetes for Production Workloads
- Monitoring, Debugging, and Optimization
- Recommended Configurations
Running Multiple Local LLMs Simultaneously: Multi-Model Setup Guide
Share this article
- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Running multiple local LLMs simultaneously has moved from experimental curiosity to practical necessity. Professionals building AI-powered workflows increasingly need concurrent access to task-specific models: one tuned for code generation, another for summarization, a third embedded in a retrieval-augmented generation pipeline. By the end of this article, you will have a working multi-model deployment across all three orchestration tiers: bare Ollama, Docker Compose, and Kubernetes.
How to Run Multiple Local LLMs Simultaneously
- Plan your VRAM budget by summing each model's quantized size plus 15–20% overhead for KV cache and CUDA runtime.
- Select task-specific models and assign quantization levels (Q4_K_M for density, Q5_K_M/Q8 for fidelity) based on priority.
- Configure Ollama with
OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_PARALLEL, and an appropriate keep-alive duration. - Deploy a model router (e.g., FastAPI) that dispatches requests to the correct model by task type.
- Containerize each model instance using Docker Compose with GPU passthrough for isolated failure domains.
- Scale to Kubernetes with per-model Deployments, PVCs, and GPU resource requests when you need autoscaling and rolling updates.
- Monitor VRAM utilization and per-model latency via
nvidia-smi dmon, Prometheus, and router-level logging. - Tune keep-alive values and quantization mix iteratively to eliminate cold-start churn and VRAM contention.
Table of Contents
- Prerequisites
- Why Run Multiple Local LLMs Simultaneously?
- Resource Planning for Concurrent Models
- Setting Up Multi-Model Serving with Ollama
- Containerized Multi-Model Orchestration with Docker
- Scaling with Kubernetes for Production Workloads
- Monitoring, Debugging, and Optimization
- Recommended Configurations
Prerequisites
Before starting, ensure the following are in place:
- A Linux host running Ubuntu 22.04 or later. macOS and Windows WSL2 have partial support but are not covered here.
- NVIDIA GPU driver version 525+ for CUDA 12.x compatibility with Ollama.
- The NVIDIA Container Toolkit (
nvidia-ctkandnvidia-container-runtime) installed and configured in your Docker daemon (/etc/docker/daemon.json) for GPU passthrough in containers. - Docker Engine 24.0 or later with Docker Compose v2.
- Ollama installed locally or via container. Pin to a specific version for reproducibility (e.g.,
ollama/ollama:0.3.6). The examples below uselatestfor brevity; replace with a pinned version tag for production use. - Python 3.9+ for the planner script, which uses PEP 585 type hints.
curlavailable in PATH.- At least 24 GB GPU VRAM for the three-model examples to fit without CPU spillover.
Why Run Multiple Local LLMs Simultaneously?
The Case for Task-Specific Model Specialization
Smaller, specialized models consistently outperform large generalists on narrow tasks while demanding a fraction of the VRAM. A 7B parameter code-focused model generates more syntactically correct output for programming tasks than a general-purpose 70B model (as measured by benchmarks like HumanEval pass rates), and it does so with lower latency and no API round-trip costs. Running these models locally eliminates per-token billing entirely, and it keeps sensitive data, whether proprietary codebases, legal documents, or medical records, from ever leaving the local network.
Smaller, specialized models consistently outperform large generalists on narrow tasks while demanding a fraction of the VRAM.
Common Multi-Model Architectures
Three patterns dominate multi-model local deployments. The router pattern places a dispatcher in front of a model pool, directing each request to the best-fit model based on task type. (The router implementation later in this guide uses an explicit task field in the request rather than an automatic query classifier.) By contrast, the pipeline pattern and parallel pattern address different coordination needs: pipelines chain models sequentially, feeding one model's output into the next (common in extract-then-summarize workflows), while parallel deployments run concurrent inference across models and merge results when you need multiple perspectives on the same input.
Resource Planning for Concurrent Models
Hardware Requirements and VRAM Budgeting
VRAM is the primary bottleneck when loading multiple models simultaneously. A model's VRAM footprint depends heavily on quantization level:
| Quantization | 7B Model VRAM (approx.) |
|---|---|
| Q4_K_M | ~4.2 GB (±10%) |
| Q5_K_M | ~5.1 GB (±10%) |
| Q8 | ~7.7 GB (±10%) |
Actual usage varies by model architecture. The practical rule of thumb: total VRAM needed equals the sum of all loaded model sizes plus 15 to 20% overhead to account for KV cache, context window allocations, and CUDA runtime overhead.
When you run out of VRAM, Ollama offloads layers to system RAM. This works but introduces substantial latency penalties. Offloading roughly 25% of layers slows inference by about 3x; offloading all layers to RAM can cost 10x or more, depending on memory bandwidth. For latency-sensitive tasks, keeping the entire model in VRAM is critical.
Quantization Strategy for Multi-Model Density
Mixing quantization levels across models is a pragmatic strategy. A code generation model where output correctness matters most might warrant Q5_K_M or Q8 quantization (Q8 costs roughly 2.6 GB more VRAM than Q5_K_M for a 7B model, so choose Q8 only when you have the headroom and need maximum fidelity). A summarization model handling less precision-critical tasks can run at Q4_K_M to conserve VRAM. This trade-off between quality and density is the central lever for fitting more models into a fixed VRAM budget.
This trade-off between quality and density is the central lever for fitting more models into a fixed VRAM budget.
#!/usr/bin/env python3
# Requires Python 3.9+
"""VRAM/RAM allocation planner for multi-model deployments."""
# Values are midpoint estimates; actual usage varies ±10% by model architecture.
# Source: llama.cpp quantization size tables (approximate).
QUANT_SIZES_GB = {
"Q4_K_M": {4: 2.3, 7: 4.2, 13: 7.9, 34: 20.0},
"Q5_K_M": {4: 2.8, 7: 5.1, 13: 9.4, 34: 23.8},
"Q8": {4: 4.0, 7: 7.7, 13: 13.5, 34: 35.0},
}
OVERHEAD_FACTOR = 0.18
def plan_allocation(models: list[dict], vram_gb: float, ram_gb: float) -> dict:
"""
models: list of {"name": str, "params_b": int, "quant": str, "priority": int}
priority: lower number = higher priority for VRAM placement
"""
models = list(models) # Handle generators safely
# Validate all models before allocating
for m in models:
if m["quant"] not in QUANT_SIZES_GB:
raise ValueError(
f"Unknown quantization '{m['quant']}' for model '{m['name']}'. "
f"Valid: {list(QUANT_SIZES_GB.keys())}"
)
if m["params_b"] not in QUANT_SIZES_GB[m["quant"]]:
raise ValueError(
f"Unknown params_b '{m['params_b']}' for quant '{m['quant']}' "
f"in model '{m['name']}'. "
f"Valid: {list(QUANT_SIZES_GB[m['quant']].keys())}"
)
sorted_models = sorted(models, key=lambda m: m["priority"])
vram_used = 0.0
allocation = {"vram": [], "ram_spillover": [], "total_vram_gb": 0, "total_ram_gb": 0}
for model in sorted_models:
size = QUANT_SIZES_GB[model["quant"]][model["params_b"]]
size_with_overhead = size * (1 + OVERHEAD_FACTOR)
if vram_used + size_with_overhead <= vram_gb:
allocation["vram"].append({
"name": model["name"], "quant": model["quant"],
"vram_gb": round(size_with_overhead, 2)
})
vram_used += size_with_overhead
else:
allocation["ram_spillover"].append({
"name": model["name"], "quant": model["quant"],
"ram_gb": round(size_with_overhead, 2),
"warning": "Expect 3-10x latency increase with CPU offloading"
})
allocation["total_vram_gb"] = round(vram_used, 2)
total_ram_spillover = round(
sum(m["ram_gb"] for m in allocation["ram_spillover"]), 2
)
allocation["total_ram_gb"] = total_ram_spillover
if total_ram_spillover > ram_gb:
allocation["ram_warning"] = (
f"RAM spillover ({total_ram_spillover} GB) exceeds available RAM ({ram_gb} GB). "
"Host may OOM. Reduce model count or quantization level."
)
return allocation
if __name__ == "__main__":
models = [
{"name": "codellama:7b", "params_b": 7, "quant": "Q5_K_M", "priority": 1},
{"name": "mistral:7b", "params_b": 7, "quant": "Q4_K_M", "priority": 2},
{"name": "phi3:mini", "params_b": 4, "quant": "Q4_K_M", "priority": 3},
]
result = plan_allocation(models, vram_gb=24.0, ram_gb=64.0)
for section in ("vram", "ram_spillover"):
print(f"
=== {section.upper()} ===")
for m in result[section]:
size_gb = m["vram_gb"] if "vram_gb" in m else m["ram_gb"]
print(f" {m['name']} ({m.get('quant')}) — {size_gb} GB")
print(f"
Total VRAM: {result['total_vram_gb']} GB | Total RAM spillover: {result['total_ram_gb']} GB")
if "ram_warning" in result:
print(f"WARNING: {result['ram_warning']}")
Setting Up Multi-Model Serving with Ollama
Configuring Ollama for Concurrent Model Loading
Ollama supports concurrent model serving through two critical environment variables. OLLAMA_MAX_LOADED_MODELS controls how many models remain loaded in memory simultaneously, defaulting to 1 per detected GPU (so a 2-GPU system defaults to 2). Verify your current default with ollama ps before overriding. Setting this to 3 or higher allows multiple models to stay resident without being evicted between requests. OLLAMA_NUM_PARALLEL controls the total number of parallel inference requests across all loaded models globally.
The keep-alive duration determines how long an idle model remains in VRAM before Ollama unloads it. A keep-alive under 5m causes cold-start churn; over 30m wastes VRAM on idle models. Start at 10-15m and adjust based on your request patterns.
# Pull the three models
ollama pull codellama:7b-code-q4_K_M
ollama pull mistral:7b-instruct-q4_K_M
ollama pull phi3:mini
# Launch Ollama with multi-model concurrency
OLLAMA_MAX_LOADED_MODELS=3 \
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_KEEP_ALIVE="10m" \
ollama serve &
OLLAMA_PID=$!
# Wait for Ollama to be ready before proceeding (with timeout)
i=0
until curl -sf http://localhost:11434/api/tags >/dev/null; do
i=$((i + 1))
echo "Waiting for Ollama... attempt ${i}"
if [ "$i" -ge 60 ]; then
echo "ERROR: Ollama did not become ready within 60 seconds. Exiting."
exit 1
fi
sleep 1
done
# Verify models are available
ollama list
# Warm-load all three models by sending an initial request to each
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"codellama:7b-code-q4_K_M","prompt":"// hello","stream":false}'
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"mistral:7b-instruct-q4_K_M","prompt":"Hello","stream":false}'
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"phi3:mini","prompt":"Hello","stream":false}'
# Confirm loaded models
curl http://localhost:11434/api/ps
Note: If Ollama is already running as a system service on port 11434, the ollama serve command above will fail with a port conflict. Stop the existing service first (systemctl stop ollama) or choose a different port with OLLAMA_HOST=0.0.0.0:11435.
Building a Model Router
A lightweight router maps incoming requests to the appropriate model based on declared task type, implementing the router pattern described earlier.
Install dependencies first:
pip install "fastapi>=0.100.0" "httpx>=0.25.0" "uvicorn[standard]>=0.23.0"
"""Model router — FastAPI service that dispatches to the correct Ollama model."""
import os
import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Literal
app = FastAPI()
MODEL_MAP = {
"code": "codellama:7b-code-q4_K_M",
"summarize": "mistral:7b-instruct-q4_K_M",
"chat": "phi3:mini",
}
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434/api/generate")
class InferenceRequest(BaseModel):
prompt: str = Field(..., max_length=32768)
task: Literal["code", "summarize", "chat"]
stream: bool = False
@app.post("/infer")
async def infer(req: InferenceRequest):
model = MODEL_MAP[req.task] # Literal type guarantees key exists
try:
async with httpx.AsyncClient(timeout=120.0) as client:
resp = await client.post(OLLAMA_URL, json={
"model": model, "prompt": req.prompt, "stream": req.stream
})
resp.raise_for_status()
return resp.json()
except httpx.HTTPStatusError as e:
raise HTTPException(
status_code=502,
detail=f"Upstream model error: {e.response.status_code}"
)
except httpx.RequestError as e:
raise HTTPException(
status_code=503,
detail=f"Ollama unreachable: {type(e).__name__}"
)
# Run: uvicorn router:app --host 0.0.0.0 --port 8000
Security note: The command above binds to all interfaces (0.0.0.0) with no authentication. For non-local deployments, place the router behind an authenticated reverse proxy (e.g., nginx with auth_basic) or add an API key header check before exposing it on any shared network.
Containerized Multi-Model Orchestration with Docker
Docker Compose for Isolated Model Instances
Containers let you restart one model without killing the others, and they give you reproducible builds across machines. Note that Docker Compose's deploy.resources.reservations.devices with count: 1 reserves access to a GPU but does not enforce VRAM limits. On a single-GPU system, all containers share the same physical device and its total VRAM. True hardware-level GPU isolation requires NVIDIA MIG or multiple physical GPUs.
Running one Ollama container per model versus a single Ollama instance serving all models is a real trade-off: per-model containers give you independent failure domains, but they consume more system resources from duplicated Ollama processes. A single multi-model Ollama instance is more memory-efficient but lacks hard isolation between models.
Important: The NVIDIA Container Toolkit (nvidia-ctk, nvidia-container-runtime) must be installed and configured in your Docker daemon before GPU passthrough will work. See the NVIDIA Container Toolkit installation guide.
The following Compose file is configured for a multi-GPU system (3 GPUs), with each container pinned to a separate GPU via CUDA_VISIBLE_DEVICES. If you have a single GPU, remove all CUDA_VISIBLE_DEVICES entries — the containers will share one GPU and contend for its VRAM.
# docker-compose.yml — Three isolated Ollama model containers
# Docker Compose v2+ (version field deprecated)
services:
codellama:
image: ollama/ollama:latest # Pin to a specific tag (e.g., ollama/ollama:0.3.6) for production
ports:
- "11434:11434"
environment:
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=15m
- CUDA_VISIBLE_DEVICES=0
volumes:
- codellama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["/bin/sh", "-c",
"ollama serve & OLLAMA_PID=$!; \
i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
sleep 1; done; \
ollama pull codellama:7b-code-q4_K_M; \
wait \"$OLLAMA_PID\""]
networks:
- llm-net
mistral:
image: ollama/ollama:latest # Pin to a specific tag for production
ports:
- "11435:11434"
environment:
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=15m
- CUDA_VISIBLE_DEVICES=1
volumes:
- mistral_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["/bin/sh", "-c",
"ollama serve & OLLAMA_PID=$!; \
i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
sleep 1; done; \
ollama pull mistral:7b-instruct-q4_K_M; \
wait \"$OLLAMA_PID\""]
networks:
- llm-net
phi3:
image: ollama/ollama:latest # Pin to a specific tag for production
ports:
- "11436:11434"
environment:
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=15m
- CUDA_VISIBLE_DEVICES=2
volumes:
- phi3_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["/bin/sh", "-c",
"ollama serve & OLLAMA_PID=$!; \
i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
sleep 1; done; \
ollama pull phi3:mini; \
wait \"$OLLAMA_PID\""]
networks:
- llm-net
networks:
llm-net:
driver: bridge
volumes:
codellama_data:
mistral_data:
phi3_data:
Production tip: The ollama pull in each entrypoint re-downloads the model on every cold start if the volume is missing or the image changes, and it requires internet access. For production or air-gapped environments, build a custom image with the model pre-baked:
FROM ollama/ollama:0.3.6
RUN ollama serve & until curl -sf http://localhost:11434/api/tags >/dev/null; do sleep 1; done && ollama pull codellama:7b-code-q4_K_M
GPU Partitioning Across Containers
Setting CUDA_VISIBLE_DEVICES per container restricts which physical GPUs a container can access, but on a single-GPU system all containers with CUDA_VISIBLE_DEVICES=0 still share that GPU with no VRAM enforcement. Supported NVIDIA data-center GPUs (A100, H100, A30, A10) offer Multi-Instance GPU (MIG) for hardware-level GPU partitioning when an administrator explicitly enables MIG mode (nvidia-smi -i 0 -mig 1). On consumer GPUs without MIG, containers share the same GPU and rely on Ollama's memory management, which means VRAM contention is possible if total allocations exceed physical capacity.
Scaling with Kubernetes for Production Workloads
Kubernetes Deployment for Multi-Model Serving
When Docker Compose's limitations become apparent, particularly around automated health checks, rolling restarts, load balancing, and scaling, Kubernetes provides the necessary orchestration layer. The NVIDIA GPU Operator and its device plugin expose GPU resources as schedulable Kubernetes resources. (The GPU Operator must be installed in your cluster first; see the NVIDIA GPU Operator documentation for installation instructions.)
Each model gets its own Deployment and Service. An init container pulls the model before the main container starts serving, ensuring the model is present when the liveness probe begins. The init container does not require a GPU — it only needs CPU and network access to download the model files.
# k8s-multi-model.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: codellama-model-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi # Adjust to model size
storageClassName: standard # Replace with your cluster's storage class
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-model-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: standard
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: codellama-serving
spec:
replicas: 1
selector:
matchLabels:
app: codellama
template:
metadata:
labels:
app: codellama
spec:
initContainers:
- name: pull-model
image: ollama/ollama:0.3.6
command: ["/bin/sh", "-c",
"ollama serve & OLLAMA_PID=$!; \
i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
i=$((i+1)); [ $i -ge 60 ] && echo 'Ollama failed to start' && exit 1; \
sleep 1; done; \
ollama pull codellama:7b-code-q4_K_M; \
kill $OLLAMA_PID && wait $OLLAMA_PID; exit 0"]
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1"
containers:
- name: ollama
image: ollama/ollama:0.3.6
ports:
- containerPort: 11434
env:
- name: OLLAMA_MAX_LOADED_MODELS
value: "1"
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
startupProbe:
httpGet:
path: /api/tags
port: 11434
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 15
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: codellama-model-pvc
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-serving
spec:
replicas: 1
selector:
matchLabels:
app: mistral
template:
metadata:
labels:
app: mistral
spec:
initContainers:
- name: pull-model
image: ollama/ollama:0.3.6
command: ["/bin/sh", "-c",
"ollama serve & OLLAMA_PID=$!; \
i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
i=$((i+1)); [ $i -ge 60 ] && echo 'Ollama failed to start' && exit 1; \
sleep 1; done; \
ollama pull mistral:7b-instruct-q4_K_M; \
kill $OLLAMA_PID && wait $OLLAMA_PID; exit 0"]
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1"
containers:
- name: ollama
image: ollama/ollama:0.3.6
ports:
- containerPort: 11434
env:
- name: OLLAMA_MAX_LOADED_MODELS
value: "1"
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
startupProbe:
httpGet:
path: /api/tags
port: 11434
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 15
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: mistral-model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: codellama-service
spec:
selector:
app: codellama
ports:
- port: 11434
targetPort: 11434
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: mistral-service
spec:
selector:
app: mistral
ports:
- port: 11434
targetPort: 11434
type: ClusterIP
Autoscaling and Model Lifecycle Management
Tie horizontal pod autoscaling to inference queue depth using custom Prometheus metrics to scale replicas of high-demand models while keeping low-traffic models at a single replica. Custom metric HPA requires KEDA (helm install keda kedacore/keda) or a Prometheus adapter; standard HPA cannot consume arbitrary Prometheus metrics. Preemption strategies matter: when VRAM is scarce, low-priority model pods can be evicted using Kubernetes priority classes, freeing GPU resources for critical inference tasks. Use standard Kubernetes rollout strategies for model version updates; for true zero-downtime updates, ensure replicas: 2 or configure a PodDisruptionBudget with minAvailable: 1, since a single-replica rolling update causes a brief service interruption.
Monitoring, Debugging, and Optimization
Monitoring VRAM and Inference Latency
Continuous nvidia-smi dmon monitoring reveals per-GPU VRAM utilization, temperature, and compute load. (Note: nvidia-smi dmon requires a native Linux driver installation and is not available on WSL2; use nvidia-smi without dmon in WSL2 environments.) For structured observability, feed GPU metrics from the NVIDIA DCGM exporter into Prometheus and visualize them in Grafana. (Requires dcgm-exporter deployment; see the NVIDIA DCGM documentation for installation.) Logging per-model response times at the router layer exposes contention: if one model's p99 latency exceeds 2x its baseline while others remain stable, that model is likely experiencing VRAM pressure or KV cache eviction.
Common Pitfalls
Frequently swapping models in and out of memory fragments VRAM, leaving unusable gaps in the GPU memory space. (This is allocator-level fragmentation in the CUDA memory pool, not disk fragmentation.) The GPU OOM-kills a model silently when its actual memory use, including KV cache growth during long contexts, exceeds your estimate. Overly aggressive keep-alive values can hold VRAM hostage for idle models, blocking higher-priority loads. The fix is deliberate lifecycle management: set keep-alive values proportional to expected request frequency per model.
The fix is deliberate lifecycle management: set keep-alive values proportional to expected request frequency per model.
Recommended Configurations
The three architecture patterns map to different operational needs: the router pattern suits varied workloads with distinct task types, the pipeline pattern fits sequential processing chains, and the parallel pattern enables ensemble-style inference. A practical starting point is Docker Compose with Ollama serving two to three Q4_K_M-quantized 7B models on a single 24 GB GPU, which leaves adequate headroom for KV cache and overhead. (This assumes no other VRAM consumers are active on the GPU.)
| Hardware Tier | Concurrent 7B Models (Q4_K_M) | Suggested Quantization Mix | Orchestration Tool |
|---|---|---|---|
| 24 GB (RTX 4090) | 2 to 3 | Q4_K_M across all | Ollama standalone or Docker Compose |
| 48 GB single (A6000/A6000 Ada) or 2x 24 GB (e.g., 2x RTX 3090) | 4 to 6 | Q5_K_M for critical, Q4_K_M for others | Docker Compose |
| 2x48 GB | 8+ or mix of 7B/13B | Q8 for primary, Q4_K_M for auxiliary | Kubernetes |
For teams ready to push further, projects like Mixture of Experts implementations in llama.cpp show how lightweight classifiers can dynamically select among specialized models at inference time, reducing per-request VRAM by activating only the relevant expert.
- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
