Running Multiple Local LLMs Simultaneously: Multi-Model Setup Guide

👁 SitePoint Team

SitePoint Team

Published in

AI·Computing·DevOps·

March 20, 2026

Share this article

👁 Running Multiple Local LLMs Simultaneously: Multi-Model Setup Guide

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

Running multiple local LLMs simultaneously has moved from experimental curiosity to practical necessity. Professionals building AI-powered workflows increasingly need concurrent access to task-specific models: one tuned for code generation, another for summarization, a third embedded in a retrieval-augmented generation pipeline. By the end of this article, you will have a working multi-model deployment across all three orchestration tiers: bare Ollama, Docker Compose, and Kubernetes.

How to Run Multiple Local LLMs Simultaneously

Plan your VRAM budget by summing each model's quantized size plus 15–20% overhead for KV cache and CUDA runtime.
Select task-specific models and assign quantization levels (Q4_K_M for density, Q5_K_M/Q8 for fidelity) based on priority.
Configure Ollama with OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, and an appropriate keep-alive duration.
Deploy a model router (e.g., FastAPI) that dispatches requests to the correct model by task type.
Containerize each model instance using Docker Compose with GPU passthrough for isolated failure domains.
Scale to Kubernetes with per-model Deployments, PVCs, and GPU resource requests when you need autoscaling and rolling updates.
Monitor VRAM utilization and per-model latency via nvidia-smi dmon, Prometheus, and router-level logging.
Tune keep-alive values and quantization mix iteratively to eliminate cold-start churn and VRAM contention.

Prerequisites

Before starting, ensure the following are in place:

A Linux host running Ubuntu 22.04 or later. macOS and Windows WSL2 have partial support but are not covered here.
NVIDIA GPU driver version 525+ for CUDA 12.x compatibility with Ollama.
The NVIDIA Container Toolkit (nvidia-ctk and nvidia-container-runtime) installed and configured in your Docker daemon (/etc/docker/daemon.json) for GPU passthrough in containers.
Docker Engine 24.0 or later with Docker Compose v2.
Ollama installed locally or via container. Pin to a specific version for reproducibility (e.g., ollama/ollama:0.3.6). The examples below use latest for brevity; replace with a pinned version tag for production use.
Python 3.9+ for the planner script, which uses PEP 585 type hints.
curl available in PATH.
At least 24 GB GPU VRAM for the three-model examples to fit without CPU spillover.

Why Run Multiple Local LLMs Simultaneously?

The Case for Task-Specific Model Specialization

Smaller, specialized models consistently outperform large generalists on narrow tasks while demanding a fraction of the VRAM. A 7B parameter code-focused model generates more syntactically correct output for programming tasks than a general-purpose 70B model (as measured by benchmarks like HumanEval pass rates), and it does so with lower latency and no API round-trip costs. Running these models locally eliminates per-token billing entirely, and it keeps sensitive data, whether proprietary codebases, legal documents, or medical records, from ever leaving the local network.

Smaller, specialized models consistently outperform large generalists on narrow tasks while demanding a fraction of the VRAM.

Common Multi-Model Architectures

Three patterns dominate multi-model local deployments. The router pattern places a dispatcher in front of a model pool, directing each request to the best-fit model based on task type. (The router implementation later in this guide uses an explicit task field in the request rather than an automatic query classifier.) By contrast, the pipeline pattern and parallel pattern address different coordination needs: pipelines chain models sequentially, feeding one model's output into the next (common in extract-then-summarize workflows), while parallel deployments run concurrent inference across models and merge results when you need multiple perspectives on the same input.

Resource Planning for Concurrent Models

Hardware Requirements and VRAM Budgeting

VRAM is the primary bottleneck when loading multiple models simultaneously. A model's VRAM footprint depends heavily on quantization level:

Quantization	7B Model VRAM (approx.)
Q4_K_M	~4.2 GB (±10%)
Q5_K_M	~5.1 GB (±10%)
Q8	~7.7 GB (±10%)

Actual usage varies by model architecture. The practical rule of thumb: total VRAM needed equals the sum of all loaded model sizes plus 15 to 20% overhead to account for KV cache, context window allocations, and CUDA runtime overhead.

When you run out of VRAM, Ollama offloads layers to system RAM. This works but introduces substantial latency penalties. Offloading roughly 25% of layers slows inference by about 3x; offloading all layers to RAM can cost 10x or more, depending on memory bandwidth. For latency-sensitive tasks, keeping the entire model in VRAM is critical.

Quantization Strategy for Multi-Model Density

Mixing quantization levels across models is a pragmatic strategy. A code generation model where output correctness matters most might warrant Q5_K_M or Q8 quantization (Q8 costs roughly 2.6 GB more VRAM than Q5_K_M for a 7B model, so choose Q8 only when you have the headroom and need maximum fidelity). A summarization model handling less precision-critical tasks can run at Q4_K_M to conserve VRAM. This trade-off between quality and density is the central lever for fitting more models into a fixed VRAM budget.

This trade-off between quality and density is the central lever for fitting more models into a fixed VRAM budget.

#!/usr/bin/env python3
# Requires Python 3.9+
"""VRAM/RAM allocation planner for multi-model deployments."""
# Values are midpoint estimates; actual usage varies ±10% by model architecture.
# Source: llama.cpp quantization size tables (approximate).
QUANT_SIZES_GB = {
 "Q4_K_M": {4: 2.3, 7: 4.2, 13: 7.9, 34: 20.0},
 "Q5_K_M": {4: 2.8, 7: 5.1, 13: 9.4, 34: 23.8},
 "Q8": {4: 4.0, 7: 7.7, 13: 13.5, 34: 35.0},
}
OVERHEAD_FACTOR = 0.18
def plan_allocation(models: list[dict], vram_gb: float, ram_gb: float) -> dict:
 """
 models: list of {"name": str, "params_b": int, "quant": str, "priority": int}
 priority: lower number = higher priority for VRAM placement
 """
 models = list(models) # Handle generators safely
 # Validate all models before allocating
 for m in models:
 if m["quant"] not in QUANT_SIZES_GB:
 raise ValueError(
 f"Unknown quantization '{m['quant']}' for model '{m['name']}'. "
 f"Valid: {list(QUANT_SIZES_GB.keys())}"
 )
 if m["params_b"] not in QUANT_SIZES_GB[m["quant"]]:
 raise ValueError(
 f"Unknown params_b '{m['params_b']}' for quant '{m['quant']}' "
 f"in model '{m['name']}'. "
 f"Valid: {list(QUANT_SIZES_GB[m['quant']].keys())}"
 )
 sorted_models = sorted(models, key=lambda m: m["priority"])
 vram_used = 0.0
 allocation = {"vram": [], "ram_spillover": [], "total_vram_gb": 0, "total_ram_gb": 0}
 for model in sorted_models:
 size = QUANT_SIZES_GB[model["quant"]][model["params_b"]]
 size_with_overhead = size * (1 + OVERHEAD_FACTOR)
 if vram_used + size_with_overhead <= vram_gb:
 allocation["vram"].append({
 "name": model["name"], "quant": model["quant"],
 "vram_gb": round(size_with_overhead, 2)
 })
 vram_used += size_with_overhead
 else:
 allocation["ram_spillover"].append({
 "name": model["name"], "quant": model["quant"],
 "ram_gb": round(size_with_overhead, 2),
 "warning": "Expect 3-10x latency increase with CPU offloading"
 })
 allocation["total_vram_gb"] = round(vram_used, 2)
 total_ram_spillover = round(
 sum(m["ram_gb"] for m in allocation["ram_spillover"]), 2
 )
 allocation["total_ram_gb"] = total_ram_spillover
 if total_ram_spillover > ram_gb:
 allocation["ram_warning"] = (
 f"RAM spillover ({total_ram_spillover} GB) exceeds available RAM ({ram_gb} GB). "
 "Host may OOM. Reduce model count or quantization level."
 )
 return allocation
if __name__ == "__main__":
 models = [
 {"name": "codellama:7b", "params_b": 7, "quant": "Q5_K_M", "priority": 1},
 {"name": "mistral:7b", "params_b": 7, "quant": "Q4_K_M", "priority": 2},
 {"name": "phi3:mini", "params_b": 4, "quant": "Q4_K_M", "priority": 3},
 ]
 result = plan_allocation(models, vram_gb=24.0, ram_gb=64.0)
 for section in ("vram", "ram_spillover"):
 print(f"
=== {section.upper()} ===")
 for m in result[section]:
 size_gb = m["vram_gb"] if "vram_gb" in m else m["ram_gb"]
 print(f" {m['name']} ({m.get('quant')}) — {size_gb} GB")
 print(f"
Total VRAM: {result['total_vram_gb']} GB | Total RAM spillover: {result['total_ram_gb']} GB")
 if "ram_warning" in result:
 print(f"WARNING: {result['ram_warning']}")

Setting Up Multi-Model Serving with Ollama

Configuring Ollama for Concurrent Model Loading

Ollama supports concurrent model serving through two critical environment variables. OLLAMA_MAX_LOADED_MODELS controls how many models remain loaded in memory simultaneously, defaulting to 1 per detected GPU (so a 2-GPU system defaults to 2). Verify your current default with ollama ps before overriding. Setting this to 3 or higher allows multiple models to stay resident without being evicted between requests. OLLAMA_NUM_PARALLEL controls the total number of parallel inference requests across all loaded models globally.

The keep-alive duration determines how long an idle model remains in VRAM before Ollama unloads it. A keep-alive under 5m causes cold-start churn; over 30m wastes VRAM on idle models. Start at 10-15m and adjust based on your request patterns.

# Pull the three models
ollama pull codellama:7b-code-q4_K_M
ollama pull mistral:7b-instruct-q4_K_M
ollama pull phi3:mini
# Launch Ollama with multi-model concurrency
OLLAMA_MAX_LOADED_MODELS=3 \
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_KEEP_ALIVE="10m" \
ollama serve &
OLLAMA_PID=$!
# Wait for Ollama to be ready before proceeding (with timeout)
i=0
until curl -sf http://localhost:11434/api/tags >/dev/null; do
 i=$((i + 1))
 echo "Waiting for Ollama... attempt ${i}"
 if [ "$i" -ge 60 ]; then
 echo "ERROR: Ollama did not become ready within 60 seconds. Exiting."
 exit 1
 fi
 sleep 1
done
# Verify models are available
ollama list
# Warm-load all three models by sending an initial request to each
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"codellama:7b-code-q4_K_M","prompt":"// hello","stream":false}'
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"mistral:7b-instruct-q4_K_M","prompt":"Hello","stream":false}'
curl -H "Content-Type: application/json" http://localhost:11434/api/generate -d '{"model":"phi3:mini","prompt":"Hello","stream":false}'
# Confirm loaded models
curl http://localhost:11434/api/ps

Note: If Ollama is already running as a system service on port 11434, the ollama serve command above will fail with a port conflict. Stop the existing service first (systemctl stop ollama) or choose a different port with OLLAMA_HOST=0.0.0.0:11435.

Building a Model Router

A lightweight router maps incoming requests to the appropriate model based on declared task type, implementing the router pattern described earlier.

Install dependencies first:

pip install "fastapi>=0.100.0" "httpx>=0.25.0" "uvicorn[standard]>=0.23.0"

"""Model router — FastAPI service that dispatches to the correct Ollama model."""
import os
import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Literal
app = FastAPI()
MODEL_MAP = {
 "code": "codellama:7b-code-q4_K_M",
 "summarize": "mistral:7b-instruct-q4_K_M",
 "chat": "phi3:mini",
}
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434/api/generate")
class InferenceRequest(BaseModel):
 prompt: str = Field(..., max_length=32768)
 task: Literal["code", "summarize", "chat"]
 stream: bool = False
@app.post("/infer")
async def infer(req: InferenceRequest):
 model = MODEL_MAP[req.task] # Literal type guarantees key exists
 try:
 async with httpx.AsyncClient(timeout=120.0) as client:
 resp = await client.post(OLLAMA_URL, json={
 "model": model, "prompt": req.prompt, "stream": req.stream
 })
 resp.raise_for_status()
 return resp.json()
 except httpx.HTTPStatusError as e:
 raise HTTPException(
 status_code=502,
 detail=f"Upstream model error: {e.response.status_code}"
 )
 except httpx.RequestError as e:
 raise HTTPException(
 status_code=503,
 detail=f"Ollama unreachable: {type(e).__name__}"
 )
# Run: uvicorn router:app --host 0.0.0.0 --port 8000

Security note: The command above binds to all interfaces (0.0.0.0) with no authentication. For non-local deployments, place the router behind an authenticated reverse proxy (e.g., nginx with auth_basic) or add an API key header check before exposing it on any shared network.

Containerized Multi-Model Orchestration with Docker

Docker Compose for Isolated Model Instances

Containers let you restart one model without killing the others, and they give you reproducible builds across machines. Note that Docker Compose's deploy.resources.reservations.devices with count: 1 reserves access to a GPU but does not enforce VRAM limits. On a single-GPU system, all containers share the same physical device and its total VRAM. True hardware-level GPU isolation requires NVIDIA MIG or multiple physical GPUs.

Running one Ollama container per model versus a single Ollama instance serving all models is a real trade-off: per-model containers give you independent failure domains, but they consume more system resources from duplicated Ollama processes. A single multi-model Ollama instance is more memory-efficient but lacks hard isolation between models.

Important: The NVIDIA Container Toolkit (nvidia-ctk, nvidia-container-runtime) must be installed and configured in your Docker daemon before GPU passthrough will work. See the NVIDIA Container Toolkit installation guide.

The following Compose file is configured for a multi-GPU system (3 GPUs), with each container pinned to a separate GPU via CUDA_VISIBLE_DEVICES. If you have a single GPU, remove all CUDA_VISIBLE_DEVICES entries — the containers will share one GPU and contend for its VRAM.

# docker-compose.yml — Three isolated Ollama model containers
# Docker Compose v2+ (version field deprecated)
services:
 codellama:
 image: ollama/ollama:latest # Pin to a specific tag (e.g., ollama/ollama:0.3.6) for production
 ports:
 - "11434:11434"
 environment:
 - OLLAMA_MAX_LOADED_MODELS=1
 - OLLAMA_KEEP_ALIVE=15m
 - CUDA_VISIBLE_DEVICES=0
 volumes:
 - codellama_data:/root/.ollama
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: 1
 capabilities: [gpu]
 entrypoint: ["/bin/sh", "-c",
 "ollama serve & OLLAMA_PID=$!; \
 i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
 i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
 sleep 1; done; \
 ollama pull codellama:7b-code-q4_K_M; \
 wait \"$OLLAMA_PID\""]
 networks:
 - llm-net
 mistral:
 image: ollama/ollama:latest # Pin to a specific tag for production
 ports:
 - "11435:11434"
 environment:
 - OLLAMA_MAX_LOADED_MODELS=1
 - OLLAMA_KEEP_ALIVE=15m
 - CUDA_VISIBLE_DEVICES=1
 volumes:
 - mistral_data:/root/.ollama
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: 1
 capabilities: [gpu]
 entrypoint: ["/bin/sh", "-c",
 "ollama serve & OLLAMA_PID=$!; \
 i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
 i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
 sleep 1; done; \
 ollama pull mistral:7b-instruct-q4_K_M; \
 wait \"$OLLAMA_PID\""]
 networks:
 - llm-net
 phi3:
 image: ollama/ollama:latest # Pin to a specific tag for production
 ports:
 - "11436:11434"
 environment:
 - OLLAMA_MAX_LOADED_MODELS=1
 - OLLAMA_KEEP_ALIVE=15m
 - CUDA_VISIBLE_DEVICES=2
 volumes:
 - phi3_data:/root/.ollama
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: 1
 capabilities: [gpu]
 entrypoint: ["/bin/sh", "-c",
 "ollama serve & OLLAMA_PID=$!; \
 i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
 i=$((i+1)); if [ \"$i\" -ge 60 ]; then echo 'Ollama failed to start'; exit 1; fi; \
 sleep 1; done; \
 ollama pull phi3:mini; \
 wait \"$OLLAMA_PID\""]
 networks:
 - llm-net
networks:
 llm-net:
 driver: bridge
volumes:
 codellama_data:
 mistral_data:
 phi3_data:

Production tip: The ollama pull in each entrypoint re-downloads the model on every cold start if the volume is missing or the image changes, and it requires internet access. For production or air-gapped environments, build a custom image with the model pre-baked:

FROM ollama/ollama:0.3.6
RUN ollama serve & until curl -sf http://localhost:11434/api/tags >/dev/null; do sleep 1; done && ollama pull codellama:7b-code-q4_K_M

GPU Partitioning Across Containers

Setting CUDA_VISIBLE_DEVICES per container restricts which physical GPUs a container can access, but on a single-GPU system all containers with CUDA_VISIBLE_DEVICES=0 still share that GPU with no VRAM enforcement. Supported NVIDIA data-center GPUs (A100, H100, A30, A10) offer Multi-Instance GPU (MIG) for hardware-level GPU partitioning when an administrator explicitly enables MIG mode (nvidia-smi -i 0 -mig 1). On consumer GPUs without MIG, containers share the same GPU and rely on Ollama's memory management, which means VRAM contention is possible if total allocations exceed physical capacity.

Scaling with Kubernetes for Production Workloads

Kubernetes Deployment for Multi-Model Serving

When Docker Compose's limitations become apparent, particularly around automated health checks, rolling restarts, load balancing, and scaling, Kubernetes provides the necessary orchestration layer. The NVIDIA GPU Operator and its device plugin expose GPU resources as schedulable Kubernetes resources. (The GPU Operator must be installed in your cluster first; see the NVIDIA GPU Operator documentation for installation instructions.)

Each model gets its own Deployment and Service. An init container pulls the model before the main container starts serving, ensuring the model is present when the liveness probe begins. The init container does not require a GPU — it only needs CPU and network access to download the model files.

# k8s-multi-model.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: codellama-model-pvc
spec:
 accessModes: [ReadWriteOnce]
 resources:
 requests:
 storage: 10Gi # Adjust to model size
 storageClassName: standard # Replace with your cluster's storage class
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: mistral-model-pvc
spec:
 accessModes: [ReadWriteOnce]
 resources:
 requests:
 storage: 10Gi
 storageClassName: standard
---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: codellama-serving
spec:
 replicas: 1
 selector:
 matchLabels:
 app: codellama
 template:
 metadata:
 labels:
 app: codellama
 spec:
 initContainers:
 - name: pull-model
 image: ollama/ollama:0.3.6
 command: ["/bin/sh", "-c",
 "ollama serve & OLLAMA_PID=$!; \
 i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
 i=$((i+1)); [ $i -ge 60 ] && echo 'Ollama failed to start' && exit 1; \
 sleep 1; done; \
 ollama pull codellama:7b-code-q4_K_M; \
 kill $OLLAMA_PID && wait $OLLAMA_PID; exit 0"]
 volumeMounts:
 - name: model-storage
 mountPath: /root/.ollama
 resources:
 requests:
 memory: "512Mi"
 cpu: "500m"
 limits:
 memory: "1Gi"
 cpu: "1"
 containers:
 - name: ollama
 image: ollama/ollama:0.3.6
 ports:
 - containerPort: 11434
 env:
 - name: OLLAMA_MAX_LOADED_MODELS
 value: "1"
 volumeMounts:
 - name: model-storage
 mountPath: /root/.ollama
 resources:
 limits:
 nvidia.com/gpu: "1"
 requests:
 nvidia.com/gpu: "1"
 startupProbe:
 httpGet:
 path: /api/tags
 port: 11434
 failureThreshold: 30
 periodSeconds: 10
 livenessProbe:
 httpGet:
 path: /api/tags
 port: 11434
 initialDelaySeconds: 10
 periodSeconds: 15
 volumes:
 - name: model-storage
 persistentVolumeClaim:
 claimName: codellama-model-pvc
---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: mistral-serving
spec:
 replicas: 1
 selector:
 matchLabels:
 app: mistral
 template:
 metadata:
 labels:
 app: mistral
 spec:
 initContainers:
 - name: pull-model
 image: ollama/ollama:0.3.6
 command: ["/bin/sh", "-c",
 "ollama serve & OLLAMA_PID=$!; \
 i=0; until curl -sf http://localhost:11434/api/tags >/dev/null; do \
 i=$((i+1)); [ $i -ge 60 ] && echo 'Ollama failed to start' && exit 1; \
 sleep 1; done; \
 ollama pull mistral:7b-instruct-q4_K_M; \
 kill $OLLAMA_PID && wait $OLLAMA_PID; exit 0"]
 volumeMounts:
 - name: model-storage
 mountPath: /root/.ollama
 resources:
 requests:
 memory: "512Mi"
 cpu: "500m"
 limits:
 memory: "1Gi"
 cpu: "1"
 containers:
 - name: ollama
 image: ollama/ollama:0.3.6
 ports:
 - containerPort: 11434
 env:
 - name: OLLAMA_MAX_LOADED_MODELS
 value: "1"
 volumeMounts:
 - name: model-storage
 mountPath: /root/.ollama
 resources:
 limits:
 nvidia.com/gpu: "1"
 requests:
 nvidia.com/gpu: "1"
 startupProbe:
 httpGet:
 path: /api/tags
 port: 11434
 failureThreshold: 30
 periodSeconds: 10
 livenessProbe:
 httpGet:
 path: /api/tags
 port: 11434
 initialDelaySeconds: 10
 periodSeconds: 15
 volumes:
 - name: model-storage
 persistentVolumeClaim:
 claimName: mistral-model-pvc
---
apiVersion: v1
kind: Service
metadata:
 name: codellama-service
spec:
 selector:
 app: codellama
 ports:
 - port: 11434
 targetPort: 11434
 type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
 name: mistral-service
spec:
 selector:
 app: mistral
 ports:
 - port: 11434
 targetPort: 11434
 type: ClusterIP

Autoscaling and Model Lifecycle Management

Tie horizontal pod autoscaling to inference queue depth using custom Prometheus metrics to scale replicas of high-demand models while keeping low-traffic models at a single replica. Custom metric HPA requires KEDA (helm install keda kedacore/keda) or a Prometheus adapter; standard HPA cannot consume arbitrary Prometheus metrics. Preemption strategies matter: when VRAM is scarce, low-priority model pods can be evicted using Kubernetes priority classes, freeing GPU resources for critical inference tasks. Use standard Kubernetes rollout strategies for model version updates; for true zero-downtime updates, ensure replicas: 2 or configure a PodDisruptionBudget with minAvailable: 1, since a single-replica rolling update causes a brief service interruption.

Monitoring, Debugging, and Optimization

Monitoring VRAM and Inference Latency

Continuous nvidia-smi dmon monitoring reveals per-GPU VRAM utilization, temperature, and compute load. (Note: nvidia-smi dmon requires a native Linux driver installation and is not available on WSL2; use nvidia-smi without dmon in WSL2 environments.) For structured observability, feed GPU metrics from the NVIDIA DCGM exporter into Prometheus and visualize them in Grafana. (Requires dcgm-exporter deployment; see the NVIDIA DCGM documentation for installation.) Logging per-model response times at the router layer exposes contention: if one model's p99 latency exceeds 2x its baseline while others remain stable, that model is likely experiencing VRAM pressure or KV cache eviction.

Common Pitfalls

Frequently swapping models in and out of memory fragments VRAM, leaving unusable gaps in the GPU memory space. (This is allocator-level fragmentation in the CUDA memory pool, not disk fragmentation.) The GPU OOM-kills a model silently when its actual memory use, including KV cache growth during long contexts, exceeds your estimate. Overly aggressive keep-alive values can hold VRAM hostage for idle models, blocking higher-priority loads. The fix is deliberate lifecycle management: set keep-alive values proportional to expected request frequency per model.

The fix is deliberate lifecycle management: set keep-alive values proportional to expected request frequency per model.

Recommended Configurations

The three architecture patterns map to different operational needs: the router pattern suits varied workloads with distinct task types, the pipeline pattern fits sequential processing chains, and the parallel pattern enables ensemble-style inference. A practical starting point is Docker Compose with Ollama serving two to three Q4_K_M-quantized 7B models on a single 24 GB GPU, which leaves adequate headroom for KV cache and overhead. (This assumes no other VRAM consumers are active on the GPU.)

Hardware Tier	Concurrent 7B Models (Q4_K_M)	Suggested Quantization Mix	Orchestration Tool
24 GB (RTX 4090)	2 to 3	Q4_K_M across all	Ollama standalone or Docker Compose
48 GB single (A6000/A6000 Ada) or 2x 24 GB (e.g., 2x RTX 3090)	4 to 6	Q5_K_M for critical, Q4_K_M for others	Docker Compose
2x48 GB	8+ or mix of 7B/13B	Q8 for primary, Q4_K_M for auxiliary	Kubernetes

For teams ready to push further, projects like Mixture of Experts implementations in llama.cpp show how lightweight classifiers can dynamically select among specialized models at inference time, reducing per-request VRAM by activating only the relevant expert.

👁 SitePoint Team
SitePoint Team

Sharing our passion for building incredible internet things.