Voozh

📚 More on this topic: Run Your First Local LLM · Ollama vs LM Studio · Open WebUI Setup · llama.cpp vs Ollama vs vLLM · Qwen 3.5 9B Setup Guide · Planning Tool

Ollama is the easiest way to run local LLMs, right up until it isn’t. The installation is one command, but when something goes wrong — GPU not detected, model won’t load, painfully slow responses — the error messages aren’t always helpful.

This guide covers every common Ollama problem with exact commands to diagnose and fix it. Bookmark this for when things break.

Qwen 3.6 + Ollama gotchas (June 2026)

Qwen 3.6-35B-A3B is one of the most-pulled Ollama models right now, and r/ollama threads keep surfacing the same handful of issues. For the full Qwen 3.6 model breakdown (27B dense and 35B-A3B), see the Qwen 3.6 complete guide.

401 unauthorized on Ollama Cloud. Hitting :cloud-tagged models (kimi-k2.6:cloud etc.) and getting 401? The cloud token isn’t being picked up. Run ollama signin, confirm with ollama ls. If it persists, check ~/.ollama/id_ed25519.pub matches the key at ollama.com/settings/keys. Don’t delete the key file — regenerating breaks other installs on the same account.

Role developer is not supported. Pops on Qwen 3.6 calls from agent harnesses (Pi Agent, OpenCode, OpenClaw) that pass a developer role. Update to the latest stable Ollama release and check the GitHub release notes for the version that resolved it on your platform. Workaround in the meantime: remap developer → system in your harness config; quality drops a hair but the call goes through.

Pi Agent / OpenClaw drops on long sessions. Long agentic runs hit the context wall and Ollama drops the request silently. Use ollama launch openclaw for sane compaction, or set OLLAMA_CONTEXT_LENGTH=131072 before ollama serve. The ollama launch family covers other coding integrations too, and the exact list shifts release to release (Claude Desktop, for example, was added then removed across v0.23.x). Run ollama launch --help for what’s available on your version.

Version that works cleanly with Qwen 3.6: current stable, v0.30.0 or later. The Qwen 3.6 readiness all landed during v0.17.x: v0.17.4 added the architecture, v0.17.5 fixed GPU/CPU split crashes, v0.17.6 fixed tool-call parsing, v0.17.7 fixed thinking-effort and compaction. v0.30.0 integrated llama.cpp alongside the MLX engine on Apple Silicon and made flash attention auto-enable for Qwen 3.x and Gemma 3/4 on Ampere+/RDNA3+ GPUs. On anything newer you’re fine. The hard floor is v0.17.4 (don’t limp along on v0.17.3).

v0.17 fixes worth knowing (if you came from an older build)

Current stable is v0.30.0, which includes everything below and brought the bigger engine update on top: llama.cpp integrated alongside the MLX engine (Apple Silicon stays MLX-first), flash attention auto-enabled at runtime for supported architectures (Qwen 3.x, Gemma 3/4, gpt-oss, mistral3) on Ampere+/RDNA3+ GPUs, and matured cloud-model support. This section is for users still on a v0.17.x or older build who want to know what was fixed when.

Auto-update broken on v0.17.1: If you’re stuck on v0.17.1, automatic updates won’t carry you forward. Manually re-download from ollama.com/download or re-run the install script. Once you’re on a later build, future updates work normally.

Dynamic context scaling (v0.17.0): Context length now scales based on available VRAM instead of crashing when a model’s default exceeds your card. You can still override with OLLAMA_CONTEXT_LENGTH or num_ctx.

Qwen 3.5 architecture floor (v0.17.4): Qwen 3.5 uses Gated DeltaNet hybrid attention; v0.17.4 was the first build that could load it. If you see unsupported model architecture or a refused pull on a Qwen 3.5 model, the version is too old.

Qwen 3.5 GPU/CPU split crash (v0.17.5): Models that split layers across GPU and CPU crashed on Qwen 3.5 before v0.17.5.

Qwen 3.5 tool calling (v0.17.3, v0.17.6): Earlier versions routed Qwen 3.5 through the wrong tool-call pipeline. v0.17.3 fixed parsing during thinking mode; v0.17.6 fixed the rest.

For 412 errors on a Qwen 3.5 pull, see the Qwen 3.5 section below.

How to Check What’s Going Wrong

Before fixing anything, gather information. These three commands tell you most of what you need to know:

# What models are loaded and where they're running (GPU vs CPU)
ollama ps
# Is the GPU visible to the system?
nvidia-smi # NVIDIA
rocm-smi # AMD
# Enable debug logging for detailed diagnostics
OLLAMA_DEBUG=1 ollama serve

The ollama ps command is your most important diagnostic tool. The Processor column shows whether the model is running on GPU, CPU, or a split:

Processor Output	Meaning
`100% GPU`	Fully on GPU — good
`100% CPU`	Entirely on CPU — slow
`48%/52% CPU/GPU`	Split between CPU and GPU — slower than full GPU

If you expected GPU and see CPU, that’s your problem. Read the GPU section below.

Where to Find Logs

OS	Location
Linux (systemd)	`journalctl -u ollama --no-pager -f`
Linux (manual)	Terminal output from `ollama serve`
macOS	`~/.ollama/logs/server.log`
Windows	`%LOCALAPPDATA%\Ollama\server.log`
Docker	`docker logs -f ollama`

GPU Not Detected / Running on CPU

This is the #1 problem people hit. The model loads but runs on CPU at 2-8 tok/s instead of GPU at 40-100+ tok/s.

NVIDIA: Diagnosing the Problem

# Step 1: Can the OS see your GPU?
nvidia-smi
# If nvidia-smi fails: drivers aren't installed or are broken
# If it works: check the driver version (top of output)

Minimum driver version for Ollama: Windows 531+, Linux 535+. If yours is older, update.

# Step 2: Check what Ollama sees
OLLAMA_DEBUG=1 ollama serve
# Look for lines about GPU detection, CUDA version, VRAM

# Step 3: Check what's actually running
ollama ps
# Look at the Processor column

Common NVIDIA Fixes

Driver issue after update: If a recent driver update broke GPU detection, try the latest known-working version or check ollama/ollama issues for active driver regressions.

After suspend/resume on Linux: The GPU can disappear after waking from sleep.

sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm

Force CUDA version: If auto-detection picks the wrong library:

OLLAMA_LLM_LIBRARY=cuda_v12 ollama serve

Permissions (Linux): Your user needs GPU access:

sudo usermod -aG video,render $USER
# Log out and back in

Docker: The --gpus=all flag is required:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Test GPU passthrough first: docker run --gpus all ubuntu nvidia-smi

AMD ROCm: Diagnosing the Problem

# Step 1: Can ROCm see your GPU?
rocm-smi
rocminfo
# Step 2: Check kernel messages
sudo dmesg | grep -i amdgpu

Common AMD Fixes

Unsupported GPU architecture: Most AMD GPU issues come down to architecture mismatch. Override the GFX version:

# Common overrides:
# RX 6600/6600 XT (gfx1032) → set to gfx1030
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve
# RX 7600 (gfx1102) → set to gfx1100
HSA_OVERRIDE_GFX_VERSION=11.0.0 ollama serve

For the systemd service (persistent):

sudo systemctl edit ollama.service

Add:

[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Permissions (Linux):

sudo usermod -aG render,video $USER

iGPU conflict: If your CPU has integrated graphics and ROCm picks the wrong GPU, disable the iGPU in BIOS or set:

ROCR_VISIBLE_DEVICES=1 ollama serve # Use second GPU (check rocminfo for correct ID)

Docker with AMD:

docker run -d --device /dev/kfd --device /dev/dri \
 -v ollama:/root/.ollama -p 11434:11434 \
 -e HSA_OVERRIDE_GFX_VERSION="10.3.0" \
 ollama/ollama:rocm

Verifying the Fix

After any GPU fix, verify with:

ollama run llama3.2 --verbose
# Check the eval rate in the output — GPU should give 40+ tok/s for a 3B model
# Then:
ollama ps
# Should show 100% GPU

Also watch VRAM usage in real-time:

watch -n 1 nvidia-smi # NVIDIA
watch -n 1 rocm-smi # AMD

VRAM usage should spike when the model loads.

Apple M5 Pro / M5 Max: First Steps

If you just upgraded to a MacBook Pro with M5 Pro or M5 Max and Ollama feels slow or isn’t using the GPU properly:

Update Ollama first. Older versions don’t know about the M5’s Metal 4 GPU architecture or the Neural Accelerator. Run:

curl -fsSL https://ollama.com/install.sh | sh

Or download the latest .dmg from ollama.com. You want current stable (v0.30.0 or later) for the best M5 support and MLX stability. Historical context: v0.17.5 was the first version with stable M5/MLX support, and v0.30.0 integrated llama.cpp alongside MLX (on Mac you’re still MLX-first; the integration matters more on Linux/Windows). Anything newer than v0.30.0 is fine.

Check that the GPU is actually being used:

ollama run qwen3.5:9b --verbose
# Look for the eval rate — M5 Pro should push 50+ tok/s on a 9B model
ollama ps
# Should show 100% GPU

If you’re seeing CPU fallback on M5, the most common cause is running an Ollama version from before M5 shipped. The MLX runner improvements in v0.16.3+ expanded architecture support, and v0.17.5 improved MLX stability further. Update and restart.

Unified memory advantage: The M5 Pro’s 36GB and M5 Max’s 128GB of unified memory mean you can load larger models than discrete GPU users. A 70B Q4 model fits entirely in memory on the M5 Max without any CPU offloading. See our Mac M-series guide for model-to-chip recommendations.

Out of Memory Errors

The error: llama runner exited, you may not have enough memory to run the model

This means the model weights + KV cache + overhead exceed your available VRAM (and possibly RAM).

Why It Happens

Total VRAM needed = Model Weights + KV Cache + ~500MB-1GB overhead

The KV cache is the part people forget. It scales with context length:

Model Size	2K Context	4K Context	8K Context	32K Context
8B params	~0.3 GB	~0.6 GB	~1.2 GB	~5 GB
14B params	~0.5 GB	~1.0 GB	~2.0 GB	~8 GB
32B params	~1.0 GB	~2.0 GB	~4.0 GB	~16 GB

A 14B model at Q4 takes about 8GB for weights. Add 2GB of KV cache at 8K context plus overhead, and you’re at ~11GB. That fits in 12GB VRAM but barely. Bump context to 16K and it won’t.

Fixes (In Order of Impact)

1. Reduce context length — the fastest fix:

ollama run llama3.2 /set parameter num_ctx 2048

Or set it globally:

export OLLAMA_CONTEXT_LENGTH=4096

Note: As of v0.17.0, Ollama dynamically scales context length based on available VRAM, which reduces surprise OOM crashes. But if you’ve set an explicit num_ctx in a Modelfile or via the API, that override still applies and can still blow your VRAM budget.

2. Enable KV cache quantization — halves KV cache memory:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve

Q8 KV cache has negligible quality loss. For even more savings, use q4_0 (roughly 1/3 the size of f16).

As of v0.30.0, flash attention auto-enables at runtime for Qwen 3.x, Gemma 3/4, gpt-oss, and mistral3 on Ampere+/RDNA3+ GPUs — so on those combinations the OLLAMA_FLASH_ATTENTION=1 line is redundant (already on). Set it explicitly anyway if you want the KV cache quant on an older build or a model not in the auto list.

3. Use a smaller model or lower quantization:

Switch from Q6 to Q4_K_M (saves ~30% VRAM)
Switch from 14B to 8B (saves ~50% VRAM)
See our VRAM requirements guide for what fits where

4. Unload other models:

ollama stop <model-name>

Or limit concurrent models:

export OLLAMA_MAX_LOADED_MODELS=1

5. Watch out for parallel requests: Setting OLLAMA_NUM_PARALLEL=4 with num_ctx=2048 allocates KV cache for an effective 8192 tokens. This alone can push a model off GPU.

The Partial Offload Trap

When a model doesn’t fully fit in VRAM, Ollama splits layers between GPU and CPU. This works but performance drops dramatically — from 50+ tok/s to 5-10 tok/s. You might not even notice it’s happening unless you check ollama ps.

If you see a CPU/GPU split, your model is too large for full GPU loading. Either reduce context, use a smaller model, or upgrade your GPU.

Slow Performance

If Ollama is running but painfully slow, work through these causes:

1. Model Running on CPU (Most Common)

ollama ps

If the Processor column shows CPU or a CPU/GPU split, that’s your answer. See the GPU section above, or reduce model size to fit fully in VRAM.

Expected speeds (full GPU):

Model	RTX 3060 12GB	RTX 3090 24GB	RTX 4090 24GB
8B Q4	~35-45 tok/s	~80-112 tok/s	~95-140 tok/s
14B Q4	~20-25 tok/s	~40-55 tok/s	~55-75 tok/s
32B Q4	Too large	~25-35 tok/s	~34-50 tok/s

If you’re seeing 2-8 tok/s on any of these, the model is on CPU.

2. Context Length Too High

Every token of context costs VRAM. Ollama’s default is 4096, but some model configs request much higher. Check what your model is using:

ollama show <model-name>
# Look for num_ctx in the parameters

Lower it if needed:

ollama run llama3.2 /set parameter num_ctx 4096

3. Multiple Models Loaded

Ollama keeps models in memory by default (up to 3 per GPU). If you’ve been testing several models, they’re all competing for VRAM.

ollama ps # See what's loaded
ollama stop <model> # Unload specific models

Or set auto-unload:

export OLLAMA_KEEP_ALIVE=5m # Unload after 5 minutes idle

4. Enable Flash Attention (mostly automatic now)

As of v0.30.0, flash attention auto-enables at runtime for Qwen 3.x, Gemma 3/4, gpt-oss, and mistral3 on Ampere+/RDNA3+ GPUs (RTX 30-series and newer NVIDIA; RX 7000-series and newer AMD). If you’re running one of those combinations on current stable, it’s already on. Worth flipping manually if you’re on an older build, running a model architecture not in the auto list, or you want to force it explicitly:

export OLLAMA_FLASH_ATTENTION=1

5. Request Serialization With Multiple Models (Known Issue)

If you have two models loaded and send a request to model B while model A is busy with a long generation, model B’s request can queue for 50+ seconds even though it’s already in VRAM. This is a known concurrency issue — still no public fix as of v0.30.0; the linked issue is the live source for status. Workaround: run separate Ollama instances on different ports for latency-sensitive multi-model setups.

6. Measuring Performance

ollama run llama3.2 --verbose

After each response, this prints timing stats including the eval rate (tokens per second). As of v0.17.5, it also shows peak memory usage — useful for seeing how close you are to your VRAM ceiling. Use this to verify whether changes actually help.

Installation & Startup Issues

Ollama Won’t Start

“command not found: ollama” Ollama isn’t installed or isn’t in PATH.

# Install (Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Verify
which ollama # Should return /usr/local/bin/ollama

“bind: address already in use” Port 11434 is occupied by another process (or a zombie Ollama session).

# Find what's using the port
sudo lsof -i :11434 # Linux/Mac
netstat -aon | findstr :11434 # Windows
# Kill the process, then start Ollama

Or change the port:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Service Management

Linux:

sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama

macOS: Launch the Ollama app from Applications, or:

ollama serve # Run manually in terminal

Windows: Find “Ollama” in Start Menu or the system tray. For service control, open Services (services.msc) and find “Ollama”.

Pinning a Specific Version

If an update breaks something:

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.30.0 sh

Model Download & Pull Issues

Failed or Stuck Downloads

Check network connectivity:

curl -I https://registry.ollama.ai/v2/

If this fails, it’s a network issue (firewall, VPN, DNS).

Check disk space: Models range from 2GB (small 3B) to 45GB+ (70B). Make sure you have enough free space.

Clear corrupted downloads:

sudo systemctl stop ollama # Stop the service first
rm -rf ~/.ollama/models/* # Nuclear option: remove all models
rm -rf ~/.ollama/cache/* # Clear download cache
sudo systemctl start ollama
ollama pull llama3.2 # Re-download

Windows equivalent: delete contents of %HOMEPATH%\.ollama\models and %HOMEPATH%\.ollama\cache.

Proxy issues: If you’re behind a corporate proxy:

HTTPS_PROXY=https://proxy.example.com ollama pull llama3.2

Do not set HTTP_PROXY — it can break Ollama’s internal client-server communication.

Model Name Errors

Model names are exact. Common mistakes:

ollama pull llama-3.2 # Wrong — no hyphen
ollama pull llama3.2 # Correct
ollama pull llama3.2:7b # Wrong — it's 3b or 1b for 3.2
ollama pull llama3.2:3b # Correct

Check available tags on ollama.com/library.

Changing Model Storage Location

Models are stored at:

OS	Default Path
Linux (service)	`/usr/share/ollama/.ollama/models`
Linux (user)	`~/.ollama/models`
macOS	`~/.ollama/models`
Windows	`C:\Users\%username%\.ollama\models`

Move them to a larger drive:

export OLLAMA_MODELS=/mnt/large-drive/ollama-models

Set this permanently in your systemd override or shell profile.

Qwen 3.5 Model Issues

Qwen 3.5 dropped in late February 2026 and is still one of the most-pulled Ollama families. For the full Qwen 3.5 lineup and benchmarks, see the Qwen 3.5 complete cheat sheet. Here’s what people hit on the troubleshooting side.

“pull model manifest: 412” Error

This is the most common Qwen 3.5 error. It means your Ollama version is too old to understand the model manifest format that Qwen 3.5 uses.

Fix: Update Ollama.

curl -fsSL https://ollama.com/install.sh | sh

On macOS, download the latest .dmg from ollama.com. On Windows, run the installer from the same page. Update to current stable (v0.30.0 or later). For historical context: v0.16.0 was the first that could pull Qwen 3.5; v0.17.7 was where tool calling, stability, and thinking levels all landed.

After updating, the pull should work:

ollama pull qwen3.5:9b

Broken Tool Calling

Tool calling with Qwen 3.5 had problems on two fronts, both now fixed:

GGUF conversion issue (35B, 27B): Unsloth flagged broken GGUF conversions that produced malformed tool call responses. Fixed March 2, 2026. Re-pull:

ollama pull qwen3.5:35b # Re-pulls the fixed version
ollama pull qwen3.5:27b

Ollama parsing issue (all Qwen 3.5 sizes): Ollama was routing Qwen 3.5 tool calls through the wrong parsing pipeline (Hermes-style JSON instead of the Qwen3-Coder XML format the model was trained on). v0.17.3 fixed parsing during thinking mode; v0.17.6 fixed the remaining cases. Update to current stable (v0.30.0 or later); v0.17.6 was the first clean tool-call build.

If you read elsewhere that Qwen 3.5 tool calling is “broken in Ollama,” that was true before v0.17.6 and isn’t anymore.

Still seeing flaky tool calls after the update? The version is fine, but the underlying quant can degrade tool-call formatting in long agent loops. A default Q4_K_M can be sloppier than a calibrated UD-Q4_K_XL on the exact patterns agents lean on. The pattern and the fix are in Q4 vs Q6 for local coding agents.

Thinking Mode Disabled by Default on Small Models

This catches people off guard. Qwen 3.5 supports “thinking mode” (chain-of-thought reasoning inside <think> tags), but it’s disabled by default on the smaller sizes: 0.8B, 2B, 4B, and 9B.

If you’re running qwen3.5:9b and expecting chain-of-thought reasoning like you saw in benchmarks or demos, you’re not getting it unless you explicitly enable it.

Enable thinking mode in Ollama:

Create a Modelfile:

FROM qwen3.5:9b
PARAMETER num_ctx 8192
TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
<think>
"""

Then:

ollama create qwen3.5-think -f Modelfile
ollama run qwen3.5-think

Or use the /think toggle if your client supports it (Open WebUI does).

When to bother: Thinking mode helps on math, logic puzzles, and multi-step coding problems. For plain conversation and simple Q&A, leave it off — it adds latency and the quality difference is marginal. See our Qwen 3.5 9B setup guide for detailed benchmarks with and without thinking.

Vision and Multimodal on Small Models

The Qwen 3.5 small models (0.8B through 9B) are natively multimodal — they handle images without a separate vision adapter. But you need to know how to actually send images.

Via the Ollama CLI:

ollama run qwen3.5:9b "Describe this image" --images ./photo.jpg

Via the Ollama API:

curl http://localhost:11434/api/generate -d '{
 "model": "qwen3.5:9b",
 "prompt": "What do you see in this image?",
 "images": ["'"$(base64 -w 0 photo.jpg)"'"]
}'

The image gets base64-encoded and passed in the images array. This works with any Qwen 3.5 small model, no extra setup needed.

Common gotcha: If you’re using an older Ollama version, the vision capabilities might not work even if the model loads fine. Update to current stable (v0.30.0 or later).

For a full walkthrough of what these models can do with images, see our Qwen 3.5 small models overview.

Crash When Splitting Across GPU and CPU

If you’re running a Qwen 3.5 model that doesn’t fully fit in VRAM and Ollama crashes (not just slow, actually crashes), update to current stable (v0.30.0 or later). v0.17.5 was the first build with the fix; earlier versions had a bug specific to Qwen 3.5 when splitting layers between GPU and CPU.

Repetition / Looping Output

Some users reported Qwen 3.5 producing excessively long chain-of-thought loops that never resolved into a final answer, or repeating itself endlessly. This was caused by a missing presence penalty. v0.17.5 was the first build with the fix, so update to current stable (v0.30.0 or later) and re-run.

Connection & API Issues

“Could Not Connect to Ollama”

# Is the server running?
curl http://localhost:11434
# Should return "Ollama is running"

If not:

ollama serve # Start manually
# or
sudo systemctl start ollama # Start the service

Remote Access (Binding to 0.0.0.0)

Security warning: Researchers have found 175,000+ publicly exposed Ollama instances across 130 countries, many with no authentication. Ollama has no built-in auth. If you bind to 0.0.0.0 and your firewall is open, anyone on the internet can use your GPU. Put a reverse proxy with auth in front, or use a VPN/Cloudflare Tunnel for remote access. Only bind to 0.0.0.0 on a trusted LAN.

By default, Ollama only listens on localhost. To access from other machines:

Linux (systemd):

sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

sudo systemctl daemon-reload
sudo systemctl restart ollama

macOS:

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# Restart the Ollama app

Windows: Set OLLAMA_HOST to 0.0.0.0:11434 in System Environment Variables, then restart Ollama from the taskbar.

Open the firewall:

sudo ufw allow 11434/tcp # Linux

Docker Networking (Open WebUI)

The most common Docker problem: Open WebUI can’t reach Ollama because localhost inside the container doesn’t point to the host.

Option A — Host networking (Linux, simplest):

docker run -d --network=host \
 -v open-webui:/app/backend/data \
 -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
 --name open-webui ghcr.io/open-webui/open-webui:main

Option B — host.docker.internal (Mac/Windows):

docker run -d -p 3000:8080 \
 --add-host=host.docker.internal:host-gateway \
 -v open-webui:/app/backend/data \
 --name open-webui ghcr.io/open-webui/open-webui:main

Set OLLAMA_BASE_URL=http://host.docker.internal:11434 in Open WebUI settings.

Option C — Docker Compose (both containerized):

services:
 ollama:
 image: ollama/ollama
 environment:
 - OLLAMA_HOST=0.0.0.0:11434
 ports:
 - "11434:11434"
 open-webui:
 image: ghcr.io/open-webui/open-webui:main
 environment:
 - OLLAMA_BASE_URL=http://ollama:11434
 ports:
 - "3000:8080"

CORS Issues

If a web app can’t connect to Ollama’s API:

export OLLAMA_ORIGINS=http://localhost:3000,http://your-server-ip:3000

Set this as an environment variable the same way as OLLAMA_HOST (systemd edit, launchctl, or Windows env vars).

Common Error Messages: Quick Reference

Error	Cause	Fix
`could not connect to ollama app`	Server not running	`ollama serve` or `sudo systemctl start ollama`
`bind: address already in use`	Port 11434 occupied	Kill the other process or change port
`llama runner exited, not enough memory`	Model + context exceeds VRAM/RAM	Reduce `num_ctx`, use smaller model, enable KV cache quant
`pull model manifest: 412`	Ollama too old for this model format	Update to current stable: `curl -fsSL https://ollama.com/install.sh \| sh`
`model not found`	Typo or model not pulled	Check name, run `ollama pull <model>`
`command not found: ollama`	Not installed or not in PATH	Install with `curl -fsSL https://ollama.com/install.sh \| sh`
`connection refused` (Docker)	Container can’t reach host	Use `host.docker.internal` or `--network=host`
`Max retries exceeded` (Python)	API server unreachable	Check server is running, check firewall
GPU error code 3	GPU not initialized	Reinstall drivers, check `nvidia-smi`
GPU error code 100	No GPU device found	Driver issue or GPU not connected
`cudaMalloc failed: out of memory`	VRAM exhausted mid-generation	Restart Ollama, reduce concurrent models

Environment Variables Reference

The most useful ones to know:

Variable	Default	What It Does
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address (set to `0.0.0.0` for remote access)
`OLLAMA_MODELS`	OS-specific	Model storage path
`OLLAMA_DEBUG`	`0`	Set to `1` for verbose logging
`OLLAMA_FLASH_ATTENTION`	`false`	Force flash attention on/off. As of v0.30.0 auto-enables at runtime for Qwen 3.x, Gemma 3/4, gpt-oss, mistral3 on Ampere+/RDNA3+ — this var is an explicit override
`OLLAMA_API_KEY`	unset	Cloud-model auth (for `:cloud`-tagged models routed via ollama.com)
`OLLAMA_KV_CACHE_TYPE`	`f16`	KV cache quant: `f16`, `q8_0`, `q4_0`
`OLLAMA_CONTEXT_LENGTH`	`4096`	Default context window
`OLLAMA_NUM_PARALLEL`	`1`	Concurrent requests per model
`OLLAMA_MAX_LOADED_MODELS`	`3 * GPU count`	Max models in memory
`OLLAMA_KEEP_ALIVE`	`5m`	Time before idle model unloads (`-1` = never)
`OLLAMA_ORIGINS`	localhost	Allowed CORS origins
`OLLAMA_LLM_LIBRARY`	auto	Force backend: `cuda_v12`, `rocm`, `cpu`, etc.
`CUDA_VISIBLE_DEVICES`	all	Select specific NVIDIA GPUs (e.g., `0,1`)
`HSA_OVERRIDE_GFX_VERSION`	auto	Override AMD GPU architecture version

How to set them permanently:

Linux (systemd): sudo systemctl edit ollama.service → add Environment="VAR=value" under [Service] → sudo systemctl daemon-reload && sudo systemctl restart ollama
macOS: launchctl setenv VAR "value" → restart Ollama app
Windows: System Settings → Environment Variables → add/edit → restart Ollama from taskbar
Docker: -e VAR=value in the docker run command

When to Reinstall vs When to Debug

Debug first if:

The problem started after a specific change (driver update, new model, config edit)
ollama ps and nvidia-smi / rocm-smi give useful output
Logs show a specific error message you can search for

Reinstall if:

ollama serve crashes immediately with no useful output
Driver issues that can’t be resolved with version changes
Corrupted installation (missing binaries, broken symlinks)

How to clean reinstall:

# Linux
sudo systemctl stop ollama
sudo rm /usr/local/bin/ollama
sudo rm -rf /usr/share/ollama # Removes service user data
rm -rf ~/.ollama # Removes your models and config
curl -fsSL https://ollama.com/install.sh | sh

Your models will need to be re-downloaded after a clean install. If you just want to reinstall the binary without losing models, skip the rm -rf ~/.ollama step.

The Bottom Line

Most Ollama problems come down to three things:

GPU not being used — check with ollama ps, fix drivers or permissions
Not enough memory — reduce num_ctx, enable KV cache quantization, use a smaller model
Server not reachable — make sure ollama serve is running, check the port, configure Docker networking correctly

When in doubt: OLLAMA_DEBUG=1 ollama serve tells you everything Ollama is doing. Read the output, search for the error message, and you’ll find your fix.

Get notified when we publish new guides.

Subscribe — free, no spam

URL: https://insiderllm.com/guides/ollama-troubleshooting-guide/

⇱ Ollama Troubleshooting Guide: Every Common Problem and Fix | InsiderLLM