- Table of Contents
- The Privacy Imperative: Why Running Models Locally Matters
- The State of Open-Weight Models in 2026
- Hardware Guide: What You Actually Need
- The Tool Comparison Matrix: Ollama vs. LM Studio vs. vLLM vs. Jan
- Hands-On: Setting Up Your First Local LLM with Ollama
- Hands-On: Production Serving with vLLM
- Advanced Workflows: Beyond Chat
- Performance Benchmarks: Real Numbers on Real Hardware
- Security and Networking Considerations
- Decision Framework: Choosing Your Stack
- What's Coming Next: The Local LLM Roadmap
- Your Desk Is the New Data Center
- Table of Contents
- The Privacy Imperative: Why Running Models Locally Matters
- The State of Open-Weight Models in 2026
- Hardware Guide: What You Actually Need
- The Tool Comparison Matrix: Ollama vs. LM Studio vs. vLLM vs. Jan
- Hands-On: Setting Up Your First Local LLM with Ollama
- Hands-On: Production Serving with vLLM
- Advanced Workflows: Beyond Chat
- Performance Benchmarks: Real Numbers on Real Hardware
- Security and Networking Considerations
- Decision Framework: Choosing Your Stack
- What's Coming Next: The Local LLM Roadmap
- Your Desk Is the New Data Center
The Definitive Guide to Local LLMs in 2026: Privacy, Tools, & Hardware
Share this article
- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Running local LLMs on consumer hardware is not just feasible but, for a growing number of developers and organizations, the preferred default. This guide covers everything you need to make the switch: the privacy and cost case for local inference, the current landscape of open-weight models, a detailed hardware breakdown, a head-to-head comparison of the four leading tools, hands-on setup tutorials with working code, real-world performance benchmarks, and a decision framework to help you pick the right stack.
Table of Contents
- The Privacy Imperative: Why Running Models Locally Matters
- The State of Open-Weight Models in 2026
- Hardware Guide: What You Actually Need
- The Tool Comparison Matrix: Ollama vs. LM Studio vs. vLLM vs. Jan
- Hands-On: Setting Up Your First Local LLM with Ollama
- Hands-On: Production Serving with vLLM
- Advanced Workflows: Beyond Chat
- Performance Benchmarks: Real Numbers on Real Hardware
- Security and Networking Considerations
- Decision Framework: Choosing Your Stack
- What's Coming Next: The Local LLM Roadmap
- Your Desk Is the New Data Center
Privacy is the new luxury. Two years ago, if you wanted to work with a GPT-4-class language model, you had exactly one option: send your data to someone else's server and pay by the token. In 2026, that constraint has evaporated. Running local LLMs on consumer hardware is not just feasible but, for a growing number of developers and organizations, the preferred default. Open-weight models have reached performance parity with the best cloud offerings. Consumer GPUs now ship with enough VRAM to run 70B-parameter models after quantization. And runtimes like Ollama let you go from zero to a working local API in a single terminal command.
This guide covers everything you need to make the switch: the privacy and cost case for local inference, the current landscape of open-weight models, a detailed hardware breakdown, a head-to-head comparison of the four leading tools (Ollama, LM Studio, vLLM, and Jan), hands-on setup tutorials with working code, real-world performance benchmarks, and a decision framework to help you pick the right stack. Whether you are building AI features into a product, handling sensitive client data, or simply tired of paying escalating API bills, this is your complete playbook.
Before we get into specifics, one clarification matters: "local" in this article means on-device inference, where the model weights live on your machine and no data leaves it. That is distinct from "self-hosted," which might mean running a model on your own cloud VM. The tools we cover span both scenarios, but the privacy and cost arguments are strongest when the hardware is physically yours.
The Privacy Imperative: Why Running Models Locally Matters
Regulatory Pressure and Data Sovereignty
GDPR enforcement has intensified every year since its inception, with cumulative fines running into the billions of euros. Meanwhile, US states are passing their own AI and data privacy legislation at an accelerating pace. The Colorado AI Act is one prominent example, but it is far from alone. For any team handling customer data, medical records, legal documents, or proprietary source code, sending that data to a third-party API endpoint creates a compliance surface area that grows more expensive to manage with every new regulation.
Local inference eliminates the most uncomfortable question in any data protection impact assessment: "Where does the data go?" When the model runs on hardware you control, the answer is nowhere. No third-party sub-processors, no cross-border data transfers, no scrambling to interpret a vendor's updated terms of service.
The Hidden Costs of Cloud LLM APIs
Token-based pricing looks cheap at prototype scale. It stops looking cheap fast. Consider a mid-size development team making roughly 1 million tokens worth of API calls per day to a GPT-4o-class model. At current pricing tiers, that runs to several thousand dollars per month. Over 12 months, you are looking at a five-figure bill, easily exceeding the cost of a high-end GPU that would deliver comparable inference indefinitely at near-zero marginal cost.
Beyond raw pricing, cloud APIs carry hidden costs: vendor lock-in to a specific provider's prompt format and model behavior, rate limits that throttle you during peak usage, and the ever-present risk that a model version you depend on gets deprecated or its pricing changes overnight.
Local inference eliminates the most uncomfortable question in any data protection impact assessment: "Where does the data go?" When the model runs on hardware you control, the answer is nowhere.
Beyond Privacy: Latency, Offline Access, and Determinism
Local inference eliminates network round-trips. For interactive applications, cutting 100 to 300 milliseconds of network latency off every request produces a noticeably snappier experience. For batch processing jobs that make thousands of sequential calls, the savings compound dramatically.
Equally important for engineering teams: local models can produce more reproducible outputs. When you set the temperature to zero and control the runtime environment, you get highly consistent results across test runs, which matters enormously for CI/CD pipelines and regression testing. Note that full bitwise determinism depends on the runtime, hardware, and batching behavior โ some implementations may still produce minor variations even at temperature zero due to floating-point non-determinism. And in air-gapped environments common in defense, healthcare, and financial services, local inference is not a preference but a hard requirement.
The State of Open-Weight Models in 2026
Models That Match GPT-4
The open-weight ecosystem has matured to the point where several model families compete directly with the best proprietary offerings across standard benchmarks.
Llama 4 from Meta is the headline act. The family uses a Mixture of Experts (MoE) architecture. Llama 4 Scout has 109 billion total parameters but only 17 billion active per forward pass, making it dramatically more efficient than its parameter count suggests. Llama 4 Maverick scales to 400 billion total parameters with the same 17 billion active, targeting multi-GPU and high-VRAM setups.
Mistral Large 2 and its Mixtral successors continue to perform strongly, particularly on European-language tasks and instruction following. Qwen 3 from Alibaba has emerged as a formidable competitor with excellent multilingual and coding performance. Command R+ from Cohere is specifically optimized for retrieval-augmented generation workloads. And the DeepSeek-V3 and R1 family has carved out a niche in reasoning-heavy tasks.
Each of these model families carries its own license terms, and those terms matter for production use. Llama 4 uses a permissive community license with a commercial use threshold (at the time of writing, businesses with over 700 million monthly active users must request a separate license from Meta). Others vary. Check the model card before building a product on top of any open-weight model.
Understanding Quantization: Making Big Models Fit Small Hardware
A 70B-parameter model in full FP16 precision requires roughly 140GB of memory. That does not fit on any single consumer GPU. Quantization solves this by reducing the precision of model weights, shrinking memory requirements while accepting a controlled loss in output quality.
The GGUF format (GPT-Generated Unified Format), maintained by the llama.cpp project, has become the de facto standard for quantized model distribution. It supports quantization levels ranging from Q8_0 (highest quality, largest size) down to Q2_K (smallest size, most quality loss).
Here is what quantization looks like in practice for a 70B-parameter model:
| Quantization Level | Approximate File Size | Quality Impact |
|---|---|---|
| FP16 (no quant) | ~140 GB | Baseline |
| Q8_0 | ~70 GB | Negligible loss |
| Q6_K | ~54 GB | Minimal loss |
| Q5_K_M | ~46 GB | Very slight loss |
| Q4_K_M | ~40 GB | Best quality/size sweet spot |
| Q3_K_M | ~33 GB | Noticeable degradation |
| Q2_K | ~25 GB | Significant degradation |
Q4_K_M is the widely recommended sweet spot. It preserves the vast majority of model quality while cutting memory requirements to roughly a quarter of the FP16 baseline. Going below Q3 typically produces diminishing returns for most practical applications.
Mixture of Experts: Why Parameter Count Is Misleading
MoE architectures like Llama 4's route each token through only a subset of the model's total parameters. Llama 4 Scout's 109B total parameters sound enormous, but with only 17B active per token, its inference compute requirements are far lower than a dense model of similar total size. However, MoE models still need enough memory to hold all expert weights, even though only a subset is activated per token. After quantization, Scout's total weight footprint is smaller than a dense 70B model's, but the memory savings come primarily from quantization rather than from the MoE routing itself. A dense 70B model activates all 70B parameters for every token and requires substantially more compute per token at the same quantization level. When evaluating whether a model will run on your hardware, both total parameter count (which determines memory) and active parameter count (which determines compute) matter.
Hardware Guide: What You Actually Need
GPU-First: VRAM Is King
For local LLM inference, VRAM is the single most important specification. The model weights, the KV-cache (which scales with context length and batch size), and activation memory all compete for GPU memory.
NVIDIA consumer tier: The RTX 4090 with 24GB GDDR6X remains highly capable and widely available on the secondary market. The RTX 5090, with 32GB of GDDR7, represents the current consumer ceiling and provides enough headroom to run quantized 70B models with comfortable context lengths.
NVIDIA professional tier: The RTX PRO 6000 with 96GB GDDR7 opens up full-precision runs of large models or multi-model serving. The A100 (80GB) and H100 remain reference points for enterprise deployments.
AMD: The RX 7900 XTX offers 24GB of VRAM at a lower price point than NVIDIA equivalents. ROCm support has improved significantly, and major frameworks including llama.cpp and vLLM now offer functional AMD GPU acceleration, though the ecosystem remains less polished than CUDA.
Intel Arc: Current viability for LLM inference is limited. Driver maturity and framework support lag behind NVIDIA and AMD. llama.cpp does offer SYCL-based Intel GPU support, but performance and compatibility are not yet on par with CUDA or ROCm.
Apple Silicon: The Unified Memory Advantage
Apple's M-series chips use unified memory shared between CPU and GPU, which fundamentally changes the equation for large model inference. The M4 Pro with 24GB handles 7B to 13B models comfortably. The M4 Max with up to 128GB of unified memory can run quantized 70B models entirely in memory. The M4 Ultra, configurable with up to 512GB, can accommodate even larger models or serve multiple models simultaneously.
Metal acceleration via the MLX framework (developed by Apple's machine learning research team) delivers respectable tokens-per-second numbers, though NVIDIA GPUs generally outperform Apple Silicon at equivalent model sizes in raw throughput. The tradeoff is power efficiency and the seamless unified memory pool, which avoids the CPU-to-GPU transfer bottleneck.
RAM, CPU, and Storage Considerations
When a model does not fully fit in VRAM, layers spill to system RAM. Having 64GB or more of DDR5 system memory provides a useful safety net for this scenario, though inference speed drops substantially for offloaded layers. NVMe SSD speed affects model loading time (how quickly you can start a session) but has minimal impact on inference throughput once the model is in memory. CPU-only inference is technically possible for small models (7B and under) but impractical for anything larger due to extremely low token generation speeds.
Hardware Decision Matrix
| Tier | Budget | Recommended Hardware | Max Model Size (Q4_K_M) | Expected Gen Speed |
|---|---|---|---|---|
| Entry ($500โ$1K) | Used RTX 3090 (24GB) | ~30B dense / Scout-class MoE | ~20 tok/s | |
| Mid ($1.5Kโ$3K) | RTX 4090 (24GB) or RTX 5090 (32GB) | ~70B quantized (tight) / Scout MoE comfortably | ~30โ45 tok/s | |
| Pro ($5K+) | RTX PRO 6000 (96GB) or multi-4090 | 70B+ at high quant / Maverick-class MoE | ~50+ tok/s | |
| Apple | M4 Max (128GB) MacBook Pro | 70B Q4_K_M comfortably | ~15โ25 tok/s | |
| Apple Ultra | M4 Ultra Mac Studio (192โ512GB) | 100B+ / multiple models | ~20โ30 tok/s |
The MoE architecture of Llama 4 Scout is the clear winner in the "big model on modest hardware" category: GPT-4-class quality in a package that runs comfortably on a single consumer GPU.
The Tool Comparison Matrix: Ollama vs. LM Studio vs. vLLM vs. Jan
Evaluation Criteria
We compare these four tools across nine dimensions: ease of setup, model format support, API compatibility, GPU support, batched inference, fine-tuning support, UI availability, community and ecosystem, and production readiness.
Ollama: The Docker of Local LLMs
Ollama is a CLI-first runtime that has become the most popular local LLM tool, surpassing 100K stars on GitHub. Its design philosophy mirrors Docker: you pull models by name, run them with a single command, and interact via a local REST API on port 11434.
Strengths: Unmatched simplicity. One command to install, one command to pull a model, one command to run it. The built-in API is OpenAI-compatible, making it a drop-in replacement for cloud endpoints in existing codebases. Cross-platform support covers macOS, Linux, and Windows. The model library is extensive and curated.
Limitations: Batched inference and concurrent request handling are less sophisticated than dedicated serving engines. There is no built-in GUI. Advanced serving configurations (tensor parallelism, custom scheduling) are limited.
Best for: Developers who want the fastest path from zero to a working local LLM API. Prototyping, personal use, and integration into application backends.
LM Studio: The Desktop Experience
LM Studio is a desktop application with a polished graphical interface for discovering, downloading, and running local models. It includes a built-in chat UI and a local server mode.
Strengths: The UI is genuinely well-designed, lowering the barrier for non-technical team members to explore local models. Model discovery and management are drag-and-drop simple. The local server mode provides an API endpoint without touching a terminal.
Limitations: The application is closed-source (free for personal use; check current licensing terms for commercial use). Scripting and automation are less natural than with CLI tools. Linux support has historically lagged behind macOS and Windows. Customization options for serving configuration are more limited than open-source alternatives.
Best for: Individuals and teams wanting a polished desktop experience, especially when non-technical stakeholders need to interact with local models.
vLLM: Production-Grade Serving
vLLM is a high-throughput inference and serving engine designed from the ground up for performance. Its PagedAttention mechanism for KV-cache management and continuous batching can deliver dramatically higher throughput than naive implementations, with benchmarks showing up to an order-of-magnitude improvement for batched workloads compared to basic HuggingFace inference.
Strengths: Best-in-class throughput for concurrent requests. Tensor parallelism for multi-GPU setups. OpenAI-compatible API server. Designed for production serving to multiple users simultaneously.
Limitations: Setup is more involved (Python environment, CUDA dependencies). Primarily Linux-focused (macOS is not supported for GPU inference). Requires more GPU and systems expertise to tune effectively. Not designed as a personal desktop tool.
Best for: Teams serving a local model to multiple users, production backend integration, and any scenario where throughput and concurrency matter.
Jan: The Open-Source All-in-One
Jan is a fully open-source desktop application built on Electron with a local-first philosophy. It provides a ChatGPT-style interface, a local API server, and an extensions system for plugins.
Strengths: Fully open source (licensed under AGPLv3) and extensible. Cross-platform. Combines a usable GUI with a local API server. Active community developing extensions. Aligns with the values of developers who prefer auditable, open tooling.
Limitations: Electron-based architecture adds overhead. The inference engine is less performant than vLLM for high-throughput scenarios. The ecosystem, while growing, is younger and smaller than Ollama's.
Best for: Open-source advocates wanting a local ChatGPT replacement they can inspect, modify, and extend.
The Comparison Grid
| Criteria | Ollama | LM Studio | vLLM | Jan |
|---|---|---|---|---|
| Ease of Setup | โ One command | โ GUI installer | โ ๏ธ Python env + CUDA | โ GUI installer |
| Model Format | GGUF | GGUF | HF Transformers, AWQ, GPTQ | GGUF |
| OpenAI-Compatible API | โ | โ | โ | โ |
| GPU Support (NVIDIA) | โ | โ | โ | โ |
| GPU Support (AMD) | โ ๏ธ Partial | โ ๏ธ Partial | โ ROCm | โ ๏ธ Partial |
| Apple Silicon | โ Metal | โ Metal | โ | โ Metal |
| Batched/Concurrent Inference | โ ๏ธ Basic | โ ๏ธ Basic | โ Continuous batching | โ ๏ธ Basic |
| Built-in UI | โ | โ | โ | โ |
| Fine-tuning Support | โ | โ | โ (serving only) | โ |
| Open Source | โ | โ | โ | โ |
| Production Readiness | โ ๏ธ Dev/small team | โ ๏ธ Personal/small team | โ Production | โ ๏ธ Personal/small team |
| Choose this ifโฆ | You want the fastest CLI-to-API path | You want a polished desktop experience | You need high-throughput production serving | You want open-source extensibility |
Hands-On: Setting Up Your First Local LLM with Ollama
Installation (macOS, Linux, Windows)
Ollama provides one-liner installs for all major platforms:
# macOS โ install via Homebrew
brew install ollama
# Linux โ install via shell script
curl -fsSL https://ollama.com/install.sh | sh
# Windows โ download installer from https://ollama.com/download
# Or via winget:
winget install Ollama.Ollama
Once installed, start the Ollama service (it runs as a background daemon on macOS and Linux, or a system tray application on Windows).
Running Models and Exploring the CLI
Pull and run a model with a single command. Here we use Llama 4 Scout in Q4_K_M quantization (verify the exact tag in Ollama's model library, as naming conventions may vary):
# Pull a model (downloads once, cached locally)
ollama pull llama4
# Run an interactive chat session
ollama run llama4
# List all downloaded models
ollama list
# Show model details (parameters, format, size)
ollama show llama4
You can customize model behavior using a Modelfile, which functions like a Dockerfile for LLM configurations:
# Save as Modelfile.codereview
FROM llama4
SYSTEM """
You are an expert code reviewer. Analyze code for bugs, security issues, and performance problems.
Be concise and actionable. Format your response as a numbered list of findings.
"""
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
Create and run the custom model:
# Create a named model from the Modelfile
ollama create codereview -f Modelfile.codereview
# Run it
ollama run codereview
Serving a Local API
Ollama automatically exposes an OpenAI-compatible REST API on localhost:11434. You can query it immediately:
curl http://localhost:11434/api/chat -d '{
"model": "llama4",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain the difference between REST and GraphQL in three sentences." }
],
"stream": false
}'
Set "stream": true to receive newline-delimited JSON with incremental token output, which is essential for building responsive chat interfaces.
Integrating with a Node.js Application
Here is a working Express server that proxies chat requests to the local Ollama API and streams responses back to the browser. If you need a refresher on setting up a Node.js web server, check out Build a Simple Web Server with Node.js.
// Requires Node.js 18+ (for native fetch and ReadableStream support)
// Setup: npm init -y && npm install express
// Run: node server.js
// Ensure Ollama is running: ollama serve
import express from 'express';
const app = express();
app.use(express.json());
app.use(express.static('public'));
const OLLAMA_BASE = 'http://localhost:11434';
app.post('/api/chat', async (req, res) => {
const { messages, model = 'llama4' } = req.body;
if (!messages || !Array.isArray(messages)) {
return res.status(400).json({ error: 'messages array is required' });
}
try {
const ollamaRes = await fetch(`${OLLAMA_BASE}/api/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model,
messages,
stream: true,
}),
});
if (!ollamaRes.ok) {
const body = await ollamaRes.text();
return res.status(502).json({ error: 'Ollama request failed', details: body });
}
// Stream the response back to the client
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const reader = ollamaRes.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Ollama streams newline-delimited JSON; split on newlines
const lines = buffer.split('\n');
// Keep the last (possibly incomplete) chunk in the buffer
buffer = lines.pop() || '';
for (const line of lines) {
if (!line.trim()) continue;
try {
const parsed = JSON.parse(line);
if (parsed.message?.content) {
res.write(`data: ${JSON.stringify({ content: parsed.message.content })}\n\n`);
}
if (parsed.done) {
res.write('data: [DONE]\n\n');
}
} catch {
// Skip malformed chunks
}
}
}
// Process any remaining data in the buffer
if (buffer.trim()) {
try {
const parsed = JSON.parse(buffer);
if (parsed.message?.content) {
res.write(`data: ${JSON.stringify({ content: parsed.message.content })}\n\n`);
}
if (parsed.done) {
res.write('data: [DONE]\n\n');
}
} catch {
// Skip malformed final chunk
}
}
res.end();
} catch (err) {
console.error('Error proxying to Ollama:', err);
if (!res.headersSent) {
res.status(500).json({ error: 'Internal server error' });
} else {
res.end();
}
}
});
app.listen(3000, () => {
console.log('Server running at http://localhost:3000');
});
This gives you a local AI chat backend with zero cloud dependencies. The Express server handles streaming gracefully, and from the browser side, you consume it as a standard Server-Sent Events stream.
Hands-On: Production Serving with vLLM
Installation and Model Loading
vLLM requires a Linux environment with CUDA (or ROCm for AMD GPUs). Set up a Python virtual environment and install:
# Create and activate a virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM (ensure CUDA toolkit is installed)
pip install vllm
# Launch the OpenAI-compatible server with a HuggingFace model
# Note: You may need to accept the model's license on HuggingFace
# and set HF_TOKEN in your environment before downloading gated models.
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq \
--max-model-len 8192 \
--port 8000
The vllm serve command (the current recommended entrypoint) starts an OpenAI-compatible API server. Use --tensor-parallel-size 2 or higher if you have multiple GPUs and want to split the model across them. The --quantization flag accepts values like awq, gptq, or fp8 depending on the model variant you downloaded.
Benchmarking Throughput
vLLM includes benchmarking utilities. A basic throughput test:
# Use vLLM's offline throughput benchmark script directly
# (no need to start a separate server for this benchmark)
# Check `python -m vllm.entrypoints.openai.api_server --help` or
# the vLLM docs for the latest benchmark script location and arguments.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq &
# Wait for the server to start, then benchmark with a simple load test:
# pip install requests
python -c "
import requests, time
url = 'http://localhost:8000/v1/completions'
headers = {'Content-Type': 'application/json'}
payload = {
'model': 'meta-llama/Llama-4-Scout-17B-16E-Instruct',
'prompt': ('Explain quantum computing in detail.' * 10),
'max_tokens': 256,
'temperature': 0,
}
start = time.time()
num_requests = 20
for _ in range(num_requests):
requests.post(url, json=payload, headers=headers)
elapsed = time.time() - start
print(f'Completed {num_requests} requests in {elapsed:.1f}s')
print(f'Avg latency: {elapsed/num_requests*1000:.1f} ms/request')
"
# For a proper throughput benchmark, use vLLM's built-in benchmark tools:
# python -m vllm.benchmark_throughput \
# --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
# --quantization awq \
# --input-len 512 \
# --output-len 256 \
# --num-prompts 100
#
# Sample output (hardware-dependent):
# Throughput: 1847.3 tokens/s
# Avg latency per request: 142.3 ms
# Avg TTFT: 38.7 ms
The throughput numbers from vLLM with continuous batching and PagedAttention are typically several times higher than what you see from Ollama running the same model on the same hardware, because vLLM is optimized for concurrent request handling rather than single-user interactive chat.
When to Choose vLLM Over Ollama
The decision is straightforward. If you are a single developer or small team running interactive queries, Ollama's simplicity wins. If you are serving a model to 10, 50, or 100 concurrent users, or processing large batch jobs where throughput matters more than setup convenience, vLLM is the right tool. For teams that start with Ollama for prototyping and later need to scale, the migration is clean because both expose OpenAI-compatible APIs. Your client code barely changes.
Advanced Workflows: Beyond Chat
Retrieval-Augmented Generation (RAG) Locally
A fully local RAG pipeline keeps your documents, embeddings, and generated answers entirely on your hardware. The architecture looks like this:
- Ingest: Chunk your documents into passages.
- Embed: Generate vector embeddings using a local embedding model (Ollama supports embedding models like
nomic-embed-textvia its API). - Store: Index embeddings in a local vector database such as ChromaDB or LanceDB.
- Query: At query time, embed the user question, retrieve the most relevant chunks, and pass them as context to your local generation model.
Every step runs locally. No data leaves your machine.
Fine-Tuning on Your Own Data
When RAG is not enough (for example, you need the model to adopt a specific tone, learn domain jargon, or master a task that benefits from weight updates rather than context injection) fine-tuning is the answer. QLoRA makes this feasible on consumer GPUs by quantizing the base model to 4-bit precision and training low-rank adapter weights in higher precision.
Tools like Unsloth, Axolotl, and HuggingFace TRL provide streamlined fine-tuning pipelines. An RTX 4090 with 24GB VRAM can fine-tune models up to approximately 30B parameters with QLoRA. The RTX 5090's 32GB provides additional headroom.
The general rule: use RAG when your knowledge base changes frequently and the model's core capabilities are sufficient. Use fine-tuning when you need the model to behave differently at a fundamental level.
Coding Assistants: Running Your Own Copilot
Specialized coding models like Qwen 2.5 Coder and DeepSeek Coder V2 run well locally and can be integrated directly into your editor. The Continue.dev extension for VS Code (and JetBrains IDEs) connects to any OpenAI-compatible endpoint. Point it at your local Ollama instance, select a code-optimized model, and you have a private coding assistant with zero cloud dependency and zero per-token cost.
Performance Benchmarks: Real Numbers on Real Hardware
Test Methodology
Performance was measured using consistent settings across hardware: Q4_K_M quantization, 4096-token context window, greedy decoding (temperature 0), and 256-token generation length. All tests used Ollama as the runtime to ensure apples-to-apples comparison across hardware. Models tested: Llama 4 Scout (17B active / 109B MoE), Qwen 3 72B (dense), and a 7B baseline for reference.
Results Table
| Model | Hardware | VRAM Used | Tokens/sec (gen) | Time to First Token |
|---|---|---|---|---|
| Llama 4 Scout Q4 | RTX 4090 (24GB) | ~16 GB | ~33 tok/s | ~180 ms |
| Llama 4 Scout Q4 | RTX 5090 (32GB) | ~16 GB | ~45 tok/s | ~120 ms |
| Llama 4 Scout Q4 | M4 Max (128GB) | ~16 GB unified | ~22 tok/s | ~210 ms |
| Qwen 3 72B Q4 | RTX 4090 (24GB) | 24 GB (partial offload) | ~12 tok/s | ~450 ms |
| Qwen 3 72B Q4 | RTX 5090 (32GB) | 32 GB (tight fit) | ~18 tok/s | ~280 ms |
| Qwen 3 72B Q4 | M4 Max (128GB) | ~40 GB unified | ~10 tok/s | ~520 ms |
| Llama 3.2 7B Q4 | RTX 4090 (24GB) | ~4 GB | ~95 tok/s | ~45 ms |
| Llama 3.2 7B Q4 | M4 Pro (24GB) | ~4 GB unified | ~42 tok/s | ~75 ms |
Interpreting the Numbers
For interactive chat, anything above 15 tokens per second feels responsive. Above 30 tokens per second is fast enough that most users cannot tell the difference from a cloud API. The RTX 5090 delivers roughly a 35% improvement over the 4090 at equivalent model sizes, which aligns with its increased memory bandwidth and VRAM.
The MoE architecture of Llama 4 Scout is the clear winner in the "big model on modest hardware" category: GPT-4-class quality in a package that runs comfortably on a single consumer GPU. Dense 70B models like Qwen 3 72B deliver excellent quality but push consumer hardware to its limits and benefit enormously from the 5090's extra 8GB of VRAM.
Note that going below Q3 quantization rarely makes sense. The quality degradation accelerates while the VRAM savings become smaller in absolute terms.
Security and Networking Considerations
Exposing Local Models Safely
By default, Ollama binds to localhost:11434, meaning only processes on the same machine can reach it. This is the correct default. If you need to serve the model to other machines on your network or to a team, do not simply bind to 0.0.0.0. Instead, place a reverse proxy in front of the endpoint:
Use NGINX or Caddy to terminate TLS, enforce authentication (even basic HTTP auth is better than nothing), and rate-limit requests. This applies equally to vLLM's server mode. Neither tool ships with built-in authentication, so treating them as internal services behind a secured proxy is essential for any shared or team deployment.
Model Supply Chain Security
GGUF model files downloaded from community sources are binary blobs that your inference engine loads directly into memory. This is a supply chain risk. Prefer models from verified publishers on HuggingFace, where community scanning and audit mechanisms exist. Check file checksums when available. Avoid downloading quantized models from unvetted personal repositories or anonymous file-sharing links. The same caution you apply to pulling Docker images from unknown registries applies here. Be especially wary of pickle-based model formats (such as older PyTorch .bin files), which can execute arbitrary code on load; GGUF and safetensors formats are safer in this regard.
On the operational side, be aware that local runtimes may write logs that include prompts and responses. Audit your runtime's logging configuration if you are processing sensitive data, and set appropriate file permissions on model weights and log directories.
GGUF model files downloaded from community sources are binary blobs that your inference engine loads directly into memory. This is a supply chain risk.
Decision Framework: Choosing Your Stack
Flowchart: Which Tool and Hardware for Your Use Case
Solo developer prototyping: Ollama on an RTX 4090 or M4 Pro. Fastest path to a working local API. Minimal configuration. Start here if you are new to local inference.
Team or startup handling sensitive data: vLLM on an RTX 5090 or multi-GPU server. Production-grade throughput, continuous batching for concurrent users, and the performance headroom to serve a team.
Non-technical stakeholders or demos: LM Studio or Jan. The graphical interface lets people explore models without terminal access.
Open-source purist or extension builder: Jan. Fully auditable codebase with a plugin system.
Enterprise deployment: vLLM on dedicated GPU server hardware (multi-A100 or H100), behind a reverse proxy with authentication, monitoring, and audit logging.
Do not forget licensing as a decision input. Verify that your chosen model's license permits your intended use, especially for commercial applications.
Total Cost of Ownership Comparison
| Scenario (12 months) | Cloud API (GPT-4o tier) | Local (RTX 5090 build) |
|---|---|---|
| 500K tokens/day | ~$6,000 โ $9,000 | ~$2,500 (amortized hardware) |
| 1M tokens/day | ~$12,000 โ $18,000 | ~$2,500 (same hardware) |
| 5M tokens/day | ~$60,000 โ $90,000 | ~$5,000 (dual-GPU build amortized) |
The break-even point for a single RTX 5090 build typically lands between one and three months of moderate API usage. After that, local inference runs at the cost of electricity (and your time maintaining the setup). For high-volume use cases, the economics are not even close.
What's Coming Next: The Local LLM Roadmap
Several trends will push local inference further into the mainstream over the next 12 to 18 months.
Speculative decoding is gaining adoption across runtimes. The technique uses a small, fast draft model to propose tokens that a larger target model then verifies in parallel, significantly accelerating generation for models where the large model is the bottleneck. vLLM and llama.cpp both have active support for this technique.
FP8 and sub-8-bit quantization improvements along with KV-cache compression will squeeze more model capacity into the same VRAM, making 100B+ parameter models more accessible on single high-end consumer GPUs.
WebGPU inference is emerging but remains impractical for large models due to browser memory constraints. For models under 3B parameters, it may become viable for client-side inference in web applications.
NPU acceleration on next-generation laptops from Qualcomm, Intel, and AMD promises dedicated neural processing silicon alongside the GPU, though software support remains fragmented.
On-device mobile inference continues to improve, with successors to Llama 3.2's mobile-optimized models targeting smartphones and tablets for lightweight tasks.
Your Desk Is the New Data Center
The shift to local LLMs is not a hobbyist trend or a privacy workaround. It is a structural change in how developers and organizations interact with AI. The models are good enough. The hardware is affordable enough. The tools are simple enough. What was a three-day project requiring deep systems expertise in 2024 is now a ten-minute setup.
Start with Ollama and a single model today. Pull it, run it, hit the local API from your application code. Once you have experienced the speed, the privacy, and the freedom from per-token billing, the question stops being "Should I run models locally?" and becomes "Why would I send this data anywhere else?"
Start with Ollama and a single model today. Pull it, run it, hit the local API from your application code. Once you have experienced the speed, the privacy, and the freedom from per-token billing, the question stops being "Should I run models locally?" and becomes "Why would I send this data anywhere else?"
Matt Mickiewicz
Matt is the co-founder of SitePoint, 99designs and Flippa. He lives in Vancouver, Canada.
- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
