Running large language models on local hardware not only lets you avoid paying monthly subscriptions to cloud providers, but also prevents large corporations from gaining access to your private data. But unless you’re willing to spend thousands of dollars on a top-of-the-line graphics card, you’re bound to run out of VRAM when attempting to run large language models with over 15B parameters. Sure, 7B and 9B models can get the job done when it comes to productivity tasks, but sub-10B LLMs (or even their sub-20B counterparts, for that matter) aren’t the best for hardcore coding workloads or tasks involving precise output.

Fortunately, all hope isn’t lost, as there are a handful of workarounds for deploying larger models on cheaper, outdated hardware. In fact, with a little bit of fine-tuning, you can get reliable performance without lowering the accuracy of the outputs.

👁 A MacBook air connected to a monitor running DeepSeek-R1 locally
7 things I wish I knew when I started self-hosting LLMs

I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

Offloading layers lets me run massive LLMs on weak GPUs

That’s how I managed to deploy Qwen3.6-35B-A3B on 12GB of VRAM

Although your GPU is the ideal component for providing extra processing oomph to your LLMs, it’s not the only device capable of running them. On a typical LLM, offloading a handful of layers to the CPU and system memory can help your server run the model, though you’ll have a noticeable drop in performance. Sure, it’s definitely more helpful than not being able to load the model altogether, and that’s pretty much how I got larger models “running” at low token generation rates on my old Pascal card.

However, mixture-of-experts models like Qwen3.6-35B-A3B and GPT-OSS-20B have an ace up their sleeve when it comes to running them on limited hardware. Rather than moving entire layers off the GPU, Qwen3.6 and GPT-OSS can offload the lesser-used experts onto the system RAM, while leaving the attention layers on the graphics card. This way, you can run MoE models with limited VRAM and still get reliable token generation speeds. Thanks to the MoE offload technique, I was able to get 20+ tokens/s when prompting Qwen3.6-35B-A3B on my RTX 3080 Ti – an outdated GPU with merely 12GB of VRAM.

👁 XDA
Quiz
8 Questions · Test Your Knowledge

You don't need a beefy GPU to run a local LLM
Trivia challenge

Think you know your way around local AI? Test your knowledge of running LLMs without breaking the bank.

HardwareAI ModelsPerformanceSoftwareRAM & CPU
01 / 8Software

Which popular open-source tool is widely used to run large language models locally on consumer hardware without writing any code?

That's right! Ollama is a lightweight, easy-to-use tool that lets you download and run LLMs locally with simple terminal commands. It handles model management, hardware detection, and even exposes a local API — making it one of the most accessible entry points for local AI.
Not quite — the answer is Ollama. While TensorFlow Serving and CUDA Toolkit are real AI infrastructure tools, they require significantly more setup. Ollama is purpose-built for running LLMs locally and works on Mac, Linux, and Windows with minimal friction.
02 / 8AI Models

Meta's open-weight model family, commonly run on consumer hardware, is known by what name?

Correct! Meta's Llama (Large Language Model Meta AI) series — including Llama 2 and Llama 3 — has become a cornerstone of the local AI movement. Because Meta releases the weights openly, the community has built countless quantized versions optimized for consumer hardware.
The correct answer is Llama. While Falcon, Gemma (Google), and Mistral are all legitimate open-weight models you can run locally, Meta's Llama series is arguably the most widely adopted and has the largest ecosystem of community tools and fine-tuned variants.
03 / 8Hardware

When running an LLM locally without a dedicated GPU, which hardware component becomes the primary bottleneck for inference speed?

Exactly right! When a GPU isn't available, LLMs run entirely in system RAM. Both the capacity (you need enough to hold the model) and the memory bandwidth (how fast data moves to the CPU) directly determine inference speed. DDR5 and multi-channel configurations can make a meaningful difference.
The answer is system RAM capacity and bandwidth. While CPU clock speed matters, the real constraint is getting model weights in and out of memory fast enough. A model that doesn't fit in RAM will either fail to load or spill to disk, causing dramatically slower performance regardless of CPU speed.
04 / 8AI Models

What does 'quantization' mean in the context of running LLMs on consumer hardware?

Spot on! Quantization reduces the bit-width used to store model weights — for example, from 32-bit floats down to 4-bit integers. This can shrink a model's memory footprint by 4–8x with only a modest drop in output quality, making billion-parameter models runnable on everyday laptops.
Not quite — quantization means reducing numerical precision of model weights. A 7-billion parameter model at full 32-bit precision might need over 28GB of RAM, but a 4-bit quantized version can fit in around 4–5GB. It's one of the most important techniques enabling local AI on affordable hardware.
05 / 8Performance

A '7B' model like Llama 3 7B refers to what specification of the model?

Correct! The 'B' in model names like 7B, 13B, or 70B stands for billions of parameters — the individual numerical weights that define the model's behavior. More parameters generally means greater capability, but also higher memory requirements. 7B models strike a sweet spot for consumer hardware.
The answer is 7 billion parameters. Parameters are the learned numerical values inside the neural network that encode everything the model knows. A 7B model has 7 billion of them, which is why even quantized versions need several gigabytes of RAM — and why 70B models remain a challenge for most consumer setups.
06 / 8Hardware

Apple Silicon chips like the M1, M2, and M3 are considered exceptionally well-suited for local LLM inference primarily because of what architectural advantage?

That's right! Apple Silicon's unified memory architecture means the CPU and GPU share the same high-bandwidth memory pool. A MacBook with 16GB or 32GB of unified RAM can load and run LLMs at speeds that rival or exceed systems with discrete GPUs, making Apple laptops surprisingly competitive for local AI.
The correct answer is unified memory. Apple Silicon doesn't support CUDA (that's NVIDIA-specific), but its unified memory design eliminates the bottleneck of transferring data between system RAM and a separate GPU's VRAM. This lets models run fast even without a discrete graphics card, which is why tools like Ollama and LM Studio perform so well on Macs.
07 / 8Software

LM Studio is a graphical desktop application for running local LLMs. What is one of its most useful features for beginners?

Exactly! LM Studio gives users a polished GUI to browse, download, and chat with local models — no terminal or coding needed. Its built-in local server also mimics the OpenAI API format, so you can point compatible apps at your own machine instead of the cloud.
The correct answer is its built-in chat interface and local API server. LM Studio is entirely offline and free to use — it doesn't connect to OpenAI or require a subscription. Its approachable design has made it one of the most popular on-ramps for people exploring local AI for the first time.
08 / 8RAM & CPU

If you want to run a quantized 13B parameter LLM locally at a usable speed on a CPU-only system, what is the generally recommended minimum amount of system RAM?

Correct! A 4-bit quantized 13B model typically requires around 8–10GB of RAM just to load, which means 16GB is the practical minimum for a usable experience — leaving some memory for your OS and other processes. Going below that often results in the model using slow disk swap, making inference painfully sluggish.
The answer is 16GB. While a 4-bit quantized 13B model can technically fit in under 10GB, you still need headroom for your operating system and background tasks. With only 8GB total, your system would likely resort to swapping to disk, turning a response that should take seconds into one that takes minutes.
Challenge Complete

Your Score

/ 8

Thanks for playing!

I’m a llama.cpp user, so I used the --n-cpu-moe flag to offload the experts to my CPU. After running a couple of benchmarks, I found 32 to be the sweet spot for my LLM tasks. And the best part? I can still use a decently large context length of 65536, making my self-hosted Qwen3.6 setup perfect for long prompting sessions when I need to troubleshoot broken experiments or long chains of terminal outputs.

Opting for higher quantization levels helps just as much

But I wouldn't recommend going below Q4_K_M

Besides selectively offloading the experts, switching to heavily quantized models is a great way to get some more oomph out of your GPU. If you browse HuggingFace as often as I do, you’ll find models with different quantization levels, ranging from the highly precise Q8 versions to the heavily compressed Q2 and Q3 variants. Considering that quantization involves reducing the precision of model weights, Q8 can offer close to full accuracy of the model, but the VRAM hit is so massive that it makes sense to go for a higher parameter LLM and with heavier quantization. For example, if I had to choose between DeepSeek-R1 7B running at higher quants (something like Q8) and its 14B variant running at Q4_K_M or Q5_K_M, I’d pick the latter. While there are exceptions to this rule, I’d rather run a model of higher reasoning capabilities than opt for better precision on a comparatively weaker LLM.

Of course, that doesn’t mean you should go full throttle on the quantization aspect, either. Depending on your specific model, going down to Q2_K_M (or worse, Q2_K_S) can reduce its accuracy quite a bit. I consider Q4_K_M the sweet spot for my LLM workloads, as it’s light enough to fit on my old workstations without reducing the precision too much.

Local LLMs can be a godsend for productivity tasks

Name a better duo than self-hosted tools and local models

Over the last couple of months, I’ve started pairing my self-hosted LLMs with different productivity apps, and larger models such as Qwen3.6-35B-A3B and GPT-OSS-20B work really well with most tasks I throw at them. On the coding front, they mesh well with Claude Code and VS Code, and the former is especially effective at helping me troubleshoot annoying bugs, autocompleting long functions, and rewriting syntax (especially config files and IaC documents) into different languages.

I also use them for my document management tasks, with my Paperless-GPT and Paperless AI instances benefiting the most from their superior reasoning prowess. Since I don’t need to turn down the context window, these massive LLMs can sift through my notes on Blinko and documents on Open Notebook to answer my queries. Now, I wouldn’t say that it’s possible to run 200B+ models without enterprise-grade hardware. But if you’ve got a decent GPU, you can try offloading the experts (or even sequential layers) and switch to higher quantization models to get some of the best consumer-tier LLMs on your humble PC.

llama.cpp

Llama.cpp is an open-source framework that runs large language models locally on your computer.