Running large language models on local hardware not only lets you avoid paying monthly subscriptions to cloud providers, but also prevents large corporations from gaining access to your private data. But unless you’re willing to spend thousands of dollars on a top-of-the-line graphics card, you’re bound to run out of VRAM when attempting to run large language models with over 15B parameters. Sure, 7B and 9B models can get the job done when it comes to productivity tasks, but sub-10B LLMs (or even their sub-20B counterparts, for that matter) aren’t the best for hardcore coding workloads or tasks involving precise output.
Fortunately, all hope isn’t lost, as there are a handful of workarounds for deploying larger models on cheaper, outdated hardware. In fact, with a little bit of fine-tuning, you can get reliable performance without lowering the accuracy of the outputs.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
Offloading layers lets me run massive LLMs on weak GPUs
That’s how I managed to deploy Qwen3.6-35B-A3B on 12GB of VRAM
Although your GPU is the ideal component for providing extra processing oomph to your LLMs, it’s not the only device capable of running them. On a typical LLM, offloading a handful of layers to the CPU and system memory can help your server run the model, though you’ll have a noticeable drop in performance. Sure, it’s definitely more helpful than not being able to load the model altogether, and that’s pretty much how I got larger models “running” at low token generation rates on my old Pascal card.
However, mixture-of-experts models like Qwen3.6-35B-A3B and GPT-OSS-20B have an ace up their sleeve when it comes to running them on limited hardware. Rather than moving entire layers off the GPU, Qwen3.6 and GPT-OSS can offload the lesser-used experts onto the system RAM, while leaving the attention layers on the graphics card. This way, you can run MoE models with limited VRAM and still get reliable token generation speeds. Thanks to the MoE offload technique, I was able to get 20+ tokens/s when prompting Qwen3.6-35B-A3B on my RTX 3080 Ti – an outdated GPU with merely 12GB of VRAM.
You don't need a beefy GPU to run a local LLM
Trivia challenge
Think you know your way around local AI? Test your knowledge of running LLMs without breaking the bank.
Which popular open-source tool is widely used to run large language models locally on consumer hardware without writing any code?
Meta's open-weight model family, commonly run on consumer hardware, is known by what name?
When running an LLM locally without a dedicated GPU, which hardware component becomes the primary bottleneck for inference speed?
What does 'quantization' mean in the context of running LLMs on consumer hardware?
A '7B' model like Llama 3 7B refers to what specification of the model?
Apple Silicon chips like the M1, M2, and M3 are considered exceptionally well-suited for local LLM inference primarily because of what architectural advantage?
LM Studio is a graphical desktop application for running local LLMs. What is one of its most useful features for beginners?
If you want to run a quantized 13B parameter LLM locally at a usable speed on a CPU-only system, what is the generally recommended minimum amount of system RAM?
Your Score
Thanks for playing!
I’m a llama.cpp user, so I used the --n-cpu-moe flag to offload the experts to my CPU. After running a couple of benchmarks, I found 32 to be the sweet spot for my LLM tasks. And the best part? I can still use a decently large context length of 65536, making my self-hosted Qwen3.6 setup perfect for long prompting sessions when I need to troubleshoot broken experiments or long chains of terminal outputs.
Opting for higher quantization levels helps just as much
But I wouldn't recommend going below Q4_K_M
Besides selectively offloading the experts, switching to heavily quantized models is a great way to get some more oomph out of your GPU. If you browse HuggingFace as often as I do, you’ll find models with different quantization levels, ranging from the highly precise Q8 versions to the heavily compressed Q2 and Q3 variants. Considering that quantization involves reducing the precision of model weights, Q8 can offer close to full accuracy of the model, but the VRAM hit is so massive that it makes sense to go for a higher parameter LLM and with heavier quantization. For example, if I had to choose between DeepSeek-R1 7B running at higher quants (something like Q8) and its 14B variant running at Q4_K_M or Q5_K_M, I’d pick the latter. While there are exceptions to this rule, I’d rather run a model of higher reasoning capabilities than opt for better precision on a comparatively weaker LLM.
Of course, that doesn’t mean you should go full throttle on the quantization aspect, either. Depending on your specific model, going down to Q2_K_M (or worse, Q2_K_S) can reduce its accuracy quite a bit. I consider Q4_K_M the sweet spot for my LLM workloads, as it’s light enough to fit on my old workstations without reducing the precision too much.
Local LLMs can be a godsend for productivity tasks
Name a better duo than self-hosted tools and local models
Over the last couple of months, I’ve started pairing my self-hosted LLMs with different productivity apps, and larger models such as Qwen3.6-35B-A3B and GPT-OSS-20B work really well with most tasks I throw at them. On the coding front, they mesh well with Claude Code and VS Code, and the former is especially effective at helping me troubleshoot annoying bugs, autocompleting long functions, and rewriting syntax (especially config files and IaC documents) into different languages.
I also use them for my document management tasks, with my Paperless-GPT and Paperless AI instances benefiting the most from their superior reasoning prowess. Since I don’t need to turn down the context window, these massive LLMs can sift through my notes on Blinko and documents on Open Notebook to answer my queries. Now, I wouldn’t say that it’s possible to run 200B+ models without enterprise-grade hardware. But if you’ve got a decent GPU, you can try offloading the experts (or even sequential layers) and switch to higher quantization models to get some of the best consumer-tier LLMs on your humble PC.
llama.cpp
Llama.cpp is an open-source framework that runs large language models locally on your computer.
