Large language models (LLMs) are increasingly everywhere. Copilot, ChatGPT, and others are now so ubiquitous that you almost can’t use a website without being exposed to some form of "artificial intelligence," even if the feature isn’t exactly smart. That said, running your own LLM from home is pretty cool and can open up a world of possibilities, from helping you be more productive to interacting with other self-hosted services. I recently started hosting LLMs and it blew me away.
The best part? You don’t need the world’s most powerful GPU to get started, though it would certainly help. The issue is that models usually require high VRAM and GPU resources to provide enough computing power, which just isn’t available with most desktop PCs and home lab servers. There are ways around this problem, however. Here’s how I got around VRAM limitations when hosting models to interact with.
Picking the right model size
No two LLMs are the same
So, why is VRAM so important for hosting and running models? It's the best way to run such tasks. The CPU and RAM are capable of handling an LLM, but performance will be sluggish since RAM is actually significantly slower than the memory on a GPU. Even if you have 64GB of DDR5-7000 with an AMD Ryzen 9 9950X CPU, you'll still get better results with an Nvidia GeForce RTX 3060 Ti.
VRAM is critical for storing model weights and running intermediate computations. Larger models with 70B parameters or more require considerable amounts of memory, which is why high-VRAM GPUs are often recommended. Even smaller 14B models can take up a lot of memory, so it’s important to consider not just your hardware but also how many models you’ll need to run simultaneously.
VRAM is critical for storing model weights and running intermediate computations.
There are ways to get around these VRAM limitations. You can opt for model compression, quantization, and pruning, all of which can lower memory requirements. Each step in reducing model footprint will impact accuracy and other metrics to measure LLM performance, but this will allow for larger models (or multiple instances) to run on weaker hardware. A 4B model will be significantly quicker than a 14B version, but you'll lose out on accuracy.
Lowering the model footprint
It's all about quantization
Quantization reduces the precision of model weights from 32-bit to as little as 4-bit. This cuts memory usage by up to 20% for a 70B model, making it possible to run on hardware with as little as 8GB of VRAM. While this slightly reduces accuracy, the trade-off is often worth it for the flexibility it provides.
While this does impact the capabilities of the model, there's always a trade-off somewhere, it has been found to slightly reduce the precision of a model can lead to similar results. This alone allows specific LLMs to run on a variety of hardware configurations, everything from a single-board computer (SBC) to a full-fledged server. It's fairly impressive how this technology works, calibrating the LLM to run at lower precision.
Pruning and distillation are techniques to create more efficient and smaller model versions.
Pruning and distillation are techniques to create more efficient and smaller model versions. Like quantization, the aim is to reduce the overall footprint as much as possible without sacrificing performance too heavily. Pruning, much like gardening, removes parts of the model deemed less important. Distillation transfers knowledge between "teacher" and "student" models, resulting in faster, more compact models.
You'll see a few smaller models essentially mimicking larger ones, such as TinyLlama. These can prove invaluable in running an LLM on weaker hardware without losing all the benefits of a larger model. Then there's the actual framework, which can be configured to use as few resources as possible so as to leave more of the system for the model itself. Ollama is a pretty good lightweight server for running LLMs on CPUs and GPUs.
I've tried others too with favorable results, including Llama.cpp and ONNX for CPU-friendly models.
Start small see how you go
You can always switch to another LLM
With just 8GB of VRAM with the RTX 3060, I opted to run the 14B Qwen3 with Q4 quantization. Typically requiring double the VRAM to comfortably run the model, just 8GB turned out to be enough with some spill over into RAM. Even with the RTX 4060 Ti and 16GB of VRAM, running qwen3:14b-q4 takes up just 10GB, freeing up a little for anything else that I need to run. It all depends on what you need to use an LLM for (and how many).
So long as you're happy with the performance, it doesn't matter how low your token processing is and how long the model takes to conjure up a response. It's easy to gauge just how much VRAM you'll require when looking at a list of models and their various versions. Many tables and lists will provide size estimations, which should work out at almost 1:1 for how much memory will be required to store all the parameters.
The beauty of using LLMs is that it's pretty easy to switch between models.
I've noticed that using slightly higher models with more aggressive quantization yielded the same (if not better) results than "more complete" smaller models. The beauty of using LLMs is that it's pretty easy to switch between models. Using Ollama, I can download and try a new model type within a minute, depending on size and download speed. Even when looking through Reddit and other LLM sites, it's best to try a few before settling on one that ticks all your boxes.
Just don't fret too much about VRAM requirements as there are plenty of ways to get around it. These include choosing smaller models, taking advantage of more efficient quantization variants, and allowing the LLM to take a little longer to complete requests.
