Voozh

There are many cloud-based services that operate powerful language models, but like anything on the cloud, two immediate problems arise: data collection and consistent access. I've always loved experimenting with self-hosted LLMs, and the tech has advanced hugely since it first became possible to run completely free and powerful models on a consumer-grade graphics card. Of course, what you can actually run differs greatly based on your specs, but the truth is that there's a lot out there.

Nowadays, I run Gemma 27B IT QAT and Qwen 2.5 Coder 32B on my AMD Radeon RX 7900 XTX, and I play around with smaller models for local testing, like Deepseek R1 0528 Qwen 3 8B. There's been a lot to learn over time, though, and as I figured things out, there were plenty of lessons I learned along the way that I wish I had known before I started, as it would have saved me a lot of time or helped me upgrade my setup faster. These are the most important lessons that I took from it, and while some of them are aimed at beginners just getting started, some of them are things even the more seasoned self-hosting veteran may not be aware of.

7 Model size and VRAM aren't the only things that matter

Memory bandwidth is a major factor

The first lesson I wish I knew sooner is that model size isn't just about how "smart" a model is, even if the parameters of a model do map pretty linearly to its capabilities. However, there's one other aspect to consider, and that's how many tokens are generated per second. Memory bandwidth can play a big part here, and that's why the RTX 3090 is still one of the best consumer-grade GPUs for local inference despite the release of the 4090 and the 5090, and that's thanks to its high memory bandwidth and lower cost when compared to the newer RTX xx90 cards that have been released since then. Those cards perform better than the RTX 3090 in a lot of ways, but not to the scale you'd expect when comparing them in inference specifically. For reference, the 5090 can see a performance uplift of anywhere from two to three times the speed of the 3090 (though it also has 32GB of VRAM), yet the 4090 to the 3090 presents only a minor upgrade when it comes to large language models.

There's another part of the equation to consider, too, and that's the context window. Language models calculate their "position" using Rotary Positional Embeddings (RoPE) encoded in transformers, and these act like a mathematical ruler laid over the sequence. Increasing the length of this ruler (the context window) means more multiplications during every forward pass and a larger key-value cache, and doubling the length of the context (for example, from 8K tokens to 16K) can cut performance in half. There are RoPE scaling methods that scale this further (like NTK or yaRN), but that scaling can then blur details that degrade responses as the conversation lengthens.

There are better ways around this that allow you to provide more information without using up your context window, but while it's tempting to crank up the context length in order to try and give your LLM superhuman levels of recall, it comes at a cost. Performance will quickly degrade as time goes on, and if you overflow your VRAM and start hitting RAM, it will only get worse.

6 Quantization is your friend

Decrease memory usage with a negligible impact on performance

Quantization is one of the most important things to learn about when it comes to self-hosted LLMs, as it dictates a whole lot. It essentially compresses the continuous 16-bit or 32-bit floating-point numbers that make up a neural network into fewer bits, storing approximate values that are "good enough" for inference. In fact, eight-bit integer quantization (INT8) is quite common at this point, and it maps each channel's range to 256 discrete levels at runtime and can often run with no retraining. To put it another way, let's take the 671B parameter version of DeepSeek's R1 model, specifically the Q4_K_M 4-bit quantized version. There's very little quality lost compared to the full-sized model without any quantization, yet the reduction in memory footprint is a big deal. Here's how to read that, first of all:

Qx: This refers to the quantization level. It's how much memory is used to store the model's weight.
K: This refers to the k-quant family (originally "k-means"/improved quantization) schemes in llama.cpp that use grouped blocks with additional scale and min data for better accuracy.
M: This refers to which tensors get higher-precision sub-formats, and can be S, M, or L, meaning small, medium, or large.

So, Q4_K of the original Deepseek R1 model (not 0528) comes in at about 400GB. What about Q6_K and Q8_0? Those come in at about 550GB and 713GB each, yet the difference in technical performance between them is very little. What this means is that a model that would theoretically require 713GB of RAM to run can run in a machine with less than 500GB, which is a big deal. Going from Q8 to Q4 will see memory usage drop by almost half, which we can see above, yet thanks to the technology underpinning the deployment of local language models, it's still almost as good. It's essentially using a compressed floating-point tensor alongside metadata, meaning it can reconstruct the values at runtime, resulting in similar outputs to a larger model with much lower memory.

There are drawbacks to aggressive quantization, though, such as higher reconstruction error. This means rare words or subtle numeric reasoning can fail when every operation is being rounded. Despite this, the exponential savings in VRAM and bandwidth usually outweigh the occasional loss in accuracy, particularly for more "basic" usage, and the ever-so-slight trade-off in performance will result in being able to run a larger model than you could have otherwise. Plus, it's almost a guarantee that the smaller model would have had worse performance in all categories than the quantized version of the larger model.

5 Don't forget to factor in electricity costs

Not to mention hardware costs

While a self-hosted LLM may seem like a cost-effective way to get good, local inference, many people forget about the associated electricity bills and other costs that can rack up when deploying a locally-hosted LLM. The RTX 4090 has a 450W TDP, and the average U.S. electricity cost is $0.16 per kWh. That means you could run up an energy bill of more than $50 every month if you were running it at full pelt. Obviously, most people wouldn't be, but even using it frequently throughout a day could add up quickly, and may work out to be more expensive than using the Gemini or OpenAI APIs for access to significantly more powerful models.

This gets even more out of hand if you're looking to use multiple GPUs for inference, and that's without thinking of power distribution that you'll need to account for, custom cooling, and any other hardware you'd need to pick up along the way. I've seen people say that they can save money by hosting their own model rather than paying for ChatGPT Plus or Google One's AI tier, and that is probably true on the surface, but add in the cost of the GPU and other hardware, and you might find yourself spending more in the long run.

4 You don't just need to focus on Nvidia

Intel and AMD can be great, too

While this is more of a recent development, Nvidia isn't the only player in the game these days when it comes to self-hosted LLMs. As I already mentioned, I use an AMD Radeon RX 7900 XTX for my self-hosted models, and I've also tested out the Intel Arc A770 with its 16GB of VRAM. AMD enjoys official support in tools like Ollama, and while it takes a bit more work, you can use an Intel GPU as well through the IPEX LLM fork of Ollama.

While Nvidia undoubtedly rules the roost when it comes to pure tokens-per-second generation, the reality is that Nvidia's GPUs are so sought after that you may not be able to pick up a high-end Nvidia card for your system. An A770 will yield decent performance when it comes to language models, and I've been more than happy with my 7900 XTX. Even with the Gemma 27B model that I'm running, I still see token generation of more than 30 tokens per second. Plus, it has 24GB of VRAM, eclipsed only by the RTX 5090 and matching the RTX 4090 while costing a lot less in general.

Nvidia is certainly preferable, but if an Nvidia card is out of the question, take a look at AMD and Intel, and research performance for the kinds of models you want to run and see if any of their cards fit your needs. You may be surprised.

3 Prompt engineering and tool usage are great ways to get more out of a small model

Don't just brute-force with more parameters

If you're running a smaller model and want better performance out of it, don't just go switching models in the hopes that a few extra billion parameters will solve all of your problems. Instead, a few tips, and the first is to rethink your prompts. A concise, direct, and comprehensive prompt will yield better results than a vague, open-ended one. Just because you're used to Gemini, ChatGPT, or Claude, which can do well with vague prompts, doesn't mean you can approach a significantly smaller model running on your computer or home server in the same way. If you're direct and to-the-point, your models will likely perform significantly better, so rethink your prompts if the answers you're getting aren't good enough.

The next tip is to make use of Retrieval Augmented Generation, otherwise known as RAG. What this does is provide your model a dataset that it can base its answers on, leading to better accuracy in responses without needing to load up the model's entire context length with every potentially relevant piece of information. As it pulls from actual data that exists on-device, it also reduces the tendency for hallucinations, which can be an issue on smaller models, too. It's obviously not going to solve every issue (nor will it give a 7B parameter model all of the capabilities of a 70B parameter model), but it can massively improve performance if your main goal of using an LLM is to query data. Nvidia's Chat with RTX was a great demo of how RAG could speed up local inference.

The final tip is to make use of tools. Tools in the context of an LLM are software utilities designed to be operated by the model and can be called when necessary. For example, the JSON Toolkit provides an LLM the ability to iteratively explore a JSON response without wasting valuable context with data that's useless to the actual query it was given. The same goes for the Pandas Dataframe tool, as rather than loading the whole dataframe into context, the LLM can utilize the Pandas Dataframe tool to run Python code that will find the answer, skipping the need to look at all of it. There are so many different kinds of tools you can use, and in many cases, it may not be necessary to brute-force "intelligence" by simply using a larger model.

2 Mixture of Experts models allow for larger models in lower VRAM constraints

Though with some work, first

Mixture-of-Experts (MoE) language models are somewhat new, but the concept in AI is decades old and has been used in deep learning contexts for research and computation. These models essentially partition a network of "experts" with a lightweight gate that decides which experts handle which tasks. This doesn't mean that its memory footprint is less than that of another model with the same quantization and parameter count. However, what it does mean is that you can mess around with the loading of the model so that the lesser-accessed tensors are offloaded to system RAM, leaving room in our GPU's VRAM for tensors we do want to access frequently.

This is a significantly more advanced topic, and not really a tip that newcomers should try to take heed of right away. However, it's good to know that if you find yourself limited by VRAM, there are solutions that can lessen the impact on performance when it comes to MoE models. There are a lot of different MoE models out there, too, and once you've gotten comfortable with your tool of choice, you can start to explore this option a bit more to get even larger models running on your machine in a way that makes the most out of your VRAM and offloads the least-accessed parts of the model to the system RAM.

1 Keep it simple to start with

LM Studio is a great place to start

Rather than going all out on setting up tools like Ollama and Open Web UI, which can be daunting for a first-time self-hoster, use a graphical front-end like LM Studio to get you started. It's super simple; use its built-in search to find a model, download it, and run it. You don't need to do any configuration, and it's a "plug-and-play" equivalent when it comes to LLMs. It comes with all of the libraries you need to make the most of your hardware, and works on Windows, Linux, and macOS, so it takes all of the pain out of figuring out what exactly you need to make an LLM run on your system.

Even better, for development, LM Studio can host an OpenAI-compatible server in the background, so you can use it for testing your own applications or tools that understand the OpenAI API once you point them to your locally-hosted endpoint instead. It's easy, it's free, and it's a great way to get started and get a feel for what it's like to host your own LLM before going all out on deploying one elsewhere. All of the main settings you could want to modify, from system prompts to context lengths, are modifiable too, so it's a great way to get started.

URL: https://www.xda-developers.com/things-wish-knew-started-self-host-llms/

⇱ 7 things I wish I knew when I started self-hosting LLMs