Voozh

For someone who stayed away from anything AI-related for a long time, hosting my own large language models made me aware of how productive local models can be. Whether it’s aiding my troubleshooting efforts after a botched project renders my home lab offline, extracting precise text snippets from abysmally long documents, or helping me organize my bookmarks, local LLMs are now a staple part of my FOSS arsenal.

Plus, the more I use LLMs, the more I’ve grown to appreciate the sheer number of models at my disposal. Rather than locking myself into a proprietary LLM ecosystem or being forced to choose between similar models trained with the same algorithms, my self-hosted Ollama, LM Studio, and (most importantly) llama.cpp setups let me swap LLMs depending on the workloads. Not just different parameters, mind you. I’m talking about entirely different multimodal capabilities, data inputs, and machine-learning algorithms – and it’s by far the most useful aspect of local models in 2026.

Not all LLMs are created equal

The “best LLM” depends entirely on your use case

If you lurk in AI-centric forums, you’ve probably come across posts highlighting specific LLMs as the next best thing since sliced bread. And well, they aren’t wrong. Despite their lower computing prowess compared to their cloud counterparts, local LLMs are versatile enough to process queries over a range of topics, even more so once you start looking at high-parameter models.

But once you start using them extensively, you might encounter certain models producing unsatisfactory results in some situations. That wouldn’t be an issue if not for the fact that the same LLM performs extraordinarily well in other tasks. A lot of these inconsistencies can be attributed to the algorithms, training data, quantization rates, and tuning methods that shaped these LLMs.

For example, Qwen 2.5 Coder (the higher parameter variants) and GPT-OSS are considered the creme de la creme for programming workloads, while the DeepSeek lineup is typically the preferred option for workloads requiring a lot of reasoning on the clanker’s side. This, in turn, massively influences the model’s utility in typical productivity tasks. For example, I’ve had the best luck using Llama 3.1 for processing long notes and querying them for research on Open Notebook, while Qwen 2.5 Coder works really well for autocomplete suggestions on VS Code. Meanwhile, Qwen 3.5 and DeepSeek are better for calling tools on external apps (namely, my NAS server, Nextcloud instance, and Home Assistant hub) via MCP servers.

👁 Running a llama.cpp server on a Raspberry Pi

I built a local LLM server I can access from anywhere, and it uses a Raspberry Pi

It may not replace ChatGPT, but it's good enough for edge projects

By Ayush Pande

That’s before you include the parameter size and multimodal capabilities

Although the training data matters quite a bit, the parameter size is another key factor that determines an LLM’s computational firepower. For example, when I tried to use conversational language when invoking tools via 1B models, they’d often hallucinate or throw errors. Switching to 9B and higher models got rid of this problem, as they have far more parameters to understand deeper patterns, though they come with a higher performance tax, especially on old consumer hardware such as mine.

Heck, even with the same parameter size, two LLMs can differ in the type of data they can process. Multimodal LLMs can integrate text, videos, images, and even something as quirky as sensor feeds, while simpler models can fail to comprehend anything past text files. Heck, the local AI ecosystem even has some ultra-efficient embedding models whose sole purpose is to map typical documents into vector spaces, thereby making it easier to retrieve information.

Switching between LLMs lets you harness their unique capabilities

And you can optimize your LLM workloads to match your GPU specs

If I were to stick to a specific model family, I’d be significantly bottlenecking the potential of my local LLM servers. So, I tend to use a bunch of LLMs in my workflow, and cycle between them depending on my needs.

Take my Paperless-GPT text extraction pipeline, for example. Since minicpm-v is a vision-capable model, it’s highly accurate at recognizing text in images and PDF documents. Attempting to use Llama 3.1 specifically for text extraction would be futile, even though it’s significantly better for general-purpose conversational tasks.

Likewise, I rely on my Qwen 3.5 (9B) model when instructing my MCP-powered tools, as I don’t want them to hallucinate while I’m trying to configure a quick automation workflow. For my coding workloads (and no, I don’t use clankers to generate apps), I tend to go with even higher parameter models. However, the 20B models are too large to run on anything but my RTX 3080 Ti (and even then, they require extra tweaks). So, I rely on the weaker 3B LLMs and embedding models hosted on my GTX 1080 for simpler bookmark, document, and note tagging tasks instead.

Really, the secret to a productive local LLM setup isn’t staying loyal to a specific family; it’s swapping between a bunch of models depending on the task and the capabilities of the underlying hardware.

LM Studio

See at LM Studio

URL: https://www.xda-developers.com/local-llms-work-best-when-youre-not-loyal-to-just-one/