One of the things that I enjoy the most about running local AI models is the sense of ownership that comes with them. They're private, efficient, and becoming increasingly capable iteration after iteration while hardware requirements continue to become more modest over time. But after spending a few months with Gemma 4, I ran into a problem that needed to be addressed if I wanted to rely on it at all. I first noticed it about a week ago when I asked it to summarize some of the announcements coming out of Computex 2026, and the response, while convincing, completely hallucinated most of the details, to the point that some of the products it described simply didn't exist.

Now, to be fair, the onus is entirely on the user to understand the limitations of the model. Since local models don't always have access to live information and, ergo, no reliable way of admitting they don't know something, they tend to answer incorrectly, but with confidence regardless. That's a dangerous problem to have, and here's how I decided to fix it.

Why does Gemma 4 hallucinate?

Most local models have this problem

The Computex summarization wasn't a one-off occurrence. Once I realized that Gemma had confidently invented details and product announcements surrounding the event, I decided to experiment further to see how far the "hallucinations" go. To my surprise (and a little bit of horror), I found that it can fabricate driver versions, product specs, pricing trends, and anything that relies on an updated corpus of information with finesse. The answers were authoritative, fluent, and completely detached from reality as you'd expect. It was clear that whenever the model reached the edge of its understanding, it would simply continue the conversation rather than acknowledge that it didn't know something.

When you delve into it, you'll realize why it happens. Since the model was running locally via Ollama, Gemma had no retrieval layer and therefore no access to current information. After all, a compact 4.5-billion-parameter model can't be expected to have a reliable mechanism for recognizing when a question falls outside its knowledge boundary. While I consciously remember not to prompt a locally running model like a cloud-based LLM service, the reason why it becomes a risk is that, in a busy workflow, it's often easy to forget the limitations of the model you're prompting or asking for information, and it becomes a challenging problem to have when the said model does not stop to think if it has access to valid information and continues to behave like a subject-matter expert.

The need for architecting a solution

Fixing this tendency wasn't straightforward

The aspect of fact-checking every single response post generation is a tedious affair, and one that anyone relying on a large volume of research done through locally run models couldn't be bothered with. So naturally, I had to think of something else.

The obvious "fix" to this problem that immediately came to mind was to detect when Gemma was struggling and escalate from there. If Gemma expressed uncertainty, it was possible to intercept that response and forward the query to a cloud model like Claude. The first implementation of this solution scanned every response for hedging language, such as phrases like "I'm not sure", "my knowledge cutoff is", or "I cannot confirm" before handing off the query and context to Claude. It was a reasonable starting point, and a rather promising attempt at a solution. I covered three major aspects of refusal, including the explicit lack of knowledge, temporal limitations, and inability to independently verify information.

UNCERTAINTY_PATTERNS = [
r"i('m| am) not sure",
r"i don't (know|have enough)",
r"i cannot (answer|help|provide|assist)",
r"i('m| am) unable to",
r"i lack (the )?(knowledge|information|context)",
r"beyond my (knowledge|training|capabilities)",
r"i (don't|do not) have access",
r"my (knowledge|training) (cutoff|only goes)",
r"i('m| am) (not|unsure) (certain|confident)",
r"i (cannot|can't) (confirm|verify|guarantee)",
r"i (would|may) be (wrong|incorrect|mistaken)",
r"knowledge cutoff",
r"training cutoff",
r"cutoff is",
r"still in the future",
r"has not (happened|occurred|taken place) yet",
r"i do not have (information|data|details)",
r"no (information|data|announcements).{0,30}(available|yet|made)",
]

The problem with this approach was the fact that Gemma does not easily hedge. It completes prompts, and it completes them confidently regardless of whether the underlying information... is real. That's why it failed immediately. The next viable solution I thought of where the signal exists was in the input and the nature of the query itself, before any model touches it. Since routing had to happen upstream, it meant classifying the nature of each question rather than auditing each answer ex post facto.

How a single Python script solved the problem

A deceptively simple solution to a complex problem

Once I realized the routing had to happen before Gemma ever saw the prompt, I had already solved half the problem. The solution was a single Python script running a Flask backend with a browser-based interface that opens in any web browser. Gemma 4 runs locally through Ollama, while Claude is accessed through Anthropic's API using an environment variable for authentication. Interestingly enough, neither model is aware that the other exists in the environment. The decision-making happens entirely between the layers.

The routing follows very simple rules. The ones classed as "stable" queries involve explanations, brainstorming, and writing tasks, and they all stay with Gemma. Coding requests, product comparisons, and any other fact-sensitive queries are automatically redirected to Claude. This includes anything that's got an element of recency, such as current pricing, recent announcements, or news. To make the process transparent, the GUI displays which model will take the query before the prompt is submitted.

Gemma just needed a co-worker, not a total replacement

After running this setup for a little while, I've come to appreciate the best of both local and cloud-based LLMs, and it has an economic value proposition as well. Most of my everyday questions run locally, which keeps the experience private, fast, and free of cost. Claude only steps in when factual accuracy is at risk. This also means that Anthropic's API costs stay low because Gemma handles the primary interface, and neither model is forced into a task that it wasn't designed to handle. While it may not be the most elegant solution, it does solve a problem that I had in my workflow, and I'd wager that this approach, if fine-tuned to other workflows, can provide better value.

Claude is an AI assistant and LLM developed by Anthropic.