Summary
- Let local LLMs condense and structure messy ideas before sending to Claude to save tokens and time.
- Keep a manual copy-and-paste checkpoint to refine specs; it prevents overcomplicating and wasted tokens.
- A hybrid workflow boosts code quality and productivity: local LLMs for prep, Claude for heavy lifting.
I've become a little obsessed with weekend coding projects ever since I started using Claude Code. It might be a simple Python utility, or a React app that simplifies or streamlines something I can do without ever using AI, but the joy of creating something or solving a problem with just vibe-coding remains unparalleled. Despite how capable Claude Code can be, and how easy it is to lean on for every step of the development process, the problem arises when it starts getting mighty expensive in terms of tokens, even with a paid subscription.
After burning through my Opus usage more times than I'd like to admit, I finally stopped asking how to get more out of my tokens, because the usual methods are done to death. In fact, I came upon a realization that I knew all along, but didn't quite pay attention to β I didn't need Claude Code for everything. Despite hosting local AI models through Ollama and llama.cpp, I do find it hard to get off the Claude Train, but that's exactly where I realized my local LLM could help me out significantly. Soon enough, I built a hybridized workflow that feels faster, cleaner, and a lot more productive than relying on just a single AI model for every task.
I finally found a local LLM I actually want to use for coding
Qwen3-Coder-Next is a great model, and it's even better with Claude Code as a harness.
I had to stop treating Claude Code as my first stop
Throwing every idea at Opus wasn't doing me any favors
When I first subscribed to Claude Code, I did what I suspect most people do β use it for absolutely everything. I'd ask it to brainstorm ideas, organize my thoughts, figure out project architecture, write code, and then refactor the whole thing a dozen prompts later, because, of course, there's just one more feature I'd really like added into this final executable. Over the course of building an entire wedding planner, a React-based trip planner, and even a Python-based text-based RPG, Claude has definitely earned its top place in my priority list of destinations when I need something done.
For the most part, that approach has worked brilliantly... until it didn't. Even on a paid plan, it never really takes long for Opus to chew through my credits, and I'm left staring at the screen wondering where it had all gone. Of course, I say it's Opus, but it's really me who wastes tokens with each rambling prompt and half-baked idea. Mostly, it's the revisions that add more context for Claude to process with each new prompt.
It wasn't long before it dawned on me that I was asking one of the most capable coding assistants in the world to perform work that didn't even really require frontier-level intelligence. Planning, organizing, and translating messy thoughts into a proper structure is important, but that doesn't need to be Claude Code's job in the first place.
I created a complex web app using Claude Code, Codex, and Antigravity, and only one acted like a tech lead
One tool understood the bigger picture.
My local LLM became Claude Code's assistant
The cheapest AI in my setup shot up to being one of the most useful
Like almost everyone experimenting with local AI, I too have a few of the most popular models downloaded β Gemma, Qwen, Mistral, and the like. They absolutely do have their benefits, but the thought of replacing Claude Code with any of them, even the more coding-focused models, isn't something I've ever entertained. Instead, I learned that I needed to let the local model do a ton of administrative work before ever letting Claude enter the picture.
Nowadays, if I have six messy paragraphs describing one app idea, they no longer go straight to Claude. Instead, I word-vomit right into my local model, and then ask it to condense the entire thing into structured requirements, remove any redundancies, tell me if there are crucial details or highlights that I'm forgetting to mention, and then ask me questions about the project so I can cover all my basis. It's only after this entire process is over (with no token limits thanks to the model being self-hosted) that I turn it all into something much closer to a streamlined, point-to-point, no-nonsense product specification instead of a stream of consciousness typed by a rambler at midnight.
It's one extra step in the workflow, sure, but this change has made a remarkable difference. Once my local model, Gemma 4:e4b, is finished making everything concise and removing any repetitions, I then send it to Claude. As such, Claude spends a lot less time, if any, on interpreting what I mean, and more time and token toward actually building what I want. Gemma can't compete with Opus's intelligence, but it sure can act as an assistant preparing the meeting room before the bigwigs walk in. As odd as it sounds, using two AI models this way feels significantly more efficient than relying on one for every stage of the process.
I use OpenCode over Claude Code, and it's every bit as good
Beat-for-beat, feature-for-feature.
There's a reason I'm not directly integrating Gemma into Claude Code
I need the final bit of friction for a quality result
One thing I haven't automated, and probably never will, is the handoff between Gemma 4 and Claude Code. It would be fairly easy to script the entire process so that Gemma's output lands directly inside Claude without me lifting a finger, but I genuinely think that would make the workflow worse. That manual copy-and-paste acts like a checkpoint where I stop being the person with a dozen ideas bouncing around in my head and become the person reviewing an actual product specification.
More often than I'd like to admit, that's the moment when I realize that a feature or characteristic isn't nearly as clever as it sounded in my head. I often end up overcomplicating things that were supposed to be nothing more than fun weekend projects, too. As such, I'd much rather spend thirty seconds refining the plan than have Claude faithfully write hundreds of lines of code for an idea I wasn't completely sold on from the get-go. This little bit of intentional friction has saved me far more time (and tokens) than automating the entire pipeline ever could.
Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won't find anywhere else on the site. Subscribe by modifying your newsletter preferences!
Better prompts ended up producing better code
Saving tokens is only half the story here
Now, I did adopt this dual-model workflow because I wanted to stretch my Claude Code usage further, and it definitely worked. However, the biggest improvement here has had little to do with token counts. In fact, the quality of the conversations I have with Claude has improved remarkably, because Claude has now stopped asking as many clarification questions. After all, most of those answers now already sit inside the prompt, all before a single line of code is generated.
Ever since, nearly every personal project I've worked on has skyrocketed in quality. My wedding planner now has cleaner feature lists before I even get into revisions, and even my text-based wedding RPG has evolved from a rough proof-of-concept into a proper executable with clearly-defined gameplay systems.
This entire process now feels less like prompting an AI and more like handing detailed requirements to an experienced developer. When you think about it, it's hardly surprising, considering how software projects have always benefited from better documentation and clearer specifications. AI-assisted development doesn't seem to be all that different. We've just replaced lengthy meetings and design documents with well-structured prompts that serve exactly the same purpose.
I made these 4 changes to my Claude Code setup, and now it runs circles around the defaults
I don't waste time correcting Claude Code anymore.
The best AI workflow rarely ever has one perfect model
Instead, give multiple models a job they can excel at
The more I use AI for development, the less interested I become in benchmark charts that pit one model against another. Sure, those comparisons have their place, but my experience over the past few months has changed the way I approach both local and cloud-based models. My local model is fast, private, always available, and, above all, free of cost. It's excellent at summarizing ideas, restructuring prompts, breaking projects into manageable milestones, and generally cleaning up the messy parts of my thinking. Claude Code, meanwhile, shines where it matters most: reasoning through complex problems, writing code that is production-ready from the first attempt, and making architectural decisions that would challenge much smaller models. Rather than replacing one with the other, I've begun treating them like teammates that are tasked with very different responsibilities.
Ironically, adding another AI model to my workflow has made the entire experience feel simpler instead of more complicated. In fact, I even add Qwen3-coder sometimes, letting Gemma prepare prompts for Qwen, and only bringing in Claude for quality assurance, as it were. Claude Code is still doing the heavy lifting, but every token it spends now feels a lot more deliberate since it isn't wasting any more time on deciphering my rambling thoughts or extracting requirements from half-formed ideas. The ideation phase is no longer Claude's problem, and instead, it's Gemma's.
Claude Code works best when you stop asking it to code
Claude Code became far more useful once I stopped treating it like a code generator and started using it to understand projects and terminal chaos.
Why I'm sticking with this hybrid AI workflow
This new workflow takes away my worries about token budgets the entire time.
I've stopped thinking about AI models as products competing for my attention, because now, they're specialized tools that complement one another. Gemma helps me think more clearly, while Claude helps me build more effectively. This new workflow has made coding more of a creative game where I can do a lot more, since I'm thankfully not worrying about token budgets the entire time.
Local models are also improving at a frantic pace, and before long, this kind of hybrid workflow will absolutely become the norm instead of the exception.
