I've been using Claude Code for all kinds of things lately, but I'm always worried about my token usage, even on the Max plan. Especially with Opus 4.7, where my allocation seems to get burned through on simple tasks it used to handle quickly. I've also been moving into locally hosting LLMs, after getting a couple of DGX Spark units from Asus to play with, which has enabled me to run much larger models, and hence more capable ones.
And while Nvidia's cloud has plenty of free-to-use models, it still requires a constant internet connection. The problem I've had with local LLMs is twofold โ the ones small enough to run on consumer hardware were barely better than autocomplete, and the hardware needed to run the larger models cost as much as a second-hand car.
That's changing (somewhat) with hardware like the Nvidia DGX Spark. Sure, they're still expensive, but compared to Claude Max plans or API usage, it looks like an investment that will start saving money after just over a year. And as a business expense, it's a line item. It's not perfect and isn't a full replacement for Claude or other cloud LLMs, but they're getting closer, and having a local LLM to handle the tasks helps save my cloud tokens for the ones it can't handle.
Claude Code works best when you stop asking it to code
Claude Code became far more useful once I stopped treating it like a code generator and started using it to understand projects and terminal chaos.
Why use a local LLM with Claude Code?
Burning through cloud tokens only gets you so far
Okay, while I could use Pi or OpenCode and still switch between the local LLM and Claude, Anthropic has already restricted use of my Claude subscription on third-party tools, and I'd rather not pay the API costs for things I'm just creating for myself. If work was paying, then I'd have no worries about going API-first, but that's not the case. To do the same tasks on API, it would be several times, if not an order of magnitude more expensive than my Max plan.
And that's also why the local LLM helps. Claude Code already has a switch between Opus, Sonnet, and Haiku, and setting up an Anthropic API-capable local LLM slots in as a fourth. It's more complicated to switch between multiple local LLMs, but honestly, the only one right now that's capable enough for coding is Qwen3-Coder-Next, and until that changes, I don't need another running.
That fourth slot gets used by Claude Code when I'm pulling apart existing code for malware and other analysis, or when I'm doing sanity checks on the code that Opus plans and executes. I've found it struggles with building things from scratch, but if it has existing code, the local model is great at figuring out what is going on inside the program.
INT8 precision for a very reasonable cost
Asus happened to send me two GX10s on an extended loan because I wanted to play with connecting them as a cluster. Getting models to run across the cluster has so far eluded me, but I'm getting closer by pulling together little nuggets of knowledge from GitHub comments and forum posts.
That means I've been running Qwen3-Coder-Next FP8 on a single box, which uses about 88 GiB of GPU memory in active use. That still gives plenty of memory for context and for the OS and CPU, while staying stable. It's stable with a 32K context, but can be pushed out a little bit more to 40K for some tasks.
Asus Ascent GX10
The unified RAM on DGX Spark units makes this possible
Larger models need extraordinary amounts of VRAM
Nvidia's GB10 Grace Blackwell Superchip is a mini supercomputer in a Mac mini sized box. With 128GB of LPDDR5x memory shared between the Arm CPU and the Blackwell GPU, these are capable of running around 80 billion parameters at any time without taking the RAM that the system needs to keep running.
Honestly, the only thing that's been holding me back somewhat on these specific Asus GX10s is that they have only 1TB of SSD storage. That makes training larger tasks like Nanochat difficult, as its training data is over 800 GB. But for pretrained LLM models? It's more than enough, and plenty of space to have several models available.
How much do you know about Claude?
Trivia challenge
Think you know Anthropic's AI assistant? Put your knowledge of Claude to the test.
Which company created Claude?
What is the name of the safety and values framework Anthropic developed to guide Claude's behavior?
What is the name most commonly associated with inspiring Claude's name?
Which of the following best describes Claude's context window capability in its more advanced versions?
Which of the following principles is NOT part of Anthropic's core goal for Claude?
What was a key distinguishing feature of Claude 2 when it launched compared to many rival models at the time?
Anthropic describes itself primarily as which type of company?
Which of the following tasks is Claude specifically designed to handle well?
Your Score
Thanks for playing!
It's important to know what the limitations are
Currently, no local LLM model can rival the reasoning of the large cloud models. I'm not entirely sure if that will ever change, but given how fast the field is changing, it won't be long before local LLMs reach the level of current cloud models. Not for general tasks, probably, but for specialized coding tasks? Sure, especially if models start getting trained for specific coding languages.
Qwen3-Coder-Next is a non-reasoning model, so it behaves very differently from the models Claude Code uses by default. That's no digital pausing before it gives me a direct answer, which reduces the anthropomorphic nature cloud LLMs have taken on. To some degree, that makes it feel similar to how fast Opus is on Claude Code, since the cloud model pauses to 'think' before answering.
I finally found a local LLM I actually want to use for coding
Qwen3-Coder-Next is a great model, and it's even better with Claude Code as a harness.
Being able to switch to a local LLM for tasks saves my subscription for the hard stuff
This hybrid model of using a hyperscaler cloud model for deep thinking, planning, and executing, and a capable local LLM for diving into existing code, is the best of two worlds. I have a local LLM that's capable of real work, that runs on hardware that doesn't cost a second mortgage. And that keeps my cloud tokens for the thinking and planning tasks that larger models excel at.
