Voozh

I've been using Claude Code for all kinds of things lately, but I'm always worried about my token usage, even on the Max plan. Especially with Opus 4.7, where my allocation seems to get burned through on simple tasks it used to handle quickly. I've also been moving into locally hosting LLMs, after getting a couple of DGX Spark units from Asus to play with, which has enabled me to run much larger models, and hence more capable ones.

And while Nvidia's cloud has plenty of free-to-use models, it still requires a constant internet connection. The problem I've had with local LLMs is twofold — the ones small enough to run on consumer hardware were barely better than autocomplete, and the hardware needed to run the larger models cost as much as a second-hand car.

That's changing (somewhat) with hardware like the Nvidia DGX Spark. Sure, they're still expensive, but compared to Claude Max plans or API usage, it looks like an investment that will start saving money after just over a year. And as a business expense, it's a line item. It's not perfect and isn't a full replacement for Claude or other cloud LLMs, but they're getting closer, and having a local LLM to handle the tasks helps save my cloud tokens for the ones it can't handle.

👁 Claude Code works best when you stop asking it to code - featured

Claude Code works best when you stop asking it to code

Claude Code became far more useful once I stopped treating it like a code generator and started using it to understand projects and terminal chaos.

By Jeff Butts

Why use a local LLM with Claude Code?

Burning through cloud tokens only gets you so far

Okay, while I could use Pi or OpenCode and still switch between the local LLM and Claude, Anthropic has already restricted use of my Claude subscription on third-party tools, and I'd rather not pay the API costs for things I'm just creating for myself. If work was paying, then I'd have no worries about going API-first, but that's not the case. To do the same tasks on API, it would be several times, if not an order of magnitude more expensive than my Max plan.

And that's also why the local LLM helps. Claude Code already has a switch between Opus, Sonnet, and Haiku, and setting up an Anthropic API-capable local LLM slots in as a fourth. It's more complicated to switch between multiple local LLMs, but honestly, the only one right now that's capable enough for coding is Qwen3-Coder-Next, and until that changes, I don't need another running.

That fourth slot gets used by Claude Code when I'm pulling apart existing code for malware and other analysis, or when I'm doing sanity checks on the code that Opus plans and executes. I've found it struggles with building things from scratch, but if it has existing code, the local model is great at figuring out what is going on inside the program.

INT8 precision for a very reasonable cost

Asus happened to send me two GX10s on an extended loan because I wanted to play with connecting them as a cluster. Getting models to run across the cluster has so far eluded me, but I'm getting closer by pulling together little nuggets of knowledge from GitHub comments and forum posts.

That means I've been running Qwen3-Coder-Next FP8 on a single box, which uses about 88 GiB of GPU memory in active use. That still gives plenty of memory for context and for the OS and CPU, while staying stable. It's stable with a 32K context, but can be pushed out a little bit more to 40K for some tasks.

Asus Ascent GX10

$3500 at Amazon $3500 at Asus $3561 at Newegg

The unified RAM on DGX Spark units makes this possible

Larger models need extraordinary amounts of VRAM

Nvidia's GB10 Grace Blackwell Superchip is a mini supercomputer in a Mac mini sized box. With 128GB of LPDDR5x memory shared between the Arm CPU and the Blackwell GPU, these are capable of running around 80 billion parameters at any time without taking the RAM that the system needs to keep running.

Honestly, the only thing that's been holding me back somewhat on these specific Asus GX10s is that they have only 1TB of SSD storage. That makes training larger tasks like Nanochat difficult, as its training data is over 800 GB. But for pretrained LLM models? It's more than enough, and plenty of space to have several models available.

👁 XDA
Quiz

8 Questions · Test Your Knowledge

How much do you know about Claude?
Trivia challenge

Think you know Anthropic's AI assistant? Put your knowledge of Claude to the test.

OriginsCapabilitiesSafetyFeaturesDesign

01 / 8Origins

Which company created Claude?

Correct! Claude was created by Anthropic, an AI safety company founded in 2021. Anthropic was co-founded by Dario Amodei and Daniela Amodei, among others who previously worked at OpenAI.

Not quite. Claude is made by Anthropic, not to be confused with OpenAI, which makes ChatGPT. Anthropic was founded in 2021 with a strong focus on AI safety research.

02 / 8Safety

What is the name of the safety and values framework Anthropic developed to guide Claude's behavior?

Correct! Anthropic developed Constitutional AI (CAI), a technique that trains Claude using a set of principles — a 'constitution' — to guide its responses toward being helpful, harmless, and honest.

Not quite. The framework is called Constitutional AI (CAI). It is a novel training approach pioneered by Anthropic that uses a written set of principles to help the model self-critique and improve its own outputs.

03 / 8Origins

What is the name most commonly associated with inspiring Claude's name?

Correct! Claude Shannon is widely cited as the inspiration behind the name. Shannon founded information theory, which is foundational to all modern computing and digital communication — a fitting namesake for an AI.

Not quite. The name Claude is most commonly associated with Claude Shannon, the mathematician and electrical engineer who founded information theory. His pioneering work laid the groundwork for the digital age.

04 / 8Capabilities

Which of the following best describes Claude's context window capability in its more advanced versions?

Correct! Advanced versions of Claude support context windows of 100,000 tokens or more, allowing it to process entire books, lengthy codebases, or large documents in a single conversation — a standout feature at the time of its release.

Not quite. Claude's advanced versions support context windows of 100,000 tokens or more. This was a significant leap beyond many contemporaries and allows Claude to reason over very large amounts of text in one session.

05 / 8Design

Which of the following principles is NOT part of Anthropic's core goal for Claude?

Correct! Anthropic's guiding principles for Claude are to be Helpful, Harmless, and Honest — often called the 'three H's.' Hierarchical is not part of this framework. The goal is to make AI that is safe and beneficial for everyone.

Not quite. Anthropic's three guiding principles for Claude are Helpful, Harmless, and Honest. 'Hierarchical' is not one of them. These three H's shape how Claude is trained to interact with users responsibly.

06 / 8Features

What was a key distinguishing feature of Claude 2 when it launched compared to many rival models at the time?

Correct! Claude 2 launched with a 100,000-token context window, which was remarkable at the time. This allowed users to feed in entire books or massive codebases for analysis, setting Claude apart from many competing models.

Not quite. The standout feature of Claude 2 was its 100,000-token context window. Claude does not natively generate images, and real-time browsing and built-in voice were not launch features of Claude 2.

07 / 8Safety

Anthropic describes itself primarily as which type of company?

Correct! Anthropic describes itself as an AI safety and research company. Unlike some competitors who lead with products or platforms, Anthropic's founding mission centers on building AI systems that are safe, interpretable, and steerable.

Not quite. Anthropic is primarily an AI safety and research company. Its founding mission is rooted in making AI that is safe and understandable, which is why safety-focused training methods like Constitutional AI are central to its work.

08 / 8Features

Which of the following tasks is Claude specifically designed to handle well?

Correct! Claude excels at long-form writing, summarization, coding assistance, and complex reasoning tasks. Its large context window and nuanced language understanding make it particularly well suited for handling detailed, multi-step text-based work.

Not quite. Claude is designed for text-based tasks like writing, summarization, analysis, and reasoning. It does not render graphics, autonomously execute system commands, or perform live video analysis — it is a large language model at its core.

Challenge Complete

Your Score

/ 8

Thanks for playing!

It's important to know what the limitations are

Currently, no local LLM model can rival the reasoning of the large cloud models. I'm not entirely sure if that will ever change, but given how fast the field is changing, it won't be long before local LLMs reach the level of current cloud models. Not for general tasks, probably, but for specialized coding tasks? Sure, especially if models start getting trained for specific coding languages.

Qwen3-Coder-Next is a non-reasoning model, so it behaves very differently from the models Claude Code uses by default. That's no digital pausing before it gives me a direct answer, which reduces the anthropomorphic nature cloud LLMs have taken on. To some degree, that makes it feel similar to how fast Opus is on Claude Code, since the cloud model pauses to 'think' before answering.

👁 Claude Code connected to Qwen 3 Coder Next

I finally found a local LLM I actually want to use for coding

Qwen3-Coder-Next is a great model, and it's even better with Claude Code as a harness.

By Adam Conway

Being able to switch to a local LLM for tasks saves my subscription for the hard stuff

This hybrid model of using a hyperscaler cloud model for deep thinking, planning, and executing, and a capable local LLM for diving into existing code, is the best of two worlds. I have a local LLM that's capable of real work, that runs on hardware that doesn't cost a second mortgage. And that keeps my cloud tokens for the thinking and planning tasks that larger models excel at.

URL: https://www.xda-developers.com/claude-code-with-a-local-llm-running-offline-is-the-hybrid-setup-i-didnt-know-i-needed/

⇱ Claude Code with a local LLM running offline is the hybrid setup I didn't know I needed

Claude Code works best when you stop asking it to code

Why use a local LLM with Claude Code?

Burning through cloud tokens only gets you so far

INT8 precision for a very reasonable cost

Asus Ascent GX10

The unified RAM on DGX Spark units makes this possible

Larger models need extraordinary amounts of VRAM

How much do you know about Claude?
Trivia challenge

Your Score

It's important to know what the limitations are

I finally found a local LLM I actually want to use for coding

Being able to switch to a local LLM for tasks saves my subscription for the hard stuff

URL: https://www.xda-developers.com/claude-code-with-a-local-llm-running-offline-is-the-hybrid-setup-i-didnt-know-i-needed/

⇱ Claude Code with a local LLM running offline is the hybrid setup I didn't know I needed

Claude Code works best when you stop asking it to code

Why use a local LLM with Claude Code?

Burning through cloud tokens only gets you so far

INT8 precision for a very reasonable cost

Asus Ascent GX10

The unified RAM on DGX Spark units makes this possible

Larger models need extraordinary amounts of VRAM

How much do you know about Claude?Trivia challenge

Your Score

It's important to know what the limitations are

I finally found a local LLM I actually want to use for coding

Being able to switch to a local LLM for tasks saves my subscription for the hard stuff

How much do you know about Claude?
Trivia challenge