Claude Code is one of the best agentic coding harnesses out there right now, and for good reason. It understands your codebase, calls tools, edits files, and can plan multi-step tasks with minimal hand-holding. The problem is that it's tied to Anthropic's cloud by default, and while that's fine for a lot of work, there are times when I'd rather not send my code to someone else's servers. Privacy aside, I also just like running things locally when I can.

Setting up Claude Code with a local model isn't hard, but it's not exactly frictionless either. You need to export a handful of environment variables, remember the right flags, and make sure your inference server is actually running before you launch. I got tired of copy-pasting the same block of exports every time I wanted to use my local setup, so I wrote a short bash script that handles all of it in a single command. It checks if the server is up, detects which model is loaded, and launches Claude Code pointed at the right endpoint. The whole thing is quite short, and it's made my workflow noticeably smoother. The link for it as the bottom of the article, so that you can use it too!

"Why not use OpenCode?" I hear you ask. It's a fair question, and OpenCode is a good tool; I understand why a lot of people prefer it, especially for local model support. But I like the Claude Code harness. The way it handles tool integration, file edits, and permissions just feels more polished to me, and I've had better success with local models like Qwen3 Coder Next with Claude Code than I've had with OpenCode. For example, the todowrite tool consistently fails in OpenCode, but it works perfectly in Claude Code. So, instead, I'd rather write a small script to make Claude Code work the way I want than switch to a different tool entirely.

Claude Code doesn't care where the model lives

If it speaks Anthropic's API, it works

Claude Code is, at its core, a client that speaks the Anthropic Messages API. It doesn't actually verify that there's a Claude model on the other end. If your inference server can respond in the right format, Claude Code will happily connect to it, use its tools, and treat whatever model is running as its brain. This is by design, and it's one of the things that makes the harness so flexible.

The reason this works so well now is that llama.cpp has native support for the Anthropic Messages API. There's no proxy needed and no translation layer. You start llama-server with a compatible model, and Claude Code can connect to it directly. Ollama supports it too, and LM Studio added an Anthropic-compatible endpoint as well. I'm using llama.cpp with Qwen3 Coder Next as it was deemed the best way to run it on the DGX Spark when I started testing it, and the ThinkStation PGX is a very similar device.

Qwen3 Coder Next on the GB10 is surprisingly capable. Qwen3 Coder Next was trained specifically for agentic coding workflows, so it understands tool calling, multi-step planning, and file editing in a way that most local models simply don't. Paired with the Claude Code harness, it feels like a real coding assistant, not a glorified autocomplete.

Lenovo Thinkstation PGX
9/10
Brand
Lenovo
Storage
1TB/4TB
CPU
Nvidia GB10
Memory
128 GB

The Lenovo Thinkstation PGX is a mini PC powered by Nvidia's GB10 Grace Blackwell Superchip. It has 128 GB of VRAM for local AI workloads, and can be used for quantization, fine-tuning and all things CUDA.

The script handles everything in one command

I just type "lcc"

The script I wrote lives at ~/.local/bin/lcc (lcc2 in the screenshot above, as I was updating it for this article), and it does a few things. First, it checks whether the llama-server is reachable by hitting the /health endpoint. If the server isn't running, it tells you and shows you how to start it. If it is running, it queries /v1/models to find out which model is loaded.

Here's the core of what it does when you pass a model name:

export ANTHROPIC_BASE_URL="http://${LCC_HOST}:${LCC_PORT}"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

exec claude --model "$MODEL" "$@"

That's it. It sets the base URL to your local server, provides a dummy auth token, clears the API key so Claude Code doesn't try to authenticate with Anthropic, and disables the telemetry traffic that would fail anyway without a real connection. Then it launches Claude Code with whatever model name you specified and passes through any additional arguments. It'll work with anything that supports the Anthropic API, and that includes LM Studio. If you're running it on the same machine, just change the IP to 127.0.0.1 instead.

If you run lcc without any arguments, it checks the preset server status in the script and shows you what's loaded. If there's exactly one model available, it skips the selection step entirely and just launches. I added that behavior because Qwen3 Coder Next is often the only model loaded, and I don't want to type its name every single time. The host and port are configurable through environment variables too, so if you're running your server on a different machine or port, you can override them without editing the script.

Local models are quietly perfect for security work

They automate the boring parts of reverse engineering and testing

This is where the local setup stops being a nice-to-have and starts being, arguably, a hard requirement. A big part of what I use this for is pentesting and reverse engineering. When I'm analyzing binaries, tearing apart firmware, or working through the early stages of a security assessment, I really don't want that data leaving my network. Sending disassembled code or extracted strings to a cloud API is, at best, a questionable decision. If you like the sound of that, you can find my script on GitHub Gist.

The initial phases of reverse engineering are often quite cookie-cutter. For static analysis, you run the same file and readelf commands, you look for debug symbols, you extract strings. When it comes to firmware, it's the same thing; you extract the firmware, identify the filesystem, pull out interesting binaries, and you check for hardcoded credentials. These are all well-understood steps that a competent LLM can help automate. But unlike a deterministic script that follows the same path every time, an LLM can pivot based on what it finds. If it spots something unusual in a binary, it can adjust its approach. If the target uses an unexpected architecture, it adapts. That dynamic flexibility is what makes it actually useful, rather than just another wrapper around binwalk and strings.

And because I'm running everything locally, I don't have to think twice about what I'm feeding into the model. There's no terms of service to worry about, and no risk of sensitive findings ending up in someone else's training data. The model runs on my hardware, the data stays on my network, and I get the agentic capabilities of Claude Code without any of the privacy trade-offs.

I should be clear: this doesn't replace cloud models for everything. The local setup has its limitations, and Qwen3 Coder Next, as good as it is, can't match the depth of the largest cloud models. But for the kind of work where privacy matters and the tasks are well-defined, it's more than enough. A local model is free, and while the hardware required to run this model is on the higher-end, it's possible to offload many of the expert layers to the CPU and still get a model that's capable of generating enough tokens per second to make it useful. And with a script that takes the friction out of launching it, I find myself reaching for the local option a lot more often than I expected.