Claude Code remains one of the strongest tools when it comes to reasoning depth and long-context understanding. Its structured workflow, interactive prompts, and file-aware refactoring are what make it different from other cloud-based AI tools.

AI-assisted coding is fast, powerful, and convenient, but it is also metered. Every task we run, whether it is refactoring, explaining code, or boilerplate generation, consumes tokens. Over time, with long development sessions and more assisted coding, the API usage and cost start to add up.

That feeling slowly changed the way I worked. Instead of working freely, I always try to optimize my prompts to stay efficient, and it resulted in limited exploration.

I wanted a workflow that felt like Claude Code without depending on a cloud backend. I started exploring other options. Local models have matured enough to assist daily development sessions, and LM Studio exposes them through an OpenAI-compatible API.

This is when I got the idea to rebuild a Claude-style workflow using local models and a lightweight custom CLI. This wasn’t meant to replace cloud-based models. It’s about adding flexibility and reducing dependence on them.

👁 A MacBook air connected to a monitor running DeepSeek-R1 locally
7 things I wish I knew when I started self-hosting LLMs

I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

What you’ll need before you start

Your GPU matters more than you think

Credit: Shekhar Vaidya/XDA

When working with local models, hardware becomes critical, and it can become the bottleneck too. Here’s my system configuration. I have a Ryzen 7 7700X and GeForce RTX 4070 Ti OC, paired with 32GB of DDR5 memory.

The most important hardware to look for is a powerful GPU with at least 8GB VRAM; CPU-only inference can work but is slow. After that, a good amount of memory, ideally 16GB+ of system RAM, is recommended. If you have these two, you are ready to dive into local models.

The GPU is the backbone of this setup, as token generation speed is heavily dependent on it. The more VRAM, the larger context window the model can store. For example, with my RTX 4070 Ti, I see 100+ tokens/sec because of the 12GB VRAM.

On the software side, you'll need LM Studio with at least a 7B instruct model downloaded (in my case, Mistral 7B Instruct v0.3), Node.js, and a basic terminal tool like Termius or Windows Terminal, depending on your preference.

Step 1: Turn LM Studio into a local API server

Make your model behave like a cloud endpoint

Although LM Studio is a hub to discover, download, and use open-source local models in its built-in conversation mode, we will use it differently. We’ll configure it as a local backend using its OpenAI-compatible API.

Let me go through the steps.

  • Open LM Studio and click on the Model Search button from the left sidebar, and search for an instruct model like "mistral-7b-instruct-v0.3."
  • Make sure the model you download is at least 7B, as it is a balance between performance and VRAM requirements.
  • Once downloaded, navigate to My Models and customize the local model to set the context length to 8192 (if your VRAM allows), with GPU offload set to 32 (or max available). And load the model. You can cross-check it in the My Chat section.
  • Finally, go to Local Server under the Developer section from the sidebar and start the server. You can verify it by checking the “Reachable at: http://127.0.0.1:1234” text on the same page.

On the same page, you’ll notice all the supported endpoint formats, including LM Studio API, OpenAPI-compatible, and Anthropic-compatible. For this setup, we will use the OpenAPI-compatible endpoint reachable at http://127.0.0.1:1234/v1; it will work as the base URL for our custom CLI in the next step.

Step 2: Build a Claude-style CLI on top of it

Replace the cloud without changing how you work

Credit: Shekhar Vaidya/XDA

Our local backend is ready. Now we need a custom CLI to interact with it. LM Studio already exposes an OpenAI-compatible API; you could use cURL or Postman, but that’s not how developers work day to day.

Claude Code feels powerful because of its integrated workflow and not because of an API. So, a lightweight custom CLI will work as a thin orchestration layer that will sit between you and the local model.

We will use the OpenAI SDK inside a simple Node.js-based custom CLI to call the local backend and print the output.

  • Download and install the latest version of Node.js from its official website. Once done, open a terminal tool like Termius or Windows Terminal in a separate folder.
  • Set the environment variable so that it could communicate with the local backend by using these two PowerShell commands: $env:OPENAI_BASE_URL=”http://localhost:1234/v1” and $env:OPENAI_API_KEY="local-dev". Terminate the terminal session and open a new one so that the variables are applied.
  • Now, in the same folder, initialize the project with npm init -y command, which creates a package.json file. Then install the OpenAI SDK using npm install openai command.
  • Create a JS script that will act as the entry point for the CLI. Create it by using the notepad claude.js command. And paste the following script into it.
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://localhost:1234/v1",
apiKey: "local-dev",
});
const prompt = process.argv.slice(2).join(" ");
if (!prompt) {
console.log('Usage: node claude.js "Your prompt here"');
process.exit(1);
}
async function run() {
try {
const response = await openai.chat.completions.create({
model: "mistralai/mistral-7b-instruct-v0.3",
messages: [
{
role: "user",
content:
"You are a senior software engineer. Respond with clean, well-formatted code and minimal explanation.\n\n" +
prompt,
},
],
temperature: 0.3,
});
console.log("\n" + response.choices[0].message.content);
} catch (error) {
console.error("Error:", error.message);
}
}
run();

We are ready to run our first prompt using our custom CLI and LM Studio’s OpenAI-compatible API. The command will look like this: node claude.js “custom prompt”. The CLI simply forwards prompts to http://localhost:1234/v1/chat/completions, and you will get a structured response without needing an internet connection.

Step 3: Add streaming, memory, and real developer features

Make it feel like a real coding assistant

We have created a working CLI, but at this stage it is just a raw API wrapper. The basic CLI proved that we could replace cloud endpoints with local models, but the script isn’t a usable development assistant. Our goal was to replicate the productivity features that make Claude Code effective.

Streaming and context awareness were the first upgrades I added. Instead of waiting for the full response from the model, the CLI prints tokens as they are generated. And by maintaining a conversation history, the model can now effectively respond to follow-up questions without losing context.

Finally, to make it more effective and practical for real projects, I added file and directory modes. The CLI can now load any file or folder I specify and keep the context in memory. It can now summarize and explain any code snippet and generate structured diffs for a particular file or folder. At this point, it starts behaving like a coding assistant.

What this workflow gets right — and where it doesn’t

Offline speed comes with reasoning limits

Credit: Claude Code Docs

The most important thing that we achieved from this workflow is control. There is no stress about API costs, rate limits, and external dependencies. Locally, we achieved streaming, memory, file-aware prompts, and structured diffs. That means once configured correctly, it can operate entirely offline and with no concerns for billing, quotas, or connectivity.

In my case, a 7B model on an RTX 4070 Ti performs extremely well with a consistent 100+ tokens per second generation, which makes streaming responses smooth.

However, small local models and low-end hardware have clear limitations. A weak GPU or low memory cannot keep up with the 7B model, and a 7B instruct model cannot match long context capability or reasoning depth of a cloud model.

This makes the setup highly effective for day-to-day iterative coding sessions and becomes a limitation when it comes to scenarios that require advanced reasoning chains or large context windows.

Local-first doesn’t mean cloud-free

By creating a local-only Claude-style workflow on top of LM Studio, it proves that cloud dependency is a choice and not a requirement. If configured with the right local model, powerful hardware, and a structured CLI, it is capable of handling most day-to-day development sessions. Local models may not replace the frontier cloud models for complex reasoning but significantly reduce API costs and the need to rely on them constantly.