When it comes to coding and development, AI apps and cloud APIs are the default choice. Small local large language models (LLMs), or local models in general, are underestimated and dismissed as underpowered. AI coding assistants and cloud models are convenient and impressive, but there are also metered billing, external API dependencies, and privacy concerns.

Everyone talks about AI models, 70B models, and cloud APIs. I wanted to see what would happen if I stayed 100 percent local and mostly with smaller local LLMs, like a 7B model to power a real app. So, I decided to build a useful full-stack app using only local LLMs, a Python app with no external APIs and no internet usage.

The result was a full-stack app for reviewing code snippets and getting professional reviews for bugs & vulnerabilities, performance optimization, and security audits. It was based on a Python backend, a React frontend, a local LLM as the intelligence, and a model context protocol (MCP) as the nervous system.

👁 A MacBook air connected to a monitor running DeepSeek-R1 locally
7 things I wish I knew when I started self-hosting LLMs

I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

Building with constraints on purpose

Local first, by design

Credit: Shekhar Vaidya/XDA

Opting for a local-only app powered by a small local LLM like a 7B model was not a restriction or limitation; I deliberately chose the constraints. The constraint wasn’t the local architecture or the model size; it was whether I could design and build the app around those limits.

My aim was to build a modern app based on FastAPI for the backend, a high-performance Python web framework, and React with Vite for the frontend. The goal was to build something that is modern and production-ready, not just vanilla HTML, CSS, and JS.

For the engine and intelligence layer, I chose mistral-7b-instruct-v0.3 via LM Studio’s OpenAI-compatible endpoints. The local model was an optimal choice for me, given my output expectations and my hardware constraints. Statistically, the Mistral 7B Instruct is perfect for the use case and in specific cases, the model can outperform 13B-class models on benchmarks.

While the Mistral 7B was the brain that held the raw intelligence and patterns, the Model Context Protocol (MCP) was the nervous system of the app, bridging the gap between the local model and the codebase. Cloud models rely on massive compute power and context windows; my app used MCP to extract better reasoning with limited compute.

All of it ran on my local system, making the app fully offline yet powerful enough for serious code analysis.

Why “more context” isn’t the answer

Bigger windows, blurry thinking

Credit: Shekhar Vaidya/XDA

Many users assume that the more the context, the more reasoning it can do. This was the first thing I tested with the app, and my tests gave opposite results. A larger context window means the model can see more, but it doesn’t mean it can reason better about it.

I thoroughly tested multiple codebases of different sizes. When I tested a 500-line file, it took about 1500 tokens and gave me strong insights about the code. However, when I reviewed a larger file (around 1200 lines), the total token count, including the input code, internal structured prompt, and the output, exceeded 5000 tokens. In single-pass mode, the feedback became noticeably flatter and more generic.

If more context weren't the answer, the solution had to be structural.

I didn’t upgrade the model — I upgraded the architecture

Split, focus, merge

When reviews started degrading and failing, my first instinct was to just upgrade the model and get better results. But it wasn’t feasible in my case for two reasons. My hardware was a borderline limitation for larger models, and it wasn’t logically right to choose a 13B or 30B model for the app, as it would be overpowered.

Instead of scaling the model, I scaled the structure with a simple architectural shift. It was simple, yet optimal in my case. I added a sectional review mechanism in the frontend.

On input, the app counts the code lines and calculates the approximate token count. If the token count exceeds 1200, then it instructs the backend to chunk the code into 1200-token sections. Each section then passes to the model for review, and then the results are deterministically merged into a single structured output.

The results showed measurable improvement. A 1200-line file that previously produced flattened feedback in single-pass mode was successfully processed with stable and more focused insights.

Real performance, measured locally

No cloud. No billing

Credit: Shekhar Vaidya/XDA

I have a fairly powerful PC. It has a Ryzen 7 7700X and a GeForce RTX 4070 Ti (12GB VRAM), paired with 32GB of DDR5 memory. This provides more than enough headroom for a 7B local model. With proper model-level optimization, it can perform optimally for the scope of the app.

The model is configured with a context window up to 8192 tokens, temperature set to 0.2, a response limit of 800 tokens, top-p sampling at 0.9, and the KV cache fully offloaded to GPU memory. The final output was as per my expectations. A 500-line file was reviewed in roughly 6–8 seconds, while a 1000+ line file was reviewed in around 15–20 seconds in the sectional mode.

The takeaway was simple: without any external API calls, rate limits, or privacy concerns, I could run the app as many times as I wanted, and I didn’t have to worry about any usage-based billing.

Why I refused to build a “Fix-It” button

Analysis over automation

Credit: Shekhar Vaidya/XDA

The natural instinct of an engineer is to ask, "If the app can review the code, why not add another feature that will work on the bugs and improvements and provide the user with a better code?" It would be a flashy feature to have, but that’s where the real risk begins.

This app is designed to review a snippet of code or single file, sometimes up to 2000–3000 lines. At any given time, the app will have the file provided as context, nothing more. It has no awareness of the full project context, no dependency graph, and no multi-file understanding. Even if I introduce the context, the 7B model will be a hallucination risk. That opens the door to out-of-context code rewrites, which would be dangerous for any project.

The app, right now, delivers structured analysis with strict JSON validation, which is safer for real codebases. In the end, the app is about precision and control, not autopilot coding.

Control over compute

In the end, I learned that architecture matters more than scale in real-world scenarios. The experiment wasn’t to prove whether 7B models are better than larger ones. Even a 7B model can power a full-stack application if the context, output, and architecture are structured, constrained, and intentional.