I've been using cloud-based coding assistants and local LLMs heavily since they first came out. Partly out of idle curiosity at first, but as the models have improved and the tooling ecosystem has grown, I see valid, powerful use cases everywhere. I still don't feel like it's the earth-shattering revolution that the marketing teams try to portray, but they can handle real work nowadays.

Or they can, if you take your hands off the reins a little and give them some space to work in. Now, I'm not saying you can one-shot the next TikTok-beater with vibe coding alone. But with a little knowledge of how LLMs work under the hood, and more structured prompting, you can get pretty impressive results.

The prompting issue

We're so far past basic chatbots

Whether you're prompting a local or cloud-based LLM, everything runs on tokens. Your initial prompt is turned into tokens; the response is generated bit-by-bit from probability calculations and turned into more tokens, and so on. The problem here is that each model has a context window, which is the number of tokens it can handle at once. And since everything uses tokens, it fills up quickly.

Now, studies have shown that being ruder or using brevity in your prompts and responses improves the accuracy of results. While it's tempting to anthropomorphize the LLM into a person, don't. It doesn't understand manners, at least not in the context we do. But those shorter prompts help by using fewer tokens, and brevity aligns with the high-quality technical literature the models were trained on.

Instead of treating your chatbot as a conversational partner, drop all courtesies and treat it as a direct list of specifications to follow. LLMs don't handle ambiguous questions well and are more likely to make mistakes if you prompt them that way. Be pragmatic, use brevity, and you'll get results. Anthropic has a fantastic doc on best practices when prompting that will help anyone, even if you're using a different LLM.

Ah, but the hallucination problem

All LLMs can hallucinate, and to my knowledge, that's an ingrained facet of how they work. But that doesn't mean we can't get better results or refine them. Treat the first response from any prompt as a rough draft, and work accordingly.

Adopt a Socratic prompting method, where you query the LLM to read its response, and either verify, quality control, or analyze it for conciseness, or whatever end result you are trying to achieve.

Like any tool, it needs the right materials to work well

Feed it all the knowledge it needs to succeed

Credit: Shekhar Vaidya/XDA

LLMs are general-purpose tools with a wide range of knowledge fed into them for training. But to get the best results, they need more up-to-date data, or information that is more relevant to your needs. And that's where a combination of RAG, MCP servers, and assigning a little roleplay comes in.

Define the model's role in the prompt, whether that's "you are a JavaScript expert and are writing code for a front-end website" or "you are a librarian and are analyzing the given research material to uncover relationships for cross-specialization use."

If you're coding something, make sure your LLM has access to the latest documentation for the languages you're working with. Handily, most are available in Markdown for easy ingesting into your model. But that doesn't just stop with code, add documents relevant to your work, or any other relevant repositories of information. They work best when given context around your prompts, and they can't pull that from the data they were trained on.

And MCP servers fill in the gaps, allowing them to search the internet for current data or scour your Obsidian vault. You can easily have the LLM create new MCP servers that fill specific gaps in your requirements, and you'll get more specific results. Or you could put your knowledge base into a RAG (retrieval augmented generation) to limit the sources the LLM can draw from, reducing hallucinations due to unnecessary sources.

And to set clear output expectations

Doing a little bit of the thinking for your LLM beforehand brings big dividends. If it's not burning tokens deciding how to format its output, you get more reasoned answers. The best formats I've found to specify are JSON or Markdown, and to specify the format as well, which not only gives you a structured answer but also one that can be used for further automation.

Other good practices include setting limits on response word count, using delimiters with clear separators such as ###, ```, or XML tags to separate instructions from data to be processed, and specifying what to include or exclude from the final result.

With the right structure, LLMs can be incredibly powerful

Your LLM is more knowledgeable than you, and able to surface connections that you might miss, purely on the amount of data it contains. It needs some prodding in the right direction for the best results, though, and then you'll get outputs that are usable and with the minimum necessary revisions.