VOOZH about

URL: https://thenewstack.io/retrieval-augmented-generation-for-llms/

⇱ Retrieval Augmented Generation for LLMs - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-12-13 07:55:41
Retrieval Augmented Generation for LLMs
AI / Large Language Models

Retrieval Augmented Generation for LLMs

Retrieval-augmented generation (RAG) is a cutting-edge approach in NLP and AI. Badrul Sarwar, a machine learning scientist, shares his tips.
Dec 13th, 2023 7:55am by Badrul Sarwar
👁 Featued image for: Retrieval Augmented Generation for LLMs
Image via Unsplash.

Generative AI (GenAI), powered by advanced neural network architectures and large language models (LLMs), has the remarkable ability to generate coherent and contextually relevant content — including text, images and even music — with minimal human intervention. However, such models suffer from one major limitation: they cannot expand or update their memory and may produce what is known as “hallucinations.” Hallucinations occur when the LLM produces content that sounds plausible but is actually fictional or incorrect, often due to the model extrapolating or confabulating beyond the scope of its training data.

For general-purpose content generation and other use cases, model hallucination can cause mild annoyance; but for an AI assistant or chatbot dealing with enterprise data, any type of inaccurate answer can lead to user frustration and even catastrophic consequences.

Solution: Retrieval Augmented Generation

Retrieval-augmented generation represents a cutting-edge approach to natural language processing and AI. This technique combines elements of both text generation and information retrieval to enhance the quality and relevance of generated content.

By incorporating knowledge and context from external sources or databases, retrieval-augmented generation models can produce more contextually accurate, coherent, and informative text that is free of hallucination. Most importantly, RAG can harness an application’s internal data and augment an LLM’s knowledge to find the specific answer to a question.

One can think of general-purpose LLMs as memorizing the knowledge (closed book) and when asked a question they generate an answer from their memory. But when the question is out of their memorized knowledge that is modeled through billions of parameters, they tend to fill in the gap by confabulating or hallucinating an answer. On the contrary, RAG is like an open book test — when needed, they can quickly retrieve the relevant knowledge and augment the LLM’s knowledge to provide a correct answer. RAG systems can be designed not to provide any answer if no relevant contextual information can be harnessed, thereby solving the hallucination problem.

RAG Details

At the heart of the RAG system is the retrieval system for additional knowledge. Embeddings or vector representations are used for semantic knowledge retrieval. The following are the main components of a RAG system:

1. Embedding and Similarity Search

All additional documents or knowledge sources are tokenized and embedded in some dense low-dimensional space using any foundational NLP model (e.g., Word2Vec, GPT, Bert, Llama). Embeddings are numerical representations of words in a way that preserves semantic relationships. These embeddings are of a fixed dimension whose size is dictated by the model that generates the embeddings.

With these embeddings, words with similar meanings or contexts are located closer together in the vector space. Given a query, usually Maximum Inner Product Search (MIPS) algorithms are used to find the most semantically similar documents to it. MIPS algorithms use either dot product, Euclidean distance, or cosine similarity as the sorting criteria.

2. Managing Embeddings — Vector Databases

For a typical enterprise application, there can be a great number of documents. Storing and searching through these large numbers of embeddings can be a daunting task. Imagine a scenario with 1 million documents of 1,000-dimensional embeddings. To perform a MIPS-based top-k nearest neighbor search, it would require computing dot products with the query vector to all the 1M document vectors and selecting the top-k similar documents — a very compute-intensive task.

Faster approximate nearest neighbor (ANN) algorithms, such as locality-sensitive hashing and others, have been invented to address this. These days, a new set of services called Vector Databases are available that can help with storage and organization and (most importantly) can provide MIPS-based retrieval through simple APIs. Vector databases are specifically designed to operate with vector embeddings. Cloud-based vector DBs such as Pinecone, Milvus and AWS and local-vector DBs such as FAISS and Chroma are becoming very popular and play the most crucial role in designing RAG systems.

3. Augmented Prompt

Once vector databases provide the most similar documents to the question, they are compiled into one single context that supposedly contains enough information to answer the question. Finally, a special prompt needs to be created that instructs the LLM to answer the question by only using this supplied context. If the quality of retrieval is good — i.e., the context is relevant to the question — the LLMs can generate a suitable answer.

The quality of the retrieved context can be controlled by applying similarity thresholds and the RAG system can decide not to answer a question if the retrieved contextual information is not relevant.

Benefits of RAGs for Enterprise Applications

Retrieval-augmented generation can be beneficial for enterprise applications in a variety of ways:

  • No hallucination: one of the most important benefits for RAG-based generative applications. For enterprise use cases, it is crucial that the answers provided by the model are factually correct and trustworthy, otherwise, it will cause more harm than benefit.
  • Cost savings: no need to train LLMs as knowledge evolves. Training LLMs is very expensive. As enterprise knowledge grows or changes, RAG can accommodate them by simply generating embeddings and inserting them into the vector databases. The similarity search can easily retrieve those and can be used in generating the context for LLMs.
  • Tailored experience: ideally, enterprises can train smaller, more tailored foundational LLMs with the help of RAG systems and can provide a much better customer experience.
  • Privacy and security maintenance: enterprises can avoid the exposure of their proprietary data to large LLMs. The privacy and security of enterprise customer data is one of the most important considerations when it comes to using LLMs. With the help of RAG, enterprises can run smaller but powerful open LLMs and provide better customer experience without compromising the privacy and security of sensitive data.

Challenges of RAGs

RAG-based applications have challenges, too. The use of additional vector databases may add to the cost. Also, with RAGs, the prompt to the LLM is augmented by using extra information that is retrieved from vector DBs — and that adds to the response time. Also, the overall prompt size is much larger, as we send the question as well as the contextual information in the same prompt. As the LLMs charge by prompt token count, each question answered gets more expensive.

TRENDING STORIES
Badrul Sarwar is co-founder and CTO of CloudAEye, an AI-Ops startup. He is an expert AI researcher/practitioner with decades of industry and academia experience.
Read more from Badrul Sarwar
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.