VOOZH about

URL: https://thenewstack.io/do-enormous-llm-context-windows-spell-the-end-of-rag/

⇱ Do Enormous LLM Context Windows Spell the End of RAG? - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-05-10 10:36:35
Do Enormous LLM Context Windows Spell the End of RAG?
sponsor-myscale,sponsored-post-contributed,
AI / Data / Large Language Models

Do Enormous LLM Context Windows Spell the End of RAG?

Now that LLMs can retrieve 1 million tokens at once, how long will it be until we don’t need retrieval augmented generation for accurate AI responses?
May 10th, 2024 10:36am by Usama Jamil
👁 Featued image for: Do Enormous LLM Context Windows Spell the End of RAG?
 Featured image by Erik Mclean on Unsplash.
MyScale sponsored this post.

Generative AI’s (GenAI) iteration speed is growing exponentially. One outcome is that the context window — the number of tokens a large language model (LLM) can use at one time to generate a response — is also expanding rapidly.

Google Gemini 1.5 Pro, released in February 2024, set a record for the longest context window to date: 1 million tokens, equivalent to 1 hour of video or 700,000 words. Gemini’s outstanding performance in handling long contexts led some people to proclaim that “retrieval augmented generation (RAG) is dead.” LLMs are already very powerful retrievers, they said, so why spend time building a weak retriever and dealing with RAG-related issues like chunking, embedding and indexing?

The increased context window started a debate: With these improvements, is RAG still needed? Or might it soon become obsolete?

How RAG Works

LLMs are constantly pushing the boundaries of what machines can understand and achieve, but they have been limited by issues such as difficulty responding accurately to unseen data or being up to date with the latest information. These gave rise to hallucinations, which RAG was developed to address.

RAG combines the power of LLMs with external knowledge sources to produce more informed and accurate responses. When a user query is received, the RAG system initially processes the text to understand its context and intent. Then it retrieves data relevant to the user’s query from a knowledge base and passes it, along with the query, to the LLM as context. Instead of passing the entire knowledge base, it passes only the relevant data because LLMs have a contextual limit, the amount of text a model can consider or understand at once.

👁 RAG architecture

First, the query is converted into vector embeddings using an embedding model. This embedding vector is then compared against a database of document vectors, identifying the most relevant documents. These relevant documents are fetched and combined with the original query to provide a rich context for the LLM to generate a more accurate response. This hybrid approach enables the model to use up-to-date information from external sources, allowing LLMs to generate more informed and accurate responses.

Why Long Context Windows Might Be the End of RAG

The continuous increase in LLMs’ contextual window will directly affect how these models ingest and generate responses. By increasing the amount of text LLMs can process at one time, these extended contextual windows enhance the model’s ability to understand more comprehensive narratives and complex ideas, improving the overall quality and relevance of generated responses. This improves LLMs’ ability to keep track of longer texts, allowing them to grasp a context and its finer details more effectively. Consequently, as LLMs become more adept at handling and integrating extensive context, the reliance on RAG for enhancing response accuracy and relevancy may decrease.

Accuracy

RAG improves the model’s ability by providing relevant documents as a context based on a similarity score. However, RAG doesn’t adapt or learn from the context in real time. Instead, it retrieves documents that appear similar to the user’s query, which may not always be the most contextually appropriate, potentially leading to less accurate responses.

On the other hand, by utilizing the long contextual window of a language model by stuffing all the data inside it, an LLM’s attention mechanism can generate better answers. The attention mechanism within a language model focuses on the different parts of the provided context to generate a precise response. Furthermore, this mechanism can be fine-tuned. By adjusting the LLM model to reduce loss, it gets progressively better, resulting in more accurate and contextually relevant responses.

Information Retrieval

When retrieving information from a knowledge base to enhance LLM responses, it’s always hard to find complete and relevant data for the contextual window. There’s continual uncertainty about whether the retrieved data completely answers the user’s query. This situation can cause inefficiencies and errors if the information chosen isn’t enough and doesn’t align well with the user’s actual intentions or the context of the conversation.

External Storage

Traditionally, LLMs couldn’t handle massive amounts of information simultaneously because they were limited in how much context they could process. But their newly enhanced capabilities allow them to handle extensive data directly, eliminating the need for separate storage per query. This streamlines architecture speeds access to external databases and boosts AI efficiency.

Why RAG Will Stick Around

Expanded contextual windows in LLMs may provide a model with deeper insights, but it also brings challenges such as higher computational costs and efficiency. RAG tackles these challenges by selectively retrieving only the most relevant information, which helps optimize performance and accuracy.

Complex RAG Will Persist

There’s no doubt that simple RAG, where data is chunked into fixed-length documents and retrieved based on similarity, is fading. However, the complex RAG systems are not just persisting but evolving significantly.

Complex RAG includes a broader range of capabilities such as query rewriting, data cleaning, reflection, optimized vector search, graph search, rerankers and more sophisticated chunking techniques. These enhancements are not only refining RAG’s functionality but also scaling its capabilities.

Performance Beyond Context Length

While expanding the contextual window in LLMs to include millions of tokens looks promising, the practical implementations are still questionable due to several factors including time, efficiency and cost.

  • Time: As the size of the contextual window increases, so does latency in response times. As you expand the size of the contextual window, the LLM takes more time to process a higher number of tokens, leading to delays and increased latency. Many LLM applications require quick responses. The added delay from processing larger text blocks can significantly hinder the LLM’s performance in real-time scenarios, creating a major bottleneck in the implementation of larger contextual windows.
  • Efficiency: Studies suggest that LLMs achieve better results when provided fewer, more relevant documents rather than large volumes of unfiltered data. A recent Stanford study found that state-of-the-art LLMs often struggle to extract valuable information from their extensive context windows. The difficulty is particularly pronounced when critical data is buried in the middle of a large text block, causing LLMs to overlook important details and leading to inefficient data processing.
  • Cost: Expanding the size of the context window in LLMs results in higher computational costs. Processing a larger number of input tokens demands more resources, leading to increased operational expenses. For instance, in systems like ChatGPT, there is a focus on limiting the number of tokens processed to keep costs under control.

RAG optimizes each of these three factors directly. By passing only similar or related documents as context (rather than stuffing everything in), LLMs process information more quickly, which not only reduces latency but also improves the quality of responses and lowers the cost.

Why Not Fine-Tuning

Besides using a larger contextual window, another alternative to RAG is fine-tuning. However, fine-tuning can be expensive and cumbersome. It’s challenging to update an LLM every time new information comes in to ensure it’s up to date. Other issues with fine-tuning include:

  • Training data limitations: No matter the advancements made in LLMs, there will always be contexts that were either unavailable at the time of training or not deemed relevant.
  • Computational resources: To fine-tune an LLM on your data set and tailor it for specific tasks, you need high computational resources.
  • Expertise needed: Developing and maintaining cutting-edge AI isn’t for the faint of heart. You need specialized skills and knowledge that can be hard to come by.

Additional issues include gathering the data, making sure that the quality is good enough and model deployment.

Comparing RAG vs. Fine-Tuning or Long Context Windows

Below is a comparative overview of RAG and fine-tuning or long context window techniques (because the latter two have similar characteristics, I have combined them in this chart). It highlights key aspects such as cost, data timelines and scalability.

Feature RAG Fine-Tuning/Long Context Windows
Cost Minimal, no training required High, requires extensive training and updating
Data Timeliness Data retrieved on demand, ensuring currency Data can quickly become outdated
Transparency High, shows retrieved documents Low, unclear how data influences outcomes
Scalability High, easily integrates with various data sources Limited, scaling up involves significant resources
Performance Selective data retrieval enhances performance Performance can degrade with larger context sizes
Adaptability Can be tailored to specific tasks without retraining Requires retraining for significant adaptations

Optimizing RAG Systems With Vector Databases

State-of-the-art LLMs can process millions of tokens simultaneously, but the complexity and constant evolution of data structures make it challenging for LLMs to manage heterogeneous enterprise data effectively. RAG addresses these challenges, although retrieval accuracy remains a major bottleneck for end-to-end performance. Whether it’s the large context window of LLMs or RAG, the goal is to make the best use of big data and ensure the high efficiency of large-scale data processing.

Integrating LLMs with big data using advanced SQL vector databases like MyScaleDB enhances the effectiveness of LLMs and facilitates better intelligence extraction from big data. Additionally, it mitigates model hallucination, offers data transparency and improves reliability. MyScaleDB, an open source SQL vector database built on ClickHouse is tailored for large AI/RAG applications. Leveraging ClickHouse as its base and featuring the proprietary MSTG algorithm, MyScaleDB demonstrates superior performance managing large-scale data compared to other vector databases.

LLM technology is changing the world, and the importance of long-term memory will persist. Developers of AI applications are always striving for the perfect balance between query quality and cost. When large enterprises put generative AI into production, they need to control costs while maintaining the best quality of response. RAG and vector databases remain important tools to achieve this goal.

You’re welcome to dive into MyScaleDB on GitHub or discuss more about LLMs or RAG in our Discord.

MyScale is an open-source SQL vector database that allows to effectively manage massive volumes of both structured and vector data for developing robust AI applications. It enables every developer to build production-grade GenAI applications with powerful and familiar SQL.
Learn More
TRENDING STORIES
Usama Jamil, a developer advocate at MyScale, brings with him a wealth of experience and a profound interest in data science. With a passion for exploring new trends in the AI/ML domain, Usama strives to make complex concepts accessible to...
Read more from Usama Jamil
MyScale sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: turing, ClickHouse.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.