VOOZH about

URL: https://thenewstack.io/rag-still-relevant-in-the-era-of-long-context-models/

⇱ RAG: Still Relevant in the Era of Long Context Models - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-05-20 10:00:00
RAG: Still Relevant in the Era of Long Context Models
contributed,
AI / API Management / Large Language Models

RAG: Still Relevant in the Era of Long Context Models

While RAG will remain a staple of production applications, Gemini 1.5 Pro and similar models will help enterprise data science teams.
May 20th, 2024 10:00am by Shahebaz Mohammad
👁 Featued image for: RAG: Still Relevant in the Era of Long Context Models
Image by Artie_Navarre from Pixabay.

Google recently released Gemini 1.5 Pro, a large language model boasting a mammoth one million token context window. This sparked a buzz in the AI community, with some dubbing it the “RAG killer.”

Before we rush to write eulogies for retrieval-augmented generation (RAG), let’s take a deep breath and analyze the situation from an enterprise perspective. Extremely long context windows may get data science teams to a working pipeline faster, but does a deployment speed advantage justify an application that costs many times as much to run in production?

Probably not.

Enterprises need applications that achieve high performance in a small footprint. That means choosing and customizing a right-sized foundation model along with the entire supporting LLM system ecosystem around it. Highly customized RAG systems simply provide better value for high-throughput tasks.

But these technologies can co-exist. While RAG will remain a staple of production applications, Gemini 1.5 Pro and similar models will help enterprise data science teams experiment and iterate faster.

RAG’s Obvious Advantage: More Tokens = Higher Cost

Injecting more context into large language model (LLM) prompts means paying for more processing power — whether that’s directly with per-token charges through an API, or indirectly through the cost of computational resources. Data scientists and developers, therefore, must carefully consider how much context is the right amount for each task.

In a way, this is a nice problem to have. Early LLM-backed applications typically used the entire context window and struggled to optimize what context to fit into it. As context sizes increased from 1,000 tokens to 16,000 tokens and now one million tokens, development pressure has shifted from prioritizing the most important documents to deciding where performance gains no longer justify the price of additional text.

No matter how an enterprise pays for its LLM usage, more tokens mean higher operating costs. Very few tasks require a million tokens of context.

RAG’s Modularity Advantage

The modular architecture of RAG-based applications offers valuable flexibility. Gemini, like most LLMs, is a black box. It undoubtedly works well on some topics and tasks and less well on others. If an enterprise data science team built an application that used Gemini 1.5’s entire context window, they would have a difficult time replacing Gemini with another model — at least until a comparable competitor reaches the market.

That’s not true with RAG-based applications. RAG-based LLM systems allow data science teams to swap out and customize each component to their specific needs.

Snorkel AI recently worked on a RAG-based project with a banking customer. The customer needed the system to accurately answer questions about contracts. The project started with off-the-shelf components (GPT-4 as the LLM with LlamaIndex for RAG) and scored 25% accuracy — quite far from deployment benchmarks.

In their first sprint, our engineers added components to the application to intelligently chunk and tag source documents. The off-the-shelf version of the application struggled to identify which texts contained dates. Our team added a lightweight helper model that explicitly tagged document chunks predicted to contain date information. They also optimized the prompt template and fine-tuned the embedding model on the domain-specific data. In just three weeks, they improved system accuracy to 79%.

Later work pushed the accuracy to 89%, but they achieved their first 54-point gain without modifying the off-the-shelf LLM at all. That’s the power of RAG’s modularity.

Better Data Development Builds Better LLM Systems

Our engineers’ 64-point accuracy gain would have been impossible without top-quality data development guided by our client’s subject matter expert.

To train the date-tagging model, we needed examples of passages that did and did not mention dates. Our engineers didn’t immediately know what kind of subtle date references to expect, but the subject matter expert did. The SME identified a small number of passages with oblique or subtle date references, and wrote a brief explanation for why they tagged it.

When it comes to production use cases, RAG will win out. Its modularity, multiple points of customizability and comparative cost-effectiveness make it the better choice for enterprise AI.

Our engineers then encoded the SME’s explanations as labeling functions in the Snorkel Flow AI data development platform. The platform quickly labeled a large number of documents, and our engineers then checked the accuracy of their labeling functions against the SME’s ground truth. This allowed them to identify shortcomings and iterate until they produced a high-quality dataset capable of training a high-accuracy helper model.

In the end, our client’s SME spent more time verifying the accuracy of the model than they did labeling data.

While this kind of data development is technically possible with non-programmatic approaches, it’s neither efficient nor practical.

Where Gemini 1.5’s Million-Token Context Window Fits in

While I would not recommend any enterprise to build a production LLM system that uses Gemini 1.5 pro’s full-context window, Google’s noteworthy achievement has a place in enterprise AI development.

Long context models will accelerate simpler and preproduction use cases. That’s a lot of enterprise AI today! Gemini and others will let data science teams complete proof-of-concept applications faster than they can now. Once they’ve proven the concept, they can move on to building a robust, modular and highly customized RAG-based application.

Customized RAG > Long Contexts in Production Applications

Gemini 1.5 represents a significant technical achievement. I applaud the researchers and engineers at Google for what they’ve done. Gemini and other long context models will hold an important place in enterprise AI. Allowing data science teams to handle challenging one-off questions and finish rough drafts of applications faster will yield real business value.

But, when it comes to production use cases, RAG will win out. Its modularity, multiple points of customizability and comparative cost-effectiveness make it the better choice for enterprise AI.

TRENDING STORIES
Shahebaz Mohammad is an Applied Machine Learning Engineer at Snorkel AI, the $1B company helping enterprises take AI initiatives into production 100x faster than their previous baseline. At Snorkel, he leads go-to-market efforts in Europe, the Middle East, and Africa....
Read more from Shahebaz Mohammad
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.