VOOZH about

URL: https://thenewstack.io/how-to-scale-rag-and-build-more-accurate-llms/

⇱ How to Scale RAG and Build More Accurate LLMs - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-06-10 10:45:21
How to Scale RAG and Build More Accurate LLMs
sponsor-confluent,sponsored-post-contributed,
Data / Large Language Models

How to Scale RAG and Build More Accurate LLMs

RAG must be implemented in a way that provides accurate and up-to-date information and in a governed manner that can be scaled across apps and teams.
Jun 10th, 2024 10:45am by Andrew Sellers
👁 Featued image for: How to Scale RAG and Build More Accurate LLMs
Image from Krot_Studio on Shutterstock.
Confluent sponsored this post.

Retrieval augmented generation (RAG) has emerged as a leading pattern to combat hallucinations and other inaccuracies that affect large language model content generation. However, RAG needs the right data architecture around it to scale effectively and efficiently.

A data streaming approach grounds the optimal architecture for supplying LLMs with the large volumes of continuously enriched, trustworthy data to generate accurate results. This approach also allows data and application teams to work and scale independently to accelerate innovation.

Foundational LLMs like GPT and Llama are trained on vast amounts of data and can often generate reasonable responses about a broad range of topics, but they do generate erroneous content. As Forrester noted recently, public LLMs “regularly produce results that are irrelevant or flat wrong,” because their training data is weighted toward publicly available internet data.

In addition, these foundational LLMs are completely blind to the corporate data locked away in customer databases, ERP systems, corporate Wikis and other internal data sources. This hidden data must be leveraged to improve accuracy and unlock real business value.

RAG allows data teams to contextualize prompts in real-time with domain-specific company data. Having this additional context makes it far more likely that the LLM will identify the right pattern in the data and provide a correct, relevant response. This is critical for popular enterprise use cases like semantic search, content generation or copilots, where outputs must be based on accurate, up-to-date information to be trustworthy.

Why Not Just Train an LLM on Company-Specific Data?

Current best practices for generative AI often necessitate creating foundation models by training billion-node transformers on massive amounts of data, making this approach prohibitively expensive for most organizations. For example, OpenAI has said it spent more than $100 million to train GPT-4. Research and industry are beginning to provide promising results for small language models and less expensive training methods, but those aren’t generalizable and commoditized yet.

Fine-tuning an existing model is another, less resource-intensive approach and may also become a good option in the future, but this technique still requires significant expertise to get right. One of the benefits of LLMs is that they democratize access to AI, but having to hire a team of Ph.D.s to fine-tune a model largely negates that benefit.

RAG is the best option today, but it must be implemented in a way that provides accurate and up-to-date information and in a governed manner that can be scaled across applications and teams. To see why an event-driven architecture is the best fit for this, it’s helpful to look at four patterns of GenAI application development.

Data Augmentation

An application must be able to pull relevant contextual information, which is typically achieved by using a vector database to look up semantically similar information typically encoded in semi-structured or unstructured text. This means gathering data from disparate operational stores and “chunking” it into manageable segments that retain its meaning. These chunks of information are then embedded into the vector database where they can be coupled with prompts.

An event-driven architecture is beneficial here because it’s a proven method for integrating disparate sources of data from across an enterprise in real-time to provide reliable and trustworthy information.

By contrast, a more traditional ETL (extract, transform, load) pipeline that uses cascading batch operations is a poor fit, because the information will often be stale by the time it reaches the LLM. An event-driven architecture ensures that when changes are made to the operational data store, those changes are carried over to the vector store that will be used to contextualize prompts. Organizing this data as streaming data products also promotes reusability, so these data transformations can be treated as composable components that can support data augmentation for multiple LLM-enabled applications.

Inference

Inference involves engineering prompts with data prepared in the previous steps and handling responses from the LLM. When a prompt from a user comes in, the application gathers relevant context from the vector database or an equivalent service to generate the best possible prompt.

Applications like ChatGPT often take a few seconds to respond, which is an eternity in distributed systems. Using an event-driven approach means this communication can take place asynchronously between services and teams. With an event-driven architecture, services can be decomposed along functional specializations, which allows application development teams and data teams to work separately to achieve their objectives of performance and accuracy.

Further, by having decomposed, specialized services rather than monoliths, these applications can be deployed and scaled independently. This helps decrease time to market since the new inference steps are consumer groups, and the organization can template infrastructure for instantiating these quickly.

Workflows

Reasoning agents and inference steps are often linked into sequences where the next LLM call is based on the previous response. This is useful in automating complex tasks where a single LLM call will not be sufficient to complete a process. Another reason for decomposing agents into chains of calls is because the popular LLMs today tend to return better results when we ask multiple, simpler questions, although this is changing.

As the example workflow below illustrates, with a data streaming platform, the web development team can work independently from the backend system engineers, allowing each team to scale according to its needs. The data streaming platform enables this decoupling of technologies, teams and systems.

👁 Diagram showing a retail example of data streaming and RAG.

Post-Processing

Despite our best efforts, LLMs can still generate erroneous results, so we need a way to validate outputs and enforce business rules to prevent those errors from causing harm.

Typically, LLM workflows and dependencies change much more quickly than the business rules that determine whether outputs are acceptable. In the example above, we again see good use of decoupling with a data streaming platform: The compliance team validating LLM outputs can operate independently to define the rules without needing to coordinate with the team building the LLM applications.

Conclusion

RAG is a powerful model for improving the accuracy of LLMs and making generative AI applications viable for enterprise use cases. But RAG is not a silver bullet. It needs to be surrounded by an architecture and data delivery mechanisms that allow teams to build multiple generative AI applications without reinventing the wheel, and in a manner that meets enterprise standards for data governance and quality.

A data streaming model is the simplest and most efficient way to meet these needs, allowing teams to unlock the full power of LLMs to drive new value for their business. As technology becomes the business and AI enhances this technology, those firms that compete effectively will incorporate AI to augment and streamline more and more processes.

By having a common operating model for RAG applications, the enterprise can bring the first use case to market quickly while also accelerating delivery and reducing costs for everyone that follows.

Check out our GenAI resource hub to learn how Confluent can power your GenAI journey.

Confluent, founded by the original creators of Apache Kafka, pioneered a complete data streaming platform that streams, connects, processes, and governs data as it flows throughout a business. With Confluent, any organization can modernize their business and run it in real-time.
Learn More
The latest from Confluent
TRENDING STORIES
Andrew Sellers leads Confluent's Technology Strategy Group, supporting strategy development, competitive analysis, and thought leadership. He has previously brought several AI-enabled commercial offerings to market as a technology leader. He is a co-inventor on over a dozen patents related to...
Read more from Andrew Sellers
Confluent sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.