VOOZH about

URL: https://thenewstack.io/3-reasons-data-engineers-are-the-unsung-heroes-of-genai/

⇱ 3 Reasons Data Engineers Are the Unsung Heroes of GenAI - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-05-01 10:00:48
3 Reasons Data Engineers Are the Unsung Heroes of GenAI
contributed,sponsor-aerospike,sponsored-topic,
AI / Data / Tech Careers

3 Reasons Data Engineers Are the Unsung Heroes of GenAI

As organizations add AI to their products, data engineers will be integral to expanding infrastructure and governance to include new models and technology.
May 1st, 2024 10:00am by Barr Moses
👁 Featued image for: 3 Reasons Data Engineers Are the Unsung Heroes of GenAI
Image via Pixabay.

Over the last 18 months, advancements in generative AI have created an insatiable appetite among boards and business leaders. As of September, 87% of C-suite executives surveyed by IDC say they’re at least exploring potential use cases. And another 77% of business leaders fear they’re already missing out on the benefits of GenAI, according to a November 2023 report from Salesforce.

But data leaders understand that no matter how much FOMO their CEOs experience after watching a flashy demo, implementing the latest LLMs has to be done thoughtfully. To deliver meaningful business value, those models need to be supplied with quality data — while maintaining security, privacy and scalability.

In most organizations, there are key contributors already doing that work: data engineers. And given the current state of how companies achieve enterprise-ready AI, data engineers will be increasingly essential going forward.

The Essential Role of Data Engineers in Enterprise AI

Within any modern data team, data engineers are responsible for building and maintaining the underlying infrastructure of the data stack. Their pipelines and workflows enable applications, analysts, business consumers and data scientists to access and use the data they need to get their work done.

As organizations begin to layer generative AI into their products, data engineers will be integral to expanding existing infrastructure and governance to encompass the latest models and technologies. Let’s explore three specific ways data engineers will contribute to AI success.

1. Facilitate RAG to Improve LLM Outputs

At this moment, most organizations achieving success with GenAI are using retrieval-augmented generation (RAG). This involves incorporating a knowledge source or dataset into their generative process — giving an LLM access to a dynamic database while responding to prompts. For example, with RAG fully implemented, a consumer-facing chatbot would be able to pull specific customer data to reference during a support interaction.

For most use cases, RAG is a better fit than fine-tuning—retraining an existing LLM on a smaller, specific dataset. Fine-tuning requires considerable computational resources and large volumes of data, and typically involves a higher risk of overfitting.

Effectively implementing RAG requires quality data pipelines that feed company data to AI models. Data engineers are responsible for ensuring:

  • The database is accurate and relevant, with regular updates and quality checks
  • Retrieval processes are optimized and prompts are addressed with correct and contextually appropriate data
  • Data inputs are continuously monitored and refined through data observability

The preference for RAG may change as the technology evolves, but for now, it’s generally considered the most practical path forward for enterprise AI. It also helps reduce hallucinations and inaccuracies, while improving transparency for data teams.

2. Maintain Security and Privacy

Data engineers already play a key role in data governance, ensuring that databases have the proper built-in roles and security controls to ensure privacy and compliance. When RAG is implemented, those controls need to be extended and applied consistently throughout the pipelines.

For instance, a company’s LLM shouldn’t be using any customer data for its own training, and a customer-facing chatbot must confirm a user’s identity and permissions before sharing sensitive data. Data engineers play a pivotal role in maintaining compliance with regulations and best practices.

3. Reliable, High-Quality Data

Ultimately, the success of GenAI depends on data quality. Without accurate, reliable data consistently made available to LLMs, even the most advanced models won’t produce useful outputs.

Over the last five years, leading data engineers have adopted observability tooling — including automated monitoring and alerting, similar to DevOps observability software — to help improve data quality. Observability helps data teams monitor and proactively respond to incidents like failed Airflow jobs, broken APIs and misformatted third-party data that put data health at risk. And with end-to-end data lineage, teams gain visibility into upstream and downstream dependencies.

Data engineers can provide transparency when observability tooling is applied across the modern AI stack, including vector databases. Lineage allows engineers to trace the source of the data as it’s converted to embeddings, then use that data to generate rich text that the LLM puts in front of the user. This visibility helps data teams understand how LLMs operate, improve their outputs, and quickly troubleshoot incidents.

As Vishnu Ram, VP of engineering at CreditKarma, told us: “We need to be able to observe the data. We need to understand what data we’re putting into the LLM, and if the LLM is coming up with its own thing, we need to know that — and then know how to deal with that situation. If you don’t have observability of what goes into the LLM and what comes out, you’re screwed.”

Data Engineers Are the Future of AI-Driven Organizations

AI technologies are evolving at a head-spinning pace. But even as fine-tuning models and more advanced custom training become feasible for enterprises, the need to ensure data quality, security and privacy will not change.

As organizations invest in generative AI applications, the quality and availability of their data will be more valuable than ever before. That means the workflows and data engineering processes may change, but their importance within organizations has only just begun.

Aerospike is the real-time database built for infinite scale, speed, and savings. Our customers are ready for what’s next with the lowest latency and the highest throughput data platform. Cloud and AI-forward, we empower leading organizations like Adobe, Airtel, Criteo, Experian, and PayPal.
Learn More
The latest from Aerospike
TRENDING STORIES
Barr Moses is CEO & Co-Founder of Monte Carlo, a data reliability company and creator of the data observability category, backed by Accel, GGV, Redpoint, ICONIQ Growth, Salesforce Ventures, IVP, and other top Silicon Valley investors. Previously, she was VP...
Read more from Barr Moses
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.