VOOZH about

URL: https://thenewstack.io/a-call-to-use-generative-ai-to-create-more-trustworthy-data/

⇱ How Developers Can Use Generative AI to Improve Data Quality - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-08-22 12:00:29
How Developers Can Use Generative AI to Improve Data Quality
sponsor-confluent,sponsored-post-contributed,
AI / Compliance / Data Streaming

How Developers Can Use Generative AI to Improve Data Quality

While generative AI is driving the need for stronger data governance, it can also help to meet that need.
Aug 22nd, 2024 12:00pm by Andrew Sellers
👁 Featued image for: How Developers Can Use Generative AI to Improve Data Quality
Image from 13_Phunkod on Shutterstock
Confluent sponsored this post.

It sounds counterintuitive — using a technology that has trust issues to create more trustworthy data. But smart engineers can put generative AI to work to improve the quality of their data, allowing them to build more accurate and trustworthy AI-powered applications.

Generative AI models are remarkable for their ability to answer questions in human-like sentences, but they are prone to hallucination, and they can’t derive insight from internal company data that wasn’t part of their training. Yet this internal data is critical for many enterprise use cases.

Imagine an AI chatbot that tells employees how many days of PTO they have left or one that tells airline customers if they’re eligible for a seat upgrade. These use cases require precise responses, and machine learning engineers need access to accurate, timely data to maximize the value of generative AI in business.

Data governance can play a key role here, helping to manage the operational and reputational risks that can result from improper AI decision-making. Specifically, by applying metadata that describes the structure and provenance of data and how it should be used, data teams can ensure data quality and improve the accuracy of generative AI-powered applications. This extends beyond the business domain to emerging compliance frameworks, which require policies to ensure data integrity, security and accountability.

Creating this metadata is time-consuming work for data producers, however, which means busy data teams often cut corners or don’t create it at all. For an analogy, you may remember that Tim Berners-Lee once called for the creation of a “semantic web,” where web content would be much more useful because it was described in a machine-readable form. This required sites to manually tag their content, which mostly never happened. That’s not unlike the governance problem data teams face today.

But while generative AI is driving the need for stronger data governance, it can also help to meet that need. By presenting a generative AI model with examples of how data should be labeled, generative AI can create the required metadata automatically. A human will still need to review the results, but the process will be much less laborious than creating metadata from scratch.

Get Started With a Data Product Mindset 

The need for high-quality data doesn’t apply only to generative AI. As data becomes more important for all types of analysis, there’s been an accompanying surge of interest in building unified data catalogs that make it easier for other teams to discover and use data. By employing generative AI to create metadata, along with a data-streaming platform to create reusable data products, data becomes much more available, boosting innovation and productivity.

This metadata includes machine-readable information, such as a data schema and field descriptions, as well as human-readable information, such as who created the data and how it should be used. The key is to provide sufficient information so that someone elsewhere in the organization who wants to consume a data asset will know where it originated, how it can be used, any associated service-level agreement (SLA) and its degree of trustworthiness.

The foundational element of data governance is a schema — specific metadata that describes the structure of data. If we present a generative AI model with enough examples of the data being collected or the code that produces it, the model can induce the schema.

This process works best when the metadata is created when data is produced. We can retroactively run a generative AI program over older data sets to induce metadata, but we may get less fidelity in the results because the original schema evolved over time. By creating metadata when data is produced, the metadata tends to be more accurate at describing the underlying data set.

Keep Humans in the Loop

Human review is needed because of limitations with the current state of AI. The AI will be good at seeing patterns, but it may not be able to generalize the entire schema based on a limited set of examples that it’s been shown. We’ve not yet totally replicated expert intuition and understanding, and this can complement the volume of information that AI can rapidly process. We know there are 12 months in the year, or 50 states in the United States, or that street addresses typically require a street number — and that allows us to easily spot mislabeled data. The AI process may make errors because it lacks this basic knowledge or because it hasn’t seen enough examples. However, a human can quickly fix these mistakes and still save a lot of time and effort before nonconformant data is used by engineers downstream.

To make this work well, producers of data need to adhere to the data policies that the organization has established. In addition, when a schema evolves, you may need to adjust the model to reflect the new schema. The choice of LLM matters, but it is less important than the workflows that support data curation and the contextualization of the system prompt. For the best results, the model needs examples not only of the data set or production code, but also guidance for the metadata you want the model to create.

A Data-Streaming Platform Is the Optimal Pattern

Recalling the Semantic Web, we never saw its vision realized of making the web machine-readable in the way its creators envisioned. Yet the web became machine-readable in a way that few foresaw in the early 2000s, because machine learning got far better at understanding media created for humans. In a similar way, better machine learning presents a better alternative to completing the rote tasks necessary for data governance.

Applying generative AI in this way requires a platform to work with, and a data-streaming platform that can process data generated in real time is a good fit. Data streaming platforms are designed from the ground up to present data in a way that’s consumable, so it’s an efficient environment to apply metadata at the time of production and to create data products that can be reused in other applications.

A data streaming platform also helps to ensure that governance controls and metadata are incorporated into a common data catalog for discovery and reuse.

The rapid emergence of generative AI has created a critical need for high-quality data and data governance, but it has also provided a solution. In time, generative AI may be able to take on additional governance tasks, such as applying data policies, but it’s not ready for that yet in general.

Nevertheless, generative AI can help eliminate much of the rote work for defining and applying schema and other important data characteristics, creating a virtuous cycle that increases the quality of generative AI-powered applications and makes data much more widely available for reuse.

Industry and academia are beginning to define what AI governance should look like, but it’s still an emerging concept. Practitioners lack a consensus definition of what AI governance entails, let alone anything resembling a framework. But we can say for sure that AI governance depends on data governance, by helping engineers trust data that they can use to build generative AI applications.

In the future, I would like to see the industry further define what AI governance should look like, and for data infrastructure vendors to bring more focus to integrating generative AI into tools and abstractions that promote better data quality.

Confluent, founded by the original creators of Apache Kafka, pioneered a complete data streaming platform that streams, connects, processes, and governs data as it flows throughout a business. With Confluent, any organization can modernize their business and run it in real-time.
Learn More
The latest from Confluent
TRENDING STORIES
Andrew Sellers leads Confluent's Technology Strategy Group, supporting strategy development, competitive analysis, and thought leadership. He has previously brought several AI-enabled commercial offerings to market as a technology leader. He is a co-inventor on over a dozen patents related to...
Read more from Andrew Sellers
Confluent sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.