VOOZH about

URL: https://thenewstack.io/clean-data-trusted-model-ensure-good-data-hygiene-for-your-llms/

⇱ Clean Data, Trusted Model: Ensure Good Data Hygiene for Your LLMs - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-05-03 10:00:41
Clean Data, Trusted Model: Ensure Good Data Hygiene for Your LLMs
contributed,
Data / Large Language Models / Security

Clean Data, Trusted Model: Ensure Good Data Hygiene for Your LLMs

The fact is, some data is just too risky to input into a model. Some can carry significant risks, such as privacy violations or biases.
May 3rd, 2024 10:00am by Chase Lee
👁 Featued image for: Clean Data, Trusted Model: Ensure Good Data Hygiene for Your LLMs
Image via Pixabay.

Large language models (LLMs) have emerged as powerful engines of creativity, transforming simple prompts into a world of possibilities.

But beneath their potential capacity lies a critical challenge. The data that flows into LLMs touches countless enterprise systems, and this interconnectedness poses a growing data security threat to organizations.

LLMs are nascent and not always completely understood. Depending on the model, their inner workings may be a black box, even to their creators — meaning that we can’t fully understand what will happen to the data we put in, and how or where it may come out.

To stave off risks, organizations will need to build infrastructure and processes that perform rigorous data sanitization of both inputs and outputs, and can monitor and canvas every LLM on an ongoing basis.

Model Inventory: Take Stock of What You’re Deploying

As the old saying goes, “You can’t secure what you can’t see.” Maintaining a comprehensive inventory of models throughout both production and development phases is critical to achieving transparency, accountability and operational efficiency.

In production, tracking each model is crucial for monitoring performance, diagnosing issues and executing timely updates. During development, inventory management helps track iterations, facilitating the decision-making process for model promotion.

To be clear, this is not a “record-keeping task” — a robust model inventory is absolutely essential in building reliability and trust in AI-driven systems.

Data Mapping: Know What Data You’re Feeding Models

Data mapping is a critical component of responsible data management. It involves a meticulous process to comprehend the origin, nature and volume of data that feeds into these models.

It’s imperative to know where the data originates, whether it contains sensitive information like personally identifiable information (PII) or protected health information (PHI), especially given the sheer quantity of data being processed.

Understanding the precise data flow is a must; this includes tracking which data goes into which models, when this data is utilized and for what specific purposes. This level of insight not only enhances data governance and compliance but also aids in risk mitigation and the preservation of data privacy. It ensures that machine learning operations remain transparent, accountable and aligned with ethical standards while optimizing the utilization of data resources for meaningful insights and model performance improvements.

Data mapping bears striking resemblance to compliance efforts often undertaken for regulations like the General Data Protection Regulation (GDPR). Just as GDPR mandates a thorough understanding of data flows, the types of data being processed and their purpose, the data mapping exercise extends these principles to the realm of machine learning. By applying similar practices to both regulatory compliance and model data management, organizations can ensure that their data practices adhere to the highest standards of transparency, privacy and accountability across all facets of operations, whether it’s meeting legal obligations or optimizing the performance of AI models.

Data Input Sanitation: Weed out Risky Data

“Garbage in, garbage out” has never rung truer than with LLMs. Just because you have vast troves of data to train a model doesn’t mean you should do so. Whatever data you use should have a reasonable and defined purpose.

The fact is, some data is just too risky to input into a model.Some can carry significant risks, such as privacy violations or biases.

It is crucial to establish a robust data sanitization process to filter out such problematic data points and ensure the integrity and fairness of the model’s predictions. In this era of data-driven decision-making, the quality and suitability of the inputs are just as vital as the sophistication of the models themselves.

One method rising in popularity is adversarial testing on models. Just as selecting clean and purposeful data is vital for model training, assessing the model’s performance and robustness is equally crucial in the development and deployment stages. These evaluations help detect potential biases, vulnerabilities or unintended consequences that may arise from the model’s predictions.

There’s already a growing market of startups specializing in providing services for precisely this purpose. These companies offer invaluable expertise and tools to rigorously test and challenge models, ensuring they meet ethical, regulatory and performance standards.

Data Output Sanitation: Ensure Trust and Coherence

Data sanitation isn’t limited to just the inputs in the context of large language models; it extends to what’s generated as well. Given the inherently unpredictable nature of LLMs, the output data requires careful scrutiny in order to establish effective guard rails.

The outputs should not only be relevant but also coherent and sensible within the context of their intended use. Failing to ensure this coherence can swiftly erode trust in the system, as nonsensical or inappropriate responses can have detrimental consequences.

As organizations continue to embrace LLMs, they will need to pay close attention to the sanitation and validation of model outputs in order to maintain the reliability and credibility of any AI-driven systems.

The inclusion of a diverse set of stakeholders and experts when creating and maintaining the rules for outputs and when building tools to monitor outputs is are key steps toward successfully safeguarding models.

Putting Data Hygiene into Action

Using LLMs in a business context is no longer an option; it’s essential to stay ahead of the competition. This means organizations will have to establish measures to ensure model safety and data privacy. Data sanitization and meticulous model monitoring are a good start, but the landscape of LLMs evolves quickly. Staying abreast of the latest and greatest, as well as regulations, will be key to making continuous improvements to your processes.

TRENDING STORIES
Chase Lee currently serves as GM of Enterprise at Vanta, the hypergrowth security and compliance management platform. Vanta acquired Trustpage in January 2023 where Chase was founder and CEO. Prior to starting Trustpage in 2020, Chase was the VP of...
Read more from Chase Lee
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.