VOOZH about

URL: https://thenewstack.io/the-data-quality-problem-and-its-impact-on-application-performance/

⇱ The Data Quality Problem and Its Impact on Application Performance - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-06-04 12:30:14
The Data Quality Problem and Its Impact on Application Performance
contributed,op-ed,
Cloud Native Ecosystem / Observability

The Data Quality Problem and Its Impact on Application Performance

How legacy monitoring and data warehousing tools are not keeping pace with the Big Data era.
Jun 4th, 2021 12:30pm by Manu Bansal
👁 Featued image for: The Data Quality Problem and Its Impact on Application Performance
Feature image via Pixabay.
Manu Bansal
Manu Bansal is the co-founder and CEO of Lightup Data. He was previously co-founder of Uhana, an AI-based analytics platform for mobile operators that was acquired by VMware in 2019. He received his Ph.D. in Electrical Engineering from Stanford University.

You’ve seen it everywhere… you are having major problems with your application but your IT and application performance monitoring tools have not identified any issues. The wide range of outages impacting application performance all demonstrates that there are growing problems with your data pipelines.

Consequently, data quality has become a hot topic again and new tools have started to appear. But why is this happening? Why do we need to resolve a problem that’s been around since data itself, and that already has an incumbent stack of legacy tools?

Two words: Big data.

The growth in data volume over the past 10 years has created a tectonic shift in the requirements for data quality tools — and legacy tools don’t meet them anymore.

Here’s why.

Legacy Tools: How IDQ and Others Were Built Before Big Data

Legacy data quality tools were designed to serve a different world of data. Informatica Data Quality was released in 2001. Talend was released in 2005. Comparable tools arrived in the same window. But the world of “big data” was created by three events that arrived much later.

Event 1: The Birth of Big Data and ETL

ETL for big data began with Hadoop, which was released in 2006, but didn’t penetrate the mainstream Fortune 500 enterprise segment for another decade.

Event 2: The Birth of Cloud

Mainstream cloud adoption began with Amazon Web Services, which was publicly launched in 2006, but wasn’t fully accessible until Redshift became fully available in 2013.

Event 3: The Birth of the Cloud Data Warehouse and ELT

Cloud Data Warehouses (CDWs) made data warehousing accessible to everyone. But Snowflake wasn’t founded until 2012 followed by Databricks in 2013.

In Short: Legacy Data Quality tools were created long before big data arrived. As such, they were never designed to solve data quality in a big data world. While they have tried to catch up, they fundamentally do not meet the unique requirements created by the 44x increase in data volume production we’ve seen from 2010-2020.

Fundamental Mismatch: 12 Requirements Legacy Tools Don’t Meet

Big data has made legacy tools ineffective across multiple requirements, including:

  1. Increased Data Volume: Legacy tools often load complete datasets before analyzing them. But big data lakes and warehouses have so much data that this approach is expensive, slow, or infeasible.
  2. Increased Data Cardinality. Legacy tools and manual approaches were not built to handle thousands of tables with hundreds or thousands of columns each.
  3. Increased Data Stochasticity. Legacy tools inspect individual data integrity violations. But this is untenable and meaningless when we have so much data volume and variety, and when one small issue can break many data elements.
  4. Continuous Flows of Data. Legacy can’t keep pace when data arrives every hour or minute and must be used right away, and issues must be detected in near-real-time to prevent damage.
  5. Processing Pipelines. Legacy tools use legacy definitions of data quality. But now we have automated ELT pipelines with additional modes of failing that are unique to the setting and are not included in legacy data quality definitions.
  6. Changing Data Shapes. Legacy tools were designed before every organization became data-driven. But now, data is entrenched deep into the product and analytics pipeline and data models evolve as the product evolves.
  7. Dataflow Topology/Lineage. Legacy tools were built to run checks on a single master dataset. But we now have data pipelines with a dozen stages and many branches, which adds a spatial dimension to data quality problems.
  8. Timeseries Problems. Legacy tools were designed to measure data quality on a single batch of data using absolute criteria. But data now flows continuously in small batches and added a temporal dimension to data quality problems.

We have also experienced cultural changes that created their own new requirements.

  1. Collaboration. Data problems and solutions now touch everyone in the org.
  2. Consumerization. Every org now struggles with data volume and complexity.
  3. APIs. Platforms now need to be dev-friendly, automatable, and interoperable.
  4. Laws. Platforms must build architecture for security, compliance, and privacy.

These new requirements have been quietly building over the last decade, and have suddenly begun to drive new conversations around data quality for one core reason.

The Tipping Point: Why Now Is the Time to Revisit Data Quality

After a period of heavy flux in the ETL jungle, a new and stable ELT data stack has emerged. And the centerpiece of the new stack — the data warehouse — has less data integrity checks and constraints being enforced than traditional databases.

At the same time that support for data quality is thinner than before, companies depend on their data more than before. Every company is now data-driven, nobody can afford bad data anymore, and the flaws in legacy tools are really starting to hurt.

In summary, it has become painfully obvious that too much has changed, that legacy tools do not work in the new world of data, and that we need to rethink the data quality problem from a clean slate.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and Snowflake are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Databricks, Shapes.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.