VOOZH about

URL: https://thenewstack.io/build-data-factories-not-data-warehouses/

⇱ Build Data Factories, Not Data Warehouses - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-04-12 10:00:17
Build Data Factories, Not Data Warehouses
contributed,
Data

Build Data Factories, Not Data Warehouses

We don’t store data in warehouses; we operate data factories without quality controls.
Apr 12th, 2022 10:00am by Jeremy Stanley
👁 Featued image for: Build Data Factories, Not Data Warehouses
Feature image via Pixabay.

The data warehouse is a broken metaphor in the modern data stack.

We aren’t loading indistinguishable pallets of data into virtual warehouses, where we stack them in neat rows and columns and then forklift them out onto delivery trucks.

Instead, we feed raw data into factories filled with complex assembly lines connected by conveyor belts. Our factories manufacture customized and evolving data products for various internal and external customers.

👁 Image

As a business operating a data factory, our primary concerns should be:

  • Is the factory producing high-quality data products?
  • How much does it cost to run our factory?
  • How quickly can we adapt our factory to changing customer needs?

Cloud data warehouses like Amazon Web Services’ Redshift, Snowflake, Google’s BigQuery, and Databricks have reduced data factory operating costs. Orchestration tools like Airflow and data transformation frameworks like dbt have made redesigning components in our factory easier.

But we often miss data quality issues, resulting in bad decisions and broken product experiences. Or they are caught by end users at the last minute leading to fire drills and eroded trust.

Data Quality Control Priorities

Jeremy Stanley
Jeremy Stanley is the co-founder and CTO of Anomalo, where he helps companies improve the quality and reliability of their data. Most recently, he was the VP of data science at Instacart, where he focused on machine learning for logistics and marketplace discovery. Prior to that, he led data science and engineering at Sailthru, building personalization tools for e-commerce and publishing companies. Before Sailthru, he was responsible for creating advertising optimization and bidding technology at Collective. His early experience with data, machine learning and strategy began at EY.

To establish data quality control in our metaphorical factory, we could test at four points:

  • The raw materials that arrive in our factory.
  • The machine performance at each step in the line.
  • The work-in-progress material that lands between transformation steps.
  • The final products we ship to internal or external customers.

These testing points are not equally important. As a factory operator, the most critical quality test is at the end of the line. Factories have dedicated teams that sample finished products and ensure they meet rigorous quality standards.

The same holds for data. We don’t know if the data we produce is high quality until we have tested the finished product. For example:

  • Did a join introduce duplicate rows?
  • Did a malformed column cause missing values?
  • Are timestamps inconsistently recorded?
  • Has a change in query logic affected business metrics?

After validating the quality of our final product, we should ensure we are consuming high-quality raw materials. Identifying defects in raw data arriving into the factory will save us time and effort in root-causing issues later.

Insufficient Investments

Unfortunately, to date, most investment in testing our data factories has been the equivalent of evaluating machine performance or visualizing floor plans:

  • We monitor data infrastructure for uptime and responsiveness.
  • We monitor Airflow tasks for exceptions and run times.
  • We apply rule-based tests with dbt to check the logic of transformations.
  • We analyze data lineage to build complex maps of data factory floors.

These activities are helpful, but we have put the cart before the horse! We should first ensure that our factory produces and ingests high-quality data. From conversations I have had with hundreds of data teams, I believe we have failed to do so for three reasons:

1. We use the tools we have at hand.
Engineering teams have robust tools and best practices for monitoring the operations of web and backend applications. We can use these existing tools to monitor the infrastructure and orchestration for our data factory. However, these tools are incapable of monitoring the data itself.

2. We have tasked machine operators with quality control.
The burden of data quality often falls on the backs of the data and analytics engineers operating the machines in the factory. They are experts in the tools and logic used to transform the data. They may write tests to ensure their transformations are correct, but they can overlook upstream or downstream issues from their processing.

3. Testing data well is difficult.
Our data factories produce thousands of incredibly diverse data tables with hundreds of meaningful columns and segments. The data in these tables constantly changes for reasons that range from “expected” to “entirely out of our control.” Simplistic testing strategies frequently miss real issues, and complex strategies are hard to maintain. Poorly calibrated tests can spam users with false-positive alerts, leading to alert fatigue.

Data Quality Control Needs

We need purpose-built tools to monitor and assess the quality of data arriving into or exiting our data factories.

We should place these tools into the hands of data consumers  —  the subject matter experts who deeply care about the quality of the data they use. These consumers should be able to quickly test their data and monitor their key metrics, with or without code.

Our data quality tools must scale to cover thousands of tables, with billions of rows, across hundreds of teams, in daily batch processes or real-time flows.

The algorithms used should be flexible enough to handle data from diverse applications and industries. They should gracefully adapt to different tabular structures, data granularity and table update mechanics. We should automate testing to avoid burdening data consumers with busy work.

We should avoid creating alert fatigue by minimizing false positives through notification controls, feedback loops and robust predictive models. When issues arise, we should visually explain them by leveraging context in the data and upstream data generation processes.

The Future of Data Quality

Organizations today can capture, store and query a remarkable breadth of data relevant to their business. They can democratize access to this data so that analyses, processes or products can depend on it.

Data teams operate complex data factories to service the data needs of their organization. But they are often unable to control the quality of data produced. Data teams risk losing trust and becoming sidelined if they do not catch and address data quality issues before downstream users.

Data leaders must take responsibility for data quality by defining and enforcing quality control standards. They need tools and processes that test data in ways that scale, both with the data itself and the people involved in producing and consuming it.

These are complex challenges, but a tremendous amount of innovation is happening in the data community to address them. I look forward to a future where our data factories are transparent, fast, inexpensive, and produce data of outstanding quality!

I’d like to thank Anthony Goldbloom, Chris Riccomini, Dan Siroker, D.J. Patil, John Joo, Kris Kendall, Monica Rogati, Pete Soderling, Taly Kanfi and Vicky Andonova for their feedback and suggestions.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
AWS and Snowflake are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Databricks.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.