![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
The data mesh article by Zhamak Dehghani has made popular the concept of “data as a product.” As one of the four essential principles, it describes a fundamental shift in the way organizations need to create, store and communicate important business data. While the data mesh concept is relatively new and feels simple and intuitive, the problems it highlights and proposes to solve are not.
The challenges around data movement and self-service access to reliable data have plagued organizations for decades, and none of the so-called modern approaches provide a realistic solution to the problem. If anything, they amplify existing data access challenges.
Under the hood, data pipelines play a critical role. They do a lot of the heavy lifting, coordinating data movement, extracting, transforming, integrating and loading data across silos of systems to serve various operational and analytical use cases. They’re essential to building trustworthy data products. And yet, despite the criticality of data pipelines without which a modern data-driven organization cannot function, they fundamentally haven’t evolved in the last few decades.
In this article, we’ll dive into the legacy approaches to data pipelines and the compounding data problems present in the much-touted modern data stack. We’ll also explore how to reimagine your data pipelines and build better data products that can serve the real-time needs of your business and customers.
ETL (extract, transform and load) tools were originally built decades ago to extract data from siloed systems, and then transform and load it in periodic intervals into a format that matches the destination data warehouse for post-hoc analysis. Data typically has one final destination, the data warehouse, and processing is heavily governed by centralized, domain-agnostic “data product” teams, who spend a bulk of their time fixing broken data pipelines and use their remaining time to discover and understand domain data to glean meaningful insights. This approach made sense in the old world because the application of data, the analysis, was a back-office concern and had cycle times of weeks, months or even quarters. That analysis, in a highly latent fashion, was then used to steer the company.
In the 2000s, due to the advent of the internet, data growth accelerated and the market saw an emergence of cloud-based data warehouses such as Amazon Redshift, Snowflake and Google BigQuery that could accommodate any volume of data, and scale storage and processing infinitely. Unfortunately, traditional ETL software isn’t able to take advantage of the native improvements the newer generation of cloud data warehouses offer. In an attempt to overcome the performance bottlenecks created by legacy ETL tools, a variation of this traditional paradigm — extract, load and transform (ELT) — has emerged.
ELT (extract, load and transform) tools also focus on loading the data in a centralized cloud data warehouse or data lake, but unlike traditional ETL tools, the transformations happen in the target system, resulting in reduced physical infrastructure and intermediate staging layers. This allowed for data to be extracted and directly loaded into the data warehouse, improving load times. The final destination, yet again, being the data warehouse.
However, in a modern environment, data doesn’t have just one destination. Other systems and applications in the organization need access to the data. And that’s given rise to a whole new set of tools that reverse the pattern of data movement. A symptom of the centralizing force, reverse-ETL (rETL) tools have evolved to share the subsequent analysis of the data (from the data warehouse) back to operational systems, such as databases and SaaS applications such as CRM, finance and ERP systems.
While the combination of ELT and rETL tools surface the need to share data back to the operational systems and SaaS applications to power various use cases, these approaches intensify the data problem. Decodable CEO Eric Sammer wrote an excellent article on the abuse of the data warehouse and how “putting high-priced analytical database systems in the hot path introduces pants-on-head anti-patterns to supportability and ops.”
ETL, ELT and reverse ETL are all approaches that aim to solve the need for unlocking the value of data. They all highlight the imperative to maximize the availability, reliability and usability of data to derive meaningful insights and power various use cases. However, all these approaches make a foundational assumption that data needs to be housed in a single central repository before it can be put to practical use and that the flow of data is sequential and not continuous.
If you take a moment to think about this waterfall approach, which involves centralizing data into the warehouse and then reversing that pattern to decentralize data access, it has significant drawbacks that are nontrivial.
Despite all the buzz around the modern data stack where traditional data integration tools play an important role, the reality is that the entire stack is built on a legacy paradigm, revolving around the centralization of data into the data warehouse.
However, as Sammer mentions in his article, the data warehouse was never meant to be a gateway to enable free movement of data. “Its design center is to store data at scale, and to support things like large analytical queries and visualization tools. Gateways are built for many-to-many relationships, for decoupled no-knowledge apps, for decentralization and access control.” We’re trying to solve the right problem but with the wrong solution, and as a result we’re further exacerbating data access challenges and accumulating technical debt. “We’ve accidentally designed our customer experience to rely on slow batch ELT processes,” states Sammer.
To put your data to work and make it discoverable, accessible, and usable, you have to stop thinking of the flow of data as a sequence of processing steps, where the next step is triggered after the previous step is completed. Instead, your data needs to be a first-class citizen.
You have to think of data as something that’s continuously flowing, continuously being processed and continuously shared, so your systems and applications can react and respond to the data the moment it’s created and the moment it changes.
This requires a shift in your mindset. This requires treating your data as if it were a high quality, ready-to-use product that’s instantly accessible across the organization. It’s consistent everywhere, which means that everyone is using the same data and taking advantage of the latest and greatest data so your operational systems can serve your customers better, your analytical systems can meet the demands of your stakeholders, and your SaaS applications are always up to date.
It’s governed, which means you can track where the data is coming from, where it’s going and who has access to it. You enable the discoverability of data assets through data contracts, so whoever needs access to the data in whatever format can easily subscribe and use it on demand.
When you apply this kind of product thinking to your data, you begin to accelerate use-case delivery and innovation.
How do you deliver this vision and put your data to work? In a modern enterprise, where immediate access to data is critical to being a competitive differentiator, you need to think of the data movement and access in a fundamentally different way. Your approach to data pipelines has to incorporate five essential components. They are:
When you reimagine the flow of data in the context of these five foundational principles, you enable data-as-a-product thinking and maximize the usability of data, making it easy for different teams to produce, share and consume trustworthy data assets. Together, these principles reinforce one another to promote data reusability, engineering agility and greater collaboration and trust within the organization.
The next time you’re building an application that needs to be informed by data, ask yourself if your data infrastructure can support your data needs for real-time, ubiquitous, self-service and governed access to high quality data streams. Question why you need to default your assumption to the status quo, which is batch-oriented, centralized, ungoverned and inflexible. Identify solutions that will help you not only solve your data needs for today, but will also serve your business needs in the future. Here are a few resources to get you started: