![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
In a recent post on The New Stack, I discussed the emergence and significance of real-time databases. These databases are designed to support real-time analytics as a part of event-driven architectures. They prioritize high write throughput, low query latency, even with complex analytical queries including filter aggregates and joins, and high levels of concurrent requests.
This highly-specialized class of database, which includes open source variants such as ClickHouse, Apache Pinot and Apache Druid, is often the first choice when you’re building a real-time data pipeline from scratch. But more often than not, real-time analytics is pursued as an add-on to an existing application or service, where a more traditional, relational database like PostgreSQL, SQL Server or MySQL has already been collecting data for years.
In the post I linked above, I also briefly touched on how these online transactional processing (OLTP) databases aren’t optimized for analytics at scale. When it comes to analytics, they simply cannot deliver the same query performance at the necessary levels of concurrency. If you want to understand why in more detail, read this.
Row-based databases may not work for real-time analytics, but we can’t get around the fact that they are tightly integrated with backend data systems around the world and across the internet. They’re everywhere, and they host critical data sets that are integral to and provide context for many of the real-time systems and use cases we want to build. They store facts and dimensions about customers, products, locations and more that we want to use to enrich streaming data and build more powerful user experiences.
So, what are we to do? How do you bring this row-oriented, relational data into the high-speed world of real-time analytics? And how do you do it without overwhelming your relational database server?
Right now, the prevailing pattern to get data out of a relational database and into an analytical system is using a batch extract, transform, load (ETL) process scheduled with an orchestrator to pull data from the database, transform it as needed and dump it into a data warehouse so the analysts can query it for the dashboards and reports. Or, if you’re feeling fancy, you go for an extract, load, transform (ELT) approach and let the analytics engineers build 500 dbt models on the Postgres table you’ve replicated in Snowflake.
This may as well be an anti-pattern in real-time analytics. It doesn’t work. Data warehouses make terrible application backends, especially when you’re dealing with real-time data.
Batch ETL processes read from the source system on a schedule, which not only introduces latency but also puts strain on your relational database server.
ETL/ELT is simply not designed for serving high volumes of concurrent data requests in real-time. By nature, it introduces untenable latency between data updates and their availability to downstream consumers. With these batch approaches, latencies of more than an hour are common, with five-minute latencies about as fast as can be expected.
And finally, ETLs put your application or service at risk. If you’re querying a source system (often inefficiently) on a schedule, that puts a strain on your database server, which puts a strain on your application and degrades your user experience. Sure, you can create a read replica, but now you’re doubling your storage costs, and you’re still stuck with the same latency and concurrency constraints.
Hope is not lost, however, thanks to real-time change data capture (CDC). CDC is a method of tracking changes made to a database such as inserts, updates and deletes, and sending those changes to a downstream system in real time.
Change data capture works by monitoring a transaction log of the database. CDC tools read the transaction log and extract the changes that have been made. These changes are then sent to the downstream system.
Change data capture tools read from the database log file and propagate change events to a message queue for downstream consumers.
The transaction log, such as PostgreSQL’s Write Ahead Log (WAL) or MySQL’s “bin log,” chronologically records database changes and related data. This log-based CDC minimizes the additional load on the source system, making it superior to other methods executing queries directly on source tables.
CDC tools monitor these logs for new entries and append them to a topic on an event-streaming platform like Apache Kafka or some other message queue, where they can be consumed and processed by downstream systems such as data warehouses, data lakes or real-time data platforms.
If your service or product uses a microservices architecture, it’s highly likely that you have several (perhaps dozens!) of relational databases that are continually being updated with new information about your customers, your products and even how your internal systems are running. Wouldn’t it be nice to be able to run analytics on that data in real time so you can implement features like real-time recommendation engines or real-time visualizations in your products or internal tools like anomaly detection, systems automation or operational intelligence dashboards?
For example, let’s say you run an e-commerce business. Your website runs over a relational database that keeps track of customers, products and transactions. Every customer action, such as viewing products, adding to a cart and making a purchase, triggers a change in a database.
Using change data capture, you can keep these data sources in sync with real-time analytics systems to provide the up-to-the-second details needed for managing inventory, logistics and positive customer experiences.
Now, when you want to place a personalized offer in front of a shopper during checkout to improve conversion rates and increase average order value, you can rely on your real-time data pipelines, fed by the most up-to-date change data to do so.
OK, that all sounds great. But how do you build a CDC event pipeline? How do you stream changes from your relational database into a system that can run real-time analytics and then expose them back as APIs that you can incorporate into the products you’re building?
Let’s start with the components you’ll need:
An example real-time CDC pipeline for PostgreSQL. Note that unless your destination includes an API layer, you’ll have to build one to support user-facing features.
Put all these components together, and you’ve got a real-time analytics pipeline built on fresh data from your source data systems. From there, what you build is limited only by your imagination (and your SQL skills).
Change data capture (CDC) bridges the gap between traditional backend databases and modern real-time streaming data architectures. By capturing and instantly propagating data changes, CDC gives you the power to create new event streams and enrich others with up-to-the-second information from existing applications and services.
So what are you waiting for? It’s time to tap into that 20-year-old Postgres instance and mine it for all its worth. Get out there, research the right CDC solution for your database, and start building. If you’re working with Postgres, MongoDB or MySQL, here are some links to get you started: