![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Although streams can be an efficient way of processing huge volumes of data, they come with their own set of challenges. Let’s take a look at a few of them.
1. What happens if the consumer is unable to process the chunks as quickly as the producer creates them? Let’s look at an example: What if the consumer is 50% slower than the producer? If we’re starting out with a 10 gigabyte file, that means by the time the producer has processed all 10GBs, the consumer would only have processed 5GB. What happens to the remaining 5GB while it’s waiting to be processed? Suddenly that 50 to100 bytes allocated for data that still needs to be processed would have to be expanded to 5GB.
Picture 1: If the consumer is slower than the producer, you’ll need additional memory.
2. And that’s just one nightmare scenario. There are others. For example, what happens if the consumer suddenly dies while it’s processing a line? You’d need a way of keeping track of the line that was being processed and a mechanism that would allow you to reread that line and all the lines that follow.
Picture 2: When the consumer fails.
3. Finally, what happens if you need to be able to process different events and send them to different consumers? And, to add an extra level of complexity, what if you have interdependent processing, when the process of one consumer depends on the actions of another? There’s a real risk that you’ll wind up with a complex, tightly coupled, monolithic system that’s very hard to manage. This is because these requirements will keep changing as you keep adding and removing different producers and consumers.
For example (Picture 3), let’s assume we have a large retail shop with thousands of servers that support shopping through web apps and mobile apps.
Imagine that we are processing three types of data related to payments, inventory and webserver logs and that each has a corresponding consumer: a “payment processor,” an “inventory processor” and a “webserver events processor.” In addition, there is an important interdependency between two of the consumers. Before you can process the inventory, you need to verify payment first. Finally, each type of data has different destinations. If it’s a payment event, you send the output to all the systems, such as the database, email system, CRM and so on. If it’s a webserver event, then you send it just to the database. If it’s an inventory event, you send it to the database and the CRM.
As you can imagine, this can quickly become quite complicated and messy. And that’s not even including the slow consumers and fault-tolerance issues that we’ll need to deal with for each consumer.
Picture 3: The challenge of tight coupling because of multiple producers and consumers.
Of course, all of this assumes that you’re dealing with a monolithic architecture, that you have a single server receiving and processing all the events. How would you deal with a microservices architecture? In this case, numerous small servers — that is, microservices — would be processing the events, and they would all need to be able to talk to each other. Suddenly, you don’t just have multiple producers and consumers. You have them spread out over multiple servers.
A key benefit of microservices is that they solve the problem of scaling specific services depending on changing needs. Unfortunately, although microservices solve some problems, they leave others unaddressed. We still have tight coupling between our producers and consumers, and we retain the dependency between the inventory microservices and the payment ones. Finally, the problems we pinpointed in our original streaming example remain problems:
These are just some of the main challenges. Let’s take a look at how to address them.
Picture 4: The challenges of tight coupling in the microservices world
As we’ve seen, streams can be great for processing large amounts of data but also introduce a set of challenges. New specialized systems such as Apache Kafka and Redis Streams were introduced to solve these challenges. In the world of Kafka and Redis Streams, servers no longer lie at the center, the streams do, and everything else revolves around them.
Data engineers and data architects frequently share this stream-centered worldview. Perhaps it’s not surprising that when streams become the center of the world, everything is streamlined.
Picture 5 shows a direct mapping of the tightly coupled example you saw earlier. Let’s see how it works at a high level.
Picture 5: When we make streams the center of the world, everything becomes streamlined.
A number of streaming challenges that once seemed formidable, even insurmountable, can readily be solved just by putting streams at the center of the world. This is why more and more people are using Kafka and Redis Streams in their data layer.
This is why data engineers view streams as the center of the world.
Learn more about how Kafka and Redis deal with these complex data challenges. Download this free book that features 50+ illustrations to help you understand this complex topic in a fun and engaging way.