VOOZH about

URL: https://thenewstack.io/how-to-handle-bad-data-in-event-streams/

⇱ How to Handle Bad Data in Event Streams - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-10-08 11:00:18
How to Handle Bad Data in Event Streams
sponsor-confluent,sponsored-post-contributed,
Data / Data Streaming

How to Handle Bad Data in Event Streams

Apache Kafka topics are immutable, so you can’t edit or delete their data. But there are a few things you can do to fix bad data in event streams.
Oct 8th, 2024 11:00am by Adam Bellemare
👁 Featued image for: How to Handle Bad Data in Event Streams
Featured image by Getty Images for Unsplash+.
Confluent sponsored this post.

A recent Gartner survey found that poor data quality costs organizations an average of $12.9 million annually and can increase the complexity of data ecosystems. “Bad data” is defined as corrupted or malformed data that doesn’t conform to developers’ expectations. It can create outages and other corrupting effects for data scientists, analysts, machine learning, AI and other data practitioners.

Apache Kafka topics are immutable. Once an event is written into an event stream, it cannot be edited or deleted. This design tradeoff ensures that every consumer of the data will end up with exactly the same copy, and that no data will be edited or changed after it has been read. However because bad data cannot be edited once written to the stream, it is essential to prevent bad data from getting into the stream in the first place. However, if bad data does get into the stream, there are a few things you can do even though you can’t edit it in place.

Here are four tips to help you effectively prevent and fix bad data in event streams.

1. Use Schemas to Prevent Bad Data From Entering

Schemas explicitly define what data should and should not be in the event, including field names, types, defaults, ranges of acceptable values and human-readable documentation. Popular schema technologies for event streams include Avro, Protobuf and JSON Schema.

Schemas significantly reduce data errors by preventing the producer from writing bad data. If data does not adhere to the schema, the application will throw an exception and let the schema know. Schemas allow consumers to focus on using the data instead of making best-effort attempts to parse the producer’s actual meaning.

Strongly defined explicit schemas are important for ensuring clear meaning. It is common in an event-driven system to have different independent consumers read the same topic.

👁 With 2 topics and 4 consumer, there are 8 chances a consumer will misinterpret data

In the figure above, there are eight possible chances that a consumer will misinterpret the data from an event stream. The more consumers and topics you have, the greater the chance they misinterpret data compared to their peers, unless you use clearly defined explicit schemas.

The risk is that your consumers misinterpret the data just slightly differently from one another, leading to computations and results that deviate from one another. This can lead to significant efforts to reconcile which system is misinterpreting the data. Instead, eliminate this possibility by using schemas.

2. Test Your Schemas With Your Applications

Testing is essential for preventing bad data from entering your streams. While a runtime exception from the producing service may prevent the bad data from getting into the stream, it’ll likely degrade the experience for other applications and users that depend on that service.

Schemas provide everything you need to mock out test data for testing your code. Your producer service tests can exercise all your code paths to ensure that they create only properly formatted events. Meanwhile, your consumer applications can write all of their business logic and tests against the same schema so they don’t throw any exceptions or miscompute results when they receive and process the events.

Testing integrates into your CI/CD pipeline so you can verify that your code and schemas operate correctly together before you deploy your applications and services. You can also integrate your CI/CD pipeline to validate your schemas with the latest schemas in the schema registry to ensure that your application is compatible with all of its dependent schemas in case you missed an evolution or update.

3. Prioritize Event Design

Despite efforts to prevent bad data from entering a stream, sometimes a typo is all it takes to corrupt an input. Event design plays another pivotal role in preventing bad data in your event streams. A well-thought-out event design can allow for corrections, like overwriting previous bad data by publishing new records with the correct data. Prioritizing careful, deliberate event design during the application development phase can significantly ease issues related to bad data remediation.

State events (also known as event-carried state transfers) provide a complete picture of the entity at a given point in time. Delta events provide only the change from the previous delta event. The following image shows delta events as analogous to the moves in a game of chess, while the state event shows the full current state of the board.

👁 Delta events are like individual chess moves, while state events provide a complete picture of the entity at any point in time.

State events can simplify the process of correcting previously published bad data. You simply publish a new state event with the updated correct state. Then, you can use compaction to (asynchronously) delete the old, incorrect data. Each consumer will receive a copy of the correct state and can process and infer their changes by comparing them to any previous state that they may have stored in their domain boundary.

While deltas provide a smaller event size, you cannot compact them away. The best you can do is issue a delta that undoes a previous delta, but the problem is that all of your consumers must be able to handle the reversal events. The challenge is that there are many ways to produce bad deltas (e.g., illegal moves, one player moving several turns in a row), and each undo event must be a precise fix. The reality is that this is really hard to do at any meaningful scale, and you still end up with all the previous bad data in your event stream; you simply can’t clean it up if you choose to use deltas.

Event design allows for rectifying errors without having to delete everything and start from square one. However, only state events provide the means to issue a correction (a new event with the total fixed state) and delete the old bad data (compaction).

4. When All Else Fails, Rewind, Rebuild and Retry

In the world of data streaming, prevention is always better than a fix. As a last resort, be prepared to dig into the event stream. While the process can be applied to any topic with bad data — whether it’s state, delta or a hybrid — it is labor-intensive and easy to mess up. Proceed with caution.

Rebuilding data from an external source requires searching for the bad data and producing a new stream with the fixed data. You have to rewind to the start of the process and pause consumers and producers. After that, you can fix and rewrite the data into another stream where you will eventually migrate all parties.

Although this expensive, complex solution should be a last resort, it’s an essential strategy to have in your arsenal.

Mitigate the Impact of Bad Data

Handling bad data in event streams doesn’t have to be a daunting task. By understanding the nature of bad data, preventing it from entering your event stream, utilizing event design to overwrite bad data, and being prepared to rewind, rebuild and retry when necessary, you can effectively mitigate the impact of bad data. Good data practices not only save time and effort but also enable you to get work done.

Confluent, founded by the original creators of Apache Kafka, pioneered a complete data streaming platform that streams, connects, processes, and governs data as it flows throughout a business. With Confluent, any organization can modernize their business and run it in real-time.
Learn More
The latest from Confluent
TRENDING STORIES
Adam Bellemare is a principal technologist in the Technology Strategy Group at Confluent. He has worked on a wide range of projects, including event-driven data mesh theory and proof of concepts, event-driven microservice strategies, and event and event stream design...
Read more from Adam Bellemare
Confluent sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Rewind.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.