![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Apache Kafka has become the go-to platform for streaming data across an enterprise, but streaming is even more valuable when the data can be cleaned, enriched and made available for additional use cases downstream. That’s where stream processing comes in.
Stream processing allows you to continually consume data streams, process them with additional business logic and turn them into new streams that others can repurpose for their own applications. Uses cover a wide variety of applications, including real-time dashboards, machine learning models, materialized views, and event-driven apps and microservices.
Stream processing augments data streams with additional business logic, turning them into new data streams that can be reused in other applications and pipelines.
The complexity of processing logic differs based on the application at hand and can range from straightforward tasks, like filters and aggregators, to more involved operations, like multiway temporal joins and arbitrary event-driven logic. As a result, the benefits of stream processing over other options (such as periodic batch jobs, ELT, classical two-tiered architecture) differ depending on the use case.
Despite this variation, the key drivers of adoption for stream processing typically fall into one or more of these categories:
Flink is one of the most active Apache projects, providing a unified framework for stream and batch processing. Digital-first companies like Uber, Netflix and LinkedIn use Flink, as well as more traditional enterprises like Goldman Sachs and Comcast.
Flink also has a large and vibrant contributor community, supported by companies like Apple and Alibaba, that helps to ensure continual innovation. As a result, Flink has enjoyed rapid adoption comparable to that of Kafka in its early days.
Here are four common reasons that companies choose Flink over other stream-processing technologies:
Flink boasts a powerful runtime with exceptional resource optimization, high throughput with low latency and robust state handling. Specifically, the runtime can:
Flink can be configured for a wide range of workloads depending on the use case, including streaming, batch or a hybrid of the two.
Flink offers four distinct APIs that can each cater to different users and applications. Flink also extends support for multiple programming languages, including Python, Java and SQL.
Flink offers several layered APIs with varying levels of abstraction, allowing it to handle both common and more unusual use cases.
The DataStream API, available in both Java and Python, allows you to create data flow graphs by linking transformation functions like FlatMap, Filter and Process. Within these user-defined functions, you gain access to the fundamental components of a stateful stream processor, such as state, time and events. This provides you with fine-grained control over how records flow through the system and how they read, write and update the state of your application. If you’re familiar with the Kafka Streams DSL and Kafka Processor API (↔ ProcessFunction), the experience will be familiar.
The Table API is Flink’s more modern, declarative API. It enables you to write programs using relational operations like joins, filters, aggregations and projections, in addition to various types of user-defined functions. Similar to the DataStream API, the Table API is supported in Java and Python. Programs developed using this API undergo optimization similar to Flink SQL queries, sharing several features with SQL, such as the type system, built-in functions and the validation layer. This API has parallels to Spark Structured Streaming, Spark’s DataFrame API and Snowpark DataFrame API, although those APIs are geared more toward micro-batch and batch processing than streaming.
Built on the same underlying architecture as the Table API is Flink SQL, a SQL engine that adheres to ANSI standards and can process both live and historical data. Flink SQL uses Apache Calcite for query planning and optimization. It supports arbitrarily nested subqueries, has broad language support including various streaming joins and pattern matching, and comes with an extensive ecosystem including JDBC Driver, catalogs and an interactive SQL shell.
Finally we have “Stateful Functions,” which eases the creation of stateful, distributed event-driven applications. This is a separate subproject under the Flink umbrella and quite different from Flink’s other APIs. The simplest way to think about Stateful Functions is as a stateful, fault-tolerant distributed Actor system based on the Flink runtime.
The broad choice of APIs makes Flink the ideal option for stream processing, and you can mix different APIs over time as your requirements and use cases evolve.
Apache Flink unifies stream and batch processing, because its main APIs (SQL, Table API and DataStream API) support both bounded data sets and unbounded data streams. Specifically, you can run the same program in either batch- or stream-processing mode depending on the nature of the data that is being processed. You can even let the system choose the processing mode for you.
This unification of stream and batch processing offers tangible benefits for developers:
Flink is a mature platform that has been battle-tested in the most demanding production use cases. Features that demonstrate this include:
Flink and Kafka are frequently used together; in fact, Kafka is Flink’s most popular connector. The two are highly compatible, and Kafka in many ways has driven Flink’s widespread adoption.
Note that Flink itself does not store any data; it operates on data stored elsewhere. You can think of Flink as a computation layer for Kafka, powering real-time applications and pipelines, while Kafka serves as the foundational storage layer for streaming data.
Within the data-streaming stack, Flink can manage computational needs while Kafka provides the storage layer.
Flink has become even more adept at supporting Kafka applications over time. It can employ Kafka as both a data source and a data sink, capitalizing on Kafka’s broad ecosystem and tools. Flink also supports popular data formats natively, including Avro, JSON and Protobuf.
Kafka is an equally good fit for Flink. Compared to other messaging systems like ActiveMQ, RabbitMQ or PubSub, Kafka enables persistent and indefinite data storage for Flink. Furthermore, Kafka allows multiple consumers to read streams concurrently and rewind them as necessary. The first attribute complements Flink’s distributed processing paradigm, while the second is crucial for Flink’s fault-tolerance mechanisms.
For a deeper dive, try your hand at the Flink 101 course at the Confluent Developer site or this Apache Flink training.