![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
In the world of machine learning, change is the only constant. The traditional reliance on large, batch-processed data sets is giving way to a more dynamic, real-time approach to data. This evolution is being driven by the understanding that being able to process and analyze data in real time is not just an advantage — it’s a necessity.
This is particularly true in sectors like the food delivery ecosystem, where customer expectations and business needs can switch at the drop of a hat. Here, streaming data engines emerge as key players transforming the landscape of data processing and machine learning.
Food delivery time prediction has traditionally relied on batch-processed data. This method, while somewhat effective, often leads to stale insights due to the latency between data collection and processing. The data variables typically include the delivery partner’s mode of transport, age, ratings and the crucial metric of distance between the restaurant and delivery location.
In recent years, the food delivery industry experienced a tremendous spike in demand. This surge, partially driven by the pandemic, highlighted the painful limitations of batch-processed data models and underlined the need for real-time data processing. Real-time data processing allows immediate insights and adaptability — key components in an industry driven by time-sensitive customer expectations.
Streaming technologies like Apache Kafka bubbled up to solve the challenges created by the influx of real-time data. Kafka, known for its ability to handle high-throughput data streams, provides the backbone for real-time data ingestion and processing. However, Kafka’s architecture, while robust, often requires additional components for data transformation and processing.
Redpanda is a modern implementation of the Kafka API positioned as a more streamlined alternative to Kafka. It addresses some of Kafka’s complexities by providing a simpler setup and operational experience for developers.
For example, Redpanda Data Transforms is powered by WebAssembly (Wasm) and allows in-place data processing. This means data can be cleaned, transformed and prepared for machine learning models directly within the Redpanda broker, eliminating the need for additional data-processing layers.
To illustrate Redpanda’s role in machine learning (ML) applications that handle high volumes of data in real time, I’ll continue the example of a food delivery service.
Architecture of how Redpanda fits into a real-time delivery service powered by machine learning (Source: Redpanda)
In the “food delivery time” prediction model, Redpanda’s architecture involves these key components:
The following diagram illustrates the setup process, which involves several key steps.
Components of the proposed food delivery service infrastructure. (Source: Redpanda)
A Python script simulates the continuous flow of data, mimicking real-world scenarios of frequent order updates.
A Redpanda cluster is set up to handle the data streams. This involves configuring the number of brokers and setting up Redpanda Console for monitoring.
The Golang script for data transformation is deployed using Redpanda’s rpk transform deploy command. This ensures that the data transformation logic is applied uniformly across all broker nodes.
Data is processed in the broker of the partition it is sent to, and the result is written directly into memory. (Source: Redpanda)
Initiate the Redpanda Transforms project:
Build the transform into a WebAssembly (Wasm) module and deploy it to the Redpanda cluster for execution:
Deploy the module to the Redpanda cluster. Redpanda distributes the deployed module across all brokers in the cluster. This distribution is vital for load balancing and fault tolerance. Regardless of which broker is managing a particular partition or topic, the transform logic will be available to process the data to reduce latency and increase efficiency, since there’s no need to move data across the network for processing.
The TensorFlow I/O model is trained using both historical batch data and real-time data streams. This hybrid approach helps ensure the model benefits from the depth of historical data while staying agile with real-time updates.
Wasm assists in preprocessing data into the desired format and prepares it for ML model training. (Source: Redpanda)
To stream data directly from Redpanda topics into a TensorFlow data set, configure the data set to ingest data from the “model data” topic on a Redpanda cluster. The main processing loop handles data in batches: It accumulates messages, and then shuffles and decodes them before using them for training. Subsequently, the model is trained for one epoch with each batch and then saved and exported.
Integrating Redpanda in predictive modeling offers several advantages:
This approach, while demonstrated through the example of food delivery time prediction, has far-reaching implications. It can be applied to many sectors where real-time data analysis is crucial, such as financial markets, health-care monitoring and smart city management.
Modern streaming-data engines like Redpanda aren’t just transforming the way we handle data — they’re reshaping the future of real-time ML applications. As we continue to explore and innovate, the possibilities are as vast and exciting as the data streams we seek to harness.