![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
In the world of academia, the hard problems posed by machine learning revolve around building better, smarter models and finding more and better data. I and my co-author know from our days as Ph.D. students that there’s plenty of time to build models, and data moves slowly — if at all.
But when it comes to data engineering at data-centric enterprises, building offline models for offline data doesn’t cut it in the use cases that companies face today. Every software app with active users generates data constantly; the data moves fast, streaming back and forth between application clients and backend servers.
Use cases for machine learning (ML) in these apps need the backend infrastructure, including the ML and AI model deployments, to be able to scale and keep pace with the data. Insights that are delivered late (or never) lose a lot of their business value.
Online search, recommendation engines, user engagement interventions, video game AI and streaming content are just a few use cases that are better served if the latest, freshest data is fully used in intelligent app features. Taking full advantage of up-to-the-moment data to generate powerful insight is the central goal of real-time AI, and it requires a modern stack that can handle the demands of real-time AI at scale.
Though data speed and quantity are certainly formidable challenges, moving a successful offline ML project into a real-time AI system isn’t as simple as making your infrastructure bigger and faster. There are additional challenges, and every step needs to be automated. For example, “data wrangling” — that much-maligned, messy cleanup step in every data science project — has to be automated, as does the communication and orchestration between applications, services, ML models, streams and data stores. Development and deployment processes need to be implemented around data and ML models.
Many challenges to real-time AI stem from one of three broad root causes.
A major reason that real-time AI is difficult is the nature of the data itself. Event data is not only fast and often high volume, but it can also be sporadic, unreliable, unstructured and incompatible with other data and systems. Events usually must be processed, transformed and aggregated via a number of steps within a data pipeline before ML and AI services can use them.
With few exceptions, input data for an ML model, both training and live production data, needs to be normalized to a single format. This format could be rows of inputs containing numeric feature values; it could be sentences and paragraphs of text; or it could be a time series at regular intervals. In any case, the ML model needs somewhat structured data so that, across training and live data inputs, it is comparing apples to apples, so to speak.
Successful real-time AI deployments require bringing data to ML models at the right time, and in the right format, before communicating results to app clients or wherever else it might be useful. It is helpful when your data storage, transport and processing systems integrate well with one another into an infrastructure stack that works well across many use cases.
Given the demands of deploying and running ML or AI in production at scale, it’s important to choose the right tools for the job. Back in graduate school, we didn’t need (or want) to use the most powerful data storage or streaming platforms, but at large enterprises, as well as the not-so-large, using the best data infrastructure is practically required for an app or service to be a success.
As your enterprise grows, and you graduate from easy-to-use data infrastructure to a more highly scalable, resilient stack, it can be hard to know which tools are worth the effort. Having worked alongside and learned from some amazing software and data engineers over the years, we have a lot of confidence in three open source projects that create a powerful combination to support real-time AI at scale: Apache Cassandra, Apache Pulsar and Kaskada.
Some of the largest companies in the world — like Apple, Netflix and Uber — rely on Cassandra to power their data-intensive, AI-powered applications. Cassandra’s distributed architecture provides high availability with no single point of failure, making it a popular choice for organizations that require a scalable, fault-tolerant database. Features include:
Designed for streaming data at scale, Pulsar provides an efficient, reliable and secure platform for processing, storing and transmitting real-time data. Pulsar’s unique architecture combines the benefits of a traditional messaging system with the scalability and durability of a log-based storage system, making it an ideal solution for ML and AI applications. Advantages include:
One feature that is particularly helpful for ML and AI deployments is Pulsar functions. While more complex ML models might require heavier infrastructure for deployment, models on the simpler side — from Python’s scikit-learn, for example — can be deployed natively in Pulsar via a function, making it unnecessary to deploy a separate Lambda function or other API or service to host the ML model. Simpler infrastructure configuration makes data engineers happy.
As a new open source project, Kaskada is certainly the least well known of the three described here. However, Kaskada offers a unique capability: It’s an event-processing engine designed specifically for turning event data into continuous, stateful timelines that are consumable by real-time systems.
The current alternatives for continuously generating stateful features for real-time AI seem painfully unfit for anything but the simplest cases. SQL-based tools require writing the same types of time-centric JOIN over and over to get events into various aggregation buckets for ML feature calculations, and the results aren’t natively stateful or continuous, meaning you have to query again and again to try to stay close to real time. Python-based tools give you all of the flexibility to write whatever calculations you want, but you have to write all the logic yourself and then find a way to deploy it. Other event-processing engines handle events and simple logic well but don’t do the complex calculations needed for ML features (or they let you write your own in Python). Kaskada features include:
Each of these three projects is best-in-class, but why use them together? Well, if you have event data, at scale, and you want to do real-time AI, it’s becoming hard to argue for another stack. Without Pulsar, it’s hard to do a lot of things in real time at web scale. Without Kaskada, it’s difficult to process events into ML features in real time; most people are currently doing it the hard way because Kaskada wasn’t widely available until very recently. Without Cassandra (or Pulsar, for that matter), scaling can become a significant hurdle.
Cassandra and Pulsar already have large established open source development communities. And Kaskada has a small but growing community along with the full commitment of DataStax, due to the natural fit with Cassandra and Pulsar, to deliver on the promise to be the first-choice event-processing engine for real-time AI.
Navigating these technologies and understanding how they can fit into your current architecture can be a challenge. To accelerate development and deployment of real-time AI solutions for your business, DataStax recently introduced Luna ML, a new support service for Kaskada Open Source. Luna ML helps organizations deploy Kaskada and operate modern, open source event processing for ML. Together with the rest of our Luna offerings, we support your entire stack with real-time AI capabilities to derive maximum value from your high-speed, high-volume data.
Learn more about how DataStax enables real-time AI.