![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Real-time data is becoming increasingly important to enterprise success. Successfully collecting incoming data while reacting both quickly and strategically is paramount.
However, the data collection process is often far from trivial. Write contention is a common bottleneck with many large-scale architectures. As data storage infrastructure evolves, developers are further abstracted away from the critical areas of the write path. This can make troubleshooting any issues that crop up with write durability and performance difficult.
The question becomes: How can we ensure that all data is being stored while not overloading the storage layer? Here, we’ll explore using DataStax Astra Streaming to help with some of the pitfalls of ensuring real-time data delivery.
Let’s say that we’re supporting a video service. Once a user finishes watching their favorite show or movie, they’re prompted to give it a star rating between one star and five stars. From that, we take a data-driven approach to infer a few recommendations that we “think” they might like. This one small rating action helps improve the future accuracy of this inference for everyone.
How can we architect a system to accomplish this? We will need some simple components, most notably a database and a service layer to query it.
For our data storage layer, we will use DataStax Enterprise Graph, a Gremlin/TinkerPop property graph database built on top of Apache Cassandra. Graph databases are perfect for use cases where the relationships between the data are just as important as the data itself.
In our case, we’re focusing on the relationship between our users and the movies they like. This way, we can help to point them toward additional movies they might enjoy.
Using DSE Graph, we can track data about users and movies, storing them as “vertices” in the database (figure 1). Whenever a user rates a movie, we can add a “rated” edge (with the rating value as a property) from the user to the movie.
When we want to get a recommendation, we can use a particular movie as the entry point and “walk” the graph out to other users, then further out to movies that they’ve rated similarly.
Figure 1 – A partial graph showing how “User” and “Movie” vertices are connected by edges containing the user’s rating.
To interact with the database, we’ll build a simple, restful service using Java Spring Boot. Inside the controller, we’ll build out two services:
The services are largely self-explanatory. One writes user ratings to the database (stored as an edge in graph). The other returns recommendations based on the movie provided, using item-based collaborative filtering to match like-rated movies. (For more on item-based collaborative filtering, check out chapter 10 of “The Practitioner’s Guide to Graph Data.”)
Astra Streaming is a distributed streaming-as-a-service built atop Apache Pulsar. We’ll use a streaming topic to handle the incoming write traffic. Calls to the addUserRating service will send a user’s new rating for a movie to the topic (as shown in figure 2). We will then have a process “subscribing” to the topic, which will consume the data and write it as an edge into the graph.
Using Astra Streaming in this approach gives us a few advantages:
Figure 2 – A visual representation of how the different components of the recommendation system interact with each other
Once new ratings are sent to the Astra Streaming topic, a consumer process will take over. This process will “subscribe” to the topic and await any messages posted to it. Upon the arrival of a rating message on the topic, the consumer will acknowledge it and write it as an edge into the graph database using the following (Fluent) Gremlin code (figure 3).
As the consumer process is running continuously, it will continue to monitor the topic and apply new “rated” edges to the graph as they come in. The great thing about this is that additional ratings messages will queue up and be applied at a consistent level of throughput.
The read path will be built using a simple graph traversal. Using the original movie as an entry point, we’ll move along similar ratings edges out to the users [who submitted them], and then continue to the adjoining movie nodes. Running this traversal with the movie “Back to the Future” as the entry point for our sample data set produces the following results:
| Title | Total # of High Ratings |
| Indiana Jones and the Raiders of the Lost Ark (1981) | 20 |
| Shawshank Redemption (1994) | 18 |
| Star Wars: Episode IV – A New Hope (1977) | 18 |
| Forrest Gump (1994) | 18 |
| Matrix The (1999) | 17 |
Table 1 – Traversal results for movies that are similarly rated to “Back to the Future.”
We discussed steps to improve our write path into a real-time recommendation system. We’ve implemented our main storage model in a graph database, which offers methods and algorithms that can take data discovery to a whole new level. Likewise, we’ve improved our data persistence guarantees while simultaneously protecting the storage layer from becoming overwhelmed in the event of a spike in user traffic.
While the use case of a movie recommendation system was the example here, the concepts discussed can be applied to many types of real-time systems. It’s common to find event processing and graph databases in use cases for many areas, such as supply chain, cybersecurity, and product-data management. Employing the methods discussed above can help ensure real-time data persistence.
The code for the Java Spring Boot service layer can be found in this repository, and the code for the Java consumer can be found