VOOZH about

URL: https://thenewstack.io/linkedin-unifies-stream-and-batch-processing-with-apache-beam/

⇱ LinkedIn Unifies Stream and Batch Processing with Apache Beam - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-03-23 08:00:44
LinkedIn Unifies Stream and Batch Processing with Apache Beam
sponsor-kinetica,sponsored-topic,
Operations

LinkedIn Unifies Stream and Batch Processing with Apache Beam

By migrating to Apache Beam, social networking service LinkedIn unified its streaming and batch source code files, and reduced data processing time by 94%.
Mar 23rd, 2023 8:00am by Jessica Wachtel
👁 Featued image for: LinkedIn Unifies Stream and Batch Processing with Apache Beam

By migrating to Apache Beam, social networking service LinkedIn unified its streaming and batch source code files, and reduced data processing time by 94%.

Originally, the job of refreshing data sets, “backfilling,” was first run as a set of stream processing jobs,  but the more complex the jobs became, the more problems arose, explained a multi-author LinkedIn blog entry posted Thursday.

Backfills were then processed as batch via a lambda architecture, bringing on a new set of problems — now there were two different code bases, complete with all the challenges that came with owning and maintaining two sets of source code. The lambda architecture was replaced with the Beam API which required only one source code file for batch and streaming. The project was a success and resource use overall dropped by 50%.

Thought leaders and streaming software companies are engaged in a debate over real time vs. batch processing. One side is firmly planted in the idea that software must become more accessible to developers of all skill levels before streaming will truly hit the mainstream. The counterargument says developers must rise to the higher skill level requirements of the inconsistent tech stacks and languages that make up current streaming systems.

That LinkedIn recently reduced its data processing time by 94% by unifying its streaming and batch pipelines with Apache Beam makes a big win for the simplification argument.

The Challenge with Backfills

LinkedIn’s standardization process is the mapping of user data input strings (job titles, skills, education history) to internal IDs. The standardization data is required for search indexing and recommendation models. There are also more advanced AI models used within the pipelines to join complex data (job types and working experience) to standardize data for further usage.

Standardization requires data processing in two methods: real-time computation to reflect immediate updates and periodic backfills to refresh data when new models are introduced. When both real-time and backfills were processed as stream processing they were executed via Apache Samza Runner, which runs Beam pipelines. This worked until the following problems became insurmountable:

  • Real-time jobs failed to meet time and resource requirements during backfill processing.
  • The target requirement of 900 million profiles at a rate of 40,000/sec required for each backfill job became unreachable as training models grew in complexity.
  • Streaming clusters weren’t optimized for the backfill’s spiky resource footprint.
👁 Image

Figure 1

The first optimization moved backfills to batch processing and execute the logic with the lambda architecture. This was operational but not optimal because with the Lambda architecture came the Matryoshka doll of challenges — a second code base. The introduction of a second code base began the requirement for developers to build, learn, and maintain two codebases in two different languages and stacks.

The next iteration of the process brought on the introduction of Apache Beam API. Using Apache Beam meant developers could go back to working on one source code file.

The Solution: Apache Beam

Apache Beam is an open source, unified model for defining batch and streaming data-parallel processing pipelines. Developers can build a program to define the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of beam’s distributed processing backends of which there are several options such as Apache Flink, Spark, and Google Cloud Dataflow.

👁 Image

The unified pipeline is powered by Beam’s Samza and Spark backends in this particular use case. Samza processes two trillion messages daily with large states and fault tolerance. Beam Samza Runner executes the Beam pipeline as a Samza application locally. The Spark backend processes petabytes of data with LinkedIn’s eternal shuffling service and schema metadata store. Beam Apache Spark Runner executes beam pipelines using Spark just as a native spark application does.

Kinetica is the real-time database platform that leverages generative AI and vectorized processing to let you ask anything of your sensor and machine data. Kinetica offers native vectorized analytics in generative AI, spatial, time-series, and graph.
Learn More
The latest from Kinetica

How It Works

The Beam pipeline manages a directed acyclic graph of processing logic. The pipeline in the diagram below reads ProfileData, joins the table with a sideTable, applies a user-defined function called Standardizer(), and completes by writing the standardized result to the database. The code snippet is executed by both Samza Cluster and Spark Cluster.

Batch and stream processing jobs accept different inputs and return different outputs even in the instance of Beam when the source code is the same. Streaming inputs originate from unbounded sources such as Kafka and their outputs update the database while batch inputs come from bounded sources like HDFS and produce datasets as outputs.

PTransforms are the out-of-the-box step in the Beam workflow that takes the input from either source and performs the processing function which then produces zero or more outputs. LinkedIn added functionality to further streamline the Beam API in their Unified PTransforms. Unified PTransforms provides two expand() functions for streaming and batch. The pipeline types are detected at runtime and the appropriate expand() is called accordingly.

👁 Image

Success Metrics

The original method of performing backfills as a stream processing required over 5,000 GB-Hours of memory and nearly 4,000 hours of CPU time. Those numbers were cut in half after the migration to Beam. The seven hours it took to complete the jobs dropped to a mere 25 minutes after the migration.

👁 Image

Overall this translates to 94% of processing time and 50% of overall resource usage was saved. Eleven times the operating cost was reduced based on the cost-to-serve analysis.

Future Works

This is but a first step toward a truly end-to-end convergence solution. LinkedIn continues to work toward easing the complexity of working with streaming and batch solutions. Though there is only one source code file, there is still additional complexity caused by different runtime binary stacks (Beam Samza runner in stream and Beam Spark runner in batch) such as learning how to run, tune, and debug both clusters, the operational and maintenance costs of runtime on the two engines, and the maintenance of the two runner codebases.

LinkedIn Senior Software Engineer Yuhong Cheng was the lead author for the LinkedIn post, with Yuhong Cheng, Shangjin Zhang, Xinyu Liu, and Yi Pan serving as co-authors.

Kinetica is the real-time database platform that leverages generative AI and vectorized processing to let you ask anything of your sensor and machine data. Kinetica offers native vectorized analytics in generative AI, spatial, time-series, and graph.
Learn More
The latest from Kinetica
TRENDING STORIES
Jessica Wachtel is a developer marketing writer at InfluxData where she creates content that helps make the world of time series data more understandable and accessible. Jessica has a background in software development and technical journalism.
Read more from Jessica Wachtel
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Real, Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.