VOOZH about

URL: https://thenewstack.io/optimizing-compute-in-the-post-hadoop-era/

⇱ Optimizing Compute in the Post-Hadoop Era - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-08-25 11:00:51
Optimizing Compute in the Post-Hadoop Era
contributed,sponsor-hpe,sponsored,sponsored-post-contributed,
Software Development

Optimizing Compute in the Post-Hadoop Era

The explosion in AI and data-driven apps has in turn driven the migration from Hadoop and fueled the adoption of Spark and machine learning.
Aug 25th, 2021 11:00am by Randy Thomasson
👁 Featued image for: Optimizing Compute in the Post-Hadoop Era
Lead image via Pixabay.
HPE sponsored this post.

In my recent CIO.com post, “Is There Life After Hadoop?”, I wrote about the post-Hadoop era and two key strategies that organizations can deploy to help them transition. These strategies are: 1) Build a better lake, and 2) Optimize the compute.

I’ll expand on building a better lake in a future article, but today I want to focus on the compute part of the equation. As I wrote in my previous article, Apache Spark’s flexibility, columnar approach to data, suitability for artificial intelligence (AI) and machine learning, and its vastly improved performance over Hadoop have all served to dramatically increase its adoption in recent years. For most users, it has become the logical successor to Hadoop MapReduce. This article addresses how to get the most from Spark and help ignite a revolution.

Why Spark?

Randy Thomasson
As a global solution architect for HPE Ezmeral software, Randy provides technical leadership, strategy and architectural guidance spanning a wide range of technologies and disciplines, including application development and modernization, big data and advanced analytics, infrastructure automation, in-memory and NoSQL data technologies and DevOps.

Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer and a physical execution engine. It also has a strong affinity with Hadoop and uses many Hadoop libraries. Spark also powers a stack of its own libraries, including SQL and DataFrames, Spark Streaming, MLib and GraphX. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.

Spark clusters are well suited to tackle the needs of today’s data-driven business, as its support for streaming and in-memory processing can yield significant performance improvements over more batch-oriented technologies like Hadoop.

Spark’s cluster-based architecture also makes it well suited to handle a wide range of data sets. Moreover, Spark provides multiple deployment options and directly supports four different cluster managers:

  • Standalone – a basic cluster manager included with Spark for simple, easy-to-run clusters.
  • Apache Mesos – an open source cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN – an open source resource manager that is included in Hadoop 2.
  • Kubernetes– an open source system for automating deployment, scaling, and management of containerized applications.

The good news is that the underlying cluster manager is transparent to Spark applications. So choosing a different cluster manager doesn’t require changes to Spark applications, only the deployment configuration.

Choosing the Deployment

The choice of Spark cluster managers varies somewhat, but most organizations have traditionally run production Spark workloads using Hadoop YARN. However, momentum is shifting to running Spark on Kubernetes. This is true for several reasons:

  • Standalone is limited – It’s the easiest to start with, but is best suited to single-node development/ test clusters. It lacks the dynamic management capabilities of the other three cluster managers, and in today’s container-based, virtualized infrastructure environments, it lags behind more advanced technologies like Kubernetes.
  • Mesos is dead – Just a few months ago, it looked like Apache Mesos was headed for the Attic, the place where Apache projects go to die, but at the eleventh hour it was granted a reprieve. That said, activity in the community has slowed dramatically, with only a single release in 2020 and none so far in 2021. Mesos adopters include marquee names like Apple, Twitter, Netflix and Uber, but it never gained critical mass and didn’t make it into the mainstream like Hadoop or Kubernetes.
  • YARN is yesterday YARN has historically been the most popular Spark cluster manager for a variety of reasons. However, unlike Mesos and Kubernetes, YARN is a relative newcomer to containers (as recently as Hadoop 3.1.1 it was considered experimental and incomplete) and as an integral part of Hadoop requires either a Hadoop cluster to run or a means of deploying YARN independent of a Hadoop cluster (e.g., as a KubeDirector application). The state of container support and extra Hadoop baggage translates to higher life-cycle costs and makes it less attractive for new Spark deployments.

Given today’s data-driven business processes with shared, virtualized infrastructure running in complex deployments spanning on-premises data centers and public clouds, the choice for today’s production Spark workloads is clear: Kubernetes.

Kubernetes offers some distinct advantages for Spark deployments. Chief among these is its support for containers. Containers have revolutionized the way that applications are packaged and deployed much like virtualization revolutionized server infrastructure. Containers provide better isolation, improved portability, simpler dependency management and, most importantly, dramatically reduced application cycle times. Kubernetes also provides more efficient resource management, eliminating the need for transient clusters — as recommended by Databricks, EMR, etc. — to avoid resource conflicts/impacts in non-Kubernetes Spark environments. The shorter application iteration cycles and significantly less setup/teardown delays provided by Kubernetes translate to substantially lower life-cycle costs. As a result, organizations moving their Spark workloads to Kubernetes can see 50% to 75% lower costs.

HPE Software, powered by HPE GreenLake, delivers a unified hybrid cloud platform experience that allows enterprises to simplify IT, reduce costs, and accelerate transformation with automated provisioning, unified observability, and data protection across hybrid and multi-vendor environments.
Learn More
The latest from HPE

Retiring the Elephant

Hadoop has had a good run over the years, but for many organizations it’s time to move on, and Spark has emerged as the tool of choice to replace it. Spark’s improved performance, affinity with existing Hadoop assets, and its more advanced approach to data have made it a popular choice for migrating Hadoop workloads. That said, Hadoop will be with us for a while. This is true for multiple reasons:

  • Migrating petabytes of Hadoop data and related applications takes time.
  • Some Hadoop services don’t have direct replacements in the Spark ecosystem yet.
  • There are still some use cases where Hadoop is a better choice.

Given this reality, organizations migrating from Hadoop need a solution strategy that provides a cost-effective home for their remaining Hadoop assets both during and after migration, while at the same time accommodating growing Spark workloads, preferably using a common, container-based platform. Ideally, the solution would support the compute and storage needs of existing Hadoop assets as well as newer Spark workloads while minimizing both the number of runtime platforms and associated storage.

Igniting a Revolution with Spark

Recent years have seen an explosion in AI and data-driven applications. This in turn has driven the migration from Hadoop and fueled the adoption of Spark and machine learning technologies.

As organizations look to migrate existing Hadoop data and applications, they need an approach that will allow them to effectively manage their shrinking Hadoop investment, while at the same time increasing their investments in Spark and machine learning technologies. The best way to do this is to embrace a Spark-plus-data fabric strategy for analytics.

By adopting HPE Ezmeral, organizations can ease their transition into a post-Hadoop era, optimizing analytics compute functions with Spark while effectively managing legacy Hadoop assets in the process.

HPE Software, powered by HPE GreenLake, delivers a unified hybrid cloud platform experience that allows enterprises to simplify IT, reduce costs, and accelerate transformation with automated provisioning, unified observability, and data protection across hybrid and multi-vendor environments.
Learn More
The latest from HPE
TRENDING STORIES
As a global solution architect for HPE Ezmeral software, Randy provides technical leadership, strategy and architectural guidance spanning a wide range of technologies and disciplines, including application development and modernization, big data and advanced analytics, infrastructure automation, in-memory and NoSQL...
Read more from Randy Thomasson
HPE sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Databricks.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.