![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
In my recent CIO.com post, “Is There Life After Hadoop?”, I wrote about the post-Hadoop era and two key strategies that organizations can deploy to help them transition. These strategies are: 1) Build a better lake, and 2) Optimize the compute.
I’ll expand on building a better lake in a future article, but today I want to focus on the compute part of the equation. As I wrote in my previous article, Apache Spark’s flexibility, columnar approach to data, suitability for artificial intelligence (AI) and machine learning, and its vastly improved performance over Hadoop have all served to dramatically increase its adoption in recent years. For most users, it has become the logical successor to Hadoop MapReduce. This article addresses how to get the most from Spark and help ignite a revolution.
Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer and a physical execution engine. It also has a strong affinity with Hadoop and uses many Hadoop libraries. Spark also powers a stack of its own libraries, including SQL and DataFrames, Spark Streaming, MLib and GraphX. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.
Spark clusters are well suited to tackle the needs of today’s data-driven business, as its support for streaming and in-memory processing can yield significant performance improvements over more batch-oriented technologies like Hadoop.
Spark’s cluster-based architecture also makes it well suited to handle a wide range of data sets. Moreover, Spark provides multiple deployment options and directly supports four different cluster managers:
The good news is that the underlying cluster manager is transparent to Spark applications. So choosing a different cluster manager doesn’t require changes to Spark applications, only the deployment configuration.
The choice of Spark cluster managers varies somewhat, but most organizations have traditionally run production Spark workloads using Hadoop YARN. However, momentum is shifting to running Spark on Kubernetes. This is true for several reasons:
Given today’s data-driven business processes with shared, virtualized infrastructure running in complex deployments spanning on-premises data centers and public clouds, the choice for today’s production Spark workloads is clear: Kubernetes.
Kubernetes offers some distinct advantages for Spark deployments. Chief among these is its support for containers. Containers have revolutionized the way that applications are packaged and deployed much like virtualization revolutionized server infrastructure. Containers provide better isolation, improved portability, simpler dependency management and, most importantly, dramatically reduced application cycle times. Kubernetes also provides more efficient resource management, eliminating the need for transient clusters — as recommended by Databricks, EMR, etc. — to avoid resource conflicts/impacts in non-Kubernetes Spark environments. The shorter application iteration cycles and significantly less setup/teardown delays provided by Kubernetes translate to substantially lower life-cycle costs. As a result, organizations moving their Spark workloads to Kubernetes can see 50% to 75% lower costs.
Hadoop has had a good run over the years, but for many organizations it’s time to move on, and Spark has emerged as the tool of choice to replace it. Spark’s improved performance, affinity with existing Hadoop assets, and its more advanced approach to data have made it a popular choice for migrating Hadoop workloads. That said, Hadoop will be with us for a while. This is true for multiple reasons:
Given this reality, organizations migrating from Hadoop need a solution strategy that provides a cost-effective home for their remaining Hadoop assets both during and after migration, while at the same time accommodating growing Spark workloads, preferably using a common, container-based platform. Ideally, the solution would support the compute and storage needs of existing Hadoop assets as well as newer Spark workloads while minimizing both the number of runtime platforms and associated storage.
Recent years have seen an explosion in AI and data-driven applications. This in turn has driven the migration from Hadoop and fueled the adoption of Spark and machine learning technologies.
As organizations look to migrate existing Hadoop data and applications, they need an approach that will allow them to effectively manage their shrinking Hadoop investment, while at the same time increasing their investments in Spark and machine learning technologies. The best way to do this is to embrace a Spark-plus-data fabric strategy for analytics.
By adopting HPE Ezmeral, organizations can ease their transition into a post-Hadoop era, optimizing analytics compute functions with Spark while effectively managing legacy Hadoop assets in the process.