VOOZH about

URL: https://towardsdatascience.com/hadoop-vs-spark-overview-and-comparison-f62c99d0ee15/

⇱ Hadoop vs Spark: Overview and Comparison | Towards Data Science


Skip to content

Hadoop vs Spark: Overview and Comparison

A summary and comparison of Spark and Hadoop

8 min read
👁 Photo by Wolfgang Hasselmann on Unsplash
Photo by Wolfgang Hasselmann on Unsplash

Both Hadoop and Spark are collections of open-source software, maintained by the Apache Software Foundation, that are used for large scale data processing. Hadoop is the older of the two and was once the go-to for processing big data. Since the introduction of Spark, however, it has been growing much more rapidly than Hadoop, which is no longer the undisputed leader in the area.

With Spark’s rise in popularity, choosing between Spark and Hadoop is a question many companies in the real-world face. The answer to that question, unfortunately, is not a simple one. Both systems have strengths and weaknesses, and the correct choice will depend on the intricacies of the use case in question.

In this discussion we will give a brief introduction to both Spark and Hadoop, discuss the main technical differences between the two, and compare their strengths and weaknesses with a view to identifying which circumstances you should choose one over the other.

Hadoop Overview

👁 Photo by Wolfgang Hasselmann on Unsplash - Edited by Author
Photo by Wolfgang Hasselmann on Unsplash – Edited by Author

Hadoop allows its user to utilise a network of many computers with the aim of harnessing their combined computational power to tackle problems involving huge amounts of data. There are two main elements to the Hadoop framework, namely distributed storage and processing. The distributed storage uses the Hadoop Distributed File System (HDFS) while the processing implements the MapReduce programming model using Yet Another Resource Negotiator (YARN) to schedule tasks and allocate resources.

The HDFS was set up with a number of goals in mind. Firstly, since an HDFS instance may consist of thousands of machines, hardware failure is seen as the norm rather than the exception. As a result, this failure is planned for by ensuring faults are detected quickly and the recovery process is smooth and automatic. Secondly, the design of HDFS is with batch processing in mind, rather than interactive use by the user. As such, instead of low latency access to data, HDFS prioritises high throughput, which enables streaming access to data. Thirdly, HDFS ensures that use cases involving huge datasets (many terabytes for example) are accommodated. Finally, another advantage of HDFS is the ease of use stemming from its compatibility with many operating systems and portability across hardware platforms.

Hadoop was originally released without YARN and relied solely on the MapReduce framework. The addition of YARN meant that Hadoops’s potential use cases were expanded beyond that of solely MapReduce. The key addition of YARN was the decoupling of cluster resource management and scheduling from the data processing component of MapReduce. This resulted in Hadoop clusters better allocating resources (in terms of both memory and CPU load) compared with the more rigid approach of MapReduce. Providing a more efficient link between HDFS and the processing engines (such as Spark) running applications, YARN enabled Hadoop to run a broader range of applications such as streaming data and interactive querying.

The real basis of Hadoop is MapReduce and its key characteristics are batch processing, no limits on passes over the data, no time or memory constraints. There are a number of ideas that enable these characteristics and define Hadoop MapReduce. Firstly, the design is such that hardware failure is expected and will be handled quickly and without losing or corrupting data. Secondly, priority is on scaling out rather than up meaning that adding more commodity machines is preferable to fewer high-end ones. As a result, scalability in Hadoop is relatively cheap and seamless. Additionally, Hadoop processes data sequentially, avoiding random access, and also promotes data locality awareness. These properties ensure that processing is orders of magnitude faster and the expensive process of moving large amounts of data is avoided where possible.

Spark Overview

👁 Photo by Cristian Escobar on Unsplash
Photo by Cristian Escobar on Unsplash

The simple MapReduce programming model of Hadoop is attractive and is utilised extensively in industry, however, performance on certain tasks remain sub-optimal. This gave rise to Spark which was introduced to provide a speedup over Hadoop. It is important to note that Spark is not dependent on Hadoop but can make use of it. Before comparing and contrasting the two technologies we will provide a brief overview of Spark.

Spark is a data processing engine for big data sets that is also open-source and maintained by the Apache Foundation. The introduction of an abstraction called resilient distributed datasets (RDDs) was the foundation that allowed Spark to excel and gain a huge speedup over Hadoop at certain tasks.

RDDs are fault-tolerant collections of elements that can be worked on in parallel by distribution among multiple nodes in a cluster. The key to the speed of Spark is that any operation performed on an RDD is done in memory rather than on disk. Spark allows two types of operations on RDDs, namely, transformations and actions. Actions are used to apply computation and obtain a result while transformation results in the creation of a new RDD. The distribution of these operations is done by Spark and does not need direction from the user.

The operations performed on an RDD are managed by using a directed acyclic graph (DAG). In a Spark DAG, each RDD is represented as a node while the operations form the edges. The fault-tolerant property of RDDs comes from the fact that in part of an RDD is lost then it can be re-computed from the original dataset by using the lineage of operations, which are stored in the graph.

Major Technical Differences and Choosing Between Spark and Hadoop

👁 Image by Author
Image by Author

As previously mentioned, Spark results in a massive speedup for certain tasks. The primary technical reason for this is due to the fact that Spark processes data in RAM (random access memory) while Hadoop reads and writes files to HDFS, which is on disk (we note here that Spark can use HDFS as a data source but will still process the data in RAM rather than on disk as is the case with Hadoop). RAM is much faster than disk for two reasons. Firstly, RAM uses solid-state technology to store information while disk does this magnetically. Secondly, RAM is much closer to the CPU than information stored on disk and has a faster connection, thus data in RAM is accessed much faster.

This technical difference results in speedups of many orders of magnitude for applications where the same dataset is reused multiple times. Hadoop results in significant delays (latency) for these tasks because a separate MapReduce job is required for each query which involves reloading the data from disk each time. However, with Spark, the data remains in RAM and so is read from there instead of disk. This results in Spark exhibiting speedups of up to 100x over Hadoop in certain cases where we reuse the same data multiple times. As a result, in cases like these, I would choose Spark over Hadoop. Common examples where this is true are iterative jobs and interactive analysis.

A specific and very common example of an iterative task that repeatedly uses the same dataset is the training of a machine learning (ML) model. ML models are often trained by iteratively passing over the same training dataset in order to try and reach the global minimum of the error function by using an optimisation algorithm such as gradient descent. The level of increased performance achieved by Spark in a task like this becomes more prominent the more times the data is queried. For example, there would be no speedup evident if you were to train an ML model on Hadoop and Spark using only one pass over the data (epoch), since the data needs to be loaded from disk into RAM for the first iteration on Spark. However, each subsequent iteration on Spark will run in a fraction of the time while each subsequent Hadoop iteration will take the same amount of time as the very first iteration as the data is retrieved from disk each time. As a result, Spark is generally preferable to Hadoop when dealing with ML applications.

Though a huge advantage in many applications, it is worth noting that there are circumstances where the in-memory computation of Spark falls short. For example, if the data sets we are dealing with are so large that they exceed available RAM, then Hadoop is the preferred choice. Additionally, again due to the RAM and disk difference, Hadoop is relatively easy and cheap to scale when compared with Spark. As a result, though a business under time constraints would likely be best served with Spark, a business with capital constraints may be better served by the cheaper setup and scalability of Hadoop.


If you get value from articles like these, consider signing up to medium using the link below! 👇

Join Medium with my referral link – Rian Dolphin


Written By

Rian Dolphin

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles