Hadoop MapReduce - Data Flow

Last Updated : 4 Aug, 2025

MapReduce is a Hadoop processing framework that efficiently handles large-scale data across distributed machines. Unlike traditional systems, it works directly on data stored across nodes in HDFS.

Hadoop MapReduce follows a simple yet powerful data processing model that breaks large datasets into smaller chunks and processes them in parallel across a cluster. This flow from input splitting to mapping, shuffling and reducing ensures scalable, fault-tolerant and efficient data processing over distributed systems.

Below is the workflow of Hadoop MapReduce with a simple data flow diagram.

👁 Hadoop-MapReduce-Data-Flow

How MapReduce Works (Step-by-Step)

1. Input Split

The input data (e.g., a big log file or dataset) is divided into smaller chunks called Input Splits. Each split is processed independently by a separate Mapper.

Example: If you have a 1 GB file and Hadoop splits it into four 256 MB chunks, it will use 4 Mappers one for each chunk.

2. Mapper Phase

Each Mapper runs in parallel on different nodes and processes one input split.

What it does:

Reads the input data line by line
Transforms each line into key-value pairs
Stores the intermediate output locally (not yet in HDFS)

Example: If the task is counting words, the Mapper reads: "Data is power"--> emits:

("Data", 1), ("is", 1), ("power", 1)

3. Shuffling & Sorting

This is a behind-the-scenes phase handled by Hadoop after mapping is done.

What it does:

Shuffles intermediate key-value pairs across the cluster
Groups all values with the same key
Sorts them by key before sending to the Reducer

Example: From all Mappers, these pairs:

("Data", 1), ("Data", 1), ("power", 1)

are grouped into:

("Data", [1, 1]), ("power", [1])

4. Reducer Phase

Each Reducer receives a list of values for each unique key.

What it does:

Applies aggregation logic (e.g., sum, average, filter)
Generates the final key-value output
Stores the result in HDFS

Example: For word count, the Reducer gets:

("Data", [1, 1]) --> outputs: ("Data", 2)

The final output is saved in files like: part-r-00000

Comment

Article Tags:

Big Data

Hadoop

MapReduce

URL: https://www.geeksforgeeks.org/big-data/hadoop-mapreduce-data-flow/