![]() |
VOOZH | about |
MapReduce is a Hadoop processing framework that efficiently handles large-scale data across distributed machines. Unlike traditional systems, it works directly on data stored across nodes in HDFS.
Hadoop MapReduce follows a simple yet powerful data processing model that breaks large datasets into smaller chunks and processes them in parallel across a cluster. This flow from input splitting to mapping, shuffling and reducing ensures scalable, fault-tolerant and efficient data processing over distributed systems.
Below is the workflow of Hadoop MapReduce with a simple data flow diagram.
👁 Hadoop-MapReduce-Data-FlowThe input data (e.g., a big log file or dataset) is divided into smaller chunks called Input Splits. Each split is processed independently by a separate Mapper.
Example: If you have a 1 GB file and Hadoop splits it into four 256 MB chunks, it will use 4 Mappers one for each chunk.
Each Mapper runs in parallel on different nodes and processes one input split.
What it does:
Example: If the task is counting words, the Mapper reads: "Data is power"--> emits:
("Data", 1), ("is", 1), ("power", 1)
This is a behind-the-scenes phase handled by Hadoop after mapping is done.
What it does:
Example: From all Mappers, these pairs:
("Data", 1), ("Data", 1), ("power", 1)
are grouped into:
("Data", [1, 1]), ("power", [1])
Each Reducer receives a list of values for each unique key.
What it does:
Example: For word count, the Reducer gets:
("Data", [1, 1]) --> outputs: ("Data", 2)
The final output is saved in files like: part-r-00000