![]() |
VOOZH | about |
MapReduce is the processing engine of Hadoop. While HDFS is responsible for storing massive amounts of data, MapReduce handles the actual computation and analysis. It provides a simple yet powerful programming model that allows developers to process large datasets in a distributed and parallel manner. It is a two-phase data processing model in Hadoop:
Example Input File: sample.txt
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
When stored in HDFS, this file is divided into input splits (e.g., first.txt, second.txt, etc.). Each split is assigned to a Mapper for processing.
Example:
(0, "Hello I am GeeksforGeeks")
(26, "How can I help you")
File Formats in Hadoop
The way a RecordReader converts text into (key, value) pairs depends on the input file format. Hadoop provides several builtβin formats:
By default, Hadoop uses TextInputFormat.
Each Mapper processes its assigned (key, value) pair and generates intermediate data.
For (0, "Hello I am GeeksforGeeks"):
(Hello, 1)
(I, 1)
(am, 1)
(GeeksforGeeks, 1)
For (26, "How can I help you"):
(How, 1)
(can, 1)
(I, 1)
(help, 1)
(you, 1)
All Mappers run in parallel, one per input split.
The Mapper outputs are not final. Before reducing:
(How, [1,1])
(Are, [1,1,1])
(I, [1,1])
This prepares data for the reducer.
The Reducer aggregates values for each key:
(How, [1,1]) β (How, 2)
(Are, [1,1,1]) β (Are, 3)
(I, [1,1]) β (I, 2)
Final output (saved in result.output):
Hello - 1
I - 2
am - 1
GeeksforGeeks - 1
How - 2
Are - 3
are - 2
you - 3
what - 2
...