![]() |
VOOZH | about |
In Hadoop’s MapReduce framework, the Mapper is the core component of the Map Phase, responsible for processing raw input data and converting it into a structured form (key-value pairs) that Hadoop can efficiently handle.
A Mapper is a user-defined Java class that takes input splits (chunks of data from HDFS), processes each record and emits intermediate key-value pairs. These pairs are then shuffled and sorted before being passed to the Reducer (or directly stored in case of a Map-only job).
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Parameters :
The Mapper’s task is completed with the help of five key components:
👁 Mapper In Hadoop Map-ReduceThe Mapper process starts with the input, which consists of raw datasets stored in HDFS. An InputFormat is used to locate and interpret this data so it can be processed properly.
The input is divided into input splits, allowing Hadoop to process data in parallel. Each split is handled by a separate Mapper task. The split size can be configured with mapred.max.split.size, and the number of Mappers is calculated as:
Number of Mappers = Total Data Size / Input Split Size
For example, a 10TB file with 128MB splits results in about 81,920 Mappers.
Each split is then converted into key-value pairs by a RecordReader. By default, Hadoop uses TextInputFormat, where the key is the byte offset of a line and the value is the text itself.
The map() function contains the user-defined logic. It processes each key-value pair and produces intermediate key-value pairs, which serve as input for the Reduce phase.
The Mapper’s output is stored temporarily, first in an in-memory buffer (100MB by default, configurable via io.sort.mb). When the buffer is full, the data is spilled to the local disk. These results are not written to HDFS unless it is a Map-only job with no Reducer.
The WordCount program demonstrates the Mapper’s role clearly.
Input:
Hello Hadoop
Hello Mapper
Mapper Output (Intermediate Data):
(Hello, 1)
(Hadoop, 1)
(Hello, 1)
(Mapper, 1)
Explanation:
The number of Mappers is determined by the input split size, not directly by the number of HDFS blocks. Each split is handled by one Mapper task. By default, the split size equals the HDFS block size (e.g., 128 MB), but it can be configured.
Formula:
Number of Mappers = Total Data Size / Input Split Size
Example: For a dataset of 10 TB (≈10,240,000 MB) with a split size of 128 MB: