![]() |
VOOZH | about |
MapReduce is a core programming model in the Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.
Hadoop programs typically consist of three main components:
The Reducer is the second stage of MapReduce. It takes the intermediate key-value pairs generated by the Mapper and produces the final consolidated output, which is then written to HDFS (Hadoop Distributed File System).
1. Intermediate Data (Mapper Output): The Mapper produces output in the form of (key, value) pairs.
2. Shuffle & Sort: Before passing the data to Reducer, Hadoop automatically performs two operations:
Sorting and Shuffling are executed in parallel for efficiency.
3. Reduce Phase:
The Reducer receives (key, list of values) and applies user-defined computation logic such as aggregation, filtering, or summation. The output is then written back to HDFS.
Suppose we have faculty salary data stored in a CSV file. If we want to compute the total salary per department, we can:
The Reducer will aggregate all salary values for each department and produce the final result in the format:
Dept_Name Total_Salary
CSE 750000
ECE 620000
MECH 450000
job.getConfiguration().set("mapreduce.output.basename", "GeeksForGeeks");
Note: The final output from the Reducer is not sorted by default.
Hadoop allows users to configure the number of Reducers:
mapred.reduce.tasks=<number_of_reducers>
job.setNumReduceTasks(2);
If set to 0, only the Map phase is executed (useful for Map-only jobs).
The number of Reducers significantly affects performance and resource utilization. Ideally, it should be tuned based on cluster size and workload:
Recommended formula:
NumReducers ≈ (0.95 or 1.75) × (Number of Nodes × Max Containers per Node)