![]() |
VOOZH | about |
Running a MapReduce job isn't just about splitting data and computing results it also involves monitoring, handling failures and finally committing the output. Let’s break down what happens when a job completes successfully (and what Hadoop does when things go wrong).
Once all the map and reduce tasks are done, ApplicationMaster updates job status to SUCCESSFUL.
waitForCompletion(true) is a blocking call that waits until the entire job finishes and returns job status to the client.
Once the job completes successfully:
You can enable automatic job completion notifications by configuring:
mapreduce.job.end-notification.url
This URL gets a callback when your job ends.
After the job is marked complete, All containers (Map, Reduce, ApplicationMaster) clean up their temporary files and resources. The OutputCommitter runs commitJob():
Note: Hadoop uses FileOutputCommitter by default, which ensures output is only published if all tasks succeed preventing partial writes.
After successful completion, Job History Server stores metadata about the completed job. This includes logs, counters and configuration useful for:
Hadoop’s strength lies in fault tolerance. Jobs can still succeed even if some components fail. Let’s explore how Hadoop handles failures.
This diagram shows which MapReduce components step in to complete tasks and manage failures:
👁 ImageLet’s explore how Hadoop detects, manages and recovers from task or node failures to ensure job completion.
If a Mapper or Reducer crashes due to user code errors (e.g., bugs, bad input) JVM running that task exits.
In Hadoop Streaming, any non-zero exit code is treated as a failure. This behavior is controlled by:
stream.non.zero.exit.is.failure #(default = true)
If JVM running a task crashes or exits unexpectedly:
This helps Hadoop recover quickly by retrying the task on another node.
If a task appears to hang (i.e., stops reporting progress):
Configurable via:
mapreduce.task.timeout
Set to 0 to disable timeout (not recommended)
When a Map or Reduce task fails due to temporary issues (like a bad node or network glitch), ApplicationMaster takes charge of rescheduling the task typically on a different node to avoid repeating the problem.
Each task is retried up to 4 times by default. These limits can be configured using:
mapreduce.map.maxattempts
mapreduce.reduce.maxattempts
If all retry attempts fail, task is marked as permanently failed and entire job may be aborted.
A MapReduce job may fail if:
You can allow some task failures without failing the entire job by configuring:
mapreduce.map.failures.maxpercent
mapreduce.reduce.failures.maxpercent
But if failure thresholds are exceeded or system components remain down job is marked as FAILED and will not complete.
MapReduce is built to handle real-world issues code bugs, machine crashes, JVM errors and still complete the job if possible. Always monitor logs and counters after job completion. They reveal critical info about task health and performance.