Mapreduce Lifecycle
Mapreduce Lifecycle
The Job Client prepares a job for execution. The local Job Client performs below When a MapReduce
job submitted to Hadoop –
• Validates the job configuration.
• Generates the input splits.
• Copies the job resources to a shared location (HDFS directory) which is accessible to the Job
Tracker and Task Trackers.
• Submits the job to the Job Tracker.
JOB TRACKER
A Task Tracker manages the tasks assigned and reports status to the Job Tracker. The Task
Tracker runs on the associated node. The associated node may not require to be on the same
host.
Task tracker perform below when the Job Tracker assigns a map or reduce task to a Task
Tracker -
• Fetches job resources locally.
• Issues a child JVM on the node to execute the map or reduce task.
• Reports status to the Job Tracker.
The task issued by the Task Tracker runs the job's map or reduce functions.
MAP TASK
The Hadoop MapReduce framework creates a map task to process each InputSplit. The map task -
• Create input key-value pairs using the InputFormat to fetch the input data locally.
• Applies the job-supplied map function to each key-value pair.
• Performs local sorting and aggregation of the results.
• Runs the Combiner for further aggregation if the job includes a Combiner.
• Stores the results locally in memory and on the local file system.
• Communicates with the Task Tracker about progress and status.
• Notifies the Task Tracker for the job completion.
Map task results processed through a local sort by key to prepare the data for reduce tasks. Combiner runs in the map
task, if a Combiner is configured for the job. Combiner consolidates and reduces the amount of data that must be
transferred to reduce tasks.
When a map task notifies the Task Tracker about the job completion, the Task Tracker notifies it to the Job Tracker. Then
Job Tracker makes the results available to reduce tasks.
REDUCE TASK
The reduce phase aggregates the results from the map phase into final results. Normally, the result set is smaller than
the input set and application dependent. The reduction is carried out by parallel reduce tasks.
The reduce input keys and values need not have the same type as the output keys and values. The reduce phase is
optional and a job can be configured to stop after map phase completes. Reduce is carried out in three phases - copy,
sort and merge.
• A reduce task -
• Fetches job resources locally.
• Performs copy phase to fetch local copies of all performed map results from the map worker nodes.
• Once the copy phase completed, performs sort phase to merge the copied results into a single sorted set of (key,
value-list) pairs.
• Once the sort phase completes, executes the reduce phase by invoking the job-supplied reduce function on each (key,
value-list) pair.
• Saves the end results to the output destination (HDFS).
The input to a reduce function is key-value pairs where the value is a list of values sharing the same key. When a map
task notifies the Task Tracker about the job completion, the Task Tracker notifies it to the Job Tracker. Then Job Tracker
saves the end results at the output destination (HDFS).