Mapreduce Notes (1)
Mapreduce Notes (1)
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It
was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large
Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the
Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer
as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-value format,
and the output of reducer is the final output.
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB
to 128 MB). Each data block is associated with a Map function.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.
Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-
value pairs. The map input and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.
The data are shuffled between/within nodes so that it moves out from the map and get ready to process
for reduce function. Sometimes, the shuffling of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here, the data is compared using
comparison function and arranged in a sorted form.
Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The
values associated with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write
the Reduce output to the stable storage.
MapReduce API
In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods used in
MapReduce programming.
These intermediate records associated with a given output key and passed to Reducer for the final output.
void map(KEYIN key, VALUEIN value, This method can be called only once for each key-value in
Context context) the input split.
void run(Context context) This method can be override to control the execution of
the Mapper.
void setup(Context context) This method called only once at the beginning of the
task.
Counters getCounters() This method is used to get the counters for the job.
long getFinishTime() This method is used to get the finish time for the job.
Job getInstance() This method is used to generate a new Job without any
cluster.
Job getInstance(Configuration conf) This method is used to generate a new Job without any
cluster and provided configuration.
Job getInstance(Configuration conf, String This method is used to generate a new Job without any
jobName) cluster and provided configuration and job name.
String getJobFile() This method is used to get the path of the submitted job
configuration.
String getJobName() This method is used to get the user-specified job name.
JobPriority getPriority() This method is used to get the scheduling function of the
job.
void setJarByClass(Class<?> c) This method is used to set the jar by providing the class
name with .class extension.
void setJobName(String name) This method is used to set the user-specified job name.
void setMapOutputKeyClass(Class<?> class) This method is used to set the key class for the map
output data.
void setMapOutputValueClass(Class<?> This method is used to set the value class for the map
class) output data.
void setMapperClass(Class<? extends This method is used to set the Mapper for the job.
Mapper> class)
void setNumReduceTasks(int tasks) This method is used to set the number of reduce tasks for
the job
void setReducerClass(Class<? extends This method is used to set the Reducer for the job.
Reducer> class)