18mcs35e U4
18mcs35e U4
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce
task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write an
application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration
change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several
small chunks of data.
o Reduce stage − This stage is the combination of the Shufflestage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Input Output
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form
the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing
takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It
contains the monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and
produce results such as finding the year of maximum usage, year of minimum usage,
and so on. This is a walkover for the programmers with finite number of records. They
will simply write the logic to produce the required output, and pass the data to the
application written.
But, think of the data representing the electrical consumption of all the largescale
industries of a particular state, since its formation.
When we write applications to process such bulk data,
• They will take a lot of time to execute.
• There will be a heavy network traffic when we move data from source to network
server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txtand given as input. The input file looks as
shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
while(s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Save the above program as ProcessUnits.java. The compilation and execution of the
program is explained below.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Step 3
The following commands are used for compiling the ProcessUnits.java program and
creating a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input
files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir
output_dir
Wait for a while until the file is executed. After execution, as shown below, the output
will contain the number of input splits, the number of Map tasks, the number of reducer
tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters
Map-Reduce Framework