Bda Unit III r20csm
Bda Unit III r20csm
hadoop
user
Job tracker
• Map: • Reduce :
– Accepts input – Accepts intermediate
key/value pair key/value pair
– Emits intermediate – Emits output
key/value pair key/value pair
Diagram (1)
SCALING OUT
• Small input can be processed in standalone
system. This is useful to test the MR
programming model. The data is in local file
system and computation is done.
• If data is more we should go for distributed
systems and part of data is processed parallel
in each data node
Diagram (2)
Input file ( in GBs)
<KEY , Value>
MAPPER
INPUT SPLIT(128Mb
INPUT FILE in GBs RECORD READER (RR)
block)
Hadoop runs map task on the node where input <KEY ,Value>
data resides in HDFS. <byteOffset, line >
This is called data locality optimization
MAPPER(tokenizes)
The output of mapper is written to local disks.
Key , value
Bcz the output is intermediate output. How,1
Its processed by reduce task. Shuffle and sort
Final output comes from reducer.
There can be single reducer or multiple reducers
Reducer
If there are multiple reducers the output of mapper
(aggregation)
is partitoned into various nodes
Record Writer
MAPPER PHASE:
<k1,v1>---- -> list(k2,v2>
Combiner: <k2,list(v2)
REDUCER PHASE:
<k2,list(v2)> ---- list<k3,v3>
PACKAGES AND LIBRARIES
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Public class WordCountMapper extends
Mapper<LongWritable,Text,Text,IntWritable>
{
Public void map(LongWritable key,Text value,Context context)
throws IOException InterruptedException
{
Text word = new Text();
String line = value.toString();
stringTokenizer s = new StringTokenizer(line);
While(s.hasMoreTokens())
{
word.set(s.nextToken());
context.write(word, new IntWritable(1));
}}}
public class WordCountReducer extends
Reducer<Text, IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> value ,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : value)
{
sum = sum+ val.get();
}