0% found this document useful (0 votes)
9 views54 pages

Bda Unit III r20csm

Uploaded by

kharshitha93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Bda Unit III r20csm

Uploaded by

kharshitha93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

MAP REDUCE

Mrs.P.Sridevi CSE-IT dept


UNIT -III
• Writing MapReduce Programs:
• A Weather Dataset,
• Understanding Hadoop API for MapReduce Framework
(Old and New),
• Basic programs of Hadoop MapReduce: WordCount
• Driver code,
• Mapper code,
• Reducer code,
• RecordReader,
• Combiner, Partitioner
Hadoop components

hadoop

HDFS Map Reduce


MR Design pattern
• Used in
• Indexing and Search
• Classification
• Summarization
• Recommendation Systems
• Analytics
MR Features
• It’s a Programming Model
• Large Scale Distributed Model
• Parallel Programming
Functions in MR
• Mapper Map()
• Reducer Reduce()
Implementation- hadoop
File Input formatter
• TextInputFormatter (k,V)
• KeyValueTextInputFormatter(k,V)
• Sequence File Input Formatter(k,V)
<filename,file>
• binary files( images,video,audio) called
sequence files
Architecture overview
Master node

user

Job tracker

Slave node 1 Slave node 2 Slave node N

Task tracker Task tracker Task tracker

Workers Workers Workers


Map+Reduce
R
M E
Very Partitioning
A D Result
big Function
P U
data
C
E

• Map: • Reduce :
– Accepts input – Accepts intermediate
key/value pair key/value pair
– Emits intermediate – Emits output
key/value pair key/value pair
Diagram (1)
SCALING OUT
• Small input can be processed in standalone
system. This is useful to test the MR
programming model. The data is in local file
system and computation is done.
• If data is more we should go for distributed
systems and part of data is processed parallel
in each data node
Diagram (2)
Input file ( in GBs)

INPUT SPLIT 0 INPUT SPLIT 1 INPUT SPLIT 2 INPUT SPLIT 3


In 64mb block

RECORD READER< RR1 RR2 RR3


key, value>

MAPPER 0 MAPPER 1 MAPPER 2 MAPPER 3


(tokenizes)
<Key,value>

SHUFFLE AND SORT INTERMEDIATE DATA (key2,list[])


Points need to be emphasized
• No reduce can begin until map is complete
• Master must communicate locations of
intermediate files
• Tasks scheduled based on location of data
• If map worker fails any time before reduce
finishes, task must be completely rerun
• MapReduce library does most of the hard
work.
INPUT FILE

FILE: TEXT INPUT FORMAT

MAP REDUCE PROGRAMS PROCESS DATA IN TWO MAP REDUCE


PHASES
1. MAP PHASE INPUT <KEY , VALUE >
OUTPUT <KEY,VALUE>
2. REDUCE PHASE INPUT <KEY,VALUE>
OUTPUT<KEY, VALUE>
OUTPUT FILE
MAP REDUCE PROGRAM HAS TWO FUNCTIONS
3. MAP()
4. REDUCE()
INPUT FILE INPUT SPLIT(blocks) MAP PER

•Each input split is a 64/128 Mb block


•One map task is applied for each split
•Each line is considered a record
•The number of records = the number of times the
mapper runs in the split
•The Number of input splits we have so many
mappers we should have
INPUT FILE INPUT SPLIT RECORD READER

<KEY , Value>

<byteOffset, , line >

MAPPER
INPUT SPLIT(128Mb
INPUT FILE in GBs RECORD READER (RR)
block)

Hadoop runs map task on the node where input <KEY ,Value>
data resides in HDFS. <byteOffset, line >
This is called data locality optimization
MAPPER(tokenizes)
The output of mapper is written to local disks.
Key , value
Bcz the output is intermediate output. How,1
Its processed by reduce task. Shuffle and sort
Final output comes from reducer.
There can be single reducer or multiple reducers
Reducer
If there are multiple reducers the output of mapper
(aggregation)
is partitoned into various nodes
Record Writer

Working of map reduce flow


Result/output file
ncdc
0029029170999991909010106004+62900+027667FM-
12+009099999V0202001N001019999999N0000001N9-
01061+99999102241ADDGF108991999999999999999999
0029029170999991909010113004+62900+027667FM-12+009099999V0209991C000019999999N0000001N9-00781+99999102031ADDGF108991999999999999999999
0029029170999991909010120004+62900+027667FM-12+009099999V0202001N001019999999N0000001N9-00501+99999101781ADDGF108991999999999999999999
0029029170999991909010206004+62900+027667FM-12+009099999V0202301N001019999999N0000001N9-00721+99999101141ADDGF108991999999999999999999
0035029170999991909010213004+62900+027667FM-12+009099999V0202301N002619999999N0000001N9-00221+99999100081ADDGF108991999999999999999999
0029029170999991909010220004+62900+027667FM-12+009099999V0202301N001019999999N0000001N9+00061+99999099671ADDGF104991999999999999999999
0029029170999991909010306004+62900+027667FM-12+009099999V0202701N001019999999N0000001N9-00111+99999099861ADDGF100991999999999999999999
0029029170999991909010313004+62900+027667FM-12+009099999V0209991C000019999999N0000001N9-00221+99999100111ADDGF104991999999999999999999
0029029170999991909010320004+62900+027667FM-12+009099999V0202501N002619999999N0000001N9+00061+99999099321ADDGF108991999999999999999999
0029029170999991909010406004+62900+027667FM-12+009099999V0202701N001019999999N0000001N9+00111+99999099101ADDGF100991999999999999999999
0029029170999991909010413004+62900+027667FM-12+009099999V0202901N001019999999N0000001N9+00111+99999098951ADDGF100991999999999999999999
0029029170999991909010420004+62900+027667FM-12+009099999V0203201N002619999999N0000001N9+00001+99999098941ADDGF100991999999999999999999
0029029170999991909010506004+62900+027667FM-12+009099999V0202901N002619999999N0000001N9-00171+99999099341ADDGF100991999999999999999999
0029029170999991909010513004+62900+027667FM-12+009099999V0203201N001019999999N0000001N9-00221+99999099961ADDGF100991999999999999999999
0029029170999991909010520004+62900+027667FM-12+009099999V0203201N001019999999N0000001N9-00391+99999100181ADDGF100991999999999999999999
0029029170999991909010606004+62900+027667FM-12+009099999V0209991C000019999999N0000001N9-00561+99999100431ADDGF108991999999999999999999
FILE SIZE IS 2GB
BLOCK SIZE IS 64MB
• Hi students
• How are you
• How is your BDA class
• How many students are there in the class
• How are the students feeling about online
class
• Many students are present today.
• I love teaching BDA
Shuffle
Input to the Reducer is the sorted output
of the mappers. In this phase the
framework fetches the relevant partition
of the output of all the mappers.
Sort
The framework groups Reducer inputs by
keys (since different mappers may have
output the same key) in this stage.
The shuffle and sort phases occur
simultaneously; while map-outputs are
being fetched they are merged.
File output formatter
TextOutputFormatter - plain text files
SequenceFileOutputFormatter -sequence files

MAPPER PHASE:
<k1,v1>---- -> list(k2,v2>
Combiner: <k2,list(v2)
REDUCER PHASE:
<k2,list(v2)> ---- list<k3,v3>
PACKAGES AND LIBRARIES
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Public class WordCountMapper extends
Mapper<LongWritable,Text,Text,IntWritable>
{
Public void map(LongWritable key,Text value,Context context)
throws IOException InterruptedException
{
Text word = new Text();
String line = value.toString();
stringTokenizer s = new StringTokenizer(line);
While(s.hasMoreTokens())
{
word.set(s.nextToken());
context.write(word, new IntWritable(1));
}}}
public class WordCountReducer extends
Reducer<Text, IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> value ,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : value)
{
sum = sum+ val.get();
}

context.write(key, new IntWritable(sum);


}
}
public class WordCountDriver
{
Public static void main(String[] args)throws IOException
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word countprj");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, newPath(args[1]));
if(!job.waitForCompletion(true) )
return;
}}
Mapper output
• INPUT:HELLO WORLD BYE WORLD
• HELLO HADOOP GOODBYE HADOOP
• For the given sample input the first map
emits:
• < Hello, 1>
• < World, 1>
• < Bye, 1>
• < World, 1>
For the given sample input the second map
emits:
• Hello Hadoop Goodbye Hadoop
• < Hello, 1>
• < Hadoop, 1>
• < Goodbye, 1>
• < Hadoop, 1>
The output of the map:
• < Goodbye, 1>
• < Hadoop, [1,1]>
• < Hello, [1,1]>
• < BYE,1>
• <WORLD,[1,1]>
OUTPUT OF REDUCER
• < Bye, 1>
• < Goodbye, 1>
• < Hadoop, 2>
• < Hello, 2>
• < World, 2>
How Many Maps?
The number of maps is usually driven by the total size
of the inputs, that is, the total number of blocks of the
input files.
The right level of parallelism for maps seems to be
around 10-100 maps per-node.

Thus, if you expect 10TB of input data and have a


blocksize of 128MB, you’ll end up with 82,000 maps,
Configuration.set(MRJobConfig.NUM_MAPS,
int) is used to set the map size
Reducer

Reducer has 3 primary phases: shuffle, sort and


reduce.

Reducer reduces a set of intermediate values which


share a key to a smaller set of values.
The number of reducers for the job is set by the user
via Job.setNumReduceTasks(int).
The framework then
calls reduce(WritableComparable, Iterable<Writable>,
Context) method for each <key, (list of values)> pair in
the grouped inputs.
• a MapReduce job takes a set of input key-value pairs
and produces a set of output key-value pairs by
passing the data through map and reduce functions.
• Map tasks deal with splitting and mapping of data
• Reduce tasks shuffle and reduce the data. .
The reduce tasks consolidate the data into final
results.
• MapReduce programs are parallel in nature, thus are
very useful for performing large-scale data analysis
• The input to each phase is key-value pairs.
• Map Reduce programs are performed by multiple
machines in a cluster.
Map only job

• let us consider a scenario where we just need to


perform the operation and no aggregation
required, in such case, we will prefer ‘Map-Only
job’ in Hadoop.
• In Hadoop Map-Only job, the map does all task
with its InputputSplit and no job is done by the
reducer. Here map output is the final output.
• by setting job.setNumreduceTasks(0) in the
configuration in a driver. This will make a
number of reducer as 0
• If we set the number of Reducer to 0 (by setting
job.setNumreduceTasks(0)), then no reducer will
execute and no aggregation will take place. In
such case, we will prefer “Map only Job”.
In Map-Only job, the map does all task with
its InputSplit and the reducer do no job.
• Mapper output is the final output.
• Between map and reduce phases there is key,
sort, and shuffle phase. Sort and shuffle phase
are responsible for sorting the keys in ascending
order. Then grouping values based on same keys.
This phase is very expensive.
Map only job cont...

• Avoiding reduce phase would eliminate sort and


shuffle phase as well. This also saves network
congestion. As in shuffling an output of mapper
travels to the reducer, when data size is huge, large
data travel to the reducer.
• In MapReduce job, mapper output is written to
local disk before sending to Reducer
• in the map-only job, this output is directly written
to HDFS. This further saves time and reduces cost
as well.
In between map and reduces phases there is
key, sort and shuffle phase. Sort and shuffle are
responsible for sorting the keys in ascending
order and then grouping values based on same
keys.
The output of mapper is written to local disk
before sending to reducer but in map only job,
this output is directly written to HDFS. This
further saves time and reduces cost as well.
• there is no need of partitioner and combiner
in Hadoop Map Only job that makes the
process fast.
• Map only job in Hadoop reduces the network
congestion by avoiding shuffle, sort and
reduce phase. Mapper takes care of overall
processing and produces the output. We can
achieve this by using
the job.setNumreduceTasks(0).
Combiners & Partitioners
combiners
• Combiners are used to reduce the amount of the
data being transferred over the Network
• It is used to optimize the usage of the network
bandwidth.
• Combiners are also called as local reducers
• They run on the mapper output and run on the same
machine where mapper has been executed earlier
• Hadoop may call combiner function zero,one or
many times for a particular map output record.
WORD COUNT PROCESS
• 1st map output
Hadoop 1
Hadoop 1
Hadoop 1
Hadoop 1
Combiner function
<Hadoop ,1,1,1,1>
Reduce method would be called with
<Hadoop,4>
• A Combiner, also known as a semi-reducer,
• It is an optional class that operates by accepting the
inputs from the Map class and thereafter passing the
output key-value pairs to the Reducer class.
• The main function of a Combiner is to summarize the
map output records with the same key.
• The output (key-value collection) of the combiner will
be sent over the network to the actual Reducer task as
input.
• Combiner function does not replace the reduce
method.
• Combiner is implemented by extending reduce
abstract class
Partitioner

• A partitioner works like a condition in processing an input


dataset.
• The partition phase takes place after the Map phase and
before the Reduce phase.
• The number of partitioners is equal to the number of
reducers.
• That means a partitioner will divide the data according to the
number of reducers.
• Therefore, the data passed from a single partitioner is
processed by a single Reducer.
• Partitioner allows us to distribute outputs from the map
stage are sent to reducers
• It partitions the key space
• A partitioner partitions the key-value pairs of
intermediate Map-outputs.
• It partitions the data using a user-defined
condition, which works like a hash function.
• The total number of partitions is same as the
number of Reducer tasks for the job.
• The difference between a partitioner and
a combiner is that the partitioner divides the
data according to the number of reducers so
that all the data in a single partition gets
executed by a single reducer.
DIFFERENCE BETWEEN COMBINERS AND
Combiner can be PARTITIONERS
viewed as mini-reducers in
the map phase.
They perform a local-reduce on the mapper
results before they are distributed further.
Once the Combiner functionality is executed,
it is then passed on to the Reducer for
further work.
come into the picture when we are
Partitioner
working on more than one Reducer. So, the
partitioner decide which reducer is
responsible for a particular key. They
basically take the Mapper Result(if Combiner is
used then Combiner Result) and send it to the
MAPREDUCE FEATURES
• MapReduce is a popular programming model
for processing large datasets in parallel across
a distributed cluster of computers.
• Developed by Google, it has become an
essential component of the Hadoop
ecosystem, enabling efficient data processing
and analysis.
• MapReduce Fundamentals
• MapReduce is designed to address the challenges associated
with processing massive amounts of data by breaking the
problem into smaller, more manageable tasks. It consists of two
primary functions: Map and Reduce, which work together to
process and analyze data.
• The Map Function
• The Map function takes input data and processes it into
intermediate key-value pairs. It applies a user-defined function
to each input record, generating output pairs that are then
sorted and grouped by key.
• The Reduce Function
• The Reduce function processes the intermediate key-value pairs
generated by the Map function. It aggregates, filters, or
combines the data based on a user-defined function, generating
the final output.

You might also like