SlideShare a Scribd company logo
Map-Reduce framework
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
Introduction
• Map-Reduce is a programming model designed
for processing large volumes of data in parallel
by dividing the work into a set of independent
tasks.
• Map-Reduce programs are written in a
particular style influenced by functional
programming constructs, specifically idioms for
processing lists of data.
• This module explains the nature of this
programming model and how it can be used to
write programs which run in the Hadoop
environment.
Map-reduce Basics
• 1. List Processing
• Conceptually, Map-Reduce programs
transform lists of input data elements into lists
of output data elements.
• A Map-Reduce program will do this twice,
using two different list processing idioms:
map, and reduce.
• These terms are taken from several list
processing languages such as LISP,
Map-reduce Basics
• 2.Mapping Lists
• The first phase of a Map-Reduce program is called
mapping.
• A list of data elements are provided, one at a time, to a
function called the Mapper, which transforms each
element individually to an output data element.
Map-reduce Basics
• 3.Reducing List
• Reducing lets you aggregate values together.
• A reducer function receives an iterator of input values from an input list.
• It then combines these values together, returning a single output value.
Map-reduce Basics
• 4.Putting Them Together in Map-
Reduce:
• The Hadoop Map-Reduce framework
takes these concepts and uses them to
process large volumes of information.
• A Map-Reduce program has two
components:
– Mapper
– And reducer.
The Mapper and Reducer idioms described
above are extended slightly to work in this
Example (word count)
• mapper (filename, file-contents):
– for each word in file-contents:
• emit (word, 1)
• reducer (word, values):
– sum = 0
– for each value in values:
– sum = sum + value
– emit (word, sum)
Example (word count)
Map-Reduce Data Flow
Now that we have seen the components that make up a basic MapReduce job, we can
see how everything works together at a higher level:
Data flow
• Map-Reduce inputs typically come from
input files loaded onto our processing
cluster in HDFS.
• These files are distributed across all our
nodes.
• Running a Map-Reduce program involves
running mapping tasks on many or all of
the nodes in our cluster.
• Each of these mapping tasks is equivalent:
– no mappers have particular "identities"
associated with them.
Data flow
• When the mapping phase has completed, the
intermediate (key, value) pairs must be
exchanged between machines to send all
values with the same key to a single reducer.
• The reduce tasks are spread across the
same nodes in the cluster as the mappers.
• This is the only communication step in
MapReduce.
• Individual map tasks do not exchange
information with one another, nor are they
aware of one another's existence.
• Similarly, different reduce tasks do not
communicate with one another.
Data flow
• The user never explicitly marshals information
from one machine to another; all data transfer is
handled by the Hadoop Map-Reduce platform
itself, guided implicitly by the different keys
associated with values.
• This is a fundamental element of Hadoop Map-
Reduce's reliability.
• If nodes in the cluster fail, tasks must be able to be
restarted.
• If they have been performing side-effects, e.g.,
communicating with the outside world, then the
shared state must be restored in a restarted task.
• By eliminating communication and side-effects,
Input files
• This is where the data for a Map-Reduce
task is initially stored.
• While this does not need to be the case,
the input files typically reside in HDFS.
• The format of these files is arbitrary; while
line-based log files can be used, we could
also use a binary format, multi-line input
records, or something else entirely.
• It is typical for these input files to be very
large -- tens of gigabytes or more.
InputFormat
• These input files are split up and read is defined by the
InputFormat.
• An InputFormat is a class that provides the following
functionality:
– Selects the files or other objects that should be used for input
– Defines the InputSplits that break a file into tasks
– Provides a factory for RecordReader objects that read the file
• Several InputFormats are provided with Hadoop.
• An abstract type is called FileInputFormat; all InputFormats
that operate on files inherit functionality and properties from
this class.
• When starting a Hadoop job, FileInputFormat is provided with
a path containing files to read.
• The FileInputFormat will read all files in this directory. It then
divides these files into one or more InputSplits each.
• You can choose which InputFormat to apply to your input files
for a job by calling the setInputFormat() method of the
• The default InputFormat is the TextInputFormat.
– This is useful for unformatted data or line-based records
like log files.
 A more interesting input format is the KeyValueInputFormat.
 This format also treats each line of input as a separate record.
While the TextInputFormat treats the entire line as the value,
 the KeyValueInputFormat breaks the line itself into the key and
value by searching for a tab character.
 This is particularly useful for reading the output of one MapReduce
job as the input to another
 Finally, the SequenceFileInputFormat reads special binary files that
are specific to Hadoop.
 These files include many features designed to allow data to be rapidly
read into Hadoop mappers.
 Sequence files are block-compressed and provide direct serialization
and deserialization of several arbitrary data types (not just text).
 Sequence files can be generated as the output of other MapReduce
tasks and are an efficient intermediate representation for data that is
passing from one MapReduce job to anther.
InputSplits
• An InputSplit describes a unit of work that
comprises a single map task in a
MapReduce program.
• A MapReduce program applied to a data set,
collectively referred to as a Job, is made up
of several (possibly several hundred) tasks.
• Map tasks may involve reading a whole file;
they often involve reading only part of a file.
• By default, the FileInputFormat and its
descendants break a file up into 64 MB
RecordReader
• The InputSplit has defined a slice of work, but
does not describe how to access it.
• The RecordReader class actually loads the data
from its source and converts it into (key, value)
pairs suitable for reading by the Mapper.
• The RecordReader instance is defined by the
InputFormat.
• The default InputFormat, TextInputFormat,
provides a
– LineRecordReader, which treats each line of the
input file as a new value.
– The key associated with each line is its byte offset in
the file.
– The RecordReader is invoke repeatedly on the input
until the entire InputSplit has been consumed. Each
Mapper
• The Mapper performs the interesting user-defined
work of the first phase of the MapReduce
program.
• Given a key and a value, the map() method emits
(key, value) pair(s) which are forwarded to the
Reducers.
• A new instance of Mapper is instantiated in a
separate Java process for each map task
(InputSplit) that makes up part of the total job
input. The individual mappers are intentionally not
provided with a mechanism to communicate with
one another in any way.
• This allows the reliability of each map task to be
governed solely by the reliability of the local
machine.
Mapper
• The OutputCollector object has a method named
collect() which will forward a (key, value) pair to the
reduce phase of the job.
• The Reporter object provides information about the
current task; its getInputSplit() method will return an
object describing the current InputSplit.
• It also allows the map task to provide additional
information about its progress to the rest of the
system.
• The setStatus() method allows you to emit a status
message back to the user. The incrCounter() method
allows you to increment shared performance counters.
• Each mapper can increment the counters, and the
JobTracker will collect the increments made by the
different processes and aggregate them for later
retrieval when the job ends.
1.Partition & Shuffle (Mapper)
• After the first map tasks have completed, the nodes may still be
performing several more map tasks each.
• But they also begin exchanging the intermediate outputs from the
map tasks to where they are required by the reducers.
• This process of moving map outputs to the reducers is known as
shuffling.
• A different subset of the intermediate key space is assigned to each
reduce node; these subsets (known as "partitions") are the inputs
to the reduce tasks.
• Each map task may emit (key, value) pairs to any partition; all
values for the same key are always reduced together regardless of
which mapper is its origin.
• Therefore, the map nodes must all agree on where to send the
different pieces of the intermediate data.
• The Partitioner class determines which partition a given (key,
value) pair will go to.
2. Sort (Mapper)
• Each reduce task is responsible for reducing the
values associated with several intermediate keys.
• The set of intermediate keys on a single node is
automatically sorted by Hadoop before they are
presented to the Reducer.
Reduce
• A Reducer instance is created for each reduce
task.
• This is an instance of user-provided code that
performs the second important phase of job-
specific work.
• For each key in the partition assigned to a
Reducer, the Reducer's reduce() method is called
once.
• This receives a key as well as an iterator over all
the values associated with the key.
• The values associated with a key are returned by
the iterator in an undefined order.
• The Reducer also receives as parameters
OutputCollector and Reporter objects; they are
1. OutputFormat (Reducer)
• The (key, value) pairs provided to this
OutputCollector are then written to output
files.
• The way they are written is governed by
the OutputFormat.
• The OutputFormat functions much like
the InputFormat class.
• The instances of OutputFormat provided
by Hadoop write to files on the local disk
or in HDFS; they all inherit from a common
2. RecordWriter (Reducer)
• Much like how the InputFormat
actually reads individual records
through the RecordReader
implementation, the OutputFormat
class is a factory for RecordWriter
objects; these are used to write the
individual records to the files as
directed by the OutputFormat.
3. Combiner (Reducer)
• The pipeline showed earlier omits a processing
step which can be used for optimizing bandwidth
usage by MapReduce job.
• Called the Combiner, this pass runs after the
Mapper and before the Reducer.
• Usage of the Combiner is optional. If this pass is
suitable for your job, instances of the Combiner
class are run on every node that has run map
tasks.
• The Combiner will receive as input all data
emitted by the Mapper instances on a given node.
The output from the Combiner is then sent to the
Reducers, instead of the output from the
Mappers.
• The Combiner is a "mini-reduce" process which
More Tips about map-reduce
• Chaining Jobs
• Not every problem can be solved with a
MapReduce program, but fewer still are
those which can be solved with a single
MapReduce job. Many problems can be
solved with MapReduce, by writing
several MapReduce steps which run in
series to accomplish a goal:
• E.g
– Map1 -> Reduce1 -> Map2 -> Reduce2 ->
Listing and Killing Jobs:
• It is possible to submit jobs to a Hadoop cluster which malfunction
and send themselves into infinite loops or other problematic states.
• In this case, you will want to manually kill the job you have started.
• The following command, run in the Hadoop installation directory on
a Hadoop cluster, will list all the current jobs:
• $ bin/hadoop job -list
• currently running JobId StartTime UserName
job_200808111901_0001 1 1218506470390 aaron
• $ bin/hadoop job -kill jobid
Conclusions
• This module described the MapReduce
execution platform at the heart of the
Hadoop system. By using MapReduce, a
high degree of parallelism can be
achieved by applications.
• The MapReduce framework provides a
high degree of fault tolerance for
applications running on it by limiting the
communication which can occur between
nodes, and requiring applications to be
Assignment
• Parallel efficiency of map-reduce.
•Q&A/Feedback?

More Related Content

Similar to Introduction to the Map-Reduce framework.pdf (20)

PPT
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
PPT
Hadoop_Pennonsoft
PennonSoft
 
PDF
Hadoop
devakalyan143
 
PPTX
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
PDF
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
PDF
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
PPTX
S_MapReduce_Types_Formats_Features_07.pptx
RajiArun7
 
PDF
Hadoop Hackathon Reader
Evert Lammerts
 
PPT
Session 19 - MapReduce
AnandMHadoop
 
PPTX
writing Hadoop Map Reduce programs
jani shaik
 
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
PPTX
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
PDF
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
PPTX
MapReduce Paradigm
NilaNila16
 
PPTX
Unit3 MapReduce
Integral university, India
 
PDF
Hadoop interview question
pappupassindia
 
PPT
Hadoop ppt2
Ankit Gupta
 
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
PDF
An Introduction to MapReduce
Sina Ebrahimi
 
PPTX
Map reduce prashant
Prashant Gupta
 
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Hadoop_Pennonsoft
PennonSoft
 
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
S_MapReduce_Types_Formats_Features_07.pptx
RajiArun7
 
Hadoop Hackathon Reader
Evert Lammerts
 
Session 19 - MapReduce
AnandMHadoop
 
writing Hadoop Map Reduce programs
jani shaik
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
MapReduce Paradigm
NilaNila16
 
Hadoop interview question
pappupassindia
 
Hadoop ppt2
Ankit Gupta
 
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
An Introduction to MapReduce
Sina Ebrahimi
 
Map reduce prashant
Prashant Gupta
 

Recently uploaded (20)

PPTX
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
PDF
lesson4-occupationalsafetyandhealthohsstandards-240812020130-1a7246d0.pdf
arvingallosa3
 
PPTX
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
DOCX
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
PDF
Authentication Devices in Fog-mobile Edge Computing Environments through a Wi...
ijujournal
 
PDF
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
PPTX
Explore USA’s Best Structural And Non Structural Steel Detailing
Silicon Engineering Consultants LLC
 
PPTX
Functions in Python Programming Language
BeulahS2
 
PDF
bs-en-12390-3 testing hardened concrete.pdf
ADVANCEDCONSTRUCTION
 
PDF
June 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
PDF
Designing for Tomorrow – Architecture’s Role in the Sustainability Movement
BIM Services
 
PPTX
Computer network Computer network Computer network Computer network
Shrikant317689
 
PDF
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
PPTX
Engineering Quiz ShowEngineering Quiz Show
CalvinLabial
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PDF
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
PDF
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
lesson4-occupationalsafetyandhealthohsstandards-240812020130-1a7246d0.pdf
arvingallosa3
 
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
Authentication Devices in Fog-mobile Edge Computing Environments through a Wi...
ijujournal
 
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
Explore USA’s Best Structural And Non Structural Steel Detailing
Silicon Engineering Consultants LLC
 
Functions in Python Programming Language
BeulahS2
 
bs-en-12390-3 testing hardened concrete.pdf
ADVANCEDCONSTRUCTION
 
June 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
Designing for Tomorrow – Architecture’s Role in the Sustainability Movement
BIM Services
 
Computer network Computer network Computer network Computer network
Shrikant317689
 
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
Engineering Quiz ShowEngineering Quiz Show
CalvinLabial
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
Ad

Introduction to the Map-Reduce framework.pdf

  • 7. Introduction • Map-Reduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. • Map-Reduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. • This module explains the nature of this programming model and how it can be used to write programs which run in the Hadoop environment.
  • 8. Map-reduce Basics • 1. List Processing • Conceptually, Map-Reduce programs transform lists of input data elements into lists of output data elements. • A Map-Reduce program will do this twice, using two different list processing idioms: map, and reduce. • These terms are taken from several list processing languages such as LISP,
  • 9. Map-reduce Basics • 2.Mapping Lists • The first phase of a Map-Reduce program is called mapping. • A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.
  • 10. Map-reduce Basics • 3.Reducing List • Reducing lets you aggregate values together. • A reducer function receives an iterator of input values from an input list. • It then combines these values together, returning a single output value.
  • 11. Map-reduce Basics • 4.Putting Them Together in Map- Reduce: • The Hadoop Map-Reduce framework takes these concepts and uses them to process large volumes of information. • A Map-Reduce program has two components: – Mapper – And reducer. The Mapper and Reducer idioms described above are extended slightly to work in this
  • 12. Example (word count) • mapper (filename, file-contents): – for each word in file-contents: • emit (word, 1) • reducer (word, values): – sum = 0 – for each value in values: – sum = sum + value – emit (word, sum)
  • 14. Map-Reduce Data Flow Now that we have seen the components that make up a basic MapReduce job, we can see how everything works together at a higher level:
  • 15. Data flow • Map-Reduce inputs typically come from input files loaded onto our processing cluster in HDFS. • These files are distributed across all our nodes. • Running a Map-Reduce program involves running mapping tasks on many or all of the nodes in our cluster. • Each of these mapping tasks is equivalent: – no mappers have particular "identities" associated with them.
  • 16. Data flow • When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines to send all values with the same key to a single reducer. • The reduce tasks are spread across the same nodes in the cluster as the mappers. • This is the only communication step in MapReduce. • Individual map tasks do not exchange information with one another, nor are they aware of one another's existence. • Similarly, different reduce tasks do not communicate with one another.
  • 17. Data flow • The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop Map-Reduce platform itself, guided implicitly by the different keys associated with values. • This is a fundamental element of Hadoop Map- Reduce's reliability. • If nodes in the cluster fail, tasks must be able to be restarted. • If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. • By eliminating communication and side-effects,
  • 18. Input files • This is where the data for a Map-Reduce task is initially stored. • While this does not need to be the case, the input files typically reside in HDFS. • The format of these files is arbitrary; while line-based log files can be used, we could also use a binary format, multi-line input records, or something else entirely. • It is typical for these input files to be very large -- tens of gigabytes or more.
  • 19. InputFormat • These input files are split up and read is defined by the InputFormat. • An InputFormat is a class that provides the following functionality: – Selects the files or other objects that should be used for input – Defines the InputSplits that break a file into tasks – Provides a factory for RecordReader objects that read the file • Several InputFormats are provided with Hadoop. • An abstract type is called FileInputFormat; all InputFormats that operate on files inherit functionality and properties from this class. • When starting a Hadoop job, FileInputFormat is provided with a path containing files to read. • The FileInputFormat will read all files in this directory. It then divides these files into one or more InputSplits each. • You can choose which InputFormat to apply to your input files for a job by calling the setInputFormat() method of the
  • 20. • The default InputFormat is the TextInputFormat. – This is useful for unformatted data or line-based records like log files.  A more interesting input format is the KeyValueInputFormat.  This format also treats each line of input as a separate record. While the TextInputFormat treats the entire line as the value,  the KeyValueInputFormat breaks the line itself into the key and value by searching for a tab character.  This is particularly useful for reading the output of one MapReduce job as the input to another  Finally, the SequenceFileInputFormat reads special binary files that are specific to Hadoop.  These files include many features designed to allow data to be rapidly read into Hadoop mappers.  Sequence files are block-compressed and provide direct serialization and deserialization of several arbitrary data types (not just text).  Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to anther.
  • 21. InputSplits • An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. • A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. • Map tasks may involve reading a whole file; they often involve reading only part of a file. • By default, the FileInputFormat and its descendants break a file up into 64 MB
  • 22. RecordReader • The InputSplit has defined a slice of work, but does not describe how to access it. • The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. • The RecordReader instance is defined by the InputFormat. • The default InputFormat, TextInputFormat, provides a – LineRecordReader, which treats each line of the input file as a new value. – The key associated with each line is its byte offset in the file. – The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed. Each
  • 23. Mapper • The Mapper performs the interesting user-defined work of the first phase of the MapReduce program. • Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the Reducers. • A new instance of Mapper is instantiated in a separate Java process for each map task (InputSplit) that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. • This allows the reliability of each map task to be governed solely by the reliability of the local machine.
  • 24. Mapper • The OutputCollector object has a method named collect() which will forward a (key, value) pair to the reduce phase of the job. • The Reporter object provides information about the current task; its getInputSplit() method will return an object describing the current InputSplit. • It also allows the map task to provide additional information about its progress to the rest of the system. • The setStatus() method allows you to emit a status message back to the user. The incrCounter() method allows you to increment shared performance counters. • Each mapper can increment the counters, and the JobTracker will collect the increments made by the different processes and aggregate them for later retrieval when the job ends.
  • 25. 1.Partition & Shuffle (Mapper) • After the first map tasks have completed, the nodes may still be performing several more map tasks each. • But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. • This process of moving map outputs to the reducers is known as shuffling. • A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. • Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. • Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. • The Partitioner class determines which partition a given (key, value) pair will go to.
  • 26. 2. Sort (Mapper) • Each reduce task is responsible for reducing the values associated with several intermediate keys. • The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.
  • 27. Reduce • A Reducer instance is created for each reduce task. • This is an instance of user-provided code that performs the second important phase of job- specific work. • For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. • This receives a key as well as an iterator over all the values associated with the key. • The values associated with a key are returned by the iterator in an undefined order. • The Reducer also receives as parameters OutputCollector and Reporter objects; they are
  • 28. 1. OutputFormat (Reducer) • The (key, value) pairs provided to this OutputCollector are then written to output files. • The way they are written is governed by the OutputFormat. • The OutputFormat functions much like the InputFormat class. • The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS; they all inherit from a common
  • 29. 2. RecordWriter (Reducer) • Much like how the InputFormat actually reads individual records through the RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are used to write the individual records to the files as directed by the OutputFormat.
  • 30. 3. Combiner (Reducer) • The pipeline showed earlier omits a processing step which can be used for optimizing bandwidth usage by MapReduce job. • Called the Combiner, this pass runs after the Mapper and before the Reducer. • Usage of the Combiner is optional. If this pass is suitable for your job, instances of the Combiner class are run on every node that has run map tasks. • The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. • The Combiner is a "mini-reduce" process which
  • 31. More Tips about map-reduce • Chaining Jobs • Not every problem can be solved with a MapReduce program, but fewer still are those which can be solved with a single MapReduce job. Many problems can be solved with MapReduce, by writing several MapReduce steps which run in series to accomplish a goal: • E.g – Map1 -> Reduce1 -> Map2 -> Reduce2 ->
  • 32. Listing and Killing Jobs: • It is possible to submit jobs to a Hadoop cluster which malfunction and send themselves into infinite loops or other problematic states. • In this case, you will want to manually kill the job you have started. • The following command, run in the Hadoop installation directory on a Hadoop cluster, will list all the current jobs: • $ bin/hadoop job -list • currently running JobId StartTime UserName job_200808111901_0001 1 1218506470390 aaron • $ bin/hadoop job -kill jobid
  • 33. Conclusions • This module described the MapReduce execution platform at the heart of the Hadoop system. By using MapReduce, a high degree of parallelism can be achieved by applications. • The MapReduce framework provides a high degree of fault tolerance for applications running on it by limiting the communication which can occur between nodes, and requiring applications to be