0% found this document useful (0 votes)
8 views28 pages

BDA UNIT -4 notes

The document outlines the MapReduce programming model and its implementation in Hadoop, detailing the phases of Map, Shuffle and Sort, and Reduce, along with the roles of Job Tracker and Task Tracker in job execution. It discusses the failure modes in classic MapReduce and introduces YARN architecture, which enhances resource management and job scheduling. The document also explains the job submission process and task assignment in YARN, highlighting the differences from classic MapReduce.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

BDA UNIT -4 notes

The document outlines the MapReduce programming model and its implementation in Hadoop, detailing the phases of Map, Shuffle and Sort, and Reduce, along with the roles of Job Tracker and Task Tracker in job execution. It discusses the failure modes in classic MapReduce and introduces YARN architecture, which enhances resource management and job scheduling. The document also explains the job submission process and task assignment in YARN, highlighting the differences from classic MapReduce.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT-IV

Map Reduce and Yarn: Hadoop Map Reduce paradigm, Map and Reduce tasks, Job and Task
trackers, Mapper, Reducer, Map Reduce workflows, classic Map-reduce - YARN - failures in classic Map-
reduce and YARN - job scheduling - shuffle and sort - task execution - Map Reduce types -input formats -
output formats.

4.1 Hadoop Map Reduce paradigm:

MapReduce programming model refers to a programming paradigm for processing Big Data sets with a
parallel and distributed environment using map and reduce tasks.

Big Data Processing employs the Map Reduce Programming Model. A job means a Map Reduce Program.
Each job consists of several smaller unit, called MapReduce Tasks.

A software execution framework in MapReduce programming defines the parallel tasks. The Hadoop
MapReduce implementation uses Java framework.

Fig : MapReduce Programming Model

1. Map Phase:
1. The input data is split into independent chunks (Input Splits).
2. Each chunk is processed by a Mapper task.
3. The Mapper outputs key-value pairs.
2. Shuffle and Sort:
1. After the Map phase, intermediate data is shuffled and sorted.
2. Data with the same key is grouped together to be processed by the Reducer.
3. Reduce Phase:
1. Reducers take the grouped key-value pairs from the Shuffle and Sort phase.
2. They perform aggregation or summarization and produce the final output.

1
4.2.MapReduce Program on Client Submitting Job

Fig: MapReduce Program on Client Submitting Job

Job Clients:
The job client the one who submits the job.A job contains the mapper function and reducer function and
some configuration function that will drive a job.

Job Tracker :The job tracker is the master of task trackers which are the slaves work on data nodes. The
job tracker responsibilities to come up with the execution plan and it is coordinate and schedule the plan
cross the task trackers. It also can do phase coordination.

Task Tracker :
Task tracker is the one who break down the job into tasks that is map and reduce task .
Every task tracker has slots on it. the job tracker take map and reduce function all of them compile binary
and throw them into the task slots which actually do the execution over the map and reduce functions.

Working of MapReduce:
when a client submits a job, and the succeeding actions by the JobTracker and TaskTracker. The data for a
MapReduce task is initially at input files. The input files typically reside in the HDFS.

The files may be line-based log files, binary format file, multi-line input records, or something else entirely
different. These input files are practically very large, hundreds of terabytes or even more than it.

JobTracker and Task Tracker MapReduce consists of a single master JobTracker and one slave
TaskTracker per cluster node.

The master is responsible for scheduling the component tasks in a job onto the slaves, monitoring them and
re-executing the failed tasks. The slaves execute the tasks as directed by the master.

2
4.3.Map Reduce workflows:

Fig: MapReduce Workflow

• Input: This is the input data / file to be processed.


• Split: Hadoop splits the incoming data into smaller pieces called “splits”.
• Map: In this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
• Combine: This is an optional step and is used to improve the performance by reducing the amount of
data transferred across the network. Combiner is the same as the reduce step and is used for aggregating
the output of the map() function before it is passed to the subsequent steps.
• Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order, and
grouped before sending them to the next step.
• Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output of
reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
• Output: Finally the output of reduce step is written to a file in HDFS.

4.4. Classic MapReduce:


A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there
are four independent entities:
• The client, which submits the MapReduce job.

• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose
main class is JobTracker.

• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are
Java applications whose main class is TaskTracker.

• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used
for sharing job files between the other entities.

3
1. Job Submission:
The submit() method on Job creates an internal JobSummitter instance and calls
submitJobInternal() on it (step 1 in Figure 6-1).

The job submission process implemented by JobSummitter does the following:


 Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker)
(step 2).
 Computes the input splits for the job. Copies the resources needed to run the job,
including the job JAR file, the configuration file, and the computed input splits, to
the jobtracker’s filesystem in a directory named after the job ID. (step 3).

4
 Tells the jobtracker that the job is ready for execution (by calling submitJob() on
JobTracker) (step 4).

2. Job Initialization:

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue
from where the job scheduler will pick it up and initialize it. Initialization involves creating
an object to represent the job being run (step 5).

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the
client from the shared filesystem (step 6). It then creates one map task for each split.

3. Task Assignment:
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat, a
tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will
allocate it a task, which it
communicates to the tasktracker using the heartbeat return value (step 7).

4. Task Execution:

Now that the tasktracker has been assigned a task, the next step is for it to run the task. First, it
localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem.
It also copies any files needed from the distributed cache by the application to the local disk;
see “Distributed Cache” on page 288 (step 8).

TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10).

5. Progress and Status Updates:


MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, it’s important for the user to get feedback on how
the job is progressing. A job and each of its tasks have a status.

When a task is running, it keeps track of its progress, that is, the proportion of the task
completed.

6. Job Completion:

When the jobtracker receives a notification that the last task for a job is complete (this will be the
special job cleanup task), it changes the status for the job to “successful.”

5
4.5. Failures in Classic MapReduce:
In the MapReduce 1 runtime there are three failure modes to consider:
1.failure of the running task,
2. failure of the tastracker, and
3. failure of the jobtracker.
1. Task Failure:
 Consider first the case of the child task failing. The most common way that this
happens is when user code in the map or reduce task throws a runtime exception. If
this happens, the child JVM reports the error back to its parent tasktracker, before
it exits.

 The error ultimately makes it into the user logs. The tasktracker marks the task
attempt as failed, freeing up a slot to run another task.

 When the jobtracker is notified of a task attempt that has failed (by the
tasktracker’s heartbeat call), it will reschedule execution of the task. The jobtracker
will try to avoid rescheduling the task on a tasktracker where it has previously
failed.

2. Tasktracker Failure:
 Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or running very
slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently).
 The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it hasn’t received
one for 10 minutes, configured via the mapred.task tracker.expiry.interval property, in
milliseconds) and remove it from its pool of tasktrackers to schedule tasks on.
 A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed. If
more than four tasks from the same job fail on a particular tasktracker then the jobtracker records
this as a fault.
 Blacklisted tasktrackers are not assigned tasks, but they continue to communicate with the
jobtracker. Faults expire over time (at the rate of one per day), so tasktrackers get the chance to
run jobs again simply by leaving them running.

3. Jobtracker Failure:
 Failure of the jobtracker is the most serious failure mode. Hadoop has no mechanism for dealing
with failure of the jobtracker—it is a single point of failure—so in this case the job fails.
 However, this failure mode has a low chance of occurring, since the chance of a particular
machine failing is low. The good news is that the situation is improved in YARN, since one of its
design goals is to eliminate single points of failure in Map Reduce.
 After restarting a jobtracker, any jobs that were running at the time it was stopped will need to be
re-submitted. There is a configuration option that attempts to recover any running job
(mapred.jobtracker.restart.recover, turned off by default), however it is known not to work
reliably, so should not be used.

6
4.6. YARN Architecture:

Fig: YARN Architecture

YARN stands for Yet Another Resource Negotiator. It has two major responsibilities:
1. Management of cluster resources such as compute, network, and
memory.
2. Scheduling and monitoring of jobs.

YARN achieves these goals through two long-running daemons:


1. Resource Manager
2. Node Manager
The two components work in a master-slave relationship, where the Resource
Manager(RM) is the master and the Node Managers the slave.
A single Resource Manager runs in the cluster with one Node Manager per machine.
Together, these two components make up the data-computation framework.
Let’s discuss the resource manager first.

1. Client: It submits map-reduce jobs.


2. Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a
processing request, it forwards it to the corresponding node manager and allocates
resources for the completion of the request accordingly.
7
It has two major components:
Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as monitoring
or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports
plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.

Application manager: It is responsible for accepting the application and negotiating the
first container from the resource manager. It also restarts the Application Master container
if a task fails.
3. Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node. Its primary job is
to keep-up with the Resource Manager. It registers with the Resource
Manager and sends heartbeats with the health status of the node. It monitors
resource usage, performs log management and also kills a container based on
directions from the resource manager.
4. Application Master: An application is a single job submitted to a framework.
The application master is responsible for negotiating resources with the
resource manager, tracking the status and monitoring progress of a single
application.
The application master requests the container from the node manager by sending
a Container Launch Context(CLC) which includes everything an application
needs to run. Once the application is started, it sends the health report to the
resource manager from time-to-time.
5. Container: It is a collection of physical resources such as RAM, CPU cores
and disk on a single node. The containers are invoked by Container Launch
Context(CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.

8
4.7. YARN architecture for running a Map Reduce Job:
MapReduce on YARN involves more entities than classic MapReduce.
They are:
 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of compute
resources on the cluster.
 The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
 The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
that are scheduled by the resource manager, and managed by the node managers.
 The distributed filesystem which is used for sharing job files between the other
entities

9
1. Job Submission
2. Job Initialization
3. Task Assignment
4. Task Execution
5. Job Completion
1. Job Submission:
The job submission process implemented by JobSubmitter does the following:
• Asks the resource manager for a new application ID, used for the MapReduce job
ID (step 2).
• Checks the output specification of the job. For example, if the output directory has
not been specified or it already exists, the job is not submitted, and an error is
thrown to the MapReduce program.
• Computes the input splits for the job. If the splits cannot be computed (because the
input paths don’t exist, for example), the job is not submitted, and an error is
thrown to the MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a
directory named after the job ID (step 3). The job JAR is copied with a high
replication factor so that there are lots of copies across the cluster for the node
managers to access when they run tasks for the job.
• Submits the job by calling submitApplication() on the resource manager (step 4).
2. Job Initialization:
• When the resource manager receives a call to its submitApplication() method, it
hands off the request to the YARN scheduler. The scheduler allocates a container,
and the resource manager then launches the application master’s process.
• The application master for MapReduce jobs is a Java application, it initializes the
job by creating a number of bookkeeping objects to keep track of the job’s progress,
as it will receive progress and completion reports from the tasks (step 6).
• Next, it retrieves the input splits computed in the client from the shared filesystem
(step 7). It then creates a map task object for each split, as well as a number of
reduce task. Tasks are given IDs at this point.
3. Task Assignment:
• If the job does not qualify for running as an uber task, then the application master
10
requests containers for all the map and reduce tasks in the job from the resource
manager (step 8).
• Requests for map tasks are made first and with a higher priority than those for
reduce tasks, since all the map tasks must complete before the sort phase of the
reduce can start. Requests for reduce tasks are not made until 5% of map tasks have
completed.
• Reduce tasks can run anywhere in the cluster, but requests for map tasks have data
locality constraints that the scheduler tries to honor.
• In the optimal case, the task is data local that is, running on the same node that the
split resides on. Alternatively, the task may be rack local: on the same rack, but not
the same node, as the split. Some tasks are neither data local nor rack local and
retrieve their data from a different rack than the one they are running on.
• Requests also specify memory requirements and CPUs for tasks. By default, each
map and reduce task is allocated 1,024 MB of memory and one virtual core.

4. Task Execution:
• Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by
contacting the node manager (steps 9a and 9b).
• The task is executed by a Java application whose main class is YarnChild. Before it
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
• Finally, it runs the map or reduce task (step 11).
5. Job Completion:
• When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to “successful”.
• Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method.
4.8.Job Scheduling:
1. FIFO Scheduler:
First In First Out is the default scheduling policy used in Hadoop. FIFO
Scheduler gives more preferences to the application coming first than those
coming later. It places the applications in a queue and executes them in the
order of their submission (first in, first out).
Advantage:
It is simple to understand and doesn’t need any configuration.
11
Jobs are executed in the order of their submission.

Disadvantage:
It is not suitable for shared clusters. If the large application comes before the
shorter one, then the large application will use all the resources in the cluster,
and the shorter application has to wait for its turn. This leads to starvation.
It does not take into account the balance of resource allocation between the long
applications and short applications.

2. Capacity Scheduler:
The CapacityScheduler allows multiple-tenants to securely share a large Hadoop
cluster. It is designed to run Hadoop applications in a shared, multi-tenant cluster
while maximizing the throughput and the utilization of the cluster.

It supports hierarchical queues to reflect the structure of organizations or groups that


utilizes the cluster resources. A queue hierarchy contains three types of queues that
are root, parent, and leaf.

Advantages:
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.

12
3.The Fair Scheduler:
 The Fair Scheduler aims to give every user a fair share of the cluster capacity over
time.
 If a single job is running, it gets all of the cluster. As more jobs are submitted, free
task slots are given to the jobs in such a way as to give each user a fair share of the
cluster.

 A short job belonging to one user will complete in a reasonable time even while
another user’s long job is running, and the long job will still make progress.

 Jobs are placed in pools, and by default, each user gets their own pool. A user who
submits more jobs than a second user will not get any more cluster resources than
the second, on average. It is also possible to define custom pools with guaranteed
minimum capacities defined in terms of the number of map and reduce slots, and to
set weightings for each pool.
Advantages:
It provides a reasonable way to share the Hadoop Cluster between the number of
users.
Also, the FairScheduler can work with app priorities where the priorities are
used as weights in determining the fraction of the total resources that each
application should get.

Disadvantage:
It requires configuration.

4.9 Failures in YARN Classic MapReduce:


1. Task Failures: Failure of the running task is similar to the classic case.
Runtime exceptions and sudden exits of the JVM are propagated back to the
application master and the task attempt is marked as failed.

13
The configuration properties for determining when a task is considered to be
failed are the same as the classic case: a task is marked as failed after four
attempts.
2. Application Master Failure: An application master sends periodic heartbeats
to the resource manager, and in the event of application master failure, the
resource manager will detect the failure and start a new instance of the master
running in a new container (managed by a node manager).
In the case of the MapReduce application master, it can recover the state of the
tasks that had already been run by the (failed) application so they don’t have
to be rerun.

3. Node Manager Failure:


If a node manager fails, then it will stop sending heartbeats to the resource
manager, and the node manager will be removed from the resource
manager’s pool of available nodes.
The property yarn.resourcemanager.nm.liveness-monitor.expiry-intervalms,
which defaults to 600000 (10 minutes), determines the minimum time the
resource manager waits before considering a node manager that has sent no
heartbeat in that time as failed.
Node managers may be blacklisted if the number of failures for the application
is high. Blacklisting is done by the application master, and for MapReduce
the application master will try to reschedule tasks on different nodes if more
than three tasks fail on a node manager.
4. Resource Manager Failure:
Failure of the resource manager is serious, since without it neither jobs nor task
containers can be launched.
After a crash, a new resource manager instance is brought up (by an
adminstrator) and it recovers from the saved state. The state consists of the
node managers in the system as well as the running applications.

4.10 TASK EXECUTION:


1. The Task Execution Environment
2. Speculative Execution
3. Output Committers
 1. The Task Execution Environment:

14
 Hadoop provides information to a map or reduce task about the environment in
which it is running. For example, a map task can discover the name of the file it is
processing, and a map or reduce task can find out the attempt number of the task.
Task Execution Property:

2. Speculative Execution:
 The MapReduce model is to break jobs into tasks and run the tasks in parallel to
make the overall job execution time smaller than it would otherwise be if the tasks
ran sequentially.
 This makes job execution time sensitive to slow-running tasks, as it takes only one
slow task to make the whole job take significantly longer than it would have done
otherwise. When a job consists of hundreds or thousands of tasks, the possibility of
a few straggling tasks is very real.
 Tasks may be slow for various reasons, including hardware degradation or software
mis-configuration, but the causes may be hard to detect since the tasks still
complete successfully, albeit after a longer time than expected. Hadoop doesn’t try
to diagnose and fix slow-running tasks; instead, it tries to detect when a task is
running slower than expected and launches another, equivalent, task as a backup.
This is termed speculative execution of tasks.
 speculative task is launched only after all the tasks for a job have been launched,
and then only for tasks that have been running for some time (at least a minute) and
have failed to make as much progress, on average, as the other tasks from the job.
 When a task completes successfully, any duplicate tasks that are running are killed
since they are no longer needed. So if the original task completes before the
speculative task, then the speculative task is killed; on the other hand, if the
speculative task finishes first, then the original is killed.

3.Output Committers:

15
 Hadoop MapReduce uses a commit protocol to ensure that jobs and tasks either
succeed, or fail cleanly. The behavior is implemented by the OutputCommitter in
use for the job, and this is set in the old MapReduce API by calling the
setOutputCommitter() on JobConf, or by setting mapred.output.committer.class in
the configuration.
 In the new MapReduce API, the OutputCommitter is determined by the
OutputFormat, via its getOut putCommitter() method. The default is
FileOutputCommitter, which is appropriate for file-based MapReduce.
 The setupJob() method is called before the job is run, and is typically used to
perform initialization. For FileOutputCommitter the method creates the final output
directory, ${mapred.output.dir}, and a temporary working space for task output, $
{mapred.output.dir}/_temporary.
 If the job succeeds then the commitJob() method is called, which in the default
filebased implementation deletes the temporary working space, and creates a hidden
empty marker file in the output directory called _SUCCESS to indicate to
filesystem clients that the job completed successfully.

 If the job did not succeed, then the abort Job() is called with a state object
indicating whether the job failed or was killed (by a user, for example). In the
default implementation this will delete the job’s temporary working space.

4.10. Shuffle and Sort:

1. The Map Side :


MapReduce makes the guarantee that the input to every reducer is sorted by key. The
process by which the system performs the sort—and transfers the map outputs to
the reducers as inputs—is known as the shuffle.

Each map task has a circular memory buffer that it writes the output to. The buffer is 100
MB by default, a size which can be tuned by changing the io.sort.mb property.When
the contents of the buffer reaches a certain threshold size (io.sort.spill.per cent,
default 0.80, or 80%), a background thread will start to spill the contents to disk. Each
time the memory buffer reaches the spill threshold, a new spill file is created.

Spills are written in round-robin fashion to the directories. Before it writes to disk, the
thread first divides the data into partitions corresponding to the reducers that they will
ultimately be sent to. Within each partition, the background thread performs an in-
memory sort by key, and if there is a combiner function, it is run on the output of the
sort. Running the combiner function makes for a more compact map output, so there
is less data to write to local disk and to transfer to the reducer.

so after the map task has written its last output record there could be several spill
files.Before the task is finished, the spill files are merged into a single partitioned and
sorted output file.
16
If there are at least three spill files (set by the min.num.spills.for.combine property) then
the combiner is run again before the output file is written. If there are only one or two
spills, then the potential reduction in map output size is not worth the overhead in
invoking the combiner.

2. The Reduce Side :


Let’s turn now to the reduce part of the process.
The map output file is sitting on the local disk of the machine that ran the map task.But now it is
needed by the machine that is about to run the reduce task for the partition. The reduce task needs
the map output for its particular partition from several map tasks across the cluster.

The copy phase of the reduce task. The reduce task has a small number of copier threads so that it
can fetch map outputs in parallel. The default is five threads, but this number can be changed by
setting the mapred.reduce.parallel.copies property.
The map outputs are copied to the reduce task JVM’s memory if they are small enough,otherwise
they are copied to disk.When the in-memory buffer reaches a threshold size (controlled by
mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs
(mapred.inmem.merge.threshold), it is merged and spilled to disk.

When all the map outputs have been copied, the reduce task moves into the merge phase.which
merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if
there were 50 map outputs, and the merge factor was 10 (the default, controlled by the
io.sort.factor property, just like in the map’s merge),then there would be 5 rounds. Each round
would merge 10 files into one, so at the end there would be five intermediate files.

During the reduce phase, the reduce function is invoked for each key in the sorted output. The
output of this phase is written directly to the output filesystem, typically HDFS.

17
Configuration Tuning :

18
19
Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in
languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between
Hadoop and your program, so you can use any language that can read standard input and write to
standard output to write your MapReduce program.
Hadoop Pipes Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike
Streaming, which uses standard input and output to communicate with the map and reduce code,
Pipes uses sockets as the channel over which the tasktracker communicates with the process running
the C++ map or reduce function. JNI is not used.

4.11. Map Reduce Types:


The map and reduce functions in Hadoop MapReduce have the following general form:

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

In general, the map input key and value types (K1 and V1) are different from the map output
types (K2 and V2). However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3 and V3).

20
public void map(LongWritable key, Text value, Context context)
{
.....
.....
context.write(new Text(year), new IntWritable(airTemperature));
}

public void reduce(Text key, Iterable <IntWritable> values, Context context)


{
..................
context.write(key, new IntWritable(maxValue));
}
public void combiner(Text key, Iterable <IntWritable> values, Context context)
{
..................
context.write(key, new IntWritable(maxValue));
}

If a combine function is used, then it is the same form as the reduce function (and is an
implementation of Reducer), except its output types are the intermediate key and value types
(K2 and V2), so they can feed the reduce function:

map: (K1, V1) → list(K2, V2)


combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)

Often the combine and reduce functions are the same, in which case, K3 is the same as K2,
and V3 is the same as V2.

Input types are set by the input format. So, for instance, a TextInputFormat generates keys of
type LongWritable and values of type Text. The other types are set explicitly by calling the
methods on the Job as follows.

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

So if K2 and K3 are the same, you don’t need to call setMapOutputKeyClass(), since it falls
back to the type set by calling setOutputKeyClass(). Similarly, if V2 and V3 are the same,
you only need to use setOutputValueClass().

21
1. Input Formats:
Hadoop can process many different types of data formats, from flat text files to databases.

The Relationship Between Input Splits and HDFS Blocks Figure 7-3 shows an example.
A single file is broken into lines, and the line boundaries do not correspond with the HDFS
lock boundaries. Splits honor logical record boundaries,in this case lines, so we see that the
first split contains line 5, even though it spans the first and second block. The second split
starts at line 6.

2. Text Input:

TextInputFormat :
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is the
contents of the line.
So a file containing the following
text: On the top of the Crumpetty
Tree The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

22
The records are interpreted as the following
key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

3. KeyValueTextInputFormat

TextInputFormat’s keys, being simply the offset within the file, are not normally very useful.
It is common for each line in a file to be a key-value pair, separated by a delimiter such as a
tab character.

You can specify the separator via the mapreduce.input.keyvaluelinerecor


dreader.key.value.separator property (or key.value.separator.in.input.line in the old API).
It is a tab character by default. Consider the following input file, where → represents a
(horizontal) tab character:

line1→On the top of the Crumpetty Tree


line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.

Like in the TextInputFormat case, the input is in a single split comprising four records,
although this times the keys are the Text sequences before the tab in each line:

(line1, On the top of the Crumpetty Tree)


(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)

4. NLineInputFormat

With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable


number of lines of input. The number depends on the size of the split and the length of the
lines. If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys are the byte
offsets within the file and the values are the lines themselves.

N refers to the number of lines of input that each mapper receives. With N set to one (the
default), each mapper receives exactly one line of input.

On the top of the Crumpetty Tree


The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

23
If, for example, N is two, then each split contains two lines. One mapper will receive the first
two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)

And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

5. XML

Most XML parsers operate on whole XML documents, so if a large XML document is made
up of multiple input splits, then it is a challenge to parse these individually.

Large XML documents that are composed of a series of “records” (XML document
fragments) can be broken into these records using simple string or regular-expression
matching to find start and end tags of records.

set your input format to StreamInputFormat and set the stream.recordreader.class


property to org.apache.hadoop.streaming.StreamXmlRecordReader to use xml as an
input format.

To take an example, Wikipedia provides dumps of its content in XML form, which are
appropriate for processing in parallel using MapReduce using this approach.

6. Binary Input :

SequenceFileInputFormat

Hadoop’s sequence file format stores sequences of binary key-value pairs. Sequence files are
well suited as a format for MapReduce data since they are splittable they support
compression as a part of the format

SequenceFileAsTextInputFormat

SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that converts the


sequence file’s keys and values to Text objects.

SequenceFileAsBinaryInputFormat

SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the


sequence file’s keys and values as opaque binary objects.

7. Multiple Inputs
The input to a MapReduce job may consist of multiple input files. This case is handled
elegantly by using the MultipleInputs class. For example, if we had weather data from the
UK Met Office6 that we wanted to combine with the NCDC data for our maximum
temperature analysis, then we might set up the input as follows:

24
MultipleInputs.addInputPath(job, ncdcInputPath, TextInputFormat.class,
MaxTemperatureMapper.class);

MultipleInputs.addInputPath(job, metOfficeInputPath, TextInputFormat.class,


MetOfficeMaxTemperatureMapper.class);

8. Database Input

DBInputFormat is an input format for reading data from a relational database, using JDBC. It
is best used for loading relatively small datasets, perhaps for joining with larger datasets from
HDFS, using MultipleInputs.

25
2.Output Formats:

1.Text Output

The default output format, TextOutputFormat, writes records as lines of text. Its keys and
values may be of any type, since TextOutputFormat turns them to strings by calling toString()
on them. Each key-value pair is separated by a tab character. The counterpart to TextOutput
Format for reading in this case is KeyValueTextInputFormat.

2.Binary Output

SequenceFileOutputFormat As the name indicates, SequenceFileOutputFormat writes


sequence files for its output.

SequenceFileAsBinaryOutputFormat
SequenceFileAsBinaryOutputFormat is the counterpart to SequenceFileAsBinaryInput
Format, and it writes keys and values in raw binary format into a SequenceFile container.

3.MapFileOutputFormat

MapFileOutputFormat writes MapFiles as output. The keys in a MapFile must be added in


order, so you need to ensure that your reducers emit keys in sorted order.

26
4.Multiple Outputs
FileOutputFormat and its subclasses generate a set of files in the output directory. There is
one file per reducer, and files are named by the partition number: part-r-00000, partr-00001,
etc. There is sometimes a need to have more control over the naming of the files or to
produce multiple files per reducer. MapReduce comes with the MultipleOutputs class to
help you do this.

An example: Partitioning data Consider the problem of partitioning the weather dataset by
weather station. We would like to run a job whose output is a file per station, with each file
containing all the records for that station.

One way of doing this is to have a reducer for each weather station. To arrange this, we need
to do two things. First, write a partitioner that puts records from the same weather station into
the same partition. Second, set the number of reducers on the job to be the number of weather
stations. The partitioner would look like this:

public class StationPartitioner extends Partitioner<LongWritable, Text>

{ private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
parser.parse(value);
return getPartition(parser.getStationId());
}
private int getPartition(String stationId) {
...
}
}

5.Lazy Output

FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are empty.
Some applications prefer that empty files not be created, which is where LazyOutputFormat
helps.

6.Database Output

The output formats for writing to relational databases and to HBase. DBOutputFormat,
which is useful for dumping job outputs (of modest size) into a database

27
28

You might also like