100% found this document useful (1 vote)
627 views

UNIT-3 Hadoop and MapReduce Programming

Uploaded by

Naru Naveen
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
627 views

UNIT-3 Hadoop and MapReduce Programming

Uploaded by

Naru Naveen
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Big Data

And
Analytics

Seema Acharya
Subhashini Chellappan

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Chapter 5

Introduction to Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to Hadoop

1. To study the features of a) To comprehend the reasons


Hadoop. behind the popularity of
Hadoop.
2. To learn the basic concepts of
HDFS and MapReduce b) To be able to perform HDFS
Programming. operations.

3. To study HDFS Architecture. c) To comprehend MapReduce


framework.
4. To study MapReduce
Programming Model d) To understand the read and
write in HDFS.
5. To study Hadoop Ecosystem.
e) To be able to understand
Hadoop Ecosystem.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
► Hadoop - An Introduction
► Why Hadoop?
► Why not RDBMS?
► RDBMS versus Hadoop
► Distributed Computing Challenges
► History of Hadoop
► Hadoop Overview
❖ Key Aspects of Hadoop
❖ Hadoop Components
❖ Hadoop Conceptual Layer
❖ High Level Architecture of Hadoop
► Use case for Hadoop
❖ ClickStream Data
► Hadoop Distributors
► HDFS
❖ HDFS Daemons
❖ Anatomy of File Read
❖ Anatomy of File Write
❖ Replica Placement Strategy
❖ Working with HDFS commands
❖ Special Features of HDFS
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
► Processing Data with Hadoop
❖ What is MapReduce Programming?
❖ MapReduce Daemons
❖ How does MapReduce Works?
❖ MapReduce Word Count Example

► Managing Resources and Application with Hadoop YARN


❖ Limitations of Hadoop 1.0 Architecture
❖ HDFS Limitation
❖ Hadoop 2:HDFS
❖ Hadoop 2 YARN: Taking Hadoop Beyond Batch

► Interacting with Hadoop Ecosystem


❖ Pig
❖ Hive
❖ Sqoop
❖ HBase
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop – An Introduction
1.Every Day:
► NYSE
► FaceBook
► Google
2.Every minute:
► FB
► Twitter
► Instagram
► Youtube
► Apple
► Email
► Amazon
► Google
3.Every second:
► Banking Applications
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Data: The Treasure Trove
► Provides business advantages like generating
product recommendations, inventing new
products, analyzing the market etc.
► Provides few early key indicators that can
turn the fortune of business.
► Provides room for precise analysis.

► To process ,analyze and make sense of these


different kinds of data,we need a system
that scales and addresses the challenges
shown in fig 5.1

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Why Hadoop?
Ever wondered why Hadoop has been and is one of the most wanted
technologies!!

The key consideration (the rationale behind its huge popularity) is:

Its capability to handle massive amounts of data, different


categories of data – fairly quickly.

The other considerations are :

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Why not RDBMS?

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
RDBMS versus HADOOP

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Distributed Computing Challenges

• Hardware Failure – replication factor

• The default replication factor is 3. That's the minimum number that a file


will replicate across the cluster. The default can be set in hdfs-site.xml but can be
changed dynamically for individual files by using:
 hdfs dfs -setrep <replication factor> <filename>

• How to Process This gigantic Store of Data? -


• How to integrate the data available on several machines prior to processing it.
• Mapreduce programming is the solution

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
History of Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
History of Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Overview

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Key Aspects of Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Components

Flume is a distributed,
reliable, and available service
for efficiently collecting,
aggregating, and moving
large amounts of streaming
data into the Hadoop
Distributed File System
(HDFS).

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Components

Hadoop Core Components:

HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.

MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.

Hadoop Ecosystem Components:

Flume,Oozie,Mahout,Hive,Pig,Sqoop,Hbase

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Conceptual Layer:

► Data storage layer


► Data Processing layer

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop High Level Architecture

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Use case for Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
ClickStream Data Analysis

ClickStream data (mouse clicks) helps you to understand the


purchasing behavior of customers. ClickStream analysis helps online
marketers to optimize their product web pages, promotional content,
etc. to improve their business.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Distributors

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Distributed File System
1. Storage component of Hadoop.

2. Distributed File System.

3. Modeled after Google File System.

4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).

5. You can replicate a file for a configured number of times, which is


tolerant in terms of both software and hardware.

6. Re-replicates data blocks automatically on nodes that have failed.

7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).

8. Sits on top of native file system

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
HDFS key points

► Block structured file


► Default replication factor : 3
► Default block size : 64MB

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Daemons

NameNode:

• Single NameNode per cluster.


• Keeps the metadata details

DataNode:

• Multiple DataNode per cluster


• Read/Write operations

SecondaryNameNode:

• Housekeeping Daemon

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
NameNode
► FsImage – file in which entire file system is stored
► EditLog – records every transaction that occurs

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Name Node and Data Node communication

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Anatomy of File Read

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Anatomy of File Write

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Steps involved in Anatomy of File Write

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Special Features of HDFS

Data Replication: There is absolutely no need for a client application to


track all blocks. It directs the client to the nearest replica to ensure high
performance.

Data Pipeline: A client application writes a block to the first DataNode in


the pipeline. Then this DataNode takes over and forwards the data to the
next node in the pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Replica Placement Strategy
As per the Hadoop Replica Placement Strategy, first replica is placed on the same node as
the client. Then it places second replica on a node that is present on different rack. It
places the third replica on the same rack as second, but on a different node in the rack.
Once replica locations have been set, a pipeline is built. This strategy provides good
reliability.
Fig: Replica Placement strategy

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Working with HDFS Commands
Objective: To create a directory (say, sample) in HDFS.

Act:

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

Act:

hadoop fs -put /root/sample/test.txt /sample/test.txt

Objective: To copy a file from HDFS to local file system.

Act:

hadoop fs -get /sample/test.txt /root/sample/testsample.txt

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Commands..
Objective: To get the list of directories and files at the root of HDFS
Act: hadoop fs –ls /

Objective: To get the list of complete directories and files of HDFS.


Act:
hadoop fs –ls –R /

Objective: To copy a file from local file system to HDFS via copyFromLocal command
Act:
hadoop fs –copyFromLocal /root /sample/test.txt /sample/testsample.txt

Objective: To copy a file from Hadoop file system to local file system via copyToLocal command
Act:
hadoop fs –copyToLocal /sample/test.txt /root/sample/testsample1.txt
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Commands..

Objective: To display contents of an HDFS file on console


Act:
hadoop fs –cat /sample/test.txt

Objective: To copy a file from one directory to another on HDFS


Act:
hadoop fs –cp /sample/test.txt /sample1

Objective: To remove a directory from HDFS


Act:
hadoop fs-rm-r /sample1

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Processing Data with Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
What is MapReduce Programming?

MapReduce Programming is a software framework. MapReduce Programming helps


you to process massive amounts of data in parallel.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
► In MapReduce programming, the input data set is split into independent chunks.
► Map Tasks process these independent chunks completely in a parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on the local
disk of that server.
► The output of the mappers are automatically shuffled and sorted by the framework.
MapReduce framework sorts the output based on KEYS.
► This sorted output becomes the input to the Reduce Tasks.
► Reduced tasks provides reduced output by combining the output of the various
mappers.
► Job inputs and outputs are stored in a file systems.
► MapReduce framework also takes care of the other tasks such as Scheduling,
Monitoring, Re-Executing failed tasks, etc.
► HDFS and MapReduce frameworks run on the same set of nodes. This configuration
allows effective scheduling of tasks on the nodes where data is present (DATA
LOCALITY). This in turn results in very high throughput.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
► There are two daemons associated with MapReduce programming.
► A single master JOB TRACKER per cluster and one slave TASK TRACKER per
cluster node.
► The JobTracker is responsible for scheduling tasks to the TaskTracker,
monitoring the task and re-executing the task just in case the TaskTracker
fails.
► TaskTracker executes the tasks.
► MapReduce applications use suitable interfaces to construct the job. The
application and the job parameters together called as JOB CONFIGURATION.
► Hadoop JOB CLIENT submits job(jar/executable,etc) to the JobTracker.
Then it is the responsibility of the JobTracker to schedule the tasks to the
slaves and it also monitors the task and provides status information to the
client.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce daemons:
Job Tracker and Task Tracker

Fig: Job Tracker


and Task Tracker
interaction

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
How MapReduce Programming Workflow

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce programming architecture

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce WordCount-Example
►Count the occurrences of similar words across 50
files

►Driver class: - Job configuration details


►Mapper class: - Map function
►Reducer class: - Reduce function

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce – Word Count Example

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
SQL vs MapReduce

Parameter SQL MapReduce

Access Interactive and batch Batch

Structure Static Dynamic

Updates Read and write many time Write once, read many
s times

Integrity High Low

Scalability Nonlinear Linear


Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN

(YET ANOTHER RESOURCE NEGOTIATOR)

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Limitations of Hadoop 1.0 Architecture

1. Single NameNode is responsible for managing entire namespace for Hadoop


Cluster.

2. It has a restricted processing model which is suitable for batch-oriented


MapReduce jobs.

3. Hadoop MapReduce is not suitable for interactive analysis.

4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.

5. MapReduce is responsible for cluster resource management and data


processing.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Limitation:
Name node saves all its file metadata in main memory. so it can quickly become overwhelmed
with load on the system increasing.

Hadoop 2 : HDFS
► Major components:
►Namespace
►Blocks storage device

► Features:
►Horizontal scalability
►High availability
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Fig: Active and Passive Name Node
Interaction

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop 2 YARN: Taking Hadoop beyond Batch

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop 2 YARN: Taking Hadoop beyond Batch
The fundamental idea behind this architecture is splitting the JobTracker responsibility of
resource management and Job Scheduling/Monitoring into separate daemons. Daemons that
are part of YARN Architecture are described below.

A Global ResourceManager: Its main responsibility is to distribute resources among various


applications in the system. It has two main components:
Scheduler: Decides the allocation of resources to various running applications,it is a
pure scheduler and it does not monitor monitor or track the status of the application.
Application Manager: Accepts the job, negotiating resources for excuting the
application specific application master, Restarting the application master during its
failure.

NodeManager: This is a per-machine slave daemon. NodeManager responsibility is launching


the application containers for application execution. NodeManager monitors the resource
usage such as memory, CPU, disk, network, etc. It then reports the usage of resources to the
global ResourceManager.

Per-application ApplicationMaster: This is an application-specific entity. Its responsibility is


to negotiate required resources for execution from the ResourceManager. It works along with
the NodeManager for executing and monitoring component tasks.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Basic concepts
► Application
► Container
► YARN Architecture

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Interacting with Hadoop
Ecosystem

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Interacting with Hadoop Ecosytem
Pig : Pig is a data flow system for Hadoop. It uses Pig Latin to specify data
flow. Pig is an alternative to MapReduce Programming. It abstracts some
details and allows you to focus on data processing.

Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.

Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.

HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented


NoSQL database. HBase is used to store billions of rows and millions of
columns. HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of HDFS.
Figure 5.33 depicts the HBase in Hadoop ecosystem.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
CHAPTER-8
Introduction to MapReduce Programming

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Introduction to MapReduce Programming
► Introduction
► Mapper
❖ RecordReader
❖ Map
❖ Combiner
❖ Partitioner
► Reducer
❖ Shuffle
❖ Sort
❖ Reduce
❖ Output Format
► Combiner
► Partitioner
► Searching
► Sorting
► Compression
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Introduction

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Introduction

In MapReduce Programming, Jobs (Applications) are split into a set


of map tasks and reduce tasks. Then these tasks are executed in a
distributed fashion on Hadoop cluster.

Each task processes small subset of data that has been assigned to
it. This way, Hadoop distributes the load across the cluster.

MapReduce job takes a set of files that is stored in HDFS (Hadoop


Distributed File System) as input.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Mapper

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Mapper

A mapper maps the input key−value pairs into a set of


intermediate key–value pairs. Maps are individual tasks that
have the responsibility of transforming input records into
intermediate key–value pairs.

Mapper Consists of following phases:

• RecordReader

• Map

• Combiner

• Partitioner
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Reducer

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Reducer

The primary chore of the Reducer is to reduce a set of


intermediate values (the ones that share a common
key) to a smaller set of values.

The Reducer has three primary phases: Shuffle and


Sort, Reduce, and Output Format.

⮚ Shuffle and sort


⮚ Reduce
⮚ Output format

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
The chores of Mapper, Combiner, Partitioner, and
Reducer

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
The chores of Mapper, Combiner, Partitioner, and Reducer

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Combiner

It is an optimization technique for MapReduce Job. Generally, the


reducer class is set to be the combiner class. The difference between
combiner class and reducer class is as follows:

• Output generated by combiner is intermediate data and it is passed


to the reducer.

• Output of the reducer is passed to the output file on disk.

• Objective
• Input data
• Act
• Output data

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Partitioner

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Partitioner

The partitioning phase happens after map phase and


before reduce phase. Usually the number of partitions are
equal to the number of reducers. The default partitioner is
hash partitioner.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Searching and Sorting Demo

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Compression

In MapReduce programming, you can compress the MapReduce output file.


Compression provides two benefits as follows:

1. Reduces the space to store files.


2. Speeds up data transfer across the network.

You can specify compression format in the Driver Program as shown below:

conf.setBoolean("mapred.output.compress",true);
conf.setClass("mapred.output.compression.codec",
GzipCodec.class,CompressionCodec.class);

Here, codec is the implementation of a compression and decompression algorithm.


GzipCodec is the compression algorithm for gzip. This compresses the output file.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Answer a few questions…

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Fill in the blanks

1. Partitioner phase belongs ------------------------ to task.

2. Combiner is also known ---------------------------.

3. RecordReader converts byte-oriented view into --------------------------- view.

4. MapReduce sorts the intermediate value based on -------------------------- .

5. In MapReduce Programming, reduce function is applied ---------------- group at a


time.

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Thank You

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Answer a few quick questions…

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Match the columns

Column A Column B

HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Match the columns

Column A Column B

JobTracker Executes Task


MapReduce Schedules Task
TaskTracker Programming Model
Job Configuration Converts input into Key Value pair
Map Job Parameters

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Thank You

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.

You might also like