0% found this document useful (0 votes)
70 views

Bigdata Module2 7th-Sem 18cs72

(1) The Map step emits key-value pairs with book titles as keys and the value 1. (2) The Reduce step sums the counts for each unique book title. (3) The output is a single book count for the entire library.

Uploaded by

ram patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Bigdata Module2 7th-Sem 18cs72

(1) The Map step emits key-value pairs with book titles as keys and the value 1. (2) The Reduce step sums the counts for each unique book title. (3) The output is a single book count for the entire library.

Uploaded by

ram patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Module 2 -Introduction to hadoop

Introduction to hadoop

The Apache™ Hadoop® project develops open-source
software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Introduction to hadoop

It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.

Yahoo has more than 1,00,000 CPUs in over 40000
servers running Hadoop

Facebook has 2 major clusters:

A cluster has 1100-machines with 8800 cores and
about 12 PB raw storage. A 300-machine cluster
with 2400 cores and about 3 PB raw-storage.
Each(commodity) node has 8 cores and 12 TB

Hadoop data store concept implies storing the data at a number of clusters.

Each cluster has a number of data stores, called racks. Each rack stores a

number of DataNodes. Each DataNode has a large number of data blocks. The

racks distribute across a cluster. The nodes have processing and storage

capabilities. The nodes have the data in data blocks to run the application

tasks. The data blocks replicate by default at least on three DataNodes in same or

remote nodes. Data at the stores enable running the distributed applications

including analytics, data mining, OLAP using the clusters. A file, containing the

data divides into data blocks. A data block default size is 64 MBS (HDFS division of

files concept is similar to Linux or virtual memory page in Intel x86 and

Pentium processors where the block size is fixed and is of 4 KB).

Hadoop HDFS features are as follows:

Hadoop System Characteristics


Scalable

Self-manageable

Self-healing

Distributed file system
Hadoop core components
The Hadoop core components of the framework
are:

Hadoop Common — The common module
contains the libraries and utilities that are
required by the other modules of Hadoop.
Hadoop core components

Hadoop Distributed File System (HDFS) — A
Java-based distributed file system which can
store all kinds of data on the disks at the
clusters.
Hadoop core components


MapReduce — Software programming model in
Hadoop uses Mapper and Reducer. The
hadoop processes large sets of data in parallel
and in batches.
Hadoop core components

YARN — Software for managing resources for
computing.

The user application tasks or sub-tasks run in
parallel at the Hadoop, uses scheduling and
handles the requests for the resources in
distributed running of the tasks.
Features of Hadoop

Fault-efficient scalable, flexible and modular design

Robust design of HDFS

Store and process Big Data

Distributed clusters computing with data locality.

Hardware fault-tolerant

Open-source framework

Java and Linux based
Features of Hadoop
Fault-efficient scalable, flexible and modular design

Hadoop uses simple and modular programming model.
The system provides servers at high Scalability.

The system is scalable by adding new nodes to handle
larger Data.

Modular functions make the system flexible.

One can add or replace components at ease.
Features of Hadoop
Robust design of HDFS:

Execution of Big Data applications continue even
when an individual server or cluster fails. This is
because of Hadoop provisions for backup (due to
replications at least three times for each data block)
and a data recovery mechanism.

HDFS thus has high reliability
Features of Hadoop
Store and process Big Data:

Processes Big Data of 3V characteristics.
Features of Hadoop
Distributed clusters computing with data locality.

Processes Big Data at high speed as the application tasks
and sub-tasks submit to the DataNodes.

One can achieve more computing power by increasing the
number of computing nodes. The processing splits across
multiple

DataNodes (servers), and thus fast processing and
aggregated results.
Features of Hadoop
Hardware fault-tolerant:

A fault does not affect data and application
processing. If a node goes down, the other nodes
take care of the residue.

This is due to multiple copies of all data blocks
which replicate automatically. Default is three copies
of data blocks.
Features of Hadoop
Open-source framework:

Open source access and cloud services enable
large data store. Hadoop uses a cluster of
multiple inexpensive servers or the cloud.
Features of Hadoop
Java and Linux based:

Hadoop uses Java interfaces. Hadoop base is
Linux but has its own set of shell commands
support.
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)

HDFS is a core component of Hadoop.

HDFS is designed to run on a cluster of
computers and servers at cloud-based utility
services.
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)


HDFS stores Big Data which may range from
GBs to PBs .

HDFS stores the data in a distributed manner in
order to compute fast
HDFS Data Storage
HDFS Data Storage

Hadoop data store concept implies storing the data
at a number of clusters.

Each cluster has a number of data stores, called
racks.

Each rack stores a number of DataNodes.

Each DataNode has a large number of data blocks.
Problem

Consider a data storage for University
students. Each student data, stuData
which is in a file of size less than 64 MB (1
MB = 220B). A data block stores the full
file data for a student of stuData_idN,
where N =1 to 500.
(i) How the files of each student will be
distributed at a Hadoop cluster?
How many student data can be stored at one
cluster? Assume that each rack has two
DataNodes for processing each of 64 GB (1 GB
230B) memory. Assume that cluster consists of
120 racks, and thus 240 DataNodes.
(ii) What is the total
memory capacity of the
cluster in TB
(iii) Show the distributed blocks for
students with ID =96 and 1025.
Assume default replication in the
DataNodes = 3.
(iv) What shall be the changes when a
stuData file size <= 128 MB?
i)Data block default size is 64 MB.

Each students file size is less than
64MB. Therefore, for each student file
one data block .

A data block is in a DataNode.

Assume, for simplicity, each rack has two
nodes each of memory capacity = 64 GB.

Each node can thus store 64GB/64MB = 1024
data blocks = 1024 student files.

Each rack can thus store 2 x 64 GB/64MB =
2048 data blocks = 2048 student files.

Each data block default replicates three
times in the DataNodes.

Therefore,the number of students whose
data can be stored in the cluster =
number of racks multiplied by number of
files divided by 3 = 120 x 2048/3= 81920.

Therefore, the maximum
number of 81920 stuData_IDN
files can be distributed per
cluster, with N = I to 81920
ii)Total memory capacity of the cluster =
120 x 128 GB = 15360 GB = 15 TB
iV)Changes will be that each node will
have half the number of data blocks
Hadoop main components
and
ecosystem components

Hadoop ecosystem refers to a combination of
technologies.

Hadoop ecosystem consists of own family of
applications which tie up together with the
Hadoop
The four layers in Figure are as follows:
(i) Distributed storage layer
(ii) Resource-manager layer for job or application sub-
tasks scheduling and execution
(iii) Processing-framework layer, consisting of Mapper
and Reducer for the MapReduce process-flow
(iv) APIs at application support layer (applications
such as Hive and Pig).
Hadoop Physical organization
Hadoop Physical organization
Hadoop Physical organization
HDFS use the

NameNodes and

DataNodes.
Hadoop Physical organization

A NameNode stores the file's
meta data. Meta data gives
information about the file of user
application, but does not
participate in the computations.
Hadoop Physical organization

The DataNode stores the actual data files in
the data blocks.
Hadoop Physical organization

Few nodes in a Hadoop cluster
act as NameNodes. These nodes
are termed as MasterNodes or
simply masters.

The masters have a different
configuration supporting high
DRAM and processing power.

The masters have much less local
storage.

Hadoop Physical organization

Clients as the users run
the application with the
help of Hadoop ecosystem
projects. For example,
Hive, Mahout and Pig are
the ecosystem's projects.

They are not required to be
present at the Hadoop
cluster
Hadoop Physical organization

The MasterNode fundamentally plays
the role of a coordinator.

The MasterNode receives client
connections, maintains the
description of the global file system
namespace, and the allocation of file
blocks. It also monitors the state of
the system in order to detect any
failure.

The Masters consists of three
components NameNode, Secondary
NameNode and JobTracker.
Hadoop Physical organization
The NameNode stores all the
file system related
information such as:

The file section is stored in
which part of the cluster

Last access time for the files

User permissions like which
user has access to the file.
Map reduce Programming Model
MapReduce
Goal:
count the number of books in
the library.
Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
Reduce:
We all get together and add up our individual counts
HADOOP YARN

YARN is a resource management platform. It
manages computer resources.

The platform is responsible for providing the
computational resources, such as CPUs
memory, network I/0 which are needed when an
application executes. An application task has a
number of sub-task
HADOOP YARN

An application task has a number of sub-task

Each sub-task uses the resources in allotted
time intervals.

YARN stands for Yet Another Resource
Negotiator
Hadoop 2 Execution Model

Figure shows the YARN-based execution model.
The figure shows the YARN components—

Client,

Resource Manager (RM),

Node Manager (NM),

Application Master (AM) and Containers.
Hadoop 2 Execution Model

List of actions of YARN resource allocation and
scheduling functions is as follows:

A MasterNode has two components:
(i) Job History Server
and
(ii) Resource Manager(RM).
Hadoop 2 Execution Model

A Client Node submits the request of an
application to the RM.

The RM is the master.

One RM exists per cluster.
Hadoop 2 Execution Model

The RM keeps information of all the slave Nms.

Information is about the location (Rack
Awareness) and the number of resources (data
blocks and servers) they have.
Hadoop 2 Execution Model

Multiple NMs are at a cluster.

An NM creates an AM instance (AMI) and starts
up.

The AMI initializes itself and registers with the
RM.

Multiple AMIs can be created in an AM.
Hadoop 2 Execution Model

The AMI performs role of an Application
Manager (ApplM), that estimates the resources
requirement for running an application program
or sub- task.

The ApplMs send their requests for the
necessary resources to the RM.
Hadoop 2 Execution Model

NM is a slave of the infrastructure.

It signals whenever it initializes.

All active NMs send the controlling signal
periodically to the RM signaling their presence.

You might also like