0% found this document useful (0 votes)
14 views

Hadoop Cluster

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Hadoop Cluster

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Cluster

Md. Asif Mahmud Ridoy


Lecturer
Department of Computer Science and Engineering
Agenda
⮚ What is Hadoop
⮚ Hadoop Components
⮚ Hadoop Cluster Architecture
⮚ Size of Hadoop Architecture
⮚ Single Node and Multi-Node Cluster
⮚ Communication Protocol in Hadoop Cluster
⮚ Benefits of Hadoop Cluster
⮚ Challenges of Hadoop Cluster
What is Hadoop
Hadoop is an open project overseen by the Apache Software Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations – Including
Cloudera, Yahoo!, Facebook, LinkedIn

 Hadoop takes a radical new approach to the problem of


distributed computing – distribute the data as it’s initially stored
in the system and individual nodes work on data local to the
nodes.
History of Hadoop
Who Uses Hadoop?
Hadoop Components

Application Application
& Resource Management Layer
Layer

Resource Management
Layer
Storage
Layer
Storage
Layer
HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS splits
the data unit into smaller units called blocks and stores them in a distributed manner.
It has got two clusters running.
• Master node – NameNode
• Slave nodes – DataNode.
Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous
storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.
What is Cluster
⮚ A Hadoop cluster is a collection of computers, known as nodes, that
are networked together to perform parallel computations on big data
sets.

⮚ Hadoop clusters consist of a network of connected master and slave


nodes that utilize high availability, low-cost commodity hardware.
Hadoop Cluster Architecture
Master node in Hadoop Cluster Consists of
• NameNode
• Resource Manager
Hadoop Cluster Architecture
Slave nodes in the Hadoop Cluster Consists of
• DataNode
• NodeManager
• Client Node
Single Node Cluster

⮚ Consist of only one node.


⮚ All the NameNode, DataNode, ResourceManager, and
NodeManager are on a single machine.
⮚ Only one DataNode is running.
⮚ Mainly used for studying and testing purposes
Multi-Node Cluster

⮚ Consist of multiple node.


⮚ Clusters run on separate host or machine.
⮚ NameNode cluster run on the master machine
⮚ DataNode clusters runs on the slave machines.
⮚ Used in organizations for analysing Big Data.
Self-healing
• DataNodes send heartbeats to the NameNode
• After a period without any heartbeats, a DataNode is assumed to be lost
• NameNode determines which blocks were on the lost node
• NameNode finds other DataNodes with copies of these blocks Same block stored in
• These DataNodes are instructed to copy the blocks to other nodes different DataNodes
• Replication is actively maintained
What is cluster size in Hadoop?
A Hadoop cluster size is a set of metrics that defines storage and
compute capabilities to run Hadoop workloads, namely :

⮚ Number of nodes : number of Master nodes, number of Edge


Nodes, number of Worker Nodes.
⮚ Configuration of each type node: number of cores per node, RAM
and Disk Volume.
Calculating Hadoop Cluster Capacity
For example: Let's say that you need 500TB of space. If you have a
JBOD of 12 disks, and each disk can store 6TB of data, then the data
node capacity, or the maximum amount of data that each node can
store, will be 72 TB. Data nodes can be added as the data grows, so to
start with its better to select the lowest number of data nodes
required.

In this case, the number of data nodes required to store 500TB of data
equals 500/72, or approximately 7.
Communication Protocols Used in Hadoop
Clusters
⮚ The HDFS communication protocol works on the top of TCP/IP protocol.

⮚ The client establishes a connection with NameNode using configurable TCP


port.

⮚ Hadoop cluster establishes the connection to the client using client protocol.

⮚ DataNode talks to NameNode using the DataNode Protocol.


Benefits of Hadoop Clusters
⮚ Robustness
⮚ Data disks failures, heartbeats and re-replication
⮚ Cluster Rebalancing
⮚ Data integrity
⮚ Metadata disk failure
⮚ Snapshot
What are the challenges of a Hadoop Cluster?
⮚ Issue with small files

⮚ High processing overhead

⮚ Only batch processing is supported

⮚ Iterative Processing
References
⮚ Data-Flair
⮚ Data-Bricks

You might also like