0% found this document useful (0 votes)
35 views

Chapter N2 HDFS The Hadoop Distributed File System - Matrix

The document describes HDFS (Hadoop Distributed File System), which is the storage component of Hadoop that provides scalable and reliable data storage across commodity hardware. It discusses the architecture of HDFS including the NameNode, DataNodes, and Secondary NameNode, and how data is stored and replicated on clusters. Key aspects such as read/write operations, data placement policies, and failure recovery mechanisms are also overviewed.

Uploaded by

Komal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Chapter N2 HDFS The Hadoop Distributed File System - Matrix

The document describes HDFS (Hadoop Distributed File System), which is the storage component of Hadoop that provides scalable and reliable data storage across commodity hardware. It discusses the architecture of HDFS including the NameNode, DataNodes, and Secondary NameNode, and how data is stored and replicated on clusters. Key aspects such as read/write operations, data placement policies, and failure recovery mechanisms are also overviewed.

Uploaded by

Komal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

HDFS (THE HADOOP DISTRIBUTED FILE SYSTEM)

Instructor: Oussama Derbel


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


Introduction

■ What is Hadoop?

– Hadoop was created by Doug Cutting and Mike Cafarella in 2005


– Cutting named the program after his son’s toy elephant.

Doug Cutting Mike Cafarella


Introduction

■ What is Hadoop?

– Apache Hadoop is an open source software framework used to develop data processing

applications which are executed in a distributed computing environment.

– Applications built using HADOOP are run on large data sets distributed across clusters of

commodity computers.

■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater

computational power at low cost.


Introduction

■ Hadoop History
Introduction

■ Core of Hadoop

HDFS
Storage Part
( Hadoop Distributed File System)

MAPREDUCE Processing Part


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


HDFS Overview and Architecture

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are

run on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

computation nodes.

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.


Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
HDFS Overview and Architecture

■ The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster.

■ It provides scalable, fault-tolerant, rack-aware data storage designed to be deployed on commodity hardware.

■ HDFS is:

– designed with hardware failure in mind

– built for large datasets, with a default block size of 128 MB

– optimized for sequential operations

– rack-aware ( In Rack Awareness, NameNode chooses the DataNode which is closer to the same rack or nearby rack)

– cross-platform and supports heterogeneous clusters


HDFS Overview and Architecture

■ How Data is Stored in Hadoop?


HDFS Overview and Architecture

■ Hadoop = Partitionning+ Replication


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


Deployment Architecture
■ HDFS Architecture: Master-slave architecture

■ Operation ensured by 3 types of nodes:

– Active Name Node (ANN): Master Node


■ Host meta data(file mapping<->blocks and permissions)
■ Locate the blocks in the cluster
■ Manages the replication of blocks in the event of DN failure
– Stand by Name-Node (SNN): Master Node
■ Performs maintenance tasks for the ANN
■ Acts as a mirror of the ANN in the event of failure
– Data Node (DN): Slave Node
■ Store data blocks in the local file system
■ Provides block report to NN and facilitates transfer to other DN
Deployment Architecture


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


Name Node


Name Node

■ Initially, data is broken into abstract data blocks.

– The file metadata for these blocks, which include the file name, file permissions, IDs, locations, and the

number of replicas, are stored in a fsimage, on the NameNode local memory.

– FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of

the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on

which node

– EditLogs is a transaction log that recorde the changes in the HDFS file system or any action performed on the HDFS

cluster such as addtion of a new block, replication, deletion etc., It records the changes since the last FsImage was

created, it then merges the changes into the FsImage file to create a new FsImage fil
Name Node

■ Should a NameNode fail, HDFS would not be able to locate any of the data sets distributed throughout the

DataNodes.

– This makes the NameNode the single point of failure for the entire cluster.

– This vulnerability is resolved by implementing a Secondary NameNode or a Standby NameNode.

– Secondary Namenode's whole purpose is to have a checkpoint in HDFS. Its just a helper node for

namenode. That’s why it also known as checkpoint node.


OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


Data Node and Checkpoint Node (Secondary Name Node)

■ Secondary NameNode

– The Secondary NameNode served as the primary backup solution in early Hadoop versions. The

Secondary NameNode, every so often, downloads the current fsimage instance and edit logs from

the NameNode and merges them. The edited fsimage can then be retrieved and restored in the

primary NameNode.

– The failover is not an automated process as an administrator would need to recover the data from

the Secondary NameNode manually.


Data Node and Checkpoint Node (Secondary Name Node)

■ Standby NameNode

– The High Availability feature was introduced in Hadoop 2.0 and subsequent versions to avoid any

downtime in case of the NameNode failure. This feature allows you to maintain two NameNodes

running on separate dedicated master nodes.

– The Standby NameNode is an automated failover in case an Active NameNode becomes

unavailable. The Standby NameNode additionally carries out the check-pointing process. Due to this

property, the Secondary and Standby NameNode are not compatible. A Hadoop cluster can maintain

either one or the other.


Data Node and Checkpoint Node (Secondary Name Node)

■ Data NameNode

– Each DataNode in a cluster uses a background process to store the individual blocks of data on
slave servers.

– By default, HDFS stores three copies of every data block on separate DataNodes. The NameNode
uses a rack-aware placement policy. This means that the DataNodes that contain the data block
replicas cannot all be located on the same server rack.

– A DataNode communicates and accepts instructions from the NameNode roughly twenty times a
minute. Also, it reports the status and health of the data blocks located on that node once an hour.
Based on the provided information, the NameNode can request the DataNode to create additional
replicas, remove them, or decrease the number of data blocks present on the node
Data Node and Checkpoint Node (Secondary Name Node)

■ Rack Aware Placement Policy

■ One of the main objectives of a distributed storage system like HDFS is to maintain high availability and replication.
Therefore, data blocks need to be distributed not only on different DataNodes but on nodes located on different
server racks.

■ This ensures that the failure of an entire rack does not terminate all data replicas. The HDFS NameNode maintains a
default rack-aware replica placement policy:

■ The first data block replica is placed on the same node as the client.

■ The second replica is automatically placed on a random DataNode on a different rack.

■ The third replica is placed in a separate DataNode on the same rack as the second replica.

■ Any additional replicas are stored on random DataNodes throughout the cluster.
Data Node and Checkpoint Node (Secondary Name Node)

■ This rack placement policy maintains only one replica per node and sets a limit of two replicas per server
rack.

Rack failures are much less frequent than node failures. HDFS ensures high reliability by always storing at least one data
block replica in a DataNode on a different rack
OUTLINE
Introduction

HDFS Overview and Architecture

Deployment Architecture

Name Node

Data Node and Checkpoint Node (Secondary Name Node)

HDFS Data Flows (Read/Write)


HDFS Data Flows (Read/Write)

■ Read Operation

1.A client initiates read request by calling 'open()' method of

FileSystem object; it is an object of type DistributedFileSystem.

2.This object connects to namenode using RPC and gets

metadata information such as the locations of the blocks of the

file. Please note that these addresses are of first few blocks of a

file.

3.In response to this metadata request, addresses of the

DataNodes having a copy of that block is returned back.


HDFS Data Flows (Read/Write)

■ Read Operation

4. Once addresses of DataNodes are received, an object of

type FSDataInputStream is returned to the

client. FSDataInputStream contains DFSInputStream which

takes care of interactions with DataNode and NameNode. In

step 4 shown in the above diagram, a client

invokes 'read()' method which causes DFSInputStream to

establish a connection with the first DataNode with the first

block of a file.
HDFS Data Flows (Read/Write)

■ Read Operation

5. Data is read in the form of streams wherein client

invokes 'read()' method repeatedly. This process

of read() operation continues till it reaches the end of block.

6. Once the end of a block is reached, DFSInputStream

closes the connection and moves on to locate the next

DataNode for the next block

7. Once a client has done with the reading, it calls a

close() method.
HDFS Data Flows (Read/Write)

■ Write Operation
1- A client initiates write operation by calling 'create()' method
of DistributedFileSystem object which creates a new file -
Step no. 1 in the above diagram.
2- DistributedFileSystem object connects to the NameNode
using RPC call and initiates new file creation. However, this
file creates operation does not associate any blocks with the
file. It is the responsibility of NameNode to verify that the file
(which is being created) does not exist already and a client
has correct permissions to create a new file. If a file already
exists or client does not have sufficient permission to create
a new file, then IOException is thrown to the client.
Otherwise, the operation succeeds and a new record for the
file is created by the NameNode.
HDFS Data Flows (Read/Write)

■ Write Operation
3- Once a new record in NameNode is created, an object of
type FSDataOutputStream is returned to the client. A client
uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
4- FSDataOutputStream contains DFSOutputStream object
which looks after communication with DataNodes and
NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this
data. These packets are enqueued into a queue which is
called as DataQueue.
HDFS Data Flows (Read/Write)

■ Write Operation

5- There is one more component called DataStreamer which

consumes this DataQueue. DataStreamer also asks

NameNode for allocation of new blocks thereby picking

desirable DataNodes to be used for replication.

6- Now, the process of replication starts by creating a

pipeline using DataNodes. In our case, we have chosen a

replication level of 3 and hence there are 3 DataNodes in the

pipeline.
HDFS Data Flows (Read/Write)

■ Write Operation

7- The DataStreamer pours packets into the first DataNode in

the pipeline.

8- Every DataNode in a pipeline stores packet received by it and

forwards the same to the second DataNode in a pipeline.

9- Another queue, 'Ack Queue' is maintained by

DFSOutputStream to store packets which are waiting for

acknowledgment from DataNodes.


HDFS Data Flows (Read/Write)

■ Write Operation
10- Once acknowledgment for a packet in the queue is
received from all DataNodes in the pipeline, it is removed
from the 'Ack Queue'. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11- After a client is done with the writing data, it calls a
close() method (Step 9 in the diagram) Call to close(), results
into flushing remaining data packets to the pipeline followed
by waiting for acknowledgment.
12- Once a final acknowledgment is received, NameNode is
contacted to tell it that the file write operation is complete.
HDFS Commands

■ Put command • Make directory command


■ List command • Move command
■ Get command
• Change file permissions
■ Cat command
• Remove command
■ Tail command
• Remove recursively command
■ Tex command
References

1. https://www.erpublication.org/published_paper/IJETR042630.pdf
2. https://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
3. https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/
Thank you

You might also like