0% found this document useful (0 votes)

103 views

The Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to reliably store large datasets and stream data at high bandwidth to applications. It uses a master/slave architecture where the NameNode is the master metadata server and DataNodes store data in blocks with replication across machines. The NameNode manages the file system namespace and monitors DataNodes. DataNodes store data blocks and service read/write requests. HDFS provides reliability through replication and uses rack awareness for placement of replicas.

Uploaded by

SaravanaRaajaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

The Hadoop Distributed File System

Uploaded by

SaravanaRaajaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA

Presented by Ying Yang 9/24/2012

Outline
Introduction Architecture
NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots

File I/O Operations and Replica Management

File Read and Write, Block Placement, Replication management, Balancer,

Practice at YAHoo! FUTURE WORK

Introduction
HDFS
The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very large data sets (1) reliably, and to stream those data sets (2) at high bandwidth to user applications. These are achieved by replicating file content on multiple machines(DataNodes).

Outline
Introduction Architecture
NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots

File I/O Operations and Replica Management

File Read and Write, Block Placement, Replication management, Balancer,

Practice at YAHoo! FUTURE WORK

Architecture
HDFS is a block-structured file system: Files broken into blocks of 128MB (per-file configurable). A file can be made of several blocks, and they are stored across a cluster of one or more machines with data storage capacity. Each block of a file is replicated across a number of machines, To prevent loss of data.

Architecture

Architecture
NameNode and DataNodes HDFS stores file system metadata and application data separately. Metadata refers to file metadata(attributes such as permissions, modification, access times, namespace and disk space quotas. )called inodes+list of blocks belong to the file. HDFS stores metadata on a dedicated server, called the NameNode.(Master) Application data are stored on other servers called DataNodes.(Slaves) All servers are fully connected and communicate with each other using TCP-based protocols.(RPC)

Architecture
Single Namenode:
Maintain the namespace tree(a hierarchy of files and directories) operations like opening, closing, and renaming files and directories. Determine the mapping of file blocks to DataNodes (the physical location of file data). File metadata (i.e. inode) . Authorization and authentication. Collect block reports from Datanodes on block locations. Replicate missing blocks.
HDFS keeps the entire namespace in RAM, allowing fast access to the metadata.

Architecture
DataNodes:
The DataNodes are responsible for serving read and write requests from the file systems clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Data nodes periodically send block reports to Namenode.

Architecture

Architecture
NameNode and DataNode communication: Heartbeats.

DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.

Architecture

Architecture
Blockreports:
A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block id, the generation stamp and the length for each block replica the server hosts. Blockreports provide the NameNode with an up-to-date view of where block replicas are located on the cluster and nameNode constructs and maintains latest metadata from blockreports.

Architecture

Architecture
failure recovery
The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to: replicate blocks to other nodes: DataNode died. copy data to local. remove local block replicas; re-register or to shut down the node;

Architecture

Architecture
failure recovery So when dataNode died, NameNode will notice and instruct other dataNode to replicate data to new dataNode. What if NameNode died?

Architecture
failure recovery Keep journal (the modification log of metadata). Checkpoint: The persistent record of the metadata stored in the local hosts native files system. For example: During restart, the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system.

Architecture
failure recovery CheckpointNode and BackupNode--two other roles of NameNode

CheckpointNode: When journal becomes too long, checkpointNode combines the existing checkpoint and journal to create a new checkpoint and an empty journal.

Architecture
failure recovery CheckpointNode and BackupNode--two other roles of NameNode BackupNode: A read-only NameNode it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. If the NameNode fails, the BackupNodes image in memory and the checkpoint on disk is a record of the latest namespace state.

Architecture
failure recovery Upgrades, File System Snapshots The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades. During software upgrades the possibility of corrupting the system due to software bugs or human mistakes increases. The snapshot mechanism lets administrators persistently save the current state of the file system(both data and metadata), so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.

Outline
Introduction Architecture
NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots

File I/O Operations and Replica Management

File Read and Write, Block Placement, Replication management, Balancer,

Practice at YAHoo! FUTURE WORK

File I/O Operations and Replica Management

Hadoop has the concept of Rack Awareness.

File I/O Operations and Replica Management

Hadoop has the concept of Rack Awareness. The default HDFS replica placement policy can be summarized as follows: 1. No Datanode contains more than one replica of any block. 2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.

File I/O Operations and Replica Management

OK. Write to dataNode 1.
dsHey,DN1, duplicate Block A to DN5 and DN6.

File I/O Operations and Replica Management

Balancer

File I/O Operations and Replica Management

Balancer

Outline
Introduction Architecture
NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots

File I/O Operations and Replica Management

File Read and Write, Block Placement, Replication management, Balancer,

Practice at YAHoo! FUTURE WORK

Practice at YAHoo!
HDFS clusters at Yahoo! include about 3500 nodes A typical cluster node has:
2 quad core Xeon processors @ 2.5ghz Red Hat Enterprise Linux Server Release 5.1 Sun Java JDK 1.6.0_13-b03 4 directly attached SATA drives (one terabyte each) 16G RAM 1-gigabit Ethernet

Practice at YAHoo!
70 percent of the disk space is allocated to HDFS. The remainder is reserved for the operating system (Red Hat Linux), logs, and space to spill the output of map tasks. (MapReduce intermediate data are not stored in HDFS.) For each cluster, the NameNode and the BackupNode hosts are specially provisioned with up to 64GB RAM; application tasks are never assigned to those hosts. In total, a cluster of 3500 nodes has 9.8 PB of storage available as blocks that are replicated three times yielding a net 3.3 PB of storage for user applications. As a convenient approximation, one thousand nodes represent one PB of application storage.

Practice at YAHoo!
Durability of Data uncorrelated node failures Replication of data three times is a robust guard against loss of data due to uncorrelated node failures. correlated node failures, the failure of a rack or core switch. HDFS can tolerate losing a rack switch (each block has a replica on some other rack). loss of electrical power to the cluster a large cluster will lose a handful of blocks during a power-on restart.

Practice at YAHoo!
Benchmarks

NameNode Throughput benchmark

FUTURE WORK
Automated failover

plan: Zookeeper, Yahoos distributed consensus technology to build an automated failover solution Scalability of the NameNode Solution: Our near-term solution to scalability is to allow multiple namespaces (and NameNodes) to share the physical storage within a cluster. Drawbacks: The main drawback of multiple independent namespaces is the cost of managing them.

Object Storage
100% (1)
Object Storage
45 pages
Hardware Simulator Tutorial
No ratings yet
Hardware Simulator Tutorial
49 pages
Visual Basic 6 String Functions - Visual Basic 6 (VB6)
No ratings yet
Visual Basic 6 String Functions - Visual Basic 6 (VB6)
70 pages
Hadoop
No ratings yet
Hadoop
23 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
The Architecture of Open Source Applications - The Hadoop Distributed File System
No ratings yet
The Architecture of Open Source Applications - The Hadoop Distributed File System
6 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
HDFS
No ratings yet
HDFS
37 pages
HDFSArchitecture
No ratings yet
HDFSArchitecture
15 pages
huawei
No ratings yet
huawei
32 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
HDFS
No ratings yet
HDFS
19 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
Lecture_14_HDFS_GFS
No ratings yet
Lecture_14_HDFS_GFS
30 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
HDFS
No ratings yet
HDFS
15 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
HDFS v001
No ratings yet
HDFS v001
30 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Unit- 3 (HDFS)-1
No ratings yet
Unit- 3 (HDFS)-1
24 pages
What Is Hadoop HDFS
No ratings yet
What Is Hadoop HDFS
20 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
No ratings yet
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
9 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
HDFS
No ratings yet
HDFS
16 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Module 1 PDF
No ratings yet
Module 1 PDF
42 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Training in Hyderabad - Hadoop File System
No ratings yet
Hadoop Training in Hyderabad - Hadoop File System
5 pages
Understanding Hadoop Ecosystem1 2
No ratings yet
Understanding Hadoop Ecosystem1 2
65 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Unit 2
No ratings yet
Unit 2
56 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
HDFS and YARN
No ratings yet
HDFS and YARN
91 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
5_bdp-2024-06
No ratings yet
5_bdp-2024-06
14 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
3.3 HDFS
No ratings yet
3.3 HDFS
32 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
What Is Hadoop HDF1
No ratings yet
What Is Hadoop HDF1
6 pages
Unit 4
No ratings yet
Unit 4
104 pages
BDA Module-1 Notes
No ratings yet
BDA Module-1 Notes
14 pages
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
5/5 (5)
Ovs Slides
No ratings yet
Ovs Slides
18 pages
SDN Tutorial
100% (2)
SDN Tutorial
31 pages
What Is Cloud Computing?
No ratings yet
What Is Cloud Computing?
26 pages
CCE
No ratings yet
CCE
4 pages
Deploying RDO On Red Hat Enterprise Linux
No ratings yet
Deploying RDO On Red Hat Enterprise Linux
55 pages
Chapter 1: Designing Openstack Cloud Architecture
No ratings yet
Chapter 1: Designing Openstack Cloud Architecture
65 pages
Lecture Summary: Exercise: Make It So That Our
No ratings yet
Lecture Summary: Exercise: Make It So That Our
4 pages
390alecture03 12sp
No ratings yet
390alecture03 12sp
3 pages
Amazon AWS Tutorial II: Windows and Linux On EC2
No ratings yet
Amazon AWS Tutorial II: Windows and Linux On EC2
69 pages
Ceilometer Introduction PDF
No ratings yet
Ceilometer Introduction PDF
25 pages
Shells and Shell Scripts: COMP 444/5201 Revision 1.3 January 25, 2005
No ratings yet
Shells and Shell Scripts: COMP 444/5201 Revision 1.3 January 25, 2005
34 pages
Lecture Summary: Course Introduction and Syllabus
No ratings yet
Lecture Summary: Course Introduction and Syllabus
4 pages
Lecture Summary: Directory Description Command Description
No ratings yet
Lecture Summary: Directory Description Command Description
3 pages
HR Blue Print
No ratings yet
HR Blue Print
54 pages
The CCA Component Model For High-Performance Scientific Computing
No ratings yet
The CCA Component Model For High-Performance Scientific Computing
15 pages
Installing Eucalyptus On Centos: Setup
No ratings yet
Installing Eucalyptus On Centos: Setup
12 pages
BK Compute Adminguide Essex
No ratings yet
BK Compute Adminguide Essex
241 pages
Business Blueprint: Production Planning of
100% (1)
Business Blueprint: Production Planning of
79 pages
Had Oop Eucalyptus
No ratings yet
Had Oop Eucalyptus
4 pages
Eucalyptus: Configuring Your Private Cloud To Resemble Amazon EC2 Eucalyptus
No ratings yet
Eucalyptus: Configuring Your Private Cloud To Resemble Amazon EC2 Eucalyptus
15 pages
Business Process Blueprint Finance: Project
No ratings yet
Business Process Blueprint Finance: Project
62 pages
Eucalyptus: Setting Up A Private Infrastructure Cloud
No ratings yet
Eucalyptus: Setting Up A Private Infrastructure Cloud
10 pages
Setup - HTML: Eucalyptus Front-End Installation
No ratings yet
Setup - HTML: Eucalyptus Front-End Installation
5 pages
10 Random Analysis
No ratings yet
10 Random Analysis
18 pages
Ajit Pal Singh
No ratings yet
Ajit Pal Singh
4 pages
AngularJS in Action
100% (1)
AngularJS in Action
201 pages
Foxit
No ratings yet
Foxit
13 pages
ISS331 Lecture Note 1
No ratings yet
ISS331 Lecture Note 1
17 pages
ACE Exam 201 - PAN-OS 7.0
17% (6)
ACE Exam 201 - PAN-OS 7.0
8 pages
AVR Fuse Bits
No ratings yet
AVR Fuse Bits
3 pages
Dzone Guide - Mobile Development PDF
No ratings yet
Dzone Guide - Mobile Development PDF
34 pages
Connect Authentication T: Step 1
No ratings yet
Connect Authentication T: Step 1
3 pages
Best of Hacking - 2010
No ratings yet
Best of Hacking - 2010
218 pages
IT - Security and Architecture in Banking
No ratings yet
IT - Security and Architecture in Banking
10 pages
System Variables Supported by Azure Data Factory
No ratings yet
System Variables Supported by Azure Data Factory
2 pages
Clean Room Approach
No ratings yet
Clean Room Approach
3 pages
Html5 - Web SQL Database
No ratings yet
Html5 - Web SQL Database
3 pages
Drink Price Fefef
No ratings yet
Drink Price Fefef
9 pages
Systems Alliance: VPP 4.3.3: VISA Implementation Specification For The G Language
No ratings yet
Systems Alliance: VPP 4.3.3: VISA Implementation Specification For The G Language
53 pages
Links and Associations
No ratings yet
Links and Associations
24 pages
Lecture 2: Gibb's, Data Processing and Fano's Inequalities: 2.1.1 Fundamental Limits in Information Theory
No ratings yet
Lecture 2: Gibb's, Data Processing and Fano's Inequalities: 2.1.1 Fundamental Limits in Information Theory
6 pages
Analyzing Esxtop Data
No ratings yet
Analyzing Esxtop Data
6 pages
SM2246EN Product Brief - N0525
No ratings yet
SM2246EN Product Brief - N0525
2 pages
Semantic Data Control
No ratings yet
Semantic Data Control
12 pages
Activity Diagram
No ratings yet
Activity Diagram
22 pages
Analysis of Super Password 1.0
No ratings yet
Analysis of Super Password 1.0
16 pages
JCL Handbook All in One
No ratings yet
JCL Handbook All in One
66 pages
Data Structures For Statistical Computing in Pytho
No ratings yet
Data Structures For Statistical Computing in Pytho
7 pages
FOR Function 1 Question 1 Bisection Method: f1 @ (X) 2-X+log (X)
No ratings yet
FOR Function 1 Question 1 Bisection Method: f1 @ (X) 2-X+log (X)
9 pages
How-to-Guide - Guided Procedures For SAP NetWeaver 7.0-Java-VMware-Trial PDF
No ratings yet
How-to-Guide - Guided Procedures For SAP NetWeaver 7.0-Java-VMware-Trial PDF
75 pages
ISC 2017 Computer Science Theory Paper 1 - Solved Paper PDF
100% (1)
ISC 2017 Computer Science Theory Paper 1 - Solved Paper PDF
29 pages

The Hadoop Distributed File System

Uploaded by

The Hadoop Distributed File System

Uploaded by

The Hadoop Distributed File System

Presented by Ying Yang 9/24/2012

File I/O Operations and Replica Management

Practice at YAHoo! FUTURE WORK

File I/O Operations and Replica Management

Practice at YAHoo! FUTURE WORK

File I/O Operations and Replica Management

Practice at YAHoo! FUTURE WORK

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

Practice at YAHoo! FUTURE WORK

NameNode Throughput benchmark

You might also like