0% found this document useful (0 votes)

4 views

03_hdfs

Uploaded by

imenhamada17

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

03_hdfs

Uploaded by

imenhamada17

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Hadoop Distributed File System (HDFS)

Agenda
▪ Overview
▪ Architecture
▪▪ DataN
Name ode
Node

▪ Blocks and Replication

▪ File System Shell ▪ Web

Console

Two Key Aspects of Hadoop

MapReduce
HDFS
• Parallel Programming
• Distributed
• Fault Tolerant
• Reliable
• Commodity gear

Hadoop Distributed File System (HDFS)

▪ Distributed, scalable, fault tolerant, high throughput
▪ Data access through MapReduce
▪ Files split into blocks
▪ 3 replicas for each piece of data by default
▪ Can create, delete, copy, but NOT update
▪ Designed for streaming reads, not random access
▪ Data locality: processing data on or near the physical storage to decrease
transmission of data

architecture
HDFS – ▪ Master: NameNode
– manages the file system
Architecture ▪ namespace and metadata

Master / Slave NameNode

a
b
c
d
File1
•• EditL
FsIm og
age
– regulates client access to files

▪ Slave: DataNode
– many per cluster
– manages storage attached to
the
nodes
– periodically reports status to
NameNode a b a c

babd

d c c d DataNodes

5
HDFS – Blocks
▪ HDFS is designed to support very large files
▪ Each file is split into blocks
– Hadoop default: 64MB
– BigInsights default: 128MB
▪ Blocks reside on different physical DataNode
▪ Behind the scenes, 1 HDFS block is supported by multiple operating
system blocks

64 MB HDFS blocks

OS blocks

▪ If a file or a chunk of the file is smaller than the block size, only
needed space is used. E.g.: a 210MB file is split as
64 MB 64 MB 64 MB 18
MB
6

HDFS – Replication
▪ Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per
file – Default is 3 replicas

Common case:
▪ one replica on one node in the
local rack
▪
another replica on a different
node in the local rack
▪
and the last on a different
node in a different rack
This cuts inter-rack network
bandwidth, which improves
write performance
7

Setting Rack Topology (Rack Awareness)

▪ Can be defined by script which specifies which node is on which rack.
▪ Script is referenced in topology.script.property.file in core-site.xml.

– Example of property:
<property>
<name>topology.script.file.name</name>
<value>/opt/ibm/biginsights/hadoop-conf/rack-aware.sh</value>
</property>
▪ The network topology script (topology.script.file.name in the above example)
receives as arguments one or more IP addresses of nodes in the cluster. It
returns on stdout a list of rack names, one for each input. The input and
output order must be consistent.
8

Namenode Startup

1. NameNode reads fsimage in memory

2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
▪
Namenode doesn’t store block information
▪
Namenode exits safemode when 99.9% of blocks have at least one copy accounted for

1. Fsimage read
block2
…
2. Editlog read and
applied

3.Block information send to

namenode

datanode1
datadir
NameNode block1
editlog
datanode2datadir block1
block2

namedir
…
fsimage …

Adding file

1. File is added to NameNode memory and persisted in editlog

2. Data is written in blocks to datanodes
▪
Datanode starts chained copy to two other
datanodes ▪
If at least one write for each block succeeds,

datadir
block1
block2
…
NameNode
write is successful datanode1

namedir
editlog
datanode2datadir block1
block2

…
fsimage …

Managing Cluster

▪ Adding Data Node

–
Start new datanode ( pointing to namenode )
If required run balancer (hadoop balancer) to rebalance

blocks –

▪ Remove Node
–
Simply remove datanode
Better: Add node to exclude file and wait till all blocks have been moved
–
Can be checked in server admin console server:50070
–

▪ Checking filesystem health –

Use hadoop fsck
11

HDFS-2 Namenode HA
▪
HDFS-2 adds Namenode High Availability
▪
Standby Namenode needs filesystem transactions and block locations for fast
failover ▪
Every filesystem modification is logged to at least 3 quorum journal nodes by active
Namenode – Standby Node applies changes from journal nodes as they occur
– Majority of journal nodes define reality
– Split Brain is avoided by Journalnodes ( They will only allow one Namenode to write to them )
▪ Datanodes send block locations and heartbeats to both Namenodes ▪
Memory state of Standby Namenode is very close to Active Namenode

Much faster failover than cold start

Journalnode1 Journalnode2 Journalnode3
Standby
Namenod
Active e
Namenode

Datanode1 Datanode2 Datanode3 Datanodex 12

Federated Namenode (HDFS2)

▪ New in Hadoop2 Namenodes can be federated
–
Historically Namenodes would become a bottleneck on huge
clusters –
One million blocks or ~100TB of data require roughly one GB of RAM in Namenode

▪ Blockpools
– Administrator can create separate blockpools/namespaces with different namenodes
– Datanodes register on all Namenodes
– Datanodes store data of all blockpools ( otherwise you could setup separate clusters)
– New ClusterID identifies all namenodes in a cluster.
– A Namespace and its block pool together are called Namespace Volume – You
define which blockpool to use by connecting to a specific Namenode – Each
Namenode still has its own separate backup/secondary/checkpoint node

▪ Benefits
– One Namenode failure will not impact other Blockpools
– Better scalability for large numbers of file operations

Secondary NameNode
During operation primary Namenode cannot merge fsImage and editlog
▪
This is done on the secondary namenode
▪
Every couple minutes, secondary namenode copies new edit log from primary NN
–
Merges editLog into fsimage
–
–
Copies the new merged fsImage back to primary namenode
▪ Not HA but faster startup time
– Secondary NN does not have complete image. In-flight transactions would be lost –
Primary Namenode needs to merge less during startup
▪ Was temporarily deprecated because of Namenode HA but has some advantages – (
no need for Quorum nodes, less network traffic, less moving parts )

New Edit Log is copied to

namedir
editlog
fsimage

Secondary NN
Primary Secondary
NameNode NameNode

Merged fsimage is copied

back
editlog
fsimage

namedir

Possible FileSystem Setup

▪ GPFS
No single point of failure
–
Posix compliance
–
–
Advanced features like cold storage, backup and restore

▪ Hadoop 2 with HA
No single point of failure
–
Wide community
support –

▪ Hadoop 2 without HA ( or Hadoop 1.x in older versions )

–
Copy namedir to NFS ( RAID )
–
Have virtual IP for backup namenode
–
Still some failover time to read blocks, no instant failover but less overhead

fs – file system shell

• File System Shell (fs)

• Invoked as follows:
hadoop fs <args>
• Example:
• Listing the current directory in hdfs

hadoop fs –ls .

fs – file system shell

• FS shell commandstake URIs as argument
• URI format:
scheme://authority/path
• Scheme:
• For the local filesystem, the scheme is file
• For HDFS, the scheme is hdfs
• Authority isthe hostname and port of the
NameNode hadoop fs –copyFromLocal

file:///myfile.txt
hdfs://localhost:9000/user/keith/myfile.txt
• Scheme and authority are optional
• Defaults are taken from configuration file core-site.xml
17

fs – file system shell

• Many POSIX-like commands
• cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail

• Some HDFS-specific commands

• copyFromLocal, put, copyToLocal, get, getmerge,setrep

18
HDFS – FS shell commands

• copyFromLocal / put
• Copy files from the local file system into fs

hadoop fs -copyFromLocal <localsrc> .. <dst>

hadoop fs -put <localsrc> .. <dst>

19
HDFS – FS shell commands

• copyToLocal / get
• Copy files from fs into the local file system

hadoop fs -copyToLocal [-ignorecrc] [-crc]

hadoop fs -get [-ignorecrc] [-crc]

<src> <localdst>
20

Files Tab – hadoop shell command

21
Questions?

InteliVision 8 Reference Guide PDF
100% (1)
InteliVision 8 Reference Guide PDF
92 pages
APQP / PPAP Checklist - Suppliers: Responsiblility Step
100% (1)
APQP / PPAP Checklist - Suppliers: Responsiblility Step
8 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
Module 2
No ratings yet
Module 2
17 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
23 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
14-2-2019 12.05-1.00
No ratings yet
14-2-2019 12.05-1.00
12 pages
Unit 2
No ratings yet
Unit 2
53 pages
BDS Session 5
No ratings yet
BDS Session 5
57 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
BDP 2024 06
No ratings yet
BDP 2024 06
14 pages
Module-2-Introduction To HDFS and Tools
No ratings yet
Module-2-Introduction To HDFS and Tools
38 pages
5_bdp-2024-06
No ratings yet
5_bdp-2024-06
14 pages
Lecture 4 - Hadoop HDFS
No ratings yet
Lecture 4 - Hadoop HDFS
48 pages
HDFS
No ratings yet
HDFS
15 pages
Module 1 PDF
No ratings yet
Module 1 PDF
42 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
21CS72-BIGDATA-MODULE-2-HDFS (1)
No ratings yet
21CS72-BIGDATA-MODULE-2-HDFS (1)
55 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
6_bdp-2024-07
No ratings yet
6_bdp-2024-07
17 pages
L2
No ratings yet
L2
60 pages
Unit- 3 (HDFS)-1
No ratings yet
Unit- 3 (HDFS)-1
24 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
14-2-2019 9.00-9.55
No ratings yet
14-2-2019 9.00-9.55
8 pages
BDP 2024 07
No ratings yet
BDP 2024 07
17 pages
Unit - 3 HDFS MAPREDUCE HBASE
No ratings yet
Unit - 3 HDFS MAPREDUCE HBASE
34 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
HDFS Interview Questions
No ratings yet
HDFS Interview Questions
29 pages
HDFS
No ratings yet
HDFS
37 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
02 Hadoop HDFS
No ratings yet
02 Hadoop HDFS
22 pages
LectureNotes_HadoopFinal_1
No ratings yet
LectureNotes_HadoopFinal_1
74 pages
2018 Unit1 Lecture5 HDFS HA
No ratings yet
2018 Unit1 Lecture5 HDFS HA
29 pages
Solaris Dynamic File System: Sun Microsystems, Inc
No ratings yet
Solaris Dynamic File System: Sun Microsystems, Inc
26 pages
004 - Hadoop Daemons (HDFS Only)
No ratings yet
004 - Hadoop Daemons (HDFS Only)
3 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Hadoop Week 2
No ratings yet
Hadoop Week 2
40 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
BigData Module 1
No ratings yet
BigData Module 1
17 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
2 Zfs Internals
No ratings yet
2 Zfs Internals
29 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Hdfs Architecture
No ratings yet
Hdfs Architecture
16 pages
Introduction To Hadoop Distributed File System (HDFS)
No ratings yet
Introduction To Hadoop Distributed File System (HDFS)
22 pages
Unit2 HDFS and Map Reduce
No ratings yet
Unit2 HDFS and Map Reduce
119 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
Unit-2
No ratings yet
Unit-2
14 pages
RTK Notes m1
No ratings yet
RTK Notes m1
16 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
From Everand
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
Michael W. Lucas
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
02_haddop_biginsights
No ratings yet
02_haddop_biginsights
36 pages
06_hadoop_query_languages
No ratings yet
06_hadoop_query_languages
23 pages
04_MapReduce
No ratings yet
04_MapReduce
45 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Penn Testing
No ratings yet
Penn Testing
14 pages
Esperanto Teacher
No ratings yet
Esperanto Teacher
233 pages
Navagraha (Nine Planets) Quiz
No ratings yet
Navagraha (Nine Planets) Quiz
37 pages
The New Forum For Classical Singers: Anonymous Wow
No ratings yet
The New Forum For Classical Singers: Anonymous Wow
5 pages
DLL Biodiversity
No ratings yet
DLL Biodiversity
4 pages
Visual Impairement - MCQ - PART - 2
No ratings yet
Visual Impairement - MCQ - PART - 2
19 pages
MR Walker Day Reading
No ratings yet
MR Walker Day Reading
1 page
Capstone PP
No ratings yet
Capstone PP
20 pages
Baapstore Products Cat 501 - 1000
No ratings yet
Baapstore Products Cat 501 - 1000
84 pages
Webinar Funnel Formula
No ratings yet
Webinar Funnel Formula
8 pages
Catalog HP 2011
No ratings yet
Catalog HP 2011
368 pages
Alternating Current and Direct Current PDF
No ratings yet
Alternating Current and Direct Current PDF
5 pages
جهود الجزائر في مجال ترقية المقاولاتية خلال الفترة 2020 2022
No ratings yet
جهود الجزائر في مجال ترقية المقاولاتية خلال الفترة 2020 2022
15 pages
RFID Technology and Its Applications in Internet of Things (IOT)
No ratings yet
RFID Technology and Its Applications in Internet of Things (IOT)
5 pages
MTP - Maths QP - Evening Batch - Ca Foundation
No ratings yet
MTP - Maths QP - Evening Batch - Ca Foundation
15 pages
BOM Up On The Mountain
No ratings yet
BOM Up On The Mountain
2 pages
Transcription Water and Sanitation For Health Facility Improvement Tool WASH FIT Course Introduction EN
No ratings yet
Transcription Water and Sanitation For Health Facility Improvement Tool WASH FIT Course Introduction EN
2 pages
EXAMPLE Feasibility Report For Teldon Facilities Corporation
No ratings yet
EXAMPLE Feasibility Report For Teldon Facilities Corporation
24 pages
EXAMPLESeparations Pre Lab AX05
No ratings yet
EXAMPLESeparations Pre Lab AX05
8 pages
12 Physical Education Hindi Medium Chapter 8
No ratings yet
12 Physical Education Hindi Medium Chapter 8
19 pages
6.15 Surviving Serialism 3: Inversion, R, RI Secrets
No ratings yet
6.15 Surviving Serialism 3: Inversion, R, RI Secrets
1 page
4.degradation pathways-HidroOks 2019
No ratings yet
4.degradation pathways-HidroOks 2019
37 pages
ADV 8300 ATSC Modulator r2
No ratings yet
ADV 8300 ATSC Modulator r2
2 pages
Hope 11 Q1 Module 51
No ratings yet
Hope 11 Q1 Module 51
72 pages
Temporary Permit: Transport Department Uttar Pradesh Fathehpur Form Sr-30 (See Rule 65 (1) (V) )
No ratings yet
Temporary Permit: Transport Department Uttar Pradesh Fathehpur Form Sr-30 (See Rule 65 (1) (V) )
1 page
Introduction To Wave Optics Module
No ratings yet
Introduction To Wave Optics Module
76 pages
BSBADM506: Manage Business Document Design and Development
No ratings yet
BSBADM506: Manage Business Document Design and Development
19 pages
4 - Mathematical Expectations
No ratings yet
4 - Mathematical Expectations
40 pages
Normal and Tangetial Components
No ratings yet
Normal and Tangetial Components
15 pages

03_hdfs

Uploaded by

03_hdfs

Uploaded by

Hadoop Distributed File System (HDFS)

▪ Blocks and Replication

▪ File System Shell ▪ Web

Two Key Aspects of Hadoop

Hadoop Distributed File System (HDFS)

Master / Slave NameNode

Setting Rack Topology (Rack Awareness)

1. NameNode reads fsimage in memory

3.Block information send to

1. File is added to NameNode memory and persisted in editlog

▪ Adding Data Node

▪ Checking filesystem health –

Much faster failover than cold start

Datanode1 Datanode2 Datanode3 Datanodex 12

Federated Namenode (HDFS2)

New Edit Log is copied to

Merged fsimage is copied

Possible FileSystem Setup

▪ Hadoop 2 without HA ( or Hadoop 1.x in older versions )

fs – file system shell

• File System Shell (fs)

fs – file system shell

fs – file system shell

• Some HDFS-specific commands

hadoop fs -copyFromLocal <localsrc> .. <dst>

hadoop fs -put <localsrc> .. <dst>

hadoop fs -copyToLocal [-ignorecrc] [-crc]

hadoop fs -get [-ignorecrc] [-crc]

Files Tab – hadoop shell command

You might also like