0% found this document useful (0 votes)

9 views21 pages

DATA228 Lecture Notes Week 4

The document provides an overview of distributed filesystems, focusing on Hadoop's Distributed File System (HDFS). It outlines key attributes, architecture, and core concepts of HDFS, including its design for large files, replication, and data flow for reads and writes. Additionally, it discusses the limitations of HDFS and its components like Namenodes and Datanodes.

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views21 pages

DATA228 Lecture Notes Week 4

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

DATA 228

Big Data Technologies and Applications (Fall 2024)

Sangjin Lee
Hadoop: distributed ilesystems
& HDFS

Ch pter 3, “H doop: the De initive Guide” 4th Edition, Tom White

a
a
f
f
What is a distributed ilesystem?

“A distributed ilesystem is ilesystem th t en bles clients to ccess ile stor ge from multiple
hosts through computer network s if the user w s ccessing loc l stor ge.”
f
a
a
f
a
a
a
f
a
a
a
a
f
a
a
What is a distributed ilesystem?
Key phrases

• A ilesystem

• Multiple hosts

• Through computer network

• As if the user w s ccessing loc l stor ge

f
a
a
a
a
a
f
What is a distributed ilesystem?
More attributes

• Sem ntics of ilesystem

• P ths, directories, ccess control, timest mps, etc.

• POSIX compli nce?

• Resiliency nd f ult toler nce import nt > loc l ilesystems

• More tr dition l: SMB, NFS

• Big-d t -driven: HDFS, GFS, M pR File System

• Stor ge-derived: CephFS, GlusterFS

• Cloud solutions (block-b sed): EBS (AWS), PD (GCP)

• Cloud solutions (object-b sed)*: s3 (AWS), GCS (GCP)

• Other vendor solutions: NetApp, Nut nix, Cohesity, …

* Not ll object stor ge systems re ilesystems.

a
a
a
a
a
a
a
a
a
a
f
a
a
f
Hadoop’s distributed ilesystem

• H doop provides n bstr ct (distributed) ilesystem API

• Clients of distributed ilesystems c n inter ct with them t the bstr ct level (vi URIs)

• HDFS is only one implement tion provided by H doop out of the box

• Ex mples

• file:// (loc l iles), hdfs:// (HDFS), s3n:// (s3 “n tive”), gs:// (GCS), …
a
a
a
f
a
a
f
a
a
a
a
f
a
f
a
a
a
a
a
HDFS
Design

• Ge red tow rds very l rge iles: GBs or TBs

• Stre ming d t ccess

• Write-once nd re d-m ny-times

• Re ding whole iles over r ndom seeks

• Commodity h rdw re

• Highly resilient to individu l node f ilures: multiple replic s, block rep irs, reb l ncing
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
f
a
a
a
a
a
HDFS
What HDFS is NOT so good at

• Low-l tency d t ccess

• Tr de-o between throughput nd l tency

• Lots of sm ll iles: tr de-o from rchitectur l nd sc le consider tions

• Multiple writers

• Arbitr ry ile modi ic tions

• Doesn’t provide full POSIX compli nce

a
a
a
ff
f
a
f
a
a
a
f
a
a
a
ff
a
a
a
a
a
a
a
a
HDFS
Core concepts

• Blocks

• Blocks re useful concept in ilesystem implement tions

• Loc l ilesystem blocks: commonly 512 B - 8 KB

• H doop’s def ult block size: 128 MB (often much l rger in re l clusters)

• Implic tions for sm ll iles

• Replic tion: 3 by def ult (er sure coding c n reduce it)

• Compression: up to users
a
a
f
a
a
a
a
a
a
a
f
a
f
a
a
a
a
HDFS
Architecture

• N menodes nd d t nodes
a
a
a
a
HDFS
Namenode

• “One” for single cluster

• N menode m n ges met d t : met d t for iles nd directories

• Block loc tions re reported by d t nodes (not persisted by n menode)

• N menode requires l rge mount of memory

• N menode d t (n mesp ce im ge nd the edit log) re written to disk in sever l loc tions

• N menode c n be sc l bility bottleneck

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
HDFS
Namenode high availability (HA)

• Redund nt stor ge of the ilesystem met d t

• Second ry n menode

• Gets periodic upd tes from the prim ry n menode nd ret ins the st te

• It c n run s hot st ndby

• F ilover vi ZooKeeper

• Fencing

• Client h ndles it vi client libr ry

a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
HDFS
Datanodes

• One per node

• Stores nd retrieves blocks ( sked by clients nd the n menode)

• Veri ies blocks’ checksums periodic lly

• Reports the block list to the n menode

f
a
a
a
a
a
a
HDFS
Data ow: reads
fl
HDFS
Data ow: reads

• DistributedFileSystem returns the block loc tions from the n menode

• Actu l re ds re done vi FSDataInputStream

• Re ds go directly to d t nodes (not through n menode)

a
a
fl
a
a
a
a
a
a
a
a
HDFS
Data ow: writes
fl
HDFS
Data ow: writes

• Client m kes request to write new ile vi DistributedFileSystem

• N menode cre tes record of the new ile

• D t nodes form pipeline of writes: blocking oper tion

• D t nodes report block loc tions to N menode

• Replic pl cement

• R ck diversity: s me node s client —> o -r ck —> s me r ck

a
a
a
a
a
a
a
fl
a
a
a
a
a
a
a
a
a
a
f
a
f
ff
a
a
a
a
a
HDFS
Replica placement
HDFS
Coherency model

• A ile is gu r nteed to exist fter create()

• A ile content m y not be visible even fter the stre m is lushed (vi flush())

• A ile content is gu r nteed to be visible fter hflush()

• File ren mes or directory ren mes re NOT tomic

f
f
f
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
HDFS
Demo

Exploring lesystem APIs

N10-009 CompTIA Network+ Exam Updated Dumps
100% (2)
N10-009 CompTIA Network+ Exam Updated Dumps
28 pages
App Builder Mock
0% (1)
App Builder Mock
13 pages
Schmitt Trigger Report
No ratings yet
Schmitt Trigger Report
9 pages
AC 66 - 2-11 AME Exam Avionic PDF
0% (1)
AC 66 - 2-11 AME Exam Avionic PDF
33 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
5_bdp-2024-06
No ratings yet
5_bdp-2024-06
14 pages
Module III Hadoop Framework
No ratings yet
Module III Hadoop Framework
21 pages
BCS061_Notes_Unit3
No ratings yet
BCS061_Notes_Unit3
23 pages
BDP 2024 06
No ratings yet
BDP 2024 06
14 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
HDFS(27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS(27 Jan 2025 Hadoop Distributed File System)
73 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
HDFS
No ratings yet
HDFS
16 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit-2
No ratings yet
Unit-2
14 pages
HDFS
No ratings yet
HDFS
22 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
16 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Unit- 3 (HDFS)
No ratings yet
Unit- 3 (HDFS)
23 pages
Unit- 3 (HDFS)-1
No ratings yet
Unit- 3 (HDFS)-1
24 pages
IMTC634_Data Science_Chapter 14
No ratings yet
IMTC634_Data Science_Chapter 14
22 pages
4
No ratings yet
4
53 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
HDFS
No ratings yet
HDFS
14 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
DC MOD 6
No ratings yet
DC MOD 6
9 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop Distributed File System HDFS 1688981751
No ratings yet
Hadoop Distributed File System HDFS 1688981751
49 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
HDFS
No ratings yet
HDFS
37 pages
BDP 2024 07
No ratings yet
BDP 2024 07
17 pages
6_bdp-2024-07
No ratings yet
6_bdp-2024-07
17 pages
Big Data Importance of Hadoop Distributed Filesystem
No ratings yet
Big Data Importance of Hadoop Distributed Filesystem
4 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Crio - Do - QEats Sneak Peek - Getting Started and Troubleshooting Guide PDF
No ratings yet
Crio - Do - QEats Sneak Peek - Getting Started and Troubleshooting Guide PDF
14 pages
Femap Installation Guide
No ratings yet
Femap Installation Guide
13 pages
Solar Cells For Energy
No ratings yet
Solar Cells For Energy
13 pages
LF10KIT Specsheet
No ratings yet
LF10KIT Specsheet
3 pages
Registry Explorer Manual
No ratings yet
Registry Explorer Manual
86 pages
Lecturer Note - Earthing System
100% (3)
Lecturer Note - Earthing System
31 pages
En 30190818v070102p
No ratings yet
En 30190818v070102p
62 pages
Data Center Interconnect: New Revenue Opportunities
No ratings yet
Data Center Interconnect: New Revenue Opportunities
11 pages
Understanding Software Along With The Most Common Functions and Types of Software
No ratings yet
Understanding Software Along With The Most Common Functions and Types of Software
3 pages
Rot (Multi Functional Table)
No ratings yet
Rot (Multi Functional Table)
4 pages
The Role of Technical Writing in Business Communication Final
No ratings yet
The Role of Technical Writing in Business Communication Final
26 pages
Best Practices With Hadoop
No ratings yet
Best Practices With Hadoop
33 pages
Cyberoam CR15 I NG
No ratings yet
Cyberoam CR15 I NG
2 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Const CV Shamim
100% (2)
Const CV Shamim
4 pages
Concentrated Solar Brochure
No ratings yet
Concentrated Solar Brochure
4 pages
EC009 HarshDevda
No ratings yet
EC009 HarshDevda
22 pages
Nsimiirwe Harriet 2
No ratings yet
Nsimiirwe Harriet 2
11 pages
Chernobyl Disaster
No ratings yet
Chernobyl Disaster
9 pages
EnVision Overview Guide
No ratings yet
EnVision Overview Guide
48 pages
ProfileAPIs
No ratings yet
ProfileAPIs
2 pages
Iso Usability Standards Paper
No ratings yet
Iso Usability Standards Paper
12 pages
Starbucks Coffee Customer Story en
No ratings yet
Starbucks Coffee Customer Story en
4 pages
Badatutorial DevelopmentEnvironment
No ratings yet
Badatutorial DevelopmentEnvironment
161 pages
DWFile
No ratings yet
DWFile
22 pages
Blood Bank Management System
No ratings yet
Blood Bank Management System
5 pages

DATA228 Lecture Notes Week 4

Uploaded by

DATA228 Lecture Notes Week 4

Uploaded by

DATA 228

Big Data Technologies and Applications (Fall 2024)

Ch pter 3, “H doop: the De initive Guide” 4th Edition, Tom White

• Through computer network

• As if the user w s ccessing loc l stor ge

• Sem ntics of ilesystem

• P ths, directories, ccess control, timest mps, etc.

• POSIX compli nce?

• Resiliency nd f ult toler nce import nt > loc l ilesystems

• Tr nsient network f ilures

• More tr dition l: SMB, NFS

• Big-d t -driven: HDFS, GFS, M pR File System

• Stor ge-derived: CephFS, GlusterFS

• Cloud solutions (block-b sed): EBS (AWS), PD (GCP)

• Cloud solutions (object-b sed)*: s3 (AWS), GCS (GCP)

• Other vendor solutions: NetApp, Nut nix, Cohesity, …

* Not ll object stor ge systems re ilesystems.

• H doop provides n bstr ct (distributed) ilesystem API

• Ge red tow rds very l rge iles: GBs or TBs

• Stre ming d t ccess

• Write-once nd re d-m ny-times

• Re ding whole iles over r ndom seeks

• Low-l tency d t ccess

• Tr de-o between throughput nd l tency

• Lots of sm ll iles: tr de-o from rchitectur l nd sc le consider tions

• Arbitr ry ile modi ic tions

• Doesn’t provide full POSIX compli nce

• Blocks re useful concept in ilesystem implement tions

• Loc l ilesystem blocks: commonly 512 B - 8 KB

• Implic tions for sm ll iles

• Replic tion: 3 by def ult (er sure coding c n reduce it)

• “One” for single cluster

• N menode m n ges met d t : met d t for iles nd directories

• Block loc tions re reported by d t nodes (not persisted by n menode)

• N menode requires l rge mount of memory

• N menode c n be sc l bility bottleneck

• Redund nt stor ge of the ilesystem met d t

• It c n run s hot st ndby

• Client h ndles it vi client libr ry

• One per node

• Stores nd retrieves blocks ( sked by clients nd the n menode)

• Veri ies blocks’ checksums periodic lly

• Reports the block list to the n menode

• DistributedFileSystem returns the block loc tions from the n menode

• Actu l re ds re done vi FSDataInputStream

• Re ds go directly to d t nodes (not through n menode)

• Client m kes request to write new ile vi DistributedFileSystem

• N menode cre tes record of the new ile

• D t nodes form pipeline of writes: blocking oper tion

• D t nodes report block loc tions to N menode

• R ck diversity: s me node s client —> o -r ck —> s me r ck

• A ile is gu r nteed to exist fter create()

• A ile content is gu r nteed to be visible fter hflush()

• File ren mes or directory ren mes re NOT tomic

Exploring lesystem APIs

You might also like