DATA228 Lecture Notes Week 4
DATA228 Lecture Notes Week 4
Sangjin Lee
Hadoop: distributed ilesystems
& HDFS
“A distributed ilesystem is ilesystem th t en bles clients to ccess ile stor ge from multiple
hosts through computer network s if the user w s ccessing loc l stor ge.”
f
a
a
f
a
a
a
f
a
a
a
a
f
a
a
What is a distributed ilesystem?
Key phrases
• A ilesystem
• Multiple hosts
• D t losses
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
f
f
Examples of distributed ilesystems
• Clients of distributed ilesystems c n inter ct with them t the bstr ct level (vi URIs)
• HDFS is only one implement tion provided by H doop out of the box
• Ex mples
• file:// (loc l iles), hdfs:// (HDFS), s3n:// (s3 “n tive”), gs:// (GCS), …
a
a
a
f
a
a
f
a
a
a
a
f
a
f
a
a
a
a
a
HDFS
Design
• Commodity h rdw re
• Highly resilient to individu l node f ilures: multiple replic s, block rep irs, reb l ncing
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
f
a
a
a
a
a
HDFS
What HDFS is NOT so good at
• Multiple writers
• Blocks
• H doop’s def ult block size: 128 MB (often much l rger in re l clusters)
• Compression: up to users
a
a
f
a
a
a
a
a
a
a
f
a
f
a
a
a
a
HDFS
Architecture
• N menodes nd d t nodes
a
a
a
a
HDFS
Namenode
• N menode d t (n mesp ce im ge nd the edit log) re written to disk in sever l loc tions
• Second ry n menode
• Gets periodic upd tes from the prim ry n menode nd ret ins the st te
• F ilover vi ZooKeeper
• Fencing
• Replic pl cement
• A ile content m y not be visible even fter the stre m is lushed (vi flush())