SlideShare a Scribd company logo
1
Introduction to HDFS
By: Siddharth Mathur
Instructor: Dr. Shiyong Lu
2
Big Data
Wikipedia Definition:
In information technology, big data is a loosely-
defined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools.
3
How Big is Big Data?
2008: Google processed 20 PB a day
2009: Facebook had 2.5 PB user data + 15
TB/day
2009: eBay had 6.5 PB user data + 50 TB/day
2011: Yahoo! had 180-200 PB of data
2012: Facebook ingests 500 TB/day
4
HOW TO ANALYZE THIS DATA?
5
Divide and Conquer
Partition
Combine
6
But Parallel Processing is complicated
How do we assign tasks to workers?
What if we have more tasks than slots?
What happens when tasks fail?
How do you handle distributed synchronization?
7
The Solution!
Google
File
System
Map
Reduce
BigTable
8
GFS to HDFS
It started when google researchers wrote a
paper on a distributed file system to resolve
storage and analysis issues of Big Data
The researchers proposed a file system named
Google File System which in turn, gave birth to
Hadoop Distributed File System (HDFS)
The paper on MapReduce resulted in
MapReduce programming structure
The paper on BigTable produced Hadoop
Hbase, Data warehouse schema over HDFS
9
HADOOP DISTRIBUTED FILE SYSTEM
10
Key Features
Accesible
Hadoop runs on large clusters of commodity machines or on
cloud computing services such as Amazon's Elastic Compute
Cloud (EC2).
Robust
As Hadoop is intended to run on commodity hardware, It is
architected with the assumption of frequent hardware
malfunctions. It can gracefully handle most such failures.
Scalable
Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
Simple
Hadoop allows users to quickly write efficient parallel code.
11
HDFS Scaling Out
Performs a task
in 45 minutes
Performs a
task in ~ 45/4
minutes
12
Basic Hadoop Stack
Hadoop Distributed File System
MapReduce
Hbase
Higher Level Languages
13
Hadoop Platforms
Platforms: Unix and on Windows.
Linux: the only supported production platform.
Other variants of Unix, like Mac OS X: run Hadoop for
development.
Windows + Cygwin: development platform (openssh)
Java 6
Java 1.6.x (aka 6.0.x aka 6) is recommended for
running Hadoop.
14
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is
easy to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode
– The Hadoop daemons run on a cluster of machines.
15
Master-Slave Architecture
Namenode
Jobtracker
Datanode
Tasktracker
Secondary
Namenode
16
Master-Slave Architecture
HDFS has a master-slave architecture.
The master node or the name node governs the cluster.
It takes care of tasks and resource allocation.
It stores all the metadata related to file breakage, block
storage, block replication and task execution status.
The slave nodes or the data nodes are the one which
stores all the data blocks and perform task executions
Tasktracker is the program which runs on each individual
data node and monitors the task execution over each
node.
Jobtracker runs on name node and monitors the
complete job execution.
17
HDFS File Distribution
File metadata
FILE-A -> 1,2,3 (split into 3 blocks)
FILE-B -> 4,5 (split into 2 blocks)
1
3
1
3
Replication factor = 3
Hdfs-site.xml
“ dfs.replication”
4 3
4 4
22
2 5
5
5
Block
1
18
HDFS File Distribution
Name node stores metadata related to:
File split
Block allocation
Task allocation
Each file is split into data blocks. Default size is
64 Mb
Each data block is replicated on different data
node. The replication factor in configurable.
Default value is 3
19
Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
20
Rack awareness
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 2 Rack 3
NameNode
File X=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Rack 1 =
DN:1,2,3,4
Rack 2 =
DN:5,6,7,8
Rack 3 =
DN:9,10,11,
12
Switch Switch Switch
Data
block A
Data
block B
FILE X
21
Rack awareness
HDFS is aware of the placement of each data
node and on the racks
To prevent data loss due to a complete rack
failure, Hadoop intelligently replicates each data
block onto other racks also
This helps HDSF to recover the data even if
complete rack of data node shuts down.
This information is stored in the name node.
22
File Write in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
File.txt
[A , B, C]
Broken
down
using
Hadoop
client API
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
First block
in one rack
next blocks
in different
rack
Intelligent
storage of
data
Heartbeat
Request
Response
MetaData
Creation
Block A Write
23
File Write in Hadoop
HDFS client system requests the name node to
write down a file onto HDFS.
It also provide the file size and other metadata
information to the name node.
Meanwhile, each slave node sends a heartbeat
signal to namenode telling it about their status
24
File Write in Hadoop
The namenode tells the client system where to
store the data blocks
Also, it tells the data node to get ready for data
write.
After the data write procedure is complete the
data node sends a success message to both
client and name node.
25
File Read in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
An
ordered
list of
nodes.
Heartbeat
Request
Response
26
Re-replicating missing replicas
27
Re-replication
Missing Heartbeats signify lost Nodes
Name Node consults metadata, finds affected
data
Name Node consults Rack Awareness script
Name Node tells the Data node to re-replicate
28
3 main configuration files
Core-site.xml
Contains configuration information that overrides the
default core Hadoop properties
Mapred-site.xml
Contains configuration information that overrides the
default core Mapreduce properties
Also defines the host and port that the MapReduce job
tracker runs at
Hdfs-site.xml
Mainly, to set the block replication factor
29
Anatomy of a Job Launch
30
Job Status updates
31
Limitations of Hadoop -1
Scalability
Maximum Cluster size – 4,000 nodes for best
performance
Maximum Concurrent tasks- 40,000
Name Node as a single point of failure
Failure kills all running and queued jobs
Jobs need to be re-submitted by the user
Re-Start ability
Restart is very tricky due to complex state
32
Who has the biggest cluster setups
Facebook 400
Microsoft 400
LinkedIn 4100
Yahoo 42,000
33
References
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html
http://research.google.com/archive/bigtable.html
http://hbase.apache.org/
http://wiki.apache.org/hadoop/FAQ
http://matt-
wand.utsacademics.info/webUTSdiscns/HadoopNotes
.pdf
34
THANK YOU
Ad

More Related Content

What's hot (19)

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
Fabio Fumarola
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Mydbops
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
Yahoo Developer Network
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
Mydbops
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
Sudheer Kondla
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
Hariharan Ganesan
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBD
dawnlua
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
Utah Networxs Consultoria e Treinamento
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Maarten Smeets
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replication
PoguttuezhiniVP
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Yahoo Developer Network
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
Fabio Fumarola
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Mydbops
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
Mydbops
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
Sudheer Kondla
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBD
dawnlua
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replication
PoguttuezhiniVP
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Yahoo Developer Network
 

Similar to Introduction to HDFS (20)

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfjHadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Prateek Rathore
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Unit 1
Unit 1Unit 1
Unit 1
SriKGangadharRaoAssi
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
RamyaMurugesan12
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdfUnit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
VarunTyagi624957
 
Hadoop Distributed File System in Big data
Hadoop Distributed File System in Big dataHadoop Distributed File System in Big data
Hadoop Distributed File System in Big data
ramukaka777787
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
Yousef Fadila
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfjHadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Prateek Rathore
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdfUnit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
VarunTyagi624957
 
Hadoop Distributed File System in Big data
Hadoop Distributed File System in Big dataHadoop Distributed File System in Big data
Hadoop Distributed File System in Big data
ramukaka777787
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
Yousef Fadila
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Ad

Recently uploaded (20)

DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ad

Introduction to HDFS

  • 1. 1 Introduction to HDFS By: Siddharth Mathur Instructor: Dr. Shiyong Lu
  • 2. 2 Big Data Wikipedia Definition: In information technology, big data is a loosely- defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
  • 3. 3 How Big is Big Data? 2008: Google processed 20 PB a day 2009: Facebook had 2.5 PB user data + 15 TB/day 2009: eBay had 6.5 PB user data + 50 TB/day 2011: Yahoo! had 180-200 PB of data 2012: Facebook ingests 500 TB/day
  • 4. 4 HOW TO ANALYZE THIS DATA?
  • 6. 6 But Parallel Processing is complicated How do we assign tasks to workers? What if we have more tasks than slots? What happens when tasks fail? How do you handle distributed synchronization?
  • 8. 8 GFS to HDFS It started when google researchers wrote a paper on a distributed file system to resolve storage and analysis issues of Big Data The researchers proposed a file system named Google File System which in turn, gave birth to Hadoop Distributed File System (HDFS) The paper on MapReduce resulted in MapReduce programming structure The paper on BigTable produced Hadoop Hbase, Data warehouse schema over HDFS
  • 10. 10 Key Features Accesible Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2). Robust As Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple Hadoop allows users to quickly write efficient parallel code.
  • 11. 11 HDFS Scaling Out Performs a task in 45 minutes Performs a task in ~ 45/4 minutes
  • 12. 12 Basic Hadoop Stack Hadoop Distributed File System MapReduce Hbase Higher Level Languages
  • 13. 13 Hadoop Platforms Platforms: Unix and on Windows. Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows + Cygwin: development platform (openssh) Java 6 Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop.
  • 14. 14 Hadoop Modes • Standalone (or local) mode – There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them. • Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale. • Fully distributed mode – The Hadoop daemons run on a cluster of machines.
  • 16. 16 Master-Slave Architecture HDFS has a master-slave architecture. The master node or the name node governs the cluster. It takes care of tasks and resource allocation. It stores all the metadata related to file breakage, block storage, block replication and task execution status. The slave nodes or the data nodes are the one which stores all the data blocks and perform task executions Tasktracker is the program which runs on each individual data node and monitors the task execution over each node. Jobtracker runs on name node and monitors the complete job execution.
  • 17. 17 HDFS File Distribution File metadata FILE-A -> 1,2,3 (split into 3 blocks) FILE-B -> 4,5 (split into 2 blocks) 1 3 1 3 Replication factor = 3 Hdfs-site.xml “ dfs.replication” 4 3 4 4 22 2 5 5 5 Block 1
  • 18. 18 HDFS File Distribution Name node stores metadata related to: File split Block allocation Task allocation Each file is split into data blocks. Default size is 64 Mb Each data block is replicated on different data node. The replication factor in configurable. Default value is 3
  • 19. 19 Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica
  • 20. 20 Rack awareness DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 7 DN 8 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 2 Rack 3 NameNode File X= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Rack 1 = DN:1,2,3,4 Rack 2 = DN:5,6,7,8 Rack 3 = DN:9,10,11, 12 Switch Switch Switch Data block A Data block B FILE X
  • 21. 21 Rack awareness HDFS is aware of the placement of each data node and on the racks To prevent data loss due to a complete rack failure, Hadoop intelligently replicates each data block onto other racks also This helps HDSF to recover the data even if complete rack of data node shuts down. This information is stored in the name node.
  • 22. 22 File Write in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client File.txt [A , B, C] Broken down using Hadoop client API DN 5 DN 6 DN 7 DN 8 Rack 2 Switch First block in one rack next blocks in different rack Intelligent storage of data Heartbeat Request Response MetaData Creation Block A Write
  • 23. 23 File Write in Hadoop HDFS client system requests the name node to write down a file onto HDFS. It also provide the file size and other metadata information to the name node. Meanwhile, each slave node sends a heartbeat signal to namenode telling it about their status
  • 24. 24 File Write in Hadoop The namenode tells the client system where to store the data blocks Also, it tells the data node to get ready for data write. After the data write procedure is complete the data node sends a success message to both client and name node.
  • 25. 25 File Read in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client DN 5 DN 6 DN 7 DN 8 Rack 2 Switch An ordered list of nodes. Heartbeat Request Response
  • 27. 27 Re-replication Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells the Data node to re-replicate
  • 28. 28 3 main configuration files Core-site.xml Contains configuration information that overrides the default core Hadoop properties Mapred-site.xml Contains configuration information that overrides the default core Mapreduce properties Also defines the host and port that the MapReduce job tracker runs at Hdfs-site.xml Mainly, to set the block replication factor
  • 29. 29 Anatomy of a Job Launch
  • 31. 31 Limitations of Hadoop -1 Scalability Maximum Cluster size – 4,000 nodes for best performance Maximum Concurrent tasks- 40,000 Name Node as a single point of failure Failure kills all running and queued jobs Jobs need to be re-submitted by the user Re-Start ability Restart is very tricky due to complex state
  • 32. 32 Who has the biggest cluster setups Facebook 400 Microsoft 400 LinkedIn 4100 Yahoo 42,000