Hadoop Architecture - Hadoop Distributed File System (HDFS)-2
Hadoop Architecture - Hadoop Distributed File System (HDFS)-2
B1 B2 … Bn
Data replication:
Helps to handle hardware failures.
Try to spread the data, same piece of data on different nodes.
Basic Functions:
Manage the storage on the DataNode.
Read and write requests on the clients
Block creation, deletion, and replication is all based on instructions from
the NameNode.
Benefits:
Increase namespace scalability
Performance
Isolation
Heterogeneous Storage
and Archival Storage
ARCHIVE, DISK, SSD, RAM_DISK
So, if you remember the original design you have one name space and a bunch of
data nodes. So, the structure looks similar.
You have a bunch of NameNodes, instead of one NameNode. And each of those
NameNodes is essentially right into these pools, but the pools are spread out over the
data nodes just like before. This is where the data is spread out. You can gloss over
the different data nodes. So, the block pool is essentially the main thing that's
different.
Big Data Computing Vu Pham Big Data Enabling Technologies
HDFS Performance Measures
B1 B2 … Bn
B1 B2 … Bn
The other impact of this is the map tasks, each time they spin up
and spin down, there's a latency involved with that because you
are starting up Java processes and stopping them.
Solution:
Merge/Concatenate files
Sequence files
HBase, HIVE configuration
CombineFileInputFormat
Tuning parameters
Default replication is 3.
Parameter: dfs.replication
Tradeoffs:
Examples:
Dfs.datanode.handler.count (10): Sets the number of server
threads on each datanode
Dfs.namenode.fs-limits.max-blocks-per-file: Maximum number
of blocks per file.
Full List:
http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/hdfs-default.xml
Mark the data. And any new I/O that comes up is not going to be sent to
that data node. Also remember that NameNode has information on all
the replication information for the files on the file system. So, if it knows
that a datanode fails which blocks will follow that replication factor.
Now this replication factor is set for the entire system and also you could
set it for particular file when you're writing the file. Either way, the
NameNode knows which blocks fall below replication factor. And it will
restart the process to re-replicate.