Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
Hadoop itself is an open source distributed processing framework that manages data
processing and storage for big data applications. HDFS is a key part of the many Hadoop
ecosystem technologies. It provides a reliable means for managing pools of big data and
supports related big data analytics applications.
With HDFS, data is written on the server once, and read and reused numerous times after
that. HDFS has a primary NameNode, which keeps track of where file data is kept in the
cluster.
HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one per
node in a cluster. The DataNodes are generally organized within the same rack in the data
center. Data is broken down into separate blocks and distributed among the various
DataNodes for storage. Blocks are also replicated across nodes, enabling highly efficient
parallel processing.
The NameNode knows which DataNode contains which blocks and where the DataNodes
reside within the machine cluster. The NameNode also manages access to the files, including
reads, writes, creates, deletes and data block replication across the DataNodes.
The NameNode operates in conjunction with the DataNodes. As a result, the cluster can
dynamically adapt to server capacity demand in real time by adding or subtracting nodes as
necessary.
The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of
the status of each DataNode. If the NameNode realizes that one DataNode isn't working
properly, it can immediately reassign that DataNode's task to a different node containing the
same data block. DataNodes also communicate with each other, which enables them to
cooperate during normal file operations.
Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates -- or
copies -- each piece of data multiple times and distributes the copies to individual nodes,
placing at least one copy on a different server rack than the other copies.
HDFS exposes a file system namespace and enables user data to be stored in files. A file is
split into one or more of the blocks that are stored in a set of DataNodes. The NameNode
performs file system namespace operations, including opening, closing and renaming files
and directories. The NameNode also governs the mapping of blocks to the DataNodes. The
DataNodes serve read and write requests from the clients of the file system. In addition, they
perform block creation, deletion and replication when the NameNode instructs them to do so.
HDFS supports a traditional hierarchical file organization. An application or user can create
directories and then store files inside these directories. The file system namespace hierarchy
is like most other file systems -- a user can create, remove, rename or move files from one
directory to another.
The NameNode records any change to the file system namespace or its properties. An
application can stipulate the number of replicas of a file that the HDFS should maintain. The
NameNode stores the number of copies of a file, called the replication factor of that file.
Features of HDFS
There are several features that make HDFS particularly useful, including:
• Data replication. This is used to ensure that the data is always available and prevents data
loss. For example, when a node crashes or there is a hardware failure, replicated data can
be pulled from elsewhere within a cluster, so processing continues while data is
recovered.
• Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across
nodes in a large cluster ensures fault tolerance and reliability.
• Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. By
minimizing the distance between the data and the computing process, this approach
decreases network congestion and boosts a system's overall throughput.
1. Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf
hardware, which cuts storage costs. Also, because HDFS is open source, there's no
licensing fee.
2. Large data set storage. HDFS stores a variety of data of any size -- from megabytes
to petabytes -- and in any format, including structured and unstructured data.
3. Fast recovery from hardware failure. HDFS is designed to detect faults and automatically
recover on its own.
4. Portability. HDFS is portable across all hardware platforms, and it is compatible with
several operating systems, including Windows, Linux and Mac OS/X.
5. Streaming data access. HDFS is built for high data throughput, which is best for access to
streaming data.
EBay, Facebook, LinkedIn and Twitter are among the companies that use HDFS to underpin
big data analytics to address requirements similar to Yahoo's.
HDFS has found use beyond meeting ad serving and search engine requirements. The New
York Times used it as part of large-scale image conversions, Media6Degrees for log
processing and machine learning, LiveBet for log storage and odds analysis, Joost for session
analysis, and Fox Audience Network for log analysis and data mining. HDFS is also at the
core of many open source data lakes.
More generally, companies in several industries use HDFS to manage pools of big data,
including:
• Electric companies. The power industry deploys phasor measurement units (PMUs)
throughout their transmission networks to monitor the health of smart grids. These high-
speed sensors measure current and voltage by amplitude and phase at selected
transmission stations. These companies analyze PMU data to detect system faults in
network segments and adjust the grid accordingly. For instance, they might switch to a
backup power source or perform a load adjustment. PMU networks clock thousands of
records per second, and consequently, power companies can benefit from inexpensive,
highly available file systems, such as HDFS.
• Oil and gas providers. Oil and gas companies deal with a variety of data formats with
very large data sets, including videos, 3D earth models and machine sensor data. An
HDFS cluster can provide a suitable platform for the big data analytics that's needed.
• Research. Analyzing data is a key part of research, so, here again, HDFS clusters
provide a cost-effective way to store, process and analyze large amounts of data.