CST322_Module4_Part3_Hadoop
CST322_Module4_Part3_Hadoop
10TB
• Apache Hadoop is an open source software framework used to develop data processing 1TB
applications which are executed in a distributed computing environment.
• Applications built using hadoop are run on large data sets distributed across clusters of commodity Relational
computers. Commodity computers are cheap and widely available. These are mainly useful for Database
achieving greater computational power at low cost.
• Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides
in a distributed file system which is called as a Hadoop Distributed File system. In traditional approach, all data was
stored in to a single central database.
With the rise of big data, a single
database is not enough for storage
Apache Hadoop consists of two sub-projects –
• Hadoop MapReduce: MapReduce is a computational model and software framework for writing
applications which are run on Hadoop. These MapReduce programs are capable of processing
enormous data in parallel on large clusters of computation nodes.
• HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop
applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of
data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and
extremely rapid computations.
DataNode
✔ DataNode is responsible for storing the actual data in HDFS.
✔ DataNode is also known as the Slave
✔ NameNode and DataNode are in constant communication.
✔ When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is
responsible for.
✔ When a DataNode is down, it does not affect the availability of data or the cluster. NameNode
will arrange for replication for the blocks
4/6/2024 managed
Dr.Kesab by the DataNode that is not available. 5
Nath,IIIT Kottayam
✔ DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 6
HDFS Data Blocks
2.It reduces the processing time and supports faster processing of data. This is because
all the nodes are working with their part of the data, in parallel.
3.Developers can write MapReduce codes in a range of languages such as Java, C++,
and Python.
4.It is fault-tolerant as it considers replicated copies of the blocks in other machines for
• The input data to process using the MapReduce task is stored in input files that
reside on HDFS.
• The input format defines the input specification and how the input files are split
and read.
• The record reader communicates with the input split and converts the data into
key-value pairs suitable for reading by the mapper (k, v).
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 37
• The mapper class processes input records from RecordReader and generates
intermediate key-value pairs (k’, v’). Conditional logic is applied to ‘n’ number of
data blocks present across various data nodes.
• The combiner is a mini reducer. For every combiner, there is one mapper. It is
used to optimize the performance of MapReduce jobs.
• The partitioner decides how outputs from the combiner are sent to the reducers.
• The output of the partitioner is shuffled and sorted. All the duplicate values are
removed, and different values are grouped based on similar keys. This output is
fed as input to the reducer. All the intermediate values for the intermediate keys
are combined into a list by the reducer called tuples.
• The record writer writes these output key-value pairs from the reducer to the
output files. The output data is stored on the HDFS.