0% found this document useful (0 votes)
3 views

CST322_Module4_Part3_Hadoop

Uploaded by

chinnuedwina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CST322_Module4_Part3_Hadoop

Uploaded by

chinnuedwina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Big Data Analytics

CST 322 MODULE 4 –


PART 2
Slides Courtesy : Dr. Kesabh Nath , Asst Professor
Indian Institute of Information Technology
Kottayam
100TB

10TB

• Apache Hadoop is an open source software framework used to develop data processing 1TB
applications which are executed in a distributed computing environment.
• Applications built using hadoop are run on large data sets distributed across clusters of commodity Relational
computers. Commodity computers are cheap and widely available. These are mainly useful for Database
achieving greater computational power at low cost.
• Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides
in a distributed file system which is called as a Hadoop Distributed File system. In traditional approach, all data was
stored in to a single central database.
With the rise of big data, a single
database is not enough for storage
Apache Hadoop consists of two sub-projects –
• Hadoop MapReduce: MapReduce is a computational model and software framework for writing
applications which are run on Hadoop. These MapReduce programs are capable of processing
enormous data in parallel on large clusters of computation nodes.
• HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop
applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of
data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and
extremely rapid computations.

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 2


Data Distribution
• The Hadoop Distributed File System (HDFS) is a distributed file system designed to
run on commodity hardware. It has many similarities with existing distributed file
systems. However, the differences from other distributed file systems are significant.
• Highly fault-tolerant and is designed to be deployed on low-cost hardware.
• Provides high throughput access to application data and is suitable for applications that have
large data sets.
• Error Rectification: Hadoop’s environment replicates all pieces of data stored in its nodes. So if
a particular node fails and loses the data, there are nodes to back it up. It prevents data loss
and lets you work freely without worrying about the same.
• Scaling: Hadoop provides secure options for more data scaling. It has clusters which you can
scale to a large extent through adding more cluster nodes. By adding more nodes, you can
easily enhance the capability of your Hadoop system.
• HDFS will split up large data files into fragments that are managed by different nodes of
the cluster.
• Each fragment is replicated in multiple computers, so that a single failure in a machine
will note make the data unavailable.
• Content is universally accessible through a single namespace.
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 3
HDFS Architecture

• HDFS has a master/slave architecture.


• An HDFS cluster consists of a single NameNode, a
master server that manages the file system
namespace and regulates access to files by clients.
• In addition, there are a number of DataNodes.
• HDFS exposes a file system namespace and allows
user data to be stored in files.
• Internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes.
• Namenode determines the mapping of blocks to
DataNodes. The DataNodes are responsible for
serving read and write requests from the file
system’s clients.
• The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode.

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 4


NameNode and DataNodes
Namenode
✔ NameNode is also known as the Master
✔ NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files
across the cluster.
✔ Metadata like Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related
configuration
✔ NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
✔ NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode
knows how to construct the file from blocks.
✔ NameNode is a single point of failure in Hadoop cluster.

DataNode
✔ DataNode is responsible for storing the actual data in HDFS.
✔ DataNode is also known as the Slave
✔ NameNode and DataNode are in constant communication.
✔ When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is
responsible for.
✔ When a DataNode is down, it does not affect the availability of data or the cluster. NameNode
will arrange for replication for the blocks
4/6/2024 managed
Dr.Kesab by the DataNode that is not available. 5
Nath,IIIT Kottayam
✔ DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 6
HDFS Data Blocks

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 7


HDFS Data Blocks

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 8


4/6/2024 Dr.Kesab Nath,IIIT Kottayam 9
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 10
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 11
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 12
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 13
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 14
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 15
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 16
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 17
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 18
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 19
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 20
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 21
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 22
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 23
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 24
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 25
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 26
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 27
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 28
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 29
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 30
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 31
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 32
• MapReduce is a software framework and programming model used for processing huge amounts of data. MapReduce
program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while
Reduce tasks shuffle and reduce the data.
• The whole process goes through four phases of execution namely, splitting, mapping, shuffling,
and reducing.
❖ Input Splits:
A dataset is split into equal units called chunks (input splits) in the splitting step
❖ Mapping
In this phase data in each split is passed to a mapping function to produce output values. . Hadoop consists of
a RecordReader that uses TextInputFormat to transform input splits into key-value pairs. This is the only data
format that a mapper can read or understand. The mapping step contains a coding logic that is applied to these
data blocks. In this step, the mapper processes the key-value pairs and produces an output of the same form (key-
value pairs).
❖ Shuffling
Shuffling phase takes place after the completion of the Mapping phase. It consists of two main steps: sorting
and merging. In the sorting step, the key-value pairs are sorted using the keys. Merging ensures that key-
value pairs are combined.
The shuffling phase facilitates the removal of duplicate values and the grouping of values. Different values with
similar keys are grouped. The output of this phase will be keys and values, just like in the Mapping phase.
❖ Reducing
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 33
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling
Consider you have following
input data for the MapReduce in
Big data Program

Welcome to Hadoop Class


Hadoop is good
Hadoop is bad

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 34


4/6/2024 Dr.Kesab Nath,IIIT Kottayam 35
• Features of MapReduce

1.MapReduce algorithms help organizations to process vast amounts of data, parallelly

stored in the Hadoop Distributed File System (HDFS).

2.It reduces the processing time and supports faster processing of data. This is because

all the nodes are working with their part of the data, in parallel.

3.Developers can write MapReduce codes in a range of languages such as Java, C++,

and Python.

4.It is fault-tolerant as it considers replicated copies of the blocks in other machines for

further processing, in case of failure.


4/6/2024 Dr.Kesab Nath,IIIT Kottayam 36
• How Does the Hadoop MapReduce Algorithm Work?

• The input data to process using the MapReduce task is stored in input files that
reside on HDFS.

• The input format defines the input specification and how the input files are split
and read.

• The input split logically represents the data to be processed by an individual


Mapper.

• The record reader communicates with the input split and converts the data into
key-value pairs suitable for reading by the mapper (k, v).
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 37
• The mapper class processes input records from RecordReader and generates
intermediate key-value pairs (k’, v’). Conditional logic is applied to ‘n’ number of
data blocks present across various data nodes.
• The combiner is a mini reducer. For every combiner, there is one mapper. It is
used to optimize the performance of MapReduce jobs.
• The partitioner decides how outputs from the combiner are sent to the reducers.
• The output of the partitioner is shuffled and sorted. All the duplicate values are
removed, and different values are grouped based on similar keys. This output is
fed as input to the reducer. All the intermediate values for the intermediate keys
are combined into a list by the reducer called tuples.
• The record writer writes these output key-value pairs from the reducer to the
output files. The output data is stored on the HDFS.

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 38


Fig: MapReduce workflow

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 39


Fig: MapReduce Example to count the
occurrences of words

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 40


• Developing and Executing a Hadoop MapReduce Program
• A common approach to develop a Hadoop MapReduce program is to write Java
code using an Interactive Development Environment (IDE) tool such as Eclipse
• Compared to a plaintext editor or a command-line interface (CLI), IDE tools offer
a better experience to write, compile, test, and debug code.
• A typical MapReduce program consists of three Java files: one each for the driver
code, map code, and reduce code.
• Additional, Java files can be written for the combiner or the custom partitioner, if
applicable.
• The Java code is compiled and stored as a Java Archive (JAR) file.
• This JAR file is then executed against the specified HDFS input files.

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 41


• Beyond learning the mechanics of submitting a MapReduce job, three key
challenges to a new Hadoop developer are
1. defining the logic of the code to use the MapReduce paradigm
2. learning the Apache Hadoop Java classes, methods, and interfaces
3. implementing the driver, map, and reduce functionality in Java.
• For users who prefer to use a programming language other than Java, there are
some other options.
• One option is to use the Hadoop Streaming API, which allows the user to write
and run Hadoop jobs with no direct knowledge of Java
• However, knowledge of some other programming language, such as Python, C, or
Ruby, is necessary
• Apache Hadoop provides the Hadoop-streaming.jar file that accepts the HDFS
paths for the input/output files and the paths for the files that implement the map
and reduce functionality.
4/6/2024 Dr.Kesab Nath,IIIT Kottayam 42
• Here are some important considerations when preparing and running a Hadoop
streaming job:

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 43


• A second alternative is to use Hadoop pipes, a mechanism that uses compiled C++
code for the map and reduced functionality.
• An advantage of using C++ is the extensive numerical libraries available to
include in the code
• To work directly with data in HDFS, one option is to use the C API (libhdfs) or
the Java API provided with Apache Hadoop
• These APIs allow reads and writes to HDFS data files outside the typical
MapReduce paradigm
• Such an approach may be useful when attempting to debug a MapReduce job by
examining the input data or when the objective is to transform the HDFS data
prior to running a MapReduce job.

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 44


• Map reduce algorithm simplified
• https://www.tutorialscampus.com/map-reduce/algorithm.htm

• Map Reduce Word Count Program in Java


• https://www.javatpoint.com/mapreduce-word-count-example

• Steps to configure Hadoop Map Reduce environment in Linux


• https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup
.htm

4/6/2024 Dr.Kesab Nath,IIIT Kottayam 45

You might also like