0% found this document useful (0 votes)
0 views

Bigdata Lecture 3

The document outlines the components and architecture of the Hadoop Distributed File System (HDFS) and its role in big data analytics. It details the evolution of Hadoop, its architecture including MapReduce, HDFS, and YARN, and the responsibilities of key components like NameNode and DataNode. The document emphasizes HDFS's capabilities for managing large datasets in a reliable and fault-tolerant manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Bigdata Lecture 3

The document outlines the components and architecture of the Hadoop Distributed File System (HDFS) and its role in big data analytics. It details the evolution of Hadoop, its architecture including MapReduce, HDFS, and YARN, and the responsibilities of key components like NameNode and DataNode. The document emphasizes HDFS's capabilities for managing large datasets in a reliable and fault-tolerant manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Big Data & Big Data

Analytics
Dr. Iman Ahmed ElSayed

Spring 24-25 / fourth level

Lecture 3- Hadoop distributive file


system components and architecture
Big Data Big Data & Analytics
Lecture Contents:

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

5. Hadoop’s Rise

6. Evolution of HDFS

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS


2
Big Data Hadoop

➢ Hadoop is an open-source software framework for storing and processing large


datasets ranging in size from gigabytes to petabytes.

➢ Hadoop was developed at the Apache Software Foundation in 2005.

➢ It is written in Java.

➢ Hadoop comes as the solution to the problem of big data i.e. storing and
processing the big data with some extra capabilities.

➢ Its co-founder Doug Cutting named it on his son’s toy elephant.

➢ There are mainly two components of Hadoop which are :


✓ Hadoop Distributed File System (HDFS)
✓ Yet Another Resource Negotiator(YARN). 3
Big Data Hadoop architecture and components

4
Big Data Hadoop architecture and components

5
Big Data Hadoop architecture and components

6
Big Data Hadoop architecture and components

The Hadoop Architecture Mainly consists of 4 components:

➢ MapReduce

➢ HDFS(Hadoop Distributed File System)

➢ YARN (Yet Another Resource Negotiator)

➢ Common Utilities or Hadoop Common

7
Big Data Hadoop architecture and components

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is


based on the YARN framework.

The major feature of MapReduce is to perform the distributed processing


in parallel in a Hadoop cluster which Makes Hadoop working so fast.

MapReduce has mainly 2 tasks which are divided phase-wise:

➢ In first phase, Map is utilized


➢ In next phase Reduce is utilized.

8
Big Data Map - Reduce functions

9
Big Data Map() function

As we can see that an Input is provided to the Map(),


now as we are using Big Data.

➢ The Input is a set of Data.

➢ The Map() function here breaks this DataBlocks


into Tuples that are nothing but a key-value pair.

➢ These key-value pairs are now sent as input to the


Reduce().

10
Big Data Map() function

Map Task:

➢ RecordReader: The purpose of recordreader is to break the records. It is


responsible for providing key-value pairs in a Map() function. The key is its
locational information, and value is the data associated with it.

➢ Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate
any key-value pair or generate multiple pairs of these tuples.

➢ Combiner: Combiner is used for grouping the data in the Map workflow.

➢ Partitioner: it is responsible for fetching key-value pairs generated in the Mapper


Phases.
11
Big Data Map() function

Assume you have 5 files, and each file contains two columns (a key and a value in
Hadoop terms) that represent a city and the corresponding temperature recorded in
that city for the various measurement days (record reading).

The city is the key, and the temperature is the value.

For example: (Toronto, 20) (key, value) (Map)

Out of all the data we have collected, you want to find the maximum temperature for
each city across the data files.

12
Big Data Map() function
Using the MapReduce framework, you can break this down into five map tasks,
where each mapper works on one of the five files.

The mapper task goes through the data and returns the maximum temperature for
each city. For example, the results produced from one mapper task for the data
above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33) (combiner)

Assume the other four mapper tasks (working on the other four files not shown here)
produced the following intermediate results: (partitioner)

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)


(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)
(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)
(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
13
Big Data Reduce() function
The Reduce() function then:
✓ Combines this broken Tuples or key-value pair
based on its Key value and form set of Tuples.

✓ Perform some operation like sorting, summation type


job, etc. which is then sent to the final Output Node.

✓ Finally, the Output is Obtained.

The data processing is always done in Reducer


depending upon the business requirement of that
industry.

14
Big Data Reduce() function
Reduce Task:

➢ Shuffle and Sort: The Task of Reducer starts with this step, the process in which
the Mapper generates the intermediate key-value and transfers them to the Reducer
task is known as Shuffling. Using the Shuffling process the system can sort the data
using its key value. Once some of the Mapping tasks are done Shuffling begins that
is why it is a faster process and does not wait for the completion of the task
performed by Mapper.

➢ Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-value depending on its key element.

➢ OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner. 15
Big Data Reduce() function

Example Cont’d:

(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)


(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)
(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)
(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)
(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)

Shuffling and sorting is performed from the reduce() function.

All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing an output:
(Toronto,32) (Whitby,27) (New York, 33) (Rome,38)
16
Big Data Map - Reduce functions

17
Big Data Hadoop architecture and components

(done)

18
Big Data Hadoop distributed file system (HDFS)
➢ Hadoop comes with a distributed file system called HDFS,
which stands for Hadoop Distributed File system.

➢ The Hadoop Distributed File System (HDFS) is based on the


Google File System (GFS) and provides a distributed file
system that is designed to run on large clusters (thousands
of computers) of small computer machines in a reliable,
fault-tolerant manner.

➢ It is designed to store and manage large volumes of data


across multiple machines in a distributed manner.

➢ HDFS uses a master/slave architecture where master


consists of a single NameNode that manages the file system
metadata and one or more slave DataNodes that store the
actual data.
19
Big Data Hadoop distributed file system (HDFS)
HDFS is responsible for storing data on the cluster in Hadoop.

Files in HDFS are split into blocks before they are stored on cluster of
size 64MB or 128MB.

On a fully configured cluster, running Hadoop means running a set of


daemons, or resident programs (a set of java processes), on the
different servers in your network.

Programs which reside permanently in memory are called ―Resident


Programs.

Each daemon runs separately in its own JVM. These daemons have
specific roles; some exist only on one server, some exist across multiple
servers. 20
Big Data Hadoop distributed file system (HDFS)
HDFS Architecture

HDFS is designed to be highly scalable, reliable, and efficient, enabling


the storage and processing of massive datasets. Its architecture consists
of several key components:

1. NameNode
2. DataNode
3. Secondary NameNode
4. HDFS Client
5. Block Structure
6. JobTracker (master daemon)
7. TaskTracker (slave daemon)

The above daemons are called as ―”Building Blocks of Hadoop”


21
Big Data Hadoop distributed file system (HDFS)
NameNode

➢ The NameNode maps file blocks to DataNodes.


➢ It is the master daemon of the HDFS and directs the slave
DataNode daemons to perform the low-level I/O tasks
such as opening, closing, and renaming files and
directories.
➢ It acts like a bookkeeper; keeps track of how your files are
broken down into file blocks, which nodes store those
blocks, (i.e.: is responsible for storing all the location
information of the files present in HDFS.
➢ The actual data is never stored on a namenode. In other
words, it holds the metadata of the files in HDFS.)
22
Big Data Hadoop distributed file system (HDFS)

Cont’d: NameNode

The Namenode maintains the entire metadata in RAM, which helps clients
receive quick responses to read requests.
The Namenode daemon also maintains a persistent checkpoint of the
metadata in a file stored on the disk called the fsimage file. Whenever a file is
placed/deleted/updated in the cluster, an entry of this action is updated in a file
called the edits logfile.
Key Responsibilities:
• Maintaining the filesystem tree and metadata.
• Managing the mapping of file blocks to DataNodes.
• Ensuring data integrity and coordinating replication of data blocks 23
Big Data Hadoop distributed file system (HDFS)
DataNode daemon
➢ Performs the dirty work of the distributed filesystem.
➢ Acts as a slave node and is responsible for storing the actual files in HDFS.
➢ The files are split as data blocks across the cluster.
➢ The blocks are typically 64 MB to 128 MB size blocks. The block size is a
configurable parameter.
➢ The file blocks in a Hadoop cluster also replicate themselves to other
datanodes for redundancy so that no data is lost in case a datanode
daemon fails.
➢ The datanode daemon sends information to the namenode daemon about
the files and blocks stored in that node and responds to the namenode
daemon for all filesystem operations 24
Big Data Hadoop distributed file system (HDFS)

25
Big Data Hadoop distributed file system (HDFS)

26
Big Data Hadoop distributed file system (HDFS)
It is the responsibility of the namenode daemon to maintain a list of the files
and their corresponding locations on the cluster. Whenever a client needs to
access a file, the namenode daemon provides the location of the file to client
and the client, and then accesses the file directly from the datanode daemon.

Key Responsibilities:

➢ Storing data blocks and serving read/write requests from clients.

➢ Performing block creation, deletion, and replication upon instruction from the
NameNode.

➢ Periodically sending block reports and heartbeats to the NameNode to


confirm its status
27
Big Data Hadoop distributed file system (HDFS)
Secondary Name Node:

Since the fsimage file is not updated for every operation inside the NameNode,
it is possible the edits logfile would grow to a very large file.
The restart of namenode service would become very slow because all the
actions in the large edits logfile will have to be applied on the fsimage file.
The slow boot up time could be avoided using the secondary namenode
daemon.

The Secondary NameNode acts as a helper to the primary NameNode,


primarily responsible for merging the EditLogs with the current filesystem
image (FsImage) to reduce the potential load on the NameNode.
28
Big Data Hadoop distributed file system (HDFS)

29
Big Data Hadoop distributed file system (HDFS)
Secondary Name Node:

✓ is an assistant daemon for monitoring the state of the cluster


HDFS.
✓ is responsible for performing periodic housekeeping functions for
namenode.
✓ each cluster has one SNN, and it typically resides on its own
machine as well.
✓ It only creates checkpoints of the filesystem metadata (fsimage)
present in namenode.
✓ In case the namenode daemon fails, this checkpoint could be
used to rebuild the filesystem metadata.
30
Big Data Hadoop distributed file system (HDFS)

31
Big Data Hadoop distributed file system (HDFS)

The following are the steps carried out by the secondary


namenode daemon:

1. Get the edits logfile from the primary namenode daemon.

2. Get the fsimage file from the primary namenode daemon.

3. Apply all the actions present in the edits logs to the


fsimage file.

4. Push the fsimage file back to the primary namenode.

32
Big Data Hadoop distributed file system (HDFS)

33
Big Data Hadoop distributed file system (HDFS)
Job Tracker (‫)رئيس العمال‬:

➢ is responsible for accepting job requests from a client and


scheduling/assigning tasktrackers with tasks to be performed.

➢ The jobtracker daemon tries to assign tasks to the tasktracker


daemon on the datanode daemon where the data to be
processed is stored. This feature is called “data locality”.
➢ If that is not possible, it will at least try to assign tasks to
tasktrackers within the same physical server rack.

➢ If for some reason the node hosting the datanode and


tasktracker daemons fails, the jobtracker daemon assigns the
task to another tasktracker daemon where the replica of the data
exists.
34
Big Data Hadoop distributed file system (HDFS)
Task Tracker (‫)العمال‬:

➢ is a daemon that accepts tasks (map, reduce, and shuffle) from


the jobtracker daemon.

➢ The tasktracker daemon is the daemon that performs the actual


tasks during a MapReduce operation.

➢ The tasktracker daemon sends a heartbeat message to


jobtracker, periodically, to notify the jobtracker daemon that it is
alive.

➢ Along with the heartbeat, it also sends the free slots available
within it, to process tasks. The tasktracker daemon starts and
monitors the map, and reduces tasks and sends progress/status
information back to the jobtracker daemon. 35
Big Data Hadoop distributed file system (HDFS)
Block Structure:

HDFS stores files by dividing them into large blocks, typically


128MB or 256MB in size.

Each block is stored independently across multiple DataNodes,


allowing for parallel processing and fault tolerance.

The NameNode keeps track of the block locations and their


replicas.

Key Features:
✓ Large block size reduces the overhead of managing a large
number of blocks.
✓ Blocks are replicated across multiple DataNodes to ensure data
36
availability and fault tolerance.
Big Data Hadoop distributed file system (HDFS)

HDFS Client:

The HDFS client is the interface through which users and


applications interact with the HDFS.

It allows for file creation, deletion, reading, and writing operations.

The client communicates with the NameNode to determine which


DataNodes hold the blocks of a file and interacts directly with the
DataNodes for actual data read/write operations.

Key Responsibilities:
✓ Facilitating interaction between the user/application and HDFS.
✓ Communicating with the NameNode for metadata and with
DataNodes for data access. 37
Big Data Hadoop distributed file system (HDFS)

38
Big Data Hadoop architecture and components

(done)

(done)

39
Big Data Hadoop architecture and components

40
Big Data Hadoop architecture and components

41
Big Data Hadoop architecture and components
Bottle neck will
occur sooner or
later

42

You might also like