0% found this document useful (0 votes)

8 views40 pages

L02-Hadoop Framework

The lecture covers Big Data Processing Frameworks with a focus on Apache Hadoop, including its architecture, components, and ecosystem. It explains key concepts such as MapReduce, data processing types, and Hadoop's features like scalability and fault tolerance. Additionally, it introduces Hadoop YARN, which enhances resource management and job scheduling in Hadoop 2.0 compared to version 1.0.

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views40 pages

L02-Hadoop Framework

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Lecture 2

Big Data Processing Frameworks

(Apache Hadoop)

Dr. Lydia Wahid

1
Agenda
Basic Concepts

Introduction to MapReduce

Apache Hadoop

Hadoop ecosystem

Hadoop YARN

2
Basic Concepts

3
Basic Concepts
➢Parallel Data Processing vs Distributed Data Processing:
• Parallel data processing involves the simultaneous execution of
multiple sub-tasks that collectively comprise a larger task.
• The goal is to reduce the execution time by dividing a single larger
task into multiple smaller tasks that run concurrently.
• Although parallel data processing can be achieved through multiple
networked machines, it is more typically achieved within a single
machine with multiple processors or cores.
• Distributed data processing is always achieved through physically
separate machines that are networked together as a cluster.
4
Basic Concepts
Parallel Data Processing Distributed Data Processing

5
Basic Concepts
➢Batch vs Interactive vs Real-time stream Processing
• Batch processing is the processing of data in groups or batches. No
user interaction is required once batch processing is happening.
• Interactive processing means that the person needs to provide the
computer with instructions while it is doing the processing.
• Real-time stream processing is the processing of data at the time
the data is generated. Processing times can be measured in
microseconds rather than in hours or days.

6
Introduction to MapReduce

7
Introduction to MapReduce
➢MapReduce is a programming paradigm to perform large-scale
computation across computing clusters.
➢It is mainly composed of two stages; Map and Reduce:
• Map: takes a sequence of values and iteratively applies a mapper
function to each value in the sequence and produces <key, value>
pairs as output.
• Reduce: takes <key, value> pairs and combines all the elements
having the same key through a reducer function.

8
Introduction to MapReduce

9
Introduction to MapReduce

10
Note: Same key goes to the same reducer machine
Introduction to MapReduce
➢Example: Word count.

Map stage
(mapper)

11
Introduction to MapReduce
➢Example: Word count.
• Assume there are two reducer machines

Reduce stage
(reducer)

12
Apache Hadoop

13
Apache Hadoop
➢Hadoop is an Apache open-source framework written in java that allows
distributed processing of large datasets across clusters of computers.

➢A cluster is a configuration of nodes that interact to perform a specific

task.

➢Hadoop is designed to scale up from single server to thousands of

machines, each offering local computation and storage.

14
Apache Hadoop - Architecture
➢Hadoop Architecture:
• It consists of two main components:
1. Storage (Hadoop Distributed File System)
• MapReduce consume data from HDFS. HDFS creates multiple replicas of
data blocks and distributes them on the nodes in a cluster. This
distribution enables reliable and extremely rapid computations.
2. Processing/Computation (MapReduce)
• Capable of processing enormous data in parallel on large clusters of
computation nodes. It performs two main tasks: map and reduce

15
Apache Hadoop - Architecture
➢Hadoop Distributed File System (HDFS) is managed with the master-
slave architecture included with the following components:
• NameNode: This is the master of the HDFS system. It maintains the file system
tree and the metadata for all the files and directories present in the system.
• DataNode: These are slaves that are deployed on each machine and provide
actual storage. They are responsible for serving read-and-write data requests for
the clients. The internal mechanism of HDFS divides the file into one or more
blocks; these blocks are stored in a set of data nodes.
Note: HDFS is not a database, but it is a distributed
file system that can store huge volume of data sets
across a cluster of computers to be processed
16
Apache Hadoop - Architecture
➢MapReduce is managed with master-slave architecture included with the
following components:
• JobTracker: This is the master node of the MapReduce system, which manages
the jobs and resources in the cluster. The JobTracker tries to schedule the maps
to specific nodes in the cluster, ideally the nodes that have the data.

• TaskTracker: These are the slaves that are deployed on each machine. They are
responsible for running the map and reduce tasks as instructed by the JobTracker.

17
Apache Hadoop - Architecture
➢ How MapReduce tasks are assigned to specific nodes in the cluster:
1. Client applications submit jobs to the Jobtracker.
2. The JobTracker talks to the NameNode to determine the location of the data (DataNode)
3. The JobTracker locates TaskTracker nodes with available slots at or near the data (DataNode)
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit signals often enough, they are
considered to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something to
avoid, and it may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.

18
Number of Maps
➢The number of maps is usually driven by the number of blocks of the
input files.
➢10TB of input data and a blocksize of 128MB, will end up with 81,920
maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) is
used to set it even higher.
➢The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 maps for very cpu-light
map tasks.
➢Block size can also be changed to adjust the number of blocks.

19
Number of Reducers
➢In Hadoop, the default number of reducers is one.
➢In this phase the reduce(WritableComparable, Iterable<Writable>,
Context) method is called for each <key, (list of values)> pair in the
grouped inputs.
➢The number of reducers for the job is set by the user via
Job.setNumReduceTasks(int)
➢Increasing the number of reducers increases the framework overhead,
but increases load balancing and lowers the cost of failures.
➢It is legal to set the number of reduce-tasks to zero if no reduction is
desired.
20
Apache Hadoop – HDFS Features
➢ High Scalability - HDFS is highly scalable as it can have hundreds of nodes
in a single cluster.
➢ Handling Huge datasets − Due to the high scalability, HDFS can manage
applications having huge datasets.
➢ Distributed data storage - This is one of the most important features of
HDFS that makes Hadoop very powerful. Here, data is divided into multiple
blocks and stored into nodes.
➢ Hardware at data − A requested task is done efficiently as the computation
takes place near the data. This reduces the network traffic and increases the
throughput, especially where huge datasets are involved.
21
Apache Hadoop – HDFS Features
➢Replication - Due to some unfavorable conditions, the node containing
the data may be lost. To overcome such problems, HDFS always
maintains the copy of data on more than one machine.
➢Fault tolerance - In HDFS, the fault tolerance signifies the robustness
of the system in the event of failure. Due to Replication, HDFS is highly
fault-tolerant that if any machine fails, the other machine containing the
copy of that data automatically become active.
➢Portability - HDFS is designed in such a way that it can be easily
portable from platform to another.

22
Apache Hadoop – Hadoop Operation Modes
➢ Hadoop can run in 3 different modes:
1. Standalone Mode (Single machine Single process):
By default, Hadoop is configured to run in a non-distributed mode. It runs as a single
Java process. Instead of HDFS, this mode utilizes the local file system. This mode is
useful for debugging.
2. Pseudo-Distributed Mode (Single machine Multiple processes):
Hadoop can also run on a single machine in a Pseudo Distributed mode. In this mode,
each Hadoop process runs as separate java process. Here HDFS is utilized for input
and output.
3. Fully Distributed Mode (Multiple machines Multiple processes):
This is the production mode of Hadoop. In this mode, Hadoop is distributed across
multiple machines. Therefore, separate Java processes are present. This mode offers
fully distributed computing capability, reliability, fault tolerance and scalability.
23
Hadoop ecosystem

24
Apache Hadoop – Hadoop ecosystem
➢Apache Hive
• It is a software developed by Facebook that facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL.
• It allows users to fire queries in SQL-like languages, such as HiveQL.

➢Apache Pig:
• It is platform for creating programs that run on Apache Hadoop. The language
for this platform is called Pig Latin.
• Apache Pig has been developed by Yahoo. Currently, Yahoo and Twitter are the
primary users of Pig.

25
Apache Hadoop – Hadoop ecosystem
➢Apache Hbase:
• It is the Hadoop database, a distributed, scalable, big data store. This allows
random, real-time read/write access to Big Data. Apache HBase is an open-
source, distributed, non-relational database modeled after Google's Bigtable.
• The following are some companies using HBase: Yahoo, Twitter, and stumble
upon (This is a personalized recommender system, realtime data storage, and data
analytics platform).

26
Apache Hadoop – Hadoop ecosystem
➢Apache Impala:
• With Impala, you can query data, whether stored in HDFS or Apache HBase –
including SELECT, JOIN, and aggregate functions – in real time.
• Impala directly access the data through a specialized distributed query engine.
The result is order-of-magnitude faster performance than Hive
➢Apache Sqoop:
• Apache Sqoop is a mutual data tool for importing data from the relational
databases to Hadoop HDFS and exporting data from HDFS to relational
databases.
• It works together with most modern relational databases, such as Microsoft SQL
Server, MySQL, and Oracle.

27
Apache Hadoop – Hadoop ecosystem
➢Mahout:
• It is a popular data mining library. It includes the most popular data mining
scalable machine learning algorithms. Also, it is a scalable machine-learning
library.
• The following are some companies that are using Mahout: Amazon, Twitter,
Yahoo, and LucidWorks Big Data (This is an analytics firm, which uses Mahout
for clustering, duplicate document detection, phase extraction, and classification).

28
Apache Hadoop – Hadoop ecosystem
➢Apache Solr:
• It is an open-source enterprise search platform.
• Solr is highly reliable, scalable and fault tolerant, providing distributed indexing,
replication and load-balanced querying, automated failover and recovery,
centralized configuration and more.
• This allows building web application with powerful search capabilities.
• Solr powers the search and navigation features of many of the world's largest
internet sites

29
Apache Hadoop – Hadoop ecosystem
➢Apache Zookeeper:
• It is a service that enables highly reliable distributed coordination.
• It is a centralized service for maintaining configuration information, naming, and
providing distributed synchronization.

➢…..and others

30
Hadoop YARN

31
Hadoop YARN
➢The JobTracker in v1.0 is the single master that allocates resources for
applications, performs scheduling for demand and also monitors the jobs
of processing in the system. (master for resources + processing)

➢YARN stands for “Yet Another Resource Negotiator”. It was

introduced in Hadoop version 2.0.

➢The fundamental idea of YARN is to split up the functionalities of

resource management and job scheduling/monitoring.
32
Hadoop YARN
➢Thus,YARN is now responsible for Resource Management and Job
scheduling.

➢Through its various components, YARN can dynamically allocate various

resources and schedule the application processing.

➢The idea is to have a global ResourceManager (RM) and per-application

ApplicationMaster (AM). An application is either a single job or a DAG of
jobs.

33
Hadoop YARN - Architecture
➢The main components of YARN architecture include:
1. Resource Manager: It is the master of YARN and is responsible for
resource assignment and management among all the applications.
• Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the
completion of the request accordingly.
• The ResourceManager has two main components: Scheduler and
ApplicationsManager.

34
Hadoop YARN - Architecture
➢The main components of YARN architecture include:
1. Resource Manager:
• ApplicationsManager is responsible for accepting job-submissions, negotiating
the first container for executing the application specific ApplicationMaster and
provides the service for restarting the ApplicationMaster container on failure.

• Scheduler is responsible for allocating resources to the various running

application. It performs its scheduling function based on the resource
requirements of the applications

35
Hadoop YARN - Architecture
2. Container: In the Container, there are the physical resources like a
disk, CPU cores, RAM.
3. Application Master: An application is a single job submitted to a
framework. The application master is responsible for negotiating
resources with the resource manager, executing, tracking the
status, and monitoring progress of a single application.
4. Node manager: It sends each node’s health status to the Resource
Manager, stating if the node process has finished working with the
resource. Node manager is also responsible for monitoring resource
usage by individual Container and reporting it to the Resource
manager.
36
Hadoop YARN - Architecture

37
Hadoop YARN - Features
➢Multi-tenancy: YARN has allowed access to multiple data processing
engines such batch, interactive, and real-time stream processing.
➢Scalability: The scheduler in Resource manager of YARN architecture
allows Hadoop to extend and manage thousands of nodes and clusters.
➢Cluster utilization: YARN allocates all cluster resources in an efficient
and dynamic manner, which leads to better utilization.
➢Compatibility: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0 as well.

38
Hadoop Version 2.0 vs Version 1.0
Criteria Version 2.0 Version 1.0
Components • HDFS • HDFS
• MapReduce • MapReduce
• YARN
Suitable for MapReduce and non- Only MapReduce
MapReduce applications applications
Managing cluster resource Done by YARN Done by JobTracker

Cluster resource Excellent due to central Average due to fixed Map

optimization resource management and Reduce slots

39
Thank You

02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Module 2
No ratings yet
Module 2
23 pages
Module-2
No ratings yet
Module-2
23 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
BDA viva
No ratings yet
BDA viva
26 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
HADOOP
No ratings yet
HADOOP
10 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
wk8__final
No ratings yet
wk8__final
39 pages
CC unit5
No ratings yet
CC unit5
27 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Unit 5
No ratings yet
Unit 5
35 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
HADOOP
No ratings yet
HADOOP
19 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
Big Data
No ratings yet
Big Data
67 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
HADOOP
No ratings yet
HADOOP
4 pages
Unit 5
No ratings yet
Unit 5
7 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Unit 2
No ratings yet
Unit 2
7 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
BDM 2
No ratings yet
BDM 2
5 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
AWS Academy Cloud Foundations Module 03 Student Guide: 100-ACCLFO-20-EN-SG
100% (5)
AWS Academy Cloud Foundations Module 03 Student Guide: 100-ACCLFO-20-EN-SG
37 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop
No ratings yet
Hadoop
5 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
4.4 - Managed Services
No ratings yet
4.4 - Managed Services
17 pages
Foundation of Cloud IoT Edge ML - Unit 6 - Week 05
50% (4)
Foundation of Cloud IoT Edge ML - Unit 6 - Week 05
3 pages
Vmware Learning Paths
No ratings yet
Vmware Learning Paths
38 pages
Unit-5 CC
No ratings yet
Unit-5 CC
84 pages
Course Quizzes For CLF-C01 and CLF-C02
No ratings yet
Course Quizzes For CLF-C01 and CLF-C02
25 pages
Decentraliserad Datalagring Baserad På Blockkedjan
No ratings yet
Decentraliserad Datalagring Baserad På Blockkedjan
55 pages
Amazon Web Services: Here Is Where Your Presentations Begins
No ratings yet
Amazon Web Services: Here Is Where Your Presentations Begins
9 pages
Hybrid Cloud Strategies and Management
No ratings yet
Hybrid Cloud Strategies and Management
82 pages
Cia2 Dc Questionbank
No ratings yet
Cia2 Dc Questionbank
3 pages
[25D3T2S04]_Amazon DynamoDB로 무엇이 가능할까
No ratings yet
[25D3T2S04]_Amazon DynamoDB로 무엇이 가능할까
70 pages
VI SEM DS Guess Paper CSE & IT
No ratings yet
VI SEM DS Guess Paper CSE & IT
4 pages
Vac Assignment
No ratings yet
Vac Assignment
7 pages
25-Unit4 8-CloudStorage SkyNetIoT Part7
No ratings yet
25-Unit4 8-CloudStorage SkyNetIoT Part7
16 pages
Bda - 3 Unit
No ratings yet
Bda - 3 Unit
18 pages
Types of S3 Bucket Storage Classes
No ratings yet
Types of S3 Bucket Storage Classes
7 pages
NPTEL CC Assignment4
100% (2)
NPTEL CC Assignment4
4 pages
UM017004E AWS IoT UserManual Eng
No ratings yet
UM017004E AWS IoT UserManual Eng
16 pages
EE450 Cloud Computing
No ratings yet
EE450 Cloud Computing
1 page
Cloud computing charvi 2 page document 12 dec
No ratings yet
Cloud computing charvi 2 page document 12 dec
2 pages
Free - Proxy - List (Canada)
No ratings yet
Free - Proxy - List (Canada)
2 pages
Acronis Cyber Cloud Pricing Calculator: Live Demo
No ratings yet
Acronis Cyber Cloud Pricing Calculator: Live Demo
6 pages
Aws 101 Presentation Deck August 2014 1
No ratings yet
Aws 101 Presentation Deck August 2014 1
47 pages
ESB Presentation
No ratings yet
ESB Presentation
35 pages
Cloud Computing
No ratings yet
Cloud Computing
7 pages
14 Best Personal Finance Blogs Most Important in 2020
No ratings yet
14 Best Personal Finance Blogs Most Important in 2020
1 page
Subject: Cloud Computing 5th SEM / Computer Engg
No ratings yet
Subject: Cloud Computing 5th SEM / Computer Engg
2 pages
Client Server Systems
No ratings yet
Client Server Systems
13 pages
CS 498: Cloud Computing Applications Syllabus: Course Description
No ratings yet
CS 498: Cloud Computing Applications Syllabus: Course Description
8 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

L02-Hadoop Framework

Uploaded by

L02-Hadoop Framework

Uploaded by

Lecture 2

Big Data Processing Frameworks

Dr. Lydia Wahid

➢A cluster is a configuration of nodes that interact to perform a specific

➢Hadoop is designed to scale up from single server to thousands of

➢YARN stands for “Yet Another Resource Negotiator”. It was

➢The fundamental idea of YARN is to split up the functionalities of

➢Through its various components, YARN can dynamically allocate various

➢The idea is to have a global ResourceManager (RM) and per-application

• Scheduler is responsible for allocating resources to the various running

Cluster resource Excellent due to central Average due to fixed Map

You might also like