Bigdata Module2 7th-Sem 18cs72

(1) The Map step emits key-value pairs with book titles as keys and the value 1. (2) The Reduce step sums the counts for each unique book title. (3) The output is a single book count for the entire library.

Uploaded by

ram patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views64 pages

Bigdata Module2 7th-Sem 18cs72

Uploaded by

ram patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Module 2 -Introduction to hadoop

Introduction to hadoop
●
The Apache™ Hadoop® project develops open-source
software for reliable, scalable, distributed computing.
●
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Introduction to hadoop
●
It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
●
Yahoo has more than 1,00,000 CPUs in over 40000
servers running Hadoop
●
Facebook has 2 major clusters:
●
A cluster has 1100-machines with 8800 cores and
about 12 PB raw storage. A 300-machine cluster
with 2400 cores and about 3 PB raw-storage.
Each(commodity) node has 8 cores and 12 TB
●
Hadoop data store concept implies storing the data at a number of clusters.
●
Each cluster has a number of data stores, called racks. Each rack stores a
●
number of DataNodes. Each DataNode has a large number of data blocks. The
●
racks distribute across a cluster. The nodes have processing and storage
●
capabilities. The nodes have the data in data blocks to run the application
●
tasks. The data blocks replicate by default at least on three DataNodes in same or
●
remote nodes. Data at the stores enable running the distributed applications
●
including analytics, data mining, OLAP using the clusters. A file, containing the
●
data divides into data blocks. A data block default size is 64 MBS (HDFS division of
●
files concept is similar to Linux or virtual memory page in Intel x86 and
●
Pentium processors where the block size is fixed and is of 4 KB).
●
Hadoop HDFS features are as follows:
●
Hadoop System Characteristics

●
Scalable
●
Self-manageable
●
Self-healing
●
Distributed file system
Hadoop core components
The Hadoop core components of the framework
are:
●
Hadoop Common — The common module
contains the libraries and utilities that are
required by the other modules of Hadoop.
Hadoop core components
●
Hadoop Distributed File System (HDFS) — A
Java-based distributed file system which can
store all kinds of data on the disks at the
clusters.
Hadoop core components

●
MapReduce — Software programming model in
Hadoop uses Mapper and Reducer. The
hadoop processes large sets of data in parallel
and in batches.
Hadoop core components
●
YARN — Software for managing resources for
computing.
●
The user application tasks or sub-tasks run in
parallel at the Hadoop, uses scheduling and
handles the requests for the resources in
distributed running of the tasks.
Features of Hadoop
●
Fault-efficient scalable, flexible and modular design
●
Robust design of HDFS
●
Store and process Big Data
●
Distributed clusters computing with data locality.
●
Hardware fault-tolerant
●
Open-source framework
●
Java and Linux based
Features of Hadoop
Fault-efficient scalable, flexible and modular design
●
Hadoop uses simple and modular programming model.
The system provides servers at high Scalability.
●
The system is scalable by adding new nodes to handle
larger Data.
●
Modular functions make the system flexible.
●
One can add or replace components at ease.
Features of Hadoop
Robust design of HDFS:
●
Execution of Big Data applications continue even
when an individual server or cluster fails. This is
because of Hadoop provisions for backup (due to
replications at least three times for each data block)
and a data recovery mechanism.
●
HDFS thus has high reliability
Features of Hadoop
Store and process Big Data:
●
Processes Big Data of 3V characteristics.
Features of Hadoop
Distributed clusters computing with data locality.
●
Processes Big Data at high speed as the application tasks
and sub-tasks submit to the DataNodes.
●
One can achieve more computing power by increasing the
number of computing nodes. The processing splits across
multiple
●
DataNodes (servers), and thus fast processing and
aggregated results.
Features of Hadoop
Hardware fault-tolerant:
●
A fault does not affect data and application
processing. If a node goes down, the other nodes
take care of the residue.
●
This is due to multiple copies of all data blocks
which replicate automatically. Default is three copies
of data blocks.
Features of Hadoop
Open-source framework:
●
Open source access and cloud services enable
large data store. Hadoop uses a cluster of
multiple inexpensive servers or the cloud.
Features of Hadoop
Java and Linux based:
●
Hadoop uses Java interfaces. Hadoop base is
Linux but has its own set of shell commands
support.
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
●
HDFS is a core component of Hadoop.
●
HDFS is designed to run on a cluster of
computers and servers at cloud-based utility
services.
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)

●
HDFS stores Big Data which may range from
GBs to PBs .
●
HDFS stores the data in a distributed manner in
order to compute fast
HDFS Data Storage
HDFS Data Storage
●
Hadoop data store concept implies storing the data
at a number of clusters.
●
Each cluster has a number of data stores, called
racks.
●
Each rack stores a number of DataNodes.
●
Each DataNode has a large number of data blocks.
Problem
●
Consider a data storage for University
students. Each student data, stuData
which is in a file of size less than 64 MB (1
MB = 220B). A data block stores the full
file data for a student of stuData_idN,
where N =1 to 500.
(i) How the files of each student will be
distributed at a Hadoop cluster?
How many student data can be stored at one
cluster? Assume that each rack has two
DataNodes for processing each of 64 GB (1 GB
230B) memory. Assume that cluster consists of
120 racks, and thus 240 DataNodes.
(ii) What is the total
memory capacity of the
cluster in TB
(iii) Show the distributed blocks for
students with ID =96 and 1025.
Assume default replication in the
DataNodes = 3.
(iv) What shall be the changes when a
stuData file size <= 128 MB?
i)Data block default size is 64 MB.
●
Each students file size is less than
64MB. Therefore, for each student file
one data block .
●
A data block is in a DataNode.
●
Assume, for simplicity, each rack has two
nodes each of memory capacity = 64 GB.
●
Each node can thus store 64GB/64MB = 1024
data blocks = 1024 student files.
●
Each rack can thus store 2 x 64 GB/64MB =
2048 data blocks = 2048 student files.
●
Each data block default replicates three
times in the DataNodes.
●
Therefore,the number of students whose
data can be stored in the cluster =
number of racks multiplied by number of
files divided by 3 = 120 x 2048/3= 81920.
●
Therefore, the maximum
number of 81920 stuData_IDN
files can be distributed per
cluster, with N = I to 81920
ii)Total memory capacity of the cluster =
120 x 128 GB = 15360 GB = 15 TB
iV)Changes will be that each node will
have half the number of data blocks
Hadoop main components
and
ecosystem components
●
Hadoop ecosystem refers to a combination of
technologies.
●
Hadoop ecosystem consists of own family of
applications which tie up together with the
Hadoop
The four layers in Figure are as follows:
(i) Distributed storage layer
(ii) Resource-manager layer for job or application sub-
tasks scheduling and execution
(iii) Processing-framework layer, consisting of Mapper
and Reducer for the MapReduce process-flow
(iv) APIs at application support layer (applications
such as Hive and Pig).
Hadoop Physical organization
Hadoop Physical organization
Hadoop Physical organization
HDFS use the
●
NameNodes and
●
DataNodes.
Hadoop Physical organization
●
A NameNode stores the file's
meta data. Meta data gives
information about the file of user
application, but does not
participate in the computations.
Hadoop Physical organization
●
The DataNode stores the actual data files in
the data blocks.
Hadoop Physical organization
●
Few nodes in a Hadoop cluster
act as NameNodes. These nodes
are termed as MasterNodes or
simply masters.
●
The masters have a different
configuration supporting high
DRAM and processing power.
●
The masters have much less local
storage.
●
Hadoop Physical organization
●
Clients as the users run
the application with the
help of Hadoop ecosystem
projects. For example,
Hive, Mahout and Pig are
the ecosystem's projects.
●
They are not required to be
present at the Hadoop
cluster
Hadoop Physical organization
●
The MasterNode fundamentally plays
the role of a coordinator.
●
The MasterNode receives client
connections, maintains the
description of the global file system
namespace, and the allocation of file
blocks. It also monitors the state of
the system in order to detect any
failure.
●
The Masters consists of three
components NameNode, Secondary
NameNode and JobTracker.
Hadoop Physical organization
The NameNode stores all the
file system related
information such as:
●
The file section is stored in
which part of the cluster
●
Last access time for the files
●
User permissions like which
user has access to the file.
Map reduce Programming Model
MapReduce
Goal:
count the number of books in
the library.
Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
Reduce:
We all get together and add up our individual counts
HADOOP YARN
●
YARN is a resource management platform. It
manages computer resources.
●
The platform is responsible for providing the
computational resources, such as CPUs
memory, network I/0 which are needed when an
application executes. An application task has a
number of sub-task
HADOOP YARN
●
An application task has a number of sub-task
●
Each sub-task uses the resources in allotted
time intervals.
●
YARN stands for Yet Another Resource
Negotiator
Hadoop 2 Execution Model
●
Figure shows the YARN-based execution model.
The figure shows the YARN components—
●
Client,
●
Resource Manager (RM),
●
Node Manager (NM),
●
Application Master (AM) and Containers.
Hadoop 2 Execution Model
●
List of actions of YARN resource allocation and
scheduling functions is as follows:
●
A MasterNode has two components:
(i) Job History Server
and
(ii) Resource Manager(RM).
Hadoop 2 Execution Model
●
A Client Node submits the request of an
application to the RM.
●
The RM is the master.
●
One RM exists per cluster.
Hadoop 2 Execution Model
●
The RM keeps information of all the slave Nms.
●
Information is about the location (Rack
Awareness) and the number of resources (data
blocks and servers) they have.
Hadoop 2 Execution Model
●
Multiple NMs are at a cluster.
●
An NM creates an AM instance (AMI) and starts
up.
●
The AMI initializes itself and registers with the
RM.
●
Multiple AMIs can be created in an AM.
Hadoop 2 Execution Model
●
The AMI performs role of an Application
Manager (ApplM), that estimates the resources
requirement for running an application program
or sub- task.
●
The ApplMs send their requests for the
necessary resources to the RM.
Hadoop 2 Execution Model
●
NM is a slave of the infrastructure.
●
It signals whenever it initializes.
●
All active NMs send the controlling signal
periodically to the RM signaling their presence.

Copa Lesson Plan 2024-2025
100% (2)
Copa Lesson Plan 2024-2025
113 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module-2
No ratings yet
Module-2
23 pages
BDA Module-02 Search Creators
No ratings yet
BDA Module-02 Search Creators
33 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
shawn
No ratings yet
shawn
4 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
unit 2
No ratings yet
unit 2
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
2-Notes
No ratings yet
2-Notes
61 pages
Unit 2
No ratings yet
Unit 2
56 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
BIG DATA UNIT 2
No ratings yet
BIG DATA UNIT 2
277 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
Unit 2
No ratings yet
Unit 2
73 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Hadoop
No ratings yet
Hadoop
154 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
UNIT II
No ratings yet
UNIT II
30 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Saperpabap Solutions
No ratings yet
Saperpabap Solutions
11 pages
Unit-2 Engineering Computation With MATLAB
No ratings yet
Unit-2 Engineering Computation With MATLAB
32 pages
Data Center Firewall Toolkit
No ratings yet
Data Center Firewall Toolkit
13 pages
RTL8019AS
No ratings yet
RTL8019AS
12 pages
Sdklog
No ratings yet
Sdklog
36 pages
BCS 051
No ratings yet
BCS 051
23 pages
A222 LabTest2-MIT App Inventor
No ratings yet
A222 LabTest2-MIT App Inventor
9 pages
C Programming and Data Structures - CS3353 - Important Questions With Answer - Unit 1 - C Programming Fundamentals
No ratings yet
C Programming and Data Structures - CS3353 - Important Questions With Answer - Unit 1 - C Programming Fundamentals
11 pages
WEBG301 ID Assignment
No ratings yet
WEBG301 ID Assignment
13 pages
Calculation of Q-Factor From OSNR - WDM Network Design
No ratings yet
Calculation of Q-Factor From OSNR - WDM Network Design
4 pages
Computer Hardware - Testbook
No ratings yet
Computer Hardware - Testbook
8 pages
الصف الاول حاسبputer-s PDF
No ratings yet
الصف الاول حاسبputer-s PDF
143 pages
CV Mahmud
No ratings yet
CV Mahmud
5 pages
Lecture On Compiler Design: Chapter 8: Intermediate Code Generation
No ratings yet
Lecture On Compiler Design: Chapter 8: Intermediate Code Generation
11 pages
How To Build Amplifier 2x30W With STK465 (Circuit Diagram)
No ratings yet
How To Build Amplifier 2x30W With STK465 (Circuit Diagram)
10 pages
Slide 04 - Variables and Constants
No ratings yet
Slide 04 - Variables and Constants
19 pages
Multi-Tier Architecture
No ratings yet
Multi-Tier Architecture
3 pages
Comparator & Decoders: by Dr. Nermeen Talaat
No ratings yet
Comparator & Decoders: by Dr. Nermeen Talaat
31 pages
Manuscritos de Nag Hammadi H.T. Elpizein Ediciones Epopteia 2 Edición Diciembre 2015
No ratings yet
Manuscritos de Nag Hammadi H.T. Elpizein Ediciones Epopteia 2 Edición Diciembre 2015
2 pages
Configurator 810 UserGuide
No ratings yet
Configurator 810 UserGuide
80 pages
WINSEM2023-24 CBS1004 ETH VL2023240503603 2024-03-18 Reference-Material-II
No ratings yet
WINSEM2023-24 CBS1004 ETH VL2023240503603 2024-03-18 Reference-Material-II
12 pages
Eee150s Practical
No ratings yet
Eee150s Practical
16 pages
548 16sccca8-16scccs6-16sccit6 2020051603545429
No ratings yet
548 16sccca8-16scccs6-16sccit6 2020051603545429
16 pages
10 Interview Question On Singleton Pattern in Java
No ratings yet
10 Interview Question On Singleton Pattern in Java
5 pages
Us70 Series
No ratings yet
Us70 Series
3 pages
Software Development Life Cycle - Pak Ben
No ratings yet
Software Development Life Cycle - Pak Ben
39 pages
Embedded C Programming and the Atmel AVR 2nd Edition Richard H. Barnett - The latest updated ebook is now available for download
100% (2)
Embedded C Programming and the Atmel AVR 2nd Edition Richard H. Barnett - The latest updated ebook is now available for download
57 pages
16ee201 Electrical Circuits QB
No ratings yet
16ee201 Electrical Circuits QB
12 pages
zscaler-internet-access
No ratings yet
zscaler-internet-access
11 pages

Bigdata Module2 7th-Sem 18cs72

Uploaded by

Bigdata Module2 7th-Sem 18cs72

Uploaded by

Module 2 -Introduction to hadoop

You might also like