0% found this document useful (0 votes)

24 views

Module 2 Hadoop Eco System

The document discusses different components that collectively form the Hadoop ecosystem including HDFS, YARN, MapReduce, HBase, Hive, Pig and Sqoop. HDFS is the primary storage component and is responsible for storing large datasets across nodes. YARN handles resource management. MapReduce provides the logic for processing large datasets. HBase is a distributed database built on HDFS. Hive provides SQL-like querying. Pig is a data flow language. Sqoop imports and exports data between Hadoop and external sources.

Uploaded by

n4519072

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Module 2 Hadoop Eco System

Uploaded by

n4519072

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

HES

Wednesday, December 7, 2022 2:21 PM

Hadoop Ecosystem is neither a programming language nor a service, it

is a platform or framework which solves big data problems. You can
consider it as a suite which encompasses a number of services

BD Page 1
consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it.

Following are the components that collectively form a Hadoop ecosystem:

• HDFS: Hadoop Distributed File System

• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

1 . HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
The Hadoop distributed file system is a storage system which runs on Java
programming language and used as a primary storage device in Hadoop
applications.

HDFS consists of two core components i.e.

Name node:
 NameNode is a daemon which maintains and operates all DATA nodes
(slave nodes).
 It acts as the recorder of metadata for all blocks in it, and it contains
information like size, location, source, and hierarchy, etc.
 It records all changes that happen to metadata.
 If any file gets deleted in the HDFS, the NameNode will automatically
record it in EditLog.
 NameNode frequently receives heartbeat and block report from the data
nodes in the cluster to ensure they are working and live.

Data Node:
 It acts as a slave node daemon which runs on each slave machine.
 The data nodes act as a storage device.
 It takes responsibility to serve read and write request from the user.
 It takes the responsibility to act according to the instructions of
NameNode, which includes deleting blocks, adding blocks, and replacing
blocks.
 It sends heartbeat reports to the NameNode regularly and the actual time
is once in every 3 seconds.

BD Page 2
Block
 Generally the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data
nodes.
 These file segments are called as blocks.
 In other words, the minimum amount of data that HDFS can read or write is
called a Block.
 The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.

Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network traffic
and increases the throughput.

2 YARN:
 YARN is an acronym for Yet Another Resource Negotiator.
 It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN allocates
RAM, memory, and other resources to different applications.

BD Page 3
YARN has two components :
1. ResourceManager (Master) - This is the master daemon. It manages the
assignment of resources such as CPU, memory, and network bandwidth.
2. NodeManager (Slave) - This is the slave daemon, and it reports the resource
usage to the Resource Manager.

Resource Manager

BD Page 4
Resource Manager

• It works at the cluster level and takes responsibility oforrunning the master machine.
• It stores the track of heartbeats from the Node manager.
• It takes the job submissions and negotiates the first container for executing an
application.
• It consists of two components: Application manager and Scheduler.

Node manager:

• It works on node level component and runs on every slave machine.

• It is responsible for monitoring resource utilization in each container and managing
containers.
• It also keeps track of log management and node health.
• It maintains continuous communication with a resource manager to give updates.

3 Map Reduce:

It is the core component of processing in a Hadoop Ecosystem as it provides the logic

of processing. In other words, MapReduce is a software framework which helps in
writing applications that processes large data sets using distributed and parallel
algorithms inside Hadoop environment.

• In a MapReduce program, Map() and Reduce() are two functions.

1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as
the input for Reduce function.

BD Page 5
Let us take the above example to have a better understanding of a MapReduce
program.
We have a sample case of students and their respective departments. We want to
calculate the number of students in each department. Initially, Map program will execute
and calculate the students appearing in each department, producing the key value pair
as mentioned above. This key value pair is the input to the Reduce function. The
Reduce function will then aggregate each department and calculate the total number of
students in each department and produce the given result.

BD Page 6
4 HBASE:

Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of row and millions of columns.
HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase, provide real-time
access to read or write data in HDFS.

Components of Hbase:

There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all

BD Page 7
It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
• Maintain and monitor the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting
tables.)
• Controls the failover.
• HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests
from clients. Region server process runs on every node in Hadoop cluster.
Region server runs on HDFS DateNode.

5 HIVE:

• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
• Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing
(i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
• It supports all primitive data types of SQL.
• You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.

BD Page 8
Main parts of Hive are:

• Metastore – It stores the metadata.

• Driver – Manage the lifecycle of a HiveQL statement.
• Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
• Hive server – Provide a thrift interface and JDBC/ODBC server.

6 PIG

Apache Pig was developed by Yahoo researchers, targeted mainly

towards non-programmers. It was designed with the ability to
analyze and process large datasets without using complex Java
codes. It provides a high-level data processing language that can
perform numerous operations without getting bogged down with
too many technical concepts.

BD Page 9
7 SQOOP

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS,
Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational

BD Page 10
Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational
databases such as teradata, Netezza, oracle, MySQL.

Features of Apache Sqoop:

• Import sequential datasets from mainframe – Sqoop satisfies the growing
need to move data from the mainframe to HDFS.
• Import direct to ORC files – Improves compression and light weight
indexing and improve query performance.
• Parallel data transfer – For faster performance and optimal system
utilization.
• Efficient data analysis – Improve efficiency of data analysis by combining
structured data and unstructured data on a schema on reading data lake.
• Fast data copies – from an external system into Hadoop.

8 ZOOKEEPER:
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group services.
Zookeeper manages and coordinates a large cluster of machines.

BD Page 11
Features of Zookeeper:
• Fast – Zookeeper is fast with workloads where reads to data are more
common than writes. The ideal read/write ratio is 10:1.
• Ordered – Zookeeper maintains a record of all transactions.

9 FLUME:

Ingesting data is an important part of our Hadoop Ecosystem.

• The Flume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
• It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
• It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:

There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the
data source. Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.

BD Page 12
The flume agent has 3 components: source, sink and channel.
1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a
temporary storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.

10 OOZIE:

Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
There are two kinds of Oozie jobs:
1. Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to complete
his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data
is made available to it. Think of this as the response-stimuli system in our body. In
the same manner as we respond to an external stimulus, an Oozie coordinator
responds to the availability of data and it rests otherwise.

BD Page 13

Service Manual: MX-5070V/6070V MX-5050V/6050V
No ratings yet
Service Manual: MX-5070V/6070V MX-5050V/6050V
456 pages
IL-2 BOS Game Key Mapping Sheet
100% (2)
IL-2 BOS Game Key Mapping Sheet
1 page
Address
100% (1)
Address
1 page
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Troubleshooting Eva 325 525 195
100% (7)
Troubleshooting Eva 325 525 195
39 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Unit 2
No ratings yet
Unit 2
23 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Hadoop
No ratings yet
Hadoop
12 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
big-data-unit 4
No ratings yet
big-data-unit 4
99 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
15 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Lec 2
No ratings yet
Lec 2
20 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
UNIT II
No ratings yet
UNIT II
30 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Hadoop
No ratings yet
Hadoop
83 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
Part C - Assignment No. 5 Health Care Case Study
No ratings yet
Part C - Assignment No. 5 Health Care Case Study
10 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
bda unit 4-1
No ratings yet
bda unit 4-1
64 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit-1 and Swayam Week 1 and Week 2 Revision Unit-1 Course Contents
No ratings yet
Unit-1 and Swayam Week 1 and Week 2 Revision Unit-1 Course Contents
4 pages
Mail Merge
No ratings yet
Mail Merge
26 pages
Two eFlexPWM Module Synchronization
100% (1)
Two eFlexPWM Module Synchronization
10 pages
Blue Open Studio Import Tool For Panelmate Users Guide
No ratings yet
Blue Open Studio Import Tool For Panelmate Users Guide
15 pages
Hfe Sae Audio Professional Products 2013 en
No ratings yet
Hfe Sae Audio Professional Products 2013 en
71 pages
TBEA TC 5000KF Datasheet - 22.11.2021
No ratings yet
TBEA TC 5000KF Datasheet - 22.11.2021
3 pages
Posteroanterior Cephalometry: Craniofacial Frontal Analysis: Joseph G. Ghafari
No ratings yet
Posteroanterior Cephalometry: Craniofacial Frontal Analysis: Joseph G. Ghafari
26 pages
Equitrac - Installation Guide
No ratings yet
Equitrac - Installation Guide
97 pages
Servlet 3
No ratings yet
Servlet 3
18 pages
ACN Lab manual-UPDATED
No ratings yet
ACN Lab manual-UPDATED
90 pages
cst438 Midterm
No ratings yet
cst438 Midterm
7 pages
Instant Download Hawker Hurricane MK I V 1st Edition Martyn Chorlton PDF All Chapter
100% (4)
Instant Download Hawker Hurricane MK I V 1st Edition Martyn Chorlton PDF All Chapter
39 pages
Regresi Linear Berganda Manual 3 Variabel Bebas
100% (1)
Regresi Linear Berganda Manual 3 Variabel Bebas
5 pages
S4 S6 Guide v19.9 PDF
No ratings yet
S4 S6 Guide v19.9 PDF
275 pages
Project Proposal
No ratings yet
Project Proposal
3 pages
HSK Admission Ticket: 3 3 Musafili Ildephonse
No ratings yet
HSK Admission Ticket: 3 3 Musafili Ildephonse
1 page
IT Application Tools in Business
No ratings yet
IT Application Tools in Business
22 pages
Certification: Tuguegarao Archdiocesan School System Our Lady of Piat High School Poblacion 2, Piat, Cagayan
No ratings yet
Certification: Tuguegarao Archdiocesan School System Our Lady of Piat High School Poblacion 2, Piat, Cagayan
3 pages
Computer Aided Design (CAD), ME 530.414, JHU: Course Description
No ratings yet
Computer Aided Design (CAD), ME 530.414, JHU: Course Description
7 pages
Quadratic Regression
No ratings yet
Quadratic Regression
9 pages
Unit-1 Basics of Logic Design
No ratings yet
Unit-1 Basics of Logic Design
32 pages
Ikev 2 Avec 3 Sites
No ratings yet
Ikev 2 Avec 3 Sites
4 pages
Gaussian 09W Reference: Æleen Frisch
No ratings yet
Gaussian 09W Reference: Æleen Frisch
28 pages
PC3.3 Parallel-Application
100% (1)
PC3.3 Parallel-Application
39 pages
Advanced Computer Architecture Unit 4 PDF
No ratings yet
Advanced Computer Architecture Unit 4 PDF
35 pages
Deeplsgr: Neural Collaborative Filtering For Recommendation Systems in Smart Community
No ratings yet
Deeplsgr: Neural Collaborative Filtering For Recommendation Systems in Smart Community
20 pages