Module 2 Hadoop Eco System
Module 2 Hadoop Eco System
BD Page 1
consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it.
Data Node:
It acts as a slave node daemon which runs on each slave machine.
The data nodes act as a storage device.
It takes responsibility to serve read and write request from the user.
It takes the responsibility to act according to the instructions of
NameNode, which includes deleting blocks, adding blocks, and replacing
blocks.
It sends heartbeat reports to the NameNode regularly and the actual time
is once in every 3 seconds.
BD Page 2
Block
Generally the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data
nodes.
These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is
called a Block.
The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network traffic
and increases the throughput.
2 YARN:
YARN is an acronym for Yet Another Resource Negotiator.
It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN allocates
RAM, memory, and other resources to different applications.
BD Page 3
YARN has two components :
1. ResourceManager (Master) - This is the master daemon. It manages the
assignment of resources such as CPU, memory, and network bandwidth.
2. NodeManager (Slave) - This is the slave daemon, and it reports the resource
usage to the Resource Manager.
Resource Manager
BD Page 4
Resource Manager
• It works at the cluster level and takes responsibility oforrunning the master machine.
• It stores the track of heartbeats from the Node manager.
• It takes the job submissions and negotiates the first container for executing an
application.
• It consists of two components: Application manager and Scheduler.
Node manager:
3 Map Reduce:
BD Page 5
Let us take the above example to have a better understanding of a MapReduce
program.
We have a sample case of students and their respective departments. We want to
calculate the number of students in each department. Initially, Map program will execute
and calculate the students appearing in each department, producing the key value pair
as mentioned above. This key value pair is the input to the Reduce function. The
Reduce function will then aggregate each department and calculate the total number of
students in each department and produce the given result.
BD Page 6
4 HBASE:
Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of row and millions of columns.
HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase, provide real-time
access to read or write data in HDFS.
Components of Hbase:
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all
BD Page 7
It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
• Maintain and monitor the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting
tables.)
• Controls the failover.
• HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests
from clients. Region server process runs on every node in Hadoop cluster.
Region server runs on HDFS DateNode.
5 HIVE:
• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
• Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing
(i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
• It supports all primitive data types of SQL.
• You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
BD Page 8
Main parts of Hive are:
6 PIG
BD Page 9
7 SQOOP
Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS,
Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational
BD Page 10
Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational
databases such as teradata, Netezza, oracle, MySQL.
8 ZOOKEEPER:
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group services.
Zookeeper manages and coordinates a large cluster of machines.
BD Page 11
Features of Zookeeper:
• Fast – Zookeeper is fast with workloads where reads to data are more
common than writes. The ideal read/write ratio is 10:1.
• Ordered – Zookeeper maintains a record of all transactions.
9 FLUME:
There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the
data source. Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
BD Page 12
The flume agent has 3 components: source, sink and channel.
1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a
temporary storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
10 OOZIE:
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
There are two kinds of Oozie jobs:
1. Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to complete
his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data
is made available to it. Think of this as the response-stimuli system in our body. In
the same manner as we respond to an external stimulus, an Oozie coordinator
responds to the availability of data and it rests otherwise.
BD Page 13