Hadoop Ecosystem
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions.
There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common
Utilities. Most of the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters.
• In short, it performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
• Resource manager has the privilege of allocating resources for the applications in
a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement
of the two.
Hadoop Ecosystem
• MapReduce:
• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
• Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database.
• It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time.
• At such times, HBase comes handy as it gives us a tolerant way of storing limited data
Hadoop Ecosystem
• Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:
•
• Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often.
• Zookeeper overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
• There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
• Oozie workflow is the jobs that need to be executed in a sequentially ordered
manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.