Hadoop Ecosystem

Uploaded by

Aijaz Chopan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Hadoop Ecosystem

Uploaded by

Aijaz Chopan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Hadoop Ecosystem

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions.
There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common
Utilities. Most of the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

•HDFS: Hadoop Distributed File System

•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
Hadoop Ecosystem
Hadoop Ecosystem
• HDFS:

• HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured
data across various nodes and thereby maintaining the metadata in
the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about
data) requiring comparatively fewer resources than the data nodes
that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and
Hadoop Ecosystem
• YARN:

• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters.
• In short, it performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
• Resource manager has the privilege of allocating resources for the applications in
a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement
of the two.
Hadoop Ecosystem
• MapReduce:

• By making the use of distributed and parallel algorithms,

MapReduce makes it possible to carry over the processing’s
logic and helps to write applications which transform big data
sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller set
of tuples.
Hadoop Ecosystem
• PIG:
• Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care of.
After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework which
runs on Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and
hence is a major segment of the Hadoop Ecosystem.
Hadoop Ecosystem
• HIVE:

• With the help of SQL methodology and interface, HIVE performs

reading and writing of large data sets. However, its query
language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by
Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data
storage permissions and connection whereas HIVE Command line
helps in the processing of queries.
Hadoop Ecosystem
• Mahout:

• Mahout, allows Machine Learnability to a system or application.

Machine Learning, as the name suggests helps the system to
develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are
nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Hadoop Ecosystem
• Apache Spark:

• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
• Apache HBase:

• It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database.
• It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time.
• At such times, HBase comes handy as it gives us a tolerant way of storing limited data
Hadoop Ecosystem
• Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:
•
• Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often.
• Zookeeper overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
• There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
• Oozie workflow is the jobs that need to be executed in a sequentially ordered
manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

Unit Iii
No ratings yet
Unit Iii
20 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
4 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Unit 2
No ratings yet
Unit 2
23 pages
Hadoop
No ratings yet
Hadoop
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
FRO CH3
No ratings yet
FRO CH3
21 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
0% (1)
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
38 pages
UNIT II
No ratings yet
UNIT II
30 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
hadoop hdfs
No ratings yet
hadoop hdfs
8 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
HADOOP
No ratings yet
HADOOP
19 pages
Part C - Assignment No. 5 Health Care Case Study
No ratings yet
Part C - Assignment No. 5 Health Care Case Study
10 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
BD Unit-4
No ratings yet
BD Unit-4
79 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Stream Processing Lab Manual
No ratings yet
Stream Processing Lab Manual
38 pages
GunaSekharYalamuri
No ratings yet
GunaSekharYalamuri
5 pages
Dhanush Bigdata Resume Updated
No ratings yet
Dhanush Bigdata Resume Updated
9 pages
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
No ratings yet
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
4 pages
BDACh05L04Spark DataFramesAndRDDs
No ratings yet
BDACh05L04Spark DataFramesAndRDDs
22 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
14 pages
TCSL Job Descriptions
No ratings yet
TCSL Job Descriptions
12 pages
DAT207x Syllabus PDF
No ratings yet
DAT207x Syllabus PDF
2 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
sql for interview
No ratings yet
sql for interview
4 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Dice Resume CV Karthik S
No ratings yet
Dice Resume CV Karthik S
4 pages
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
No ratings yet
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
8 pages
Project Report
No ratings yet
Project Report
37 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
41 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Big Data Journal
No ratings yet
Big Data Journal
50 pages
Data Management for Machine Learning
No ratings yet
Data Management for Machine Learning
7 pages
@Q_B@Snowflake & AWS
No ratings yet
@Q_B@Snowflake & AWS
17 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
B E - Computer-Engg
No ratings yet
B E - Computer-Engg
27 pages
Sushant Sharma: IIT Kharagpur Alumnus, Engineering at Mayhem Studios
No ratings yet
Sushant Sharma: IIT Kharagpur Alumnus, Engineering at Mayhem Studios
3 pages
Architecting A Data Lake
100% (7)
Architecting A Data Lake
60 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Hadoop vs. Spark: The New Age of Big Data
No ratings yet
Hadoop vs. Spark: The New Age of Big Data
7 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Session 3.8
No ratings yet
Session 3.8
17 pages
Top 75 Apache Spark Interview Questions
No ratings yet
Top 75 Apache Spark Interview Questions
18 pages

Hadoop Ecosystem

Uploaded by

Hadoop Ecosystem

Uploaded by

Hadoop Ecosystem

•HDFS: Hadoop Distributed File System

• HDFS is the primary or major component of Hadoop ecosystem and is

• By making the use of distributed and parallel algorithms,

• With the help of SQL methodology and interface, HIVE performs

• Mahout, allows Machine Learnability to a system or application.

You might also like