0% found this document useful (0 votes)
27 views

BD - HadoopEcoSystem Unit 2part 1

The document discusses the Hadoop ecosystem, which includes components like HDFS, MapReduce, Sqoop, Flume, Hive, Pig, Mahout, R connectors, Ambari, Zookeeper and Oozie. It explains what each component is used for and how they work together to process and manage large datasets across clusters of computers.

Uploaded by

Rameshwar Kanade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

BD - HadoopEcoSystem Unit 2part 1

The document discusses the Hadoop ecosystem, which includes components like HDFS, MapReduce, Sqoop, Flume, Hive, Pig, Mahout, R connectors, Ambari, Zookeeper and Oozie. It explains what each component is used for and how they work together to process and manage large datasets across clusters of computers.

Uploaded by

Rameshwar Kanade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Hadoop Ecosystem

Unit II Chapter 4 Mr. M.S.Emmi,


Faculty,
Department of MCA,
KLS GIT Belagavi.
Contents
• Understanding Hadoop Ecosystem
• Hadoop Distributed File System
• HDFS Architecture
• Concept of Blocks in HDFS Architecture
• NameNodes and DataNodes
• The Command-Line Interface
• Using HDFS Files
• Hadoop-Specific File System Types
• HDFS Commands
• The org.apache.hadoop.io package
• HDFS High availability: Features of HDFS.
An enterprise will have a computer to store and process big data. For storage purpose, the
programmers will take the help of their choice of database vendors such as Oracle, IBM, etc.
In this approach, the user interacts with the application, which in turn handles the part of data
storage and analysis.

Limitation
This approach works fine with those applications that process less voluminous data that can
be accommodated by standard database servers, or up to the limit of the processor that is
processing the data. But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns them to many computers, and collects
the results from them which when integrated, form the result dataset.
Hadoop
Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project
called HADOOP.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.
In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge
amounts of data.
Advantages of Hadoop

•Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatically distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.

•Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.

•Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.

•Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
How Distributed databases and Hadoop are different?
➢Distributed databases ➢Hadoop
• Deal with tables and relations • Deal with flat files in any format
• Must have schema for data • Operates on no schema for data
• Implements data fragmentation and • Divides files automatically into blocks
partitioning
• Generate notations of a job divided
• Generate notations of a transaction into tasks
• Implements ACID transaction • Implements MapReduce computing
properties model
• Allow distributed transactions • Consider every task as either a map
or a reduce.
Understanding Hadoop Ecosystem
•So exactly What is Hadoop?
• “Hadoop is a framework that allows for the distributed
processing of data sets across clusters of computers
using simple programming models.”
• Hadoop is an Apache open source framework written in
java that allows distributed processing of large datasets
across clusters of computers using simple programming
models.
Understanding Hadoop Ecosystem

Hadoop ecosystem can be defined as a


“comprehensive collection of tools and
technologies that can be effectively
implemented and deployed to provide Big
Data solutions in a cost-effective manner.”

MapReduce and Hadoop Distributed File


System (HDFS) are two components of the
Hadoop ecosystem.

Along with these two it provides a collection


of various elements to support the
complete development and deployment of
Big Data solutions.

The figure depicts the elements of the


Hadoop Ecosystem
❖ HDFS: In Hadoop, data is stored in this storage layer and it is stored across different machines in a distributed fashion.
❖ MapReduce: It helps in processing data and getting some valuable results. Latest version is YARN (Yet Another Resource
Negotiator)
❖ Sqoop: Is a mechanism to get data from relational DBs. It has import and export utility.
❖ Flume: Helps to get unstructured data into Hadoop.
❖ Hive: It is a high level language or wrapper on top of MapReduce, based on writing logic driven queries.(Created by FB)
❖ Pig: Provides high level API to process data, speedup code and make it handier(English like language, Created by Yahoo)
❖ Mahout: Machine learning component.
❖ R connectors: Provides support for statistical and Mathematical calculations.
❖ Ambari: Open source mechanism to create, provision, manage and monitor cluster.
❖ Zookeeper: Provides coordination and synchronization between tools and components of Hadoop.
❖ Oozie: Schedule jobs i.e, manages workflow.
❖ Hbase: It structures data into columns and sits on top of HDFS providing reference to HDFS data.
All these elements enable users to process
large datasets in real time and provide tools
to support various types of Hadoop projects,
schedule jobs and manage cluster
resources.

The fig depicts how the various elements of


Hadoop involve at various stages of
processing data
i
MapReduce and HDFS provide the
necessary services and basic structure to
deal with the core requirements of Big Data
solutions.
Other services and tools of the ecosystem
provide the environment and components
required to build and manage purpose
driven Big Data applications.

You might also like