Apache Hadoop

Hadoop is an open source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and YARN for resource management and scheduling. MapReduce is used for distributed processing where the map function performs operations in parallel and reduce combines the results.

Uploaded by

Ankit Guleria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Apache Hadoop

Uploaded by

Ankit Guleria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

WHAT IS APACHE HADOOP

• Hadoop is an Apache open source framework written in java that allows

distributed processing of large datasets across clusters of computers using
simple programming models.
• It is basically a collection of open source components for processing Big
Data.
BIG DATA
• A term that describes the large volume of data – both structured and
unstructured, that is generated by businesses on a day-to-day basis.
• It is a collection of data sets so large and complex that it becomes difficult to
process using traditional data processing applications within the given time
frame.
• Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, and updating.
Characteristics….
• Volume : The quantity of generated and stored data(petabytes, exabytes).
• Variety: Composed of structured, semi-structured and unstructured data.
• Velocity : Generated at a large speed.
• Veracity : The data quality, may be imprecise or uncertain.
• Variability : The inconsistency in data.
Some Examples….
• Facebook hosts approximately 10 billion photos, taking up one petabyte of
storage.
• The New York Stock Exchange generates one terabyte data per day.
• The Large Hadron Collider produces 15 petabytes data per year.
THE PROBLEM….
• The transfer speed is around 100 MB/s.
• A standard disk is 1 Terabyte(very less in comparison to large data sets).
• The obvious solution is to use multiple processors to solve the same
problem by fragmenting it into pieces.
DISTRIBUTED PROCESSING
• Distributed data processing is a computer-networking method in which data
is distributed over multiple computers across different locations and they
share computer-processing capability.
• Distributed Processing utilizes a network of many machines each
accomplishing a portion of an overall task to achieve a computational result
much more quickly than with a single computer.
HADOOP BASICS…
• Designed to answer the question: “How to process big data with
reasonable cost and time?”
• Reliable shared storage and analysis system.
• Efficient, Automatic distribution of data.
• Provides a simplified programming model which allows the user to quickly
write and test distributed systems.
HADOOP ECOSYSTEM
• Hadoop ecosystem includes a set of official Apache open source projects
and a number of commercial tools and solutions.
• Core elements of the Apache Hadoop system are HDFS, YARN,
MapReduce.
• Spark, Hive, Oozie, Pig, and Sqoop are few of the popular open source tools.
ECOSYTEM
• Hadoop Ecosystem
consists of various
technologies and Hadoop
components which can
even solve the complex
data problems easily.
HDFS(Hadoop Distributed File System)
• Manages the storage of large amount of data across a network of machines.
• Built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern.
• Provides interfaces for applications to move themselves closer to where the
data is located.
• Fault Tolerant : Data Blocks are replicated for fault tolerance, Heartbeat is
by each data node.
HDFS ARCHITECTURE
• Based on Master-Slave
Architecture.
• Each server works as a
node, so each node has the
computing power.
MASTER NODE

•Stores the
metadata.
•Determines the
SLAVE NODE
• Data Nodes store the actual data.
• Responsible for serving read and write requests from the client.
• Block creation, deletion, and replication.
• TaskTracker:
• Executes tasks upon instruction from the Master.
• Handles data motion between the map and reduce phases.
YARN(Yet Another Resource Negotiator)
• Brain of the Hadoop ecosystem and all processing is performed right here,
which may include resource allocation, job scheduling, and activity
processing.
• YARN’s dynamic allocation of cluster resources improves utilization.
• YARN’s Resource Manager focuses exclusively on scheduling.
YARN ARCHITECTURE
• Components of YARN :
• Resource Manager
• Node Manager
• Application Master
RESOURCE MANAGER
• Arbitrator of all cluster resources.
• TWO PARTS
• 1. Scheduler : Responsible for allocating resources to the various running
applications.
• 2. Applications Manager : Responsible for accepting job-submissions.
NODE MANAGER
• Per-machine framework agent who is responsible for containers, monitoring
their resource usage.
• Also reports the same to the Resource Manager/Scheduler.
APPLICATION MASTER
• Application Master is where the job resides.
• Per-application Application Master is a framework specific library and is
tasked with negotiating resources from the Resource Manager.
• Works with the Node Manager(s) to execute and monitor the tasks.
• Works as a job life-cycle manager.
MAP REDUCE
• Combination of two operations, named as Map and Reduce.
• “Map” sends a query to various data nodes for processing and “Reduce”
collects the result of these queries.
• Map function performs grouping, sorting and filtering operations, while
Reduce function summarizes and aggregates the result, produced by Map
function.
EXAMPLE…
SOME IMPORTANT POINTS….
• Input data can be divided into n number of chunks depending upon the
amount of data.
• All the chunks are processed simultaneously at the same time.
• Shuffling happens which leads to aggregation of similar patterns.
• Reducers combine them all to get a consolidated output as per the logic.
OTHER COMPONENTS
• Apache Pig : Procedural language, alternative for Java, used to process large
data sets in parallel.
• HBase : Open source and non-relational or NoSQL database, supports all
data types and so can handle any data type inside a Hadoop system.
• Mahout : Provides the environment for developing the machine learning
applications, such applications can perform filtering, clustering, classification.
• Zookeeper : Known as the king of coordination, can provide reliable, fast
and organized operational services for the Hadoop clusters.
COMPONENTS…
• Oozie : Performs the job scheduling and works like an alarm and clock
service inside the Hadoop Ecosystem.
• Ambari : Makes the Hadoop ecosystem more manageable by managing,
monitoring, and provisioning of the Hadoop clusters.
• Hive : Gives an SQL-like interface to query data stored in various databases
and file systems that integrate with Hadoop.
• Sqoop : Command-line interface application for transferring data between
relational databases and Hadoop.
REAL-WORLD USE CASES
• Financial services companies to assess risks and build investment models.
• Retail Websites to analyze structured and unstructured data to better
understand and serve their customers.
• Companies can even use it to understand what people think about them
through data mining and machine learning.
• Companies such as Amazon, Microsoft, Intel etc. use Hadoop to store and
analyze their data.
ADVANTAGES OF HADOOP
• Scalability : Highly scalable storage platform.
• Fast : Hadoop’s unique file system processed data at a very rapid rate.
• Flexible : Hadoop enables businesses to easily access new data sources.
• Fault Tolerance : Due to data replication, even hardware failures don’t cause
problems.
CONCLUSION
• Hadoop is a natural platform with which enterprise IT can now apply data
science to a huge variety of business problems such as product
recommendation, data analysis and other sentiment analysis.
• It is rapidly becoming a central store for big data in many industries.

Unit Iii
No ratings yet
Unit Iii
20 pages
All Commands - DBA
100% (1)
All Commands - DBA
42 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Informix-4Gl: Installation Guide
No ratings yet
Informix-4Gl: Installation Guide
14 pages
List of Documents in The CERTIKIT GDPR Toolkit V2
100% (1)
List of Documents in The CERTIKIT GDPR Toolkit V2
2 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Unit 2
No ratings yet
Unit 2
23 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
UNIT II
No ratings yet
UNIT II
30 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HADOOP
No ratings yet
HADOOP
19 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Benefits of Hadoop MapReduce
No ratings yet
Benefits of Hadoop MapReduce
1 page
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Big Data Intro
No ratings yet
Big Data Intro
10 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Big Data
No ratings yet
Big Data
29 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Hadoop
No ratings yet
Hadoop
83 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Hadoop
No ratings yet
Hadoop
12 pages
HADOOP
No ratings yet
HADOOP
18 pages
By - Shubham Parmar
No ratings yet
By - Shubham Parmar
14 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
data analyst
No ratings yet
data analyst
9 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Huawei: IT Product Portfolio
No ratings yet
Huawei: IT Product Portfolio
42 pages
An Introduction To MariaDB's Data at Rest Encryption (DARE) - Part 1
No ratings yet
An Introduction To MariaDB's Data at Rest Encryption (DARE) - Part 1
2 pages
Dbms
No ratings yet
Dbms
236 pages
Aes DG
No ratings yet
Aes DG
338 pages
Dwbi Unit 4 & 5
No ratings yet
Dwbi Unit 4 & 5
26 pages
Classic Data Centre
No ratings yet
Classic Data Centre
16 pages
Lesson Plan Databases Introduction
No ratings yet
Lesson Plan Databases Introduction
7 pages
AIS Assignment 1
No ratings yet
AIS Assignment 1
3 pages
Measurement Units Worksheet
No ratings yet
Measurement Units Worksheet
2 pages
SQL- Revision Questions
No ratings yet
SQL- Revision Questions
4 pages
Data Processing
No ratings yet
Data Processing
35 pages
Clinic Management System
58% (12)
Clinic Management System
25 pages
Spanner Google's Globally-Distributed Database
No ratings yet
Spanner Google's Globally-Distributed Database
14 pages
Hibernate Query Language (HQL)
No ratings yet
Hibernate Query Language (HQL)
22 pages
Ebook The Evolution of The Data Warehouse
No ratings yet
Ebook The Evolution of The Data Warehouse
40 pages
Database QC
No ratings yet
Database QC
2 pages
A4academics Com Interview Questions 53 Database and SQL 397
0% (1)
A4academics Com Interview Questions 53 Database and SQL 397
14 pages
No Name Student ID: Database System (CMPD 314) Exercise 4 - Group Members
No ratings yet
No Name Student ID: Database System (CMPD 314) Exercise 4 - Group Members
8 pages
DOS Assignment 01
No ratings yet
DOS Assignment 01
13 pages
Using Partitions and Fragments
No ratings yet
Using Partitions and Fragments
19 pages
CF Unit-4 DBMS
No ratings yet
CF Unit-4 DBMS
11 pages
AIS Chapter-4
No ratings yet
AIS Chapter-4
45 pages
Customer Service Report: Yternetic Genesis Pvt. LTD
No ratings yet
Customer Service Report: Yternetic Genesis Pvt. LTD
7 pages
Basics of SQL
No ratings yet
Basics of SQL
2 pages
Creating tables, forms, queries
No ratings yet
Creating tables, forms, queries
2 pages
TUGAS AKHIR 2 Ramadianti Saleh (20212021)
No ratings yet
TUGAS AKHIR 2 Ramadianti Saleh (20212021)
84 pages
Dynamic Drop Down List Databind
No ratings yet
Dynamic Drop Down List Databind
8 pages

Apache Hadoop

Uploaded by

Apache Hadoop

Uploaded by

WHAT IS APACHE HADOOP

• Hadoop is an Apache open source framework written in java that allows

You might also like