Hadoop

The document provides an overview of Hadoop, its architecture, and its ecosystem, including components like HDFS, MapReduce, and various distributions such as AWS EMR and Cloudera. It also discusses tools and frameworks used in conjunction with Hadoop, such as Spark, HIVE, and Mahout, highlighting their functionalities and applications in big data analytics. Additionally, the document outlines the evolution of Hadoop and its significance in managing large datasets across various industries.

Uploaded by

rmtv6x7nqf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views21 pages

Hadoop

Uploaded by

rmtv6x7nqf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Hadoop

Blockchain
Introduction
• “3-D Data Management: Controlling Data Volume and Variety”
• In 2001
• Introduced Big Data
Varying Data Structure
1. Structure- RDBMS
2. Semistructure- Text files, Tab-delimited Files
3. Unstructure- Data Analytics, Data Mining
Origin
• By Dong Cutting
• Douglass Read Cutting is a software designer and creator of open
source search technology. He founded Lucent open-source search
technology projects which are now managed through the apache software
foundation. He is also a co-founders of Apache Hadoop
• Components
• MapReduce
• Hadoop Distributed File System (HDFS)
HDFS Architecture
• Master Service- NameNode
• In HDFS cluster Namenode is the master and the centerpiece of
the HDFS file system.
• It manages the file system namespace.
• It keeps the directory tree of all files in the file system and
metadata about files and directories.
• Slave Service- DataNode
• DataNode is responsible for storing the actual data in HDFS.
• It is also known as the Slave.
• NameNode and DataNode are in constant communication.
• When a DataNode starts up it announce itself to the NameNode
along with the list of blocks it is responsible for.
Hadoop echosytem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning Algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hadoop Distributions
1) Amazon Web Services Elastic MapReduce Hadoop
Distribution
2) Hortonworks Hadoop Distribution
3) Cloudera Hadoop Distribution
4) MapR Hadoop Distribution
5) IBM Infosphere BigInsights Hadoop Distribution
6) Microsoft Azure's HDInsight Cloud based Hadoop
Distribution
1) Amazon Web Services Elastic
MapReduce Hadoop Distribution
• AWS Elastic MapReduce renders data analytics platform built
on the powerful HDFS architecture. With major focus on
map/reduce queries, AWS EMR, Amazon Web Services EMR.
• AWS EMR handles important big data uses like web indexing,
scientific simulation, log analysis, bioinformatics, machine
learning, financial analysis and data warehousing for big data
analysis.
• DynamoDB is another major NoSQL database that was deployed
to run its giant consumer website.
• Redshift is a completely managed petabyte scale data analytics
solution.
2) Hortonworks Hadoop Distribution
• Hortonworks is a pure play Hadoop company that drives open source
Hadoop distributions in the IT market.
• Apache Ambari is an example of Hadoop cluster management console
developed by Hortonworks Hadoop vendor for provision, managing and
monitoring Hadoop clusters. and eBay.
• Hortonworks has garnered strong engineering partnerships with RedHat,
Microsoft, SAP and Teradata.
3) Cloudera Hadoop Distribution
• Since 2008.Cloudera, founded by a group of engineers from
Yahoo, Google and Facebook - is focused on providing enterprise
ready solutions of Hadoop with additional customer support and
training.
• Cloudera Hadoop vendor has close to 350 paying customers
including the U.S Army, AllState and Monsanto.
• Cloudera owes its long term success to corporate partners -
Oracle, IBM, HP, NetApp and MongoDB that have been
consistently pushing its services.
4) MapR Hadoop Distribution
• MapR has made considerable investments to get over
the obstacles to worldwide adoption of Hadoop which
include enterprise grade reliability, data protection,
integrating Hadoop into existing environments with
ease and infrastructure to render support for real time
operations.
• In 2015, MapR plans to make further investments to
maintain its significance in the Big Data vendors list.
5) IBM Infosphere BigInsights
Hadoop Distribution
• IBM Infosphere BigInsights is an industry standard combines
Hadoop with enterprise grade characteristics.
• IBM provides BigSheets and BigInsights as a service via its
Smartcloud Enterprise Infrastructure .
• With IBM Hadoop distributions users can easily set up and
move data to Hadoop clusters in no more than 30 minutes with
data processing rate of 60 cents per Hadoop cluster, per hour.
• With IBM BigInsights innovation, customers can get to market
at a rapid pace with their applications that incorporate
advanced Big Data analytics by harnessing the power of
Hadoop.
6) Microsoft Azure's HDInsight
Cloud based Hadoop Distribution
• Microsoft’s big data solution is best leveraged through its public
cloud product -Windows Azure’s HDInsight particularly
developed to run on Azure. There is another production ready
feature of Microsoft named Polybase that lets the users search
for information available on SQL Server during the execution of
Hadoop queries.
• Microsoft has great significance in delivering a growing Hadoop
Stack to its customers. Microsoft Azure’s HDInsight is public-cloud
only based product and customers can not run on their own
hardware with this.
Hadoop Toolbox
1) HDFS
• Hadoop Distributed File System (HDFS) is designed to store a large amount of data, hence is
quite a lot more efficient than the NTFS (New Type File System) and FAT32 File System, which
are used in Windows PCs.
• HDFS is used to carter large chunks of data quickly to applications.
• Yahoo has been using Hadoop Distributed File System to manage over 40 petabytes of data.
2) HIVE
• Apache, which is commonly known for hosting servers, have got their solution for Hadoop’s
database as Apache HIVE data warehouse software.
• This makes it easy for us to query and manage large datasets.
• With HIVE, all the unstructured data are projected with a structure, and later, we can query
the data with SQL like language known as HiveQL.
• HIVE provides different storage types such as plain text, RCFile, Hbase, ORC, etc.
• HIVE also comes with built-in functions for the users, which can be used to manipulate
dates, strings, numbers, and several other types of data mining functions.
Hadoop Toolbox
3) NoSQL
• Structured Query Languages have been in use since a long time, now as the data is
mostly unstructured, we require a Query Language which doesn’t have any structure. This
is solved mainly through NoSQL.
• Here we have primarily key pair values with secondary indexes. NoSQL can easily be
integrated with Oracle Database, Oracle Wallet, and Hadoop.
• This makes NoSQL one of the widely supported Unstructured Query Language.
4) Mahout
• Apache has also developed its library of different machine learning algorithms which is
known as Mahout. Mahout is implemented on top of Apache Hadoop and uses the
MapReduce paradigm of BigData.
• As we all know about the Machines learning different things daily by generating data
based on the inputs of a different user, this is known as Machine learning and is one of
the critical components of Artificial Intelligence.
• Machine Learning is often used to improve the performance of any particular system, and
this majorly works on the outcome of the previous run of the machine.
Hadoop Toolbox
5) Avro
• With this tool, we can quickly get representations of complex data structures that are generated
by Hadoop’s MapReduce algorithm.
• Avro Data tool can easily take both input and output from a MapReduce Job, where it can also
format the same in a much easier way.
• With Avro, we can have real-time indexing, with easily understandable XML Configurations for
the tool.
6) GIS tools
• Geographic information is one of the most extensive sets of information available over the world.
• This includes all the states, cafes, restaurants, and other news around the world, and this needs
to be precise.
• Hadoop is used with GIS tools, which are a Java-based tool available for understanding
Geographic Information.
• With the help of this tool, we can handle Geographic Coordinates in place of strings, which can
help us to minimize the lines of code.
• With GIS, we can integrate maps in reports and publish them as online map applications.
Hadoop Toolbox
7) Flume
• LOGs are generated whenever there is any request, response, or any type of activity in
the database.
• Logs help to debug the program and see where things are going wrong. While working
with large sets of data, even the Logs are generated in bulk. And when we need to move
this massive amount of log data, Flume comes into play.
• Flume uses a simple, extensible data model, which will help you to apply online analytic
applications with the most ease.
8) Clouds
• All the cloud platforms work on Large data sets, which might make them slow in the
traditional way. Hence most of the cloud platforms are migrating to Hadoop, and Clouds
will help you with the same.
• With this tool, they can use a temporary machine that will help to calculate big data sets
and then store the results and free up the temporary machine, which was used to get the
results. All these things are set up and scheduled by the cloud/ Due to this, the normal
working of the servers is not affected at all.
Hadoop Toolbox
9) Spark
• Coming to hadoop analytics tools, Spark tops the list. Spark is a
framework available for Big Data analytics from Apache.
• This one is an open-source data analytics cluster computing framework
that was initially developed by AMPLab at UC Berkeley. Later Apache
bought the same from AMPLab.
• Spark works on the Hadoop Distributed File System, which is one of the
standard file systems to work with BigData. Spark promises to perform
100 times better than the MapReduce algorithm for Hadoop over a
specific type of application.
• Spark loads all the data into clusters of memory, which will allow the
program to query it repeatedly, making it the best framework available
for AI and Machine Learning.
Hadoop Toolbox
10) MapReduce
• Hadoop MapReduce is a framework that makes it quite easy for the developer to
write an application that will process multi-terabyte datasets in parallel.
• These datasets can be calculated over large clusters. MapReduce framework
consists of a JobTracker and TaskTracker; there is a single JobTracker which
tracks all the jobs, while there is a TaskTracker for every cluster-node.
• Master i.e., JobTracker, schedules the job, while TaskTracker, which is a slave,
monitors them and reschedule them if they failed.
11) Impala
• Cloudera is another company that works on developing tools for development
needs. Impala is software from Cloudera, which is leading software for Massively
Parallel Processing of SQL Query Engine, which runs natively on Apache Hadoop.
Apache licenses impala, and this makes it quite easy to directly query data
stored in HDFS (Hadoop Distributed File System) and Apache HBase.
Thanks

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Abhishek Paul 19SCSE2030072 Big Data and Technologies Assignment 3
No ratings yet
Abhishek Paul 19SCSE2030072 Big Data and Technologies Assignment 3
7 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data
No ratings yet
Big Data
63 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
UNIT-IV -BDA
No ratings yet
UNIT-IV -BDA
150 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Big Data technologies UNIT 1
No ratings yet
Big Data technologies UNIT 1
5 pages
hadoop.pptx
No ratings yet
hadoop.pptx
61 pages
yasir f29 ass1 bigdata
No ratings yet
yasir f29 ass1 bigdata
7 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Top 4 Open Source Tools You Can Use To Handle Big Data
No ratings yet
Top 4 Open Source Tools You Can Use To Handle Big Data
64 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
BDTools
No ratings yet
BDTools
15 pages
CP 329_Lecture twoNew_2025_122059
No ratings yet
CP 329_Lecture twoNew_2025_122059
43 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Big Data Infrastructure
No ratings yet
Big Data Infrastructure
12 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
SUB UNIT 3 - Copy
No ratings yet
SUB UNIT 3 - Copy
9 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Open Source Technologies
No ratings yet
Open Source Technologies
19 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
89 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
Big Data Platforms
No ratings yet
Big Data Platforms
4 pages
15 Big Data Tools and Technologies To Know About in 2021
No ratings yet
15 Big Data Tools and Technologies To Know About in 2021
7 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Big Data
No ratings yet
Big Data
27 pages
UNIT2 BDA
No ratings yet
UNIT2 BDA
12 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Last Min Preparation -Big Data
No ratings yet
Last Min Preparation -Big Data
5 pages
HADOOP
No ratings yet
HADOOP
10 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit 1dbms
No ratings yet
Unit 1dbms
41 pages
Business Intelligence Using The Fuzzy-Kano Model
No ratings yet
Business Intelligence Using The Fuzzy-Kano Model
17 pages
Vedant Resume 7th Semester
No ratings yet
Vedant Resume 7th Semester
1 page
Introduction-to-IT-Era - LESSON-1 2
No ratings yet
Introduction-to-IT-Era - LESSON-1 2
16 pages
Securing Smart Manufacturing by Integrating Anomaly Detection With Zero-Knowledge Proofs
No ratings yet
Securing Smart Manufacturing by Integrating Anomaly Detection With Zero-Knowledge Proofs
15 pages
Unit 3 - AI - Knowledge
No ratings yet
Unit 3 - AI - Knowledge
2 pages
Isc Js Notes
No ratings yet
Isc Js Notes
205 pages
Khairul Hayat 21109105 PDF
No ratings yet
Khairul Hayat 21109105 PDF
1 page
BCA 8th Proposal
No ratings yet
BCA 8th Proposal
17 pages
2023.findings Emnlp.314v2
No ratings yet
2023.findings Emnlp.314v2
21 pages
Sustainability 15 06965 v2
No ratings yet
Sustainability 15 06965 v2
29 pages
The Complete Spring Tutorial
No ratings yet
The Complete Spring Tutorial
44 pages
Ontological engineering
No ratings yet
Ontological engineering
17 pages
Major PPT 1
No ratings yet
Major PPT 1
15 pages
Cyber Security Using AI
No ratings yet
Cyber Security Using AI
8 pages
Oracle PL SQL Cheat Sheet 1690341202
No ratings yet
Oracle PL SQL Cheat Sheet 1690341202
7 pages
Jointly Optimize Capacity, Latency and Engagement in Large-Scale Recommendation Systems
No ratings yet
Jointly Optimize Capacity, Latency and Engagement in Large-Scale Recommendation Systems
3 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Blockchain_Viva_Questions
No ratings yet
Blockchain_Viva_Questions
3 pages
Instant download (Ebook) Collaborative Knowledge Management Through Product Lifecycle: A Computational Perspective by Hongwei Wang, Gongzhuang Peng ISBN 9789811996252, 9789811996269, 9811996253, 9811996261 pdf all chapter
100% (5)
Instant download (Ebook) Collaborative Knowledge Management Through Product Lifecycle: A Computational Perspective by Hongwei Wang, Gongzhuang Peng ISBN 9789811996252, 9789811996269, 9811996253, 9811996261 pdf all chapter
81 pages
Bus Ticketing and Management System
No ratings yet
Bus Ticketing and Management System
12 pages
35333
No ratings yet
35333
60 pages
Research Paper-1
No ratings yet
Research Paper-1
5 pages
Instant ebooks textbook Business Information Systems: 22nd International Conference, BIS 2019, Seville, Spain, June 26–28, 2019, Proceedings, Part II Witold Abramowicz download all chapters
100% (4)
Instant ebooks textbook Business Information Systems: 22nd International Conference, BIS 2019, Seville, Spain, June 26–28, 2019, Proceedings, Part II Witold Abramowicz download all chapters
55 pages
Marathi Hate Speech Detection IEEE Paper
No ratings yet
Marathi Hate Speech Detection IEEE Paper
5 pages
A Hybrid Approach From Ant Colony Optimization and K-Neare
No ratings yet
A Hybrid Approach From Ant Colony Optimization and K-Neare
13 pages
Testbank Natural Language Processing mid 2024 spring with solution (1)
No ratings yet
Testbank Natural Language Processing mid 2024 spring with solution (1)
9 pages
Helping Material For Artificial Intelligence & Machine Learning
No ratings yet
Helping Material For Artificial Intelligence & Machine Learning
2 pages
DBMS Reviewer
No ratings yet
DBMS Reviewer
5 pages
Srs Sds
No ratings yet
Srs Sds
2 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

Hadoop

You might also like