0% found this document useful (0 votes)
1 views21 pages

Hadoop

The document provides an overview of Hadoop, its architecture, and its ecosystem, including components like HDFS, MapReduce, and various distributions such as AWS EMR and Cloudera. It also discusses tools and frameworks used in conjunction with Hadoop, such as Spark, HIVE, and Mahout, highlighting their functionalities and applications in big data analytics. Additionally, the document outlines the evolution of Hadoop and its significance in managing large datasets across various industries.

Uploaded by

rmtv6x7nqf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views21 pages

Hadoop

The document provides an overview of Hadoop, its architecture, and its ecosystem, including components like HDFS, MapReduce, and various distributions such as AWS EMR and Cloudera. It also discusses tools and frameworks used in conjunction with Hadoop, such as Spark, HIVE, and Mahout, highlighting their functionalities and applications in big data analytics. Additionally, the document outlines the evolution of Hadoop and its significance in managing large datasets across various industries.

Uploaded by

rmtv6x7nqf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Hadoop

Blockchain
Introduction
• “3-D Data Management: Controlling Data Volume and Variety”
• In 2001
• Introduced Big Data
Varying Data Structure
1. Structure- RDBMS
2. Semistructure- Text files, Tab-delimited Files
3. Unstructure- Data Analytics, Data Mining
Origin
• By Dong Cutting
• Douglass Read Cutting is a software designer and creator of open
source search technology. He founded Lucent open-source search
technology projects which are now managed through the apache software
foundation. He is also a co-founders of Apache Hadoop
• Components
• MapReduce
• Hadoop Distributed File System (HDFS)
HDFS Architecture
• Master Service- NameNode
• In HDFS cluster Namenode is the master and the centerpiece of
the HDFS file system.
• It manages the file system namespace.
• It keeps the directory tree of all files in the file system and
metadata about files and directories.
• Slave Service- DataNode
• DataNode is responsible for storing the actual data in HDFS.
• It is also known as the Slave.
• NameNode and DataNode are in constant communication.
• When a DataNode starts up it announce itself to the NameNode
along with the list of blocks it is responsible for.
Hadoop echosytem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning Algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hadoop Distributions
1) Amazon Web Services Elastic MapReduce Hadoop
Distribution
2) Hortonworks Hadoop Distribution
3) Cloudera Hadoop Distribution
4) MapR Hadoop Distribution
5) IBM Infosphere BigInsights Hadoop Distribution
6) Microsoft Azure's HDInsight Cloud based Hadoop
Distribution
1) Amazon Web Services Elastic
MapReduce Hadoop Distribution
• AWS Elastic MapReduce renders data analytics platform built
on the powerful HDFS architecture. With major focus on
map/reduce queries, AWS EMR, Amazon Web Services EMR.
• AWS EMR handles important big data uses like web indexing,
scientific simulation, log analysis, bioinformatics, machine
learning, financial analysis and data warehousing for big data
analysis.
• DynamoDB is another major NoSQL database that was deployed
to run its giant consumer website.
• Redshift is a completely managed petabyte scale data analytics
solution.
2) Hortonworks Hadoop Distribution
• Hortonworks is a pure play Hadoop company that drives open source
Hadoop distributions in the IT market.
• Apache Ambari is an example of Hadoop cluster management console
developed by Hortonworks Hadoop vendor for provision, managing and
monitoring Hadoop clusters. and eBay.
• Hortonworks has garnered strong engineering partnerships with RedHat,
Microsoft, SAP and Teradata.
3) Cloudera Hadoop Distribution
• Since 2008.Cloudera, founded by a group of engineers from
Yahoo, Google and Facebook - is focused on providing enterprise
ready solutions of Hadoop with additional customer support and
training.
• Cloudera Hadoop vendor has close to 350 paying customers
including the U.S Army, AllState and Monsanto.
• Cloudera owes its long term success to corporate partners -
Oracle, IBM, HP, NetApp and MongoDB that have been
consistently pushing its services.
4) MapR Hadoop Distribution
• MapR has made considerable investments to get over
the obstacles to worldwide adoption of Hadoop which
include enterprise grade reliability, data protection,
integrating Hadoop into existing environments with
ease and infrastructure to render support for real time
operations.
• In 2015, MapR plans to make further investments to
maintain its significance in the Big Data vendors list.
5) IBM Infosphere BigInsights
Hadoop Distribution
• IBM Infosphere BigInsights is an industry standard combines
Hadoop with enterprise grade characteristics.
• IBM provides BigSheets and BigInsights as a service via its
Smartcloud Enterprise Infrastructure .
• With IBM Hadoop distributions users can easily set up and
move data to Hadoop clusters in no more than 30 minutes with
data processing rate of 60 cents per Hadoop cluster, per hour.
• With IBM BigInsights innovation, customers can get to market
at a rapid pace with their applications that incorporate
advanced Big Data analytics by harnessing the power of
Hadoop.
6) Microsoft Azure's HDInsight
Cloud based Hadoop Distribution
• Microsoft’s big data solution is best leveraged through its public
cloud product -Windows Azure’s HDInsight particularly
developed to run on Azure. There is another production ready
feature of Microsoft named Polybase that lets the users search
for information available on SQL Server during the execution of
Hadoop queries.
• Microsoft has great significance in delivering a growing Hadoop
Stack to its customers. Microsoft Azure’s HDInsight is public-cloud
only based product and customers can not run on their own
hardware with this.
Hadoop Toolbox
1) HDFS
• Hadoop Distributed File System (HDFS) is designed to store a large amount of data, hence is
quite a lot more efficient than the NTFS (New Type File System) and FAT32 File System, which
are used in Windows PCs.
• HDFS is used to carter large chunks of data quickly to applications.
• Yahoo has been using Hadoop Distributed File System to manage over 40 petabytes of data.
2) HIVE
• Apache, which is commonly known for hosting servers, have got their solution for Hadoop’s
database as Apache HIVE data warehouse software.
• This makes it easy for us to query and manage large datasets.
• With HIVE, all the unstructured data are projected with a structure, and later, we can query
the data with SQL like language known as HiveQL.
• HIVE provides different storage types such as plain text, RCFile, Hbase, ORC, etc.
• HIVE also comes with built-in functions for the users, which can be used to manipulate
dates, strings, numbers, and several other types of data mining functions.
Hadoop Toolbox
3) NoSQL
• Structured Query Languages have been in use since a long time, now as the data is
mostly unstructured, we require a Query Language which doesn’t have any structure. This
is solved mainly through NoSQL.
• Here we have primarily key pair values with secondary indexes. NoSQL can easily be
integrated with Oracle Database, Oracle Wallet, and Hadoop.
• This makes NoSQL one of the widely supported Unstructured Query Language.
4) Mahout
• Apache has also developed its library of different machine learning algorithms which is
known as Mahout. Mahout is implemented on top of Apache Hadoop and uses the
MapReduce paradigm of BigData.
• As we all know about the Machines learning different things daily by generating data
based on the inputs of a different user, this is known as Machine learning and is one of
the critical components of Artificial Intelligence.
• Machine Learning is often used to improve the performance of any particular system, and
this majorly works on the outcome of the previous run of the machine.
Hadoop Toolbox
5) Avro
• With this tool, we can quickly get representations of complex data structures that are generated
by Hadoop’s MapReduce algorithm.
• Avro Data tool can easily take both input and output from a MapReduce Job, where it can also
format the same in a much easier way.
• With Avro, we can have real-time indexing, with easily understandable XML Configurations for
the tool.
6) GIS tools
• Geographic information is one of the most extensive sets of information available over the world.
• This includes all the states, cafes, restaurants, and other news around the world, and this needs
to be precise.
• Hadoop is used with GIS tools, which are a Java-based tool available for understanding
Geographic Information.
• With the help of this tool, we can handle Geographic Coordinates in place of strings, which can
help us to minimize the lines of code.
• With GIS, we can integrate maps in reports and publish them as online map applications.
Hadoop Toolbox
7) Flume
• LOGs are generated whenever there is any request, response, or any type of activity in
the database.
• Logs help to debug the program and see where things are going wrong. While working
with large sets of data, even the Logs are generated in bulk. And when we need to move
this massive amount of log data, Flume comes into play.
• Flume uses a simple, extensible data model, which will help you to apply online analytic
applications with the most ease.
8) Clouds
• All the cloud platforms work on Large data sets, which might make them slow in the
traditional way. Hence most of the cloud platforms are migrating to Hadoop, and Clouds
will help you with the same.
• With this tool, they can use a temporary machine that will help to calculate big data sets
and then store the results and free up the temporary machine, which was used to get the
results. All these things are set up and scheduled by the cloud/ Due to this, the normal
working of the servers is not affected at all.
Hadoop Toolbox
9) Spark
• Coming to hadoop analytics tools, Spark tops the list. Spark is a
framework available for Big Data analytics from Apache.
• This one is an open-source data analytics cluster computing framework
that was initially developed by AMPLab at UC Berkeley. Later Apache
bought the same from AMPLab.
• Spark works on the Hadoop Distributed File System, which is one of the
standard file systems to work with BigData. Spark promises to perform
100 times better than the MapReduce algorithm for Hadoop over a
specific type of application.
• Spark loads all the data into clusters of memory, which will allow the
program to query it repeatedly, making it the best framework available
for AI and Machine Learning.
Hadoop Toolbox
10) MapReduce
• Hadoop MapReduce is a framework that makes it quite easy for the developer to
write an application that will process multi-terabyte datasets in parallel.
• These datasets can be calculated over large clusters. MapReduce framework
consists of a JobTracker and TaskTracker; there is a single JobTracker which
tracks all the jobs, while there is a TaskTracker for every cluster-node.
• Master i.e., JobTracker, schedules the job, while TaskTracker, which is a slave,
monitors them and reschedule them if they failed.
11) Impala
• Cloudera is another company that works on developing tools for development
needs. Impala is software from Cloudera, which is leading software for Massively
Parallel Processing of SQL Query Engine, which runs natively on Apache Hadoop.
Apache licenses impala, and this makes it quite easy to directly query data
stored in HDFS (Hadoop Distributed File System) and Apache HBase.
Thanks

You might also like