SlideShare a Scribd company logo
Introductio
nto
Dr. C.V. Suresh Babu
(CentreforKnowledgeTransfer)
institute
DiscussionTopics
• What is Hadoop?
• Need for Hadoop
• History of Hadoop
• Hadoop Overview
• Advantages and Disadvantages of Hadoop
• Hadoop Distributed File System
• Comparing: RDBMS vs. Hadoop
• Advantages and Disadvantages of HDFS
• Hadoop frameworks
• Modules of Hadoop frameworks
• Features of 'Hadoop‘
• Hadoop AnalyticsTools
(CentreforKnowledgeTransfer)
institute
What is Hadoop?
• Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation.
• Its framework is based on Java programming with some native code in C and
shell scripts.
• Hadoop is used for some advanced level of analytics, which includes
Machine Learning and data mining
(CentreforKnowledgeTransfer)
institute
Need for Hadoop
• Redundant, Fault-tolerant data storage
• Parallel computation framework
• Job coordination
Programmers
Q: Where file is located?
Q: How to handle failures & data lost?
Q: How to divide computation?
Q: How to program for scaling?
No longer need to
worry about
(CentreforKnowledgeTransfer)
institute
History of Hadoop
• Apache Software Foundation is the developers of
Hadoop, and it’s co-founders are Doug Cutting and Mike
Cafarella.
• It’s co-founder Doug Cutting named it on his son’s toy
elephant. In October 2003 the first paper release was
Google File System.
• In January 2006, MapReduce development started on the
Apache Nutch which consisted of around 6000 lines
coding for it and around 5000 lines coding for HDFS.
• In April 2006 Hadoop 0.1.0 was released.
(CentreforKnowledgeTransfer)
institute
(CentreforKnowledgeTransfer)
institute
Advantages and Disadvantages of Hadoop
Advantages:
• Ability to store a large amount of
data.
• High flexibility.
• Cost effective.
• High computational power.
• Tasks are independent.
• Linear scaling.
Disadvantages:
• Not very effective for small
data.
• Hard cluster management.
• Has stability issues.
• Security concerns.
(CentreforKnowledgeTransfer)
institute
Hadoop Distributed File System
• It has distributed file system known
as HDFS and this HDFS splits files
into blocks and sends them across
various nodes in form of large
clusters.
• Also in case of a node failure, the
system operates and data transfer
takes place between the nodes
which are facilitated by HDFS.
(CentreforKnowledgeTransfer)
institute
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
(CentreforKnowledgeTransfer)
institute
Advantages of HDFS:
• It is inexpensive, immutable in nature, stores data reliably, ability to tolerate
faults, scalable, block structured, can process a large amount of data
simultaneously and many more.
Disadvantages of HDFS:
• It’s the biggest disadvantage is that it is not fit for small quantities of
data. Also, it has issues related to potential stability, restrictive and
rough in nature.
(CentreforKnowledgeTransfer)
institute
Some common frameworks of Hadoop
• Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
• Drill- It consists of user-defined functions and is used for data exploration.
• Storm- It allows real-time processing and streaming of data.
• Spark- It contains a Machine Learning Library(MLlib) for providing
enhanced machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
• Pig- It has Pig Latin, a SQL-Like language and performs data transformation
of unstructured data.
• Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
(CentreforKnowledgeTransfer)
institute
Modules of Hadoop frameworks
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. HadoopYARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
(CentreforKnowledgeTransfer)
institute
Suitable for Big Data Analysis
• As Big Data tends to be distributed and unstructured in nature, HADOOP
clusters are best suited for analysis of Big Data.
• Since it is processing logic (not the actual data) that flows to the computing
nodes, less network bandwidth is consumed.
• This concept is called as data locality concept which helps increase the
efficiency of Hadoop based applications.
Features of 'Hadoop'
(CentreforKnowledgeTransfer)
institute
Scalability
• HADOOP clusters can easily be scaled to any extent
by adding additional cluster nodes and thus allows
for the growth of Big Data.
• Also, scaling does not require modifications to
application logic.
(CentreforKnowledgeTransfer)
institute
• HADOOP ecosystem has a provision to replicate the input data on to other
cluster nodes.
• That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.
(CentreforKnowledgeTransfer)
institute
(CentreforKnowledgeTransfer)
institute
Scales to
Petabytes or
more easily
Parallel data processing
Suited for particular types
of big data problems
Hadoop AnalyticsTools
• There is a wide range of analytical tools available in the market that help
Hadoop deal with the astronomical size data efficiently.
• Let us discuss some of the most famous and widely used tools one by one.
Below are the top 10 Hadoop analytics tools for big data.
(CentreforKnowledgeTransfer)
institute
• Apache spark in an open-source processing engine that is designed for ease of
analytics operations.
• It is a cluster computing platform that is designed to be fast and made for general
purpose uses.
• Spark is designed to cover various batch applications, Machine Learning, streaming
data processing, and interactive queries.
Features of Spark:
• In memory processing
• Tight Integration Of component
• Easy and In-expensive
• The powerful processing engine makes it so fast
• Spark Streaming has high level library for streaming process
(CentreforKnowledgeTransfer)
institute
• MapReduce is just like an Algorithm or a data structure that is based on the
YARN framework.
• The primary feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster, which Makes Hadoop working so fast Because
when we are dealing with Big Data, serial processing is no more of any use.
Features of Map-Reduce:
• Scalable
• FaultTolerance
• Paraller Processing
• Tunable Replication
• Load Balancing
(CentreforKnowledgeTransfer)
institute
• Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various
sources.
• Hive is one of the best tools used for data analysis on Hadoop.
• The one who is having knowledge of SQL can comfortably use Apache Hive.
• The query language of high is known as HQL or HIVEQL.
Features of Hive:
• Queries are similar to SQL queries.
• Hive has different storage type HBase, ORC, Plain text, etc.
• Hive has in-built function for data-mining and other works.
• Hive operates on compressed data that is present inside Hadoop Ecosystem.
(CentreforKnowledgeTransfer)
institute
• Apache Impala is an open-source SQL engine designed for Hadoop.
• Impala overcomes the speed-related issue inApache Hive with its faster-processing speed.
• Apache Impala uses similar kinds of SQL syntax, ODBC driver, and user interface as that of
Apache Hive.
• Apache Impala can easily be integrated with Hadoop for data analytics purposes.
Features of Impala:
• Easy-Integration
• Scalability
• Security
• In Memory data processing
(CentreforKnowledgeTransfer)
institute
• The name Mahout is taken from the Hindi word Mahavat which means the
elephant rider.
• Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout.
• Mahout is mainly used for implementing various Machine Learning algorithms on
our Hadoop like classification, Collaborative filtering, Recommendation.
• Apache Mahout can implement the Machine algorithms without integration on
Hadoop.
Features of Mahout:
• Used for Machine Learning Application
• Mahout hasVector and Matrix libraries
• Ability to analyze large datasets quickly
(CentreforKnowledgeTransfer)
institute
• This Pig was Initially developed byYahoo to get ease in programming.
• Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop.
• Apache pig is used for analyzing more massive datasets by representing them as dataflow.
• Apache Pig also raises the level of abstraction for processing enormous datasets.
• Pig Latin is the scripting language that the developer uses for working on the Pig framework
that runs on Pig runtime.
Features of Pig:
• EasyTo Programme
• Rich set of operators
• Ability to handle various kind of data
• Extensibility
(CentreforKnowledgeTransfer)
institute
• HBase is nothing but a non-relational, NoSQL distributed, and column-oriented
database. HBase consists of various tables where each table has multiple numbers
of data rows.
• These rows will have multiple numbers of column family’s, and this column family
will have columns that contain key-value pairs.
• HBase works on the top of HDFS(Hadoop Distributed File System).
• We use HBase for searching small size data from the more massive datasets.
Features of HBase:
• HBase has Linear and Modular Scalability
• JAVA API can easily be used for client access
• Block cache for real time data queries
(CentreforKnowledgeTransfer)
institute
• Sqoop is a command-line tool that is developed by Apache.
• The primary purpose of Apache Sqoop is to import structured data i.e.,
RDBMS(Relational database management System) like MySQL, SQL Server,
Oracle to our HDFS(Hadoop Distributed File System).
• Sqoop can also export the data from our HDFS to RDBMS.
Features of Sqoop:
• Sqoop can Import DataTo Hive or HBase
• Connecting to database server
• Controlling parallelism
(CentreforKnowledgeTransfer)
institute
• Tableau is a data visualization software that can be used for data analytics and
business intelligence.
• It provides a variety of interactive visualization to showcase the insights of the data
and can translate the queries to visualization and can also import all ranges and
sizes of data.
• Tableau offers rapid analysis and processing, so it Generates useful visualizing
charts on interactive dashboards and worksheets.
Features ofTableu:
• Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt
chart and so many
• Secure and Robust
• Interactive Dashboard and worksheets
(CentreforKnowledgeTransfer)
institute
• Apache Storm is a free open source distributed real-time computation system build using
Programming languages like Clojure and java.
• It can be used with many programming languages.
• Apache Storm is used for the Streaming process, which is very faster.
• We use Daemons like Nimbus, Zookeeper, and Supervisor inApache Storm.
• Apache Storm can be used for real-time processing, online Machine learning, and many
more. Companies likeYahoo, Spotify,Twitter, and so many uses Apache Storm.
Features of Storm:
• Easily operatable
• each node can process millions of tuples in one second
• Scalable and FaultTolerance
(CentreforKnowledgeTransfer)
institute
Companies Using Hadoop
(CentreforKnowledgeTransfer)
institute
Common Hadoop Distributions
(CentreforKnowledgeTransfer)
institute
• Open Source
• Apache
• Commercial
• Cloudera
• Hortonworks
• MapR
• AWS MapReduce
• MicrosoftAzure HDInsight (Beta)
Ad

More Related Content

What's hot (20)

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop
HadoopHadoop
Hadoop
Nishant Gandhi
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Prashant Gupta
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 
Big data ppt
Big data pptBig data ppt
Big data ppt
IDBI Bank Ltd.
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 

Similar to Introduction to Hadoop (20)

Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
Amit Bhardwaj
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
Mostafa
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
Mostafa
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
Hive - A theoretical overview in Detail.pptx
Hive - A theoretical overview in Detail.pptxHive - A theoretical overview in Detail.pptx
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
葵慶 李
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
Tom Rogers
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Hadoop
HadoopHadoop
Hadoop
thisisnabin
 
BDA: Introduction to HIVE, PIG and HBASE
BDA: Introduction to HIVE, PIG and HBASEBDA: Introduction to HIVE, PIG and HBASE
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
Amit Bhardwaj
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
Mostafa
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
Mostafa
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Hive - A theoretical overview in Detail.pptx
Hive - A theoretical overview in Detail.pptxHive - A theoretical overview in Detail.pptx
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
葵慶 李
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
Tom Rogers
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
BDA: Introduction to HIVE, PIG and HBASE
BDA: Introduction to HIVE, PIG and HBASEBDA: Introduction to HIVE, PIG and HBASE
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Ad

More from Dr. C.V. Suresh Babu (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
Dr. C.V. Suresh Babu
 
Association rules
Association rulesAssociation rules
Association rules
Dr. C.V. Suresh Babu
 
Clustering
ClusteringClustering
Clustering
Dr. C.V. Suresh Babu
 
Classification
ClassificationClassification
Classification
Dr. C.V. Suresh Babu
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
Dr. C.V. Suresh Babu
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
Dr. C.V. Suresh Babu
 
DART
DARTDART
DART
Dr. C.V. Suresh Babu
 
Mycin
MycinMycin
Mycin
Dr. C.V. Suresh Babu
 
Expert systems
Expert systemsExpert systems
Expert systems
Dr. C.V. Suresh Babu
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
Dr. C.V. Suresh Babu
 
Bayes network
Bayes networkBayes network
Bayes network
Dr. C.V. Suresh Babu
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
Dr. C.V. Suresh Babu
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
Dr. C.V. Suresh Babu
 
Rule based system
Rule based systemRule based system
Rule based system
Dr. C.V. Suresh Babu
 
Formal Logic in AI
Formal Logic in AIFormal Logic in AI
Formal Logic in AI
Dr. C.V. Suresh Babu
 
Production based system
Production based systemProduction based system
Production based system
Dr. C.V. Suresh Babu
 
Game playing in AI
Game playing in AIGame playing in AI
Game playing in AI
Dr. C.V. Suresh Babu
 
Diagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AIDiagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AI
Dr. C.V. Suresh Babu
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
Dr. C.V. Suresh Babu
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
Dr. C.V. Suresh Babu
 
Ad

Recently uploaded (20)

AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 

Introduction to Hadoop

  • 1. Introductio nto Dr. C.V. Suresh Babu (CentreforKnowledgeTransfer) institute
  • 2. DiscussionTopics • What is Hadoop? • Need for Hadoop • History of Hadoop • Hadoop Overview • Advantages and Disadvantages of Hadoop • Hadoop Distributed File System • Comparing: RDBMS vs. Hadoop • Advantages and Disadvantages of HDFS • Hadoop frameworks • Modules of Hadoop frameworks • Features of 'Hadoop‘ • Hadoop AnalyticsTools (CentreforKnowledgeTransfer) institute
  • 3. What is Hadoop? • Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. • Its framework is based on Java programming with some native code in C and shell scripts. • Hadoop is used for some advanced level of analytics, which includes Machine Learning and data mining (CentreforKnowledgeTransfer) institute
  • 4. Need for Hadoop • Redundant, Fault-tolerant data storage • Parallel computation framework • Job coordination Programmers Q: Where file is located? Q: How to handle failures & data lost? Q: How to divide computation? Q: How to program for scaling? No longer need to worry about (CentreforKnowledgeTransfer) institute
  • 5. History of Hadoop • Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella. • It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first paper release was Google File System. • In January 2006, MapReduce development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. • In April 2006 Hadoop 0.1.0 was released. (CentreforKnowledgeTransfer) institute
  • 7. Advantages and Disadvantages of Hadoop Advantages: • Ability to store a large amount of data. • High flexibility. • Cost effective. • High computational power. • Tasks are independent. • Linear scaling. Disadvantages: • Not very effective for small data. • Hard cluster management. • Has stability issues. • Security concerns. (CentreforKnowledgeTransfer) institute
  • 8. Hadoop Distributed File System • It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. • Also in case of a node failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS. (CentreforKnowledgeTransfer) institute
  • 9. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing) (CentreforKnowledgeTransfer) institute
  • 10. Advantages of HDFS: • It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data simultaneously and many more. Disadvantages of HDFS: • It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential stability, restrictive and rough in nature. (CentreforKnowledgeTransfer) institute
  • 11. Some common frameworks of Hadoop • Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS. • Drill- It consists of user-defined functions and is used for data exploration. • Storm- It allows real-time processing and streaming of data. • Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and Scala. • Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data. • Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster. (CentreforKnowledgeTransfer) institute
  • 12. Modules of Hadoop frameworks Hadoop framework is made up of the following modules: 1. Hadoop MapReduce- a MapReduce programming model for handling and processing large data. 2. Hadoop Distributed File System- distributed files in clusters among nodes. 3. HadoopYARN- a platform which manages computing resources. 4. Hadoop Common- it contains packages and libraries which are used for other modules. (CentreforKnowledgeTransfer) institute
  • 13. Suitable for Big Data Analysis • As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. • Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. • This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications. Features of 'Hadoop' (CentreforKnowledgeTransfer) institute
  • 14. Scalability • HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. • Also, scaling does not require modifications to application logic. (CentreforKnowledgeTransfer) institute
  • 15. • HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. • That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node. (CentreforKnowledgeTransfer) institute
  • 16. (CentreforKnowledgeTransfer) institute Scales to Petabytes or more easily Parallel data processing Suited for particular types of big data problems
  • 17. Hadoop AnalyticsTools • There is a wide range of analytical tools available in the market that help Hadoop deal with the astronomical size data efficiently. • Let us discuss some of the most famous and widely used tools one by one. Below are the top 10 Hadoop analytics tools for big data. (CentreforKnowledgeTransfer) institute
  • 18. • Apache spark in an open-source processing engine that is designed for ease of analytics operations. • It is a cluster computing platform that is designed to be fast and made for general purpose uses. • Spark is designed to cover various batch applications, Machine Learning, streaming data processing, and interactive queries. Features of Spark: • In memory processing • Tight Integration Of component • Easy and In-expensive • The powerful processing engine makes it so fast • Spark Streaming has high level library for streaming process (CentreforKnowledgeTransfer) institute
  • 19. • MapReduce is just like an Algorithm or a data structure that is based on the YARN framework. • The primary feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with Big Data, serial processing is no more of any use. Features of Map-Reduce: • Scalable • FaultTolerance • Paraller Processing • Tunable Replication • Load Balancing (CentreforKnowledgeTransfer) institute
  • 20. • Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data Warehousing is nothing but storing the data at a fixed location generated from various sources. • Hive is one of the best tools used for data analysis on Hadoop. • The one who is having knowledge of SQL can comfortably use Apache Hive. • The query language of high is known as HQL or HIVEQL. Features of Hive: • Queries are similar to SQL queries. • Hive has different storage type HBase, ORC, Plain text, etc. • Hive has in-built function for data-mining and other works. • Hive operates on compressed data that is present inside Hadoop Ecosystem. (CentreforKnowledgeTransfer) institute
  • 21. • Apache Impala is an open-source SQL engine designed for Hadoop. • Impala overcomes the speed-related issue inApache Hive with its faster-processing speed. • Apache Impala uses similar kinds of SQL syntax, ODBC driver, and user interface as that of Apache Hive. • Apache Impala can easily be integrated with Hadoop for data analytics purposes. Features of Impala: • Easy-Integration • Scalability • Security • In Memory data processing (CentreforKnowledgeTransfer) institute
  • 22. • The name Mahout is taken from the Hindi word Mahavat which means the elephant rider. • Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. • Mahout is mainly used for implementing various Machine Learning algorithms on our Hadoop like classification, Collaborative filtering, Recommendation. • Apache Mahout can implement the Machine algorithms without integration on Hadoop. Features of Mahout: • Used for Machine Learning Application • Mahout hasVector and Matrix libraries • Ability to analyze large datasets quickly (CentreforKnowledgeTransfer) institute
  • 23. • This Pig was Initially developed byYahoo to get ease in programming. • Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop. • Apache pig is used for analyzing more massive datasets by representing them as dataflow. • Apache Pig also raises the level of abstraction for processing enormous datasets. • Pig Latin is the scripting language that the developer uses for working on the Pig framework that runs on Pig runtime. Features of Pig: • EasyTo Programme • Rich set of operators • Ability to handle various kind of data • Extensibility (CentreforKnowledgeTransfer) institute
  • 24. • HBase is nothing but a non-relational, NoSQL distributed, and column-oriented database. HBase consists of various tables where each table has multiple numbers of data rows. • These rows will have multiple numbers of column family’s, and this column family will have columns that contain key-value pairs. • HBase works on the top of HDFS(Hadoop Distributed File System). • We use HBase for searching small size data from the more massive datasets. Features of HBase: • HBase has Linear and Modular Scalability • JAVA API can easily be used for client access • Block cache for real time data queries (CentreforKnowledgeTransfer) institute
  • 25. • Sqoop is a command-line tool that is developed by Apache. • The primary purpose of Apache Sqoop is to import structured data i.e., RDBMS(Relational database management System) like MySQL, SQL Server, Oracle to our HDFS(Hadoop Distributed File System). • Sqoop can also export the data from our HDFS to RDBMS. Features of Sqoop: • Sqoop can Import DataTo Hive or HBase • Connecting to database server • Controlling parallelism (CentreforKnowledgeTransfer) institute
  • 26. • Tableau is a data visualization software that can be used for data analytics and business intelligence. • It provides a variety of interactive visualization to showcase the insights of the data and can translate the queries to visualization and can also import all ranges and sizes of data. • Tableau offers rapid analysis and processing, so it Generates useful visualizing charts on interactive dashboards and worksheets. Features ofTableu: • Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt chart and so many • Secure and Robust • Interactive Dashboard and worksheets (CentreforKnowledgeTransfer) institute
  • 27. • Apache Storm is a free open source distributed real-time computation system build using Programming languages like Clojure and java. • It can be used with many programming languages. • Apache Storm is used for the Streaming process, which is very faster. • We use Daemons like Nimbus, Zookeeper, and Supervisor inApache Storm. • Apache Storm can be used for real-time processing, online Machine learning, and many more. Companies likeYahoo, Spotify,Twitter, and so many uses Apache Storm. Features of Storm: • Easily operatable • each node can process millions of tuples in one second • Scalable and FaultTolerance (CentreforKnowledgeTransfer) institute
  • 29. Common Hadoop Distributions (CentreforKnowledgeTransfer) institute • Open Source • Apache • Commercial • Cloudera • Hortonworks • MapR • AWS MapReduce • MicrosoftAzure HDInsight (Beta)