SlideShare a Scribd company logo
Big Data Analytics
Dr. Madhura Phadke
Module 1
Introduction : Big Data
• Each one of us generates data, which
contributes to generation of big data.
• Let’s consider some examples.
MODULE 1: Introduction to Big Data Analytics.pptx
Data measuring units
•
Brontobyte BB 1024 Yottabyte
• Phases of Big Data
– 1970 to 2000
– 2000 to 2010
– 2010 onwards
Who records and uses this data?
• ?
Application areas of Big Data
• Different sectors
– Different service providers
Different sectors
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
Data creation
• In 2024, approximately 402.74 million terabytes of data is generated daily.
• How Much Data Is Created Every Day Per Person?
• The amount of data created every day per person is approximately 0.0635 terabytes. 5.18
billion users.
Data center : India
• Currently, India has about 151 data centres, ranking 14th in the world.
With 880 million users, India has witnessed a surge in data centre
investments.
• A data center serves as a network of computing and storage resources,
facilitating the delivery of shared software applications and data. These
centers play a crucial role in housing vast amounts of data, making them
essential for the seamless operations of both companies and consumers.
• Consequently, data center real estate, whether in the form of cloud,
colocation, or managed services, is expected to gain growing significance
on a global scale.
• Tata Communications Ltd
• Sify Technologies
• Web Werks India Pvt Ltd
Big Data characteristics
Big Data characteristics
• Volume, Velocity And Variety
• Velocity, Volume, Value, Variety And Veracity
• Volume, Velocity, Value, Variety, Veracity,
Validity, Visualization, Virality, Variability,
Volatility, Venue, Vocabulary, Vagueness
Types of Big Data
•
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop
What is Hadoop?
• Hadoop:
• an open-source software framework that supports data-
intensive distributed applications, licensed under the Apache
v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
03/24/2025 20
Uses for Hadoop
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
03/24/2025 21
Who Uses Hadoop?
03/24/2025 BDA Chapter 1 22
The Hadoop Ecosystem
•Contains Libraries and other modules
Hadoop
Common
•Hadoop Distributed File System
HDFS
•Yet Another Resource Negotiator
Hadoop YARN
•A programming model for large scale data
processing
Hadoop
MapReduce
Hadoop Framework Tools
enterprise
datawarehouse
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop Ecosystem
• Storing Data : HDFS, Hbase (NoSQL db)
• Processing Data: MapReduce
• Query on data : Pig, Hive, Apache Drill
• Machine Learning : Mahout, Spark Mlib
• Managing cluster : Zookeeper
• Data Ingesting : Flume, Sqoop
• Searching and Indexing : Apache Solr & Lucene
• Provision, Monitor and Maintain cluster :
Ambari
HDFS ARCHITECTURE
Failure handling
• If the active namenode fails, a standby can take over very quickly
because it has the latest state of metadata. zookeeper helps in
switching between the active and the standby namenodes. The
namenode maintains the reference to every file and block in the
memory.
• A 'heartbeat' is a signal sent between a DataNode and NameNode.
This signal is taken as a sign of vitality. If there is no response to the
signal, then it is understood that there are certain health issues/
technical problems with the DataNode or the TaskTracker.
• The default heartbeat interval is 3 seconds. If the NameNode does
not receive any heartbeats from a DataNode for a period of 10
minutes, then a 'Heartbeat Lost' condition occurs and the
corresponding DataNode is deemed to be dead/unavailable.
HDFS Racks
Hadoop Cluster
Rack awareness
• There should not be more
than 1 replica on the same
Datanode.
• More than 2 replica’s of a
single block is not allowed
on the same Rack.
• The number of racks used
inside a Hadoop cluster
must be smaller than the
number of replicas.
MODULE 1: Introduction to Big Data Analytics.pptx
• Hadoop Clusters are also known as Shared-
nothing systems because nothing is shared
between the nodes in the cluster except the
network bandwidth. This decreases the
processing latency.
• Thus, when there is a need to process queries
on the huge amount of data, the cluster-wide
latency is minimized.
Hadoop Cluster
Master in the Hadoop Cluster
is a high power machine with a high configuration
of memory and CPU.
ResourceManager
• is the master daemon of YARN.
• It keeps track of live and dead nodes in the
cluster.
NodeManager
• is the slave daemon of YARN.
• It is responsible for containers, monitoring their
resource usage (such as CPU, disk, memory,
network) and reporting the same to the
ResourceManager.
• The NodeManager also checks the health of
the node on which it is running.
Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power and
storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce tasks,
and also DataNode to store needed blocks closely as possible
• Central control node runs NameNode to keep track of HDFS directories
& files, and JobTracker to dispatch compute tasks to TaskTracker
• Written in Java, also supports Python and Ruby
Limitations of Hadoop
• Not suited for small files
• It cannot handle firmly the live data
• Slow processing speed
• Not efficient for iterative processing
• Not efficient for caching
Hadoop Ecosystem Tools
• YARN
Brain of your Hadoop Ecosystem.
It performs all your processing activities by
allocating resources and scheduling tasks.
Components :
Resource Manager and Node Manager
Schedulers, Applications Manager
• PIG has two parts: Pig Latin, the language and the
pig runtime, for the execution environment.
• 10 line of pig latin = approx. 200 lines of Map-
Reduce Java code.
• first the load command, loads the data. Then we
perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you
can dump the data on the screen or you can store
the result back in HDFS.
APACHE HIVE
• It has 2 basic components: Hive Command
Line and JDBC/ODBC driver.
• It supports all primitive data types of SQL.
APACHE MAHOUT
• It has a predefined set of library which already
contains different inbuilt algorithms for
different use cases.
• collaborative filtering, clustering and
classification
APACHE SPARK
• framework for real time data analytics
• It executes in-memory computations to
increase speed of data processing over Map-
Reduce.
• It is 100x faster than Hadoop for large scale
data processing by exploiting in-memory
computations and other
optimizations. Therefore, it requires high
processing power than Map-Reduce.
MODULE 1: Introduction to Big Data Analytics.pptx
APACHE ZOOKEEPER
• Apache Zookeeper coordinates with various
services in a distributed environment.
• synchronization, configuration maintenance,
grouping and naming.
APACHE OOZIE
• clock and alarm service inside Hadoop Ecosystem.
• There are two kinds of Oozie jobs:
– Oozie workflow: These are sequential set of actions to be
executed. You can assume it as a relay race. Where each
athlete waits for the last one to complete his part.
– Oozie Coordinator: These are the Oozie jobs which are
triggered when the data is made available to it. Think of
this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an
Oozie coordinator responds to the availability of data and
it rests otherwise.
APACHE FLUME
• Ingesting data
• collecting, aggregating and moving large
amount of data sets.
• It helps us to ingest online streaming data from
various sources like network traffic, social
media, email messages, log files etc. in HDFS.
• The flume agent has 3 components: source,
sink and channel.
APACHE SQOOP
• Flume only ingests unstructured data or semi-
structured data into HDFS.
• While Sqoop can import as well as export
structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
• When we submit Sqoop command, our main
task gets divided into sub tasks which is
handled by individual Map Task internally.
Solr, Lucene, Ambari
• Apache Solr and Apache Lucene are the two
services which are used for searching and
indexing in Hadoop Ecosystem.
• Ambari is an Apache Software Foundation
Project which aims at making Hadoop
ecosystem more manageable.
– It includes software for provisioning, managing
and monitoring Apache Hadoop clusters.
APACHE DRILL
• The main power of Apache Drill lies
in combining a variety of data stores just by
using a single query.
• Azure Blob Storage, Google Cloud Storage,
HBase, MongoDB, MapR-DB HDFS, MapR-FS,
Amazon S3, Swift, NAS and local files.
APACHE HBASE
• NoSQL database.
• It supports all types of data and that is why,
it’s capable of handling anything and
everything inside a Hadoop ecosystem.
• Kafka - It is an open-source message broker.
Using Kafka, we can handle feeds with high-
throughput and low-latency.
• Storm - Realtime stream based processing
framework. (Does same what MapReduce do
for batch-like processing)
• https://
www.mentimeter.com/app/presentation/n/al
yp2c5nuf9khbrcdso4wwx94ucuc39y/edit?que
stion=dci9typyikba
Ad

More Related Content

Similar to MODULE 1: Introduction to Big Data Analytics.pptx (20)

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Anju
AnjuAnju
Anju
Anju Shekhawat
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Roushan Sinha
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
Nilesh Gule
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir
 
Hadoop
HadoopHadoop
Hadoop
Oded Rotter
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
Nilesh Gule
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 

More from NiramayKolalle (6)

Module 1.2: Data Warehousing Fundamentals.pptx
Module 1.2:  Data Warehousing Fundamentals.pptxModule 1.2:  Data Warehousing Fundamentals.pptx
Module 1.2: Data Warehousing Fundamentals.pptx
NiramayKolalle
 
Module 2.2 Introduction to NoSQL Databases.pptx
Module 2.2 Introduction to NoSQL Databases.pptxModule 2.2 Introduction to NoSQL Databases.pptx
Module 2.2 Introduction to NoSQL Databases.pptx
NiramayKolalle
 
Module 2.3 Document Databases in NoSQL Systems
Module 2.3 Document Databases in NoSQL SystemsModule 2.3 Document Databases in NoSQL Systems
Module 2.3 Document Databases in NoSQL Systems
NiramayKolalle
 
Web mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ceWeb mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ce
NiramayKolalle
 
SPCC_Sem6_Chapter 6_Code Optimization part
SPCC_Sem6_Chapter 6_Code Optimization partSPCC_Sem6_Chapter 6_Code Optimization part
SPCC_Sem6_Chapter 6_Code Optimization part
NiramayKolalle
 
Module 6 Intermediate Code Generation.pdf
Module 6 Intermediate Code Generation.pdfModule 6 Intermediate Code Generation.pdf
Module 6 Intermediate Code Generation.pdf
NiramayKolalle
 
Module 1.2: Data Warehousing Fundamentals.pptx
Module 1.2:  Data Warehousing Fundamentals.pptxModule 1.2:  Data Warehousing Fundamentals.pptx
Module 1.2: Data Warehousing Fundamentals.pptx
NiramayKolalle
 
Module 2.2 Introduction to NoSQL Databases.pptx
Module 2.2 Introduction to NoSQL Databases.pptxModule 2.2 Introduction to NoSQL Databases.pptx
Module 2.2 Introduction to NoSQL Databases.pptx
NiramayKolalle
 
Module 2.3 Document Databases in NoSQL Systems
Module 2.3 Document Databases in NoSQL SystemsModule 2.3 Document Databases in NoSQL Systems
Module 2.3 Document Databases in NoSQL Systems
NiramayKolalle
 
Web mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ceWeb mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ce
NiramayKolalle
 
SPCC_Sem6_Chapter 6_Code Optimization part
SPCC_Sem6_Chapter 6_Code Optimization partSPCC_Sem6_Chapter 6_Code Optimization part
SPCC_Sem6_Chapter 6_Code Optimization part
NiramayKolalle
 
Module 6 Intermediate Code Generation.pdf
Module 6 Intermediate Code Generation.pdfModule 6 Intermediate Code Generation.pdf
Module 6 Intermediate Code Generation.pdf
NiramayKolalle
 
Ad

Recently uploaded (20)

railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Ad

MODULE 1: Introduction to Big Data Analytics.pptx

  • 1. Big Data Analytics Dr. Madhura Phadke
  • 3. Introduction : Big Data • Each one of us generates data, which contributes to generation of big data. • Let’s consider some examples.
  • 6. • Phases of Big Data – 1970 to 2000 – 2000 to 2010 – 2010 onwards
  • 7. Who records and uses this data? • ?
  • 8. Application areas of Big Data • Different sectors – Different service providers
  • 12. Data creation • In 2024, approximately 402.74 million terabytes of data is generated daily. • How Much Data Is Created Every Day Per Person? • The amount of data created every day per person is approximately 0.0635 terabytes. 5.18 billion users.
  • 13. Data center : India • Currently, India has about 151 data centres, ranking 14th in the world. With 880 million users, India has witnessed a surge in data centre investments. • A data center serves as a network of computing and storage resources, facilitating the delivery of shared software applications and data. These centers play a crucial role in housing vast amounts of data, making them essential for the seamless operations of both companies and consumers. • Consequently, data center real estate, whether in the form of cloud, colocation, or managed services, is expected to gain growing significance on a global scale. • Tata Communications Ltd • Sify Technologies • Web Werks India Pvt Ltd
  • 15. Big Data characteristics • Volume, Velocity And Variety • Velocity, Volume, Value, Variety And Veracity • Volume, Velocity, Value, Variety, Veracity, Validity, Visualization, Virality, Variability, Volatility, Venue, Vocabulary, Vagueness
  • 16. Types of Big Data •
  • 19. What is Hadoop? • Hadoop: • an open-source software framework that supports data- intensive distributed applications, licensed under the Apache v2 license. • Goals / Requirements: • Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured and non-structured data • Simple programming models • High scalability and availability • Use commodity (cheap!) hardware with little redundancy • Fault-tolerance • Move computation rather than data
  • 20. 03/24/2025 20 Uses for Hadoop • Data-intensive text processing • Assembly of large genomes • Graph mining • Machine learning and data mining • Large scale social network analysis
  • 22. 03/24/2025 BDA Chapter 1 22 The Hadoop Ecosystem •Contains Libraries and other modules Hadoop Common •Hadoop Distributed File System HDFS •Yet Another Resource Negotiator Hadoop YARN •A programming model for large scale data processing Hadoop MapReduce
  • 26. • Storing Data : HDFS, Hbase (NoSQL db) • Processing Data: MapReduce • Query on data : Pig, Hive, Apache Drill • Machine Learning : Mahout, Spark Mlib • Managing cluster : Zookeeper • Data Ingesting : Flume, Sqoop • Searching and Indexing : Apache Solr & Lucene • Provision, Monitor and Maintain cluster : Ambari
  • 28. Failure handling • If the active namenode fails, a standby can take over very quickly because it has the latest state of metadata. zookeeper helps in switching between the active and the standby namenodes. The namenode maintains the reference to every file and block in the memory. • A 'heartbeat' is a signal sent between a DataNode and NameNode. This signal is taken as a sign of vitality. If there is no response to the signal, then it is understood that there are certain health issues/ technical problems with the DataNode or the TaskTracker. • The default heartbeat interval is 3 seconds. If the NameNode does not receive any heartbeats from a DataNode for a period of 10 minutes, then a 'Heartbeat Lost' condition occurs and the corresponding DataNode is deemed to be dead/unavailable.
  • 31. Rack awareness • There should not be more than 1 replica on the same Datanode. • More than 2 replica’s of a single block is not allowed on the same Rack. • The number of racks used inside a Hadoop cluster must be smaller than the number of replicas.
  • 33. • Hadoop Clusters are also known as Shared- nothing systems because nothing is shared between the nodes in the cluster except the network bandwidth. This decreases the processing latency. • Thus, when there is a need to process queries on the huge amount of data, the cluster-wide latency is minimized.
  • 34. Hadoop Cluster Master in the Hadoop Cluster is a high power machine with a high configuration of memory and CPU. ResourceManager • is the master daemon of YARN. • It keeps track of live and dead nodes in the cluster. NodeManager • is the slave daemon of YARN. • It is responsible for containers, monitoring their resource usage (such as CPU, disk, memory, network) and reporting the same to the ResourceManager. • The NodeManager also checks the health of the node on which it is running.
  • 35. Hadoop’s Architecture • Distributed, with some centralization • Main nodes of cluster are where most of the computational power and storage of the system lies • Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible • Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker • Written in Java, also supports Python and Ruby
  • 36. Limitations of Hadoop • Not suited for small files • It cannot handle firmly the live data • Slow processing speed • Not efficient for iterative processing • Not efficient for caching
  • 37. Hadoop Ecosystem Tools • YARN Brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks. Components : Resource Manager and Node Manager Schedulers, Applications Manager
  • 38. • PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. • 10 line of pig latin = approx. 200 lines of Map- Reduce Java code. • first the load command, loads the data. Then we perform various functions on it like grouping, filtering, joining, sorting, etc. At last, either you can dump the data on the screen or you can store the result back in HDFS.
  • 39. APACHE HIVE • It has 2 basic components: Hive Command Line and JDBC/ODBC driver. • It supports all primitive data types of SQL.
  • 40. APACHE MAHOUT • It has a predefined set of library which already contains different inbuilt algorithms for different use cases. • collaborative filtering, clustering and classification
  • 41. APACHE SPARK • framework for real time data analytics • It executes in-memory computations to increase speed of data processing over Map- Reduce. • It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other optimizations. Therefore, it requires high processing power than Map-Reduce.
  • 43. APACHE ZOOKEEPER • Apache Zookeeper coordinates with various services in a distributed environment. • synchronization, configuration maintenance, grouping and naming.
  • 44. APACHE OOZIE • clock and alarm service inside Hadoop Ecosystem. • There are two kinds of Oozie jobs: – Oozie workflow: These are sequential set of actions to be executed. You can assume it as a relay race. Where each athlete waits for the last one to complete his part. – Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available to it. Think of this as the response-stimuli system in our body. In the same manner as we respond to an external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.
  • 45. APACHE FLUME • Ingesting data • collecting, aggregating and moving large amount of data sets. • It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS. • The flume agent has 3 components: source, sink and channel.
  • 46. APACHE SQOOP • Flume only ingests unstructured data or semi- structured data into HDFS. • While Sqoop can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or vice versa. • When we submit Sqoop command, our main task gets divided into sub tasks which is handled by individual Map Task internally.
  • 47. Solr, Lucene, Ambari • Apache Solr and Apache Lucene are the two services which are used for searching and indexing in Hadoop Ecosystem. • Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more manageable. – It includes software for provisioning, managing and monitoring Apache Hadoop clusters.
  • 48. APACHE DRILL • The main power of Apache Drill lies in combining a variety of data stores just by using a single query. • Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.
  • 49. APACHE HBASE • NoSQL database. • It supports all types of data and that is why, it’s capable of handling anything and everything inside a Hadoop ecosystem.
  • 50. • Kafka - It is an open-source message broker. Using Kafka, we can handle feeds with high- throughput and low-latency. • Storm - Realtime stream based processing framework. (Does same what MapReduce do for batch-like processing)