MODULE 1: Introduction to Big Data Analytics.pptx

Big Data Analytics
Dr. Madhura Phadke

Introduction : Big Data
• Each one of us generates data, which
contributes to generation of big data.
• Let’s consider some examples.

Data measuring units
•
Brontobyte BB 1024 Yottabyte

• Phases of Big Data
– 1970 to 2000
– 2000 to 2010
– 2010 onwards

Who records and uses this data?
• ?

Application areas of Big Data
• Different sectors
– Different service providers

Data creation
• In 2024, approximately 402.74 million terabytes of data is generated daily.
• How Much Data Is Created Every Day Per Person?
• The amount of data created every day per person is approximately 0.0635 terabytes. 5.18
billion users.

Data center : India
• Currently, India has about 151 data centres, ranking 14th in the world.
With 880 million users, India has witnessed a surge in data centre
investments.
• A data center serves as a network of computing and storage resources,
facilitating the delivery of shared software applications and data. These
centers play a crucial role in housing vast amounts of data, making them
essential for the seamless operations of both companies and consumers.
• Consequently, data center real estate, whether in the form of cloud,
colocation, or managed services, is expected to gain growing significance
on a global scale.
• Tata Communications Ltd
• Sify Technologies
• Web Werks India Pvt Ltd

Big Data characteristics
• Volume, Velocity And Variety
• Velocity, Volume, Value, Variety And Veracity
• Volume, Velocity, Value, Variety, Veracity,
Validity, Visualization, Virality, Variability,
Volatility, Venue, Vocabulary, Vagueness

What is Hadoop?
• Hadoop:
• an open-source software framework that supports data-
intensive distributed applications, licensed under the Apache
v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data

03/24/2025 20
Uses for Hadoop
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis

03/24/2025 21
Who Uses Hadoop?

03/24/2025 BDA Chapter 1 22
The Hadoop Ecosystem
•Contains Libraries and other modules
Hadoop
Common
•Hadoop Distributed File System
HDFS
•Yet Another Resource Negotiator
Hadoop YARN
•A programming model for large scale data
processing
Hadoop
MapReduce

Hadoop Framework Tools
enterprise
datawarehouse

• Storing Data : HDFS, Hbase (NoSQL db)
• Processing Data: MapReduce
• Query on data : Pig, Hive, Apache Drill
• Machine Learning : Mahout, Spark Mlib
• Managing cluster : Zookeeper
• Data Ingesting : Flume, Sqoop
• Searching and Indexing : Apache Solr & Lucene
• Provision, Monitor and Maintain cluster :
Ambari

Failure handling
• If the active namenode fails, a standby can take over very quickly
because it has the latest state of metadata. zookeeper helps in
switching between the active and the standby namenodes. The
namenode maintains the reference to every file and block in the
memory.
• A 'heartbeat' is a signal sent between a DataNode and NameNode.
This signal is taken as a sign of vitality. If there is no response to the
signal, then it is understood that there are certain health issues/
technical problems with the DataNode or the TaskTracker.
• The default heartbeat interval is 3 seconds. If the NameNode does
not receive any heartbeats from a DataNode for a period of 10
minutes, then a 'Heartbeat Lost' condition occurs and the
corresponding DataNode is deemed to be dead/unavailable.

Rack awareness
• There should not be more
than 1 replica on the same
Datanode.
• More than 2 replica’s of a
single block is not allowed
on the same Rack.
• The number of racks used
inside a Hadoop cluster
must be smaller than the
number of replicas.

• Hadoop Clusters are also known as Shared-
nothing systems because nothing is shared
between the nodes in the cluster except the
network bandwidth. This decreases the
processing latency.
• Thus, when there is a need to process queries
on the huge amount of data, the cluster-wide
latency is minimized.

Hadoop Cluster
Master in the Hadoop Cluster
is a high power machine with a high configuration
of memory and CPU.
ResourceManager
• is the master daemon of YARN.
• It keeps track of live and dead nodes in the
cluster.
NodeManager
• is the slave daemon of YARN.
• It is responsible for containers, monitoring their
resource usage (such as CPU, disk, memory,
network) and reporting the same to the
ResourceManager.
• The NodeManager also checks the health of
the node on which it is running.

Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power and
storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce tasks,
and also DataNode to store needed blocks closely as possible
• Central control node runs NameNode to keep track of HDFS directories
& files, and JobTracker to dispatch compute tasks to TaskTracker
• Written in Java, also supports Python and Ruby

Limitations of Hadoop
• Not suited for small files
• It cannot handle firmly the live data
• Slow processing speed
• Not efficient for iterative processing
• Not efficient for caching

Hadoop Ecosystem Tools
• YARN
Brain of your Hadoop Ecosystem.
It performs all your processing activities by
allocating resources and scheduling tasks.
Components :
Resource Manager and Node Manager
Schedulers, Applications Manager

• PIG has two parts: Pig Latin, the language and the
pig runtime, for the execution environment.
• 10 line of pig latin = approx. 200 lines of Map-
Reduce Java code.
• first the load command, loads the data. Then we
perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you
can dump the data on the screen or you can store
the result back in HDFS.

APACHE HIVE
• It has 2 basic components: Hive Command
Line and JDBC/ODBC driver.
• It supports all primitive data types of SQL.

APACHE MAHOUT
• It has a predefined set of library which already
contains different inbuilt algorithms for
different use cases.
• collaborative filtering, clustering and
classification

APACHE SPARK
• framework for real time data analytics
• It executes in-memory computations to
increase speed of data processing over Map-
Reduce.
• It is 100x faster than Hadoop for large scale
data processing by exploiting in-memory
computations and other
optimizations. Therefore, it requires high
processing power than Map-Reduce.

APACHE ZOOKEEPER
• Apache Zookeeper coordinates with various
services in a distributed environment.
• synchronization, configuration maintenance,
grouping and naming.

APACHE OOZIE
• clock and alarm service inside Hadoop Ecosystem.
• There are two kinds of Oozie jobs:
– Oozie workflow: These are sequential set of actions to be
executed. You can assume it as a relay race. Where each
athlete waits for the last one to complete his part.
– Oozie Coordinator: These are the Oozie jobs which are
triggered when the data is made available to it. Think of
this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an
Oozie coordinator responds to the availability of data and
it rests otherwise.

APACHE FLUME
• Ingesting data
• collecting, aggregating and moving large
amount of data sets.
• It helps us to ingest online streaming data from
various sources like network traffic, social
media, email messages, log files etc. in HDFS.
• The flume agent has 3 components: source,
sink and channel.

APACHE SQOOP
• Flume only ingests unstructured data or semi-
structured data into HDFS.
• While Sqoop can import as well as export
structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
• When we submit Sqoop command, our main
task gets divided into sub tasks which is
handled by individual Map Task internally.

Solr, Lucene, Ambari
• Apache Solr and Apache Lucene are the two
services which are used for searching and
indexing in Hadoop Ecosystem.
• Ambari is an Apache Software Foundation
Project which aims at making Hadoop
ecosystem more manageable.
– It includes software for provisioning, managing
and monitoring Apache Hadoop clusters.

APACHE DRILL
• The main power of Apache Drill lies
in combining a variety of data stores just by
using a single query.
• Azure Blob Storage, Google Cloud Storage,
HBase, MongoDB, MapR-DB HDFS, MapR-FS,
Amazon S3, Swift, NAS and local files.

APACHE HBASE
• NoSQL database.
• It supports all types of data and that is why,
it’s capable of handling anything and
everything inside a Hadoop ecosystem.

• Kafka - It is an open-source message broker.
Using Kafka, we can handle feeds with high-
throughput and low-latency.
• Storm - Realtime stream based processing
framework. (Does same what MapReduce do
for batch-like processing)

• https://
www.mentimeter.com/app/presentation/n/al
yp2c5nuf9khbrcdso4wwx94ucuc39y/edit?que
stion=dci9typyikba

MODULE 1: Introduction to Big Data Analytics.pptx

Recommended

More Related Content

Similar to MODULE 1: Introduction to Big Data Analytics.pptx (20)

More from NiramayKolalle (6)

Recently uploaded (20)

MODULE 1: Introduction to Big Data Analytics.pptx