0% found this document useful (0 votes)
0 views

Unit 2 (1)

The document provides an overview of Big Data Analytics with a focus on Hadoop, detailing its capabilities for handling massive data storage and processing. It discusses the architecture of Hadoop, including its core components like HDFS and MapReduce, and highlights the challenges of distributed computing. Additionally, it outlines the use cases of Hadoop, particularly in analyzing clickstream data for business insights.

Uploaded by

saint51155544
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Unit 2 (1)

The document provides an overview of Big Data Analytics with a focus on Hadoop, detailing its capabilities for handling massive data storage and processing. It discusses the architecture of Hadoop, including its core components like HDFS and MapReduce, and highlights the challenges of distributed computing. Additionally, it outlines the use cases of Hadoop, particularly in analyzing clickstream data for business insights.

Uploaded by

saint51155544
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit-

2
BIG D ATA
ANALYTICS
PREPARED BY
Dr M Mohammed
Mustafa Associate Professor
Department of AI &
DS
SYLLABUS
SYLLABUS
OUTLIN
E Introducti
on
Hadoop
History of
Hadoop
RDBMS vs
Hadoop
Distributed Computing
Challenges Key Aspects of
Hadoop
INTRODUCTION
Why Hadoop?
Its capability to handle massive amounts of data,
different categories of data-fairly quickly.
Massive data storage
Faster data processing
HADOOP
Data can be managed with
Hadoop Distributes the data
and duplicates chunk of
each data file across several
nodes.
Locally available computer
resource is used to
process each chunk of
data in parallel.
Hadoop framework handles fail
over smartly and
HISTORY OF HADOOP

Reference
RDBMS V S HADOOP
DISTRIBUTED COMPUTING
CHALLENGES
In distributed system several servers are networked
together. Hardware Failure
Hadoop uses Replication Factor(RF)
Replication Factor connotes the number of data
copies of a given data item/data block stored across
the network.
Processing huge volume of data
Key challenge is integrating the data.
Hadoop solves using MapReduce
Programming. It is a programming model
to process the data.
KEY ASPECTS OF HADOOP
Open source software
It is free to download, use and contribute to.
Framework
Everything that you will need to develop, execute and
application is provided-programs,tools,etc.
Distributed
Divided and stores data across multiple computers.
Computation/Processing is done in parallel across multiple
connected nodes.
Massive Storage
Stores large amounts of data across nodes of low-cost
commodity hardware.
Faster Processing
Large amount of data is processed in parallel, yielding quick
HADOOP COMPONENTS

https://www.turing.com/kb/hadoop-ecosystem-and-hadoop-components-
for-big-data- proble
HDFS
Storage component of
Hadoop Distributed File
system Modeled after GFS
Optimized for high
throughput
Replication of a file for configured number of times,
which is tolerant in terms of both software and
hardware.
It sits on the top of native file system.
HADOOP COMPONENTS
HDFS, YARN, and Map Reducer are the core components of the
Hadoop Ecosystem
HDFS helps store structured, unstructured, and semi-structured data
in large amounts.
It works as a single unit, as HDFS creates an abstraction over the
resources. HDFS maintains the log files about the metadata.
The files in HDFS break block-sized into chunks.
Each file is divided into blocks of 128MB (configurable) and stored on
different machines in the cluster.
HDFS master/slave architecture
This architecture has two main components NameNode and
DataNode.
A single Name Node works as a master and multiple Data Nodes
perform the role of a slave. Both NameNode and DataNode are
capable enough to run on entity machines.
HDFS
HADOOP USECASE
Clickstream Data
Clickstream data(mouse clicks) helps you to
understand the purchasing behavior of customers.
Click stream helps online marketers to optimize the
products to improve their business.
Three key benefits:
Hadoop helps to join click stream data with other
sources such as C R M data(includes demographics,
sales, ad campaigns).
Scalability->Stores years of data(helps in YOY
analysis) Business analyst can use pig,hive for
website analysis(visualization)
MAP REDUCE PROGRAMMING
Software framework.
Helps to process massive amounts of data in
parallel. Input dataset is split into independent
chunks.
Map tasks process these chunks in parallel.
The output produced by map tasks serves as
intermediate data and is stored on the local disk of
that server.
The output of the mappers are automatically
shuffled and sorted by the framework.
MapReduce framework sorts the output based on
keys. This sorted output becomes the input to the
MAP REDUCE PROGRAMMING
Reduce task provides reduced output by combining
output of various mappers.
Job inputs and outputs are stored in a file system.
MapReduce frame work also takes care of other tasks
such as scheduling,monitoring,re-executing the failed
tasks etc...
HDFS and MapReduce framework run on the same set
of nodes. Because it allows effective scheduling of
tasks on the nodes where data is present.
It in turns give high throughput.
MAPREDUCE PROGRAMMING
There are two Daemons associated with
MapReduce Programming
Job tracker(one per cluster)
Task tracker(one slave per cluster)
Job tracker-->responsible for task scheduling
for task tracker.
Provides connectivity between Hadoop and
client application.
Task tracker-->It executes the assigned tasks.
When jobtracker fails to receive heartbeat from the task
tracker it assumes that the task tracker failed and assigns
new task tracker.
MAPREDUCE PROGRAMMING
MAPREDUCE PROGRAMMING
Input data is split into multiple pieces.
Framework creates a master and several workers
processes and executes the worker processes remotely.
Several map tasks work simultaneously and read pieces
of data that were assigned to each map tasks.
The map worker uses partitioner function to divide the
data into regions.
Partitioner decides which reducer should get the output
of the specified mapper.
When the map workers complete their work ,the master
instructs the reduce workers to begin the work.
MAPREDUCE PROGRAMMING
The reduce workers in turn contact the map workers to
get the key/value data for their partition.
The data thus received is shuffled and sorted as per keys.
Then it calls reduce function for every unique key. This
function writes the output to the file.
When all the reduce workers complete their work,the
master transfers the control to the user program.

You might also like