Unit 2 (1)

The document provides an overview of Big Data Analytics with a focus on Hadoop, detailing its capabilities for handling massive data storage and processing. It discusses the architecture of Hadoop, including its core components like HDFS and MapReduce, and highlights the challenges of distributed computing. Additionally, it outlines the use cases of Hadoop, particularly in analyzing clickstream data for business insights.

Uploaded by

saint51155544

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Unit 2 (1)

Uploaded by

saint51155544

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Unit-

2
BIG D ATA
ANALYTICS
PREPARED BY
Dr M Mohammed
Mustafa Associate Professor
Department of AI &
DS
SYLLABUS
SYLLABUS
OUTLIN
E Introducti
on
Hadoop
History of
Hadoop
RDBMS vs
Hadoop
Distributed Computing
Challenges Key Aspects of
Hadoop
INTRODUCTION
Why Hadoop?
Its capability to handle massive amounts of data,
different categories of data-fairly quickly.
Massive data storage
Faster data processing
HADOOP
Data can be managed with
Hadoop Distributes the data
and duplicates chunk of
each data file across several
nodes.
Locally available computer
resource is used to
process each chunk of
data in parallel.
Hadoop framework handles fail
over smartly and
HISTORY OF HADOOP

Reference
RDBMS V S HADOOP
DISTRIBUTED COMPUTING
CHALLENGES
In distributed system several servers are networked
together. Hardware Failure
Hadoop uses Replication Factor(RF)
Replication Factor connotes the number of data
copies of a given data item/data block stored across
the network.
Processing huge volume of data
Key challenge is integrating the data.
Hadoop solves using MapReduce
Programming. It is a programming model
to process the data.
KEY ASPECTS OF HADOOP
Open source software
It is free to download, use and contribute to.
Framework
Everything that you will need to develop, execute and
application is provided-programs,tools,etc.
Distributed
Divided and stores data across multiple computers.
Computation/Processing is done in parallel across multiple
connected nodes.
Massive Storage
Stores large amounts of data across nodes of low-cost
commodity hardware.
Faster Processing
Large amount of data is processed in parallel, yielding quick
HADOOP COMPONENTS

https://www.turing.com/kb/hadoop-ecosystem-and-hadoop-components-
for-big-data- proble
HDFS
Storage component of
Hadoop Distributed File
system Modeled after GFS
Optimized for high
throughput
Replication of a file for configured number of times,
which is tolerant in terms of both software and
hardware.
It sits on the top of native file system.
HADOOP COMPONENTS
HDFS, YARN, and Map Reducer are the core components of the
Hadoop Ecosystem
HDFS helps store structured, unstructured, and semi-structured data
in large amounts.
It works as a single unit, as HDFS creates an abstraction over the
resources. HDFS maintains the log files about the metadata.
The files in HDFS break block-sized into chunks.
Each file is divided into blocks of 128MB (configurable) and stored on
different machines in the cluster.
HDFS master/slave architecture
This architecture has two main components NameNode and
DataNode.
A single Name Node works as a master and multiple Data Nodes
perform the role of a slave. Both NameNode and DataNode are
capable enough to run on entity machines.
HDFS
HADOOP USECASE
Clickstream Data
Clickstream data(mouse clicks) helps you to
understand the purchasing behavior of customers.
Click stream helps online marketers to optimize the
products to improve their business.
Three key benefits:
Hadoop helps to join click stream data with other
sources such as C R M data(includes demographics,
sales, ad campaigns).
Scalability->Stores years of data(helps in YOY
analysis) Business analyst can use pig,hive for
website analysis(visualization)
MAP REDUCE PROGRAMMING
Software framework.
Helps to process massive amounts of data in
parallel. Input dataset is split into independent
chunks.
Map tasks process these chunks in parallel.
The output produced by map tasks serves as
intermediate data and is stored on the local disk of
that server.
The output of the mappers are automatically
shuffled and sorted by the framework.
MapReduce framework sorts the output based on
keys. This sorted output becomes the input to the
MAP REDUCE PROGRAMMING
Reduce task provides reduced output by combining
output of various mappers.
Job inputs and outputs are stored in a file system.
MapReduce frame work also takes care of other tasks
such as scheduling,monitoring,re-executing the failed
tasks etc...
HDFS and MapReduce framework run on the same set
of nodes. Because it allows effective scheduling of
tasks on the nodes where data is present.
It in turns give high throughput.
MAPREDUCE PROGRAMMING
There are two Daemons associated with
MapReduce Programming
Job tracker(one per cluster)
Task tracker(one slave per cluster)
Job tracker-->responsible for task scheduling
for task tracker.
Provides connectivity between Hadoop and
client application.
Task tracker-->It executes the assigned tasks.
When jobtracker fails to receive heartbeat from the task
tracker it assumes that the task tracker failed and assigns
new task tracker.
MAPREDUCE PROGRAMMING
MAPREDUCE PROGRAMMING
Input data is split into multiple pieces.
Framework creates a master and several workers
processes and executes the worker processes remotely.
Several map tasks work simultaneously and read pieces
of data that were assigned to each map tasks.
The map worker uses partitioner function to divide the
data into regions.
Partitioner decides which reducer should get the output
of the specified mapper.
When the map workers complete their work ,the master
instructs the reduce workers to begin the work.
MAPREDUCE PROGRAMMING
The reduce workers in turn contact the map workers to
get the key/value data for their partition.
The data thus received is shuffled and sorted as per keys.
Then it calls reduce function for every unique key. This
function writes the output to the file.
When all the reduce workers complete their work,the
master transfers the control to the user program.

Assignment For Module 3: 1. Basic CREATE TABLE Statement Requirements
100% (1)
Assignment For Module 3: 1. Basic CREATE TABLE Statement Requirements
22 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit 2
No ratings yet
Unit 2
56 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
BDA- UNIT 3
No ratings yet
BDA- UNIT 3
41 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit III
No ratings yet
Unit III
15 pages
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
No ratings yet
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
5 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
System Design and Implementation 5.1 System Design
No ratings yet
System Design and Implementation 5.1 System Design
14 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
No ratings yet
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
7 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Adobe Scan Dec 05, 2023
No ratings yet
Adobe Scan Dec 05, 2023
7 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
shawn
No ratings yet
shawn
4 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Unit 5 - Big Data Ecosystem - 06.05.18
No ratings yet
Unit 5 - Big Data Ecosystem - 06.05.18
21 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
V3i308 PDF
No ratings yet
V3i308 PDF
9 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 3
No ratings yet
Unit 3
10 pages
A Brief On MapReduce Performance
No ratings yet
A Brief On MapReduce Performance
6 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
CC unit5
No ratings yet
CC unit5
27 pages
BDA Notes Unit-2
No ratings yet
BDA Notes Unit-2
27 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
BDA.Unit-4
No ratings yet
BDA.Unit-4
32 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Unit-2
No ratings yet
Unit-2
18 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Bda Bi Jit Chapter-4
No ratings yet
Bda Bi Jit Chapter-4
20 pages
BDM 2
No ratings yet
BDM 2
5 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
LAB MANUAL.docx
No ratings yet
LAB MANUAL.docx
8 pages
CRM Unit 1
No ratings yet
CRM Unit 1
12 pages
2020.1 Browser Install Upgrade Guide
No ratings yet
2020.1 Browser Install Upgrade Guide
35 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
Code
No ratings yet
Code
16 pages
Expert System For Prescribing Medicine For Given Symptoms
No ratings yet
Expert System For Prescribing Medicine For Given Symptoms
10 pages
Mostafa Mohamed
No ratings yet
Mostafa Mohamed
4 pages
BC 180407008
No ratings yet
BC 180407008
6 pages
DQ Architecture
0% (1)
DQ Architecture
3 pages
Database SET a Final
No ratings yet
Database SET a Final
4 pages
MODULE 1 - FBA INTRO - PPTM
No ratings yet
MODULE 1 - FBA INTRO - PPTM
15 pages
Determine Suitability of Database Functionality and Scalability
No ratings yet
Determine Suitability of Database Functionality and Scalability
73 pages
Elasticsearch The Definitive Guide 1st Edition Clinton Gormley - Quickly access the ebook and start reading today
100% (2)
Elasticsearch The Definitive Guide 1st Edition Clinton Gormley - Quickly access the ebook and start reading today
47 pages
Entity Relationship Diagram For Emims Part 1
No ratings yet
Entity Relationship Diagram For Emims Part 1
1 page
CITIBANK - Presentation
No ratings yet
CITIBANK - Presentation
14 pages
Spring Batch Processing
No ratings yet
Spring Batch Processing
16 pages
[English] Python RAG Tutorial (With Local LLMs)_ AI for Your PDFs [DownSub.com]
No ratings yet
[English] Python RAG Tutorial (With Local LLMs)_ AI for Your PDFs [DownSub.com]
15 pages
Mysql Always Up Mit Galera Cluster
No ratings yet
Mysql Always Up Mit Galera Cluster
31 pages
Cloud-Computing Assignment-2
No ratings yet
Cloud-Computing Assignment-2
17 pages
Hibernate Class Notes
No ratings yet
Hibernate Class Notes
97 pages
HOLIDAYS HW IP XI Page 3
No ratings yet
HOLIDAYS HW IP XI Page 3
7 pages
Togaf Series Guide: Information Architecture: Customer Master Data Management (C-MDM)
100% (1)
Togaf Series Guide: Information Architecture: Customer Master Data Management (C-MDM)
56 pages
B1if Processing - Call B1 Object
No ratings yet
B1if Processing - Call B1 Object
17 pages
Can LLM Already Serve As A Database Interface? A Big Bench For Large-Scale Database Grounded Text-To-Sqls
No ratings yet
Can LLM Already Serve As A Database Interface? A Big Bench For Large-Scale Database Grounded Text-To-Sqls
28 pages
Buckland, M. K.
No ratings yet
Buckland, M. K.
18 pages
QP-1PB-CS-2024 Set 2
No ratings yet
QP-1PB-CS-2024 Set 2
9 pages
ProAccess SPACE User Manual v6
No ratings yet
ProAccess SPACE User Manual v6
470 pages
PL - SQL Questions
No ratings yet
PL - SQL Questions
6 pages

Unit 2 (1)

Uploaded by

Unit 2 (1)

Uploaded by

Unit-

You might also like