0% found this document useful (0 votes)

44 views23 pages

BigData&Analytics Module6

The document discusses Hadoop and Spark. It describes the Hadoop Distributed File System (HDFS) and its NameNode and DataNode architecture. HDFS stores data in blocks that are replicated across nodes for availability. MapReduce is also described, involving map and reduce stages to process data in parallel. Spark is then introduced as an alternative to Hadoop that can be used in cases requiring faster computation.

Uploaded by

Mohamed Ehab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views23 pages

BigData&Analytics Module6

Uploaded by

Mohamed Ehab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Eslsca business school logo

Big Data & Business Analytics

Module (06) –Hadoop & Spark
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 Learning Objectives

Module Objectives:
Understand Hadoop Distributed File System HDFS concept and structure
Know advantages of HDFS
Understand Map Reduce with Example
Compare Spark with Hadoop
Know use cases where Spark is required

What to Study for Exam:

Module 6 Lecture Notes (emphasis on above topics , no Yarn, no Spark components)

© 2020 Eslsca. All Rights Reserved 2

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS

• The Hadoop Distributed File System (HDFS) is the

primary data storage system used by Hadoop
applications.
• HDFS employs a NameNode and DataNode
architecture to implement a distributed file system.
• HDFS takes in data, breaks them down into separate
blocks and distributes them to different nodes/
DataNodes (that can reside on different computers).
• Blocks are also replicated across nodes, enabling
highly available architecture. If a failure happens on a
computer, another computer with a copy of the
DataNode will take over.
How Hadoop works:
https://www.youtube.com/watch?v=aReuLtY0YMI
© 2020 Eslsca. All Rights Reserved 3
Big Data & Business Analytics
Course Name Module
Module Name & Spark
6: Hadoop
Module 02
Module 6 1st Hadoop Distributed File System HDFS
Block
o Generally the user data is stored in the files of HDFS.
o The file will be divided into one or more segments and stored in individual data
nodes.
o These file segments are called blocks.
o The default block size is 128MB, but it can be increased as per the need to change
in HDFS configuration.

© 2020 Eslsca. All Rights Reserved 4

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS

• NameNode is the master server that manages the HDFS namespace and regulates
access to files/data by clients. It executes operations like:
o opening/closing/renaming files (that contain the application data)
o mapping data blocks to DataNodes
o regulates client’s access (read & write) to files

• DataNode is the slave server that manages the following:

o performs operations such as block creation, deletion, and replication
according to the instructions of the NameNode.
o performs read-write operations on the file systems in response to client
request, according to the instructions of the NameNode.

© 2020 Eslsca. All Rights Reserved 5

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS
Main Advantages
• Parallel processing: With HDFS, computation happens on the DataNodes by the
DataNode software, where the data resides, rather than having the data move to
the master server. This approach decreases network congestion and boosts
system's overall throughput. This is in addition to parallel processing where each
data node processes data concurrently.
• Data replication: Blocks on the a data node are replicated on other data nodes.
This ensures that data is always available and prevents data loss. For example,
when a node crashes or there is a hardware failure, replicated data can be pulled
from other data nodes, so processing continues.
• Fault tolerance: HDFS' ability to replicate file blocks and store them across nodes
in a large cluster of computer hardware ensures fault tolerance and reliability.

© 2020 Eslsca. All Rights Reserved 6

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

MapReduce program executes in two stages, namely map stage and reduce
stage:
• Map stage − The map or mapper’s job is to process the input data.
Generally the input data is stored in the Hadoop file system (HDFS). The
mapper processes the data and creates several small chunks of
intermediate data.

• Reduce stage − This stage includes the Shuffle/Combine and Sort of map
stage results and the Aggregation of values to achieve final results. The
Reducer’s job is to process the data that comes from the mapper. After
processing, it produces the final output.

© 2020 Eslsca. All Rights Reserved 7

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

© 2020 Eslsca. All Rights Reserved 8

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

© 2020 Eslsca. All Rights Reserved 9

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce Example
Suppose, you have to perform a word count on the sample.txt using
MapReduce. So, to find the unique words and the number of
occurrences of those unique words.

© 2020 Eslsca. All Rights Reserved 10

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce
Hadoop Aggregation Function includes any of the following:
• Average (i.e., arithmetic mean, sum of values/count of values)
• Count
• Sum (sum of the values) Calculation of Median:
• Maximum After sorting the dataset:
• Minimum Median = {(n+1)/2}th term if n is odd
• Range Median = [(n/2)th term + {(n/2)+1}th]/2
if n is even
• Median
Calculation of Mode:
• Mode Mode = Value which is repeated most of the
times in the dataset

© 2020 Eslsca. All Rights Reserved 11

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 3rd Hadoop Yarn For exam

© 2020 Eslsca. All Rights Reserved 12

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 3rd Hadoop Yarn For exam

 Resource manager
handles the
assignments of
resources (CPU and
memory) to the
competing
applications.
 CPU and memory
assigned is also called
container.
 Each application is
governed by the App
master and has its
resource/container.
© 2020 Eslsca. All Rights Reserved 13
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 4th Hadoop Ecosystem

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark

• Whereas Hadoop reads and writes files to HDFS on hard drive, Spark
processes data in RAM (memory).
• Spark reduces the number of read/write cycles to hard disk and stores
intermediate data in-memory, hence provides faster-processing speed.
• Spark requires a lot of RAM to run in-memory data processing, thus its
application is more costly than Hadoop.
• Spark is useful for processing real-time data, for example streamed videos,
streamed sensor-based data, streamed transactions, etc.
• Internet Big Data giants such as Netflix, Yahoo, and eBay have deployed
Spark at massive scale, collectively processing multiple petabytes of data
on clusters of over 8,000 nodes.

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark

o Spark can run either in stand-alone

mode (but not common)
o Spark can run with a Hadoop cluster
serving as the input data source
or in conjunction with other data sources:
o Redis
o Cassandra
o MongoDB
and others ….

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark real-time processing
Data is streamed
from for example
surveillance
cameras, video
streaming provider
etc. through Kafka
or Flume, then
processed by Spark
in real-time, then
the results are
OR output from Spark
and stored in static
data sources and
displayed if
required.

More information on Flume, can be found in the following video:

https://www.youtube.com/watch?v=fUesPFJ6FfE
© 2020 Eslsca. All Rights Reserved 17
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark Components For exam
o Spark is structured around Spark Core, the engine that drives the resource
scheduling, optimizations, in-memory processing, and Map-reduce.
There are several libraries that operate on top of Spark Core including:
o Spark SQL, which allows you to run SQL-like commands on distributed data sets
o MLLib, the built-in library for machine learning
o GraphX for graph problems,
o and streaming which allows for the input of continually streaming data

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 6th Apache Flume for Data Ingestion For exam
Apache Flume is a
data ingestion tool
for collecting,
aggregating and
transporting large
amounts of
streaming data such
as log files, events
(etc...) from various
sources to a
centralized data
store. Principally
designed to copy
streaming data from
various web servers
to HDFS.
More information on Flume, can be found in the following video:
https://www.youtube.com/watch?v=fUesPFJ6FfE
© 2020 Eslsca. All Rights Reserved 19
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 7th Spark Use Cases

• Banks are using Apache Spark to access and analyze the social media
profiles, call recordings, complaint logs, emails, forum discussions, etc. to
gain insights that can help them make the right business decisions for
credit risk assessment, targeted advertising and customer segmentation.
• In e-commerce information about real time transaction can be passed to
Spark and with its machine learning libraries customer segmentation can
be undertaken with K-means clustering algorithm. The results can be
combined with data from other sources like social media profiles, product
reviews on forums, customer comments, etc. to enhance the
recommendations to customers based on new trends.

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 7th Spark Use Cases

• Many healthcare providers are using Apache Spark to analyse patient records
along with past clinical data to identify which patients are likely to face health
issues after being discharged from the clinic. This helps hospitals prevent
hospital re-admittance as they can deploy home healthcare services to the
identified patient, saving on costs for both the hospitals and patients.
• Apache Spark is used in genomic sequencing to reduce the time needed to
process genome data. Earlier, it took several weeks to organize all the
chemical compounds with genes but now with Apache spark on Hadoop it just
takes few hours.
• Yahoo uses Apache Spark for personalizing its news webpages and for
targeted advertising. It uses machine learning algorithms that run on Apache
Spark to find out what kind of news - users are interested to read and
categorizing the news stories to find out what kind of users would be
interested in reading each category of news.
Source: https://www.projectpro.io/article/top-5-apache-spark-use-cases/271

Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 Questions

 Module Completed

Module 06

Peter Gliwa - Embedded Software Timing - Methodology, Analysis and Practical Tips With A Focus On Automotive-Springer (2021)
No ratings yet
Peter Gliwa - Embedded Software Timing - Methodology, Analysis and Practical Tips With A Focus On Automotive-Springer (2021)
308 pages
Ielts Vocabulary Book
No ratings yet
Ielts Vocabulary Book
205 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
6003 400 450 02 - Rfs PDF
100% (4)
6003 400 450 02 - Rfs PDF
348 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
Certified Hadoop and Spark Course Curriculum
No ratings yet
Certified Hadoop and Spark Course Curriculum
9 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Bigdata
No ratings yet
Bigdata
6 pages
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
M5
No ratings yet
M5
18 pages
Unit 5
No ratings yet
Unit 5
7 pages
wk8__final
No ratings yet
wk8__final
39 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Big Data
No ratings yet
Big Data
67 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
unit 2
No ratings yet
unit 2
9 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
BDA practical (1)
No ratings yet
BDA practical (1)
18 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Big Data Hadoop Interview Questions and Answers
100% (1)
Big Data Hadoop Interview Questions and Answers
25 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Hdfs MR Wordcount
No ratings yet
Hdfs MR Wordcount
16 pages
DA U2
No ratings yet
DA U2
17 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
Top 50 Hadoop Interview Questions for 2019
No ratings yet
Top 50 Hadoop Interview Questions for 2019
42 pages
I am preparing for a Big Data Analytics university... (1)
No ratings yet
I am preparing for a Big Data Analytics university... (1)
15 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
MapReduce_Unit3
No ratings yet
MapReduce_Unit3
27 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Manegrial Economics en 002 Part 1
No ratings yet
Manegrial Economics en 002 Part 1
37 pages
PaySky Technical Assessment (Automation)
No ratings yet
PaySky Technical Assessment (Automation)
1 page
BigData&Analytics Module2
No ratings yet
BigData&Analytics Module2
29 pages
BigData&Analytics Module3
No ratings yet
BigData&Analytics Module3
17 pages
BigData&Analytics Module5
No ratings yet
BigData&Analytics Module5
21 pages
Benchmarking
No ratings yet
Benchmarking
19 pages
BigData&Analytics Module4
No ratings yet
BigData&Analytics Module4
17 pages
TQM Final Exam-Benchmarking&tqm
No ratings yet
TQM Final Exam-Benchmarking&tqm
23 pages
Brainstorming Workshop Sample
No ratings yet
Brainstorming Workshop Sample
10 pages
Brainstorming Workshop Sample PPT DEVQC Group1
No ratings yet
Brainstorming Workshop Sample PPT DEVQC Group1
9 pages
Analytical CDS Queries
No ratings yet
Analytical CDS Queries
5 pages
Computer Applications PDF
No ratings yet
Computer Applications PDF
4 pages
AI Multi Agent Shopping System
No ratings yet
AI Multi Agent Shopping System
7 pages
PCM (Sinhala) III 2024 Assignment
No ratings yet
PCM (Sinhala) III 2024 Assignment
6 pages
Getting Started With Imagery and Remote Sensing
No ratings yet
Getting Started With Imagery and Remote Sensing
17 pages
01 Laboratory Exercise 1 - ARG
No ratings yet
01 Laboratory Exercise 1 - ARG
4 pages
VLSI Design Flow (Verification)
No ratings yet
VLSI Design Flow (Verification)
7 pages
Manual 385 Zte
No ratings yet
Manual 385 Zte
8 pages
Atm Application Use Case Diagram
No ratings yet
Atm Application Use Case Diagram
58 pages
CSC520 Chapter 6
No ratings yet
CSC520 Chapter 6
27 pages
SPE Expert 1K-FA User Rev3.2
No ratings yet
SPE Expert 1K-FA User Rev3.2
75 pages
HRM666 Idustrial Report
No ratings yet
HRM666 Idustrial Report
41 pages
Comprehensive Guide To The Semiconductor Industry - Suppliers, Processes, and Market Insights
No ratings yet
Comprehensive Guide To The Semiconductor Industry - Suppliers, Processes, and Market Insights
15 pages
White Box Testing
No ratings yet
White Box Testing
49 pages
3a 1 - Tas - BSB80120-V2 0
No ratings yet
3a 1 - Tas - BSB80120-V2 0
21 pages
SWT Lab-3 Unit-Testing g2
No ratings yet
SWT Lab-3 Unit-Testing g2
102 pages
IPM Product Testing With Selenium Learning Path123
No ratings yet
IPM Product Testing With Selenium Learning Path123
40 pages
AIR sRO 2 SDS
No ratings yet
AIR sRO 2 SDS
7 pages
Camtasia 2020 MSI Installation Guide
No ratings yet
Camtasia 2020 MSI Installation Guide
4 pages
2-IJCI Vol. 3 No. 8-August 2024-Paper1-Dr. Faraj
No ratings yet
2-IJCI Vol. 3 No. 8-August 2024-Paper1-Dr. Faraj
10 pages
Google Pay Ug
No ratings yet
Google Pay Ug
59 pages
PBO Tugas Assignment
No ratings yet
PBO Tugas Assignment
3 pages
NSCI6401 Course Project
No ratings yet
NSCI6401 Course Project
2 pages
Docket Dairo Oluwatimilehin
No ratings yet
Docket Dairo Oluwatimilehin
3 pages
Concurrent & Parallel Execution
No ratings yet
Concurrent & Parallel Execution
3 pages
Advantage and Disadvantage Symmetric Key Cryptography
No ratings yet
Advantage and Disadvantage Symmetric Key Cryptography
13 pages
CME CPQ PDF-en
No ratings yet
CME CPQ PDF-en
1,015 pages

BigData&Analytics Module6

Uploaded by

BigData&Analytics Module6

Uploaded by

Eslsca business school logo

Big Data & Business Analytics

What to Study for Exam:

© 2020 Eslsca. All Rights Reserved 2

• The Hadoop Distributed File System (HDFS) is the

© 2020 Eslsca. All Rights Reserved 4

• DataNode is the slave server that manages the following:

© 2020 Eslsca. All Rights Reserved 5

© 2020 Eslsca. All Rights Reserved 6

© 2020 Eslsca. All Rights Reserved 7

© 2020 Eslsca. All Rights Reserved 8

© 2020 Eslsca. All Rights Reserved 9

© 2020 Eslsca. All Rights Reserved 10

© 2020 Eslsca. All Rights Reserved 11

© 2020 Eslsca. All Rights Reserved 12

© 2020 Eslsca. All Rights Reserved 14

© 2020 Eslsca. All Rights Reserved 15

o Spark can run either in stand-alone

© 2020 Eslsca. All Rights Reserved 16

More information on Flume, can be found in the following video:

© 2020 Eslsca. All Rights Reserved 18

© 2020 Eslsca. All Rights Reserved 20

© 2020 Eslsca. All Rights Reserved 21

© 2018 MegaSoft. All Rights Reserved 22

You might also like