0% found this document useful (0 votes)
33 views

BigData&Analytics Module6

The document discusses Hadoop and Spark. It describes the Hadoop Distributed File System (HDFS) and its NameNode and DataNode architecture. HDFS stores data in blocks that are replicated across nodes for availability. MapReduce is also described, involving map and reduce stages to process data in parallel. Spark is then introduced as an alternative to Hadoop that can be used in cases requiring faster computation.

Uploaded by

Mohamed Ehab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

BigData&Analytics Module6

The document discusses Hadoop and Spark. It describes the Hadoop Distributed File System (HDFS) and its NameNode and DataNode architecture. HDFS stores data in blocks that are replicated across nodes for availability. MapReduce is also described, involving map and reduce stages to process data in parallel. Spark is then introduced as an alternative to Hadoop that can be used in cases requiring faster computation.

Uploaded by

Mohamed Ehab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Eslsca business school logo

Big Data & Business Analytics


Module (06) –Hadoop & Spark
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 Learning Objectives

Module Objectives:
Understand Hadoop Distributed File System HDFS concept and structure
Know advantages of HDFS
Understand Map Reduce with Example
Compare Spark with Hadoop
Know use cases where Spark is required

What to Study for Exam:


Module 6 Lecture Notes (emphasis on above topics , no Yarn, no Spark components)

© 2020 Eslsca. All Rights Reserved 2


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS

• The Hadoop Distributed File System (HDFS) is the


primary data storage system used by Hadoop
applications.
• HDFS employs a NameNode and DataNode
architecture to implement a distributed file system.
• HDFS takes in data, breaks them down into separate
blocks and distributes them to different nodes/
DataNodes (that can reside on different computers).
• Blocks are also replicated across nodes, enabling
highly available architecture. If a failure happens on a
computer, another computer with a copy of the
DataNode will take over.
How Hadoop works:
https://www.youtube.com/watch?v=aReuLtY0YMI
© 2020 Eslsca. All Rights Reserved 3
Big Data & Business Analytics
Course Name Module
Module Name & Spark
6: Hadoop
Module 02
Module 6 1st Hadoop Distributed File System HDFS
Block
o Generally the user data is stored in the files of HDFS.
o The file will be divided into one or more segments and stored in individual data
nodes.
o These file segments are called blocks.
o The default block size is 128MB, but it can be increased as per the need to change
in HDFS configuration.

© 2020 Eslsca. All Rights Reserved 4


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS

• NameNode is the master server that manages the HDFS namespace and regulates
access to files/data by clients. It executes operations like:
o opening/closing/renaming files (that contain the application data)
o mapping data blocks to DataNodes
o regulates client’s access (read & write) to files

• DataNode is the slave server that manages the following:


o performs operations such as block creation, deletion, and replication
according to the instructions of the NameNode.
o performs read-write operations on the file systems in response to client
request, according to the instructions of the NameNode.

© 2020 Eslsca. All Rights Reserved 5


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 1st Hadoop Distributed File System HDFS
Main Advantages
• Parallel processing: With HDFS, computation happens on the DataNodes by the
DataNode software, where the data resides, rather than having the data move to
the master server. This approach decreases network congestion and boosts
system's overall throughput. This is in addition to parallel processing where each
data node processes data concurrently.
• Data replication: Blocks on the a data node are replicated on other data nodes.
This ensures that data is always available and prevents data loss. For example,
when a node crashes or there is a hardware failure, replicated data can be pulled
from other data nodes, so processing continues.
• Fault tolerance: HDFS' ability to replicate file blocks and store them across nodes
in a large cluster of computer hardware ensures fault tolerance and reliability.

© 2020 Eslsca. All Rights Reserved 6


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

MapReduce program executes in two stages, namely map stage and reduce
stage:
• Map stage − The map or mapper’s job is to process the input data.
Generally the input data is stored in the Hadoop file system (HDFS). The
mapper processes the data and creates several small chunks of
intermediate data.

• Reduce stage − This stage includes the Shuffle/Combine and Sort of map
stage results and the Aggregation of values to achieve final results. The
Reducer’s job is to process the data that comes from the mapper. After
processing, it produces the final output.

© 2020 Eslsca. All Rights Reserved 7


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

© 2020 Eslsca. All Rights Reserved 8


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce

© 2020 Eslsca. All Rights Reserved 9


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce Example
Suppose, you have to perform a word count on the sample.txt using
MapReduce. So, to find the unique words and the number of
occurrences of those unique words.

© 2020 Eslsca. All Rights Reserved 10


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 2nd Hadoop Map Reduce
Hadoop Aggregation Function includes any of the following:
• Average (i.e., arithmetic mean, sum of values/count of values)
• Count
• Sum (sum of the values) Calculation of Median:
• Maximum After sorting the dataset:
• Minimum Median = {(n+1)/2}th term if n is odd
• Range Median = [(n/2)th term + {(n/2)+1}th]/2
if n is even
• Median
Calculation of Mode:
• Mode Mode = Value which is repeated most of the
times in the dataset

© 2020 Eslsca. All Rights Reserved 11


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 3rd Hadoop Yarn For exam

© 2020 Eslsca. All Rights Reserved 12


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 3rd Hadoop Yarn For exam

 Resource manager
handles the
assignments of
resources (CPU and
memory) to the
competing
applications.
 CPU and memory
assigned is also called
container.
 Each application is
governed by the App
master and has its
resource/container.
© 2020 Eslsca. All Rights Reserved 13
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 4th Hadoop Ecosystem

© 2020 Eslsca. All Rights Reserved 14


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark

• Whereas Hadoop reads and writes files to HDFS on hard drive, Spark
processes data in RAM (memory).
• Spark reduces the number of read/write cycles to hard disk and stores
intermediate data in-memory, hence provides faster-processing speed.
• Spark requires a lot of RAM to run in-memory data processing, thus its
application is more costly than Hadoop.
• Spark is useful for processing real-time data, for example streamed videos,
streamed sensor-based data, streamed transactions, etc.
• Internet Big Data giants such as Netflix, Yahoo, and eBay have deployed
Spark at massive scale, collectively processing multiple petabytes of data
on clusters of over 8,000 nodes.

© 2020 Eslsca. All Rights Reserved 15


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark

o Spark can run either in stand-alone


mode (but not common)
o Spark can run with a Hadoop cluster
serving as the input data source
or in conjunction with other data sources:
o Redis
o Cassandra
o MongoDB
and others ….

© 2020 Eslsca. All Rights Reserved 16


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark real-time processing
Data is streamed
from for example
surveillance
cameras, video
streaming provider
etc. through Kafka
or Flume, then
processed by Spark
in real-time, then
the results are
OR output from Spark
and stored in static
data sources and
displayed if
required.

More information on Flume, can be found in the following video:


https://www.youtube.com/watch?v=fUesPFJ6FfE
© 2020 Eslsca. All Rights Reserved 17
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 5th Spark Components For exam
o Spark is structured around Spark Core, the engine that drives the resource
scheduling, optimizations, in-memory processing, and Map-reduce.
There are several libraries that operate on top of Spark Core including:
o Spark SQL, which allows you to run SQL-like commands on distributed data sets
o MLLib, the built-in library for machine learning
o GraphX for graph problems,
o and streaming which allows for the input of continually streaming data

© 2020 Eslsca. All Rights Reserved 18


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 6th Apache Flume for Data Ingestion For exam
Apache Flume is a
data ingestion tool
for collecting,
aggregating and
transporting large
amounts of
streaming data such
as log files, events
(etc...) from various
sources to a
centralized data
store. Principally
designed to copy
streaming data from
various web servers
to HDFS.
More information on Flume, can be found in the following video:
https://www.youtube.com/watch?v=fUesPFJ6FfE
© 2020 Eslsca. All Rights Reserved 19
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 7th Spark Use Cases

• Banks are using Apache Spark to access and analyze the social media
profiles, call recordings, complaint logs, emails, forum discussions, etc. to
gain insights that can help them make the right business decisions for
credit risk assessment, targeted advertising and customer segmentation.
• In e-commerce information about real time transaction can be passed to
Spark and with its machine learning libraries customer segmentation can
be undertaken with K-means clustering algorithm. The results can be
combined with data from other sources like social media profiles, product
reviews on forums, customer comments, etc. to enhance the
recommendations to customers based on new trends.

© 2020 Eslsca. All Rights Reserved 20


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 7th Spark Use Cases

• Many healthcare providers are using Apache Spark to analyse patient records
along with past clinical data to identify which patients are likely to face health
issues after being discharged from the clinic. This helps hospitals prevent
hospital re-admittance as they can deploy home healthcare services to the
identified patient, saving on costs for both the hospitals and patients.
• Apache Spark is used in genomic sequencing to reduce the time needed to
process genome data. Earlier, it took several weeks to organize all the
chemical compounds with genes but now with Apache spark on Hadoop it just
takes few hours.
• Yahoo uses Apache Spark for personalizing its news webpages and for
targeted advertising. It uses machine learning algorithms that run on Apache
Spark to find out what kind of news - users are interested to read and
categorizing the news stories to find out what kind of users would be
interested in reading each category of news.
Source: https://www.projectpro.io/article/top-5-apache-spark-use-cases/271

© 2020 Eslsca. All Rights Reserved 21


Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 Questions

© 2018 MegaSoft. All Rights Reserved 22


 Module Completed

Module 06

You might also like