BigData&Analytics Module6
BigData&Analytics Module6
Module Objectives:
Understand Hadoop Distributed File System HDFS concept and structure
Know advantages of HDFS
Understand Map Reduce with Example
Compare Spark with Hadoop
Know use cases where Spark is required
• NameNode is the master server that manages the HDFS namespace and regulates
access to files/data by clients. It executes operations like:
o opening/closing/renaming files (that contain the application data)
o mapping data blocks to DataNodes
o regulates client’s access (read & write) to files
MapReduce program executes in two stages, namely map stage and reduce
stage:
• Map stage − The map or mapper’s job is to process the input data.
Generally the input data is stored in the Hadoop file system (HDFS). The
mapper processes the data and creates several small chunks of
intermediate data.
• Reduce stage − This stage includes the Shuffle/Combine and Sort of map
stage results and the Aggregation of values to achieve final results. The
Reducer’s job is to process the data that comes from the mapper. After
processing, it produces the final output.
Resource manager
handles the
assignments of
resources (CPU and
memory) to the
competing
applications.
CPU and memory
assigned is also called
container.
Each application is
governed by the App
master and has its
resource/container.
© 2020 Eslsca. All Rights Reserved 13
Big Data & Business Analytics
Course Name Module
Module 6: Name
Hadoop & Spark
Module 02
Module 6 4th Hadoop Ecosystem
• Whereas Hadoop reads and writes files to HDFS on hard drive, Spark
processes data in RAM (memory).
• Spark reduces the number of read/write cycles to hard disk and stores
intermediate data in-memory, hence provides faster-processing speed.
• Spark requires a lot of RAM to run in-memory data processing, thus its
application is more costly than Hadoop.
• Spark is useful for processing real-time data, for example streamed videos,
streamed sensor-based data, streamed transactions, etc.
• Internet Big Data giants such as Netflix, Yahoo, and eBay have deployed
Spark at massive scale, collectively processing multiple petabytes of data
on clusters of over 8,000 nodes.
• Banks are using Apache Spark to access and analyze the social media
profiles, call recordings, complaint logs, emails, forum discussions, etc. to
gain insights that can help them make the right business decisions for
credit risk assessment, targeted advertising and customer segmentation.
• In e-commerce information about real time transaction can be passed to
Spark and with its machine learning libraries customer segmentation can
be undertaken with K-means clustering algorithm. The results can be
combined with data from other sources like social media profiles, product
reviews on forums, customer comments, etc. to enhance the
recommendations to customers based on new trends.
• Many healthcare providers are using Apache Spark to analyse patient records
along with past clinical data to identify which patients are likely to face health
issues after being discharged from the clinic. This helps hospitals prevent
hospital re-admittance as they can deploy home healthcare services to the
identified patient, saving on costs for both the hospitals and patients.
• Apache Spark is used in genomic sequencing to reduce the time needed to
process genome data. Earlier, it took several weeks to organize all the
chemical compounds with genes but now with Apache spark on Hadoop it just
takes few hours.
• Yahoo uses Apache Spark for personalizing its news webpages and for
targeted advertising. It uses machine learning algorithms that run on Apache
Spark to find out what kind of news - users are interested to read and
categorizing the news stories to find out what kind of users would be
interested in reading each category of news.
Source: https://www.projectpro.io/article/top-5-apache-spark-use-cases/271
Module 06