Unit 3 MapReduce Part 1

MapReduce is a programming model and framework for processing large datasets in a distributed computing environment. It works in two phases - the Map phase and the Reduce phase. In the Map phase, data is processed key-value pairs in parallel. The Reduce phase then aggregates the outputs from the Map phase by key. The data flows through input readers, mappers, a shuffle and sort phase, reducers, and output writers to process data in parallel across clusters.

Uploaded by

Ruparel Education Pvt. Ltd.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Unit 3 MapReduce Part 1

Uploaded by

Ruparel Education Pvt. Ltd.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

MapReduce

Agenda MapReduce:
1. Data Flow
2. Map
3. Shuffle
4. Sort
5. Reduce,
6. Hadoop Streaming,
7. mrjob,
8. Installation
9. wordcount in mrjob
10. Executing mrjob
What is MapReduce?
History :
MapReduce was developed in the walls of Google back in 2004
by Jeffery Dean and Sanjay Ghemawat of Google (Dean &
Ghemawat, 2004). In their paper, “MAPREDUCE: SIMPLIFIED
DATA PROCESSING ON LARGE CLUSTERS,” and was inspired by
the map and reduce functions commonly used in functional
programming.
What is MapReduce?
Hadoop MapReduce is the data processing layer. It processes the huge
amount of structured and unstructured data stored in HDFS. MapReduce
processes data in parallel by dividing the job into the set of independent tasks.
So, parallel processing improves speed and reliability.
Hadoop MapReduce data processing takes place in 2 phases- Map and Reduce
phase.
•Map phase- It is the first phase of data processing. In this phase, we specify all
the complex logic/business rules/costly code.
•Reduce phase- It is the second phase of processing. In this phase, we specify
light-weight processing like aggregation/summation.
MapReduce programming offers several benefits to help you gain
valuable insights from your big data:

•Scalability. Businesses can process petabytes of data stored in the

Hadoop Distributed File System (HDFS).
•Flexibility. Hadoop enables easier access to multiple sources of data
and multiple types of data.
•Speed. With parallel processing and minimal data movement, Hadoop
offers fast processing of massive amounts of data.
•Simple. Developers can write code in a choice of languages, including
Java, C++ and Python.
How does MapReduce work?
The MapReduce operations are:

•Map: The first phase of a MapReduce application is the map phase.

Within the map phase, a function (called the mapper) processes a
series of key-value pairs. The mapper sequentially processes each
key-value pair individually, producing zero or more output keyvalue
pairs.
•As an example, consider a mapper whose purpose is to transform
sentences into words. The input to this mapper would be strings that
contain sentences, and the mapper’s function would be to split the
sentences into words and output the words
How does MapReduce work?
•Shuffle:
•This phase consumes the output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our
example, the same words are clubed together along with their
respective frequency.
•In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single
output value. In short, this phase summarizes the complete dataset.
•In our example, this phase aggregates the values from Shuffling phase
i.e., calculates total occurrences of each word.
How does MapReduce work?
Data Flow In MapReduce
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and
distributed form, the data has to flow from various phases.

Phases of MapReduce data flow

Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128
MB). Each data block is associated with a Map function.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.
Data Flow In MapReduce
Phases of MapReduce data flow (Continue….)
Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-
value pairs. The map input and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out from the map and get ready to process
for reduce function. Sometimes, the shuffling of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here, the data is compared using
comparison function and arranged in a sorted form.
Data Flow In MapReduce

Phases of MapReduce data flow (Continue….)

Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The
values associated with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to
write the Reduce output to the stable storage.

CWS 315 2I en StudentManual 4 5 Days v02
No ratings yet
CWS 315 2I en StudentManual 4 5 Days v02
600 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit 3
No ratings yet
Unit 3
13 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
MapReduce
No ratings yet
MapReduce
14 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Unit 3
No ratings yet
Unit 3
22 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
unit 2
No ratings yet
unit 2
12 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
bda_unit_3[1]
No ratings yet
bda_unit_3[1]
20 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
BDA Lab 5
No ratings yet
BDA Lab 5
6 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Data Science
No ratings yet
Data Science
7 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
Map reduce
No ratings yet
Map reduce
35 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Unit - III
No ratings yet
Unit - III
37 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
12_EM_Account
No ratings yet
12_EM_Account
8 pages
Unit 2 - Collections
No ratings yet
Unit 2 - Collections
22 pages
MCA 2 Repeater
No ratings yet
MCA 2 Repeater
1 page
Eric Android Kotlin Unit 4
No ratings yet
Eric Android Kotlin Unit 4
21 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Unit 3-Null Safety
No ratings yet
Unit 3-Null Safety
15 pages
Data Mining
No ratings yet
Data Mining
4 pages
Troubleshooting During Upgrade Oracle Applications 11i To 10g
No ratings yet
Troubleshooting During Upgrade Oracle Applications 11i To 10g
7 pages
( (Hack Instagram Account 2020) ) (Hack Insta) Using Our Website in 1 Minutes ' PRO '
No ratings yet
( (Hack Instagram Account 2020) ) (Hack Insta) Using Our Website in 1 Minutes ' PRO '
3 pages
Print Kelas 2aa
No ratings yet
Print Kelas 2aa
14 pages
Lovely Professional University, Punjab: Detailed Plan For Lectures
No ratings yet
Lovely Professional University, Punjab: Detailed Plan For Lectures
6 pages
FCSS_SOC_AN-7.4 (1)
No ratings yet
FCSS_SOC_AN-7.4 (1)
5 pages
06_Handout_1(80)
No ratings yet
06_Handout_1(80)
5 pages
openSAP Sac3 Week 1 Exercise1
No ratings yet
openSAP Sac3 Week 1 Exercise1
30 pages
Class 12 Practical Programs
No ratings yet
Class 12 Practical Programs
55 pages
Samyak S - Resume (Sample 2)
No ratings yet
Samyak S - Resume (Sample 2)
1 page
Docshare - Tips Mileage
No ratings yet
Docshare - Tips Mileage
29 pages
(G3-2nd - Presnt) The Oracle Enterprise Architecture Framework
No ratings yet
(G3-2nd - Presnt) The Oracle Enterprise Architecture Framework
30 pages
Benefits and Risks of Data Center Outsourcing
No ratings yet
Benefits and Risks of Data Center Outsourcing
32 pages
Jurnal Pemrograman Komputer PDF
No ratings yet
Jurnal Pemrograman Komputer PDF
11 pages
Java Questions Bvrit v1.3
No ratings yet
Java Questions Bvrit v1.3
5 pages
Rojbel dbms2
No ratings yet
Rojbel dbms2
38 pages
h17926-dellemc-powerprotect-dd-ds
No ratings yet
h17926-dellemc-powerprotect-dd-ds
5 pages
Refactoring To Collections v1.1.0 PDF
100% (2)
Refactoring To Collections v1.1.0 PDF
158 pages
Debanjana Mondal Resume
No ratings yet
Debanjana Mondal Resume
3 pages
Introduction:-: Database Security Database Management System - 2
No ratings yet
Introduction:-: Database Security Database Management System - 2
9 pages
Data Collection
No ratings yet
Data Collection
64 pages
Test2
No ratings yet
Test2
5 pages
EUSecWest 2010 - DarunGrim - A Tool For Binary Diffing and Automatic Vulnerabilities Pattern Matching
No ratings yet
EUSecWest 2010 - DarunGrim - A Tool For Binary Diffing and Automatic Vulnerabilities Pattern Matching
63 pages
Consultant 8
No ratings yet
Consultant 8
12 pages
R122 EBSpracticalguide
100% (1)
R122 EBSpracticalguide
793 pages
The Basics of Git and GitHub
100% (1)
The Basics of Git and GitHub
10 pages
BTrack 4 On Windows - WPA & WPA2
No ratings yet
BTrack 4 On Windows - WPA & WPA2
4 pages
Tutorial - 10 - A2 and Query Optimization
No ratings yet
Tutorial - 10 - A2 and Query Optimization
16 pages
EcoStruxure Power Monitoring Expert 8.2 - PSWSANCZZSPEZZ
No ratings yet
EcoStruxure Power Monitoring Expert 8.2 - PSWSANCZZSPEZZ
2 pages