0% found this document useful (0 votes)

37 views

Map Reduce

The document discusses MapReduce and batch processing. It summarizes a research paper on MapReduce, which presents a programming model for expressing distributed computations on large data sets. MapReduce allows users to parallelize processing across large clusters of machines and handles details like fault tolerance, scheduling and load balancing.

Uploaded by

ahmadroheed

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Map Reduce

Uploaded by

ahmadroheed

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Cloud Computing

MapReduce, Batch Processing

Eva Kalyvianaki
[email protected]
Contents

 Batch processing: processing large amounts of data at-

once, in one-go to deliver a result according to a query on
the data.

 Material is from the paper:

“MapReduce: Simplified Data Processing on Large Clusters”,
By Jeffrey Dean and Sanjay Ghemawat from Google
published in Usenix OSDI conference, 2004

2
Motivation: Processing Large-sets of Data
 Need for many computations over large/huge sets of data:
 Input data: crawled documents, web request logs
 Output data: inverted indices, summary of pages crawled per host, the
set of the most frequent queries in a given day, …
 Most of these computation are relatively straight-forward
 To speedup computation and shorten processing time, we can
distribute data across 100s of machines and process them in
parallel
 But, parallel computations are difficult and complex to manage:
 Race conditions, debugging, data distribution, fault-tolerance, load
balancing, etc

 Ideally, we would like to process data in parallel but not deal

with the complexity of parallelisation and data distribution
3
MapReduce
“A new abstraction that allows us to express the simple
computations we were trying to perform but hides the messy
details of parallelization, fault-tolerance, data distribution and
load balancing in a library.”

 Programming model:
 Provides abstraction to express computation
 Library:
 To take care the runtime parallelisation of the computation.

4
Example: Counting the number of occurrences of each
work in the text below from Wikipedia
“Cloud computing is a recently evolved computing terminology
or metaphor based on utility and consumption of computing
resources. Cloud computing involves deploying groups of
remote servers and software networks that allow centralized
data storage and online access to computer services or
resources. Cloud can be classified as public, private or
hybrid.”

Word: Number of Occurrences

Cloud 1
2
computing 1
is 1
a 1
recently 1
evolved 1
computing 1?
terminology 1 5
Programming Model

 Input: a set of key/value pairs

 Output: a set of key/value pairs
 Computation is expressed using the two functions:
1. Map task: a single pair  a list of intermediate pairs
 map(input-key, input-value)  list(out-key, intermediate-value)
 <ki, vi>  { < kint, vint > }

2. Reduce task: all intermediate pairs with the same kint  a list of values
 reduce(out-key, list(intermediate-value))  list(out-values)
 < kint, {vint} >  < ko, vo >

6
Example: Counting the number of occurrences of each
work in a collection of documents

map(String input_key, String input_value):

// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
7
MapReduce Example Applications
 The MapReduce model can be applied to many applications:
 Distributed grep:
 map: emits a line, if line matched the pattern
 reduce: identity function
 Count of URL access Frequency
 Reverse Web-Link Graph
 Inverted Index
 Distributed Sort
 ….

8
MapReduce Implementation

 MapReduce implementation presented in the paper

matched Google infrastructure at-the-time:
1. Large cluster of commodity PCs connected via switched Ethernet
2. Machines are typically dual-processor x86, running Linux, 2-4GB of
mem! (slow machines for today’s standards)
3. A cluster of machines, so failures are anticipated
4. Storage with (GFS) Google File System (2003) on IDE disks
attached to PCs. GFS is a distributed file system, uses replication
for availability and reliability.
 Scheduling system:
1. Users submit jobs
2. Each job consists of tasks; scheduler assigns tasks to machines

9
Google File System (GFS)
 File is divided into several chunks of predefined size;
 Typically, 16-64 MB
 The system replicates each chunk by a number:
 Usually three replicas
 To achieve fault-tolerance, availability and reliability

10
Parallel Execution
 User specifies:
 M: number of map tasks
 R: number of reduce tasks
 Map:
 MapReduce library splits the input file into M pieces
 Typically 16-64MB per piece
 Map tasks are distributed across the machines
 Reduce:
 Partitioning the intermediate key space into R pieces
 hash(intermediate_key) mod R
 Typical setting:
 2,000 machines
 M = 200,000
 R = 5,000
11
Execution Flow

12
Master Data Structures

 For each map/reduce task:

 State status {idle, in-progress, completed}
 Identity of the worker machine (for non-idle tasks)

 The location of intermediate file regions is passed from maps to

reducers tasks through the master.
 This information is pushed incrementally (as map tasks finish) to
workers that have in-progress reduce tasks.

13
Fault-Tolerance
Two types of failures:
1. worker failures:
 Identified by sending heartbeat messages by the master. If no
response within a certain amount of time, then the worker is dead.
 In-progress and completed map tasks are re-scheduled  idle
 In-progress reduce tasks are re-scheduled  idle
 Workers executing reduce tasks affected from failed map/workers are
notified of re-scheduling
 Question: Why completed tasks have to be re-scheduler?
 Answer: Map output is stored on local fs, while reduce output is stored
on GFS
2. master failure:
1. Rare
2. Can be recovered from checkpoints
3. Solution: aborts the MapReduce computation and starts again
14
Disk Locality

 Network bandwidth is a relatively scarce resource and also

increases latency
 The goal is to save network bandwidth

 Use of GFS that stores typically three copies of the data block
on different machines
 Map tasks are scheduled “close” to data
 On nodes that have input data (local disk)
 If not, on nodes that are nearer to input data (e.g., same switch)

15
Task Granularity
 Number of map tasks > number of worker nodes
 Better load balancing
 Better recovery

 But, this, increases load on the master

 More scheduling
 More states to be saved

 M could be chosen with respect to the block size of the file

system
 For locality properties
 R is usually specified by users
 Each reduce tasks produces one output file

16
Stragglers
 Slow workers delay overall completion time  stragglers
 Bad disks with soft errors
 Other tasks using up resources
 Machine configuration problems, etc

 Very close to end of MapReduce operation, master schedules backup

execution of the remaining in-progress tasks.
 A task is marked as complete whenever either the primary or the
backup execution completes.

 Example: sort operation takes 44% longer to complete when the

backup task mechanism is disabled.

17
Refinements: Partitioning Function

 Partitioning function identifies the reduce task

 Users specify the desired output files they want, R
 But, there may be more keys than R
 Uses the intermediate key and R
 Default: hash(key) mod R

 Important to choose well-balanced partitioning functions:

 hash(hostname(urlkey)) mod R
 For output keys that are URLs

18
Refinements: Combiner Function

 Introduce a mini-reduce phase before intermediate data is

sent to reduce
 When there is significant repetition of intermediate keys
 Merge values of intermediate keys before sending to reduce tasks
 Example: word count, many records of the form <word_name, 1>.
Merge records with the same word_name
 Similar to reduce function

 Saves network bandwidth

19
Evaluation - Setup
 Evaluation on two programs running on a large cluster and
processing 1 TB of data:
1. grep: search over 1010 100-byte records looking for a rare 3-character
pattern
2. sort: sorts 1010 100-byte records

 Cluster configuration:
 1,800 machines
 Each machine has 2 GHz Intel Xeon proc., 4GB mem, 2 160GB IDE disks
 Gigabit Ethernet link
 Hosted in the same facility

20
Grep
 M = 15,000 of 64MB each split
 R=1
 Entire computation finishes at 150s
 Startup overhead ~60s
 Propagation of program to workers
 Delays to interact with GFS to open 1,000 files
 …
 Picks at 30GB/s with 1,764 workers

21
Sort
 M = 15,000 splits, 64MB each
 R = 4,000 files
 Workers = 1,700
 Evaluated on three executions:
 With backup tasks
 Without backup tasks
 With machine failures

22
Sort Results
Top: rate at which input is read
Middle: rate at which data is sent from mappers to reducers
Bottom: rate at which sorted data is written to output file by reducers

Without backup With machine failures,

Normal execution
tasks, 5 reduce 200 out of 1746 workers, 23
with backup
tasks stragglers,  a 5% increase over
44% increase normal execution time
Implementation
 First MapReduce library in 02/2003
 Use cases (back then):
 Large-scale machine learning problems
 Clustering problems for the Google News
 Extraction of data for reports Google zeitgeist
 Large-scale graph computations MapReduce jobs run in 8/2004

24
Summary

 MapReduce is a very powerful and expressive model

 Performance depends a lot on implementation details
 Material is from the paper:
“MapReduce: Simplified Data Processing on Large Clusters”,
By Jeffrey Dean and Sanjay Ghemawat from Google
published in Usenix OSDI conference, 2004

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Scheduling Algorithm
No ratings yet
Scheduling Algorithm
27 pages
Package EasySocial
No ratings yet
Package EasySocial
238 pages
7BCEE1A-Datamining and Data Warehousing
No ratings yet
7BCEE1A-Datamining and Data Warehousing
128 pages
DBMS Unit 5
No ratings yet
DBMS Unit 5
12 pages
Java Programming: Inheritance, Packages, Exceptions Topics Covered in This Unit
No ratings yet
Java Programming: Inheritance, Packages, Exceptions Topics Covered in This Unit
43 pages
Data Structures - Unit-1
No ratings yet
Data Structures - Unit-1
32 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Os Notes
No ratings yet
Os Notes
67 pages
Web Notes M3
No ratings yet
Web Notes M3
43 pages
Cryptanalysis of The Stream Cipher ZUC in The 3GPP Confidentiality & Integrity Algorithms 128-EEA3 & 128-EIA3
No ratings yet
Cryptanalysis of The Stream Cipher ZUC in The 3GPP Confidentiality & Integrity Algorithms 128-EEA3 & 128-EIA3
15 pages
It6713 Grid & Cloud Computing Lab
67% (12)
It6713 Grid & Cloud Computing Lab
97 pages
Distributed System: Processes in DS
100% (1)
Distributed System: Processes in DS
36 pages
LinkedList, Jagged Array, List, and Enumeration in Java
No ratings yet
LinkedList, Jagged Array, List, and Enumeration in Java
29 pages
Azure Data Factory Interview Questions: Click Here
No ratings yet
Azure Data Factory Interview Questions: Click Here
28 pages
SQLMX Vs Oracle
No ratings yet
SQLMX Vs Oracle
49 pages
CS8691 QB 006 Edubuzz360
No ratings yet
CS8691 QB 006 Edubuzz360
11 pages
Relational Database Design: (PM Jat, DAIICT, Gandhinagar)
No ratings yet
Relational Database Design: (PM Jat, DAIICT, Gandhinagar)
14 pages
Lecture 0 INT306
No ratings yet
Lecture 0 INT306
38 pages
Data Analytics Process
No ratings yet
Data Analytics Process
9 pages
DBMS Concepts&architecture
No ratings yet
DBMS Concepts&architecture
39 pages
Merkle Tree
No ratings yet
Merkle Tree
2 pages
CC (Neha) PDF
No ratings yet
CC (Neha) PDF
50 pages
COP CD Unit3
No ratings yet
COP CD Unit3
247 pages
Web Technologies Notes
No ratings yet
Web Technologies Notes
238 pages
ParrotOSPressentation MohamedDaoud
No ratings yet
ParrotOSPressentation MohamedDaoud
14 pages
DW Olap
No ratings yet
DW Olap
57 pages
Lecture 9: BCSE302L - DBMS: Relational Algebra
No ratings yet
Lecture 9: BCSE302L - DBMS: Relational Algebra
25 pages
CSI104 Summary
No ratings yet
CSI104 Summary
114 pages
7bcee3a Unit V Data Binding
No ratings yet
7bcee3a Unit V Data Binding
18 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
154 pages
Memory Allocation in Operating Systems
No ratings yet
Memory Allocation in Operating Systems
37 pages
Surayya's Resume-1
No ratings yet
Surayya's Resume-1
4 pages
DW and Olap
No ratings yet
DW and Olap
59 pages
PHP Program
No ratings yet
PHP Program
16 pages
Lecture 1 - Basic Concepts in Information Security
No ratings yet
Lecture 1 - Basic Concepts in Information Security
40 pages
Web Mining: Based On Tutorials and Presentations
No ratings yet
Web Mining: Based On Tutorials and Presentations
101 pages
Exame 1 PDF
No ratings yet
Exame 1 PDF
3 pages
3.MySql Lab Exercise - 1
100% (1)
3.MySql Lab Exercise - 1
11 pages
Dbms r19 - Unit-2 (Ref-2)
No ratings yet
Dbms r19 - Unit-2 (Ref-2)
27 pages
DWM Course
No ratings yet
DWM Course
67 pages
OCaml
No ratings yet
OCaml
11 pages
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
No ratings yet
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
17 pages
BDA Unit-4
No ratings yet
BDA Unit-4
38 pages
Chapter 2-Cloud Computing
No ratings yet
Chapter 2-Cloud Computing
29 pages
Unit 1.2 HTML - Frames-1
No ratings yet
Unit 1.2 HTML - Frames-1
46 pages
Java Assignment For Instructor David
No ratings yet
Java Assignment For Instructor David
13 pages
2nd Unit Android
No ratings yet
2nd Unit Android
50 pages
Case Study On Building Data Warehouse/Data Mart
100% (2)
Case Study On Building Data Warehouse/Data Mart
6 pages
Transport Layer Notes
No ratings yet
Transport Layer Notes
35 pages
CD Unit-3
No ratings yet
CD Unit-3
146 pages
Core Java Notes
No ratings yet
Core Java Notes
136 pages
Topic 3 Data Models PDF
No ratings yet
Topic 3 Data Models PDF
12 pages
Big Data Nit067
No ratings yet
Big Data Nit067
1 page
Java Unit 1
No ratings yet
Java Unit 1
14 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Unit-3 - Unix Scripts
No ratings yet
Unit-3 - Unix Scripts
21 pages
Data Mining-2-1
No ratings yet
Data Mining-2-1
12 pages
Pearson 1932
No ratings yet
Pearson 1932
59 pages
Amazon Relational Database Service (RDS) TPC-C Benchmark
No ratings yet
Amazon Relational Database Service (RDS) TPC-C Benchmark
15 pages
File Input-Output in CPP
No ratings yet
File Input-Output in CPP
68 pages
Mapreduce Advanced
No ratings yet
Mapreduce Advanced
26 pages
Virtual Ization
No ratings yet
Virtual Ization
42 pages
Lecture - 1
No ratings yet
Lecture - 1
17 pages
Saturation Attack in SDN
100% (1)
Saturation Attack in SDN
2 pages
Python Programming
No ratings yet
Python Programming
6 pages
Supervisor Visit Form
No ratings yet
Supervisor Visit Form
3 pages
Chapter 6 Functions: For Educational Purpose Only. Not To Be Circulated Without This Banner
No ratings yet
Chapter 6 Functions: For Educational Purpose Only. Not To Be Circulated Without This Banner
132 pages
Etere Media Asset Management
No ratings yet
Etere Media Asset Management
4 pages
Java EE Midlife Crisis: by Azrul MADISA
No ratings yet
Java EE Midlife Crisis: by Azrul MADISA
41 pages
Install OpenOPC and PyComm Library in Python For PLC
No ratings yet
Install OpenOPC and PyComm Library in Python For PLC
1 page
Microsoft Power FX Overview: Article - 02/23/2023 - 20 Minutes To Read
No ratings yet
Microsoft Power FX Overview: Article - 02/23/2023 - 20 Minutes To Read
1,207 pages
DCS115
No ratings yet
DCS115
304 pages
3
No ratings yet
3
3 pages
System Analysis
No ratings yet
System Analysis
5 pages
PDF What Every Engineer Should Know about Software Engineering, 2nd Edition Phillip A. Laplante download
100% (1)
PDF What Every Engineer Should Know about Software Engineering, 2nd Edition Phillip A. Laplante download
48 pages
02 - CM2015 - Variables, Control Flow and Functions (2022-10)
No ratings yet
02 - CM2015 - Variables, Control Flow and Functions (2022-10)
23 pages
Full-Stack Development 5 Day Workshop Syllabus
No ratings yet
Full-Stack Development 5 Day Workshop Syllabus
5 pages
Asm Transformations
No ratings yet
Asm Transformations
7 pages
Lab 7
No ratings yet
Lab 7
3 pages
DIP10 - MATLAB Example - Utility M-Functions For Intensity Transformations
100% (1)
DIP10 - MATLAB Example - Utility M-Functions For Intensity Transformations
19 pages
CS Class 12
No ratings yet
CS Class 12
21 pages
Azure Devops
100% (2)
Azure Devops
25 pages
Lab 4.3 Guess Who
No ratings yet
Lab 4.3 Guess Who
2 pages
How Does A Web Application Work?
No ratings yet
How Does A Web Application Work?
9 pages
Experiment No. 2 Web Page Using HTML5 Title: Objective
No ratings yet
Experiment No. 2 Web Page Using HTML5 Title: Objective
34 pages
Data Flow Testing
No ratings yet
Data Flow Testing
1 page
Interview Question
No ratings yet
Interview Question
6 pages
Java Notes B.Sc-1-1 PDF
No ratings yet
Java Notes B.Sc-1-1 PDF
28 pages
Generated 1
No ratings yet
Generated 1
22 pages
Chapitre 3 Autosar
No ratings yet
Chapitre 3 Autosar
70 pages
ICT SS1 WK8 PROGRAMMING LANGUAGE I
No ratings yet
ICT SS1 WK8 PROGRAMMING LANGUAGE I
4 pages
PF-Lab-16-File Handling
No ratings yet
PF-Lab-16-File Handling
6 pages
OOP1 Practical Index 2022-23
No ratings yet
OOP1 Practical Index 2022-23
2 pages
CIS Red Hat Enterprise Linux 8 STIG Benchmark v1.0.0
No ratings yet
CIS Red Hat Enterprise Linux 8 STIG Benchmark v1.0.0
1,403 pages
E Commerce 1
No ratings yet
E Commerce 1
19 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

Cloud Computing

MapReduce, Batch Processing

 Batch processing: processing large amounts of data at-

 Material is from the paper:

 Ideally, we would like to process data in parallel but not deal

Word: Number of Occurrences

 Input: a set of key/value pairs

map(String input_key, String input_value):

 MapReduce implementation presented in the paper

 For each map/reduce task:

 The location of intermediate file regions is passed from maps to

 Network bandwidth is a relatively scarce resource and also

 But, this, increases load on the master

 M could be chosen with respect to the block size of the file

 Very close to end of MapReduce operation, master schedules backup

 Example: sort operation takes 44% longer to complete when the

 Partitioning function identifies the reduce task

 Important to choose well-balanced partitioning functions:

 Introduce a mini-reduce phase before intermediate data is

 Saves network bandwidth

Without backup With machine failures,

 MapReduce is a very powerful and expressive model

You might also like