0% found this document useful (0 votes)

11 views

Chapter 4 MapReduce

Uploaded by

nhatminhle248

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Chapter 4 MapReduce

Uploaded by

nhatminhle248

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

MapReduce

Thoai Nam
High Performance Computing Lab (HPC Lab)
Faculty of Computer Science and Engineering
HCMC University of Technology

HPC Lab-‐CSE-‐HCMUT 1
Ref
– MapReduce algorithm design, Jimmy Lin

HPC Lab-‐CSE-‐HCMUT 2
HPC Lab-‐CSE-‐HCMUT 3
HPC Lab-‐CSE-‐HCMUT 4
HPC Lab-‐CSE-‐HCMUT 5
HPC Lab-‐CSE-‐HCMUT 6
MapReduce: A Real World Analogy
Coins Deposit

HPC Lab-‐CSE-‐HCMUT 7
MapReduce: A Real World Analogy
Coins Deposit

Coins Coun9ng Machine

HPC Lab-‐CSE-‐HCMUT 8
MapReduce: A Real World Analogy
Coins Deposit

Mapper: Categorize coins by their face values

Reducer: Count the coins in each face value in parallel
HPC Lab-‐CSE-‐HCMUT 9
MapReduce
• Programmers specify two funcNons:
map (k1, v1) → [<k2, v2>]
reduce (k2, [v2]) → [<k3, v3>]
(All values with the same key are sent to the same reducer)
• The execuNon framework handles everything
else...

HPC Lab-‐CSE-‐HCMUT 10
HPC Lab-‐CSE-‐HCMUT 11
MapReduce
• Programmers specify two funcNons:
Map (k1, v1) → <k2, v2>*
Reduce (k2, list (v2)) → list (v3)
(All values with the same key are sent to the same reducer)
• The execuNon framework handles everything
else...

HPC Lab-‐CSE-‐HCMUT 12
MapReduce “run9me”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribuNon”
– Moves processes to data
• Handles synchronizaNon
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed file
system
HPC Lab-‐CSE-‐HCMUT 13
Synchroniza9on & ordering
• Barrier between map and reduce phases
– But intermediate data can be copied over as soon as mappers
finish
• Keys arrive at each reducer in sorted order
– No enforced ordering across reducers

HPC Lab-‐CSE-‐HCMUT 14
MapReduce
• Programmers specify two funcNons:
Map (k1, v1) → <k2, v2>*
Reduce (k2, list (v2)) → list (v3)
(All values with the same key are sent to the same reducer)
• The execuNon framework handles everything else...
• Not quite...usually, programmers also specify:
par99on (k2, number of parNNons) → parNNon for k2
– Ohen a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operaNons
combine (k2, v2) → <k2, v2>*
– Mini-‐reducers that run in memory aher the map phase
– Used as an opNmizaNon to reduce network traffic
HPC Lab-‐CSE-‐HCMUT 15

HPC Lab-‐CSE-‐HCMUT 16
What’s the big deal?
• Developers need the right level of abstracNon
– Moving beyond the von Neumann architecture
– We need beler programming models
• AbstracNons hide low-‐level details from the
developers
– No more race condiNons, lock contenNon, etc.
• MapReduce separaNng the what from
how
– Developer specifies the computaNon that need to be
performed
– ExecuNon framework (“runNme”) handles actual
execuNon
HPC Lab-‐CSE-‐HCMUT 17
The data center
is the computer?
HPC Lab-‐CSE-‐HCMUT 18
MapReduce can refers to…
• The programming model
• The execuNon framework (aka “runNme”)
• The specific implementaNon

HPC Lab-‐CSE-‐HCMUT 19
MapReduce Implementa9ons
• Google has a proprietary implementaNon in C++
– Bindings in Java, Python
• Hadoop is an open-‐source implementaNon in
Java
– Development led by Yahoo, now an Apache project
– Used in producNon at Yahoo, Facebook, Twiler, Linked In,
Nerlix, etc.
– The de facto big data processing plarorm
– Rapidly expanding sohware ecosystem
• Lots of custom research implementaNons
– For GPUs, cell processors, etc.

HPC Lab-‐CSE-‐HCMUT 20
MapReduce algorithm design
• The execuNon framework handles “everything
else”...
– Scheduling: assigns workers to map and reduce tasks
– “Data distribuNon”: moves processes to data
– SynchronizaNon: gathers, sorts, and shuffles intermediate data
– Errors and faults: detects worker failures and restarts
• Limited control over data and execuNon flow
– All algorithms must expressed in m,r,c,p
• You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a parNcular mapper is processing
– Which intermediate key a parNcular reducer is processing
HPC Lab-‐CSE-‐HCMUT 21
Apache Hadoop

HPC Lab-‐CSE-‐HCMUT 22
Data volumes: Google Example
• Analyze 10 billion web pages
• Average size of a webpage: 20KB
• Size of the collection: 10 billion x 20KBs = 200TB
• HDD hard disk read bandwidth: 150MB/sec
• Time needed to read all web pages (without analyzing
them): 2 million seconds = more than 15 days
• A single node architecture is not adequate

4
Data volumes: Google Example with SSD
• Analyze 10 billion web pages
• Average size of a webpage: 20KB
• Size of the collection: 10 billion x 20KBs = 200TB
• SSD hard disk read bandwidth: 550MB/sec
• Time needed to read all web pages (without analyzing
them): 2 million seconds = more than 4 days
• A single node architecture is not adequate

5
Apache Hadoop
• Scalable fault-‐tolerant distributed system for Big Data
o Distributed Data Storage
o Distributed Data Processing
o Borrowed concepts/ideas from the systems designed at
Google (Google File System for Google's MapReduce)
o Open source project under the Apache license
Ø But there are also many commercial implementations (e.g., Cloudera,
Hortonworks, MapR)

26
Hadoop History
• Dec 2004 -‐ Google published a paper about GFS
• July 2005 -‐ Nutch uses MapReduce
• Feb 2006 -‐ Hadoop becomes a Lucene subproject
• Apr 2007 -‐ Yahoo! runs it on a 1000-‐node cluster
• Jan 2008 -‐ Hadoop becomes an Apache Top Level Project
• Jul 2008 -‐ Hadoop is tested on a 4000 node cluster
• Feb 2009 -‐ The Yahoo! Search Webmap is a Hadoop application that runs
on more than 10,000 core Linux cluster
• June 2009 -‐ Yahoo! made available the source code of its production
version of Hadoop
• In 2010 Facebook claimed that they have the largest Hadoop cluster in the
world with 21 PB of storage
o On July 27, 2011 they announced the data has grown to 30PB.

27
Hadoop vs. HPC
• Hadoop
• Designed for Data intensive workloads
• Usually, no CPU demanding/intensive tasks
• HPC (High-‐performance computing)
o A supercomputer with a high-‐level computational capacity
Ø Performance of a supercomputer is measured in ﬂoating-‐point
operations per second (FLOPS)
o Designed for CPU intensive tasks
o Usually it is used to process "small" data sets

3°
Hadoop: main components
• Core components of Hadoop:
o Distributed Big Data Processing Infrastructure based on the
MapReduce programming paradigm
§ Provides a high-‐level abstraction view
Ø Programmers do not need to care about task scheduling and synchronization
§ Fault-‐tolerant
Ø Node and task failures are automatically managed by the Hadoop system
o HDFS (Hadoop Distributed File System)
§ High availability distributed storage
§ Fault-‐tolerant

31
HDFS
(Hadoop File System)

HPC Lab-‐CSE-‐HCMUT 29
HDFS
• HDFS is a distributed file system that is fault tolerant, scalable
and extremely easy to expand
• HDFS is the primary distributed storage for Hadoop
applications
• HDFS provides interfaces for applications to move themselves
closer to data
• HDFS is designed to ‘just work’, however a working
knowledge helps in diagnostics and improvements

HPC Lab-‐CSE-‐HCMUT 30
HDFS: a distributed ﬁle system

Example with number of replicas per chunk = 2

HPC Lab-‐CSE-‐HCMUT 31
HDFS – Data Organiza9on
• Each ﬁle wrilen into HDFS is split into data blocks
• Each block is stored on one or more nodes
• Each copy of the block is called replica
• Block placement policy
o First replica is placed on the local node
o Second replica is placed in a diﬀerent rack
o Third replica is placed in the same rack as the second replica

HPC Lab-‐CSE-‐HCMUT 32
HDFS architecture (1)

HPC Lab-‐CSE-‐HCMUT 33
HDFS architecture (2)
There are two (and a half)
types of machines in a
HDFS cluster
• NameNode is the
heart of an HDFS
filesystem, it
maintains and
manages the file
system metadata. E.g;
what blocks make up a
file, and on which
datanodes those
blocks are stored
• DataNode where HDFS
stores the actual data,
there are usually quite
a few of these

HPC Lab-‐CSE-‐HCMUT 34
Read opera9on in HDFS

HPC Lab-‐CSE-‐HCMUT 35
Write opera9on in HDFS

HPC Lab-‐CSE-‐HCMUT 36
Unique features of HDFS
• HDFS also has a bunch of unique features that make it
ideal for distributed systems:
Ø Failure tolerant -‐ data is duplicated across mulNple DataNodes to
protect against machine failures. The default is a replicaNon factor
of 3 (every block is stored on three machines).
Ø Scalability -‐ data transfers happen directly with the DataNodes so
your read/write capacity scales fairly well with the number of
DataNodes
Ø Space -‐ need more disk space? Just add more DataNodes and re-‐
balance
Ø Industry standard -‐ Other distributed applicaNons are built on top
of HDFS (HBase, Map-‐Reduce)
• HDFS is designed to process large data sets with write-‐
once-‐read-‐many, it is not for low latency access
HPC Lab-‐CSE-‐HCMUT 37
MapReduce & HDFS

HPC Lab-‐CSE-‐HCMUT 38
Algorithm & programming

HPC Lab-‐CSE-‐HCMUT 39
MapReduce Example: Word Count
Input Split Map ShuUle/Sort Reduce Output

Deer, 1 Beer, 1
Beer, 1 Beer, 2
Dear Beer River Beer, 1
River, 1

Car, 1 Car, 1 Beer, 2

Deer Beer River Car Car River Car, 1 Car, 3
Car Car River Car, 1 Car, 3
River, 1 Car, 1 Deer, 2
Deer Car Beer
River, 2
Deer, 1 Deer, 1
Deer Car Beer Deer, 2
Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

HPC Lab-‐CSE-‐HCMUT 40
MapReduce Example: Word Count
Input Split Map ShuUle/Sort Reduce Output

Deer, 1 Beer, 1
Beer, 1 Beer, 2
Dear Beer River Beer, 1
River, 1

Car, 1 Car, 1 Beer, 2

Deer Beer River Car Car River Car, 1 Car, 3
Car Car River Car, 1 Car, 3
River, 1 Car, 1 Deer, 2
Deer Car Beer
River, 2
Deer, 1 Deer, 1
Deer Car Beer Deer, 2
Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: What are the Key and Value Pairs of Map and Reduce?
Map: Key=word, Value=1
Reduce: Key=word, Value=aggregated count
HPC Lab-‐CSE-‐HCMUT 41
Word Count: baseline

HPC Lab-‐CSE-‐HCMUT 42
MapReduce Example: Word Count
Input Split Map ShuUle/Sort Reduce Output

Deer, 1 Beer, 1
Beer, 1 Beer, 2
Dear Beer River Beer, 1
River, 1

Car, 1 Car, 1 Beer, 2

Deer Beer River Car Car River Car, 1 Car, 3
Car Car River Car, 1 Car, 3
River, 1 Car, 1 Deer, 2
Deer Car Beer
River, 2
Deer, 1 Deer, 1
Deer Car Beer Deer, 2
Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: Do you see any place we can improve the eﬃciency?
Local aggrega9on at mapper will be able to improve
MapReduce eﬃciency. HPC Lab-‐CSE-‐HCMUT 43
MapReduce: Combiner
• Combiner: do local aggregaNon/combine task at
mapper

Car, 1 Car, 2 Car, 2 Car, 3

Car, 1 River, 1 Car, 1
River, 1

• Q: What are the benefits of using combiner:
– Reduce memory/disk requirement of Map tasks
– Reduce network traffic
• Q: Can we remove the reduce func9on?
– No, reducer sNll needs to process records with same key
but from different mappers
• Q: How would you implement combiner?
– It is the same as Reducer!
HPC Lab-‐CSE-‐HCMUT 44
Shuffle and sort

HPC Lab-‐CSE-‐HCMUT 45
Preserving state

HPC Lab-‐CSE-‐HCMUT 46
Implementa9on don’t
• Don’t unnecessarily create objects
– Object creaNon is costly
– Garbage collecNon is costly
• Don’t buﬀer objects
– Processes have limited heap size (remember, commodity
machines)
– May work for small datasets, but won’t scale!

HPC Lab-‐CSE-‐HCMUT 47
Word Count: version 1

HPC Lab-‐CSE-‐HCMUT 48
Word Count: version 2

Are combiners s9ll need?

HPC Lab-‐CSE-‐HCMUT 49
Design paUern for local aggrega9on
• “In-‐mapper combining”
– Fold the funcNonality of the combiner into the mapper by
preserving state across mulNple map calls
• Advantages
– Speed
– Why is this faster than actual combiners?
• Disadvantages
– Explicit memory management required
– PotenNal for order-‐dependent bugs

HPC Lab-‐CSE-‐HCMUT 50
Combiner design
• Combiners and reducers share same method signature
– SomeNmes, reducers can serve as combiners
– Ohen, not…
• Remember: combiner are opNonal opNmizaNons
– Should not aﬀect algorithm correctness
– May be run 0, 1, or mulNple Nmes
• Example: ﬁnd average of integers associated with the
same key

HPC Lab-‐CSE-‐HCMUT 51
Compu9ng the Mean: version 1

Why can’t we use Reducer as Combiner?

HPC Lab-‐CSE-‐HCMUT 52
Compu9ng the Mean: version 2

HPC Lab-‐CSE-‐HCMUT 53
Compu9ng the Mean: version 3

HPC Lab-‐CSE-‐HCMUT 54
Compu9ng the Mean: version 4

Are combiners s9ll need?

HPC Lab-‐CSE-‐HCMUT 55
Word Count & sor9ng
• New Goal: output all words sorted by their
frequencies (total counts) in a document.
• Ques9on: How would you adopt the basic word
count program to solve it?
• Solu9on:
– Sort words by their counts in the reducer
– Problem: what happens if we have more than one
reducer?

HPC Lab-‐CSE-‐HCMUT 56
Word Count & sor9ng
• New Goal: output all words sorted by their
frequencies (total counts) in a document.
• Ques9on: How would you adopt the basic word
count program to solve it?
• Solu9on:
– Do two rounds of MapReduce
– In the 2nd round, take the output of WordCount as input
but switch key and value pair!
– Leverage the sor9ng capability of shuﬄe/sort to do the
global sor9ng!

HPC Lab-‐CSE-‐HCMUT 57
Word Count & top K words
• New Goal: output the top K words sorted by their
frequencies (total counts) in a document.
• Ques9on: How would you adopt the basic word
count program to solve it?
• Solu9on:
– Use the solu9on of previous problem and only grab the
top K in the ﬁnal output
– Problem: is there a more eﬃcient way to do it?

HPC Lab-‐CSE-‐HCMUT 58
Word Count & top K words
• New Goal: output the top K words sorted by their
frequencies (total counts) in a document.
• Ques9on: How would you adopt the basic word
count program to solve it?
• Solu9on:
– Add a sort func9on to the reducer in the ﬁrst round and
only output the top K words
– Intui9on: the global top K must be a local top K in any
reducer!

HPC Lab-‐CSE-‐HCMUT 59
MapReduce In-‐class Exercise
• Problem: Find the maximum monthly temperature
for each year from weather reports
• Input: A set of records with format as:
<Year/Month, Average Temperature of that month>
-‐ (200707,100), (200706,90)
-‐ (200508, 90), (200607,100)
-‐ (200708, 80), (200606,80)
• Ques9on: write down the Map and Reduce funcNon
to solve this problem
– Assume we split the input by line

HPC Lab-‐CSE-‐HCMUT 60
Mapper and Reducer of Max Temperature
• Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value:
Emit(t-‐>year, t-‐>temperature);} Combiner is the same
as Reducer
• Reduce(key, list of values){
// key: year
//list of values: a list of monthly temperature
int max_temp = -‐100;
for each v in values:
max_temp= max(v, max_temp);
Emit(key, max_temp);}
HPC Lab-‐CSE-‐HCMUT 61
MapReduce Example: Max Temperature
Input (200707,100), (200706,90)
(200508, 90), (200607,100)
(200708, 80), (200606,80)

Map

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006, 80)

Combine

(2007,100) (2005, 90), (2006,100) (2007, 80), (2006, 80)

ShuUle/Sort

(2005,[90]) (2006,[100, 80]) (2007,[100, 80])

Reduce

(2005,90) (2006,100)
HPC Lab-‐CSE-‐HCMUT
(2007,100) 62
MapReduce In-‐class Exercise
• Key-‐Value Pair of Map and Reduce:
– Map: (year, temperature)
– Reduce: (year, maximum temperature of the year)

• Ques9on: How to use the above Map Reduce

program (that contains the combiner) with slight
changes to ﬁnd the average monthly temperature of
the year?

HPC Lab-‐CSE-‐HCMUT 63
Mapper and Reducer of Average Temperature
• Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value:
Emit(t-‐>year, t-‐>temperature);}
• Reduce(key, list of values){ Combiner is the same
as Reducer
// key: year
// list of values: a list of monthly temperatures
int total_temp = 0;
for each v in values:
total_temp= total_temp+v;
Emit(key, total_temp/size_of(values));}
HPC Lab-‐CSE-‐HCMUT 64
MapReduce Example: Average Temperature
Input (200707,100), (200706,90)
Real average of
(200508, 90), (200607,100)
2007: 90
(200708, 80), (200606,80)

Map

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006,80)

Combine

(2007,95) (2005, 90), (2006,100) (2007, 80), (2006,80)

ShuUle/Sort

(2005,[90]) (2006,[100, 80]) (2007,[95, 80])

Reduce

(2005,90) (2006,90)
HPC Lab-‐CSE-‐HCMUT
(2007,87.5) 65
MapReduce In-‐class Exercise
• The problem is with the combiner!
• Here is a simple counterexample:
– (2007, 100), (2007,90) -‐> (2007, 95)
(2007,80)-‐>(2007,80)
– Average of the above is: (2007,87.5)
– However, the real average is: (2007,90)

• However, we can do a small trick to get around this
– Mapper: (2007, 100), (2007,90) -‐> (2007, <190,2>)
(2007,80)-‐>(2007,<80,1>)
– Reducer: (2007,<270,3>)-‐>(2007,90)
HPC Lab-‐CSE-‐HCMUT 66
MapReduce Example: Average Temperature
Input (200707,100), (200706,90)
(200508, 90), (200607,100)
(200708, 80), (200606,80)

Map

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006,80)

Combine

(2007,<190,2>) (2005, <90,1>), (2007, <80,1>),

(2006, <100,1>) (2006,<80,1>)
ShuUle/Sort

(2005,[<90,1>]) (2006,[<100,1>, <80,1>]) (2007,[<190,2>, <80,1>])

Reduce

(2005,90) (2006,90)
HPC Lab-‐CSE-‐HCMUT
(2007,90) 67
Mapper and Reducer of Average Temperature
• Map(key, value){ • Combine(key, list of values){
// key: line number // key: year
// value: tuples in a line // list of values: a list of monthly
for each tuple t in value: temperature
Emit(t-‐>year, t-‐>temperature);} int total_temp = 0;
for each v in values:
• Reduce (key, list of values){
total_temp= total_temp+v;
// key: year
Emit(key,<total_temp,size_of(values)>);}
// list of values: a list of <temperature
sums, counts> tuples
int total_temp = 0;
int total_count=0;
for each v in values:
total_temp= total_temp+v-‐>sum;
total_count=total_count+v-‐>count;
Emit(key,total_temp/total_count);} HPC Lab-‐CSE-‐HCMUT 68
MapReduce In-‐class Exercise

• Func9ons that can use combiner are called
distribu<ve:
– DistribuNve: Min/Max(), Sum(), Count(), TopK()
– Non-‐distribuNve: Mean(), Median(), Rank()
Gray, Jim*, et al. "Data cube: A relaNonal aggregaNon
operator generalizing group-‐by, cross-‐tab, and sub-‐
totals." Data Mining and Knowledge Discovery 1.1
(1997): 29-‐53.

*Jim Gray received Turing Award in 1998
HPC Lab-‐CSE-‐HCMUT 69
Map Reduce Problems Discussion
• Problem 1: Find Word Length DistribuNon
• Statement: Given a set of documents, use Map-‐
Reduce to ﬁnd the length distribuNon of all words
contained in the documents
• Ques9on:
– What are the Mapper and Reducer FuncNons?
12: 1
MapReduce 7: 1
This is a test data for
6: 1
the word length
4: 4
distribuNon problem
3: 2
2: 1
1: 1
HPC Lab-‐CSE-‐HCMUT 70
Mapper and Reducer of Word Length
Distribu9on
• Map(key, value){
// key: document name
// value: words in a document
for each word w in value:
Emit(length(w), w);}
• Reduce(key, list of values){
// key: length of a word
// list of values: a list of words with the same length
Emit(key, size_of(values));}

HPC Lab-‐CSE-‐HCMUT 71
Map Reduce Problems Discussion
• Problem 1: Find Word Length DistribuNon
• Mapper and Reducer:
– Mapper(document)
{ Emit (Length(word), word) }
– Reducer(output of map)
{ Emit (Length(word), Size of (List of words
at a par9cular length))}

HPC Lab-‐CSE-‐HCMUT 72
Map Reduce Problems Discussion
• Problem 2: Indexing & Page Rank
• Statement: Given a set of web pages, each page has
a page rank associated with it, use Map-‐Reduce to
ﬁnd, for each word, a list of pages (sorted by rank)
that contains that word
• Ques9on:
– What are the Mapper and Reducer FuncNons?
MapReduce
Word 1: [page x1,
page x2, ..]

Word 2: [page y1,
page y2, …]
…
HPC Lab-‐CSE-‐HCMUT 73
Page Rank

HPC Lab-‐CSE-‐HCMUT 74
Mapper and Reducer of Indexing and
PageRank
• Map(key, value){
// key: a page
// value: words in a page
for each word w in value:
Emit(w, <page_id, page_rank>);}
• Reduce(key, list of values){
// key: a word
// list of values: a list of pages containing that word
sorted_pages=sort(values, page_rank)
Emit(key, sorted_pages);}

HPC Lab-‐CSE-‐HCMUT 75
Map Reduce Problems Discussion
• Problem 2: Indexing and Page Rank
• Mapper and Reducer:
– Mapper(page_id, <page_text, page_rank>)
{ Emit (word, <page_id, page_rank>) }
– Reducer(output of map)
{ Emit (word, List of pages contains the
word sorted by their page_ranks)}

HPC Lab-‐CSE-‐HCMUT 76
Map Reduce Problems Discussion
• Problem 3: Find Common Friends
• Statement: Given a group of people on online social
media (e.g., Facebook), each has a list of friends, use
Map-‐Reduce to ﬁnd common friends of any two
persons who are friends
• Ques9on:
– What are the Mapper and Reducer FuncNons?

HPC Lab-‐CSE-‐HCMUT 77
Map Reduce Problems Discussion
• Problem 3: Find Common Friends
• Simple example: Input:
A -‐> B,C,D
B-‐> A,C,D
A C C-‐> A,B
MapReduce
D-‐>A,B
Output:
B D (A ,B) -‐> C,D
(A,C) -‐> B
(A,D) -‐> ..
….
HPC Lab-‐CSE-‐HCMUT 78
Mapper and Reducer of Common Friends
• Map(key, value){
// key: person_id
// value: the list of friends of the person
for each friend f_id in value:
Emit(<person_id, f_id>, value);}
• Reduce(key, list of values){
// key: <friend pair>
// list of values: a set of friend lists related with the friend pair
for v1, v2 in values:
common_friends = v1 intersects v2;
Emit(key, common_friends);}
HPC Lab-‐CSE-‐HCMUT 79
Map Reduce Problems Discussion
• Problem 3: Find Common Friends
• Mapper and Reducer:
– Mapper(friend list of a person)
{ for each person in the friend list:
Emit (<friend pair>, <list of friends>) }
– Reducer(output of map)
{ Emit (<friend pair>, Intersec9on of two (i.e, the
one in friend pair) friend lists)}

HPC Lab-‐CSE-‐HCMUT 80
Map Reduce Problems Discussion
• Problem 3: Find Common Friends
• Mapper and Reducer:
Input: Map: Reduce:
Suggest
Fiends J
A -‐> B,C,D (A,B) -‐> B,C,D (A,B) -‐> C,D
B-‐> A,C,D (A,C) -‐> B,C,D (A,C) -‐> B
C-‐> A,B (A,D) -‐> B,C,D (A,D) -‐> B
D-‐>A,B (A,B) -‐> A,C,D (B,C) -‐> A
(B,C) -‐> A,C,D (B,D) -‐> A
(B,D) -‐> A,C,D
(A,C) -‐> A,B
(B,C) -‐> A,B
(A,D) -‐> A,B
(B,D) -‐> A ,B
HPC Lab-‐CSE-‐HCMUT 81
Enjoy MR and HadoopJ

HPC Lab-‐CSE-‐HCMUT 82

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data
No ratings yet
Big Data
67 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Hadoop
No ratings yet
Hadoop
34 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Introduction To Hadoop and Its Ecosystem
No ratings yet
Introduction To Hadoop and Its Ecosystem
95 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
No ratings yet
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
15 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Chapter 4 Spark
No ratings yet
Chapter 4 Spark
57 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Unit 5
No ratings yet
Unit 5
35 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Big Data
No ratings yet
Big Data
29 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Lecture-3-MR-model-and-systems
No ratings yet
Lecture-3-MR-model-and-systems
67 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HPC Clusters Demystified
From Everand
HPC Clusters Demystified
Alisa Turing
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
SM-A013F Manual de Servicio Anibal Garcia Irepair PDF
No ratings yet
SM-A013F Manual de Servicio Anibal Garcia Irepair PDF
30 pages
Core Banking Union Bank of India
100% (2)
Core Banking Union Bank of India
36 pages
Air India Training Manual Passage ESS
No ratings yet
Air India Training Manual Passage ESS
16 pages
Meraki Datasheet Mt11 English
No ratings yet
Meraki Datasheet Mt11 English
3 pages
Curriculum Vitae: XXXXX
No ratings yet
Curriculum Vitae: XXXXX
19 pages
CybTouch 8P
No ratings yet
CybTouch 8P
6 pages
ThreatLocker TestingEnvironment OnePager
No ratings yet
ThreatLocker TestingEnvironment OnePager
2 pages
Technical Sagar: Windows Run Commands Cheatsheet
100% (3)
Technical Sagar: Windows Run Commands Cheatsheet
5 pages
Aloha Loyalty User Guide
No ratings yet
Aloha Loyalty User Guide
143 pages
(The Brief Introduction of Glozer APP) : (Notcie)
No ratings yet
(The Brief Introduction of Glozer APP) : (Notcie)
6 pages
Data Mining and Business Intelligence Module:1 Data Warehouse (DWH)
No ratings yet
Data Mining and Business Intelligence Module:1 Data Warehouse (DWH)
18 pages
Teradata Stored Procedures
No ratings yet
Teradata Stored Procedures
2 pages
Readme First
No ratings yet
Readme First
4 pages
Homework Writing Machine
No ratings yet
Homework Writing Machine
17 pages
Matlab Psychtoolbox Examples
No ratings yet
Matlab Psychtoolbox Examples
4 pages
Flexys Panel TX4 Flexys Rail-230V
No ratings yet
Flexys Panel TX4 Flexys Rail-230V
3 pages
Modulation Scheme Recognition Techniques For Software Radio On A General Purpose Processor Platform
No ratings yet
Modulation Scheme Recognition Techniques For Software Radio On A General Purpose Processor Platform
5 pages
1 Analog_Mixed-Signal_Integrated_Circuits_for_Quantum_Computing
No ratings yet
1 Analog_Mixed-Signal_Integrated_Circuits_for_Quantum_Computing
8 pages
How To Search Product Hunting
No ratings yet
How To Search Product Hunting
5 pages
Eaton 9a Ups
No ratings yet
Eaton 9a Ups
2 pages
Maintain Depreciation Key - Asset SAP
No ratings yet
Maintain Depreciation Key - Asset SAP
3 pages
Module 2 Point 4
No ratings yet
Module 2 Point 4
18 pages
7.2.3.5 Lab - Using Wireshark To Examine A UDP DNS Capture
100% (1)
7.2.3.5 Lab - Using Wireshark To Examine A UDP DNS Capture
6 pages
Gue Brochure - Optione v8
No ratings yet
Gue Brochure - Optione v8
2 pages
SQL Server DBA Interview Questions
No ratings yet
SQL Server DBA Interview Questions
7 pages
CTS Interview Questions For UI Developer Position (2-3 Years Experience) - Crazy Coder
No ratings yet
CTS Interview Questions For UI Developer Position (2-3 Years Experience) - Crazy Coder
8 pages
Chat Bot Making Process
100% (1)
Chat Bot Making Process
99 pages
Ni Teststand Advanced Architecture Series
No ratings yet
Ni Teststand Advanced Architecture Series
204 pages
Report On Cpu
75% (4)
Report On Cpu
41 pages
Assignment No.: - 2 Ques:1 List The Principles of OOSE With Its Concepts. Ans:1
No ratings yet
Assignment No.: - 2 Ques:1 List The Principles of OOSE With Its Concepts. Ans:1
11 pages

Chapter 4 MapReduce

Uploaded by

Chapter 4 MapReduce

Uploaded by

MapReduce

Coins Coun9ng Machine

Mapper: Categorize coins by their face values

Example with number of replicas per chunk = 2

Car, 1 Car, 1 Beer, 2

Car, 1 Car, 1 Beer, 2

Car, 1 Car, 1 Beer, 2

Car, 1 Car, 2 Car, 2 Car, 3

Are combiners s9ll need?

Why can’t we use Reducer as Combiner?

Are combiners s9ll need?

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006, 80)

(2007,100) (2005, 90), (2006,100) (2007, 80), (2006, 80)

(2005,[90]) (2006,[100, 80]) (2007,[100, 80])

• Ques9on: How to use the above Map Reduce

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006,80)

(2007,95) (2005, 90), (2006,100) (2007, 80), (2006,80)

(2005,[90]) (2006,[100, 80]) (2007,[95, 80])

(2007,100), (2007,90) (2005, 90), (2006,100) (2007, 80), (2006,80)

(2007,<190,2>) (2005, <90,1>), (2007, <80,1>),

(2005,[<90,1>]) (2006,[<100,1>, <80,1>]) (2007,[<190,2>, <80,1>])

You might also like