Chapter 4 MapReduce
Chapter 4 MapReduce
Thoai Nam
High Performance Computing Lab (HPC Lab)
Faculty of Computer Science and Engineering
HCMC University of Technology
HPC
Lab-‐CSE-‐HCMUT
1
Ref
– MapReduce
algorithm
design,
Jimmy
Lin
HPC
Lab-‐CSE-‐HCMUT
2
HPC
Lab-‐CSE-‐HCMUT
3
HPC
Lab-‐CSE-‐HCMUT
4
HPC
Lab-‐CSE-‐HCMUT
5
HPC
Lab-‐CSE-‐HCMUT
6
MapReduce:
A
Real
World
Analogy
Coins
Deposit
HPC
Lab-‐CSE-‐HCMUT
7
MapReduce:
A
Real
World
Analogy
Coins
Deposit
HPC
Lab-‐CSE-‐HCMUT
8
MapReduce:
A
Real
World
Analogy
Coins
Deposit
HPC
Lab-‐CSE-‐HCMUT
10
HPC
Lab-‐CSE-‐HCMUT
11
MapReduce
• Programmers
specify
two
funcNons:
Map
(k1,
v1)
→
<k2,
v2>*
Reduce
(k2,
list
(v2))
→
list
(v3)
(All
values
with
the
same
key
are
sent
to
the
same
reducer)
• The
execuNon
framework
handles
everything
else...
HPC
Lab-‐CSE-‐HCMUT
12
MapReduce
“run9me”
• Handles
scheduling
– Assigns
workers
to
map
and
reduce
tasks
• Handles
“data
distribuNon”
– Moves
processes
to
data
• Handles
synchronizaNon
– Gathers,
sorts,
and
shuffles
intermediate
data
• Handles
errors
and
faults
– Detects
worker
failures
and
restarts
• Everything
happens
on
top
of
a
distributed
file
system
HPC
Lab-‐CSE-‐HCMUT
13
Synchroniza9on
&
ordering
• Barrier
between
map
and
reduce
phases
– But
intermediate
data
can
be
copied
over
as
soon
as
mappers
finish
• Keys
arrive
at
each
reducer
in
sorted
order
– No
enforced
ordering
across
reducers
HPC
Lab-‐CSE-‐HCMUT
14
MapReduce
• Programmers
specify
two
funcNons:
Map
(k1,
v1)
→
<k2,
v2>*
Reduce
(k2,
list
(v2))
→
list
(v3)
(All
values
with
the
same
key
are
sent
to
the
same
reducer)
• The
execuNon
framework
handles
everything
else...
• Not
quite...usually,
programmers
also
specify:
par99on
(k2,
number
of
parNNons)
→
parNNon
for
k2
– Ohen
a
simple
hash
of
the
key,
e.g.,
hash(k’)
mod
n
– Divides
up
key
space
for
parallel
reduce
operaNons
combine
(k2,
v2)
→
<k2,
v2>*
– Mini-‐reducers
that
run
in
memory
aher
the
map
phase
– Used
as
an
opNmizaNon
to
reduce
network
traffic
HPC
Lab-‐CSE-‐HCMUT
15
HPC
Lab-‐CSE-‐HCMUT
16
What’s
the
big
deal?
• Developers
need
the
right
level
of
abstracNon
– Moving
beyond
the
von
Neumann
architecture
– We
need
beler
programming
models
• AbstracNons
hide
low-‐level
details
from
the
developers
– No
more
race
condiNons,
lock
contenNon,
etc.
• MapReduce
separaNng
the
what
from
how
– Developer
specifies
the
computaNon
that
need
to
be
performed
– ExecuNon
framework
(“runNme”)
handles
actual
execuNon
HPC
Lab-‐CSE-‐HCMUT
17
The
data
center
is
the
computer?
HPC
Lab-‐CSE-‐HCMUT
18
MapReduce
can
refers
to…
• The
programming
model
• The
execuNon
framework
(aka
“runNme”)
• The
specific
implementaNon
HPC
Lab-‐CSE-‐HCMUT
19
MapReduce
Implementa9ons
• Google
has
a
proprietary
implementaNon
in
C++
– Bindings
in
Java,
Python
• Hadoop
is
an
open-‐source
implementaNon
in
Java
– Development
led
by
Yahoo,
now
an
Apache
project
– Used
in
producNon
at
Yahoo,
Facebook,
Twiler,
Linked
In,
Nerlix,
etc.
– The
de
facto
big
data
processing
plarorm
– Rapidly
expanding
sohware
ecosystem
• Lots
of
custom
research
implementaNons
– For
GPUs,
cell
processors,
etc.
HPC
Lab-‐CSE-‐HCMUT
20
MapReduce
algorithm
design
• The
execuNon
framework
handles
“everything
else”...
– Scheduling:
assigns
workers
to
map
and
reduce
tasks
– “Data
distribuNon”:
moves
processes
to
data
– SynchronizaNon:
gathers,
sorts,
and
shuffles
intermediate
data
– Errors
and
faults:
detects
worker
failures
and
restarts
• Limited
control
over
data
and
execuNon
flow
– All
algorithms
must
expressed
in
m,r,c,p
• You
don’t
know:
– Where
mappers
and
reducers
run
– When
a
mapper
or
reducer
begins
or
finishes
– Which
input
a
parNcular
mapper
is
processing
– Which
intermediate
key
a
parNcular
reducer
is
processing
HPC
Lab-‐CSE-‐HCMUT
21
Apache Hadoop
HPC
Lab-‐CSE-‐HCMUT
22
Data
volumes:
Google
Example
• Analyze
10
billion
web
pages
• Average
size
of
a
webpage:
20KB
• Size
of
the
collection:
10
billion
x
20KBs
=
200TB
• HDD
hard
disk
read
bandwidth:
150MB/sec
• Time
needed
to
read
all
web
pages
(without
analyzing
them):
2
million
seconds
=
more
than
15
days
• A
single
node
architecture
is
not
adequate
4
Data
volumes:
Google
Example
with
SSD
• Analyze
10
billion
web
pages
• Average
size
of
a
webpage:
20KB
• Size
of
the
collection:
10
billion
x
20KBs
=
200TB
• SSD
hard
disk
read
bandwidth:
550MB/sec
• Time
needed
to
read
all
web
pages
(without
analyzing
them):
2
million
seconds
=
more
than
4
days
• A
single
node
architecture
is
not
adequate
5
Apache
Hadoop
• Scalable
fault-‐tolerant
distributed
system
for
Big
Data
o Distributed
Data
Storage
o Distributed
Data
Processing
o Borrowed
concepts/ideas
from
the
systems
designed
at
Google
(Google
File
System
for
Google's
MapReduce)
o Open
source
project
under
the
Apache
license
Ø But
there
are
also
many
commercial
implementations
(e.g.,
Cloudera,
Hortonworks,
MapR)
26
Hadoop
History
• Dec
2004
-‐
Google
published
a
paper
about
GFS
• July
2005
-‐
Nutch
uses
MapReduce
• Feb
2006
-‐
Hadoop
becomes
a
Lucene
subproject
• Apr
2007
-‐
Yahoo!
runs
it
on
a
1000-‐node
cluster
• Jan
2008
-‐
Hadoop
becomes
an
Apache
Top
Level
Project
• Jul
2008
-‐
Hadoop
is
tested
on
a
4000
node
cluster
• Feb
2009
-‐
The
Yahoo!
Search
Webmap
is
a
Hadoop
application
that
runs
on
more
than
10,000
core
Linux
cluster
• June
2009
-‐
Yahoo!
made
available
the
source
code
of
its
production
version
of
Hadoop
• In
2010
Facebook
claimed
that
they
have
the
largest
Hadoop
cluster
in
the
world
with
21
PB
of
storage
o On
July
27,
2011
they
announced
the
data
has
grown
to
30PB.
27
Hadoop
vs.
HPC
• Hadoop
• Designed
for
Data
intensive
workloads
• Usually,
no
CPU
demanding/intensive
tasks
• HPC
(High-‐performance
computing)
o A
supercomputer
with
a
high-‐level
computational
capacity
Ø Performance
of
a
supercomputer
is
measured
in
floating-‐point
operations
per
second
(FLOPS)
o Designed
for
CPU
intensive
tasks
o Usually
it
is
used
to
process
"small"
data
sets
3°
Hadoop:
main
components
• Core
components
of
Hadoop:
o Distributed
Big
Data
Processing
Infrastructure
based
on
the
MapReduce
programming
paradigm
§ Provides
a
high-‐level
abstraction
view
Ø Programmers
do
not
need
to
care
about
task
scheduling
and
synchronization
§ Fault-‐tolerant
Ø Node
and
task
failures
are
automatically
managed
by
the
Hadoop
system
o HDFS
(Hadoop
Distributed
File
System)
§ High
availability
distributed
storage
§ Fault-‐tolerant
31
HDFS
(Hadoop File System)
HPC
Lab-‐CSE-‐HCMUT
29
HDFS
• HDFS is a distributed file system that is fault tolerant, scalable
and extremely easy to expand
• HDFS is the primary distributed storage for Hadoop
applications
• HDFS provides interfaces for applications to move themselves
closer to data
• HDFS is designed to ‘just work’, however a working
knowledge helps in diagnostics and improvements
HPC
Lab-‐CSE-‐HCMUT
30
HDFS:
a
distributed
file
system
HPC
Lab-‐CSE-‐HCMUT
31
HDFS
–
Data
Organiza9on
• Each
file
wrilen
into
HDFS
is
split
into
data
blocks
• Each
block
is
stored
on
one
or
more
nodes
• Each
copy
of
the
block
is
called
replica
• Block
placement
policy
o First
replica
is
placed
on
the
local
node
o Second
replica
is
placed
in
a
different
rack
o Third
replica
is
placed
in
the
same
rack
as
the
second
replica
HPC
Lab-‐CSE-‐HCMUT
32
HDFS
architecture
(1)
HPC
Lab-‐CSE-‐HCMUT
33
HDFS
architecture
(2)
There
are
two
(and
a
half)
types
of
machines
in
a
HDFS
cluster
• NameNode
is
the
heart
of
an
HDFS
filesystem,
it
maintains
and
manages
the
file
system
metadata.
E.g;
what
blocks
make
up
a
file,
and
on
which
datanodes
those
blocks
are
stored
• DataNode
where
HDFS
stores
the
actual
data,
there
are
usually
quite
a
few
of
these
HPC
Lab-‐CSE-‐HCMUT
34
Read
opera9on
in
HDFS
HPC
Lab-‐CSE-‐HCMUT
35
Write
opera9on
in
HDFS
HPC
Lab-‐CSE-‐HCMUT
36
Unique
features
of
HDFS
• HDFS
also
has
a
bunch
of
unique
features
that
make
it
ideal
for
distributed
systems:
Ø Failure
tolerant
-‐
data
is
duplicated
across
mulNple
DataNodes
to
protect
against
machine
failures.
The
default
is
a
replicaNon
factor
of
3
(every
block
is
stored
on
three
machines).
Ø Scalability
-‐
data
transfers
happen
directly
with
the
DataNodes
so
your
read/write
capacity
scales
fairly
well
with
the
number
of
DataNodes
Ø Space
-‐
need
more
disk
space?
Just
add
more
DataNodes
and
re-‐
balance
Ø Industry
standard
-‐
Other
distributed
applicaNons
are
built
on
top
of
HDFS
(HBase,
Map-‐Reduce)
• HDFS
is
designed
to
process
large
data
sets
with
write-‐
once-‐read-‐many,
it
is
not
for
low
latency
access
HPC
Lab-‐CSE-‐HCMUT
37
MapReduce
&
HDFS
HPC
Lab-‐CSE-‐HCMUT
38
Algorithm & programming
HPC
Lab-‐CSE-‐HCMUT
39
MapReduce
Example:
Word
Count
Input
Split
Map
ShuUle/Sort
Reduce
Output
Deer,
1
Beer,
1
Beer,
1
Beer,
2
Dear
Beer
River
Beer,
1
River,
1
River,
1
River,
1
River,
2
HPC
Lab-‐CSE-‐HCMUT
40
MapReduce
Example:
Word
Count
Input
Split
Map
ShuUle/Sort
Reduce
Output
Deer,
1
Beer,
1
Beer,
1
Beer,
2
Dear
Beer
River
Beer,
1
River,
1
River,
1
River,
1
River,
2
Q:
What
are
the
Key
and
Value
Pairs
of
Map
and
Reduce?
Map:
Key=word,
Value=1
Reduce:
Key=word,
Value=aggregated
count
HPC
Lab-‐CSE-‐HCMUT
41
Word
Count:
baseline
HPC
Lab-‐CSE-‐HCMUT
42
MapReduce
Example:
Word
Count
Input
Split
Map
ShuUle/Sort
Reduce
Output
Deer,
1
Beer,
1
Beer,
1
Beer,
2
Dear
Beer
River
Beer,
1
River,
1
River,
1
River,
1
River,
2
Q:
Do
you
see
any
place
we
can
improve
the
efficiency?
Local
aggrega9on
at
mapper
will
be
able
to
improve
MapReduce
efficiency.
HPC
Lab-‐CSE-‐HCMUT
43
MapReduce:
Combiner
• Combiner:
do
local
aggregaNon/combine
task
at
mapper
HPC
Lab-‐CSE-‐HCMUT
45
Preserving
state
HPC
Lab-‐CSE-‐HCMUT
46
Implementa9on
don’t
• Don’t
unnecessarily
create
objects
– Object
creaNon
is
costly
– Garbage
collecNon
is
costly
• Don’t
buffer
objects
– Processes
have
limited
heap
size
(remember,
commodity
machines)
– May
work
for
small
datasets,
but
won’t
scale!
HPC
Lab-‐CSE-‐HCMUT
47
Word
Count:
version
1
HPC
Lab-‐CSE-‐HCMUT
48
Word
Count:
version
2
HPC
Lab-‐CSE-‐HCMUT
50
Combiner
design
• Combiners
and
reducers
share
same
method
signature
– SomeNmes,
reducers
can
serve
as
combiners
– Ohen,
not…
• Remember:
combiner
are
opNonal
opNmizaNons
– Should
not
affect
algorithm
correctness
– May
be
run
0,
1,
or
mulNple
Nmes
• Example:
find
average
of
integers
associated
with
the
same
key
HPC
Lab-‐CSE-‐HCMUT
51
Compu9ng
the
Mean:
version
1
HPC
Lab-‐CSE-‐HCMUT
53
Compu9ng
the
Mean:
version
3
HPC
Lab-‐CSE-‐HCMUT
54
Compu9ng
the
Mean:
version
4
HPC
Lab-‐CSE-‐HCMUT
56
Word
Count
&
sor9ng
• New
Goal:
output
all
words
sorted
by
their
frequencies
(total
counts)
in
a
document.
• Ques9on:
How
would
you
adopt
the
basic
word
count
program
to
solve
it?
• Solu9on:
– Do
two
rounds
of
MapReduce
– In
the
2nd
round,
take
the
output
of
WordCount
as
input
but
switch
key
and
value
pair!
– Leverage
the
sor9ng
capability
of
shuffle/sort
to
do
the
global
sor9ng!
HPC
Lab-‐CSE-‐HCMUT
57
Word
Count
&
top
K
words
• New
Goal:
output
the
top
K
words
sorted
by
their
frequencies
(total
counts)
in
a
document.
• Ques9on:
How
would
you
adopt
the
basic
word
count
program
to
solve
it?
• Solu9on:
– Use
the
solu9on
of
previous
problem
and
only
grab
the
top
K
in
the
final
output
– Problem:
is
there
a
more
efficient
way
to
do
it?
HPC
Lab-‐CSE-‐HCMUT
58
Word
Count
&
top
K
words
• New
Goal:
output
the
top
K
words
sorted
by
their
frequencies
(total
counts)
in
a
document.
• Ques9on:
How
would
you
adopt
the
basic
word
count
program
to
solve
it?
• Solu9on:
– Add
a
sort
func9on
to
the
reducer
in
the
first
round
and
only
output
the
top
K
words
– Intui9on:
the
global
top
K
must
be
a
local
top
K
in
any
reducer!
HPC
Lab-‐CSE-‐HCMUT
59
MapReduce
In-‐class
Exercise
• Problem:
Find
the
maximum
monthly
temperature
for
each
year
from
weather
reports
• Input:
A
set
of
records
with
format
as:
<Year/Month,
Average
Temperature
of
that
month>
-‐
(200707,100),
(200706,90)
-‐
(200508,
90),
(200607,100)
-‐
(200708,
80),
(200606,80)
• Ques9on:
write
down
the
Map
and
Reduce
funcNon
to
solve
this
problem
– Assume
we
split
the
input
by
line
HPC
Lab-‐CSE-‐HCMUT
60
Mapper
and
Reducer
of
Max
Temperature
• Map(key,
value){
//
key:
line
number
//
value:
tuples
in
a
line
for
each
tuple
t
in
value:
Emit(t-‐>year,
t-‐>temperature);}
Combiner
is
the
same
as
Reducer
• Reduce(key,
list
of
values){
//
key:
year
//list
of
values:
a
list
of
monthly
temperature
int
max_temp
=
-‐100;
for
each
v
in
values:
max_temp=
max(v,
max_temp);
Emit(key,
max_temp);}
HPC
Lab-‐CSE-‐HCMUT
61
MapReduce
Example:
Max
Temperature
Input
(200707,100),
(200706,90)
(200508,
90),
(200607,100)
(200708,
80),
(200606,80)
Map
Combine
ShuUle/Sort
Reduce
(2005,90)
(2006,100)
HPC
Lab-‐CSE-‐HCMUT
(2007,100)
62
MapReduce
In-‐class
Exercise
• Key-‐Value
Pair
of
Map
and
Reduce:
– Map:
(year,
temperature)
– Reduce:
(year,
maximum
temperature
of
the
year)
HPC
Lab-‐CSE-‐HCMUT
63
Mapper
and
Reducer
of
Average
Temperature
• Map(key,
value){
//
key:
line
number
//
value:
tuples
in
a
line
for
each
tuple
t
in
value:
Emit(t-‐>year,
t-‐>temperature);}
• Reduce(key,
list
of
values){
Combiner
is
the
same
as
Reducer
//
key:
year
//
list
of
values:
a
list
of
monthly
temperatures
int
total_temp
=
0;
for
each
v
in
values:
total_temp=
total_temp+v;
Emit(key,
total_temp/size_of(values));}
HPC
Lab-‐CSE-‐HCMUT
64
MapReduce
Example:
Average
Temperature
Input
(200707,100),
(200706,90)
Real
average
of
(200508,
90),
(200607,100)
2007:
90
(200708,
80),
(200606,80)
Map
Combine
ShuUle/Sort
Reduce
(2005,90)
(2006,90)
HPC
Lab-‐CSE-‐HCMUT
(2007,87.5)
65
MapReduce
In-‐class
Exercise
• The
problem
is
with
the
combiner!
• Here
is
a
simple
counterexample:
– (2007,
100),
(2007,90)
-‐>
(2007,
95)
(2007,80)-‐>(2007,80)
– Average
of
the
above
is:
(2007,87.5)
– However,
the
real
average
is:
(2007,90)
• However,
we
can
do
a
small
trick
to
get
around
this
– Mapper:
(2007,
100),
(2007,90)
-‐>
(2007,
<190,2>)
(2007,80)-‐>(2007,<80,1>)
– Reducer:
(2007,<270,3>)-‐>(2007,90)
HPC
Lab-‐CSE-‐HCMUT
66
MapReduce
Example:
Average
Temperature
Input
(200707,100),
(200706,90)
(200508,
90),
(200607,100)
(200708,
80),
(200606,80)
Map
Combine
Reduce
(2005,90)
(2006,90)
HPC
Lab-‐CSE-‐HCMUT
(2007,90)
67
Mapper
and
Reducer
of
Average
Temperature
• Map(key,
value){
• Combine(key,
list
of
values){
//
key:
line
number
//
key:
year
//
value:
tuples
in
a
line
//
list
of
values:
a
list
of
monthly
for
each
tuple
t
in
value:
temperature
Emit(t-‐>year,
t-‐>temperature);}
int
total_temp
=
0;
for
each
v
in
values:
• Reduce
(key,
list
of
values){
total_temp=
total_temp+v;
//
key:
year
Emit(key,<total_temp,size_of(values)>);}
//
list
of
values:
a
list
of
<temperature
sums,
counts>
tuples
int
total_temp
=
0;
int
total_count=0;
for
each
v
in
values:
total_temp=
total_temp+v-‐>sum;
total_count=total_count+v-‐>count;
Emit(key,total_temp/total_count);}
HPC
Lab-‐CSE-‐HCMUT
68
MapReduce
In-‐class
Exercise
• Func9ons
that
can
use
combiner
are
called
distribu<ve:
– DistribuNve:
Min/Max(),
Sum(),
Count(),
TopK()
– Non-‐distribuNve:
Mean(),
Median(),
Rank()
Gray,
Jim*,
et
al.
"Data
cube:
A
relaNonal
aggregaNon
operator
generalizing
group-‐by,
cross-‐tab,
and
sub-‐
totals."
Data
Mining
and
Knowledge
Discovery
1.1
(1997):
29-‐53.
*Jim
Gray
received
Turing
Award
in
1998
HPC
Lab-‐CSE-‐HCMUT
69
Map
Reduce
Problems
Discussion
• Problem
1:
Find
Word
Length
DistribuNon
• Statement:
Given
a
set
of
documents,
use
Map-‐
Reduce
to
find
the
length
distribuNon
of
all
words
contained
in
the
documents
• Ques9on:
– What
are
the
Mapper
and
Reducer
FuncNons?
12:
1
MapReduce
7:
1
This
is
a
test
data
for
6:
1
the
word
length
4:
4
distribuNon
problem
3:
2
2:
1
1:
1
HPC
Lab-‐CSE-‐HCMUT
70
Mapper
and
Reducer
of
Word
Length
Distribu9on
• Map(key,
value){
//
key:
document
name
//
value:
words
in
a
document
for
each
word
w
in
value:
Emit(length(w),
w);}
• Reduce(key,
list
of
values){
//
key:
length
of
a
word
//
list
of
values:
a
list
of
words
with
the
same
length
Emit(key,
size_of(values));}
HPC
Lab-‐CSE-‐HCMUT
71
Map
Reduce
Problems
Discussion
• Problem
1:
Find
Word
Length
DistribuNon
• Mapper
and
Reducer:
– Mapper(document)
{
Emit
(Length(word),
word)
}
– Reducer(output
of
map)
{
Emit
(Length(word),
Size
of
(List
of
words
at
a
par9cular
length))}
HPC
Lab-‐CSE-‐HCMUT
72
Map
Reduce
Problems
Discussion
• Problem
2:
Indexing
&
Page
Rank
• Statement:
Given
a
set
of
web
pages,
each
page
has
a
page
rank
associated
with
it,
use
Map-‐Reduce
to
find,
for
each
word,
a
list
of
pages
(sorted
by
rank)
that
contains
that
word
• Ques9on:
– What
are
the
Mapper
and
Reducer
FuncNons?
MapReduce
Word
1:
[page
x1,
page
x2,
..]
Word
2:
[page
y1,
page
y2,
…]
…
HPC
Lab-‐CSE-‐HCMUT
73
Page
Rank
HPC
Lab-‐CSE-‐HCMUT
74
Mapper
and
Reducer
of
Indexing
and
PageRank
• Map(key,
value){
//
key:
a
page
//
value:
words
in
a
page
for
each
word
w
in
value:
Emit(w,
<page_id,
page_rank>);}
• Reduce(key,
list
of
values){
//
key:
a
word
//
list
of
values:
a
list
of
pages
containing
that
word
sorted_pages=sort(values,
page_rank)
Emit(key,
sorted_pages);}
HPC
Lab-‐CSE-‐HCMUT
75
Map
Reduce
Problems
Discussion
• Problem
2:
Indexing
and
Page
Rank
• Mapper
and
Reducer:
– Mapper(page_id,
<page_text,
page_rank>)
{
Emit
(word,
<page_id,
page_rank>)
}
– Reducer(output
of
map)
{
Emit
(word,
List
of
pages
contains
the
word
sorted
by
their
page_ranks)}
HPC
Lab-‐CSE-‐HCMUT
76
Map
Reduce
Problems
Discussion
• Problem
3:
Find
Common
Friends
• Statement:
Given
a
group
of
people
on
online
social
media
(e.g.,
Facebook),
each
has
a
list
of
friends,
use
Map-‐Reduce
to
find
common
friends
of
any
two
persons
who
are
friends
• Ques9on:
– What
are
the
Mapper
and
Reducer
FuncNons?
HPC
Lab-‐CSE-‐HCMUT
77
Map
Reduce
Problems
Discussion
• Problem
3:
Find
Common
Friends
• Simple
example:
Input:
A
-‐>
B,C,D
B-‐>
A,C,D
A
C
C-‐>
A,B
MapReduce
D-‐>A,B
Output:
B
D
(A
,B)
-‐>
C,D
(A,C)
-‐>
B
(A,D)
-‐>
..
….
HPC
Lab-‐CSE-‐HCMUT
78
Mapper
and
Reducer
of
Common
Friends
• Map(key,
value){
//
key:
person_id
//
value:
the
list
of
friends
of
the
person
for
each
friend
f_id
in
value:
Emit(<person_id,
f_id>,
value);}
• Reduce(key,
list
of
values){
//
key:
<friend
pair>
//
list
of
values:
a
set
of
friend
lists
related
with
the
friend
pair
for
v1,
v2
in
values:
common_friends
=
v1
intersects
v2;
Emit(key,
common_friends);}
HPC
Lab-‐CSE-‐HCMUT
79
Map
Reduce
Problems
Discussion
• Problem
3:
Find
Common
Friends
• Mapper
and
Reducer:
– Mapper(friend
list
of
a
person)
{
for
each
person
in
the
friend
list:
Emit
(<friend
pair>,
<list
of
friends>)
}
– Reducer(output
of
map)
{
Emit
(<friend
pair>,
Intersec9on
of
two
(i.e,
the
one
in
friend
pair)
friend
lists)}
HPC
Lab-‐CSE-‐HCMUT
80
Map
Reduce
Problems
Discussion
• Problem
3:
Find
Common
Friends
• Mapper
and
Reducer:
Input:
Map:
Reduce:
Suggest
Fiends
J
A
-‐>
B,C,D
(A,B)
-‐>
B,C,D
(A,B)
-‐>
C,D
B-‐>
A,C,D
(A,C)
-‐>
B,C,D
(A,C)
-‐>
B
C-‐>
A,B
(A,D)
-‐>
B,C,D
(A,D)
-‐>
B
D-‐>A,B
(A,B)
-‐>
A,C,D
(B,C)
-‐>
A
(B,C)
-‐>
A,C,D
(B,D)
-‐>
A
(B,D)
-‐>
A,C,D
(A,C)
-‐>
A,B
(B,C)
-‐>
A,B
(A,D)
-‐>
A,B
(B,D)
-‐>
A ,B
HPC
Lab-‐CSE-‐HCMUT
81
Enjoy
MR
and
HadoopJ
HPC Lab-‐CSE-‐HCMUT 82