CSE545 Sp23 (3) Hadoop MapReduce 2-13
CSE545 Sp23 (3) Hadoop MapReduce 2-13
H. Andrew Schwartz
CSE545
Spring 2023
(freesvg.org/1534373472)
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
W st
or em
Sy s
kf s
m
ir th
lo
w l go
A
a l
D ols
tic s
is
To
s
tri
i
at hod
bu
t
S et
ted
M
Big Data Analytics, The Class
W st
or em
Sy s
kf s
m
ir th
lo
w l go
A
a l
D ols
tic s
is
To
s
tri
i
at hod
bu
t
S et
ted
M
Data
Classical Data Analytics
CPU
Memory
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words
is faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words
is faster per word, but fast modern disks
still only reach ~1GB/s for sequential reads.
CPU
Classical focus: efficient use of disk.
e.g. Apache Lucene / Solr
Memory
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
Switch
~10Gbps
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
1. Nodes fail
1 in 1000 nodes fail a day
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Memory
Disk
Distributed Filesystem
(e.g. Apache HadoopDFS, GoogleFS, EMRFS)
https://opensource.com/life/14/8/intro
-apache-hadoop-big-data
C
D
Distributed Filesystem
“Hadoop” was named after a
toy elephant belonging to Doug
Cutting’s son. Cutting was one
(e.g. Apache HadoopDFS, GoogleFS, EMRFS)
of Hadoop’s creators.
C, D: Two different files
https://opensource.com/life/14/8/intro
-apache-hadoop-big-data
C
D
Distributed Filesystem
(e.g. Apache HadoopDFS, GoogleFS, EMRFS)
C0 D0
C1 D1
C2 D2
C3 D3
C4 D4
C5 D5
Distributed Filesystem
(e.g. Apache HadoopDFS, GoogleFS, EMRFS)
input chunks => map tasks | group_by keys | reduce tasks => output
“|” is the linux “pipe” symbol: passes stdout from first process to stdin of next.
What is MapReduce
noun.1 - A style of programming
input chunks => map tasks | group_by keys | reduce tasks => output
“|” is the linux “pipe” symbol: passes stdout from first process to stdin of next.
input chunks | map tasks | group_by keys | reduce tasks => output
“|” is the linux “pipe” symbol: passes output from first process to input of next.
input chunks => map tasks | group_by keys | reduce tasks => output
“|” is the linux “pipe” symbol: passes output from first process to input of next.
extract what
you care
about.
sort and
shuffle
Map
What is MapReduce
sort and
shuffle
extract what
you care
about. aggregate,
summarize
Map
Reduce
What is MapReduce
Easy as 1, 2, 3!
Step 1: Map Step 2: Sort / Group by Step 3: Reduce
What is MapReduce
Easy as 1, 2, 3!
Step 1: Map Step 2: Sort / Group by Step 3: Reduce
Group by key: (k1’, v1’), (k2’, v2’), ... -> (k1’, (v1’, v’, …),
(system handles) (k2’, (v1’, v’, …), …
Map: extract
what you sort and Reduce:
care about. shuffle aggregate,
summarize
Example: Word Count
Chunks
Example: Word Count
@abstractmethod
def map(k, v):
pass
@abstractmethod
def reduce(k, vs):
pass
Example: Word Count (v1)
def map(k, v):
for w in tokenize(v):
yield (w,1)
Select
Project
Natural Join
Grouping
Example: Relational Algebra
Select
Project
Natural Join
Grouping
Example: Relational Algebra
Select
Select
hash
Programmer
hash
Programmer
Handled with:
Answer: 1) If possible, one chunk per map task Mem Mem . Mem
(maximizes flexibility for scheduling) .
.
Disk Disk Disk
2) M >> |nodes| ≈≈ |cores|
(better handling of node failures, better load balancing)
3) R <= M
(reduces number of parts stored in DFS)
Data Flow Tasks (Map Task or Reduce Task)
version 1: few reduce tasks
(same number of reduce tasks as nodes)
node1
node2
node3
node4
node5
time
tasks represented by
time to complete task
(some tasks take much longer)
Data Flow Tasks (Map Task or Reduce Task)
version 1: few reduce tasks version 2: more reduce tasks
(same number of reduce tasks as nodes) (more reduce tasks than nodes)
node1 node1
node2 node2
node3 node3
node4 node4
node5 node5
time time
tasks represented by tasks represented by
time to complete task time to complete task
(some tasks take much longer) (some tasks take much longer)
Data Flow Tasks (Map Task or Reduce Task)
version 1: few reduce tasks version 2: more reduce tasks
(same number of reduce tasks as nodes) (more reduce tasks than nodes)