Week 4 - Hadoop Ecosystem
Week 4 - Hadoop Ecosystem
Hadoop
Ecosystem
Chapter
2.1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-2
The
Hadoop
Ecosystem
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-3
Chapter
Topics
IntroducLon
Data
Storage:
HBase
Data
IntegraMon:
Flume
and
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-4
The
Hadoop
Ecosystem
(1)
Hadoop
Ecosystem
HBase
Flume
Oozie
CDH
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-5
The
Hadoop
Ecosystem
(2)
Hadoop
Ecosystem
HBase
Flume
Oozie
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-6
The
Hadoop
Ecosystem
(3)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-7
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraMon:
Flume
and
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-8
HBase
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-9
HBase
vs
TradiMonal
RDBMSs
RDBMS
HBase
Data
layout
Row-oriented
Column-oriented
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-10
When
To
Use
HBase
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-11
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraLon:
Flume
and
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-12
Flume:
Real-Mme
Data
Import
What
is
Flume?
A
service
to
move
large
amounts
of
data
in
real
Mme
Example:
storing
log
les
in
HDFS
Flume
is
Distributed
Reliable
and
available
Horizontally
scalable
Extensible
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-13
Flume:
High-Level
Overview
Write
in
parallel
Scalable
throughput
Agent(s)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-14
Sqoop:
Exchanging
Data
With
RDBMSs
Sqoop
RDBMS
HDFS
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-15
Sqoop
Custom
Connectors
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-16
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraMon:
Flume
and
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-17
Apache
Spark
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-18
Spark
vs
Hadoop
MapReduce
MapReduce
Widely
used,
huge
investment
already
made
Supports
and
supported
by
many
complementary
tools
Mature,
well-tested
Spark
Flexible
Elegant
Fast
Supports
real-Mme
streaming
data
processing
Over
Lme,
Spark
is
expected
to
supplant
MapReduce
as
the
general
processing
framework
used
by
most
organizaLons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-19
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraMon:
Flume,
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-20
Hive
and
Pig:
High
Level
Data
Languages
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-21
Hive
What
is
Hive?
HiveQL:
An
SQL-like
interface
to
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-22
Pig
What
is
Pig?
Pig
LaLn:
A
dataow
language
for
transforming
large
data
sets
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-23
Hive
vs.
Pig
Hive
Pig
Language
HiveQL
(SQL-like)
Pig
LaMn
(dataow
language)
Schema
Table
deniMons
stored
in
Schema
opMonally
dened
a
metastore
at
runMme
ProgrammaLc
access
JDBC,
ODBC
PigServer
(Java
API)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-24
Impala:
High
Performance
Queries
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-25
Which
to
Choose?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-26
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraMon:
Flume,
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-27
Oozie
Oozie
Workow
engine
for
MapReduce
jobs
Denes
dependencies
between
jobs
The
Oozie
server
submits
the
jobs
to
the
server
in
the
correct
sequence
We
will
invesLgate
Oozie
later
in
the
course
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-28
Chapter
Topics
IntroducMon
Data
Storage:
HBase
Data
IntegraMon:
Flume,
Sqoop
Data
Processing:
Spark
Data
Analysis:
Hive,
Pig,
and
Impala
Workow
Engine:
Oozie
Machine
Learning:
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-29
Mahout
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-30
Key
Points
Hadoop
Ecosystem
Many
projects
built
on,
and
supporMng,
Hadoop
Several
will
be
covered
in
detail
later
in
the
course
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-31
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-32
Bibliography
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-33
Managing
Your
Hadoop
SoluMon
Chapter
2.2
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-34
Managing
Your
Hadoop
SoluMon
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-35
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-36
A
Typical
Data
Center
With
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-37
Technology
Strengths
and
Weaknesses
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-38
Example
Data
Flow
Hadoop
Sqoop
(nightly)
Orders
HBase
Sqoop
Flume
HDFS
ETL
Enterprise
(real
Mme)
Data
Warehouse
Web
server
logs
Sqoop
(nightly)
RecommendaMons
Site
Content/
RecommendaMons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-39
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-40
Cluster
Hardware
Master
Name
Nodes
Node
Slave
Nodes
JobTracker
Master
or
Client
Resource
Nodes
Manager
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-41
Slave
Nodes:
Recommended
ConguraMons
(1)
Processors
Mid-grade
processors
(e.g.,
2
x
6-core
2.9
GHz)
Memory
48-96GB
RAM
Network
1Gb
Ethernet
(mid-range)
10Gb
Ethernet
(high-end)
Disk
Drives
6
x
2TB
drives
per
machine
(mid-range)
12
x
3TB
drives
per
machine
(high-end)
Non-RAID
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-42
Slave
Nodes:
Recommended
ConguraMons
(2)
Switch
Dedicated
switching
infrastructure
required
because
Hadoop
can
saturate
the
network
All
nodes
talking
to
all
nodes
Cost
Per
slave
node
cost
should
be
around
$4,000
to
$10,000
(2014
esMmate)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-43
Master
Nodes
Are
More
Important
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-44
Capacity
Planning
(1)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-45
Capacity
Planning
(2)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-46
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-47
IntroducMon
to
MapReduce
Chapter
2.3
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-48
IntroducMon
to
MapReduce
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-49
Chapter
Topics
IntroducLon to MapReduce
MapReduce
Overview
Example:
WordCount
Mappers
Reducers
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-50
Review
-
Features
of
MapReduce
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-51
Review
Key
MapReduce
Stages
The
Mapper
Each
Map
task
(typically)
operates
on
a
single
HDFS
block
Map
Map
tasks
(usually)
run
on
the
node
where
the
block
is
stored
Shue
and
Sort
Sorts
and
consolidates
intermediate
data
from
all
Shue
mappers
and
Sort
Happens
a]er
all
Map
tasks
are
complete
and
before
Reduce
tasks
start
The
Reducer
Operates
on
shued/sorted
intermediate
data
Reduce
(Map
task
output)
Produces
nal
output
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-52
The
MapReduce
Flow
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-53
The
MapReduce
Flow
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-54
The
MapReduce
Flow
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-55
The
MapReduce
Flow
Reducer Reducer
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-56
The
MapReduce
Flow
Reducer Reducer
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-57
The
MapReduce
Flow
Reducer Reducer
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-58
Chapter
Topics
IntroducLon to MapReduce
MapReduce
Overview
Review:
WordCount
Example
Mappers
Reducers
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-59
Example:
Word
Count
Result
aardvark 1
Input
Data
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa Map
Reduce
on 2
sat 2
sofa 1
the 4
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-60
Example:
The
WordCount
Mapper
(1)
Input
Data
(HDFS
le)
the cat sat on the mat
the aardvark sat on the sofa
Mapper
Record Reader
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-61
Example:
The
WordCount
Mapper
(2)
Input
Data
(HDFS
le)
the cat sat on the mat
the 1
the aardvark sat on the sofa
Mapper
cat 1
sat 1
on 1
map()
the 1
Record
Reader
mat 1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-62
Example:
WordCount
Shue
and
Sort
the 1
cat 1
sat 1 aardvark 1 aardvark 1
on 1 cat 1 cat 1
the 1 mat 1 mat 1
mat 1 on 1,1
Mapper
the 1 sat 1,1
aardvark 1 sofa 1 elephant 1
sat 1 the 1,1,1,1 mahout 1
on 1 sat 1,1
the 1
Node 1 sofa 1
Node
2
drove 1
the 1
drove 1 on 1,1
mahout 1
Mapper
elephant 1 sofa 1
drove 1
mahout 1 the 1,1,1,1,1,1
the 1
elephant 1 the 1,1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-63
Example:
SumReducer
(1)
Final
Output
part-r-00000
aardvark 1
aardvark 1
cat 1 Reducer
0
cat 1
mat 1
mat 1
elephant 1
part-r-00001
mahout 1 Reducer
1
elephant 1
sat 1,1
mahout 1
sat 2
drove 1
on 1,1 part-r-00002
Reducer
2
sofa 1 drove 1
the 1,1,1,1,1,1 on 2
sofa 1
the 6
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-64
Example:
SumReducer
(2)
Final
Output
Reducer
2
HDFS
File
part-r-00002
reduce()
drove 1 drove 1
on 1,1
reduce()
on 2
sofa 1
the 1,1,1,1,1,1
sofa 1
reduce()
the 6
reduce()
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-65
Chapter
Topics
IntroducLon to MapReduce
MapReduce
Overview
Example:
WordCount
Mappers
Reducers
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-66
MapReduce:
The
Mapper
(1)
The
Mapper
Input:
key/value
pair
Output:
A
list
of
zero
or
more
key
value
pairs
map(in_key, in_value)
(inter_key, inter_value) list
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-67
MapReduce:
The
Mapper
(2)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-68
Example
Mapper:
Upper
Case
Mapper
let map(k, v) =
emit(k.toUpper(), v.toUpper())
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-69
Example
Mapper:
Explode
Mapper
let map(k, v) =
foreach char c in v:
emit (k, c)
pi 3
pi 3.14 map()
pi .
pi 1
pi 4
145 k
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-70
Example
Mapper:
Filter
Mapper
Only
output
key/value
pairs
where
the
input
value
is
a
prime
number
(pseudo-code):
let map(k, v) =
if (isPrime(v)) then emit(k, v)
48 7 map() 48 7
pi 3.14 map()
5 12 map()
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-71
Example
Mapper:
Changing
Keyspaces
The
key
output
by
the
Mapper
does
not
need
to
be
idenLcal
to
the
input
key
Example:
output
the
word
length
as
the
key
(pseudo-code):
let map(k, v) =
emit(v.length(), v)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-72
Example
Mapper:
IdenMty
Mapper
let map(k, v) =
emit(k,v)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-73
Chapter
Topics
IntroducLon to MapReduce
MapReduce
Overview
Example:
WordCount
Mappers
Reducers
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-74
Shue
and
Sort
Amer
the
Map
phase
is
over,
all
intermediate
values
for
a
given
intermediate
key
are
grouped
together
Each
key
and
value
list
is
passed
to
a
Reducer
All
values
for
a
parMcular
intermediate
key
go
to
the
same
Reducer
The
intermediate
keys/value
lists
are
passed
in
sorted
key
order
gif 1231
jpg 3992 1231
gif
html 891 3997 Reducer
1231 jpg 3992
gif
jpg 3992 3997
gif 3997
344 344
html 788 html 891 html 891 Reducer
788 788
html 344
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-75
The
Reducer
1231
gif reduce()
gif 2614
3997
344
reduce()
html 1498
html 891
788
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-76
Example
Reducer:
Sum
Reducer
Add
up
all
the
values
associated
with
each
intermediate
key
(pseudo-
code):
1
1
the reduce()
the 4
1
1
34
reduce()
SKU0021 61
SKU0021 8
19
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-77
Example
Reducer:
Average
Reducer
Find
the
mean
of
all
the
values
associated
with
each
intermediate
key
(pseudo-code):
1
1
the reduce()
the 1
1
1
34
reduce()
SKU0021 20.33
SKU0021 8
19
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-78
Example
Reducer:
IdenMty
Reducer
2 28 2
28 2 reduce()
28 2
7 28 7
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-79
Key
Points
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-80
Hadoop
Clusters
Chapter
2.4
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-81
Hadoop
Clusters
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-82
Chapter
Topics
Hadoop Clusters
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-83
Installing
A
Hadoop
Cluster
(1)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-84
Installing
A
Hadoop
Cluster
(2)
Dicult
Download,
install,
and
integrate
individual
Hadoop
components
directly
from
Apache
Easier:
CDH
Clouderas
DistribuMon
for
Apache
Hadoop
Vanilla
Hadoop
plus
many
patches,
backports,
bug
xes
Includes
many
other
components
from
the
Hadoop
ecosystem
Easiest:
Cloudera
Manager
Wizard-based
UI
to
install,
congure
and
manage
a
Hadoop
cluster
Included
with
Cloudera
Standard
(free)
or
Cloudera
Enterprise
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-85
Hadoop
Cluster
Terminology
Slave Node
Slave
Node
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-86
Hadoop
Daemons:
HDFS
HDFS
daemons
NameNode
holds
the
metadata
for
HDFS
Typically
two
on
a
producMon
cluster:
one
acMve,
one
standby
DataNode
holds
the
actual
HDFS
data
One
per
slave
node
DataNode
DataNode
Name
Node
(AcMve
+
DataNode
Standby)
DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-87
MapReduce
v1
and
v2
(1)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-88
Hadoop
Daemons:
MapReduce
v1
MRv1
daemons
JobTracker
one
per
cluster
Manages
MapReduce
jobs,
distributes
individual
tasks
to
TaskTrackers
TaskTracker
one
per
slave
node
Starts
and
monitors
individual
Map
and
Reduce
tasks
TaskTracker
JobTracker
TaskTracker
(AcMve
+
Standby)
TaskTracker
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-89
Basic
Cluster
ConguraMon:
HDFS
Name
HDFS
Node
Manage
data
storage
Master
(AcMve
Hold
metadata
Nodes
+
Standby)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-90
Basic
Cluster
ConguraMon:
HDFS
+
MapReduce
v1
Name
HDFS
Node
Manage
data
storage
Master
(AcMve
Hold
metadata
Nodes
+
Standby)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-91
Hadoop
Daemons:
MapReduce
v2
MRv2
daemons
ResourceManager
one
per
cluster
Starts
ApplicaMonMasters,
allocates
resources
on
slave
nodes
ApplicaMonMaster
one
per
job
Requests
resources,
manages
individual
Map
and
Reduce
tasks
NodeManager
one
per
slave
node
Manages
resources
on
individual
slave
nodes
JobHistory
one
per
cluster
Archives
jobs
metrics
and
metadata
NodeManager
Resource
Manager
NodeManager
(AcMve
+
Standby)
ApplicaMonMaster
Name
HDFS
Node
Manage
data
storage
Master
(AcMve
Hold
metadata
Nodes
+
Standby)
Note:
This
slide
is
the
same
as
the
earlier
HDFS
slide;
there
is
no
change
in
the
HDFS
design
for
YARN/MapReduce
v2
because
HDFS
is
the
storage
side
of
Hadoop.
YARN/
MapReduce
v2
implement
the
compute
side
of
Hadoop.
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-93
Basic
Cluster
ConguraMon:
HDFS
+
MapReduce
v2
Name
HDFS
Node
Manage
data
storage
Master
(AcMve
Hold
metadata
Nodes
+
Standby)
MR Job
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-94
Chapter
Topics
Hadoop Clusters
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-95
Review
-
MapReduce
Terminology
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-96
Submi|ng
A
Job
Job
MR
Client
Master
.jar
XML
Node
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-97
A
MapReduce
v1
Cluster
Slave
Nodes
TaskTracker
DataNode
TaskTracker
DataNode
Name
Job
Node(s)
Tracker(s)
TaskTracker DataNode
TaskTracker DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-98
Running
a
Job
on
a
MapReduce
v1
Cluster
(1)
Slave
Nodes
$ hadoop fs put mydata
TaskTracker
DataNode
Block2
Name
Job
Node(s)
Tracker(s)
TaskTracker DataNode
TaskTracker DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-99
Running
a
Job
on
a
MapReduce
v1
Cluster
(2)
Slave
Nodes
TaskTracker
DataNode
HDFS:
mydata
Map
Task
1
Block1
Client
TaskTracker
DataNode
Map
Task
2
Block2
Name
Job
Node(s)
Tracker(s)
Reduce
Task
1
TaskTracker
DataNode
Reduce
Task
2
TaskTracker DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-100
A
MapReduce
v2
Cluster
Slave
Nodes
NodeManager
DataNode
NodeManager
DataNode
Name
Resource
Node(s)
Manager(s)
NodeManager DataNode
Job
History
Server
NodeManager
DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-101
Running
a
Job
on
a
MapReduce
v2
Cluster
(1)
Slave
Nodes
NodeManager
DataNode
HDFS:
mydata
Block1
Client
NodeManager
DataNode
Block2
Name
Resource
Node(s)
Manager(s)
NodeManager
DataNode
MapReduce
ApplicaLon
Master
Job
History
Server
NodeManager
DataNode
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-102
Running
a
Job
on
a
MapReduce
v2
Cluster
(2)
Slave
Nodes
NodeManager
DataNode
HDFS:
mydata
Map
Task
1
Block1
Client
NodeManager
DataNode
Map
Task
2
Block2
Name
Resource
Node(s)
Manager(s)
Reduce
Task
1
NodeManager
DataNode
MapReduce
ApplicaLon
Master
Job
History
Server
NodeManager
DataNode
Reduce
Task
2
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-103
Job
Data:
Mapper
Data
Locality
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-104
Job
Data:
Intermediate
Data
Intermediate
Data
Map
task
intermediate
data
is
stored
on
the
local
Block2
disk
(not
HDFS)
HDFS
Map Task 2
Intermediate
Data
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-105
Job
Data:
Shue
and
Sort
Intermediate
data
is
transferred
across
the
Block2
network
to
the
HDFS
Reducers
Reduce
Task
1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-106
Is
Shue
and
Sort
a
Bo>leneck?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-107
Is
a
Slow
Mapper
a
Bo>leneck?
It
is
possible
for
one
Map
task
to
run
more
slowly
than
the
others
Perhaps
due
to
faulty
hardware,
or
just
a
very
slow
machine
It
would
appear
that
this
would
create
a
bogleneck
The
reduce
method
in
the
Reducer
cannot
start
unMl
every
Mapper
has
nished
Hadoop
uses
specula3ve
execu3on
to
miLgate
against
this
If
a
Mapper
appears
to
be
running
signicantly
more
slowly
than
the
others,
a
new
instance
of
the
Mapper
will
be
started
on
another
machine,
operaMng
on
the
same
data
A
new
task
a7empt
for
the
same
task
The
results
of
the
rst
Mapper
to
nish
will
be
used
Hadoop
will
kill
o
the
Mapper
which
is
sMll
running
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-108
CreaMng
and
Running
a
MapReduce
Job
Run the hadoop jar command to submit the job to the Hadoop cluster
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2-109
Bibliography
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2-110