CS621 Week 15
CS621 Week 15
Objective
s
Usage of MapReduce.
What is MapReduce?
Popularized by open-
source Hadoop • Used at Yahoo!, Facebook, Amazon
project
What is MapReduce used for?
At Google
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation
At Facebook
• Data mining
• Ad optimization
• Spam detection
At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
MapReduce Usage In Research?
Astronomical
image
analysis
(Washington
Ocean )
climate Bioinformati
simulation cs
(Washington (Maryland)
)
In
Researc
h
Analyzing
Particle
Wikipedia
physics
conflicts
(Nebraska)
(PARC)
Natural
language
processing
(CMU)
How MapReduce work?
• Map
• Sort
MapRed • Reduce
uce has
three
main
phases:
MapReduce Overview
MapReduce: Examples
Objective
s
Five processing stages based
MapReduce example.
MapReduce Example
(based on Three Phases)
• Example corpus:
The
canonica Jane likes toast with jam
l Joe likes toast
MapRed Joe burnt the toast
uce
Example
: Word
Count
MapReduce: Map (Slow Motion)
MapReduce: Sort (Slow Motion)
MapReduce: Reduce (Slow Motion)
data
Comp
utatio
n
MapReduce logical data flow in 5
processing stages over successive
(key, value) pairs.
MapReduce logical data in 5
processing stages : Example
MapReduce Actual Data and
Computation
Data and Control Partitioning
D
Flow:
e
th te
an e rm
tio ce
d ma in
n
nc u
w s in
fu ed
or te g
R
ke r
r
The main responsibility of the
MapReduce framework is to
efficiently run a user’s program
on a distributed computing
system. Sorting and
Reading the input
data
Grouping
(Data Distribution)
nc bi nd
n r
n
tio ne
fu om p a
io
at
C a
M
ic
un
m
om
Synchronization
C
MapReduce Design Goals
Cost-efficiency:
Commodity machines (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer administrators),
Easy to use (fewer programmers)
Hadoop
Objective
s
Key functions of Hadoop
What is Hadoop?
MapReduce:
Simple programming model that enables
parallel execution of data processing Executes the work on the data near the data
programs
In a nutshell: HDFS places the data on the cluster and MapReduce does the
processing work
Hadoop Principle
I am one
big data
• Hadoop is basically a middleware set
platforms that manages a cluster of
machines
• The core components is a distributed file
system (HDFS) Hadoop
• Files in HDFS are split into blocks that
are scattered over the cluster HDFS
• The cluster can grow indefinitely simply
by adding new nodes
Hadoop Components
Hadoo
p
MapReduce
MapReduce and Hadoop
HDFS
Hadoop and MapReduce
• Replacing a RDMBS
• Complex processing requiring
Not Good for various phases and/or iterations
• Processing small to medium size
data
GFS: Google File System
Objective
s
GFS Working Process
GFS: Google File System
Workload: Need
• Large semantics
Must for
streaming
monitor & Modest reads + concurrent
recover number of small High
from comp large files random sustained
failures reads bandwidth
(More important
• Many large than low
sequential latency)
writes
GFS: Interface
• Snapshot
• Low cost
Novel • Record append
• Atomicity with multiple concurrent writes
GFS: Architecture
GFS: Architecture details
Objective
s
GFS implementation.
GFS Architecture: Master
• Stores all metadata
• Namespace
• Access-control information
• Chunk locations
• ‘Lease’ management
• Heartbeats
Master • Having one master global knowledge
• Allows better placement / replication
• Simplifies design
GFS Architecture: Chunk Servers
• Contact single
master
• Obtain chunk
locations
• Contact one of
chunk servers
• Obtain data
GFS Architecture: Master->
Metadata
Stored in memory
Mutation Order
identical replicas
File region may end up
containing mingled fragments
from different clients
(consistent but undefined)
GFS: Limitations
Only viable in
Custom Limited
a specific
designed security
environment
HDFS: Hadoop Distributed
File System
Objective
s
HDFS Blocks and Nodes.
HDFS: Background
Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop
Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other related
entities is called the project Hadoop or simply Hadoop.
Highly fault-tolerant
High throughput
Traditional concept of
the library card catalogs
Structural Metadata
Focuses on the data structure’s design and specification
Descriptive Metadata
Structural Metadata
Focuses on the data structure’s design and specification
Descriptive Metadata
Namenode keeps track of the datanodes that have blocks of a distributed file
assigned
Without the namenode, the filesystem cannot be used
If the computer running the namenode malfunctions then reconstruction of the files
(from the blocks on the datanodes) would not be possible