Large-Scale Data Management: Cs525: Special Topics in Dbs
Large-Scale Data Management: Cs525: Special Topics in Dbs
Large-Scale Data
Management
Hadoop/MapReduce Computing
Paradigm
Spring 2013
WPI, Mohamed Eltabakh
1
Large-Scale Data
Analytics
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional
database systems
Database
vs.
Scalability (petabytes of
data, thousands of machines)
Performance (tons of
indexing, tuning, data
organization tech.)
Features:
- Provenance tracking
- Annotation management
- .
What is Hadoop
Hadoop is a software framework for distributed processing
of large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
What is Hadoop
(Contd)
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
Hadoop Master/Slave
Architecture
Hadoop is designed as a master-slave shared-nothing architecture
Design Principles of
Hadoop
Need to process big data
Need to parallelize computation across thousands
of nodes
Commodity hardware
Large number of low-end cheap machines working in
parallel to solve a computing problem
Design Principles of
Hadoop
Automatic parallelization & distribution
Hidden from the end-user
10
Hadoop Architecture
Distributed file system (HDFS)
Execution engine (MapReduce)
11
Centralized namenode
- Maintains metadata info about files
File F
Main Properties of
HDFS
Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file systems data
Replication: Each data block is replicated many times
(default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
Namenode is consistently checking Datanodes
13
Produces (k,
v)
( , 1)
Map
Shuffle &
Sorting based
on k
Parse-hash
Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Reduce
Map
Parse-hash
Reduce
Map
Parse-hash
Reduce
Map
Parse-hash
Properties of MapReduce
Engine
Job Tracker is the master node (runs with the namenode)
Receives the users job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
15
Properties of MapReduce
Engine (Contd)
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
M ap
P a rse -h a sh
R educe
M ap
P a rse -h a sh
R educe
M ap
P a rse -h a sh
R edu ce
M ap
P a rse -h a sh
16
Key-Value Pairs
Mappers and Reducers are users code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
MapReduce Phases
Example 1: Word
Count
Reduce
Tasks
Map
Tasks
19
Produces (k,
v)
( , 1)
Shuffle &
Sorting based
on k
Map
Parse-hash
Map
Parse-hash
Map
Map
Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Part0001
Reduce
Reduce
Part0002
Reduce
Part0003
Parse-hash
Parse-hash
20
Write to HDFS
Part0001
Map
Write to HDFS
Part0002
Map
Write to HDFS
Part0003
Map
Write to HDFS
Part0004
21
Hadoop
Computing
Model
Notion of transactions
Transaction is the unit of work
ACID properties, Concurrency
control
Notion of jobs
Job is the unit of work
No concurrency control
Data Model
Cost Model
Expensive servers
Fault Tolerance
Cloud Computing
Key
- Efficiency,
A computing model
where anyoptimizations,
computing fineCharacteristics
infrastructure can tuning
run on the cloud
Hardware & Software are provided as remote services
Elastic: grows and shrinks based on the users
demand
Example: Amazon EC2
22