03 Intro HadoopAndMapReduce BigData
03 Intro HadoopAndMapReduce BigData
3
Analyze 10 billion web pages
Average size of a webpage: 20KB
Size of the collection: 10 billion x 20KBs =
200TB
HDD hard disk read bandwidth: 150MB/sec
Time needed to read all web pages (without
analyzing them): 2 million seconds = more
than 15 days
A single node architecture is not adequate
4
Analyze 10 billion web pages
Average size of a webpage: 20KB
Size of the collection: 10 billion x 20KBs =
200TB
SSD hard disk read bandwidth: 550MB/sec
Time needed to read all web pages (without
analyzing them): 2 million seconds = more
than 4 days
A single node architecture is not adequate
5
Failures are part of everyday life, especially in
data center
A single server stays up for 3 years (~1000 days)
▪ 10 servers 1 failure every 100 days (~3 months)
▪ 100 servers 1 failure every 10 days
▪ 1000 servers 1 failure/day
Sources of failures
Hardware/Software
Electrical, Cooling, ...
Unavailability of a resource due to overload
6
LALN data [DSN 2006]
Data for 5000 machines, for 9 years
Hardware failures: 60%, Software: 20%, Network
5%
DRAM error analysis [Sigmetrics 2009]
Data for 2.5 years
8% of DIMMs affected by errors
Disk drive failure analysis [FAST 2007]
Utilization and temperature major causes of
failures
7
Failure types
Permanent
▪ E.g., Broken motherboard
Transient
▪ E.g., Unavailability of a resource due to overload
8
Network becomes the bottleneck if big amounts
of data need to be exchanged between
nodes/servers
Network bandwidth (in a data center): 10Gbps
Moving 10 TB from one server to another takes more
than 2 hours
Data should be moved across nodes only when it
is indispensable
Usually, codes/programs are small (few MBs)
Move code (programs) and computation to data
9
Network becomes the bottleneck if big amounts
of data need to be exchanged between
nodes/servers
Network bandwidth (in a data center): 10Gbps
Moving 10 TB from one server to another takes more
than 2 hours
Data should be moved across nodes only when it
is indispensable
Usually, codes/programs are small (few MBs)
Move code (programs) and computation to data
Data locality
10
Server (Single node)
CPU
Memory
Disk
11
Server (Single node)
CPU
Machine Learning, Statistics
12
Server (Single node)
CPU
“Classical” data mining
14
2-10 Gbps backbone between racks
Switch
1 Gbps between
any pair of nodes
in a rack Switch …
Switch Switch
18
Vertical scalability (scale up)
Add more power/resources (main memory, CPUs)
to a single node (high-performing server)
▪ Cost of super-computers is not linear with respect to
their resources
Horizontal scalability (scale out)
Add more nodes (commodity servers) to a system
▪ The cost scales approximately linearly with respect to
the number of added nodes
▪ But data center efficiency is a difficult problem to solve
19
For data-intensive workloads, a large number of
commodity servers is preferred over a small
number of high-performing servers
At the same cost, we can deploy a system that
processes data more efficiently and is more fault-
tolerant
Horizontal scalability (scale out) is preferred for
big data applications
But distributed computing is hard
New systems hiding the complexity of the distributed part of
the problem to developers are needed
20
Distributed programming is hard
Problem decomposition and parallelization
Task synchronization
Task scheduling of distributed applications is
critical
Assign tasks to nodes by trying to
▪ Speed up the execution of the application
▪ Exploit (almost) all the available resources
▪ Reduce the impact of node failures
21
Distributed data storage
How do we store data persistently on disk and
keep it available if nodes can fail?
▪ Redundancy is the solution, but it increases the
complexity of the system
Network bottleneck
Reduce the amount of data send through the
network
▪ Move computation and code to data
22
Distributed computing is not a new topic
HPC (High-performance computing) ~1960
Grid computing ~1990
Distributed databases ~1990
Hence, many solutions to the mentioned
challenges are already available
But we are now facing big data driven-
problems
The former solutions are not adequate to address
big data volumes
23
Typical Big Data Problem
Iterate over a large number of records/objects
Extract something of interest from each record/object
Aggregate intermediate results
Generate final output
The challenges:
Parallelization
Distributed storage of large data sets (Terabytes,
Petabytes)
Node Failure management
Network bottleneck
Diverse input format (data diversity & heterogeneity)
24
Scalable fault-tolerant distributed system for
Big Data
Distributed Data Storage
Distributed Data Processing
Borrowed concepts/ideas from the systems
designed at Google (Google File System for
Google’s MapReduce)
Open source project under the Apache license
▪ But there are also many commercial implementations
(e.g., Cloudera, Hortonworks, MapR)
26
Dec 2004 – Google published a paper about GFS
July 2005 – Nutch uses MapReduce
Feb 2006 – Hadoop becomes a Lucene
subproject
Apr 2007 – Yahoo! runs it on a 1000-node cluster
Jan 2008 – Hadoop becomes an Apache Top
Level Project
Jul 2008 – Hadoop is tested on a 4000 node
cluster
27
Feb 2009 – The Yahoo! Search Webmap is a
Hadoop application that runs on more than
10,000 core Linux cluster
June 2009 – Yahoo! made available the source
code of its production version of Hadoop
In 2010 Facebook claimed that they have the
largest Hadoop cluster in the world with 21
PB of storage
On July 27, 2011 they announced the data has
grown to 30 PB.
28
Amazon
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
…..
29
Hadoop
Designed for Data intensive workloads
Usually, no CPU demanding/intensive tasks
HPC (High-performance computing)
A supercomputer with a high-level computational
capacity
▪ Performance of a supercomputer is measured in
floating-point operations per second (FLOPS)
Designed for CPU intensive tasks
Usually it is used to process “small” data sets
30
Core components of Hadoop:
Distributed Big Data Processing Infrastructure based
on the MapReduce programming paradigm
▪ Provides a high-level abstraction view
▪ Programmers do not need to care about task scheduling and
synchronization
▪ Fault-tolerant
▪ Node and task failures are automatically managed by the Hadoop
system
HDFS (Hadoop Distributed File System)
▪ High availability distributed storage
▪ Fault-tolerant
31
Switch
Switch …
Switch Switch
Switch …
Switch Switch
Switch …
Switch Switch
Switch …
Switch Switch
36
But an in-depth knowledge of the Hadoop
framework is important to develop efficient
applications
The design of the application must exploit data
locality and limit network usage/data sharing
37
HDFS
Standard Apache Hadoop distributed file system
Provides global file namespace
Stores data redundantly on multiple nodes to provide
persistence and availability
▪ Fault-tolerant file system
Typical usage pattern
Huge files (GB to TB)
Data is rarely updated
Reads and appends are common
▪ Usually, random read/write operations are not performed
38
Each file is split in “chunks/blocks” that are
spread across the servers
Each chuck is replicated on different servers
(usually there are 3 replicas per chunk)
▪ Ensures persistence and availability
▪ To increase persistence and availability, replicas are
stored in different racks, if it is possible
Typically each chunk is 64-128MB
39
Switch
Switch …
Switch Switch
50
CPU
Mem
Disk
Toy example
CPU
file for Hadoop. Mem
Hadoop running
Disk
example.
CPU
Mem
Disk
51
CPU
Mem
Disk
Toy example
file for
Toy example
CPU
file for Hadoop. Mem
Hadoop running
Disk
example. Hadoop.
Hadoop running
example.
CPU
Mem
Disk
52
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1>
Toy example
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.
CPU
Mem
Disk
53
The problem can be easily parallelized
1. Each server processes its chunk of data and
counts the number of times each word appears
in its own chunk
▪ Each server can execute its sub-task independently from
the other servers of the cluster
synchronization is not needed in this phase
▪ The output generated from each chunk by each server
represents a partial result
54
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1>
Toy example
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.
CPU
Mem
Disk
55
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1> send data
through the
Toy example network
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.
CPU
Mem
Disk
56
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1> send data
through the
Toy example network
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.
CPU
Mem
<toy, 1>
Disk <example, 2>
<file, 1>
<for, 1>
<hadoop, 2>
57
<running, 1>
2. Each server sends its local (partial) list of pairs
<word, number of occurrences in its chunk> to a
server that is in charge of aggregating all local
results and computing the global result
▪ The server in charge of computing the global result
needs to receive all the local (partial) results to compute
and emit the final list
A synchronization operation is needed in this phase
Case 2: File too large to fit in main memory
Suppose that
The file size is 100 GB and the number of distinct
words occurring in it is at most 1,000
The cluster has 101 servers
The file is spread acr0ss 100 servers and each of
these servers contains one (different) chunk of the
input file
▪ i.e., the file is optimally spread across 100 servers (each
server contains 1/100 of the file in its local hard drives)
Each server reads 1 GB of data from its local hard
drive (it reads one chunk from HDFS)
Few seconds
Each local list consists of at most 1,000 pairs
(because the number of distinct words is 1,000)
Few MBs
The maximum amount of data sent on the
network is 100 x size of local list (number of
servers x local list size)
Some MBs
We can define scalability along two dimensions
In terms of data:
▪ Given twice the amount of data, the word count algorithm
takes approximately no more than twice as long to run
▪ Each server processes 2 x data => 2 x execution time to compute local
list
In terms of resources
▪ Given twice the number of servers, the word count algorithm
takes approximately no more than half as long to run
▪ Each server processes ½ x data => ½ x execution time to compute
local list
61
The time needed to send local results to the
node in charge of computing the final result
and the computation of the final result are
considered negligible in this running example
Frequently, this assumption is not true
It depends
▪ on the complexity of the problem
▪ on the ability of the developer to limit the amount of
data sent on the network
62
Scale “out”, not “up”
Increase the number of servers, avoiding to upgrade
the resources (CPU, memory) of the current ones
Move processing to data
The network has a limited bandwidth
Process data sequentially, avoid random access
Seek operations are expensive
Big data applications usually read and analyze all
input records/objects
▪ Random access is useless
63
Traditional distributed systems (e.g., HPC)
move data to computing nodes (servers)
This approach cannot be used to process TBs of
data
▪ The network bandwidth is limited
Hadoop moves code to data
Code (few KB) is copied and executed on the
servers where the chunks of data are stored
This approach is based on “data locality”
64
Hadoop/MapReduce is designed for
Batch processing involving (mostly) full scans of
the input data
Data-intensive applications
▪ Read and process the whole Web (e.g., PageRank
computation)
▪ Read and process the whole Social Graph (e.g.,
LinkPrediction, a.k.a. “friend suggestion”)
▪ Log analysis (e.g., Network traces, Smart-meter data, ..)
65
Hadoop/MapReduce is not the panacea for all
Big Data problems
66
67
The MapReduce programming paradigm is
based on the basic concepts of Functional
programming
MapReduce “implements” a subset of
functional programming
The programming model appears quite limited
and strict
▪ Everything is based on two “functions” with predefined
signatures
▪ Map and Reduce
68
Solving complex problems is difficult
However, there are several important
problems that can be adapted to MapReduce
Log analysis
PageRank computation
Social graph analysis
Sensor data analysis
Smart-city data analysis
Network capture analysis
69
MapReduce is based on two main “building
blocks”
Map and Reduce functions
Map function
It is applied over each element of an input data set
and emits a set of (key, value) pairs
Reduce function
It is applied over each set of (key, value) pairs
(emitted by the map function) with the same key and
emits a set of (key, value) pairs Final result
70
Input
A textual file (i.e., a list of words)
Problem
Count the number of times each distinct word
appears in the file
Output
A list of pairs <word, number of occurrences in the
input file>
71
The input textual file is considered as a list of
words L
72
L = [toy, example, toy, example , hadoop]
Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]
Group by key
Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]
Apply a function
on each group
[ (toy, 2) , (example, 2), (hadoop, 1) ]
Shuffle and
Sort phase
(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])
Reduce
phase
[ (toy, 2) , (example, 2), (hadoop, 1) ]
78
The input textual file is considered as a list of
words L
A key-value pair (w, 1) is emitted for each
word w in L
i.e., the map function is
m(w) = (w, 1)
A new list of (key, value) pairs Lm is generated
79
The key-value pairs in Lm are aggregated by
key (i.e., by word w in our example)
One group Gw is generated for each word w
Each group Gw is a key-list pair (w, [list of values])
where [list of values] contains all the values of the
pairs associated with the word w
▪ i.e., [list of values] is a list of [1, 1, 1, …] in our example
▪ Given a group Gw, the number of ones [1, 1, 1, …] is equal
to the occurrences of word w in the input file
80
A key-value pair (w, sum Gw.[list of values]) is
emitted for each group Gw
i.e., the reduce function is
r(Gw) = (w, sum(Gw.[list of values]) )
The list of emitted pairs is the result of the
word count problem
One pair (word w, num. of occurrences) for each
word in our running example
81
The Map phase can be viewed as a
transformation over each element of a data set
This transformation is a function m defined by
developers
m is invoked one time for each input element
Each invocation of m happens in isolation
▪ The application of m to each element of a data set can be
parallelized in a straightforward manner
82
The Reduce phase can be viewed as an
aggregate operation
The aggregate function is a function r defined by
developers
r is invoked one time for each distinct key and
aggregates all the values associated with it
Also the reduce phase can be performed in
parallel and in isolation
▪ Each group of key-value pairs with the same key can be
processed in isolation
83
The shuffle and sort phase is always the same
i.e., group the output of the map phase by key
It does not need to be defined by developers
It is already provided by the Hadoop system
84
Key-value pair is the basic data structure in
MapReduce
Keys and values can be: integers, float, strings, …
They can also be (almost) arbitrary data structures
defined by the designer
Both input and output of a MapReduce
program are lists of key-value pairs
Note that also the input is a list of key-value pairs
85
The design of MapReduce involves
Imposing the key-value structure on the input and
output data sets
▪ E.g., for a collection of Web pages, input keys may be
URLs and values may be their HTML content
86
The map and reduce functions are formally
defined as follows:
map: (k1, v1) → [(k2, v2)]
reduce: (k2, [v2]) → [(k3, v3)]
Since the input data set is a list of key-value
pairs, the argument of the map function is a
key-value pair
map(key, value):
// key: offset of the word in the file
// value: a word of the input document
emit(value, 1)
reduce(key, values):
// key: a word; values: a list of integers
occurrences = 0
for each c in values:
occurrences = occurrences + c
emit(key, occurrences)
91