0% found this document useful (0 votes)

14 views

03 Intro HadoopAndMapReduce BigData

Uploaded by

max

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

03 Intro HadoopAndMapReduce BigData

Uploaded by

max

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

 The amount of data increases every day

 Some numbers (∼ 2012):

 Data processed by Google every day: 100+ PB
 Data processed by Facebook every day: 10+ PB
 To analyze them, systems that scale with
respect to the data volume are needed

3
 Analyze 10 billion web pages
 Average size of a webpage: 20KB
 Size of the collection: 10 billion x 20KBs =
200TB
 HDD hard disk read bandwidth: 150MB/sec
 Time needed to read all web pages (without
analyzing them): 2 million seconds = more
than 15 days
 A single node architecture is not adequate
4
 Analyze 10 billion web pages
 Average size of a webpage: 20KB
 Size of the collection: 10 billion x 20KBs =
200TB
 SSD hard disk read bandwidth: 550MB/sec
 Time needed to read all web pages (without
analyzing them): 2 million seconds = more
than 4 days
 A single node architecture is not adequate
5
 Failures are part of everyday life, especially in
data center
 A single server stays up for 3 years (~1000 days)
▪ 10 servers  1 failure every 100 days (~3 months)
▪ 100 servers  1 failure every 10 days
▪ 1000 servers  1 failure/day
 Sources of failures
 Hardware/Software
 Electrical, Cooling, ...
 Unavailability of a resource due to overload
6
 LALN data [DSN 2006]
 Data for 5000 machines, for 9 years
 Hardware failures: 60%, Software: 20%, Network
5%
 DRAM error analysis [Sigmetrics 2009]
 Data for 2.5 years
 8% of DIMMs affected by errors
 Disk drive failure analysis [FAST 2007]
 Utilization and temperature major causes of
failures
7
 Failure types
 Permanent
▪ E.g., Broken motherboard
 Transient
▪ E.g., Unavailability of a resource due to overload

8
 Network becomes the bottleneck if big amounts
of data need to be exchanged between
nodes/servers
 Network bandwidth (in a data center): 10Gbps
 Moving 10 TB from one server to another takes more
than 2 hours
 Data should be moved across nodes only when it
is indispensable
 Usually, codes/programs are small (few MBs)
 Move code (programs) and computation to data

9
 Network becomes the bottleneck if big amounts
of data need to be exchanged between
nodes/servers
 Network bandwidth (in a data center): 10Gbps
 Moving 10 TB from one server to another takes more
than 2 hours
 Data should be moved across nodes only when it
is indispensable
 Usually, codes/programs are small (few MBs)
 Move code (programs) and computation to data

Data locality
10
Server (Single node)

CPU

Memory

Disk

11
Server (Single node)

CPU
Machine Learning, Statistics

Memory  Small data

 Data can be completely
loaded in main memory
Disk

12
Server (Single node)

CPU
“Classical” data mining

Memory  Large data

 Data can not be completely
loaded in main memory
Disk ▪ Load in main memory one chunk
of data at a time
▪ Process it and store some statistics
▪ Combine statistics to compute
the final result
13
 Cluster of servers (data center)
 Computation is distributed across servers
 Data are stored/distributed across servers
 Standard architecture in the Big data context
(∼ 2012)
 Cluster of commodity Linux nodes/servers
▪ 32 GB of main memory per node
 Gigabit Ethernet interconnection

14
2-10 Gbps backbone between racks
Switch
1 Gbps between
any pair of nodes
in a rack Switch …
Switch Switch

CPU CPU CPU CPU

Mem … Mem … Mem … Mem

Disk Disk Disk Disk

Server 1 Server .. Server .. Server N

Rack 1 Rack … Rack M
Each rack contains 16-64 nodes
15
16
17
 Current systems must scale to address
 The increasing amount of data to analyze
 The increasing number of users to serve
 The increasing complexity of the problems
 Two approaches are usually used to address
scalability issues
 Vertical scalability (scale up)
 Horizontal scalability (scale out)

18
 Vertical scalability (scale up)
 Add more power/resources (main memory, CPUs)
to a single node (high-performing server)
▪ Cost of super-computers is not linear with respect to
their resources
 Horizontal scalability (scale out)
 Add more nodes (commodity servers) to a system
▪ The cost scales approximately linearly with respect to
the number of added nodes
▪ But data center efficiency is a difficult problem to solve
19
 For data-intensive workloads, a large number of
commodity servers is preferred over a small
number of high-performing servers
 At the same cost, we can deploy a system that
processes data more efficiently and is more fault-
tolerant
 Horizontal scalability (scale out) is preferred for
big data applications
 But distributed computing is hard
New systems hiding the complexity of the distributed part of
the problem to developers are needed

20
 Distributed programming is hard
 Problem decomposition and parallelization
 Task synchronization
 Task scheduling of distributed applications is
critical
 Assign tasks to nodes by trying to
▪ Speed up the execution of the application
▪ Exploit (almost) all the available resources
▪ Reduce the impact of node failures
21
 Distributed data storage
 How do we store data persistently on disk and
keep it available if nodes can fail?
▪ Redundancy is the solution, but it increases the
complexity of the system
 Network bottleneck
 Reduce the amount of data send through the
network
▪ Move computation and code to data

22
 Distributed computing is not a new topic
 HPC (High-performance computing) ~1960
 Grid computing ~1990
 Distributed databases ~1990
 Hence, many solutions to the mentioned
challenges are already available
 But we are now facing big data driven-
problems
 The former solutions are not adequate to address
big data volumes
23
 Typical Big Data Problem
 Iterate over a large number of records/objects
 Extract something of interest from each record/object
 Aggregate intermediate results
 Generate final output
 The challenges:
 Parallelization
 Distributed storage of large data sets (Terabytes,
Petabytes)
 Node Failure management
 Network bottleneck
 Diverse input format (data diversity & heterogeneity)

24
 Scalable fault-tolerant distributed system for
Big Data
 Distributed Data Storage
 Distributed Data Processing
 Borrowed concepts/ideas from the systems
designed at Google (Google File System for
Google’s MapReduce)
 Open source project under the Apache license
▪ But there are also many commercial implementations
(e.g., Cloudera, Hortonworks, MapR)

26
 Dec 2004 – Google published a paper about GFS
 July 2005 – Nutch uses MapReduce
 Feb 2006 – Hadoop becomes a Lucene
subproject
 Apr 2007 – Yahoo! runs it on a 1000-node cluster
 Jan 2008 – Hadoop becomes an Apache Top
Level Project
 Jul 2008 – Hadoop is tested on a 4000 node
cluster

…
C0 C1 C6 C4 C2 C1 C0 C4
C7 C2 C5 C3 C6 C7 C5 C3

Server 1 Server 2 Server N-1 Server N

Rack 1 Rack … Rack M

Example with number of replicas per chunk = 2

40
 The Master node, a.k.a. Name Nodes in HDFS, is
a special node/server that
 Stores HDFS metadata
▪ E.g., the mapping between the name of a file and the location
of its chunks
 Might be replicated
 Client applications: file access through HDFS
APIs
 Talk to the master node to find data/chuck servers
associated with the file of interest
 Connect to the selected chunk servers to access data
41
 Many Hadoop-related projects/systems are
available
 Hive
▪ A distributed relational database, based on MapReduce, for
querying data stored in HDFS by means of a query language
based on SQL
 HBase
▪ A distributed column-oriented database that uses HDFS for
storing data
 Pig
▪ A data flow language and execution environment, based on
MapReduce, for exploring very large datasets
42
 Sqoop
▪ A tool for efficiently moving data from traditional
relational databases and external flat file sources to
HDFS
 ZooKeeper
▪ A distributed coordination service. It provides primitives
such as distributed locks
 ….
 Each project/system addresses one specific
class of problems
43
 Input
 A large textual file of words
 Problem
 Count the number of times each distinct word
appears in the file
 Output
 A list of pairs <word, number>, counting the
number of occurrences of each specific word in
the input file
 Case 1: Entire file fits in main memory
 Case 1: Entire file fits in main memory
 A traditional single node approach is probably the
most efficient solution in this case
▪ The complexity and overheads of a distributed system
affects the performance when files are “small”
▪ “small” depends on the resources you have
 Case 1: Entire file fits in main memory
 A traditional single node approach is probably the
most efficient solution in this case
▪ The complexity and overheads of a distributed system
affects the performance when files are “small”
▪ “small” depends on the resources you have
 Case 2: File too large to fit in main memory
 Case 1: Entire file fits in main memory
 A traditional single node approach is probably the
most efficient solution in this case
▪ The complexity and overheads of a distributed system
affects the performance when files are “small”
▪ “small” depends on the resources you have
 Case 2: File too large to fit in main memory
 How can we split this problem in a set of (almost)
independent sub-tasks, and
 execute them in parallel on a cluster of servers?
 Suppose that
 The cluster has 3 servers
 The content of the input file is
▪ “Toy example file for Hadoop. Hadoop running
example.”
 The input file is split into 2 chunks
 The number of replicas is 1

50
CPU
Mem
Disk

Toy example
CPU
file for Hadoop. Mem
Hadoop running
Disk
example.

CPU
Mem
Disk

51
CPU
Mem
Disk
Toy example
file for

Toy example
CPU
file for Hadoop. Mem
Hadoop running
Disk
example. Hadoop.
Hadoop running
example.

CPU
Mem
Disk

52
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1>
Toy example
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.

CPU
Mem
Disk

53
 The problem can be easily parallelized
1. Each server processes its chunk of data and
counts the number of times each word appears
in its own chunk
▪ Each server can execute its sub-task independently from
the other servers of the cluster
 synchronization is not needed in this phase
▪ The output generated from each chunk by each server
represents a partial result

54
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1>
Toy example
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.

CPU
Mem
Disk

55
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1> send data
through the
Toy example network
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.

CPU
Mem
Disk

56
CPU
Mem <toy, 1>
Disk <example, 1>
Toy example <file, 1>
file for <for, 1> send data
through the
Toy example network
CPU
file for Hadoop. Mem
Hadoop running <hadoop, 2>
Disk
example. Hadoop. <running, 1>
Hadoop running <example, 1>
example.

CPU
Mem
<toy, 1>
Disk <example, 2>
<file, 1>
<for, 1>
<hadoop, 2>
57
<running, 1>
2. Each server sends its local (partial) list of pairs
<word, number of occurrences in its chunk> to a
server that is in charge of aggregating all local
results and computing the global result
▪ The server in charge of computing the global result
needs to receive all the local (partial) results to compute
and emit the final list
 A synchronization operation is needed in this phase
 Case 2: File too large to fit in main memory
 Suppose that
 The file size is 100 GB and the number of distinct
words occurring in it is at most 1,000
 The cluster has 101 servers
 The file is spread acr0ss 100 servers and each of
these servers contains one (different) chunk of the
input file
▪ i.e., the file is optimally spread across 100 servers (each
server contains 1/100 of the file in its local hard drives)
 Each server reads 1 GB of data from its local hard
drive (it reads one chunk from HDFS)
 Few seconds
 Each local list consists of at most 1,000 pairs
(because the number of distinct words is 1,000)
 Few MBs
 The maximum amount of data sent on the
network is 100 x size of local list (number of
servers x local list size)
 Some MBs
 We can define scalability along two dimensions
 In terms of data:
▪ Given twice the amount of data, the word count algorithm
takes approximately no more than twice as long to run
▪ Each server processes 2 x data => 2 x execution time to compute local
list
 In terms of resources
▪ Given twice the number of servers, the word count algorithm
takes approximately no more than half as long to run
▪ Each server processes ½ x data => ½ x execution time to compute
local list

61
 The time needed to send local results to the
node in charge of computing the final result
and the computation of the final result are
considered negligible in this running example
 Frequently, this assumption is not true
 It depends
▪ on the complexity of the problem
▪ on the ability of the developer to limit the amount of
data sent on the network

62
 Scale “out”, not “up”
 Increase the number of servers, avoiding to upgrade
the resources (CPU, memory) of the current ones
 Move processing to data
 The network has a limited bandwidth
 Process data sequentially, avoid random access
 Seek operations are expensive
 Big data applications usually read and analyze all
input records/objects
▪ Random access is useless
63
 Traditional distributed systems (e.g., HPC)
move data to computing nodes (servers)
 This approach cannot be used to process TBs of
data
▪ The network bandwidth is limited
 Hadoop moves code to data
 Code (few KB) is copied and executed on the
servers where the chunks of data are stored
 This approach is based on “data locality”

64
 Hadoop/MapReduce is designed for
 Batch processing involving (mostly) full scans of
the input data
 Data-intensive applications
▪ Read and process the whole Web (e.g., PageRank
computation)
▪ Read and process the whole Social Graph (e.g.,
LinkPrediction, a.k.a. “friend suggestion”)
▪ Log analysis (e.g., Network traces, Smart-meter data, ..)

65
 Hadoop/MapReduce is not the panacea for all
Big Data problems

 Hadoop/MapReduce does not feet well

 Iterative problems
 Recursive problems
 Stream data processing
 Real-time processing

66
67
 The MapReduce programming paradigm is
based on the basic concepts of Functional
programming
 MapReduce “implements” a subset of
functional programming
 The programming model appears quite limited
and strict
▪ Everything is based on two “functions” with predefined
signatures
▪ Map and Reduce

68
 Solving complex problems is difficult
 However, there are several important
problems that can be adapted to MapReduce
 Log analysis
 PageRank computation
 Social graph analysis
 Sensor data analysis
 Smart-city data analysis
 Network capture analysis

69
 MapReduce is based on two main “building
blocks”
 Map and Reduce functions
 Map function
 It is applied over each element of an input data set
and emits a set of (key, value) pairs
 Reduce function
 It is applied over each set of (key, value) pairs
(emitted by the map function) with the same key and
emits a set of (key, value) pairs  Final result
70
 Input
 A textual file (i.e., a list of words)
 Problem
 Count the number of times each distinct word
appears in the file
 Output
 A list of pairs <word, number of occurrences in the
input file>

71
 The input textual file is considered as a list of
words L

72
L = [toy, example, toy, example , hadoop]

[…] denotes a list. (k, v) denotes a key-value pair. 73

L = [toy, example, toy, example , hadoop]
Apply a function
on each element
Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]

[…] denotes a set. (k, v) denotes a key-value pair. 74

L = [toy, example, toy, example , hadoop]

Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]

Group by key

(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])

[…] denotes a list. (k, v) denotes a key-value pair. 75

L = [toy, example, toy, example , hadoop]

Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]

(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])

Apply a function
on each group
[ (toy, 2) , (example, 2), (hadoop, 1) ]

[…] denotes a list. (k, v) denotes a key-value pair. 76

L = [toy, example, toy, example , hadoop]
Map
phase
Lm =[(toy, +1), ( example, +1), ( toy, +1), ( example, +1), (hadoop, +1)]

Shuffle and
Sort phase
(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])

Reduce
phase
[ (toy, 2) , (example, 2), (hadoop, 1) ]

[…] denotes a list. (k, v) denotes a key-value pair. 77

 The input textual file is considered as a list of
words L

78
 The input textual file is considered as a list of
words L
 A key-value pair (w, 1) is emitted for each
word w in L
 i.e., the map function is
m(w) = (w, 1)
 A new list of (key, value) pairs Lm is generated

79
 The key-value pairs in Lm are aggregated by
key (i.e., by word w in our example)
 One group Gw is generated for each word w
 Each group Gw is a key-list pair (w, [list of values])
where [list of values] contains all the values of the
pairs associated with the word w
▪ i.e., [list of values] is a list of [1, 1, 1, …] in our example
▪ Given a group Gw, the number of ones [1, 1, 1, …] is equal
to the occurrences of word w in the input file

80
 A key-value pair (w, sum Gw.[list of values]) is
emitted for each group Gw
 i.e., the reduce function is
r(Gw) = (w, sum(Gw.[list of values]) )
 The list of emitted pairs is the result of the
word count problem
 One pair (word w, num. of occurrences) for each
word in our running example

81
 The Map phase can be viewed as a
transformation over each element of a data set
 This transformation is a function m defined by
developers
 m is invoked one time for each input element
 Each invocation of m happens in isolation
▪ The application of m to each element of a data set can be
parallelized in a straightforward manner

82
 The Reduce phase can be viewed as an
aggregate operation
 The aggregate function is a function r defined by
developers
 r is invoked one time for each distinct key and
aggregates all the values associated with it
 Also the reduce phase can be performed in
parallel and in isolation
▪ Each group of key-value pairs with the same key can be
processed in isolation

83
 The shuffle and sort phase is always the same
 i.e., group the output of the map phase by key
 It does not need to be defined by developers
 It is already provided by the Hadoop system

84
 Key-value pair is the basic data structure in
MapReduce
 Keys and values can be: integers, float, strings, …
 They can also be (almost) arbitrary data structures
defined by the designer
 Both input and output of a MapReduce
program are lists of key-value pairs
 Note that also the input is a list of key-value pairs

85
 The design of MapReduce involves
 Imposing the key-value structure on the input and
output data sets
▪ E.g., for a collection of Web pages, input keys may be
URLs and values may be their HTML content

86
 The map and reduce functions are formally
defined as follows:
 map: (k1, v1) → [(k2, v2)]
 reduce: (k2, [v2]) → [(k3, v3)]
 Since the input data set is a list of key-value
pairs, the argument of the map function is a
key-value pair

[…] denotes a list. (k, v) denotes a key-value pair 87

 Map function
 map: (k1, v1) → [(k2, v2)]
 The argument of the map function is a key-
value pair
 Note that the map function emits a list of
key-value pairs for each input record
 The list can also be empty

[…] denotes a list. (k, v) denotes a key-value pair 88

 Reduce function
 reduce: (k2, [v2]) → [(k3, v3)]
 Note that the reduce function
 Receives the complete list of values [v2]
associated with a specific key k2
 Emits a list of key-value pairs

[…] denotes a list. (k, v) denotes a key-value pair 89

 In many applications, the key part of the
input data set is ignored
 i.e., usually the map function does not consider
the key of its key-value pair argument
▪ E.g., word count problem
 Some specific applications exploit also the
keys of the input data
 E.g., keys can be used to uniquely identify
records/objects
90
Input file: a textual document with one word per line
The map function is invoked over each word of the input file

map(key, value):
// key: offset of the word in the file
// value: a word of the input document
emit(value, 1)

reduce(key, values):
// key: a word; values: a list of integers
occurrences = 0
for each c in values:
occurrences = occurrences + c

emit(key, occurrences)

Trade Project - Sample
100% (1)
Trade Project - Sample
16 pages
Narrative As Virtual Reality II Revisiti
No ratings yet
Narrative As Virtual Reality II Revisiti
2 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Linuxacademy Devops Slides
100% (1)
Linuxacademy Devops Slides
54 pages
Account Management System
0% (2)
Account Management System
57 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
2a Intro To Cluster Computing PDF
No ratings yet
2a Intro To Cluster Computing PDF
18 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
Cluster Basics
No ratings yet
Cluster Basics
34 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
lec09-no-sql
No ratings yet
lec09-no-sql
42 pages
0 To 60 in 3.1: Tyler Carlton Cory Sessions
No ratings yet
0 To 60 in 3.1: Tyler Carlton Cory Sessions
21 pages
TDD: Topics in Distributed Databases: Parallel Database Management Systems
No ratings yet
TDD: Topics in Distributed Databases: Parallel Database Management Systems
38 pages
Week 02
No ratings yet
Week 02
115 pages
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
No ratings yet
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
27 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
us-cellular-2030369
No ratings yet
us-cellular-2030369
10 pages
BSC in Information Technology: Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology: Massive or Big Data Processing J.Alosius
17 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
Week 10 CloudComputing Module Two
No ratings yet
Week 10 CloudComputing Module Two
22 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Big Data Intro
No ratings yet
Big Data Intro
6 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
No ratings yet
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
36 pages
BD Merged
No ratings yet
BD Merged
330 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
Introduction To Data and Memory Intensive Computing
No ratings yet
Introduction To Data and Memory Intensive Computing
31 pages
Linux and H/W Optimizations For MySQL
100% (2)
Linux and H/W Optimizations For MySQL
160 pages
CH 4. The Evolution of Analytic Scalability: Taming The Big Data Tidal Wave
No ratings yet
CH 4. The Evolution of Analytic Scalability: Taming The Big Data Tidal Wave
21 pages
Synthetic Data Gen
No ratings yet
Synthetic Data Gen
29 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
01 - Hadoop - HDFS
No ratings yet
01 - Hadoop - HDFS
49 pages
Advanced Operating Systems: Virtualization and Cloud Computing
No ratings yet
Advanced Operating Systems: Virtualization and Cloud Computing
83 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
Big Data
No ratings yet
Big Data
4 pages
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
No ratings yet
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
23 pages
15 - Sun - Ouwens
No ratings yet
15 - Sun - Ouwens
31 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Chapter_6 - Hadoop
No ratings yet
Chapter_6 - Hadoop
51 pages
Elective-I Advanced Database Management Systems: Unit Ii
100% (1)
Elective-I Advanced Database Management Systems: Unit Ii
141 pages
intro-to-linux-and-hpc
No ratings yet
intro-to-linux-and-hpc
67 pages
Datacenter 101: Ravi Soundararajan, Vmware, Inc
100% (2)
Datacenter 101: Ravi Soundararajan, Vmware, Inc
32 pages
Disco
No ratings yet
Disco
33 pages
W2C1 History Building Blocks Cloud Computing
No ratings yet
W2C1 History Building Blocks Cloud Computing
38 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Session Title: Bob Johnston / IPS Grid & HA Solutions
No ratings yet
Session Title: Bob Johnston / IPS Grid & HA Solutions
31 pages
Cloud Computing: By: Ahmad Ali Uot (Aps), Nowshera
No ratings yet
Cloud Computing: By: Ahmad Ali Uot (Aps), Nowshera
33 pages
Lecture 1 Parallel Databases
No ratings yet
Lecture 1 Parallel Databases
30 pages
DigitalLogic ComputerOrganization L24 VirtualMemory Handout
No ratings yet
DigitalLogic ComputerOrganization L24 VirtualMemory Handout
30 pages
Cloud-Based Data Processing
No ratings yet
Cloud-Based Data Processing
45 pages
The Big Data Ecosystem
No ratings yet
The Big Data Ecosystem
42 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Traing On Hadoop
No ratings yet
Traing On Hadoop
123 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
02 Distdbms Storage
No ratings yet
02 Distdbms Storage
62 pages
PDC 5 - Networks of Workstations (Distributed Memory)
No ratings yet
PDC 5 - Networks of Workstations (Distributed Memory)
19 pages
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
DDR5 Speed Boost
From Everand
DDR5 Speed Boost
Mei Gates
No ratings yet
Activity 2 Historical Antecedents
No ratings yet
Activity 2 Historical Antecedents
4 pages
Fam.1c 27J 5 1
No ratings yet
Fam.1c 27J 5 1
146 pages
Computing 8
No ratings yet
Computing 8
11 pages
Alarm Correlation PDF
No ratings yet
Alarm Correlation PDF
8 pages
Itt Electrical Communication 1966-4
No ratings yet
Itt Electrical Communication 1966-4
154 pages
Innovation Products Solutions: WWW - Noreps.no
No ratings yet
Innovation Products Solutions: WWW - Noreps.no
36 pages
CSE 2001 Computer Architecture & Organization: Prof. Krishnamoorthy A School of Computer Science and Engineering
No ratings yet
CSE 2001 Computer Architecture & Organization: Prof. Krishnamoorthy A School of Computer Science and Engineering
16 pages
PHA / HIRA Study Template: Your Partner in Process Safety & Risk Management
No ratings yet
PHA / HIRA Study Template: Your Partner in Process Safety & Risk Management
5 pages
Muthukumaran Resume
No ratings yet
Muthukumaran Resume
3 pages
Life Cycle Costing
No ratings yet
Life Cycle Costing
4 pages
Capstone Formal Proposal
No ratings yet
Capstone Formal Proposal
5 pages
The Information Age Written Report
No ratings yet
The Information Age Written Report
12 pages
Imports System - Data.OleDb
No ratings yet
Imports System - Data.OleDb
3 pages
WEEK 1 Computer Ethics Net STE 10 Newtoneinsteinfinal
No ratings yet
WEEK 1 Computer Ethics Net STE 10 Newtoneinsteinfinal
8 pages
Scaler Data Science Machine Learning Brochure
No ratings yet
Scaler Data Science Machine Learning Brochure
19 pages
Roip102 Series
No ratings yet
Roip102 Series
21 pages
NVR-104B-P4: Datasheet
No ratings yet
NVR-104B-P4: Datasheet
3 pages
Dasmariñas Integrated High School: Action Plan SY 2020-2021
100% (1)
Dasmariñas Integrated High School: Action Plan SY 2020-2021
3 pages
Wa600 6
No ratings yet
Wa600 6
78 pages
Generative Adversarial Networks in Construction Applications
No ratings yet
Generative Adversarial Networks in Construction Applications
16 pages
Baron - Language of The Internet
100% (1)
Baron - Language of The Internet
63 pages
Safety-Critical Software Development Using Automatic Production Code Generation
No ratings yet
Safety-Critical Software Development Using Automatic Production Code Generation
13 pages
Hotel Management System: Booking Class
No ratings yet
Hotel Management System: Booking Class
31 pages
A Smart Digital Twin Enabled Security Framework For Vehicle-to-Grid Cyber-Physical Systems
No ratings yet
A Smart Digital Twin Enabled Security Framework For Vehicle-to-Grid Cyber-Physical Systems
14 pages
Power - Technologies - PTSenR SMART GRID IIOT Catalogues 2024 (Screen)
No ratings yet
Power - Technologies - PTSenR SMART GRID IIOT Catalogues 2024 (Screen)
15 pages
Import Notes For MSP To A P6 Database
No ratings yet
Import Notes For MSP To A P6 Database
3 pages

03 Intro HadoopAndMapReduce BigData

Uploaded by

03 Intro HadoopAndMapReduce BigData

Uploaded by

 The amount of data increases every day

 Some numbers (∼ 2012):

Memory  Small data

Memory  Large data

CPU CPU CPU CPU

Mem … Mem … Mem … Mem

Disk Disk Disk Disk

Server 1 Server .. Server .. Server N

CPU CPU CPU CPU

Disk Disk Disk Disk

Server 1 Server 2 Server N-1 Server N

Example with number of replicas per chunk = 2

CPU CPU CPU CPU

Disk Disk Disk Disk

Server 1 Server 2 Server N-1 Server N

Example with number of replicas per chunk = 2

CPU CPU CPU CPU

Disk Disk Disk Disk

Example with number of replicas per chunk = 2

CPU CPU CPU CPU

Disk Disk Disk Disk

Server 1 Server 2 Server N-1 Server N

Example with number of replicas per chunk = 2

CPU CPU CPU CPU

Disk Disk Disk Disk

Server 1 Server 2 Server N-1 Server N

Example with number of replicas per chunk = 2

 Hadoop/MapReduce does not feet well

[…] denotes a list. (k, v) denotes a key-value pair. 73

[…] denotes a set. (k, v) denotes a key-value pair. 74

(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])

[…] denotes a list. (k, v) denotes a key-value pair. 75

(toy, [+1, +1]) (example, [+1, +1]) (hadoop, [+1])

[…] denotes a list. (k, v) denotes a key-value pair. 76

[…] denotes a list. (k, v) denotes a key-value pair. 77

[…] denotes a list. (k, v) denotes a key-value pair 87

[…] denotes a list. (k, v) denotes a key-value pair 88

[…] denotes a list. (k, v) denotes a key-value pair 89

You might also like