The Big Data Ecosystem
The Big Data Ecosystem
Ecosystem
Jesús Montes
[email protected]
Sept. 2022
The Big Data Ecosystem
Do you know any of these?
1. Infrastructure
○ Collecting/storing the data
○ Processing the data
2. Analytics + ML/AI
○ Extracting knowledge from data
○ Visualizing the data/knowledge
3. Applications
● Basically Available
● Soft state
● Eventual consistency
Size
Stores Column
stores
Document
databases
Graph
databases
Relational databases
Complexity
● Inspired by map/reduce
● Designed to make possible the processing of very large datasets
Two phases:
Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map
Map
Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map
Map
problem:
foolishness, it was the epoch of belief, it was the
epoch of incredulity…
For each word in the line received, the Map The reduce function receives all values
function generates a (key, value) pair. The key emitted by the mappers for a single key. The
is the word being processed, and the value is function adds these values and produces this
always the number 1. sum as a result.
(was, 1)
(it, {1,1,1,1,1,1}) Reduce (it, 6)
(the, 1)
(best, 1) it, 6
(was, {1,1,1,1,1,1}) Reduce (was, 6) was, 6
(of, 1) the, 6
best, 1
(times, 1) (the, {1,1,1,1,1,1}) Reduce (the, 6) of, 5
...
(it, 1)
(was, 1)
(best, {1}) Reduce (best, 1)
(the, 1) ...
(of, {1,1,1,1,1}) Reduce (of, 5)
(age, 1)
Shuffle and Append
... ... ... ...
sort
The Big Data Ecosystem 33
MapReduce
● MapReduce applications require only to provide the implementation of
the Map and Reduce functions.
● MapReduce applications are deployed over a MapReduce framework,
usually running in a cluster.
● The framework takes care of all data management operations:
○ Data splitting
○ Shuffle and sort
○ Collection of results
● The parallelization is transparent to the programmer.
● The MapReduce paradigm sacrifices design flexibility in exchange for easy
and fast development of parallel applications.
Map
Map
Reduce Map
Input Map Output 1 Reduce Output 2
Reduce Map
Map
Map
In addition to the Map and Reduce functions of each cycle, global/cycle parameters can be defined,
but state is never shared between mappers or reducers in the same stage.
Map(numbers) { Map(line) {
emit(1, (max(numbers), min(numbers))) for each number in line {
} bar=floor((number-min_v)/
((max-min_v)/n))
if(number=max)
emit(n-1,1)
Reduce(key, values) { else
max_v = -infinity emit(bar,1)
min_v = infinity }
for each pair in values { }
max_v = max(max_v, pair[0])
min_v = min(min_v, pair[1])
}
emit(“max”,max_v) Reduce(key, values) {
emit(“min”,min_v) emit(key, sum(values))
} }