0% found this document useful (0 votes)
45 views42 pages

The Big Data Ecosystem

The document discusses the big data ecosystem and technologies for big data. It describes three categories of big data technologies: infrastructure for collecting and storing data and processing data, analytics and machine learning/artificial intelligence for extracting knowledge from data and visualizing data/knowledge, and applications. It then provides more details on infrastructure technologies, including NoSQL databases like key-value stores, column-based stores, document databases, and graph databases. It also discusses big data processing and technologies like MapReduce.

Uploaded by

yagoencuestas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views42 pages

The Big Data Ecosystem

The document discusses the big data ecosystem and technologies for big data. It describes three categories of big data technologies: infrastructure for collecting and storing data and processing data, analytics and machine learning/artificial intelligence for extracting knowledge from data and visualizing data/knowledge, and applications. It then provides more details on infrastructure technologies, including NoSQL databases like key-value stores, column-based stores, document databases, and graph databases. It also discusses big data processing and technologies like MapReduce.

Uploaded by

yagoencuestas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

The Big Data

Ecosystem
Jesús Montes
[email protected]

Sept. 2022
The Big Data Ecosystem
Do you know any of these?

The Big Data Ecosystem 2


And all
these?

The Big Data Ecosystem 3


Technology for Big Data
Big Data technologies can be grouped in three categories:

1. Infrastructure
○ Collecting/storing the data
○ Processing the data
2. Analytics + ML/AI
○ Extracting knowledge from data
○ Visualizing the data/knowledge
3. Applications

A multidisciplinary profile is required (computer systems, statistics, AI,


visualization…)

The Big Data Ecosystem 4


Big Data infrastructure
● Collecting/storing the data
○ Information needs to be properly collected and stored
○ New technologies have appeared that are better suited for the nature of Big Data
problems
■ Fast generating data
■ Very large volume
■ Heterogeneous sources and complex data schemas
○ NoSQL

The Big Data Ecosystem 5


Big Data infrastructure

● Processing the data


○ Data needs to be efficiently processed.
○ We need to take as much advantage as possible from distributed and parallel techniques.
○ At the same time, is important to maintain focus on data and its analysis.
○ This is not (necessarily) HPC, so we try to avoid making it a parallel programming
problem.
○ MapReduce (and its descendants)

The Big Data Ecosystem 6


Are not traditional RDBMS enough?
● RDBMS are successfully used in most
professional applications that require
proper data handling.
● They provide high performance and the
very convenient ACID properties:
○ Atomicity
○ Consistency
○ Isolation
○ Durability

The Big Data Ecosystem 7


Are not traditional RDBMS enough?
● It is generally accepted that traditional RDBMS are not enough for several
reasons. Mainly:
○ Scalability: They do not handle extremely large datasets well.
○ Flexibility: They do not adapt easily to the complexity and requirements of some modern
applications.
● Still, there are some voices that argue that most of these claims have not
been properly justified, and that traditional RDBMS are suitable for many
Big Data applications (example: Oracle Exadata).
● It is up to us to decide what solution is better for each problem.

The Big Data Ecosystem 8


NoSQL databases
NoSQL (Not Only SQL): Is an extended group of
database technologies that do not necessarily use SQL
as query language.

● The do not fully guarantee ACID properties.


● They are optimized for LOAD and STORE/INSERT
operations, but not UPDATE.
● The have very limited JOIN capabilities.
● The scale extremely well.
● They are distributed solutions.

The Big Data Ecosystem 9


NoSQL databases
● Without schema
● Easily replicable
● Simple APIs
● Relaxed consistency requirements (eventual consistency vs. strong
consistency in RDBMS)
● Immutable data:
○ Raw storage (without transformation)
○ Makes accounting more complicated

The Big Data Ecosystem 10


NoSQL databases
Brewer’s Theorem (a.k.a. CAP principle):
It is impossible for a distributed
Availability
storage system to present the three
following characteristics
simultaneously:
Partition
● Consistency Consistency
tolerance
● Availability
● Partition tolerance

The Big Data Ecosystem 11


NoSQL databases

Instead of ACID, NoSQL databases present the BASE properties:

● Basically Available
● Soft state
● Eventual consistency

The Big Data Ecosystem 12


NoSQL databases
NoSQL databases are modern alternatives to traditional RDBMs. They are
usually grouped in the following categories:

● Generic NoSQL models


○ Key-value stores (Dynamo, Redis, Riak...)
○ Column-based stores/databases (BigTable, HBase, Cassandra...)
● Data-specific models
○ Document-oriented databases (MongoDB, CouchDB...)
○ Graph databases (Neo4J, OrientDB...)

The Big Data Ecosystem 13


NoSQL databases
Key-value

Size
Stores Column
stores

Document
databases

Graph
databases

90% of use cases are below this line

Relational databases

Complexity

The Big Data Ecosystem 14


Key-value stores
● Very simple data model: just (key : value) pairs.
● Designed to store extremely large amounts of data.
● Easy to implement and use.
● Very efficient for locating+reading data.
● Inefficient when we need to access/update only a part of a value:
○ We have to read the entire value.
○ We have to update the entire value.
● Similar to a distributed hash table (DHT).
● Usually implemented in a fully distributed fashion, without a master node
that could become a SPF.
● Popular examples: Dynamo, Redis, Riak.

The Big Data Ecosystem 15


Column-based stores/databases
● Data is organized into columns, instead of tables with rows.
● Column: An object with three fields:
○ Unique name.
○ Value.
○ Timestamp.
● Columns in column-based stores are not the same as matrix columns or
attributes in a RDBMs table.
● Columns are grouped into tuples, each with a unique key (tuple ID).
● Tuples can be grouped into column families, analogous to tables in
RDBMSs

The Big Data Ecosystem 16


Column-based stores/databases
● Pros:
○ Aggregation operations (COUNT, SUM, MIN, …) are very efficient.
○ Batch insertions are very efficient.
○ Optimize the use of storage space.
○ Easy to compress and distribute
● Cons
○ Read/write an entire tuple takes time (operating over many columns).
● Popular examples: Cassandra, HBase, BigTable, Accumulo

The Big Data Ecosystem 17


Document databases
● Basic principle: Storing documents in a natural way.
● Document:
○ Its specific definition will depend on the database implementation, but they will always
present a flexible, rich structure
○ They are typically stored as XML or JSON/BSON files
○ They can be labeled and organized into collections and/or hierarchies.
● Documents are stored as (key : value) pairs. Each document has an unique
identifier (the key). Its contents are stored in the value field.
● The database understands the document format (JSON/BSON, XML, …)
and provides tools to access parts of the documents.
● Popular examples: MongoDB, CouchDB

The Big Data Ecosystem 18


Graph databases
● Data is organized into nodes and edges.
● Both nodes and edges have unique IDs, a
type ID and a variable set of properties,
usually in the form of (key : value) Entries.
● Edges also have a source node ID and a
destination node ID.
● Data in graph databases are not restricted to
a fixed schema, as in relational DBs.
● The graph model allows for an extremely rich
data representation, and to perform data
queries that would not be feasible in a
relational DB (using graph based algorithms).
● Popular examples: Neo4J, OrientDB, “GraphX”

The Big Data Ecosystem 19


Big Data processing
● We need an efficient way of processing data.
● When data is too large and/or complex, parallel and distributed
approaches are good.
○ Increased performance and throughput.
○ Better use of computational resources (avoiding bottlenecks).
● Parallel programming, however, is often complex and developed ad hoc
for each problem (e.g. HPC).
● Is there an alternative, more convenient approach?

The Big Data Ecosystem 20


The song lyrics database example

The Big Data Ecosystem 21


The song lyrics database example
● We want a list of all the artists that have participated in an album
containing at least one song whose lyrics contain the word “LOVE”.
○ The output should be the artiste name, the album title, the song title and the son lyrics
● Can we do it with a typical RDBMs?
○ Of course but, with how many queries/joins?
○ What if we are looking for a phrase, not a word?
○ What if there are duplicated records?
○ What if data is splitted in more than one RDBMS?
○ What if not all lyrics are stored as plain text, but there are also document files?
○ What if the database is from Spotify or Amazon Music, with information from all the
music in their catalogs?

The Big Data Ecosystem 22


A little bit of history...
● 2003: Google publishes its papers about Google File System (GFS).
● 2004: Jeffrey Dean and Sanjay Ghemawat (Google) publish their paper
about MapReduce.
● 2006: Doug Cutting and Mike Cafarella (Yahoo) develop Hadoop, based on
Google’s MapReduce.
○ Open source framework for data processing using the MapReduce model
○ Includes the Hadoop File System (HDFS)
○ Donated to the Apache Foundation and distributed under Apache License 2.0
● 2010: Matei Zahaira develops Spark (initially his Ph.D. thesis).
● 2014: Spark becomes a Top-Level Apache Project, with more than 1000
contributors worldwide.

The Big Data Ecosystem 23


MapReduce vs. map/reduce

map/reduce is a standard functional programming technique:

● map(f, X) = [f(x1), f(x2), f(x3), …]


○ applies function f to every value of X.
● reduce(f, X) = f(x1, f(x2, f(x3, …)))
○ Recursively accumulates the values of X, applying function f. Also known as fold or
accumulate.

The Big Data Ecosystem 24


MapReduce
● Data processing framework for large problems that can be parallelized.
● Designed for large data sets.
○ A small problem will be much slower with MapReduce.
● Makes use of large clusters or grids (many nodes).
● Uses low-level techniques to improve performance, mainly data locality.
● Inspired by functional programming map/reduce, but with different
objectives.

The Big Data Ecosystem 25


MapReduce vs. map/reduce
MapReduce is a distributed computing framework.

● Inspired by map/reduce
● Designed to make possible the processing of very large datasets

Two phases:

● Map phase: Input data are splitted in blocks (sub-problems) and


distributed throughout the cluster (worker nodes). Each node processes
its sub-problem.
● Reduce phase: Sub-problem results are combined into a final output.

The Big Data Ecosystem 26


MapReduce
Map

Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map

Map

The Big Data Ecosystem 27


The 5 steps of MapReduce
1. Data is randomly splitted and disseminated throughout the cluster.
2. Map: Each mapper (a worker node or processor assigned) executes the
user-provided Map() function over its assigned data block. The result of
each sub-problem is a set of key-value pairs.
3. Shuffle and sort: Map() output is sorted by key and redistributed in the
cluster.
4. Reduce: Each reducer (again, a worker node or processor assigned)
executes the the user-provided Reduce() over all data associated to a
single key. One reducer is executed per key generated.
5. Final output: The output of all reducers put together.

The Big Data Ecosystem 28


MapReduce
Map

Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map

Map

Split Shuffle and Append


sort
The Big Data Ecosystem 29
Word Count
One of the most typical MapReduce
It was the best of times, it was the worst of times, it
basic example is the word count was the age of wisdom, it was the age of

problem:
foolishness, it was the epoch of belief, it was the
epoch of incredulity…

Input: A (very large) text, usually


A Tale of Two Cities (Charles Dickens)

simply a collection of text lines. MapReduce
● Desired output: A list of all the word
present in the text, and the number it, 6
was, 6
of times each appears in the text. the, 6
best, 1
of, 5
How can we do it with MapReduce? ...

The Big Data Ecosystem 30


Word Count: Solution
Map function: Reduce function:

Map(line) { Reduce(key, values) {


for each word in line { sum = 0
emit(word, 1) for each value in values {
} sum = sum + value
} }
emit(key, sum)
}

For each word in the line received, the Map The reduce function receives all values
function generates a (key, value) pair. The key emitted by the mappers for a single key. The
is the word being processed, and the value is function adds these values and produces this
always the number 1. sum as a result.

The Big Data Ecosystem 31


Word Count: How does it work?
It was the best of times
(it, 1) (was, 1) (the, 1) (best, 1) (of, 1) (times, 1)
Map
(it, 1) (was, 1) (the, 1) (age, 1) (of, 1) (wisdom, 1)
it was the worst of times

it was the age of wisdom


(it, 1) (was, 1) (the, 1) (worst, 1) (of, 1) (times, 1)
Map
(it, 1) (was, 1) (the, 1) (age, 1) (of, 1) (foolishness, 1)
it was the age of foolishness

it was the epoch of belief


(it, 1) (was, 1) (the, 1) (epoch, 1) (of, 1) (belief, 1)
Map
(it, 1) (was, 1) (the, 1) (epoch, 1) (of, 1) (incredulity, 1)
it was the epoch of incredulity
continues...
The Big Data Ecosystem 32
Word Count: How does it work?
(it, 1)

(was, 1)
(it, {1,1,1,1,1,1}) Reduce (it, 6)
(the, 1)

(best, 1) it, 6
(was, {1,1,1,1,1,1}) Reduce (was, 6) was, 6
(of, 1) the, 6
best, 1
(times, 1) (the, {1,1,1,1,1,1}) Reduce (the, 6) of, 5
...
(it, 1)

(was, 1)
(best, {1}) Reduce (best, 1)

(the, 1) ...
(of, {1,1,1,1,1}) Reduce (of, 5)
(age, 1)
Shuffle and Append
... ... ... ...
sort
The Big Data Ecosystem 33
MapReduce
● MapReduce applications require only to provide the implementation of
the Map and Reduce functions.
● MapReduce applications are deployed over a MapReduce framework,
usually running in a cluster.
● The framework takes care of all data management operations:
○ Data splitting
○ Shuffle and sort
○ Collection of results
● The parallelization is transparent to the programmer.
● The MapReduce paradigm sacrifices design flexibility in exchange for easy
and fast development of parallel applications.

The Big Data Ecosystem 34


MapReduce
Typical MapReduce applications present more than one MapReduce cycle

Map
Map
Reduce Map
Input Map Output 1 Reduce Output 2
Reduce Map
Map
Map

In addition to the Map and Reduce functions of each cycle, global/cycle parameters can be defined,
but state is never shared between mappers or reducers in the same stage.

The Big Data Ecosystem 35


The histogram
“A histogram is a graphical representation of the
distribution of numerical data. [...] To construct a
histogram, the first step is to "bin" the range of
values—that is, divide the entire range of values into a
series of intervals—and then count how many values
fall into each interval.” [Wikipedia]

Suppose we want to create an histogram of an extremely large sample of a


random variable.
● We know the number of bins we want, but the data file is so big it does
not fit in the memory of any single machine we have.
● Can we do it in a cluster, with MapReduce? If so, how?

The Big Data Ecosystem 36


The histogram: Solution
The histogram problem can be solved in two MapReduce cycles/jobs:

● Job 1: Calculate the data range (maximum and minimum).


○ Input: The data file
○ Output: The maximum and minimum of the sample
● Job 2: Knowing the data range and the number of bins, construct the
histogram.
○ Input: The data file and the maximum and minimum calculated in job 1 (as global
parameters known by all workers).
○ Output: The histogram.

The Big Data Ecosystem 37


The histogram: Solution
Job 1 Job 2

Map(numbers) { Map(line) {
emit(1, (max(numbers), min(numbers))) for each number in line {
} bar=floor((number-min_v)/
((max-min_v)/n))
if(number=max)
emit(n-1,1)
Reduce(key, values) { else
max_v = -infinity emit(bar,1)
min_v = infinity }
for each pair in values { }
max_v = max(max_v, pair[0])
min_v = min(min_v, pair[1])
}
emit(“max”,max_v) Reduce(key, values) {
emit(“min”,min_v) emit(key, sum(values))
} }

The Big Data Ecosystem 38


Directed graph
Given a directed graph:

● Nodes (with sate)


● Edges, connecting two nodes

The state of a graph changes through several


iterations.

In each iteration, the new state of a node is


calculated using its current state and the state
of its neighbors.

Can we calculate the graph’s evolution


using MapReduce?

The Big Data Ecosystem 39


Directed graph: Solution
● The graph is stored in two files:
○ A list of nodes and their states [(A:SA), (B:SB), ...]. This will be the input of the MapReduce
process.
○ A list of nodes and their neighbors [(A:{B,C}), (B:{A}), ...]. This has to accessible to all
mappers.
● In the Map stage, each node sends an message with its state to itself and
all its neighbors: Map(A:SA) = [(A:SA),(B:SA),(C:SA)]
● In the Reduce stage, the new state of each node is calculated:
Reduce(A, [(SA,SB)]) = (A:new SA)
● The output of the Reduce stage can serve as input for the next graph
iteration.

The Big Data Ecosystem 40


Directed graph: Solution
Map function: This implementation:

Map(n, N) { ● Allows fast, parallel calculation of the


emit(n, message(N)) graph state.
for each m in outgoing(n) { ● It can be implemented without the
emit(m, message(N))
burdens of traditional parallel
}
} programming, such as message passing,
process synchronization, etc.
Reduce function: ● Its only limitation is that it requires for
the graph to completely fit in memory of
Reduce(m, messages) { every cluster node.
M = calculateState(messages)
● It is, therefore, useful when the node
emit(m, M)
} state is not very big.

The Big Data Ecosystem 41


MapReduce: Key Takeaways
● As software, MapReduce is nowadays obsolete
(it has been replaced by more mature technologies like Spark)
● Its theoretical principles, however, are the basis of most modern data
processing frameworks, meeting the developer “halfway” between the
infrastructure and the data processing/analysis problem:
○ Development based on versatile primitive operations that can be easily developed
○ The framework takes care of data splitting and distribution, load balancing and network
management
○ Take advantage of data locality to improve performance
○ Procedures can be sequenced/orchestrated to solve more complex tasks

The Big Data Ecosystem 42

You might also like