0% found this document useful (0 votes)

45 views42 pages

The Big Data Ecosystem

The document discusses the big data ecosystem and technologies for big data. It describes three categories of big data technologies: infrastructure for collecting and storing data and processing data, analytics and machine learning/artificial intelligence for extracting knowledge from data and visualizing data/knowledge, and applications. It then provides more details on infrastructure technologies, including NoSQL databases like key-value stores, column-based stores, document databases, and graph databases. It also discusses big data processing and technologies like MapReduce.

Uploaded by

yagoencuestas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views42 pages

The Big Data Ecosystem

Uploaded by

yagoencuestas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

The Big Data

Ecosystem
Jesús Montes
[email protected]

Sept. 2022
The Big Data Ecosystem
Do you know any of these?

The Big Data Ecosystem 2

And all
these?

The Big Data Ecosystem 3

Technology for Big Data
Big Data technologies can be grouped in three categories:

1. Infrastructure
○ Collecting/storing the data
○ Processing the data
2. Analytics + ML/AI
○ Extracting knowledge from data
○ Visualizing the data/knowledge
3. Applications

A multidisciplinary proﬁle is required (computer systems, statistics, AI,

visualization…)

The Big Data Ecosystem 4

Big Data infrastructure
● Collecting/storing the data
○ Information needs to be properly collected and stored
○ New technologies have appeared that are better suited for the nature of Big Data
problems
■ Fast generating data
■ Very large volume
■ Heterogeneous sources and complex data schemas
○ NoSQL

The Big Data Ecosystem 5

Big Data infrastructure

● Processing the data

○ Data needs to be eﬃciently processed.
○ We need to take as much advantage as possible from distributed and parallel techniques.
○ At the same time, is important to maintain focus on data and its analysis.
○ This is not (necessarily) HPC, so we try to avoid making it a parallel programming
problem.
○ MapReduce (and its descendants)

The Big Data Ecosystem 6

Are not traditional RDBMS enough?
● RDBMS are successfully used in most
professional applications that require
proper data handling.
● They provide high performance and the
very convenient ACID properties:
○ Atomicity
○ Consistency
○ Isolation
○ Durability

The Big Data Ecosystem 7

Are not traditional RDBMS enough?
● It is generally accepted that traditional RDBMS are not enough for several
reasons. Mainly:
○ Scalability: They do not handle extremely large datasets well.
○ Flexibility: They do not adapt easily to the complexity and requirements of some modern
applications.
● Still, there are some voices that argue that most of these claims have not
been properly justiﬁed, and that traditional RDBMS are suitable for many
Big Data applications (example: Oracle Exadata).
● It is up to us to decide what solution is better for each problem.

The Big Data Ecosystem 8

NoSQL databases
NoSQL (Not Only SQL): Is an extended group of
database technologies that do not necessarily use SQL
as query language.

● The do not fully guarantee ACID properties.

● They are optimized for LOAD and STORE/INSERT
operations, but not UPDATE.
● The have very limited JOIN capabilities.
● The scale extremely well.
● They are distributed solutions.

The Big Data Ecosystem 9

NoSQL databases
● Without schema
● Easily replicable
● Simple APIs
● Relaxed consistency requirements (eventual consistency vs. strong
consistency in RDBMS)
● Immutable data:
○ Raw storage (without transformation)
○ Makes accounting more complicated

The Big Data Ecosystem 10

NoSQL databases
Brewer’s Theorem (a.k.a. CAP principle):
It is impossible for a distributed
Availability
storage system to present the three
following characteristics
simultaneously:
Partition
● Consistency Consistency
tolerance
● Availability
● Partition tolerance

The Big Data Ecosystem 11

NoSQL databases

Instead of ACID, NoSQL databases present the BASE properties:

● Basically Available
● Soft state
● Eventual consistency

The Big Data Ecosystem 12

NoSQL databases
NoSQL databases are modern alternatives to traditional RDBMs. They are
usually grouped in the following categories:

● Generic NoSQL models

○ Key-value stores (Dynamo, Redis, Riak...)
○ Column-based stores/databases (BigTable, HBase, Cassandra...)
● Data-speciﬁc models
○ Document-oriented databases (MongoDB, CouchDB...)
○ Graph databases (Neo4J, OrientDB...)

The Big Data Ecosystem 13

NoSQL databases
Key-value

Size
Stores Column
stores

Document
databases

Graph
databases

90% of use cases are below this line

Relational databases

Complexity

The Big Data Ecosystem 14

Key-value stores
● Very simple data model: just (key : value) pairs.
● Designed to store extremely large amounts of data.
● Easy to implement and use.
● Very eﬃcient for locating+reading data.
● Ineﬃcient when we need to access/update only a part of a value:
○ We have to read the entire value.
○ We have to update the entire value.
● Similar to a distributed hash table (DHT).
● Usually implemented in a fully distributed fashion, without a master node
that could become a SPF.
● Popular examples: Dynamo, Redis, Riak.

The Big Data Ecosystem 15

Column-based stores/databases
● Data is organized into columns, instead of tables with rows.
● Column: An object with three ﬁelds:
○ Unique name.
○ Value.
○ Timestamp.
● Columns in column-based stores are not the same as matrix columns or
attributes in a RDBMs table.
● Columns are grouped into tuples, each with a unique key (tuple ID).
● Tuples can be grouped into column families, analogous to tables in
RDBMSs

The Big Data Ecosystem 16

Column-based stores/databases
● Pros:
○ Aggregation operations (COUNT, SUM, MIN, …) are very eﬃcient.
○ Batch insertions are very eﬃcient.
○ Optimize the use of storage space.
○ Easy to compress and distribute
● Cons
○ Read/write an entire tuple takes time (operating over many columns).
● Popular examples: Cassandra, HBase, BigTable, Accumulo

The Big Data Ecosystem 17

Document databases
● Basic principle: Storing documents in a natural way.
● Document:
○ Its specific definition will depend on the database implementation, but they will always
present a flexible, rich structure
○ They are typically stored as XML or JSON/BSON files
○ They can be labeled and organized into collections and/or hierarchies.
● Documents are stored as (key : value) pairs. Each document has an unique
identifier (the key). Its contents are stored in the value field.
● The database understands the document format (JSON/BSON, XML, …)
and provides tools to access parts of the documents.
● Popular examples: MongoDB, CouchDB

The Big Data Ecosystem 18

Graph databases
● Data is organized into nodes and edges.
● Both nodes and edges have unique IDs, a
type ID and a variable set of properties,
usually in the form of (key : value) Entries.
● Edges also have a source node ID and a
destination node ID.
● Data in graph databases are not restricted to
a ﬁxed schema, as in relational DBs.
● The graph model allows for an extremely rich
data representation, and to perform data
queries that would not be feasible in a
relational DB (using graph based algorithms).
● Popular examples: Neo4J, OrientDB, “GraphX”

The Big Data Ecosystem 19

Big Data processing
● We need an eﬃcient way of processing data.
● When data is too large and/or complex, parallel and distributed
approaches are good.
○ Increased performance and throughput.
○ Better use of computational resources (avoiding bottlenecks).
● Parallel programming, however, is often complex and developed ad hoc
for each problem (e.g. HPC).
● Is there an alternative, more convenient approach?

The Big Data Ecosystem 20

The song lyrics database example

The Big Data Ecosystem 21

The song lyrics database example
● We want a list of all the artists that have participated in an album
containing at least one song whose lyrics contain the word “LOVE”.
○ The output should be the artiste name, the album title, the song title and the son lyrics
● Can we do it with a typical RDBMs?
○ Of course but, with how many queries/joins?
○ What if we are looking for a phrase, not a word?
○ What if there are duplicated records?
○ What if data is splitted in more than one RDBMS?
○ What if not all lyrics are stored as plain text, but there are also document ﬁles?
○ What if the database is from Spotify or Amazon Music, with information from all the
music in their catalogs?

The Big Data Ecosystem 22

A little bit of history...
● 2003: Google publishes its papers about Google File System (GFS).
● 2004: Jeﬀrey Dean and Sanjay Ghemawat (Google) publish their paper
about MapReduce.
● 2006: Doug Cutting and Mike Cafarella (Yahoo) develop Hadoop, based on
Google’s MapReduce.
○ Open source framework for data processing using the MapReduce model
○ Includes the Hadoop File System (HDFS)
○ Donated to the Apache Foundation and distributed under Apache License 2.0
● 2010: Matei Zahaira develops Spark (initially his Ph.D. thesis).
● 2014: Spark becomes a Top-Level Apache Project, with more than 1000
contributors worldwide.

The Big Data Ecosystem 23

MapReduce vs. map/reduce

map/reduce is a standard functional programming technique:

● map(f, X) = [f(x1), f(x2), f(x3), …]

○ applies function f to every value of X.
● reduce(f, X) = f(x1, f(x2, f(x3, …)))
○ Recursively accumulates the values of X, applying function f. Also known as fold or
accumulate.

The Big Data Ecosystem 24

MapReduce
● Data processing framework for large problems that can be parallelized.
● Designed for large data sets.
○ A small problem will be much slower with MapReduce.
● Makes use of large clusters or grids (many nodes).
● Uses low-level techniques to improve performance, mainly data locality.
● Inspired by functional programming map/reduce, but with diﬀerent
objectives.

The Big Data Ecosystem 25

MapReduce vs. map/reduce
MapReduce is a distributed computing framework.

● Inspired by map/reduce
● Designed to make possible the processing of very large datasets

Two phases:

● Map phase: Input data are splitted in blocks (sub-problems) and

distributed throughout the cluster (worker nodes). Each node processes
its sub-problem.
● Reduce phase: Sub-problem results are combined into a ﬁnal output.

The Big Data Ecosystem 26

MapReduce
Map

Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map

Map

The Big Data Ecosystem 27

The 5 steps of MapReduce
1. Data is randomly splitted and disseminated throughout the cluster.
2. Map: Each mapper (a worker node or processor assigned) executes the
user-provided Map() function over its assigned data block. The result of
each sub-problem is a set of key-value pairs.
3. Shuﬄe and sort: Map() output is sorted by key and redistributed in the
cluster.
4. Reduce: Each reducer (again, a worker node or processor assigned)
executes the the user-provided Reduce() over all data associated to a
single key. One reducer is executed per key generated.
5. Final output: The output of all reducers put together.

The Big Data Ecosystem 28

MapReduce
Map

Map
Reduce
Map
Input Reduce Output
Map
Reduce
Map

Map

Split Shuffle and Append

sort
The Big Data Ecosystem 29
Word Count
One of the most typical MapReduce
It was the best of times, it was the worst of times, it
basic example is the word count was the age of wisdom, it was the age of

problem:
foolishness, it was the epoch of belief, it was the
epoch of incredulity…

Input: A (very large) text, usually

A Tale of Two Cities (Charles Dickens)
●
simply a collection of text lines. MapReduce
● Desired output: A list of all the word
present in the text, and the number it, 6
was, 6
of times each appears in the text. the, 6
best, 1
of, 5
How can we do it with MapReduce? ...

The Big Data Ecosystem 30

Word Count: Solution
Map function: Reduce function:

Map(line) { Reduce(key, values) {

for each word in line { sum = 0
emit(word, 1) for each value in values {
} sum = sum + value
} }
emit(key, sum)
}

For each word in the line received, the Map The reduce function receives all values
function generates a (key, value) pair. The key emitted by the mappers for a single key. The
is the word being processed, and the value is function adds these values and produces this
always the number 1. sum as a result.

The Big Data Ecosystem 31

Word Count: How does it work?
It was the best of times
(it, 1) (was, 1) (the, 1) (best, 1) (of, 1) (times, 1)
Map
(it, 1) (was, 1) (the, 1) (age, 1) (of, 1) (wisdom, 1)
it was the worst of times

it was the age of wisdom

(it, 1) (was, 1) (the, 1) (worst, 1) (of, 1) (times, 1)
Map
(it, 1) (was, 1) (the, 1) (age, 1) (of, 1) (foolishness, 1)
it was the age of foolishness

it was the epoch of belief

(it, 1) (was, 1) (the, 1) (epoch, 1) (of, 1) (belief, 1)
Map
(it, 1) (was, 1) (the, 1) (epoch, 1) (of, 1) (incredulity, 1)
it was the epoch of incredulity
continues...
The Big Data Ecosystem 32
Word Count: How does it work?
(it, 1)

(was, 1)
(it, {1,1,1,1,1,1}) Reduce (it, 6)
(the, 1)

(best, 1) it, 6
(was, {1,1,1,1,1,1}) Reduce (was, 6) was, 6
(of, 1) the, 6
best, 1
(times, 1) (the, {1,1,1,1,1,1}) Reduce (the, 6) of, 5
...
(it, 1)

(was, 1)
(best, {1}) Reduce (best, 1)

(the, 1) ...
(of, {1,1,1,1,1}) Reduce (of, 5)
(age, 1)
Shuffle and Append
... ... ... ...
sort
The Big Data Ecosystem 33
MapReduce
● MapReduce applications require only to provide the implementation of
the Map and Reduce functions.
● MapReduce applications are deployed over a MapReduce framework,
usually running in a cluster.
● The framework takes care of all data management operations:
○ Data splitting
○ Shuffle and sort
○ Collection of results
● The parallelization is transparent to the programmer.
● The MapReduce paradigm sacrifices design flexibility in exchange for easy
and fast development of parallel applications.

The Big Data Ecosystem 34

MapReduce
Typical MapReduce applications present more than one MapReduce cycle

Map
Map
Reduce Map
Input Map Output 1 Reduce Output 2
Reduce Map
Map
Map

In addition to the Map and Reduce functions of each cycle, global/cycle parameters can be deﬁned,
but state is never shared between mappers or reducers in the same stage.

The Big Data Ecosystem 35

The histogram
“A histogram is a graphical representation of the
distribution of numerical data. [...] To construct a
histogram, the ﬁrst step is to "bin" the range of
values—that is, divide the entire range of values into a
series of intervals—and then count how many values
fall into each interval.” [Wikipedia]

Suppose we want to create an histogram of an extremely large sample of a

random variable.
● We know the number of bins we want, but the data ﬁle is so big it does
not ﬁt in the memory of any single machine we have.
● Can we do it in a cluster, with MapReduce? If so, how?

The Big Data Ecosystem 36

The histogram: Solution
The histogram problem can be solved in two MapReduce cycles/jobs:

● Job 1: Calculate the data range (maximum and minimum).

○ Input: The data ﬁle
○ Output: The maximum and minimum of the sample
● Job 2: Knowing the data range and the number of bins, construct the
histogram.
○ Input: The data ﬁle and the maximum and minimum calculated in job 1 (as global
parameters known by all workers).
○ Output: The histogram.

The Big Data Ecosystem 37

The histogram: Solution
Job 1 Job 2

Map(numbers) { Map(line) {
emit(1, (max(numbers), min(numbers))) for each number in line {
} bar=floor((number-min_v)/
((max-min_v)/n))
if(number=max)
emit(n-1,1)
Reduce(key, values) { else
max_v = -infinity emit(bar,1)
min_v = infinity }
for each pair in values { }
max_v = max(max_v, pair[0])
min_v = min(min_v, pair[1])
}
emit(“max”,max_v) Reduce(key, values) {
emit(“min”,min_v) emit(key, sum(values))
} }

The Big Data Ecosystem 38

Directed graph
Given a directed graph:

● Nodes (with sate)

● Edges, connecting two nodes

The state of a graph changes through several

iterations.

In each iteration, the new state of a node is

calculated using its current state and the state
of its neighbors.

Can we calculate the graph’s evolution

using MapReduce?

The Big Data Ecosystem 39

Directed graph: Solution
● The graph is stored in two ﬁles:
○ A list of nodes and their states [(A:SA), (B:SB), ...]. This will be the input of the MapReduce
process.
○ A list of nodes and their neighbors [(A:{B,C}), (B:{A}), ...]. This has to accessible to all
mappers.
● In the Map stage, each node sends an message with its state to itself and
all its neighbors: Map(A:SA) = [(A:SA),(B:SA),(C:SA)]
● In the Reduce stage, the new state of each node is calculated:
Reduce(A, [(SA,SB)]) = (A:new SA)
● The output of the Reduce stage can serve as input for the next graph
iteration.

The Big Data Ecosystem 40

Directed graph: Solution
Map function: This implementation:

Map(n, N) { ● Allows fast, parallel calculation of the

emit(n, message(N)) graph state.
for each m in outgoing(n) { ● It can be implemented without the
emit(m, message(N))
burdens of traditional parallel
}
} programming, such as message passing,
process synchronization, etc.
Reduce function: ● Its only limitation is that it requires for
the graph to completely ﬁt in memory of
Reduce(m, messages) { every cluster node.
M = calculateState(messages)
● It is, therefore, useful when the node
emit(m, M)
} state is not very big.

The Big Data Ecosystem 41

MapReduce: Key Takeaways
● As software, MapReduce is nowadays obsolete
(it has been replaced by more mature technologies like Spark)
● Its theoretical principles, however, are the basis of most modern data
processing frameworks, meeting the developer “halfway” between the
infrastructure and the data processing/analysis problem:
○ Development based on versatile primitive operations that can be easily developed
○ The framework takes care of data splitting and distribution, load balancing and network
management
○ Take advantage of data locality to improve performance
○ Procedures can be sequenced/orchestrated to solve more complex tasks

The Big Data Ecosystem 42

Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Enrico Ferri On Criminology
0% (1)
Enrico Ferri On Criminology
2 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Chapter-1-Introduction to Big Data
No ratings yet
Chapter-1-Introduction to Big Data
25 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Data Science
No ratings yet
Data Science
87 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
43 pages
Big Data
No ratings yet
Big Data
190 pages
BDA - Lecture 3
100% (1)
BDA - Lecture 3
17 pages
Big Data
No ratings yet
Big Data
25 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Lecture8
No ratings yet
Lecture8
34 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
BigData-Session1
No ratings yet
BigData-Session1
14 pages
Slide 6 NoSQL Database and HBase Tutorial
No ratings yet
Slide 6 NoSQL Database and HBase Tutorial
110 pages
Chapter-14
No ratings yet
Chapter-14
35 pages
Big data Slides
No ratings yet
Big data Slides
26 pages
5.1 Intro Nosql
No ratings yet
5.1 Intro Nosql
22 pages
4.1_intro_nosql-converted-133751863122661863
No ratings yet
4.1_intro_nosql-converted-133751863122661863
43 pages
Module 1 and NoSQL
No ratings yet
Module 1 and NoSQL
35 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
04-2 Intro Nosql
No ratings yet
04-2 Intro Nosql
18 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
Bda Ut1 Question Bank
No ratings yet
Bda Ut1 Question Bank
19 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Introduction To Big Data and NoSQL
No ratings yet
Introduction To Big Data and NoSQL
52 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
BIGDATAUNIT1AKTUpdf
No ratings yet
BIGDATAUNIT1AKTUpdf
33 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Module 1
No ratings yet
Module 1
54 pages
DA U2
No ratings yet
DA U2
17 pages
DOC-20250306-WA0000.
No ratings yet
DOC-20250306-WA0000.
35 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Big Data Ecosystem
No ratings yet
Big Data Ecosystem
11 pages
BIG DATA PYQ 21-22
No ratings yet
BIG DATA PYQ 21-22
9 pages
2 Big Data Analytics-Hadoop R21 A7902 ABP
No ratings yet
2 Big Data Analytics-Hadoop R21 A7902 ABP
16 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Mastering Database Design
From Everand
Mastering Database Design
Ted Noreux
No ratings yet
Aiwan e Quaid Smile
No ratings yet
Aiwan e Quaid Smile
2 pages
DNV-CP-0293 Abrasion Resistant Coatings
No ratings yet
DNV-CP-0293 Abrasion Resistant Coatings
13 pages
Human Resource Answers
No ratings yet
Human Resource Answers
8 pages
Catalogo Teufelberger 2023 24
No ratings yet
Catalogo Teufelberger 2023 24
64 pages
Iv-Infusion-Rate-Calculations and Sample Questions
No ratings yet
Iv-Infusion-Rate-Calculations and Sample Questions
3 pages
Rajshahi University of Engineering & Technology: Heaven's Light Is Our Guide
No ratings yet
Rajshahi University of Engineering & Technology: Heaven's Light Is Our Guide
15 pages
Destiny Harvest Centre: Prayer, Fasting & Revival
No ratings yet
Destiny Harvest Centre: Prayer, Fasting & Revival
2 pages
A. Office Sundram Fasteners LTD., Padi, Chennai-600050: Biodata of Assessor
No ratings yet
A. Office Sundram Fasteners LTD., Padi, Chennai-600050: Biodata of Assessor
2 pages
1) Spawned Process Is Destroyed When Parent Is Destroyed. Advantage
No ratings yet
1) Spawned Process Is Destroyed When Parent Is Destroyed. Advantage
1 page
Thesis For Information Technology Free
100% (2)
Thesis For Information Technology Free
5 pages
Mariah Campbell Biology Homework
No ratings yet
Mariah Campbell Biology Homework
4 pages
978 1 58503 610 3 - Toc
0% (2)
978 1 58503 610 3 - Toc
4 pages
jama_lieu_2020_review article
No ratings yet
jama_lieu_2020_review article
11 pages
Substation Monitoring System, SMS 510 Release Note
No ratings yet
Substation Monitoring System, SMS 510 Release Note
6 pages
CV322 Assignment#2
No ratings yet
CV322 Assignment#2
8 pages
Week 4 Compounding
No ratings yet
Week 4 Compounding
27 pages
Ec6 Plumbing Report 2
No ratings yet
Ec6 Plumbing Report 2
55 pages
7830547 PowerShiftR V2 Manual Resistance Installation Guideline (Rev B)
No ratings yet
7830547 PowerShiftR V2 Manual Resistance Installation Guideline (Rev B)
39 pages
Sustainable Development and Its Characteristics
No ratings yet
Sustainable Development and Its Characteristics
3 pages
Finishing Krajnik Hashimoto SecondSubmission
No ratings yet
Finishing Krajnik Hashimoto SecondSubmission
11 pages
Glistening Upturn in Branded Luxury Jewellery
No ratings yet
Glistening Upturn in Branded Luxury Jewellery
3 pages
Khan Yasir. Governance of Caliph Abu Bakr
No ratings yet
Khan Yasir. Governance of Caliph Abu Bakr
4 pages
HW 6
No ratings yet
HW 6
2 pages
L1S2 - Opportunity Cost and Marginal Analysis Exercises
No ratings yet
L1S2 - Opportunity Cost and Marginal Analysis Exercises
2 pages
Ethics: Slam Shraf Ahmy
No ratings yet
Ethics: Slam Shraf Ahmy
6 pages
PADARTH VIGYAN - Syllabus PDF
No ratings yet
PADARTH VIGYAN - Syllabus PDF
5 pages
NiQ Brochure EN
No ratings yet
NiQ Brochure EN
16 pages
Software Project Management (Department Elective Û I)
No ratings yet
Software Project Management (Department Elective Û I)
3 pages
Search 'n Stuff x Podcast Growth Edition (1)
No ratings yet
Search 'n Stuff x Podcast Growth Edition (1)
123 pages

The Big Data Ecosystem

Uploaded by

The Big Data Ecosystem

Uploaded by

The Big Data

The Big Data Ecosystem 2

The Big Data Ecosystem 3

A multidisciplinary proﬁle is required (computer systems, statistics, AI,

The Big Data Ecosystem 4

The Big Data Ecosystem 5

● Processing the data

The Big Data Ecosystem 6

The Big Data Ecosystem 7

The Big Data Ecosystem 8

● The do not fully guarantee ACID properties.

The Big Data Ecosystem 9

The Big Data Ecosystem 10

The Big Data Ecosystem 11

Instead of ACID, NoSQL databases present the BASE properties:

The Big Data Ecosystem 12

● Generic NoSQL models

The Big Data Ecosystem 13

90% of use cases are below this line

The Big Data Ecosystem 14

The Big Data Ecosystem 15

The Big Data Ecosystem 16

The Big Data Ecosystem 17

The Big Data Ecosystem 18

The Big Data Ecosystem 19

The Big Data Ecosystem 20

The Big Data Ecosystem 21

The Big Data Ecosystem 22

The Big Data Ecosystem 23

map/reduce is a standard functional programming technique:

● map(f, X) = [f(x1), f(x2), f(x3), …]

The Big Data Ecosystem 24

The Big Data Ecosystem 25

● Map phase: Input data are splitted in blocks (sub-problems) and

The Big Data Ecosystem 26

The Big Data Ecosystem 27

The Big Data Ecosystem 28

Split Shuffle and Append

Input: A (very large) text, usually

The Big Data Ecosystem 30

Map(line) { Reduce(key, values) {

The Big Data Ecosystem 31

it was the age of wisdom

it was the epoch of belief

The Big Data Ecosystem 34

The Big Data Ecosystem 35

Suppose we want to create an histogram of an extremely large sample of a

The Big Data Ecosystem 36

● Job 1: Calculate the data range (maximum and minimum).

The Big Data Ecosystem 37

The Big Data Ecosystem 38

● Nodes (with sate)

The state of a graph changes through several

In each iteration, the new state of a node is

Can we calculate the graph’s evolution

The Big Data Ecosystem 39

The Big Data Ecosystem 40

Map(n, N) { ● Allows fast, parallel calculation of the

The Big Data Ecosystem 41

The Big Data Ecosystem 42

You might also like