0% found this document useful (0 votes)

40 views60 pages

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Herve Roussel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views60 pages

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Herve Roussel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Architectures of AI systems

Engineering for Big Data & AI

HCMC, Sep 6th 2019 Herve Roussel [email protected]

What is
Data Engineering ?
Is this data engineering?

UploadData.java

upload_data.py
Is this data engineering?

cat console.log
| grep “ERROR”
> errors.log
Data engineering?

Program

Event data Transformed data

Backend vs Data?
Is this data engineering?

Event data
cat console.log
Transform
| grep “ERROR”
Transformed data
> errors.log
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Chris posted. Is that good Who can see
or bad? this?

Racist? Vulgar?
Notify? Web,
mobile?
Anybody tagged?

Is this a face? Who’s this?

Friend? Celebrity?
Courtney likes. Is that
good or bad? What rank in
feed?

Paddy commented. Is that

good or bad?
Copyright violation?
Is Big Data just for big companies?

300K QPS [R] 1B+ QPM [P] 400M LOC [P]

6K QPS [W] 250M+ QPM [R] 1.8 TB per year [P]

As of JULY 8, 2013
Data Engineering

Program

Event data Augmented data

Big Data Engineering + AI
Event data

Transform

Augmented data
Source (Event data)

Pipeline (Transform)

Sink (Augmented data)

What is a
source ?
Where is data coming from?
Main data

Synchronous_
( 10-100 ms )_

Asynchronous_
( 3-5 s )_

Event source

Why split?
What’s in an event data?
Post PostCreatedEvent

{ {
id: 12345, story_id: 12345,
content: “hello world”, type: “story_posted”
created_at: … …
updated_at: … }
author_id: 67890,
…
}
What’s batch processing?

Job 1

Scheduler

Job 2
Which DB for event source?
How to store events?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)

30 GB OK Good Very good

10K WPS OK Good Very good

1K RPS OK Good Very good

Range readread
Sequential OK Good Very good

Cost $$ $$$ $
Who wants to become architect?
What’s the problem with batch?

E NC Y
LAT
Job 1

Scheduler

Job 2
How to process real-time?

Stream processing
How can 2 processes talk?
QUEUE

Why not use database?

Why not database?
Importance MySQL Kafka Redis

10K WPS 1.0 5 10 10

1K RPS 1.0 5 10 10

Sequential 1.0 10 10 10
read (with B-TREE) (using Lists)

Order 0.2 10 0 10
guarantee

Durability 0.1 10 5 (but perf. hit) 0

Deployability 0.5 10 5 7.5

Score 5.6 / 10 6.6 / 10 7.15 / 10

What is a
transform ?
Source

Transforms

Sink
Functional vs OOP

Operations on things Things with operations

Add more things Add more operations

Librarian find(book)
Books.create()
.startShift()
load_cover(book)
Catalog.open() Library.close()
remove(book)

assign(book)
Functional vs OOP

Things with operations

Add more operations

generate_thumbnails(vid_uploaded)

find_similar(vid_uploaded)

transcribe_captions(vid_uploaded)

alert_subscribers(vid_uploaded)
What’s supporting data?

Supporting data
event
{
id: 12345,
Transform type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66 ]
}

Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?

Requests in thread Long running

API vs Pipeline: performance?

100ms 100ms * 300,000/60/60 = 9H

⇓ ⇓

10ms 10ms * 300,000/60/60 = 55 min

Where is the data coming from?

Is this a face? Who’s this?

Friend? Celebrity?
Data pipelines & AI

AI model Transform
How can 2 processes talk?

Transform

AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read

Data scientist

Sales
What are the read use cases?

Give me posts that

Give me summary
contain the words Give me all posts by
report of last month’s
Donald Trump, Trump female, age 18-35
activity
or President

Aggregation Full text search Bulk data, filtered

ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?

AI model Transform
Transformv2
Idempotency & backfill

f(f(x)) = f(x)

POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?

AI model v2 Transform
AI systems ≠ traditional systems?

93.2%

Deterministic Probabilistic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )

AI Model v2
( accuracy: ?? )
What have we
learned ?
[BE/FE] Use DL model in app
[DE] Collect data

[DS] Build DL model

[DE] Process data

[DA] Validate DL model

Source: Uber Engineering

Which NFR for Big Data?

• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
Which NFR for Big Data?

Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)

http://bit.ly/quod-ai-join

Herve Roussel [email protected]

Ultimate Big Data Masters Program Curriculum v1
No ratings yet
Ultimate Big Data Masters Program Curriculum v1
14 pages
essentials-of-data-engineeringByMukeshSaini
No ratings yet
essentials-of-data-engineeringByMukeshSaini
30 pages
Topic 10 - Excavation Safety
No ratings yet
Topic 10 - Excavation Safety
44 pages
Data Engineer Toolkit in 2025_Must‑Have Skills, Tools & Resources _ by Vijay Gadhave _ May, 2025 _ Medium
No ratings yet
Data Engineer Toolkit in 2025_Must‑Have Skills, Tools & Resources _ by Vijay Gadhave _ May, 2025 _ Medium
15 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
Genshin Impact (@GenshinImpact) Twitter
No ratings yet
Genshin Impact (@GenshinImpact) Twitter
1 page
Data engineering | JVM Institute | Coding | Data Science
No ratings yet
Data engineering | JVM Institute | Coding | Data Science
14 pages
Data Engineering Nanodegree Program Syllabus
33% (3)
Data Engineering Nanodegree Program Syllabus
15 pages
Interview ques for CSE faculty
No ratings yet
Interview ques for CSE faculty
4 pages
EmTec Chapter 2 (1)
No ratings yet
EmTec Chapter 2 (1)
32 pages
Big Data One Shot - Google Docs
No ratings yet
Big Data One Shot - Google Docs
45 pages
fsdl-berkeley-lecture8-data-management
No ratings yet
fsdl-berkeley-lecture8-data-management
86 pages
ASF Form
No ratings yet
ASF Form
2 pages
Case Study: Fortune at The Last Frontier
No ratings yet
Case Study: Fortune at The Last Frontier
3 pages
R&AC
No ratings yet
R&AC
2 pages
OD 02 PDE Designing Data Processing Systems
No ratings yet
OD 02 PDE Designing Data Processing Systems
67 pages
Master Big Data Beginner to Advanced 2
No ratings yet
Master Big Data Beginner to Advanced 2
27 pages
Selecting The Best'' ERP System For SMEs Using A Combination of ANP
No ratings yet
Selecting The Best'' ERP System For SMEs Using A Combination of ANP
10 pages
Unit 1 1
No ratings yet
Unit 1 1
10 pages
MD1711 Datasheet, Pinout, Application Circuits High Speed, Integrated Ultrasound Driver IC
No ratings yet
MD1711 Datasheet, Pinout, Application Circuits High Speed, Integrated Ultrasound Driver IC
4 pages
BDSCP Module 09 Mindmap
No ratings yet
BDSCP Module 09 Mindmap
1 page
Features: WWW - Domin.co - Uk +44 (0) 1761 252650 Info@domin - Co.uk
No ratings yet
Features: WWW - Domin.co - Uk +44 (0) 1761 252650 Info@domin - Co.uk
4 pages
EcoStruxure Panel Server - PAS800
No ratings yet
EcoStruxure Panel Server - PAS800
3 pages
Java Lab
No ratings yet
Java Lab
12 pages
UADE 2021 - TIC 3 - Data Architecture and Integration
No ratings yet
UADE 2021 - TIC 3 - Data Architecture and Integration
68 pages
Online Bus Ticket Reservation System
No ratings yet
Online Bus Ticket Reservation System
18 pages
AEE_PQ_25_27
No ratings yet
AEE_PQ_25_27
9 pages
Lecture 2-3
No ratings yet
Lecture 2-3
65 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Pmos Vs Nmos
No ratings yet
Pmos Vs Nmos
1 page
Data engineering Flow-
No ratings yet
Data engineering Flow-
4 pages
Open Source Tools For Data Engineering - LinkedIn
No ratings yet
Open Source Tools For Data Engineering - LinkedIn
5 pages
DS231 Module 3.PDF
No ratings yet
DS231 Module 3.PDF
41 pages
High-Volume Low-Speed Fan: How HVLS Fans Work
No ratings yet
High-Volume Low-Speed Fan: How HVLS Fans Work
3 pages
DS231_Week_3
No ratings yet
DS231_Week_3
41 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Data Engineering Notes
No ratings yet
Data Engineering Notes
4 pages
3
No ratings yet
3
12 pages
Final PPT Creation 1
No ratings yet
Final PPT Creation 1
13 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
CE404 06a Hydraulic Design of Syphon
No ratings yet
CE404 06a Hydraulic Design of Syphon
9 pages
Data and Analytics - TechM.pdf
No ratings yet
Data and Analytics - TechM.pdf
8 pages
C1_W1
No ratings yet
C1_W1
91 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
1st Internal Solved
No ratings yet
1st Internal Solved
12 pages
Big_Data_Analytics_-_Chapter_4
No ratings yet
Big_Data_Analytics_-_Chapter_4
22 pages
Dynamo - Wikipedia
No ratings yet
Dynamo - Wikipedia
13 pages
2 emerging
No ratings yet
2 emerging
10 pages
CFNetworkDownload j3ETNb - tmp.2
No ratings yet
CFNetworkDownload j3ETNb - tmp.2
2 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Path Lab
No ratings yet
Path Lab
1 page
Ramy Elsadany Ghoraba- Mechanical Engineer.pdf
No ratings yet
Ramy Elsadany Ghoraba- Mechanical Engineer.pdf
6 pages
Big Data and Hadoop: Senior Product Specialist
No ratings yet
Big Data and Hadoop: Senior Product Specialist
40 pages
Irjet V2i3331 PDF
No ratings yet
Irjet V2i3331 PDF
7 pages
Spark Devops
0% (1)
Spark Devops
301 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
SPA_L1_To_L7
No ratings yet
SPA_L1_To_L7
52 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Course1_summary
No ratings yet
Course1_summary
4 pages
New Questions (Added On 30th-Jan-2021)
No ratings yet
New Questions (Added On 30th-Jan-2021)
25 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
01 Automotive Price List-10.0
No ratings yet
01 Automotive Price List-10.0
56 pages
roadmap
No ratings yet
roadmap
3 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Data Pipelines From Zero to Solid
No ratings yet
Data Pipelines From Zero to Solid
16 pages
Simple Electromechanical Relay
No ratings yet
Simple Electromechanical Relay
11 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Big Data
No ratings yet
Big Data
51 pages
1 Intro
No ratings yet
1 Intro
33 pages
rm300 hidraulico
No ratings yet
rm300 hidraulico
2 pages
23 Big Data and Data Wrangling
No ratings yet
23 Big Data and Data Wrangling
56 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
No ratings yet
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
9 pages
FD TAs Allottment-2nd Semester 2023-24
No ratings yet
FD TAs Allottment-2nd Semester 2023-24
4 pages
Cse 4-1 4-2
No ratings yet
Cse 4-1 4-2
19 pages
Bigdata Engineer Complete Syllabus: Presented by
No ratings yet
Bigdata Engineer Complete Syllabus: Presented by
21 pages
Optimizing Linux Performance
No ratings yet
Optimizing Linux Performance
26 pages
Module 1
No ratings yet
Module 1
54 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Data Science
No ratings yet
Data Science
87 pages
Windows Programming
No ratings yet
Windows Programming
69 pages
Big Data
0% (1)
Big Data
2 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Optimize The Value of SAP Commerce Cloud With Dynatrace
No ratings yet
Optimize The Value of SAP Commerce Cloud With Dynatrace
20 pages
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Architectures of AI systems

Engineering for Big Data & AI

HCMC, Sep 6th 2019 Herve Roussel [email protected]

Event data Transformed data

Is this a face? Who’s this?

Paddy commented. Is that

300K QPS [R] 1B+ QPM [P] 400M LOC [P]

Event data Augmented data

Sink (Augmented data)

30 GB OK Good Very good

10K WPS OK Good Very good

1K RPS OK Good Very good

Why not use database?

10K WPS 1.0 5 10 10

Durability 0.1 10 5 (but perf. hit) 0

Deployability 0.5 10 5 7.5

Score 5.6 / 10 6.6 / 10 7.15 / 10

Operations on things Things with operations

Things with operations

Requests in thread Long running

100ms 100ms * 300,000/60/60 = 9H

10ms 10ms * 300,000/60/60 = 55 min

Is this a face? Who’s this?

Give me posts that

Aggregation Full text search Bulk data, filtered

[DS] Build DL model

[DE] Process data

Source: Uber Engineering

Herve Roussel [email protected]

You might also like