Architecture of AI Systems - Engineering For Big Data and AI (Grokking)
Architecture of AI Systems - Engineering For Big Data and AI (Grokking)
UploadData.java
upload_data.py
Is this data engineering?
cat console.log
| grep “ERROR”
> errors.log
Data engineering?
Program
Event data
cat console.log
Transform
| grep “ERROR”
Transformed data
> errors.log
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Chris posted. Is that good Who can see
or bad? this?
Racist? Vulgar?
Notify? Web,
mobile?
Anybody tagged?
As of JULY 8, 2013
Data Engineering
Program
Transform
Augmented data
Source (Event data)
Pipeline (Transform)
Synchronous_
( 10-100 ms )_
Asynchronous_
( 3-5 s )_
Event source
Why split?
What’s in an event data?
Post PostCreatedEvent
{ {
id: 12345, story_id: 12345,
content: “hello world”, type: “story_posted”
created_at: … …
updated_at: … }
author_id: 67890,
…
}
What’s batch processing?
Job 1
Scheduler
Job 2
Which DB for event source?
How to store events?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)
Range readread
Sequential OK Good Very good
Cost $$ $$$ $
Who wants to become architect?
What’s the problem with batch?
E NC Y
LAT
Job 1
Scheduler
Job 2
How to process real-time?
Stream processing
How can 2 processes talk?
QUEUE
1K RPS 1.0 5 10 10
Sequential 1.0 10 10 10
read (with B-TREE) (using Lists)
Order 0.2 10 0 10
guarantee
Transforms
Sink
Functional vs OOP
Librarian find(book)
Books.create()
.startShift()
load_cover(book)
Catalog.open() Library.close()
remove(book)
assign(book)
Functional vs OOP
generate_thumbnails(vid_uploaded)
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded)
alert_subscribers(vid_uploaded)
What’s supporting data?
Supporting data
event
{
id: 12345,
Transform type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66 ]
}
Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?
⇓ ⇓
AI model Transform
How can 2 processes talk?
Transform
AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read
Data scientist
Sales
What are the read use cases?
AI model Transform
Transformv2
Idempotency & backfill
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?
AI model v2 Transform
AI systems ≠ traditional systems?
93.2%
Deterministic Probabilistic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )
What have we
learned ?
[BE/FE] Use DL model in app
[DE] Collect data
• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
Which NFR for Big Data?
• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
What have we learned?
Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join