slide1ss"g
slide1ss"g
ENGINEERING
PIPELINES
Part 1: Data Storage
1
For each section, start the GPU task
before going through lecture
2
WELCOME!
3
Main Goal:
How do we organize and process an
unexplored dataset to produce
actionable insight?
4
THE GOALS OF THIS COURSE
Part 3: Data
Visualization
AGENDA – PART 1
• Systems Engineering
• File Formats
• Data Frameworks
• Lab
SYSTEMS ENGINEERING
8
SYSTEM EXAMPLE
Modeling a Car
Energ
Tires Lights
y
Let’s
Ride
!
9
SYSTEM EXAMPLE
Modeling a Car
3/4 1/2
Energ
Tires Lights
y
3/3
Let’s
Ride
!
10
SYSTEM EXAMPLE
Modeling a Car
Energ
Tires Lights
y
3/3 (.999)
No
go!
11
SYSTEMS BIG AND SMALL
12
SYSTEM ENGINEERING FOR DATA
Data Data Data
Algorithms Hardware Exchange
1: User asks
Calculat
e Age for website
2: Server
returns
webpage Server
Client
Calculat Average 3: User CPU
e Age Age requests
filtered data
6: server
returns filtered
Calculat data
e Age
13
DATA AS A SYSTEM OF
ALGORITHMS
14
DATABASE SYSTEM DESIGN
Enhanced Entity Relationship Diagram
Car
1
Has Attribute
M
Part
Relationshi
p
UPC Brand Quantity
15
DATABASE SYSTEM DESIGN
From Design to Practice
Cars.csv Cars.json
{[
VIN, Model, Year {“VIN”: 1a3b3c,
1a2b3c, Sedan, 1986 “Model”: “Sedan”,
4d5e6g, Convertible, 2011 “Year”: 1986,
7h8i9j, Sedan, 1997 “Parts”: [
{“UPC”: 8675309,
Parts.csv “Brand”: “Generic Lights”,
“Quantity”: 2
VIN, UPC, Brand, Quantity }, {
1a2b3c, 8675309, Generic Lights, 2 {“UPC”: 8675310,
1a2b3c, 8675310, Generic Tires, 4 “Brand”: “Generic Lights”,
4d5e6g, 8675309, Awesome …
Lights, 2
4d5e6g, 8675310, Awesome Tires,
4
16
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs
Calculat Map
e Age
Calculat
e Age
17
DATABASE SYSTEM DESIGN
Data Quality
Car
1 Calculat Average
Has e Age Age
M
Part
Calculat
UPC Brand Quantity e Age
18
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs
MapReduc
Calculat
Drivers ETL
e Age e
Join on Denormalize
VID d Table
Calculat Average Cars
e Age Age
Network
Routing
B D B D
Calculat
e Age A A
C E C E
19
FILE FORMATS
20
DATA FORMATS
Pick 1 - 2
Supports
• Scalable to read or to
unique
schemas write?
• Scalable with speed or
Function cost?
al Scalable
• Flexible in the data store
or flexible in the
Works well
Usable in application?
many with many
contexts data points • Functional for the server
or functional for the
client?
21
PICKING THE BEST FORMAT
CRUD
• Create
• Add a record
• Read
• Get record
• Update
• Change a record
• Delete
• Remove a record
22
ROW VS COLUMNAR STORAGE
- Row - | Columnar |
Efficient for Efficient for
Adding a new record Data Aggregation
• Formats • Formats
• CSV (Comma-Separated • Apache Parquet
Values) • Engines
• TSV (Tab-Separated • BigQuery
Values) • Snowflake
• Apache AVRO • Redshift
• Engines
• MySQL
• PostgreSQL
23
WRITING
Adding a New Entry
Column Formatted
Row Formatted Data First Last Born Data
Grace | Hopper | 1906
Grace Hopper 1906 Grace | Blaise | Katherine Alan
Blaise | Pascal | 1623
Blaise Pascal 1623 Hopper | Pascal | Johnson Turing
Katherine | Johnson |
Katherine Johnson 1918 1906 | 1623 | 1918 1912
1918
Alan | Turing | 1912 +
Can concatenate to Alan Turing 1912
Broken up and inserted
end, or inserted by row
at the end of each block
number
24
ANALYSIS
A.K.A Feature Engineering
Column Formatted
Row Formatted Data
First Last In. Data
Grace | Blaise | Katherine
Grace | Hopper |
GH Grace Hopper GH |
Blaise | Pascal |
BP Hopper | Pascal | Johnson
Blaise Pascal > BP
Katherine | Johnson
KJ Katherin
Johnson KJ
|
| e GH | BP | KJ
Alan Turing AT
Can be concatenated to
Broken up and inserted
end or inserted by
at the end of each block
column number
25
BINARY
Ex: Multimedia File
Con
• Faster to send and process decoding software
• Flexible • Difficult to debug data
• Many datatypes can easily integrity
be converted to binary
• Great for images
26
ASCII
Ex: CSV
Con
• No file metadata • No file metadata
• File is human readable • Average scalability
• Average scalability • Data is not compressed as
• Easy to join and split much as other file types
multiple CSV files
• Easy to append a new
entry
27
PARQUET
Ex: Hadoop
Con
repeated values • Query results are typically
• Efficient to read a subset of saved in a new file
columns • Querying for all the
• Support for complex attributes of an entity is an
datatypes like arrays expensive operation
• Files are not human
readable without a tool
28
DATA FORMATS COMPARISON
Summary
29
DATA FRAMEWORKS
30
VERTICAL VS HORIZONTAL SCALING
↑ Vertical ↑ ← Horizontal →
Scales to higher quality hardware Scales to more partitions /
machines
• SQL
• Dask
• CuPy
• NoSQL
• NumPy
• Spark
• cuDF
• Hadoop
• pandas
31
SQL
Structured Query Language
Query Table
first last born (awesome.people)
first last
Result
Grace Hopper
Katherine Johnson
Alan Turing
32
DATAFRAMES
Pandas (CPU) and cuDF (GPU)
Query Table
first last born df
first last
Result
Grace Hopper
Katherine Johnson
Alan Turing
33
MATRICES AND NUMBER ARRAYS
NumPy (CPU) and CuPy (GPU)
Query Array
a
8 6 7
5 3 0
9 8 6
a = a.sum(axis = 0)
7 5 3
Result
29 22 16
34
DASK SCALES PYTHON ANALYTICS
SCALE FROM A LAPTOP TO LARGE-SCALE CLUSTERS WITH EASE
Calculat Map
e Age
Calculat
e Age
36
LAZY EXECUTION
Building a Factory
DAG
Calculat
e Age
Calculat Average
e Age Age
Calculat
e Age
37
RELATIONAL DATABASES
Ex: SQL
Con
• Concise Language • Some objects do not
• Relatively fast querying convert well to table
• Foreign keys format
• Blazing SQL • Typically, single server
• More expensive hardware
needed to scale
38
DATAFRAME
Ex: cuDF, Pandas, R
Con
• cuDF, Pandas large-scale data
• Compared to SQL, more manipulation
flexible operations • Consider Spark instead
• Easier to make user- • Compared to SQL, not as
defined functions and scalable
integrate third party
libraries
39
DASK
Ex: Dask DataFrame, Dask-cuDF
Con
receive a significant speed not worth it for small files or
increase limited computation
• Can read large data sources • Lazy execution can make it
due to partitioning tricky to debug
40
LAB
41
WEATHER SYSTEMS
42
INVESTIGATING WATER LEVEL
43
LET’S GO!
44
45