0% found this document useful (0 votes)
15 views

slide1ss"g

The document outlines a course focused on accelerating data engineering pipelines, specifically addressing data storage, formats, and frameworks. It emphasizes the importance of organizing and processing datasets to derive actionable insights and covers topics such as ETL processes, data visualization, and various data storage formats. Additionally, it discusses the use of directed acyclic graphs and the differences between row and columnar storage, along with the benefits of using GPU over CPU for data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

slide1ss"g

The document outlines a course focused on accelerating data engineering pipelines, specifically addressing data storage, formats, and frameworks. It emphasizes the importance of organizing and processing datasets to derive actionable insights and covers topics such as ETL processes, data visualization, and various data storage formats. Additionally, it discusses the use of directed acyclic graphs and the differences between row and columnar storage, along with the benefits of using GPU over CPU for data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

ACCELERATING DATA

ENGINEERING
PIPELINES
Part 1: Data Storage

1
For each section, start the GPU task
before going through lecture

To see lecture notes, make full screen


and click the “notes” button

2
WELCOME!
3
Main Goal:
How do we organize and process an
unexplored dataset to produce
actionable insight?

4
THE GOALS OF THIS COURSE

• Get used to many different data types / frameworks


and how they operate on GPU vs CPU.
• Understand how DAG based frameworks can speed up
ETL
• Learn how to visualize data to
• Assess data quality
• Allow users to make their own decisions through
interactivity
Part 1: Data Formats

Part 2: ETL with


AGENDA NVTabular

Part 3: Data
Visualization
AGENDA – PART 1
• Systems Engineering
• File Formats
• Data Frameworks
• Lab
SYSTEMS ENGINEERING
8
SYSTEM EXAMPLE
Modeling a Car

Energ
Tires Lights
y

Let’s
Ride
!
9
SYSTEM EXAMPLE
Modeling a Car

Tire Tire Tire Tire Light Light


1 2 3 4 1 2

3/4 1/2

Energ
Tires Lights
y

3/3
Let’s
Ride
!
10
SYSTEM EXAMPLE
Modeling a Car

.99 .99 .99 .99 .98 .98

Tire Tire Tire Tire Light Light


1 2 3 4 1 2

3/4 (.9994) 1/2 (.9996)

Energ
Tires Lights
y

3/3 (.999)
No
go!

11
SYSTEMS BIG AND SMALL

Details Big Picture

12
SYSTEM ENGINEERING FOR DATA
Data Data Data
Algorithms Hardware Exchange

1: User asks
Calculat
e Age for website

2: Server
returns
webpage Server
Client
Calculat Average 3: User CPU
e Age Age requests
filtered data

6: server
returns filtered
Calculat data
e Age

13
DATA AS A SYSTEM OF
ALGORITHMS
14
DATABASE SYSTEM DESIGN
Enhanced Entity Relationship Diagram

VIN Model Year


Entity

Car

1
Has Attribute
M
Part

Relationshi
p
UPC Brand Quantity

15
DATABASE SYSTEM DESIGN
From Design to Practice

Cars.csv Cars.json
{[
VIN, Model, Year {“VIN”: 1a3b3c,
1a2b3c, Sedan, 1986 “Model”: “Sedan”,
4d5e6g, Convertible, 2011 “Year”: 1986,
7h8i9j, Sedan, 1997 “Parts”: [
{“UPC”: 8675309,
Parts.csv “Brand”: “Generic Lights”,
“Quantity”: 2
VIN, UPC, Brand, Quantity }, {
1a2b3c, 8675309, Generic Lights, 2 {“UPC”: 8675310,
1a2b3c, 8675310, Generic Tires, 4 “Brand”: “Generic Lights”,
4d5e6g, 8675309, Awesome …
Lights, 2
4d5e6g, 8675310, Awesome Tires,
4
16
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs

Calculat Map
e Age

Calculat Average Reduc


e Age Age e

Calculat
e Age

17
DATABASE SYSTEM DESIGN
Data Quality

VIN Model Year Calculat


e Age

Car

1 Calculat Average
Has e Age Age
M
Part

Calculat
UPC Brand Quantity e Age

18
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs

MapReduc
Calculat
Drivers ETL
e Age e

Join on Denormalize
VID d Table
Calculat Average Cars
e Age Age

Network
Routing
B D B D
Calculat
e Age A A

C E C E

19
FILE FORMATS
20
DATA FORMATS
Pick 1 - 2

These definitions vary


Flexible based on context:

Supports
• Scalable to read or to
unique
schemas write?
• Scalable with speed or
Function cost?
al Scalable
• Flexible in the data store
or flexible in the
Works well
Usable in application?
many with many
contexts data points • Functional for the server
or functional for the
client?
21
PICKING THE BEST FORMAT

CRUD
• Create
• Add a record

• Read
• Get record

• Update
• Change a record

• Delete
• Remove a record

22
ROW VS COLUMNAR STORAGE

- Row - | Columnar |
Efficient for Efficient for
Adding a new record Data Aggregation

• Formats • Formats
• CSV (Comma-Separated • Apache Parquet
Values) • Engines
• TSV (Tab-Separated • BigQuery
Values) • Snowflake
• Apache AVRO • Redshift
• Engines
• MySQL
• PostgreSQL
23
WRITING
Adding a New Entry

Column Formatted
Row Formatted Data First Last Born Data
Grace | Hopper | 1906
Grace Hopper 1906 Grace | Blaise | Katherine Alan
Blaise | Pascal | 1623
Blaise Pascal 1623 Hopper | Pascal | Johnson Turing
Katherine | Johnson |
Katherine Johnson 1918 1906 | 1623 | 1918 1912
1918
Alan | Turing | 1912 +
Can concatenate to Alan Turing 1912
Broken up and inserted
end, or inserted by row
at the end of each block
number

24
ANALYSIS
A.K.A Feature Engineering

Column Formatted
Row Formatted Data
First Last In. Data
Grace | Blaise | Katherine
Grace | Hopper |
GH Grace Hopper GH |
Blaise | Pascal |
BP Hopper | Pascal | Johnson
Blaise Pascal > BP
Katherine | Johnson
KJ Katherin
Johnson KJ
|
| e GH | BP | KJ
Alan Turing AT
Can be concatenated to
Broken up and inserted
end or inserted by
at the end of each block
column number

25
BINARY
Ex: Multimedia File

• Compact • Hard to visualize without


Pro

Con
• Faster to send and process decoding software
• Flexible • Difficult to debug data
• Many datatypes can easily integrity
be converted to binary
• Great for images

26
ASCII
Ex: CSV

• Simple structure • Simple structure


Pro

Con
• No file metadata • No file metadata
• File is human readable • Average scalability
• Average scalability • Data is not compressed as
• Easy to join and split much as other file types
multiple CSV files
• Easy to append a new
entry

27
PARQUET
Ex: Hadoop

• Good compression if many • Immutable


Pro

Con
repeated values • Query results are typically
• Efficient to read a subset of saved in a new file
columns • Querying for all the
• Support for complex attributes of an entity is an
datatypes like arrays expensive operation
• Files are not human
readable without a tool

28
DATA FORMATS COMPARISON
Summary

Properties CSV JSON Parquet Avro


Columnar
Compressible
Splittable
Human readable
Complex data structure
Schema
evolution/validation
Binary

29
DATA FRAMEWORKS
30
VERTICAL VS HORIZONTAL SCALING

↑ Vertical ↑ ← Horizontal →
Scales to higher quality hardware Scales to more partitions /
machines

• SQL
• Dask
• CuPy
• NoSQL
• NumPy
• Spark
• cuDF
• Hadoop
• pandas

31
SQL
Structured Query Language
Query Table
first last born (awesome.people)

Grace Hopper 1906


SELECT first, last
Blaise Pascal 1623
FROM awesome.people
Katherine Johnson 1918
WHERE born > 1900
Alan Turing 1912

first last
Result
Grace Hopper

Katherine Johnson

Alan Turing
32
DATAFRAMES
Pandas (CPU) and cuDF (GPU)
Query Table
first last born df

Grace Hopper 1906

Blaise Pascal 1623


df = df[df[“born”] >
Katherine Johnson 1918
1900]
Alan Turing 1912
df = df[“first”, “last”]

first last
Result
Grace Hopper

Katherine Johnson

Alan Turing
33
MATRICES AND NUMBER ARRAYS
NumPy (CPU) and CuPy (GPU)
Query Array
a
8 6 7

5 3 0

9 8 6
a = a.sum(axis = 0)
7 5 3

Result

29 22 16

34
DASK SCALES PYTHON ANALYTICS
SCALE FROM A LAPTOP TO LARGE-SCALE CLUSTERS WITH EASE

Time Series Data


• Dask enables data scientists to scale
out analytics workloads in native
Python. With an optimized scheduler,
Dask makes it easy to schedule and
execute tasks on distributed
computation.

• Dask follows the standards set by the


PyData ecosystem to provide a familiar,
comfortable user experience at scale.
When paired with NVTABULAR/RAPIDS,
data scientists can leverage the
processing power of NVIDIA accelerated
compute and distribute across clusters
to improve cycle time-reducing time to
insights drastically.
35
MAPREDUCE
Map to each thread, Reduce all threads to one

Calculat Map
e Age

Calculat Average Reduc


e Age Age e

Calculat
e Age

36
LAZY EXECUTION
Building a Factory

DAG
Calculat
e Age

Calculat Average
e Age Age

Calculat
e Age

37
RELATIONAL DATABASES
Ex: SQL

• Well known • Inflexible data structure


Pro

Con
• Concise Language • Some objects do not
• Relatively fast querying convert well to table
• Foreign keys format
• Blazing SQL • Typically, single server
• More expensive hardware
needed to scale

38
DATAFRAME
Ex: cuDF, Pandas, R

Python and R APIs • Single server, not meant for


Pro

Con
• cuDF, Pandas large-scale data
• Compared to SQL, more manipulation
flexible operations • Consider Spark instead
• Easier to make user- • Compared to SQL, not as
defined functions and scalable
integrate third party
libraries

39
DASK
Ex: Dask DataFrame, Dask-cuDF

• Large computation can • Large overhead to set up


Pro

Con
receive a significant speed not worth it for small files or
increase limited computation
• Can read large data sources • Lazy execution can make it
due to partitioning tricky to debug

40
LAB
41
WEATHER SYSTEMS

Credit: Ralph F. Kresge, Submitted to NOAA

42
INVESTIGATING WATER LEVEL

43
LET’S GO!

44
45

You might also like