0% found this document useful (0 votes)

15 views

slide1ss"g

The document outlines a course focused on accelerating data engineering pipelines, specifically addressing data storage, formats, and frameworks. It emphasizes the importance of organizing and processing datasets to derive actionable insights and covers topics such as ETL processes, data visualization, and various data storage formats. Additionally, it discusses the use of directed acyclic graphs and the differences between row and columnar storage, along with the benefits of using GPU over CPU for data processing.

Uploaded by

nourane.bougharriou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

slide1ss"g

Uploaded by

nourane.bougharriou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

ACCELERATING DATA

ENGINEERING
PIPELINES
Part 1: Data Storage

1
For each section, start the GPU task
before going through lecture

To see lecture notes, make full screen

and click the “notes” button

2
WELCOME!
3
Main Goal:
How do we organize and process an
unexplored dataset to produce
actionable insight?

4
THE GOALS OF THIS COURSE

• Get used to many different data types / frameworks

and how they operate on GPU vs CPU.
• Understand how DAG based frameworks can speed up
ETL
• Learn how to visualize data to
• Assess data quality
• Allow users to make their own decisions through
interactivity
Part 1: Data Formats

Part 2: ETL with

AGENDA NVTabular

Part 3: Data
Visualization
AGENDA – PART 1
• Systems Engineering
• File Formats
• Data Frameworks
• Lab
SYSTEMS ENGINEERING
8
SYSTEM EXAMPLE
Modeling a Car

Energ
Tires Lights
y

Let’s
Ride
!
9
SYSTEM EXAMPLE
Modeling a Car

Tire Tire Tire Tire Light Light

1 2 3 4 1 2

3/4 1/2

Energ
Tires Lights
y

3/3
Let’s
Ride
!
10
SYSTEM EXAMPLE
Modeling a Car

.99 .99 .99 .99 .98 .98

Tire Tire Tire Tire Light Light

1 2 3 4 1 2

3/4 (.9994) 1/2 (.9996)

Energ
Tires Lights
y

3/3 (.999)
No
go!

11
SYSTEMS BIG AND SMALL

Details Big Picture

12
SYSTEM ENGINEERING FOR DATA
Data Data Data
Algorithms Hardware Exchange

1: User asks
Calculat
e Age for website

2: Server
returns
webpage Server
Client
Calculat Average 3: User CPU
e Age Age requests
filtered data

6: server
returns filtered
Calculat data
e Age

13
DATA AS A SYSTEM OF
ALGORITHMS
14
DATABASE SYSTEM DESIGN
Enhanced Entity Relationship Diagram

VIN Model Year

Entity

Car

1
Has Attribute
M
Part

Relationshi
p
UPC Brand Quantity

15
DATABASE SYSTEM DESIGN
From Design to Practice

Cars.csv Cars.json
{[
VIN, Model, Year {“VIN”: 1a3b3c,
1a2b3c, Sedan, 1986 “Model”: “Sedan”,
4d5e6g, Convertible, 2011 “Year”: 1986,
7h8i9j, Sedan, 1997 “Parts”: [
{“UPC”: 8675309,
Parts.csv “Brand”: “Generic Lights”,
“Quantity”: 2
VIN, UPC, Brand, Quantity }, {
1a2b3c, 8675309, Generic Lights, 2 {“UPC”: 8675310,
1a2b3c, 8675310, Generic Tires, 4 “Brand”: “Generic Lights”,
4d5e6g, 8675309, Awesome …
Lights, 2
4d5e6g, 8675310, Awesome Tires,
4
16
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs

Calculat Map
e Age

Calculat Average Reduc

e Age Age e

Calculat
e Age

17
DATABASE SYSTEM DESIGN
Data Quality

VIN Model Year Calculat

e Age

Car

1 Calculat Average
Has e Age Age
M
Part

Calculat
UPC Brand Quantity e Age

18
DATABASE SYSTEM DESIGN
Directed Acyclic Graphs

MapReduc
Calculat
Drivers ETL
e Age e

Join on Denormalize
VID d Table
Calculat Average Cars
e Age Age

Network
Routing
B D B D
Calculat
e Age A A

C E C E

19
FILE FORMATS
20
DATA FORMATS
Pick 1 - 2

These definitions vary

Flexible based on context:

Supports
• Scalable to read or to
unique
schemas write?
• Scalable with speed or
Function cost?
al Scalable
• Flexible in the data store
or flexible in the
Works well
Usable in application?
many with many
contexts data points • Functional for the server
or functional for the
client?
21
PICKING THE BEST FORMAT

CRUD
• Create
• Add a record

• Read
• Get record

• Update
• Change a record

• Delete
• Remove a record

22
ROW VS COLUMNAR STORAGE

- Row - | Columnar |
Efficient for Efficient for
Adding a new record Data Aggregation

• Formats • Formats
• CSV (Comma-Separated • Apache Parquet
Values) • Engines
• TSV (Tab-Separated • BigQuery
Values) • Snowflake
• Apache AVRO • Redshift
• Engines
• MySQL
• PostgreSQL
23
WRITING
Adding a New Entry

24
ANALYSIS
A.K.A Feature Engineering

25
BINARY
Ex: Multimedia File

• Compact • Hard to visualize without

Pro

Con
• Faster to send and process decoding software
• Flexible • Difficult to debug data
• Many datatypes can easily integrity
be converted to binary
• Great for images

26
ASCII
Ex: CSV

• Simple structure • Simple structure

Pro

Con
• No file metadata • No file metadata
• File is human readable • Average scalability
• Average scalability • Data is not compressed as
• Easy to join and split much as other file types
multiple CSV files
• Easy to append a new
entry

27
PARQUET
Ex: Hadoop

• Good compression if many • Immutable

Pro

Con
repeated values • Query results are typically
• Efficient to read a subset of saved in a new file
columns • Querying for all the
• Support for complex attributes of an entity is an
datatypes like arrays expensive operation
• Files are not human
readable without a tool

28
DATA FORMATS COMPARISON
Summary

Properties CSV JSON Parquet Avro

Columnar
Compressible
Splittable
Human readable
Complex data structure
Schema
evolution/validation
Binary

29
DATA FRAMEWORKS
30
VERTICAL VS HORIZONTAL SCALING

↑ Vertical ↑ ← Horizontal →
Scales to higher quality hardware Scales to more partitions /
machines

• SQL
• Dask
• CuPy
• NoSQL
• NumPy
• Spark
• cuDF
• Hadoop
• pandas

31
SQL
Structured Query Language
Query Table
first last born (awesome.people)

Grace Hopper 1906

SELECT first, last
Blaise Pascal 1623
FROM awesome.people
Katherine Johnson 1918
WHERE born > 1900
Alan Turing 1912

first last
Result
Grace Hopper

Katherine Johnson

Alan Turing
32
DATAFRAMES
Pandas (CPU) and cuDF (GPU)
Query Table
first last born df

Grace Hopper 1906

Blaise Pascal 1623

df = df[df[“born”] >
Katherine Johnson 1918
1900]
Alan Turing 1912
df = df[“first”, “last”]

first last
Result
Grace Hopper

Katherine Johnson

Alan Turing
33
MATRICES AND NUMBER ARRAYS
NumPy (CPU) and CuPy (GPU)
Query Array
a
8 6 7

5 3 0

9 8 6
a = a.sum(axis = 0)
7 5 3

Result

29 22 16

34
DASK SCALES PYTHON ANALYTICS
SCALE FROM A LAPTOP TO LARGE-SCALE CLUSTERS WITH EASE

Time Series Data

• Dask enables data scientists to scale
out analytics workloads in native
Python. With an optimized scheduler,
Dask makes it easy to schedule and
execute tasks on distributed
computation.

• Dask follows the standards set by the

PyData ecosystem to provide a familiar,
comfortable user experience at scale.
When paired with NVTABULAR/RAPIDS,
data scientists can leverage the
processing power of NVIDIA accelerated
compute and distribute across clusters
to improve cycle time-reducing time to
insights drastically.
35
MAPREDUCE
Map to each thread, Reduce all threads to one

Calculat Map
e Age

Calculat Average Reduc

e Age Age e

Calculat
e Age

36
LAZY EXECUTION
Building a Factory

DAG
Calculat
e Age

Calculat Average
e Age Age

Calculat
e Age

37
RELATIONAL DATABASES
Ex: SQL

• Well known • Inflexible data structure

Pro

Con
• Concise Language • Some objects do not
• Relatively fast querying convert well to table
• Foreign keys format
• Blazing SQL • Typically, single server
• More expensive hardware
needed to scale

38
DATAFRAME
Ex: cuDF, Pandas, R

Python and R APIs • Single server, not meant for

Pro

Con
• cuDF, Pandas large-scale data
• Compared to SQL, more manipulation
flexible operations • Consider Spark instead
• Easier to make user- • Compared to SQL, not as
defined functions and scalable
integrate third party
libraries

39
DASK
Ex: Dask DataFrame, Dask-cuDF

• Large computation can • Large overhead to set up

Pro

Con
receive a significant speed not worth it for small files or
increase limited computation
• Can read large data sources • Lazy execution can make it
due to partitioning tricky to debug

40
LAB
41
WEATHER SYSTEMS

Credit: Ralph F. Kresge, Submitted to NOAA

42
INVESTIGATING WATER LEVEL

43
LET’S GO!

44
45

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
CHCCCS031 SD Individual Care Plan - Cynthia Blake.v1.0
50% (2)
CHCCCS031 SD Individual Care Plan - Cynthia Blake.v1.0
11 pages
Automotive Diagnostic Systems: Understanding OBD-I & OBD-II Revised
From Everand
Automotive Diagnostic Systems: Understanding OBD-I & OBD-II Revised
Keith McCord
4/5 (3)
Red Hat High Availability Clustering PDF
50% (2)
Red Hat High Availability Clustering PDF
8 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
Data Modeling For Big Data Zhu Wang
No ratings yet
Data Modeling For Big Data Zhu Wang
7 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
1 Intro
No ratings yet
1 Intro
33 pages
final report
No ratings yet
final report
22 pages
DA Full
No ratings yet
DA Full
738 pages
01-relationalmodel
No ratings yet
01-relationalmodel
70 pages
Unit 1 - DA - Introduction To Big Data
No ratings yet
Unit 1 - DA - Introduction To Big Data
65 pages
A Comparison of A Graph Database and A Relational Database: A Data Provenance Perspective
No ratings yet
A Comparison of A Graph Database and A Relational Database: A Data Provenance Perspective
6 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
data science
No ratings yet
data science
23 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
01 Unit-I Introduction To Big Data
No ratings yet
01 Unit-I Introduction To Big Data
11 pages
Unit 1 - DA - Introduction To Data Science
No ratings yet
Unit 1 - DA - Introduction To Data Science
70 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
DW sem
No ratings yet
DW sem
25 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
2 emerging
No ratings yet
2 emerging
10 pages
Database System Implementation
No ratings yet
Database System Implementation
16 pages
DATA ANALYSIS USING PYTHON2
No ratings yet
DATA ANALYSIS USING PYTHON2
27 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Datas
No ratings yet
Datas
27 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Bba Unit-1
No ratings yet
Bba Unit-1
11 pages
Facets of Data
0% (1)
Facets of Data
22 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
SQL Material
No ratings yet
SQL Material
47 pages
NoSQL Database For Software
No ratings yet
NoSQL Database For Software
49 pages
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
179 pages
d 01 Introduction
No ratings yet
d 01 Introduction
37 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
SQL and NoSQL
No ratings yet
SQL and NoSQL
5 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
BD U0 01 Overview
No ratings yet
BD U0 01 Overview
43 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture1-Introduction
No ratings yet
Lecture1-Introduction
29 pages
Database Applications Cy S 125242: DR - Layla Abdour
No ratings yet
Database Applications Cy S 125242: DR - Layla Abdour
32 pages
W01- Database Systems_Introduction
No ratings yet
W01- Database Systems_Introduction
45 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
DZ Nosql Migration Essentials
No ratings yet
DZ Nosql Migration Essentials
9 pages
Introduction To Databases Compsci 316 Fall 2017
No ratings yet
Introduction To Databases Compsci 316 Fall 2017
44 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Datascience-unit3
No ratings yet
Datascience-unit3
19 pages
01 - Intro
No ratings yet
01 - Intro
47 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Data Framework
No ratings yet
Data Framework
5 pages
01 BigDataDesign
No ratings yet
01 BigDataDesign
38 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Database Management Systems Week 1
No ratings yet
Database Management Systems Week 1
20 pages
Estrategias de Vehículos Automotrices y Modos del ECM
From Everand
Estrategias de Vehículos Automotrices y Modos del ECM
Mandy Concepcion
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
2kenya Amr-mptf Proposal Final 22-10-20201 2 1 Redacted
No ratings yet
2kenya Amr-mptf Proposal Final 22-10-20201 2 1 Redacted
44 pages
Presentation 1
100% (1)
Presentation 1
25 pages
Flying W Farm & Ranch: Pets & Livestock
No ratings yet
Flying W Farm & Ranch: Pets & Livestock
4 pages
Lean and Continuous Improvement - X
No ratings yet
Lean and Continuous Improvement - X
3 pages
Pak Export Import of Engineering Goods
No ratings yet
Pak Export Import of Engineering Goods
8 pages
Skid Piping Fabrication Service
No ratings yet
Skid Piping Fabrication Service
4 pages
Lesson Plan: Teacher: Data: Level of English No. of Students Lesson: Objectives
No ratings yet
Lesson Plan: Teacher: Data: Level of English No. of Students Lesson: Objectives
4 pages
Deed of Sale - Reyes
No ratings yet
Deed of Sale - Reyes
4 pages
AUDIO-MIXER-CIRCUIT-02
No ratings yet
AUDIO-MIXER-CIRCUIT-02
6 pages
Dumper Dux DT24 01 - DT24 - 5731B
No ratings yet
Dumper Dux DT24 01 - DT24 - 5731B
2 pages
Richard Nixon Watergate and the Press A Historical Retrospective 1st Edition Louis W. Liebovich pdf download
100% (6)
Richard Nixon Watergate and the Press A Historical Retrospective 1st Edition Louis W. Liebovich pdf download
71 pages
Professor Emeritus of Electrical Engineering: Page 1 of 4 - U Sein Win's Bio (Retired Prof - RIT)
100% (2)
Professor Emeritus of Electrical Engineering: Page 1 of 4 - U Sein Win's Bio (Retired Prof - RIT)
4 pages
Piyush Aher Project
No ratings yet
Piyush Aher Project
24 pages
Testbank Mid Cau Hoi Suu Tam
No ratings yet
Testbank Mid Cau Hoi Suu Tam
47 pages
Research Activity Example
No ratings yet
Research Activity Example
5 pages
R101 - General Requirements Accreditation of ISOIEC 17025 Laboratories
No ratings yet
R101 - General Requirements Accreditation of ISOIEC 17025 Laboratories
24 pages
Cookbook 4 Recipes For Healing
No ratings yet
Cookbook 4 Recipes For Healing
16 pages
Noetic Maltego Report
No ratings yet
Noetic Maltego Report
273 pages
Software Access Catalog Sample List
No ratings yet
Software Access Catalog Sample List
722 pages
Literature Study-TNCDBR Norms
No ratings yet
Literature Study-TNCDBR Norms
5 pages
Samsung La32n71b La40n71b La46n71b Chassis Gnm32asa Gnm40asa Gnm46asa PDF
No ratings yet
Samsung La32n71b La40n71b La46n71b Chassis Gnm32asa Gnm40asa Gnm46asa PDF
183 pages
Fact Sheet Number #7 The History of Dragon Dreaming - The Beginnings
No ratings yet
Fact Sheet Number #7 The History of Dragon Dreaming - The Beginnings
17 pages
Prelims (Business Logic)
No ratings yet
Prelims (Business Logic)
1 page
HVS US Hotel Development Cost Survey 2024
No ratings yet
HVS US Hotel Development Cost Survey 2024
14 pages
Project 1 Data
No ratings yet
Project 1 Data
236 pages
Mobile Portal Thesis
100% (3)
Mobile Portal Thesis
6 pages
Digital Media Reach: A Comparative Study of Rural and Urban People in India
No ratings yet
Digital Media Reach: A Comparative Study of Rural and Urban People in India
10 pages
List of Companies (Vru)
No ratings yet
List of Companies (Vru)
3 pages

slide1ss"g

Uploaded by

slide1ss"g

Uploaded by

ACCELERATING DATA

To see lecture notes, make full screen

• Get used to many different data types / frameworks

Part 2: ETL with

Tire Tire Tire Tire Light Light

.99 .99 .99 .99 .98 .98

Tire Tire Tire Tire Light Light

3/4 (.9994) 1/2 (.9996)

Details Big Picture

VIN Model Year

Calculat Average Reduc

VIN Model Year Calculat

These definitions vary

• Compact • Hard to visualize without

• Simple structure • Simple structure

• Good compression if many • Immutable

Properties CSV JSON Parquet Avro

Grace Hopper 1906

Grace Hopper 1906

Blaise Pascal 1623

Time Series Data

• Dask follows the standards set by the

Calculat Average Reduc

• Well known • Inflexible data structure

Python and R APIs • Single server, not meant for

• Large computation can • Large overhead to set up

Credit: Ralph F. Kresge, Submitted to NOAA

You might also like