SlideShare a Scribd company logo
Jomar Silva - 2023
Tecnologias NVIDIA aplicadas ao e-
commerce - Muito além do hardware
2
NVIDIA is built like a computing stack or neural
network—in four layers: hardware, system
software, platform software, and applications.
Each layer is open to computer makers, service
providers, and developers to integrate into their
offerings however best for them.
NVIDIA HPC
RTX CUDA-X PHYSX
RTX DGX HGX EGX OVX
SUPER
POD AGX
GPU CPU DPU NIC SOC
SWITCH
NVIDIA AI NVIDIA OMNIVERSE
ACCELERATED COMPUTING ACROSS THE FULL-STACK AND AT DATA CENTER SCALE
NVIDIA ACCELERATED AI PLATFORM
Accelerating AI from edge to datacenter to cloud
FLEET
COMMAND
NVIDIA AI ENTERPRISE
NVIDIA vGPU NVIDIA Magnum IO™ NVIDIA CUDA-X AI™
Infrastructure
Optimization
NVIDIA GPU Operator NVIDIA Network Operator
Cloud-Native
Deployment
TensorFlow PyTorch NVIDIA RAPIDS™
NVIDIA TensorRT® NVIDIA Triton™ Inference Server
AI and Data Science Tools
and Frameworks
APPLICATION FRAMEWORKS
METROPOLIS
Video Analytics
ISAAC
Robotics
RAPIDS
Data Science
RIVA
Conv AI/ASR/NLP/TTS
MERLIN
Recommendation Systems
OMNIVERSE
Simulation / Digital Twins
GPU | NETWORKING | SECURITY | STORAGE
Tencent Cloud
PRE-TRAINED MODELS
SYSTEM INTEGRATORS
Consult with experts
MAJOR CUSTOMERS
In-house data science teams
150+ SOFTWARE VENDORS
Pre-packaged AI solutions
4
https://rapids.ai
5
THE CHALLENGES OF DATA SCIENCE TODAY
PROBLEMS THAT HINDER DATA-DRIVEN ENTERPRISES
It's Time Consuming
Building, training, and iterating on
models takes substantial time. As
big data use cases continue to
grow, CPU computational power
becomes a major bottleneck.
It's Costly
Large-scale CPU infrastructure is
incredibly expensive for conducting
big data operations. With growing
datasets, adding CPU infrastructure
continues to increase costs.
It's Frustrating
Productionizing large-scale data
processing operations is arduous. It
often involves refactoring and
hand-offs between teams adding
more cycle time.
These challenges increase cycle time and slow down time to insight.
6
NVIDIA RAPIDS TRANSFORMS DATA SCIENCE
From Analytics to NVIDIA Accelerated Data Science
7
PYTHON TOOLS HAVE DEMOCRATIZED DATA SCIENCE
ACCESSIBLE, EASY TO USE TOOLS ABSTRACT COMPLEXITY
Analytics
pandas
Data Preparation Visualization
Model Training
Machine Learning
scikit-learn
Graph Analytics
NetworkX
Deep Learning
TensorFlow, PyTorch,
MxNet
Vizualization
CuXFILTER <> pyViz
Dask
CPU Memory
Python is the most-used language in
Data Science today. Libraries like
NumPy, Scikit-Learn, and Pandas have
changed how we think about accessibility
in Data Science and Machine Learning.
While great for experimentation, PyData
tools lack the power necessary for
enterprise-scale workloads. This leads to
substantial refactoring to handle the size
of modern problems, increasing cycle
time, overhead, and time to insight.
These pain points are further
compounded by computational
bottlenecks of CPU-based processing.
Code refactors and inter-team handoffs decrease data-driven ROI
8
RAPIDS ACCELERATES POPULAR DATA SCIENCE TOOLS
DELIVERING ENTERPRISE-GRADE DATA SCIENCE SOLUTIONS IN PURE PYTHON
Pre-Processing
cuIO & cuDF
Data Preparation Visualization
Model Training
Machine Learning
cuML
Graph Analytics
cuGRAPH
Deep Learning
TensorFlow, PyTorch,
MxNet
Vizualization
CuXFILTER <> pyViz
Dask
GPU Memory
The RAPIDS suite of open source
software libraries gives you the freedom
to execute end-to-end data science and
analytics pipelines entirely on GPUs.
RAPIDS utilizes NVIDIA
CUDA primitives for low-level compute
optimization and exposes GPU
parallelism and high-bandwidth memory
speed through user-friendly Python
interfaces like PyData.
With Dask, RAPIDS can scale out to
multi-node, multi-GPU cluster to power
through big data processes.
RAPIDS enables the PyData stack with the power of NVIDIA GPUs
9
Accelerates PyData on NVIDIA GPUs
NumPy -> CuPy/PyTorch/..
Pandas -> cuDF
Scikit-Learn -> cuML
Numba -> Numba
RAPIDS
Distributes and accelerates PyData
Can be distributed across Multi-GPU on
single node (DGX) or across a cluster
Provides easy to use tooling enabling
HPC-level performance
RAPIDS + DASK
Provides accessible, easy to use
tooling
NumPy, Pandas, Scikit-Learn,
Numba and many more
Single CPU core, in-memory data
PYDATA
Distributes PyData across multiple cores
NumPy -> Dask Array
Pandas -> Dask DataFrame
Scikit-Learn -> Dask-ML
… -> Dask Futures
DASK
Scale
Up
/
Accelerate
Scale Out / Parallelize
SCALE OUT PYTHON TOOLS WITH RAPIDS + DASK
DISTRIBUTE & ACCELERATE COMPUTATION FOR PRODUCTION WORKLOADS
10
RAPIDS INTEGRATES WITH DEEP LEARNING TOOLS
BUILD COMPLEX WORKFLOWS WITHOUT LEAVING THE GPU
mpi4py
RAPIDS supports device memory sharing
between many popular data science
and deep learning libraries, such as
PyTorch and TensorFlow. By providing
native array_interface support, data
can stay on the GPU avoiding costly
copying back and forth to host
memory.
Data stored in Apache Arrow can be
seamlessly pushed to deep learning
frameworks that accept CUDA Array
Interface protocol or work with DLPack,
such as Chainer, MXNet, and PyTorch.
NVIDIA Merlin
NVTabular
Pipelines are slow
and complex
Challenge
Solution
Inference
Training
Data Loading
ETL
Using common item-
by-item loading can
be slow
High throughput to
rank more items is
difficult while
maintaining low
latency
Embedding tables of
large deep learning
recommender
systems can exceed
memory
GPU-accelerated
and easy-to-use ETL
pipelines prepares
datasets in minutes
Asynchronous and
GPU-accelerated
dataloader for
PyTorch and
TensorFlow/Keras
Easy data and
model parallel
training allow to
scale TB size
embeddings
High throughput,
low-latency
production
deployment
NVIDIA Merlin is an open-source library to deploy recommender systems end-2-end
Triton
HugeCTR
NVIDIA MERLIN ADDRESSES
RECOMMENDER SYSTEM CHALLENGES
DATA LAKE
TRITON
USER QUERY
10
RECOMMENDATIONS
CANDIDATE
GENERATION RANKING
O(1000)
O(Billions)
EMBEDDINGS
INFERENCE
TRAINING
DATA
LOADER
ETL
NVTABULAR
HUGECTR
NVTABULAR
VALIDATION
MODEL
ANALYSIS
RAPIDS RAPIDS CUDNN RAPIDS
NVIDIA Merlin Accelerates Every Stage in Recommender Pipeline
14
Merlin speeds up the entire pipeline
ETL
Minutes
Optimized
Spark
(4x CPU node)
NVTabular ETL
(1xA100)
NVIDIA Merlin provides 9-35x speed-up in ETL+Training+Inference RecSys models and easily scales to multiple GPUs
Accelerating Training
Scaling Accelerated
Training
Inference
Tensorflow
Data loader
(1x A100)
NVTabular
Data loader
(1x A100)
CPU cluster
(4x nodes)
HugeCTR
(1x DGX-A100)
21x 9x 24x 35x
HugeCTR
NVTabular HPS on Triton
Speedup
TF/PyT Plugins
Minutes
Minutes
Throughput
(samples/sec)
at
10
ms
latency
PyTorch
(2x CPU node)
HugeCTR
(1x A100)
NVTabular
Fast Feature Transforms & Dataloading of Tabular Data on GPU
NVTABULAR: RECOMMENDER SYSTEM ETL ON GPU
NVTabular
What it is:
Feature engineering and preprocessing library designed to quickly and
easily manipulate terabytes of tabular data
What it’s capable of:
• Speed – GPU acceleration, 10x speedup compared to CPU,
eliminate input bottleneck
• Scale – No limit on dataset size (not bound by GPU or CPU
memory)
• Usability - Higher level abstraction, recommender systems
oriented, fewer API calls are required to accomplish the
same processing pipeline
• Core Features - integration with PyTorch, TensorFlow, and
HugeCTR; multi-hot encoding
1
2
3
4 Visualization of feature engineering and preprocessing
pipeline for Criteo Click Ads Prediction dataset
SCALE: NVTABULAR SCALES TO MULTI-GPUS AND
MULTI-NODES USING TBS OF GPU MEMORY
Example of NVTabular ETL with 2 nodes:
1. ETL starts with 1 node (8 GPUs)
2. ETL adds a 2nd node (8 GPUs, total=16GPUs)
Dask dashboard monitors NVTabular ETL:
● Top-right: Visualization of GPUs interaction, small dot is
one GPU (upto 16 GPUs in this example)
● Bottom-center: Utilized GPU memory and total GPU
memory (upto 500GB in this example)
● Top-left: Tasks overtime
Detailed Notebook: https://github.com/NVIDIA/NVTabular/blob/main/examples/multi-gpu_dask.ipynb
Under the hood, NVTabular uses Dask and Dask-cuDF to provide a
high-performance recommender system-specific ETL pipeline on multiple GPUs
Graphic showing the dask dashboard when running NVTabular
I
II
III
I
II
II
I
SCALE: NVTABULAR SCALES EASILY TO MULTI-NODES
AND REDUCES ETL TIME
In our example, NVTabular scales to 128 GPUs (16 nodes x 8 GPUs)
with total of 5.1 TB GPU memory (each GPU has 40GB GPU memory)
Preliminary results
Run on DGX A100,
Dataset: Criteo 1TB (26 categorical features and 13 integer features)
NVTabular v0.3
To enable distributed parallelism, we need to start a
cluster and then connect to it to run the application.
How it works:
1. Start the scheduler dask-scheduler
2. Start the workers dask-cuda-worker
3. Run the NVTabular application
Dask is a flexible library for parallel computing in Python
that makes scaling out your workflow smooth and simple.
Dask-cuDF extends Dask where necessary to allow its
DataFrame partitions to be processed by cuDF GPU
DataFrames
USABILITY: NVTABULAR’S HIGH-LEVEL API IS 10-20 LINES
Easy-to-use API. Find more details from examples here.
Encode categorical variables
using the defined thresholds and
add metadata (tags)
For continuous features -
zero filling any nulls
clip all values,
log transform,
normalize,
add metadata (tags)
Collect statistics on train dataset
Transform train & valid dataset
with statistics from train dataset
Define dataset files
Combine pipelines and initialize
workflow
Load schema by tags for model
definition
ONE PARAMETER TO SWITCH BETWEEN CPU AND GPU
GPU CPU
New users can easily try out NVTabular as CPU-mode runs on all infrastructures
Users can develop on local machines on CPU-mode and push to cluster on GPU-mode
21
Interoperability: NVTabular’s output files can be used
by PyTorch, TensorFlow or HugeCTR
3
Export from
data lake
parquet
Huge
CTR
binary
Training
Data Loading
csv
parquet
avro
NVTabular
with high-level API
Dedicated DL
framework
ETL
NVTabular
Dataloader
22
Advantages of NVTabular ETL
Scale
Speed
Usability
Core
Features
● Recommender focused APIs to easily implement the most common workflows
● 5-25 lines of code compared to other frameworks such as Pandas
● Examples published for common datasets and models
● Optimized TF, PyT, and HugeCTR dataloaders
● Native tabular data format support: CSV, parquet, orc, avro
● Building an easy path to production deployment for data transforms
● Supports datasets larger than host and GPU memory
● Scales easily to multi-GPU / multi-Nodes with terabytes of GPU memory
● Benchmark shows 100x-3000x speedup in comparison to CPU environments
Dataloading with NVTabular
Fast Dataloading of Tabular Data on GPU
24
NVTabular: GPU-accelerated dataloading
NVTabular Dataloader
What it is:
GPU-accelerated and asynchronous dataloader for TensorFlow
& PyTorch prepares quickly new data for training step to fully
utilize the GPU
What it’s capable of:
• Speed – NVTabular dataloader is 10x faster compared to
Tensorflow dataloader with GPU training
• Scale - dataloaders enables training of larger than memory
datasets by streaming chunks from disk
• Usability - easy integration with TensorFlow, PyTorch and
FastAI
Core Feature – convergence behavior is similar to other
framework dataloaders; supports common tabular data
format parquet; no CPU-GPU communication as data is
loaded directly into GPU
1
2
3
4
Visualization of idle times for NVTabular data loader
Time
25
Speed: NVTabular dataloader is up to 9x faster in
comparison to TensorFlow native version
Benchmark:
● Training and validation have each 1 day of Criteo Ads Clicks Prediction dataset with 150M samples
● Each experiment uses GPUs for training the same neural network architecture
Results:
● Only dataloading without training has 24x speed-up, which indicates the bottleneck in the dataloader.
● Training neural networks has 2.7x-9x speed-up, as data loader has time to prepare batch during training step.
Overall dataloader is still the bottleneck in TensorFlow / PyTorch pipeline.
1
Comparison to TensorFlow
26
Scale: NVTabular supports larger than memory
datasets by streaming data from disk
2
1
2
3
...
N
Buffer k
randomly
selected
chunks1
1) A file is stored in multiple chunks. A hyperparameter defines how many chunks are buffered by NVTabular dataloader
2
...
k
Shuffle
Preprocess
Batch
Final
Batch
Disk Storage
(e.g. raid)
GPU
Asynchronous
● Buffering chunks from disk
enables larger than host/GPU
memory datasets
● Dataloading on GPU removes
overhead of CPU-GPU
communication
● Shuffle, preprocess and batch
are faster on GPU
● Using random, multiple chunks
increases randomness in batches
27
Usability: NVTabular dataloaders can be used in
existing TensorFlow pipeline
3
Reserve GPU memory for NVTabular dataloader
Helper function for defining tf.feature_columns
Define dataset files
Initialize NVTabular dataloader for training
Define data schema with tf.feature_columns
Initialize NVTabular dataloader for validation
Get a TensorFlow model
NVTabular dataloader requires a custom
Validation Callback
Train model with .fit()
Steps which a different
28
Usability: NVTabular dataloaders can be used in
existing PyTorch pipeline
3
Helper function for run one epoch
Define dataset files
Initialize NVTabular dataloader for training
Initialize NVTabular dataloader for validation
Get a PyTorch model
Train and validate model
Steps which a different
29
Core Feature: NVTabular has less idle times
than other data loader
4
NVTabular data loader has no overhead to move data from CPU to GPU.
Preparing batch on GPU is significantly faster.
Synchronous
CPU dataloader
Asynchronous
CPU dataloader
Asynchronous
GPU dataloader
NVTabular
Prepare
CPU:
Idle
GPU: Train
Idle Prepare
Idle Train
Idle Prepare
Idle Train
Idle
Prepare
CPU:
Idle
GPU: Train
Prepare
Idle Train
Prepare
Idle Train
Prepare
Idle Train
Prepare
Idle
Prepare
GPU:
Idle
GPU: Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Time
30
Advantages of NVTabular dataloaders
Speed
Scale
Usability
Core
Features
● NVTabular dataloaders stream dataset in chunks from hard disk and thereby,
they support larger than host/GPU memory datasets
● Seamless integration into existing TensorFlow or PyTorch pipelines
● NVTabular dataloaders speed-up TensorFlow pipelines by 9x
● NVTabular dataloaders speed-up PyTorch pipelines by 9x
● NVTabular dataloaders convergence in the same number of update steps by
require less time
● Supports common tabular dataformats: parquet and csv
● Data is loaded directly into GPU to remove CPU-GPU communication
Training on GPU at scale
Accelerating training at scale with HugeCTR and Tensorflow Plugins
32
TRITON
ETL
NVTABULAR
NATIVE HUGECTR
RAPIDS CuDNN, CuDF, CuBLAS, NCCL …
Merlin Overview - Make Recommenders on GPU Fast & Easy
Merlin Models & Merlin Systems
SparseOperationKit -
TF1
TRT
MERLIN
DATA
LOADER
DistributedEmbeddings
- TF2
HUGECTR
Hierarchical
Parameter
Server
TRAINING INFERENCE
Transformer4Rec
High Level
Libraries
E2E usability
Merlin
Partner Nvidia
Products
Lower Level
Libraries
Performance &
Scalability
HKV
Huge CTR
What it is:
HugeCTR is an open-source framework to accelerate the training of CTR
estimation models on NVIDIA GPUs. It is written in CUDA C++ and highly
exploits GPU-accelerated libraries such as cuBLAS, cuDNN, and NCCL.
What it’s capable of:
Speed: A single node DGX-A100 trains MLPerf v1.0 DLRM at 1.96
minutes
Scale: Multi-Node Model Parallel, GPU embedding cache
Usability: Keras-like Python API
Core Features: Multi-slot Embedding with in-memory GPU
Hashtable, Asynchronous and Multithreaded Data Pipeline,
Inference & Embedding Cache, Tensorflow Embedding Plugin
HUGECTR: RECOMMENDER SYSTEM TRAINING ON GPU
1
2
3
4
SPEED: HUGECTR DLRM MLPERF WIN
Features used to achieve MLPerf win:
§ Hybrid embedding
§ Fused MLP
§ Optimized collectives
§ Optimized data reader
§ Overlapping MLP with embedding
§ Whole-iteration CUDA graph
At 0.99 minutes, HugeCTR on DGX-A100 is the fastest available system on MLPerf Recommender Benchmark
1
MLPerf v1.0 training results. Recommender task (training DLRM on the Criteo 1TB dataset). Bars represent speedup factor compared to a 4 CPU-node cluster. The higher the better. HugeCTR v3.1
running on single node of DGX-A100 with 8x A100 80GB GPU and 14 nodes of DGX-A100. Intel's CPU submission based on 4 nodes, each with 4X Intel(R) Xeon(R) Platinum 8376H CPU @ 2.60GHz with 6
UPI for a total of 16 CPUs.
Scale: Multi-Node Model Parallel
Dense
Model
Dense
Model
Dense
Model
Dense
Model
Sparse Model/Embedding
GPU
0
GPU
1
GPU
2
GPU
3
Node 0
Model Parallel
Data Parallel
Dense
Model
Dense
Model
Dense
Model
Dense
Model
GPU
0
GPU
1
GPU
2
GPU
3
Node 1
To train a large-scale CTR estimation model, the HugeCTR embedding table can span multiple nodes beyond
multiple GPUs (model parallelism). Each GPU has its own feed-forward neural network (data parallelism).
2
USABILITY: KERAS-LIKE PYTHON API
Easy to use Python API
3
1. Initialize Model
solver = hugectr.CreateSolver(...)
reader = hugectr.DataReaderParams(...)
optimizer = hugectr.CreateOptimizer(...)
model = hugectr.Model(solver, reader, optimizer)
2. Create Model graph
model.add(hugectr.Input(...))
model.add(hugectr.SparseEmbedding(...))
model.add(hugectr.DenseLayer(...))
...
3. Save model graph
model.graph_to_json(“dlrm.json”)
4. Compile & Fit
model.compile(...)
model.fit(...)
1. Create InferenceSession
inference_params = InferenceParams(
max_batchsize = 4096,
hit_rate_threshold = 0.8,
dense_model_file = ...,
sparse_model_files = ...,
device_id = 0,
use_gpu_embedding_cache = True,
cache_size_percentage = 0.3)
inference_session = CreateInferenceSession(“dlrm.json”,
inference_params)
2. Make inference
• From extracted VCSR
inference_session.predict(dense, keys, row_ptrs)
• From Norm/Parquet dataset
inference_session.predict(...)
inference_session.evaluate(...)
Train Inference
37
Usability - Democratize Model Parallelism with Three lines!
Try it today https://github.com/NVIDIA-
Merlin/distributed-embeddings to easily train
TB size model with speed!
Learn more from our blog here
https://developer.nvidia.com/blog/fast-
terabyte-scale-recommender-training-made-
easy-with-nvidia-merlin-distributed-
embeddings/
ASYNCHRONOUS, MULTITHREADED DATA PIPELINE
Read File
Copy to
GPU
Train
Read File
Copy to
GPU
Train
Read File
Copy to
GPU
Train
Batch 0
Batch 1
Batch 2
Time
CPU Memory GPU Memory
Datasets
Worker
Worker
Worker
Worker
Collector
Model
Training
HugeCTR pipeline can overlap the data read from disk to CPU memory,
data transfer from CPU to GPU and the actual training on GPU across different batches.
4
Triton
Fast and scalable AI in production
MERLIN TRITON: EASY DEPLOYMENT OF ETL WORKFLOW
AND DEEP LEARNING MODELS WITH GPU-ACCELERATION
Triton Inference with Merlin
What it is:
Triton Inference Server simplifies the deployment of AI models at scale in
production. It is an open source inference serving software that lets teams
deploy trained AI models from any framework. Both NVTabular workflows
and TensorFlow, PyTorch and HugeCTR models can be deployed
What it’s capable of:
• Speed – GPU acceleration for inference in production, HugeCTR
embedding cache to keep frequent users/items embedding on GPU
• Scale – Triton with HugeCTR can deploy larger-than-memory
embedding tables with parameter server
• Usability - Only few lines of code to deploy ensemble of ETL workflow
and deep learning model
• Core Features - maximize CPU/GPU utilization with real-time inference
or batch inference; supports all major deep learning frameworks and
custom builts; Kubernetes integration for orchestration, metrics, and
auto-scaling; multiple models at one
Visualization of deployed ensemble model
of NVTabular workflow and HugeCTR model.
Alternatively, TensorFlow or PyTorch model can be deployed
Triton
1
2
3
4
MERLIN DEPLOYS ENSEMBLE MODEL FOR SAME ETL
TRANSFORMATION IN PRODUCTION
Request
Prediction
1
2
3
...
N
Training Production
Triton
● Easy deployment to production
with only few lines of code
● NVTabular collects statistics,
such as mean/std of numeric
features and mapping tables for
categorical features
● The same ETL workflow needs
to be applied in production
● Triton supports TensorFlow and
PyTorch, too
42
TRITON
ETL
NVTABULAR
NATIVE HUGECTR
RAPIDS CuDNN, CuDF, CuBLAS, NCCL …
Merlin Overview - Make Recommenders on GPU Fast & Easy
Merlin Models & Merlin Systems
SOK - TF1
TRT
MERLIN
DATA
LOADER
DE - TF2 HUGECTR
HPS
TRAINING INFERENCE
Transformer4Rec
High Level
Libraries
Merlin
Partner Nvidia
Products
Lower Level
Libraries
43
Options of Running Inference
GPU for both Embedding & MLP
MLP MLP MLP MLP
CPU for Embedding + GPU for MLP
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
MLP
CPU for Embedding & MLP
Embedding
Embedding Embedding
Embedding
Pros Runs everywhere Easier to switch to GPU for MLP
Make use of all existing resource
Most performant, esp for large models
Cons Slow per socket Embedding is not accelerated Not supported by native TF
That’s why we have HugeCTR
Hierarchical Parameter Server!
44
Input Feature Follows Power Law
Distribution
Only a small fraction of parameters are accessed most frequently. These “hot
parameters” will be placed in GPU cache.
In criteo 1 terabyte
dataset, 305k out of 188M
(0.16%) categories
account for 95.9%
samples
Hierarchical Parameter Server (HPS) - A “Giant” Key Value store!
- Each GPU has its own lookup
session, and its own embedding
cache. No inter-communication
between GPUs
- Reduce synchronization time
- Ensures High system
availability
- CPU memory and SSDs can be
configured by users as extended
embedding storage
- HPS supports embedding lookup for
a variety of models/frameworks.
SSD (eg. RocksDB, HDFS)
CPU Memory (eg. Redis, HashMap)
GPU 0
Lookup
Session
Embedding
Cache
GPU 1
Lookup
Session
Embedding
Cache
GPU 2
Lookup
Session
Embedding
Cache
GPU 3
Lookup
Session
Embedding
Cache
46
HPS Performance Benchmark
16x
Merlin Models
Merlin Models: Model Training Library
Merlin Models
What it is:
High quality implementations from classical machine learning
models to more advanced deep learning models in Tensorflow
What it’s capable of:
▪ Implementation of common architectures, loss functions and
tasks
▪ Unified API enables users to create models in various FW and
libraries - Tensorflow, XGBoost, Implicit, and LightFM
▪ Flexible building blocks to design custom architecture based
on common components
▪ GPU-optimized data-loading and model training
▪ Integration with NVTabular Visualization of Merlin Models
Fast Iteration is Key
Merlin enables quick experimentation cycles to find high accuracy models
Evaluation
Input
Feature
Engineering
Model
Training
Iteration
Deployment
NVTabular + (Merlin Models / HugeCTR / Transformer4Rec)
10x 10x 10x
Triton Inference Server
Example Model Architectures in Merlin Models
Facebook DRLM Google DCN YouTube DNN
Sequential/Session-based recommendation w/ NLP Transformer architecture
Transformers4Rec
SESSION-BASED RECOMMENDATION TASK
Problem: User’s intention changes between different sessions and usually online services
have only access to the actual session (GDPR)
Goal: Build a recommender system that is able to leverage the short sequence of past
interactions within the same session and dynamically adapts the next suggested item.
NLP X SEQUENTIAL RECSYS
Seminal Neural Architectures
Prod2Vec
Meta-Prod2Vec
GRU4Rec NARM
SASRec
BERT4Rec NVIDIA
Transformers4Rec lib
Word2Vec
2013 2014 2015 2016 2017 2018 2019 2020
GRU Attention Transformer BERT XLNET
HuggingFace
Transformers lib
2021
GPT-2
ATRank
AttRec
Fig 2. The influence of NLP research in Recommender System
WHAT IS TRANSFORMERS4REC?
Transformers4Rec
§ is a flexible and efficient open source library for sequential and session-based recommendation,
§ makes state-of-the-art Transformer architectures available for RecSys community,
§ available for both PyTorch and Tensorflow.
https://github.com/NVIDIA-Merlin/Transformers4Rec
TRANSFORMERS4REC IS MULTI-FRAMEWORK AND EASY TO USE
PyTorch TensorFlow
56
Transformers4Rec Workflow Pipeline
Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based
Recommendation
Fig 3. End-to-end session-based recommendation pipeline
Success Cases
58
PERSONALIZED
PRODUCT RECOMMENDATIONS
Olay is arming consumers with knowledge to make informed
purchase decisions. Its Olay Skin Advisor — a GPU-accelerated AI
tool that works on any mobile device — assesses a user-provided
selfie and advises how to improve trouble areas using a daily
regime of recommended Olay products. After four weeks 94% of
Skin Advisor users continued to use the products the Olay Skin
Advisor recommended.
59
With >100,000 different products in its 4,700 U.S. stores, the Walmart Labs data science team predicts demand for 500 million item-by-store
combinations every week.
By performing forecasting with the open-source RAPIDS data processing and machine learning libraries built on CUDA-X AI on NVIDIA GPUs,
Walmart speeds up feature engineering 100x and trains machine learning algorithms 20x faster, resulting in faster delivery of products, real-
time reaction to shopper trends, and inventory cost savings at scale.
IMPROVING DEMAND FORECASTS
60
MODERNIZING THE WAREHOUSE
In 2019 global ecommerce represented $3.4 trillion ¾ or 13.7% ¾ of the $25 trillion dollar retail sales market. Oberlo predicts this
market share will grow to 17.5% by 2021.
With thousands of orders placed every hour, data scientists at Zalando, Europe’s leading online fashion retailer, applied deep
learning powered by NVIDIA GPUs to develop the Optimal Cart Pick algorithm.
The algorithm resulted in an 11% decrease in workers’ travel time per item picked. The work is a good example of the efficiencies
that AI can discover for e-commerce, manufacturing and other large-systems-based industries.
NVIDIA DEVELOPER
PROGRAM
FOR DEVELOPERS
62
NVIDIA DEVELOPER PROGRAM
TOOLS
JOIN THE COMMUNITY THAT’S CHANGING THE WORLD
• Get exclusive access to an extensive library of NVIDIA software, spanning all of NVIDIA’s technology platforms.
• Save time with ready-to-run, GPU-optimized software, model scripts, and containerized apps from the NVIDIA NGC™
catalog.
• Participate in early access programs where you can be one of the first to experience the latest NVIDIA technology.
TRAINING
• Take advantage of research papers, technical documentation, developer blogs, and industry-specific resources.
• Choose from a broad catalog of training options through the NVIDIA Deep Learning Institute (DLI).
• Get unlimited access to NVIDIA On-Demand, the home for NVIDIA resources from GTCs and other leading industry events.
COMMUNITY
• Network with like-minded developers, engage with GPU experts, and contribute to discussions in the developer forums.
• Attend exclusive meetups, GPU hackathons, and events.
• Connect with NVIDIA experts through developer-focused webinars and Instructor-led workshops.
Join the Free Program developer.nvidia.com/join
63
https://ngc.nvidia.com/
NVIDIA INCEPTION
PROGRAM
FOR STARTUPS
NVIDIA INCEPTION HELPS TECH STARTUPS BUILD AND
GROW FASTER
Open to Tech Startups of All Stages Working in AI, Deep Learning, AR/VR, Gaming, Networking, and Graphics
100+
Countries Represented
37%
Program Growth in 2022
15,000+
Startups Worldwide
$94B+
in Cumulative Funding
INCEPTION HELPS YOU ACHIEVE SUCCESS AT EVERY STAGE
SCALE
GROW
BUILD
Get free technical training, engineering
guidance, and discounts on technology to
build your solutions faster.
Accelerate growth with opportunities to
increase your company’s market
awareness and gain exposure to VC
investors.
Premier members enjoy an enhanced set
of go-to-market and technical guidance
benefits to help them scale faster.
NVIDIA INCEPTION
67
https://nvidia.com/startups
NVIDIA DEEP LEARNING
INSTITUTE
FOR STUDENTS, RESEARCHERS AND PROFESSORS
69
HELPING YOU SOLVE YOUR
MOST CHALLENGING PROBLEMS
• Build and deploy end-to-end projects across a range
of technologies and domains.
• Gain hands-on experience with the most widely used,
industry-standard software, tools, and frameworks.
• Join live, instructor-led workshops and learn from DLI-certified
instructors who are experts in their fields.
• Take self-paced, online courses anytime, anywhere.
• Earn NVIDIA DLI certificates to demonstrate subject matter
competency and support career growth.
Hands-On Training in AI, Accelerated Computing,
Accelerated Data Science, Graphics and Simulation,
and More
Learn more: www.nvidia.com/dli
Deep Learning
Fundamentals
AI for Anomaly
Detection
AI for Industrial
Inspection
Conversational AI
AI for
Intelligent Video
Analytics
Accelerated Computing
Fundamentals
Recommender Systems
AI for
Predictive Maintenance
Accelerated Data Science
Fundamentals
Graphics and
Simulation
Networking
AI in the Data Center
THANK YOU!
jsilva@nvidia.com

More Related Content

PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
PDF
RAPIDS Overview
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
RAPIDS – Open GPU-accelerated Data Science
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Innovation with ai at scale on the edge vt sept 2019 v0
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
RAPIDS Overview
RAPIDS, GPUs & Python - AWS Community Day Melbourne

Similar to Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware. (20)

PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PDF
The Convergence of HPC and Deep Learning
PDF
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
PDF
Deep Learning Update May 2016
PDF
Enabling Artificial Intelligence - Alison B. Lowndes
PPTX
abelbrownnvidiarakuten2016-170208065814 (1).pptx
PDF
GOAI: GPU-Accelerated Data Science DataSciCon 2017
PDF
Harnessing AI for the Benefit of All.
PDF
Introduction to Deep Learning (NVIDIA)
PDF
Accelerating Data Science With GPUs
PPTX
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
PDF
Phi Week 2019
PDF
Nvidia why every industry should be thinking about AI today
PDF
Possibilities of generative models
PDF
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
PDF
GTC 2017: Powering the AI Revolution
PDF
Deep Learning on the SaturnV Cluster
PDF
Tesla Accelerated Computing Platform
PDF
Nvidia at SEMICon, Munich
PDF
instruction of install Caffe on ubuntu
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
The Convergence of HPC and Deep Learning
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Deep Learning Update May 2016
Enabling Artificial Intelligence - Alison B. Lowndes
abelbrownnvidiarakuten2016-170208065814 (1).pptx
GOAI: GPU-Accelerated Data Science DataSciCon 2017
Harnessing AI for the Benefit of All.
Introduction to Deep Learning (NVIDIA)
Accelerating Data Science With GPUs
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Phi Week 2019
Nvidia why every industry should be thinking about AI today
Possibilities of generative models
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
GTC 2017: Powering the AI Revolution
Deep Learning on the SaturnV Cluster
Tesla Accelerated Computing Platform
Nvidia at SEMICon, Munich
instruction of install Caffe on ubuntu
Ad

More from E-Commerce Brasil (20)

PPTX
Congresso Alimentos & Bebidas | O novo capítulo do e-commerce alimentar com o...
PPTX
Congresso Alimentos & Bebidas | Sucessão e transformação: o futuro das redes ...
PPTX
Congresso Alimentos & Bebidas | Varejo e distribuição de alimentos e bebidas:...
PDF
Congresso Moda&Beleza I Do caos ao sucesso - Transformando desafios em oportu...
PPTX
Congresso Moda&Beleza I O fim da compra tradicional: como a hiperpersonalizaç...
PPTX
Congresso Moda&BelezaI Case Modaliss: como ter sucesso com um e-commerce de moda
PDF
Congresso Saúde & Farma | Os desafios do marketing digital e da inovação no s...
PDF
Conferência Rio Grande do Sul I O Poder do CRM: transformando a cultura da em...
PPSX
Conferência Rio Grande do Sul I Inovação como estratégia para criar mercados ...
PPTX
Conferência Rio Grande do Sul I De fraude a fidelização: pagamentos seguros n...
PDF
Conferência Rio Grande do Sul I Como uma gestão Full Commerce impacta na expe...
PDF
Conferência Rio Grande do Sul I Tecnologia logística para o crescimento das v...
PPTX
Conferência Rio Grande do Sul I Fraude na Black Friday no Brasil e no Sul: in...
PDF
Fórum E-Commerce Brasil 2024 | Transformação Impulsionada pela Tecnologia/IA:...
PPTX
Fórum E-Commerce Brasil 2024 | Da IA à Geração Z: desbloqueando a experiência...
PPTX
Fórum E-Commerce Brasil 2024 | O Modelo Y - Onde Produto, Canal e Experiência...
PDF
Congresso Indústria Digital I A revolução dos pagamentos
PPTX
Congresso Indústria Digital I Como um checkout externo pode facilitar e agili...
PDF
Congresso Indústria Digital I Retail Media: como indústrias podem expandir su...
PDF
Congresso Indústria Digital I Perspectivas do Pix 2024: novidades e impactos ...
Congresso Alimentos & Bebidas | O novo capítulo do e-commerce alimentar com o...
Congresso Alimentos & Bebidas | Sucessão e transformação: o futuro das redes ...
Congresso Alimentos & Bebidas | Varejo e distribuição de alimentos e bebidas:...
Congresso Moda&Beleza I Do caos ao sucesso - Transformando desafios em oportu...
Congresso Moda&Beleza I O fim da compra tradicional: como a hiperpersonalizaç...
Congresso Moda&BelezaI Case Modaliss: como ter sucesso com um e-commerce de moda
Congresso Saúde & Farma | Os desafios do marketing digital e da inovação no s...
Conferência Rio Grande do Sul I O Poder do CRM: transformando a cultura da em...
Conferência Rio Grande do Sul I Inovação como estratégia para criar mercados ...
Conferência Rio Grande do Sul I De fraude a fidelização: pagamentos seguros n...
Conferência Rio Grande do Sul I Como uma gestão Full Commerce impacta na expe...
Conferência Rio Grande do Sul I Tecnologia logística para o crescimento das v...
Conferência Rio Grande do Sul I Fraude na Black Friday no Brasil e no Sul: in...
Fórum E-Commerce Brasil 2024 | Transformação Impulsionada pela Tecnologia/IA:...
Fórum E-Commerce Brasil 2024 | Da IA à Geração Z: desbloqueando a experiência...
Fórum E-Commerce Brasil 2024 | O Modelo Y - Onde Produto, Canal e Experiência...
Congresso Indústria Digital I A revolução dos pagamentos
Congresso Indústria Digital I Como um checkout externo pode facilitar e agili...
Congresso Indústria Digital I Retail Media: como indústrias podem expandir su...
Congresso Indústria Digital I Perspectivas do Pix 2024: novidades e impactos ...
Ad

Recently uploaded (20)

PDF
PakistanCoinageAct-906.pdfdbnsshsjjsbsbb
PPTX
Pin configuration and project related to
PDF
GENERATOR AND IMPROVED COIL THEREFOR HAVINGELECTRODYNAMIC PROPERTIES
PDF
ISS2022 present sdabhsa hsdhdfahasda ssdsd
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PDF
20A LG INR18650HJ2 3.6V 2900mAh Battery cells for Power Tools Vacuum Cleaner
PDF
CAB UNIT 1 with computer details details
PPTX
AIR BAG SYStYEM mechanical enginweering.pptx
PDF
Topic-1-Main-Features-of-Data-Processing.pdf
PPTX
unit1d-communitypharmacy-240815170017-d032dce8.pptx
PPTX
Grade 10 System Servicing for Hardware and Software
PPTX
New professional education PROF-ED-7_103359.pptx
PPTX
AI_ML_Internship_WReport_Template_v2.pptx
PPTX
Computers and mobile device: Evaluating options for home and work
PDF
SAHIL PROdhdjejss yo yo pdf TOCOL PPT.pdf
PPTX
Prograce_Present.....ggation_Simple.pptx
PPTX
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
PPTX
RTS MASTER DECK_Household Convergence Scorecards. Use this file copy.pptx
PDF
ICT grade for 8. MATATAG curriculum .P2.pdf
PDF
Dozuki_Solution-hardware minimalization.
PakistanCoinageAct-906.pdfdbnsshsjjsbsbb
Pin configuration and project related to
GENERATOR AND IMPROVED COIL THEREFOR HAVINGELECTRODYNAMIC PROPERTIES
ISS2022 present sdabhsa hsdhdfahasda ssdsd
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
20A LG INR18650HJ2 3.6V 2900mAh Battery cells for Power Tools Vacuum Cleaner
CAB UNIT 1 with computer details details
AIR BAG SYStYEM mechanical enginweering.pptx
Topic-1-Main-Features-of-Data-Processing.pdf
unit1d-communitypharmacy-240815170017-d032dce8.pptx
Grade 10 System Servicing for Hardware and Software
New professional education PROF-ED-7_103359.pptx
AI_ML_Internship_WReport_Template_v2.pptx
Computers and mobile device: Evaluating options for home and work
SAHIL PROdhdjejss yo yo pdf TOCOL PPT.pdf
Prograce_Present.....ggation_Simple.pptx
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
RTS MASTER DECK_Household Convergence Scorecards. Use this file copy.pptx
ICT grade for 8. MATATAG curriculum .P2.pdf
Dozuki_Solution-hardware minimalization.

Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.

  • 1. Jomar Silva - 2023 Tecnologias NVIDIA aplicadas ao e- commerce - Muito além do hardware
  • 2. 2 NVIDIA is built like a computing stack or neural network—in four layers: hardware, system software, platform software, and applications. Each layer is open to computer makers, service providers, and developers to integrate into their offerings however best for them. NVIDIA HPC RTX CUDA-X PHYSX RTX DGX HGX EGX OVX SUPER POD AGX GPU CPU DPU NIC SOC SWITCH NVIDIA AI NVIDIA OMNIVERSE ACCELERATED COMPUTING ACROSS THE FULL-STACK AND AT DATA CENTER SCALE
  • 3. NVIDIA ACCELERATED AI PLATFORM Accelerating AI from edge to datacenter to cloud FLEET COMMAND NVIDIA AI ENTERPRISE NVIDIA vGPU NVIDIA Magnum IO™ NVIDIA CUDA-X AI™ Infrastructure Optimization NVIDIA GPU Operator NVIDIA Network Operator Cloud-Native Deployment TensorFlow PyTorch NVIDIA RAPIDS™ NVIDIA TensorRT® NVIDIA Triton™ Inference Server AI and Data Science Tools and Frameworks APPLICATION FRAMEWORKS METROPOLIS Video Analytics ISAAC Robotics RAPIDS Data Science RIVA Conv AI/ASR/NLP/TTS MERLIN Recommendation Systems OMNIVERSE Simulation / Digital Twins GPU | NETWORKING | SECURITY | STORAGE Tencent Cloud PRE-TRAINED MODELS SYSTEM INTEGRATORS Consult with experts MAJOR CUSTOMERS In-house data science teams 150+ SOFTWARE VENDORS Pre-packaged AI solutions
  • 5. 5 THE CHALLENGES OF DATA SCIENCE TODAY PROBLEMS THAT HINDER DATA-DRIVEN ENTERPRISES It's Time Consuming Building, training, and iterating on models takes substantial time. As big data use cases continue to grow, CPU computational power becomes a major bottleneck. It's Costly Large-scale CPU infrastructure is incredibly expensive for conducting big data operations. With growing datasets, adding CPU infrastructure continues to increase costs. It's Frustrating Productionizing large-scale data processing operations is arduous. It often involves refactoring and hand-offs between teams adding more cycle time. These challenges increase cycle time and slow down time to insight.
  • 6. 6 NVIDIA RAPIDS TRANSFORMS DATA SCIENCE From Analytics to NVIDIA Accelerated Data Science
  • 7. 7 PYTHON TOOLS HAVE DEMOCRATIZED DATA SCIENCE ACCESSIBLE, EASY TO USE TOOLS ABSTRACT COMPLEXITY Analytics pandas Data Preparation Visualization Model Training Machine Learning scikit-learn Graph Analytics NetworkX Deep Learning TensorFlow, PyTorch, MxNet Vizualization CuXFILTER <> pyViz Dask CPU Memory Python is the most-used language in Data Science today. Libraries like NumPy, Scikit-Learn, and Pandas have changed how we think about accessibility in Data Science and Machine Learning. While great for experimentation, PyData tools lack the power necessary for enterprise-scale workloads. This leads to substantial refactoring to handle the size of modern problems, increasing cycle time, overhead, and time to insight. These pain points are further compounded by computational bottlenecks of CPU-based processing. Code refactors and inter-team handoffs decrease data-driven ROI
  • 8. 8 RAPIDS ACCELERATES POPULAR DATA SCIENCE TOOLS DELIVERING ENTERPRISE-GRADE DATA SCIENCE SOLUTIONS IN PURE PYTHON Pre-Processing cuIO & cuDF Data Preparation Visualization Model Training Machine Learning cuML Graph Analytics cuGRAPH Deep Learning TensorFlow, PyTorch, MxNet Vizualization CuXFILTER <> pyViz Dask GPU Memory The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. RAPIDS utilizes NVIDIA CUDA primitives for low-level compute optimization and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces like PyData. With Dask, RAPIDS can scale out to multi-node, multi-GPU cluster to power through big data processes. RAPIDS enables the PyData stack with the power of NVIDIA GPUs
  • 9. 9 Accelerates PyData on NVIDIA GPUs NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS Distributes and accelerates PyData Can be distributed across Multi-GPU on single node (DGX) or across a cluster Provides easy to use tooling enabling HPC-level performance RAPIDS + DASK Provides accessible, easy to use tooling NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core, in-memory data PYDATA Distributes PyData across multiple cores NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures DASK Scale Up / Accelerate Scale Out / Parallelize SCALE OUT PYTHON TOOLS WITH RAPIDS + DASK DISTRIBUTE & ACCELERATE COMPUTATION FOR PRODUCTION WORKLOADS
  • 10. 10 RAPIDS INTEGRATES WITH DEEP LEARNING TOOLS BUILD COMPLEX WORKFLOWS WITHOUT LEAVING THE GPU mpi4py RAPIDS supports device memory sharing between many popular data science and deep learning libraries, such as PyTorch and TensorFlow. By providing native array_interface support, data can stay on the GPU avoiding costly copying back and forth to host memory. Data stored in Apache Arrow can be seamlessly pushed to deep learning frameworks that accept CUDA Array Interface protocol or work with DLPack, such as Chainer, MXNet, and PyTorch.
  • 12. NVTabular Pipelines are slow and complex Challenge Solution Inference Training Data Loading ETL Using common item- by-item loading can be slow High throughput to rank more items is difficult while maintaining low latency Embedding tables of large deep learning recommender systems can exceed memory GPU-accelerated and easy-to-use ETL pipelines prepares datasets in minutes Asynchronous and GPU-accelerated dataloader for PyTorch and TensorFlow/Keras Easy data and model parallel training allow to scale TB size embeddings High throughput, low-latency production deployment NVIDIA Merlin is an open-source library to deploy recommender systems end-2-end Triton HugeCTR NVIDIA MERLIN ADDRESSES RECOMMENDER SYSTEM CHALLENGES
  • 13. DATA LAKE TRITON USER QUERY 10 RECOMMENDATIONS CANDIDATE GENERATION RANKING O(1000) O(Billions) EMBEDDINGS INFERENCE TRAINING DATA LOADER ETL NVTABULAR HUGECTR NVTABULAR VALIDATION MODEL ANALYSIS RAPIDS RAPIDS CUDNN RAPIDS NVIDIA Merlin Accelerates Every Stage in Recommender Pipeline
  • 14. 14 Merlin speeds up the entire pipeline ETL Minutes Optimized Spark (4x CPU node) NVTabular ETL (1xA100) NVIDIA Merlin provides 9-35x speed-up in ETL+Training+Inference RecSys models and easily scales to multiple GPUs Accelerating Training Scaling Accelerated Training Inference Tensorflow Data loader (1x A100) NVTabular Data loader (1x A100) CPU cluster (4x nodes) HugeCTR (1x DGX-A100) 21x 9x 24x 35x HugeCTR NVTabular HPS on Triton Speedup TF/PyT Plugins Minutes Minutes Throughput (samples/sec) at 10 ms latency PyTorch (2x CPU node) HugeCTR (1x A100)
  • 15. NVTabular Fast Feature Transforms & Dataloading of Tabular Data on GPU
  • 16. NVTABULAR: RECOMMENDER SYSTEM ETL ON GPU NVTabular What it is: Feature engineering and preprocessing library designed to quickly and easily manipulate terabytes of tabular data What it’s capable of: • Speed – GPU acceleration, 10x speedup compared to CPU, eliminate input bottleneck • Scale – No limit on dataset size (not bound by GPU or CPU memory) • Usability - Higher level abstraction, recommender systems oriented, fewer API calls are required to accomplish the same processing pipeline • Core Features - integration with PyTorch, TensorFlow, and HugeCTR; multi-hot encoding 1 2 3 4 Visualization of feature engineering and preprocessing pipeline for Criteo Click Ads Prediction dataset
  • 17. SCALE: NVTABULAR SCALES TO MULTI-GPUS AND MULTI-NODES USING TBS OF GPU MEMORY Example of NVTabular ETL with 2 nodes: 1. ETL starts with 1 node (8 GPUs) 2. ETL adds a 2nd node (8 GPUs, total=16GPUs) Dask dashboard monitors NVTabular ETL: ● Top-right: Visualization of GPUs interaction, small dot is one GPU (upto 16 GPUs in this example) ● Bottom-center: Utilized GPU memory and total GPU memory (upto 500GB in this example) ● Top-left: Tasks overtime Detailed Notebook: https://github.com/NVIDIA/NVTabular/blob/main/examples/multi-gpu_dask.ipynb Under the hood, NVTabular uses Dask and Dask-cuDF to provide a high-performance recommender system-specific ETL pipeline on multiple GPUs Graphic showing the dask dashboard when running NVTabular I II III I II II I
  • 18. SCALE: NVTABULAR SCALES EASILY TO MULTI-NODES AND REDUCES ETL TIME In our example, NVTabular scales to 128 GPUs (16 nodes x 8 GPUs) with total of 5.1 TB GPU memory (each GPU has 40GB GPU memory) Preliminary results Run on DGX A100, Dataset: Criteo 1TB (26 categorical features and 13 integer features) NVTabular v0.3 To enable distributed parallelism, we need to start a cluster and then connect to it to run the application. How it works: 1. Start the scheduler dask-scheduler 2. Start the workers dask-cuda-worker 3. Run the NVTabular application Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. Dask-cuDF extends Dask where necessary to allow its DataFrame partitions to be processed by cuDF GPU DataFrames
  • 19. USABILITY: NVTABULAR’S HIGH-LEVEL API IS 10-20 LINES Easy-to-use API. Find more details from examples here. Encode categorical variables using the defined thresholds and add metadata (tags) For continuous features - zero filling any nulls clip all values, log transform, normalize, add metadata (tags) Collect statistics on train dataset Transform train & valid dataset with statistics from train dataset Define dataset files Combine pipelines and initialize workflow Load schema by tags for model definition
  • 20. ONE PARAMETER TO SWITCH BETWEEN CPU AND GPU GPU CPU New users can easily try out NVTabular as CPU-mode runs on all infrastructures Users can develop on local machines on CPU-mode and push to cluster on GPU-mode
  • 21. 21 Interoperability: NVTabular’s output files can be used by PyTorch, TensorFlow or HugeCTR 3 Export from data lake parquet Huge CTR binary Training Data Loading csv parquet avro NVTabular with high-level API Dedicated DL framework ETL NVTabular Dataloader
  • 22. 22 Advantages of NVTabular ETL Scale Speed Usability Core Features ● Recommender focused APIs to easily implement the most common workflows ● 5-25 lines of code compared to other frameworks such as Pandas ● Examples published for common datasets and models ● Optimized TF, PyT, and HugeCTR dataloaders ● Native tabular data format support: CSV, parquet, orc, avro ● Building an easy path to production deployment for data transforms ● Supports datasets larger than host and GPU memory ● Scales easily to multi-GPU / multi-Nodes with terabytes of GPU memory ● Benchmark shows 100x-3000x speedup in comparison to CPU environments
  • 23. Dataloading with NVTabular Fast Dataloading of Tabular Data on GPU
  • 24. 24 NVTabular: GPU-accelerated dataloading NVTabular Dataloader What it is: GPU-accelerated and asynchronous dataloader for TensorFlow & PyTorch prepares quickly new data for training step to fully utilize the GPU What it’s capable of: • Speed – NVTabular dataloader is 10x faster compared to Tensorflow dataloader with GPU training • Scale - dataloaders enables training of larger than memory datasets by streaming chunks from disk • Usability - easy integration with TensorFlow, PyTorch and FastAI Core Feature – convergence behavior is similar to other framework dataloaders; supports common tabular data format parquet; no CPU-GPU communication as data is loaded directly into GPU 1 2 3 4 Visualization of idle times for NVTabular data loader Time
  • 25. 25 Speed: NVTabular dataloader is up to 9x faster in comparison to TensorFlow native version Benchmark: ● Training and validation have each 1 day of Criteo Ads Clicks Prediction dataset with 150M samples ● Each experiment uses GPUs for training the same neural network architecture Results: ● Only dataloading without training has 24x speed-up, which indicates the bottleneck in the dataloader. ● Training neural networks has 2.7x-9x speed-up, as data loader has time to prepare batch during training step. Overall dataloader is still the bottleneck in TensorFlow / PyTorch pipeline. 1 Comparison to TensorFlow
  • 26. 26 Scale: NVTabular supports larger than memory datasets by streaming data from disk 2 1 2 3 ... N Buffer k randomly selected chunks1 1) A file is stored in multiple chunks. A hyperparameter defines how many chunks are buffered by NVTabular dataloader 2 ... k Shuffle Preprocess Batch Final Batch Disk Storage (e.g. raid) GPU Asynchronous ● Buffering chunks from disk enables larger than host/GPU memory datasets ● Dataloading on GPU removes overhead of CPU-GPU communication ● Shuffle, preprocess and batch are faster on GPU ● Using random, multiple chunks increases randomness in batches
  • 27. 27 Usability: NVTabular dataloaders can be used in existing TensorFlow pipeline 3 Reserve GPU memory for NVTabular dataloader Helper function for defining tf.feature_columns Define dataset files Initialize NVTabular dataloader for training Define data schema with tf.feature_columns Initialize NVTabular dataloader for validation Get a TensorFlow model NVTabular dataloader requires a custom Validation Callback Train model with .fit() Steps which a different
  • 28. 28 Usability: NVTabular dataloaders can be used in existing PyTorch pipeline 3 Helper function for run one epoch Define dataset files Initialize NVTabular dataloader for training Initialize NVTabular dataloader for validation Get a PyTorch model Train and validate model Steps which a different
  • 29. 29 Core Feature: NVTabular has less idle times than other data loader 4 NVTabular data loader has no overhead to move data from CPU to GPU. Preparing batch on GPU is significantly faster. Synchronous CPU dataloader Asynchronous CPU dataloader Asynchronous GPU dataloader NVTabular Prepare CPU: Idle GPU: Train Idle Prepare Idle Train Idle Prepare Idle Train Idle Prepare CPU: Idle GPU: Train Prepare Idle Train Prepare Idle Train Prepare Idle Train Prepare Idle Prepare GPU: Idle GPU: Train Prepare Train Prepare Train Prepare Train Prepare Train Prepare Train Prepare Train Prepare Time
  • 30. 30 Advantages of NVTabular dataloaders Speed Scale Usability Core Features ● NVTabular dataloaders stream dataset in chunks from hard disk and thereby, they support larger than host/GPU memory datasets ● Seamless integration into existing TensorFlow or PyTorch pipelines ● NVTabular dataloaders speed-up TensorFlow pipelines by 9x ● NVTabular dataloaders speed-up PyTorch pipelines by 9x ● NVTabular dataloaders convergence in the same number of update steps by require less time ● Supports common tabular dataformats: parquet and csv ● Data is loaded directly into GPU to remove CPU-GPU communication
  • 31. Training on GPU at scale Accelerating training at scale with HugeCTR and Tensorflow Plugins
  • 32. 32 TRITON ETL NVTABULAR NATIVE HUGECTR RAPIDS CuDNN, CuDF, CuBLAS, NCCL … Merlin Overview - Make Recommenders on GPU Fast & Easy Merlin Models & Merlin Systems SparseOperationKit - TF1 TRT MERLIN DATA LOADER DistributedEmbeddings - TF2 HUGECTR Hierarchical Parameter Server TRAINING INFERENCE Transformer4Rec High Level Libraries E2E usability Merlin Partner Nvidia Products Lower Level Libraries Performance & Scalability HKV
  • 33. Huge CTR What it is: HugeCTR is an open-source framework to accelerate the training of CTR estimation models on NVIDIA GPUs. It is written in CUDA C++ and highly exploits GPU-accelerated libraries such as cuBLAS, cuDNN, and NCCL. What it’s capable of: Speed: A single node DGX-A100 trains MLPerf v1.0 DLRM at 1.96 minutes Scale: Multi-Node Model Parallel, GPU embedding cache Usability: Keras-like Python API Core Features: Multi-slot Embedding with in-memory GPU Hashtable, Asynchronous and Multithreaded Data Pipeline, Inference & Embedding Cache, Tensorflow Embedding Plugin HUGECTR: RECOMMENDER SYSTEM TRAINING ON GPU 1 2 3 4
  • 34. SPEED: HUGECTR DLRM MLPERF WIN Features used to achieve MLPerf win: § Hybrid embedding § Fused MLP § Optimized collectives § Optimized data reader § Overlapping MLP with embedding § Whole-iteration CUDA graph At 0.99 minutes, HugeCTR on DGX-A100 is the fastest available system on MLPerf Recommender Benchmark 1 MLPerf v1.0 training results. Recommender task (training DLRM on the Criteo 1TB dataset). Bars represent speedup factor compared to a 4 CPU-node cluster. The higher the better. HugeCTR v3.1 running on single node of DGX-A100 with 8x A100 80GB GPU and 14 nodes of DGX-A100. Intel's CPU submission based on 4 nodes, each with 4X Intel(R) Xeon(R) Platinum 8376H CPU @ 2.60GHz with 6 UPI for a total of 16 CPUs.
  • 35. Scale: Multi-Node Model Parallel Dense Model Dense Model Dense Model Dense Model Sparse Model/Embedding GPU 0 GPU 1 GPU 2 GPU 3 Node 0 Model Parallel Data Parallel Dense Model Dense Model Dense Model Dense Model GPU 0 GPU 1 GPU 2 GPU 3 Node 1 To train a large-scale CTR estimation model, the HugeCTR embedding table can span multiple nodes beyond multiple GPUs (model parallelism). Each GPU has its own feed-forward neural network (data parallelism). 2
  • 36. USABILITY: KERAS-LIKE PYTHON API Easy to use Python API 3 1. Initialize Model solver = hugectr.CreateSolver(...) reader = hugectr.DataReaderParams(...) optimizer = hugectr.CreateOptimizer(...) model = hugectr.Model(solver, reader, optimizer) 2. Create Model graph model.add(hugectr.Input(...)) model.add(hugectr.SparseEmbedding(...)) model.add(hugectr.DenseLayer(...)) ... 3. Save model graph model.graph_to_json(“dlrm.json”) 4. Compile & Fit model.compile(...) model.fit(...) 1. Create InferenceSession inference_params = InferenceParams( max_batchsize = 4096, hit_rate_threshold = 0.8, dense_model_file = ..., sparse_model_files = ..., device_id = 0, use_gpu_embedding_cache = True, cache_size_percentage = 0.3) inference_session = CreateInferenceSession(“dlrm.json”, inference_params) 2. Make inference • From extracted VCSR inference_session.predict(dense, keys, row_ptrs) • From Norm/Parquet dataset inference_session.predict(...) inference_session.evaluate(...) Train Inference
  • 37. 37 Usability - Democratize Model Parallelism with Three lines! Try it today https://github.com/NVIDIA- Merlin/distributed-embeddings to easily train TB size model with speed! Learn more from our blog here https://developer.nvidia.com/blog/fast- terabyte-scale-recommender-training-made- easy-with-nvidia-merlin-distributed- embeddings/
  • 38. ASYNCHRONOUS, MULTITHREADED DATA PIPELINE Read File Copy to GPU Train Read File Copy to GPU Train Read File Copy to GPU Train Batch 0 Batch 1 Batch 2 Time CPU Memory GPU Memory Datasets Worker Worker Worker Worker Collector Model Training HugeCTR pipeline can overlap the data read from disk to CPU memory, data transfer from CPU to GPU and the actual training on GPU across different batches. 4
  • 39. Triton Fast and scalable AI in production
  • 40. MERLIN TRITON: EASY DEPLOYMENT OF ETL WORKFLOW AND DEEP LEARNING MODELS WITH GPU-ACCELERATION Triton Inference with Merlin What it is: Triton Inference Server simplifies the deployment of AI models at scale in production. It is an open source inference serving software that lets teams deploy trained AI models from any framework. Both NVTabular workflows and TensorFlow, PyTorch and HugeCTR models can be deployed What it’s capable of: • Speed – GPU acceleration for inference in production, HugeCTR embedding cache to keep frequent users/items embedding on GPU • Scale – Triton with HugeCTR can deploy larger-than-memory embedding tables with parameter server • Usability - Only few lines of code to deploy ensemble of ETL workflow and deep learning model • Core Features - maximize CPU/GPU utilization with real-time inference or batch inference; supports all major deep learning frameworks and custom builts; Kubernetes integration for orchestration, metrics, and auto-scaling; multiple models at one Visualization of deployed ensemble model of NVTabular workflow and HugeCTR model. Alternatively, TensorFlow or PyTorch model can be deployed Triton 1 2 3 4
  • 41. MERLIN DEPLOYS ENSEMBLE MODEL FOR SAME ETL TRANSFORMATION IN PRODUCTION Request Prediction 1 2 3 ... N Training Production Triton ● Easy deployment to production with only few lines of code ● NVTabular collects statistics, such as mean/std of numeric features and mapping tables for categorical features ● The same ETL workflow needs to be applied in production ● Triton supports TensorFlow and PyTorch, too
  • 42. 42 TRITON ETL NVTABULAR NATIVE HUGECTR RAPIDS CuDNN, CuDF, CuBLAS, NCCL … Merlin Overview - Make Recommenders on GPU Fast & Easy Merlin Models & Merlin Systems SOK - TF1 TRT MERLIN DATA LOADER DE - TF2 HUGECTR HPS TRAINING INFERENCE Transformer4Rec High Level Libraries Merlin Partner Nvidia Products Lower Level Libraries
  • 43. 43 Options of Running Inference GPU for both Embedding & MLP MLP MLP MLP MLP CPU for Embedding + GPU for MLP Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 MLP CPU for Embedding & MLP Embedding Embedding Embedding Embedding Pros Runs everywhere Easier to switch to GPU for MLP Make use of all existing resource Most performant, esp for large models Cons Slow per socket Embedding is not accelerated Not supported by native TF That’s why we have HugeCTR Hierarchical Parameter Server!
  • 44. 44 Input Feature Follows Power Law Distribution Only a small fraction of parameters are accessed most frequently. These “hot parameters” will be placed in GPU cache. In criteo 1 terabyte dataset, 305k out of 188M (0.16%) categories account for 95.9% samples
  • 45. Hierarchical Parameter Server (HPS) - A “Giant” Key Value store! - Each GPU has its own lookup session, and its own embedding cache. No inter-communication between GPUs - Reduce synchronization time - Ensures High system availability - CPU memory and SSDs can be configured by users as extended embedding storage - HPS supports embedding lookup for a variety of models/frameworks. SSD (eg. RocksDB, HDFS) CPU Memory (eg. Redis, HashMap) GPU 0 Lookup Session Embedding Cache GPU 1 Lookup Session Embedding Cache GPU 2 Lookup Session Embedding Cache GPU 3 Lookup Session Embedding Cache
  • 48. Merlin Models: Model Training Library Merlin Models What it is: High quality implementations from classical machine learning models to more advanced deep learning models in Tensorflow What it’s capable of: ▪ Implementation of common architectures, loss functions and tasks ▪ Unified API enables users to create models in various FW and libraries - Tensorflow, XGBoost, Implicit, and LightFM ▪ Flexible building blocks to design custom architecture based on common components ▪ GPU-optimized data-loading and model training ▪ Integration with NVTabular Visualization of Merlin Models
  • 49. Fast Iteration is Key Merlin enables quick experimentation cycles to find high accuracy models Evaluation Input Feature Engineering Model Training Iteration Deployment NVTabular + (Merlin Models / HugeCTR / Transformer4Rec) 10x 10x 10x Triton Inference Server
  • 50. Example Model Architectures in Merlin Models Facebook DRLM Google DCN YouTube DNN
  • 51. Sequential/Session-based recommendation w/ NLP Transformer architecture Transformers4Rec
  • 52. SESSION-BASED RECOMMENDATION TASK Problem: User’s intention changes between different sessions and usually online services have only access to the actual session (GDPR) Goal: Build a recommender system that is able to leverage the short sequence of past interactions within the same session and dynamically adapts the next suggested item.
  • 53. NLP X SEQUENTIAL RECSYS Seminal Neural Architectures Prod2Vec Meta-Prod2Vec GRU4Rec NARM SASRec BERT4Rec NVIDIA Transformers4Rec lib Word2Vec 2013 2014 2015 2016 2017 2018 2019 2020 GRU Attention Transformer BERT XLNET HuggingFace Transformers lib 2021 GPT-2 ATRank AttRec Fig 2. The influence of NLP research in Recommender System
  • 54. WHAT IS TRANSFORMERS4REC? Transformers4Rec § is a flexible and efficient open source library for sequential and session-based recommendation, § makes state-of-the-art Transformer architectures available for RecSys community, § available for both PyTorch and Tensorflow. https://github.com/NVIDIA-Merlin/Transformers4Rec
  • 55. TRANSFORMERS4REC IS MULTI-FRAMEWORK AND EASY TO USE PyTorch TensorFlow
  • 56. 56 Transformers4Rec Workflow Pipeline Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation Fig 3. End-to-end session-based recommendation pipeline
  • 58. 58 PERSONALIZED PRODUCT RECOMMENDATIONS Olay is arming consumers with knowledge to make informed purchase decisions. Its Olay Skin Advisor — a GPU-accelerated AI tool that works on any mobile device — assesses a user-provided selfie and advises how to improve trouble areas using a daily regime of recommended Olay products. After four weeks 94% of Skin Advisor users continued to use the products the Olay Skin Advisor recommended.
  • 59. 59 With >100,000 different products in its 4,700 U.S. stores, the Walmart Labs data science team predicts demand for 500 million item-by-store combinations every week. By performing forecasting with the open-source RAPIDS data processing and machine learning libraries built on CUDA-X AI on NVIDIA GPUs, Walmart speeds up feature engineering 100x and trains machine learning algorithms 20x faster, resulting in faster delivery of products, real- time reaction to shopper trends, and inventory cost savings at scale. IMPROVING DEMAND FORECASTS
  • 60. 60 MODERNIZING THE WAREHOUSE In 2019 global ecommerce represented $3.4 trillion ¾ or 13.7% ¾ of the $25 trillion dollar retail sales market. Oberlo predicts this market share will grow to 17.5% by 2021. With thousands of orders placed every hour, data scientists at Zalando, Europe’s leading online fashion retailer, applied deep learning powered by NVIDIA GPUs to develop the Optimal Cart Pick algorithm. The algorithm resulted in an 11% decrease in workers’ travel time per item picked. The work is a good example of the efficiencies that AI can discover for e-commerce, manufacturing and other large-systems-based industries.
  • 62. 62 NVIDIA DEVELOPER PROGRAM TOOLS JOIN THE COMMUNITY THAT’S CHANGING THE WORLD • Get exclusive access to an extensive library of NVIDIA software, spanning all of NVIDIA’s technology platforms. • Save time with ready-to-run, GPU-optimized software, model scripts, and containerized apps from the NVIDIA NGC™ catalog. • Participate in early access programs where you can be one of the first to experience the latest NVIDIA technology. TRAINING • Take advantage of research papers, technical documentation, developer blogs, and industry-specific resources. • Choose from a broad catalog of training options through the NVIDIA Deep Learning Institute (DLI). • Get unlimited access to NVIDIA On-Demand, the home for NVIDIA resources from GTCs and other leading industry events. COMMUNITY • Network with like-minded developers, engage with GPU experts, and contribute to discussions in the developer forums. • Attend exclusive meetups, GPU hackathons, and events. • Connect with NVIDIA experts through developer-focused webinars and Instructor-led workshops. Join the Free Program developer.nvidia.com/join
  • 65. NVIDIA INCEPTION HELPS TECH STARTUPS BUILD AND GROW FASTER Open to Tech Startups of All Stages Working in AI, Deep Learning, AR/VR, Gaming, Networking, and Graphics 100+ Countries Represented 37% Program Growth in 2022 15,000+ Startups Worldwide $94B+ in Cumulative Funding
  • 66. INCEPTION HELPS YOU ACHIEVE SUCCESS AT EVERY STAGE SCALE GROW BUILD Get free technical training, engineering guidance, and discounts on technology to build your solutions faster. Accelerate growth with opportunities to increase your company’s market awareness and gain exposure to VC investors. Premier members enjoy an enhanced set of go-to-market and technical guidance benefits to help them scale faster. NVIDIA INCEPTION
  • 68. NVIDIA DEEP LEARNING INSTITUTE FOR STUDENTS, RESEARCHERS AND PROFESSORS
  • 69. 69 HELPING YOU SOLVE YOUR MOST CHALLENGING PROBLEMS • Build and deploy end-to-end projects across a range of technologies and domains. • Gain hands-on experience with the most widely used, industry-standard software, tools, and frameworks. • Join live, instructor-led workshops and learn from DLI-certified instructors who are experts in their fields. • Take self-paced, online courses anytime, anywhere. • Earn NVIDIA DLI certificates to demonstrate subject matter competency and support career growth. Hands-On Training in AI, Accelerated Computing, Accelerated Data Science, Graphics and Simulation, and More Learn more: www.nvidia.com/dli Deep Learning Fundamentals AI for Anomaly Detection AI for Industrial Inspection Conversational AI AI for Intelligent Video Analytics Accelerated Computing Fundamentals Recommender Systems AI for Predictive Maintenance Accelerated Data Science Fundamentals Graphics and Simulation Networking AI in the Data Center