Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.

Jomar Silva - 2023
Tecnologias NVIDIA aplicadas ao e-
commerce - Muito além do hardware

2
NVIDIA is built like a computing stack or neural
network—in four layers: hardware, system
software, platform software, and applications.
Each layer is open to computer makers, service
providers, and developers to integrate into their
offerings however best for them.
NVIDIA HPC
RTX CUDA-X PHYSX
RTX DGX HGX EGX OVX
SUPER
POD AGX
GPU CPU DPU NIC SOC
SWITCH
NVIDIA AI NVIDIA OMNIVERSE
ACCELERATED COMPUTING ACROSS THE FULL-STACK AND AT DATA CENTER SCALE

NVIDIA ACCELERATED AI PLATFORM
Accelerating AI from edge to datacenter to cloud
FLEET
COMMAND
NVIDIA AI ENTERPRISE
NVIDIA vGPU NVIDIA Magnum IO™ NVIDIA CUDA-X AI™
Infrastructure
Optimization
NVIDIA GPU Operator NVIDIA Network Operator
Cloud-Native
Deployment
TensorFlow PyTorch NVIDIA RAPIDS™
NVIDIA TensorRT® NVIDIA Triton™ Inference Server
AI and Data Science Tools
and Frameworks
APPLICATION FRAMEWORKS
METROPOLIS
Video Analytics
ISAAC
Robotics
RAPIDS
Data Science
RIVA
Conv AI/ASR/NLP/TTS
MERLIN
Recommendation Systems
OMNIVERSE
Simulation / Digital Twins
GPU | NETWORKING | SECURITY | STORAGE
Tencent Cloud
PRE-TRAINED MODELS
SYSTEM INTEGRATORS
Consult with experts
MAJOR CUSTOMERS
In-house data science teams
150+ SOFTWARE VENDORS
Pre-packaged AI solutions

4
https://rapids.ai

5
THE CHALLENGES OF DATA SCIENCE TODAY
PROBLEMS THAT HINDER DATA-DRIVEN ENTERPRISES
It's Time Consuming
Building, training, and iterating on
models takes substantial time. As
big data use cases continue to
grow, CPU computational power
becomes a major bottleneck.
It's Costly
Large-scale CPU infrastructure is
incredibly expensive for conducting
big data operations. With growing
datasets, adding CPU infrastructure
continues to increase costs.
It's Frustrating
Productionizing large-scale data
processing operations is arduous. It
often involves refactoring and
hand-offs between teams adding
more cycle time.
These challenges increase cycle time and slow down time to insight.

6
NVIDIA RAPIDS TRANSFORMS DATA SCIENCE
From Analytics to NVIDIA Accelerated Data Science

7
PYTHON TOOLS HAVE DEMOCRATIZED DATA SCIENCE
ACCESSIBLE, EASY TO USE TOOLS ABSTRACT COMPLEXITY
Analytics
pandas
Data Preparation Visualization
Model Training
Machine Learning
scikit-learn
Graph Analytics
NetworkX
Deep Learning
TensorFlow, PyTorch,
MxNet
Vizualization
CuXFILTER <> pyViz
Dask
CPU Memory
Python is the most-used language in
Data Science today. Libraries like
NumPy, Scikit-Learn, and Pandas have
changed how we think about accessibility
in Data Science and Machine Learning.
While great for experimentation, PyData
tools lack the power necessary for
enterprise-scale workloads. This leads to
substantial refactoring to handle the size
of modern problems, increasing cycle
time, overhead, and time to insight.
These pain points are further
compounded by computational
bottlenecks of CPU-based processing.
Code refactors and inter-team handoffs decrease data-driven ROI

8
RAPIDS ACCELERATES POPULAR DATA SCIENCE TOOLS
DELIVERING ENTERPRISE-GRADE DATA SCIENCE SOLUTIONS IN PURE PYTHON
Pre-Processing
cuIO & cuDF
Data Preparation Visualization
Model Training
Machine Learning
cuML
Graph Analytics
cuGRAPH
Deep Learning
TensorFlow, PyTorch,
MxNet
Vizualization
CuXFILTER <> pyViz
Dask
GPU Memory
The RAPIDS suite of open source
software libraries gives you the freedom
to execute end-to-end data science and
analytics pipelines entirely on GPUs.
RAPIDS utilizes NVIDIA
CUDA primitives for low-level compute
optimization and exposes GPU
parallelism and high-bandwidth memory
speed through user-friendly Python
interfaces like PyData.
With Dask, RAPIDS can scale out to
multi-node, multi-GPU cluster to power
through big data processes.
RAPIDS enables the PyData stack with the power of NVIDIA GPUs

9
Accelerates PyData on NVIDIA GPUs
NumPy -> CuPy/PyTorch/..
Pandas -> cuDF
Scikit-Learn -> cuML
Numba -> Numba
RAPIDS
Distributes and accelerates PyData
Can be distributed across Multi-GPU on
single node (DGX) or across a cluster
Provides easy to use tooling enabling
HPC-level performance
RAPIDS + DASK
Provides accessible, easy to use
tooling
NumPy, Pandas, Scikit-Learn,
Numba and many more
Single CPU core, in-memory data
PYDATA
Distributes PyData across multiple cores
NumPy -> Dask Array
Pandas -> Dask DataFrame
Scikit-Learn -> Dask-ML
… -> Dask Futures
DASK
Scale
Up
/
Accelerate
Scale Out / Parallelize
SCALE OUT PYTHON TOOLS WITH RAPIDS + DASK
DISTRIBUTE & ACCELERATE COMPUTATION FOR PRODUCTION WORKLOADS

10
RAPIDS INTEGRATES WITH DEEP LEARNING TOOLS
BUILD COMPLEX WORKFLOWS WITHOUT LEAVING THE GPU
mpi4py
RAPIDS supports device memory sharing
between many popular data science
and deep learning libraries, such as
PyTorch and TensorFlow. By providing
native array_interface support, data
can stay on the GPU avoiding costly
copying back and forth to host
memory.
Data stored in Apache Arrow can be
seamlessly pushed to deep learning
frameworks that accept CUDA Array
Interface protocol or work with DLPack,
such as Chainer, MXNet, and PyTorch.

NVTabular
Pipelines are slow
and complex
Challenge
Solution
Inference
Training
Data Loading
ETL
Using common item-
by-item loading can
be slow
High throughput to
rank more items is
difficult while
maintaining low
latency
Embedding tables of
large deep learning
recommender
systems can exceed
memory
GPU-accelerated
and easy-to-use ETL
pipelines prepares
datasets in minutes
Asynchronous and
GPU-accelerated
dataloader for
PyTorch and
TensorFlow/Keras
Easy data and
model parallel
training allow to
scale TB size
embeddings
High throughput,
low-latency
production
deployment
NVIDIA Merlin is an open-source library to deploy recommender systems end-2-end
Triton
HugeCTR
NVIDIA MERLIN ADDRESSES
RECOMMENDER SYSTEM CHALLENGES

DATA LAKE
TRITON
USER QUERY
10
RECOMMENDATIONS
CANDIDATE
GENERATION RANKING
O(1000)
O(Billions)
EMBEDDINGS
INFERENCE
TRAINING
DATA
LOADER
ETL
NVTABULAR
HUGECTR
NVTABULAR
VALIDATION
MODEL
ANALYSIS
RAPIDS RAPIDS CUDNN RAPIDS
NVIDIA Merlin Accelerates Every Stage in Recommender Pipeline

14
Merlin speeds up the entire pipeline
ETL
Minutes
Optimized
Spark
(4x CPU node)
NVTabular ETL
(1xA100)
NVIDIA Merlin provides 9-35x speed-up in ETL+Training+Inference RecSys models and easily scales to multiple GPUs
Accelerating Training
Scaling Accelerated
Training
Inference
Tensorflow
Data loader
(1x A100)
NVTabular
Data loader
(1x A100)
CPU cluster
(4x nodes)
HugeCTR
(1x DGX-A100)
21x 9x 24x 35x
HugeCTR
NVTabular HPS on Triton
Speedup
TF/PyT Plugins
Minutes
Minutes
Throughput
(samples/sec)
at
10
ms
latency
PyTorch
(2x CPU node)
HugeCTR
(1x A100)

NVTabular
Fast Feature Transforms & Dataloading of Tabular Data on GPU

NVTABULAR: RECOMMENDER SYSTEM ETL ON GPU
NVTabular
What it is:
Feature engineering and preprocessing library designed to quickly and
easily manipulate terabytes of tabular data
What it’s capable of:
• Speed – GPU acceleration, 10x speedup compared to CPU,
eliminate input bottleneck
• Scale – No limit on dataset size (not bound by GPU or CPU
memory)
• Usability - Higher level abstraction, recommender systems
oriented, fewer API calls are required to accomplish the
same processing pipeline
• Core Features - integration with PyTorch, TensorFlow, and
HugeCTR; multi-hot encoding
1
2
3
4 Visualization of feature engineering and preprocessing
pipeline for Criteo Click Ads Prediction dataset

SCALE: NVTABULAR SCALES TO MULTI-GPUS AND
MULTI-NODES USING TBS OF GPU MEMORY
Example of NVTabular ETL with 2 nodes:
1. ETL starts with 1 node (8 GPUs)
2. ETL adds a 2nd node (8 GPUs, total=16GPUs)
Dask dashboard monitors NVTabular ETL:
● Top-right: Visualization of GPUs interaction, small dot is
one GPU (upto 16 GPUs in this example)
● Bottom-center: Utilized GPU memory and total GPU
memory (upto 500GB in this example)
● Top-left: Tasks overtime
Detailed Notebook: https://github.com/NVIDIA/NVTabular/blob/main/examples/multi-gpu_dask.ipynb
Under the hood, NVTabular uses Dask and Dask-cuDF to provide a
high-performance recommender system-specific ETL pipeline on multiple GPUs
Graphic showing the dask dashboard when running NVTabular
I
II
III
I
II
II
I

SCALE: NVTABULAR SCALES EASILY TO MULTI-NODES
AND REDUCES ETL TIME
In our example, NVTabular scales to 128 GPUs (16 nodes x 8 GPUs)
with total of 5.1 TB GPU memory (each GPU has 40GB GPU memory)
Preliminary results
Run on DGX A100,
Dataset: Criteo 1TB (26 categorical features and 13 integer features)
NVTabular v0.3
To enable distributed parallelism, we need to start a
cluster and then connect to it to run the application.
How it works:
1. Start the scheduler dask-scheduler
2. Start the workers dask-cuda-worker
3. Run the NVTabular application
Dask is a flexible library for parallel computing in Python
that makes scaling out your workflow smooth and simple.
Dask-cuDF extends Dask where necessary to allow its
DataFrame partitions to be processed by cuDF GPU
DataFrames

USABILITY: NVTABULAR’S HIGH-LEVEL API IS 10-20 LINES
Easy-to-use API. Find more details from examples here.
Encode categorical variables
using the defined thresholds and
add metadata (tags)
For continuous features -
zero filling any nulls
clip all values,
log transform,
normalize,
add metadata (tags)
Collect statistics on train dataset
Transform train & valid dataset
with statistics from train dataset
Define dataset files
Combine pipelines and initialize
workflow
Load schema by tags for model
definition

ONE PARAMETER TO SWITCH BETWEEN CPU AND GPU
GPU CPU
New users can easily try out NVTabular as CPU-mode runs on all infrastructures
Users can develop on local machines on CPU-mode and push to cluster on GPU-mode

21
Interoperability: NVTabular’s output files can be used
by PyTorch, TensorFlow or HugeCTR
3
Export from
data lake
parquet
Huge
CTR
binary
Training
Data Loading
csv
parquet
avro
NVTabular
with high-level API
Dedicated DL
framework
ETL
NVTabular
Dataloader

22
Advantages of NVTabular ETL
Scale
Speed
Usability
Core
Features
● Recommender focused APIs to easily implement the most common workflows
● 5-25 lines of code compared to other frameworks such as Pandas
● Examples published for common datasets and models
● Optimized TF, PyT, and HugeCTR dataloaders
● Native tabular data format support: CSV, parquet, orc, avro
● Building an easy path to production deployment for data transforms
● Supports datasets larger than host and GPU memory
● Scales easily to multi-GPU / multi-Nodes with terabytes of GPU memory
● Benchmark shows 100x-3000x speedup in comparison to CPU environments

Dataloading with NVTabular
Fast Dataloading of Tabular Data on GPU

24
NVTabular: GPU-accelerated dataloading
NVTabular Dataloader
What it is:
GPU-accelerated and asynchronous dataloader for TensorFlow
& PyTorch prepares quickly new data for training step to fully
utilize the GPU
• Speed – NVTabular dataloader is 10x faster compared to
Tensorflow dataloader with GPU training
• Scale - dataloaders enables training of larger than memory
datasets by streaming chunks from disk
• Usability - easy integration with TensorFlow, PyTorch and
FastAI
Core Feature – convergence behavior is similar to other
framework dataloaders; supports common tabular data
format parquet; no CPU-GPU communication as data is
loaded directly into GPU
1
2
3
4
Visualization of idle times for NVTabular data loader
Time

25
Speed: NVTabular dataloader is up to 9x faster in
comparison to TensorFlow native version
Benchmark:
● Training and validation have each 1 day of Criteo Ads Clicks Prediction dataset with 150M samples
● Each experiment uses GPUs for training the same neural network architecture
Results:
● Only dataloading without training has 24x speed-up, which indicates the bottleneck in the dataloader.
● Training neural networks has 2.7x-9x speed-up, as data loader has time to prepare batch during training step.
Overall dataloader is still the bottleneck in TensorFlow / PyTorch pipeline.
1
Comparison to TensorFlow

26
Scale: NVTabular supports larger than memory
datasets by streaming data from disk
2
1
2
3
...
N
Buffer k
randomly
selected
chunks1
1) A file is stored in multiple chunks. A hyperparameter defines how many chunks are buffered by NVTabular dataloader
2
...
k
Shuffle
Preprocess
Batch
Final
Batch
Disk Storage
(e.g. raid)
GPU
Asynchronous
● Buffering chunks from disk
enables larger than host/GPU
memory datasets
● Dataloading on GPU removes
overhead of CPU-GPU
communication
● Shuffle, preprocess and batch
are faster on GPU
● Using random, multiple chunks
increases randomness in batches

27
Usability: NVTabular dataloaders can be used in
existing TensorFlow pipeline
3
Reserve GPU memory for NVTabular dataloader
Helper function for defining tf.feature_columns
Initialize NVTabular dataloader for training
Define data schema with tf.feature_columns
Initialize NVTabular dataloader for validation
Get a TensorFlow model
NVTabular dataloader requires a custom
Validation Callback
Train model with .fit()
Steps which a different

28
Usability: NVTabular dataloaders can be used in
existing PyTorch pipeline
3
Helper function for run one epoch
Initialize NVTabular dataloader for training
Initialize NVTabular dataloader for validation
Get a PyTorch model
Train and validate model
Steps which a different

29
Core Feature: NVTabular has less idle times
than other data loader
4
NVTabular data loader has no overhead to move data from CPU to GPU.
Preparing batch on GPU is significantly faster.
Synchronous
CPU dataloader
Asynchronous
CPU dataloader
Asynchronous
GPU dataloader
NVTabular
Prepare
CPU:
Idle
GPU: Train
Idle Prepare
Idle Train
Idle Prepare
Idle Train
Idle
Prepare
CPU:
Idle
GPU: Train
Prepare
Idle Train
Prepare
Idle Train
Prepare
Idle Train
Prepare
Idle
Prepare
GPU:
Idle
GPU: Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Train
Prepare
Time

30
Advantages of NVTabular dataloaders
Speed
Scale
Usability
Core
Features
● NVTabular dataloaders stream dataset in chunks from hard disk and thereby,
they support larger than host/GPU memory datasets
● Seamless integration into existing TensorFlow or PyTorch pipelines
● NVTabular dataloaders speed-up TensorFlow pipelines by 9x
● NVTabular dataloaders speed-up PyTorch pipelines by 9x
● NVTabular dataloaders convergence in the same number of update steps by
require less time
● Supports common tabular dataformats: parquet and csv
● Data is loaded directly into GPU to remove CPU-GPU communication

Training on GPU at scale
Accelerating training at scale with HugeCTR and Tensorflow Plugins

32
TRITON
ETL
NVTABULAR
NATIVE HUGECTR
RAPIDS CuDNN, CuDF, CuBLAS, NCCL …
Merlin Overview - Make Recommenders on GPU Fast & Easy
Merlin Models & Merlin Systems
SparseOperationKit -
TF1
TRT
MERLIN
DATA
LOADER
DistributedEmbeddings
- TF2
HUGECTR
Hierarchical
Parameter
Server
TRAINING INFERENCE
Transformer4Rec
High Level
Libraries
E2E usability
Merlin
Partner Nvidia
Products
Lower Level
Libraries
Performance &
Scalability
HKV

Huge CTR
What it is:
HugeCTR is an open-source framework to accelerate the training of CTR
estimation models on NVIDIA GPUs. It is written in CUDA C++ and highly
exploits GPU-accelerated libraries such as cuBLAS, cuDNN, and NCCL.
Speed: A single node DGX-A100 trains MLPerf v1.0 DLRM at 1.96
minutes
Scale: Multi-Node Model Parallel, GPU embedding cache
Usability: Keras-like Python API
Core Features: Multi-slot Embedding with in-memory GPU
Hashtable, Asynchronous and Multithreaded Data Pipeline,
Inference & Embedding Cache, Tensorflow Embedding Plugin
HUGECTR: RECOMMENDER SYSTEM TRAINING ON GPU
1
2
3
4

SPEED: HUGECTR DLRM MLPERF WIN
Features used to achieve MLPerf win:
§ Hybrid embedding
§ Fused MLP
§ Optimized collectives
§ Optimized data reader
§ Overlapping MLP with embedding
§ Whole-iteration CUDA graph
At 0.99 minutes, HugeCTR on DGX-A100 is the fastest available system on MLPerf Recommender Benchmark
1
MLPerf v1.0 training results. Recommender task (training DLRM on the Criteo 1TB dataset). Bars represent speedup factor compared to a 4 CPU-node cluster. The higher the better. HugeCTR v3.1
running on single node of DGX-A100 with 8x A100 80GB GPU and 14 nodes of DGX-A100. Intel's CPU submission based on 4 nodes, each with 4X Intel(R) Xeon(R) Platinum 8376H CPU @ 2.60GHz with 6
UPI for a total of 16 CPUs.

Scale: Multi-Node Model Parallel
Dense
Model
Dense
Model
Dense
Model
Dense
Model
Sparse Model/Embedding
GPU
0
GPU
1
GPU
2
GPU
3
Node 0
Model Parallel
Data Parallel
Dense
Model
Dense
Model
Dense
Model
Dense
Model
GPU
0
GPU
1
GPU
2
GPU
3
Node 1
To train a large-scale CTR estimation model, the HugeCTR embedding table can span multiple nodes beyond
multiple GPUs (model parallelism). Each GPU has its own feed-forward neural network (data parallelism).
2

USABILITY: KERAS-LIKE PYTHON API
Easy to use Python API
3
1. Initialize Model
solver = hugectr.CreateSolver(...)
reader = hugectr.DataReaderParams(...)
optimizer = hugectr.CreateOptimizer(...)
model = hugectr.Model(solver, reader, optimizer)
2. Create Model graph
model.add(hugectr.Input(...))
model.add(hugectr.SparseEmbedding(...))
model.add(hugectr.DenseLayer(...))
...
3. Save model graph
model.graph_to_json(“dlrm.json”)
4. Compile & Fit
model.compile(...)
model.fit(...)
1. Create InferenceSession
inference_params = InferenceParams(
max_batchsize = 4096,
hit_rate_threshold = 0.8,
dense_model_file = ...,
sparse_model_files = ...,
device_id = 0,
use_gpu_embedding_cache = True,
cache_size_percentage = 0.3)
inference_session = CreateInferenceSession(“dlrm.json”,
inference_params)
2. Make inference
• From extracted VCSR
inference_session.predict(dense, keys, row_ptrs)
• From Norm/Parquet dataset
inference_session.predict(...)
inference_session.evaluate(...)
Train Inference

37
Usability - Democratize Model Parallelism with Three lines!
Try it today https://github.com/NVIDIA-
Merlin/distributed-embeddings to easily train
TB size model with speed!
Learn more from our blog here
https://developer.nvidia.com/blog/fast-
terabyte-scale-recommender-training-made-
easy-with-nvidia-merlin-distributed-
embeddings/

ASYNCHRONOUS, MULTITHREADED DATA PIPELINE
Read File
Copy to
GPU
Train
Read File
Copy to
GPU
Train
Read File
Copy to
GPU
Train
Batch 0
Batch 1
Batch 2
Time
CPU Memory GPU Memory
Datasets
Worker
Worker
Worker
Worker
Collector
Model
Training
HugeCTR pipeline can overlap the data read from disk to CPU memory,
data transfer from CPU to GPU and the actual training on GPU across different batches.
4

Triton
Fast and scalable AI in production

MERLIN TRITON: EASY DEPLOYMENT OF ETL WORKFLOW
AND DEEP LEARNING MODELS WITH GPU-ACCELERATION
Triton Inference with Merlin
What it is:
Triton Inference Server simplifies the deployment of AI models at scale in
production. It is an open source inference serving software that lets teams
deploy trained AI models from any framework. Both NVTabular workflows
and TensorFlow, PyTorch and HugeCTR models can be deployed
• Speed – GPU acceleration for inference in production, HugeCTR
embedding cache to keep frequent users/items embedding on GPU
• Scale – Triton with HugeCTR can deploy larger-than-memory
embedding tables with parameter server
• Usability - Only few lines of code to deploy ensemble of ETL workflow
and deep learning model
• Core Features - maximize CPU/GPU utilization with real-time inference
or batch inference; supports all major deep learning frameworks and
custom builts; Kubernetes integration for orchestration, metrics, and
auto-scaling; multiple models at one
Visualization of deployed ensemble model
of NVTabular workflow and HugeCTR model.
Alternatively, TensorFlow or PyTorch model can be deployed
Triton
1
2
3
4

MERLIN DEPLOYS ENSEMBLE MODEL FOR SAME ETL
TRANSFORMATION IN PRODUCTION
Request
Prediction
1
2
3
...
N
Training Production
Triton
● Easy deployment to production
with only few lines of code
● NVTabular collects statistics,
such as mean/std of numeric
features and mapping tables for
categorical features
● The same ETL workflow needs
to be applied in production
● Triton supports TensorFlow and
PyTorch, too

42
TRITON
ETL
NVTABULAR
NATIVE HUGECTR
RAPIDS CuDNN, CuDF, CuBLAS, NCCL …
Merlin Overview - Make Recommenders on GPU Fast & Easy
Merlin Models & Merlin Systems
SOK - TF1
TRT
MERLIN
DATA
LOADER
DE - TF2 HUGECTR
HPS
TRAINING INFERENCE
Transformer4Rec
High Level
Libraries
Merlin
Partner Nvidia
Products
Lower Level
Libraries

43
Options of Running Inference
GPU for both Embedding & MLP
MLP MLP MLP MLP
CPU for Embedding + GPU for MLP
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
MLP
CPU for Embedding & MLP
Embedding
Embedding Embedding
Embedding
Pros Runs everywhere Easier to switch to GPU for MLP
Make use of all existing resource
Most performant, esp for large models
Cons Slow per socket Embedding is not accelerated Not supported by native TF
That’s why we have HugeCTR
Hierarchical Parameter Server!

44
Input Feature Follows Power Law
Distribution
Only a small fraction of parameters are accessed most frequently. These “hot
parameters” will be placed in GPU cache.
In criteo 1 terabyte
dataset, 305k out of 188M
(0.16%) categories
account for 95.9%
samples

Hierarchical Parameter Server (HPS) - A “Giant” Key Value store!
- Each GPU has its own lookup
session, and its own embedding
cache. No inter-communication
between GPUs
- Reduce synchronization time
- Ensures High system
availability
- CPU memory and SSDs can be
configured by users as extended
embedding storage
- HPS supports embedding lookup for
a variety of models/frameworks.
SSD (eg. RocksDB, HDFS)
CPU Memory (eg. Redis, HashMap)
GPU 0
Lookup
Session
Embedding
Cache
GPU 1
Lookup
Session
Embedding
Cache
GPU 2
Lookup
Session
Embedding
Cache
GPU 3
Lookup
Session
Embedding
Cache

46
HPS Performance Benchmark
16x

Merlin Models: Model Training Library
Merlin Models
What it is:
High quality implementations from classical machine learning
models to more advanced deep learning models in Tensorflow
▪ Implementation of common architectures, loss functions and
tasks
▪ Unified API enables users to create models in various FW and
libraries - Tensorflow, XGBoost, Implicit, and LightFM
▪ Flexible building blocks to design custom architecture based
on common components
▪ GPU-optimized data-loading and model training
▪ Integration with NVTabular Visualization of Merlin Models

Fast Iteration is Key
Merlin enables quick experimentation cycles to find high accuracy models
Evaluation
Input
Feature
Engineering
Model
Training
Iteration
Deployment
NVTabular + (Merlin Models / HugeCTR / Transformer4Rec)
10x 10x 10x
Triton Inference Server

Example Model Architectures in Merlin Models
Facebook DRLM Google DCN YouTube DNN

Sequential/Session-based recommendation w/ NLP Transformer architecture
Transformers4Rec

SESSION-BASED RECOMMENDATION TASK
Problem: User’s intention changes between different sessions and usually online services
have only access to the actual session (GDPR)
Goal: Build a recommender system that is able to leverage the short sequence of past
interactions within the same session and dynamically adapts the next suggested item.

NLP X SEQUENTIAL RECSYS
Seminal Neural Architectures
Prod2Vec
Meta-Prod2Vec
GRU4Rec NARM
SASRec
BERT4Rec NVIDIA
Transformers4Rec lib
Word2Vec
2013 2014 2015 2016 2017 2018 2019 2020
GRU Attention Transformer BERT XLNET
HuggingFace
Transformers lib
2021
GPT-2
ATRank
AttRec
Fig 2. The influence of NLP research in Recommender System

WHAT IS TRANSFORMERS4REC?
Transformers4Rec
§ is a flexible and efficient open source library for sequential and session-based recommendation,
§ makes state-of-the-art Transformer architectures available for RecSys community,
§ available for both PyTorch and Tensorflow.
https://github.com/NVIDIA-Merlin/Transformers4Rec

TRANSFORMERS4REC IS MULTI-FRAMEWORK AND EASY TO USE
PyTorch TensorFlow

56
Transformers4Rec Workflow Pipeline
Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based
Recommendation
Fig 3. End-to-end session-based recommendation pipeline

58
PERSONALIZED
PRODUCT RECOMMENDATIONS
Olay is arming consumers with knowledge to make informed
purchase decisions. Its Olay Skin Advisor — a GPU-accelerated AI
tool that works on any mobile device — assesses a user-provided
selfie and advises how to improve trouble areas using a daily
regime of recommended Olay products. After four weeks 94% of
Skin Advisor users continued to use the products the Olay Skin
Advisor recommended.

59
With >100,000 different products in its 4,700 U.S. stores, the Walmart Labs data science team predicts demand for 500 million item-by-store
combinations every week.
By performing forecasting with the open-source RAPIDS data processing and machine learning libraries built on CUDA-X AI on NVIDIA GPUs,
Walmart speeds up feature engineering 100x and trains machine learning algorithms 20x faster, resulting in faster delivery of products, real-
time reaction to shopper trends, and inventory cost savings at scale.
IMPROVING DEMAND FORECASTS

60
MODERNIZING THE WAREHOUSE
In 2019 global ecommerce represented $3.4 trillion ¾ or 13.7% ¾ of the $25 trillion dollar retail sales market. Oberlo predicts this
market share will grow to 17.5% by 2021.
With thousands of orders placed every hour, data scientists at Zalando, Europe’s leading online fashion retailer, applied deep
learning powered by NVIDIA GPUs to develop the Optimal Cart Pick algorithm.
The algorithm resulted in an 11% decrease in workers’ travel time per item picked. The work is a good example of the efficiencies
that AI can discover for e-commerce, manufacturing and other large-systems-based industries.

NVIDIA DEVELOPER
PROGRAM
FOR DEVELOPERS

62
NVIDIA DEVELOPER PROGRAM
TOOLS
JOIN THE COMMUNITY THAT’S CHANGING THE WORLD
• Get exclusive access to an extensive library of NVIDIA software, spanning all of NVIDIA’s technology platforms.
• Save time with ready-to-run, GPU-optimized software, model scripts, and containerized apps from the NVIDIA NGC™
catalog.
• Participate in early access programs where you can be one of the first to experience the latest NVIDIA technology.
TRAINING
• Take advantage of research papers, technical documentation, developer blogs, and industry-specific resources.
• Choose from a broad catalog of training options through the NVIDIA Deep Learning Institute (DLI).
• Get unlimited access to NVIDIA On-Demand, the home for NVIDIA resources from GTCs and other leading industry events.
COMMUNITY
• Network with like-minded developers, engage with GPU experts, and contribute to discussions in the developer forums.
• Attend exclusive meetups, GPU hackathons, and events.
• Connect with NVIDIA experts through developer-focused webinars and Instructor-led workshops.
Join the Free Program developer.nvidia.com/join

63
https://ngc.nvidia.com/

NVIDIA INCEPTION
PROGRAM
FOR STARTUPS

NVIDIA INCEPTION HELPS TECH STARTUPS BUILD AND
GROW FASTER
Open to Tech Startups of All Stages Working in AI, Deep Learning, AR/VR, Gaming, Networking, and Graphics
100+
Countries Represented
37%
Program Growth in 2022
15,000+
Startups Worldwide
$94B+
in Cumulative Funding

INCEPTION HELPS YOU ACHIEVE SUCCESS AT EVERY STAGE
SCALE
GROW
BUILD
Get free technical training, engineering
guidance, and discounts on technology to
build your solutions faster.
Accelerate growth with opportunities to
increase your company’s market
awareness and gain exposure to VC
investors.
Premier members enjoy an enhanced set
of go-to-market and technical guidance
benefits to help them scale faster.
NVIDIA INCEPTION

67
https://nvidia.com/startups

NVIDIA DEEP LEARNING
INSTITUTE
FOR STUDENTS, RESEARCHERS AND PROFESSORS

69
HELPING YOU SOLVE YOUR
MOST CHALLENGING PROBLEMS
• Build and deploy end-to-end projects across a range
of technologies and domains.
• Gain hands-on experience with the most widely used,
industry-standard software, tools, and frameworks.
• Join live, instructor-led workshops and learn from DLI-certified
instructors who are experts in their fields.
• Take self-paced, online courses anytime, anywhere.
• Earn NVIDIA DLI certificates to demonstrate subject matter
competency and support career growth.
Hands-On Training in AI, Accelerated Computing,
Accelerated Data Science, Graphics and Simulation,
and More
Learn more: www.nvidia.com/dli
Deep Learning
Fundamentals
AI for Anomaly
Detection
AI for Industrial
Inspection
Conversational AI
AI for
Intelligent Video
Analytics
Accelerated Computing
Fundamentals
Recommender Systems
AI for
Predictive Maintenance
Accelerated Data Science
Fundamentals
Graphics and
Simulation
Networking
AI in the Data Center

Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.

More Related Content

Similar to Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware. (20)

More from E-Commerce Brasil (20)

Recently uploaded (20)

Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.