inference-whitepaper-mar23-update
inference-whitepaper-mar23-update
Whitepaper
In this paper, we begin with a view of the end-to-end deep learning workflow and move
into the details of taking AI-enabled applications from prototype to production
deployments. We’ll cover the evolving inference usage landscape, architectural
considerations for the optimal inference accelerator, and the NVIDIA AI inference
platform, a complete end-to-end stack of products and services that delivers the
performance, efficiency, and responsiveness critical to powering the next era of
generative AI inference.
Additionally, training LLMs from scratch can be incredibly expensive and take weeks to
months depending on the size of the task and underlying infrastructure. Off-the-shelf
LLMs can also fall short in catering to the distinct requirements of organizations,
whether it is the intricacies of specialized domain knowledge, industry jargon, or unique
operational scenarios. This is precisely why modern end-to-end generative AI workflows
take advantage of custom LLMs using various LLM customization techniques. These
customized models provide enterprises with the means to create solutions personalized
to match their brand voice and streamline workflows, for more accurate insights, and
rich user experiences. Once the AI model, or ensemble of models, are ready for
deployment, they must be optimized for inference in order to meet real-time latency
requirements at scale.
This requires a full stack approach that solves for the entire workflow - start to finish -
from importing and preparing data sets for training to deploying a trained network as an
AI-powered service using inference.
See Figure 3 for end-to-end deep learning workflow, from training to inference.
The focus of this paper is mainly on the challenges of deploying trained AI models in
production and how to overcome them to accelerate your path to production. However,
a key prerequisite before you get to the deployment phase is, of course, to have
completed the development phase of the AI workflow and have trained AI models that
are ready to take to production.
Depending on the service or product that you need to integrate your AI models into, and
how your end customers will interact with it, the optimal place to execute AI inference
can vary from inside the heart of the data center, on the public cloud, enterprise edge or
in embedded devices (see Figure 6).
Some industries, like healthcare for example, have well established rules about where
data must be stored and how it can be accessed, and for these customers and
industries, on-premises is likely the right call. Cloud deployments are a great choice, as
well, since they provide on-demand compute as needed and allow organizations to ease
into the AI transition before making larger IT investments.
> NVIDIA Grace CPU Superchip: The NVIDIA AI software stack is also optimized for
NVIDIA’s Grace CPU, purpose built for giant scale AI and HPC applications. NVIDIA
Grace CPU comes in two data center superchip products. The Grace Hopper
Superchip pairs an NVIDIA Grace CPU with an NVIDIA H100 Tensor Core GPU in a
coherent memory architecture for giant-scale NVIDIA AI applications. The NVIDIA
Grace CPU Superchip combines 144 high-performance Arm cores with a fast NVIDIA
designed fabric and LPDDR5X memory for HPC, demanding cloud and enterprise
applications.
> DPU: The BlueField® DPU platform offloads, accelerates, and isolates a broad range of
advanced infrastructure services, providing AI data centers with high-performance
networking, robust security, and sustainability. For LLM AI inference, NVIDIA BlueField
DPUs provide high-speed, low-latency connectivity between NVIDIA GPUs, which
delivers consistent results as models grow in scale and complexity. It can handle
networking tasks, storage tasks, security tasks, and AI acceleration.
> NVIDIA DGX Systems: NVIDIA DGX™ Systems are purpose-built for the unique
demands of enterprise AI. Powered by NVIDIA Base Command software, and
architecture for industry-leading performance and multi-node scale with DGX POD
and DGX SuperPOD, DGX systems are the gold standard in AI infrastructure,
delivering the fastest time-to-solution on the most complex AI workloads including
natural language processing, recommender systems, data analytics and more, with
direct access to NVIDIA AI experts.
> NVIDIA-Certified Systems: NVIDIA-Certified Systems brings NVIDIA GPUs and NVIDIA
high-speed, secure network adapters to systems from leading NVIDIA partners in
configurations validated for optimum performance, manageability, and scale. With an
NVIDIA-Certified System, organizations can confidently choose enterprise-grade
hardware solutions to power their accelerated computed workloads - from the
desktop to the data center and edge.
> Networking: NVIDIA Quantum InfiniBand and Spectrum Ethernet networking
platforms provide AI practitioners with advanced acceleration engines, and the
fastest of interconnect speeds at up to 400Gb / s, enabling superior performance for
inference at scale. These focus on enabling high-speed communication and data
transfer between different parts of a distributed system. Focus on fast and efficient
communication pathways between different nodes, allowing for rapid data exchange
between servers, storage, and other components.
NVIDIA GPUs
NVIDIA offers a complete portfolio of GPUs, featuring Hopper and Ada Lovelace Tensor
Core GPUs as the inference engine powering NVIDIA AI. Following are the inference
GPUs (see Figure 7):
> NVIDIA GH200 Grace Hopper Superchip
Enterprises need a versatile system to handle the largest models and realize the full
potential of their inference infrastructure. The GH200 NVIDIA Grace Hopper
Superchip delivers over 7x the fast-access memory to the GPU as traditional
accelerated inference solutions and up to 284x more performance vs. CPUs to
address LLMs, recommenders, vector databases and more.
> NVIDIA H100 Tensor Core GPU
The NVIDIA H100 Tensor Core GPU delivers unprecedented performance, scalability,
and security for every workload. The H100 GPU further extends NVIDIA’s market-
leading inference leadership with advancements in the NVIDIA Hopper architecture
to deliver industry-leading conversational AI, speeding up large language models by
30x over the previous generation on LLMs over 175B parameters With the NVIDIA
fourth generation NVLINK, H100 accelerates workloads, while the dedicated
Transformer Engine supports trillion-parameter language models. NVIDIA H100 PCIe
GPU configuration includes the NVIDIA AI Enterprise software suite to streamline
development and deployment of AI workloads. For LLMs up to 175B parameters,
systems equipped with H100 NVL GPUs can support inference on GPT3-175B with
12x more throughput in a fixed power data center than previous generation systems.
For next generation trillion parameter LLMs, HGX H100 systems can scale for the
highest inference performance.
> NVIDIA L40S GPU
Combining NVIDIA’s full stack of inference serving software with the compute
capabilities of the L40S GPU provides a powerful platform for trained models ready
for inference. With support for Transformer Engine, FP8, and a broad range of
precisions, servers equipped with 8x L40S GPUs deliver up to 1.7x the inference
performance of HGX A100 8-GPU systems.
> NVIDIA L4 Tensor Core GPU
The NVIDIA Ada Lovelace L4 Tensor Core GPU delivers universal acceleration and
energy efficiency for video, AI, virtual workstations, and graphics in the enterprise, in
the cloud, and at the edge. With NVIDIA’s AI platform and full-stack approach, L4 is
optimized for video and inference at scale for a broad range of AI applications to
deliver the best in personalized experiences. For AI Video pipeline applications using
CV-CUDA®, servers equipped with L4 provide 120x better performance than CPU
based server solutions, letting enterprises gain real-time insights to personalize
content and implement cost effective smart-space solutions.
NVIDIA-Certified Systems
Deploying cutting-edge AI-enabled products and services in enterprise data centers
needs computing infrastructure that provides performance, manageability, security, and
scalability, while increasing operational efficiencies.
NVIDIA-Certified Systems™ enable enterprises to confidently deploy hardware solutions
that securely and optimally run their modern accelerated workloads. NVIDIA-Certified
Systems bring together NVIDIA GPUs and NVIDIA networking in servers, from leading
NVIDIA partners, in optimized configurations. These servers are validated for
performance, manageability, security, and scalability and are backed by enterprise-grade
support from NVIDIA and our partners. With an NVIDIA-Certified System, enterprises
can confidently choose performance-optimized hardware solutions to power their
accelerated computing workloads—from the data center to the edge.
NVIDIA-Certified Systems with the GH200 Superchips, H100, L40S and L4 GPUs deliver
breakthrough AI inference performance, ensuring that AI-enabled applications can be
deployed with fewer servers and less power, resulting in faster insights with dramatically
lower costs.
NVIDIA AI Enterprise
Inference is where AI models are put to work and make predictions. It is a crucial process
for enterprises who integrate AI into addressing questions and making evidence-based
decisions. However, the complexity of maintaining security and stability of an AI
software stack with increasing dependencies is a massive undertaking. A foundation AI
stack consists of over 4500 open-source software that includes 3rd party and NVIDIA
packages, representing 10,000 dependencies.
NVIDIA AI Enterprise, the enterprise-grade software that powers the NVIDIA inference
platform, accelerates time to production with security, stability, manageability, and
support. NVIDIA AI Enterprise includes proven, open-source NVIDIA frameworks,
pretrained models, and development tools to streamline development and deployment
of production-ready generative AI, computer vision, speech AI, data science, and more.
To maintain uptime for mission-critical AI applications, NVIDIA AI Enterprise offers
continuous monitoring and regular releases of security patches for critical and common
vulnerabilities and exposures (CVEs), production releases that ensure API stability,
management software for AI deployment at scale, and enterprise support with service-
level agreements (SLAs).
Key components of NVIDIA AI Enterprise that help optimize for AI inference
performance and deployments:
> NVIDIA TensorRT™, an SDK for high-performance deep learning inference
> NVIDIA TensorRT-LLM, an SDK for high-performance deep learning inference that
includes an inference optimizer and runtime GPUs deliver up to 1.7x the inference
performance of HGX A100 8-GPU systems.
> NVIDIA Triton Inference Server, an open-source inference server for AI models
TensorRT-LLM
The LLM ecosystem is innovating rapidly, developing new and diverse model
architectures for new capabilities and use cases. As LLMs evolve, developers need high-
accuracy results for optimal production inference deployments. However, increases in
LLM size drive up the costs and complexities of optimal deployment.
Tensor Parallelism
Previously, developers looking to achieve the best performance for LLM inference had to
rewrite and manually split the AI model into fragments and coordinate execution across
GPUs. Now with TensorRT-LLM, developers can use tensor parallelism, a type of model
parallelism in which individual weight matrices are split across devices. This enables
efficient inference at scale - with each model running in parallel across multiple GPUs
connected through NVLink and across multiple servers - without developer intervention
or model changes.
To find the best Triton model configuration, developers can use Model Analyzer to find
optimal parameter settings for batch size, model concurrency, and precision to deploy
efficient inference. This process can optimize hardware usage, maximize model
throughput, increase reliability, and allow for better hardware sizing by better managing
GPU memory footprint. However, this is just one step in the end-to-end deployment
process. Model Navigator automates the steps between going from a trained model to
instance deployment by converting models to TensorRT, validating accuracy after
TensorRT conversion, optimizing configurations with Model Analyzer, generating model
config files for Triton’s model repository, and finally generating a helm chart to deploy
on Kubernetes.
Given the diversity of AI use cases across industries, a one size fits all approach to
accelerated AI inference is far from optimal. To that end, NVIDIA has created
application-specific frameworks to accelerate developer productivity and address the
common challenges of deploying AI within those specific applications. Figure 11 provides
a quick overview of a few of these.
To help convey how NVIDIA’s application specific frameworks accelerate the path to
developing and deploying AI in production, we will zoom into four use cases: generative
AI / LLMs, conversational AI, recommender systems, and computer vision, including the
challenges inherent within each and how to address them using a full-stack approach.
NVIDIA Riva provides state-of-the-art models, fully accelerated pipelines, and tools to
easily add speech and translation AI capabilities to real-time applications like intelligent
virtual assistants, call center agent assists, and video conferencing. Riva components are
fully customizable, so you can adapt the applications for your use case and industry and
deploy them in all clouds, on-premises, at the edge, and on embedded systems. In
addition, Riva offers packaged AI workflows for audio transcriptions and intelligent
virtual assistants.
Under the hood, Riva applies powerful NVIDIA TensorRT optimizations to models,
configures the NVIDIA Triton Inference Server for model serving, and exposes the
models as a service through a standard API that can easily be integrated into
applications. For domain-specific data, users can fine-tune Riva speech and translation
models with the NVIDIA NeMo to achieve the best possible accuracy.
NVIDIA Riva is part of the NVIDIA AI Enterprise software platform, and can be purchased
for production deployments with unlimited usage on all clouds, access to NVIDIA AI
experts, and long-term support. Riva containers are also available for free for
development for 90 days to all members of the NVIDIA Developer Program on NGC.
Organizations looking to deploy Riva-based applications can apply to try Riva on NVIDIA
Launchpad, a program that provides short-term access to enterprise-grade NVIDIA
hardware and software via a web browser.
As the growth in the volume of data available to power these systems accelerates, data
scientists and ML engineers are increasingly turning to more traditional ML methods to
highly expressive DL models to improve the quality of their recommendations. In the
future, they will rely upon an ensemble of tools, techniques, and frameworks to deploy at
scale.
Recommenders work by collecting information, such as what movies you tell your video
streaming app you want to see, ratings and reviews you’ve submitted, purchases you’ve
made, and other actions you’ve taken in the past. These data sets are often huge and
tabular, with multiple entries of metadata, including product and customer interactions.
They can be hundreds of terabytes in size and require massive compute, connectivity,
and storage performance to train effectively.
With NVIDIA GPUs, you can exploit data parallelism through columnar data processing
instead of traditional row-based reading designed initially for CPUs. This provides higher
performance and cost savings. Current DL-based models for recommender systems like
DLRM, Wide and Deep (W&D), Neural Collaborative Filtering (NCF), Variational
AutoEncoder (VAE) are part of the NVIDIA GPU-accelerated DL model portfolio that
covers a wide range of network architectures and applications in many different
domains beyond recommender systems, including image, text, and speech analysis.
Computer Vision
Image-centric use cases have been at the center of the DL phenomenon, going back to
AlexNet, which won the ImageNet competition in 2012, signaling what we refer to as the
“Big Bang” of DL and AI. Computer vision has a broad range of applications, including
smart cities, agriculture, autonomous driving, consumer electronics, gaming, healthcare,
manufacturing, and retail services to name a few. In all these applications, computer
vision is the technology that enables the cameras and vision systems to perceive,
analyze, and interpret information in images and videos.
Modern cities are dotted with video cameras that generate a massive amount of data
every day. Deep learning-based computer vision is the best way to turn this raw video
data into actionable insights, and NVIDIA GPU-based inference is the only way to do it in
real-time. To enable developers, NVIDIA offers a variety of different GPU-accelerated
libraries, SDKs and application frameworks for every stage of the computer vision
pipeline, including codecs, data processing, training and inference from the edge to the
cloud.
Data Processing
CV-CUDA - An open-source, low-level library that easily integrates into existing custom
CV applications to accelerate video and image processing.
DALI - A holistic framework for loading, decoding, and processing data while offering
operators for augmenting 2D and 3D image, video, and audio data. Providing the
convenient transfer of data processing between training and inference.
NPP - A comprehensive set of image/video/signal processing ops in a closed source
library for efficient processing of large images (e.g., 20kx20k pixels). Performs up to 30x
faster than CPU-only implementations.
Optical Flow - Detect and track objects in successive video frames, interpolate, or
extrapolate video frames to improve smoothness of video playback and compute flow
vectors
VPI - A low-level library providing CV operators for use in AI/CV pipelines running on
embedded devices like Jetson or dGPUs and in thermal and energy constrained
environments. Also supports CPUs, GPUs, PVA, VIC and OFA.
Training
NVIDIA Omniverse Replicator - A core extension of the Omniverse platform, replicator
allows developers to bootstrap the model training process by generating photo-realistic,
physically-aware training datasets
NVIDIA TAO - Create highly accurate, customized, and enterprise-ready AI models with
this low-code toolkit and deploy them on any device - GPUs, CPUs, and MCUs—whether
at the edge or in the cloud.
Fraud Detection
Transaction fraud is a multi-billion-dollar problem. Detecting true fraud is critical, but
traditional systems have historically generated many more false-positive than true-fraud
signals. Now, advanced machine learning and deep learning techniques are improving
detection and, at the same time, drastically cutting false-positive rates. AI is
revolutionizing multi-trillion-dollar industries and powering the growth of nations around
the world to make a significant impact on an organization’s bottom line.
Leveraging NVIDIA’s full-stack platform, leading banks are deploying enterprise AI
capabilities that reduce operational costs, drive higher revenues, improve customer
satisfaction, and create long-term competitive advantage. NVIDIA Triton Inference
Server and the NVIDIA TensorRT SDK, part of NVIDIA AI Enterprise software, can help
with easy deployment, running, and scaling of AI models to deliver meaningful outcomes
in financial services. Accelerated inference can benefit modern-day inference pipelines
across cloud and on premises.
NVIDIA LaunchPad
NVIDIA LaunchPad is a universal proving ground, offering extensive testing of the latest
NVIDIA enterprise hardware and software. It expedites short-term trials, facilitates long-
term proofs of concepts (POCs), and accelerates the development of both managed and
standalone services.
Begin with a tailored development environment or explore a wide selection of hands-on
labs. These labs offer guided experiences across use cases ranging from AI and data
science to 3D design and infrastructure optimization. Enterprises can access essential
hardware and software stacks through private hosted infrastructure.
Enterprise Support
As AI initiatives move into the production stage, the need for a trusted, scalable support
model for enterprise becomes vital to ensuring AI projects stay on track. NVIDIA
Enterprise Support is offered through NVIDIA AI Enterprise and includes:
> Broad Platform Support: Full enterprise grade support for multiple deployment
options across on-prem, hybrid and multi-cloud environments
> Access to NVIDIA AI Experts: Local business hours(e.g. 9 a.m. - 5 p.m.) support
includes guidance on configuration and performance, and escalations to engineering
> Priority notification: Latest security fixes and maintenance releases
> Long term support: Up to 3-years for designated software branches
Customized support upgrade option: Designated Technical Account Manager (TAM) or
Business Critical support for 24x7 live agent access with a One hour response time for
severe issues
Request for a free 90-day NVIDIA AI Enterprise evaluation license that includes access to
the NVIDIA enterprise-grade inference platform and NVIDIA enterprise support.
Trademarks
NVIDIA, the NVIDIA logo, NVIDIA TensorRT, NVIDIA Triton, NVIDIA Merlin, NVIDIA's Hopper, NVIDIA Grace, NVIDIA Ampere , BlueField, DGX, NVIDIA Grace
Hopper, NVIDIA-Certified Systems, NVIDIA Jetson, NVIDIA CUDA, NVIDIA Jetson, and NVIDIA Orin are trademarks and/or registered trademarks of
NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which
they are associated.
VESA DisplayPort
DisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort Compliance Logo for Active Cables
are trademarks owned by the Video Electronics Standards Association in the United States and other countries.
HDMI
HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.
OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
Copyright
© 2023 NVIDIA Corporation. All rights reserved.