0% found this document useful (0 votes)
28 views

Infer Bench

The document introduces InferBench, an automatic benchmarking system for understanding deep learning inference serving. It allows developers to evaluate model performance across different hardware and software configurations with minimal configuration. InferBench dispatches benchmarking jobs, generates requests and models, and provides analysis tools to help optimize resource allocation and configuration for deep learning services. It incorporates a two-tier scheduler to improve benchmarking speeds by up to 30%. Comprehensive experiments provide guidelines for deep learning service deployment.

Uploaded by

Jaideep Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Infer Bench

The document introduces InferBench, an automatic benchmarking system for understanding deep learning inference serving. It allows developers to evaluate model performance across different hardware and software configurations with minimal configuration. InferBench dispatches benchmarking jobs, generates requests and models, and provides analysis tools to help optimize resource allocation and configuration for deep learning services. It incorporates a two-tier scheduler to improve benchmarking speeds by up to 30%. Comprehensive experiments provide guidelines for deep learning service deployment.

Uploaded by

Jaideep Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

I NFER B ENCH : U NDERSTANDING D EEP L EARNING I NFERENCE S ERVING

WITH AN AUTOMATIC B ENCHMARKING SYSTEM

Huaizheng Zhang 1 Yizheng Huang 1 Yonggang Wen 1 Jianxiong Yin 2 Kyle Guan 3

A BSTRACT
Deep learning (DL) models have become core modules for many applications. However, deploying these models
without careful performance benchmarking that considers both hardware and software’s impact often leads to poor
arXiv:2011.02327v3 [cs.LG] 5 Jan 2021

service and costly operational expenditure. To facilitate DL models’ deployment, we implement an automatic and
comprehensive benchmark system for DL developers. To accomplish benchmark-related tasks, the developers only
need to prepare a configuration file consisting of a few lines of code. Our system, deployed to a leader server in DL
clusters, will dispatch users’ benchmark jobs to follower workers. Next, the corresponding requests, workload, and
even models can be generated automatically by the system to conduct DL serving benchmarks. Finally, developers
can leverage many analysis tools and models in our system to gain insights into the trade-offs of different system
configurations. In addition, a two-tier scheduler is incorporated to avoid unnecessary interference and improve
average job compilation time by up to 1.43x (equivalent of 30% reduction). Our system design follows the best
practice in DL clusters operations to expedite day-to-day DL service evaluation efforts by the developers. We
conduct many benchmark experiments to provide in-depth and comprehensive evaluations. We believe these
results are of great values as guidelines for DL service configuration and resource allocation.

1 I NTRODUCTION variety of configurations (e.g., batch size and layer number)


(Wang et al., 2020). The obtained results can be used as
Deep learning (DL) tools are transforming our life, as they guidelines for future design.
are incorporated into more and more cloud services and
applications (Tran et al., 2015; Bahdanau et al., 2014; Gatys Though existing benchmark studies have made substan-
et al., 2016; Silver et al., 2017). Facing this ever-increasing tial contributions to understand DL inference performance
DL deployment demand, many companies, tech-giants or (Reddi et al., 2020; Ignatov et al., 2019; Coleman et al.,
startups, are engaging in a fierce arm race to develop cus- 2017), they still do not adequately address the aforemen-
tomized inference hardware (Jouppi et al., 2017), software tioned challenges. MLPerf inference benchmark (Reddi
(Baylor et al., 2017) and optimization tools (Chen et al., et al., 2020), as the current state-of-the-art solution, lacks
2018) to support DL services. Hence, there is a huge need to the much-needed configurability, as it leaves the implemen-
concurrently develop systems to study and benchmark their tation details to developers. As a result, developers have to
performance as guidelines for deploying high-performance spend days or even weeks preparing a benchmark submis-
and cost-effective DL services. sion for a fair comparison. In addition, the results (Reddi
et al., 2020) only showed the importance of hardware and
Unlike conventional web frameworks and techniques, deep software configuration but provided very limited insights
learning (DL) techniques as well as their inference hardware and prescriptive guidelines for DL service configuration
and software platforms, are still evolving at a rapid pace. To and resource allocation. Moreover, benchmarking perfor-
address this challenge, an easy-to-use and highly deploy- mance using only one or two models optimized for specific
able benchmark system is needed to assist DL developers inference tasks (e.g., a well-trained ResNet50 model (He
in their daily performance evaluation tasks and service con- et al., 2016) for image classification) provides a very lim-
figuration. Also, the system must provide out-of-the-box ited understanding of the impact of hyper-parameters (e.g.,
methodologies to assess and understand the intricate inter- layer number and batch size) on the performance or resource
actions among DL models, hardware, and software across a utilization for a wide range of inference applications.
1
Nanyang Technological University, Singapore 2 NVIDIA AI In this work, we propose an automatic and complete DL
Tech Center 3 Nokia Bell Lab. Correspondence to: Huaizheng
Zhang <[email protected]>.
serving benchmark system (as shown in Figure 1) to address
these needs and narrow gaps of existing systems. We design
Preliminary work. Under review. Do not distribute. our system following the best practice in DL clusters design.
Benchmarking Deep Learning Inference Serving

Task Pipeline Transmission Pre-processing Inference Stage


Report:
Submission Benchmark Server
Task ID: 0001
Model: BERT Tensorflow- Trion Inference Serving
Manager Monitor
Format: ONNX ONNX Runtime
Benchmark
Hardware: T4 GPU Serving Serving System
Data Scientists Report Scheduler Generator
Software Engineers Latency: ...
Hardware Engineers Software
Ops Engineers TRT TorchScript ONNX Format
Task
Dispatch

Result TVM XLA Glow Complier


Return

General
V100 GPU T4 GPU CPU
Accelerator
V100 T4
Hardware
GPU CPU
GPU Specific
TPU v2 TPU v3 Habana
Accelerator

Figure 1. The overview of the proposed benchmarking system. The Figure 2. Three benchmark tiers for evaluating DL inference. The
system first accepts users’ benchmarking tasks. Then it distributes performance (e.g., latency and cost) of an AI application is influ-
the tasks to dedicated servers to complete them automatically. enced by many factors, which can be categorized into three classes:
Finally, it will send a detailed report and guidelines back to users. hardware, software, and pipeline.

The goal is to provide an end-to-end solution that frees DL derive valuable insights for system design.
developers from tedious and potentially error-prone bench-
marking tasks (e.g., boilerplate code writing, data collection • We implement a scheduler to ensure safe benchmark
and workload generation). In addition to the models chosen progress and reduce the average benchmark job com-
by users, the system can effortlessly generate and iterate pletion time.
models with different hyper-parameters (e.g., different layer
types and different number of layers) to adequately explore In the remainder of this paper, we first introduce the back-
the design space. The system also provides an extensive set ground knowledge of DL inference serving and our moti-
of analysis tools and models to help developers choose the vation in Section 2. Then, we detail three benchmark tiers
best configuration for their applications under constraints as well as their benchmarking metrics in Section 3. Next,
like latency, cloud cost, etc. Besides, we implement a two- we present the system implementation and the employed
tier scheduler in the system to improve service efficiency. methodologies in Section 4. We employ our system to
perform benchmark tasks and evaluate its performance in
For the logical clarity and comprehensiveness of bench- Section 5. Finally, we briefly introduce the related work in
marking evaluations, we designate impact factors for DL Section 6 and summarize our paper in Section 7.
inference into three tiers, as shown in Figure 2. For the
hardware tier, we select five representative platforms for
evaluation. For the software tier, we choose four represen- 2 BACKGROUND AND M OTIVATION
tative online serving infrastructures. For the pipeline tier, In this section, we introduce the typical workflow of build-
we simulate the real-world workload and examine a specific ing DL services including pre-deployment, post-deployment
inference service with three types of transmission technolo- and online serving. We illustrate how current benchmark
gies. We use metrics such as tail latency, cloud cost and studies can not fully address the challenges of the new work-
resource usage to measure their performance, and analysis loads, thus motivating us to build a new benchmark system.
models like Roofline (Williams et al., 2009) and Heat maps
to profile DL serving system properties.
2.1 Deep Learning Service Pre-Development
In summary, the main contributions of this paper are as
After receiving models from data scientists, engineers need
follows,
to perform many optimization and service configuration
tasks before models go to production, as shown in Figure
• We build an automatic and distributed benchmark sys- 3. First, models alone with their artifacts such as weight
tem in DL clusters to streamline DL serving inference files and corresponding processors (e.g., image resize), will
benchmark for developers. be stored and versioned (Chard et al., 2019). Developers
• We conduct a comprehensive performance analysis on may try to re-implement processors with a production lan-
three tires with many DL applications under various guage such as Java and test them. Second, engineers will
workloads, providing users insights to trade off latency, optimize models (e.g., INT8 conversion or model compres-
cost, and throughput as well as configure services. sion (Han et al., 2015)) but maintaining good accuracy. As
such, engineers need to check every new model to ensure
• We use generated models to study the sensitivity of that it meets both accuracy and latency requirements. Third,
hardware performance to model hyper-parameters and the users will choose a serving system such as Tensorflow-
Benchmarking Deep Learning Inference Serving

Serving (TFS) (Baylor et al., 2017) to bind their models Mobile App Service Broker Pre-Processor Batch Manager Model Serving
Upload
as a service. Different serving systems perform differently Images
Post-Processor
over a wide range of hardware choices. Even more, theses
systems have many new features such as dynamic batching
(Crankshaw et al., 2017), resulting in a complex configura- Figure 4. A simplified view of a DL inference pipeline. Users first
tion with varied performance. Developers need to perform send requests through a mobile or web app. A service broker then
rigorous evaluations for a service. Finally, a service can be distributes requests to backend servers for processing. Usually,
dispatched to a cluster to serve customers. during the processing, a request will go through a pre-processor, a
batch manager, a model server and a post-processor.
Observation 1: A benchmark task for building a service
often takes tens of iterations. Also, developers need to con-
sider the trade-off among many impact factors (IFs) under
SLOs (Service-Level-Objectives) or budget to make a judi-
cious choice. To efficiently complete the task, developers prove resource utilization.
need a simple, easy-to-use and customizable system.
Observation 4: Current studies for simple isolated models
Observation 2: The benchmark tasks need a dedicated and can not be easily generalized to analyze system bottleneck
isolated runtime environment. As AutoML techniques are (e.g., compute- or memory- bound). Also, only benchmark-
becoming popular, these tasks will consume more resources, ing these models can not provide help to understand the
resulting in long delays in getting benchmark results. This hyper-parameter influence on both DL hardware and model
motivates us to propose a way to perform a benchmark with performance. More effective analysis methodologies should
efficient use of resources. be incorporated.

2.2 Deep Learning Service Post-Development 2.3 DL Inference Serving Workflow


Upon finishing the pre-development, the developers still In this section, we depict how a DL Service process users’
need to 1) study the accelerator (e.g., TPU (Jouppi et al., requests and point out the limitations of current studies. As
2017) and FPGA) characteristic for upgrading them; 2) de- shown in Figure 4, customers first send requests from mobile
sign better resource allocation mechanisms for DL serving or website. These requests will be pre-processed either on
in clusters; 3) monitor DL services to diagnose performance the client- or server-side to meet a format requirement of a
issues immediately. The automation of these tasks has been model service. The requests are then forwarded to a frontend
studied extensively. For instance, while many engineers in a server for further dispatched. Next, the request will be
manually write DL operators, TVM (Chen et al., 2018) au- fed into a backend where a model or multiple models is
tomatically generate operators to run models on specific loaded by a serving platform (e.g., TFS). Once the inference
hardware efficiently. Also, to increase the GPU resource is completed, the predictions will be post-processed (either
utilization, MPS (NVIDIA, 2020a) and Salus (Yu & Chowd- in a server or a client) and displayed to the end-users.
hury, 2019) provide supports to share a GPU among multi-
ple models. All of these studies require a lot of benchmark Observation 5: The end-to-end design of the pipeline de-
efforts. However, current benchmark studies ignore these termines the performance of a service. As such, it is very
post-deployment activities and provide very few supports useful to have a detailed benchmark to locate the bottleneck.
for developers. Observation 6: Since a service processes hundreds of thou-
Observation 3: Resource usage under different scenar- sands of requests daily (Hazelwood et al., 2018), the miti-
ios is not fully studied, which limits the development of gation of tail latency under varied arrival rates, is critical.
resource allocation methods for DL services. Developers Serving software usually includes functions to address the
need to analyze resource usage with varied settings to im- issue. Developers should have a tool to analyze this.
Observation 7: Existing DL serving systems (Crankshaw
et al., 2017; Baylor et al., 2017) provides several mecha-
Benchmark nisms such as dynamic batching to improve the performance
Models with Bound to a Dispatched to a
of a service. A benchmark system should explore these ad-
Optimization Conversion DL service
Artifacts Serving System Device
Researchers
vanced features.
Engineers Preprocessors, Prunning, Tensorflow-Serving,
Postprocessors, Quantization TRT, ONNX, ... Trion Inference GPU, CPU, FPGA, ...
... (FP16, INT8), ... Serving, ...

3 B ENCHMARK T IERS
Figure 3. The pipeline of the AI service deployment. To build a
high-performance and low-cost AI service, users need to spend a In this section, we describe three benchmark tiers: hardware,
lot of time on optimization and benchmarking. software, and pipeline as shown in Figure 2.
Benchmarking Deep Learning Inference Serving

3.1 Tier 1 - Hardware ing software provide some useful functions like dynamic
batching. Our system can help to study their impacts.
For the first tier, our system mainly focuses on inference
hardware performance as well as the interactions between
3.3 Tier 3 - Pipeline
hardware and models under varied configurations and sce-
narios. In addition to the latency and throughput measure- In this tier, we aim to look into each stage to identify the
ment capabilities provided in previous research, we exten- bottleneck of a service under various conditions. The system
sively study the cost, hardware performance sensitivity to supports to explore the performance of each stage.
model hyper-parameters, and their bottleneck with different
types of models. We next detail the metrics. Latency per Stage. The DL serving pipeline often con-
sists of five stages: pre-processing, transmission, batching,
Latency & Throughput. These two metrics are widely DL inference, and post-processing. In addition to the DL
used to measure hardware. In general, online real-time inference stage, the other stages also have varied perfor-
services require low latency and offline processing systems mance under different conditions (e.g., transmission under
prefer high throughput. Unlike CPUs, the new accelerators varied networking conditions). We will discuss this in detail.
such as TPUs and GPUs encourages batch processing to Meanwhile, the cold start that refers to the time interval
improve resource utilization. We extensively studied the (milliseconds to seconds) to start a system is varied for dif-
property with our system. ferent systems. It is also a critical metric as it decides the
Cost. When developers implement a DL service, they must provisioning time.
consider their budget. To support this, our system provides Summary. The huge configuration space leads to many
tools to measure energy, CO2 emission (Anthony et al., trade-offs including but not limited to Latency versus
2020), and cloud cost under different devices and cloud Throughput, Accuracy versus Latency, Cost versus
providers with an aim to present a comprehensive study. Quality and Sharing versus Dedicate. With the help our
Sensitivity of Model Hyper-parameters. DL models have system, DL developers can inspect to these trade-offs for
many hyper-parameters such as layer types (e.g., LSTM better configuration and future upgrading.
(Hochreiter & Schmidhuber, 1997)) and the number of lay-
ers. Our system can generate models to study their influ- 4 S YSTEM D ESIGN AND
ences to provide insights. I MPLEMENTATION
Memory & Computation. The performance of a model Design Philosophy. We aim to design an automatic and
on a device is decided by both computation and memory distributed benchmark system which can be 1) operated
capacity from the device. We explore both of them with independently to perform benchmark tasks; 2) incorporated
real-world and generated models under a wide range of into DL lifecycle management (Zaharia et al., 2018) or
hyper-parameters. DL continues integration (Zhang et al., 2020; Karlaš et al.,
2020) systems to further improve the automation of DL
3.2 Tier 2 - Software service development and evaluation; 3) and connected to a
In the second tier, we study the impacts of software such monitor system for DL service diagnose. Accordingly, our
as formats and serving platform. Here, the system provides system can serve users involved at various stages of a DL
complete support for serving platforms such as TFS (Olston service pipeline: data scientists, DL deployment engineers,
et al., 2017) and Trion Inference Serving (NVIDIA, 2020b), hardware engineers, and AIOps engineers, by significantly
which have not been fully investigated in the previous study. reducing the often manual and error-prone tasks.
Users can easily extend our systems to support more plat-
forms with provided APIs. We will study the following 4.1 System Overview
features. As shown in Figure 1, our centralized benchmark system
Tail Latency. A well-designed serving platform can effec- uses a Leader/Follower architecture. The leader server man-
tively mitigate the effect of tail latency which is critical for ages the whole system by accepting users’ benchmark sub-
online services. We provide a detailed study of the perfor- missions and dispatching benchmark tasks to specific fol-
mance under varied request arrival rates. lower workers, guided by a task scheduler. The leader server
also generates the corresponding requests and workloads
Resource Usage. Serving software may bring overhead or for later evaluation. Meanwhile, users can choose either to
save resources by their dedicated design. Understanding the register their own models to our system or use the different
behavior will lead to better resource allocation. iterations of canonical models generated by our system to
Advanced Features. To improve resource utilization, serv- benchmark. Next, the follower workers, which can be any
Benchmarking Deep Learning Inference Serving

Table 1. Five hardware platforms we used for experiments.


Platform Peak TFLOPS Memory AWS Google Cloud
ID Version Memory
(Arch) (FP32/FP16) Bandwidth (GB/s) (Instances) (Instances)
C1 CPU Intel Xeon E502698 v4 128 GB - - - -
G1 GPU (Volta) Tesla V100 32 GB 15.7 (31.4) 900 4 4
G2 GPU (Turing) GeForce 2080Ti 11 GB 14.25 (28.5) 616 - -
G3 GPU (Turing) Tesla T4 16 GB 8.1 (16.2) 300 7 3
G4 GPU (Pascal) Tesla P4 8 GB 5.5 (11.0) 192 - 3

Management
Plane
Task
Monitor Scheduler Utility Scalability. As cluster-based model training and deploy-
Manager
ment have become the de facto practice for both industry
Canonical
and academia, our system supports all the computation and
Generate Generator Request Workload Real-world Model
Model
Repository management functionalities required by cluster computing,
Pre- Post- Pre- Model Post-
with the aim to keep user manual efforts (e.g., manually
Serve Client processor processor Server
processor processor Serving moving model and result files) to the minimum.
Collect Prober Metric Collector Logger System Integrity and Security. Since starting a bench-
marking task without proper coordination with the leader
Analyze Leaderboard Aggregator
PerfDB can potentially interrupt the running service, the system
shall have the capability to monitor environment status so
Figure 5. The benchmarking system architecture. A management as to decide whether a task can be executed or to schedule
plane running in the leader server is designed to control and moni- such a task.
tor the system and users’ benchmarking tasks. We divide an infer-
ence benchmarking into four stages and provide a set of functions 4.2.1 Management Plane
to complete it automatically.
The management plane, designed to control all benchmark
tasks in users’ clusters, consists of four functional blocks:
of the idle servers in a cluster, execute users’ benchmark task manager, monitor, scheduler, and other utility functions.
tasks according to the submissions’ specifications. Specifi- The goal is to provide an interface for users’ fine-grained
cally, it loads and serves a model with infrastructure such control inputs.
as TensorFlow-Serving (TFS) and then invokes the corre-
sponding pre-processing and post-processing functions to Task Manager. The task manager accepts users’ bench-
complete a service pipeline. Finally, users can start clients mark submission and logs related information, such as the
to simulate real-world workload by sending requests to the user name, task ID, and submission timestamp. The task
DL service to evaluate its performance. manager then dispatches the task to a specific worker accord-
ing to the task specification and starts the related procedures.
For evaluation, a metric collector is utilized as a daemon In our implementation, we use MongoDB as the backend.
process to obtain detailed performance information via a The developers can easily replace the backend with their
probe module. At the same time, a logging module is en- preferred database such as MySQL.
abled to record the running systems’ status, including both
hardware and software. All of this information will be sent Monitor. This functional block collects and aggregates sys-
to a PerfDB (performance database). After evaluation, users tem resource usage. Specifically, We use two backends, cAd-
can aggregate the performance data and use our build-in visor (Google, 2018) and Node Exporter (NVIDIA, 2018),
analysis models to extract insights. We also implement a for logging the status of serving container (e.g., CPU usage)
recommender and a leaderboard for service configuration and hardware status (e.g., GPU usage), respectively.
and resource allocation. Scheduler. We aim to design a multi-tenant system to or-
ganize a team’s daily benchmark tasks and avoid potential
4.2 System Implementation Detail conflict and interference. As such, we implement a sched-
uler for these benchmark jobs to minimize the average job
Our implementation of the system (shown in Figure 5) fol-
completion time (JCT) and improve efficiency. Specifically,
low the following three best-practices:
we design a simple baseline method (as discussed in Section
Modularity. Modularity can provide 1) a seamless exten- 4.3.2), upon which developers can extend to design their
sion to support evolving DL applications, 2) a natural adap- own algorithms per the workload profiles.
tation to existing model management and serving systems,
Utility Functions. Currently, we have two functions, a
and 3) easy customization from developers.
Benchmarking Deep Learning Inference Serving

sharing manager and a configuration recommender. The 4.2.3 Stage 2 - Serve


sharing manager helps users configure MPS, which is the de
In the second stage, the system performs users’ benchmark
facto software to share multiple DL models with one GPU,
tasks according to the schedule. To expedite the execution,
to support a sharing benchmark. The recommender help
we implement several unified functionalities, including a
users to make a simple decision for service configuration.
pre-processor, a post-processor, and a model server.
Users need to input an SLO (e.g., latency), and the system
will return the top 3 configurations. Pre-processor & Post-processor. We collect and imple-
ment many out-of-the-box processing functions for different
4.2.2 Stage 1 - Generate DL models in our system. For instance, for image classifi-
cation models, we have image resizing and tensor conver-
In the first stage, our system prepares requests, workloads
sion functions. For text classification models, we provide
(sending patterns), and models for benchmarking according
a set of tokenizer methods. These functions can be called
to users’ specifications. From their submission (a YAML
individually or jointly according to model specifications.
file), the system first chooses to call either real-world or gen-
For instance, when analyzing videos, we need to build a
erated models and then prepares the corresponding requests
pre-processing pipeline that consists of video decoding and
and workloads to perform tasks. We next describe the four
image resize functions. The post-processor also has a va-
main functions of this stage.
riety of functionalities. For example, the post-processor
Model Repository. As each team can produce tens or even can match a prediction class ID to a label in a database for
hundreds of models, this module is designed to help them classification tasks.
organize different versions of models. It has four APIs,
Following our design, developers can extend the system
including register, update, search, and delete, based on
by implementing their own processors. Our system also
MongoDB with GridFS. Both model weights and basic
provides the flexibility for users to offload these functions
information such as the model name and the dataset used
to the client-side or directly run them on the server-side.
can be stored in the repository.
Model Server. The benchmarking of DL serving software
Canonical Model Generator. The module contains a
infrastructure has not been thoroughly investigated by the
trainer and an exporter to generate canonical models un-
current benchmark studies. As such, we provide the module
der different hyper-parameter, including the batch size, the
to bridge the gap. The serving software can transform mod-
neuron number, and the layer number (see appendix for de-
els into services and expedite the model development. We
tails). We build models by repeatedly stacking the following
adopt two kinds of DL serving infrastructure into our sys-
four most commonly used blocks (layers), a fully-connected
tem. The first is the DL-specific software, with Tensorflow-
layer (FC), a residual block (CNN), an LSTM layer (RNN),
Serving (TFS) and Trion Inference Serving (TrIS) as two
and an attention block (Transformer (Vaswani et al., 2017)),
representatives. The second is a general web framework
respectively, to obtain four groups of models. They are
(e.g., FastAPI and Flask) with an optimized runtime (e.g.,
trained with Tensorflow and exported to a format that can be
ONNX Runtime (Microsoft, 2019)). We wrap many of
served by Tensorflow-Serving. Different from those isolated
them in our system so that developers can start a service for
real-world models registered by users, the canonical models
benchmarking with ease.
can help explore the sensitivity of hardware performance
to model hyper-parameters, which can not only benchmark
4.2.4 Stage 3 - Collect
target platforms but also profile their properties.
In this stage, three modules are implemented to ensure both
Request Generator. To save developers’ preparation time,
fine-grained performance collection and reproducibility. All
the module stores many kinds of data selected from widely
of the information will be sent to the performance database.
used datasets such as ImageNet (Deng et al., 2009) and also
has an interface for users to upload their own test data. Prober. The module examines the DL serving pipeline to
obtain the performance of each stage. As shown in Figure
Workload Generator. Since the requests must be sent by
4, to evaluate the performance of a serving pipeline with
following a pattern for benchmarking, we implement this
multiple stages, the prober sets endpoints at the boundaries
workload generator. We have built many modes in our
of each stage. Then it triggers the metric collector to obtain
system to meet the diversity testing scenarios (including
the measurement results corresponding to each of them. The
both online services and offline services). Developers can
results help developers detect the performance bottlenecks
further customize all these modes. For instance, we have a
in an inference pipeline.
pattern to simulate request arrival processes that follow a
Poisson Distribution and a specified arrival rate. Metric Collector. This module contains a series of evalua-
tion metrics as well as their implementation, as described
Benchmarking Deep Learning Inference Serving

in Section 3. This module is designed to be composable - exploit the performance as well as the interactions between
it can either be fully automated during the benchmarking hardware and models instead of simple real-world models.
process or be called selectively to meet users’ demand.
CDF Plots. Cumulative distribution function (CDF) plots
Logger. This module records the system runtime informa- can give the probability that a performance metric (e.g.,
tion and parameter settings. As we discussed in Section 3, latency) is less than or equal to its target (e.g., latency SLO).
the DL inference performance is affected by model, hard- With the plot, we compare the capability to process real-
ware, input size, etc. To ensure the benchmarking results’ world workloads of varied software.
reproducibility, we use a logger to track all the above infor-
Heat Maps. Heat maps are powerful tools to understand the
mation during evaluations. The information falls into two
performance (e.g., utilization) sensitivity to models’ hyper-
categories: runtime environment information and evaluation
parameters. We rely on heat maps to measure how system
settings. Runtime environment information includes hard-
performance varies with the hyper-parameters to help de-
ware types, serving software names, etc. Evaluation settings
velopers better understand the interactions between models
include model names, training frameworks, etc.
and systems.
4.2.5 Stage 4 - Analyze Other Plots. We also design and upgrade some basic bar
plots to summarize the performance obtained from our sys-
In the last stage, users can get initial benchmark results from
tem. Compared to a table that lists results, our plots high-
a database and use analysis models and tools built in our
light the insights and give intuitive guidelines for service
system to gain insights. A leaderboard is provided for users
configuration.
to check results.
PerfDB. We design a performance database to store bench- 4.3.2 Scheduler Design
mark results and log information from the collection stage.
Our system invokes a scheduling agent in a two-tier manner
The database runs a daemon process in the leader server to
to perform benchmarking tasks efficiently. In the first tier, a
collect data from those follower workers in a cluster. Same
newly submitted job will be dispatched by the leader server
with the model repository, it exposes many functions for
to a relatively idle follower worker with adequate resources
results management MongoDB as the backend.
to execute the job. The second tier is to determine the
Aggregator. The module is used to aggregate results from order of job execution on a follower worker. Specifically,
PerfDB for a comprehensive analysis. We adopt many anal- suppose the system has a set of jobs J waited to be processed
ysis models (e.g., Roofline) to our system to draw more in a scheduling interval. For each job j ∈ J, the total
insightful conclusions. The models will be described in time to process a job is tj = waiting + processing.P The
Section 4.3.1. optimization target is to minimize T where T = j∈J tj .
Leaderboard. In the leaderboard, developers can sort
results by different metrics such as energy computation Algorithm 1 Benchmark Job Scheduling
and cloud cost to configure a high-performance and cost- Input: Jobs J, Workers, W1 , . . . , Wk
effective DL service. We also provide a visualizer for pre- for all jobs j ∈ J do
senting the results. repeat
Select an idle worker Wmin with the shortest queue
4.3 System Methodology Enqueue j to Wmin
Remove j from J
This section introduces adopted analysis models in our sys- until J is empty
tem and a simple scheduling method for benchmark tasks. end for
for all Workers, W1 , . . . , Wk do
4.3.1 Analysis Model Re-order jobs in an ascending way
Roofline. Roofline (Williams et al., 2009) model is widely end for
used to study the performance of applications running on Execute Jobs
hardware devices. We use Roofline to estimate two core
capacity metrics of an accelerator: computation and memory To solve the problem, we implement a global load balancer
bandwidth. Compared to a simple percent-of-peak estimate, (LB) to decide the job placement and a scheduler at every
it evaluates the quality of attained performance and explores follower worker to allocate resources, as shown in Algo-
a performance bound more effectively. As DL models’ rithm 1. First, workers will publish their current queue
performance is affected by many hyper-parameters (e.g., length (i.e., the time to process all waited jobs) to the leader
neuron numbers), we use generated canonical models to server. Then LB distributes a job to a worker, minimizing
the waiting time. Next, the worker will re-order jobs in
Benchmarking Deep Learning Inference Serving

an ascending way. Finally, workers will execute the jobs Training Serving Serving
Framework Format Infrastructure
sequentially.
TorchScript Torch.JIT*

5 E VALUATION R ESULTS PyTorch Convert


ONNX
(FP32/FP16) Deploy
ONNX
Runtime*

In this section, we describe the experimental settings and the TensorFlow- TensorFlow-
Tensorflow
evaluations of our system. We also illustrate the importance SavedModel Serving
of benchmarking scheduling via a simple case study. TensorRT Trion Inference
(FP32/FP16) Server
5.1 Experimental Setup
Figure 6. Four serving software infrastructures under our test. Our
We deploy our system to a server in a private cluster and system accepts models trained with PyTorch and Tensorflow. Then
study five types of hardware, four types of software plat- it converts the models into many serialized and optimized serving
forms, and a pipeline with three scenarios. In the process, formats such as ONNX and TensorRT. These serving model will
we have registered many real-world models and generated be bound with a serving infrastructure like Tensorflow-Serving for
hundreds of models for the benchmark study. benchmarking.

Hardware Platforms. The specifications of five hardware


platforms are listed in Table 1, including one CPU platform experiments are conducted with TensorFlow SavedModel
as the reference and four GPU platforms (V100, 2080Ti, with TFS 2.3.0.
T4, and P4). These GPUs have many different architec-
ture designs (e.g., Volta and Turing), thus providing us a 200 Nvidia GeForce RTX 2080Ti
Nvidia Tesla P4
500 Nvidia GeForce RTX 2080Ti
Nvidia Tesla P4
Average Latency (ms)

Average Latency (ms)


wide range of computational capabilities. We also survey 150
Nvidia Tesla T4
Nvidia Tesla V100
400 Nvidia Tesla T4
Nvidia Tesla V100

instances that provide the GPUs in cloud providers. Intel(R) Xeon(R) E5-2698v4 CPU
300
Intel(R) Xeon(R) E5-2698v4 CPU

100 94.45ms (bs=1) 200


Software Platforms. As shown in Figure 6, we investigate
100
four serving infrastructures in our experiments. Tensorflow- 50 49.60ms (bs=1)

1 2 4 8 16 32
0 1 2 4 8 16 32
Serving (TFS) is the default serving platform for Tensorflow Batch Size Batch Size

SavedModel format, which is converted from Tensorflow (a) Bert-Large (b) ResNet50
models. Trion Inference Server (TrIS) supports many for- 70
GPU(V100)/CPU Throughput Speedup
mats like TensorRT format that can be converted from many 60
Speedup Ratio

other formats. Both TorchScript (converted from PyTorch) 50 47.4x

40
and ONNX (converted from PyTorch) have no stable serving 30.1x
30 26.9x
infrastructure during our implementation, so we use their 20 16.9x
21.3x 23.2x 24.3x

default optimized runtime torch.jit and ONNX runtime, re- 10


11.1x 10.8x
7.8x
5.4x 7.4x 5.3x
3.6x
spectively, to load models and wrap them as services with 0 OD1 GAN1 TC1 TC2 TC3 TC4 IC1 IC2 IC3 IC4 IC5 IC6 IC7 IC8
Model Structure
FastAPI (a popular web framework (FastAPI, 2019)) by
following the official document. To use these platforms, our (c) GPU/CPU speedup for ResNet50
system first converts newly trained models into optimized
and serialized formats. We next use gRPC APIs test them Figure 7. The latency and throughput comparisons of different
hardware. We showcase two representative models with varied
with docker containers. All of them have been adopted to
batch sizes. We also show the speedup of GPU/CPU under the
our system to free developers from engineering works.
latency SLO.
Pipelines. We build a pipeline (shown in Figure 4) that
consists of a service broker that accepts users’ requests and Latency & Throughput. We first plot how latency changes
dispatches them to a backend with a running DL service, with different batch sizes and type of hardware for two in-
as our testbed in the system. We test the whole pipeline ference models (Bert-Large and ResNet50), in Figure 7a
in three network scenarios, including LAN, 4G LTE, and and 7b, respectively. The batch size for the CPU is fixed at
Campus WIFI. one. The plots show that for latency, GPU platforms per-
form better than CPU for small batch sizes (less than eight).
5.2 Hardware Platform Characterization When the batch size becomes large, the latency becomes
much longer for two types of GPU. A larger batch size often
We present hardware benchmarking results with both real-
provides high throughput. We caution developers to check
world and generated models in this section. More results can
they latency SLO first and use our configuration recommen-
be found in the leaderboard at our website (we omit it here
dation to select hardware and batch size before deploying
due to the anonymous review). Unless otherwise stated, all
the service. In Figure 7c, we plot the speedup ratio for dif-
Benchmarking Deep Learning Inference Serving

100 100

1
ferent inference models on one type of GPU (V100). The 90 90

2
inference models (see appendix for more details) include 80
70
80
70

4
Batch Size

Batch Size
OD (object detection), GAN (CycleGAN (Zhu et al., 2017)), 60 60
50 50

8
TC (text classification) and IC (image classification). The 40 40

16

16
30 30
evaluation shows a wide range of speedup ratios, from 3.6x 20 20

32

32
to 47.4x. We use the model latency with CPU as each ser- 10 10
2 3 4 5 6 7 8 10 12 14 16
0 1 2 4 8 12 14 16 18 20
0
vice’s SLO and input them to our system. The system can Number of Residual Blocks Number of Transformer Blocks

recommend the best batch size and the speedup ratio of each (a) CNN models. (b) Transformer models.
service as a reference.
Figure 9. The GPU utilization with varied hyper-parameters of
Energy Consumption per Request (J)

different models.
CO2 Emissions per Request (mg)

1.2 0.40
12
Cost per Request (10−4¢)

Tesla T4 (J) Tesla P4 (mg) C1-I1 (V100)


Tesla P4 (J) Tesla T4 (mg) 0.35
1.0 RTX 2080Ti (J) RTX 2080Ti (mg) C2-I1 (V100)
0.30 10
Tesla V100 (J) Tesla V100 (mg)
0.8 C1-I2 (P4)
0.25 8 C1-I3 (T4)
0.6 0.20
6
0.4
0.15 a transformer model with more layers will utilize GPU re-
0.10 4
0.2
0.05
2
sources more. Since more hardware accelerators (e.g., TPU)
0.0
1 2 4
Batch Size
8 16 32
0.00
0 1 2 4 8 16 32
are emerging, we provide a powerful tool for hardware en-
Batch Size gineers to explore the hyper-parameter influences.
(a) ResNet50 engergy and CO2
emission. (b) ResNet50 cloud cost.
105 105
Memory Bound Compute Bound Memory Bound Compute Bound

Ops per Sec (GFLOPs/sec)


Ops per Sec (GFLOPs/sec)
104 104
Figure 8. The three cost comparison across different GPUs. Node(64-1024)
103 103 Layers(2-2048)

102 102

101 101
Cost. We examine three kinds of costs, including energy 100
Tesla V100 ResNet50 (bs=1)
VGG16 (bs=1) ResNet152 (bs=1) 100
consumption, CO2 emission, and cloud cost. Figure 8a 10−1
VGG19 (bs=1)
MobileNet V1 (bs=1)
ResNet101 (bs=1)
InceptionV3 (bs=1) 10−1
BS(2-64)

MobileNet V2 (bs=1) Tesla V100 Generated MLP Model


shows the energy consumption and CO2 emission per re- 0
0 100 101 102 103
0
0 100 101 102 103 104 105
Arithemetic Intensity (FLOP/Byte) Arithmetic Intensity (FLOP/Byte)
quest of the ResNet50 model in a batch-processing manner.
In this scenario, more powerful GPUs consume more energy (a) Real-world models (b) Generated models
and emit more CO2. We note that most energy is consumed
Figure 10. The Roofline model analysis with both real-world mod-
with the batch size one. This can be attributed to the over-
els and generated models.
head associated with context start, which can be amortized
with larger batch sizes.
Computation and Memory Bandwidth. The performance
For cloud cost, we use hourly rates of different GPU in-
of a GPU is decided by both its computation capacity and
stances from Google Cloud Platform and AWS. Our focus is
memory bandwidth. We apply Roofline models, which
the benchmark comparison provided by our system instead
represent the relationship between the model’s operational
of ranking providers, so we use [C1, C2] and [I1, I2, I3] as
intensity and operations per second, to give a quantitative
labels for providers and instances, respectively. From the
comparison. The ceiling line (in red) shows the theoretical
results plotted in Figure 8b, we observe that 1) for the same
bandwidth and computation capability of a GPU (V100).
devices (V100), different providers have different hourly
For real-world CNN models (shown in Figure 10a), we
rates; 2) GPU devices’ costs vary with computational ca-
observe that two lightweight models (MobileNet (Howard
pability. Though T4 (I3) GPU is more powerful than P4
et al., 2017)) are more memory-bound and the other heavy
(I2) GPU, it has a lower price; and 3) as the batch size in-
models are more compute-bound. This is in line with that in
creases, more images can be processed hourly. As a result,
real-world practice, the performance improvement of Mo-
the cost per request decreases. With this capability provided
bileNet does not significantly align with its small numbers
by our system, users can choose the best cloud providers
of parameters. It can not fully utilize the computation of
and instances for their services.
GPUs. With our system, data scientists can easily under-
Performance Sensitivity. We evaluate performance sensi- stand their models for optimization. Figure 10b presents
tivity to hyper-parameters such as batch size and the layer the Roofline analysis of generated models. The model op-
number with generated models on a V100 GPU. Every time erations per second increases while the arithmetic inten-
we select two parameters and keep the others fixed. Fig- sity increases. Larger batch sizes make MLP models more
ure 9 shows two examples. For a CNN type model, GPU compute-bound, whereas more layers and more neurons
utilization increases with both batch size and model depth, incur a memory-bound. This insight can not be obtained
indicating that a GPU exploits parallelism within the batch from isolated real-world models. Hardware engineers can
size and depth of this kind of models. For a transformer apply our system to analyze the inference performance of
model, the model’s depth has more impact, indicating that both newly proposed model structures or accelerators.
Benchmarking Deep Learning Inference Serving
Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE)


0.16 ResNet50 model and send requests concurrently. In this
Batch Size = 2 Arrival Rate=180 (req/sec)
0.14
0.12
Batch Size = 4
10−1 Arrival Rate=120 (req/sec) case, TrIS can utilize the feature and improve the throughput
Arrival Rate=60 (req/sec)
Batch Size = 8
0.10
Batch Size = 16 10−2
steadily while TFS performs even worse than no dynamic
0.08
0.06
batching in a small concurrency. This reminds us that before
10−3
0.04 starting to use the feature, engineers should understand their
0.02
0.00 0
scenarios first and tune the two parameters accordingly.
101 102 20 40 60 80 100
Latency (ms) Latency (ms)
1000
Static BS=1 Static BS=1
(a) Batch size (b) Arrival rate (KDE) 1000

Throughput (req/sec)

Throughput (req/sec)
Dynamic BS=4 Dynamic BS=4
800
800 Dynamic BS=8 Dynamic BS=8
1.0 1.0
600
600
0.8 0.8
Probability
Probability

400 400
0.6 0.6
P95 Latency 200 200
0.4 P95 Latency 0.4 Triton Inference Server
P99 Latency TensorFlow Serving (TFS) 0 0
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
0.2 Arrival Rate = 120 (req/sec) 0.2 FastAPI ONNX Server Number of Concurrency Number of Concurrency
Arrival Rate = 220 (req/sec) FastAPI TorchScript Server
0.0 0.0
50 100 150 200 250
Latency (ms)
20 40 60 80
Latency (ms)
100 120
(a) ResNet50 served by TFS (b) ResNet50 served by TrIS
(c) Arrival rate(CDF) (d) Four software Figure 12. The throughput comparison of two serving software
with the dynamic batching feature.
Figure 11. The tail latency of four systems under varied workloads.
The batch size, arrival rate, and serving software influence tail
Resource Usage. Understanding the resource utilization
latency a lot.
pattern often leads to better resource allocations. We test two
services (BERT with an arrival rate of 30 requests/seconds
and batch size 1 and ResNet50 with 160 requests/second
5.3 Software Platform Characterization and batch size 1). The results are shown in Figure 13. We
observe that the GPU utilization is dynamic with varied
We prepare several benchmark submissions to evaluate soft-
workloads and tends to be under-utilization with a low ar-
ware platforms. They serve models with a range of up to 5
rival rate (even it loads a heavy model like BERT). This
minutes. The workload generator is used to simulate Poisson
gives developers a large enough room for optimization.
Distribution with varied arrival rates to understand software
performance. We also explore the advanced features like 100
TensorFlow Serving (TFS) 100 Nvidia Triton Inference Server
the dynamic batching specific to DL serving. FastAPI Server (ONNX Runtime) TensorFlow Serving (TFS)
GPU Utilization (%)
80
GPU Utilization (%)

80

Tail Latency. We compare the tail latency across varied 60


60

software and arrival rates. We use TFS with ResNet50 as 40 40

a case study. As shown in Figure 11a, the larger batch size 20 20

accounts for a longer tail latency though it can save service 0 0 30 60 90 120 150
0 0 30 60 90 120 150
Time (sec) Time (sec)
providers’ cost. Both Figure 11b and Figure 11c indicate
that TFS can not adequately handle spike load. (a) BERT (b) ResNet50

We further compare different four platforms for a same im- Figure 13. The GPU utilization with different serving software
age classification service (based on ResNet50) on a V100 under varied workloads
GPU as shown in Figure 11d. TrIS from NVIDIA per-
forms best with no surprise since it includes many GPU op-
5.4 Inference Pipeline Decomposition
timization techniques. ONNX Runtime with a simple web
framework performs better than TFS, a dedicated serving We simulate a real-world DL service by building a simple
system for DL inference. In conclusion, when configuring pipeline and sending requests using three networking trans-
a latency-intensive service, developers need to take the im- mission technologies. We use the ResNet50 and TFS as the
pact of batch size and the served software (on latency) into inference model and the serving system, respectively.
consideration. Both academia and industry have invested
The results in Figure 14a show that the transmission time
efforts in serving software development, and our system can
is very comparable to inference time for small batch sizes.
make things easier.
As the batch size increases, the inference time accounts for
Advance Feature: Dynamic Batching. Two representa- a much larger portion of the total time. The results (Figure
tive systems, TFS and TrIS, provide dynamic batching set- 14b) also show that for the same service, 4G LTE has the
tings. To use the feature, developers need to set a lot of longest end-to-end latency. This indicates that for a mobile
hyper-parameters like the maximum batch size and max- application, sending requests to cloud DL service can incur
imum queue time. We use the two software to serve a high latency.
Benchmarking Deep Learning Inference Serving

140 Transmission (LAN)


107
106
LAN 6 R ELATED W ORK
120 Pre-processing WLAN (Campus Wifi)
105 WAN (4G LTE)
Latency (ms)

Latency (ms)
100 Inference

80
Post-processing 104 Inference Time The system’s design follows the best practice of previous
60
103
benchmarking works and we provide a highly efficient clus-
102
40
101 ter system. We categorize previous studies into two classes:
20
0
100
0
micro-benchmark and macro-benchmark. Micro-benchmark
1 2 4 8 16 1 2 4
Batch Size (client side) Batch Size (client side) focuses more on low-level operators. DeepBench (Baidu,
(a) Batch size 2017) is designed to study the kernel-level operations such
(b) Networking condition
30
as GEMM across multiple hardware devices, including both
25
TensorFlow Serving (TFS) cloud servers and edge nodes. AI matrix (Alibaba, 2018)
Triton Inference Server
gets inspiration from DeepBench and further extends it to
Latency (sec)

20

15 benchmark both layers (e.g., RNN) and models (e.g., Faster-


10 RCNN). These studies focus on small sets of impact fac-
5 tors and can not simulate the scenarios in practice. Macro-
0
IC1 IC2 IC3 TC1 TC2 TC3 TC4 OD1 OD2 GAN1
benchmark such as Fathom (Adolf et al., 2016) and AI
Model Structure Benchmark (Ignatov et al., 2019) collect a set of models and
(c) The cold start latency of different models with two software study their performance on the mobile side. DawnBench
(Coleman et al., 2017) and MLPerf Inference (Reddi et al.,
Figure 14. The pipeline decomposition results. 2020) support more scenarios and provide competition for
both industry and academia to measure their systems. In
comparison, our work provides full support for developers
We also compare the cold start time of different models to speed up their benchmark process and configure new ser-
with two software. TrIS has a longer starting time than vices with ease. Our system also explores a wider space to
that of TFS. Even for a small image classification model, it evaluate DL systems for a comprehensive study. ParaDNN
needs more than 10 seconds to prepare. In practice, the long (Wang et al., 2020) also uses analysis models like Roofline
starting time can pose challenges for resource provisioning. and Heat maps to study DL platforms, but it focuses on
training workload and lack of configurability. In general,
5.5 Case Study: Task Scheduling we implement the system to complement existing inference
To the best of our knowledge, current benchmarking tasks benchmark suits and obtain more comprehensive analysis
are often carried out in an error-to-prone fashion. Since tasks results.
need to be executed in an idle server, the system status needs
to be checked. If this check is neglected, all tasks located 7 S UMMARY AND F UTURE W ORK
in a worker will crash. As there are no specific scheduling
methods to address these issues before this work, we im- With the rapid development of deep learning (DL) mod-
plement two methods as the baselines, a round-robin (RR) els and the related hardware and software, benchmarking
load balancer (LB) with First-Come-First-Serve (FCFS) and for service configuration and system upgrade becomes de-
an LB with Short-Job-First (SJF). The results 15 show that velopers’ day-to-day tasks. Previous work focuses on the
our scheduler, queue aware (QA) LB with SJF, can reduce isolated DL model benchmark study and leaves tedious and
the average job-completion-time JCT by 1.43 (equivalent error-prone tasks to developers. In this work, we design and
of 30% reduction). Now we assume that the processing implement an automatic benchmarking system to address
time of every benchmark task is determined before they are these issues. Our system streamlines the task executions
executed. We leave the study of a scheduler for jobs with and explores a wide range of design space. It provides many
stochastic processing time for future work. analysis models to gain insights for resource allocation and
service operations. A scheduler is implemented to improve
1.0
efficiency. We conduct many experiments to demonstrate
1.6
1.43x
1.4
1.22x 0.8
the functions of our system and provide many practical
Speedup Ratio

1.2
guidelines. We plan to continue upgrading the system and
Probability

1.00x 0.6
1.0
0.8
0.4
design more APIs to give users more flexibility to adopt our
0.6 RR + FCFS
0.4 0.2 RR + SJF system into their own DL services
QA + SJF
0.2
0.0
0.0 0 500 1000 1500 2000 2500 3000 3500 4000
RR + FCFS RR + SJF QA + SJF Job Completion Time (s)

(a) Speedup ratio (b) CDF

Figure 15. The performance comparison of three schedulers.


Benchmarking Deep Learning Inference Serving

R EFERENCES In 2009 IEEE conference on computer vision and pattern


recognition, pp. 248–255. Ieee, 2009.
Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., and Brooks,
D. Fathom: Reference workloads for modern deep learn- FastAPI. Fastapi framework, high performance, easy to
ing methods. In 2016 IEEE International Symposium learn, fast to code, ready for production. https://
on Workload Characterization (IISWC), pp. 1–10. IEEE, fastapi.tiangolo.com/, 2019. Accessed: 2020-
2016. 06-03.
Alibaba. Ai matrix. https://github.com/ Gatys, L. A., Ecker, A. S., and Bethge, M. Image style trans-
alibaba/ai-matrix, 2018. Accessed: 2020-09-02. fer using convolutional neural networks. In Proceedings
of the IEEE conference on computer vision and pattern
Anthony, L. F. W., Kanding, B., and Selvan, R. Car-
recognition, pp. 2414–2423, 2016.
bontracker: Tracking and predicting the carbon foot-
print of training deep learning models. arXiv preprint Google. cadvisor: Analyzes resource usage and perfor-
arXiv:2007.03051, 2020. mance characteristics of running containers. https://
github.com/google/cadvisor, 2018. Accessed:
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine
2020-06-03.
translation by jointly learning to align and translate. arXiv
preprint arXiv:1409.0473, 2014. Han, S., Mao, H., and Dally, W. J. Deep compres-
sion: Compressing deep neural networks with pruning,
Baidu. Benchmarking deep learning operations on
trained quantization and huffman coding. arXiv preprint
different hardware. https://github.com/
arXiv:1510.00149, 2015.
baidu-research/DeepBench, 2017. Accessed:
2020-09-02. Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U.,
Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A.,
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., et al. Applied machine learning at facebook: A datacenter
Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. infrastructure perspective. In 2018 IEEE International
Tfx: A tensorflow-based production-scale machine learn- Symposium on High Performance Computer Architecture
ing platform. In Proceedings of the 23rd ACM SIGKDD (HPCA), pp. 620–629. IEEE, 2018.
International Conference on Knowledge Discovery and
Data Mining, pp. 1387–1395, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, conference on computer vision and pattern recognition,
A., Tuecke, S., Blaiszik, B., Franklin, M., and Foster, pp. 770–778, 2016.
I. Dlhub: Model and data serving for science. In 2019
IEEE International Parallel and Distributed Processing Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Symposium (IPDPS), pp. 283–292. IEEE, 2019. Neural computation, 9(8):1735–1780, 1997.
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. {TVM}: W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
An automated end-to-end optimizing compiler for deep Efficient convolutional neural networks for mobile vision
learning. In 13th {USENIX} Symposium on Operating applications. arXiv preprint arXiv:1704.04861, 2017.
Systems Design and Implementation ({OSDI} 18), pp.
578–594, 2018. Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K.,
Baum, F., Wu, M., Xu, L., and Van Gool, L. Ai bench-
Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., mark: All about deep learning on smartphones in 2019. In
Nardi, L., Bailis, P., Olukotun, K., Ré, C., and Zaharia, 2019 IEEE/CVF International Conference on Computer
M. Dawnbench: An end-to-end deep learning benchmark Vision Workshop (ICCVW), pp. 3617–3635. IEEE, 2019.
and competition. Training, 100(101):102, 2017.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gon- G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
zalez, J. E., and Stoica, I. Clipper: A low-latency online A., et al. In-datacenter performance analysis of a tensor
prediction serving system. In 14th {USENIX} Sympo- processing unit. In Proceedings of the 44th Annual Inter-
sium on Networked Systems Design and Implementation national Symposium on Computer Architecture, pp. 1–12,
({NSDI} 17), pp. 613–627, 2017. 2017.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Karlaš, B., Interlandi, M., Renggli, C., Wu, W., Zhang, C.,
L. Imagenet: A large-scale hierarchical image database. Mukunthu Iyappan Babu, D., Edwards, J., Lauren, C.,
Benchmarking Deep Learning Inference Serving

Xu, A., and Weimer, M. Building continuous integration Williams, S., Waterman, A., and Patterson, D. Roofline:
services for machine learning. In Proceedings of the 26th an insightful visual performance model for multicore
ACM SIGKDD International Conference on Knowledge architectures. Communications of the ACM, 52(4):65–76,
Discovery & Data Mining, pp. 2407–2415, 2020. 2009.

Microsoft. Onnx runtime: cross-platform, high performance Yu, P. and Chowdhury, M. Salus: Fine-grained gpu sharing
scoring engine for ml models. https://github. primitives for deep learning applications. arXiv preprint
com/microsoft/onnxruntime, 2019. Accessed: arXiv:1902.04610, 2019.
2020-05-07.
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong,
NVIDIA. Nvidia data center gpu manager (dcgm) is a suite S. A., Konwinski, A., Murching, S., Nykodym, T.,
of tools for managing and monitoring tesla gpus in clus- Ogilvie, P., Parkhe, M., et al. Accelerating the machine
ter environments. https://developer.nvidia. learning lifecycle with mlflow. IEEE Data Eng. Bull., 41
com/dcgm, 2018. Accessed: 2020-10-02. (4):39–45, 2018.

NVIDIA. Multi-process service. https: Zhang, H., Li, Y., Huang, Y., Wen, Y., Yin, J., and Guan,
//docs.nvidia.com/deploy/pdf/CUDA_ K. Mlmodelci: An automatic cloud platform for efficient
Multi_Process_Service_Overview.pdf, mlaas. arXiv preprint arXiv:2006.05096, 2020.
2020a. Accessed: 2020-06-28.
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired
NVIDIA. The triton inference server provides a image-to-image translation using cycle-consistent adver-
cloud inferencing solution optimized for nvidia sarial networks. In Proceedings of the IEEE international
gpus. https://github.com/NVIDIA/ conference on computer vision, pp. 2223–2232, 2017.
triton-inference-server, 2020b. Accessed:
2020-05-07.

Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li,
F., Rajashekhar, V., Ramesh, S., and Soyke, J. Tensorflow-
serving: Flexible, high-performance ml serving. arXiv
preprint arXiv:1712.06139, 2017.

Reddi, V. J., Cheng, C., Kanter, D., Mattson, P.,


Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M.,
Charlebois, M., Chou, W., et al. Mlperf inference bench-
mark. In 2020 ACM/IEEE 47th Annual International
Symposium on Computer Architecture (ISCA), pp. 446–
459. IEEE, 2020.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,


I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. Mastering the game of go without
human knowledge. nature, 550(7676):354–359, 2017.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri,
M. Learning spatiotemporal features with 3d convolu-
tional networks. In Proceedings of the IEEE international
conference on computer vision, pp. 4489–4497, 2015.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,


L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.

Wang, Y., Wei, G.-Y., and Brooks, D. A systematic method-


ology for analysis of deep learning hardware and software
platforms. Proceedings of Machine Learning and Sys-
tems, 2020:30–43, 2020.

You might also like