Infer Bench
Infer Bench
Huaizheng Zhang 1 Yizheng Huang 1 Yonggang Wen 1 Jianxiong Yin 2 Kyle Guan 3
A BSTRACT
Deep learning (DL) models have become core modules for many applications. However, deploying these models
without careful performance benchmarking that considers both hardware and software’s impact often leads to poor
arXiv:2011.02327v3 [cs.LG] 5 Jan 2021
service and costly operational expenditure. To facilitate DL models’ deployment, we implement an automatic and
comprehensive benchmark system for DL developers. To accomplish benchmark-related tasks, the developers only
need to prepare a configuration file consisting of a few lines of code. Our system, deployed to a leader server in DL
clusters, will dispatch users’ benchmark jobs to follower workers. Next, the corresponding requests, workload, and
even models can be generated automatically by the system to conduct DL serving benchmarks. Finally, developers
can leverage many analysis tools and models in our system to gain insights into the trade-offs of different system
configurations. In addition, a two-tier scheduler is incorporated to avoid unnecessary interference and improve
average job compilation time by up to 1.43x (equivalent of 30% reduction). Our system design follows the best
practice in DL clusters operations to expedite day-to-day DL service evaluation efforts by the developers. We
conduct many benchmark experiments to provide in-depth and comprehensive evaluations. We believe these
results are of great values as guidelines for DL service configuration and resource allocation.
General
V100 GPU T4 GPU CPU
Accelerator
V100 T4
Hardware
GPU CPU
GPU Specific
TPU v2 TPU v3 Habana
Accelerator
Figure 1. The overview of the proposed benchmarking system. The Figure 2. Three benchmark tiers for evaluating DL inference. The
system first accepts users’ benchmarking tasks. Then it distributes performance (e.g., latency and cost) of an AI application is influ-
the tasks to dedicated servers to complete them automatically. enced by many factors, which can be categorized into three classes:
Finally, it will send a detailed report and guidelines back to users. hardware, software, and pipeline.
The goal is to provide an end-to-end solution that frees DL derive valuable insights for system design.
developers from tedious and potentially error-prone bench-
marking tasks (e.g., boilerplate code writing, data collection • We implement a scheduler to ensure safe benchmark
and workload generation). In addition to the models chosen progress and reduce the average benchmark job com-
by users, the system can effortlessly generate and iterate pletion time.
models with different hyper-parameters (e.g., different layer
types and different number of layers) to adequately explore In the remainder of this paper, we first introduce the back-
the design space. The system also provides an extensive set ground knowledge of DL inference serving and our moti-
of analysis tools and models to help developers choose the vation in Section 2. Then, we detail three benchmark tiers
best configuration for their applications under constraints as well as their benchmarking metrics in Section 3. Next,
like latency, cloud cost, etc. Besides, we implement a two- we present the system implementation and the employed
tier scheduler in the system to improve service efficiency. methodologies in Section 4. We employ our system to
perform benchmark tasks and evaluate its performance in
For the logical clarity and comprehensiveness of bench- Section 5. Finally, we briefly introduce the related work in
marking evaluations, we designate impact factors for DL Section 6 and summarize our paper in Section 7.
inference into three tiers, as shown in Figure 2. For the
hardware tier, we select five representative platforms for
evaluation. For the software tier, we choose four represen- 2 BACKGROUND AND M OTIVATION
tative online serving infrastructures. For the pipeline tier, In this section, we introduce the typical workflow of build-
we simulate the real-world workload and examine a specific ing DL services including pre-deployment, post-deployment
inference service with three types of transmission technolo- and online serving. We illustrate how current benchmark
gies. We use metrics such as tail latency, cloud cost and studies can not fully address the challenges of the new work-
resource usage to measure their performance, and analysis loads, thus motivating us to build a new benchmark system.
models like Roofline (Williams et al., 2009) and Heat maps
to profile DL serving system properties.
2.1 Deep Learning Service Pre-Development
In summary, the main contributions of this paper are as
After receiving models from data scientists, engineers need
follows,
to perform many optimization and service configuration
tasks before models go to production, as shown in Figure
• We build an automatic and distributed benchmark sys- 3. First, models alone with their artifacts such as weight
tem in DL clusters to streamline DL serving inference files and corresponding processors (e.g., image resize), will
benchmark for developers. be stored and versioned (Chard et al., 2019). Developers
• We conduct a comprehensive performance analysis on may try to re-implement processors with a production lan-
three tires with many DL applications under various guage such as Java and test them. Second, engineers will
workloads, providing users insights to trade off latency, optimize models (e.g., INT8 conversion or model compres-
cost, and throughput as well as configure services. sion (Han et al., 2015)) but maintaining good accuracy. As
such, engineers need to check every new model to ensure
• We use generated models to study the sensitivity of that it meets both accuracy and latency requirements. Third,
hardware performance to model hyper-parameters and the users will choose a serving system such as Tensorflow-
Benchmarking Deep Learning Inference Serving
Serving (TFS) (Baylor et al., 2017) to bind their models Mobile App Service Broker Pre-Processor Batch Manager Model Serving
Upload
as a service. Different serving systems perform differently Images
Post-Processor
over a wide range of hardware choices. Even more, theses
systems have many new features such as dynamic batching
(Crankshaw et al., 2017), resulting in a complex configura- Figure 4. A simplified view of a DL inference pipeline. Users first
tion with varied performance. Developers need to perform send requests through a mobile or web app. A service broker then
rigorous evaluations for a service. Finally, a service can be distributes requests to backend servers for processing. Usually,
dispatched to a cluster to serve customers. during the processing, a request will go through a pre-processor, a
batch manager, a model server and a post-processor.
Observation 1: A benchmark task for building a service
often takes tens of iterations. Also, developers need to con-
sider the trade-off among many impact factors (IFs) under
SLOs (Service-Level-Objectives) or budget to make a judi-
cious choice. To efficiently complete the task, developers prove resource utilization.
need a simple, easy-to-use and customizable system.
Observation 4: Current studies for simple isolated models
Observation 2: The benchmark tasks need a dedicated and can not be easily generalized to analyze system bottleneck
isolated runtime environment. As AutoML techniques are (e.g., compute- or memory- bound). Also, only benchmark-
becoming popular, these tasks will consume more resources, ing these models can not provide help to understand the
resulting in long delays in getting benchmark results. This hyper-parameter influence on both DL hardware and model
motivates us to propose a way to perform a benchmark with performance. More effective analysis methodologies should
efficient use of resources. be incorporated.
3 B ENCHMARK T IERS
Figure 3. The pipeline of the AI service deployment. To build a
high-performance and low-cost AI service, users need to spend a In this section, we describe three benchmark tiers: hardware,
lot of time on optimization and benchmarking. software, and pipeline as shown in Figure 2.
Benchmarking Deep Learning Inference Serving
3.1 Tier 1 - Hardware ing software provide some useful functions like dynamic
batching. Our system can help to study their impacts.
For the first tier, our system mainly focuses on inference
hardware performance as well as the interactions between
3.3 Tier 3 - Pipeline
hardware and models under varied configurations and sce-
narios. In addition to the latency and throughput measure- In this tier, we aim to look into each stage to identify the
ment capabilities provided in previous research, we exten- bottleneck of a service under various conditions. The system
sively study the cost, hardware performance sensitivity to supports to explore the performance of each stage.
model hyper-parameters, and their bottleneck with different
types of models. We next detail the metrics. Latency per Stage. The DL serving pipeline often con-
sists of five stages: pre-processing, transmission, batching,
Latency & Throughput. These two metrics are widely DL inference, and post-processing. In addition to the DL
used to measure hardware. In general, online real-time inference stage, the other stages also have varied perfor-
services require low latency and offline processing systems mance under different conditions (e.g., transmission under
prefer high throughput. Unlike CPUs, the new accelerators varied networking conditions). We will discuss this in detail.
such as TPUs and GPUs encourages batch processing to Meanwhile, the cold start that refers to the time interval
improve resource utilization. We extensively studied the (milliseconds to seconds) to start a system is varied for dif-
property with our system. ferent systems. It is also a critical metric as it decides the
Cost. When developers implement a DL service, they must provisioning time.
consider their budget. To support this, our system provides Summary. The huge configuration space leads to many
tools to measure energy, CO2 emission (Anthony et al., trade-offs including but not limited to Latency versus
2020), and cloud cost under different devices and cloud Throughput, Accuracy versus Latency, Cost versus
providers with an aim to present a comprehensive study. Quality and Sharing versus Dedicate. With the help our
Sensitivity of Model Hyper-parameters. DL models have system, DL developers can inspect to these trade-offs for
many hyper-parameters such as layer types (e.g., LSTM better configuration and future upgrading.
(Hochreiter & Schmidhuber, 1997)) and the number of lay-
ers. Our system can generate models to study their influ- 4 S YSTEM D ESIGN AND
ences to provide insights. I MPLEMENTATION
Memory & Computation. The performance of a model Design Philosophy. We aim to design an automatic and
on a device is decided by both computation and memory distributed benchmark system which can be 1) operated
capacity from the device. We explore both of them with independently to perform benchmark tasks; 2) incorporated
real-world and generated models under a wide range of into DL lifecycle management (Zaharia et al., 2018) or
hyper-parameters. DL continues integration (Zhang et al., 2020; Karlaš et al.,
2020) systems to further improve the automation of DL
3.2 Tier 2 - Software service development and evaluation; 3) and connected to a
In the second tier, we study the impacts of software such monitor system for DL service diagnose. Accordingly, our
as formats and serving platform. Here, the system provides system can serve users involved at various stages of a DL
complete support for serving platforms such as TFS (Olston service pipeline: data scientists, DL deployment engineers,
et al., 2017) and Trion Inference Serving (NVIDIA, 2020b), hardware engineers, and AIOps engineers, by significantly
which have not been fully investigated in the previous study. reducing the often manual and error-prone tasks.
Users can easily extend our systems to support more plat-
forms with provided APIs. We will study the following 4.1 System Overview
features. As shown in Figure 1, our centralized benchmark system
Tail Latency. A well-designed serving platform can effec- uses a Leader/Follower architecture. The leader server man-
tively mitigate the effect of tail latency which is critical for ages the whole system by accepting users’ benchmark sub-
online services. We provide a detailed study of the perfor- missions and dispatching benchmark tasks to specific fol-
mance under varied request arrival rates. lower workers, guided by a task scheduler. The leader server
also generates the corresponding requests and workloads
Resource Usage. Serving software may bring overhead or for later evaluation. Meanwhile, users can choose either to
save resources by their dedicated design. Understanding the register their own models to our system or use the different
behavior will lead to better resource allocation. iterations of canonical models generated by our system to
Advanced Features. To improve resource utilization, serv- benchmark. Next, the follower workers, which can be any
Benchmarking Deep Learning Inference Serving
Management
Plane
Task
Monitor Scheduler Utility Scalability. As cluster-based model training and deploy-
Manager
ment have become the de facto practice for both industry
Canonical
and academia, our system supports all the computation and
Generate Generator Request Workload Real-world Model
Model
Repository management functionalities required by cluster computing,
Pre- Post- Pre- Model Post-
with the aim to keep user manual efforts (e.g., manually
Serve Client processor processor Server
processor processor Serving moving model and result files) to the minimum.
Collect Prober Metric Collector Logger System Integrity and Security. Since starting a bench-
marking task without proper coordination with the leader
Analyze Leaderboard Aggregator
PerfDB can potentially interrupt the running service, the system
shall have the capability to monitor environment status so
Figure 5. The benchmarking system architecture. A management as to decide whether a task can be executed or to schedule
plane running in the leader server is designed to control and moni- such a task.
tor the system and users’ benchmarking tasks. We divide an infer-
ence benchmarking into four stages and provide a set of functions 4.2.1 Management Plane
to complete it automatically.
The management plane, designed to control all benchmark
tasks in users’ clusters, consists of four functional blocks:
of the idle servers in a cluster, execute users’ benchmark task manager, monitor, scheduler, and other utility functions.
tasks according to the submissions’ specifications. Specifi- The goal is to provide an interface for users’ fine-grained
cally, it loads and serves a model with infrastructure such control inputs.
as TensorFlow-Serving (TFS) and then invokes the corre-
sponding pre-processing and post-processing functions to Task Manager. The task manager accepts users’ bench-
complete a service pipeline. Finally, users can start clients mark submission and logs related information, such as the
to simulate real-world workload by sending requests to the user name, task ID, and submission timestamp. The task
DL service to evaluate its performance. manager then dispatches the task to a specific worker accord-
ing to the task specification and starts the related procedures.
For evaluation, a metric collector is utilized as a daemon In our implementation, we use MongoDB as the backend.
process to obtain detailed performance information via a The developers can easily replace the backend with their
probe module. At the same time, a logging module is en- preferred database such as MySQL.
abled to record the running systems’ status, including both
hardware and software. All of this information will be sent Monitor. This functional block collects and aggregates sys-
to a PerfDB (performance database). After evaluation, users tem resource usage. Specifically, We use two backends, cAd-
can aggregate the performance data and use our build-in visor (Google, 2018) and Node Exporter (NVIDIA, 2018),
analysis models to extract insights. We also implement a for logging the status of serving container (e.g., CPU usage)
recommender and a leaderboard for service configuration and hardware status (e.g., GPU usage), respectively.
and resource allocation. Scheduler. We aim to design a multi-tenant system to or-
ganize a team’s daily benchmark tasks and avoid potential
4.2 System Implementation Detail conflict and interference. As such, we implement a sched-
uler for these benchmark jobs to minimize the average job
Our implementation of the system (shown in Figure 5) fol-
completion time (JCT) and improve efficiency. Specifically,
low the following three best-practices:
we design a simple baseline method (as discussed in Section
Modularity. Modularity can provide 1) a seamless exten- 4.3.2), upon which developers can extend to design their
sion to support evolving DL applications, 2) a natural adap- own algorithms per the workload profiles.
tation to existing model management and serving systems,
Utility Functions. Currently, we have two functions, a
and 3) easy customization from developers.
Benchmarking Deep Learning Inference Serving
in Section 3. This module is designed to be composable - exploit the performance as well as the interactions between
it can either be fully automated during the benchmarking hardware and models instead of simple real-world models.
process or be called selectively to meet users’ demand.
CDF Plots. Cumulative distribution function (CDF) plots
Logger. This module records the system runtime informa- can give the probability that a performance metric (e.g.,
tion and parameter settings. As we discussed in Section 3, latency) is less than or equal to its target (e.g., latency SLO).
the DL inference performance is affected by model, hard- With the plot, we compare the capability to process real-
ware, input size, etc. To ensure the benchmarking results’ world workloads of varied software.
reproducibility, we use a logger to track all the above infor-
Heat Maps. Heat maps are powerful tools to understand the
mation during evaluations. The information falls into two
performance (e.g., utilization) sensitivity to models’ hyper-
categories: runtime environment information and evaluation
parameters. We rely on heat maps to measure how system
settings. Runtime environment information includes hard-
performance varies with the hyper-parameters to help de-
ware types, serving software names, etc. Evaluation settings
velopers better understand the interactions between models
include model names, training frameworks, etc.
and systems.
4.2.5 Stage 4 - Analyze Other Plots. We also design and upgrade some basic bar
plots to summarize the performance obtained from our sys-
In the last stage, users can get initial benchmark results from
tem. Compared to a table that lists results, our plots high-
a database and use analysis models and tools built in our
light the insights and give intuitive guidelines for service
system to gain insights. A leaderboard is provided for users
configuration.
to check results.
PerfDB. We design a performance database to store bench- 4.3.2 Scheduler Design
mark results and log information from the collection stage.
Our system invokes a scheduling agent in a two-tier manner
The database runs a daemon process in the leader server to
to perform benchmarking tasks efficiently. In the first tier, a
collect data from those follower workers in a cluster. Same
newly submitted job will be dispatched by the leader server
with the model repository, it exposes many functions for
to a relatively idle follower worker with adequate resources
results management MongoDB as the backend.
to execute the job. The second tier is to determine the
Aggregator. The module is used to aggregate results from order of job execution on a follower worker. Specifically,
PerfDB for a comprehensive analysis. We adopt many anal- suppose the system has a set of jobs J waited to be processed
ysis models (e.g., Roofline) to our system to draw more in a scheduling interval. For each job j ∈ J, the total
insightful conclusions. The models will be described in time to process a job is tj = waiting + processing.P The
Section 4.3.1. optimization target is to minimize T where T = j∈J tj .
Leaderboard. In the leaderboard, developers can sort
results by different metrics such as energy computation Algorithm 1 Benchmark Job Scheduling
and cloud cost to configure a high-performance and cost- Input: Jobs J, Workers, W1 , . . . , Wk
effective DL service. We also provide a visualizer for pre- for all jobs j ∈ J do
senting the results. repeat
Select an idle worker Wmin with the shortest queue
4.3 System Methodology Enqueue j to Wmin
Remove j from J
This section introduces adopted analysis models in our sys- until J is empty
tem and a simple scheduling method for benchmark tasks. end for
for all Workers, W1 , . . . , Wk do
4.3.1 Analysis Model Re-order jobs in an ascending way
Roofline. Roofline (Williams et al., 2009) model is widely end for
used to study the performance of applications running on Execute Jobs
hardware devices. We use Roofline to estimate two core
capacity metrics of an accelerator: computation and memory To solve the problem, we implement a global load balancer
bandwidth. Compared to a simple percent-of-peak estimate, (LB) to decide the job placement and a scheduler at every
it evaluates the quality of attained performance and explores follower worker to allocate resources, as shown in Algo-
a performance bound more effectively. As DL models’ rithm 1. First, workers will publish their current queue
performance is affected by many hyper-parameters (e.g., length (i.e., the time to process all waited jobs) to the leader
neuron numbers), we use generated canonical models to server. Then LB distributes a job to a worker, minimizing
the waiting time. Next, the worker will re-order jobs in
Benchmarking Deep Learning Inference Serving
an ascending way. Finally, workers will execute the jobs Training Serving Serving
Framework Format Infrastructure
sequentially.
TorchScript Torch.JIT*
In this section, we describe the experimental settings and the TensorFlow- TensorFlow-
Tensorflow
evaluations of our system. We also illustrate the importance SavedModel Serving
of benchmarking scheduling via a simple case study. TensorRT Trion Inference
(FP32/FP16) Server
5.1 Experimental Setup
Figure 6. Four serving software infrastructures under our test. Our
We deploy our system to a server in a private cluster and system accepts models trained with PyTorch and Tensorflow. Then
study five types of hardware, four types of software plat- it converts the models into many serialized and optimized serving
forms, and a pipeline with three scenarios. In the process, formats such as ONNX and TensorRT. These serving model will
we have registered many real-world models and generated be bound with a serving infrastructure like Tensorflow-Serving for
hundreds of models for the benchmark study. benchmarking.
instances that provide the GPUs in cloud providers. Intel(R) Xeon(R) E5-2698v4 CPU
300
Intel(R) Xeon(R) E5-2698v4 CPU
1 2 4 8 16 32
0 1 2 4 8 16 32
Serving (TFS) is the default serving platform for Tensorflow Batch Size Batch Size
SavedModel format, which is converted from Tensorflow (a) Bert-Large (b) ResNet50
models. Trion Inference Server (TrIS) supports many for- 70
GPU(V100)/CPU Throughput Speedup
mats like TensorRT format that can be converted from many 60
Speedup Ratio
40
and ONNX (converted from PyTorch) have no stable serving 30.1x
30 26.9x
infrastructure during our implementation, so we use their 20 16.9x
21.3x 23.2x 24.3x
100 100
1
ferent inference models on one type of GPU (V100). The 90 90
2
inference models (see appendix for more details) include 80
70
80
70
4
Batch Size
Batch Size
OD (object detection), GAN (CycleGAN (Zhu et al., 2017)), 60 60
50 50
8
TC (text classification) and IC (image classification). The 40 40
16
16
30 30
evaluation shows a wide range of speedup ratios, from 3.6x 20 20
32
32
to 47.4x. We use the model latency with CPU as each ser- 10 10
2 3 4 5 6 7 8 10 12 14 16
0 1 2 4 8 12 14 16 18 20
0
vice’s SLO and input them to our system. The system can Number of Residual Blocks Number of Transformer Blocks
recommend the best batch size and the speedup ratio of each (a) CNN models. (b) Transformer models.
service as a reference.
Figure 9. The GPU utilization with varied hyper-parameters of
Energy Consumption per Request (J)
different models.
CO2 Emissions per Request (mg)
1.2 0.40
12
Cost per Request (10−4¢)
102 102
101 101
Cost. We examine three kinds of costs, including energy 100
Tesla V100 ResNet50 (bs=1)
VGG16 (bs=1) ResNet152 (bs=1) 100
consumption, CO2 emission, and cloud cost. Figure 8a 10−1
VGG19 (bs=1)
MobileNet V1 (bs=1)
ResNet101 (bs=1)
InceptionV3 (bs=1) 10−1
BS(2-64)
Throughput (req/sec)
Throughput (req/sec)
Dynamic BS=4 Dynamic BS=4
800
800 Dynamic BS=8 Dynamic BS=8
1.0 1.0
600
600
0.8 0.8
Probability
Probability
400 400
0.6 0.6
P95 Latency 200 200
0.4 P95 Latency 0.4 Triton Inference Server
P99 Latency TensorFlow Serving (TFS) 0 0
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
0.2 Arrival Rate = 120 (req/sec) 0.2 FastAPI ONNX Server Number of Concurrency Number of Concurrency
Arrival Rate = 220 (req/sec) FastAPI TorchScript Server
0.0 0.0
50 100 150 200 250
Latency (ms)
20 40 60 80
Latency (ms)
100 120
(a) ResNet50 served by TFS (b) ResNet50 served by TrIS
(c) Arrival rate(CDF) (d) Four software Figure 12. The throughput comparison of two serving software
with the dynamic batching feature.
Figure 11. The tail latency of four systems under varied workloads.
The batch size, arrival rate, and serving software influence tail
Resource Usage. Understanding the resource utilization
latency a lot.
pattern often leads to better resource allocations. We test two
services (BERT with an arrival rate of 30 requests/seconds
and batch size 1 and ResNet50 with 160 requests/second
5.3 Software Platform Characterization and batch size 1). The results are shown in Figure 13. We
observe that the GPU utilization is dynamic with varied
We prepare several benchmark submissions to evaluate soft-
workloads and tends to be under-utilization with a low ar-
ware platforms. They serve models with a range of up to 5
rival rate (even it loads a heavy model like BERT). This
minutes. The workload generator is used to simulate Poisson
gives developers a large enough room for optimization.
Distribution with varied arrival rates to understand software
performance. We also explore the advanced features like 100
TensorFlow Serving (TFS) 100 Nvidia Triton Inference Server
the dynamic batching specific to DL serving. FastAPI Server (ONNX Runtime) TensorFlow Serving (TFS)
GPU Utilization (%)
80
GPU Utilization (%)
80
accounts for a longer tail latency though it can save service 0 0 30 60 90 120 150
0 0 30 60 90 120 150
Time (sec) Time (sec)
providers’ cost. Both Figure 11b and Figure 11c indicate
that TFS can not adequately handle spike load. (a) BERT (b) ResNet50
We further compare different four platforms for a same im- Figure 13. The GPU utilization with different serving software
age classification service (based on ResNet50) on a V100 under varied workloads
GPU as shown in Figure 11d. TrIS from NVIDIA per-
forms best with no surprise since it includes many GPU op-
5.4 Inference Pipeline Decomposition
timization techniques. ONNX Runtime with a simple web
framework performs better than TFS, a dedicated serving We simulate a real-world DL service by building a simple
system for DL inference. In conclusion, when configuring pipeline and sending requests using three networking trans-
a latency-intensive service, developers need to take the im- mission technologies. We use the ResNet50 and TFS as the
pact of batch size and the served software (on latency) into inference model and the serving system, respectively.
consideration. Both academia and industry have invested
The results in Figure 14a show that the transmission time
efforts in serving software development, and our system can
is very comparable to inference time for small batch sizes.
make things easier.
As the batch size increases, the inference time accounts for
Advance Feature: Dynamic Batching. Two representa- a much larger portion of the total time. The results (Figure
tive systems, TFS and TrIS, provide dynamic batching set- 14b) also show that for the same service, 4G LTE has the
tings. To use the feature, developers need to set a lot of longest end-to-end latency. This indicates that for a mobile
hyper-parameters like the maximum batch size and max- application, sending requests to cloud DL service can incur
imum queue time. We use the two software to serve a high latency.
Benchmarking Deep Learning Inference Serving
Latency (ms)
100 Inference
80
Post-processing 104 Inference Time The system’s design follows the best practice of previous
60
103
benchmarking works and we provide a highly efficient clus-
102
40
101 ter system. We categorize previous studies into two classes:
20
0
100
0
micro-benchmark and macro-benchmark. Micro-benchmark
1 2 4 8 16 1 2 4
Batch Size (client side) Batch Size (client side) focuses more on low-level operators. DeepBench (Baidu,
(a) Batch size 2017) is designed to study the kernel-level operations such
(b) Networking condition
30
as GEMM across multiple hardware devices, including both
25
TensorFlow Serving (TFS) cloud servers and edge nodes. AI matrix (Alibaba, 2018)
Triton Inference Server
gets inspiration from DeepBench and further extends it to
Latency (sec)
20
1.2
guidelines. We plan to continue upgrading the system and
Probability
1.00x 0.6
1.0
0.8
0.4
design more APIs to give users more flexibility to adopt our
0.6 RR + FCFS
0.4 0.2 RR + SJF system into their own DL services
QA + SJF
0.2
0.0
0.0 0 500 1000 1500 2000 2500 3000 3500 4000
RR + FCFS RR + SJF QA + SJF Job Completion Time (s)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Karlaš, B., Interlandi, M., Renggli, C., Wu, W., Zhang, C.,
L. Imagenet: A large-scale hierarchical image database. Mukunthu Iyappan Babu, D., Edwards, J., Lauren, C.,
Benchmarking Deep Learning Inference Serving
Xu, A., and Weimer, M. Building continuous integration Williams, S., Waterman, A., and Patterson, D. Roofline:
services for machine learning. In Proceedings of the 26th an insightful visual performance model for multicore
ACM SIGKDD International Conference on Knowledge architectures. Communications of the ACM, 52(4):65–76,
Discovery & Data Mining, pp. 2407–2415, 2020. 2009.
Microsoft. Onnx runtime: cross-platform, high performance Yu, P. and Chowdhury, M. Salus: Fine-grained gpu sharing
scoring engine for ml models. https://github. primitives for deep learning applications. arXiv preprint
com/microsoft/onnxruntime, 2019. Accessed: arXiv:1902.04610, 2019.
2020-05-07.
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong,
NVIDIA. Nvidia data center gpu manager (dcgm) is a suite S. A., Konwinski, A., Murching, S., Nykodym, T.,
of tools for managing and monitoring tesla gpus in clus- Ogilvie, P., Parkhe, M., et al. Accelerating the machine
ter environments. https://developer.nvidia. learning lifecycle with mlflow. IEEE Data Eng. Bull., 41
com/dcgm, 2018. Accessed: 2020-10-02. (4):39–45, 2018.
NVIDIA. Multi-process service. https: Zhang, H., Li, Y., Huang, Y., Wen, Y., Yin, J., and Guan,
//docs.nvidia.com/deploy/pdf/CUDA_ K. Mlmodelci: An automatic cloud platform for efficient
Multi_Process_Service_Overview.pdf, mlaas. arXiv preprint arXiv:2006.05096, 2020.
2020a. Accessed: 2020-06-28.
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired
NVIDIA. The triton inference server provides a image-to-image translation using cycle-consistent adver-
cloud inferencing solution optimized for nvidia sarial networks. In Proceedings of the IEEE international
gpus. https://github.com/NVIDIA/ conference on computer vision, pp. 2223–2232, 2017.
triton-inference-server, 2020b. Accessed:
2020-05-07.
Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li,
F., Rajashekhar, V., Ramesh, S., and Soyke, J. Tensorflow-
serving: Flexible, high-performance ml serving. arXiv
preprint arXiv:1712.06139, 2017.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri,
M. Learning spatiotemporal features with 3d convolu-
tional networks. In Proceedings of the IEEE international
conference on computer vision, pp. 4489–4497, 2015.