0% found this document useful (0 votes)
50 views

MLPerf Inferencing Benchmark

The MLPerf Inference Benchmark addresses the challenge of benchmarking machine learning inference performance across the many possible combinations of ML tasks, models, datasets, frameworks, architectures, and engines. It implements a set of rules and practices to ensure comparability between wildly differing systems. The first release garnered over 600 inference performance measurements from 14 organizations representing more than 30 systems with a wide range of capabilities.

Uploaded by

Suhas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

MLPerf Inferencing Benchmark

The MLPerf Inference Benchmark addresses the challenge of benchmarking machine learning inference performance across the many possible combinations of ML tasks, models, datasets, frameworks, architectures, and engines. It implements a set of rules and practices to ensure comparability between wildly differing systems. The first release garnered over 600 inference performance measurements from 14 organizations representing more than 30 systems with a wide range of capabilities.

Uploaded by

Suhas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

MLP ERF I NFERENCE B ENCHMARK

Vijay Janapa Reddi 1 Christine Cheng 2 David Kanter 3 Peter Mattson 4 Guenther Schmuelling 5
Carole-Jean Wu 6 Brian Anderson 4 Maximilien Breughe 7 Mark Charlebois 8 William Chou 8
Ramesh Chukka 2 Cody Coleman 9 Sam Davis 10 Pan Deng 11 Greg Diamos 12 Jared Duke 4 Dave Fick 13
J. Scott Gardner 14 Itay Hubara 15 Sachin Idgunji 7 Thomas B. Jablin 4 Jeff Jiao 16 Tom St. John 17
Pankaj Kanwar 4 David Lee 18 Jeffery Liao 19 Anton Lokhmotov 20 Francisco Massa 6 Peng Meng 11
Paulius Micikevicius 7 Colin Osborne 21 Gennady Pekhimenko 22 Arun Tejusve Raghunath Rajan 2
Dilip Sequeira 7 Ashish Sirasao 23 Fei Sun 24 Hanlin Tang 2 Michael Thomson 25 Frank Wei 26 Ephrem Wu 23
arXiv:1911.02549v1 [cs.LG] 6 Nov 2019

Lingjie Xu 26 Koichi Yamada 2 Bing Yu 18 George Yuan 7 Aaron Zhong 16 Peizhao Zhang 6 Yuchen Zhou 27

Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of
different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that
incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude
in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more
software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system
performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for
industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. Driven by more
than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf implements a set of rules and
practices to ensure comparability across systems with wildly differing architectures. In this paper, we present the method
and design principles of the initial MLPerf Inference release. The first call for submissions garnered more than 600
inference-performance measurements from 14 organizations, representing over 30 systems that show a range of capabilities.

1 I NTRODUCTION To address these growing computational demands, hard-


ware, software, and system developers have focused on in-
Machine learning (ML) powers a variety of applications ference performance for a variety of use cases by designing
from computer vision (He et al., 2016; Goodfellow et al., optimized ML hardware and software systems. Estimates
2014; Liu et al., 2016; Krizhevsky et al., 2012) and natural- indicate that over 100 companies are producing or are on
language processing (Vaswani et al., 2017; Devlin et al., the verge of producing optimized inference chips. By com-
2018) to self-driving cars (Xu et al., 2018; Badrinarayanan parison, only about 20 companies target training.
et al., 2017) and autonomous robotics (Levine et al., 2018).
These applications are deployed at large scale and require Each system takes a unique approach to inference and
substantial investment to optimize inference performance. presents a trade-off between latency, throughput, power,
Although training of ML models has been a development and model quality. For example, quantization and reduced
bottleneck and a considerable expense (Amodei & Hernan- precision are powerful techniques for improving inference
dez, 2018), inference has become a critical workload, since latency, throughput, and power efficiency at the expense
models can serve as many as 200 trillion queries and per- of accuracy (Han et al., 2015; 2016). After training with
form over 6 billion translations a day (Lee et al., 2019b). floating-point numbers, compressing model weights enables
1
better performance by decreasing memory-bandwidth re-
Harvard University 2 Intel 3 Real World Insights 4 Google quirements and increasing computational throughput (e.g.,
5
Microsoft 6 Facebook 7 NVIDIA 8 Qualcomm 9 Stanford Univer-
sity 10 Myrtle 11 Tencent 12 Landing AI 13 Mythic 14 Advantage En-
by using wider vectors). Similarly, many weights can be
gineering 15 Habana Labs 16 Alibaba T-Head 17 Tesla 18 MediaTek removed to boost sparsity, which can reduce the memory
19
Synopsys 20 dividiti 21 Arm 22 University of Toronto 23 Xilinx footprint and the number of operations (Han et al., 2015;
24
Alibaba (formerly Facebook) 25 Centaur Technology 26 Alibaba Molchanov et al., 2016; Li et al., 2016). Support for these
Cloud 27 General Motors. MLPerf Inference is the product of indi- techniques varies among systems, however, and these opti-
viduals from these organizations who led the benchmarking effort
and of submitters who produced the first set of benchmark re-
mizations can drastically reduce final model quality. Hence,
sults. Both groups are necessary to create a successful industry the field needs an ML inference benchmark that can quantify
benchmark. We credit the submitters and their organizations in the these trade-offs in an architecturally neutral, representative,
acknowledgments. Send correspondence to [email protected]. and reproducible manner.
MLPerf Inference Benchmark

The challenge is the ecosystem’s many possible combina- evaluation metrics. In addition, the benchmark suite comes
tions of machine-learning tasks, models, data sets, frame- with permissive rules that allow comparison of different
works, tool sets, libraries, architectures, and inference architectures under realistic scenarios.
engines, which make inference benchmarking almost in-
Unlike traditional SPEC CPU–style benchmarks that run
tractable. The spectrum of ML tasks is broad, including but
out of the box (Dixit, 1991), MLPerf promotes competi-
not limited to image classification and localization, object
tion by allowing vendors to reimplement and optimize the
detection and segmentation, machine translation, automatic
benchmark for their system and then submit the results. To
speech recognition, text to speech, and recommendations.
make results comparable, it defines detailed rules. It pro-
Even for a specific task, such as image classification, many
vides guidelines on how to benchmark inference systems, in-
ML models are viable. These models serve in a variety of
cluding when to start the performance-measurement timing,
scenarios that range from taking a single picture on a smart-
what preprocessing to perform before invoking the model,
phone to continuously and concurrently detecting pedes-
and which transformations and optimizations to employ.
trians through multiple cameras in an autonomous vehicle.
Such meticulous specifications help ensure comparability
Consequently, ML tasks have vastly different quality re-
across ML systems because all follow the same rules.
quirements and real-time-processing demands. Even im-
plementations of functions and operations that the models We describe the design principles and architecture of the
typically rely on can be highly framework specific, and they MLPerf Inference benchmark’s initial release (v0.5). We
increase the complexity of the design and the task. received over 600 submissions across a variety of tasks,
frameworks, and platforms from 14 organizations. Audit
Both academic and industrial organizations have developed
tests validated the submissions, and the tests cleared 595
ML inference benchmarks. Examples include AIMatrix (Al-
of them as valid. The final results show a four-orders-of-
ibaba, 2018), EEMBC MLMark (EEMBC, 2019), and AIX-
magnitude performance variation ranging from embedded
PRT (Principled Technologies, 2019) from industry, as well
devices and smartphones to data-center systems. MLPerf
as AI Benchmark (Ignatov et al., 2019), TBD (Zhu et al.,
Inference adopts the following principles for a tailored ap-
2018), Fathom (Adolf et al., 2016), and DAWNBench (Cole-
proach to industry-standard benchmarking:
man et al., 2017) from academia. Each one has made sub-
stantial contributions to ML benchmarking, but they were
developed without input from ML-system designers. As a 1. Pick representative workloads that everyone can ac-
result, there is no consensus on representative models, met- cess.
rics, tasks, and rules across these benchmarks. For example, 2. Evaluate systems in realistic scenarios.
some efforts focus too much on specific ML applications
(e.g., computer vision) or specific domains (e.g., embedded 3. Set target qualities and tail-latency bounds in accor-
inference). Moreover, it is important to devise the right per- dance with real use cases.
formance metrics for inference so the evaluation accurately 4. Allow the benchmarks to flexibly showcase both hard-
reflects how these models operate in practice. Latency, for ware and software capabilities.
instance, is the primary metric in many initial benchmarking
5. Permit the benchmarks to change rapidly in response
efforts, but latency-bounded throughput is more relevant for
to the evolving ML ecosystem.
many cloud inference scenarios.
Therefore, two critical needs remain unmet: (i) standard The rest of the paper is organized as follows: Section 2 pro-
evaluation criteria for ML inference systems and (ii) an ex- vides background, describing the differences in ML train-
tensive (but reasonable) set of ML applications/models that ing versus ML inference and the challenges to creating a
cover existing inference systems across all major domains. benchmark that covers the broad ML inference landscape.
MLPerf Inference answers the call with a benchmark suite Section 3 describes the goals of MLPerf Inference. Sec-
that complements MLPerf Training (Mattson et al., 2019). tion 4 presents MLPerf’s underlying inference-benchmark
Jointly developed by the industry with input from academic architecture and reveals the design choices for version 0.5.
researchers, more than 30 organizations as well as more than Section 5 summarizes the submission, review, and report-
200 ML engineers and practitioners assisted in the bench- ing process. Section 6 highlights v0.5 submission results
mark design and engineering process. This community to demonstrate that MLPerf Inference is a well-crafted in-
architected MLPerf Inference to measure inference perfor- dustry benchmark. Section 7 shares the important lessons
mance across a wide variety of ML hardware, software, learned and prescribes a tentative roadmap for future work.
systems, and services. The benchmark suite defines a set Section 8 compares MLPerf Inference with prior efforts.
of tasks (models, data sets, scenarios, and quality targets) Section 9 concludes the paper. Section 10 acknowledges
that represent real-world deployments, and it specifies the the individuals who contributed to the benchmark’s devel-
opment or validated the effort by submitting results.

Page 2 of 23
MLPerf Inference Benchmark

Model Prediction
Metrics
Sanitize/
Input Extract Training Inference
Data Features

CPUs GPUs TPUs


Data
Data Source
Source
Data Source CPUs CPUs GPUs TPUs

DSPs FPGAs ASICs

Figure 1. Stages of a typical ML pipeline. The first stage involves gathering data to train the models. The raw data is often noisy, so it
requires processing before training a deep neural network (DNN). The hardware landscape for DNN training and inference is diverse.

2 B ENCHMARKING C HALLENGES models vary tremendously in compute and memory require-


ments (e.g., a 50x difference in Gflops), while the corre-
We provide background on ML execution (Section 2.1) and sponding Top-1 accuracy ranges from 55% to 83% (Bianco
describe the extreme heterogeneity that makes developing et al., 2018). This variation creates a Pareto frontier rather
an ML inference benchmark challenging (Section 2.2). than one optimal choice.

2.1 ML Pipeline Choosing the right model depends on the application. For
example, pedestrian detection in autonomous vehicles has a
Machine learning generally involves a series of complicated much higher accuracy requirement than does labeling ani-
tasks (Figure 1). Nearly every ML pipeline begins by acquir- mals in photographs, owing to the different consequences
ing data to train and test the models. Raw data is typically of wrong predictions. Similarly, quality-of-service require-
sanitized and normalized before use because real-world data ments for inference vary by several orders of magnitude
often contains errors, irrelevancies, or biases that reduce the from effectively no latency requirement for offline processes
quality and accuracy of ML models. to milliseconds for real-time applications. Covering this
ML benchmarking focuses on two phases: training and design space necessitates careful selection of models that
inference. During training, models learn to make predictions represent realistic scenarios.
from inputs. For example, a model may learn to predict Another challenge is that models vary wildly, so it is difficult
the subject of a photograph or the most fluent translation to draw meaningful comparisons. In many cases, such as
of a sentence from English to German. During inference, in Figure 2, a small accuracy change (e.g., a few percent)
models make predictions about their inputs, but they no can drastically change the computational requirements (e.g.,
longer learn. This phase is increasingly crucial as ML moves 5–10x). For example, SE-ResNeXt-50 (Hu et al., 2018; Xie
from research to practice, serving trillions of queries daily. et al., 2017) and Xception (Chollet, 2017) achieve roughly
Despite its apparent simplicity relative to training, the task of the same accuracy (∼79%) but exhibit a 2x difference in
balancing latency, throughput, and accuracy for real-world computational requirements (∼4 Gflops versus ∼8 Gflops).
applications makes optimizing inference difficult.
2.2.2 Diversity of Deployment Scenarios
2.2 ML Inference Benchmarking Complexity
In addition to accuracy and computational complexity, the
Creating a useful ML benchmark involves four critical chal- availability and arrival patterns of the input data vary with
lenges: (1) the diversity of models, (2) the variety of deploy- the deployment scenario. For example, in offline batch pro-
ment scenarios, (3) the array of inference systems, and (4) cessing such as photo categorization, all the data may be
the lack of a standard inference workflow. readily available in (network) storage, allowing accelera-
tors to reach and maintain peak performance. By contrast,
2.2.1 Diversity of Models translation, image tagging, and other web applications may
experience variable arrival patterns based on end-user traffic.
Even for a single task, such as image classification, numer-
ous models present different trade-offs between accuracy Similarly, real-time applications such as augmented reality
and computational complexity, as Figure 2 shows. These and autonomous vehicles handle a constant flow of data

Page 3 of 23
MLPerf Inference Benchmark

ML Computer Language Autonomous


Speech
Applications Vision Translation Driving

ML
ImageNet COCO VOC KITTI WMT
Datasets

ML
ResNet GoogleNet SqueezeNet MobileNet SSD GNMT
Models

ML Paddle
Tensorflow PyTorch Caffe MxNet CNTK Theano
Frameworks Paddle

Graph
NNEF ONNX
Formats

Graph
XLA nGraph Glow TVM
Compilers

Optimized MKL
CUDA CuBLAS OpenBLAS Eigen
Libraries DNN

Operating
Linux Windows MacOS Android RTOS
Systems

Hardware
CPUs GPUs TPUs NPUs DSPs FPGAs Accelerators
Targets

Figure 3. Software and hardware options at every level of the infer-


ence stack. The combinations across the layers make benchmark-
ing ML inference systems a particularly challenging problem.

lenge. Consider the Non-Maximum Suppression (NMS)


operator implementation for object detection. When train-
Figure 2. An example of ML-model diversity for image classifica-
tion (figure from Bianco et al. (2018)). No single model is optimal;
ing object-detection models in TensorFlow, the regular NMS
each one presents a unique design trade-off between accuracy, operator smooths out imprecise bounding boxes for a single
memory requirements, and computational complexity. object. But this implementation is unavailable in Tensor-
Flow Lite, which is tailored for mobile and instead imple-
ments fast NMS. As a result, when converting the model
rather than having it all in memory. Although the same gen- from TensorFlow to TensorFlow Lite, the accuracy of SSD-
eral model architecture could be employed in each scenario, MobileNets-v1 decreases from 23.1% to 22.3% mAP. These
data batching and similar optimizations may be inapplicable, types of subtle differences make it hard to port models ex-
leading to drastically different performance. Timing the on- actly from one framework to another.
device inference latency alone fails to reflect the real-world
On the hardware side, platforms are tremendously diverse,
inference requirements.
ranging from familiar processors (e.g., CPUs, GPUs, and
DSPs) to FPGAs, ASICs, and exotic accelerators such as
2.2.3 Diversity of Inference Systems
analog and mixed-signal processors. Each platform comes
The possible combinations of different inference applica- with hardware-specific features and constraints that enable
tions, data sets, models, machine-learning frameworks, tool or disrupt performance depending on the model and scenario.
sets, libraries, systems, and platforms are numerous. Fig- Combining this diversity with the range of software systems
ure 3 shows the breadth and depth of the ML space. The above presents a unique challenge to deriving a robust and
hardware and software side exhibit substantial complexity. useful ML benchmark that meets industry needs.
On the software side, about a dozen ML frameworks com-
2.2.4 Lack of a Standard Inference Workflow
monly serve for developing deep-learning models, such
as Caffe/Caffe2 (Jia et al., 2014), Chainer (Tokui et al., There are many ways to optimize model performance. For
2015), CNTK (Seide & Agarwal, 2016), Keras (Chollet example, quantizing floating-point weights decreases mem-
et al., 2015), MXNet (Chen et al., 2015), TensorFlow (Abadi ory footprint and bandwidth requirements and increases
et al., 2016), and PyTorch (Paszke et al., 2017). Indepen- computational throughput (wider vectors), but it also de-
dently, there are also many optimized libraries, such as creases model accuracy. Some platforms require quantiza-
cuDNN (Chetlur et al., 2014), Intel MKL (Intel, 2018a), and tion because they lack floating-point support. Low-power
FBGEMM (Khudia et al., 2018), supporting various infer- mobile devices, for example, call for such an optimization.
ence run times, such as Apple CoreML (Apple, 2017), Intel
Other transformations are more complicated and change the
OpenVino (Intel, 2018b), NVIDIA TensorRT (NVIDIA),
network structure to boost performance further or exploit
ONNX Runtime (Bai et al., 2019), Qualcomm SNPE (Qual-
unique features of the inference platform. An example is
comm), and TF-Lite (Lee et al., 2019a).
reshaping image data from space to depth. The enormous
Each combination has idiosyncrasies that make supporting variety of ML inference hardware and software means no
the most current neural-network model architectures a chal- one method can prepare trained models for all deployments.

Page 4 of 23
MLPerf Inference Benchmark

3 MLP ERF I NFERENCE G OALS many critical inference applications.


To overcome the challenges, MLPerf Inference adopted a set Our goal is a method that simulates the realistic behavior
of principles for developing a robust yet flexible benchmark of the inference system under test; such a feature is unique
suite based on community-driven development. among AI benchmarks. To this end, we developed the Load
Generator (LoadGen) tool, which is a query-traffic generator
3.1 Representative, Broadly Accessible Workloads that mimics the behavior of real-world systems. It has four
scenarios: single-stream, multistream, server and offline.
For the initial version 0.5, we chose tasks that reflect ma- They emulate the ML-workload behavior of mobile devices,
jor commercial and research scenarios for a large class of autonomous vehicles, robotics, and cloud-based setups.
submitters and that capture a broad set of computing mo-
tifs. To focus on the realistic rules and testing infrastruc- 3.3 Target Qualities and Tail-Latency Bounds
ture, we selected a minimum-viable-benchmark approach
to accelerate the development process. Where possible, Quality and performance are intimately connected for all
we adopted models that were part of the MLPerf Training forms of machine learning, but the role of quality targets
v0.6 suite (Mattson et al., 2019), thereby amortizing the in inference is distinct from that in training. For training,
benchmark-development effort. the performance metric is the time to train to a specific
quality, making accuracy a first-order consideration. For
The current version’s tasks and models are modest in inference, the starting point is a pretrained reference model
scope. MLPerf Inference v0.5 comprises three tasks and that achieves a target quality. Still, many system architec-
five models: image classification (ResNet-50 (He et al., tures can sacrifice model quality to achieve lower latency,
2016) and MobileNet-v1 (Howard et al., 2017)), object de- lower total cost of ownership (TCO), or higher throughput.
tection (SSD-ResNet34—i.e., SSD (Liu et al., 2016) with a
ResNet34 backbone—and SSD-MobileNet-v1—i.e., SSD The trade-offs between accuracy, latency, and TCO are ap-
with a MobileNet-v1 backbone), and machine translation plication specific. Trading 1% model accuracy for 50%
(GNMT (Wu et al., 2016)). We plan to add others. lower TCO is prudent when identifying cat photos, but it is
less so during online pedestrian detection. For MLPerf, we
We chose our tasks and models through a consensus-driven define a model’s quality targets. To reflect this important
process and considered community feedback to ensure their aspect of real world-deployments, we established per-model
relevance. Our models are mature and have earned broad and scenario targets for inference latency and model quality.
community support. Because the industry has studied them The latency bounds and target qualities are based on input
and can build efficient systems, benchmarking is accessible gathered from end users.
and provides a snapshot that shows the state of ML systems.
Moreover, we focused heavily on the benchmark’s modular
3.4 Flexibility to Showcase Hardware and Software
design to make adding new models and tasks less costly.
As we show in Section 6.7, our design has allowed MLPerf Systems benchmarks can be characterized as language level
Inference users to easily add new models. Our plan is to (SPECInt (Dixit, 1991)), API level (LINPACK (Dongarra,
extend the scope to include more areas, tasks, models, and 1988)), or semantic level (TPC (Council, 2005)). The ML
so on. Additionally, we aim to maintain consistency and community has embraced a wide variety of languages and
alignment between the training and inference benchmarks. libraries, so MLPerf Inference is a semantic-level bench-
mark. This type specifies the task to be accomplished and
3.2 System Evaluation Using Realistic Scenarios the general rules of the road, but it leaves implementation
details to the submitters.
As our submission results show, ML inference systems vary
in power consumption across four or more orders of mag- The MLPerf Inference benchmarks are flexible enough that
nitude and cover a wide variety of applications as well as submitters can optimize the reference models, run them
physical deployments that range from deeply embedded de- through their preferred software tool chain, and execute
vices to smartphones to data centers. The applications have them on their hardware of choice. Thus, MLPerf Inference
a variety of usage models and many figures of merit, which has two divisions: closed and open. Strict rules govern the
in turn require multiple performance metrics. For example, closed division, whereas the open division is more permis-
the figure of merit for an image-recognition system that sive and allows submitters to change the model, achieve
classifies a video camera’s output will be entirely different different quality targets, and so on. The closed division
than for a cloud-based translation system. To address these is designed to address the lack of a standard inference-
various models, we surveyed MLPerf’s broad membership, benchmarking workflow.
which includes both customers and vendors. On the basis Within each division, submitters may file their results un-
of that feedback, we identified four scenarios that represent der specific categories on the basis of their hardware and

Page 5 of 23
MLPerf Inference Benchmark

software components’ availability. There are three system Dataset


categories: available; preview; and research, development,
2 3
or other systems. Systems in the first category are available
off the shelf, while systems in the second category allow System Under Test (SUT)
vendors to provide a sneak peek into their capabilities. At 1 4 5 6
the other extreme are bleeding-edge ML solutions in the 7
third category that are not ready for production use. LoadGen Accuracy Script

In summary, MLPerf Inference allows submitters to exhibit


Figure 4. MLPerf Inference system under test (SUT) and how the
many different systems across varying product-innovation, components integrate. (1) The LoadGen requests that the SUT
maturity, and support levels. load samples; (2–3) the SUT loads samples into memory; (4) the
SUT signals the LoadGen when it is ready; (5) the LoadGen issues
3.5 Benchmark Changes for Rapidly Evolving ML requests to the SUT; (6) the benchmark processes the results and
returns the results to the LoadGen; and (6) the LoadGen outputs
MLPerf Inference v0.5 is only the beginning. The bench- logs, which the accuracy script then reads and verifies.
mark will evolve. We are working to add more models (e.g.,
recommendation and time-series models), more scenarios
(e.g., “burst” mode), better tools (e.g., a mobile application), or more samples. The LoadGen sends queries to the SUT
and better metrics (e.g., timing preprocessing) to more ac- in accordance with the selected scenario. Depending on
curately reflect the performance of the whole ML pipeline. that scenario, it can submit queries one at a time, at regular
intervals, or in a Poisson distribution.
4 D ESIGN AND I MPLEMENTATION The SUT runs inference on each query and sends the re-
sponse back to the LoadGen, which either logs the response
In this section we describe the design and implementation
or discards it. After the run, an accuracy script checks the
of MLPerf Inference v0.5. We also define the components
logged responses to determine whether the model accuracy
of an inference system (Section 4.1) and detail how an
is within tolerance.
inference query flows through one such system (Section 4.2).
Our discussion also covers the MLPerf Inference tasks for We provide a clear interface between the SUT and LoadGen
v0.5 (Section 4.3). so new scenarios and experiments can be handled in the
LoadGen and rolled out to all models and SUTs without ex-
4.1 Inference System Under Test (SUT) tra effort. Doing so also facilitates compliance and auditing,
since many technical rules about query arrivals, timing, and
A complete MLPerf Inference system contains multiple accuracy are implemented outside of submitter code. As we
components: a data set, a system under test (SUT), the Load describe in Section 6.7, one submitter obtained results for
Generator (LoadGen), and an accuracy script. Figure 4 over 60 image-classification and object-detection models.
shows an overview of an MLPerf Inference system. The
data set, LoadGen, and accuracy script are fixed for all Moreover, placing the performance-measurement code out-
submissions and are provided by MLPerf. Submitters have side of submitter code is congruent with MLPerf’s goal of
wide discretion to implement an SUT according to their end-to-end system benchmarking. To that end, the LoadGen
architecture’s requirements and their engineering judgment. measures the holistic performance of the entire SUT rather
By establishing a clear boundary between submitter-owned than any individual part. Finally, this condition enhances
and MLPerf-owned components, the benchmark maintains the benchmark’s realism: inference engines typically serve
comparability among submissions. as black-box components of larger systems.

4.2 Life of a Query 4.3 Benchmark Tasks

At startup, the LoadGen requests that the SUT load samples Designing ML benchmarks is fundamentally different from
into memory. The MLPerf Inference rules allow them to be designing non-ML benchmarks. MLPerf defines high-level
loaded into DRAM as an untimed operation. The SUT loads tasks (e.g., image classification) that a machine-learning
the samples into DRAM and may perform other timed oper- system can perform. For each one, we provide a canonical
ations as the rules stipulate. These untimed operations may reference model in a few widely used frameworks. The
include but are not limited to compilation, cache warmup, reference model and weights offer concrete instantiations
and preprocessing. of the ML task, but formal mathematical equivalence is
unnecessary. For example, a fully connected layer can be
The SUT signals the LoadGen when it is ready to receive implemented with different cache-blocking and evaluation
the first query. A query is a request for inference on one strategies. Consequently, submitting results requires opti-

Page 6 of 23
MLPerf Inference Benchmark

A REA TASK R EFERENCE M ODEL DATA S ET Q UALITY TARGET

R ES N ET-50 V 1.5
V ISION I MAGE CLASSIFICATION ( HEAVY ) 25.6M PARAMETERS I MAGE N ET (224 X 224) 99% OF FP32 (76.456%) T OP -1 ACCURACY
7.8 GOPS / INPUT
M OBILE N ET- V 1 224
V ISION I MAGE CLASSIFICATION ( LIGHT ) 4.2M PARAMETERS I MAGE N ET (224 X 224) 98% OF FP32 (71.676%) T OP -1 ACCURACY
1.138 GOPS / INPUT
SSD-R ES N ET 34
V ISION O BJECT DETECTION ( HEAVY ) 36.3M PARAMETERS COCO (1,200 X 1,200) 99% OF FP32 (0.20 M AP)
433 GOPS / INPUT

SSD-M OBILE N ET- V 1


V ISION O BJECT DETECTION ( LIGHT ) 6.91M PARAMETERS COCO (300 X 300) 99% OF FP32 (0.22 M AP)
2.47 GOPS / INPUT
GNMT
L ANGUAGE M ACHINE TRANSLATION WMT16 EN-DE 99% OF FP32 (23.9 S ACRE B LEU )
210M PARAMETERS

Table 1. ML Tasks in MLPerf Inference v0.5. Each one reflects critical commercial and research use cases for a large class of submitters,
and together they also capture a broad set of computing motifs (e.g., CNNs and RNNs).

mizations to achieve good performance. 4.3.1 Image Classification


The concept of a reference model and a valid class of equiv- Image classification is widely used in commercial appli-
alent implementations creates freedom for most ML sys- cations and is also a de facto standard for evaluating ML-
tems while still enabling relevant comparisons of inference system performance. A classifier network takes an image as
systems. MLPerf provides reference models using 32-bit input and selects the class that best describes it. Example
floating-point weights and, for convenience, also provides applications include photo searches, text extraction from
carefully implemented equivalent models to address the images, and industrial automation, such as object sorting
three most popular formats: TensorFlow (Abadi et al., 2016), and defect detection.
PyTorch (Paszke et al., 2017), and ONNX (Bai et al., 2019).
For image classification, we use the standard ImageNet 2012
As Table 1 illustrates, we selected a set of vision and lan- data set (Deng et al., 2009) and crop to 224x224 during
guage tasks along with associated reference models. We preprocessing. We selected two models: a higher-accuracy
chose vision and translation because they are widely used and more computationally expensive heavyweight model as
across all computing systems, from edge devices to cloud well as a computationally lightweight model that is faster but
data centers. Additionally, mature and well-behaved refer- less accurate. Image-classification quality is the classifier’s
ence models with different architectures (e.g., CNNs and Top-1 accuracy.
RNNs) were available.
The heavyweight model, ResNet-50 v1.5 (He et al., 2016;
For the vision tasks, we defined both heavyweight and MLPerf, 2019), comes directly from the MLPerf Training
lightweight models. The former are representative of sys- suite to maintain alignment. ResNet-50 is the most com-
tems with greater compute resources, such as a data center or mon network for performance claims. Unfortunately, it has
autonomous vehicle, where increasing the computation cost multiple subtly different implementations that make most
for better accuracy is a reasonable trade-off. In contrast, the comparisons difficult. In our training suite, we specifically
latter models are appropriate for systems with constrained selected ResNet-50 v1.5 to ensure useful comparisons and
compute resources and low latency requirements, such as compatibility across major frameworks. We also extensively
smartphones and low-cost embedded devices. studied and characterized the network for reproducibility
and low run-to-run training variation, making it an obvious
For all tasks, we standardized on free and publicly available
and low-risk choice.
data sets to ensure the entire community can participate.
Because of licensing restrictions on some data sets (e.g., The lightweight model, MobileNets-v1 224 (Howard et al.,
ImageNet), we do not host them directly. Instead, the data 2017), is built around smaller, depth-wise-separable con-
is downloaded before running the benchmark. volutions to reduce the model complexity and computa-
tional burden. MobileNets is a family of models that offer

Page 7 of 23
MLPerf Inference Benchmark

varying compute and accuracy options—we selected the model suite captures a wide variety of compute motifs.
full-width, full-resolution MobileNet-v1-1.0-224. This net-
work reduces the parameters by 6.1x and the operations by 4.4 Quality Targets
6.8x compared with ResNet-50 v1.5. We evaluated both
MobileNet-v1 and v2 (Sandler et al., 2018) for the MLPerf Many architectures can trade model quality for lower la-
Inference v0.5 suite and selected the former, as it has gar- tency, lower TCO, or greater throughput. To reflect this
nered wider adoption. important aspect of real-world deployments, we established
per-model and scenario targets for latency and model quality.
4.3.2 Object Detection We adopted quality targets that for 8-bit quantization were
achievable with considerable effort.
Object detection is a complex vision task that determines
the coordinates of bounding boxes around objects in an im- MLPerf Inference requires that almost all implementations
age and classifies the image. Object detectors typically use achieve a quality target within 1% of the FP32 reference
a pretrained image-classifier network as a backbone or a model’s accuracy (e.g., the ResNet-50 v1.5 model achieves
feature extractor, then perform regression for localization 76.46% Top-1 accuracy, and an equivalent model must
and bounding-box selection. Object detection is crucial for achieve at least 75.70% Top-1 accuracy). Initial experi-
automotive applications, such as detecting hazards and ana- ments, however, showed that for mobile-focused networks—
lyzing traffic, and for mobile-retail tasks, such as identifying MobileNet and SSD-MobileNet—the accuracy loss was
items in a picture. unacceptable without retraining. We were unable to proceed
with the low accuracy because performance benchmarking
For object detection, we chose the COCO data set (Lin would become unrepresentative.
et al., 2014) with both a lightweight and heavyweight model.
Our small model uses the 300x300 image size, which is To address the accuracy drop, we took three steps. First,
typical of resolutions in smartphones and other compact we trained the MobileNet models for quantization-friendly
devices. For the larger model, we upscale the data set to weights, enabling us to narrow the quality window to 2%.
more closely represent the output of a high-definition image Second, to reduce the training sensitivity of MobileNet-
sensor (1.44 MP total). The choice of the larger input size is based submissions, we provided equivalent MobileNet and
based on community feedback, especially from automotive SSD-MobileNet implementations quantized to an 8-bit in-
and industrial-automation customers. The quality metric for teger format. Third, for SSD-MobileNet, we reduced the
object detection is mean average precision (mAP). quality requirement to 22.0 mAP to account for the chal-
lenges of using MobileNets as a backbone.
The heavyweight object detector’s reference model is
SSD (Liu et al., 2016) with a ResNet34 backbone, which To improve the submission comparability, we disallow re-
also comes from our training benchmark. The lightweight training. Our prior experience and feasibility studies con-
object detector’s reference model uses a MobileNet-v1-1.0 firmed that for 8-bit integer arithmetic, which was an ex-
backbone, which is more typical for constrained computing pected deployment path for many systems, the ∼1% relative-
environments. We selected the MobileNet feature detector accuracy target was easily achievable without retraining.
on the basis of feedback from the mobile and embedded
communities. 4.5 Scenarios and Metrics
The diverse inference applications have various usage mod-
4.3.3 Translation els and figures of merit, which in turn require multiple
Neural machine translation (NMT) is popular in the rapidly performance metrics. To address these models, we specify
evolving field of natural-language processing. NMT models four scenarios that represent important inference applica-
translate a sequence of words from a source language to a tions. Each one has a unique performance metric, as Table 2
target language and are used in translation applications and illustrates. The LoadGen discussed in Section 4.7 simulates
services. Our translation data set is WMT16 EN-DE (WMT, the scenarios and measures the performance.
2016). The quality measurement is Bilingual Evaluation Single-stream. This scenario represents one inference-
Understudy Score (Bleu) (Papineni et al., 2002). In MLPerf query stream with a query sample size of one, reflecting
Inference, we specifically employ SacreBleu (Post, 2018). the many client applications where responsiveness is criti-
For the translation, we chose GNMT (Wu et al., 2016), cal. An example is offline voice transcription on Google’s
which employs a well-established recurrent-neural-network Pixel 4 smartphone. To measure performance, the Load-
(RNN) architecture and is part of the training benchmark. Gen injects a single query; when the query is complete, it
GNMT is representative of RNNs, which are popular for records the completion time and injects the next query. The
sequential and time-series data, and it ensures our reference- performance metric is the query stream’s 90th-percentile
latency.

Page 8 of 23
MLPerf Inference Benchmark

S CENARIO Q UERY G ENERATION M ETRIC S AMPLES /Q UERY E XAMPLES

T YPING AUTOCOMPLETE ,
S INGLE - STREAM (SS) S EQUENTIAL 90 TH - PERCENTILE LATENCY 1
REAL - TIME AR
N UMBER OF STREAMS M ULTICAMERA DRIVER ASSISTANCE ,
M ULTISTREAM (MS) A RRIVAL INTERVAL WITH DROPPING N
SUBJECT TO LATENCY BOUND LARGE - SCALE AUTOMATION

Q UERIES PER SECOND


S ERVER (S) P OISSON DISTRIBUTION 1 T RANSLATION WEBSITE
SUBJECT TO LATENCY BOUND

O FFLINE (O) BATCH T HROUGHPUT AT LEAST 24,576 P HOTO CATEGORIZATION

Table 2. Scenario description and metrics. Each scenario targets a real-world use case based on customer and vendor input.

Multistream. This scenario represents applications with people and locations in a photo album. For the offline sce-
a stream of queries, but each query comprises multiple nario, the LoadGen sends to the system a single query that
inferences, reflecting a variety of industrial-automation includes all sample-data IDs to be processed, and the system
and remote-sensing applications. For example, many au- is free to process the input data in any order. Similar to the
tonomous vehicles analyze frames from six to eight cameras multistream scenario, neighboring samples in the query are
that stream simultaneously. contiguous in memory. The metric for the offline scenario
is throughput measured in samples per second.
To model a concurrent scenario, the LoadGen sends a new
query comprising N input samples at a fixed time interval For the multistream and server scenarios, latency is a critical
(e.g., 50 ms). The interval is benchmark specific and also component of the system behavior and will constrain various
acts as a latency bound that ranges from 50 to 100 millisec- performance optimizations. For example, most inference
onds. If the system is available, it processes the incoming systems require a minimum (and architecture-specific) batch
query. If it is still processing the prior query in an interval, size to achieve full utilization of the underlying computa-
it skips the interval and delays the remaining queries by one tional resources. But in a server scenario, the arrival rate
interval. of inference queries is random, so systems must carefully
optimize for tail latency and potentially process inferences
No more than 1% of the queries may produce one or more
with a suboptimal batch size.
skipped intervals. A query’s N input samples are contigu-
ous in memory, which accurately reflects production input Table 3 shows the relevant latency constraints for each task
pipelines and avoids penalizing systems that would other- in v0.5. As with other aspects of MLPerf, we selected
wise require that samples be copied to a contiguous memory these constraints on the basis of community consultation
region before starting inference. The performance metric and feasibility assessments. The multistream arrival times
is the integer number of streams that the system supports for most vision tasks correspond to a frame rate of 15–20 Hz,
while meeting the QoS requirement. which is a minimum for many applications. The server QoS
constraints derive from estimates of the inference timing
Server. This scenario represents online server applications
budget given an overall user latency target.
where query arrival is random and latency is important.
Almost every consumer-facing website is a good example,
including services such as online translation from Baidu,
Google, and Microsoft. For this scenario, the load generator
M ULTISTREAM S ERVER Q O S
sends queries, with one sample each, in accordance with TASK
A RRIVAL T IME C ONSTRAINT
a Poisson distribution. The SUT responds to each query
within a benchmark-specific latency bound that varies from I MAGE CLASSIFICATION ( HEAVY ) 50 MS 15 MS
15 to 250 milliseconds. No more than 1% of queries may I MAGE CLASSIFICATION ( LIGHT ) 50 MS 10 MS
exceed the latency bound for the vision tasks and no more O BJECT DETECTION ( HEAVY ) 66 MS 100 MS
than 3% may do so for translation. The server scenario’s O BJECT DETECTION ( LIGHT ) 50 MS 10 MS
performance metric is the Poisson parameter that indicates M ACHINE TRANSLATION 100 MS 250 MS
the queries per second achievable while meeting the QoS
requirement.
Table 3. Latency constraints for each task in the multistream and
Offline. This scenario represents batch-processing appli- server scenarios.
cations where all the input data is immediately available
and latency is unconstrained. An example is identifying the

Page 9 of 23
MLPerf Inference Benchmark

TAIL -L ATENCY C ONFIDENCE E RROR ROUNDED N UMBER OF Q UERIES / S AMPLES PER Q UERY
I NFERENCES M ODEL
P ERCENTILE I NTERVAL M ARGIN I NFERENCES S INGLE - STREAM M ULTISTREAM S ERVER O FFLINE

90% 99% 0.50% 23,886 3×2 13


= 24, 576 I MAGE CLASSIFICATION ( HEAVY ) 1K / 1 270K / N 270K / 1 1 / 24K

95% 99% 0.25% 50,425 7 × 213 = 57, 344 I MAGE CLASSIFICATION ( LIGHT ) 1K / 1 270K / N 270K / 1 1 / 24K
O BJECT DETECTION ( HEAVY ) 1K / 1 270K / N 270K / 1 1 / 24K
99% 99% 0.05% 262,742 33 × 213 = 270, 336
O BJECT DETECTION ( LIGHT ) 1K / 1 270K / N 270K / 1 1 / 24K
M ACHINE TRANSLATION 1K / 1 90K / N 90K / 1 1 / 24K
Table 4. Query requirements for statistical confidence. All results
must meet the minimum LoadGen scenario requirements.
Table 5. Number of queries and samples per query for each task.

4.6 Statistical Confidence


will be N × 270K. Machine translation has a 97th -percentile
To ensure our results are statistically robust and adequately latency guarantee and requires only 90K queries.
capture steady-state system behavior, each task and scenario
combination requires a minimum number of queries. That For repeatability, we run both the multistream and server
number is determined by the tail-latency percentile, the scenarios several times. But the multistream scenario’s
desired margin, and the desired confidence interval. arrival rate and query count guarantee a 2.5- to 7-hour run
time. To strike a balance between repeatability and run time,
Confidence is the probability that a latency bound is within we require five runs for the server scenario, with the result
a particular margin of the reported result. We chose a 99% being the minimum of these five runs. The other scenarios
confidence bound and set the margin to a value much less require one run. We expect to revisit this choice in future
than the difference between the tail-latency percentage and benchmark versions.
100%. Conceptually, that margin ought to be relatively
small. Thus, we selected a margin that is one-twentieth All benchmarks must also run for at least 60 seconds and
of the difference between the tail-latency percentage and process additional queries and/or samples as the scenarios
100%. require. The minimum run time ensures they will measure
the equilibrium behavior of power-management systems
The equation is as follows: and systems that support dynamic voltage and frequency
scaling (DVFS), particularly for the single-stream scenario
1 − T ailLatency
M argin = (1) with a small number of queries.
20
1 − Conf idence 2
N umQueries = (N ormslnv( )) 4.7 Load Generator
2
(2)
T ailLatency × (T ailLatency − 1) The LoadGen is a traffic generator that loads the SUT and
×
M argin2 measures performance. Its behavior is controlled by a con-
figuration file it reads at the start of the benchmark run. The
Table 4 shows the query requirements. The total query count LoadGen produces the query traffic according to the rules
and tail-latency percentile are scenario and task specific. of the previously described scenarios (i.e., single-stream,
The single-stream scenario only requires 1,024 queries, and multistream, server, and offline). Additionally, the LoadGen
the offline scenario requires a single query containing at collects information for logging, debugging, and postpro-
least 24,576 samples. The single-stream scenario has the cessing the data. It records queries and responses from the
fewest queries to execute because we wanted the run time to SUT, and at the end of the run, it reports statistics, sum-
be short enough that embedded platforms and smartphones marizes the results, and determines whether the run was
could complete the runs quickly. valid.
For scenarios with latency constraints, our goal is to ensure Figure 5 shows how the LoadGen generates query traffic for
a 99% confidence interval that the constraints hold. As a re- each scenario. In the server scenario, for instance, it issues
sult, the benchmarks with more-stringent latency constraints queries in accordance with a Poisson distribution to mimic
require more queries in a highly nonlinear fashion. The num- a server’s query-arrival rates. In the single-stream case, it
ber of queries is based on the aforementioned statistics and issues a query to the SUT and waits for completion of that
is rounded up to the nearest multiple of 213 . query before issuing another.
A 99th -percentile guarantee requires 262,742 queries, which
4.7.1 Design
rounds up to 33 × 213 , or 270K. For both multistream and
server, this guarantee for vision tasks requires 270K queries, MLPerf will evolve, introducing new tasks and removing
as Table 5 shows. Because a multistream benchmark will old ones as the field progresses. Accordingly, the LoadGen’s
process N samples per query, the total number of samples design is flexible enough to handle changes to the inference-

Page 10 of 23
MLPerf Inference Benchmark

Server Offline
Number of Number of
LoadGen then issues queries, passing sample IDs to the
samples samples
S
benchmark for execution on the inference hardware. The
t0, t1, t2, … ∈ Poisson (λ) Send
Load Send Load
query queries are pre-generated to reduce overhead during the

Gen 1 query Gen
timed portion of the test.
Time Time
t0 t1 t2
As the benchmark finishes processing the queries, it informs
Single-Stream Multi-Stream the LoadGen through a function named QuerySamplesCom-
Number of
samples
Number of
samples
plete. The LoadGen makes no assumptions regarding how
t constant per benchmark

Load Send
tj = processing time for jth query
Load
N Send the SUT may partition its work, so any thread can call this
Gen 1 query

Gen
query
… function with any set of samples in any order. QuerySample-
Time Time sComplete is thread safe, is wait-free bounded, and makes
t0 t1 t2 t t
no syscalls, allowing it to scale recording to millions of sam-
Figure 5. The timing and number of queries from the Load Gener- ples per second and to minimize the performance variance
ator (LoadGen) vary between benchmark scenarios. All five ML introduced by the LoadGen, which would affect long-tail
tasks can run in any one of the four scenarios. latency.
The LoadGen maintains a logging thread that gathers events
as they stream in from other threads. At the end of the bench-
task suite. We achieve this feat by decoupling the LoadGen mark run, it outputs a set of logs that report the performance
from the benchmarks and the internal representations (e.g., and accuracy stats.
the model, scenarios, and quality and latency metrics).
The LoadGen is implemented as a standalone C++ module 4.7.3 Operating Modes
with well-defined APIs; the benchmark calls it through these The LoadGen has two primary operating modes: accu-
APIs (and vice versa through callbacks). This decoupling racy and performance. Both are necessary to make a valid
at the API level allows it to easily support various language MLPerf submission.
bindings, permitting benchmark implementations in any
language. Presently, the LoadGen supports Python, C, and Accuracy mode. The LoadGen goes through the entire data
C++ bindings; additional bindings can be added. set for the ML task. The model’s task is to run inference on
the complete data set. Afterward, accuracy results appear
Another major benefit of decoupling the LoadGen from the in the log files, ensuring that the model met the required
benchmark is that the LoadGen is extensible to support more quality target.
scenarios. Currently, MLPerf supports four of them; we may
add more, such as a multitenancy mode where the SUT must Performance mode. The LoadGen avoids going through
continuously serve multiple models while maintaining QoS the entire data set, as the system’s performance can be de-
constraints. termined by subjecting it to enough data-set samples.

4.7.2 Implementation 4.7.4 Validation Features

The LoadGen abstracts the details of the data set (e.g., im- The LoadGen has features that ensure the submission sys-
ages) behind sample IDs. Data-set samples receive an index tem complies with the rules. In addition, it can self-check to
between 0 and N. A query represents the smallest input unit determine whether its source code has been modified during
that the benchmark ingests from the LoadGen. It consists of the submission process. To facilitate validation, the sub-
one or more data-set sample IDs, each with a corresponding mitter provides an experimental config file that allows use
response ID to differentiate between multiple instances of of non-default LoadGen features. For v0.5, the LoadGen
the same sample. enables the following four tests.

The rationale for a response ID is that for any given task Accuracy verification. The purpose of this test is to en-
and scenario—say, an image-classification multistream sure valid inferences in performance mode. By default,
scenario—the LoadGen may reissue the same data (i.e., the results that the inference system returns to the Load-
an image with a unique sample ID) multiple times across Gen are not logged and thus are not checked for accuracy.
the different streams. To differentiate between them, the This choice reduces or eliminates processing overhead to
LoadGen must assign different reference IDs to accurately allow accurate measurement of the inference system’s per-
track when each sample finished processing. formance. In this test, results returned from the SUT to the
LoadGen are logged randomly. The log is checked against
At the start, the LoadGen directs the benchmark to load a the log generated in accuracy mode to ensure consistency.
list of samples into memory. Loading is untimed and the
SUT may also perform allowed data preprocessing. The On-the-fly caching detection. By default, LoadGen pro-

Page 11 of 23
MLPerf Inference Benchmark

duces queries by randomly selecting with replacement from 4.8.1 Prohibited Optimizations
the data set, and inference systems may receive queries
MLPerf Inference currently prohibits retraining and pruning
with duplicate samples. This outcome is likely for high-
to ensure comparability, although this restriction may fail to
performance systems that process many samples relative to
reflect realistic deployment for some large companies. The
the data-set size. To represent realistic deployments, the
interlocking requirements to use reference weights (possibly
MLPerf rules prohibit caching of queries or intermediate
with calibration) and minimum accuracy targets are most
data. The test has two parts. The first part generates queries
important for ensuring comparability in the closed division.
with unique sample indices. The second generates queries
The open division explicitly allows retraining and pruning.
with duplicate sample indices. Performance is measured
in each case. The way to detect caching is to determine We prohibit caching to simplify the benchmark design. In
whether the test with duplicate sample indices runs signifi- practice, real inference systems cache queries. For exam-
cantly faster than the test with unique sample indices. ple, “I love you” is one of Google Translate’s most frequent
queries, but the service does not translate the phrase ab initio
Alternate-random-seed testing. In ordinary operation, the
each time. Realistically modeling caching in a benchmark,
LoadGen produces queries on the basis of a fixed random
however, is a challenge because cache hit rates vary substan-
seed. Optimizations based on that seed are prohibited. The
tially with the application. Furthermore, our data sets are
alternate-random-seed test replaces the official random seed
relatively small, and large systems could easily cache them
with alternates and measures the resulting performance.
in their entirety.
4.8 Model Equivalence We also prohibit optimizations that are benchmark aware or
data-set aware and that are inapplicable to production envi-
The goal of MLPerf Inference is to measure realistic system- ronments. For example, real query traffic is unpredictable,
level performance across a wide variety of architectures. but for the benchmark, the traffic pattern is predetermined by
But the four properties of realism, comparability, architec- the pseudorandom-number-generator seed. Optimizations
ture neutrality, and friendliness to small submission teams that take advantage of a fixed number of queries or that use
require careful trade-offs. knowledge of the LoadGen implementation are prohibited.
Some inference deployments involve teams of compiler, Similarly, any optimization employing statistical knowledge
computer-architecture, and machine-learning experts aggres- of the performance or accuracy data sets is prohibited. Fi-
sively co-optimizing the training and inference systems to nally, we disallow any technique that takes advantage of the
achieve cost, accuracy, and latency targets across a massive upscaled images in the 1,200x1,200 COCO data set for the
global customer base. An unconstrained inference bench- heavyweight object detector.
mark, however, would disadvantage companies with less
experience and fewer ML-training resources. 4.8.2 Preprocessing and Data Types
Therefore, we set the model-equivalence rules to allow sub- Ideally, a whole-system benchmark should capture all
mitters to, for efficiency, reimplement models on different performance-relevant operations. MLPerf, however, ex-
architectures. The rules provide a complete list of disal- plicitly allows untimed preprocessing. There is no vendor-
lowed techniques and a list of allowed technique examples. or application-neutral preprocessing. For example, sys-
We chose an explicit blacklist to encourage a wide range of tems with integrated cameras can use hardware/software
techniques and to support architectural diversity. The list co-design to ensure that images arrive in memory in an ideal
of examples illustrates the boundaries of the blacklist while format; systems accepting JPEGs from the Internet cannot.
also encouraging common and appropriate optimizations.
In the interest of architecture and application neutrality, we
Examples of allowed techniques include the following: ar- adopted a permissive approach to untimed preprocessing.
bitrary data arrangement as well as different input and in- Implementations may transform their inputs into system-
memory representations of weights, mathematically equiva- specific ideal forms as an untimed operation.
lent transformations (e.g., tanh versus logistic, ReluX ver-
MLPerf explicitly allows and enables quantization to a wide
sus ReluY, and any linear transformation of an activation
variety of numerical formats to ensure architecture neutral-
function), approximations (e.g., replacing a transcendental
ity. Submitters must pre-register their numerics to help
function with a polynomial), processing queries out of or-
guide accuracy-target discussions. The approved list for
der within the scenario’s limits, replacing dense operations
the closed division includes INT4, INT8, INT16, UINT8,
with mathematically equivalent sparse operations, fusing or
UINT16, FP11 (sign, 5-bit mantissa, and 5-bit exponent),
unfusing operations, dynamically switching between one or
FP16, bfloat16, and FP32.
more batch sizes, mixing experts that combine differently
quantized weights. Quantization to lower-precision formats typically requires

Page 12 of 23
MLPerf Inference Benchmark

calibration to ensure sufficient inference quality. For each Open division. The open division fosters innovation in ML
reference model, MLPerf provides a small, fixed data set that systems, algorithms, optimization, and hardware/software
can be used to calibrate a quantized network. Additionally, co-design. Submitters must still perform the same ML task,
it offers MobileNet versions that are prequantized to INT8, but they may change the model architecture and the quality
since without retraining (which we disallow) the accuracy targets. This division allows arbitrary pre- and postpro-
falls dramatically. cessing and arbitrary models, including techniques such as
retraining. In general, submissions are not directly com-
5 S UBMISSION , R EVIEW, AND R EPORTING parable with each other or with closed submissions. Each
open submission must include documentation about how it
In this section, we describe the submission process for deviates from the closed division. Caveat emptor!
MLPerf Inference v0.5 (Sections 5.1). All submissions are
peer reviewed for validity (Section 5.2). Finally, we describe 5.1.2 Categories
how we report the results to the public (Section 5.3).
Submitters must classify their submissions into one of
three categories on the basis of hardware- and software-
5.1 Submission
component availability: available; preview; and research,
An MLPerf Inference submission contains information development, or other systems. This requirement helps con-
about the SUT: performance scores, benchmark code, a sumers of the results identify the systems’ maturity level
system-description file that highlights the SUT’s main and whether they are readily available (either for rent online
configuration characteristics (e.g., accelerator count, CPU or for purchase).
count, software release, and memory system), and LoadGen Available systems. Available systems are generally the
log files detailing the performance and accuracy runs for most mature and have stringent hardware- and software-
a set of task and scenario combinations. All this data is availability requirements.
uploaded to a public GitHub repository for peer review and
validation before release. An available cloud system must have accessible pricing
(either publicly or by request), have been rented by at least
MLPerf Inference is a suite of tasks and scenarios that en- one third party, have public evidence of availability (e.g.,
sures broad coverage, but a submission can contain subset a web page or company statement saying the product is
tasks and scenarios. Many traditional benchmarks, such as available), and be “reasonably available” for additional third
SPEC CPU, require submissions for all their components. parties to rent by the submission date.
This approach is logical for a general-purpose processor that
runs arbitrary code, but ML systems are often highly spe- An on-premise system is available if all its components that
cialized. For example, some are solely designed for vision substantially determine ML performance are available ei-
or wake-word detection and cannot run other network types. ther individually or in aggregate (development boards that
Others target particular scenarios, such as a single-stream meet the substantially determined clause are allowed). An
application, and are not intended for server-style applica- available component or system must have available pric-
tions (or vice versa). Accordingly, we allow submitters ing (either publicly advertised or available by request), have
flexibility in selecting tasks and scenarios. been shipped to at least one third party, have public evidence
of availability (e.g., a web page or company statement say-
5.1.1 Divisions ing the product is available), and be “reasonably available”
for purchase by additional third parties by the submission
MLPerf Inference has two divisions for submitting results: date. In addition, submissions for on-premises systems must
closed and open. Submitters can send results to either or describe the system and its components in sufficient detail
both, but they must use the same data set. The open divi- so that third parties can build a similar system.
sion, however, allows free model selection and unrestricted
optimization to foster ML-system innovation. Available systems must use a publicly available software
stack consisting of the software components that substan-
Closed division. The closed division enables comparisons tially determine ML performance but are absent from the
of different systems. Submitters employ the same models, source code. An available software component must be well
data sets, and quality targets to ensure comparability across supported for general use and available for download.
wildly different architectures. This division requires prepro-
cessing, postprocessing, and a model that is equivalent to Preview systems. Preview systems contain components
the reference implementation. It also permits calibration for that will meet the criteria for the available category within
quantization (using the calibration data set we provide) and 180 days or by the next submission cycle, whichever is later.
prohibits retraining. This restriction applies to both the hardware and software
requirements. The goal of the preview category is to en-

Page 13 of 23
MLPerf Inference Benchmark

able participants to submit results for new systems without


burdening product-development cycles with the MLPerf
schedule. Any system submitted to preview must then be
submitted to available during the next cycle.
Research, development, or other systems. Research, de-
velopment, or other (RDO) systems contain components not
intended for production or general availability. An example
is a prototype system that is a proof of concept. An RDO
system includes one or more RDO components. These com-
ponents submitted in one cycle may not be submitted as
available until the third cycle or until 181 days have passed, Figure 6. MLPerf Inference’s accessibility and global reach. The
whichever is later. organizations responding to the v0.5 call for submissions hail from
around the world, including the United States, Canada, the Eu-
5.2 Review and Validation ropean Union, Russia, the Middle East, India, China, and South
Korea. This domestic and international adoption reflects the com-
MLPerf Inference submissions are self- and peer-reviewed munity’s perspective that the benchmark is comprehensive and
for compliance with all rules. Compliance issues are tracked scientifically rigorous, and worthy of engineering time for submis-
and raised with submitters, who must resolve them and then sions.
resubmit results.
A challenge of benchmarking inference systems is that many may be optimized for specific ML tasks. For instance, some
include proprietary and closed-source components, such as real-world systems are more highly optimized for vision
inference engines and quantization flows, that make peer than for translation. In such scenarios, averaging the results
review difficult. To accommodate these systems while ensur- across all tasks makes no sense, as the submitter may not be
ing reproducible results that are free from common errors, targeting particular markets.
we developed a validation suite to assist with peer review.
Our validation tools perform experiments that help deter- 6 R ESULTS
mine whether a submission complies with the defined rules. We received over 600 submissions in all three categories
MLPerf Inference provides a suite of validation tests that (available, preview, and RDO) across the closed and open
submitters must run to qualify their submission as valid. divisions. Our results are the most extensive corpus of
MLPerf v0.5 tests the submission system using LoadGen inference performance data available to the public, covering
validation features (Section 4.7.4). a range of ML tasks and scenarios, hardware architectures,
In addition to LoadGen’s validation features, we use custom and software run times. Each has gone through extensive
data sets to detect result caching. This behavior is validated review before receiving approval as a valid MLPerf result.
by replacing the reference data set with a custom data set. After review, we cleared 595 results as valid.
We measure the quality and performance of the system We evaluated the closed-division results on the basis of four
operating on this custom data set and compare the results of the five objectives our benchmark aimed to achieve. The
with operation on the reference data set. exception is setting target qualities and tail-latency bounds
in accordance with real use cases, which we do not discuss
5.3 Reporting because a static benchmark setting applies to every infer-
All results are published on the MLPerf website following re- ence task. Omitting that isolated objective, we present our
view and validation. MLPerf Inference does not require that analysis as follows:
submitters include results for all the ML tasks. Therefore,
some systems lack results for certain tasks and scenarios. • Pick representative workloads that everyone can access
(Sections 6.1 and 6.2).
MLPerf Inference does not provide a “summary score.” Of-
ten in benchmarking, there is a strong desire to distill the • Evaluate systems in realistic scenarios (Section 6.3).
capabilities of a complex system to a single score to enable • Allow the benchmark to flexibly showcase both hard-
a comparison of different systems. But not all ML tasks are ware and software capabilities (Sections 6.4, 6.5, and
equally important for all systems, and the job of weighting 6.6).
some more heavily than others is highly subjective.
• Permit the benchmark to change rapidly in response to
At best, weighting and summarization are driven by the sub- the evolving ML ecosystem (Section 6.7).
mitter catering to unique customer needs, as some systems

Page 14 of 23
MLPerf Inference Benchmark

6.1 Accessibility and Global Reach


GNMT
SSD-ResNet34
11.4%
A primary goal for MLPerf Inference was to create a widely 16.3%
27
19

available benchmark. To this end, the first round of submis-


sions came from 14 worldwide organizations, hailing from
the United States, Canada, Russia, the European Union, the SSD- 37
MobileNets-v1
MobileNets-v1 22.3%
Middle East, India, China, and South Korea, as Figure 6 17.5%
29

shows.
The submitters represent many organizations that range
from startups to original equipment manufacturers (OEMs),
cloud-service providers, and system integrators. They in- 54 ResNet50-v1.5
32.5%
clude Alibaba, Centaur Technology, Dell EMC, dividiti,
FuriosaAI, Google, Habana, Hailo, Inspur, Intel, NVIDIA,
Polytechnic University of Milan, Qualcomm, and Tencent. Figure 7. Results from the closed division. The distribution of
models indicates MLPerf Inference capably selected representative
workloads for the initial v0.5 benchmark release.
6.2 Task Coverage
MLPerf Inference v0.5 submitters are allowed to pick any
combination with no submissions.
task to evaluate their system’s performance. The distribution
of results across tasks can thus reveal whether those tasks
are of interest to ML-system vendors. 6.4 Processor Types

We analyzed the submissions to determine the overall task Machine-learning solutions can be deployed on a variety
coverage. Figure 7 shows the breakdown for the tasks and of platforms, ranging from fully general-purpose CPUs to
models in the closed division. Although the most popular programmable GPUs and DSPs, FPGAs, and fixed-function
model was, unsurprisingly, ResNet-50 v1.5, it was just under accelerators. Our results reflect this diversity.
three times as popular as GNMT, the least popular model. Figure 8 shows that the MLPerf Inference submissions cov-
This small spread and the otherwise uniform distribution ered most hardware categories. The system diversity indi-
suggests we selected a representative set of tasks. cates that our inference benchmark suite and method for
In addition to selecting representative tasks, another goal is v0.5 can evaluate any processor architecture.
to provide vendors with varying quality and performance
targets. Depending on the use case, the ideal ML model 6.5 Software Frameworks
may differ (as Figure 2 shows, a vast range of models can In addition to the various hardware types are many ML soft-
target a given task). Our results reveal that vendors equally ware frameworks. Table 7 shows the variety of frameworks
supported different models for the same task because each used to benchmark the hardware platforms. ML software
model has unique quality and performance trade-offs. In the plays a vital role in unleashing the hardware’s performance.
case of object detection, we saw the same number of sub-
missions for both SSD-MobileNet-v1 and SSD-ResNet34. Some run times are specifically designed to work with cer-
tain types of hardware to fully harness their capabilities;
6.3 Scenario Usage employing the hardware without the corresponding frame-
work may still succeed, but the performance may fall short
We aim to evaluate systems in realistic use cases—a major of the hardware’s potential. The table shows that CPUs have
motivator for the LoadGen (Section 4.7) and scenarios (see
Section 4.5). To this end, Table 6 shows the distribution of
results across the various task and scenario combinations. S INGLE -S TREAM M ULTISTREAM S ERVER O FFLINE

Across all the tasks, the single-stream and offline scenarios GNMT 2 0 6 11
M OBILE N ET- V 1 18 3 5 11
are the most widely used and are also the easiest to optimize
R ES N ET-50 V 1.5 19 5 10 20
and run. Server and multistream were more complicated
SSD-M OBILE N ET- V 1 8 3 5 13
and had longer run times because of the QoS requirements SSD-R ES N ET 34 4 4 7 12
and more-numerous queries. T OTAL 51 15 33 67

GNMT garnered no multistream submissions, possibly be-


cause the constant arrival interval is unrealistic in machine Table 6. Closed-division submissions for the tasks and LoadGen
translation. Therefore, it was the only model and scenario scenarios. The high coverage of models and scenarios implies that
the benchmark captures important real-world use cases.

Page 15 of 23
MLPerf Inference Benchmark

SSD-ResNet34 SSD-MobileNets-v1 ResNet50-v1.5 ASIC CPU DSP FPGA GPU


MobileNets-v1 GNMT
80 A RM NN X X
F URIOSA -AI X
60 H AILO SDK X
Number of Results

H AN G UANG -AI X

40 ONNX X
O PEN V INO X

20
P Y T ORCH X
SNPE X
S YNAPSE X
0
DSP FPGA CPU ASIC GPU T ENSOR F LOW X X X
TF-L ITE X

Figure 8. Results from the closed division. The results cover many T ENSOR RT X
processor architectures. Almost every kind—CPUs, GPUs, DSPs,
FPGAs, and ASICs—appeared in the submissions. Table 7. Summary of software framework versus hardware archi-
tecture in the closed division. The hardware benchmarking in-
volves many different frameworks. Preventing submitters from
the most framework diversity and that TensorFlow has the reimplementing the benchmark would have made it impossible to
support the diversity of systems tested.
most architectural variety.

6.6 Diversity of Systems 6.7 Open Division


The MLPerf Inference v0.5 submissions cover a broad range The open division is the vanguard of MLPerf’s benchmark-
of systems on the power and performance scale, from mobile ing efforts. It is less rigid than the closed division; we
and edge devices to cloud computing. The performance received over 400 results. The submitters ranged from star-
delta between the smallest and largest inference systems is tups to large organizations.
four orders of magnitude, or about 10,000x.
A few highlights from the open division are the use of 4-bit
Table 8 shows the performance range for each task and sce- quantization to boost performance, an exploration of a wide
nario in the closed division (except for GNMT, which had range of models to perform the ML task (instead of using
no multistream submissions). For example, in the case of the reference model), and a demonstration of one system’s
ResNet-50 v1.5 offline, the highest-performing system is ability to deliver high throughput even under tighter latency
over 10,000x faster than the lowest-performing one. Unsur- bounds—tighter than those in the closed-division rules.
prisingly, the former comprised multiple ML accelerators,
whereas the latter was a low-power laptop-class CPU. This In addition, we received a submission that pushed the limits
delta for single-stream is surprising given that additional of mobile-chipset performance. Typically, most vendors
accelerators cannot reduce latency, and it reflects an even use one accelerator at a time to do inference. In this case,
more extensive range of systems than the other scenarios. In a vendor concurrently employed multiple accelerators to
particular, the single-stream scenario includes many smart- deliver high throughput in a multistream scenario—a rarity
phone processors, which target very low power. in conventional mobile use cases. Nevertheless, it shows
that the MLPerf Inference open division is encouraging the
Figure 9 shows the results across all tasks and scenarios. In
cases such as the MobileNet-v1 single-stream scenario (SS),
ResNet-50 v1.5 SS, and SSD-MobileNet-v1 SS, systems
S INGLE -S TREAM M ULTISTREAM S ERVER O FFLINE
exhibit a large performance difference (100x). Because
GNMT 2 N/A 5 2,367
these models have many applications, the systems that target
M OBILE N ET- V 1 1,199 29 9 438
them cover everything from low-power embedded devices
R ES N ET-50 V 1.5 11154 27 26 10,289
to high-performance servers. GNMT server (S) shows much
SSD-M OBILE N ET- V 1 8 36 25 657
less performance variation between systems.
SSD-R ES N ET 34 8 44 9 147
The broad performance range implies that the selected tasks
(as a starting point) for MLPerf Inference v0.5 are general Table 8. Closed-division performance summary across tasks and
enough to represent a variety of use cases and market seg- scenarios. Each entry is the ratio of the highest to lowest perfor-
ments. The wide array of systems also indicates that our mance. The performance range is as much as 10,000x. GNMT
method (LoadGen, metrics, etc.) is broadly applicable. appears as N/A for multistream because it had no submissions.

Page 16 of 23
MLPerf Inference Benchmark
Log10 of performance (normalized to slowest system)
100 101 102 103 104
MobileNets-v1 (SS)

MobileNets-v1 (MS)

MobileNets-v1 (S)

MobileNets-v1 (O)

SSD-MobileNets-v1 (SS)

SSD-MobileNets-v1 (MS)

SSD-MobileNets-v1 (S)

SSD-MobileNets-v1 (O)
ResNet50-v1.5 (SS)

ResNet50-v1.5 (MS)

ResNet50-v1.5 (S)

ResNet50-v1.5 (O)

SSD-ResNet34 (SS)

SSD-ResNet34 (MS)

SSD-ResNet34 (S)

SSD-ResNet34 (O)

GNMT (SS)

GNMT (S)

GNMT (O)
Figure 9. Results from the closed division. Normalized performance distribution on log scale (log10 ) across models for the single-
stream (SS), multistream (MS), server (S), and offline (O) scenarios. The boxplot shows the performance distribution of all system
submissions for a specific model and scenario combination. The results are normalized to the slowest system representing that combination.
A wide range emerges across all tasks and scenarios. GNMT MS is absent because no submitter ran the multistream scenario.

industry to push the limits of systems. MLPerf Inference began as a community-driven effort
on July 12, 2018. We consulted more than 15 organiza-
In yet another interesting submission, two organiza-
tions. Since then, many other organizations have joined
tions jointly evaluated 12 object-detection models—YOLO
the MLPerf Inference working group. Applying the wis-
v3 (Redmon & Farhadi, 2018), Faster-RCNN (Ren et al.,
dom of several ML engineers and practitioners, we built the
2015) with a variety of backbones, and SSD (Liu et al.,
benchmark from the ground up, soliciting input from the
2016)) with a variety of backbones—on a desktop platform.
ML-systems community as well as hardware end users. This
The open-division results save practitioners and researchers
collaborative effort led us to directly address the industry’s
from having to manually perform similar explorations, while
diverse needs from the start. For instance, the LoadGen
also showcasing potential techniques and optimizations.
and scenarios emerged from our desire to span the many
inference-benchmark needs of various organizations.
7 L ESSONS L EARNED
Although convincing competing organizations to agree on a
We reflect on our v0.5 benchmark-development effort and benchmark is a challenge, it is still possible—as MLPerf In-
share some lessons we learned from the experience. ference shows. Every organization has unique requirements
and expectations, so reaching a consensus was sometimes
7.1 Community-Driven Benchmark Development tricky. In the interest of progress, everyone agreed to make
decisions on the basis of “grudging consensus.” These de-
There are two main approaches to building an industry- cisions were not always in favor of any one organization.
standard benchmark. One is to create the benchmark in Organizations would comply to keep the process moving or
house, release it, and encourage the community to adopt it. defer their requirements to a future version so benchmark
The other is first to consult the community and then build development could continue.
the benchmark through a consensus-based effort. The for-
mer approach is useful when seeding an idea, but the latter Ultimately, MLPerf Inference exists because competing
is necessary to develop an industry-standard benchmark. organizations saw beyond their self-interest and worked
MLPerf Inference employed the latter. together to achieve a common goal: establishing the best

Page 17 of 23
MLPerf Inference Benchmark

ways to measure ML inference performance. the success of the benchmark.


The LoadGen improved auditability by separating measure-
7.2 Perfect Is the Enemy of Good ment and experimental setup into a shared component. The
MLPerf Inference v0.5 has a modest number of tasks and only possible error in the experimental procedure is use
models. Early in the development process, it was slated to of the wrong LoadGen settings. The LoadGen, therefore,
cover 11 ML tasks: image classification, object detection, significantly reduced compliance issues.
speech recognition, machine translation, recommendation, Finally, MLPerf provided a script for checking submissions.
text (e.g., sentiment) classification, language modeling, text The script allowed submitters to verify that they submitted
to speech, face identification, image segmentation, and im- all required files in the right formats along with the correct
age enhancement. We chose these tasks to cover the full directory layouts. It also verified LoadGen settings and
breadth of ML applications relevant to the industry. scanned logs for noncompliance.
As it matured, however, engineering hurdles and the partici- The submission-checker script kept all submissions rela-
pating organizations’ benchmark-carrying capacity limited tively uniform and allowed submitters to quickly identify
our effort. The engineering hurdles included specifying and and resolve potential problems. In future revisions, MLPerf
developing the LoadGen system, defining the scenarios, and will aim to expand the range of issues the submission script
building the reference implementations. The LoadGen, for discovers. We also plan to include additional checker scripts
instance, involved 11 engineers from nine organizations. and tools to further smooth the audit process.
The reference implementations involved 34 people from 15
organizations contributing to our GitHub repository.
8 P RIOR A RT IN AI/ML B ENCHMARKING
We deemed that overcoming the engineering hurdles was a
priority, as they would otherwise limit our ability to repre- The following summary describes prior AI/ML inference
sent various workloads and to grow in the long term. Hence, benchmarking. Each of these benchmarks has made unique
rather than incorporating many tasks and models right away, contributions. MLPerf has strived to incorporate and build
we trimmed the number of tasks to five and focused on on the best aspects of previous work while ensuring it in-
developing a proper method and infrastructure. cludes community input. Compared with earlier efforts,
MLPerf brings more-rigorous performance metrics that we
With the hurdles out of the way, a small team or even an carefully selected for each major use case along with a much
individual can add new models. For instance, thanks to wider (but still compact) set of ML applications and models
the LoadGen and a complementary workflow-automation based on the community’s input.
technology (Fursin et al., 2016), one MLPerf contributor
with only three employees swept more than 60 computer- AI Benchmark. AI Benchmark (Ignatov et al., 2019) is
vision models in the open division. arguably the first mobile-inference benchmark suite. It cov-
ers 21 computer-vision and AI tests grouped in 11 sections.
Similarly, adding another task would require only a mod- These tests are predominantly computer-vision tasks (image
est effort to integrate with the LoadGen and implement the recognition, face detection, and object detection), which are
model. This flexibility allows us to accommodate the chang- also well represented in the MLPerf suite. The AI Bench-
ing ML landscape, and it saves practitioners and researchers mark results and leaderboard focus primarily on Android
from having to perform these explorations manually, all smartphones and only measure inference latency. The suite
while showcasing potential techniques and optimizations provides a summary score, but it does not explicitly specify
for future versions of the closed division. the quality targets. Relative to AI Benchmark, we aim at a
wider variety of devices (submissions for v0.5 range from
7.3 Audits and Auditability IoT devices to server-scale systems) and multiple scenar-
MLPerf is committed to integrity through rigorous submitter ios. Another important distinction is that MLPerf does not
cross-auditing and to the privacy of the auditing process. endorse a summary score, as we mentioned previously.
This process was uncontentious and smooth flowing. Three EEMBC MLMark. EEMBC MLMark (EEMBC, 2019)
innovations helped ease the audit process: permissive rules, is an ML benchmark suite designed to measure the per-
the LoadGen, and the submission checker. formance and accuracy of embedded inference devices. It
Concerns arose during rule-making that submitters would includes image-classification (ResNet-50 v1 and MobileNet-
discover loopholes in the blacklist, allowing them to “break” v1) and object-detection (SSD-MobileNet-v1) workloads,
the benchmark and, consequently, undermine the legitimacy and its metrics are latency and throughput. Its latency and
of the entire MLPerf project. Submitters worked together to throughput modes are roughly analogous to the MLPerf
patch loopholes as they appeared because all are invested in single-stream and offline modes. MLMark measures per-

Page 18 of 23
MLPerf Inference Benchmark

formance at explicit batch sizes, whereas MLPerf allows plexity of testing and evaluating full ML models.
submitters to choose the best batch sizes for different scenar-
TBD (Training Benchmarks for DNNs). TBD (Zhu et al.,
ios. Also, the former imposes no target-quality restrictions,
2018) is a joint project of the University of Toronto and
whereas the latter imposes stringent restrictions.
Microsoft Research that focuses on ML training. It provides
Fathom. An early ML benchmark, Fathom (Adolf et al., a wide spectrum of ML models in three frameworks (Ten-
2016) provides a suite of neural-network models that in- sorFlow, MXNet, and CNTK), along with a powerful tool
corporate several types of layers (e.g., convolution, fully chain for their improvement. It primarily focuses on evalu-
connected, and RNN). Still, it focuses on throughput rather ating GPU performance and only has one full model (Deep
than accuracy. Fathom was an inspiration for MLPerf: in Speech 2) that covers inference. We considered including
particular, we likewise included a suite of models that com- TBD’s Deep Speech 2 model but lacked the time.
prise various layer types. Compared with Fathom, MLPerf
DawnBench. DawnBench (Coleman et al., 2017) was the
provides both PyTorch and TensorFlow reference implemen-
first multi-entrant benchmark competition to measure the
tations for optimization, ensuring that the models in both
end-to-end performance of deep-learning systems. It al-
frameworks are equivalent, and it also introduces a variety
lowed optimizations across model architectures, optimiza-
of inference scenarios with different performance metrics.
tion procedures, software frameworks, and hardware plat-
AIXPRT. Developed by Principled Technologies, AIX- forms. DawnBench inspired MLPerf, but our benchmark
PRT (Principled Technologies, 2019) is a closed, propri- offers more tasks, models, and scenarios.
etary AI benchmark that emphasizes ease of use. It consists
To summarize, MLPerf Inference builds on the best of prior
of image-classification, object-detection, and recommender
work and improves on it, in part through community-driven
workloads. AIXPRT publishes prebuilt binaries that employ
feedback (Section 7.1). The result has been new features,
specific inference frameworks on supported platforms. The
such as the LoadGen (which can run models in different
goal of this approach is apparently to allow technical press
scenarios), the open and closed divisions, and so on.
and enthusiasts to quickly run the benchmark. Binaries are
built using Intel OpenVino, TensorFlow, and NVIDIA Ten-
sorRT tool kits for the vision workloads, as well as MXNet 9 C ONCLUSION
for the recommendation system. AIXPRT runs these work-
More than 200 ML researchers, practitioners, and engineers
loads using FP32 and INT8 numbers with optional batching
from academia and industry helped to bring the MLPerf
and multi-instance, and it evaluates performance by measur-
Inference benchmark from concept (June 2018) to result
ing latency and throughput. The documentation and quality
submission (October 2019). This team, drawn from 32 or-
requirements are unpublished but are available to members.
ganizations, developed the reference implementations and
In contrast, MLPerf tasks are supported on any framework,
rules, and submitted over 600 performance measurements
tool kit, or OS; they have precise quality requirements; and
gathered on a wide range of systems. Of these performance
they work with a variety of scenarios.
measurements, 595 cleared the audit process as valid sub-
AI Matrix. AI Matrix (Alibaba, 2018) is Alibaba’s AI- missions and were approved for public consumption.
accelerator benchmark for both cloud and edge deployment.
MLPerf Inference v0.5 is just the beginning. The key to
It takes the novel approach of offering four benchmark types.
any benchmark’s success, especially in a rapidly chang-
First, it includes micro-benchmarks that cover basic oper-
ing field such as ML, is a development process that can
ators such as matrix multiplication and convolutions that
respond quickly to changes in the ecosystem. Work has
come primarily from DeepBench. Second, it measures per-
already started on the next version. We expect to update
formance for common layers, such as fully connected layers.
the current models (e.g., MobileNet-v1 to v2), expand the
Third, it includes numerous full models that closely track in-
list of tasks (e.g., recommendation), increase the processing
ternal applications. Fourth, it offers a synthetic benchmark
requirements by scaling the data-set sizes (e.g., 2 MP for
designed to match the characteristics of real workloads.
SSD large), allow aggressive performance optimizations
The full AI Matrix models primarily target TensorFlow and
(e.g., retraining for quantization), simplify benchmarking
Caffe, which Alibaba employs extensively and which are
through better infrastructure (e.g., a mobile app), and in-
mostly open source. We have a smaller model collection
crease the challenge to systems by improving the metrics
and focus on simulating scenarios using LoadGen.
(e.g., measuring power and adjusting the quality targets).
DeepBench. Microbenchmarks such as DeepBench (Baidu,
We welcome your input and contributions. Visit the MLPerf
2017) measure the library implementation of kernel-level
website (https://mlperf.org) for additional details.
operations (e.g., 5,124x700x2,048 GEMM) that are impor-
Results for v0.5 are available online (https://github.
tant for performance in production models. They are useful
com/mlperf/inference_results_v0.5).
for efficient model development but fail to address the com-

Page 19 of 23
MLPerf Inference Benchmark

10 ACKNOWLEDGEMENTS H AILO
MLPerf Inference is the work of many individuals from Ohad Agami, Mark Grobman, and Tamir Tapuhi.
multiple organizations. In this section, we acknowledge all
those who helped produce the first set of results or supported I NTEL
the overall benchmark development.
Md Faijul Amin, Thomas Atta-fosu, Haim Barad, Barak
A LIBABA T-H EAD Battash, Amit Bleiweiss, Maor Busidan, Deepak R Can-
chi, Baishali Chaudhuri, Xi Chen, Elad Cohen, Xu Deng,
Zhi Cai, Danny Chen, Liang Han, Jimmy He, David Mao, Pradeep Dubey, Matthew Eckelman, Alex Fradkin, Daniel
Benjamin Shen, ZhongWei Yao, Kelly Yin, XiaoTao Zai, Franch, Srujana Gattupalli, Xiaogang Gu, Amit Gur, MingX-
Xiaohui Zhao, Jesse Zhou, and Guocai Zhu. iao Huang, Barak Hurwitz, Ramesh Jaladi, Rohit Kalidindi,
Lior Kalman, Manasa Kankanala, Andrey Karpenko, Noam
BAIDU Korem, Evgeny Lazarev, Hongzhen Liu, Guokai Ma, An-
drey Malyshev, Manu Prasad Manmanthan, Ekaterina Ma-
Newsha Ardalani, Ken Church, and Joel Hestness. trosova, Jerome Mitchell, Arijit Mukhopadhyay, Jitender
Patil, Reuven Richman, Rachitha Prem Seelin, Maxim
C ADENCE Shevtshov, Avi Shimalkovski, Dan Shirron, Hui Wu, Yong
Wu, Ethan Xie, Cong Xu, Feng Yuan, and Eliran Zimmer-
Debajyoti Pal. man.

C ENTAUR T ECHNOLOGY M EDIAT EK


Bryce Arden, Glenn Henry, CJ Holthaus, Kimble Houck, Bing Yu.
Kyle O’Brien, Parviz Palangpour, Benjamin Seroussi, and
Tyler Walker. M ICROSOFT

D ELL EMC Scott McKay, Tracy Sharpe, and Changming Sun.

Frank Han, Bhavesh Patel, Vilmara Rocio Sanchez, and M YRTLE


Rengan Xu.
Peter Baldwin.
DIVIDITI
NVIDIA
Grigori Fursin and Leo Gordon.
Felix Abecassis, Vikram Anjur, Jeremy Appleyard, Julie
FACEBOOK Bernauer, Anandi Bharwani, Ritika Borkar, Lee Bushen,
Charles Chen, Ethan Cheng, Melissa Collins, Niall Em-
Soumith Chintala, Kim Hazelwood, Bill Jia, and Sean Lee. mart, Michael Fertig, Prashant Gaikwad, Anirban Ghosh,
Mitch Harwell, Po-Han Huang, Wenting Jiang, Patrick Judd,
F URIOSA AI Prethvi Kashinkunti, Milind Kulkarni, Garvit Kulshreshta,
Jonas Li, Allen Liu, Kai Ma, Alan Menezes, Maxim Mi-
Dongsun Kim and Sol Kim. lakov, Rick Napier, Brian Nguyen, Ryan Olson, Robert
Overman, Jhalak Patel, Brian Pharris, Yujia Qi, Randall
Radmer, Supriya Rao, Scott Ricketts, Nuno Santos, Madhu-
G OOGLE
mita Sridhara, Markus Tavenrath, Rishi Thakka, Ani Vaidya,
Michael Banfield, Victor Bittorf, Bo Chen, Dehao Chen, KS Venkatraman, Jin Wang, Chris Wilkerson, Eric Work,
Ke Chen, Chiachen Chou, Sajid Dalvi, Suyog Gupta, Blake and Bruce Zhan.
Hechtman, Terry Heo, Andrew Howard, Sachin Joglekar,
Allan Knies, Naveen Kumar, Cindy Liu, Thai Nguyen, Tayo P OLITECNICO DI M ILANO
Oguntebi, Yuechao Pan, Mangpo Phothilimthana, Jue Wang,
Shibo Wang, Tao Wang, Qiumin Xu, Cliff Young, Ce Zheng, Emanuele Vitali.
and Zongwei Zhou.

Page 20 of 23
MLPerf Inference Benchmark

Q UALCOMM R EFERENCES
Srinivasa Chaitanya Gopireddy, Pradeep Jilagam, Chirag Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
Patel, Harris Teague, and Mike Tremaine. J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
TensorFlow: A System for Large-Scale Machine Learn-
ing. In OSDI, volume 16, pp. 265–283, 2016.
S AMSUNG
Rama Harihara, Jungwook Hong, David Tannenbaum, Si- Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., and Brooks, D.
mon Waters, and Andy White. Fathom: Reference Workloads for Modern Deep Learn-
ing Methods. In Workload Characterization (IISWC),
2016 IEEE International Symposium on, pp. 1–10. IEEE,
S TANFORD U NIVERSITY 2016.
Peter Bailis and Matei Zaharia. Alibaba. Ai matrix. https://aimatrix.ai/
en-us/, 2018.
S UPERMICRO
Amodei, D. and Hernandez, D. Ai and compute. https:
Srini Bala, Ravi Chintala, Alec Duroy, Raju Penumatcha, //blog.openai.com/ai-and-compute/,
Gayatri Pichai, and Sivanagaraju Yarramaneni. 2018.

Apple. Core ml: Integrate machine learning models


U NAFFILIATED into your app. https://developer.apple.com/
Michael Gschwind and Justin Sang. documentation/coreml, 2017.

Badrinarayanan, V., Kendall, A., and Cipolla, R. Segnet: A


U NIVERSITY OF C ALIFORNIA , B ERKELEY / deep convolutional encoder-decoder architecture for im-
G OOGLE age segmentation. IEEE transactions on pattern analysis
and machine intelligence, 39(12):2481–2495, 2017.
David Patterson.
Bai, J., Lu, F., Zhang, K., et al. Onnx: Open neural
X ILINX network exchange. https://github.com/onnx/
onnx, 2019.
Ziheng Gao, Yiming Hu, Satya Keerthi Chand Kudupudi, Ji
Lu, Lu Tian, and Treeman Zheng. Baidu. DeepBench: Benchmarking Deep Learning Op-
erations on Different Hardware. https://github.
com/baidu-research/DeepBench, 2017.

Bianco, S., Cadene, R., Celona, L., and Napoletano, P.


Benchmark analysis of representative deep neural net-
work architectures. IEEE Access, 6:64270–64277, 2018.

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,
T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible
and efficient machine learning library for heterogeneous
distributed systems. arXiv preprint arXiv:1512.01274,
2015.

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,


J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient
primitives for deep learning. CoRR, abs/1410.0759, 2014.
URL http://arxiv.org/abs/1410.0759.

Chollet, F. Xception: Deep learning with depthwise separa-


ble convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1251–
1258, 2017.

Chollet, F. et al. Keras. https://keras.io, 2015.

Page 21 of 23
MLPerf Inference Benchmark

Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
Nardi, L., Bailis, P., Olukotun, K., Ré, C., and Zaharia, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
M. DAWNBench: An End-to-End Deep Learning Bench- Efficient convolutional neural networks for mobile vision
mark and Competition. NIPS ML Systems Workshop, applications. arXiv preprint arXiv:1704.04861, 2017.
2017.
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation
Council, T. P. P. Transaction processing performance coun- networks. In Proceedings of the IEEE conference on
cil. Web Site, http://www. tpc. org, 2005. computer vision and pattern recognition, pp. 7132–7141,
2018.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database. Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K.,
In 2009 IEEE conference on computer vision and pattern Baum, F., Wu, M., Xu, L., and Van Gool, L. Ai bench-
recognition, pp. 248–255. Ieee, 2009. mark: All about deep learning on smartphones in 2019.
arXiv preprint arXiv:1910.06663, 2019.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan- Intel. Intel math kernel library. https://software.
guage understanding. arXiv preprint arXiv:1810.04805, intel.com/en-us/mkl, 2018a.
2018.
Intel. Intel distribution of openvino toolkit.
Dixit, K. M. The spec benchmarks. Parallel computing, 17 https://software.intel.com/en-us/
(10-11):1195–1209, 1991. openvino-toolkit, 2018b.

Dongarra, J. The linpack benchmark: An explana- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
tion. In Proceedings of the 1st International Con- Girshick, R., Guadarrama, S., and Darrell, T. Caffe:
ference on Supercomputing, pp. 456–474, London, Convolutional Architecture for Fast Feature Embedding.
UK, UK, 1988. Springer-Verlag. ISBN 3-540-18991- In ACM International Conference on Multimedia, pp.
2. URL http://dl.acm.org/citation.cfm? 675–678. ACM, 2014.
id=647970.742568.
Khudia, D. S., Basu, P., and Deng, S. Open-
EEMBC. Introducing the eembc mlmark benchmark. sourcing fbgemm for state-of-the-art server-side
https://www.eembc.org/mlmark/index. inference. https://engineering.fb.com/
php, 2019. ml-applications/fbgemm/, 2018.

Fursin, G., Lokhmotov, A., and Plowman, E. Collective Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
knowledge: towards r&d sustainability. In 2016 Design, classification with deep convolutional neural networks.
Automation & Test in Europe Conference & Exhibition In Advances in neural information processing systems,
(DATE), pp. 864–869. IEEE, 2016. pp. 1097–1105, 2012.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Riccardi, F., Sarokin, R., Kulik, A., and Grundmann, M.
Y. Generative adversarial nets. In Advances in neural On-device neural net inference with mobile gpus. arXiv
information processing systems, pp. 2672–2680, 2014. preprint arXiv:1907.01989, 2019a.

Han, S., Mao, H., and Dally, W. J. Deep compres- Lee, K., Rao, V., and Arnold, W. C. Accelerat-
sion: Compressing deep neural networks with pruning, ing facebooks infrastructure with application-
trained quantization and huffman coding. arXiv preprint specific hardware. https://engineering.
arXiv:1510.00149, 2015. fb.com/data-center-engineering/
accelerating-infrastructure/, 3 2019b.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
M. A., and Dally, W. J. Eie: efficient inference engine Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen,
on compressed deep neural network. In 2016 ACM/IEEE D. Learning hand-eye coordination for robotic grasping
43rd Annual International Symposium on Computer Ar- with deep learning and large-scale data collection. The
chitecture (ISCA), pp. 243–254. IEEE, 2016. International Journal of Robotics Research, 37(4-5):421–
436, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,
conference on computer vision and pattern recognition, H. P. Pruning filters for efficient convnets. arXiv preprint
pp. 770–778, 2016. arXiv:1608.08710, 2016.

Page 22 of 23
MLPerf Inference Benchmark

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn:
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Towards real-time object detection with region proposal
Common objects in context. In European conference on networks. In Advances in neural information processing
computer vision. Springer, 2014. systems, pp. 91–99, 2015.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
C.-Y., and Berg, A. C. Ssd: Single shot multibox detector. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
In European conference on computer vision, pp. 21–37. bottlenecks. In Proceedings of the IEEE Conference
Springer, 2016. on Computer Vision and Pattern Recognition, pp. 4510–
4520, 2018.
Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micike-
vicius, P., Patterson, D., Tang, H., Wei, G.-Y., Bailis, P., Seide, F. and Agarwal, A. Cntk: Microsoft’s open-source
Bittorf, V., Brooks, D., Chen, D., Dutta, D., Gupta, U., deep-learning toolkit. In Proceedings of the 22nd ACM
Hazelwood, K., Hock, A., Huang, X., Jia, B., Kang, D., SIGKDD International Conference on Knowledge Dis-
Kanter, D., Kumar, N., Liao, J., Narayanan, D., Oguntebi, covery and Data Mining, pp. 2135–2135. ACM, 2016.
T., Pekhimenko, G., Pentecost, L., Reddi, V. J., Robie, T., Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a
John, T. S., Wu, C.-J., Xu, L., Young, C., and Zaharia, M. next-generation open source framework for deep learning.
Mlperf training benchmark, 2019. In Proceedings of workshop on machine learning systems
(LearningSys) in the twenty-ninth annual conference on
MLPerf. MLPerf Reference: ResNet in TensorFlow.
neural information processing systems (NIPS), volume 5,
https://github.com/mlperf/training/
pp. 1–6, 2015.
tree/master/image_classification/
tensorflow/official, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, tion is all you need. In Advances in neural information
J. Pruning convolutional neural networks for resource processing systems, pp. 5998–6008, 2017.
efficient inference. arXiv preprint arXiv:1611.06440,
2016. WMT. First conference on machine translation, 2016. URL
http://www.statmt.org/wmt16/.
NVIDIA. Nvidia tensorrt: Programmable inference ac-
celerator. https://developer.nvidia.com/ Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
tensorrt. Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
K., et al. Google’s neural machine translation system:
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a Bridging the gap between human and machine translation.
method for automatic evaluation of machine translation. arXiv preprint arXiv:1609.08144, 2016.
In Proceedings of the 40th annual meeting on association
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
for computational linguistics, pp. 311–318. Association
gated residual transformations for deep neural networks.
for Computational Linguistics, 2002.
In Proceedings of the IEEE conference on computer vi-
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., sion and pattern recognition, pp. 1492–1500, 2017.
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, Xu, D., Anguelov, D., and Jain, A. Pointfusion: Deep sensor
A. Automatic differentiation in pytorch. 2017. fusion for 3d bounding box estimation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Post, M. A call for clarity in reporting bleu scores. arXiv
Recognition, pp. 244–253, 2018.
preprint arXiv:1804.08771, 2018.
Zhu, H., Akrout, M., Zheng, B., Pelegris, A., Jayarajan,
Principled Technologies. Aixprt community preview. A., Phanishayee, A., Schroeder, B., and Pekhimenko, G.
https://www.principledtechnologies. Benchmarking and analyzing deep neural network train-
com/benchmarkxprt/aixprt/, 2019. ing. In 2018 IEEE International Symposium on Workload
Characterization (IISWC), pp. 88–100. IEEE, 2018.
Qualcomm. Snapdragon neural processing engine sdk
reference guide. https://developer.qualcomm.
com/docs/snpe/overview.html.

Redmon, J. and Farhadi, A. Yolov3: An incremental im-


provement. arXiv preprint arXiv:1804.02767, 2018.

Page 23 of 23

You might also like