MLPerf Inferencing Benchmark
MLPerf Inferencing Benchmark
Vijay Janapa Reddi 1 Christine Cheng 2 David Kanter 3 Peter Mattson 4 Guenther Schmuelling 5
Carole-Jean Wu 6 Brian Anderson 4 Maximilien Breughe 7 Mark Charlebois 8 William Chou 8
Ramesh Chukka 2 Cody Coleman 9 Sam Davis 10 Pan Deng 11 Greg Diamos 12 Jared Duke 4 Dave Fick 13
J. Scott Gardner 14 Itay Hubara 15 Sachin Idgunji 7 Thomas B. Jablin 4 Jeff Jiao 16 Tom St. John 17
Pankaj Kanwar 4 David Lee 18 Jeffery Liao 19 Anton Lokhmotov 20 Francisco Massa 6 Peng Meng 11
Paulius Micikevicius 7 Colin Osborne 21 Gennady Pekhimenko 22 Arun Tejusve Raghunath Rajan 2
Dilip Sequeira 7 Ashish Sirasao 23 Fei Sun 24 Hanlin Tang 2 Michael Thomson 25 Frank Wei 26 Ephrem Wu 23
arXiv:1911.02549v1 [cs.LG] 6 Nov 2019
Lingjie Xu 26 Koichi Yamada 2 Bing Yu 18 George Yuan 7 Aaron Zhong 16 Peizhao Zhang 6 Yuchen Zhou 27
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of
different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that
incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude
in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more
software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system
performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for
industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. Driven by more
than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf implements a set of rules and
practices to ensure comparability across systems with wildly differing architectures. In this paper, we present the method
and design principles of the initial MLPerf Inference release. The first call for submissions garnered more than 600
inference-performance measurements from 14 organizations, representing over 30 systems that show a range of capabilities.
The challenge is the ecosystem’s many possible combina- evaluation metrics. In addition, the benchmark suite comes
tions of machine-learning tasks, models, data sets, frame- with permissive rules that allow comparison of different
works, tool sets, libraries, architectures, and inference architectures under realistic scenarios.
engines, which make inference benchmarking almost in-
Unlike traditional SPEC CPU–style benchmarks that run
tractable. The spectrum of ML tasks is broad, including but
out of the box (Dixit, 1991), MLPerf promotes competi-
not limited to image classification and localization, object
tion by allowing vendors to reimplement and optimize the
detection and segmentation, machine translation, automatic
benchmark for their system and then submit the results. To
speech recognition, text to speech, and recommendations.
make results comparable, it defines detailed rules. It pro-
Even for a specific task, such as image classification, many
vides guidelines on how to benchmark inference systems, in-
ML models are viable. These models serve in a variety of
cluding when to start the performance-measurement timing,
scenarios that range from taking a single picture on a smart-
what preprocessing to perform before invoking the model,
phone to continuously and concurrently detecting pedes-
and which transformations and optimizations to employ.
trians through multiple cameras in an autonomous vehicle.
Such meticulous specifications help ensure comparability
Consequently, ML tasks have vastly different quality re-
across ML systems because all follow the same rules.
quirements and real-time-processing demands. Even im-
plementations of functions and operations that the models We describe the design principles and architecture of the
typically rely on can be highly framework specific, and they MLPerf Inference benchmark’s initial release (v0.5). We
increase the complexity of the design and the task. received over 600 submissions across a variety of tasks,
frameworks, and platforms from 14 organizations. Audit
Both academic and industrial organizations have developed
tests validated the submissions, and the tests cleared 595
ML inference benchmarks. Examples include AIMatrix (Al-
of them as valid. The final results show a four-orders-of-
ibaba, 2018), EEMBC MLMark (EEMBC, 2019), and AIX-
magnitude performance variation ranging from embedded
PRT (Principled Technologies, 2019) from industry, as well
devices and smartphones to data-center systems. MLPerf
as AI Benchmark (Ignatov et al., 2019), TBD (Zhu et al.,
Inference adopts the following principles for a tailored ap-
2018), Fathom (Adolf et al., 2016), and DAWNBench (Cole-
proach to industry-standard benchmarking:
man et al., 2017) from academia. Each one has made sub-
stantial contributions to ML benchmarking, but they were
developed without input from ML-system designers. As a 1. Pick representative workloads that everyone can ac-
result, there is no consensus on representative models, met- cess.
rics, tasks, and rules across these benchmarks. For example, 2. Evaluate systems in realistic scenarios.
some efforts focus too much on specific ML applications
(e.g., computer vision) or specific domains (e.g., embedded 3. Set target qualities and tail-latency bounds in accor-
inference). Moreover, it is important to devise the right per- dance with real use cases.
formance metrics for inference so the evaluation accurately 4. Allow the benchmarks to flexibly showcase both hard-
reflects how these models operate in practice. Latency, for ware and software capabilities.
instance, is the primary metric in many initial benchmarking
5. Permit the benchmarks to change rapidly in response
efforts, but latency-bounded throughput is more relevant for
to the evolving ML ecosystem.
many cloud inference scenarios.
Therefore, two critical needs remain unmet: (i) standard The rest of the paper is organized as follows: Section 2 pro-
evaluation criteria for ML inference systems and (ii) an ex- vides background, describing the differences in ML train-
tensive (but reasonable) set of ML applications/models that ing versus ML inference and the challenges to creating a
cover existing inference systems across all major domains. benchmark that covers the broad ML inference landscape.
MLPerf Inference answers the call with a benchmark suite Section 3 describes the goals of MLPerf Inference. Sec-
that complements MLPerf Training (Mattson et al., 2019). tion 4 presents MLPerf’s underlying inference-benchmark
Jointly developed by the industry with input from academic architecture and reveals the design choices for version 0.5.
researchers, more than 30 organizations as well as more than Section 5 summarizes the submission, review, and report-
200 ML engineers and practitioners assisted in the bench- ing process. Section 6 highlights v0.5 submission results
mark design and engineering process. This community to demonstrate that MLPerf Inference is a well-crafted in-
architected MLPerf Inference to measure inference perfor- dustry benchmark. Section 7 shares the important lessons
mance across a wide variety of ML hardware, software, learned and prescribes a tentative roadmap for future work.
systems, and services. The benchmark suite defines a set Section 8 compares MLPerf Inference with prior efforts.
of tasks (models, data sets, scenarios, and quality targets) Section 9 concludes the paper. Section 10 acknowledges
that represent real-world deployments, and it specifies the the individuals who contributed to the benchmark’s devel-
opment or validated the effort by submitting results.
Page 2 of 23
MLPerf Inference Benchmark
Model Prediction
Metrics
Sanitize/
Input Extract Training Inference
Data Features
Figure 1. Stages of a typical ML pipeline. The first stage involves gathering data to train the models. The raw data is often noisy, so it
requires processing before training a deep neural network (DNN). The hardware landscape for DNN training and inference is diverse.
2.1 ML Pipeline Choosing the right model depends on the application. For
example, pedestrian detection in autonomous vehicles has a
Machine learning generally involves a series of complicated much higher accuracy requirement than does labeling ani-
tasks (Figure 1). Nearly every ML pipeline begins by acquir- mals in photographs, owing to the different consequences
ing data to train and test the models. Raw data is typically of wrong predictions. Similarly, quality-of-service require-
sanitized and normalized before use because real-world data ments for inference vary by several orders of magnitude
often contains errors, irrelevancies, or biases that reduce the from effectively no latency requirement for offline processes
quality and accuracy of ML models. to milliseconds for real-time applications. Covering this
ML benchmarking focuses on two phases: training and design space necessitates careful selection of models that
inference. During training, models learn to make predictions represent realistic scenarios.
from inputs. For example, a model may learn to predict Another challenge is that models vary wildly, so it is difficult
the subject of a photograph or the most fluent translation to draw meaningful comparisons. In many cases, such as
of a sentence from English to German. During inference, in Figure 2, a small accuracy change (e.g., a few percent)
models make predictions about their inputs, but they no can drastically change the computational requirements (e.g.,
longer learn. This phase is increasingly crucial as ML moves 5–10x). For example, SE-ResNeXt-50 (Hu et al., 2018; Xie
from research to practice, serving trillions of queries daily. et al., 2017) and Xception (Chollet, 2017) achieve roughly
Despite its apparent simplicity relative to training, the task of the same accuracy (∼79%) but exhibit a 2x difference in
balancing latency, throughput, and accuracy for real-world computational requirements (∼4 Gflops versus ∼8 Gflops).
applications makes optimizing inference difficult.
2.2.2 Diversity of Deployment Scenarios
2.2 ML Inference Benchmarking Complexity
In addition to accuracy and computational complexity, the
Creating a useful ML benchmark involves four critical chal- availability and arrival patterns of the input data vary with
lenges: (1) the diversity of models, (2) the variety of deploy- the deployment scenario. For example, in offline batch pro-
ment scenarios, (3) the array of inference systems, and (4) cessing such as photo categorization, all the data may be
the lack of a standard inference workflow. readily available in (network) storage, allowing accelera-
tors to reach and maintain peak performance. By contrast,
2.2.1 Diversity of Models translation, image tagging, and other web applications may
experience variable arrival patterns based on end-user traffic.
Even for a single task, such as image classification, numer-
ous models present different trade-offs between accuracy Similarly, real-time applications such as augmented reality
and computational complexity, as Figure 2 shows. These and autonomous vehicles handle a constant flow of data
Page 3 of 23
MLPerf Inference Benchmark
ML
ImageNet COCO VOC KITTI WMT
Datasets
ML
ResNet GoogleNet SqueezeNet MobileNet SSD GNMT
Models
ML Paddle
Tensorflow PyTorch Caffe MxNet CNTK Theano
Frameworks Paddle
Graph
NNEF ONNX
Formats
Graph
XLA nGraph Glow TVM
Compilers
Optimized MKL
CUDA CuBLAS OpenBLAS Eigen
Libraries DNN
Operating
Linux Windows MacOS Android RTOS
Systems
Hardware
CPUs GPUs TPUs NPUs DSPs FPGAs Accelerators
Targets
Page 4 of 23
MLPerf Inference Benchmark
Page 5 of 23
MLPerf Inference Benchmark
At startup, the LoadGen requests that the SUT load samples Designing ML benchmarks is fundamentally different from
into memory. The MLPerf Inference rules allow them to be designing non-ML benchmarks. MLPerf defines high-level
loaded into DRAM as an untimed operation. The SUT loads tasks (e.g., image classification) that a machine-learning
the samples into DRAM and may perform other timed oper- system can perform. For each one, we provide a canonical
ations as the rules stipulate. These untimed operations may reference model in a few widely used frameworks. The
include but are not limited to compilation, cache warmup, reference model and weights offer concrete instantiations
and preprocessing. of the ML task, but formal mathematical equivalence is
unnecessary. For example, a fully connected layer can be
The SUT signals the LoadGen when it is ready to receive implemented with different cache-blocking and evaluation
the first query. A query is a request for inference on one strategies. Consequently, submitting results requires opti-
Page 6 of 23
MLPerf Inference Benchmark
R ES N ET-50 V 1.5
V ISION I MAGE CLASSIFICATION ( HEAVY ) 25.6M PARAMETERS I MAGE N ET (224 X 224) 99% OF FP32 (76.456%) T OP -1 ACCURACY
7.8 GOPS / INPUT
M OBILE N ET- V 1 224
V ISION I MAGE CLASSIFICATION ( LIGHT ) 4.2M PARAMETERS I MAGE N ET (224 X 224) 98% OF FP32 (71.676%) T OP -1 ACCURACY
1.138 GOPS / INPUT
SSD-R ES N ET 34
V ISION O BJECT DETECTION ( HEAVY ) 36.3M PARAMETERS COCO (1,200 X 1,200) 99% OF FP32 (0.20 M AP)
433 GOPS / INPUT
Table 1. ML Tasks in MLPerf Inference v0.5. Each one reflects critical commercial and research use cases for a large class of submitters,
and together they also capture a broad set of computing motifs (e.g., CNNs and RNNs).
Page 7 of 23
MLPerf Inference Benchmark
varying compute and accuracy options—we selected the model suite captures a wide variety of compute motifs.
full-width, full-resolution MobileNet-v1-1.0-224. This net-
work reduces the parameters by 6.1x and the operations by 4.4 Quality Targets
6.8x compared with ResNet-50 v1.5. We evaluated both
MobileNet-v1 and v2 (Sandler et al., 2018) for the MLPerf Many architectures can trade model quality for lower la-
Inference v0.5 suite and selected the former, as it has gar- tency, lower TCO, or greater throughput. To reflect this
nered wider adoption. important aspect of real-world deployments, we established
per-model and scenario targets for latency and model quality.
4.3.2 Object Detection We adopted quality targets that for 8-bit quantization were
achievable with considerable effort.
Object detection is a complex vision task that determines
the coordinates of bounding boxes around objects in an im- MLPerf Inference requires that almost all implementations
age and classifies the image. Object detectors typically use achieve a quality target within 1% of the FP32 reference
a pretrained image-classifier network as a backbone or a model’s accuracy (e.g., the ResNet-50 v1.5 model achieves
feature extractor, then perform regression for localization 76.46% Top-1 accuracy, and an equivalent model must
and bounding-box selection. Object detection is crucial for achieve at least 75.70% Top-1 accuracy). Initial experi-
automotive applications, such as detecting hazards and ana- ments, however, showed that for mobile-focused networks—
lyzing traffic, and for mobile-retail tasks, such as identifying MobileNet and SSD-MobileNet—the accuracy loss was
items in a picture. unacceptable without retraining. We were unable to proceed
with the low accuracy because performance benchmarking
For object detection, we chose the COCO data set (Lin would become unrepresentative.
et al., 2014) with both a lightweight and heavyweight model.
Our small model uses the 300x300 image size, which is To address the accuracy drop, we took three steps. First,
typical of resolutions in smartphones and other compact we trained the MobileNet models for quantization-friendly
devices. For the larger model, we upscale the data set to weights, enabling us to narrow the quality window to 2%.
more closely represent the output of a high-definition image Second, to reduce the training sensitivity of MobileNet-
sensor (1.44 MP total). The choice of the larger input size is based submissions, we provided equivalent MobileNet and
based on community feedback, especially from automotive SSD-MobileNet implementations quantized to an 8-bit in-
and industrial-automation customers. The quality metric for teger format. Third, for SSD-MobileNet, we reduced the
object detection is mean average precision (mAP). quality requirement to 22.0 mAP to account for the chal-
lenges of using MobileNets as a backbone.
The heavyweight object detector’s reference model is
SSD (Liu et al., 2016) with a ResNet34 backbone, which To improve the submission comparability, we disallow re-
also comes from our training benchmark. The lightweight training. Our prior experience and feasibility studies con-
object detector’s reference model uses a MobileNet-v1-1.0 firmed that for 8-bit integer arithmetic, which was an ex-
backbone, which is more typical for constrained computing pected deployment path for many systems, the ∼1% relative-
environments. We selected the MobileNet feature detector accuracy target was easily achievable without retraining.
on the basis of feedback from the mobile and embedded
communities. 4.5 Scenarios and Metrics
The diverse inference applications have various usage mod-
4.3.3 Translation els and figures of merit, which in turn require multiple
Neural machine translation (NMT) is popular in the rapidly performance metrics. To address these models, we specify
evolving field of natural-language processing. NMT models four scenarios that represent important inference applica-
translate a sequence of words from a source language to a tions. Each one has a unique performance metric, as Table 2
target language and are used in translation applications and illustrates. The LoadGen discussed in Section 4.7 simulates
services. Our translation data set is WMT16 EN-DE (WMT, the scenarios and measures the performance.
2016). The quality measurement is Bilingual Evaluation Single-stream. This scenario represents one inference-
Understudy Score (Bleu) (Papineni et al., 2002). In MLPerf query stream with a query sample size of one, reflecting
Inference, we specifically employ SacreBleu (Post, 2018). the many client applications where responsiveness is criti-
For the translation, we chose GNMT (Wu et al., 2016), cal. An example is offline voice transcription on Google’s
which employs a well-established recurrent-neural-network Pixel 4 smartphone. To measure performance, the Load-
(RNN) architecture and is part of the training benchmark. Gen injects a single query; when the query is complete, it
GNMT is representative of RNNs, which are popular for records the completion time and injects the next query. The
sequential and time-series data, and it ensures our reference- performance metric is the query stream’s 90th-percentile
latency.
Page 8 of 23
MLPerf Inference Benchmark
T YPING AUTOCOMPLETE ,
S INGLE - STREAM (SS) S EQUENTIAL 90 TH - PERCENTILE LATENCY 1
REAL - TIME AR
N UMBER OF STREAMS M ULTICAMERA DRIVER ASSISTANCE ,
M ULTISTREAM (MS) A RRIVAL INTERVAL WITH DROPPING N
SUBJECT TO LATENCY BOUND LARGE - SCALE AUTOMATION
Table 2. Scenario description and metrics. Each scenario targets a real-world use case based on customer and vendor input.
Multistream. This scenario represents applications with people and locations in a photo album. For the offline sce-
a stream of queries, but each query comprises multiple nario, the LoadGen sends to the system a single query that
inferences, reflecting a variety of industrial-automation includes all sample-data IDs to be processed, and the system
and remote-sensing applications. For example, many au- is free to process the input data in any order. Similar to the
tonomous vehicles analyze frames from six to eight cameras multistream scenario, neighboring samples in the query are
that stream simultaneously. contiguous in memory. The metric for the offline scenario
is throughput measured in samples per second.
To model a concurrent scenario, the LoadGen sends a new
query comprising N input samples at a fixed time interval For the multistream and server scenarios, latency is a critical
(e.g., 50 ms). The interval is benchmark specific and also component of the system behavior and will constrain various
acts as a latency bound that ranges from 50 to 100 millisec- performance optimizations. For example, most inference
onds. If the system is available, it processes the incoming systems require a minimum (and architecture-specific) batch
query. If it is still processing the prior query in an interval, size to achieve full utilization of the underlying computa-
it skips the interval and delays the remaining queries by one tional resources. But in a server scenario, the arrival rate
interval. of inference queries is random, so systems must carefully
optimize for tail latency and potentially process inferences
No more than 1% of the queries may produce one or more
with a suboptimal batch size.
skipped intervals. A query’s N input samples are contigu-
ous in memory, which accurately reflects production input Table 3 shows the relevant latency constraints for each task
pipelines and avoids penalizing systems that would other- in v0.5. As with other aspects of MLPerf, we selected
wise require that samples be copied to a contiguous memory these constraints on the basis of community consultation
region before starting inference. The performance metric and feasibility assessments. The multistream arrival times
is the integer number of streams that the system supports for most vision tasks correspond to a frame rate of 15–20 Hz,
while meeting the QoS requirement. which is a minimum for many applications. The server QoS
constraints derive from estimates of the inference timing
Server. This scenario represents online server applications
budget given an overall user latency target.
where query arrival is random and latency is important.
Almost every consumer-facing website is a good example,
including services such as online translation from Baidu,
Google, and Microsoft. For this scenario, the load generator
M ULTISTREAM S ERVER Q O S
sends queries, with one sample each, in accordance with TASK
A RRIVAL T IME C ONSTRAINT
a Poisson distribution. The SUT responds to each query
within a benchmark-specific latency bound that varies from I MAGE CLASSIFICATION ( HEAVY ) 50 MS 15 MS
15 to 250 milliseconds. No more than 1% of queries may I MAGE CLASSIFICATION ( LIGHT ) 50 MS 10 MS
exceed the latency bound for the vision tasks and no more O BJECT DETECTION ( HEAVY ) 66 MS 100 MS
than 3% may do so for translation. The server scenario’s O BJECT DETECTION ( LIGHT ) 50 MS 10 MS
performance metric is the Poisson parameter that indicates M ACHINE TRANSLATION 100 MS 250 MS
the queries per second achievable while meeting the QoS
requirement.
Table 3. Latency constraints for each task in the multistream and
Offline. This scenario represents batch-processing appli- server scenarios.
cations where all the input data is immediately available
and latency is unconstrained. An example is identifying the
Page 9 of 23
MLPerf Inference Benchmark
TAIL -L ATENCY C ONFIDENCE E RROR ROUNDED N UMBER OF Q UERIES / S AMPLES PER Q UERY
I NFERENCES M ODEL
P ERCENTILE I NTERVAL M ARGIN I NFERENCES S INGLE - STREAM M ULTISTREAM S ERVER O FFLINE
95% 99% 0.25% 50,425 7 × 213 = 57, 344 I MAGE CLASSIFICATION ( LIGHT ) 1K / 1 270K / N 270K / 1 1 / 24K
O BJECT DETECTION ( HEAVY ) 1K / 1 270K / N 270K / 1 1 / 24K
99% 99% 0.05% 262,742 33 × 213 = 270, 336
O BJECT DETECTION ( LIGHT ) 1K / 1 270K / N 270K / 1 1 / 24K
M ACHINE TRANSLATION 1K / 1 90K / N 90K / 1 1 / 24K
Table 4. Query requirements for statistical confidence. All results
must meet the minimum LoadGen scenario requirements.
Table 5. Number of queries and samples per query for each task.
Page 10 of 23
MLPerf Inference Benchmark
Server Offline
Number of Number of
LoadGen then issues queries, passing sample IDs to the
samples samples
S
benchmark for execution on the inference hardware. The
t0, t1, t2, … ∈ Poisson (λ) Send
Load Send Load
query queries are pre-generated to reduce overhead during the
…
Gen 1 query Gen
timed portion of the test.
Time Time
t0 t1 t2
As the benchmark finishes processing the queries, it informs
Single-Stream Multi-Stream the LoadGen through a function named QuerySamplesCom-
Number of
samples
Number of
samples
plete. The LoadGen makes no assumptions regarding how
t constant per benchmark
Load Send
tj = processing time for jth query
Load
N Send the SUT may partition its work, so any thread can call this
Gen 1 query
…
Gen
query
… function with any set of samples in any order. QuerySample-
Time Time sComplete is thread safe, is wait-free bounded, and makes
t0 t1 t2 t t
no syscalls, allowing it to scale recording to millions of sam-
Figure 5. The timing and number of queries from the Load Gener- ples per second and to minimize the performance variance
ator (LoadGen) vary between benchmark scenarios. All five ML introduced by the LoadGen, which would affect long-tail
tasks can run in any one of the four scenarios. latency.
The LoadGen maintains a logging thread that gathers events
as they stream in from other threads. At the end of the bench-
task suite. We achieve this feat by decoupling the LoadGen mark run, it outputs a set of logs that report the performance
from the benchmarks and the internal representations (e.g., and accuracy stats.
the model, scenarios, and quality and latency metrics).
The LoadGen is implemented as a standalone C++ module 4.7.3 Operating Modes
with well-defined APIs; the benchmark calls it through these The LoadGen has two primary operating modes: accu-
APIs (and vice versa through callbacks). This decoupling racy and performance. Both are necessary to make a valid
at the API level allows it to easily support various language MLPerf submission.
bindings, permitting benchmark implementations in any
language. Presently, the LoadGen supports Python, C, and Accuracy mode. The LoadGen goes through the entire data
C++ bindings; additional bindings can be added. set for the ML task. The model’s task is to run inference on
the complete data set. Afterward, accuracy results appear
Another major benefit of decoupling the LoadGen from the in the log files, ensuring that the model met the required
benchmark is that the LoadGen is extensible to support more quality target.
scenarios. Currently, MLPerf supports four of them; we may
add more, such as a multitenancy mode where the SUT must Performance mode. The LoadGen avoids going through
continuously serve multiple models while maintaining QoS the entire data set, as the system’s performance can be de-
constraints. termined by subjecting it to enough data-set samples.
The LoadGen abstracts the details of the data set (e.g., im- The LoadGen has features that ensure the submission sys-
ages) behind sample IDs. Data-set samples receive an index tem complies with the rules. In addition, it can self-check to
between 0 and N. A query represents the smallest input unit determine whether its source code has been modified during
that the benchmark ingests from the LoadGen. It consists of the submission process. To facilitate validation, the sub-
one or more data-set sample IDs, each with a corresponding mitter provides an experimental config file that allows use
response ID to differentiate between multiple instances of of non-default LoadGen features. For v0.5, the LoadGen
the same sample. enables the following four tests.
The rationale for a response ID is that for any given task Accuracy verification. The purpose of this test is to en-
and scenario—say, an image-classification multistream sure valid inferences in performance mode. By default,
scenario—the LoadGen may reissue the same data (i.e., the results that the inference system returns to the Load-
an image with a unique sample ID) multiple times across Gen are not logged and thus are not checked for accuracy.
the different streams. To differentiate between them, the This choice reduces or eliminates processing overhead to
LoadGen must assign different reference IDs to accurately allow accurate measurement of the inference system’s per-
track when each sample finished processing. formance. In this test, results returned from the SUT to the
LoadGen are logged randomly. The log is checked against
At the start, the LoadGen directs the benchmark to load a the log generated in accuracy mode to ensure consistency.
list of samples into memory. Loading is untimed and the
SUT may also perform allowed data preprocessing. The On-the-fly caching detection. By default, LoadGen pro-
Page 11 of 23
MLPerf Inference Benchmark
duces queries by randomly selecting with replacement from 4.8.1 Prohibited Optimizations
the data set, and inference systems may receive queries
MLPerf Inference currently prohibits retraining and pruning
with duplicate samples. This outcome is likely for high-
to ensure comparability, although this restriction may fail to
performance systems that process many samples relative to
reflect realistic deployment for some large companies. The
the data-set size. To represent realistic deployments, the
interlocking requirements to use reference weights (possibly
MLPerf rules prohibit caching of queries or intermediate
with calibration) and minimum accuracy targets are most
data. The test has two parts. The first part generates queries
important for ensuring comparability in the closed division.
with unique sample indices. The second generates queries
The open division explicitly allows retraining and pruning.
with duplicate sample indices. Performance is measured
in each case. The way to detect caching is to determine We prohibit caching to simplify the benchmark design. In
whether the test with duplicate sample indices runs signifi- practice, real inference systems cache queries. For exam-
cantly faster than the test with unique sample indices. ple, “I love you” is one of Google Translate’s most frequent
queries, but the service does not translate the phrase ab initio
Alternate-random-seed testing. In ordinary operation, the
each time. Realistically modeling caching in a benchmark,
LoadGen produces queries on the basis of a fixed random
however, is a challenge because cache hit rates vary substan-
seed. Optimizations based on that seed are prohibited. The
tially with the application. Furthermore, our data sets are
alternate-random-seed test replaces the official random seed
relatively small, and large systems could easily cache them
with alternates and measures the resulting performance.
in their entirety.
4.8 Model Equivalence We also prohibit optimizations that are benchmark aware or
data-set aware and that are inapplicable to production envi-
The goal of MLPerf Inference is to measure realistic system- ronments. For example, real query traffic is unpredictable,
level performance across a wide variety of architectures. but for the benchmark, the traffic pattern is predetermined by
But the four properties of realism, comparability, architec- the pseudorandom-number-generator seed. Optimizations
ture neutrality, and friendliness to small submission teams that take advantage of a fixed number of queries or that use
require careful trade-offs. knowledge of the LoadGen implementation are prohibited.
Some inference deployments involve teams of compiler, Similarly, any optimization employing statistical knowledge
computer-architecture, and machine-learning experts aggres- of the performance or accuracy data sets is prohibited. Fi-
sively co-optimizing the training and inference systems to nally, we disallow any technique that takes advantage of the
achieve cost, accuracy, and latency targets across a massive upscaled images in the 1,200x1,200 COCO data set for the
global customer base. An unconstrained inference bench- heavyweight object detector.
mark, however, would disadvantage companies with less
experience and fewer ML-training resources. 4.8.2 Preprocessing and Data Types
Therefore, we set the model-equivalence rules to allow sub- Ideally, a whole-system benchmark should capture all
mitters to, for efficiency, reimplement models on different performance-relevant operations. MLPerf, however, ex-
architectures. The rules provide a complete list of disal- plicitly allows untimed preprocessing. There is no vendor-
lowed techniques and a list of allowed technique examples. or application-neutral preprocessing. For example, sys-
We chose an explicit blacklist to encourage a wide range of tems with integrated cameras can use hardware/software
techniques and to support architectural diversity. The list co-design to ensure that images arrive in memory in an ideal
of examples illustrates the boundaries of the blacklist while format; systems accepting JPEGs from the Internet cannot.
also encouraging common and appropriate optimizations.
In the interest of architecture and application neutrality, we
Examples of allowed techniques include the following: ar- adopted a permissive approach to untimed preprocessing.
bitrary data arrangement as well as different input and in- Implementations may transform their inputs into system-
memory representations of weights, mathematically equiva- specific ideal forms as an untimed operation.
lent transformations (e.g., tanh versus logistic, ReluX ver-
MLPerf explicitly allows and enables quantization to a wide
sus ReluY, and any linear transformation of an activation
variety of numerical formats to ensure architecture neutral-
function), approximations (e.g., replacing a transcendental
ity. Submitters must pre-register their numerics to help
function with a polynomial), processing queries out of or-
guide accuracy-target discussions. The approved list for
der within the scenario’s limits, replacing dense operations
the closed division includes INT4, INT8, INT16, UINT8,
with mathematically equivalent sparse operations, fusing or
UINT16, FP11 (sign, 5-bit mantissa, and 5-bit exponent),
unfusing operations, dynamically switching between one or
FP16, bfloat16, and FP32.
more batch sizes, mixing experts that combine differently
quantized weights. Quantization to lower-precision formats typically requires
Page 12 of 23
MLPerf Inference Benchmark
calibration to ensure sufficient inference quality. For each Open division. The open division fosters innovation in ML
reference model, MLPerf provides a small, fixed data set that systems, algorithms, optimization, and hardware/software
can be used to calibrate a quantized network. Additionally, co-design. Submitters must still perform the same ML task,
it offers MobileNet versions that are prequantized to INT8, but they may change the model architecture and the quality
since without retraining (which we disallow) the accuracy targets. This division allows arbitrary pre- and postpro-
falls dramatically. cessing and arbitrary models, including techniques such as
retraining. In general, submissions are not directly com-
5 S UBMISSION , R EVIEW, AND R EPORTING parable with each other or with closed submissions. Each
open submission must include documentation about how it
In this section, we describe the submission process for deviates from the closed division. Caveat emptor!
MLPerf Inference v0.5 (Sections 5.1). All submissions are
peer reviewed for validity (Section 5.2). Finally, we describe 5.1.2 Categories
how we report the results to the public (Section 5.3).
Submitters must classify their submissions into one of
three categories on the basis of hardware- and software-
5.1 Submission
component availability: available; preview; and research,
An MLPerf Inference submission contains information development, or other systems. This requirement helps con-
about the SUT: performance scores, benchmark code, a sumers of the results identify the systems’ maturity level
system-description file that highlights the SUT’s main and whether they are readily available (either for rent online
configuration characteristics (e.g., accelerator count, CPU or for purchase).
count, software release, and memory system), and LoadGen Available systems. Available systems are generally the
log files detailing the performance and accuracy runs for most mature and have stringent hardware- and software-
a set of task and scenario combinations. All this data is availability requirements.
uploaded to a public GitHub repository for peer review and
validation before release. An available cloud system must have accessible pricing
(either publicly or by request), have been rented by at least
MLPerf Inference is a suite of tasks and scenarios that en- one third party, have public evidence of availability (e.g.,
sures broad coverage, but a submission can contain subset a web page or company statement saying the product is
tasks and scenarios. Many traditional benchmarks, such as available), and be “reasonably available” for additional third
SPEC CPU, require submissions for all their components. parties to rent by the submission date.
This approach is logical for a general-purpose processor that
runs arbitrary code, but ML systems are often highly spe- An on-premise system is available if all its components that
cialized. For example, some are solely designed for vision substantially determine ML performance are available ei-
or wake-word detection and cannot run other network types. ther individually or in aggregate (development boards that
Others target particular scenarios, such as a single-stream meet the substantially determined clause are allowed). An
application, and are not intended for server-style applica- available component or system must have available pric-
tions (or vice versa). Accordingly, we allow submitters ing (either publicly advertised or available by request), have
flexibility in selecting tasks and scenarios. been shipped to at least one third party, have public evidence
of availability (e.g., a web page or company statement say-
5.1.1 Divisions ing the product is available), and be “reasonably available”
for purchase by additional third parties by the submission
MLPerf Inference has two divisions for submitting results: date. In addition, submissions for on-premises systems must
closed and open. Submitters can send results to either or describe the system and its components in sufficient detail
both, but they must use the same data set. The open divi- so that third parties can build a similar system.
sion, however, allows free model selection and unrestricted
optimization to foster ML-system innovation. Available systems must use a publicly available software
stack consisting of the software components that substan-
Closed division. The closed division enables comparisons tially determine ML performance but are absent from the
of different systems. Submitters employ the same models, source code. An available software component must be well
data sets, and quality targets to ensure comparability across supported for general use and available for download.
wildly different architectures. This division requires prepro-
cessing, postprocessing, and a model that is equivalent to Preview systems. Preview systems contain components
the reference implementation. It also permits calibration for that will meet the criteria for the available category within
quantization (using the calibration data set we provide) and 180 days or by the next submission cycle, whichever is later.
prohibits retraining. This restriction applies to both the hardware and software
requirements. The goal of the preview category is to en-
Page 13 of 23
MLPerf Inference Benchmark
Page 14 of 23
MLPerf Inference Benchmark
shows.
The submitters represent many organizations that range
from startups to original equipment manufacturers (OEMs),
cloud-service providers, and system integrators. They in- 54 ResNet50-v1.5
32.5%
clude Alibaba, Centaur Technology, Dell EMC, dividiti,
FuriosaAI, Google, Habana, Hailo, Inspur, Intel, NVIDIA,
Polytechnic University of Milan, Qualcomm, and Tencent. Figure 7. Results from the closed division. The distribution of
models indicates MLPerf Inference capably selected representative
workloads for the initial v0.5 benchmark release.
6.2 Task Coverage
MLPerf Inference v0.5 submitters are allowed to pick any
combination with no submissions.
task to evaluate their system’s performance. The distribution
of results across tasks can thus reveal whether those tasks
are of interest to ML-system vendors. 6.4 Processor Types
We analyzed the submissions to determine the overall task Machine-learning solutions can be deployed on a variety
coverage. Figure 7 shows the breakdown for the tasks and of platforms, ranging from fully general-purpose CPUs to
models in the closed division. Although the most popular programmable GPUs and DSPs, FPGAs, and fixed-function
model was, unsurprisingly, ResNet-50 v1.5, it was just under accelerators. Our results reflect this diversity.
three times as popular as GNMT, the least popular model. Figure 8 shows that the MLPerf Inference submissions cov-
This small spread and the otherwise uniform distribution ered most hardware categories. The system diversity indi-
suggests we selected a representative set of tasks. cates that our inference benchmark suite and method for
In addition to selecting representative tasks, another goal is v0.5 can evaluate any processor architecture.
to provide vendors with varying quality and performance
targets. Depending on the use case, the ideal ML model 6.5 Software Frameworks
may differ (as Figure 2 shows, a vast range of models can In addition to the various hardware types are many ML soft-
target a given task). Our results reveal that vendors equally ware frameworks. Table 7 shows the variety of frameworks
supported different models for the same task because each used to benchmark the hardware platforms. ML software
model has unique quality and performance trade-offs. In the plays a vital role in unleashing the hardware’s performance.
case of object detection, we saw the same number of sub-
missions for both SSD-MobileNet-v1 and SSD-ResNet34. Some run times are specifically designed to work with cer-
tain types of hardware to fully harness their capabilities;
6.3 Scenario Usage employing the hardware without the corresponding frame-
work may still succeed, but the performance may fall short
We aim to evaluate systems in realistic use cases—a major of the hardware’s potential. The table shows that CPUs have
motivator for the LoadGen (Section 4.7) and scenarios (see
Section 4.5). To this end, Table 6 shows the distribution of
results across the various task and scenario combinations. S INGLE -S TREAM M ULTISTREAM S ERVER O FFLINE
Across all the tasks, the single-stream and offline scenarios GNMT 2 0 6 11
M OBILE N ET- V 1 18 3 5 11
are the most widely used and are also the easiest to optimize
R ES N ET-50 V 1.5 19 5 10 20
and run. Server and multistream were more complicated
SSD-M OBILE N ET- V 1 8 3 5 13
and had longer run times because of the QoS requirements SSD-R ES N ET 34 4 4 7 12
and more-numerous queries. T OTAL 51 15 33 67
Page 15 of 23
MLPerf Inference Benchmark
H AN G UANG -AI X
40 ONNX X
O PEN V INO X
20
P Y T ORCH X
SNPE X
S YNAPSE X
0
DSP FPGA CPU ASIC GPU T ENSOR F LOW X X X
TF-L ITE X
Figure 8. Results from the closed division. The results cover many T ENSOR RT X
processor architectures. Almost every kind—CPUs, GPUs, DSPs,
FPGAs, and ASICs—appeared in the submissions. Table 7. Summary of software framework versus hardware archi-
tecture in the closed division. The hardware benchmarking in-
volves many different frameworks. Preventing submitters from
the most framework diversity and that TensorFlow has the reimplementing the benchmark would have made it impossible to
support the diversity of systems tested.
most architectural variety.
Page 16 of 23
MLPerf Inference Benchmark
Log10 of performance (normalized to slowest system)
100 101 102 103 104
MobileNets-v1 (SS)
MobileNets-v1 (MS)
MobileNets-v1 (S)
MobileNets-v1 (O)
SSD-MobileNets-v1 (SS)
SSD-MobileNets-v1 (MS)
SSD-MobileNets-v1 (S)
SSD-MobileNets-v1 (O)
ResNet50-v1.5 (SS)
ResNet50-v1.5 (MS)
ResNet50-v1.5 (S)
ResNet50-v1.5 (O)
SSD-ResNet34 (SS)
SSD-ResNet34 (MS)
SSD-ResNet34 (S)
SSD-ResNet34 (O)
GNMT (SS)
GNMT (S)
GNMT (O)
Figure 9. Results from the closed division. Normalized performance distribution on log scale (log10 ) across models for the single-
stream (SS), multistream (MS), server (S), and offline (O) scenarios. The boxplot shows the performance distribution of all system
submissions for a specific model and scenario combination. The results are normalized to the slowest system representing that combination.
A wide range emerges across all tasks and scenarios. GNMT MS is absent because no submitter ran the multistream scenario.
industry to push the limits of systems. MLPerf Inference began as a community-driven effort
on July 12, 2018. We consulted more than 15 organiza-
In yet another interesting submission, two organiza-
tions. Since then, many other organizations have joined
tions jointly evaluated 12 object-detection models—YOLO
the MLPerf Inference working group. Applying the wis-
v3 (Redmon & Farhadi, 2018), Faster-RCNN (Ren et al.,
dom of several ML engineers and practitioners, we built the
2015) with a variety of backbones, and SSD (Liu et al.,
benchmark from the ground up, soliciting input from the
2016)) with a variety of backbones—on a desktop platform.
ML-systems community as well as hardware end users. This
The open-division results save practitioners and researchers
collaborative effort led us to directly address the industry’s
from having to manually perform similar explorations, while
diverse needs from the start. For instance, the LoadGen
also showcasing potential techniques and optimizations.
and scenarios emerged from our desire to span the many
inference-benchmark needs of various organizations.
7 L ESSONS L EARNED
Although convincing competing organizations to agree on a
We reflect on our v0.5 benchmark-development effort and benchmark is a challenge, it is still possible—as MLPerf In-
share some lessons we learned from the experience. ference shows. Every organization has unique requirements
and expectations, so reaching a consensus was sometimes
7.1 Community-Driven Benchmark Development tricky. In the interest of progress, everyone agreed to make
decisions on the basis of “grudging consensus.” These de-
There are two main approaches to building an industry- cisions were not always in favor of any one organization.
standard benchmark. One is to create the benchmark in Organizations would comply to keep the process moving or
house, release it, and encourage the community to adopt it. defer their requirements to a future version so benchmark
The other is first to consult the community and then build development could continue.
the benchmark through a consensus-based effort. The for-
mer approach is useful when seeding an idea, but the latter Ultimately, MLPerf Inference exists because competing
is necessary to develop an industry-standard benchmark. organizations saw beyond their self-interest and worked
MLPerf Inference employed the latter. together to achieve a common goal: establishing the best
Page 17 of 23
MLPerf Inference Benchmark
Page 18 of 23
MLPerf Inference Benchmark
formance at explicit batch sizes, whereas MLPerf allows plexity of testing and evaluating full ML models.
submitters to choose the best batch sizes for different scenar-
TBD (Training Benchmarks for DNNs). TBD (Zhu et al.,
ios. Also, the former imposes no target-quality restrictions,
2018) is a joint project of the University of Toronto and
whereas the latter imposes stringent restrictions.
Microsoft Research that focuses on ML training. It provides
Fathom. An early ML benchmark, Fathom (Adolf et al., a wide spectrum of ML models in three frameworks (Ten-
2016) provides a suite of neural-network models that in- sorFlow, MXNet, and CNTK), along with a powerful tool
corporate several types of layers (e.g., convolution, fully chain for their improvement. It primarily focuses on evalu-
connected, and RNN). Still, it focuses on throughput rather ating GPU performance and only has one full model (Deep
than accuracy. Fathom was an inspiration for MLPerf: in Speech 2) that covers inference. We considered including
particular, we likewise included a suite of models that com- TBD’s Deep Speech 2 model but lacked the time.
prise various layer types. Compared with Fathom, MLPerf
DawnBench. DawnBench (Coleman et al., 2017) was the
provides both PyTorch and TensorFlow reference implemen-
first multi-entrant benchmark competition to measure the
tations for optimization, ensuring that the models in both
end-to-end performance of deep-learning systems. It al-
frameworks are equivalent, and it also introduces a variety
lowed optimizations across model architectures, optimiza-
of inference scenarios with different performance metrics.
tion procedures, software frameworks, and hardware plat-
AIXPRT. Developed by Principled Technologies, AIX- forms. DawnBench inspired MLPerf, but our benchmark
PRT (Principled Technologies, 2019) is a closed, propri- offers more tasks, models, and scenarios.
etary AI benchmark that emphasizes ease of use. It consists
To summarize, MLPerf Inference builds on the best of prior
of image-classification, object-detection, and recommender
work and improves on it, in part through community-driven
workloads. AIXPRT publishes prebuilt binaries that employ
feedback (Section 7.1). The result has been new features,
specific inference frameworks on supported platforms. The
such as the LoadGen (which can run models in different
goal of this approach is apparently to allow technical press
scenarios), the open and closed divisions, and so on.
and enthusiasts to quickly run the benchmark. Binaries are
built using Intel OpenVino, TensorFlow, and NVIDIA Ten-
sorRT tool kits for the vision workloads, as well as MXNet 9 C ONCLUSION
for the recommendation system. AIXPRT runs these work-
More than 200 ML researchers, practitioners, and engineers
loads using FP32 and INT8 numbers with optional batching
from academia and industry helped to bring the MLPerf
and multi-instance, and it evaluates performance by measur-
Inference benchmark from concept (June 2018) to result
ing latency and throughput. The documentation and quality
submission (October 2019). This team, drawn from 32 or-
requirements are unpublished but are available to members.
ganizations, developed the reference implementations and
In contrast, MLPerf tasks are supported on any framework,
rules, and submitted over 600 performance measurements
tool kit, or OS; they have precise quality requirements; and
gathered on a wide range of systems. Of these performance
they work with a variety of scenarios.
measurements, 595 cleared the audit process as valid sub-
AI Matrix. AI Matrix (Alibaba, 2018) is Alibaba’s AI- missions and were approved for public consumption.
accelerator benchmark for both cloud and edge deployment.
MLPerf Inference v0.5 is just the beginning. The key to
It takes the novel approach of offering four benchmark types.
any benchmark’s success, especially in a rapidly chang-
First, it includes micro-benchmarks that cover basic oper-
ing field such as ML, is a development process that can
ators such as matrix multiplication and convolutions that
respond quickly to changes in the ecosystem. Work has
come primarily from DeepBench. Second, it measures per-
already started on the next version. We expect to update
formance for common layers, such as fully connected layers.
the current models (e.g., MobileNet-v1 to v2), expand the
Third, it includes numerous full models that closely track in-
list of tasks (e.g., recommendation), increase the processing
ternal applications. Fourth, it offers a synthetic benchmark
requirements by scaling the data-set sizes (e.g., 2 MP for
designed to match the characteristics of real workloads.
SSD large), allow aggressive performance optimizations
The full AI Matrix models primarily target TensorFlow and
(e.g., retraining for quantization), simplify benchmarking
Caffe, which Alibaba employs extensively and which are
through better infrastructure (e.g., a mobile app), and in-
mostly open source. We have a smaller model collection
crease the challenge to systems by improving the metrics
and focus on simulating scenarios using LoadGen.
(e.g., measuring power and adjusting the quality targets).
DeepBench. Microbenchmarks such as DeepBench (Baidu,
We welcome your input and contributions. Visit the MLPerf
2017) measure the library implementation of kernel-level
website (https://mlperf.org) for additional details.
operations (e.g., 5,124x700x2,048 GEMM) that are impor-
Results for v0.5 are available online (https://github.
tant for performance in production models. They are useful
com/mlperf/inference_results_v0.5).
for efficient model development but fail to address the com-
Page 19 of 23
MLPerf Inference Benchmark
10 ACKNOWLEDGEMENTS H AILO
MLPerf Inference is the work of many individuals from Ohad Agami, Mark Grobman, and Tamir Tapuhi.
multiple organizations. In this section, we acknowledge all
those who helped produce the first set of results or supported I NTEL
the overall benchmark development.
Md Faijul Amin, Thomas Atta-fosu, Haim Barad, Barak
A LIBABA T-H EAD Battash, Amit Bleiweiss, Maor Busidan, Deepak R Can-
chi, Baishali Chaudhuri, Xi Chen, Elad Cohen, Xu Deng,
Zhi Cai, Danny Chen, Liang Han, Jimmy He, David Mao, Pradeep Dubey, Matthew Eckelman, Alex Fradkin, Daniel
Benjamin Shen, ZhongWei Yao, Kelly Yin, XiaoTao Zai, Franch, Srujana Gattupalli, Xiaogang Gu, Amit Gur, MingX-
Xiaohui Zhao, Jesse Zhou, and Guocai Zhu. iao Huang, Barak Hurwitz, Ramesh Jaladi, Rohit Kalidindi,
Lior Kalman, Manasa Kankanala, Andrey Karpenko, Noam
BAIDU Korem, Evgeny Lazarev, Hongzhen Liu, Guokai Ma, An-
drey Malyshev, Manu Prasad Manmanthan, Ekaterina Ma-
Newsha Ardalani, Ken Church, and Joel Hestness. trosova, Jerome Mitchell, Arijit Mukhopadhyay, Jitender
Patil, Reuven Richman, Rachitha Prem Seelin, Maxim
C ADENCE Shevtshov, Avi Shimalkovski, Dan Shirron, Hui Wu, Yong
Wu, Ethan Xie, Cong Xu, Feng Yuan, and Eliran Zimmer-
Debajyoti Pal. man.
Page 20 of 23
MLPerf Inference Benchmark
Q UALCOMM R EFERENCES
Srinivasa Chaitanya Gopireddy, Pradeep Jilagam, Chirag Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
Patel, Harris Teague, and Mike Tremaine. J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
TensorFlow: A System for Large-Scale Machine Learn-
ing. In OSDI, volume 16, pp. 265–283, 2016.
S AMSUNG
Rama Harihara, Jungwook Hong, David Tannenbaum, Si- Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., and Brooks, D.
mon Waters, and Andy White. Fathom: Reference Workloads for Modern Deep Learn-
ing Methods. In Workload Characterization (IISWC),
2016 IEEE International Symposium on, pp. 1–10. IEEE,
S TANFORD U NIVERSITY 2016.
Peter Bailis and Matei Zaharia. Alibaba. Ai matrix. https://aimatrix.ai/
en-us/, 2018.
S UPERMICRO
Amodei, D. and Hernandez, D. Ai and compute. https:
Srini Bala, Ravi Chintala, Alec Duroy, Raju Penumatcha, //blog.openai.com/ai-and-compute/,
Gayatri Pichai, and Sivanagaraju Yarramaneni. 2018.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,
T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible
and efficient machine learning library for heterogeneous
distributed systems. arXiv preprint arXiv:1512.01274,
2015.
Page 21 of 23
MLPerf Inference Benchmark
Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
Nardi, L., Bailis, P., Olukotun, K., Ré, C., and Zaharia, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
M. DAWNBench: An End-to-End Deep Learning Bench- Efficient convolutional neural networks for mobile vision
mark and Competition. NIPS ML Systems Workshop, applications. arXiv preprint arXiv:1704.04861, 2017.
2017.
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation
Council, T. P. P. Transaction processing performance coun- networks. In Proceedings of the IEEE conference on
cil. Web Site, http://www. tpc. org, 2005. computer vision and pattern recognition, pp. 7132–7141,
2018.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database. Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K.,
In 2009 IEEE conference on computer vision and pattern Baum, F., Wu, M., Xu, L., and Van Gool, L. Ai bench-
recognition, pp. 248–255. Ieee, 2009. mark: All about deep learning on smartphones in 2019.
arXiv preprint arXiv:1910.06663, 2019.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan- Intel. Intel math kernel library. https://software.
guage understanding. arXiv preprint arXiv:1810.04805, intel.com/en-us/mkl, 2018a.
2018.
Intel. Intel distribution of openvino toolkit.
Dixit, K. M. The spec benchmarks. Parallel computing, 17 https://software.intel.com/en-us/
(10-11):1195–1209, 1991. openvino-toolkit, 2018b.
Dongarra, J. The linpack benchmark: An explana- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
tion. In Proceedings of the 1st International Con- Girshick, R., Guadarrama, S., and Darrell, T. Caffe:
ference on Supercomputing, pp. 456–474, London, Convolutional Architecture for Fast Feature Embedding.
UK, UK, 1988. Springer-Verlag. ISBN 3-540-18991- In ACM International Conference on Multimedia, pp.
2. URL http://dl.acm.org/citation.cfm? 675–678. ACM, 2014.
id=647970.742568.
Khudia, D. S., Basu, P., and Deng, S. Open-
EEMBC. Introducing the eembc mlmark benchmark. sourcing fbgemm for state-of-the-art server-side
https://www.eembc.org/mlmark/index. inference. https://engineering.fb.com/
php, 2019. ml-applications/fbgemm/, 2018.
Fursin, G., Lokhmotov, A., and Plowman, E. Collective Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
knowledge: towards r&d sustainability. In 2016 Design, classification with deep convolutional neural networks.
Automation & Test in Europe Conference & Exhibition In Advances in neural information processing systems,
(DATE), pp. 864–869. IEEE, 2016. pp. 1097–1105, 2012.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Riccardi, F., Sarokin, R., Kulik, A., and Grundmann, M.
Y. Generative adversarial nets. In Advances in neural On-device neural net inference with mobile gpus. arXiv
information processing systems, pp. 2672–2680, 2014. preprint arXiv:1907.01989, 2019a.
Han, S., Mao, H., and Dally, W. J. Deep compres- Lee, K., Rao, V., and Arnold, W. C. Accelerat-
sion: Compressing deep neural networks with pruning, ing facebooks infrastructure with application-
trained quantization and huffman coding. arXiv preprint specific hardware. https://engineering.
arXiv:1510.00149, 2015. fb.com/data-center-engineering/
accelerating-infrastructure/, 3 2019b.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
M. A., and Dally, W. J. Eie: efficient inference engine Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen,
on compressed deep neural network. In 2016 ACM/IEEE D. Learning hand-eye coordination for robotic grasping
43rd Annual International Symposium on Computer Ar- with deep learning and large-scale data collection. The
chitecture (ISCA), pp. 243–254. IEEE, 2016. International Journal of Robotics Research, 37(4-5):421–
436, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,
conference on computer vision and pattern recognition, H. P. Pruning filters for efficient convnets. arXiv preprint
pp. 770–778, 2016. arXiv:1608.08710, 2016.
Page 22 of 23
MLPerf Inference Benchmark
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn:
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Towards real-time object detection with region proposal
Common objects in context. In European conference on networks. In Advances in neural information processing
computer vision. Springer, 2014. systems, pp. 91–99, 2015.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
C.-Y., and Berg, A. C. Ssd: Single shot multibox detector. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
In European conference on computer vision, pp. 21–37. bottlenecks. In Proceedings of the IEEE Conference
Springer, 2016. on Computer Vision and Pattern Recognition, pp. 4510–
4520, 2018.
Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micike-
vicius, P., Patterson, D., Tang, H., Wei, G.-Y., Bailis, P., Seide, F. and Agarwal, A. Cntk: Microsoft’s open-source
Bittorf, V., Brooks, D., Chen, D., Dutta, D., Gupta, U., deep-learning toolkit. In Proceedings of the 22nd ACM
Hazelwood, K., Hock, A., Huang, X., Jia, B., Kang, D., SIGKDD International Conference on Knowledge Dis-
Kanter, D., Kumar, N., Liao, J., Narayanan, D., Oguntebi, covery and Data Mining, pp. 2135–2135. ACM, 2016.
T., Pekhimenko, G., Pentecost, L., Reddi, V. J., Robie, T., Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a
John, T. S., Wu, C.-J., Xu, L., Young, C., and Zaharia, M. next-generation open source framework for deep learning.
Mlperf training benchmark, 2019. In Proceedings of workshop on machine learning systems
(LearningSys) in the twenty-ninth annual conference on
MLPerf. MLPerf Reference: ResNet in TensorFlow.
neural information processing systems (NIPS), volume 5,
https://github.com/mlperf/training/
pp. 1–6, 2015.
tree/master/image_classification/
tensorflow/official, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, tion is all you need. In Advances in neural information
J. Pruning convolutional neural networks for resource processing systems, pp. 5998–6008, 2017.
efficient inference. arXiv preprint arXiv:1611.06440,
2016. WMT. First conference on machine translation, 2016. URL
http://www.statmt.org/wmt16/.
NVIDIA. Nvidia tensorrt: Programmable inference ac-
celerator. https://developer.nvidia.com/ Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
tensorrt. Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
K., et al. Google’s neural machine translation system:
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a Bridging the gap between human and machine translation.
method for automatic evaluation of machine translation. arXiv preprint arXiv:1609.08144, 2016.
In Proceedings of the 40th annual meeting on association
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
for computational linguistics, pp. 311–318. Association
gated residual transformations for deep neural networks.
for Computational Linguistics, 2002.
In Proceedings of the IEEE conference on computer vi-
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., sion and pattern recognition, pp. 1492–1500, 2017.
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, Xu, D., Anguelov, D., and Jain, A. Pointfusion: Deep sensor
A. Automatic differentiation in pytorch. 2017. fusion for 3d bounding box estimation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Post, M. A call for clarity in reporting bleu scores. arXiv
Recognition, pp. 244–253, 2018.
preprint arXiv:1804.08771, 2018.
Zhu, H., Akrout, M., Zheng, B., Pelegris, A., Jayarajan,
Principled Technologies. Aixprt community preview. A., Phanishayee, A., Schroeder, B., and Pekhimenko, G.
https://www.principledtechnologies. Benchmarking and analyzing deep neural network train-
com/benchmarkxprt/aixprt/, 2019. ing. In 2018 IEEE International Symposium on Workload
Characterization (IISWC), pp. 88–100. IEEE, 2018.
Qualcomm. Snapdragon neural processing engine sdk
reference guide. https://developer.qualcomm.
com/docs/snpe/overview.html.
Page 23 of 23