0% found this document useful (0 votes)
147 views

Asymo: Scalable and Efficient Deep-Learning Inference On Asymmetric Mobile Cpus

Uploaded by

Moss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Asymo: Scalable and Efficient Deep-Learning Inference On Asymmetric Mobile Cpus

Uploaded by

Moss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

AsyMo: Scalable and Efficient Deep-Learning Inference on

Asymmetric Mobile CPUs


Manni Wang* Shaohua Ding* Ting Cao
Xi’an Jiao Tong University National Key Laboratory for Novel Microsoft Research
Microsoft Research Software Technology, Nanjing [email protected]
[email protected] University
Microsoft Research
[email protected]

Yunxin Liu Fengyuan Xu


Microsoft Research National Key Laboratory for Novel
[email protected] Software Technology, Nanjing
University
[email protected]
Abstract CCS Concepts
On-device deep learning (DL) inference has attracted vast interest. • Human-centered computing → Ubiquitous and mobile com-
Mobile CPUs are the most common hardware for on-device infer- puting systems and tools; • Computing methodologies → Paral-
ence and many inference frameworks have been developed for them. lel computing methodologies.
Yet, due to the hardware complexity, DL inference on mobile CPUs
suffers from two common issues: the poor performance scalability Keywords
on the asymmetric multiprocessor, and energy inefficiency. Mobile CPU, Deep Neural Networks, Asymmetric multiprocessor,
We identify the root causes are improper task partitioning and Cost model, Energy efficiency
unbalanced task distribution for the poor scalability, and unaware-
ACM Reference Format:
ness of model behaviour for energy inefficiency. Based on that, we Manni Wang, Shaohua Ding, Ting Cao, Yunxin Liu, and Fengyuan Xu. 2021.
propose a novel technique called AsyMo for the thread pool im- AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric
plementation of DL frameworks to solve the two issues. The key Mobile CPUs. In The 27th Annual International Conference on Mobile
design principle is to leverage the execution determinism of DL Computing and Networking (ACM MobiCom ’21), October 25–29, 2021,
inference, and build an optimal execution plan offline by jointly New Orleans, LA, USA. ACM, New York, NY, USA, 14 pages. https://doi.
considering model structures and hardware characteristics. For per- org/10.1145/3447993.3448625
formance scalability, AsyMo implements cost-model-directed par-
titioning and asymmetry-aware task scheduling to properly divide 1 Introduction
and fairly schedule tasks on asymmetric CPUs. For energy saving,
DL technology is used extensively in mobile and edge applica-
AsyMo determines the least-energy cost frequency based on data
tions [59], such as image editing, face detection, and speech recog-
reuse rate of a model.
nition. On-device DL inference is gaining momentum, due to the
AsyMo is evaluated on different models and DL frameworks.
advantages in privacy protection, internet resilience, and quick re-
All gain substantial improvement. For example, AsyMo shows up
sponse compared to on-cloud inference. Therefore, many DL frame-
to 46% performance and 37% energy-efficiency improvement for
works and libraries have provided dedicated support for on-device
convolution-dominant models, and up to 97% performance and
inference.
1.22× energy-efficiency improvement for fully-connect-dominant
Nearly all on-device inferences run on mobile CPUs, according to
models, compared to an optimized TensorFlow on off-the-shelf mo-
a recent study from Facebook [58]. Though various AI accelerators
bile CPUs.
have been developed, mobile CPUs are still the most used due to
* Bothauthors contribute equally to this paper. Work is done during their internship at their general availability, mature programming environment, robust
Microsoft Research. support for diverse models, and increasingly better performance.
Mobile GPUs are also widely available, but they provide only as
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed much performance as mobile CPUs on majority Android devices,
for profit or commercial advantage and that copies bear this notice and the full citation and many DL models are not supported on mobile GPUs [58]. Thus,
on the first page. Copyrights for components of this work owned by others than ACM we focus on mobile CPUs in this paper.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Issues However, we find that current on-device inference on mo-
fee. Request permissions from [email protected]. bile CPUs suffers from two common inefficiency issues. The first
ACM MobiCom ’21, October 25–29, 2021, New Orleans, LA, USA issue is poor performance scalability on asymmetric multiproces-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8342-4/21/10. . . $15.00 sor (AMP). Mobile CPUs feature an asymmetric design such as
https://doi.org/10.1145/3447993.3448625 ARM big.LITTLE technology [21]. There is a big-core processor
Inference Time

1000 big big + little


MM partitioning (also called blocking or tiling) is a classical ques-
tion and used to be actively studied on symmetric CPUs [7, 34, 54–
(ms)

500
56, 64]. It is not that active in recent years because the large cache
0 on server CPUs can hold the data for most MMs, and thus perfor-
Tensorflow Tensorflow ONNX FeatherCNN Caffe2
Lite Runtime (NNPack) mance is not sensitive to block size. However, block size still has
Figure 1: DL inference barely gains speedup by using both pro- big impact on mobile CPUs (e.g., 30% on Kirin 970). The de-facto
cessors compared to just using the big processor for ResNet-50 partitioning method used by current mobile DL frameworks is based
in different frameworks on Kirin 970 with Android. on ATLAS [56]. It partitions matrices according to two considera-
tions: the smaller matrix is always the inner matrix in the loop, and
the blocks of a sub-MM task can be held in cache.
with higher CPU performance, as well as a little-core processor
Challenges There are four major challenges on mobile AMP
with lower performance and energy cost. Unfortunately, as shown in
CPUs that current partitioning methods cannot solve. (1) Hardware
Fig. 1, DL inference barely gains speedup by using both processors
asymmetry. Current partitioning uses unified block size on AMP
compared to just the big-core processor, although the little one pro-
cores, which harms performance and misleads the fair task assign-
vides additional compute capability (e.g., 58% more on Kirin 970).
ment based on the number of tasks. (2) Separated cache between the
This is undesirable since inference latency is crucial for end users.
big and little processors on most mobile CPUs. Accessing remote
DL frameworks should be capable to utilize all the available compute
cache can cause 5× latency than local cache. Current partitioning
capability.
and scheduling neglects this and may cause remote cache accesses.
The other issue is energy inefficiency because of improper CPU
(3) High competition for small cache (e.g., 2 M cache shared by
frequency setting. OS cannot identify the most energy-efficient fre-
four cores on the big processor of Kirin 970). ATLAS carefully
quency because of the unawareness of DL behaviour characteristics.
determines innermost matrix for cache reuse. However, the high
It tends to set the highest CPU frequency for inference, which is good
competition from multi-cores for the small cache can possibly cause
for performance rather than energy. For example, ResNet-101 [22]
cache thrashing. (4) Interference-prone environment. Overlooking
consumes 81% more energy while saves only 32% inference time
this, current partitioning always results in lagging threads.
on Snapdragon 845 at the highest frequency compared to the least-
Our approach This paper proposes the AsyMo system with
energy cost frequency. Besides, the OS frequency scaling is not
novel techniques to solve both the performance scalability and en-
responsive enough particularly for short-run inferences. For exam-
ergy efficiency issues on mobile AMP CPUs. The primary design
ple, the frequency only starts to gradually increase after 20% of
principle is based on the fact that DL inference is deterministic. That
total inference time for MobileNet V1 [26]. PredJoule [5] identifies
is, given a DL model, its execution is entirely determined by the
efficient frequency through extensive energy measurements for every
model itself. Therefore, by jointly considering the model structure
layer of a DL model on mobile CPUs. This is infeasible for the large
and AMP CPU characteristics, the optimal model execution plan
numbers of models.
e.g., task partition and frequency setting, can be built offline. Guided
Root causes We deeply analyze the reasons for the poor scalabil-
by this principle, AsyMo integrates the following components.
ity issue. The first reason we find is the unbalanced task1 distribution
For performance scalability, AsyMo coordinates a cost-model-
on AMP cores. DL models are composed of tensor operations, par-
directed block partitioning and an asymmetry-aware task scheduling
ticularly matrix multiplication (MM). For parallel execution, the
method, which comprehensively consider the challenges of mobile
thread pool (e.g., Eigen [15] and OpenMP [46]) of DL frameworks
AMP CPUs. The partitioning is first conducted on a processor level,
partitions MMs into sub-MM tasks, and then assigns the tasks to
and then on a core level to find the task size predicted to have
each thread of the pool in a round-robin way. The OS then sched-
the minimum MM latency by the cost model. The scheduling can
ules these threads to run on the AMP cores. Unfortunately, current
balance tasks on each core and avoid unnecessary data movement
scheduling result is not ideal. We find that (1) the task distribution
between processors. The cost model is formulated by considering the
between the big and little processors is not proportional to their
task-size impact on every aspect that contributes to the latency, such
compute capability and the little one executes much fewer tasks;
as memory accesses, task scheduling cost, and degree of parallelism.
(2) the task distribution between the cores within each processor is
The parameters of the cost model can be trained by common MMs
also unbalanced due to the interference-prone mobile environment.
for DL inference on each new CPU rather than each DL model, or
Therefore, the performance gain by adding the little processor is far
be set by empirical values.
below expectation. It is necessary to implement asymmetry-aware
AsyMo determines the least-energy frequency based on the find-
task assignment in DL frameworks rather than just relying on OS
ing that DL models are typical computation- or memory-intensive
scheduling.
workloads on mobile CPUs, which can be determined by the data
Another implicit reason for the poor scalability is inferior task par-
reuse rate (i.e., operational intensity) of a DL model. Therefore,
titioning. We find that without proper partitioning, performance gain
AsyMo offline profiles energy curves over frequency for computa-
from adding the little processor is still under expectation (e.g., 20%
tion and memory-access benchmarks on target CPUs. Based on the
vs 58% on Kirin 970) even with asymmetry-aware task assignment.
data reuse rate of a DL model, the least-energy frequency can be
found on the corresponding energy curve.
As far as we know, AsyMo is the first thread pool implemen-
1
A task in this paper means a serial computation unit which can be assigned to run on a tation that can achieve performance scalabiliy for DL inference
core. on mobile AMP CPUs, and also gain substantial energy saving.
Implemented at thread pool level, the techniques of AsyMo is com- Inter-op thread 1 Inter-op thread 2
plementary with the optimizations at framework and kernel levels. Pre- Pre-
process B process C
We implement AsyMo based on the thread pool of Eigen [15], and A Intra-op threads
then apply it to TensorFlow [18] (TF) and TensorFlow Lite [19]
(TFLite) which are the most widely used DL frameworks (39% b b b b
B C wait wait
in total) on mobile devices according to the DL app analysis [59]. b b b c
Besides, we also apply AsyMo to FeatherCNN [35] and ONNX c c c
Runtime [42] (ORT) to show its portability. FeatherCNN shows D
Post- Post-
better performance than other libraries such NNPACK and Open- process B process C
BLAS [35]. We evaluate MM computation, as well as end-to-end
DL inference for a range of CNN (Convolution Neural Network)
and RNN (Recurrent Neural Network) models on different mobile Figure 2: Inter- and intra-op parallel processing for op B and C
CPUs (Snapdragon 845 [47] and Kirin 970 [23]) and OSes. For in the example dataflow graph. The notation b and c show the
MM executions, the speedup can reach up to 49%. For inference, partitioned tasks for B and C.
AsyMo can achieve 46% performance and 37% energy efficiency
improvement for convolution-dominant models, and up-to 97% and nc
for(i=0; i<M; i+=mc) mr
1.22× improvement respectively for fully-connected-dominant mod- for(j=0; j<N; j+=nc)
mc

els compared to an optimized-version TF v2.0 on Kirin 970 with threadpool->schedule([](){ M K


Android. for(x=i; x<mc; x+=mr)
K
To sum up, the key contributions of the paper are: for(y=j; y<nc; y+=nr) nr N
mm([mr,K],[K,nr])})
• Analyze performance and energy issues for DL inference on
mobile CPUs, and identify the root causes (Section 3). Figure 3: (M, K) × (K, N) is partitioned into (mc, K) × (K, nc)
• Propose asymmetry-aware partitioning and scheduling meth- tasks for parallel execution. (mr, K) × (K, nr) is the basic com-
ods to improve performance scalability of DL inference on puting block.
mobile AMP CPUs (Section 4.2 and Section 4.3).
• Propose to find the least-energy frequency based on the data Table 1: The execution time % of MM in DL models
reuse rate of a DL model as well as the energy-frequency MobileNetsV1 SqueezeNet ResNet-18 SSD-MobileNetV1 Char-RNN
curves of the CPU (Section 4.4). 65.76% 72.48% 83.84% 68.00% 97.45%
• Implement AsyMo and achieve significant performance and ResNet-50 ResNet-101 VGG-16 RNN-classifier AlexNet
energy improvement on different frameworks, CPUs and 84.81% 88.27% 96.00% 99.11% 94.82%
OSes for both RNN and CNN models (Section 5 and Sec-
tion 6).
threads). The threads in the pool will then be scheduled to run on
CPU cores by the underlying OS thread scheduler.
2 Background
MM partitioning MM is the major cost of DL inference. Table 1
To understand the performance scalability and energy issues, this shows the MM time cost in representative CNN and RNN models
section introduces current task scheduling and partitioning for DL on Kirin 970, ranging from 66% to 99% of total time (experimental
inference, as well as OS frequency scaling for mobile AMP CPUs. settings in Section 6). The reason is that the convolution op (Conv)
Parallelism in DL inference A DL model is generally repre- for CNN is normally implemented as MM for better performance [9].
sented as a dataflow graph. A major advantage is for parallel pro- The fully-connect op (FC) for RNN is a matrix-vector multiplication
cessing. A node of the graph is a compute operation (abbreviated as (MV) when the batch size is one for inference, which is a special case
op) and the connected edges are the tensors consumed or produced of MM. Thus, proper partitioning for MM is critical for inference
by an op. Two levels of parallelism exist in the graph: the inter- and performance.
intra-op parallelism. They can be implemented by two thread pools Fig. 3 illustrates an example MM (M, K) × (K, N) which is par-
for better performance. titioned into (mc, K) × (K, nc) tasks and then assigned to threads
Fig. 2 shows a simple dataflow graph as an example. The inter-op in the pool. To fully utilize SIMD registers, MM kernel normally
thread pool can parallelly process ops without data dependency, such has an elementary block size that cannot be partitioned i.e., mr and
as op B and C. After being pre-processed, B and C are partitioned nr in the figure. Albeit as the accumulation dimension, K can be
into tasks (notated as b and c) and sent to the shared intra-op thread partitioned too if necessary.
pool to execute. Each inter-op thread waits until all the tasks of its The selection of mc and nc in current DL frameworks is basically
op are finished to resume post processing. Compared to the compute based on the heuristics of ATLAS and hardware thread number, to
tasks in the intra-op pool, the time cost of the inter-op pool is minor. make sure that the computation is equally divided for the threads
Thus, intra-op thread pool is the optimization target of AsyMo. and the task data can be held in cache. As aforementioned, these
Current thread pool implementations e.g., Eigen and OpenMP, partitioning methods cannot solve the performance issue on mobile
evenly distribute op’s tasks to the task queue of each thread without AMP CPUs.
consideration for AMP CPUs. The number of threads in the intra-op Mobile AMP and OS DVFS Single-ISA asymmetric multicore
pool is normally set to the number of CPU cores (i.e., hardware architecture [33] is proposed to achieve both the performance and
41.52 43.16 MobileNetV3 [25], the latest version of MobileNet designed to run
Time (ms)

40 35.79 39.79
24.48 on mobile devices, on a range of AI chips including Google Edge
20 TPU [20] and Intel Movidius VPU [44]. Fig. 4 shows the results
0 of chips that can successfully run the model. The mobile CPU runs
CPU GPU DSP Rockchip Movidius
Snapdragon 855 NPU VPU the fastest albeit with much lower theoretical peak performance.
Figure 4: MobileNetV3 runs the fastest on a Snapdragon CPU Moreover, mobile CPUs have similar micro architecture, and the
than mobile AI accelerators. Frameworks used are TensorFlow optimization techniques can be generally applied. However, AI ac-
Lite for the CPU; Qualcomm SNPE [48] for the GPU and DSP; celerators have quite different architecture. The optimizations need
RKNN for the Rockchip NPU [49]; and OpenVINO [29] for In- to be customized for each one. Based on these reasons, mobile CPUs
tel Movidius VPU. should still play an important role for on-device inference in the
coming future.
big little
3 Performance evaluation and motivation
CPU Usage

80% 70%
Average

60%
40% We have introduced the current design problems for DL inference on
20% 9%
0% mobile AMP CPUs. This section analyzes how the problems harm
1 et V1 t-18 0 1 6
Av
g
tsV eN et e t-5 -10 GG-1 performance scalability and energy efficiency.
eN
e ez eN esN Ne et V
il que bil R Res e sN
ob S M o R
M D-
SS 3.1 Poor performance scalability on AMP
Figure 5: Average CPU usage of big and little processors for As shown in Fig. 1, current DL inference barely gains speedup by
CNN inference in TF on Kirin 970 with Android. The little pro- using both processors. To understand the reason, we take TF as an
cessor is seriously underutilized. example, and record the processor usage for a range of CNN models
in Fig. 5. Clearly, the little-core processor is seriously underutilized,
power requirements. It is widely adopted by mobile productions, only 9% usage on average. The usage for the big processor is not
represented by ARM big.LITTLE technology. Compared to the big- ideal either, only 70% on average.
core processor, the little-core processor has lower CPU frequency, As explained in last section, the intra-op thread pool evenly dis-
smaller cache and memory bandwidth, in-order pipeline, and lower tributes tasks to each thread, and then OS assigns threads to CPU
power cost. Before DynamicIQ Shared Unit (DSU) technique [2], cores. We record the actual number of tasks executed on each core
ARM big and little processors have separate caches. DSU enables for every op. Results expose two-level unbalanced task distribution
optional shared L3 cache. (Section 6.1 has the detailed specs for the which causes the low CPU utilization.
platforms used in this paper.) The first level is unbalance between big and little processors.
For CPU frequency setting, the big and little processors normally The number of tasks executed on the little processor is much less
have isolated power domains, and thus can set frequency separately. than its capability. Take MM in MobileNets V1 as an example, the
Each processor has a range of frequencies that OS DVFS (Dynamic average number of sub-MM tasks executed on a little core is 0.68,
Voltage and Frequency Scaling) can set based on the current work- while the number on a big core is 3.82, so the ratio is 5.63. However,
load for energy efficiency. Per-core frequency setting is not supported their capability ratio (i.e., performance ratio) for running MM is only
in ARM CPUs yet. 1.73.
For timely frequency response according to workload change, the OS EAS is designed for mobile AMP. It schedules threads based
Schedutil governor [8] of OS DVFS is integrated into the state-of-the- on comprehensive considerations on asymmetric core capability,
art OS thread scheduler for mobile AMP CPUs called Energy-aware CPU utilization, and predicted energy cost. The result here shows
scheduling (EAS) [3] (note the difference between thread scheduling that the intra-op threads are improperly assigned to the big processor
in OS and task scheduling in intra-op thread pool). Thus, Schedutil much more often than the little one by EAS.
can configure frequency immediately when the scheduler changes The second level is unbalance within a processor. For example,
thread. The frequency response is much faster than the workload- the task distribution of an MM in VGG-16 is 90, 77, 94, 95 for the
sampling-based Ondemand DVFS governor [8]. big-core processor, and 40, 36, 25, 33 for the little-core processor
However, as we will show in Section 3, EAS Schedutil still has (each processor has four cores in Kirin 970). This is possibly due
mismatch with short-run DL inference, which motivates the direct to the interference from some Android background services or im-
frequency setting of AsyMo. proper decisions of EAS. This unbalance degrades the average CPU
On-device DNN accelerators Though various AI accelerators usage. The lagging core that executes the most tasks becomes the
have been developed, CPUs are the dominant hardware for on-device performance bottleneck.
inference [58] for the following reasons. Firstly, CPUs are always Therefore, the OS EAS for intra-op thread scheduling is far from
available on every mobile/edge device, but AI accelerators are not. ideal, due to the lack of workload understanding on each thread. For
For example, Edge TPU is currently only available on Google Pixel example, EAS does not know that there are equally-sized tasks on
4. Secondly, the ecosystem of AI accelerators is closed and immature. each thread, which should be distributed proportionally according to
Most of the accelerators and their inference frameworks are black core capability. By comparison, with the information of tasks, fair
box [29, 48, 49], raising the barrier from wide usage. Thirdly, spe- task assignments for AMP should be implemented in the intra-op
cialized accelerators lack the flexibility to adapt novel DL algorithms, thread pool. This motivates the asymmetry-aware task scheduling of
and thus may not always perform well. For example, we evaluate AsyMo.
6
Measured Power to identify critical path or data dependencies for general applica-
Power (W)

4 Inference Period tions [30]; (iii) typical hardware behaviour characteristics. Conv
2 is a typical computation-intensive workload while FC is memory-
intensive (explanation in Section 4.4), which facilitates the selection
0
0 500 1000 1500
Params Block size
Time (ms)
of least-energy frequency for models dominant by these ops.
copy Params
Figure 6: A mismatch
between power curve and inference pe-
address
The mobile AMP CPU characteristics considered are: (i) asym-
riod for MobileNets V1 by OS EAS Schedutil on Kirin 970 with metric core capability. Task partitioning and scheduling are cus-
MM
cost
Android (idle power is subtracted). tomized for big and little processor respectively; (ii) small caches.
model
The partitioning carefully selects block size to reduce total memory
accesses; (iii) separated caches. The partitioning and scheduling
One-run initialization Inference
avoid costly data transfer between processors. (iv) interference-
prone environment. The execution on mobile CPUs is easily to be
Block Block size
CNN/ partition
interfered by system services or other workloads. The partitioning
Intra-op
RNN Task Task and scheduling need to avoid some core become a bottleneck.
thread
model scheduling Thread
pool Considering these unique characteristics, the workflow of AsyMo
Freq. Efficient frequency
setting
is shown in Fig. 7. After a model is loaded in the one-run framework
initialization, the cost-model-directed block partitioning of AsyMo
Figure 7: AsyMo (blue blocks) workflow.
calculates the proper task size for every MM of the model on both
Energy
big and little processors. The least-energy frequency for the model
3.2
-freq. DVFS mismatch is found by AsyMo based on its data reuse rate (i.e., memory- or
curve
Through integration with OS thread scheduler, EAS Schedutil is ex- computation-intensive) as well as the processor’s energy-frequency
pected to set CPU frequency in a timely fashion. We evaluate it using curves.
the short-run MobileNets V1 as an example, and measure the CPU After all the preparation in initialization, during the inference
power curve during inference in Fig. 6 (measurement methodology run, the asymmetry-aware scheduler of AsyMo binds each intra-op
in Section 6.1). thread to one CPU core, and schedules tasks fairly to each thread.
Surprisingly, there is a big mismatch between OS CPU fre- To meet diverse requirements of apps, AsyMo supports two con-
quency scaling and DL inference. The CPU power, an indicator of figurable modes: 1) latency-first mode where the highest CPU fre-
CPU frequency, only starts to gradually increase about 34 ms (20% quency is set for best performance; 2) energy-first mode where the
of total inference time) after the inference starts. When the inference most energy-efficient CPU frequency obtained by AsyMo is used
is done, the power starts to descend about 25 ms later, and drags a to achieve minimum energy cost. Note that both the two modes
long tail to finally reach the idle power about 1,283 ms later, causing have better performance and lower energy cost than the default DL
a big waste of energy. We also evaluate the traditional sampling- inference without AsyMo (except for VGG-16 shown in Section 6).
based Ondemand governor. Much worse than Schedutil, the power Next, we introduce each technique of AsyMo in detail.
starts to increase about 775 ms after the inference starts. This makes 4.2 Cost-model-directed block partitioning
the inference time 6× more than the Schedutil result.
Users can avoid OS frequency scaling and set CPU frequency di- Current partitioning method based on ATLAS cannot solve the chal-
rectly for their workload by the Userspace governor [8], with <1 ms lenges on mobile AMP CPUs, which results in inferior performance.
frequency transition latency. By setting the highest CPU frequency AsyMo partitioning comprehensively considers all these challenges.
for the inference period of MobileNets V1, we reduce the energy cost This section will first explain the ideas behind AsyMo, and then
by 57% compared to EAS Schedutil, which shows a big potential in introduce the proposed cost model.
energy saving by using a proper CPU frequency. Design guidelines Due to the interference-prone environment,
As such, AsyMo sets CPU frequency directly for DL inference to we find that there are always threads lagging, even though balanced
eliminate the extra energy cost and performance slowdown due to task assignment is implemented and also task stealing from a busy
DVFS mismatch. thread to a free thread is enabled. Based on this, if a task is too big,
just task stealing cannot help reduce the task on lagging threads.
4 AsyMo system design Therefore, we construct a partitioning guideline on mobile CPUs:
for better task balance, task size should be minimized.
4.1 System overview However, we find that reducing task size may increase memory
The design of AsyMo is based on the unique characteristics of both accesses. The reason is that all the tasks are parallelly executed
DL inference and mobile AMP CPUs. without particular order. Unless the cache can hold all the data of
The DL inference characteristics considered are: (i) determin- an MM, as is frequently the case on servers but rare on mobile
istic execution. Given a DL model graph, the entire execution is CPUs, cache thrashing is possible i.e., every task has to load data
determined, such as ops and tensor size. The task partition and ef- from memory. Thus, another partitioning guideline for small-cache
ficient frequency can be configured during the one-run framework mobile CPUs is that task size should be maximized but remain
initialization after a model is loaded; (ii) embarrassingly parallel within cache limitations, so that memory accesses are minimal.
tensor operation. The AsyMo just needs to balance the number of The two competing guidelines result in a trade-off task size which
tasks on each core according to core capability rather than the need has the least MM latency. Empirical search of the configuration space
One-run initialization Inference run Big core cluster
Core0 Core1 Core2 Core3
kc nc Data copying task
mc kc t t t t
t t t t
K N M K t t t t
t t
M K
K Nbig mc x kc Work stealing Pin each thread
nc x kc from little to big to one core
model params Sub-MM task
Feature map Core0 Core1 Core2 Core3
M t t t t
K t t t t
t t t t
K t
Nlittle
Little core cluster
(1) Block partition for big and little core cluster (2) Tasks scheduling for each cluster

Figure 8: an MM execution process of AsyMo. During initialization, (1) AsyMo calculates the block partition strategy for big and
little processor; (2) during inference, schedules the feature map copying and sub-MM tasks (notated by t in the figure) to each core
within a processor.

on mobile CPUs is too large (in the order of millions) to be practical. cost is the time that some threads wait for others to finish all their
By considering task-size impact on every aspect that contributes to tasks. Thus, it is a multiple of sequential cost. Including this cost
the latency, including memory accesses, degree of parallelism, and is particularly important for the interference-prone environment.
scheduling cost, we formulate an MM cost model. It can predict Finally, the whole MM latency cost for a task size can be obtained
the cost of each task size and find the size with minimum MM in Step (5).
latency. Plus, the input, filter, and feature map size of DL models are To reduce memory accesses, the cost model sets two constraints
normally within a specific set, so the model is easy to achieve good for the block size. Constraint (1) is to make sure that the data of one
accuracy. This model only needs to be offline trained once for each task can be held in cache. The elementary compute block (introduced
CPU, and then can be applied for various DL inference. in Section 2) should reside in L1 cache. This constraint is for the
MM cost model Table 2 shows the cost model. The input is two-level cache on most mobile CPUs and L2 is public to all cores
block size on each dimension and output is predicted latency. The within a processor.
cost calculation process first calculates the cost of a sequential unit Constraint (2) is to reduce total memory accesses of MM. It is cal-
in Step (1) and (2). Since K is the accumulation dimension, the tasks culated by multiplying the number of tasks with the memory accesses
M N K
along K normally run sequentially. Step (3) calculates the parallel of a task, i.e., mc · nc · kc · Datatask , which equals to Eq. 1. Since
MM execution cost based on the cost of one sequential unit, the M N
·
mc nc = i × T hread#, i ∈ N, i.e., the number of sequential units is a
M N
number of sequential units ( mc · nc ), and the number of threads. multiple of the number of threads, for a fixed i, the minimum value
Step (4) calculates other cost, including framework cost and task of Eq. (1) is achieved when mc = nc. That is how Constraint (2) is
scheduling cost proportional to number of tasks. The unbalance calculated.
From Constraint (2), we can get a candidate mc and nc for each
Table 2: Latency cost model for a (M, K) × (K, N) MM. integer i. Through the cost model, we can find the least-cost mc and
Variables mc, nc, kc: block size on each dimension nc as the final partitioning result.
mr, nr: elementary block size (ref. Fig. 3)
Constants T hread#: thread number = core number N M K
Datatotal = ·M·K + ·N ·K + 2· ·M·N (1)
t f lop , tdata : cost of a FLOP; a data access. nc mc kc
Training
parameters tsched , t f ram : cost of a task scheduling; framework. A special case is when M ≪ N (or N ≪ M) such as MV, the
p: unbalance coefficient.
(1) FLOPs and data size of a task: calculated block size may be larger than M (or N). If so, AsyMo will
FLOPtask = mc · nc · kc not partition M (or N), but only partition the other dimensions.
DataTypeSize
Datatask = CachelineSize (mc · kc + kc · nc + 2 · mc · nc) Partitioning for big and little processors As shown in Fig. 8
(2) Cost of a sequential unit: (1), AsyMo first conducts processor-level partitioning by dividing
K
Cost Costseq = kc · (tcomp · FLOPstask + tdata · Datatask ) the bigger matrix into two sub-matrices. Then, it conducts core-level
calculation (3) Cost of parallel execution: partitioning within each processor. Reasons for the processor-level
process Cost par = 1 · M · N ·Costseq
T hread# mc nc division are: (1) the proper block size is different on each proces-
(4) Cost of unbalance + scheduling + framework:
Costother = p ·Costseq + tsched · mc M N K
· nc · kc + t f ram sor; (2) the separation can avoid costly data transfer between the
(5) Total MM cost: two processors. The division ratio is based on the offline measured
Cost = Cost par + Costother processor capability (i.e., performance difference running MM) as
(1) Cache size: well as the current processor usage.
mr · kc + nr · kc + mr · nr ≤ L1Size To sum up, besides of the design for AMP, the specific advantage
Constraints (mc · kc + nc · kc + mc · nc) × T hread# ≤ L2Size of AsyMo partitioning is that it considers the trade-off between
(2) Least memory accesses:
task balance and memory access reduction. The result block size
q
MN
mc = nc = i·T hread# , i ∈ N
is normally much smaller than other partitioning methods. Even
1.0
with interference, collaborated with task stealing, tasks are easier

Relative to 0.682 GHz


to be balanced among threads. Thus, results show much improved
0.8
performance scalability (results in Section 6).
ResNet18_Time
0.6
ResNet18_Energy
4.3 Asymmetry-aware scheduling Computation RNN_Time
MemAccess RNN_Energy
0.4
The guidelines for AsyMo scheduling are: 1) to balance the number 0.5 1.0 1.5
(a)
2.0 2.5 0.5 1.0
CPU frequency (GHz)
1.5
(b)
2.0 2.5

of tasks on each core and processor; 2) to avoid unnecessary data Figure 9: The least-energy frequencies are consistent on (a) the
movement between processors. offline measured energy curves for memory-access and compu-
AsyMo sets the intra-op thread pool equal to the core number, tation workloads, and (b) the ResNet-18 and Char-RNN models
and uses OS thread affinity to bind each thread to a core as shown (measurement method in Section 5).
in Fig. 8(2). This can guarantee that the tasks actually executed on
each core is consistent with AsyMo scheduling.
efficiency, memory-intensive workloads normally prefer a lower
Based on the partitioning result for big and little processor, AsyMo
CPU frequency.
schedules each task to the shortest thread queue in the corresponding
For DL inference, MM (for Conv) and MV (for FC) are typical
processor. If a thread’s task queue is empty, it can steal tasks from
computation- and memory-intensive (without parameter copying in
other threads (i.e., work stealing). This is preferentially done from
initialization) workload respectively. This can be calculated by a
the longest thread queue within a processor. If a thread on a big
simplified data-reuse formula for MM (without considering block
core fails to steal a task within the processor, it is allowed to steal a
partition). Assuming that input matrices are loaded from memory
task from a thread on a little core. However, the other way around is
once and the result matrix is written back to memory. Multiply and
forbidden, because the block size of a big core can be too large for
accumulation operations are counted separately. The data reuse for
the cache of a little core.
an MM (M, K) × (K, N) is calculated by Eq. (2).
There are two kinds of tasks scheduled in the thread pool as shown
in Fig. 8: the data-copying tasks and sub-MM tasks. Data copying is 2MNK 2
an optimization method to fully utilize cache space and locality. It is Data_reuse = = 1 1 2
(2)
MK + NK + 2MN N + M + K
used in general MM implementation after being proposed by ATLAS
and achieves better performance. After blocks are partitioned, it For MV, data reuse is bounded by 2, since N or M is 1. For MM, the
copies the data of a block into a continuous memory space (i.e., smallest dimension is much larger than 1 ranging from 20 to 540
block-major format) before computation, and this is the data-copying in our benchmark DL models. Thus, the maximum data reuse of
task. MM is bounded by the smallest dimension of the input matrices.
AsyMo schedules the data copying task and the corresponding The greater the data reuse, the fewer memory accesses that are
compute task within the same processor, which can avoid the costly needed. Therefore, MM is computation intensive and MV is memory
accesses to the remote cache. Other thread pool implementations, intensive.
however, randomly schedule data copying and sub-MM tasks to Based on all the considerations above, the idea of AsyMo is to
cores. It is possible that the data for a sub-MM task resides in the offline profile energy curves over frequency for computation and
remote cache, which causes data transfer between processors. memory-access benchmarks on target CPUs. Then, AsyMo can find
the least-energy frequency for MM and MV ops of a model by the
4.4 Frequency setting for energy efficiency corresponding curve.
Experimental verification To verify the idea, we profile the en-
AsyMo selects the least-energy frequency based on the guideline that ergy curves over frequency for self-written memory-access and
the least-energy frequency for a DL model is determined by its data computation benchmarks offline, and also the real energy curves
reuse rate (i.e., the operational intensity in a roofline model [57]). for a Conv-dominant model ResNet-18, and an FC-dominant model
The reason is as follows. Char-RNN [43] shown in Fig. 9. The lowest frequencies and curve
Design considerations Energy cost of a workload is calculated shapes between the two figures are consistent. AsyMo can find the
as the sum of static and dynamic energy i.e., Energy = Time × efficient frequencies for DL inference based on the offline profiled
Powerstatic + Time × Powerdynamic . Powerstatic normally keeps con- curves.
1
stant with CPU frequency, while Time ∝ f req. and Powerdynamic ∝ The figures show unexpected results for the energy curves of
voltage2 × f req [12]. As frequency increases, the static energy cost memory accesses. As Fig. 9 (b) shows, before 1.86 GHz, the Char-
reduces, while the dynamic energy increases. Therefore, there will RNN result is basically as expected. The time and energy do not
be a trade-off frequency with the least total energy cost. reduce much with CPU frequency. However, there is a big time and
This least-energy frequency depends on the data reuse rate of energy drop after 1.86 GHz.
the workload, because it impacts the response of Time to CPU fre- To explain this result, we profile the memory access latency on
quency. For example, for high computation-intensive workloads, two mobile CPUs shown in Fig. 10. Surprisingly, on Kirin 970, the
Time reduces accordingly as frequency increases, and so does the random access latency (solid line) drops between 1.86 and 2.09 GHz,
static energy. By comparison, for high memory-intensive work- matching the RNN energy and time curve. Similar latency drop hap-
loads, Time does not reduce much with CPU frequency, since it pens on Snapdragon 845 too. Thus, for ARM CPUs, the random
has less impact on memory access latency. This is why for energy memory access latency drops at certain CPU frequencies. This
50 400 purely hardware related, and ideally they should be provided by
Stream Kirin
hardware vendors.
Time (ms)

300
Stream Snapdragon We use Android APIs to read the voltage and current of the
25 200
battery and USB power supply on the mobile phone, and power
Random Kirin
100
Random Snapdragon
monitor on the Hikey970 development board. We thus can get the
0 0 real power of the computation and memory access at each frequency,
0.5 1.0 1.5 2.0 2.5 comp
CPU frequency (GHz) i.e., Pfmem
req and Pf req , as well as the average memory access latency
Figure 10: The latency of random (right axis) and stream (left mem
t f req (Section 6.1 for detailed measurement methodology). Then, the
axis) memory access on a single big core at different frequency energy curve for computation and memory access is calculated by
1 comp mem mem
f req × Pf req and t f req × Pf req respectively.
step of Kirin 970 and Snapdragon 845. Snapdragon 845 has big-
ger frequency range than Kirin 970. Both the big and little core processors have a range of frequencies
to be set. However, for the little processor, lowering frequency is not
is why the memory-intensive RNN has much lower energy cost at a very helpful for power reduction. For example, on Kirin 970, only
higher frequency. 7% power difference between the lowest (0.51 GHz) and highest
Frequency setting for other ops We have also considered to (1.84 GHz) frequency on the little processor (Cortex A53), while
extend AsyMo to set different frequency for different ops of a model 26% power difference between the lowest (0.68 GHz) and highest
based on its data reuse rate. However, current DL models are either (2.36 GHz) frequency on the big processor (Cortex A73). Thus,
dominant by Conv e.g., CNN models or by FC e.g., RNN models. AsyMo fixes the little core processor at the highest frequency, and
Other ops like ReLU and Softmax take little time to run or fuse only scales the frequency of the big processor.
with the Conv layers. Thus, to avoid extra frequency transition cost,
current AsyMo only sets frequency for Conv and FC ops. 6 Evaluation
6.1 Experimental methodology
5 Cost model training and energy profiling
Hardware and OS We use a Hikey970 development board with
We implement AsyMo in Eigen [15] due to its popular usage. DL HiSilicon Kirin 970 SoC (Kirin 970 for short), running Android
frameworks can call the APIs of AsyMo to utilize its thread pool 9 Pie OS (Android for short), as the main experimental platform.
and frequency setting for DL inference. The asymmetry-aware thread Compared to a phone, it is easier to conduct power monitoring and
pool of AsyMo is implemented in Eigen’s NonBlockingThreadPool. temperature control on a development board. To show the portable
The block partitioning is added in TensorContractionThreadPool performance, we also evaluate AsyMo on Kirin 970 running Debian
which conducts block partitioning, and then enqueues the data- Linux 4.9.78 aarch64 (Debian for short), and Google Pixel 3 XL with
copying and sub-MM tasks. Snapdragon 845 SoC (Snapdragon 845 for short) running Android 9
Cost model training The cost model only needs to be trained Pie. We will state particularly when the results are from these two
once for each CPU rather than each model. The input, filter, and settings. The default DVFS governor for both OS is Schedutil. Note
feature map size of DL models are normally within a specific set. that although both Kirin 970 and Snapdragon 845 run Android 9
The training data set can be much smaller than general MM, but still Pie, since Android has customized codes for different hardware, the
effective for DL models. behaviour may still be different.
We select 15 MM sizes and 20 MV sizes in the data set. The MM Table 3 shows the CPU specs for Kirin 970 and Snapdragon 845.
sizes are chosen from the Conv ops of two representative CNN mod- Peak performance is measured by self-implemented MM. Memory
els: VGG-16, an example of complex models; and MobileNets V1, bandwidth and latency are profiled by LMbench [41]. The latency
an example of light-weight models. For an (M, K) × (K, 1) MV, M and bandwidth vary with the CPU frequency. We list the min latency
and K are set as the power-of-two values within range (256, 2048) and max bandwidth, respectively.
and (124, 2048) respectively. Each MM or MV also includes a range We measure the real power cost by Monsoon high voltage power
of block sizes. In total, there are 270 settings in the data set. AsyMo monitor [28] on the Hikey970 board. On the Pixel 3 XL phone, the
trains the cost model through linear regression and K-folder (K = 10) power of battery and USB is read from the Android APIs. The sam-
cross validation using scikit-learn package. pling rate is set to 5 kHz. Energy is the integral of power over time.
The R2 value (an index to show how close the data to the regres- We calculate it by multiplying the average of measured power during
sion line) of the trained model is 0.97 and 0.98 respectively for the inference and the inference time. Note that the power measured is for
big and little processor. Applying the best partitioning found by the whole development board or phone rather than just the CPU, so
AsyMo to our benchmark DL models, the performance difference is the static power can be higher. However, it won’t affect the fairness
only 3% to 5% compared to the best empirically searched result. The of the energy comparison.
time cost of profiling and training is about one hour on Kirin 970. The core utilization is sampled from /proc/stat every 200 ms. We
Profiling of energy-frequency curves To find the least-energy keep the inference running while sampling the time, and make sure
frequency, AsyMo offline profiles energy curves over frequency for >20 samples for each model. The profiling thread runs on a different
computation and memory-access benchmarks on target CPUs. This machine rather than the measured CPU to avoid increasing core
profiling only needs to be done once for each CPU. The time cost utilization. The reported core utilization is the average of all the
to profile energy is insignificant. Particularly, the energy curves are samples for a core.
Table 3: Experimental platform specs Table 4: Time and energy cost of DL models in TensorFlow⋆
Kirin 970 big core cluster little core cluster and AsyMo at max CPU frequency on Kirin 970 with Android
CPU ARM Cortex A73, 4 cores ARM Cortex A53, 4 cores
(foot size is the min and max of the measured time)
Pipeline Out-of-order In-order
CPU Frequency 0.68∼2.36 GHz 0.51∼1.84 GHz Flops Params Time (s) Energy (J)
Peak Performance 8.8 GFLOPs 5.1 GFLOPs Model (109 ) (106 ) TF⋆ AsyMo TF⋆ AsyMo
L1 D/I 64 KB private 32 KB private MobileNets V1 1.14 4.25 0.09 (0.08, 0.11) 0.06 (0.06, 0.07) 0.87 0.75
L2 2 MB shared 1 MB shared SqueezeNet 1.67 1.25 0.11 (0.09, 0.13) 0.07 (0.07, 0.08) 1.12 0.88
Cacheline size 64 B 64 B SSD-MobileNetV1 2.47 6.82 0.17 (0.15, 0.19) 0.12 (0.11, 0.13) 1.78 1.31
Memory read bandwidth 4.6 GBs (max) 0.86 GBs (max) ResNet-18 3.47 16.02 0.18 (0.16, 0.19) 0.12 (0.11, 0.13) 1.78 1.29
Random memory access latency 200 ns (min) 200 ns (min) ResNet-50 6.69 25.61 0.34 (0.30, 0.37) 0.22 (0.21, 0.22) 3.24 2.44
ResNet-101 14.39 44.68 0.63 (0.58, 0.69) 0.43 (0.42, 0.45) 5.57 5.29
VGG-16 30.80 68.15 1.24 (1.21, 1.27) 0.62 (0.61, 0.63) 11.69 7.71
Snapdragon 845 big core cluster little core cluster
Char-RNN 0.13 3.28 0.38 (0.30, 0.46) 0.03 (0.02, 0.07) 2.38 0.43
CPU ARM Cortex A75, 4 cores ARM Cortex A55, 4 cores
AlexNet 1.44 60.97 0.39 (0.38, 0.41) 0.06 (0.05, 0.08) 2.69 0.61
Pipeline Out-of-order In-order
RNN-classifier 0.76 12.64 1.03 (0.84, 1.06) 0.11 (0.11, 0.11) 7.52 1.46
CPU Frequency 0.83∼2.80 GHz 0.30∼1.77 GHz
Peak Performance 13.0 GFLOPs 5.4 GFLOPs
L1 D/I 64 KB private 32 KB private
L2 256 KB private 128 KB private 5 5 Default 5
Shared L3 2 MB AsyMo

Cacheline size 64 B 64 B

Speedup
Memory read bandwidth 10 GBs (max) 3.2 GBs (max) 3 3 3
Random memory access latency 200 ns (min) 210 ns (min)

1 1 1

Framework and model configurations AsyMo is applied to 1 4 8 1 4


Num Threads
8 1 4 8

TF, TFLite, FeatherCNN and ORT. The version of TensorFlow and (a) (b) (c)

TFLite is the recently released v2.0 with Eigen 3.3.7. They are Figure 11: Performance scalability of MM over the number of
compiled for Android by the C++ compiler of android-ndk-r18b cores (1-4 are big and 5-8 are little cores) by applying AsyMo
with the option –cpu=arm64-v8a. to (a) TF; (b) FeatherCNN; (c) ORT at max CPU frequency on
The version of FeatherCNN is v0.1-beta. It is compiled for An- Kirin 970 with Android.
droid using android-ndk-r19c and Android API version 21. ONNX
Runtime is v1.1.0 with Eigen 3.3.90, compiled by android-ndk-r18b is 32% faster than default TF for Conv-dominant models on av-
with Android API version 24. erage, and achieves up to 4.30× improvement in performance for
AsyMo is evaluated with a range of typical DL models with vari- FC-dominated models on Kirin 970 with Android.
ous model or computation complexity and memory usage [6]. The
CNN models included are MobileNets V1 [26], SqueezeNet [27], 6.2 Results
SSD-MobileNetV1 [38], ResNet-18/50/101 [22], VGG-16 [50], and This section will first show performance scalability improvement
AlexNet [32]. The RNN models are Char-RNN [43] and RNN classi- for MM by utilizing AsyMo in TF, ORT and FeatherCNN. Then,
fier. CNN models are commonly dominated by Conv ops, except for the whole DL model inference results are shown by utilizing all the
AlexNet whose FC ops take ∼ 78% total time. Thus, we categorize techniques of AsyMo compared to TF v2.0⋆. The model evaluation
the models into either Conv-dominant or FC-dominant groups. is mainly on TF because of its sound support for various models.
The types of the model parameters are all Float32. All the CNN MM results. Fig. 11 shows the performance scalability improve-
models use default input size and NN structure. Except for VGG-16, ment of MM (an average over a range of MM size used by DL) by
a FC-4096 layer is removed because of an out-of-memory error. The utilizing AsyMo to different frameworks. Default TF and ORT use
configurations for the two RNN models are: Char-RNN: hidden_size Eigen of different versions for thread pool implementation, while
512, layers 2, batch_size 1, input_size 65, length 40, cell LSTM; FeatherCNN uses OpenMP. Every of them has different serial MM
RNN classifier: hidden_size 1024, layers 3, batch_size 1, input_size kernel implementation. AsyMo can gain portable speedup on all of
512, length 20, cell GRU. them. By the better partition strategy and fair scheduling, AsyMo can
The reported inference time and energy are the arithmetic mean of improve the CPU utilization for both big and little processors, and
20 runs with 1 s delay between each run (except for the continuous achieve performance scalability on AMP. For example, in Feather-
inference experiment). The first inference time is excluded. The CNN (Fig. 11 (b)), the 4-big-core speedup is 2.65× while AsyMo is
reason is that current DL frameworks normally use lazy initialization. 3.4×. On both big and little processors, FeatherCNN is only 3.35×
The first inference also includes some one-time framework cost and while AsyMo is 5.31×.
runs several times slower than the following ones. Model inference results. Now we show results of real DL model
Optimized TF baseline We made two modifications to default inference on both big and little processors. AsyMo can greatly reduce
TF as our baseline. One is to pre-copy model parameters into con- latency and energy compared to TensorFlow⋆ in both latency-first or
tinuous memory during initialization to eliminate copies during energy-first mode.
inference. The other is the parallel implementation for MV (default For the latency-first mode, CPU frequency is set at max (i.e.,
TF only supports sequential MV). Block number of this parallel 2.36 GHz) for both AsyMo and TensorFlow⋆ for fair comparison.
MV is equal to thread number. These two modifications are straight- Fig. 12a shows the performance and energy efficiency improvement
forward and have already been addressed in some DL framework- TimeT F EnergyT F
of AsyMo ( Time AsyMo
and Energy AsyMo
) for DL models. Table 4 lists
s/libraries [14, 34, 40]. The optimized baseline (denoted as TF⋆) the actual measured time (average, max, min), energy, and model
Speedup of TF*
2.6
1.8 Schedule Block Partition

(max freq)
Related to TF*

2.6
1.8 Performance Energy Efficiency
(max freq)

1.8
1.4
1.48
1.8
1.4
1.22 1.0 1.0
1 t 1 8 0 g t er
sV
e tV t-1 t-5 01 -16 Av NN Ne ifi
1.00 t eN Ne e
-1 G -R x ss
Ne et VG
1.0 1.0
Ne ez li e esN ar le la
V1 et 1 8 0 01 -16 g
NN et er ile ue Re
s s N
Ch
A c
ts N tV t-1 t-5 t-1 GG Av -R xN ifi ob Sq ob R Re N-
e ze Ne e Ne r e ss M RN
leN ee li e esN s sN
e V ha Al cl
a M D-
bi qu ob R Re Re
C N- SS
o S
M D-
M RN
SS Figure 13: The performance improvement breakdown for pre-
(a) arranged parameters, asymmetry-aware scheduling, and cost-
model-based block partition compared to default TensorFlow
Related to TF*

2.6
1.8 for Conv- (left axis) and FC-dominant (right axis) groups at
(best freq)

1.46
1.8
max CPU frequency on Kirin 970 with Android.
1.4 1.37

1.0 1.0
1 et 1 8 0 01 -16 g
NN et r beginning of inference using Schedutil. The third reason can also ex-
sV eN tV t-1 -5 -1 Av xN fie
t
Ne e et G -R si
Ne ez le esN sN N et VG ar Al
e
l as plain why for some models e.g.,SqueezeNet, the speedup of efficient
ile ue i
ob R
e s Ch -c
ob Sq
R Re N
frequency over Schedutil is even higher than Fig. 12b.
M D-
M RN
SS Performance improvement breakdown To show the perfor-
(b)
mance improvement of each technique of AsyMo, Fig. 13 breaks
Figure 12: The relative performance and energy efficiency im-
down the speedup at max frequency in Fig. 12a for asymmetry-aware
provement of AsyMo on Kirin 970 with Android for Conv- (left
scheduling and cost-model-based block partition.
axis) and FC-dominant (right axis) groups at (a) max CPU fre-
AsyMo scheduling improves performance by 24% for Conv-
quency; (b) most efficient frequency for AsyMo and Schedutil
dominant models on average. This improvement comes from the fair
for TensorFlow⋆.
task scheduling and better cache locality. The improvement for the
long-running VGG-16 is relatively lower than the other models. It is
size. The average performance for Conv-dominant improves 48% because as explained in Section 3.1, the default workload balance
by AsyMo, while energy efficiency increases by 22%. The energy for VGG-16 is better than others and the default little core usage is
improvement is smaller than performance because the improved 28% already. The small increase for the three FC-dominant models
core utilization by AsyMo also increases power consumption by is because the eight blocks are too few to be scheduled fairly.
around 20%. For the three FC-dominated models (right axis), the Block partition improves another 24% on average. Compared to
performance improvement is 1.01×, 74% and 83%, and the energy default Eigen block partition, the partition cost model of AsyMo
efficiency improvement is 52%, 62% and 45% respectively. The FC- considers reducing both data accesses and sequential waiting time,
dominant improvement is much better than Conv-dominant. This and generates about 6 times more tasks for each op on average. More
is because as Section 6.1 said, Tensorflow⋆ simply partitions MV tasks i.e., smaller blocks facilitate the workload balance and improve
into thread number (8 in this case) blocks, which is much fewer the parallelism degree. This is particularly important for ops having
than desired. RNN-classifier’s improvement is lower than Char- a small number of tasks by default Eigen and thus more sensitive
RNN because its inter-op parallelism is explored by TensorFlow, to imbalance. For this reason, the improvement for SqueezeNet and
and the ops can run parallelly. Char-RNN can only benefit from MobileNets V1 is a bit higher than others.
the paralellism provided by Tensorflow⋆. AlexNet is lower than the Results on other platforms The results above are from Kirin 970
other two because besides FC, its Conv op takes about 16% total with Android. To show the portability of AsyMo, we also evaluate
running time in TensorFlow. it on different hardware–Snapdragon 845, and different OS–Debian
For energy-first, AsyMo can find the efficient frequency through Linux 4.9.78. Fig. 14 shows the performance improvement results.
its energy-frequency curves. Since AsyMo changes FC-dominant The average performance improvement on Snapdragon 845 is
from a memory-intensive workload to computation intensive, the 52% for Conv-dominant models, about 29% higher than Kirin 970.
efficient frequency for all the benchmark models is set to 1.86 GHz The major reason is the default CPU usage of big cores for Snap-
in AsyMo on Hikey970. Fig. 12b shows the improvement of AsyMo dragon 845 is lower than Kirin 970. Thus, there is more space for
at this frequency compared to default TensorFlow with EAS Schedu- AsyMo to improve. The performance improvement for FC-dominant
til which mostly sets CPU to the highest frequency. AsyMo im- models is 63%, 66% and 37% respectively, smaller than the result
proves energy efficiency by 37% for Conv-dominant on average on Kirin 970. It is because the little core capability is about a third
compared to Schedutil governor. The efficiency improvement for of the big core on Snapdragon 845, and a half on Kirin 970. Assume
three FC-dominated models is 83%, 1.22× and 75% respectively. a big core’s capability is 1, then from a sequential MV running on
The performance improvement is similar as Fig. 12a. one big core to perfect parallelism on four big and four little cores
As discussed in Section 3.2, the energy cost reduction comes from on Snapdragon 845 can speedup 4+ 43 = 5.33, while on Kirin 970 can
1) the efficient frequency selected by AsyMo using less power than speedup 4+ 42 = 6.
max frequency; 2) eliminating the additional energy cost from the Energy consumption of Snapdragon 845 is also evaluated. For the
long power tail of Schedutil after the inference is done; 3) reduced latency-first mode, CPU frequency is set at max (2.65 GHz). The
running time by removing the mismatched frequency period at the energy efficiency is improved by 23% on average for Conv-dominant
Related to TF*

2.6

(max freq)
1.6

Speedup
1.8 Snapdragon+Android Kirin+Debian
(max freq)

1.52 1.32
1.3
1.8
1.4
1.24
1.0
1 et V1 t-18 -50 01 16 Av
g
1.0 1.0
tsV eN et e et t-1 G-
V1 et 1 8 0 01 -16 vg NN N et er Ne ez eN esN sN Ne VG
s N tV t-1
et
-5 -1 A -R si
fi bi l e
qu
e bil R Re es
et ze Ne Ne et VG
G
ar ex as o S Mo R
N e l e s s N N Al cl
M D-
ile ue i e
ob R Re s Ch - SS
ob Sq Re N
M D-
M RN
SS Figure 16: The performance improvement of AsyMo compared
Figure 14: The performance improvement of AsyMo compared to TFLite⋆ on Kirin 970 with Android.
to TensorFlow⋆ on Snapdragon 845 with Android and Kirin 970
with Debian at max CPU frequency.

Related to TF*
2.6
1.8 static_compute dynamic_game

(max freq)
static_memory
1.27 1.8
1.4
1.25 1.27
Related to TF*

2.6
1.8 Performance Energy Efficiency 1.0 1.0
(max freq)

V1 et 1 8 0 01 -16 g N
Ne
t er
1.52
ts eN tV t-1
et
-5 -1 Av N
si
fi
1.8
Ne eez Ne e et VG
G r-R ex as
1.4
li e le esN sN N a Al l
1.23
b q u o
i
b R R e e s Ch N -c
o S R
1.0 1.0
M D-
M RN
SS
V1 et 1 8 0 01 -16 g
NN et er
ts eN tV t-1
et
-5
t-1 GG Av xN ifi
Ne ez Ne e r-R e ss Figure 17: The relative performance improvement of AsyMo
le e le esN sN Ne V a Al cl
a
bi qu
i
ob R
e s Ch N-
o S
R Re on TensorFlow⋆ with background load interference for Conv-
M D-
M RN
SS (left axis) and FC-dominant (right axis) groups at max CPU fre-
(a) quency on Snapdragon 845 with Android.
Related to TF*

2.5 2.5
2.27
(best freq)

2.0 2.0
Related to TF* 2.6
1.5 1.5 (max freq) 1.8 1s delay no delay
1.19
0.75 1.52
1.0 1.0 1.43 1.8
et et er
1.4
V1 1 8
tV t-1 -5
0 01 -16 Av
g
NN ifi
et
s eN Ne Ne et t-1 GG r-R ex
N
ss
N ez e N e V a Al a
ile ue il e s s sN Ch cl
Re
1.0 1.0
ob Sq ob R Re N- 1 et 1 8 0 g et er
M -M RN sV tV t-1 t-5 01 -16 Av NN ifi
D t eN Ne e
-1 G -R xN s
SS Ne ez li e esN Ne et VG ar le as
ile e s N
Ch
A cl
u ob R Re Re
s
N-
(b) ob Sq
M D-
M RN
Figure 15: The relative performance and energy efficiency im- SS
provement of AsyMo on Snapdragon 845 with Android for Figure 18: The relative performance improvement of AsyMo on
Conv- (left axis) and FC-dominant (right axis) groups at (a) TensorFlow⋆ w/o delay between inference runs for Conv- (left
max CPU frequency; (b) most efficient frequency for AsyMo axis) and FC-dominant (right axis) groups on Snapdragon 845
and Schedutil for TensorFlow⋆. with Android.

models, as shown in Fig. 15a. Due to the smaller static power, the MV and doesn’t use Eigen’s, so we didn’t evaluate FC-dominant
increase in CPU utilization would result in higher increase in power, for TFLite. The average performance improvement is 32%. The
compared to Kirin 970, which is why the energy efficiency improve- result shows that although TFLite is particularly designed for mobile
ment is similar even with more improvement in performance. For the devices, AsyMo can still gain great improvement.
energy-first mode, CPU frequency is set to 1.21 GHz according to Background load interference The robustness of AsyMo under
the energy curve of computation workload. The average performance background load interference is evaluated with both static and dy-
improvement for Conv-dominant models is 19%, and the energy ef- namic background load. For static load, we use a controllable load
ficiency improves by 1.27×, as shown in Fig. 15b. Running at this generator stress [1] and run a single-thread sqrt() (which represents
low frequency, models gain acceleration mostly from correcting the computation load) or malloc() (which represents memory load). For
mismatch discussed in Section 3.2, which decreases as inference dynamic load, we use the replay tool RERAN [17] to record the user
time increases. This explains why VGG16 is 25% slower. input of playing game Cut the Rope [63] level 10 to level 12, and re-
The speedup for Kirin 970 with Debian is about 16% smaller on play it with each test. The result in Fig. 17 shows that Conv-dominant
average for Conv-dominant models compared to Kirin 970 with An- models can still gain 25% speedup with static computation load, 27%
droid. This is because counter-intuitively, the baseline CPU utiliza- speedup with static memory load, and 27% speedup under dynamic
tion of Conv-dominant on Debian is better than Android, especially game load. FC-dominant models can gain up to 58% speedup with
for MobileNets V1, although this Debian version doesn’t have EAS static computation load, 73% speedup with static memory load, and
scheduling for big and little cores. It is possibly because Android 57% speedup under dynamic game load. The smaller block size and
has more background services which disturb the inference running. work stealing mechanism can help balance the loads among cores
TFLite is designed for DL inference on mobile devices. Eigen when there is background interference.
is also its fallback choice for the thread pool implementation and Continuous inference In previous experiments, a 1 s delay is
the Float32 Conv ops. Thus, we also evaluate the performance of inserted between inferences to maintain a neat and stable experi-
AsyMo compared to TFLite⋆ with Eigen library for Conv-dominant mental environment for generating reproducible results. We also
models in Fig. 16. TFLite has its own sequential implementation for evaluate the performance for long continuous inference runs (use
Related to TF*

1.8
Asymo WASH 2.4
is a middleware that transparently support heterogenous hardware
(max freq)

1.52
1.4 1.8
and determines optimal memory layout in the compiling session.
1.0 0.95 1.2
DeepCPU [64] accelerates RNN inference on x86 server CPUs. Its
0.6 0.6

V1 et 1 8 0 01 -16 g
NN et r block partition strategy also considers reducing the slow-memory ac-
s N tV t-1 -5 -1 Av N ifie
et ze Ne Ne et et VG
G -R ex ss
N e e N ar Al a cesses. All those methods are mainly designed for symmetric server
ile ue
l
i e s s N
Ch cl
ob R Re Re
s -
ob Sq N
CPUs with large cache, and the search-based mechanisms imposes
M D-
M RN
SS
high search overhead. To compare, AsyMo formalizes an analytical
Figure 19: The relative performance improvement of AsyMo cost-prediction model to quickly find the best partition for each DL
and WASH [31] on TensorFlow⋆ for Conv- (left axis) and FC- model. AsyMo is designed for asymmetric mobile CPUs with much
dominant (right axis) groups at max CPU frequency on Snap- small cache and also considers parallelism, heterogeneous cache
dragon 845 with Android. size, scheduling overhead, and framework overhead for DL.
AMP thread scheduling There are thread scheduling algorithms
1000 times here) to show the potential thermal throttling impact designed for AMP machines [3, 31, 62]. They schedule threads
on AsyMo. As shown in Fig. 18, Conv-dominant models can gain to big or little cores by monitoring their hardware behaviours or
43% speedup on average. The reason for performance improvement criticality. WASH [31] proportionally schedules threads to big and
drop is that AsyMo increases CPU utilization and therefore would little cores according to the core capability. COLAB [62] makes
generate more heat under continuous inference runs. The thermal coordinated core assignment and thread selection decisions based
throttling may be trigger earlier. A performance increase is also on the performance estimation of each thread on different cores, as
observed for SSD-MobileNetV1, ResNet-18, VGG-16 and AlexNet. well as identified communication patterns and bottleneck threads.
This is possibly because continuous running reduces the time to Gomatheeshwari et al. [4] utilize a lightweight-deep neural network
awake sleeping threads. (LW-DNN) to predict the optimal cores for each workload. Com-
Comparison with other AMP scheduler There are some gen- pared with AsyMo, these works are for general thread scheduling
eral thread scheduling algorithms for AMP machines. For a quanti- at the OS or language virtual machine level. They are unknown of
tative comparison, we implement WASH [31], one of the state-of- the code logic running on each thread, such as the matrix multi-
the-art AMP thread scheduler, on Tensorflow⋆. Since it is a general- plication in this paper. Therefore, these schedulers cannot conduct
purpose thread scheduler, WASH is unaware of MM block partition, block partition for matrix multiplication and then schedules them.
and just schedules threads according to the core capability (no criti- By comparison, AsyMo partitions blocks first and then schedules
cal threads for MM). Therefore, for WASH, we use the default block the sub-block tasks according to the core capability.
partition strategy of Tensorflow, and schedule threads according to Energy efficiency for mobile inference PredJoule [5] empiri-
core capability (i.e., three-times threads on a big core compared cally measures the energy cost of each layer of a DL model under
to a little core on Snapdragon 845). Experimental result in Fig. 19 different CPU/GPU DVFS settings to find the least energy settings
shows that the baseline performance is reduced by 5% on average under latency requirements. However, the setting space is large and
for Conv-dominant models with WASH. This illustrates that without real measurement can be slow. Besides, as shown in the paper, the
proper block partition of AsyMo, just scheduling threads according Conv and FC layers dominate the total cost. It is not rewarding to
to core difference cannot benefit performance. measure the energy of layers like ReLU and Softmax. They are
normally fused with the conv layers anyway. An energy estimation
7 Related work tool [60] for DL models was developed and used for energy-aware
Block partition tuning There are many works on auto-tuning GPU model pruning [61] and compression [37], but it is specialized for the
work group size by empirical searching [36, 45, 53] or model-based Eyeriss [11] DL accelerator only, and thus cannot be used for com-
searching [13, 16, 24, 51]. Empirical searching runs different con- mercial hardware. AsyMo derives energy model to find the efficient
figurations on real hardware to find the best one and thus has high CPU frequency for real ARM CPUs.
searching cost. Model-based searching either manually derives a per-
formance model or automatically trains a machine learning model to 8 Conclusion
predict the running cost of different configurations. However, CPU This paper reveals the performance scalability issue due to imbal-
and GPU are different in architecture and thus, the performance anced task distribution, and energy inefficiency due to DVFS mis-
models for GPUs cannot be applied directly to CPUs. match for mobile DL inference. To solve these, AsyMo is proposed
The MM block partition strategies for CPUs normally use heuris- to properly partition the MM blocks, schedule tasks among threads
tics to empirically search for the optimal block size [7, 54]. [34] for fairness, and set the most energy efficient frequency. Both per-
recommends using the largest block size possible that does not incur formance and energy efficiency get improved greatly by AsyMo in
self interference within an array. ATLAS [54] tries to put the matrix different DL frameworks on various platforms.
that can be held in cache as the innermost matrix since it will be in-
voked many times, or it will put the two operand sub-matrices and the 9 Acknowledgments
result sub-matrix into cache. PHiPAC [7] searches the best block size We thank our anonymous reviewers and shepherd for the helpful
according to the register and cache size. TVM [10] and NeoCPU [39] comment and feedback. We thank Dr. John Zigman for the valuable
are designed to generate low-level optimized code for DL models, discussions and suggestions. Shaohua Ding and Fengyuan Xu were
by searching a code space and chooses a better operator according partially supported by the National Natural Science Foundation of
to the predicted or measured cost during compilation. SOL [52] China under Grant NSFC-61872180.
References [29] Intel. 2020. OpenVINO Deploy high-performance, deep learning inference. https:
//software.intel.com/en-us/openvino-toolkit
[1] 2019. stress-android. https://github.com/m-ric/stress-android/ [30] Ivan Jibaja, Ting Cao, Stephen M. Blackburn, and Kathryn S. McKinley. 2016.
[2] ARM. 2017. ARM documentation set for DynamIQ Shared Unit. http://infocenter. Portable performance on asymmetric multicore processors. In Proceedings of the
arm.com/help/index.jsp?topic=/com.arm.doc.subset.cortexa.dsunit/index.html 2016 International Symposium on Code Generation and Optimization, CGO 2016,
[3] ARM. 2019. Energy Aware Scheduling (EAS). https://developer.arm.com/tools- Barcelona, Spain, March 12-18, 2016, Björn Franke, Youfeng Wu, and Fabrice
and-software/open-source-software/linux-kernel/energy-aware-scheduling Rastello (Eds.). ACM, 24–35. https://doi.org/10.1145/2854038.2854047
[4] Gomatheeshwari B and J. Selvakumar. 2020. Appropriate allocation of workloads [31] I. Jibaja, T. Cao, S. M. Blackburn, and K. S. McKinley. 2016. Portable perfor-
on performance asymmetric multicore architectures via deep learning algorithms. mance on Asymmetric Multicore Processors. In 2016 IEEE/ACM International
Microprocessors and Microsystems 73 (2020), 102996. https://doi.org/10.1016/j. Symposium on Code Generation and Optimization (CGO). 24–35.
micpro.2020.102996 [32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi-
[5] Soroush Bateni, Husheng Zhou, Yuankun Zhu, and Cong Liu. 2018. PredJoule: A fication with Deep Convolutional Neural Networks. In Proceedings of the 25th
Timing-Predictable Energy Optimization Framework for Deep Neural Networks. International Conference on Neural Information Processing Systems - Volume
In RTSS. IEEE Computer Society, 107–118. 1 (Lake Tahoe, Nevada) (NIPS’12). Curran Associates Inc., USA, 1097–1105.
[6] Simone Bianco, Rémi Cadène, Luigi Celona, and Paolo Napoletano. 2018. Bench- http://dl.acm.org/citation.cfm?id=2999134.2999257
mark Analysis of Representative Deep Neural Network Architectures. IEEE [33] Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi,
Access 6 (2018), 64270–64277. and Keith I. Farkas. 2004. Single-ISA Heterogeneous Multi-Core Architectures for
[7] Jeff A. Bilmes, Krste Asanovic, Chee-Whye Chin, and James Demmel. 1997. Multithreaded Workload Performance. In ISCA. IEEE Computer Society, 64–75.
Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI [34] Monica D Lam, Edward E Rothberg, and Michael E Wolf. 1991. The cache
C Coding Methodology. In International Conference on Supercomputing. ACM, performance and optimizations of blocked algorithms. ACM SIGOPS Operating
340–347. Systems Review 25, Special Issue (1991), 63–74.
[8] Dominik Brodowski. 2020. CPUFreq Governors. https://www.kernel.org/doc/ [35] Haidong Lan, Jintao Meng, Christian Hundt, Bertil Schmidt, Minwen Deng,
Documentation/cpu-freq/governors.txt Xiaoning Wang, Weiguo Liu, and Yu Qiao. 2019. FeatherCNN: Fast Inference
[9] Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High Performance Computation with TensorGEMM on ARM Architectures. IEEE Transactions on
Convolutional Neural Networks for Document Processing. In Tenth International Parallel and Distributed Systems PP (09 2019), 1–1. https://doi.org/10.1109/
Workshop on Frontiers in Handwriting Recognition. TPDS.2019.2939785
[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen [36] Yinan Li, Jack J. Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning
Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, GEMM for GPUs. In Computational Science - ICCS 2009, 9th International
and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Conference, Baton Rouge, LA, USA, May 25-27, 2009, Proceedings, Part I (Lecture
Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Notes in Computer Science, Vol. 5544), Gabrielle Allen, Jaroslaw Nabrzyski,
Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, Edward Seidel, G. Dick van Albada, Jack J. Dongarra, and Peter M. A. Sloot
578–594. (Eds.). Springer, 884–892. https://doi.org/10.1007/978-3-642-01970-8_89
[11] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- [37] Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Junzhao Du.
ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In ISCA. 2018. On-Demand Deep Model Compression for Mobile Devices: A Usage-Driven
IEEE Computer Society, 367–379. Model Selection Framework. In MobiSys. ACM, 389–400.
[12] Intel Corporation. 2004. Enhanced Intel Speed Step Technology for the Intel [38] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Pentium M Processor (White Paper). Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox
[13] Chris Cummins, Pavlos Petoumenos, Michel Steuwer, and Hugh Leather. 2015. Detector. http://arxiv.org/abs/1512.02325 To appear.
Autotuning OpenCL Workgroup Size for Stencil Patterns. CoRR abs/1511.02490 [39] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2019.
(2015). Optimizing CNN Model Inference on CPUs. In 2019 USENIX Annual Technical
[14] Marat Dukhan. 2018. NNPack, acceleration package for neural networks on Conference (USENIX ATC 19). USENIX Association, Renton, WA, 1025–1040.
multi-core CPUs. https://github.com/Maratyszcza/NNPACK https://www.usenix.org/conference/atc19/presentation/liu-yizhi
[15] Eigen. 2020. Eigen. https://eigen.tuxfamily.org/ [40] Hao Lu Marat Dukhan, Yiming Wu and Bert Maher. 2018. Quantized Neural
[16] Thomas L. Falch and Anne C. Elster. 2015. Machine Learning Based Auto- Network PACKage. https://github.com/pytorch/QNNPACK
Tuning for Enhanced OpenCL Performance Portability. In IPDPS Workshops. [41] Larry McVoy and Carl Staelin. 1996. Lmbench: Portable Tools for Performance
IEEE Computer Society, 1231–1240. Analysis. In Proceedings of the 1996 Annual Conference on USENIX Annual
[17] L. Gomez, I. Neamtiu, T. Azim, and T. Millstein. 2013. RERAN: Timing- and Technical Conference (USENIX ATC’96).
touch-sensitive record and replay for Android. In 2013 35th International Confer- [42] Microsoft. 2019. ONNX Runtime. https://github.com/microsoft/onnxruntime
ence on Software Engineering (ICSE). 72–81. https://doi.org/10.1109/ICSE.2013. [43] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khu-
6606553 danpur. [n.d.]. Recurrent neural network based language model.. In INTER-
[18] Google. 2019. TensorFlow: An end-to-end open source machine learning platform. SPEECH 2010.
https://www.tensorflow.org/ [44] Intel Movidius. 2020. Ultimate Performance at Ultra-Low PowerIntel Movidius
[19] Google. 2019. TensorFlow Lite: Deploy machine learning models on mobile and Myriad X VPU. https://www.movidius.com/myriadx
IoT devices. https://www.tensorflow.org/lite [45] Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for
[20] Google. 2020. Edge TPU. https://cloud.google.com/edge-tpu/ OpenCL Kernels. In MCSoC. IEEE Computer Society, 195–202.
[21] Peter Greenhalgh. 2011. Big.LITTLE Processing with ARM CortexT M -A15 & [46] OpenMP. 2020. The OpenMP API specification for parallel programming. https:
Cortex-A7. https://www.cl.cam.ac.uk/~rdm34/big.LITTLE.pdf //www.openmp.org/
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [n.d.]. Deep Residual [47] Qualcomm. 2019. Snapdragon 845 Mobile Platform. https://www.qualcomm.
Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision com/products/snapdragon-845-mobile-platform
and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. [48] Qualcomm. 2020. Snapdragon Neural Processing Engine SDK. https://developer.
770–778. https://doi.org/10.1109/CVPR.2016.90 qualcomm.com/docs/snpe/overview.html
[23] HiSilicon. 2019. Kirin. http://www.hisilicon.com/en/Products/ProductList/Kirin [49] Rockchip. 2020. High performance AI development platform. http://t.rock-
[24] Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. chips.com/en/
GRNN: Low-Latency and Scalable RNN Inference on GPUs. In EuroSys. ACM, [50] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
41:1–41:16. works for Large-Scale Image Recognition. In 3rd International Conference on
[25] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingx- Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015.
ing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. http://arxiv.org/abs/1409.1556
Le, and Hartwig Adam. 2019. Searching for MobileNetV3. arXiv preprint, [51] Mingcong Song, Yang Hu, Huixiang Chen, and Tao Li. 2017. Towards Pervasive
arXiv:1905.02244. and User Satisfactory CNN across GPU Microarchitectures. In HPCA. IEEE
[26] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Computer Society, 1–12.
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. [n.d.]. MobileNets: [52] Nicolas Weber and Felipe Huici. 2020. SOL: Effortless Device Support for AI
Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR Frameworks without Source Code Changes. arXiv:2003.10688 [cs.DC]
([n. d.]). http://arxiv.org/abs/1704.04861 [53] Benvan Werkhoven, Jason Maassen, Henri E.Bal, and Frank J.Seinstra. 2014.
[27] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Optimizing convolution operations on GPUs using adaptive tiling. In Future
Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x Generation Computer System. 14–26.
fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016). http: [54] R Clinton Whaley and Jack J Dongarra. 1998. Automatically tuned linear al-
//arxiv.org/abs/1602.07360 gebra software. In SC’98: Proceedings of the 1998 ACM/IEEE conference on
[28] Monsoon Solutions Inc. 2019. Monsoon. https://www.msoon.com/online-store Supercomputing. IEEE, 38–38.
[55] R. Clinton Whaley and Jack J. Dongarra. 1999. Automatically Tuned Linear World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17,
Algebra Software. In PPSC. SIAM. 2019. ACM, 2125–2136. https://doi.org/10.1145/3308558.3313591
[56] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2000. Automated Empiri- [60] Tien-Ju Yang, Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2017. A method
cal Optimization of Software and the ATLAS Project. PARALLEL COMPUTING to estimate the energy consumption of deep neural networks. In ACSSC. IEEE,
27 (2000), 2001. 1916–1920.
[57] Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: [61] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing Energy-Efficient
An Insightful Visual Performance Model for Multicore Architectures. https: Convolutional Neural Networks Using Energy-Aware Pruning. In CVPR. IEEE
//doi.org/10.1145/1498765.1498785 Computer Society, 6071–6079.
[58] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, [62] T. Yu, R. Zhong, V. Janjic, P. Petoumenos, J. Zhai, H. Leather, and J. Thomson.
Marat Dukhan, Kim M. Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tom- 2021. Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multi-
mer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei core Processors. IEEE Transactions on Parallel and Distributed Systems 32, 5
Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram (2021), 1224–1237. https://doi.org/10.1109/TPDS.2020.3045279
Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. 2019. Ma- [63] ZeptoLab. 2020. Cut the Rope. https://cuttherope.net/#ctr
chine Learning at Facebook: Understanding Inference at the Edge. In 25th [64] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. Deep-
IEEE International Symposium on High Performance Computer Architecture, CPU: Serving RNN-based Deep Learning Models 10x Faster. In 2018 USENIX An-
HPCA 2019, Washington, DC, USA, February 16-20, 2019. IEEE, 331–344. nual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA,
https://doi.org/10.1109/HPCA.2019.00048 951–965. https://www.usenix.org/conference/atc18/presentation/zhang-minjia
[59] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and
Xuanzhe Liu. 2019. A First Look at Deep Learning Apps on Smartphones. In The

You might also like