0% found this document useful (0 votes)
17 views29 pages

Hardware-Friendly User-Specific Machine Learning For Edge Devices

Uploaded by

jiby.phd2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views29 pages

Hardware-Friendly User-Specific Machine Learning For Edge Devices

Uploaded by

jiby.phd2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Hardware-friendly User-specific Machine Learning

for Edge Devices

VIDUSHI GOYAL, REETUPARNA DAS, and VALERIA BERTACCO,


University of Michigan, USA

Machine learning (ML) on resource-constrained edge devices is expensive and often requires offloading
computation to the cloud, which may compromise the privacy of user data. In contrast, the type of data
processed at edge devices is user-specific and limited to a few inference classes. In this work, we explore
building smaller, user-specific machine learning models, rather than utilizing a generic, compute-intensive
machine learning model that caters to a diverse range of users. We first present a hardware-friendly, light-
weight pruning technique to create user-specific models directly on mobile platforms, while simultaneously
executing inferences. The proposed technique leverages compute sharing between pruning and inference,
customizes the backward pass of training, and chooses a pruning granularity for efficient processing on edge.
We then propose architectural support to prune user-specific models on a systolic edge ML inference accel-
erator. We demonstrate that user-specific models provide a speedup of 2.9× and 2.3× on the mobile CPUs for
the ResNet-50 and Inception-V3 models.

CCS Concepts: • Computer systems organization → Neural networks; Embedded hardware; • Com-
puting methodologies → Machine learning;
Additional Key Words and Phrases: Datasets, neural networks, image classification, pruning, inference,
personalized ML

ACM Reference format:


Vidushi Goyal, Reetuparna Das, and Valeria Bertacco. 2022. Hardware-friendly User-specific Machine Learn-
ing for Edge Devices. ACM Trans. Embedd. Comput. Syst. 21, 5, Article 62 (October 2022), 29 pages.
https://doi.org/10.1145/3524125

1 INTRODUCTION
Machine learning (ML) has revolutionized technology in the past decade. It offers a wide range
of applications, e.g., computer vision [51], video recognition [50], and autonomous driving [36].
Machine learning is also used to design intelligent communication systems [29] that analyze com-
plex scenarios in communication systems and make optimal predictions to obtain high Quality
of service (QoS). For example, prior works [22, 33, 49] use a deep learning based approach for
MIMO detection. Furthermore, prior works [37, 61] have proposed deep learning solutions for the

This work was supported by the Semiconductor Research Corporation (SRC), System Level Design (SLD) thrust and by the
Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA.
Authors’ addresses: V. Goyal, R. Das, and V. Bertacco, University of Michigan, 2260 Hayward Street, Ann Arbor, Michigan,
48105, USA; emails: {vidushi, reetudas, valeria}@umich.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
1539-9087/2022/10-ART62 $15.00
https://doi.org/10.1145/3524125

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62
62:2 V. Goyal et al.

Fig. 1. Comparison of accuracy, model-size, and execution time across various edge and cloud ML models
for ImageNet dataset.

complex task of channel estimation. As a result of its increasing popularity, researchers have stud-
ied ML extensively for various computing platforms, including CPU [16, 63], GPU [27, 58, 63], and
FPGA [15, 20, 48]. Its memory and compute-intensive nature have also led to the development of
specialized architectures, including TPU [31], NPU [4], and several ML accelerators [13, 44, 53].
Recently, machine learning has emerged as a leading technique for improving the ways humans
interact with machines, for example, voice recognition by IoT devices like Alexa/ Google Home,
face recognition by smart cameras, recommendation systems [42] for online shopping, personal-
ized news feeds, and many more. Furthermore, many frequently used smartphone applications,
like Facebook, Gallery/Photos, Instagram, Netflix, and so on, rely heavily on machine learning.
Despite its ubiquitous presence, machine learning is still one of the most latency-sensitive and
energy-intensive applications for small, resource-constrained IoT/edge devices. IoT or edge devices
are usually powered by tiny ARM cores or micro-controllers, which work along with a few special-
ized compute IP blocks or accelerators. Due to the limited compute capacity of edge devices, there
is a wide gap between the performance of ML applications on edge platforms and server or desk-
top platforms. For instance, as shown in Figure 1, there is a wide gap between the execution time
of large and accurate cloud ML models, like the Inception-V3 and Resnet-50, and small, but less
accurate, edge-device friendly ML models, like Mobilenet and Shufflenet. Apart from performance,
the compute and memory intensive nature of ML applications also make them energy intensive,
thereby reducing the battery life of edge platforms. Unlike servers or desktops that are always
plugged into a power source, lithium-ion batteries power edge platforms, the small form factor of
edge devices limits battery size and its charge capacity. Furthermore, the portable and seldom re-
mote nature of edge devices precludes their frequent charging. Hence, ideally, edge devices should
be able to compute high-precision cloud ML models at the speed of small edge ML models and
consume minimal energy.
Today, to compute these large and accurate models, edge devices follow the common practice of
offloading incoming machine learning requests to the cloud/back-end server. But such back-and-
forth communication with the cloud raises additional issues. First, communicating to cloud servers
requires fast and reliable internet connectivity, which can be a constraint in remote places. Second,
transferring to the cloud leads to additional transmission latency and energy, which can impact
overall performance and energy efficiency. Moreover, depending on the network traffic and avail-
able bandwidth, the transmission latency can result in violation of the tight latency requirements
of many popular machine learning-based applications. Third, and most important, offloading to
a back-end server requires users to share their personal data with a back-end commodity server,

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:3

leading to privacy and data-breach concerns. With the increasing frequency of cyber-attacks, shar-
ing user-data about every single activity may lead to harmful implications. For example, user-data
can be exploited to study habits or daily routines of users, which can then be manipulated by ma-
licious parties. All of the above concerns make it challenging to offload computation to the cloud
reliably.
In order to address the above concerns, emerging techniques move computation closer to the
edge/user device by computing either on edge servers or on edge devices. For example, Apos-
tolopoulos et al. [7] propose a distributed solution based on game theory techniques to optimally
offload partial computation to the multi-access edge servers in a risk-aware fashion. While an
edge server-based distributed solution, like the above work, reduces the overhead and risk of cloud
computing, computing entirely on the edge devices completely eliminates those concerns. Hence,
there have been significant efforts in pushing machine learning to edge devices [10, 52, 66]. One
such emerging technique is Federated learning [9, 34], which endorses computation of all machine
learning-related operations, ML inference, and ML training, locally at the user-device. It fine-tunes
the original model with incoming new inputs and shares only model updates rather than raw user-
data with the back-end server. Our work is inspired by this effort to keep all the ML computations
local to the user-device. We leverage the user interaction with the device to learn user preferences
without sharing any data with other devices or the cloud. We then use this knowledge to make
ML lightweight and more amenable to edge devices.
To address these challenges, in this work we present MyML, a hardware-software solution that
makes computationally intensive and accurate machine learning feasible at edge devices. The idea
behind the approach is that the machine learning models are built for accurate predictions over a
diverse range of classification classes to serve numerous users. However, individual users usually
interact with a handful of classes. Drawing upon the common knowledge that machine learning
models are over-parameterized, we explore the possibility of creating small user-specific models
based on user-preferences rather than utilizing one large, standard model for all users.
We leverage the transfer learning [62] approach to create such small, user-specific models to
improve performance and energy efficiency, instead of defaulting to the complex original model.
Transfer learning avoids the expensive and time-consuming process of creating new models from
scratch by transferring the knowledge from already available models. In transfer learning, the top
layers that are used for feature extraction can be transferred or used from other related domains,
while the bottom layers for classification are fine-tuned for new domains. It is an approach to learn
models for a new domain by re-training the currently available models with inputs belonging to
the new domain. We draw upon this insight to build small, user-specific models by simultaneously
pruning and re-training the current original model for inputs belonging to user-classes, locally at
the user device in an efficient way that is viable for resource-constrained edge devices.
We first developed a hardware cognizant software solution to create user-specific models with-
out sending user data to the cloud. We propose a hardware-friendly, bottom-up pruning scheme,
which utilizes the unique opportunity of simultaneous inference and pruning to share computa-
tion between the two. In bottom-up pruning, we prune one layer (or group of layers) at a time and
start pruning from the last layers of the model, moving up to the top layers. Bottom-up pruning
utilizes a structured pruning approach to achieve high training efficiency on edge CPU and edge
accelerator platforms. This work explores two kinds of structured channel pruning, symmetric and
asymmetric pruning, that have different trade-offs between pruning rates and pruning granular-
ity. Symmetric pruning works at coarse pruning granularity, leading to a lower pruning rate, but it
does not need a fine control mechanism and the related overhead. On the other hand, asymmetric
pruning prunes at a finer granularity, thus, yielding a high pruning rate. However, asymmetric
pruning requires a sophisticated book-keeping control mechanism for fine-grained computing,

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:4 V. Goyal et al.

which has a small overhead. Based on the properties of the underlying hardware, we show that
Edge accelerator platforms, like Edge TPU [1] with the 2D systolic array, can support symmetric
pruning. In contrast, asymmetric pruning can be enabled at edge CPU-only platforms, supporting
a fine-grained bookkeeping control mechanism.
We show that, for the widely accepted Resnet-50 model, our user-specific model for five user-
classes is 4.3× smaller and has comparable accuracy (≤1% accuracy drop) to the original ML model
while speeding up inference by 2.9×. For the more complex Inception-V3 model, our user-specific
model for five user-classes is 4.7× smaller and has comparable accuracy (≤1% accuracy drop) to the
original ML model while speeding up inference by 2.3×. Our first sensitivity study is for per-layer
pruning and learning rates. We show the bottom-most group of layers have the major contribution
to the model size and has the highest pruning rates of 78%. The pruning rates drop gradually as
we move to top layers while stabilizing the accuracy. On the learning rate front, the bottom layers
have higher learning rates to facilitate initial fast learning. The learning rates then drop slowly
for top layers for a stable and accurate model. The second sensitivity study on training batch size,
which determines the size of the data-set required to train a model, gives an optimal batch size
of 8 with best tradeoff between dataset size, accuracy, and model size. The highest batch size of
64 with a much larger dataset did not offer significant benefits in model size and accuracy. Our
last sensitivity study, on an increasing number of user-classes, shows that our approach is scalable
to a wider set of user-classes representing an expansion of user-preferences, resulting in a model
reduction of 3.2x for 40 classes. Furthermore, our bottom-up pruning technique can converge to
a user-specific model by processing 200 images per class at a pruning/training throughput of 2.94
images/sec and 2.56 images/sec for ResNet-50 and Inception-V3, respectively, on the octa-core
Snapdragon mobile SoC.
Further, we develop a collaborative system that computes ML inferences at the edge using the
user-specific model and track any changes in user-preferences based on prediction probability
and entropy over probability distribution. Based on the estimated divergence in user-preferences,
it determines when to discard the current user-specific model and bring back the original model to
restart a new user-specific model building process. Since all the computation – inference, tracking,
building models – is carried out locally at the edge device, our proposed system ensures user
privacy.
Finally, we propose architectural support to build user-specific models on heterogeneous edge
devices comprising general-purpose CPUs and edge ML accelerators by enabling pruning on ac-
celerators designed to support just the inference. We re-purpose Edge TPU, which computes infer-
ence in int8 precision, to also support the backward pass of the pruning phase in block floating
point (BFP16) precision. We show that, by using bottom-up pruning and BFP16 precision, for
the Resnet-50 model, we can reduce the model size by 2.6× and have accuracy comparable (≤1%
accuracy drop) to the original model while speeding up inference by 1.5×. Furthermore, for the
Inception-V3 model, we can reduce the model size by 2.2× and have accuracy comparable (≤1%
accuracy drop) to the original model while speeding up inference by 2.25×. Moreover, the bottom-
up pruning technique gives a pruning/training throughput of 10 images/sec and 7.54 images/sec
for ResNet-50 and Inception-V3, respectively, on the re-purposed Edge TPU.
The remainder of this manuscript is organized as follows. Section 2 describes the process of
building user-specific models, detailing our three phase process to create user-specific models.
In Section 3 and Section 4, we discuss the pruning granularity and pruning on edge accelerator,
respectively. Section 5 describes the methodology and experimental setup. In Section 6, we evaluate
the benefits of MyML along with some important sensitivity studies, and discuss the practicality
and privacy aspect of MyML. We describe the current limitation and future scope of this work in
Section 7 and Section 8, respectively, discuss related works in Section 9, and conclude.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:5

Fig. 2. Our three phase end-to-end process to learn user-preferences, built user-specific model, and deploy
it in real-time.

2 BUILDING THE USER MODEL


Usually, machine learning models are built to serve numerous users with diverse choices and pref-
erences; however, individual users have limited preferences. In this work, we explore the possibil-
ity of creating small, user-specific models according to user-preferences, rather than resorting to
a larger generalized model for all users. We create such user-specific models based on the transfer
learning method and avail user classes as a dimension to prune big ML models.
As also shown in Figure 2, we employ a three-phase process to create and run a user-specific
model as follows:
Learning Phase: In this phase, we use a tracking mechanism to learn user-preferences based
on the output of the original model. The tracking mechanism identifies the most frequent cate-
gories/classes in the first batch of input, termed as learning window, as user classes. The number
of user classes depends on the number of categories with which the user frequently interacts. If
any category contributes to more than x% of total inputs in the learning window, we consider it
user class. In our experiments, we have a tunable learning window of 50-100 images with an ad-
justable value of x. Therefore, for x of 15%, each category must appear at least 7-8 times among the
total of 50 input trials. We also show a sensitivity study for increasing the number of user classes.
We have a naive learning phase, where we mark any category, appearing more than x% (tun-
able parameter) in the learning window, as a user class. We assume the non-frequent classes are
one-time outliers. This approach can be modified to include any category encountered during the
learning phase. Another possible approach is to utilize our thresholding mechanism from the col-
laborative edge system, discussed in a later section, to send the outliers back to the cloud server.
Pruning Phase: Pruning is defined as re-training of the current model with pruned weight set
to zero. During the pruning phase, we use the incoming inputs to prune and re-train the original
model to create a user-specific model for user classes learned during the first phase. We prune a
block of layers for one epoch to optimize for training cycles. We use the output of the original
model as ground truth for pruning. The goal is to build a user-specific model that has accuracy
within 1% relative to the original model. Note, incoming inputs that do not belong to the classes
identified during the learning phase are discarded and not used for pruning.
Inference Phase: After the pruning phase is complete and all layers are pruned, we switch the
current working model at the edge device to the newly generated user-specific model. The pruning
phase is a one-time process. Once the user-specific model is ready, inference can run efficiently
without compromising accuracy until user data deviates from the current learned user classes. We
discuss the process applied when user-data begins to deviate from the learned preferences in the
Section 2.3.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:6 V. Goyal et al.

2.1 Pruning Background


Building a user-specific model is a one-time, but expensive, process. It requires pruning and weight
updates of the original model for learned user classes, which are time and resource-consuming
processes. Pruning is essentially re-training of the model with pruned weight set to zero value. It
consists of two passes – forward pass and backward pass. The forward pass is computationally the
same as inference and is used to calculate the error between the output prediction of the current
pruned model and ground truth. This final error, along with output activation for each layer, is
utilized during the backward pass. The backward pass itself has two steps – error propagation
and weight update. In error propagation, the final error calculated during the forward pass is
propagated back to individual layers to get error contributions from each layer. This per layer
error, along with per layer output activation stored during the forward pass, is used to find weight
updates for each layer. The above explained pruning mechanism involves compute and memory
intensive 2D and 3D convolutions, which makes it heavy for resource-constrained edge platforms.
The goal of this work is to build user-specific models locally at edge devices in order to preserve
user privacy. To achieve this goal, we strive to make the complete pruning process lightweight by
making the forward and backward pass less intensive. The first step is to use layer-wise pruning,
where we prune/retrain one layer at one point in time while keeping the rest of the layers constant.
The intuition behind layer-wise pruning is to share compute for the constant layers between the
inference pass of the original model, currently serving incoming user requests, and the forward
pass of the pruning model. Within layer-wise pruning, we explore two flavors of pruning – top-
bottom pruning and bottom-up pruning.
In top-bottom layer-wise pruning, we start pruning from the top or first layers and iteratively
move down to prune bottom layers. In this approach, when, say, layer 1 is being pruned, the re-
mainder of the bottom layers will be the same and can be shared with the original model. However,
since the original model and the pruning model start diverging at the pruned layers, there will be
two sets of incoming inputs to the shared layers – one input coming from the unpruned/original
model top layers and the other input coming from pruned model top layers. We explore the possi-
bility of using the difference (or delta) between the two sets of inputs to reduce compute. If there
are a large number of zeros in the delta/difference, there can be a significant amount of compute
sharing between the original model and the pruning model. However, contrary to our expecta-
tions, we did not find a significant fraction of zeroes in the delta. Hence, we did not pursue the
top-bottom pruning approach further.

2.2 Bottom-up Pruning


For the other option of bottom-up layer-wise pruning, we start pruning from the bottom-most layer
and move up to prune top layers in an iterative fashion. When the last layer is being pruned, all the
layers above it are the same as the original model layers and can share compute with it. Since the
original model and pruning model start diverging at the current pruning layer, there is no input
mismatch until this pruning layer. This property makes it easier to share compute between the
original model and the pruning model. Therefore, in this work, we opt for the hardware-friendly,
bottom-up pruning technique to reduce computation related to building the user-specific model.
The bottom-up pruning mechanism reduces the computation required for the forward pass, as well
as the backward pass, of the pruning process, as discussed in detail below.
2.2.1 Compute Sharing between Inference and Forward Pass. Since MyML carries out all the
computation locally at the user device, it presents a unique opportunity of re-using inference com-
putation for the forward pass of the pruning process. In bottom-up pruning, we perform layer-wise
pruning starting from the last layer, as shown in Figure 3. While layer n is being pruned, all layers

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:7

Fig. 3. Bottom-up pruning shares compute for inference and pruning path until layer n − 1. It diverges at
current pruning layer n, and updates weights for layers n to last.

up to layer n − 1 are frozen and identical to the original model. Hence, both inference and pruning
paths can share the computation carried out until layer n − 1. Starting from layer n, the inference
path (to the left) will continue processing the original model to predict user output, while the prun-
ing path (to the right) computes the remaining portion of the forward pass of the pruning process
separately. Thus, only the current pruning layer n, and the layers below it, require separate pro-
cessing during the forward pass. The pruning path calculates the error/loss based on logits (output
of the last fully connected layer) coming from the pruned model, and inference from the original
model as ground truth. This error is then used to compute weight updates for layers n to last. Note,
we only have inference results from the original model available in real-time, and hence we use it
as ground truth for the pruning phase. This methodology still achieves an accuracy within 1% of
the original model for dataset belonging to the user classes only.
2.2.2 Reduced Backward Pass. The backward pass of pruning is a combination of 3D and 2D con-
volution for error-propagation and weight updates, respectively [40]. Bottom-up pruning provides
the opportunity to reduce the number of weight updates and computations for the backward pass.
The proposed scheme freezes all the layers until layer n−1. Thus, we do not require weight updates
for layers 1 to n −1. Weight updates are needed only for the current pruning layer n and beyond, re-
ducing overall pruning time. Furthermore, since we are progressively pruning layer by layer, when
pruning layer n, all the layers > n will have already been pruned and left with just unpruned chan-
nels/weights. Thus, the layers n + 1 to layer last require weight updates for fewer channels, further
reducing computation time as we progressively prune layers in a bottom-up fashion. A significant
advantage of the bottom-up pruning technique is that we are able to keep to the original model
(and its accuracy benefits) while reusing the compute for top layers for the pruning process. We
replace the original model with the user-model, once all the layers up to the first layer are pruned.
Bottom-up pruning complexity: The proposed framework has an additional pruning path
to train the user-specific model. This comes with the added complexity of forward pass and
backward pass. As discussed before, the backward pass of a convolution layer is comprised of 3D
convolution for error propagation & 2D convolution for weight update, both of which are based
on matrix-matrix multiplication kernel. Furthermore, the forward pass, which is computationally
similar to inference, consists of matrix-matrix multiplication-based 3D convolution for each
layer. The matrix-matrix has a worst-case computational complexity of O(n3 ). The pruning path
adds three matrix multiplication kernels corresponding to forward and backward pass to update

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:8 V. Goyal et al.

Fig. 4. Collaborative edge system with a tracking unit that checks for divergence in user-preferences by
counting the number of predictions belonging to user classes.

weights of each convolution layer; thus, it increases the complexity linearly by three times, i.e,
3*O(n3 ). Equations (1) and (2) show the computational complexity of pruning and inference paths.
For Equation (1), the number of updates vary for each layer, i.e., the bottom layers, which are
pruned first, will have more number of updates and will decrease as we move to top layers, which
are pruned towards the end.

t op
Total_complexity_pruninд_path = num_updates_lyr _i ∗ (3 ∗ O (n3 )) (1)
i=bot t om

Total_complexity_in f erence_path = total_num_layer ∗ (O (n3 )) (2)

2.3 Collaborative Edge System


Once the user-specific model is constructed during the pruning phase, using bottom-up pruning
and the appropriate pruning granularity, we replace the original model with the user model and
enter the inference phase. However, the user can still switch or change preferences with time. If the
user-specific model trained for previous user-preferences is utilized for new preferences/choices, it
will lead to severe accuracy degradation. Hence, there is a need for a mechanism to detect a switch
in user-preferences. Therefore, for the inference phase, we develop a collaborative edge system to
find a deviation in user-preferences with a tracking unit, as shown in Figure 4. In this adaptive
system, we use the newly created user-specific model to classify incoming inputs and a tracking
unit to count both the number of inputs classified within user classes and the number of inputs
classified outside of user classes. If the tracking unit classifies the majority of inputs as outside of
user classes, we conclude that user-preferences have changed and restart the user model building
process. Since the user-specific model is biased to predict one of the user classes, we cannot directly
use the predictions to determine if the incoming input is within or outside of user classes. Instead,
we use the probability of the predicted user class and entropy of the probability distribution over
all the user classes as the metric to tag the input to user class or out-of-user class. The output
probability is high, and entropy is low for inputs belonging to the user classes. In contrast, the
output probability is low, and entropy is high (implying the model has low confidence) for inputs
belonging to out-of-user classes. Any input with an output probability less than a threshold and
entropy higher than an empirically determined threshold is marked as belonging to an out-of-user
class. For a running 50 image window, if more than 70% of images are estimated to be outside
learned user classes, we infer that user-preferences have changed. Thus, we discard the current
user-specific model and restore the original model to restart from the learning phase of the user

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:9

Fig. 5. Symmetric Channel Pruning: As a result of pruning the same channel IDs across all filters in layer n,
the corresponding (red) filter in layer n − 1 is pruned completely. All channels and connections shown in red
are pruned.
model creation process to identify new user-preferences. We assume the large original unpruned
model to be stored in flash (or DRAM) and fetched from there on being summoned by the CPU for
the classification task.
The threshold selection for output probability and entropy is guided by the statistical definition
of the parameters. For the output prediction probability, the threshold should be at least 50% to
claim that the input was confidently predicted. Thus, to make our adaptive system more robust, we
chose a slightly higher value of 60% (or 0.6). Furthermore, for the entropy threshold, we first deter-
mined the maximum entropy. The maximum entropy of the prediction probability distribution is
when all the classes have an equal probability of prediction. For a model built for five user classes,

equal probability is 1/5 (0.2), which upon plugging into the entropy equation, ( -plog(p)), gives
the max entropy value of 2.32. For our collaborative system, to further reinforce model confidence
and robustness, we set the entropy threshold to 1.5, which is less than half of the maximum value.

3 PRUNING GRANULARITY
Our pruning techniques are guided by the underlying hardware computation granularity. For ex-
ample, a CPU with a SIMD width of 16 slots can compute and prune at the finest granularity of 16
continuous operations in parallel. If a few of the operations, say slots 8 and 10, are zeroed during
pruning, we cannot skip the two operations and get performance benefits, since the CPU com-
putes those 16 slots together in parallel. The CPU can skip computations if all the 16 operations
are zeroed out during pruning. Thus, to utilize the parallelism provided by the compute engine,
for example, SIMD width of CPU, we have to map only non-zero contiguous operations to each
block of compute. Hence, in this work, to create pruned user-specific models, we explore two kinds
of structured pruning techniques – symmetric channel pruning and asymmetric channel pruning.
Since channel pruning zeroes out all continuous weights in a pruned channel, we can skip com-
putations for all weights of the channel altogether to improve performance and energy efficiency.
Furthermore, prior work has shown that classic sparse formats can lead to performance loss for
pruned weight matrices [63]. Our choice of an entire channel (2D filter) as a pruning granularity
allows us to retain the dense format for inference computation, even with pruned weight matrices.

3.1 Symmetric Pruning


In symmetric channel pruning, the same channel IDs are pruned across all the 3D filters in a given
layer, which is a more constrained approach illustrated in Figure 5. Channel IDs with the lowest

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:10 V. Goyal et al.

Fig. 6. Asymmetric Channel Pruning: Channels are pruned independently with no restrictions. More chan-
nels are pruned with this approach because of its flexible nature. All channels and connections shown in red
are pruned.

L2 norm values, summed across all the filters, are pruned (shown in red) in layer n. This step
further leads to the pruning of filters in the previous layer n − 1, corresponding to channel IDs
pruned in the current layer n. Symmetric pruning changes the existing dense convolution layers
structure to another dense convolution layer structure, hence it does not require any bookkeeping
mechanism for convolution layers to store which channels were pruned. The removal of filters in
layer n − 1 removes the corresponding output channel and the input channel for the subsequent
layer n. Therefore, in layer n, only input channels corresponding to unpruned filter channels are
present. Thus, we do not need to store any extra channel information for the convolution operation
to map input channels correctly to the remaining filter channels. Note, for identity mapping in
the residual networks like ResNet, we need to store unpruned channel index to gather unpruned
channels and drop the pruned channels in the identity path for the final addition operation of
identity branch and the parallel running convolution branch.

3.2 Asymmetric Pruning


In asymmetric pruning, we prune individual channel IDs with the least L2 norm values, across all
the filters, with no restriction to choose the same channel IDs across filters, as shown in Figure 6.
This makes the approach more flexible and boosts the potential for increasing the pruning rate.
Asymmetric pruning maintains the current dense model structure with only unpruned channel
weights but requires a bookkeeping mechanism to store only unpruned channels in a dense
format.
Bookkeeping: The purpose of the bookkeeping mechanism is to store only non-zero channels
and map them correctly to their corresponding input channels. Each filter of the convolution layer
requires different input channels. For instance, in Figure 6, the first channel for the top filter is
pruned; thus, only the last two channels need their corresponding input channels. Similarly, the
last two channels are pruned for the bottom filter; thus, it requires only the first input channel for
computation. Due to this asymmetric behavior, we store all the input channels in dense format.
But we need to map non-zero channels in each filter to their corresponding input channels. We
achieve this through the bookkeeping mechanism, where we only store non-zero filter/weight
channels in a contiguous manner along with some meta-data. To map weight/filters channels to
their corresponding input channels, we store a channel mask offset, the number of non-zero
channels [nnz] per filter, and the difference pointer {diff} between consecutive non-zero channels
to point to the correct input channel.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:11

Fig. 7. Bookkeeping mechanism with channel mask offset, difference point {diff}, and number of non-zero
channel [nnz].

Convolution is a matrix multiplication operation between flattened 2D input and flattened 2D


weights. Each row of the input activation matrix is one input window from the Image-to-column
(Im2Col) operation, which is multiplied by different filters represented by columns of the weight
matrix. We show a small working example in Figure 7. The input pointer starts with channel mask
offset (in orange) to compute the first channel and then uses the stored difference pointer {diff} (in
blue) to move to the next non-zero channel. For instance, to skip pruned channels and move to
the next non-zero channel, we store a diff pointer value of {+3}. We simultaneously keep a counter
of the number of non-zero channels computed/visited. When the counter is equal to the store
number of non-zero channels (nnz) value for the filter, it indicates that no more non-zero channels
are present in the filter. We use this as an indicator to perform the accumulation or aggregation
operation across all the filter channels. In Figure 7, for filter F1, we do aggregation when the
counter becomes equal to nnz value of 2 after computing the two non-zero channels of the filter.
To move to the first non-zero channel for the next filter, F2, we reset the counter and accumulator
and then simply use the diff pointer {−2} to move to the correct location. This approach incurs
a small overhead for storing meta-data of the mask offset, difference pointer, and the number of
non-zero channels. However, this overhead corresponds to merely 24KB per layer for our pruning
rates.

3.3 Which Pruning Granularity to Use


The choice of pruning granularity depends on the user platform on which the machine learning
inference will be computed. Mobile platforms with only multi-core CPUs or mobile apps that pre-
fer to run machine learning on CPUs [60] should opt for asymmetric pruning because CPUs can
control and compute at a finer granularity, such as SIMD width. CPUs can efficiently handle the
bookkeeping mechanism required to support asymmetric pruning and thus benefit from high prun-
ing rates. However, edge accelerators that map machine learning computation to dense 2D systolic
arrays do not have the capability to control or skip computations at fine granularity. Thus, such

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:12 V. Goyal et al.

Fig. 8. FP32 precision: 1-bit sign, 23-bit mantissa, Fig. 9. Block floating point (BFP16) precision: 1-bit
and 8-bit exponent for each element. sign and 7-bit mantissa for each element. 8-bit shared
exponent across all the elements in a block.

systolic-array-based edge accelerators can take advantage of symmetric pruning, which does not
require any bookkeeping, while still availing themselves of the benefits of pruning.

4 PRUNING ON EDGE ACCELERATOR


Edge accelerators or neural engines for machine learning-related tasks have become an integral
part of most of the state-of-the-art mobile SoC platforms [4, 5]. In such a heterogeneous CPU-edge
accelerator system, general-purpose CPU cores offload the incoming ML inference requests to
neural engines for faster and more energy-efficient computations. Unlike multi-core CPUs, where
the parallelism is limited by supported SIMD width (in order of 10s), edge accelerators provide a
high level of parallelism with hundreds of compute units working in parallel to produce the final
output. Therefore, in MyML, we also propose architectural techniques to enable the pruning phase
on a heterogeneous CPU-edge accelerator system. In this work, we use an accelerator modeled
after Google’s Edge TPU [1]. Edge TPU is a systolic array-based ML inference accelerator that
supports processing elements (PEs) with an 8-bit multiply-accumulate (MAC) operation. It is
used to compute the General Matrix Multiplication (GEMM) kernel that forms the backbone of
ML inference and training. For an N × N systolic array, each column computes N MAC operations
in parallel to produce an output activation. Thus, in one cycle, N output activations (one from
each column) are computed in parallel. Moreover, EdgeTPU has streamlined data-flow where the
read/write latency of input/output is hidden by MAC computation with only one-time weight-
loading latency visible. Hence, Edge TPU outperforms a multi-core SoC with ARM ISA cores [2],
which doesn’t have pipelining or overlapping and also has general-purpose instruction processing
overheads. Edge accelerators are designed to support only inference in int8 precision due to power
and area constraints. But the backward pass of the training process needs floating-point precision
to compute weight updates. Supporting a dedicated accelerator for floating-point is expensive in
terms of area/power. It is also not justified for short pruning periods since the majority of the
time is spent on inferences. Hence, to make our solution practical for edge platforms with an
inference accelerator, we repurpose an int8 edge accelerator to enable the higher precision needed
for pruning. The repurposing technique proposed in this work can have a broader scope and can
be used for Cloud TPU or other systolic array-based ML compute engines.

4.1 Repurposed Edge TPU


In order to re-purpose the int8 precision edge accelerator for pruning, we want to leverage the
int8 computation done at Edge TPU and append it with some lightweight computation done at
the CPU. Therefore, we propose using Block Floating Point (BFP) [15] as an alternative to the
FP32 floating-point format to compute the backward pass on Edge TPU. For FP32 representation,
each element has a dedicated sign, mantissa, and exponent bits as shown in Figure 8. In contrast,
for BFP representation, each block/vector shares the same exponent across all the elements, while
each element in a block has an individual mantissa, as shown in Figure 9. The dot product of any

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:13

Fig. 10. Re-purposed Edge TPU for training.

two BFP blocks is shown in Equation (3) where ma , mb are mantissa bits and e a , e b are exponents
for the Blocka and Blockb . The dot product of the two blocks is the dot product of the mantissa bits,
while the exponents can simply be added to get the exponent for the final dot product. In MyML,
Edge TPU is used to compute the dot product of the mantissa bits of weights and input activation,
while the host CPU appends the BFP output from the Edge TPU with the sum of the exponent bits.
Thus, the BFP format maps well to Edge TPU because the computation related to shared exponents
can be conducted at CPU and the computation for mantissa bits can be completed separately at
Edge TPU. We use the BFP16 floating-point format, which has an 8-bit signed mantissa and 8-
bit exponent. The 8-bit signed mantissa can be mapped directly on an Edge TPU architecture,
supporting 8-bit fixed point MAC PEs.
Blocka = {m a1 , ..m an } × e a Blockb = {mb1 , ..mbn } × e b
n
Blocka · Blockb =  m ai mbi  × e a+b (3)
 i=1 
Figure 10 shows the complete block diagram for the re-purposed Edge TPU. Each column of
the systolic array is mapped to the individual filters of a layer and represents one block sharing
a common exponent. Similarly, each column of the input vector streaming into the systolic array
is mapped to an individual input window and represents one block sharing one exponent. Before
computation, BFP16 weights are loaded from DRAM into the PEs of the systolic array. This weight
loading is a one-time process, which is followed by a long computation phase. Note, only 8-bit
mantissas of loaded BFP16 weights are used during the computation phase. The 8-bit exponent
flows through directly to the output. During the computing phase, 8-bit mantissas of input blocks
are streamed into the 2D systolic array and the output of each column of the systolic array is the

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:14 V. Goyal et al.

dot product of filter weights and input window activation mapped to that column. In one cycle of
the computing phase, Edge TPU completes N MAC operations in each of the N columns, resulting
in N ×N completed operations in one cycle. The output from the Edge TPU is piped out to the
host CPU, which then appends the output with the sum of the exponent bits of input and weight
block to deliver the exponent bits for the final output in BFP16 format. Furthermore, to accumulate
the output for large filters spanning across multiple GEMM kernel calls, we use normalization to
convert the BFP16 output from Edge TPU to FP32. The FP32 outputs can then be easily accumulated
across multiple GEMM kernel calls to obtain the final output for the filters.
To support the CPU-repurposed Edge TPU system, we add three components at the host CPU.
The first component is floating-point to block floating point convertor (FP2BFP) that con-
verts the input and intermediate activations in FP32 precision to BFP16 precision. The mantissa
bits of the BFP16 inputs are sent to Edge TPU for further computation. The second component is
the exponent adder, which adds the 8-bit exponents of input and weight BFP16 blocks to give the
final exponent of the output block. The last component is for FP32 normalization to convert BFP16
output blocks to FP32 blocks. Thus, our system fully implements the conversion process between
FP32 and BFP16 (and vice-versa) on the host CPU, as shown in Figure 10.

4.2 Conversion Error from FP32 to BFP16


Conversion from FP32 precision, which has 23-bit of mantissa and 8-bit of exponent (as shown
in Figure 8), to BFP16, which has 8-bit of signed mantissa and 8-bit of exponent and wherein
exponent is shared by all the elements within a block (as shown in Figure 9), leads to a deviation of
current/working weight and activation values from original values. The error due to the conversion
of FP32 values to BFP16 is because of reduction in mantissa bit and the block size, i.e., the number
of elements in a block of BFP16. While mantissa bits have to be set to 8 to match the underlying
Edge TPU precision, the block size is flexible. Smaller block size reduces the chances of overflow or
underflow and the divergence from original (FP32) values, thus, reducing the error. However, FP32
to BFP16 conversion overhead depends on the number of BFP16 blocks to convert. It increases
for small block sizes with more BFP16 blocks. Hence, it is a trade-off between the sizes of each
block and the number of blocks. Moreover, the minimum block size is limited by the dimension
of the systolic array present at Edge TPU. The small dimension of Edge TPU (i.e., 64) constrains
the block size and reduces the chances of overflow or underflow, resulting in less error. Thus, the
block size is set to 64 to match Edge TPU’s 64x64 systolic PE block. One column/row of matrix
multiply is divided into multiple blocks of size 64, i.e., 64 elements in each block. Furthermore,
even using a small block size of 64 and mantissa width of 8, we observe a significant drop in model
accuracy with BFP16 precision. The original unpruned Inception-V3 model with FP32 precision has
an accuracy of 79.2% for the user-specific dataset. This accuracy drops down to 76% when using
BFP16 precision for the same unpruned original model. Thus, there is a significant, 3.2%, accuracy
drop from the error generated due to conversion from FP32 to BFP16 precision. We regain this
accuracy drop by retraining the model during the user-specific pruning process to achieve 79.2%
accuracy.

5 METHODOLOGY
We evaluate MyML for the image classification task with the Inception-V3 [55] and Resnet-50 [23]
models. We show results primarily for a user-dataset comprising of five randomly chosen classes
from Imagenet dataset [14], representing user-preference. We prune the model using TensorFlow’s
tf-slim framework to obtain pruning rates and measure accuracy. We also extend the TensorFlow
framework to support block floating point (BFP16) precision by adding a floating-point (FP32) to
the BFP16 conversion module. This is a generalized module that can be configured for different

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:15

Table 1. Architectural Specifications for Snapdragon 855 Octa-core SoC


Representing Mobile CPU

Cores [email protected], [email protected], [email protected]


L1 cache 1x128KB, 3x128KB, 4x128KB
L2 cache 1x512KB, 3x256KB, 4x128KB
L3 cache 2MB
DRAM LPDDR4 6GB@2133MHz, 34.1GB/s

block-size, mantissa bits, and exponent bits. For our experiments, we have set the block size to 64,
exponent bits to 8 bit wide, and mantissa bit to 7 bits with one additional bit to represent sign bit.
For mobile CPU performance evaluation, we use the XNNPACK [6] library, which provides a
SIMD implementation of 3D convolution using the ARM Neon ISA. We extend this library further
to add SIMD support for 2D convolution. Using the GEMM implementation of XNNPACK[6], we
can skip entire blocks, corresponding to pruned channels, for asymmetric pruning with the book-
keeping mechanism explained in Section 3. We measure execution time and energy consumption
by executing these kernels on Samsung S10e mobile phone hosting Snapdragon 855 Octa-core
mobile SoC with the complete architectural configuration listed in Table 1.
To evaluate the performance for Edge TPU, we use the SCALESim [47] simulator, which gives
compute cycles for a given systolic array configuration and assumes a TPU operating frequency of
500MHz. Since SCALESim supports only 3D convolution, we extended it to support 2D convolution
for the backward pass.
As discussed in Section 2, our learning window size is 50-100 images, and the minimum appear-
ance frequency is ∼15% of window size for a class to be marked as user-class. We divided the total
layers into four blocks and pruned one block at a time, starting from the bottom block, per our
proposed bottom-up pruning technique. Each block was trained for 40 images/user-class, which
accrued to 200 images for the five-class user subset, accounting for a total of 1,000 images for the
pruning phase. Furthermore, to optimize for accuracy as well as training cycles at the mobile de-
vice, we trained each block for one epoch, with the option of training the last block for multiple
epochs to improve model accuracy. We trained the last block for simply one additional epoch to
build a robust user-specific model in our experiments.

6 EVALUATION
We evaluate MyML on two distinct platforms – mobile CPU and Edge TPU – with various prun-
ing configurations for Inception-V3 and ResNet-50 models. For mobile CPU, we compare the user-
specific model pruned using asymmetric pruning with two baseline models: the original unpruned
model with pruning type as none and the original model pruned using channel pruning (user-
agnostic), which represents the prior user-agnostic pruning works. For TPU, we compare the
user-specific model pruned using symmetric pruning with the original model. Note, Edge TPU
is designed for dense GEMM matrix computation and cannot accrue the benefits of user-agnostic
channel pruning; hence, we do not report any user-agnostic pruning configuration for TPU.

6.1 Inception-V3
The inception model was first developed by Szegedy et al. [54]. It was an important milestone
because it shifted the contemporary trend of building deeper models to wider models. Deeper
models are more prone to over-fitting. Hence, instead of having one filter at one level, these models
include multiple filters in one level to form a wider network. The Inception-V3 model [55] is an
advanced version that reduced computational bottlenecks.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:16 V. Goyal et al.

Fig. 11. Inception-V3: Inference latency and model Fig. 12. Inception-V3: Model accuracy for user-
size for different pruning types on mobile CPU plat- specific dataset and the complete Imagenet dataset
form that supports Int8 precision for inference and for different pruning types on mobile CPU platform
FP32 for pruning. that supports Int8 precision for inference and FP32
for pruning.

Inference Performance and Accuracy: As shown in Figure 11, we find that the user-specific
model built using asymmetric pruning on the mobile CPU is 2.3× faster, corresponding to a 4.7×
reduction in model size, as compared to the original model. Moreover, compared to the pruned
user-agnostic model, the user-specific model provides a 1.4× speedup, along with a 2.5× reduction
in model size. The newly built user-specific model has an accuracy of 78.8%, with less than a 1%
accuracy drop in user-dataset, as compared to the original model with an accuracy of 79.2 %, as
shown in Figure 12. The user-specific model has higher accuracy compared to the user-agnostic
pruned model for the user-datatset because the user-specific model is pruned (and re-trained) only
for user-classes. On the other hand, the user-agnostic model is pruned to maintain combined aver-
age accuracy across all the 1,000 classes of the complete Imagenet dataset. Henceforth, the accuracy
on the complete dataset is maintained by the user-agnostic pruning, whereas the accuracy drops
to <1% (close to zero) for the user-specific model because the inputs belong to outside user-classes.
Thus, we can conclude that the user-specific model yields an accuracy comparable to the origi-
nal model for inputs belonging to user-classes but does not work for inputs outside user-classes,
reinforcing the correct behavior of user-specific models.
For inference on Edge TPU, we observe that inference time and model size reduce by 2.25×
and 2.2×, respectively, for the user-specific model built with symmetric pruning over the original
model (as shown in Figure 13), while maintaining an accuracy of 79.2% over the user-dataset (as
shown in Figure 14). Furthermore, similar to the mobile CPU platform, accuracy also drops to
<1% (close to zero) for the complete Imagenet dataset on the Edge TPU platform.
There are two factors that contribute to performance improvement in Edge TPU. The first is
due to the reduction of model size because of channel pruning. The second is the reduction in
the Image-to-column (Im2col) operation that is a part of the pre-processing step. Inputs can be
piped out to the Edge TPU only once they are flattened out and converted to a 2D matrix to map
to a 2D systolic array. This operation depends on the number of input channels of the convolution
layer. Since we remove complete filters and corresponding output/input activation channels as
part of symmetric pruning, we end up reducing Im2col operations as well. This leads to additional
performance benefits over the GEMM operation reduction.
Pruning Performance: In Table 2, we report the duration of the pruning phase comprising of
1,000 images. We find that the mobile CPU with asymmetric pruning can process 2.56 images/sec,
which accumulates to a total time of 390s for the pruning phase. Edge TPU with symmetric pruning

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:17

Fig. 13. Inception-V3: Inference latency and model Fig. 14. Inception-V3: Model accuracy for user-
size for different pruning types on Edge TPU platform specific dataset and the complete Imagenet dataset
that supports Int8 precision for inference and BFP16. for different pruning types on TPU platform that
for pruning. supports Int8 precision for inference and BFP16 for
pruning.

Table 2. Inception-V3 Pruning Comparison between Mobile CPU and


Repurposed Edge TPU

Platform Pruning Type Precision Pruning Phase


Duration
mobile CPU User-Specific Asymmetric FP32 390.34 (s)
Edge TPU User-Specific Symmetric BFP 16 132.6 (s)

can process 7.54 images/sec, aggregating to 132s for the pruning phase. Our repurposed Edge TPU
is able to reduce pruning time by ≈3×. We expect the pruning to be a one-time cost for long
inference phases where user classes remain stable.
Energy: We also observe improvement in the energy efficiency of computing the models on our
mobile device. The energy per inference reduces to 0.98J for the user-specific model, compared
to 1.54J and 1.27J for the original and pruned user-agnostic model, respectively. This results in
energy reductions of 54% and 27%, respectively, for the user-specific model compared to the pruned
original model and the original unpruned model.

6.2 Resnet-50
ResNet-50 is a crucial machine learning model for image recognition/classification tasks, and has
been widely adopted by industry and academia. It is an integral part of the MLPerf [41] AI inference
and training benchmark suite for datacenter, developed in collaboration by academia, research labs,
and industry, with reasonable accuracy of 75.6% with 21.7 MB model size. It was the first network
to introduce the concept of identity mapping [23], which made training easier and improved gen-
eralization. In this work, we include ResNet-50 in our experiments to demonstrate the benefits as
well as the broad applicability of MyML. We generalize that the MyML technique can be applied
to any deep neural network with convolution layers.
Inference Performance and Accuracy: As shown in Figure 15, we find that the user-specific
model built using asymmetric pruning on the mobile CPU is 2.93× faster, corresponding to
a 4.3× reduction in model size, as compared to the original model. Moreover, compared to
the user-agnostic model, the user-specific model provides a 1.55× speedup, along with a 2.5×
reduction in model size. The newly built user-specific model has an accuracy of 73.2%, which is

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:18 V. Goyal et al.

Fig. 15. ResNet-50: Inference latency and model size Fig. 16. ResNet-50: Model accuracy for user-specific
for different pruning types on mobile CPU platform dataset and the complete Imagenet dataset for dif-
that supports Int8 precision for inference and FP32 ferent pruning types on mobile CPU platform that
for pruning. supports Int8 precision for inference and FP32 for
pruning.

Fig. 17. ResNet-50: Inference latency and model size Fig. 18. ResNet-50: Model accuracy for user-specific
for different pruning types on Edge TPU platform dataset and the complete Imagenet dataset for dif-
that supports Int8 precision for inference and BFP16 ferent pruning types on TPU platform that supports
for pruning. Int8 precision for inference and BFP16 for pruning.

within a 1% accuracy margin, compared to the original model with 72.4% in user-dataset, as shown
in Figure 16. Also, since the user-specific model is pruned (and retrained) only for user-classes,
it has significantly higher accuracy as compared to the user-agnostic model for the user-dataset.
Furthermore, for the complete dataset with inputs belonging to outside user-classes, the accuracy
drops to <1% on using the user-specific model, ensuring its correct behavior.
For symmetric pruning on Edge TPU, we show in Figure 17 that user-specific model size can
be reduced by 2.6× from 21.7 MB to 8.3 MB, resulting in a speedup of 1.5×. The user-specific
model also improves the accuracy to 73.6% for the user-dataset, within 1% margin, compared to
unpruned model accuracy of 72.4%, as shown in Figure 18. Similar to the mobile CPU platform, the
accuracy drops to <1% (close to zero) for the user-specific model on the complete Imagenet dataset.
As discussed for the Inception-V3 model, there are two factors that contribute to performance
improvement in Edge TPU. The first is the reduction of model size because of channel pruning
and the second is the reduction in the Image-to-column (Im2col) operation that is a part of the
pre-processing step.
Pruning performance: In Table 3, we report the duration of the pruning phase, comprising of
1,000 images. We find that the mobile CPU with asymmetric pruning can process 2.94 images/sec,

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:19

Table 3. Resnet-50 Pruning Comparison between Mobile CPU


and Repurposed Edge TPU

Platform Pruning Type Precision Pruning Phase


Duration
mobile CPU User-Specific Asymmetric FP32 340.05 (s)
Edge TPU User-Specific Symmetric BFP 16 99.58 (s)

Fig. 19. Trace showing MyML in realtime. We show the three phases - learning, pruning, and inference – of
our end-to-end system as well as illustrate the working of the tracking unit that monitors the change in user
preferences.

which accumulates to a total time of 340s for the pruning phase. Edge TPU with symmetric pruning
can process 10 images/sec, aggregating to 99.58s for the pruning phase. Our repurposed Edge TPU
is able to reduce pruning time by ≈3.42×. We expect the pruning to be a one-time cost for long
inference phases where user classes remain stable.
Energy: We also observe improvement in the energy efficiency of computing models on the mobile
device. The energy per inference reduces to 0.4J for the user-specific model, compared to 0.98J and
0.67J for the original and pruned user-agnostic model, respectively.
In the rest of the evaluation section, we only show results for the Inception-v3 model. The
ResNet-50 model is made up of convolution layers similar to Inception-V3 and, based on the above
discussion, the behavior of user-specific models built from the two originals models are coherent.
Hence, trends and insights gained from the Inception-V3 model will be applicable to ResNet-50.

6.3 Adaptive System


For the adaptive system, we start with a dataset that has only user-classes as inputs. Upon building
the user-specific model and utilizing it for predictions in the inference phase, we modify our dataset
to include inputs belonging outside user-classes. We gradually add outside user-classes inputs in
the 50 image window. For the first 10 images of 50 images, one out of every five inputs belongs
outside of user-classes. For the next 10 images, two out of every five inputs belong outside of user-
classes. Following this pattern, all the inputs lie outside user-classes for the last 10 images of the
50 image window.
In Figure 19, we show the effectiveness of our proposed collaborative system on the above dis-
cussed user trace for the Inception-V3 model. We show the model building process for the given
user trace and also depict our system’s behavior once user preferences start changing. At time t = 0,
the system kick-starts from the learning phase for a 50 input window. Once it learns user classes,
it enters the pruning phase, where it uses the original model for inference while simultaneously

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:20 V. Goyal et al.

Fig. 20. Per layer asymmetric channel pruning showing model size, learning rate, and pruning time for
bottom-up pruning.

creating a user-specific model. Once the pruning phase is complete, we switch the original model
with the newly created user-specific model for inference.
During all phases, our tracking mechanism checks for divergence in user preferences. For our
experiments, the tracker counts the number of outside-user-classes inputs over a window of 50
input images and reports divergence if the count exceeds a threshold of 70%. In Figure 19, we
show the running average over the tracking window. Since the tracker resets its count after each
window, we observe a seesaw pattern in our trace. As shown in the figure, when the user slowly
starts changing the preferences around 1,900s in real-time, the tracker count shoots up and crosses
the set threshold. The system then switches to a learning phase.
The trace also shows the correct and incorrect predictions by the appropriate inference model
in each phase. We see that there is a drop in the correct predictions only in the period where user
preferences transition to new classes.

6.4 Sensitivity Studies


In this section, we conducted separate studies for an in-depth analysis of the MyML technique.
Layer breakdown: The aim of this study was to understand in detail the pruning rates, pruning
times, and learning rates for individual layers comprising the deep neural network model.
Pruning rate: As shown in Figures 20 and 21 for asymmetric and symmetric channel pruning,
model size decreases as we prune more layers and move towards top layers from group 1 to group
4. Asymmetric channel pruning has a higher model size reduction compared to symmetric channel
pruning because asymmetric pruning is more flexible than symmetric pruning with constrained
channel ID selection. However, both of these pruning types follow similar trends for each group
of layers, as discussed below.
Group 1, consisting of layers 7a, 7b, and 7c, has the highest pruning rates, leading to a drastic
reduction in model size. This reduction occurs because the bottom-most layers have a large number
of filters and channels accumulating to 12MB of model size, thus resulting in a significant model
size reduction. Furthermore, bottom layers contribute towards the prediction stack, which makes
them more amenable for pruning. The model accuracy increases on pruning group 1 because we
tune the prediction layers for specific user-classes, giving a significant boost to accuracy. As we
move up in the model and prune top layers, the model size and corresponding pruning rates slowly
decrease until group 4. This behavior occurs because the top layers are smaller in size and built for
feature extraction, making them less prune-able. Furthermore, as we move up in the model and
prune top layers, the accuracy drops with each group until it is similar to original model accuracy.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:21

Fig. 21. Per layer symmetric channel pruning showing model size, learning rate, and pruning time for bottom-
up pruning.

Moving from group 3 to group 4 gives a small reduction in model size; however, it stabilizes the
error and makes the model more robust.
Pruning time: We also show the pruning time for each group of layers in Figures 20 and 21.
Pruning time increases as we move up in the model, following the bottom-up pruning technique,
because while pruning layer n, all the layers from layer n to last layer will be re-trained. For ex-
ample, while pruning layers in group 4, all the layers in the groups 1 to 3 will be re-trained. Thus,
group 4, the top-most group of layers, takes a big chunk of time. This is because we train almost
the entire model, except the untouched top feature extraction layer, and we train for one extra
epoch to get a stable model. Furthermore, pruning time is shorter for symmetric pruning on the
edge TPU accelerator as compared to asymmetric pruning on general purpose mobile CPU.
Learning rate: Inspired by a commonly used training procedure that starts with a high learning
rate for the first few epochs, that is lowered gradually for later epochs, we also form a learning rate
schedule for the bottom-up pruning, as shown in Figures 20 and 21. The bottom groups, comprising
of group 1 and group 2, have the highest learning rate of 0.001. Learning rates are reduced by order
of 10 for the top layers in group 3 and group 4. Reducing the learning rate with time/layers allows
the model to vigorously learn and jump around various local minima during the start of the pruning
process and gradually slow down to settle on global minima with very low loss value.
We also observe a difference in learning rates between asymmetric and symmetric pruning for
top layers/groups. The learning rates for top layers are relatively higher for symmetric pruning.
We suspect this is because, for symmetric pruning, more error is accumulated due to floating-point
to block floating-point conversion as we move up to top layers. Therefore, there is a need to have
higher learning rates in order to evade local minima to make up for the extra error and stabilize
to a low final loss value.
Training batch size: Training batch size is an important parameter for the MyML approach to
create a user-specific model. The batch size determines the number of times we can update weights
for a given number of images/inputs in the dataset. For example, a batch size of 10 for a dataset with
100 images will lead to 10 model updates, whereas a batch size of 25 with the 100 image dataset
will give only 4 model updates. Though a smaller batch size can give more model updates, keeping
the size too low can lead to the model jumping around different local minima and not stabilizing
at a small loss value. Hence, there is a trade-off between batch size and the accuracy (robustness)
of the model. In MyML, we want to keep a small batch size to have faster model updates within a
reasonable amount of user-data. Therefore, we present a sensitivity study with a batch size in the
range of 128 to 8, as shown in Figure 22. In this study, we keep the number of updates constant

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:22 V. Goyal et al.

Fig. 22. Model size and accuracy for increasing training batch size.

Fig. 23. Scalability of user-specific model with an increasing number of user classes.

(to 25); thus, the dataset size varies for training batch size. We observe that, at the largest batch
size of 128 and dataset size of 3.2k, we can achieve the smallest model size (highest pruning rate)
because the model has more data to learn and evade local minima to settle at a stable loss. A higher
pruning rate demands more data to converge at a low loss value. As the batch size reduces, the
model size increases (i.e., pruning rate reduces) to stabilize at a low stable loss with less amount of
data. However, the difference between the model size is not very significant. A model created with
a batch size of 8 is only bigger by 7%, compared to a model built with a batch size of 128. Across
all the training batch sizes, we maintain an accuracy margin of 1% within a baseline accuracy of
79.2%. Therefore, for this work, we choose the batch size of 8 for training user-specific models that
have an accuracy within 1% of the original unpruned model.
Scalability with number of user classes: The second study measures the utility of building
a small user-specific model over user-agnostic pruning as the user diversifies its preferences or
choices. Therefore, we conduct a sensitivity study with an increasing number of user classes rep-
resenting user-preference. As done for prior discussed results, we train these user-specific models
to maintain accuracy within a margin of 1%. Figure 23 shows that, even on increasing the number
of user classes from 5 to 40, we achieve significant model reduction over user-agnostic pruning. For
instance, with 40 user-classes, MyML gives 1.5× and 2.8× reduction compared to the user-agnostic
pruned and original model, respectively. This reduction in model size demonstrates the advantage
of utilizing user-specific models over the original generic model even as the user expands its pref-
erences. The increase in model size from 5 to 10 classes is 1.35×; this increase reduces to 1.13× from
10 to 20 classes and 1.1× from 20 to 40 classes. The increase in model size is highest when we ex-
pand from 5 to 10 classes, and thereafter, it tapers off as we expand to include more classes. Thus,

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:23

we can infer that even as user-preferences will expand beyond 40, the increase in user-specific
model size will be by a small linear fraction compared to model size at 40 classes.
Ablation Study: We also performed an ablation study by choosing five known nearby classes –
pickup truck, tow truck, trailor truck, tractor, and recreational R.V. – as user-classes for building a
user-specific model. We found that the model size for inception-v3 can be reduced to 6.2 MB from
23 MB while maintaining an accuracy of 79.2% for the user subset. This validates our hypothesis
that, by leveraging user-preferences, we can build tiny user-specific ML models to improve the
efficiency of ML applications on user-devices.

6.5 Discussion
To demonstrate the practicality of our approach, we apply a real-world dataset from Kaggle [3]
to the Inception-v3 model to obtain the ratio of user-classes and outliers. We test our method on
500 images. The learning phase operates on the first 100 images, and the remaining 400 images
determine the fraction of user-classes and outliers in the dataset. Unlike Imagenet, where each
image is manually processed to have exactly one object/entity, the Kaggle real-world dataset has
multiple objects in each image. Thus, we take the top-five predicted classes for each image in our
analysis, which account for 119 unique classes within the 500 image window. We mark the ten most
frequent classes in the learning window as user-classes and find that 95% of the remaining images
belong to these user-classes. Thus, we observe that the ten frequently appearing user-classes (8.4%
of total classes) consistently encapsulate 95% of images.
Our collaborative system has a tunable threshold for outlier tolerance. With a 95% threshold,
the collaborative system can tolerate a maximum of 5% outliers before discarding the current user-
specific model to create a new user-specific model for new preferences. We offer two solutions to
handle outliers. The first solution sends the 5% outliers to the cloud server for computation, which
applies the bulky original model, ensuring privacy for 95% of the inputs. The second solution is to
infer the 5% outliers using the bulky original model on the local edge device. The second solution
ensures privacy for all inputs by computing everything locally. Hence, MyML enforces privacy for
95% of user inputs regardless of the methods. Moreover, if the 5% outliers are computed locally on
the edge device using the original model, we provide 100% privacy for all user inputs.
We show that for the collaborative system, which sends the 5% outliers to the cloud, the speedup
is 2.2×. Additionally, computing the outliers on the device results in 2.1× speedup. Note that the
remaining 95% of the inputs are computed only at the edge device employing the user-specific
model. These speedups are lower than the 2.3× speedup achieved when only the user-specific
model is applied to all inputs without the differentiation of outliers or its extra computation.

7 LIMITATIONS
Though we provide an end-to-end holistic approach to learn, build, and deploy user-specific models
based on user-preferences, there are still some limitations and scope of improvement for this work.
This work assumes that there is a pre-built accurate original model ready to serve the user
that acts as ground truth. Training such a baseline model from scratch is a very expensive and
time-consuming process. However, it is a one-time process that can be done in the back-end cloud
server.
In this paper, user-specific models are derived from original models, which are convolution
layer-based deep neural networks. Hence, the user-specific models are, in turn, made of computa-
tionally complex convolution layers. There can be an alternate way to build smaller user models
from scratch comprising simpler MLP layers and much lower depth. This approach has not been
explored in this paper. The above limitation can be further expanded to study the switching point

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:24 V. Goyal et al.

from a simpler multi-layer perceptron (MLP) layer-based user-specific model to pruned con-
volution layer-based user-specific model. When user-preferences belong to a few classes (e.g., 5),
simpler models may provide good accuracy with smaller model sizes. However, as user-preferences
expand to a large number of classes, simpler models might not be accurate, and we may need to
switch to pruned complex models.
We determine user-preferences by simply choosing the top-k appearing classes/categories in
the learning phase window. Here, the value of k is static and pre-defined by the user or the vendor.
This can be replaced by a sophisticated dynamic approach that is independent of k.
We use entropy and output probability to determine whether the inputs belong to user-classes
or outside user-classes. Once the majority of inputs in a window are estimated to be outside
user-classes, we discard the current user-specific model and swap it with the original model. This
approach hurts the accuracy of the input window based on which we detect divergence in user-
preferences. It can be replaced by a more fine-grained approach, where we can send individual
inputs to the original model if they are marked as outside user-classes. However, such a fine-
grained system will require more robust statistics apart from prediction probability and entropy.

8 FUTURE SCOPE
The idea of building user-specific models can be scaled to other domains, such as natural lan-
guage processing (NLP) and recommendation systems, since users usually interact with a small
subset of words/items rather than the large corpus for which models are trained. However, each of
these domains has different kernels and bottlenecks. For example, NLP uses recurrent neural net-
work (RNN) models and recommendation systems consist of embedding tables and MLP layers.
Though the broad idea of creating user-specific models is applicable to each domain, the optimiza-
tions will be different from the current work and unique to each domain. Pruning methods will
differ for RNN layers and MLP layers from currently supported convolution layers. We are explor-
ing this direction in our future work. Furthermore, in this work, we explore only channel-based
pruning granularities to map on SIMD enabled mobile CPU cores and systolic array-based Edge
TPU. However, there is further scope of instead using weight pruning to achieve higher pruning
rates and build accelerators for better hardware support.

9 RELATED WORK
In this section, we discuss relevant prior works that we encountered while developing this work.
Online learning: Federated learning [9] is an emerging edge or user device-oriented approach
that advocates learning on edge devices to eliminate privacy concerns related to sending user data
back to the cloud for training. Instead, it trains the model at the edge and shares model updates
(instead of raw data) with the back-end cloud. Federated learning enables privacy-preserving con-
tinuous training across many users but builds a generic model. Our solution is inspired by the
federated learning approach that keeps all computation local to the edge device; however, we
build smaller user-specific models and in an efficient manner that is feasible at the edge device.
Another closely related work is transfer learning [62], a technique to learn models for new or
smaller domains from already available trained models. It utilizes the top layers as it is from avail-
able trained models for the new domain and finetunes (sometimes prunes) the remaining layer
for the new dataset. Fixynn [59] is a transfer learning-based approach that builds multiple mod-
els for different domains/datasets via transfer learning by keeping the feature extraction layers
constant and learning only the remaining layers. Thus, at run time, it shares the computation for
feature extraction across all the models and does individual computation for the rest of the model.
The above two works do not leverage user-preferences. To the best of our knowledge, MyML

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:25

is the first work that utilizes transfer learning to build small, user-specific models. We develop
hardware-friendly, bottom-up pruning where computation is shared between pruning and infer-
ence, and the backward pass is simplified to make the user-specific transfer learning efficient on
resource-constrained edge devices.
Offline learning: There are works that aim to extract small student models from the baseline
teacher model to reduce model size. Knowledge distillation [26] is one such seminal work that
changes the objective function to train on soft targets, logits that are inputs of the softmax layer,
for a small dataset. FitNet [46] builds thinner and deeper networks based on knowledge distillation
to also include hints from the intermediate layers. AMC [24] is an automated technique that utilizes
reinforcement learning to provide the model compression policy for mobile devices. Prior works
[8, 11] use function preserving network transformations [12] to build new compressed models.
Pruning techniques: A common and effective approach to reducing the model size involves
pruning weights. Many prior works [21, 25, 35, 38, 63, 64] guide which weights and how they
should be pruned. We use a subset of these techniques to prune our model but exclusively for the
user-specific data, rather than the complete dataset used for building the model. Bit Prudent [57]
has utilized asymmetric pruning for an in-cache acceleration of ML inference by adding a coa-
lescing unit. However, in our work, we enable asymmetric pruning on the CPU by using channel
offset and input diff pointers for our GEMM implementation. We also compare this work with a
recent state-of-the-art work PatDNN [43], a complementary approach to solve the existing prob-
lem in the industry, i.e., to run big fat models efficiently on mobile/edge devices. It has shown a
4.4x convolution layer reduction (4.5MB) for Resnet-50 with top-5 accuracy of 92.5% on Imagenet.
Until now, we have pruned our models to maintain top-1 accuracy, however, for fair comparison
with PatDNN, we now prune our user-specific model to maintain top-5 accuracy. We show that
our user-specific model for ten user-classes can reduce the convolution layers footprint to 1.02 MB
(19x compression) with top-5 accuracy of 92.6%. We also compare results for the number of FLOPs
in Resnet-50 with DMCP [18]. We find that our user-specific model, which applies asymmetric
pruning, provides an accuracy of 73.2% at the cost of 259M FLOPs for the user-specific dataset.
In contrast, for a comparable number of FLOPs, the DMCP-pruned model achieves only an ac-
curacy of 66.4% for the complete dataset. To achieve a comparable target accuracy of 74.4%, the
DMCP-pruned model entails 1.1G FLOPs. Thus, DMCP has 7 points lower accuracy than MyML
for a comparable number of FLOPs and requires four times more FLOPs to regain a comparable
accuracy.
Accelerating pruning: More recently, many works [16, 17, 30, 39, 65] have focused on accel-
erating the pruning process. Prior works [39, 65] are based on the insight that instead of waiting
for a baseline model to be trained in order to prune the baseline model, the pruning process can
be moved up and mixed with the training baseline model process. They proactively prune/remove
near-zero weights after the first few training epochs under the assumption that near-zero weights
will not revive during later training epochs. In this work, we instead opt for a more conservative
and widely recognized approach of pruning a completely trained baseline model. [16, 17] exploit
sparsity in dense DNN training computation and maps them efficiently on general-purpose CPU
cores. In addition, [30] reduces memory footprint by proposing a lossless and a lossy encoding
scheme for convolution and RELU layer to improve the performance of DNN pruning. These ap-
proaches are complementary to our proposed techniques to improve pruning performance.
Accelerating inference at edge node: Prior work [19] has reported complete analyses of com-
puting machine learning inference at edge devices and compared it with computing inference at
the cloud. According to their analysis, for the state-of-the-art devices and frameworks, conducting
inference at the edge is not energy or latency-efficient. Many prior works [10, 28, 32, 45, 52, 56, 66]

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:26 V. Goyal et al.

have focused on efficient machine learning at the edge or mobile devices. [28, 56] have proposed
accelerators to improve machine learning on edge. [10, 45, 66] utilize input similarity to make
machine learning efficient for continuous mobile vision and speech recognition. Our solution of
building a user-specific model to improve efficiency is complementary to the above works and
can benefit from the above approaches. [52] developed a system to detect if the incoming images
are unseen by the working model at the edge device, sending only the unseen images back to the
cloud for progressive training. Previous work [32] has proposed solutions to partition the infer-
ence computation at the edge and cloud to reduce data movement. They find that not all layers
are compute-intensive, and thus, they compute some layers at the edge device while the remain-
ing layers are computed on the cloud. However, the above two works still require significant data
movement and sharing of private user data. A study by Facebook [60] concludes that machine
learning is carried out on CPUs for most of its users. Therefore, there is a push for efficient ma-
chine learning on multi-core CPUs. Our CPU-friendly pruning approach makes our solution more
relevant.

10 CONCLUSION
To circumvent the problems arising from offloading machine learning to the cloud, in this work, we
present MyML, a hardware-software solution that supports machine learning at edge devices. We
leverage the transfer learning approach to create small, lightweight, user-specific ML models based
on user-preferences instead of defaulting to a large, compute-intensive ML model. We propose
hardware-friendly, bottom-up pruning, which can be utilized by any mobile platform, and we also
repurpose a systolic array-based edge accelerator to support user-specific transfer learning on
edge devices without any cloud services intervention, thus ensuring user privacy. We also present
a collaborative edge system that tracks deviations in user preferences to switch back to the original
model from the user-specific model and restart the model building process.

REFERENCES
[1] [n.d.]. Edge Tpu. https://cloud.google.com/edge-tpu.
[2] [n.d.]. Edge TPU Performance Benchmarks. https://coral.ai/docs/edgetpu/benchmarks/.
[3] [n.d.]. Intel Image Classification: Image Scene Classification of Multiclass. https://www.kaggle.com/puneet6060/
intel-image-classification/version/2.
[4] [n.d.]. iPhone 12 Pro Specifications. https://www.apple.com/iphone-12-pro/.
[5] [n.d.]. What is the NPU in Galaxy and What Does It Do? https://www.samsung.com/global/galaxy/what-is/npu/.
[6] [n.d.]. XNNPACK. https://github.com/google/XNNPACK.
[7] Pavlos Athanasios Apostolopoulos, Eirini Eleni Tsiropoulou, and Symeon Papavassiliou. 2020. Risk-aware data of-
floading in multi-server multi-access edge computing environment. IEEE/ACM Transactions on Networking 28, 3
(2020), 1405–1418.
[8] Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M. Kitani. 2017. N2N learning: Network to network
compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030 (2017).
[9] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe
Kiddon, Jakub Konečny, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel
Ramage, and Jason Roselander. 2019. Towards federated learning at scale: ‘System design. arXiv preprint
arXiv:1902.01046 (2019).
[10] Mark Buckler, Philip Bedoukian, Suren Jayasuriya, and Adrian Sampson. 2018. EV A2 : Exploiting temporal redun-
dancy in live computer vision. arXiv preprint arXiv:1803.06312 (2018).
[11] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. 2018. Path-level network transformation for efficient
architecture search. In International Conference on Machine Learning. PMLR, 678–687.
[12] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. 2015. Net2Net: Accelerating learning via knowledge transfer.
arXiv preprint arXiv:1511.05641 (2015).
[13] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for
convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
(ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:27

[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[15] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay,
Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-
time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1–14.
[16] Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, Sara Baghsorkhi, and Josep Tor-
rellas. 2020. SAVE: Sparsity-aware vector engine for accelerating DNN training and inference on CPUs. In 2020 53rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 796–810.
[17] Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, and Josep Torrellas. 2020. Sparse-
Train: Leveraging dynamic sparsity in software for training DNNs on general-purpose SIMD processors. In Proceed-
ings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 279–292.
[18] Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan. 2020. DMCP: Differentiable Markov Channel Pruning for
neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1539–1547.
[19] Ramyad Hadidi et al. 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In
Proc IISWC.
[20] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al.
2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 75–84.
[21] Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both weights and connections for efficient
neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume
1. 1135–1143.
[22] Hengtao He, Chao-Kai Wen, Shi Jin, and Geoffrey Ye Li. 2020. Model-driven deep learning for MIMO detection. IEEE
Transactions on Signal Processing 68 (2020), 1702–1715.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks.
arXiv:1603.05027 [cs.CV]
[24] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for model compression and
acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV). 784–800.
[25] Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In
Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
[26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015).
[27] Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-latency and scalable RNN
inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–16.
[28] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, and Li-De Chen.
2019. ECNN: A block-based and highly-parallel CNN accelerator for edge inference. In Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitecture. 182–195.
[29] Xin-Lin Huang, Xiaomin Ma, and Fei Hu. 2018. Machine learning and intelligent communications. Mobile Networks
and Applications 23, 1 (2018), 68–70.
[30] Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. 2018. Gist: Efficient data
encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 776–789.
[31] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Pro-
ceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[32] Yiping Kang et al. 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN
Notices 52, 4 (2017), 615–629.
[33] Mehrdad Khani, Mohammad Alizadeh, Jakob Hoydis, and Phil Fleming. 2020. Adaptive neural signal detection for
massive MIMO. IEEE Transactions on Wireless Communications 19, 8 (2020), 5635–5648.
[34] Jakub Konečnỳ, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016.
Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[35] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient ConvNets.
arXiv preprint arXiv:1608.08710 (2016).
[36] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md. E. Haque, Lingjia Tang, and Jason Mars. 2018. The
architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third
International Conference on Architectural Support for Programming Languages and Operating Systems. 751–766.
[37] Changqing Luo, Jinlong Ji, Qianlong Wang, Xuhui Chen, and Pan Li. 2018. Channel state information prediction for
5G wireless communications: A deep learning approach. IEEE Transactions on Network Science and Engineering 7, 1
(2018), 227–236.

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:28 V. Goyal et al.

[38] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. ThiNet: A filter level pruning method for deep neural network
compression. In Proceedings of the IEEE International Conference on Computer Vision. 5058–5066.
[39] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. PruneTrain: Fast
neural network training by dynamic sparse model reconfiguration. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis. 1–13.
[40] Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and
Andreas Moshovos. 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training. In 2020 53rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 781–795.
[41] Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micike-
vicius, David Patterson, Guenther Schmuelling, Hanlin Tang, et al. 2020. MLPerf: An industry standard benchmark
suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16.
[42] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park,
Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, et al. 2019. Deep learning recommendation model
for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).
[43] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN:
Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the
Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.
907–922.
[44] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany,
Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional
neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.
[45] Marc Riera, Jose-Maria Arnau, and Antonio González. 2018. Computation reuse in DNNs by exploiting input simi-
larity. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 57–68.
[46] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014.
FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
[47] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic
CNN accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).
[48] Mohammad Samragh, Mohammad Ghasemzadeh, and Farinaz Koushanfar. 2017. Customizing neural networks for
efficient FPGA implementation. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 85–92.
[49] Neev Samuel, Tzvi Diskin, and Ami Wiesel. 2017. Deep MIMO detection. In 2017 IEEE 18th International Workshop
on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 1–5.
[50] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos.
arXiv preprint arXiv:1406.2199 (2014).
[51] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556 (2014).
[52] Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. 2018. In-Situ
AI: Towards autonomous and incremental deep learning for IoT systems. In 2018 IEEE International Symposium on
High Performance Computer Architecture (HPCA). IEEE, 92–103.
[53] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[54] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1–9.
[55] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2818–2826.
[56] Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, and Tulika Mitra. 2019. High-
throughput CNN inference on embedded ARM Big. LITTLE multicore processors. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2254–2267.
[57] Xiaowei Wang, Jiecao Yu, Charles Augustine, Ravi Iyer, and Reetuparna Das. 2019. Bit prudent in-cache accelera-
tion of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, 81–93.
[58] Ziheng Wang. 2020. SparseRT: Accelerating unstructured sparsity on GPUs for deep learning inference. In Proceed-
ings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 31–42.
[59] Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, and Matthew
Mattina. 2019. FixyNN: Efficient hardware for mobile computer vision via transfer learning. arXiv preprint
arXiv:1902.11128 (2019).

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:29

[60] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad
Isaac, Yangqing Jia, Bill Jia, et al. 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019
IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331–344.
[61] Hao Ye, Geoffrey Ye Li, and Biing-Hwang Juang. 2017. Power of deep learning for channel estimation and signal
detection in OFDM systems. IEEE Wireless Communications Letters 7, 1 (2017), 114–117.
[62] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural net-
works?. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. 3320–
3328.
[63] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel:
Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45,
2 (2017), 548–560.
[64] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and
Larry S. Davis. 2018. NISP: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 9194–9203.
[65] Jiaqi Zhang, Xiangru Chen, Mingcong Song, and Tao Li. 2019. Eager pruning: Algorithm and architecture support
for fast training of deep neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 292–303.
[66] Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-SoC Co-design
for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).

Received July 2021; revised February 2022; accepted March 2022

ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.

You might also like