Hardware-Friendly User-Specific Machine Learning For Edge Devices
Hardware-Friendly User-Specific Machine Learning For Edge Devices
Machine learning (ML) on resource-constrained edge devices is expensive and often requires offloading
computation to the cloud, which may compromise the privacy of user data. In contrast, the type of data
processed at edge devices is user-specific and limited to a few inference classes. In this work, we explore
building smaller, user-specific machine learning models, rather than utilizing a generic, compute-intensive
machine learning model that caters to a diverse range of users. We first present a hardware-friendly, light-
weight pruning technique to create user-specific models directly on mobile platforms, while simultaneously
executing inferences. The proposed technique leverages compute sharing between pruning and inference,
customizes the backward pass of training, and chooses a pruning granularity for efficient processing on edge.
We then propose architectural support to prune user-specific models on a systolic edge ML inference accel-
erator. We demonstrate that user-specific models provide a speedup of 2.9× and 2.3× on the mobile CPUs for
the ResNet-50 and Inception-V3 models.
CCS Concepts: • Computer systems organization → Neural networks; Embedded hardware; • Com-
puting methodologies → Machine learning;
Additional Key Words and Phrases: Datasets, neural networks, image classification, pruning, inference,
personalized ML
1 INTRODUCTION
Machine learning (ML) has revolutionized technology in the past decade. It offers a wide range
of applications, e.g., computer vision [51], video recognition [50], and autonomous driving [36].
Machine learning is also used to design intelligent communication systems [29] that analyze com-
plex scenarios in communication systems and make optimal predictions to obtain high Quality
of service (QoS). For example, prior works [22, 33, 49] use a deep learning based approach for
MIMO detection. Furthermore, prior works [37, 61] have proposed deep learning solutions for the
This work was supported by the Semiconductor Research Corporation (SRC), System Level Design (SLD) thrust and by the
Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA.
Authors’ addresses: V. Goyal, R. Das, and V. Bertacco, University of Michigan, 2260 Hayward Street, Ann Arbor, Michigan,
48105, USA; emails: {vidushi, reetudas, valeria}@umich.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
1539-9087/2022/10-ART62 $15.00
https://doi.org/10.1145/3524125
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62
62:2 V. Goyal et al.
Fig. 1. Comparison of accuracy, model-size, and execution time across various edge and cloud ML models
for ImageNet dataset.
complex task of channel estimation. As a result of its increasing popularity, researchers have stud-
ied ML extensively for various computing platforms, including CPU [16, 63], GPU [27, 58, 63], and
FPGA [15, 20, 48]. Its memory and compute-intensive nature have also led to the development of
specialized architectures, including TPU [31], NPU [4], and several ML accelerators [13, 44, 53].
Recently, machine learning has emerged as a leading technique for improving the ways humans
interact with machines, for example, voice recognition by IoT devices like Alexa/ Google Home,
face recognition by smart cameras, recommendation systems [42] for online shopping, personal-
ized news feeds, and many more. Furthermore, many frequently used smartphone applications,
like Facebook, Gallery/Photos, Instagram, Netflix, and so on, rely heavily on machine learning.
Despite its ubiquitous presence, machine learning is still one of the most latency-sensitive and
energy-intensive applications for small, resource-constrained IoT/edge devices. IoT or edge devices
are usually powered by tiny ARM cores or micro-controllers, which work along with a few special-
ized compute IP blocks or accelerators. Due to the limited compute capacity of edge devices, there
is a wide gap between the performance of ML applications on edge platforms and server or desk-
top platforms. For instance, as shown in Figure 1, there is a wide gap between the execution time
of large and accurate cloud ML models, like the Inception-V3 and Resnet-50, and small, but less
accurate, edge-device friendly ML models, like Mobilenet and Shufflenet. Apart from performance,
the compute and memory intensive nature of ML applications also make them energy intensive,
thereby reducing the battery life of edge platforms. Unlike servers or desktops that are always
plugged into a power source, lithium-ion batteries power edge platforms, the small form factor of
edge devices limits battery size and its charge capacity. Furthermore, the portable and seldom re-
mote nature of edge devices precludes their frequent charging. Hence, ideally, edge devices should
be able to compute high-precision cloud ML models at the speed of small edge ML models and
consume minimal energy.
Today, to compute these large and accurate models, edge devices follow the common practice of
offloading incoming machine learning requests to the cloud/back-end server. But such back-and-
forth communication with the cloud raises additional issues. First, communicating to cloud servers
requires fast and reliable internet connectivity, which can be a constraint in remote places. Second,
transferring to the cloud leads to additional transmission latency and energy, which can impact
overall performance and energy efficiency. Moreover, depending on the network traffic and avail-
able bandwidth, the transmission latency can result in violation of the tight latency requirements
of many popular machine learning-based applications. Third, and most important, offloading to
a back-end server requires users to share their personal data with a back-end commodity server,
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:3
leading to privacy and data-breach concerns. With the increasing frequency of cyber-attacks, shar-
ing user-data about every single activity may lead to harmful implications. For example, user-data
can be exploited to study habits or daily routines of users, which can then be manipulated by ma-
licious parties. All of the above concerns make it challenging to offload computation to the cloud
reliably.
In order to address the above concerns, emerging techniques move computation closer to the
edge/user device by computing either on edge servers or on edge devices. For example, Apos-
tolopoulos et al. [7] propose a distributed solution based on game theory techniques to optimally
offload partial computation to the multi-access edge servers in a risk-aware fashion. While an
edge server-based distributed solution, like the above work, reduces the overhead and risk of cloud
computing, computing entirely on the edge devices completely eliminates those concerns. Hence,
there have been significant efforts in pushing machine learning to edge devices [10, 52, 66]. One
such emerging technique is Federated learning [9, 34], which endorses computation of all machine
learning-related operations, ML inference, and ML training, locally at the user-device. It fine-tunes
the original model with incoming new inputs and shares only model updates rather than raw user-
data with the back-end server. Our work is inspired by this effort to keep all the ML computations
local to the user-device. We leverage the user interaction with the device to learn user preferences
without sharing any data with other devices or the cloud. We then use this knowledge to make
ML lightweight and more amenable to edge devices.
To address these challenges, in this work we present MyML, a hardware-software solution that
makes computationally intensive and accurate machine learning feasible at edge devices. The idea
behind the approach is that the machine learning models are built for accurate predictions over a
diverse range of classification classes to serve numerous users. However, individual users usually
interact with a handful of classes. Drawing upon the common knowledge that machine learning
models are over-parameterized, we explore the possibility of creating small user-specific models
based on user-preferences rather than utilizing one large, standard model for all users.
We leverage the transfer learning [62] approach to create such small, user-specific models to
improve performance and energy efficiency, instead of defaulting to the complex original model.
Transfer learning avoids the expensive and time-consuming process of creating new models from
scratch by transferring the knowledge from already available models. In transfer learning, the top
layers that are used for feature extraction can be transferred or used from other related domains,
while the bottom layers for classification are fine-tuned for new domains. It is an approach to learn
models for a new domain by re-training the currently available models with inputs belonging to
the new domain. We draw upon this insight to build small, user-specific models by simultaneously
pruning and re-training the current original model for inputs belonging to user-classes, locally at
the user device in an efficient way that is viable for resource-constrained edge devices.
We first developed a hardware cognizant software solution to create user-specific models with-
out sending user data to the cloud. We propose a hardware-friendly, bottom-up pruning scheme,
which utilizes the unique opportunity of simultaneous inference and pruning to share computa-
tion between the two. In bottom-up pruning, we prune one layer (or group of layers) at a time and
start pruning from the last layers of the model, moving up to the top layers. Bottom-up pruning
utilizes a structured pruning approach to achieve high training efficiency on edge CPU and edge
accelerator platforms. This work explores two kinds of structured channel pruning, symmetric and
asymmetric pruning, that have different trade-offs between pruning rates and pruning granular-
ity. Symmetric pruning works at coarse pruning granularity, leading to a lower pruning rate, but it
does not need a fine control mechanism and the related overhead. On the other hand, asymmetric
pruning prunes at a finer granularity, thus, yielding a high pruning rate. However, asymmetric
pruning requires a sophisticated book-keeping control mechanism for fine-grained computing,
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:4 V. Goyal et al.
which has a small overhead. Based on the properties of the underlying hardware, we show that
Edge accelerator platforms, like Edge TPU [1] with the 2D systolic array, can support symmetric
pruning. In contrast, asymmetric pruning can be enabled at edge CPU-only platforms, supporting
a fine-grained bookkeeping control mechanism.
We show that, for the widely accepted Resnet-50 model, our user-specific model for five user-
classes is 4.3× smaller and has comparable accuracy (≤1% accuracy drop) to the original ML model
while speeding up inference by 2.9×. For the more complex Inception-V3 model, our user-specific
model for five user-classes is 4.7× smaller and has comparable accuracy (≤1% accuracy drop) to the
original ML model while speeding up inference by 2.3×. Our first sensitivity study is for per-layer
pruning and learning rates. We show the bottom-most group of layers have the major contribution
to the model size and has the highest pruning rates of 78%. The pruning rates drop gradually as
we move to top layers while stabilizing the accuracy. On the learning rate front, the bottom layers
have higher learning rates to facilitate initial fast learning. The learning rates then drop slowly
for top layers for a stable and accurate model. The second sensitivity study on training batch size,
which determines the size of the data-set required to train a model, gives an optimal batch size
of 8 with best tradeoff between dataset size, accuracy, and model size. The highest batch size of
64 with a much larger dataset did not offer significant benefits in model size and accuracy. Our
last sensitivity study, on an increasing number of user-classes, shows that our approach is scalable
to a wider set of user-classes representing an expansion of user-preferences, resulting in a model
reduction of 3.2x for 40 classes. Furthermore, our bottom-up pruning technique can converge to
a user-specific model by processing 200 images per class at a pruning/training throughput of 2.94
images/sec and 2.56 images/sec for ResNet-50 and Inception-V3, respectively, on the octa-core
Snapdragon mobile SoC.
Further, we develop a collaborative system that computes ML inferences at the edge using the
user-specific model and track any changes in user-preferences based on prediction probability
and entropy over probability distribution. Based on the estimated divergence in user-preferences,
it determines when to discard the current user-specific model and bring back the original model to
restart a new user-specific model building process. Since all the computation – inference, tracking,
building models – is carried out locally at the edge device, our proposed system ensures user
privacy.
Finally, we propose architectural support to build user-specific models on heterogeneous edge
devices comprising general-purpose CPUs and edge ML accelerators by enabling pruning on ac-
celerators designed to support just the inference. We re-purpose Edge TPU, which computes infer-
ence in int8 precision, to also support the backward pass of the pruning phase in block floating
point (BFP16) precision. We show that, by using bottom-up pruning and BFP16 precision, for
the Resnet-50 model, we can reduce the model size by 2.6× and have accuracy comparable (≤1%
accuracy drop) to the original model while speeding up inference by 1.5×. Furthermore, for the
Inception-V3 model, we can reduce the model size by 2.2× and have accuracy comparable (≤1%
accuracy drop) to the original model while speeding up inference by 2.25×. Moreover, the bottom-
up pruning technique gives a pruning/training throughput of 10 images/sec and 7.54 images/sec
for ResNet-50 and Inception-V3, respectively, on the re-purposed Edge TPU.
The remainder of this manuscript is organized as follows. Section 2 describes the process of
building user-specific models, detailing our three phase process to create user-specific models.
In Section 3 and Section 4, we discuss the pruning granularity and pruning on edge accelerator,
respectively. Section 5 describes the methodology and experimental setup. In Section 6, we evaluate
the benefits of MyML along with some important sensitivity studies, and discuss the practicality
and privacy aspect of MyML. We describe the current limitation and future scope of this work in
Section 7 and Section 8, respectively, discuss related works in Section 9, and conclude.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:5
Fig. 2. Our three phase end-to-end process to learn user-preferences, built user-specific model, and deploy
it in real-time.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:6 V. Goyal et al.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:7
Fig. 3. Bottom-up pruning shares compute for inference and pruning path until layer n − 1. It diverges at
current pruning layer n, and updates weights for layers n to last.
up to layer n − 1 are frozen and identical to the original model. Hence, both inference and pruning
paths can share the computation carried out until layer n − 1. Starting from layer n, the inference
path (to the left) will continue processing the original model to predict user output, while the prun-
ing path (to the right) computes the remaining portion of the forward pass of the pruning process
separately. Thus, only the current pruning layer n, and the layers below it, require separate pro-
cessing during the forward pass. The pruning path calculates the error/loss based on logits (output
of the last fully connected layer) coming from the pruned model, and inference from the original
model as ground truth. This error is then used to compute weight updates for layers n to last. Note,
we only have inference results from the original model available in real-time, and hence we use it
as ground truth for the pruning phase. This methodology still achieves an accuracy within 1% of
the original model for dataset belonging to the user classes only.
2.2.2 Reduced Backward Pass. The backward pass of pruning is a combination of 3D and 2D con-
volution for error-propagation and weight updates, respectively [40]. Bottom-up pruning provides
the opportunity to reduce the number of weight updates and computations for the backward pass.
The proposed scheme freezes all the layers until layer n−1. Thus, we do not require weight updates
for layers 1 to n −1. Weight updates are needed only for the current pruning layer n and beyond, re-
ducing overall pruning time. Furthermore, since we are progressively pruning layer by layer, when
pruning layer n, all the layers > n will have already been pruned and left with just unpruned chan-
nels/weights. Thus, the layers n + 1 to layer last require weight updates for fewer channels, further
reducing computation time as we progressively prune layers in a bottom-up fashion. A significant
advantage of the bottom-up pruning technique is that we are able to keep to the original model
(and its accuracy benefits) while reusing the compute for top layers for the pruning process. We
replace the original model with the user-model, once all the layers up to the first layer are pruned.
Bottom-up pruning complexity: The proposed framework has an additional pruning path
to train the user-specific model. This comes with the added complexity of forward pass and
backward pass. As discussed before, the backward pass of a convolution layer is comprised of 3D
convolution for error propagation & 2D convolution for weight update, both of which are based
on matrix-matrix multiplication kernel. Furthermore, the forward pass, which is computationally
similar to inference, consists of matrix-matrix multiplication-based 3D convolution for each
layer. The matrix-matrix has a worst-case computational complexity of O(n3 ). The pruning path
adds three matrix multiplication kernels corresponding to forward and backward pass to update
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:8 V. Goyal et al.
Fig. 4. Collaborative edge system with a tracking unit that checks for divergence in user-preferences by
counting the number of predictions belonging to user classes.
weights of each convolution layer; thus, it increases the complexity linearly by three times, i.e,
3*O(n3 ). Equations (1) and (2) show the computational complexity of pruning and inference paths.
For Equation (1), the number of updates vary for each layer, i.e., the bottom layers, which are
pruned first, will have more number of updates and will decrease as we move to top layers, which
are pruned towards the end.
t op
Total_complexity_pruninд_path = num_updates_lyr _i ∗ (3 ∗ O (n3 )) (1)
i=bot t om
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:9
Fig. 5. Symmetric Channel Pruning: As a result of pruning the same channel IDs across all filters in layer n,
the corresponding (red) filter in layer n − 1 is pruned completely. All channels and connections shown in red
are pruned.
model creation process to identify new user-preferences. We assume the large original unpruned
model to be stored in flash (or DRAM) and fetched from there on being summoned by the CPU for
the classification task.
The threshold selection for output probability and entropy is guided by the statistical definition
of the parameters. For the output prediction probability, the threshold should be at least 50% to
claim that the input was confidently predicted. Thus, to make our adaptive system more robust, we
chose a slightly higher value of 60% (or 0.6). Furthermore, for the entropy threshold, we first deter-
mined the maximum entropy. The maximum entropy of the prediction probability distribution is
when all the classes have an equal probability of prediction. For a model built for five user classes,
equal probability is 1/5 (0.2), which upon plugging into the entropy equation, ( -plog(p)), gives
the max entropy value of 2.32. For our collaborative system, to further reinforce model confidence
and robustness, we set the entropy threshold to 1.5, which is less than half of the maximum value.
3 PRUNING GRANULARITY
Our pruning techniques are guided by the underlying hardware computation granularity. For ex-
ample, a CPU with a SIMD width of 16 slots can compute and prune at the finest granularity of 16
continuous operations in parallel. If a few of the operations, say slots 8 and 10, are zeroed during
pruning, we cannot skip the two operations and get performance benefits, since the CPU com-
putes those 16 slots together in parallel. The CPU can skip computations if all the 16 operations
are zeroed out during pruning. Thus, to utilize the parallelism provided by the compute engine,
for example, SIMD width of CPU, we have to map only non-zero contiguous operations to each
block of compute. Hence, in this work, to create pruned user-specific models, we explore two kinds
of structured pruning techniques – symmetric channel pruning and asymmetric channel pruning.
Since channel pruning zeroes out all continuous weights in a pruned channel, we can skip com-
putations for all weights of the channel altogether to improve performance and energy efficiency.
Furthermore, prior work has shown that classic sparse formats can lead to performance loss for
pruned weight matrices [63]. Our choice of an entire channel (2D filter) as a pruning granularity
allows us to retain the dense format for inference computation, even with pruned weight matrices.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:10 V. Goyal et al.
Fig. 6. Asymmetric Channel Pruning: Channels are pruned independently with no restrictions. More chan-
nels are pruned with this approach because of its flexible nature. All channels and connections shown in red
are pruned.
L2 norm values, summed across all the filters, are pruned (shown in red) in layer n. This step
further leads to the pruning of filters in the previous layer n − 1, corresponding to channel IDs
pruned in the current layer n. Symmetric pruning changes the existing dense convolution layers
structure to another dense convolution layer structure, hence it does not require any bookkeeping
mechanism for convolution layers to store which channels were pruned. The removal of filters in
layer n − 1 removes the corresponding output channel and the input channel for the subsequent
layer n. Therefore, in layer n, only input channels corresponding to unpruned filter channels are
present. Thus, we do not need to store any extra channel information for the convolution operation
to map input channels correctly to the remaining filter channels. Note, for identity mapping in
the residual networks like ResNet, we need to store unpruned channel index to gather unpruned
channels and drop the pruned channels in the identity path for the final addition operation of
identity branch and the parallel running convolution branch.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:11
Fig. 7. Bookkeeping mechanism with channel mask offset, difference point {diff}, and number of non-zero
channel [nnz].
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:12 V. Goyal et al.
Fig. 8. FP32 precision: 1-bit sign, 23-bit mantissa, Fig. 9. Block floating point (BFP16) precision: 1-bit
and 8-bit exponent for each element. sign and 7-bit mantissa for each element. 8-bit shared
exponent across all the elements in a block.
systolic-array-based edge accelerators can take advantage of symmetric pruning, which does not
require any bookkeeping, while still availing themselves of the benefits of pruning.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:13
two BFP blocks is shown in Equation (3) where ma , mb are mantissa bits and e a , e b are exponents
for the Blocka and Blockb . The dot product of the two blocks is the dot product of the mantissa bits,
while the exponents can simply be added to get the exponent for the final dot product. In MyML,
Edge TPU is used to compute the dot product of the mantissa bits of weights and input activation,
while the host CPU appends the BFP output from the Edge TPU with the sum of the exponent bits.
Thus, the BFP format maps well to Edge TPU because the computation related to shared exponents
can be conducted at CPU and the computation for mantissa bits can be completed separately at
Edge TPU. We use the BFP16 floating-point format, which has an 8-bit signed mantissa and 8-
bit exponent. The 8-bit signed mantissa can be mapped directly on an Edge TPU architecture,
supporting 8-bit fixed point MAC PEs.
Blocka = {m a1 , ..m an } × e a Blockb = {mb1 , ..mbn } × e b
n
Blocka · Blockb = m ai mbi × e a+b (3)
i=1
Figure 10 shows the complete block diagram for the re-purposed Edge TPU. Each column of
the systolic array is mapped to the individual filters of a layer and represents one block sharing
a common exponent. Similarly, each column of the input vector streaming into the systolic array
is mapped to an individual input window and represents one block sharing one exponent. Before
computation, BFP16 weights are loaded from DRAM into the PEs of the systolic array. This weight
loading is a one-time process, which is followed by a long computation phase. Note, only 8-bit
mantissas of loaded BFP16 weights are used during the computation phase. The 8-bit exponent
flows through directly to the output. During the computing phase, 8-bit mantissas of input blocks
are streamed into the 2D systolic array and the output of each column of the systolic array is the
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:14 V. Goyal et al.
dot product of filter weights and input window activation mapped to that column. In one cycle of
the computing phase, Edge TPU completes N MAC operations in each of the N columns, resulting
in N ×N completed operations in one cycle. The output from the Edge TPU is piped out to the
host CPU, which then appends the output with the sum of the exponent bits of input and weight
block to deliver the exponent bits for the final output in BFP16 format. Furthermore, to accumulate
the output for large filters spanning across multiple GEMM kernel calls, we use normalization to
convert the BFP16 output from Edge TPU to FP32. The FP32 outputs can then be easily accumulated
across multiple GEMM kernel calls to obtain the final output for the filters.
To support the CPU-repurposed Edge TPU system, we add three components at the host CPU.
The first component is floating-point to block floating point convertor (FP2BFP) that con-
verts the input and intermediate activations in FP32 precision to BFP16 precision. The mantissa
bits of the BFP16 inputs are sent to Edge TPU for further computation. The second component is
the exponent adder, which adds the 8-bit exponents of input and weight BFP16 blocks to give the
final exponent of the output block. The last component is for FP32 normalization to convert BFP16
output blocks to FP32 blocks. Thus, our system fully implements the conversion process between
FP32 and BFP16 (and vice-versa) on the host CPU, as shown in Figure 10.
5 METHODOLOGY
We evaluate MyML for the image classification task with the Inception-V3 [55] and Resnet-50 [23]
models. We show results primarily for a user-dataset comprising of five randomly chosen classes
from Imagenet dataset [14], representing user-preference. We prune the model using TensorFlow’s
tf-slim framework to obtain pruning rates and measure accuracy. We also extend the TensorFlow
framework to support block floating point (BFP16) precision by adding a floating-point (FP32) to
the BFP16 conversion module. This is a generalized module that can be configured for different
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:15
block-size, mantissa bits, and exponent bits. For our experiments, we have set the block size to 64,
exponent bits to 8 bit wide, and mantissa bit to 7 bits with one additional bit to represent sign bit.
For mobile CPU performance evaluation, we use the XNNPACK [6] library, which provides a
SIMD implementation of 3D convolution using the ARM Neon ISA. We extend this library further
to add SIMD support for 2D convolution. Using the GEMM implementation of XNNPACK[6], we
can skip entire blocks, corresponding to pruned channels, for asymmetric pruning with the book-
keeping mechanism explained in Section 3. We measure execution time and energy consumption
by executing these kernels on Samsung S10e mobile phone hosting Snapdragon 855 Octa-core
mobile SoC with the complete architectural configuration listed in Table 1.
To evaluate the performance for Edge TPU, we use the SCALESim [47] simulator, which gives
compute cycles for a given systolic array configuration and assumes a TPU operating frequency of
500MHz. Since SCALESim supports only 3D convolution, we extended it to support 2D convolution
for the backward pass.
As discussed in Section 2, our learning window size is 50-100 images, and the minimum appear-
ance frequency is ∼15% of window size for a class to be marked as user-class. We divided the total
layers into four blocks and pruned one block at a time, starting from the bottom block, per our
proposed bottom-up pruning technique. Each block was trained for 40 images/user-class, which
accrued to 200 images for the five-class user subset, accounting for a total of 1,000 images for the
pruning phase. Furthermore, to optimize for accuracy as well as training cycles at the mobile de-
vice, we trained each block for one epoch, with the option of training the last block for multiple
epochs to improve model accuracy. We trained the last block for simply one additional epoch to
build a robust user-specific model in our experiments.
6 EVALUATION
We evaluate MyML on two distinct platforms – mobile CPU and Edge TPU – with various prun-
ing configurations for Inception-V3 and ResNet-50 models. For mobile CPU, we compare the user-
specific model pruned using asymmetric pruning with two baseline models: the original unpruned
model with pruning type as none and the original model pruned using channel pruning (user-
agnostic), which represents the prior user-agnostic pruning works. For TPU, we compare the
user-specific model pruned using symmetric pruning with the original model. Note, Edge TPU
is designed for dense GEMM matrix computation and cannot accrue the benefits of user-agnostic
channel pruning; hence, we do not report any user-agnostic pruning configuration for TPU.
6.1 Inception-V3
The inception model was first developed by Szegedy et al. [54]. It was an important milestone
because it shifted the contemporary trend of building deeper models to wider models. Deeper
models are more prone to over-fitting. Hence, instead of having one filter at one level, these models
include multiple filters in one level to form a wider network. The Inception-V3 model [55] is an
advanced version that reduced computational bottlenecks.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:16 V. Goyal et al.
Fig. 11. Inception-V3: Inference latency and model Fig. 12. Inception-V3: Model accuracy for user-
size for different pruning types on mobile CPU plat- specific dataset and the complete Imagenet dataset
form that supports Int8 precision for inference and for different pruning types on mobile CPU platform
FP32 for pruning. that supports Int8 precision for inference and FP32
for pruning.
Inference Performance and Accuracy: As shown in Figure 11, we find that the user-specific
model built using asymmetric pruning on the mobile CPU is 2.3× faster, corresponding to a 4.7×
reduction in model size, as compared to the original model. Moreover, compared to the pruned
user-agnostic model, the user-specific model provides a 1.4× speedup, along with a 2.5× reduction
in model size. The newly built user-specific model has an accuracy of 78.8%, with less than a 1%
accuracy drop in user-dataset, as compared to the original model with an accuracy of 79.2 %, as
shown in Figure 12. The user-specific model has higher accuracy compared to the user-agnostic
pruned model for the user-datatset because the user-specific model is pruned (and re-trained) only
for user-classes. On the other hand, the user-agnostic model is pruned to maintain combined aver-
age accuracy across all the 1,000 classes of the complete Imagenet dataset. Henceforth, the accuracy
on the complete dataset is maintained by the user-agnostic pruning, whereas the accuracy drops
to <1% (close to zero) for the user-specific model because the inputs belong to outside user-classes.
Thus, we can conclude that the user-specific model yields an accuracy comparable to the origi-
nal model for inputs belonging to user-classes but does not work for inputs outside user-classes,
reinforcing the correct behavior of user-specific models.
For inference on Edge TPU, we observe that inference time and model size reduce by 2.25×
and 2.2×, respectively, for the user-specific model built with symmetric pruning over the original
model (as shown in Figure 13), while maintaining an accuracy of 79.2% over the user-dataset (as
shown in Figure 14). Furthermore, similar to the mobile CPU platform, accuracy also drops to
<1% (close to zero) for the complete Imagenet dataset on the Edge TPU platform.
There are two factors that contribute to performance improvement in Edge TPU. The first is
due to the reduction of model size because of channel pruning. The second is the reduction in
the Image-to-column (Im2col) operation that is a part of the pre-processing step. Inputs can be
piped out to the Edge TPU only once they are flattened out and converted to a 2D matrix to map
to a 2D systolic array. This operation depends on the number of input channels of the convolution
layer. Since we remove complete filters and corresponding output/input activation channels as
part of symmetric pruning, we end up reducing Im2col operations as well. This leads to additional
performance benefits over the GEMM operation reduction.
Pruning Performance: In Table 2, we report the duration of the pruning phase comprising of
1,000 images. We find that the mobile CPU with asymmetric pruning can process 2.56 images/sec,
which accumulates to a total time of 390s for the pruning phase. Edge TPU with symmetric pruning
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:17
Fig. 13. Inception-V3: Inference latency and model Fig. 14. Inception-V3: Model accuracy for user-
size for different pruning types on Edge TPU platform specific dataset and the complete Imagenet dataset
that supports Int8 precision for inference and BFP16. for different pruning types on TPU platform that
for pruning. supports Int8 precision for inference and BFP16 for
pruning.
can process 7.54 images/sec, aggregating to 132s for the pruning phase. Our repurposed Edge TPU
is able to reduce pruning time by ≈3×. We expect the pruning to be a one-time cost for long
inference phases where user classes remain stable.
Energy: We also observe improvement in the energy efficiency of computing the models on our
mobile device. The energy per inference reduces to 0.98J for the user-specific model, compared
to 1.54J and 1.27J for the original and pruned user-agnostic model, respectively. This results in
energy reductions of 54% and 27%, respectively, for the user-specific model compared to the pruned
original model and the original unpruned model.
6.2 Resnet-50
ResNet-50 is a crucial machine learning model for image recognition/classification tasks, and has
been widely adopted by industry and academia. It is an integral part of the MLPerf [41] AI inference
and training benchmark suite for datacenter, developed in collaboration by academia, research labs,
and industry, with reasonable accuracy of 75.6% with 21.7 MB model size. It was the first network
to introduce the concept of identity mapping [23], which made training easier and improved gen-
eralization. In this work, we include ResNet-50 in our experiments to demonstrate the benefits as
well as the broad applicability of MyML. We generalize that the MyML technique can be applied
to any deep neural network with convolution layers.
Inference Performance and Accuracy: As shown in Figure 15, we find that the user-specific
model built using asymmetric pruning on the mobile CPU is 2.93× faster, corresponding to
a 4.3× reduction in model size, as compared to the original model. Moreover, compared to
the user-agnostic model, the user-specific model provides a 1.55× speedup, along with a 2.5×
reduction in model size. The newly built user-specific model has an accuracy of 73.2%, which is
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:18 V. Goyal et al.
Fig. 15. ResNet-50: Inference latency and model size Fig. 16. ResNet-50: Model accuracy for user-specific
for different pruning types on mobile CPU platform dataset and the complete Imagenet dataset for dif-
that supports Int8 precision for inference and FP32 ferent pruning types on mobile CPU platform that
for pruning. supports Int8 precision for inference and FP32 for
pruning.
Fig. 17. ResNet-50: Inference latency and model size Fig. 18. ResNet-50: Model accuracy for user-specific
for different pruning types on Edge TPU platform dataset and the complete Imagenet dataset for dif-
that supports Int8 precision for inference and BFP16 ferent pruning types on TPU platform that supports
for pruning. Int8 precision for inference and BFP16 for pruning.
within a 1% accuracy margin, compared to the original model with 72.4% in user-dataset, as shown
in Figure 16. Also, since the user-specific model is pruned (and retrained) only for user-classes,
it has significantly higher accuracy as compared to the user-agnostic model for the user-dataset.
Furthermore, for the complete dataset with inputs belonging to outside user-classes, the accuracy
drops to <1% on using the user-specific model, ensuring its correct behavior.
For symmetric pruning on Edge TPU, we show in Figure 17 that user-specific model size can
be reduced by 2.6× from 21.7 MB to 8.3 MB, resulting in a speedup of 1.5×. The user-specific
model also improves the accuracy to 73.6% for the user-dataset, within 1% margin, compared to
unpruned model accuracy of 72.4%, as shown in Figure 18. Similar to the mobile CPU platform, the
accuracy drops to <1% (close to zero) for the user-specific model on the complete Imagenet dataset.
As discussed for the Inception-V3 model, there are two factors that contribute to performance
improvement in Edge TPU. The first is the reduction of model size because of channel pruning
and the second is the reduction in the Image-to-column (Im2col) operation that is a part of the
pre-processing step.
Pruning performance: In Table 3, we report the duration of the pruning phase, comprising of
1,000 images. We find that the mobile CPU with asymmetric pruning can process 2.94 images/sec,
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:19
Fig. 19. Trace showing MyML in realtime. We show the three phases - learning, pruning, and inference – of
our end-to-end system as well as illustrate the working of the tracking unit that monitors the change in user
preferences.
which accumulates to a total time of 340s for the pruning phase. Edge TPU with symmetric pruning
can process 10 images/sec, aggregating to 99.58s for the pruning phase. Our repurposed Edge TPU
is able to reduce pruning time by ≈3.42×. We expect the pruning to be a one-time cost for long
inference phases where user classes remain stable.
Energy: We also observe improvement in the energy efficiency of computing models on the mobile
device. The energy per inference reduces to 0.4J for the user-specific model, compared to 0.98J and
0.67J for the original and pruned user-agnostic model, respectively.
In the rest of the evaluation section, we only show results for the Inception-v3 model. The
ResNet-50 model is made up of convolution layers similar to Inception-V3 and, based on the above
discussion, the behavior of user-specific models built from the two originals models are coherent.
Hence, trends and insights gained from the Inception-V3 model will be applicable to ResNet-50.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:20 V. Goyal et al.
Fig. 20. Per layer asymmetric channel pruning showing model size, learning rate, and pruning time for
bottom-up pruning.
creating a user-specific model. Once the pruning phase is complete, we switch the original model
with the newly created user-specific model for inference.
During all phases, our tracking mechanism checks for divergence in user preferences. For our
experiments, the tracker counts the number of outside-user-classes inputs over a window of 50
input images and reports divergence if the count exceeds a threshold of 70%. In Figure 19, we
show the running average over the tracking window. Since the tracker resets its count after each
window, we observe a seesaw pattern in our trace. As shown in the figure, when the user slowly
starts changing the preferences around 1,900s in real-time, the tracker count shoots up and crosses
the set threshold. The system then switches to a learning phase.
The trace also shows the correct and incorrect predictions by the appropriate inference model
in each phase. We see that there is a drop in the correct predictions only in the period where user
preferences transition to new classes.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:21
Fig. 21. Per layer symmetric channel pruning showing model size, learning rate, and pruning time for bottom-
up pruning.
Moving from group 3 to group 4 gives a small reduction in model size; however, it stabilizes the
error and makes the model more robust.
Pruning time: We also show the pruning time for each group of layers in Figures 20 and 21.
Pruning time increases as we move up in the model, following the bottom-up pruning technique,
because while pruning layer n, all the layers from layer n to last layer will be re-trained. For ex-
ample, while pruning layers in group 4, all the layers in the groups 1 to 3 will be re-trained. Thus,
group 4, the top-most group of layers, takes a big chunk of time. This is because we train almost
the entire model, except the untouched top feature extraction layer, and we train for one extra
epoch to get a stable model. Furthermore, pruning time is shorter for symmetric pruning on the
edge TPU accelerator as compared to asymmetric pruning on general purpose mobile CPU.
Learning rate: Inspired by a commonly used training procedure that starts with a high learning
rate for the first few epochs, that is lowered gradually for later epochs, we also form a learning rate
schedule for the bottom-up pruning, as shown in Figures 20 and 21. The bottom groups, comprising
of group 1 and group 2, have the highest learning rate of 0.001. Learning rates are reduced by order
of 10 for the top layers in group 3 and group 4. Reducing the learning rate with time/layers allows
the model to vigorously learn and jump around various local minima during the start of the pruning
process and gradually slow down to settle on global minima with very low loss value.
We also observe a difference in learning rates between asymmetric and symmetric pruning for
top layers/groups. The learning rates for top layers are relatively higher for symmetric pruning.
We suspect this is because, for symmetric pruning, more error is accumulated due to floating-point
to block floating-point conversion as we move up to top layers. Therefore, there is a need to have
higher learning rates in order to evade local minima to make up for the extra error and stabilize
to a low final loss value.
Training batch size: Training batch size is an important parameter for the MyML approach to
create a user-specific model. The batch size determines the number of times we can update weights
for a given number of images/inputs in the dataset. For example, a batch size of 10 for a dataset with
100 images will lead to 10 model updates, whereas a batch size of 25 with the 100 image dataset
will give only 4 model updates. Though a smaller batch size can give more model updates, keeping
the size too low can lead to the model jumping around different local minima and not stabilizing
at a small loss value. Hence, there is a trade-off between batch size and the accuracy (robustness)
of the model. In MyML, we want to keep a small batch size to have faster model updates within a
reasonable amount of user-data. Therefore, we present a sensitivity study with a batch size in the
range of 128 to 8, as shown in Figure 22. In this study, we keep the number of updates constant
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:22 V. Goyal et al.
Fig. 22. Model size and accuracy for increasing training batch size.
Fig. 23. Scalability of user-specific model with an increasing number of user classes.
(to 25); thus, the dataset size varies for training batch size. We observe that, at the largest batch
size of 128 and dataset size of 3.2k, we can achieve the smallest model size (highest pruning rate)
because the model has more data to learn and evade local minima to settle at a stable loss. A higher
pruning rate demands more data to converge at a low loss value. As the batch size reduces, the
model size increases (i.e., pruning rate reduces) to stabilize at a low stable loss with less amount of
data. However, the difference between the model size is not very significant. A model created with
a batch size of 8 is only bigger by 7%, compared to a model built with a batch size of 128. Across
all the training batch sizes, we maintain an accuracy margin of 1% within a baseline accuracy of
79.2%. Therefore, for this work, we choose the batch size of 8 for training user-specific models that
have an accuracy within 1% of the original unpruned model.
Scalability with number of user classes: The second study measures the utility of building
a small user-specific model over user-agnostic pruning as the user diversifies its preferences or
choices. Therefore, we conduct a sensitivity study with an increasing number of user classes rep-
resenting user-preference. As done for prior discussed results, we train these user-specific models
to maintain accuracy within a margin of 1%. Figure 23 shows that, even on increasing the number
of user classes from 5 to 40, we achieve significant model reduction over user-agnostic pruning. For
instance, with 40 user-classes, MyML gives 1.5× and 2.8× reduction compared to the user-agnostic
pruned and original model, respectively. This reduction in model size demonstrates the advantage
of utilizing user-specific models over the original generic model even as the user expands its pref-
erences. The increase in model size from 5 to 10 classes is 1.35×; this increase reduces to 1.13× from
10 to 20 classes and 1.1× from 20 to 40 classes. The increase in model size is highest when we ex-
pand from 5 to 10 classes, and thereafter, it tapers off as we expand to include more classes. Thus,
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:23
we can infer that even as user-preferences will expand beyond 40, the increase in user-specific
model size will be by a small linear fraction compared to model size at 40 classes.
Ablation Study: We also performed an ablation study by choosing five known nearby classes –
pickup truck, tow truck, trailor truck, tractor, and recreational R.V. – as user-classes for building a
user-specific model. We found that the model size for inception-v3 can be reduced to 6.2 MB from
23 MB while maintaining an accuracy of 79.2% for the user subset. This validates our hypothesis
that, by leveraging user-preferences, we can build tiny user-specific ML models to improve the
efficiency of ML applications on user-devices.
6.5 Discussion
To demonstrate the practicality of our approach, we apply a real-world dataset from Kaggle [3]
to the Inception-v3 model to obtain the ratio of user-classes and outliers. We test our method on
500 images. The learning phase operates on the first 100 images, and the remaining 400 images
determine the fraction of user-classes and outliers in the dataset. Unlike Imagenet, where each
image is manually processed to have exactly one object/entity, the Kaggle real-world dataset has
multiple objects in each image. Thus, we take the top-five predicted classes for each image in our
analysis, which account for 119 unique classes within the 500 image window. We mark the ten most
frequent classes in the learning window as user-classes and find that 95% of the remaining images
belong to these user-classes. Thus, we observe that the ten frequently appearing user-classes (8.4%
of total classes) consistently encapsulate 95% of images.
Our collaborative system has a tunable threshold for outlier tolerance. With a 95% threshold,
the collaborative system can tolerate a maximum of 5% outliers before discarding the current user-
specific model to create a new user-specific model for new preferences. We offer two solutions to
handle outliers. The first solution sends the 5% outliers to the cloud server for computation, which
applies the bulky original model, ensuring privacy for 95% of the inputs. The second solution is to
infer the 5% outliers using the bulky original model on the local edge device. The second solution
ensures privacy for all inputs by computing everything locally. Hence, MyML enforces privacy for
95% of user inputs regardless of the methods. Moreover, if the 5% outliers are computed locally on
the edge device using the original model, we provide 100% privacy for all user inputs.
We show that for the collaborative system, which sends the 5% outliers to the cloud, the speedup
is 2.2×. Additionally, computing the outliers on the device results in 2.1× speedup. Note that the
remaining 95% of the inputs are computed only at the edge device employing the user-specific
model. These speedups are lower than the 2.3× speedup achieved when only the user-specific
model is applied to all inputs without the differentiation of outliers or its extra computation.
7 LIMITATIONS
Though we provide an end-to-end holistic approach to learn, build, and deploy user-specific models
based on user-preferences, there are still some limitations and scope of improvement for this work.
This work assumes that there is a pre-built accurate original model ready to serve the user
that acts as ground truth. Training such a baseline model from scratch is a very expensive and
time-consuming process. However, it is a one-time process that can be done in the back-end cloud
server.
In this paper, user-specific models are derived from original models, which are convolution
layer-based deep neural networks. Hence, the user-specific models are, in turn, made of computa-
tionally complex convolution layers. There can be an alternate way to build smaller user models
from scratch comprising simpler MLP layers and much lower depth. This approach has not been
explored in this paper. The above limitation can be further expanded to study the switching point
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:24 V. Goyal et al.
from a simpler multi-layer perceptron (MLP) layer-based user-specific model to pruned con-
volution layer-based user-specific model. When user-preferences belong to a few classes (e.g., 5),
simpler models may provide good accuracy with smaller model sizes. However, as user-preferences
expand to a large number of classes, simpler models might not be accurate, and we may need to
switch to pruned complex models.
We determine user-preferences by simply choosing the top-k appearing classes/categories in
the learning phase window. Here, the value of k is static and pre-defined by the user or the vendor.
This can be replaced by a sophisticated dynamic approach that is independent of k.
We use entropy and output probability to determine whether the inputs belong to user-classes
or outside user-classes. Once the majority of inputs in a window are estimated to be outside
user-classes, we discard the current user-specific model and swap it with the original model. This
approach hurts the accuracy of the input window based on which we detect divergence in user-
preferences. It can be replaced by a more fine-grained approach, where we can send individual
inputs to the original model if they are marked as outside user-classes. However, such a fine-
grained system will require more robust statistics apart from prediction probability and entropy.
8 FUTURE SCOPE
The idea of building user-specific models can be scaled to other domains, such as natural lan-
guage processing (NLP) and recommendation systems, since users usually interact with a small
subset of words/items rather than the large corpus for which models are trained. However, each of
these domains has different kernels and bottlenecks. For example, NLP uses recurrent neural net-
work (RNN) models and recommendation systems consist of embedding tables and MLP layers.
Though the broad idea of creating user-specific models is applicable to each domain, the optimiza-
tions will be different from the current work and unique to each domain. Pruning methods will
differ for RNN layers and MLP layers from currently supported convolution layers. We are explor-
ing this direction in our future work. Furthermore, in this work, we explore only channel-based
pruning granularities to map on SIMD enabled mobile CPU cores and systolic array-based Edge
TPU. However, there is further scope of instead using weight pruning to achieve higher pruning
rates and build accelerators for better hardware support.
9 RELATED WORK
In this section, we discuss relevant prior works that we encountered while developing this work.
Online learning: Federated learning [9] is an emerging edge or user device-oriented approach
that advocates learning on edge devices to eliminate privacy concerns related to sending user data
back to the cloud for training. Instead, it trains the model at the edge and shares model updates
(instead of raw data) with the back-end cloud. Federated learning enables privacy-preserving con-
tinuous training across many users but builds a generic model. Our solution is inspired by the
federated learning approach that keeps all computation local to the edge device; however, we
build smaller user-specific models and in an efficient manner that is feasible at the edge device.
Another closely related work is transfer learning [62], a technique to learn models for new or
smaller domains from already available trained models. It utilizes the top layers as it is from avail-
able trained models for the new domain and finetunes (sometimes prunes) the remaining layer
for the new dataset. Fixynn [59] is a transfer learning-based approach that builds multiple mod-
els for different domains/datasets via transfer learning by keeping the feature extraction layers
constant and learning only the remaining layers. Thus, at run time, it shares the computation for
feature extraction across all the models and does individual computation for the rest of the model.
The above two works do not leverage user-preferences. To the best of our knowledge, MyML
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:25
is the first work that utilizes transfer learning to build small, user-specific models. We develop
hardware-friendly, bottom-up pruning where computation is shared between pruning and infer-
ence, and the backward pass is simplified to make the user-specific transfer learning efficient on
resource-constrained edge devices.
Offline learning: There are works that aim to extract small student models from the baseline
teacher model to reduce model size. Knowledge distillation [26] is one such seminal work that
changes the objective function to train on soft targets, logits that are inputs of the softmax layer,
for a small dataset. FitNet [46] builds thinner and deeper networks based on knowledge distillation
to also include hints from the intermediate layers. AMC [24] is an automated technique that utilizes
reinforcement learning to provide the model compression policy for mobile devices. Prior works
[8, 11] use function preserving network transformations [12] to build new compressed models.
Pruning techniques: A common and effective approach to reducing the model size involves
pruning weights. Many prior works [21, 25, 35, 38, 63, 64] guide which weights and how they
should be pruned. We use a subset of these techniques to prune our model but exclusively for the
user-specific data, rather than the complete dataset used for building the model. Bit Prudent [57]
has utilized asymmetric pruning for an in-cache acceleration of ML inference by adding a coa-
lescing unit. However, in our work, we enable asymmetric pruning on the CPU by using channel
offset and input diff pointers for our GEMM implementation. We also compare this work with a
recent state-of-the-art work PatDNN [43], a complementary approach to solve the existing prob-
lem in the industry, i.e., to run big fat models efficiently on mobile/edge devices. It has shown a
4.4x convolution layer reduction (4.5MB) for Resnet-50 with top-5 accuracy of 92.5% on Imagenet.
Until now, we have pruned our models to maintain top-1 accuracy, however, for fair comparison
with PatDNN, we now prune our user-specific model to maintain top-5 accuracy. We show that
our user-specific model for ten user-classes can reduce the convolution layers footprint to 1.02 MB
(19x compression) with top-5 accuracy of 92.6%. We also compare results for the number of FLOPs
in Resnet-50 with DMCP [18]. We find that our user-specific model, which applies asymmetric
pruning, provides an accuracy of 73.2% at the cost of 259M FLOPs for the user-specific dataset.
In contrast, for a comparable number of FLOPs, the DMCP-pruned model achieves only an ac-
curacy of 66.4% for the complete dataset. To achieve a comparable target accuracy of 74.4%, the
DMCP-pruned model entails 1.1G FLOPs. Thus, DMCP has 7 points lower accuracy than MyML
for a comparable number of FLOPs and requires four times more FLOPs to regain a comparable
accuracy.
Accelerating pruning: More recently, many works [16, 17, 30, 39, 65] have focused on accel-
erating the pruning process. Prior works [39, 65] are based on the insight that instead of waiting
for a baseline model to be trained in order to prune the baseline model, the pruning process can
be moved up and mixed with the training baseline model process. They proactively prune/remove
near-zero weights after the first few training epochs under the assumption that near-zero weights
will not revive during later training epochs. In this work, we instead opt for a more conservative
and widely recognized approach of pruning a completely trained baseline model. [16, 17] exploit
sparsity in dense DNN training computation and maps them efficiently on general-purpose CPU
cores. In addition, [30] reduces memory footprint by proposing a lossless and a lossy encoding
scheme for convolution and RELU layer to improve the performance of DNN pruning. These ap-
proaches are complementary to our proposed techniques to improve pruning performance.
Accelerating inference at edge node: Prior work [19] has reported complete analyses of com-
puting machine learning inference at edge devices and compared it with computing inference at
the cloud. According to their analysis, for the state-of-the-art devices and frameworks, conducting
inference at the edge is not energy or latency-efficient. Many prior works [10, 28, 32, 45, 52, 56, 66]
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:26 V. Goyal et al.
have focused on efficient machine learning at the edge or mobile devices. [28, 56] have proposed
accelerators to improve machine learning on edge. [10, 45, 66] utilize input similarity to make
machine learning efficient for continuous mobile vision and speech recognition. Our solution of
building a user-specific model to improve efficiency is complementary to the above works and
can benefit from the above approaches. [52] developed a system to detect if the incoming images
are unseen by the working model at the edge device, sending only the unseen images back to the
cloud for progressive training. Previous work [32] has proposed solutions to partition the infer-
ence computation at the edge and cloud to reduce data movement. They find that not all layers
are compute-intensive, and thus, they compute some layers at the edge device while the remain-
ing layers are computed on the cloud. However, the above two works still require significant data
movement and sharing of private user data. A study by Facebook [60] concludes that machine
learning is carried out on CPUs for most of its users. Therefore, there is a push for efficient ma-
chine learning on multi-core CPUs. Our CPU-friendly pruning approach makes our solution more
relevant.
10 CONCLUSION
To circumvent the problems arising from offloading machine learning to the cloud, in this work, we
present MyML, a hardware-software solution that supports machine learning at edge devices. We
leverage the transfer learning approach to create small, lightweight, user-specific ML models based
on user-preferences instead of defaulting to a large, compute-intensive ML model. We propose
hardware-friendly, bottom-up pruning, which can be utilized by any mobile platform, and we also
repurpose a systolic array-based edge accelerator to support user-specific transfer learning on
edge devices without any cloud services intervention, thus ensuring user privacy. We also present
a collaborative edge system that tracks deviations in user preferences to switch back to the original
model from the user-specific model and restart the model building process.
REFERENCES
[1] [n.d.]. Edge Tpu. https://cloud.google.com/edge-tpu.
[2] [n.d.]. Edge TPU Performance Benchmarks. https://coral.ai/docs/edgetpu/benchmarks/.
[3] [n.d.]. Intel Image Classification: Image Scene Classification of Multiclass. https://www.kaggle.com/puneet6060/
intel-image-classification/version/2.
[4] [n.d.]. iPhone 12 Pro Specifications. https://www.apple.com/iphone-12-pro/.
[5] [n.d.]. What is the NPU in Galaxy and What Does It Do? https://www.samsung.com/global/galaxy/what-is/npu/.
[6] [n.d.]. XNNPACK. https://github.com/google/XNNPACK.
[7] Pavlos Athanasios Apostolopoulos, Eirini Eleni Tsiropoulou, and Symeon Papavassiliou. 2020. Risk-aware data of-
floading in multi-server multi-access edge computing environment. IEEE/ACM Transactions on Networking 28, 3
(2020), 1405–1418.
[8] Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M. Kitani. 2017. N2N learning: Network to network
compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030 (2017).
[9] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe
Kiddon, Jakub Konečny, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel
Ramage, and Jason Roselander. 2019. Towards federated learning at scale: ‘System design. arXiv preprint
arXiv:1902.01046 (2019).
[10] Mark Buckler, Philip Bedoukian, Suren Jayasuriya, and Adrian Sampson. 2018. EV A2 : Exploiting temporal redun-
dancy in live computer vision. arXiv preprint arXiv:1803.06312 (2018).
[11] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. 2018. Path-level network transformation for efficient
architecture search. In International Conference on Machine Learning. PMLR, 678–687.
[12] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. 2015. Net2Net: Accelerating learning via knowledge transfer.
arXiv preprint arXiv:1511.05641 (2015).
[13] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for
convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
(ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:27
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[15] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay,
Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-
time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1–14.
[16] Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, Sara Baghsorkhi, and Josep Tor-
rellas. 2020. SAVE: Sparsity-aware vector engine for accelerating DNN training and inference on CPUs. In 2020 53rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 796–810.
[17] Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, and Josep Torrellas. 2020. Sparse-
Train: Leveraging dynamic sparsity in software for training DNNs on general-purpose SIMD processors. In Proceed-
ings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 279–292.
[18] Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan. 2020. DMCP: Differentiable Markov Channel Pruning for
neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1539–1547.
[19] Ramyad Hadidi et al. 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In
Proc IISWC.
[20] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al.
2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 75–84.
[21] Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both weights and connections for efficient
neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume
1. 1135–1143.
[22] Hengtao He, Chao-Kai Wen, Shi Jin, and Geoffrey Ye Li. 2020. Model-driven deep learning for MIMO detection. IEEE
Transactions on Signal Processing 68 (2020), 1702–1715.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks.
arXiv:1603.05027 [cs.CV]
[24] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for model compression and
acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV). 784–800.
[25] Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In
Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
[26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015).
[27] Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-latency and scalable RNN
inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–16.
[28] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, and Li-De Chen.
2019. ECNN: A block-based and highly-parallel CNN accelerator for edge inference. In Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitecture. 182–195.
[29] Xin-Lin Huang, Xiaomin Ma, and Fei Hu. 2018. Machine learning and intelligent communications. Mobile Networks
and Applications 23, 1 (2018), 68–70.
[30] Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. 2018. Gist: Efficient data
encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 776–789.
[31] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Pro-
ceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[32] Yiping Kang et al. 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN
Notices 52, 4 (2017), 615–629.
[33] Mehrdad Khani, Mohammad Alizadeh, Jakob Hoydis, and Phil Fleming. 2020. Adaptive neural signal detection for
massive MIMO. IEEE Transactions on Wireless Communications 19, 8 (2020), 5635–5648.
[34] Jakub Konečnỳ, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016.
Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[35] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient ConvNets.
arXiv preprint arXiv:1608.08710 (2016).
[36] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md. E. Haque, Lingjia Tang, and Jason Mars. 2018. The
architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third
International Conference on Architectural Support for Programming Languages and Operating Systems. 751–766.
[37] Changqing Luo, Jinlong Ji, Qianlong Wang, Xuhui Chen, and Pan Li. 2018. Channel state information prediction for
5G wireless communications: A deep learning approach. IEEE Transactions on Network Science and Engineering 7, 1
(2018), 227–236.
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
62:28 V. Goyal et al.
[38] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. ThiNet: A filter level pruning method for deep neural network
compression. In Proceedings of the IEEE International Conference on Computer Vision. 5058–5066.
[39] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. PruneTrain: Fast
neural network training by dynamic sparse model reconfiguration. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis. 1–13.
[40] Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and
Andreas Moshovos. 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training. In 2020 53rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 781–795.
[41] Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micike-
vicius, David Patterson, Guenther Schmuelling, Hanlin Tang, et al. 2020. MLPerf: An industry standard benchmark
suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16.
[42] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park,
Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, et al. 2019. Deep learning recommendation model
for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).
[43] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN:
Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the
Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.
907–922.
[44] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany,
Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional
neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.
[45] Marc Riera, Jose-Maria Arnau, and Antonio González. 2018. Computation reuse in DNNs by exploiting input simi-
larity. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 57–68.
[46] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014.
FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
[47] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic
CNN accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).
[48] Mohammad Samragh, Mohammad Ghasemzadeh, and Farinaz Koushanfar. 2017. Customizing neural networks for
efficient FPGA implementation. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 85–92.
[49] Neev Samuel, Tzvi Diskin, and Ami Wiesel. 2017. Deep MIMO detection. In 2017 IEEE 18th International Workshop
on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 1–5.
[50] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos.
arXiv preprint arXiv:1406.2199 (2014).
[51] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556 (2014).
[52] Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. 2018. In-Situ
AI: Towards autonomous and incremental deep learning for IoT systems. In 2018 IEEE International Symposium on
High Performance Computer Architecture (HPCA). IEEE, 92–103.
[53] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[54] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1–9.
[55] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2818–2826.
[56] Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, and Tulika Mitra. 2019. High-
throughput CNN inference on embedded ARM Big. LITTLE multicore processors. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2254–2267.
[57] Xiaowei Wang, Jiecao Yu, Charles Augustine, Ravi Iyer, and Reetuparna Das. 2019. Bit prudent in-cache accelera-
tion of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, 81–93.
[58] Ziheng Wang. 2020. SparseRT: Accelerating unstructured sparsity on GPUs for deep learning inference. In Proceed-
ings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 31–42.
[59] Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, and Matthew
Mattina. 2019. FixyNN: Efficient hardware for mobile computer vision via transfer learning. arXiv preprint
arXiv:1902.11128 (2019).
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.
Hardware-friendly User-specific Machine Learning for Edge Devices 62:29
[60] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad
Isaac, Yangqing Jia, Bill Jia, et al. 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019
IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331–344.
[61] Hao Ye, Geoffrey Ye Li, and Biing-Hwang Juang. 2017. Power of deep learning for channel estimation and signal
detection in OFDM systems. IEEE Wireless Communications Letters 7, 1 (2017), 114–117.
[62] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural net-
works?. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. 3320–
3328.
[63] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel:
Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45,
2 (2017), 548–560.
[64] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and
Larry S. Davis. 2018. NISP: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 9194–9203.
[65] Jiaqi Zhang, Xiangru Chen, Mingcong Song, and Tao Li. 2019. Eager pruning: Algorithm and architecture support
for fast training of deep neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 292–303.
[66] Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-SoC Co-design
for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).
ACM Transactions on Embedded Computing Systems, Vol. 21, No. 5, Article 62. Publication date: October 2022.