Accelerating Tinyml Inference On Microcontrollers Through Approximate Kernels

Uploaded by

Khánh Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views4 pages

Accelerating Tinyml Inference On Microcontrollers Through Approximate Kernels

Uploaded by

Khánh Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

To appear at the 31st International Conference on Electronics Circuits and Systems (ICECS), Nov 18-20 2024, Nancy, France.

Accelerating TinyML Inference on Microcontrollers

through Approximate Kernels
Giorgos Armeniakos∗ , Georgios Mentzos∗ , Dimitrios Soudris∗
∗ National Technical University of Athens, GR,
∗ {armeniakos, gmentzos, dsoudris}@microlab.ntua.gr

Abstract—The rapid growth of microcontroller-based IoT de- tegrating Approximate Computing [4] (AC) principles with
vices has opened up numerous applications, from smart manufac- optimized software kernels, we develop an automated frame-
turing to personalized healthcare. Despite the widespread adop- work that generates specialized approximate code for specific
tion of energy-efficient microcontroller units (MCUs) in the Tiny
Machine Learning (TinyML) domain, they still face significant Convolutional Neural Networks (CNNs). Our approach utilizes
limitations in terms of performance and memory (RAM, Flash). flash memory to unpack kernel code within convolution layers,
arXiv:2409.16815v1 [cs.LG] 25 Sep 2024

In this work, we combine approximate computing and software eliminating instruction overheads. Subsequently, by leveraging
kernel design to accelerate the inference of approximate CNN the unpacked operations and the fact that each computation
models on MCUs. Our kernel-based approximation framework contributes uniquely to the final output, we employ an offline
firstly unpacks the operands of each convolution layer and then
conducts an offline calculation to determine the significance of significance-aware computation skipping approach, where cer-
each operand. Subsequently, through a design space exploration, tain operations are either skipped or retained. Through design
it employs a computation skipping approximation strategy based space exploration (DSE), our framework identifies Pareto-
on the calculated significance. Our evaluation on an STM32- optimal solutions, each offering unique accuracy-latency trade-
Nucleo board and 2 popular CNNs trained on the CIFAR-10 offs tailored to user requirements. Compared to the state-
dataset shows that, compared to state-of-the-art exact inference,
our Pareto optimal solutions can feature on average 21% latency of-the-art CMSIS-NN, our approach achieves a 21% latency
reduction with no degradation in Top-1 classification accuracy, reduction with no degradation in Top-1 classification accuracy
while for lower accuracy requirements, the corresponding reduc- on CIFAR-10 trained CNN models, while for lower accuracy
tion becomes even more pronounced. requirements (< 5%), our method outperforms even commer-
Index Terms—Approximate Computing, MCUs, TinyML cial frameworks.
Our novel contributions within this work are as follows:
I. I NTRODUCTION
1) This is the first work that evaluates the impact of ap-
In recent years, the proliferation of low-cost IoT microcon- proximate computing on the optimized inference library
troller units (MCUs) has significantly expanded the Tiny Ma- of CMSIS-NN, targeting MCUs.
chine Learning (TinyML) domain [1], enabling real-time data 2) We propose an automated cooperative approximation
processing on tiny devices. Despite the energy efficiency of framework for accelerating CNNs inference on MCUs1
MCUs, their limited resources and high latency challenge the 3) Using our framework, we demonstrate that, in many cases
deployment of deep learning models on small-scale hardware. approximate computing is able to realize larger and faster
Consequently, new optimizations and customized architectures networks than conventional ones on tiny devices.
are needed to bridge the resource gap, making reconfigurable
MCUs an attractive option for ML acceleration. II. C OOPERATIVE A PPROXIMATION F RAMEWORK FOR
In this effor, ARM’s CMSIS-NN [2] software library offers I NFERENCE O PTIMIZATION
efficient neural network operations for MCUs running on Arm This section describes our cooperative approximation frame-
Cortex-M CPUs, achieving nearly an 11x latency improvement work for deploying approximate DNNs on microcontrollers.
compared to TensorFlow Lite Micro on several ImageNet In brief, we first describe our basic kernel customizations
models deployed on an STM32H743 board. TinyEngine [3], a and how we eliminate associated overheads from existing
system-model co-design framework, combines neural architec- inference libraries. Then, we analyze our layer-based code
ture search with a memory-optimized inference library, result- unpacking showing the latency benefits over typical imple-
ing in average latency and SRAM usage reductions of 2.1× and mentation for the targeted kernels and we finally describe
2.4×, respectively, compared to CMSIS-NN. However, relevant our significance-aware computation skipping exploration that
frameworks focus on fitting models within memory constraints offers the flexibility to trade classification accuracy for further
rather than reducing inference latency of large models. For inference acceleration. An abstract overview of our framework
instance, TinyEngine requires about 1.3s to execute an mcunet- is depicted in Fig. 1
in4 ImageNet model on a 160MHz MCU, highlighting existing
latency challenges in real-time applications. A. Customized kernels for NN deployment
In this work we investigate the feasibility of the efficient To meet our deployment scenarios’ unique requirements, we
utilization of MCUs to enhance DNN performance. By in- use CMSIS-NN as our baseline inference library. Our approx-
Work partially supported by the Horizon Europe research and innovation pro- 1 Available at https://github.com/GeorgeMentzos/ATAMAN-AuTo-driven-
gram via the “CONVOLVE” project under grant agreement No. 101070374. Approximation-and-Microcontroller-AcceleratioN-Toolkit

© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
Trained CNN any necessary padding or adjustments to ensure the correct
alignment of the data for convolution. Instead, our framework

Created by Olena Panasovska

from the Noun Project
Layer-based
Layer-basedCode
Code performs an automated layer-based code unpacking (see Fig. 1
Layer-based
Unpacking Code
Unpacking
Unpacking 1
1 ), where each operation is “unpacked” and included as an
Input 2
intrinsic function in the final generated code. Our unpacking
Distribution technique fundamentally differs from typical unrolling, since it
Significance S [ ] Calculation 3
utilizes known constant values (weights) within each iteration.
This approach enables more optimized and efficient code
User-defined
configs generation, as it allows for additional compiler optimizations.
Approximate
Thresholds S-specific
S-specific CNN The primary benefits of our code unpacked kernels that lead
4 S-aware
Computation Skipping
Computation deployment to reduced execution cycles, include the followings:
ComputationSkipping
Skipping
DSE 5 Created by Eucalyp
from the Noun Project

1) Similar to typical unrolling techniques, our code unpacking

also eliminates branch instruction overheads within convo-
Fig. 1. Abstract overview of our framework lutional kernels.
TABLE I 2) Our automated procedure allocates fixed weights to each
E VALUATION OF OUR BASELINE CIFAR-10 A LEX N ET AND L ENET ON A operand, excluding the necessity to adapt and load the
STM32-N UCLEO FITTING 2000KB ROM AND 768KB RAM
weights properly during the convolution process. This leads
# MAC Latency Flash RAM to simplified and predictable, in terms of type, operations
CNN Acc Topol.1
Ops (ms) Usage (%) (KB) that can be adjusted based on input values to enhance
AlexNet 71.9 5-2-2 16.1M 179.9 13 212.16 inference speed.
LeNet 71.6 3-2-2 4.5 M 82.8 12 183.5 3) CMSIS mat mult kernel calculates the partial products
1
Topology of network in Conv - MaxPooling - Full Connected layers, respectively. using the SMLAD instruction (SIMD logic), which per-
forms two 16-bit signed multiplications, accumulating the
imation framework customizes the generated code to support results into a 32-bit operand. Hence, a pre-processing
only essential layers and functions for the given model. Unlike is required to convert the data to the 16-bit data type.
CMSIS-NN and most existing inference libraries (e.g., TF-Lite Instead, our fixed-weight replacement avoids this time-
Micro), we offload model structure parameter operations from consuming operation. Since we know apriori the values of
runtime to compile time, enhancing inference efficiency and weights, this is easily avoided by an offline processing that
reducing flash memory usage by up to 30%. This extra flash involves concatenating two int16 (sign-extended int8 values
memory allows us to unpack more kernels or entire layers, to int16) weights. As an instance, an SMLAD (MAC)
improving the granularity of our skipping approximation. instruction with the “hardwired” value of w12 =4194324
Our focus is on optimizing convolutional layers, as most represents two multiplications with w1 =64 and w2 =20, as
cycles in CNN models are consumed by these operations [5]. 64 · 216 + 20 = 4194324.
A convolution operation in CMSIS-NN involves computing The length of the unpacked code is considered with respect
a dot product between filter weights and a small receptive to the available unused flash memory, creating an interesting
field within the input feature map, followed by matrix mul- trade-off between these two metrics. Along with works like [3]
tiplication. We extend these kernels with cycle counters to and [6] that report an average flash memory utilization of
profile parts of the C code for individual operators, providing less than 25%, we demonstrate in this work that a fully
insights into the model’s baseline performance. These counters unpacked fixed-weight convolution can be effortlessly enabled.
are deactivated during runtime. For instance, even in the worst case of AlexNet with 5
Table I shows the characteristics (baseline accuracy, topol- convolution layers, our framework fitted the whole kernel
ogy, latency) for our models deployed on an STM32-Nucleo- instructions using less than 60% of the available flash memory.
U575ZI-Q board at 160MHz, trained on the CIFAR-10 dataset
with 8-bit post-training quantization. Inputs have a 32x32 C. Significance-aware skipping
resolution and are normalized to [0,1]. As shown, even for In this section, we describe how we leverage our layer-
a small model with less than 5M parameters, latency exceeds based code unpacking to systematically omit certain operations
80ms, while for larger models like AlexNet, 87% of the flash that are considered insignificant for our classification tasks
memory remains unused. This inspires us to leverage available or choose to retain as significant some others. Unlike other
flash for customized kernels optimized for specific models. approaches that consider skipping entire channels or even
layers [7], our framework can omit operations at the finest
B. Layer-based code unpacking granularity, which, to the best of our knowledge, no other
Typical convolution kernels on MCUs are usually imple- work has targeted before in software libraries for MCUs.
mented in a matrix format, where inputs and weights are Our significance-driven analysis is motivated by two facts:
retrieved from memory using a specific pattern. This pattern 1) each computation within the convolution makes a unique
includes details such as the order in which data elements are contribution to the final output, meaning that certain com-
fetched, the stride or step size for moving through the data and putations could potentially be skipped without compromising
classification accuracy and 2) effectively reducing the total Approx. design Exact design Pareto Front
number of computations could provide a valuable trade-off
between accuracy and latency.
0.7 0.7
The accumulation of each channel during the matrix 0.6

Accuracy
0.6

Accuracy
(a)(a)
multiplication (mat mult kernel) is calculated based on a 0.5
0.5
weighted sum and an initialized bias:
0.4
X 0.4
Sumc = b + ai · wi , (1)
0.0
0.0 0.1
0.2 0.4 0.6 0.8
∀i Normalized MAC0.2
Reduction 0.3
Normalized Latency Reduction
where, b is the initialized bias, wi are the trained coefficients
(weights) and ai are the inputs from the respective channel. 0.70
Intuitively, when inputs are multiplied by large numbers, they

Accuracy
(b)
tend to produce significantly more impactful products (ai · 0.65
wi ) in the final result compared to inputs multiplied by small
values. However, it is worth noting that the significance of
0.60
the product ai · wi also depends on the value of ai . Thus, we
define the significance ( 3 ) of each product as follows: 0.0 0.1 0.2 0.3 0.4 0.5
Normalized MAC Reduction

E[ai ] · wi
Si = | P | (2) Fig. 2. Pareto space between accuracy and normalized MAC unit reduction is
∀i E[ai ] · wi depicted for our computation skipping approach within all convolution layers
for AlexNet (a) and LeNet (b).
, where E[ai ] is the average expected value of the ai
input. In other words, (2) calculates the long-term expected On average, DSE required less than 2 hours using 6 threads.
outcome of each product ai · wi over the total sum Sumc of The aforementioned execution times refer to an Intel-i7-8750H
the respective channel. If the sum equals with zero, which is with 32GB RAM. Following the DSE analysis, we extract the
the vast minority of the cases, we consider the corresponding suitable approximate configuration ( 5 ) based on the user’s
significance Si to be large, and thus, the product is retained. specified accuracy loss threshold and desired possible speedup.
For each channel and Sumc , the calculation of Si , ∀i, is Subsequently, our framework generates the approximate code
straightforward and involves capturing the input values’ dis- ( 4 ), which is then compiled and deployed to the MCU.
tribution ( 2 ) from a small portion of the dataset.
By exploiting this high-level information, we minimize III. E XPERIMENTAL R ESULTS AND A NALYSIS
the total computations required for each summation (Sumc ) In this section we evaluate the efficiency of our pro-
at compile time, and thus, we approximate the summation posed framework in reducing inference latency at the cost of
accordingly. Specifically, for each product ai · wi , if Si is some classification accuracy and we investigate the impact
less or equal to a given threshold τ , it is incorporated of approximate computing within the context of TinyML
into the generated code, while others are omitted. Thus, our on MCUs. We evaluate the inference latency, classification
approximate summation per channel is now represented by: accuracy, memory usage and energy of our approximate de-
X X signs against the state-of-the-art exact models [2] and we also
Sum′c = b + (ai · wi ) − (ai · wi ) (3) compare our framework against the closed-source X-CUBE-
∀i ∀i:Si ≤τ
AI [8] framework. All the experiments are evaluated on an
Finally, we perform an exhaustive DSE w.r.t. the targeted STM32U575ZIT6Q SoC, an ARM Cortex-M33 based MCU,
layers and the values of τ ranging from [0, 0.1] with a running at 160 MHz, with 2 MB of Flash and 768KB of RAM.
step of 0.001 and 0.01 for LeNet and AlexNet, respectively. Before deploying the final approximate design and measur-
This exploration is performed offline and only once. Every ing its latency, an initial analysis is required ( 5 ). This offline
approximate configuration, denoting which layers and com- analysis assists in extracting approximate designs based on
putations are approximated, undergoes simulation to calculate the specified accuracy loss threshold set by the user and also
the classification accuracy. Subsequently, a Pareto analysis is avoids reconfiguring the MCU multiple times (i.e., ≡ designs
conducted to determine trade-offs between accuracy and total of DSE), which could potentially result in flash memory
perforated MAC operations, leading to a model with increased deterioration. Hence, Fig. 2 presents the Pareto space between
speedup. Note that in this work, we exclusively concentrate on accuracy and normalized MAC unit reduction achieved from
the convolution layers, and therefore, the model’s behavior, our skipping approximation for the two examined CNNs. In
when considering the rest of the functions, becomes rather Fig. 2 MAC reduction concerns only the convolution layers.
predictable. Consequently, the clock cycles reported by our The black ‘x’ is our exact baseline design [2]. The blue dots
counters during our simulations [5] closely align with the on the graph correspond to approximate configurations. The
cycles of the actual model deployment, and export represen- green triangles, on the other hand, form the Pareto Front line.
tative gain percentages with respect to the “unpacked” model. These configurations particularly represent the percentage of
TABLE II
C OMPARISON WITH STATE - OF - THE - ART CMSIS [2] AND X-CUBE-AI [8] FOR TWO CNN S DEPLOYED ON AN STM 32 U 575 ZI - Q BOARD FITTING 2MB
F LASH AND 768KB RAM. T HREE ACCURACY LOSS THRESHOLDS HAVE BEEN CONSIDERED .

CMSIS-NN X-CUBE-AI Proposed (ours)

Network LeNet AlexNet LeNet AlexNet LeNet(0%) LeNet(5%) LeNet(10%) AlexNet(0%) AlexNet(5%) AlexNet(10%)
Top-1 Accuracy (%) 71.6 71.9 71.6 71.9 71.6 66.7 61.6 72.4 67.1 62.1
Latency (ms) 82.8 179.9 63.5 150.7 72.7 66.8 59.8 124.8 111.3 101.5
Flash (KB) 239 267 154 178 761 704 681 1080 954 891
#MAC Ops. 4.5M 16.1M 4.5M 16.1M 3.3M 2.9M 2.4M 7.5M 6.2M 5.5M
Energy (mJ) 2.73 5.94 2.10 4.97 2.40 2.20 1.98 4.12 3.67 3.35

operations that are skipped and the indexes of these operations reduction in latency, with a negligible accuracy degradation.
in the final generated code. Note that the number of the Additionally, uTVM [10], an end-to-end ML compiler frame-
explored configuration/designs is model dependent. As afore- work tailored for bare-metal MCUs, reports a 13% latency
mentioned, the DSE was performed based on various signifi- overhead compared to CMSIS when using a similar LeNet
cance thresholds τ , steps and examined layers for both CNNs. model architecture. For the same model, our approach outper-
In total, we evaluated more than 10,000 approximate designs forms uTVM, achieving an additional 32% speedup with an
for LeNet and AlexNet, separately. On average, it is observed accuracy loss of less than 5%.
that our “only skipping” approximation achieves 44% MAC
IV. C ONCLUSION
reduction, delivering identical classification accuracy with the
exact baseline, while this number rises further for both models In this work, to address the notable latency limitations of
to averagely 57% when compromising 5% accuracy loss. MCUs, we introduce a cooperative framework that combines
approximate computing with software kernel optimizations.
In Table II we report some important metrics of our frame- Through a systematic kernel-based computation skipping ap-
work. To generate this table we considered three conservative proach, our framework effectively removes operations deemed
accuracy loss thresholds (i.e., 0%, 5% and 10%) and we report insignificant for the model’s inference, resulting in accelerated
the latency, Top-1 accuracy, flash, and energy metrics for the inference speeds at the expense of different accuracy trade-off.
latency-optimized approximate designs after deployment on These trade-offs have the potential to open avenues for more
the examined MCU. As aforementioned, due to the nature AI applications and enable the execution of more complex
of target AI applications, such as real-time processing, a deep neural networks on tiny MCUs.
fast inference is one of the foremost requirements when
targeting DNNs on MCUs and so prioritizing it over strict R EFERENCES
accuracy constraints is a typical procedure [3]. As depicted [1] V. Rajapakse, I. Karunanayake, and N. Ahmed, “Intelligence at the ex-
in Table II, our cooperative approximation approach, which treme edge: A survey on reformable tinyml,” ACM Comput. Surv., vol. 55,
no. 13s, jul 2023.
includes both code unpacking and significance-aware skipping [2] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efficient neural network
approximation, achieves an average a speedup of 21% while kernels for arm cortex-m cpus,” 01 2018.
[3] J. Lin, W.-M. Chen, Y. Lin, J. Cohn, C. Gan, and S. Han, “Mcunet: Tiny
incurring no degradation (zero accuracy loss) compared to deep learning on iot devices,” in Proceedings of the 34th International
the exact baseline [2]. Moreover, the respective speedup is Conference on Neural Information Processing Systems, ser. NIPS’20.
Red Hook, NY, USA: Curran Associates Inc., 2020.
increased to 36% when accepting approximately 10% accuracy [4] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware ap-
loss. In Table II, we also provide a comparison of our models proximate techniques for deep neural network accelerators: A survey,”
ACM Comput. Surv., vol. 55, no. 4, nov 2022.
with the state-of-the-art homogeneous inference library, X- [5] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green, P. Warden,
CUBE-AI. Although X-CUBE-AI attains a 12% lower latency T. Ansell, and V. J. Reddi, “Cfu playground: Full-stack open-source
framework for tiny machine learning (tinyml) acceleration on fpgas,” in
for the precise LeNet(0%) compared to our framework, it is 2023 IEEE International Symposium on Performance Analysis of Systems
worth noting that for the more complex CNN of AlexNet, our and Software (ISPASS), 2023, pp. 157–167.
[6] Z. Jia, D. Li, C. Liu, L. Liao, X. Xu, L. Ping, and Y. Shi, “Tinyml design
approach outperforms X-CUBE-AI. Specifically, we achieve contest for life-threatening ventricular arrhythmia detection,” IEEE Trans-
an increased speedup of 17% with identical classification actions on Computer-Aided Design of Integrated Circuits and Systems, pp.
1–1, 2023.
accuracy, while even for LeNet we manage better latency for [7] J. Zhang, X. Chen, M. Song, and T. Li, “Eager pruning: Algorithm and
7% accuracy loss. As shown, our framework, can surpass even architecture support for fast training of deep neural networks,” in 2019
ACM/IEEE 46th Annual International Symposium on Computer Architec-
commercial tools like X-CUBE-AI (that have also very limited ture (ISCA), 2019, pp. 292–303.
flexibility), providing an accuracy-latency trade-off that was [8] STMicroelectronics, “X-cube-ai: Ai expansion pack for
stm32cubemx,” Dec 2019. [Online]. Available: https://www.st.com/
previously unattainable for optimized libraries like CMSIS. en/embedded-software/x-cube-ai.html
[9] A. Capotondi, M. Rusci, M. Fariselli, and L. Benini, “Cmix-nn: Mixed
Lastly, we undertake a qualitative evaluation, comparing our low-precision cnn library for memory-constrained edge devices,” IEEE
approximation framework with other state-of-the-art method- Transactions on Circuits and Systems II: Express Briefs, vol. 67, 2020.
[10] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen,
ologies. When compared to CMix-NN [9] using a model with L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm:
13.8M MAC operations, our framework achieves a latency An automated end-to-end optimizing compiler for deep learning,” in
Proceedings of the 13th USENIX Conference on Operating Systems Design
of 124ms on a 160MHz MCU. This means that, compared and Implementation, ser. OSDI’18, 2018.
to CMix-NN [9], our framework achieves a remarkable 62%